A Gabor Wavelet Pyramid-Based Object Detection Algorithm Yasuomi D. Sato1,2, Jenia Jitsev2,3, Joerg Bornschein2, Daniela Pamplona2, Christian Keck2, and Christoph von der Malsburg2 1
Department of Brain Science and Engineering, Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, 2-4, Hibikino, Wakamatsu-ku, Kitakyushu, 808-0196, Japan
[email protected] 2
Frankfurt Institute for Advanced Studies (FIAS), Johann Wolfgang Goethe-University, Ruth-Moufang-Str. 1, 60438 Frankfurt am Main, Germany {sato,jitsev,bornschein,pamplona,keck, malsburg}@fias.uni-frankfurt.de 3 Max-Planck-Institute for Neurological Research, Gleueler Str. 50, 50931, Koeln, Germany
[email protected]
Abstract. We introduce visual object detection architecture, making full use of technical merits of so-called multi-scale feature correspondence in the neurally inspired Gabor pyramid. The remarkable property of the multi-scale Gabor feature correspondence is found with scale-space approaches, which an original image Gabor-filtered with the individual frequency levels is approximated to the correspondingly sub-sampled image smoothed with the low-pass filter. The multiscale feature correspondence is used for effectively reducing computational costs in filtering. In particular, we show that the multi-scale Gabor feature correspondence play an effective role in matching between an input image and the model representation for object detection. Keywords: Gabor Pyramid, Visual Object Detection, Multi-scale Feature Correspondence, Computer Vision.
1 Introduction Object detection in real time is one of requisite processes in visual object recognition. In this work, we propose a so-called Gabor pyramid of the multi-scale feature correspondence finding for visual object detection as shown in Fig. 1. This is based on the modeling of receptive fields that are localized or decomposed in the Gabor spatial frequency domain (physiologically plausible Gabor decomposition). It has been hypothesized that the receptive fields in mammalian visual systems closely resemble Gabor kernels and has confirmed by a number of physiological studies on cats and primates [1]. In this Gabor pyramid object detection system, an input (I) image is downsampled at an arbitrary scale, which may be related to the spatial frequency levels of the Gabor wavelets. Although the search region for a model (M) object D. Liu et al. (Eds.): ISNN 2011, Part II, LNCS 6676, pp. 232–240, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Gabor Wavelet Pyramid-Based Object Detection Algorithm
233
Fig. 1. Sketch of the whole Gabor pyramid algorithm for visual object detection. (x, y) represents the position of a single pixel within the image. J is the Gabor feature extracted from the image. ⊗ denotes the inner product between Gabor features for a model (M) image and the down-sampled (DS) image. Such mathematical symbols are thoroughly described in Sect. 2.
already stored in the system is localized in each down sampled image (broken squares in Fig. 1), the localization is carried out by finding the Gabor feature correspondence to the M feature. It is allowed to gradually specify the most likely position of the M object in each image, in which is analogous to flow from low to high resolution. Finally, an accurate position for the M object on the image I with highest resolution can be detected. Correspondence finding between the Gabor filter for the image M and the lowpass Gabor filter for the down-sampled version of the I is the important aspect of the Gabor pyramid algorithm. The identification of feature correspondences enables to effectively find a search region to specify the M object even at a lower resolution than that of the M image. When the image resolution is increased, the search region gradually converges to instead detect the most likely position. This is analogous to a coarse-to-fine template matching method in pattern recognition studies, which is a potentially useful method that makes cost performance much lower [2]. However, no one can know the coarse-to-fine matching by finding the aforementioned Gabor feature correspondence. In addition, physiological plausible Gabor decompositions present another significantly crucial advantage in the Gabor pyramid as it enables us to realize low computational cost with fewer Gabor filters on the limited image space without the loss of any physiological constraints. Conventionally in the correspondence-base visual object recognition model of dynamic link matching [3], 40 Gabor filters have to be used to computationally establish the recognition, putting a heavy burden on the performance of the software system. The same performance cost problem occurs even in the feature-based model of the HubelWiesel type [4][5]. Thus, there is still currently a great deal of discussion regarding how to best deal with Gabor filters and the relevant performance cost problem.
234
Y.D. Sato et al.
Fig. 2. Face detection process after down sampling an input (I) image. In each down sampled (DS) image, 64×64 pixel size windows (for example, solid squares in the DS1, DS2 and DS3) are set up to search the most likely pixel position of the model (M) face. A filled circle is an extraction point for the M face. After the process, our Gabor pyramid system can detect the M face on the image I with a small frame of 20×20 pixel size.
In this work, with the full use of the physiologically plausible Gabor decomposition and scale correspondence finding between multi-resolution and Gabor feature, we attempt to develop the Gabor pyramid algorithm that model object images stored in the system can effectively and rapidly be detected on an input image. We also show that the Gabor pyramid technically supports the functionality of the coarse-to-fine template matching. This artificial vision has significant potential for practical applications, preserving the physiological nature of the Gabor filter. In Sect. 2, an object detection mechanism of the Gabor pyramid is explained in detail. In Sect. 3 and 4, numerical results of feature correspondence, multi-object as well as multi-face detection are given. In the final section, results will be discussed and conclusions given.
2 A Gabor Pyramid System An outline of the Gabor pyramid system proposed here is shown in Fig. 1. We assume that a grey-scale natural input (I) image of multi people is first prepared with w h pixel (w is the width while h is the height). The image I is downsampled using the [1/2]l (l=0,…,4), and then is stored as an image DS l in the system. Here the l=0 case represents the original size of the image I. Let the image M with 100 100 pixel be cut out from the image I. There has to appear one single object centered in the image M. One single feature J l'M ={ Jl',rM }r=0,1…,7 for l' th spatial frequency (where r represents orientation components, and l'=0,…,4) is extracted at a center of the image M, which is defined as the convolution of the image with a set of Gabor wavelet transformations. The Gabor filter responses J are usually given by:
×
×
Jˆ = ∫ I ( z − z ′)ψ ( z − z ′)d 2 z ′,
ψ ( z) =
⎛ σ 2 ⎞⎤ ⎛ k 2 z 2 ⎞⎡ ⎟⎟⎥ ⎜⎜ − ⎟ ⎜ ( ) ikz − exp − exp exp ⎜ 2σ 2 ⎟ ⎢ σ2 ⎝ 2 ⎠⎦ ⎠⎣ ⎝ k2
(1)
(2)
A Gabor Wavelet Pyramid-Based Object Detection Algorithm
235
Fig. 3. Scale correspondence of the Model (M) feature to the one corresponding to the lowpassed version features. In this figure, the feature similarity for l’ th spatial frequency takes the function of a scaled down index l, plotting an average value and the SD of the similarity, calculated with 100 different sample image.
where σ=2π to approximate the shape of receptive fields observed in the primary visual cortex. The wave vector is parameterized as
G ⎛ k x ⎞ ⎛ kl cos ϕ r ⎞ π − l +2 ⎟⎟, kl = 2 2 π , ϕ = r , k = ⎜⎜ ⎟⎟ = ⎜⎜ 8 ⎝ k y ⎠ ⎝ kl sin ϕ r ⎠
(3)
with the orientation parameter r = 0, … ,7 and the scale parameter l = 0, … 4 . As feature values we use the magnitude
J = Jˆ ( z ) .
(4)
×
In each image DS l, the 64 64 pixel size of the Region-of-interest (ROIl) is (l) (l) extracted, which (xc , yc ) is located as a center of the ROIl and is matched to M (l) (l) (l) the model Gabor feature Jl . In the ROIl, the Gabor features J (x , y )= (l) (l) (l) (l) (l) { J r(x , y ) }r=0,1,…,7 are extracted for each (x , y ) in order to calculate (l) (l) (l) M similarities to the relevant model feature, S(J (x , y ), Jl ), which is given by
G G S ( J ( l ) ( x ( l ) , y ( l ) ), J lM ) =
∑J
(l ) r
( x ( l ) , y ( l ) ) ⋅ J lM, r
r
∑ (J
(l ) r
r
(l)
We then choose the candidate point (x0 of the pixel location-specific similarity:
{
(l )
(l )
(x , y )
) ∑ (J ) 2
r
, y0(l)) by computing the highest value
},
(x0(l ) , y0(l ) ) = arg max S( J (l ) ( x(l ) , y(l ) ), JlM ) x( l ) , y( l )
(5)
M 2 l, r
(6)
236
Y.D. Sato et al.
and span the new search region of the defined 64×64 pixel size around the normalized pixel location (xc(l-1), yc(l-1)) on the next up-sampled level:
( xc( l −1) , y c( l −1) ) = 2 ( x0(l ) , y0( l ) ) .
(7)
Repetitively doing such position specific process for each down-sample image, an exact position (x0I, y0I) for specifying the model object is finally decided on an original image I with the highest resolution (see, a small square with 20×20 pixel size in the I image of Fig. 2). This object detection, carried out on a laptop computer (Intel Core(TM)2 Duo CPU 1.40GHz RAM 1.91GHz) in this case, is demonstrated as shown in Fig. 2. Here we note that the fixed search window seems to gradually converging to the desired model object from low resolution to high resolution as shown in Fig. 2. The runtime in the object detection process was less than 500 msec.
3 Scale Feature Correspondence to Image Resolution In this work, a substantial reason why our Gabor pyramid algorithm can effectively detect not only faces, but also general objects, is to find feature correspondence between down-sample image resolutions and spatial frequency factors of the Gabor feature. We here confirm such feature correspondence findings, by using 100 different images of a single person i (i=0, …, 99). Each image is used both as the input and the model, which are respectively called Mi and Ii. From the center of the image Mi, the one scale feature vector which consists of 8 orientation components is extracted for a scale factor l' (l'=0, …, 4). On the other hand, the image Ii is down-sampled with (2)-l/2hi and (2)-l/2wi (l=0, …, 4), which we will refer to as DSli. From the center of each image, one feature vector with the same number of orientation components as the Mi case is obtained by filtering with a standard spatial frequency of the Gabor wavelets. Then, the Gabor feature for l' of the Mi takes inner-product with the Gabor feature of the DSli to calculate their feature similarity:
S ( J ( l ) , J lM' ) =
∑J
(l ) r
⋅ J rM, l '
r
∑ (J ) ∑ (J ) (l ) 2 r
r
M 2 r, l'
.
(8)
r
The feature similarities for each spatial frequency l' are obtained as shown in Figure 3. In this figure, all values of the feature similarities are respectively averaged over the sampling number of the facial image, calculating the standard deviation (SD) of the feature similarity values. We have thus calculated tuning curves for each spatial frequency of the M feature. As shown in Fig. 3, the low-pass Gabor filter for the image DSl best-matches to the model Gabor filter with the same spatial frequency factor, but it has obtained an incorrect scale correspondence to another DS image.
A Gabor Wavelet Pyramid-Based Object Detection Algorithm
237
Fig. 4. Multi-face detection results. Solid squares on input images are detection windows. Filled circles are real positions that successfully achieved correct detection of features stored in memory.
Sato et al. [6][7] suggested the sue of scale-correspondence finding between feature components of the input and model. For example, the components of the Gabor feature vector are set up as the angular-eccentricity coordinate. If an input object is scaled down relative to the original size (in other words, having a lower resolution than the original), correspondingly, the components are radically shifted towards an outside of the coordinate. This addresses the scale correspondence of the down-sample image resolution to the Gabor kernel size filtered on the image. In order to support this address, we have shown the scale correspondence in Fig. 3.
4 Simulation Results Finally, we demonstrate two detection experiments for a number of general objects and faces. In fact, as shown in Fig. 4(a), even though the face size is so small that its appearance has become blurred, and another face is partially occluded, the Gabor pyramid can easily detect these objects. This can be achieved due to single Gabor feature extraction from a fiducial point (that is, the tip of the nose). At present, we notice that the detection window is fixed. Consequently, when the face size is big, the corresponding detection window is set up on a small portion of the face. When the face size is small, (i.e. smaller than 20×20 pixels), the face is positioned within the window. If this Gabor pyramid can be improved to specify the most likely size of the detected face, it may be able to automatically modify the detection window according to the specified size. In such improvement, we need other methods to specify the size, one of which scale and rotation transformation specific similarity computations proposed by [11] would be appropriate since it is a powerful method for scale and rotation invariant object recognition. When the scale and rotation transformation specific similarity computations are integrated within the framework of our Gabor pyramid with a functionality of translation invariance, we can say this improved Gabor pyramid system is recognition fully invariant to changes of scale, rotation and translation.
238
Y.D. Sato et al.
Fig. 5. Detection results for an image containing multiple objects (an ashtray, a cake, a spoon, a glass and a coffee cup). Solid squares centered on the filled black circle in the upper figure denote that a part of each general object, which is stored as one of the model feature representations (yellow circles), which can successfully be detected.
However, we have to be aware of some detection ability problems in matching by finding only single feature correspondence. This is shown in Fig. 4(b). The tip of the nose could not be specified for two of the five faces in the natural image even if the model Gabor feature is extracted from such a fiducial point on the face. In the cases, the face must be detected using another fiducial point such as the mouth. This result implies that the system should store not only a single but multiple features in memory to yield a better performance in the face detection. Next, we test the detection ability for a number of general objects, (in Figure 5, an ashtray, a glass, a coffee cup, a cake and a spoon). In this test, the single Gabor feature is effectively extracted from the related object's contour. Then, a detection window of the Gabor pyramid system fills a segment of the object. As mentioned above, such detection may be interpreted as being achieved by selecting one of several stored features forming the whole object representation, which must correctly detect the position on an input image. Thus, by proposing visual object detection with the Gabor pyramid, we may suggest a possible model of visual object recognition.
5 Discussion and Conclusion This work is an important and preliminary step toward practical applications of image processing and object recognition. As a next step, we attempt to establish a visual object recognition system that is fully invariant to scale, rotation, as well as translation, by integrating another detection algorithm for the most likely scale and rotation transformational states of the object into the Gabor pyramid algorithm in this work. This scale and rotation transformation detection was already proposed [7].
A Gabor Wavelet Pyramid-Based Object Detection Algorithm
239
In general, the face detection system proposed by Viola and Jones [8] is often used in research fields of computer vision and image processing. It is well-known that this face detection does not work when a person turns their head to one side, or the facial resolution is too low[9]. However such a detection problem is expected to be solved in case of an improved version of the Gabor pyramid in which another invariant recognition mechanism is integrated. Consequently, the construction of such an integrated visual object recognition system is an urgent task. We are planning on implementing the Gabor pyramid algorithm into FPGA. By this implementation, we will expect much faster speed of the Gabor pyramid performance with larger numbers of face or objects. Our Gabor pyramid could detect a coupled of faces/general objects in 1 [sec] without parallel processing. However, the implementation of our Gabor Pyramid into FPGA will be allowed to, in real time, process detection of further more faces or objects. Several features extracted from fiducial points on the face or contours of the object, such as graphs, are often used a correspondence-based recognition model, and are necessary to achieve smooth visual object detection. This because simulation results in this work have indicated that there are still some difficulties associated with the detection process that uses only single feature, one of which is shown in Fig. 4(b). In order to overcome such difficulties, topological constraints such as facial graph consisting of several Gabor features is required. In conclusion, there still great deal of work to be done in the construction of a neurally plausible object recognition model. However, we must also stress that the work described here is in the fundamental stage with regard to practical applications. In order for these applications to be successful, the method must demonstrate the flexibility and universality of the underlying concept of the Gabor pyramid processing. One of the most crucial mechanisms is correspondence finding between images of different resolutions and spatial frequencies of the Gabor filter. These are found due to the physiological plausible Gabor decomposition and have a great deal of potential to solve the computer vision problem. It also introduces the possibility of recycling view-dependent information about the initially unknown size of an object, which may have been regarded as the unnecessary. The similarity computation pursued here will contribution substantially to the further understanding of the highly integrated visual recognition mechanism behind the invariance. Acknowledgments. This work was financially supported by the “Bernstein Focus: Neurotechnology through research grant 01GQ0840” funded by the German Federal Ministry of Education and Research (BMBF). Y.D.S was supported by the Grant-inAid for Young Scientist (B) No. 22700237.
References 1. Jones, J.P., Palmer, L.A.: An evaluation of the two-dimensional Gabor filter model of simple receptive fields in the cat striate cortex. Journal of Neurophysiology 58(6), 1233– 1258 (1987) 2. Dufour, R.M., Miller, E.L.: Template Matching Based Object Recognition With Unknown Geometric Parameters. IEEE Transactions on Image Processing 11(12), 1385–1396 (2002)
240
Y.D. Sato et al.
3. Lades, M., Vorbrueggen, J.C., Buhmann, J., et al.: Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers 42(3), 300–311 (1993) 4. Serre, T., Wolf, S., Bileschi, M., Riesenhuber, P.T.: Robust Object Recognition with Cortex-like Mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 411–426 (2007) 5. Hubel, D.H., Wiesel, T.N.: Receptive Fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology 160(1), 106–154 (1962) 6. Sato, Y.D., Wolff, C., Wolfrum, P., von der Malsburg, C.: Dynamic Link Matching between Feature Columns for Different Scale and Orientation. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part I. LNCS, vol. 4984, pp. 385–394. Springer, Heidelberg (2008) 7. Sato, Y.D., Jitsev, J., von der Malsburg, C.: A Visual Object Recognition System Invariant to Scale and Rotation (The ICANN 2008 Special Issue). Neural Network World 19(5), 529– 544 (2009) 8. Viola, P., Jones, M.J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004) 9. Hayashi, S., Hasegawa, O.: Robust Face Detection for Low-Resolution Images. Journal of Advanced Computational Intelligence and Intelligent Informatics 10(1), 93–101 (2006)