Position, rotation, scale and orientation invariant multiple object recognition from cluttered scenes Peter Bone, Rupert Young1, Chris Chatwin Laser and Photonic Systems Research Group Department of Engineering and Design University of Sussex, Brighton BN1 9QT
ABSTRACT A method of detecting target objects in still images despite any kind of geometrical distortion is demonstrated. Two existing techniques are combined, each one capable of creating invariance to various types of distortion of the target object. A Maximum Average Correlation Height (MACH) filter is used to create invariance to orientation and gives good tolerance to background clutter and noise. A log r-θ mapping is employed to give invariance to in-plane rotation and scale by transforming rotation and scale variations of the target object into vertical and horizontal shifts. The MACH filter is trained on the log r-θ map of the target for a range of orientations and applied sequentially over regions of interest in the input image. Areas producing a strong correlation response can then be used to determine the position, inplane rotation and scale of the target objects in the scene. Keywords: logmap, MACH filter, correlation filter, invariant pattern recognition
1. INTRODUCTION The problem of recognizing objects despite distortions in position, orientation and scale 1-2, and within cluttered backgrounds is a demanding pattern recognition problem. A system capable of detecting target objects despite any kind of geometrical distortion has many practical applications since it can detect target objects when the orientation and position of the target or camera is unknown. The ability to classify objects as in-class or out-of-class in cluttered backgrounds is also crucial if the system is to be used in a practical application. The first main success at solving the invariance problem came from the development of the Synthetic Disciminant Function (SDF) 3-5, which included the expected distortions in the filter design to create invariance to such distortions. More recent attempts have been based on the Maximum Average Correlation Height (MACH) filter 6, which can be tuned to give maximum performance and is far more immune to background clutter. To attempt to solve the full invariance problem, we combine two existing techniques, each capable of achieving invariance to several of the possible variations of a target object. Out-of-plane rotation invariance is achieved using a Maximum Average Correlation Height filter (MACH). Aside from creating out-of-plane rotation invariance, this filter is capable of discriminating the target objects from cluttered or noisy backgrounds. Scale and in-plane rotation invariance is created with the use of a log r-θ mapping (logmap) of a localised region of the image space. A change in scale or rotation in the target object results in a horizontal or vertical shift in the logmap, which makes the object detectable by
1
E-mail:
[email protected]; Tel: +44(0)1273 678908; Fax: +44(0)1273690814
correlation with the logmap of the reference image. Simulation results are presented that demonstrate the detection of the target object in a scene under a number of conditions that test the filter’s invariance.
2. IN-PLANE ROTATION AND SCALE INVARIANCE A method was required to detect target objects in a scene despite their differences in scale or in-plane rotation to the target reference images. To do this a log r- θ mapping 7-9, or logmap, was used. The logmap uses a variation on the basic x-y grid sensor used in conventional image processing. The structure of the sensor is based on a Weiman polar exponential grid 7,10,11 and consists of concentric exponentially spaced rings of pixels, which increase in size from the centre to the edge. This produces an arrangement similar to that found in the mammalian retina where photoreceptive cells are small and densely packed in the fovea and increase in size exponentially to create a blurred periphery. Each sensor pixel on the circular region of the x-y Cartesian space is mapped into a rectangular region in Polar image space rθ . The sensor’s geometry maps concentric circles in the Cartesian space into vertical lines in the polar space and radial lines in the Cartesian space into horizontal lines in the Polar space. It will be shown mathematically that this transformation offers scale and rotation invariance about its centre, since rotation or scale changes simply produce vertical or horizontal shifts in the polar space. Aside from invariance to in-plane rotation and scale, this mapping also has the advantage of giving a wide visual field whilst maintaining high resolution at the centre where it is needed most. This means that it uses a minimum amount of memory and computational resources. The logmap sensor performs a complex logarithmic mapping of the image data from the circular retinal region into a rectangular region. Fig. 2.1 shows the complex logarithmic mapping performed by the sensor geometry. The vertical lines in the w-plane map to concentric circles in the z-plane and the horizontal lines in the w-plane map to radial lines in the z-plane.
Z - PLANE
W - PLANE
y
w = log z
x
v
z = exp w u z = x + iy
w = u + iv
Fig. 2.1 Sensor geometry and logarithmic mapping
The complex logarithmic mapping can be written as: (2.1)
w = log z
and applying the complex mathematical form notation
z = x+i y where r = x 2 + y 2
(2.2) and θ = arctan y x
Thus, (2.3) w = log r + iθ = u + iv where u = log r and v = θ Hence, an image in the z-plane with co-ordinates x and y is mapped to the w-plane with co-ordinates u and v. The mapped image from the Cartesian space (z-plane) in to the Polar space (w-plane) is referred to as the log-polar mapping or the logmap. This process can be reversed to produce the inverse mapping from Polar space (w-plane) to Cartesian space (z-plane). The following invariance’s are created by the logmap and are useful for image processing applications, in addition to the wide field of view, reduced pixel count and a central highly focused area inherent in the space-variant sensor geometry. Firstly, the log-polar mapping offers in-plane rotation invariance of the rotated image. If an image is rotated by an angle α about the origin, then:
z = r e i( θ + α ) Thus,
w = log r + i θ + i α = u + i v + i α
(2.4)
In effect, rotating the image by the angle α has resulted in a vertical shift in the mapped image by the rotation angle. Secondly, the log-polar mapping offers scale invariance. If the image in the z-plane is scaled by a factor β, then:
z = r β e iθ Thus,
w = log r + log β + iθ
= u + log β + i v
(2.5)
In effect, scaling the image by the factor β has resulted in a horizontal shift in the mapped image by the scaling factor. Thirdly, the mapping also offers projection invariance. If we move towards a fixed point in the z-plane then the mapped image will shift horizontally in the w-plane. The size and shape of the original z-plane image will stay the same. This is equivalent to progressively scaling the z-plane image, but with a different change in the perspective ratio. The fact that rotation and scale changes result in horizontal and vertical shifts in the logmap means that the logmap creates invariance to these transformations since a linear correlator object recognition system is invariant to x-y translation. It should be emphasised however, that the log-polar mapping is not shift invariant, so the properties described above hold only if they are with respect to the origin of the Cartesian image space.
2.2 Polar-exponential sensor geometry The geometry of the sensor can be varied to produce the optimum use of image space 12. A suitable mapping would be to have the sensors angularity proportional to cell size. This results in a sensor array where the sensor cells are roughly square and increased exponentially in size with distance from the centre of the mapping. The sensor geometry is characterised using its parameters ( n, R , r min , dr ) . The grain of the array n is defined as the number of the sensor cells in each of the rings of the array and so is equal to the number of the spokes in the sensor array. Thus, n spokes cover a range of 2 π radians, and so the angular resolution of each sensor cell from the centre of the array is granularit y ∴ g
=
2π n
2π . This is named the granularity of the array g, and can affect the sensors resolution: n =
g
(2.6)
The gain of the array k is defined as the ratio of pixel sizes between any two successive rings of pixels in the sensor: 2π (2.7) k = exp = exp g n The above equation also describes the resolution of the sensor, which decreases as we move from the centre to the periphery. Smaller gains represent a higher overall resolution than at larger gains. Thus, it can be shown that we can compute the position of each ring and spoke in the sensor array by the sensor’s granularity g:
r ( u ) = exp(u ⋅ g ) θ( v ) = v ⋅ g
(2.8) (2.9)
where u and v are the integer indices of the logmap, varying
u min ≤ u ≤ u max and 0 ≤ v ≤ ( n − 1 ) .
The next parameter of the sensor is the field width R, defined as the distance from the centre to the edge of the mapping. By setting R to different values and fixing the focus point of expansion at the centre of the image we affect the data sampled by the sensor. If we set R to be half of the size of the image, then the circular visual field omits only the image data at the corners. Changing the value to 1 2 of the size of the image will include all the image data, but also unwanted data around the image is included. The last parameter of the sensor to be defined is the blind spot of the mapping, which affects the overall resolution of the mapping. If we vary the radius of the first ring r min , then we can omit the image data at the centre of the sensor. This blind spot can be filled with a high-resolution array of uniformly spaced rectangular cells in a similar way to the mammalian retina or with an r- θ mapping or just left empty. The circular area of the blind spot is described by an offset dr, into the mapping:
size of the Blind Spot
=
r min + dr
(2.10)
The number of sensor cells in the mapping and the range of pixels which they cover into the logmap will be determined by the field width R, the radius of the first ring from the centre of the mapping r min , and the sensor’s grain n.
2.3 Logmap of an x-y raster image An x-y raster image must be converted to a logmap image by interpolation. The intensity value of each sensor cell is computed by averaging the intensity of all the x-y image pixels that fall within it. The most basic, and fastest, way of doing this is a nearest-neighbour interpolation method where each image pixel contributes to only one sensor cell depending on what sensor cell its centre is within. However, this method does not account for image pixels that fall between sensor cell locations and therefore produces a very broken logmap, which is not fully scale and rotation invariant. The logmap can be improved dramatically by sub-dividing the pixels that fall between sensor locations into smaller sub-pixels and calculating the proportion of the image pixel that covers the logmap region by computing the corresponding sensor cell for each sub-pixel. This produces a logmap image that is more rotation and scale invariant but the computational time increases as the level of sub-division is increased. If the same sensor geometry is being used to create logmaps of multiple images, then the time taken to create each logmap can be reduced significantly by precomputing a lookup table to store the image pixels that contribute to each sensor cell and the proportion they contribute. Figure 2.2 shows an original image (a), the log-mapped image (b) and the inverse mapping of the logmap (c). The parameters used in the logmap geometry were n=100, R=50, rmin=1, dr=0.
(a)
(b)
(c)
Fig. 2.2 Original image (a), logmap image (b) and inverse mapping (c)
3. MAXIMUM AVERAGE CORRELATION HEIGHT (MACH) FILTER The MACH filter 6, like the SDF, is another method of creating invariance to distortions in the target object by including the expected distortions in the construction of the filter. The MACH filter maximizes the relative height of the average correlation peak with respect to the expected distortions. Unlike the SDF, the MACH filter can be tuned to maximize the correlation peak height, peak sharpness and noise suppression while also being tolerant to distortions in the target object that fall between the distortions given in the training set. However, the peak height of the MACH filter is unconstrained making it more difficult to interpret the results of the correlation13. The MACH filter is derived by maximizing a performance metric called the average correlation height (ACH). However, several other performance measures have to be balanced to better suit different application scenarios. These measures are the Average Correlation Energy (ACE), Average Similarity Measure (ASM) and Output Noise Variance (ONV) 5,14. Thus, to form an optimal trade-off filter 6,14, the following energy function is formed and minimized with respect to one criterion, holding the others constant:
E (h ) = α (ONV ) + β ( ACE ) + γ ( ASM ) − δ ( ACH ) = αh + Ch + β h + D x h + γh + S x h − δ h T m x
(3.1)
The resulting optimal trade-off (OT) MACH filter (in the frequency domain) is given as:
h=
m x∗ αC + βD x + γS x
(3.2)
where α, β and γ are non-negative OT parameters, mx is the average of the training image vector x1 , x2 ,..., x N (in the frequency domain), and C is the diagonal power spectral density matrix of additive input noise. It is usually set as the white noise covariance matrix, C = σ 2 I . Dx is the diagonal average power spectral density of the training images:
Dx =
1 N
N
∑X
∗ i
Xi
i =1
where Xi is diagonal matrix of the ith training image. Sx denotes the similarity matrix of the training images:
(3.3)
Sx =
1 N
N
∑ (X
i
−M
x
)∗ ( X i − M x )
(3.4)
i =1
where M x is the average of X i . The different values of α, β and γ control the MACH filter’s behaviour to match different application requirements 15. If β =γ=0, the resulting filter behaves much like a MVSDF filter 16 with relatively good noise tolerance but broad peaks. If α=γ=0 then the filter behaves more like a MACE filter17, which generally exhibits sharp peaks and good clutter suppression but is very sensitive to distortion of the target object. If α=β =0, the filter gives high tolerance for distortion but is less discriminating.
4. FULLY INVARIANT FILTER By combining the in-plane rotation invariance of the logmap and the distortion invariance of the MACH it was possible to create a filter that is invariant to any kind of geometrical distortion of the target object, while maintaining high performance even in cluttered scenes. Such a filter was constructed and tested for the problem of detecting a particular car (a model of a Jaguar S-Type) in a scene (Fig 4.1). The system was expected to correctly detect the car despite variations in its out-of-plane orientation, in-plane rotation, scale, position and with noisy or cluttered backgrounds. To create invariance to changes in out-of-plane orientation of the car, the MACH filter was created using a training image set consisting of the expected range of rotation taken at small intervals of viewing angle.
Fig. 4.1 Various distortions of the target object (a model of a Jaguar S-Type)
The problem of combining the logmap with the MACH was solved simply by creating a logmap of each reference image before synthesis of the filter. The input image then also needed to be log-mapped before being correlated with the filter. However, the logmap gains in-plane rotation and scale invariance at the expense of position invariance. This means that to search the entire input image for the target object requires the correlation process to be repeated for each location in the image. To accomplish this, a moving window system was created to raster scan the input image and perform the MACH logmap correlation on the sub-image contained in the window. The size of the window needed to be as small as possible to reduce computational time while being large enough to fit any expected distortion of the target object. To classify each sub-image as containing an in-class or out-of-class object it was necessary to compare the correlation peak height of each sub-image with some threshold - above which the sub-image would be classified as containing an in-class target. The threshold was calculated by correlating each of the reference images with the MACH filter and taking an average of their resultant peak intensities (the intensity at the centre of the correlation plane). This value then gave a value close to what would be expected after correlation between the filter and an in-class target. The threshold was then set slightly lower than this value by multiplying by 0.75 so that any in-class objects would produce peaks slightly above the threshold, allowing for some reduction in amplitude, while still being high enough for all out-ofclass objects to fall below the threshold. The threshold equation was thus calculated as:
N
Threshold =
0 .75 N
∑ max {| C ( x , y ) |} i =1
x, y
i
(4.1)
Where Ci (x,y) correlation plane amplitude of the ith training image. Once a target object had been detected it is then possible to calculate the in-plane rotation and scale of the target in the image. This is possible since rotation and scale variations in the target object produce horizontal and vertical shifts in the logmap. The position of the maximum correlation peak will therefore also be shifted from its central position when target objects are rotated or scaled relative to the set of reference images used to build the filter. By measuring the horizontal and vertical offset of the peak from the centre of the correlation plane it is possible to calculate the scale ratio and rotation relative to the reference images. From equations 2.2 and 2.9 it can be seen that the rotation and scale can be calculated as:
ScaleRatio = exp(u offset ⋅ g )
Rotation = v offset ⋅ g
(4.2) (4.3)
Where uoffset and voffset are the horizontal and vertical offsets of the maximum correlation peak from the centre of the correlation plane. As well as calculating the scale and in-plane rotation of detected targets, it is also possible to calculate the outof-plane rotation (orientation) as a post processing operation. This is done by correlating the log-mapped sub-image with each of the original log-mapped reference images and seeing which one gives the strongest response. The logmap parameters used for the filter were n=100, R=50, rmin=5 and dr=0. The value of the gain n was chosen based on the resolution of the reference images so that they would not be over or under sampled. The value of R was chosen based on the maximum expected dimensions of the target object, since the aim is to make R as small as possible while being large enough to cover any target object. The width of the moving window used to sample the input image was set to 80, as this meant that it was fully sampled by the logmap while being as small as possible. rmin was set to 5 to create a blind spot at the centre of the image. The blind spot was necessary for two reasons. Firstly, the area at the centre of the sensor is over sampled and so a single pixel in the original image is mapped to many pixels in the logmap image making the logmap image unnecessarily large. Secondly, the over sampling creates large areas of flat colour on the left hand side of the logmap (this can be seen in Fig. 2.2 (b) where a blind spot was not used). Edges are created in-between these areas of colour when the MACH filter is applied, which inhibit the in-plane rotation invariance of the filter. The area inside the blind spot was left empty but this loss of data did not significantly affect the performance of the filter since it represents a relatively small proportion of the expected target size. The normal method of creating the logmap from each sub-image is very slow compared to the other stages of Fourier transforming the logmap, multiplying it by the filter and Fourier transforming again to create the correlation plane. This was especially true due to the fact that a subdivision level of 16 was chosen in the interpolation algorithm in order to create a very smooth logmap (i.e. each image pixel was divided into 4 by 4 smaller pixels to take into account pixels that fall between logmap regions). However, the logmap geometry used on each sub-image is the same and it is therefore possible to remove the redundancy of performing the interpolation geometry calculations for each sub-image by employing a look-up table. The look-up table is an array, each element of which corresponds to a particular logmap region. Each element contains a list of image pixels that contribute to that particular logmap region and the fraction they contribute. The intensity of each logmap pixel can then be computed simply by summing the intensity of its contributing parts without having to calculate what the contributing parts are. Using this system makes the creation of a logmap for each sub-image much faster. The system was able to process approximately 10 sub-images per second, including logmap creation, two Fourier transforms, correlation peak location and classification, running on a 3GHz Pentium using simulation code written and executed in MATLAB. The parameters used in the MACH filter function were set to α=0.01, β=0.3 and γ=0.1. These values were set after a long period of testing with different values to correctly balance the discriminating ability against the sharpness of the correlation peak and the general performance of the filter.
4.1 Performance Criteria To quantify the performance of the filter accurately it was necessary to calculate some basic measures for quality of the correlation plane for detected targets. The most basic measure of the target correlation peak is the correlation output peak intensity (COPI). This refers to the maximum intensity value of the of the correlation output plane, i.e. the peak intensity. COPI is defined as 18:
COPI = max {| C ( x , y ) |2 }
(4.4)
x, y
Where C(x, y) is the output correlation amplitude value at position (x, y). A filter that produces a high COPI value has good performance and better detectability. To provide the best detectability and performance it is necessary for a filter to yield a sharp correlation peak as well as a high COPI value, while keeping side peaks to a minimum. The filters ability to do this can be measured using the peakto-correlation energy (PCE) measure. The basis of the PCE measure is that the correlation peak intensity should be as high as possible while the overall correlation energy in the plane should be as low as possible. The PCE ratio is defined as 18:
PCE
where
| C ( x, y ) |2 =
=
∑
[COPI
− | C ( x , y ) |2
[
]
| C ( x , y ) |2 − | C ( x , y ) |2 ∑ ( N x N y − 1)
| C ( x, y ) |2 N xN y
(4.5)
] 2
1/2
is the mean value of the correlation output plane intensity
A high PCE value therefore implies that the filter performs well.
4.2 Results 4.2.1 Correlation with training image The first test to be carried out was a basic correlation using one of the training set images as the input image. It was expected that the filter would perform well in this test since the input image was directly associated with the construction of the filter. Figure 4.4 (a) shows the correlation plane output of the filter - trained on car images with viewing angles 0, 10, 20 and 30 degrees and correlated with the 0 degree image. The filter performed well, producing a tall sharp peak with no out-of-class peaks giving a PCE value of 25.9. It should be noted that the position of the peak in the correlation plane does not represent the position of the target object in the input image since correlation is performed in logmap space – the position of the peak therefore represents the rotation angle and log of scale ratio of the detected target object. The correlation plane shown is where the strongest peak is found during scanning of the image using the moving window method described above i.e. centred over the target object. The test was repeated 3 times by correlating the filter with the 10, 20 and 30 degree training images. Similar results were observed with high COPI and PCE values. 4.2.2 Intermediate target object tolerance In a practical implementation of the system the filter would be expected to correctly detect in-class targets even if they were at an angle that did not exactly match any of the training set images. To test the distortion tolerance of the filter it was correlated with an image of the car whose angle fell between the training set images. The filter was trained on car images with viewing angles 0, 5, 10 and 30 degrees and correlated with a 20 degree input image. The results, shown in Figure 4.4 (b), show that the MACH filter’s response was still large and sharp giving a PCE value of 22. There was only a slight reduction in COPI and a slight broadening of the base of the peak. The filter still easily maintained its
discriminating ability. The test was repeated several times using different training and intermediate images, which produced similar results. 4.2.3 In-plane rotation tolerance To test the filter’s tolerance to in-plane rotation of the target object, the filter was correlated with a car image that had been rotated, using bicubic interpolation, by an angle of 30 degrees. Any rotation of the image, other than at multiples of 90 degrees, produces a slightly distorted image since there is not a one to one mapping between pixels. This coupled with the interpolation distortion of the logmap means that in-plane rotation tolerance is a fairly demanding test on any object recognition system. Several tests were performed using different sets of training images and correlating with different rotated images. Figure 4.4 (c) shows a typical peak produced after training the filter on car images with viewing angles 0, 5, 15 and 20 degrees and correlating with the 0 degree image rotated by an angle of 30 degrees of inplane rotation. It can be seen that the MACH filter has performed well and still produces a strong peak with a PCE value of 24.9, which allows it to be correctly classify targets as an in-class objects despite their rotation. It can also be seen that the peak has been shifted vertically downwards due to the rotation of the target in the image. The simulation program was able to correctly calculate the rotation of the target given the offset from the centre using equation 4.3 - this was accurate to within a few degrees. To test how the filter responds to in-plane rotation in more detail, an image of the car was rotated by 360 degrees in 1 degree intervals, using bicubic interpolation, log-mapped and then correlated with the MACH filter at each interval. The maximum correlation peak height was found for each interval and plotted against rotation angle (Fig. 4.2). The MACH filter was constructed using car images with viewing angles of 0, 5, 10 and 15 degrees. The 15 degree image was used as the rotated target. It can be seen that the correlation peak height varies slightly due to interpolation errors in rotating the original image and creating the logmap. The interpolation cannot be perfect due to the pixilation of the images. The peak height does, however, stay well above the detection threshold, calculated from equation 4.1, showing that the filter is fully rotation invariant. This is partly due to the wrap-round nature of the Fourier transform. The high peaks at regular 90 degree intervals are where there is an exact mapping of pixels from the original image to the rotated image and so interpolation errors do not occur.
Fig. 4.2 Variation of correlation intensity peak height with in-plane rotation variation.
4.2.6 Scale tolerance To test the filters tolerance to scale changes in the target object, which is directly related to the proximity of the target object from the camera, the filter was correlated with car images that had been scaled by 120% using bicubic interpolation. The scaling interpolation and pixilation of the logmap produces a distorted logmap image in the same way as in the in-plane rotation test. However, with scale variation there is the added factor that any reduction in scale of the target relative to the training images results in a loss of information, which adds to the difficulty of detection. Figure 4.4 (d) shows a typical peak produced when the filter was trained on car images with viewing angles 5, 10, 15 and 20 degrees and correlated with the 15 degree image scaled by 120%. The MACH filter maintained its discrimination ability giving typical PCE values of around 22. The peak was been shifted sideways slightly due to the scale change and from this the simulation program was able to calculate the scale ratio to within a few percent using equation 4.2. Once again, a test was carried out to examine the response to distortion in more detail by varying the distortion gradually and calculating the corresponding peak height at each interval. A car image was scaled from 0% to 300% in 1% intervals, using bicubic interpolation, log-mapped and then correlated with the MACH filter at each interval. The maximum correlation intensity peak height was found for each scale value and plotted against rotation angle (Fig. 4.3). The MACH filter was constructed using car images with viewing angles of 5, 10, 15 and 20 degrees. The 15 degree image was used as the scaled target. It can be seen from Fig. 4.6 that the correlation peak height is maximum at 100% when the target scale matches the scale of the training images used to construct the filter. Either side of this the peak height decreases as expected. The detection range of the filter is between 63% and 221.5% and is shown as dotted lines on the graph. However, this range would decrease as other distortions of the target object are included. It can be seen that the peak height drops off more sharply on the < 100% side of the graph, which therefore means that the filter is less tolerant to reduction in scale. This was expected due to the fact that information is lost in the case of reduced targets but not in the case of expanded targets. The slight undulation in the graph is due to the pixilation of the logmap and could be reduced by increasing the resolution of the logmap at the expense of an increase in computational time.
Fig. 4.3 Variation of correlation intensity peak height with scale variation.
4.2.5 Background clutter tolerance The filter was tested with a car image that had been superimposed onto a background image (Fig. 4.6 (a)) to test the filter’s immunity to cluttered and noisy backgrounds. Figure 4.5 (a) shows a typical peak produced when the filter was trained on car images with viewing angles 0, 10, 15 and 20 degrees and correlated with the 15 degree image superimposed on a background scene. Again, the filter performed well and produced strong enough peaks to correctly classified target objects with PCE values around 23. There is an increase in out of class peaks compared to targets tested in plain backgrounds but their amplitude is very low compared to the peak corresponding to the target object. This result was expected since the MACH filter was designed to be immune to cluttered scenes. 4.2.6 Worst case scenario The final test combined all possible distortions to see how the filter would perform in a worst case scenario. The filter was trained on car images with viewing angles 5, 10, 15 and 20 degrees and correlated with the 15 degree image scaled by 120%, rotated by 70 degrees and superimposed on a background scene (Fig. 4.6 (b)). Figure 4.5 (b) shows that the performance of the filter has decreased compared to previous tests. However, considering it is such a severe test, it still produces a fairly strong in-class peak that is much larger than the surrounding out-of-class peaks and still greater than the detection threshold.
(a)
(b)
(c)
(d)
Fig. 4.4 MACH filter correlated with: one of the training set images (a), an intermediate image to the training set (b), a training set image rotated by 30 degrees (c) and a training set image scaled by 120% (d). The transparent planes show the detection threshold.
(a)
(b)
Fig. 4.5 MACH filter correlated with: one of the training set images superimposed on a background scene, one of the training set images scaled by 120%, rotated by 70 degrees and superimposed onto a background scene (b).
(a)
(b)
Fig. 4.6 Car image superimposed onto a background scene (a) and car image scaled by 120%, rotated by 70 degrees and superimposed onto a background scene (b).
5. CONCLUSIONS By combining two image processing techniques it was possible to create an object recognition system capable of detecting a target object in a background scene despite any kind of geometrical distortion. A system such as this is useful in practical applications where the position and orientation of the target object or camera is unknown. The implementation of a MACH filter gave high performance by giving tolerance to background clutter and noise while providing invariance to orientation of the target object. A method of classification of correlation peaks as in-class or outof-class objects was employed by computing the average expected peak intensity, achieved by correlating the filter with each of the reference images. This made object detection simple since the height of the peak could directly be compared to a predefined threshold. Invariance to in-plane rotation and scale of the target was successfully achieved by log r-θ mapping the training set images and input image prior to synthesis of the filter and correlation. However, the inclusion of these invariance’s meant that the computational time required to implement the system was greatly increased since correlation with the filter had to be performed at every location of the input image. This meant that implementing this system with software alone is not a valid option for most practical applications. However, since the filter is based on Fourier techniques it would be possible to implement the system using optical hardware by employing a Spatial Light Modulator (SLM) to transduce the input image at high speed and an optical correlator to filter the sampled images and produce the correlation output. It may also be possible to implement the system in real time using a specialized Digital Signal Processing (DSP) chip set and associated efficient code.
REFERENCES 1 2 3 4 5 6 7 8 9 10
11
12 13
D. Casasent and D. Psaltis, “Position, rotation, and scale invariant optical correlation”, Applied Optics Vol. 15, No. 7, 1795-1799 (1976) K. Mersereau and G. Morris, “Scale, rotation, and shift invariant image recognition”, Applied Optics Vol. 25, No. 14, 2338-2342 (1986) H. J. Caulfield and W. Maloney, “Improved discrimination in optical character recognition”, Applied Optics, Vol. 8, No. 11, 2354-2356 (1969) C. F. Hester and D. Casasent, “Multivariant technique for multiclass pattern recognition”, Applied Optics, Vol. 19, 1758-1761 (1980) Z. Bahri and B. V. K. Kumar, “Generalized synthetic discriminant functions”, Journal of Optical Society of America, Vol. 5, No. 4, 562-571 (1988) A. Mahalanobis, B.V.K. Vijaya Kumar, S. Song, S.R.F. Sims, J.F. Epperson, “Unconstrained correlation filters”, Applied Optics, Vol. 33, pp. 3751-3759, (1994). Weiman, C.F.R., and Chaikin, G., “Logarithmic spiral grids for image processing and display,” Computer Graphics and Image Processing Vol. 11, 197-226 (1979). Sandini, G., and Dario, P., “Active vision based on space-variant sensing,” in: 5th International symposium on Robotics Research, 75-83 (1989). Schwartz, E.L., Greve D., and Bonmasser, G., “Space-variant active vision: definition, overview and examples,” Neural Networks, Vol. 8, No. 7/8, 1297-1308 (1995). Weiman, C.F.R., “3-D sensing with polar exponential sensor arrays,” in: Digital and Optical Shape Representation and Pattern Recognition, Proc. SPIE Conf. on Pattern Recognition and Signal Processing, Vol. 938, 78-87 (1988). Weiman, C.F.R., “Exponential sensor array geometry and simulation,” in: Digital and Optical Shape Representation and Pattern Recognition, Proc. SPIE Conf. on Pattern Recognition and Signal Processing, Vol. 938, 129-137 (1988). C-G. Ho, R.C.D. Young, C.R. Chatwin, “Sensor Geometry and Sampling Methods for Space-Variant Image Processing”, Journal of Pattern Analysis and Applications, Vol. 5, 369-384, (2002) I. Kypraios, R. C. D. Young, P. Birch, C. Chatwin, "Object recognition within cluttered scenes employing a Hybrid Optical Neural Network (HONN) filter", Optical Engineering, Special Issue on Trends in Pattern Recognition, Vol. 43, No. 8, 1839-1850, (2004)
14 15 16 17 18
Ph. Refregier, “Optimal trade-off filters for noise robustness, sharpness of the correlation peak and Horner efficiency”, Optics Letters, Vol. 16, No. 11, 829-831 (1991) Hanying Zhou and Tien-Hsin Chao, “MACH filter synthesising for detecting targets in cluttered environment for gray-scale optical correlator”, Proc. SPIE, Vol. 715, 394-398, (1999) B. V. K. Kumar, “Minimum Variance Synthetic Discriminant Function”, Journal of Optical Society of America A3, Vol. 3, 1579-1584, (1986) A. Mahalanobis, B. V. K. Vijaya Kumar, D. Casasent, “Minimum average correlation energy filters”, Applied Optics, Vol. 26, No. 17, 3633-3640, (1987). B.V.K. Kumar and L.Hassebrook, “Performance measures for correlation filters”, Applied Optics, Vol. 29, 2997-3006, (1990).