Correspondence Using Distinct Points Based on Image Invariants K.N.Walker, T.F.Cootes and C.J.Taylor Dept. Medical Biophysics, Manchester University, UK email:
[email protected] Abstract
We present a method, based on the idea of distinctive points, for locating point correspondences between two images of similar objects independently of scale, orientation and position. Distinctive points are those which have a low probability of being mistaken with other points, and therefore are more likely to be correctly located in a similar image. The local image structure at each image point is described by vectors of Cartesian differential invariants computed at a range of scales. Distinctive points lie in low density regions of the distribution of all vectors of invariants found in an image. The vectors of invariants of distinct points are used to locate similar points in a second image. Results of applying this technique to find correspondences between images of faces are shown.
1 Introduction A common problem in computer vision is that of establishing correspondences between images of similar objects. For images of objects with fixed 3D geometry 5 correspondences are, in principle, sufficient. For more variable classes of object, such as faces, a much larger number of correspondences may be required. Previous approaches have tended to rely on matching image features chosen by the system designer [4][2]. We have explored the idea of automatically selecting features (and their scales) which are unlikely to be confused with other points. The approach uses the differential structure of the image to construct a vector of invariants over a range of scales at each image point. These vectors describe the local geometry around a point at a particular scale independently of translation and orientation. In order to locate distinctive points we estimate the probability density distribution of the vectors of invariants representing candidate points over a range of scales, and choose those points whose vectors lie in low probability regions. The vector of invariants describing a distinct point in one image can then be used to search for similar points in a second image by comparing the distinct point’s vector with each of the vectors extracted from all points and scales in the second image. In the following we describe how to construct the vectors of invariants, how to determine distinctive vectors and show how they can be used to locate corresponding features in a new image.
British Machine Vision Conference
2
2 Background Many people have studied the problem of finding point correspondences between images using vectors of local image descriptors. Lades et al [8] used a vectors containing the responses of Gabor wavelets at different wavelengths to measure local image structure. An object was then represented by a set of these vectors extracted at the vertices from a sparse grid placed over the object of interest. The vertices of the grid are unlikely to fall over the most distinctive points, so when searching for the grid in a similar image only the vertices which happened to lie near distinctive points are found accurately. Triesch et al [13] tackles this problem by extracting the vectors from the vertices of a graph which is placed manually over the object. The vertices are chosen to coincide with heavily textured positions (ie distinctive points). Our approach improves on this by providing an automatic, principled way of locating these points. Several groups have used statistical models of the intensities in a patch (’eigen features’) to locate features [4][10]. Many authors have shown that using the distinctiveness of image features can improve the robustness in object recognition algorithms [1] [14] [11], but this typically been applied to finding distinctive segments on an object boundary. Florack et al [6] describes how cartesian differential invariants can be constructed from Gaussian differential operators which describe the local image geometry independently of spatial position and orientation.
3 Cartesian Differential Invariants Cartesian Differential Invariants describe the differential structure of an image independently of the chosen cartesian coordinate system. To construct these invariants we first extract the differential structure of the image. This is done using a scaled set of differential operators. The zeroth order operator, on which the rest of the operato rs are based, is the normalised isotropic Gaussian. The higher order operators are generated from derivatives of the Gaussian, allowing us to study the differential structure of the image up to any order using a set of discrete convolution filters. The normalised Gaussian derivative Kernel is given by:
G(x; )
=
1
2 )
(2
D 2
e ? x (
2 ) 2 2
(1)
where is the scale and D is the number of dimensions. The first order Kernels are:
Gx (x; ) = x G(x; ) Gy (x; ) = y G(x; ) The second order Kernels are:
Gxx (x; ) = x G(x; ) Gyy (x; ) = y G(x; ) 2
2
2
2
G(x; ) Gxy (x; ) = xy 2
British Machine Vision Conference
Invariant
I1 I2 I3 I4 I5 I6 I7 I8
Order 1 2 2 2 3 3 3 3
3
Manifest Invariant Notation
Li Li Li Lij Lj Lii Lj Lj ? Lij Li Lj ?ij Ljk Li Lk ij (Ljkl Li Lk Ll ? Ljkk Li Ll Ll ) Liij Lj Lk Lk ? Lijk Li Lj Lk ?ij Ljkl Li Lk Ll Lijk Li Lj Lk
Table 1: The set of 2D polynomial invariants to third order expressed in tensorial manifest invariant notation
and so on. The set of all such filters completely determine the local image structure. Notice that each filter has a single parameter, , defining the scale of the filter. The response of the above filters depends on the choice of cartesian coordinate frame. However, certain combinations of filter response can be shown to be independent of the choice of coordinate frame. These combinations are known as Cartesian differential invariants. Table 1 shows a canonical set of eight two-dimensional polynomial invariants to third order, expressed in tensorial manifest invariant index notation [12] (Appendix A). In order to ensure that the invariants are independent of the image’s contrast, each tensor is divided by the L tensor (ie by intensity). An obvious question to address, before basing any scheme on the use of differential quantities, is their reliability in the presence of noise. We have tested the invariants to see how they respond to noise. Typical results for a set of face images similar to those in Figures 2 and 5 are shown in Figure 1. We defined the error in response as:
Pn
s) ? R(xi ; 0))2 error2 = i=1 (R(xi ;n:s 2 r
where R(xi ; s) is the response of the invariant at point xi given an image with random gaussian noise of standard deviations s, n is the number of points sampled and sr is the standard deviation of all responses with no noise. The image intensities were in the range 0 to 255. The results show that even for quite large noise amplitudes, the change in invariant values is small compared to the intrinsic variation across the image.
4 Locating Corresponding Points with Invariants The Cartesian differential invariants defined above are invariant to position and orientation, but depend on the choice of a scale parameter . In order to capture image structure at appropriate scales and deal with image magnification it is essential to compute the invariants over a range of scales.
British Machine Vision Conference
Filter LiLi
Filter LiiLjLj−LijLiLj
Filter LiijLjLkLk−LijkLiLjLk
0.6
0.6
0.6
0.5
0.4
0.5
0.4
0.3
0.4
0.3
0.2
0.3
0.2
0.1 0 0
Error in response
0.7
Error in response
0.7
Error in response
0.7
0.5
0.2
0.1 2
4
6
0.1
0 0
8
Variance of Gaussian Noise
4
2
4
6
0 0
8
Variance of Gaussian Noise
a)
b)
Figure 1: Graphs showing the error in the response of (a)
Liij Lj Lk Lk ? Lijk Li Lj Lk at = 6.
2
4
6
8
Variance of Gaussian Noise
c)
Li Li , (b) Li Lij Lj
and (c)
We can construct a vector, v , that represents the local image structure around a given image point over a range of 2t + 1 scales using:
v()
I k?t :) I2 (k?t :) : : : I8 (k?t :) I1 (k(1?t) :) I2 (k(1?t) :) : : : I8 (k(1?t) :) ::: I1 (k+t :) I2 (k+t :) : : : I8 (k+t :)]T Where is the geometric mean scale, ki = i , is the base of a logarithmic series which determines the scales to be sampled and Ij (s) is the response of the j th invariant filter at scale s. =
[ 1(
We can use the vector of invariants for a chosen point in one image to locate similar points in a second image. For every point (x; y ) in the second image we generate several vectors, (x;y) (1 ) : : : (x;y) (ns ), one at each of a range of base scales, i . We then compare each in turn with the vector for the original point. Two vectors 1 and 2 are compared using the metric:
v
v
v
v
(v1 ; v2 ) = (v1 ? v2 )T S ?1 (v1 ? v2 ) where S is the covariance matrix of the distribution of vectors of invariants in the first image ( is a Mahalanobis distance). A similarity image can then be constructed by selecting the similarity value of the best matching scale for each pixel (the smaller , the greater the similarity) . This shows the similarity with respect to the spatial position. The final matches can then be found by selecting a set of the lowest troughs in the similarity image. An example is shown in Figures 2, 3 and 5. In Figure 2 we attempt to find similar points to point A from Figure 5. Figure 3 shows the similarity image calculated during the search; the peaks of this image are superimposed on the search image (bright areas are those most similar to A). The size of the superimposed points indicate the scale of the match. Note that the algorithm has detected both the correct position and the correct scale for matches with the eyes.
British Machine Vision Conference
Figure 2: The results of the search for point A in Figure 5.
5
Figure 3: The similarity image corresponding to Figure 2.
A problem exists with the search algorithm due to size of the scale sampling steps. In the space of all vectors of invariants, each pixel is associated with a scale string whose shape is dependent on how the pixel’s differential structure changes with scale. Figure 4 shows an example of such a scale string. At present we sample this scale string on a logarithmic scale. We have shown that the distance between vectors of invariants at neighbouring scales is larger then the average separation between samples from different points. This can cause a mismatch if the target scale falls halfway between the sampled scales. The proposed solution is to interpolate between points in order to form a better approximation of the actual scale string, but is beyond the scope of this paper.
5 Locating Distinct Points In the example used above we searched in one image for points similar to the one selected from a second image. For some choices of point in the second image (eg points on the cheeks) we would have many possible matches in the first image. In order to simplify the process of establishing correspondences we would like to try and find matches between the most distinctive points. Distinct points are those which are least likely to be confused with any other points during a search. Ideally we would like to choose points which occur exactly once in each image we wish to search. As a step towards this we show how to find those points in a single image which are significantly different from all other points in that image. For every pixel in the image we construct several vectors of invariants, one at each of a range of scales. The full set of vectors forms a multi-variate distribution. If we can model this distribution, we can use it to estimate how likely a given point is to be confused with other points. Distinctive points correspond to vectors which lie in low density regions of the space. In the following we describe how to approximate the distribution and show examples of distinctive points determined by the approach.
British Machine Vision Conference
6
Figure 4: The two most significant modes of variation of the space of vectors of invariants, with the scale string of a single pixel highlighted
5.1 Density Estimation We estimate the local density p^ at point from a mixture of m Gaussians:
p^(x) =
m X i=1
x in a distribution by summing the contribution
wi G(x ? m ; ) i
(2)
i
We have tried three methods for choosing the parameters.
Kernel method: The Kernel method positions Gaussians at all the samples in the distribution. In this case the parameters are m = N ,wi = N1 , i = i and i = hS; where N is the number of samples in the distribution, S is the covariance of the whole distribution and h is a scaling factor. Silverman [3] suggests a value for h given by:
m
4
h = N (D + 2)
(D1+4)
x
(3)
where D is the number of dimensions. This method gives good results but its complexity is of order n2 .
British Machine Vision Conference
7
Sub-sampled Kernel method: This method attempts to approximate the Kernel method by placing gaussian kernels at a randomly selected ns of the original n points. (3) suggests h increases as ns decreases. Evaluation of the probability density function is order n2s , so can be much more efficient than the kernel method, but this comes with some loss of accuracy. Fitting a mixture model by EM: Expectation Maximisation (EM) is a iterative optimisation technique, originally proposed by Dempster, Laird and Rubin [5], which can optimise the parameters of a mixture of multi-variate Gaussians to best fit a data set [9].
The last two methods allow the number of Gaussians used to be specified, which means the execution time can be controlled.
5.2 Choosing Distinct Points
v
The density estimate for each vector of invariants ( ) corresponds directly to the distinctiveness of the pixel at scale . The lower the density, the more distinctive the point. A distinctiveness image can be constructed by setting each pixel to the density of the corresponding vector at the most distinctive scale for that pixel. The most distinctive points are then found by locating the lowest troughs in the distinctiveness image. A scale image can be constructed by plotting the most distinctive scale at each pixel. This shows the scale at which each region of the image is most distinct. An example of locating distinctive points is shown in Figures 5, 6 and 6. Figure 6 is the distinctiveness image obtained from the image in Figure 5 using the Sub-sampled Kernel method with 50 Gaussians. The bright regions are the most distinct. Figure 6 is the corresponding scale image. The peaks of the distinctiveness image are superimposed on the original image. The size of the points indicate the scale at which the points are distinctive.
Figure 5: Original image showing some of the most distinctive points.
Figure 6: Distinctiveness image calculated from Figure 5.
British Machine Vision Conference
8
The choice of density estimation method will affect the results. In Figure 7 we compare the distinctiveness images obtained using the Kernel method and the Sub-sampled Kernel method with 50 Gaussians. It can be seen that the Sub-sampled Kernel method provides a good approximation to the Kernel method.
(a)
(b)
(c)
Figure 7: (b) is the distinctiveness image of (a) obtained using the Sub-sampled Kernel method and (c) uses the Kernel method. When locating globally distinctive points the finer scales should be ignored because the coarse scale features tend to be the most stable across a set of similar images. For example, human faces have common coarse scale features such as eyes which always look similar, but fine scale features such as wrinkles tend to vary significantly between individuals. Also the coarse scales are less susceptible to image noise. The finer scale features could be used to locally refine an approximate match made by a coarse scale distinct point, but this beyond the scope of this paper.
6 Discussion We have described how vectors of image invariants can be used to locate matches between one image and another, and how we can use a probabilistic measure of distinctiveness to determine those points least likely to generate false positive matches. A single match constrains the position and scale of the object in the second image, but not its orientation. By using the matches from several distinct points we anticipate that we can achieve a robust correspondence between objects in two images. Cootes et al [4] demonstrated how statistical models of the relative positions of the points could be used to select the most plausible set of responses. We intend to use this approach with our feature detectors. So far we have only considered locating distinctive points in a single image, which means there is no guarantee that similar points exist in similar images. We intend to generalise the algorithm to work on sets of images, discarding distinctive points which are not present in the majority. We anticipate that the method of detecting distinctive points and of locating their positions in new images will prove useful both for generating cues to prime a shape model for search, and to help to automatically train a statistical shape model by locating common landmarks on a set of training images.
British Machine Vision Conference
9
Figure 8: Scale image calculated corresponding to the distinctness image in Figure 6. Dark regions indicate fine scales and light regions indicate coarse scales.
Appendix A: Expanding Tensorial Notation Tensorial manifest invariant index notation is the most common notation used to express invariants. Lij is the result of applying filter Gij to an image, hence Lij is a second order tensor. A polynomial invariant of the L terms can be constructed by expanding the Manifest Invariant notation using Einstein summation [7]. For example the second order invariant Li Lij Lj expands as follows in two dimensions:
Li Lij Lj
Lx LxxLx + Lx Lxy Ly + Ly LyxLx + Ly Lyy Ly = Lxx Lx + 2Lx Lxy Ly + Lyy Ly As well as the L tensors, ij represents a constant tensor known as Epsilon or the =
2
2
Levi-Civita tensor. In D dimensions Epsilon has D indices and is defined as follows:
8