Scale Invariant Feature Transformation

1 downloads 0 Views 1MB Size Report
generate a feature vector that describes the local image ... Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation). 311.
D

Journal of Earth Science and Engineering 2 (2012) 310-316

DAVID

PUBLISHING

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation) Yehia Hassan Miky Hassan1, Jacob Odeh Ehiorobo2, Raphael Ehigiator Irughe3 and Oyenmwen Mabel Ehigiator4 1. Faculty of Engineering, South Valley University, Aswan 81718, Egypt 2. Geomatics Engineering Unit, Department of Civil Engineering, Faculty of Engineering, University of Benin, Benin City 300283, Nigeria 3. Department of Engineering Geodesy, Siberian State Academy of Geodesy, Novosibirsk 630108, Russian Federation 4. Faculty of Basic Sciences, Benson-Idahosa University, Benin City 300252, Nigeria Received: April 12, 2012 / Accepted: May 4, 2012 / Published: May 20, 2012. Abstract: In image registration, the point-to-point correspondences between two images of the same scene, which are acquired by different sensors or by the same sensor at different times or with different parameters, are determined. The crucial point of feature-based methods is to adopt discriminative and robust feature descriptors that are invariant to the assumed differences between the two input images, and therefore the extraction of invariant features is very important to registration results. This paper presents a method for image feature generation called SIFT (scale invariant feature transformation). In the study smoothening of the images was carried out using MatLab 7 software by application of the Gaussian function. Local maxima and minima of the difference of the Gaussian image were detected and a detailed fit to nearby data for location, scale and ratio of principal curvature was carried out. Several points tests were carried out using a variety of decision techniques to label a key point as a match or otherwise. The result obtained revealed that the scale invariant feature transformed an image into a large collection of local feature vectors, each of which is invariant to image translation, scaling and rotation. Key words: Image registration, object recognition, Gaussian function, affine transformation.

1. Introduction Object recognition in cluttered real-world scenes requires local image features that are unaffected by nearby clutter or partial occlusion. The features must be at least partially invariant to illumination, projective transforms, and common object variations. On the other hand, the features must also be sufficiently distinctive to identify specific objects among many alternatives. The difficulty of the object recognition problem is due in large part to the lack of success in finding such image features. However, recent research on the use of dense local features, e.g., Corresponding author: Jacob Odeh Ehiorobo, Ph.D., main research fields: GNSS and geodetic positioning deformation surveys and analysis, engineering and construction surveys, remote sensing, GIS and environmental hazards’ analysis. E-mail: [email protected].

Ref. [1] has shown that efficient recognition can often be achieved by using local image descriptors sampled at a large number of repeatable locations. The scale-invariant features are efficiently identified by using a staged filtering approach. The first stage identifies key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each point is used to generate a feature vector that describes the local image region sampled relative to its scale-space coordinate frame. Following are the major stages of computation used to generate the set of image features:  Scale-space extrema detection: The first stage of computation searches over all scale and image locations. It is implemented efficiently by using a difference-of-Gaussian function to identify potential

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation)

interest points that are invariant to scale and orientation;  Keypoint localization: At each candidate location, a detailed model is fit to determine location and scale. Keypoints are selected based on measures of their stability;  Orientation assignment: One or more orientations are assigned to each keypoint location based on local image gradient directions. All future operations are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations;  Keypoint descriptor: The local image gradients are measured at the selected scale in the region around each keypoint. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination. This approach has been named the SIFT (scale invariant feature transformation), as it transforms image data into scale-invariant coordinates relative to local features.

Fig. 1

311

Description of image keypoint.

geometric frame of four parameters: the keypoint

keypoints using a cascade filtering approach that uses efficient algorithms to identify candidate locations that are then examined in further detail. The first stage of keypoint detection is to identify locations and scales that can be assigned under differing views of the same object. Detecting locations that are invariant to scale change of the image can be accomplished by searching for stable features across all possible scales, using a continuous function of scale known as scale space [3]. It has been shown by Koenderink et al. [4-6] that under a variety of reasonable assumptions the only possible scale-space kernel is the Gaussian function. Therefore, the scale space of an image is defined as a function, L ( x , y ,  ) , that is produced from the

center coordinates x and y, its scale (the radius of the

convolution

2. SIFT Keypoint Description A SIFT keypoint as shown in Fig. 1 is a circular image region with an orientation. It is described by a

region), and its orientation (an angle expressed in radians). The SIFT detector is invariant to translation, rotations, and rescaling of the image. Searching keypoints

at

multiple

scales

is

obtained

by

constructing a so-called “Gausssian scale space” [2]. The scale space is just a collection of images obtained by progressively smoothing the input image, which is analogous to gradually reducing the image resolution. Conventionally, the smoothing level is called scale of the image. 2.1 Detection of Scale-Space Extrema As described in the introduction, we will detect

G ( x, y , )

of a variable-scale Gaussian, , with an input image, I ( x , y ) : (1) L ( x, y , )  G ( x, y , ) * I ( x, y )

where ‫ כ‬is the convolution operation in x and y, 2 2 2 1 G ( x, y ,  )  e(x  y )  (2) 2 2 To efficiently detect stable keypoint locations in scale space, Lowe [2] proposed using scale-space extrema in the difference-of-Gaussian function convolved with the image, G ( x , y ,  ) which can be computed from the difference of two nearby scales separated by a constant multiplicative factor k: This scale space estrema is computed from the expression: D ( x , y ,  )  (G ( x , y , k )  G ( x , y ,  )) * I ( x , y ) (3)

312

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation)

D ( x , y ,  )  L ( x , y , k )  L ( x , y ,  )

(4)

It is a particularly efficient function to compute the smoothed images L in any case for scale space feature description. And D can therefore be computed by simple image subtraction. As shown in Fig. 2 for each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images on the right. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated. For key localization using Matlab-7 program, the

Fig. 2

then repeated a second time with a further incremental smoothing of   2 to give a new image, B, which now has an effective smoothing of   2 . The difference of Gaussian function is obtained by subtracting image B from A, resulting in a ratio of   2 / 2  2 between the two Gaussians. To generate the next pyramid level, we resample the already smoothed image B using a relative scale of   2 . Fig. 3 shows scale space images for first

octave with   ( 1 , 2 , 2 , 2 2 , 4 , 4 2 ). And Fig. 4 shows the effect of number of scaled samples per octave on the number of keypoints per image.

Scale space images shown on the left, difference-of-Gaussian images on the right.

 2

  1 . 00

Fig. 3

input image is first convolved with the Gaussian function using   2 to give an image A. This is

 2 2 Scale space images for first octave with

  4 . 00 2 , 2, 2 2 , 4 ,

  (1 ,

  2.00

 4 2 4

2 ).

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation)

Fig. 4

313

Effect of number of scaled samples per octave on the number of keypoints per image.

2.2 Keypoint Localization In order to detect the local maxima and minima of G ( x , y ,  ) , each sample point is compared to its eight neighbors in the current image and nine neighbors in the scale above and below (Fig. 5). It is selected only if it is larger than all of these neighbors or smaller than all of them. The images shown in the Fig. 5 are detected by comparing a pixel (marked with X) to its 26 neighbors in 3 × 3 regions at the current and adjent scales (marked with circles). Once a keypoint candidate has been found by comparing a pixel to its neighbors, the next step is to perform a detailed fit to the nearby data for location, scale, and ratio of principal curvatures. This information allows points to be rejected that have low contrast (and are therefore sensitive to noise) or are poorly localized along an edge. Fig. 6 shows some of the detected keypoints from tested image.

invariant descriptors of Ref. [1], in which each image property is based on a rotationally invariant measure. The disadvantage of that approach is that it limits the descriptors that can be used and discards image information by not requiring all measures to be based on a consistent rotation.

Fig. 5 Maxima and minima of the difference-of-Gaussian images.

2.3 Orientation Assignment By assigning a consistent orientation to each keypoint based on local image properties, the keypoint descriptor can be represented relative to this orientation and therefore achieve invariance to image rotation. This approach contrasts with the orientation

Fig. 6

Some of the detected keypoints.

314

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation)

Following experimentation with a number of approaches to assigning a local orientation, the following approach was found to give the most stable results. The scale of the keypoint is used to select the Gaussian smoothed image, L, with the closest scale, so that all computations are performed in a scale-invariant manner. For each image sample, L(x, y), at this scale, the gradient magnitude m(x, y), and orientation,  (x, y), are computed using pixel differences [2] (5) m(x, y)  (L(x 1, y)  L(x 1, y))2  (L(x, y 1)  L(x, y 1))2  L ( x , y  1)  L ( x , y  1) 

  ( x , y )  tan 1   L ( x  1, y )  L ( x  1, y ) 

Fig. 7

Image gradients Keypoint descriptor.

Keypoint descriptor

Fig. 8

Tested image with associated descriptors.

(6)

3. The Local Image Descriptor The previous operations have assigned an image location, scale, and orientation to each keypoint. These parameters impose a repeatable local 2D coordinate system, which describes the local image region and therefore provides invariance to these parameters. The next step is to compute a descriptor for the local image region that is highly distinctive yet is as invariant as possible to remaining variations, such as change in illumination or 3D viewpoint. A keypoint descriptor, as shown in Fig. 7, is created by first computing the gradient magnitude and orientation at each image sample point in a region around the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samples are then accumulated into orientation histograms summarizing the contents over 4 × 4 sub regions, as shown on the right, with the length of each arrow corresponding to the sum of the gradient magnitudes near that direction within the region. The experiments in this paper use 4 × 4 descriptors computed from a 16 × 16 sample array of histograms with 8 orientation bins in each. Therefore, the experiments in this paper use a 4 × 4 × 8 = 128 element feature vector for each keypoint. Fig. 8 shows tested image with associated descriptors. Finally, the feature vector is modified to reduce the effects of illumination change. First, the vector is

normalized to unit length. A change in image contrast in which each pixel value is multiplied by a constant will multiply gradients by the same constant, so this contrast change will be canceled by vector normalization. A brightness change in which a constant is added to each image pixel will not affect the gradient values, as they are computed from pixel differences. Matching feature descriptors: In the original SIFT descriptor comparison approach, every descriptor in one image is compared to every other descriptor in the second image via the distance metric (Euclidean distance) shown below [2]: dist ( d 1 , d 2 ) 

128



i 1

d 1i  d 2i

(7)

The basic approach being the same, the altered matching algorithm also takes into account the variance of the shared descriptor. When comparing a new image to the dictionary of features of an object, the dictionary is scanned to find the best match for each SIFT keypoint in the new image. Further, the variance of the shared descriptor which was found as the best match in the dictionary is used to decide whether the keypoint in the new image qualifies as a

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation)

match or not. There were a lot of experiments carried out using a variety of different decision techniques to label a keypoint as a match or not. One successful technique labeled a keypoint as a match if in feature space the Euclidean distance to the closest shared descriptor was less than the variance of that descriptor, which implies that the given keypoint is close in feature space to the shared descriptor and the keypoints that it represents in feature space (see Fig. 9). The other technique which was quite successful was a modification of the matching technique described by Crowley et al. [7] which uses the distance to the two closest shared descriptors in feature space and checks if these two distances are within a certain percentage of each other. This would indicate that the keypoint is also close to another point in the object feature space and this indicates a high probability that the keypoint belongs to the object feature space.

where m1, m2, m3, m4 and tx, ty are the parameters of the affine transformation of the object from the training appearance view to the test scene. These were determined by solving the following using least squares approach where a single match f ( x , y ) and f ' ( x ' , y ' ) is indicated. Since there are six unknowns, at least three match pairs (six equations) will be needed to determine transformation parameters.      x '1    y'    1    x '2      y '  2   .        .    .       x 'n       y 'n     

be located, using affine transformation. If f ( x, y ) and f ' ( x ' , y ' ) are the feature descriptors from training image and test image respectively, the transformation of the object from the training image to the test image may be accurately given as follows:  x '   m1  y '   m    3

Fig. 9

m 2   x  t x    m 4   y   t y 

Tested image with marching feature description.

 x '1 m1  y '1 m1  x '2 m1  y '2 m1 . .

 x '1 m 2  y '1 m 2  x '2 m 2  y '2 m 2 . .

 x '1 m 3  y '1 m 3  x '2 m 3  y '2 m 3 . .

 x '1 m 4  y '1 m 4  x '2 m 4  y '2 m 4 . .

 x '1 tx  y '1 tx  x '2 tx  y '2 tx . .

 x '1 ty  y '1 ty  x '2 ty  y '2 ty . .

.  x 'n m1  y 'n m1

.  x 'n m 2  y 'n m 2

.  x 'n m 3  y 'n m 3

.  x 'n m 4  y 'n m 4

.  x 'n tx  y 'n tx

.  x 'n ty  y 'n ty

             

4. Affine Transformation Once three or more entries in each cluster were identified, the affine pose between the matched features can be recovered, thereby allowing outliers to

315

x '1   x1   y '1   0  x2 x '2    y '2   0 .    .   .   .   .   . xn x 'n    y ' n   0

y1

0

0

1

0 y2

x1 0

y1 0

0 1

0

x2

y2

0

. .

. .

. .

. .

. yn

. 0

. 0

. 1

0

xn

yn

0

0  1 0  1 .  .  . 0  1 

 m1  m 2 m3  m 4  t  x  t y

        

                    

(9) m1 m  2 m3  m 4  tx   t y

        

(10)

Once the transformation parameters were determined, the affine transformation was accurately carried out. The results are shown in Fig. 10.

(8)

Fig. 10

Transformed images.

316

Registration of Multi Time Images Using SIFT (Scale Invariant Feature Transformation)

5. Conclusions

References

The SIFT keypoints described in this paper are particularly useful due to their distinctiveness, which enables the correct match for a keypoint to be selected from a large database of other keypoints. This distinctiveness is achieved by assembling a high-dimensional vector representing the image gradients within a local region of the image. The keypoints are invariant to image rotation and scale and robust across a substantial range of affine distortion, addition of noise, and change in illumination. Large numbers of keypoints can be extracted from typical images, which leads to robustness in extracting small objects among clutter. The fact that keypoints are detected over a complete range of scales means that small local features are available for matching small and highly occluded objects, while large keypoints perform well for images subject to noise and blur. Their computation is efficient, so that several hundred keypoints can be extracted from a typical image.

[1]

[2]

[3]

[4] [5]

[6]

[7]

C. Schmid, R. Mohr, Local gray value invariants for image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (5) (1997) 530-534. D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of International Conference on Computer Vision, Corfu, Greece, 1999, pp. 1150-1157. A.P. Witkin, Scale-space filtering, in: Proceedings of International Joint Conference on Artificial Intelligence, Karlsruhe, Germany, 1983, pp. 1019-1022. J. Koenderink, The structure of images, Biological Cybernetics 50 (1984) 363-396. T. Lindeberg, Scale-space theory: A basic tool for analysing structures at different scales, Journal of Applied Statistics 21 (2) (1994) 224-270. T. Lindeberg, Detecting salient blob-like image structures and their scales with a scale-space primal sketch: A method for focus-of-attention, International Journal of Computer Vision 11 (3) (1993) 283-318. J.L. Crowley, C.A. Parker, A representation for shape based on peaks and ridges in the difference of lowpass transform, IEEE Trans. on Pattern Analysis and Machine Intelligence 6 (2) (1984) 156-170.

Suggest Documents