Improving scale invariant feature transform ... - SPIE Digital Library

1 downloads 0 Views 3MB Size Report
May 7, 2015 - Improving scale invariant feature transform-based descriptors with shape–color alliance robust feature. Rui Wang. Zhengdan Zhu. Liang Zhang.
Improving scale invariant feature transform-based descriptors with shape–color alliance robust feature Rui Wang Zhengdan Zhu Liang Zhang

Journal of Electronic Imaging 24(3), 033002 (May∕Jun 2015)

Improving scale invariant feature transform-based descriptors with shape–color alliance robust feature Rui Wang,a Zhengdan Zhu,a,* and Liang Zhangb

a Beihang University, School of Instrumentation Science and Opto-Electronics Engineering, Laboratory of Precision Opto-Mechatronics Technology, No. 37 Xueyuan Road, Haidian District, Beijing 100191, China b University of Connecticut, Department of Electrical and Computer Engineering, 371 Fairfield Way, U-2157, Storrs, Connecticut 06269, United States

Abstract. Constructing appropriate descriptors for interest points in image matching is a critical aspect task in computer vision and pattern recognition. A method as an extension of the scale invariant feature transform (SIFT) descriptor called shape–color alliance robust feature (SCARF) descriptor is presented. To address the problem that SIFT is designed mainly for gray images and lack of global information for feature points, the proposed approach improves the SIFT descriptor by means of a concentric-rings model, as well as integrating the color invariant space and shape context with SIFT to construct the SCARF descriptor. The SCARF method developed is more robust than the conventional SIFT with respect to not only the color and photometrical variations but also the measuring similarity as a global variation between two shapes. A comparative evaluation of different descriptors is carried out showing that the SCARF approach provides better results than the other four state-of-the-art related methods. © 2015 SPIE and IS&T [DOI: 10.1117/1.JEI.24.3.033002] Keywords: scale invariant feature transform; color invariance; global information; shape–color alliance robust feature. Paper 14382 received Jul. 7, 2014; accepted for publication Apr. 8, 2015; published online May 7, 2015.

1 Introduction Interest point matching is matching corresponding points between two or more images. It has proven to be crucially important in many computer vision and pattern recognition tasks, such as three-dimensional (3-D) reconstruction, 3-D object recognition and tracking, texture recognition, stereo correspondence, video data mining, and image retrieval.1–5 As far as we know, one of the most effective ways to present interest points is to create a distinguishing descriptor for each feature point. To advance the state-of-the-art research on descriptors of interest points, two criteria must be simultaneously considered: the first one is the distinctiveness, i.e., the extracted feature should have enough information to distinguish the interest point being described from other points; the second one is the robustness, which means that the descriptor should be robust with respect to many kinds of geometric and photometric transformations. Until now, many different descriptors for interest points have been developed and widely used in a large number of applications.6,7 These descriptors can be categorized as the followings four types. The first category is distribution-based descriptors. These descriptors use histograms to represent different feature points. A descriptor established by Zabih and Woodfill8 relied on histograms of ordering and reciprocal relations between pixel intensities and was robust to illumination changes. Belongie et al.9 developed an approach by starting with a collection of shape points and, for each shape point, capturing the distribution of the remaining points by building a histogram in log-polar space. Carneiro and Jepson10 describe phase-based local features that represent the *Address all correspondence to: Zhengdan Zhu, E-mail: sy1017133@aspe .buaa.edu.cn

Journal of Electronic Imaging

phase rather than the magnitude of local spatial frequencies, which is likely to provide improved invariance to illumination. Lowe6,11 developed a scale invariant feature transform (SIFT) descriptor based on the gradient distribution in the detected corresponding regions. The descriptor is represented by a 3-D histogram of gradient locations and orientations, and the contribution to the location and orientation histogram bins is weighted by the gradient magnitude. Delponte et al.12 proposed using the singular value decomposition method along with SIFT for feature matching to reduce and correct the incorrect match between two points in two SIFT feature sets, respectively. Bay et al.13 improved the SIFT descriptor by speeding up robust features and using integral images for image convolutions and a fast-Hessian detector. Morel and Yu14 proposed an affine SIFT, which simulates all the distortions caused by variations in the direction of the camera’s optical axis, and then the SIFT is imposed on the simulated images. Guo et al.15 presented a mirror reflection invariant descriptor which is inspired from SIFT. The second category is differential descriptors. These descriptors use a set of image derivatives calculated up to a given order to approximate a point neighborhood. Florack et al.16 presented a method by using differential invariants which combine components of the local derivatives to obtain rotation invariance. Schmid and Mohr17 described the interest points using local differential graylevel invariants, and the descriptors are invariant to scale, rotation, and intensity changes. The third category is color-based descriptors. These descriptors utilize color information to render descriptors robust against varying imaging conditions. Gevers and Smeulders18 proposed new color models for the purpose 1017-9909/2015/$25.00 © 2015 SPIE and IS&T

033002-1

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

of recognition of multicolored objects. Geusebroek et al.19 developed various physical-based color invariants for invariant color representations under different imaging conditions. Diplaros et al.20 proposed an approach to integrate the color and shape invariant information in the context of object recognition and image retrieval. Abdel-Hakim and Farag21 introduced the colored SIFT (CSIFT) as a novel colored local invariant feature descriptor for the purpose of combining both color and geometrical information in object description. Stokman and Gevers22 proposed a generic selection model to select and weigh the color invariant models for discriminatory and robust image feature detection. Verma et al.23 proposed new color SIFT descriptors extending the SIFT descriptor to different color spaces. Apart from the three types of aforementioned descriptors, there are also other extended descriptors. We group them as a fourth category. For example, Shokoufandeh et al.24 provided a distinctive feature descriptor using wavelet coefficients. Mikolajczyk and Schmid25 established a feature descriptor that uses local edges while ignoring unrelated nearby edges, thus providing the ability to find stable features. Moreno et al.26 improved the SIFT descriptor with the Gabor smoothing derivative filters. Kim et al.27 proposed a local feature descriptor by using combination of binary Haar-like features and edge class that is robust to monotonic increases in illumination changes and Gaussian noise. Although the SIFT descriptor has been proven to perform better than the other existing local invariant feature descriptors,22 it does not have sufficient distinctiveness and robustness in the case of color and photonic transformations and may result in incorrect image matching when the image has many locally similar areas. In these cases, many mismatches could occur due to the fact that SIFT is a local feature-based descriptor and it is designed mainly for gray images. In this paper, we propose a method to improve the performance of the SIFT descriptor by addressing the drawbacks mentioned previously. Specifically, a concentricrings model is used to enhance the invariance to viewpoint change. Then, the color invariant space is applied to increase the color and photonic invariance. Next, the shape context is used as a global variation to improve the measuring of similarity between two shapes. Finally, the color and global descriptors are integrated with the SIFT descriptor to ensure more accurate point matching. The rest of the paper is organized as follows: Section 2 briefly outlines the SIFT. Our algorithm shape–color alliance robust feature (SCARF) is proposed in Sec. 3 and Sec. 4 introduces the evaluation criteria together with the datasets for the experiments. We present experimental results and analysis in Sec. 5 and draw our conclusions in Sec. 6. 2 Scale Invariant Feature Transform Review Among the strongest in different local descriptors, one can cite SIFT, developed by Lowe,6,11 and speeded-up robust features (SURF), developed by Bay et al.13 A SURF descriptor as an alternative to SIFT with softer time requirements is comprised of a 64-dimensional vector, whereas SIFT uses 128. However, as SIFT and SURF descriptors are compared, the SIFT descriptor is more suitable to identify images altered by blurring, rotation, and scaling,28 for instance. The SIFT algorithm is given by four steps:6 Journal of Electronic Imaging

Step 1. Finding the extreme points in scale space.26 It is implemented efficiently by using a difference-ofGaussian (DoG) function pyramid to identify potential key points that are invariant to scale and orientation. The DoG is a close approximation to the scale-normalized Laplacian-of-Gaussian. Each sample pixel is compared with its 26 neighbors in scale space and a pixel is selected as a candidate feature point if its value is larger or smaller than all of its neighbors. Step 2. Extracting key points. Extreme points that are unstable and sensitive to noise are filtered out to leave extreme points that are treated as key points. Accurate positions of the remaining key points are localized. Step 3. Assigning direction parameters to the key points to quantize the description. A dominant orientation is determined by building a histogram of gradient orientations from the feature point’s neighborhood, weighed by a Gaussian and the gradient magnitude. Every peak in the histogram with a height of 80% of the maximum produces a feature point with the corresponding orientation. Then, it is assigned to each final key point based on local image gradient directions to achieve invariance to image rotation. Step 4. Computation of key point descriptors. The local image gradients are measured at the selected scale in the region around each key point. These are transformed into a presentation that allows for significant levels of local shape distortion and changes in illumination. The experiments by Lowe show that the best results are achieved with a 4 × 4 array of histograms with eight orientation bins in each. Therefore, the SIFT descriptor for each key point is usually a 128-dimensional vector.

3 The Shape–Color Alliance Robust Feature Descriptor As mentioned previously, SIFT was mainly developed for gray images, which limits its performance with some colored objects. Moreover, although the key points extracted by SIFT make full use of the information of neighboring points in scale space, SIFT is inherently limited by the locality of its descriptor. Consequently, SIFT is unable to discriminate repeated image patterns since no contextual information is encoded in the algorithm. To improve the performance of SIFT and correct these aforementioned deficiencies, we propose to construct a new framework, which combines color space and shape context global information with SIFT by using a circular region instead of a rectangular region. In the stage of feature detection, i.e., scale-space extrema detection, we still adopt the SIFT detector because of its high accuracy and full invariance to rotation and scale changes. For every interest point detected, we built a three-component vector consisting of a SIFT descriptor representing local properties and a global context vector to disambiguate locally similar features. Thus, our SCARF vector is defined as 3 αS 5; SCARF ¼ 4 βG ð1 − α − βÞC

033002-2

2

(1)

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

where S is the 128-dimension SIFT descriptor, G is a 32dimension global descriptor, C is a 32-dimension color descriptor, and α and β are the relative weighting factors satisfying α > 0, β > 0, and α þ β < 1. The detailed scheme for the proposed SCARF descriptor is shown in Fig. 1. The specific steps to construct the shape context global descriptor and color descriptor, as well as their integration with the 128dimension SIFT will be introduced as follows. 3.1 Shape Context Global Descriptor In practice, an image may have many areas that are locally similar to each other. When matching local descriptors such as SIFT in this situation, multiple locally similar areas will produce ambiguities. Figure 2 illustrates the matching result of a pair of a bipartite graph having a number of similar shapes and structures, in which the incorrect matches are labeled by red circles and lines. There are some attempts in the literature20,29 which have been introduced to address this problem. For instance, a global context vector, which adds curvilinear shape information from a much larger neighborhood, was proposed by Mortensen et al.30 They use maximum curvature at each feature point to describe the global information of the key point. In that paper, a log-polar coordinate system and accumulated curvature values are built in each log-polar bin to form a 5 × 12 dimension descriptor. Specifically, the shape context’s radius r is identified and the diameter of the log-polar coordinate is equal to the image diagonal 2r. The bins have five radial increments: r r r r ; ; ; ; 16 16 8 4

and

r : 2

Then, the system is divided into 12 equal pieces based on the dominant orientation of feature points (i.e., each accounts for 30 deg of angle). An illustration of this method is shown in Fig. 3, where Figs. 3(a) and 3(b) depict the sampled edge points of two shapes and Fig. 3(c) depicts the partition by the log-polar bins. Inspired by the aforementioned research work and the fact that global information conveys much more distinguishable information which helps discriminate between local features, we use curvilinear information to create a global descriptor for a feature point. Specifically, we adopt the convenient concentric ring bins for accumulating curvature values and compute the maximum curvature only at each key point. In the

Fig. 1 Shape–color alliance robust feature (SCARF) flow chart.

Journal of Electronic Imaging

Fig. 2 Example of incorrect matching by scale invariant feature transform (SIFT).

remainder of this section, we will discuss the construction of the global descriptor in more detail. 3.1.1 Building concentric circles As mentioned previously, considering all the points in an image for object description is not feasible. Therefore, highly informative points are selected as interest points by using a SIFT detector (see Fig. 1). After localizing the interest points, calculating the dominant histogram is a critical step before feature descriptors are built to characterize these points. To this end, rather than using a log-polar coordinate system, which requires transformation of different coordinate systems, we propose to build a simpler and more convenient shape context histogram: concentric circles. In this coordinate system, the key point itself is applied as the center to form the concentric circles. Instead of using the image diagonal as the radius in Mortensen et al.,30 the radius of the circle in our paper is determined by the feature point’s scale σ. For each selected point, k × σ is the radius of its circle, where k is an experimental parameter. Since the size of the global context vector is a function of the interest point’s scale rather than the image size and, as such, is afforded better scale invariance by smoothing and by construction in that the concentric bins allow for increasing uncertainty as relative distance increases, we believe that building a context histogram with a concentric circles system is able to simplify the calculation and is more efficient than a log-polar histogram. Hence, we also employ the concentric circles’ coordinate system to construct a color descriptor, which will be illustrated in the next section. In this paper, the neighboring region around the key point consists of 32 concentric circles. Therefore, the radial increment is

Fig. 3 Shape context proposed by Mortensen et al.30 (a) and (b) sampled edge points of two shapes. (c) Diagram of log-polar histogram bins used in computing the shape contexts.

033002-3

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

equally set as k × scale∕32. To carry out this process, the feature points of an image are first extracted and only the featured points will be considered when calculating the global descriptor based on the concentric circles. 3.1.2 Constructing global descriptor For each extracted feature point, the principal curvature is calculated first. Given an image point ðx; yÞ, the principal curvature is the absolute maximum eigenvalue of the Hessian matrix    σ  rxx rxy gxx gσxy Hðx; yÞ ¼ ¼ Iðx; yÞ  σ ; (2) rxy ryy gxy gσyy where rxx and ryy are the second partials of the image in x and y, respectively, and rxy is the second cross-partial. The second derivatives are computed by convolving the image with the corresponding second derivative of a Gaussian, gxx , gxy , and gyy , with the σ, which is equal to the scalespace used in the SIFT (see Sec. 2 and Fig. 1). Thus, the curvature image is defined as cðx; yÞ ¼ jeðx; yÞj;

(3)

where eðx; yÞ is the eigenvalue of Eq. (2) with the largest absolute value. According to Steger,31 eðx; yÞ can be computed in a numerically stable and efficient manner with just a single Jacobian rotation of Hðx; yÞ so as to eliminate the rxy term. Then, we use an approach similar to shape contexts to accumulate principal curvature values of the other feature points in each concentric circle bin instead of in each logpolar bin. Each pixel’s curvature value is weighted by an inverted Gaussian and then added to the corresponding circle bin. The Gaussian weighting function is ωðx; yÞ ¼ 1 − e−½ðx−xk Þ þðy−yk Þ ∕2σk ; 2

2

(4)

where ðxk ; yk Þ is the feature point position and σ k is the same scale used to weight the SIFT feature’s neighborhood. More specifically, for each extracted feature point, a 32-dimensional descriptor G ¼ ½G1 ; G2 : : : Gi ; : : : G32  will be constructed and Gi should be generated in the corresponding concentric circle bin i. Suppose in the bin i, there are Ji feature points, then principal curvature value will be weighted by a uniform Gaussian function described in Eq. (4). We can define Gi as Gi ¼

Ji X

3.2 Color Descriptors Besides the shape context global information, it is obvious that color invariance features provide highly discriminative information cues for color objects, since color may vary significantly with change in camera viewpoint, object pose, and illumination. If color information is neglected, a very important source of distinction may be lost, and many objects may be mismatched and misclassified. Unfortunately, although SIFT descriptors can accurately extract constant local feature points from an image that are invariant to many basic image transformations, it is not invariant to light color changes because the intensity channel in gray images is simply a combination of the R, G, and B channels. To combine both geometrical and color invariants in a single descriptor, Abdel-Hakim and Farag21 introduced a colored local invariant feature descriptor CSIFT by using the color invariance model19 which depends on the old Kubelka–Munk theory32,33 to achieve the color invariance. However, illumination and gray information may be dropped by using this method. In this paper, we make some adjustment to the descriptor CSIFT by using a robust and simple feature selection strategy for identification of correspondences among colors.

cðxij ; yij Þωðxij ; yij Þ;

(5)

j¼1

where ðxij ; yij Þ is the j’th feature point in the bin i. To some extent, the 32-dimensional descriptor is a very discriminative point descriptor, facilitating easy and robust correspondence recovery by incorporating global shape information into a local point feature. In addition, compared to descriptor with the log-polar coordinate system, the proposed scheme requires less computational time with simpler and more effective design, because it omits the coordinate transformation and computes the maximum curvature only at each feature point. Journal of Electronic Imaging

3.2.1 Concentric circles neighboring region After localizing the interest points at the extrema of a DoG pyramid of the input image as shown in Fig. 1, instead of the rectangular region used in CSIFT, we make use of the concentric circles discussed in Sec. 3.1.1 to build our color descriptors. In other words, the concentric circle bins for the same-scale neighboring pixels of an interest point are used as the key entries of the descriptor. In this way, the same coordinate system used for describing both global information and color features will require less calculation of the matching distance for both the global descriptor and color descriptor, and thus may lead to a more efficient computational process of the SCARF algorithm. 3.2.2 Computation of color invariance Then, inspirited by the fact that the C-invariant used in the CSIFT descriptor can be intuitively viewed as the normalized opponent color space, we develop our color invariance model based on the old Kubelka–Munk theory which is employed in CSIFT and model the reflected spectrum of colored bodies. The Kubelka–Munk theory models the photometric reflectance by the following equation Eðλ; x˜ Þ ¼ eðλ; x˜ Þ½1 − ρf ð˜xÞ2 R∞ ðλ; x˜ Þ þ eðλ; x˜ Þρf ð˜xÞ;

(6)

where λ is the wavelength, x˜ is a two-dimensional vector representing the image position, eðλ; x˜ Þ denotes the illumination spectrum, and ρf ð˜xÞ is the Fresnel reflectance at x˜ . In addition, R∞ ðλ; x˜ Þ denotes the material reflectivity and Eðλ; x˜ Þ represents the reflected spectrum in the viewing direction. By differentiating Eq. (6) with respect to λ, then dividing them, we obtain K¼

033002-4

Eλ ∂E∕∂λ ∂R∞ ðλ; x˜ Þ∕∂λ ¼ ¼ f½R∞ ðλ; x˜ Þ: ¼ Eλλ ∂2 E∕∂2 λ ∂2 R∞ ðλ; x˜ ∕∂λ2 May∕Jun 2015



(7)

Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

Clearly, K is the reflectance property, which is independent of viewpoint, surface orientation, illumination direction, intensity, and Fresnel reflectance coefficient. Using a good approximation for the human vision system and for the Commission Internationale Ed I'eclairage 1964 XYZ basis, we can obtain the Gaussian color model in terms of RGB19 as follows: 1 0 1 0 1 0 0.06 0.63 0.27 R E @ Eλ A ¼ @ 0.30 0.04 −0.35 A × @ G A: (8) 0.34 −0.60 0.17 B Eλλ Therefore, K can be obtained by Eqs. (7) and (8).

or nearest neighbor with ambiguity rejection metric with a threshold T on the match. If two or more points in another image match a single point in the image, we keep the pair with the best match and discard the other(s). Given the definition of our feature point descriptor in Eq. (1) and three descriptors S, G, and C, let Si ðkÞ, Gi ðkÞ, and Ci ðkÞ denote the value of the k’th element of descriptors S, G, and C, for point i, respectively. Then, the matching distance dS between point i and point j for the original SIFT descriptor is calculated using a simple Euclidean distance metric, as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 128 uX (10) ½Si ðkÞ − Sj ðkÞ2 : dS ¼ jSi − Sj j ¼ t k¼1

3.2.3 Constructing the color descriptor In the stage of color descriptor construction, we use K derived in the previous subsection in a different way from CSIFT. Since in our algorithm illustrated in Fig. 1 and Eq. (1), we focus on the problem of designing a novel SCARF descriptor which keeps the original 128-dimensional SIFT vector with a new color descriptor and global descriptor as the supplement, we construct the new color descriptor by accumulating the color invariance K of the feature points in each concentric circle region. In a similar manner, for each extracted feature point, a 32-dimensional descriptor C ¼ ðC1 ; C2 : : : Ci ; : : : C32 Þ will be constructed and all color invariance are weighted by a uniform Gaussian function ωðx; yÞ [see Eq. (4)]. Then Ci should be generated in the corresponding concentric circle bin i, and is given by Ci ¼

Ji X

Kðxij ; yij Þωðxij ; yij Þ;

k¼1

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 32 uX dC ¼ jCi − Cj j ¼ t ½Ci ðkÞ − Cj ðkÞ2 :

where K ðxij ; yij Þ is the color invariance of the j’th feature point in the bin i. Finally, the 32-dimensional vector C is normalized to unit magnitude so that it is invariant to changes in image contrast. 3.3 Combined Descriptor and Image Matching In Secs. 3.1 and 3.2, the global shape information and color invariant descriptor have been developed for our new SCARF descriptor framework as represented in Eq. (1). In particular, we construct the three relatively independent descriptors based on simple coordinate systems: the SIFT descriptor is based on its original rectangular coordinate system, while both global descriptor and color descriptor are based on the same set of concentric circles which is detailed in Sec. 3.1.1. More specifically, we can accumulate the principal curvature and color invariance, respectively in the same corresponding region, thus simultaneously forming the global descriptor and color descriptor. In this manner, computation process of the algorithm is simplified and, hence, the efficiency of the algorithm is improved. In this section, we combine the descriptors and apply them in image matching. Specifically, during image matching, we aim to meet two criteria: (a) the principle of similarity and (b) the principle of exclusion.34 As the humble nearest-neighbor classifier is asymptotically optimal, a property not possessed by several considerably more complicated techniques, we compare the values of the descriptors for each feature point based on the simple nearest neighbor distance

(12)

k¼1

(9)

j¼1

Journal of Electronic Imaging

Distinct from the global context presented in Mortensen et al.,30 the matching distance dG for the global descriptor is similar to dS because of using concentric circles, likewise for dC, which is matching distance for color invariance between two points. They are respectively defined as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 32 uX (11) ½Gi ðkÞ − Gj ðkÞ2 ; dG ¼ jGi − Gj j ¼ t

As a sequel, the combined matching distance is computed via a weighted sum d ¼ αdS þ βdG þ ð1 − α − βÞdC ;

(13)

where α and β are the same weighting factors in Eq. (1). As mentioned previously, we discard matches with any matching distance greater than the threshold T. By adjusting T, we can obtain diverse results of our descriptor to satisfy different requirements. As a result, using the model detailed in Sec. 3, a systematic approach is developed to provide a set of invariance properties by incorporating the global shape context and the color invariant descriptor into Lowe’s basic SIFT framework. Hence, we expect that this technique can achieve improved invariance with respect to image translation, scaling, and rotation, and robustness to local geometric distortion. Also, it is expected that the algorithm is capable to augment local regions with the illumination and color changes as well as help disambiguate multiple regions with locally similar appearance. To validate these expectations, we test the performance of the algorithm in the next section and compare it with other typical techniques described in the literature. 4 Criterion and Dataset In this section, we first describe the evaluation criterion used to verify our results. Then, we introduce experimental datasets in detail.

033002-5

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

4.1 Performance Evaluation Recall–precision is one of the most popular metrics for evaluating detection tasks.25 In this paper, we will use recall–precision to evaluate the matching performance of the SCARF algorithm developed, since the main goal of the algorithm is to detect more corresponding points between two images, and the objective of this study is to improve the algorithm’s accuracy. The recall is the proportion of the number of the correctly matched interest regions to the number of the corresponding interest regions between two images of the same scene: recall ¼

#correct matches : #correspondences

(14)

1 − precision is the proportion of the number of the false matches to the total number of matches: 1 − precision ¼

#false matches : #total matches

(15)

We generate the recall versus 1 − precision graphs for our experiments by varying the threshold T for each descriptor. In Eqs. (14) and (15), the total number of matches is determined by the matching experiment, while the numbers of correct and false matches are determined by the overlap error of the two images. Specifically, it is defined as the ratio of the intersection and union of the regions, i.e.,

εS ¼ 1 − ðA ∩ H T BHÞ∕ðA ∪ H T BHÞ; where A and B are the regions and H is the homography between the images. Given the homography and the matrices defining the regions, this error can be calculated numerically. Then, we count the number of pixels in the union and the intersection of regions. We assume that a match is correct if the overlap error in the image area covered by two corresponding regions is less than 50% of the region union. The overlap error is calculated for the measurement regions, which are used to compute the descriptors. 4.2 Dataset We evaluate the proposed SCARF descriptor using the sequences of images in the Institut National de Recherche en Infomatique et Automatique (INRIA) datasets, which contain eight groups of datasets (see Fig. 4). Five image transformations are evaluated: viewpoint change, zoom þ rotation, image blur, illumination change, and JPEG compression. Every dataset in INRIA contains six images from the same scene. According to the problems on which our method focuses, we choose two images from every dataset to evaluate the descriptor. The images with the viewpoint of the camera are changed by 45 deg as shown in Figs. 4(a) and 4(b). We used Figs. 4(c) and 4(d) with a rotation angle of about 45 deg that represents the most difficult case of image rotation, and the scale changes are approximately a factor of 2.0. As shown in Figs. 4(e) and 4(f), the images are transformed with significant blur. As shown in Fig. 4(g),

Fig. 4 Sample images of the Institut National de Recherche en Infomatique et Automatique (INRIA) datasets used for experiments: (a) wall + (b) graffiti: viewpoint change; (c) bark + (d) boat: zoom + rotation; (e) bikes + (f) tree: blur; (g) leuven: light change; and (h) ubc: JPEG compression.

Journal of Electronic Imaging

033002-6

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

Fig. 5 Illustration of the performance of recall versus 1 − precision curves: (a) and (b) viewpoint change; (c) and (d) zoom þ rotation; (e) and (f) image blur; (g) light change; and (h) JPEG compression.

Journal of Electronic Imaging

033002-7

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

the images are introduced by varying the camera aperture to make illumination changes. The JPEG sequence is generated with the image quality parameter set to 5% as shown in Fig. 4(h). 5 Experiments and Results In this section, we evaluate the performance of the proposed SCARF approach, which incorporates a global shape information and color invariant vector into the local descriptorSIFT, in the related task of image matching. Specifically, the SCARF algorithm is compared with the four existing descriptors in the literature: (1) the standard SIFT feature representation (denoted as “SIFT”);6 (2) colored SIFT introduced by Abdel-Hakim and Farag21 (denoted as “CSIFT”); (3) principal component analysis–SIFT by Ke and Sukthankar35 (denoted as “PCASIFT”); (4) the SIFT descriptor with global context presented by Mortensen et al.30 (denoted as “SIFT + GC”)30, all based on the precision–recall criterion with respect to a number of critical system parameters. In the experiments, for the results presented here, we use a value of α ¼ β ¼ 1∕3 in Eq. (13). Moreover, we generate the recall versus 1 − precision graphs for our experiments by varying the threshold for each algorithm. For each image transformation, we choose to calculate average recall and tabulate them. It should be noted that using α ¼ β ¼ 1∕3 assumes equal preference for all three distance measures, dS , dG , and dC , which is appropriate for general image processing operations with limited prior knowledge. In fact, our experiments show that using α ¼ β ¼ 1∕3 is usually reasonable to handle the matching problem in the INRIA database, in which there is no uniform particularity. However, if prior knowledge about the matching images is available, then different values of weight parameters α and β can be used. For instance, when matching images of building objects, a heavier weight should be assigned to the global descriptor distance dG to achieve better matching performance. First we use the wall and graffiti datasets [see Figs. 4(a) and 4(b)] to evaluate the performance for image rotation in image matching. Then, we use the boat and bark datasets [see Figs. 4(c) and 4(d)] to evaluate the matching performance of the different descriptors under zoom þ rotation transformations, and use the bikes and trees datasets [see Figs. 4(e) and 4(f)] for image blur. Note that the wall dataset has similar local features which are prone to introduce false matches. Figure 5(a) plots the results of experiment on the wall dataset. From the result in Table 1 we can easily and clearly find that SCARF’s mean average recall on wall is

0.6876, which is much higher than other descriptors. It occurs possibly because the global scope and color descriptor are integrated in SCARF. As such, the technique is able to augment local regions with the “big picture” that provides an overall reference to help disambiguate multiple regions with locally similar appearance. Figure 5(b) plots the results of the viewpoint change experiment on the graffiti dataset. As one can see from the figure, the matching performance of the SCARF algorithm is the same or slightly better than that of the other descriptors. When handling the zoom þ rotation transformations with the boat and bark datasets, although all of the five descriptors are invariant to scale change and image rotation by using SIFT detector, Figs. 5(c) and 5(d) show that our proposed method performs better than the other four approaches. As shown in Figs. 5(e) and 5(f), our SCARF descriptor outperforms the other descriptors during the performance which is measured for images with a significant amount of blur. Furthermore, Table 1 indicates that our approach gives a mean average recall of 0.6698 on bark, and reports a mean average recall of 0.7565 on bikes. The explanation may be that a change in viewpoint will yield shape variations such as the orientation and scale of the object. Nevertheless, the SCARF descriptor’s invariance to the viewpoint change is improved by using the concentric circle neighboring region for every feature point. Moreover, our SCARF is a kind of systematic approach, which increases photometric invariance and discriminative power by integrating color and global shape information into the matching process. Using this model, a simple yet more robust measure of shape and light color similarity is adopted to provide a set of invariance properties which can achieve invariance to light color or shape changes and shifts due to image blur. Finally, we test five descriptors on the leuven dataset and the ubc dataset to investigate their robustness to illumination change and JPEG compression. The performance comparison of the matching experiments is displayed in Figs. 5(g) and 5(h). All the descriptors are calculated based on the illumination normalized image regions. Note that all five feature representations are quite robust to lighting changes and JPEG compression. However, from the results of Fig. 5(g) for illumination changes by adjusting the camera aperture, our approach and CSIFT perform better than other three descriptors. In addition, we can see from the average recall values in Table 1 that the performance using our method (0.7853) is slightly better than CSIFT (0.76333). The reason may be that the dependent 32-dimensional color invariance components used in SCARF can provide more distinctive

Table 1 A summary of INRIA dataset’s average recall. Transformation

Dataset

SIFT

CSIFT

PCA-SIFT

SIFT+GC

SCARF

Viewpoint

wall

0.4795

0.4721

0.4369

0.6080

0.6876

Zoom þ rotation

bark

0.6537

0.6148

0.6182

0.6044

0.6698

Image blur

bikes

0.6940

0.7347

0.6741

0.6880

0.7565

leuven

0.7216

0.7633

0.7026

0.7160

0.7853

ubc

0.7238

0.7327

0.7136

0.7186

0.7292

Light change JPEG compression

Journal of Electronic Imaging

033002-8

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . .

Fig. 6 SIFT matching and SCARF matching: (a) character recognition using SIFT; (b) character recognition using SCARF; (c) architecture retrieval using SIFT; and (d) architecture retrieval using SCARF.

information than the light color invariance components used in CSIFT. From the results shown in Fig. 5(h) and Table 1 for JPEG compression quality, it can be seen that the performance of the five descriptors is almost the same. Of interest is that PCASIFT does not perform well for almost every experiment result; this is consistent with its lack of invariance features. The aforementioned benchmark-based experiments indicate that the SCARF descriptor developed in this paper is the leading performer on all the datasets. This result suggests that it is worthwhile to investigate combinations of several descriptors such as the proposed SCARF descriptor, since they are not completely redundant. Results on the two tasks in traditional Chinese character recognition and architecture retrieval (see Fig. 6) have shown that our SCARF performs better than SIFT.

6 Conclusions In this paper, we develop a combined framework for feature descriptor based on SIFT. This proposed framework, which is called SCARF, is based on combining color and global information with a SIFT descriptor, and can be utilized in many pattern recognition and computer vision applications. Specifically, the SCARF algorithm uses 32 concentric circles to construct the global descriptor and the color invariance descriptor. Experimental comparisons with four existing feature descriptors involving SIFT were carried out to evaluate the matching performance of the proposed approach. Results indicate that SCARF is more distinctive, more robust to common image transformations than the other four descriptors. For further research, we will reduce the dimensions of the descriptors and improve the real-time performance of the SCARF. Journal of Electronic Imaging

Acknowledgments This work was supported by the National Natural Science Foundation of China (60974108) and sponsored in part by a grant from the Major State Basic Research Development Program of China (2009CB724002) and the China Scholarship Council (CSC). We thank Rob Hess for providing the code for the descriptors. References 1. T. Tuytelaars and L. Van Gool, “Matching widely separated views based on affine invariant regions,” Int. J. Comput. Vision 59(1), 61– 85 (2004). 2. D. Shin and T. Tjahjadi, “Clique descriptor of affine invariant regions for robust wide baseline image matching,” Pattern Recognit. 43(10), 3261–3272 (2010). 3. Y. Li et al., “Query difficulty estimation for image retrieval,” Neurocomputing 95, 48–53 (2012). 4. Z. Zhu et al., “Dynamic mutual calibration and view planning for cooperative mobile robots with panoramic virtual stereo vision,” Comput. Vision Image Understanding 95(3), 261–286 (2004). 5. S. Jung-Young et al., “Recent developments in 3-D imaging technologies,” J. Disp. Technol. 6(10), 394–403 (2010). 6. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision 60(2), 91–110 (2004). 7. C. Wang et al., “A novel watermarking algorithm based on feature points and SVD,” in Proc. Int. Conf. on Computational Intelligence and Software Engineering (CiSE 2009), pp. 1–4, IEEE (2009). 8. R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in Proc. Computer Vision–ECCV’94, pp. 151–158, Springer (1994). 9. S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell. 24(4), 509–522 (2002). 10. G. Carneiro and A. D. Jepson, “Phase-based local features,” in Proc. Computer Vision-ECCV 2002, pp. 282–296, Springer (2002). 11. D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. Seventh IEEE Int. Conf. on Computer Vision, pp. 1150–1157, IEEE (1999). 12. E. Delponte et al., “SVD-matching using SIFT features,” Graphical Models 68(5), 415–431 (2006). 13. H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: speeded up robust features,” in Proc. Computer Vision-ECCV 2006, pp. 404–417, Springer (2006).

033002-9

May∕Jun 2015



Vol. 24(3)

Wang, Zhu, and Zhang: Improving scale invariant feature transform-based descriptors. . . 14. J.-M. Morel and G. Yu, “ASIFT: a new framework for fully affine invariant image comparison,” SIAM J. Imaging Sci. 2(2), 438–469 (2009). 15. X. Guo et al., “MIFT: a mirror reflection invariant feature descriptor,” in Proc. Computer Vision–ACCV 2009, pp. 536–545, Springer (2010). 16. L. Florack et al., “General intensity transformations and differential invariants,” J. Math. Imaging Vision 4(2), 171–187 (1994). 17. C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell. 19(5), 530–534 (1997). 18. T. Gevers and A. W. Smeulders, “Color-based object recognition,” Pattern Recognit. 32(3), 453–464 (1999). 19. J.-M. Geusebroek et al., “Color invariance,” IEEE Trans. Pattern Anal. Mach. Intell. 23(12), 1338–1350 (2001). 20. A. Diplaros, T. Gevers, and I. Patras, “Combining color and shape information for illumination-viewpoint invariant object recognition,” IEEE Trans. Image Process. 15(1), 1–11 (2006). 21. A. E. Abdel-Hakim and A. A. Farag, “CSIFT: a SIFT descriptor with color invariant characteristics,” in Proc. 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 1978–1983, IEEE (2006). 22. H. Stokman and T. Gevers, “Selection and fusion of color models for image feature detection,” IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 371–381 (2007). 23. A. Verma, C. Liu, and J. Jia, “New colour SIFT descriptors for image classification with applications to biometrics,” Int. J. Biom. 3(1), 56–75 (2011). 24. A. Shokoufandeh, I. Marsic, and S. J. Dickinson, “View-based object recognition using saliency maps,” Image Vision Comput. 17(5), 445– 460 (1999). 25. K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615– 1630 (2005). 26. P. Moreno, A. Bernardino, and J. Santos-Victor, “Improving the SIFT descriptor with smooth derivative filters,” Pattern Recognit. Lett. 30(1), 18–26 (2009). 27. B. Kim et al., “Haar-like compact local binary pattern for illuminationrobust feature matching,” J. Electro. Imaging 21(4), 043014 (2012). 28. L. Juan and O. Gwun, “A Comparison of SIFT, PCA-SIFT and SURF,” Int. J. Image Process. 3(4), 143–152 (2009). 29. J.-N. Liu and G.-H. Zeng, “Improved global context descriptor for describing interest regions,” J. Shanghai Jiaotong Univ. 17(2), 147– 152 (2012). 30. E. N. Mortensen, H. Deng, and L. Shapiro, “A SIFT descriptor with global context,” in Proc. 2005 IEEE Computer Society Conf. on

Journal of Electronic Imaging

31. 32. 33. 34. 35.

Computer Vision and Pattern Recognition (CVPR 2005), pp. 184–190, IEEE (2005). C. Steger, “An unbiased detector of curvilinear structures,” IEEE Trans. Pattern Anal. Mach. Intell. 20(2), 113–125 (1998). P. Kubelka, “New contributions to the optics of intensely light-scattering materials. Part I,” J. Opt. Soc. Am. 38(5), 448 (1948). G. Wyszecki and W. S. Stiles, Color Science, Wiley, New York (1982). M. Pilu, “A direct method for stereo correspondence based on singular value decomposition,” in Proc. 1997 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 261–266, IEEE (1997). Y. Ke and R. Sukthankar, “PCA-SIFT: a more distinctive representation for local image descriptors,” in Proc. 2004 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR 2004), Vol. 502, pp. II-506–II-513, IEEE (2004).

Rui Wang is an associate professor in the School of Instrumentation and Opto-Electronics Engineering at Beihang University, where she received her on-the-job PhD degree in machine vision of precision instrument and mechanics in 2006. She received her BS and MS degrees in optical instruments from the Department of Precision Instruments and Mechanology, Tsinghua University, China, in 1988 and 1990, respectively. Her current research interests include machine vision, optical sensing and image processing, pattern recognition, and tracking. Zhengdan Zhu is a postgraduate in the School of Instrumentation and Opto-Electronics Engineering at Beihang University, where he received his BS and MS degrees in 2006 and 2014, respectively. His current research interests include machine vision, image processing, and pattern recognition. Liang Zhang is an assistant professor in the Department of Electrical and Computer Engineering at the University of Connecticut. He received his BE and ME degrees from the Department of Automation, Tsinghua University, China, in 2002 and 2004, respectively, and his PhD degree in electrical engineering systems from the University of Michigan in 2009. He was with the University of Wisconsin–Milwaukee from 2009 to 2013. His research interests include analysis and control of manufacturing and battery systems, and image processing.

033002-10

May∕Jun 2015



Vol. 24(3)