Locality Preserving Verification for Image Search Shanmin Pang1 , Jianru Xue1 , Nanning Zheng1 , Qi Tian2 Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University1 Dept. of Computer Science, University of Texas at San Antonio, Texas, TX 782492
[email protected],
[email protected],
[email protected],
[email protected]
ABSTRACT
billions of images. However, BoW has a limitation that the retrieval accuracy is often low since this representation discards spatial information of images. Thus, large-scale image retrieval typically adds a post-processing step after the searching step. Specifically, in the searching step, similar images are retrieved from the large database based on the BoW representation, and an initial ranking is generated. The post-processing step then refines the initial ranking, and provides a more precise ranking of the retrieved images. This step is linear in the number of images to match, hence its speed is crucial. Post-processing is usually accomplished by spatial verification. Spatial verification is important in improving the accuracy of matching and re-ranking. It first considers the spatial configurations of features, and then filters out mismatches produced by clustering features with different semantics into the same visual word, and finally achieves the goal of re-ranking images. The existing spatial verification methods can be roughly divided into two categories: the local methods and the global methods. The methods in [5, 9] belong to the first category. A weak geometric constraint, which requires that neighboring matches in the query image lie in a surrounding area in the retrieved image, is used in [9]. A similar idea appears in [5], in which the inverted file stores scale and orientation of local features, and groups of matches need to agree with their relative scale. Examples of the second category include [7, 10, 11]. In [11], the geometric context among local features is encoded into binary maps, and geometrically inconsistent matches are removed by analyzing those coding maps. Based on the assumption that two images are related by a homography, FSM [7] estimates a global transformation between the query image and a database image by LO-RANSAC [4, 6]. An advantage of FSM is that it enumerates all hypotheses by exploiting shape information in the Hessian affine regions, and therefore removes the randomness of RANSAC. Unfortunately, FSM is quadratic in the number of matches. HPM [10] is a relaxed spatial matching method, and has linear time complexity. However, it can not determine the correctness of a match accurately. In this paper, we propose a new spatial verification method by exploiting the local linear geometric structure in images. The proposed method can find correct correspondences accurately and efficiently. The assumption of our method is that: the local geometric structure among a feature point and its neighbors is not easily affected by complex transformations (for example, shape deformations). Thus this structure should be preserved after photometric and geo-
Establishing correct correspondences between two images has a wide range of applications, such as 2D and 3D registration, structure from motion, and image retrieval. In this paper, we propose a new matching method based on spatial constraints. The proposed method has linear time complexity, and is efficient when applying it to image retrieval. The main assumption behind our method is that, the local geometric structure among a feature point and its neighbors, is not easily affected by both geometric and photometric transformations, and thus should be preserved in their corresponding images. We model this local geometric structure by linear coefficients that reconstruct the point from its neighbors. The method is flexible, as it can not only estimate the number of correct matches between two images efficiently, but also determine the correctness of each match accurately. Furthermore, it is simple and easy to be implemented. When applying the proposed method on reranking images in an image search engine, it outperforms the-state-of-the-art techniques.
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Selection process
Keywords Image retrieval, linear time complexity, local geometric structure, re-ranking
1.
INTRODUCTION
Image retrieval, with its goal of searching similar images of a query image from a large corpus of images, has achieved a great improvement in recent years. Many state-of-the-art retrieval systems use the bag-of-words (BoW) model, in which each image is represented as histograms of visual words. Due to its efficiency and memory saving, BoW representation of an image makes it possible to query from millions or even Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. MM’13, October 21–25, 2013, Barcelona, Spain. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2404-5/13/10 ...$15.00. http://dx.doi.org/10.1145/2502081.2502140.
529
Algorithm 1 The proposed spatial matching method Input: the set of N tentative matches, P Output: the number of correct matches M , or the set of correct matches Q Initialization: initialize Q to be an empty set, and let L = 0 1: for i = 1 : N do 2: Find K nearest neighbors xi1 , · · · , xiK of xi ; 3: Reconstruct xi and yi by Eq.(1) and Eq.(3), respectively; 4: if Eq.(4) holds then 5: Q = Q ∪ {xi ↔ yi , xi1 ↔ yi1 , · · · , xiK ↔ yiK }; 6: L = L + 1. 7: end if 8: end for 9: if the fast implementation then 10: Compute M by Eq.(6); 11: else if the accurate implementation then 12: Determine the correctness of each match in P − Q by Eq.(10), and update Q. 13: end if
metric transformations. We model this structure by linear coefficients that reconstruct each point from its neighbors.
2.
FAST SPATIAL MATCHING METHOD
In this section, we first present our spatial matching method (Line 1-8, algorithm 1), and then present two ways to identify the correct matches. The fast implementation (Line 9-10, algorithm 1) estimates the number of correct matches efficiently, and the accurate implementation (Line 11-12, algorithm 1) filters out mismatches exactly.
2.1 Spatial Matching by Reconstruction In image search, an image is represented by a set of local features, where each local feature is described by a descriptor, position, scale and orientation, etc. Let us suppose that there are N tentative matches between images I1 and I2 , where a match means the corresponding descriptors are quantified to a same visual word in the BoW model. For convenience, we use a set P to store these N tentative matches, and simply use xi ↔ yi to denote the ith (i = 1, · · · , N ) match, where xi = (ui , vi )T and yi = (ui , vi )T are locations of this match in I1 and I2 , respectively. For xi , we find its K (K ≥ 2) nearest neighbors xi1 , · · · , xiK in image I1 , where ij ∈ {1, · · · , N } and j = 1, · · · , K. xi and its neighbors thus formulate a local geometric structure. This local geometric structure can be described by linear coefficients that reconstruct xi from xi1 , · · · , xiK . That is, K xi = wij xij , (1)
subset Q of correct matches. It is easy to know that there are both positive and negative matches in set P − Q. The remaining of our problem is then to identify the correctness of each match in P − Q. We provide two ways for solving this problem in Section 2.2 and Section 2.3.
2.2 The Fast Implementation Our first method is a coarse approach, and only estimates the number of correct matches between I1 and I2 . Assume that M out of N matches are correct, and there are L columns between W and W satisfying Eq.(4), where W = (w1 , · · · , wN ) and W = (w1 , · · · , wN ). Thus, by maximum likelihood estimation, we can get (M/N )K+1 = L/N. (5)
j=1
where wij is a coefficient, and can be obtained by solving the following least-square problem: K min xi − wij xij 22 , j=1
s.t.
K
(2)
This means M is approximately equal to
wij = 1.
M = N × (L/N )1/(K+1) .
j=1
(6)
Though this method does not explicitly determine which match is correct, we will see it is effective for purpose of geometric verification (Section 3). Moreover, this method has O(N ) complexity, so it is efficient and suitable for postprocessing in large-scale image retrieval.
It should be noted that Eq.(1) reflects the relative location relationship among xi and xij (j = 1, · · · , K). According to [8], the local geometric structure described by Eq.(1) is invariant to similarity transformation. Similarly, we can describe the relative location relationship between yi and yij (j = 1, · · · , K) by K wij yij , (3) yi =
2.3 The Accurate Implementation The second method estimates the correctness of each match accurately. As aforementioned, if I1 and I2 are similar, some matches satisfying Eq.(4) have been collected. Without loss of generality, we assume these matches are xl ↔ yl (l = 1, · · · , n). Consequently, the relationship between I1 and I2 can be characterized by these n correct matches. In fact, a correct match xl ↔ yl means there exists a vector-valued function f 1 which maps point xl to point yl , i.e., (7) f (xl ) = yl , l = 1, · · · , n.
j=1
where yij is the location of a feature point in I2 and it corresponds to xij . Based on that xij (j = 1, · · · , K) are neighbors of xi , and similarity invariance of Eq.(1), we can conclude that if xi ↔ yi , xij ↔ yij are all correct matches, then wij and wij should be nearly same, or at least (4) wi − wi 1 < , , · · · , wiK )T , and where wi = (wi1 , · · · , wiK )T , wi = (wi1 > 0 is a preset threshold. By analogy, if Eq.(4) holds, we can also safely think xi ↔ yi , xij ↔ yij are all correct; otherwise, at least one match is false, though we do not know which match is incorrect. We repeat the above process until each feature point xi (i = 1, · · · , N ) is processed. After processing all xi , the matches that satisfy Eq.(4) are collected, and constitute a
In order to achieve accuracy and robustness to noise, we use the method proposed in [3] to estimate f . That is, we solve the following minimization problem n 1 min{ yl − f (xl )22 + λf 2H }, (8) f ∈H n l=1
1
If the photographed scene is a plane, then f is a homography matrix that we are familiar with.
530
(a) (b) (c) Figure 1: Two examples of the proposed method and FSM. (a) There are 134 tentative matches obtained by the BoW model in two similar images (top). There are 42 mismatches obtained by BoW in two unrelated images (bottom). (b) Only 40 correct matches in top image pair are found by 5-dof FSM, and many correct matches are missed. In bottom image pair, 3 matches are wrongly determined to be correct by 5-dof FSM. (c) 99 correct matches consisting with human’s perception are found by the proposed method in top image pair, and no correct matches are found in bottom image pair. Note: for visibility, only 50 randomly selected matches are presented in top left and top right image pairs. where λ > 0 is a regularization parameter, H is a reproducing kernel Hilbert space (RKHS), and · 2H denotes the norm in the RKHS H. The solution of Eq.(8) is given by [3] n f (x) = γ(x, xl )cl , with (Γ + λnI2n×2n )C = Y, (9) l=1
where γ(x, xl ) is a kernel matrix with size 2 × 2, and cl ∈ R2×1 is the corresponding coefficient. Γ is a n × n block matrix, and the (i, j)-th block is γ(xi , xj ). C = (cT1 , cT2 , · · · , cTn )T and Y = (y1T , y2T , · · · , ynT )T are 2n dimensional vectors. 2 For efficiency, we choose γ(xi , xj ) = e−βxi −xj I2×2 (where β > 0 is a scalar parameter), one of the most simplest kernel matrices, as our choice for solving Eq.(9). Thus, the coefficient matrix Γ + λnI2n×2n is positive definite. Furthermore, to solve linear system in Eq.(9), the low rank matrix approximation strategy in [1] is used. This means we can obtain f with linear time complexity. After getting f , we determine the correctness of each match xi ↔ yi in P − Q by (10) ξi = yi − f (xi )22 , i = n + 1, · · · , N. If |ξi |2 < τ (where τ > 0 is a scalar parameter), xi ↔ yi is thought to be a true match, and put it in set Q; otherwise, it is considered to be false. The proposed fast spatial matching method is summarized in algorithm 1.
3.
and evaluate it against two state-of-the-art methods: Fast Spatial Matching (FSM) [7] and Hough Pyramid Matching (HPM) [10]. In image retrieval, the standard tf-idf weighting scheme [2] is often used to distinguish different matched features. However, from experimental results, we find this scheme has little effect in improving retrieval results. For example, on the Oxford 5K datset, simply counting the number of matched features yields the mAP 0.611; using tf-idf weighting scheme only increases it to 0.618. This observation is also described in [11]. For this reason, we calculate the similarity between two images by counting the number of matches with L2 normalization. Parameters initialization: To demonstrate the performance of the proposed method, four sizes (200K, 500K, 750K, 1M ) of vocabulary are tested. In our experiments, we set = 0.5, β = 0.1, λ = 3 and τ = 0.1 in algorithm 1 through this paper. When re-ranking top 800 images on Oxford 5K datset, the mAP of the fast implementation is 0.725, 0.708, 0.714 with K = 2, 3, 4, respectively. K = 2 gives the best results and is used in our report results.
3.1 Results We compare our method with 5-dof FSM (In [7], the author implements 3-dof, 4-dof and 5-dof transformations, and 5-dof affine transformation is the best of them.) on the three factors: speed, memory usage and retrieval accuracy. Speed: Our method has linear time complexity, while FSM is quadratic in the number of N . From our experiments, we find the fast implementation(TFI) is about 5 times faster than FSM in matching image pairs, and the speed up of the accurate implementation(TAI) over FSM is about 3 times. Memory usage: In the inverted file, our method only stores image ID and the location of each feature. However, FSM and HPM need additional memory to store the scale and angle of each feature point. This means our method can save a couple of bytes per indexed feature compared with FSM and HPM.
EXPERIMENTS AND RESULTS
Datasets. We apply the proposed method to perform re-ranking in an image retrieval engine, and test it on two publicly available image retrieval datasets: Oxford 5K datset [7] and Flicker 1M dataset [5]. To test the performance of the proposed method on large-scale image retrieval, we add Flicker 1M dataset as distractors to the Oxford 5K datset. To better compare our method with others, we use the descriptors provided by [7, 5] on the dataset web pages. Evaluation criteria. We measure the performance of the proposed method via mean Average Precision(mAP),
531
Table 2: Comparison results of our method and FSM on the Oxford 5K dataset with Flicker 1M images as distractors. The vocabulary size is 1M , and the mAP with no re-ranking images is = 0.389. Method/Re-rank N 100 200 400 800 FSM(5dof) 0.435 0.446 0.455 0.462 TFI 0.452 0.469 0.486 0.503 TAI 0.449 0.466 0.477 0.491
Table 1: Comparison results of our method and FSM on the Oxford 5K dataset. (a) Results of using a 200K visual vocabulary(no re-ranking = 0.472). (b) Results of using a 500K visual vocabulary(no reranking = 0.518). (c) Results of using a 750K visual vocabulary(no re-ranking = 0.532). (d) Results of using a 1M visual vocabulary(no re-ranking = 0.611). Method/Re-rank N 100 200 400 800 FSM(5dof) 0.513 0.528 0.544 0.560 TFI 0.518 0.533 0.553 0.579 TAI 0.511 0.527 0.546 0.574 (a) Method/Re-rank N 100 200 400 800 FSM(5dof) 0.549 0.562 0.578 0.593 TFI 0.558 0.579 0.602 0.629 TAI 0.560 0.582 0.606 0.633 (b) Method/Re-rank N 100 200 400 800 FSM(5dof) 0.577 0.589 0.600 0.613 TFI 0.584 0.603 0.624 0.668 TAI 0.582 0.599 0.618 0.659 (c) Method/Re-rank N 100 200 400 800 FSM(5dof) 0.648 0.657 0.660 0.664 TFI 0.650 0.670 0.689 0.725 TAI 0.647 0.666 0.683 0.716 (d) Retrieval accuracy: Varying the number of re-ranked images, we measure mAP for FSM and our method. We report the comparison results on Oxford 5K dataset and Flicker 1M dataset, separately. Two examples of the proposed method and FSM are shown in Fig 1. From this figure, it can be observed that our approach is more accurate than FSM in removing mismatches. Table 1 compares mAP of our approach with FSM under different vocabulary sizes on the Oxford 5K dataset. It clearly shows that both TFI and TAI have superior performance to FSM. Compared with FSM, when re-ranking top 800 images with a 1M visual vocabulary, TAI increases the mAP from 0.664 to 0.725, an 9% improvement can be obtained. There are also clear improvements with vocabulary sizes of 750K, 500K and 200K. It should be noted that our method is much faster than FSM. Our best score (0.725) outperforms the best score (0.692, reported in [10]) achieved by HPM, though the latter reranking 1K images and its specific codebook. Table 2 shows the retrieval accuracy when adding Flicker 1M images as distractors. Experimental results in table 2 clearly show that both TAI and TFI outperform FSM, too. In theory, TAI should be better than TFI. However, to our surprise, TFI has a similar (even better) performance with TAI(see tables 1 and 2). We think this may be caused by the choice of matrix γ in TAI and we will investigate it as our future work.
4.
theoretical analysis and experimental results, one can know that the proposed method has three major advantages over FSM: 1) It has linear complexity and is faster than FSM, since FSM is quadratic in the number of correspondences; 2) It does not need the constraint that two corresponding images are related by a homography 2 , while this constraint is required by FSM; 3) It is more accurate than FSM and uses less memory than FSM.
5. ACKNOWLEDGEMENTS This work was supported by National Basic Research Program of China (973 Program) under Grant No. 2012CB316400, and the National Natural Science Foundation of China (NSFC) projects 61273252 and 90920301. This work was supported in part to Dr. Qi Tian by ARO grant W911BF-12-1-0057, NSF IIS 1052851, Faculty Research Awards by Google, FXPAL, and NEC Laboratories of America, and 2012 UTSA START-R Research Award respectively. This work was supported in part by NSFC 61128007.
6. REFERENCES [1] X. S. Andriy Myronenko. Point set registration: Coherent point drift. IEEE TPAMI, 32(12):2262–2275, Dec. 2010. [2] R. Baeza-Yates and B. Ribeiro-Neto. Modern information retrieval. ACM press, New York, 1999. [3] L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Vector field learning via spectral filtering. In ECML, pages 56–71, 2010. ˇ Obdrˇz´ [4] O. Chum, J. Matas, and S. alek. Enhancing RANSAC by generalized model optimization. In ACCV, pages 812–817, 2004. [5] H. J´egou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, pages 304–317, 2008. [6] K. Lebeda, J. Matas, and O. Chum. Fixing the locally optimized ransac. In BMVC, pages 1–11, 2012. [7] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR, pages 1–8, 2007. [8] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. [9] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, pages 1470–1477, 2003. [10] G. Tolias and Y. Avrithis. Speeded-up, relaxed spatial matching. In ICCV, pages 1653–1660, 2011. [11] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian. Spatial coding for large scale partial-duplicate web image search. In ACM Multimedia, pages 511–520, 2010.
CONCLUSIONS
In this paper, by exploiting the local geometric structure, we propose an efficient and effective spatial verification method to boost the performance of image search. The proposed method is simple and easy to be implemented. From
2
532
This means the photographed scene lies in a plane.