GPU-enabled High Performance Online Visual

1 downloads 0 Views 1MB Size Report
GPU-enabled High Performance Online Visual Search with High Accuracy. Ali Cevahir .... We use GPUs in every step for the image search: to extract keypoints ...
GPU-enabled High Performance Online Visual Search with High Accuracy Ali Cevahir Rakuten Institute of Technology Rakuten, Inc. Tokyo, Japan [email protected]

Abstract—We propose an online image search engine based on local image features (keypoints), which runs fully on GPUs. State-of-the-art visual image retrieval techniques are based on bag-of-visual-words (BoV) model, which is an analogy for textbased search. In BoV, each keypoint is rounded off to the nearest visual word. On the other hand in this work, thanks to the vector computation power of GPUs, we utilize real values of keypoint descriptors. We match keypoints in two steps. The idea in the first step is similar to visual word matching in BoV. In the second step, we do matching in keypoint level. By keeping identities of each keypoint, closest keypoints are accurately retrieved in real-time. Image search has different characteristics than textual search. We implement one-toone keypoint matching, which is more natural for images. Our experiments reveal 265 times speed up for offline index generation, 104 times speedup for online index search and 20.5 times speedup for online keypoint matching time, when compared to the CPU implementation. Our proposed keypointmatching-based search improves accuracy of BoV by 9.5%. Keywords-Content-based image retrieval; GPU computing; k-means clustering.

I. I NTRODUCTION In this study, we introduce an online image search engine for large image collections. We propose an image retrieval method which achieves high accuracy. GPU implementation highly accelerates retrieval components, so that accurate online image search is possible using the proposed method. Image search is defined as follows. For a query image, search engine retrieves visually similar images, such as those taken in different angles, scales, lighting conditions, etc. For example, to search for an item in an ecommerce site, the user takes a picture of the item to be searched and uses the picture as a search query. The search engine is expected to return a list of images, in which images containing the queried object is listed on top. There may be billions of images in the database and the search results should return in real-time. Visually similar image search is a very vivid research area for the last decade. Recent techniques utilize scale and rotation invariant local feature points of images. Feature points or keypoints of an image are interesting points on the image, which can be extracted using various detectors [1]. Extracted keypoints are then represented by high-dimensional feature vectors or descriptors [2]. For example, 128-element

Junji Torii Rakuten Institute of Technology Rakuten, Inc. Tokyo, Japan [email protected]

SIFT [3] feature vector representation is the most popular. An image is described by hundreds to thousands of such feature vectors. Once images are stored as a set of high-dimensional feature vectors, visually similar image search can be realized by matching the closest keypoints in the database with the keypoints of the query image. However, as the number of images increases, it becomes impractical to match keypoints by exclusive search over high-dimensional feature vectors. For this reason, state-of-the-art techniques are focused on matching visual words, instead of matching individual keypoints. Visual word dictionary is calculated by clustering or quantization of keypoints. Bag-of-visual-words (BoV) is the representation of images as a set of visual words BoV representation is similar to bag-of-words for text documents. Therefore retrieval techniques applied for text search can be utilized in BoV. However, as the feature vectors are rounded off to the visual words, matching quality is decreased. To improve matching quality and accuracy, researchers propose different retrieval methods and scoring algorithms on visual word level, which we are going to explain some of them in Section II. In this work, we propose a two-step matching technique for keypoints, instead of visual words, as explained in Section III. In the first step, we do index search, where the index is precomputed by clustering. In the second step, we further search for closer keypoints correspoing to the matching indexes. By doing so, we reduce quantization errors resulted from visual word matching in BoV. Thanks to the superior performance of GPUs in vector computing, online feature vector matching can be realized in real-time. GPU accelerators contain large number of processors. They are originally designed for accelerating graphics processing, but recent GPUs can be programmed for running general purpose high-performance computations. For example, NVIDIAs CUDA GPUs, which is also the name of their C++ library for programming the hardware, recently gained great popularity in many areas of computational sciences. We use GPUs in every step for the image search: to extract keypoints, matching indexes and matching keypoints. Not only in the online processing, but we also use GPUs for offline preprocessing, i.e., clustering keypoints. For famous

clustering algorithms, GPUs cluster the feature vectors in much shorter time than CPUs. In our tests, up to 265 times speedup is gained with GPUs during clustering large number of keypoints. Using GPUs, more time can be devoted to generate better clusters. Employing GPUs, it is possible to use larger training data for clustering, such as the actual image keypoints to be searched. It is also possible to update index in shorter cycles, as the images in the database change and clusters are distorted. Although BoV draws an analogy between visual search and text search, characteristics of image matching is somewhat different than keyword matching. For example, while searching a keyword in text documents, documents having many occurrences of the keyword, except for stop words, are more appreciated. However, this is not always the case for image search. When Turkish flag with only one star is queried, it is not appropriate to rank US flag higher because it has 50 stars. To prevent such mistakes, we implement one-to-one keypoint matching, which improved the retrieval accuracy according to our tests. The disadvantage of the proposed two-step matching algorithm is that it requires storing 128-element feature vectors for each keypoint. Although real-time search is possible even for large image databases when feature vectors are stored on high speed disks, it may not be feasible for low-spec systems. In this case, GPU acceleration is still possible with BoV, although accuracy is sacrificed. II. R ELATED W ORK Content-based image retrieval (CBIR) is a well-studied research field. In CBIR systems, images are represented by values, called features, which are used in retrieval process. These features may be global, such as color histograms, or local, such as keypoints. Comparisons of various image features for CBIR are presented in [4]. Recently, local features gained attention for object recognition and visual search for images. Local feature points can be matched for retrieving images, by applying kNN or approximate kNN [3], [5], [6]. However, matching highdimensional feature points by exhaustive search is timeconsuming and it becomes impractical for large image databases. To be able to fast query large databases, BoVbased matching is proposed by Sivic and Zisserman [7]. They apply k-means clustering of SIFT features to generate visual words. Cluster centers are considered as visual words. Closer cluster centers are found by calculating distances between cluster centers and query keypoints. Retrieval of results is implemented using inverted list of visual words and TF-IDF scoring is used for ranking results. They also do some processing to acquire some level of spatial consistency of matching. Many studies published being inspired by BoV idea. Clustering algorithms, matching visual words and scoring techniques are mainly discussed in those studies. In one

of such studies which also had a great impact, Nister and Stewenius proposed hierarchical k-means tree clustering for generating visual words [8]. To generate vocabulary tree, each cluster is divided into subclusters until reaching a predetermined number of levels. For large number of keypoints in training set and large number of visual words (clusters), flat k-means clustering is very time consuming. Cost of kmeans clustering is O(N K), where N is the number of keypoints to cluster and K is the number of clusters. On the other hand, the cost of hierarchical tree clustering is O(N log K). They also explain how to calculate scoring by using cluster tree information to handle errors resulted from keypoints that are close the cluster boundaries. Approximate k-means with multiple randomized kd-trees are used for making visual word dictionary in [9]. Although kd-tree is used as an approximation for calculating distances, using multiple randomized kd-trees mitigates clustering errors for keypoints lying close to the boundaries. They also explain a post-processing method for re-ranking retrieval results by spatial matching. There are studies which match multiple visual words with a query keypoint, instead of matching with only one, and adjust scores accordingly [10], [11], [12]. Some other papers presenting advanced scoring algorithms can be seen in [13], [14]. One-to-one matching of keypoints between two images is studied in [6] and shown to be effective than many-to-many matching. Utilization of GPUs in online search systems is rather rare, because usually per-query operations are not computation intensive. List intersections and index compression for Web text search engines is investigated in [15]. Factorial correspondence analysis and filtering algorithm on GPUs for image search based on visual words is considered in [16]. III. V ISUAL S EARCH A LGORITHMS FOR L ARGE I MAGE DATABASES Images are retrieved by matching their keypoints, described as feature vectors. When a query image is received by the search engine, search engine first extracts its keypoints and decodes the image as a list of feature vectors. Query keypoints are then matched to the ones in the database. In this work, we skip implementation and evaluation for spatial matching for keypoints, which can be realized by post processing on retrieved images [9]. Instead we concentrate on efficient matching of keypoints. In this section, we explain the details of the algorithms for achieving high-accuracy, fast keypoint matching for image search. Note that, the algorithms we provide can be implemented on CPUs, also. However, for large databases, power of GPU computing is required to realize real-time online search, as will be discussed in experimental results.

Offline preprocessing before online search is required. We first do clustering for all keypoints extracted from all images, so that closer points are gathered together in the same cluster. During online search, each cluster center will serve as an index to keypoints in the cluster. At first, we explain online search algorithm assuming we have dataset clustered. Then, we explain how to build clusters by hierarchical k-means. A. 2-Step Keypoint Matching Without clustering keypoints in database, it is possible to match them with query keypoints. For each query keypoint, the closest keypoints can be retrieved by exhaustive search. That is, by applying kNN over all feature vectors in the database. However, this is extremely time-consuming for large image databases. Therefore, we utilize clustering of feature vectors for online search. Online search consists of two steps. In the first step, keypoints are matched with the closest cluster centers. Considering cluster centers as visual words, this step is similar to matching visual words in BoV approach. In the second step, we match individual keypoints with query keypoints. For a query keypoint, we match it with at most k keypoints from the cluster it matched in the first step. That is, we apply kNN for each matching cluster. Since number of keypoints within a cluster is much lower than the number of all keypoints, it is possible to achieve this kind of matching in real time, using power of GPUs in vector computing. Quality of clustering affects correctness of keypoint matching. Assuming we have high quality clusters, very precise matching, which is close to exhaustive search of all features, can be achieved. The point of selecting a subset of keypoints within clusters is to reduce errors of matching keypoints that are close to the cluster borders. We assume number of matching keypoints, k, is sufficiently small compared to the cluster sizes. Also, better accuracy can be achieved by having larger clusters, which increases the search space per cluster and border effects are diminished. However, we do not put any restrictions on k or cluster sizes. Accuracy can be increased by having larger clusters, but matching speed is decreased. These parameters can be adjusted according to the database size and precision vs. query time trade off. Another factor affecting the search precision is the distance measure used in comparing vectors. Most widely used distance measures in the literature are L1 and L2 distances. During our tests, we have seen that L1 distance is slightly better, which conforms to what reported in [8]. B. One-to-One Keypoint Matching The term one-to-one matching refers to one-to-one keypoint matching between two images. That is, one keypoint of a query image Q may match only one keypoint from an image A. Similarly, one keypoint from image A can

Figure 1. Many-to-many vs. one-to-one keypoint matching of two images.

only match one keypoint from Q. Note that, in one-to-one matching we do not limit matching query keypoints with multiple keypoints from different images. As explained in the previous section, each query keypoint is matched with k keypoints of different images in the database. For visual matching of images, this kind of matching is more natural. Reoccurring patterns may boost up rankings of false matches if many-to-many matching between two images is not prohibited. See Fig. 1 for an example. Left and right images are matched in the figure. Similar patterns (dark lane between two bright lanes) exist 3 times in left hand side and 12 times in right hand side. Although images are completely different, there are 36 matches for many-tomany matching. On the other hand there are only 3 matches for one-to-one matching, as depicted in the second picture of the figure. From this comparison it can be seen that many-to-many matching unnecessarily boosts up scores of reoccurring patterns. We shoud note that, in an appropriate BoV-based implementation, many-to-many matching should not be an issue, since visual word frequencies of images can be compared during scoring. However, for keypoint matching frequency comparison is not useful, since there is countless number of distinct keypoints. One-to-one matching of keypoints is proposed, instead. Actually, one-to-one matching is a weighted bipartite graph matching problem between the query image and each of the matching images. This problem was studied on GPUs [17]. However, since weighted bipartite graph matching for each retrieved image becomes costly, we apply an approximate algorithm for one-to-one matching. During online search, we use following approximate greedy

algorithm for one-to-one matching. For each query keypoint, we match only the closest keypoint from each image, if there are multiple matches in an image in the database. If one keypoint in a database image is matched before, we do not match it with another query keypoint. After matching keypoints, images are ranked according to the cumulative scores of the matched keypoints they contain. Many measures can be taken into account for scoring matches. In this work, since we calculate distances between individual keypoints, we can use this information for ranking images, i.e., closer matches are scored higher. This information is not available in BoV-based search, although distance between query keypoints and cluster centers representing visual words can be calculated. Also, normalization of the number of matches with the number of keypoints that the image contains is an important factor in ranking. C. Hierarchical K-means Clustering K-means clustering is one of the most frequently used methods to gather closer points together, in which K clusters are generated around K mean points. We need to cluster all keypoints, according to the same distance measure with the online matching. In the end of clustering, each cluster includes reasonable amount of keypoints to handle online matching in required response time. As there should be many clusters for large datasets, clustering time becomes the bottleneck. We adopt hierarchical clustering to reduce the k-means clustering time, as follows. We first cluster N keypoints into K 0 clusters, which is much smaller than K. We continue clustering each cluster into sub-clusters, until number of points in each cluster falls below a predetermined number. Note that clustering factor for each level is not restricted to K 0 . It can be adjusted according to the number of points in the cluster and estimated number of points within each sub-cluster to be calculated. Run time of this kind of clustering of N number of points is O(N log K), where flat k-means is O(N K). Hierarchical k-means clustering is considered to build visual words in [8]. However, depths of leaves in their kmeans tree are all the same, regardless of the number of elements within the clusters. On the other hand, we stop further clustering when a cluster is sufficiently small, while other clusters in the same level may be required to be further clustered if they are not small enough. The motivation for clustering in [8] and in this work is similar. However, their objective is to reduce keypoints into visual words, so that required time for keypoint matching is decreased. This is also similar with our goal in the first step of online matching, but our ultimate goal is to have an acceptable number of points in each cluster for further keypoint matching. As will be discussed in Section IV, fixed level clustering may result some clusters become extremely large, while others are very small. Besides runtime, having imbalanced clusters may affect retrieval accuracy. Our hierarchical clustering

approach generates better-balanced clusters. As a result, keypoint matching latencies in the second step is also balanced. Optimally, clustering is applied over all keypoints extracted from all images in the dataset. Since k-means can become very fast using GPUs, millions of points can be precisely clustered into thousands of clusters within several minutes. However, if the resources are insufficient to process all keypoints in required time, a subset of keypoints, as representatives of the whole dataset, may be selected for clustering. Once the smaller subset of keypoints is clustered, their cluster centers are used for clustering remaining keypoints: each remaining point is added into the cluster whose center is the closest. Obviously, online matching precision somewhat decreases with this kind of clustering. Note that, multiple GPUs can be used for clustering as much points as possible, if the precision loss is not desirable. Approximation on flat k-means clustering using kd-trees may be applied to have O(N log K) clustering time, as explained in [9]. However, as will be discussed in Section IV, we found out that hierarchical clustering has similar accuracy with flat kmeans clustering without approximation, as opposed to what has been reported in [9]. Authors of [9] do not provide the threshold they used to terminate k-means iterations, for ratio of points changing clusters between iterations. The reason that they had relatively good precision of approximate k-means and bad precision for hierarchical k-means (as explained in [8]) may be that they set the threshold very high. As a result, the flat kmeans may already have been not very precise, so that its difference with the approximation is not very significant. Also in this case, error propagation becomes high as they continue clustering in hierarchical k-means. As they did not provide the threshold, we cannot confirm the accuracy of this assumption. During matching of the cluster centers with query keypoints in the first step of online search, all cluster centers can be scanned or k-means tree can be traversed for faster matching. Cluster centers are kept for each node of the k-means tree for traversing to the leaves (actual clusters). During our tests, we found out that traversing cluster tree instead of brute-force scan does not reduce the precision significantly. D. Implementation Details on GPU Extracting keypoints from large number of images can be accelerated by GPU implementation of keypoint detectors. For example, SiftGPU [18] is an open source implementation of SIFT with DoG keypoint detector, which speeds up keypoint extraction process more than an order of magnitude. Once keypoints from all images are extracted, hierarchical k-means runs over them. As mentioned before, cluster centers are indices for keypoints to be matched. Keypoint database changes with addition or removal of images. We

do not change the index, that is, we do not update cluster centers for each addition or removal. Rather, if the image database changes significantly, k-means tree is recomputed. During recomputation of k-means tree, centers of the root’s children of the previous cluster tree can be used as initial cluster centers. For each image query, GPU engine extracts keypoints, first. In the first step of matching, cluster tree is traversed, for each keypoint. Cluster tree is stored on GPU. Therefore, matching cluster centers with query keypoints can be executed very fast. For multi-GPU implementation, cluster tree can be distributed among GPUs. Unfortunately, feature vector data is too big to be stored on GPU memory for large number of images. Hence, feature vectors are either stored on main memory or high speed disks, like SSDs. Of course, it is possible to use HDDs, but it decreases matching speed. During keypoint matching in the second step, required feature vectors and respective image IDs are copied to GPUs. The time required for CPUGPU transfer and matching on the GPU is much less than matching keypoints on the CPU. For parallel implementation, feature vectors are distributed conformable with the distribution of k-means cluster tree. SIFT feature vectors has 128 elements. We store each element as an unsigned byte. Hence, each vector requires 128-byte storage. Let us assume we have 1 million images and average number of feature vectors for each image is 512. In these settings, 61 GB storage is needed for feature vectors. Assuming average cluster size of 1024, 64 MB of data for feature vectors should be copied to GPU memory, when an image with 512 keypoints is queried. GPU transfer takes 8 ms for copying feature vectors with a PCIe 2.0 16x, having 8 GB/s transfer rate. The main computational load of the keypoint matching is kNN within clusters. We implement kNN by sort-and-select on GPU, since recently-developed GPU k-selection is only faster than sort-and-select for larger data sizes [19]. IV. E XPERIMENTAL R ESULTS In this section, we discuss details of runtime and precision experiments for the algorithms and implementations we explained in the paper. A. Processing Times We use SiftGPU [18] to extract features of images. Empirically, we found out that average number of 300 to 500 features per image gives the best retrieval quality. We use the ukbench dataset for precision experiments, which is provided from the authors of [8]. The dataset contains 10,200 images of 2550 different scenes, each taken from 4 different viewpoints. Size of each image is 640x480. The number of extracted features is 6,170,773. It is a small but wellknown dataset, therefore we use it for justifying the accuracy of proposed methods. For justifying runtime scalability, we

Table I K- MEANS SPEEDUPS OVER CPU IMPLEMENTATION FOR VARYING NUMBER OF CLUSTERS . Speedups # of clusters

1-GPU

2-GPU

16 64 256 1024

23.9 64 116.7 149.1

47.8 120 210 265.2

Table II AVERAGE QUERY COMPONENT TIMES IN MS FOR UKBENCH DATASET AND 20 TIMES LARGER DATASET WITH 5 TIMES SMALLER CLUSTER SIZES . Component

ukbench

large data

SiftGPU K-means tree search kNN 1-to-1 match Ranking

33 1.7 74 2 1.5

33 2.9 42 1.6 2.3

use 20 times larger dataset, composed of the images from www.rakuten.com, with varying sizes. We held experiments on a Linux machine with kernel version 3.0, dual core Intel Xeon 2.4 GHz CPU and 24 GB memory. We used NVidia GeForce GTX 590 GPU card, having two GPUs in the box with a total number of 1024 processors and 3GB device memory. We used CUDA version 4.0 for programming GPUs. During generation of k-means tree, for large number of images, feature vector data is too big to be held on GPUs as a whole. So, feature vectors should be streamed to GPUs in each iteration. The data streaming from main memory to the GPU memory slows down k-means. However, even with memory streaming, by overlapping memory copies and computations, more than two order of magnitudes speedup is achieved. In our implementation, on 2 GPUs, approximately 4 million keypoints are clustered per second, in one iteration of 16-cluster k-means. Table I depicts GPU speedups of k-means over single thread CPU implementation, for varying number of clusters. During computations, SIFT features are streamed from main memory to GPU memory, but no disk streaming is done. For larger number of clusters, speedups are more visible, since GPU thread utilization becomes better. Also, as can be observed from the table, 2 GPU speedup over 1 GPU is close to 2. This means data transfer latency to GPU is successfully hidden during multi-GPU execution. Second column of Table II depicts average time distribution of online query components on the GPU search engine for ukbench dataset. We take maximum branch factor for k-means tree as 128 and maximum number of keypoints in a cluster as 5000. We match 50 keypoints with each query keypoint. Feature vectors are stored in the main memory. As can be seen from the table, the most time consuming search component is to compute kNN within matched clusters. If

we further break into its components, required time for kNN includes copying feature vectors from main memory to GPU, measuring distances between query keypoints and keypoints in clusters, and sort keypoints according to distances. Actually, local computations of distances and sort are fast on GPUs. The main time consuming part of this component is memory transfer between CPU and GPU. Memory transfer costs 64% of the total kNN time, while each of distance computation and sort-and-select costs 18%. Although CPU to GPU memory transfer makes kNN slower, it is still much faster on GPUs, than being computed on CPUs. On the CPU, kNN computation is 20.5 times slower, on average. The difference is more obvious for kmeans tree search (index search), since the tree is deployed on the GPU and no memory transfer is required for the tree. K-means tree search on CPU is 104 times slower than GPU, meaning that it takes 177 ms on average to search for matching cluster centers using k-means tree. Query time depends on the k-means tree parameters, rather than the number of images in the database. To demonstrate runtime scalability for larger number of images, we generate a set of keypoints which is more than 20 times larger than that of ukbench (128 million keypoints), as explained above. We set maximum branch factor to 256 and maximum number of points within each cluster to be 1000, instead of 5000. We match 50 keypoints in a cluster for each query keypoint, as we do for ukbench dataset. We query all images in ukbench dataset in this dataset and take the average runtimes. Results are shown in the third column of Table II. As can be seen from the table, kNN becomes faster, because the number of keypoints within clusters are lower. Since k-means cluster tree become larger, tree search becomes slower, but since its affect on overall runtime is minor, image search with these settings becomes even faster. We assumed data fits into the main memory in the timing results discussed above. We did not consider disk streaming for feature vectors. Actually, there may be cases that disk streaming is inevitable; e.g., for large number of feature vectors with less memory available. In this case, high speed disks can be used to return results faster. For example, on our test machine, we have an SSD with 250MB/s read rate. In this setting, if feature vectors are deployed on the SSD, it takes around 1 second to copy 500 clusters of feature vectors with average size of 4000 keypoints per cluster. Therefore, to be able to return query results within 1 second, maximum number of features within clusters should be chosen below 4000. But, this computation ignores memory caching. As recently retrieved clusters are cached on the memory, queries become much faster. Note that, ignoring memory caching for disk streaming, the main factor affecting the query time is the number of feature vectors within clusters. So, we can expect the search engine to scale out by keeping the average number of keypoints within clusters the same for large databases.

B. Search Precision We provide precision results for two different test scenarios. In the first scenario, we use ukbench dataset. In the second scenario, we use a subset of book cover images in Rakuten’s online bookstore (http://books.rakuten.co.jp/), which contains 650,530 images with 208,941,248 features. For ukbench dataset, when an image in the dataset is queried, we expect to have the same image, as well as 3 other images of different views on top 4. In Table III, we show average P@4 precision of all images, for different implementations and parameters. P@4 is calculated as number of true positives on top 4 divided by 4. We use the k-means tree with the same settings as explained in Section IV-A. While clustering, we iterate each k-means run until ratio of the feature vectors changing their cluster assignment falls below 0.1%. In the first two rows of the table, we compare BoV implementation with k-means tree clustering and flat kmeans clustering, to observe whether there is any precision loss of using hierarchical k-means tree. We cluster the data by k-means directly into the same number of clusters as the k-means tree has (8K). As can be seen from the table, precision for the flat k-means is even a little worse. The reason for that can be the high variance of number of features within clusters for flat k-means clustering. Standard deviation of cluster sizes for flat k-means clustering is 4515, while it is 900 for k-means tree clustering. For comparing BoV with our proposed 2-step keypoint matching, using the same k-means tree for both BoV and 2-step matching is not fair. We match 50 keypoints with each query keypoint in keypoint matching implementation. Therefore, for BoV, we have generated a larger k-means tree having 50 keypoints in its clusters, on average. BoV using larger k-means tree has better precision than BoV using the same k-means tree as used for keypoint matching. However, one-to-one keypoint matching outperforms all BoV implementations. Precision difference of many-to-many keypoint matching and one-to-one keypoint matching can be seen from the forth and fifth rows of the table. Allowing many-to-many matches reduces the precision significantly. In order to observe precision loss resulted by any type of clustering, we add precision results of matching keypoints retrieved by kNN over all keypoints, without clustering. It took around 2 weeks to query all images by brute-force kNN. The sixth line of the table can be considered the limit precision for any matching-based search, with the SiftGPU keypoints we use. There are many other factors affecting the precision, such as quality of the keypoints extracted. See the last two lines of Table III for the precision of our BoV and keypoint matching implementations, if the keypoints optimized for the dataset is used. The keypoints used in these lines are provided by

Table III AVERAGE P@4

PRECISION FOR ALL IMAGES IN UKBENCH DATASET.

Using SiftGPU Keypoints

Using Keypoints Provided by the owners of the dataset

Search algorithm

P@4

BoV with Flat Clustering BoV with K-means Tree BoV with Larger Tree Many-to-Many Keypoint Matching 1-to-1 Keypoint Matching 1-to-1 Matching by kNN without clustering BoV 1-to-1 Keypoint Matching

0.655 0.663 0.721 0.720 0.789 0.838 0.782 0.837

authors of [8], who generated the data. Our BoV implementation achieves P@4 precision of 0.782 and 2-step keypointmatching-based implementation achieves 0.837. Note that, authors of [8] reports 0.767, 0.822 and 0.823 precision for different trees they use, with their hierarchical scoring scheme for BoV1 . Our keypoint-matching precision is better than their best performing choice of k-means tree. On the other hand, we avoid discussions on keypoint extraction process, since it is out of the scope of this work. Likewise, we avoid discussions on post-processing for spatial matching of retrieved images. Instead, we have focused on vector matching quality for a given set of feature vectors. For precision results with Rakuten Books dataset, we use a different test scenario. In this scenario, the user takes picture of a book and submits the picture as a query to search the book in the Website. We have eight test queries. In the database, we have 15 books in the same series with the query images. So, we measure for P@15 for these 15 images. See Table IV for eight query images and their P@15 scores for BoV and one-to-one keypoint matching. See all 15 true positive images for P@15 in Fig. 2. Actual image sizes vary between 120x180 and 300x436. In Table IV, we only provide P@15 scores for 15 books in the same series; however we have to note that P@1 scores are 1 for all instances, except for the BoV search of the second query. For P@1 computation, we accept the same book with the query image as true positive. Other images, even in the same series, are considered as false positives. Also, P@15 score does not tell anything about the rankings of the true positives in top 15. We observe that average ranks of images are higher for two-step one-to-one keypoint matching search. For some images, precision is lower. This is mostly because characteristic of SIFT features, which is weak at handling affine transformations. For three images, precision difference is significant. This is mostly because BoV matches stripe patterns of the table in the image background with similar patterns in book covers. However, in two-step keypoint matching, closer features are matched precisely, so that matches of image background are got rid of. See Table V for top 4 image matches for the lowest precision query. 1 http://www.vis.uky.edu/∼stewe/ukbench/data/

V. C ONCLUSION In this paper, we have demonstrated that, using GPUs it is possible to achieve high precision image retrieval in reasonable query time. To do so, we have proposed a 2step keypoint matching technique, as an alternative for wellknown bag-of-visual-words. We use a combination of simple techniques for keypoint matching, however it is not feasible to use these techniques for real world problems without GPU support. We have also shown that one-to-one keypoint matching considerably improves search accuracy. We have discussed run-time and precision experiments in detail on two different datasets with different sizes. We have presented that the search engine scales smoothly with increasing size of keypoints. In most papers we have read, implementation details and run-time analysis is lacking, which we believe that this paper helps to fill this gap, also. In the paper, we have given some hints for the distributed implementation of the search engine. However, we leave scalability and precision analysis of even larger scale distributed search engine as a future work. R EFERENCES [1] K. Mikolajczyk, and C. Schmid, “Scale and affine invariant interest point detectors,” Int. J. Computer Vision, 60(1), 2004, pp. 63–86. [2] K. Mikolajczyk, and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Analysis and Machine Intelligence, 27(10), 2005, pp. 1615–1630. [3] D. G. Lowe, “Object recognition from local scale-invariant features,” In Proc. 7th Intl Conf. Computer Vision, 1999, pp. 1150–1157. [4] T. Deselaers, D. Keysers, and H. Ney, “Features for image retrieval: an experimental comparison,” Information Retrieval, 11(2), 2008. [5] D. Omercevic, O. Drbohlav, and A. Leonardis, “Highdimensional feature matching: employing the concept of meaningful nearest neighbors,” In Proc. 11th Intl Conf. Computer Vision, 2007, pp. 1–8 [6] W. Zhao, Y. G. Jiang, and C. W. Ngo, “Keyframe retrieval by keypoints: can point-to-point match help?” In Proc. 5th Intl Conf. on Image and Video Retrieval, 2006, pp. 72–81.

Table IV E IGHT QUERY IMAGES USED FOR PRECISION TEST AND THEIR P@15

BoV Keypoint Matching

0.47 0.93

0.07 0.2

0.67 0.93

Figure 2.

1 1

0.93 0.93

SCORES .

0.27 0.27

0.93 1

1 1

Images of books in the same series with the query books. Table V

T OP 4

Query image

RESULTS FOR

B OV AND

ONE - TO - ONE KEYPOINT MATCHING FOR THE WORST PERFORMING QUERY IN

BoV 1

2

3

[7] J. Sivic, and A. Zisserman, “Video Google: a text retrieval approach to object matching in videos,” In Proc. Intl Conf. Computer Vision, 2003, pp. 1470–1477. [8] D. Nister, and H. Stewenius, “Scalable recognition with a vocabulary tree,” In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 2161–2168. [9] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2006. [10] H. Jegou, H. Harzallah, and C. Schmid, “A contextual dissimilarity measure for accurate and efficient image search,” In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2007. [11] Y. G. Jiang, and C. W. Ngo, “Visual word proximity and linguistics for semantic video indexing and near-duplicate retrieval,” Computer Vision and Image Understanding, 2008. [12] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: improving particular object retrieval in large scale image databases,” In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2008.

4

1

TABLE IV.

One-to-one keypoint matching 2 3

4

[13] H. Jegou, M. Douze, and C. Schmid, “Recent advances in large scale image search,” Emerging Trends in Visual Computing, 2009, pp. 305–326. [14] J. Sivic, and A. Zisserman, “Efficient Visual search of videos cast as text retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, 31(4), 2009, pp. 591–606. [15] N. Ao, F. Zhang, D. Wu, D. S. Stones, G. Wang, X. Liu, J. Liu, and S. Lin, “Efficient parallel lists intersection and index compression algorithms using graphics processing units,” In Proc. VLDB Endowment, 4(8), 2011, pp. 470–481. [16] N. K. Pham, A. Morin, and P. Gros, “Accelerating image retrieval using factorial correspondence analysis on GPU,” LNCS 5702, 2009, pp. 565–572. [17] C. N. Vasconcelos, and B. Rosenhahn, “Bipartite graph matching computation on GPU,” In Proc. 7th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2009. [18] C. Wu, “SiftGPU: A GPU implementation of Scale Invariant Feature Transform (SIFT),” http://cs.unc.edu/∼ccwu/siftgpu/, 2007. [19] T. Alabi, J. D. Blanchard, B. Gordon, and R. Steinbach, “Fast k-selection algorithms for graphics processing units,” Preprint, http://code.google.com/p/ggks/, Dec. 2011.