Fast Near-Duplicate Image Detection Using Uniform Randomized Trees YANQIANG LEI, Sun Yat-sen University GUOPING QIU, University of Nottingham LIGANG ZHENG, Guangzhou University JIWU HUANG, Shenzhen University
Indexing structure plays an important role in the application of fast near-duplicate image detection, since it can narrow down the search space. In this article, we develop a cluster of uniform randomized trees (URTs) as an efficient indexing structure to perform fast near-duplicate image detection. The main contribution in this article is that we introduce “uniformity” and “randomness” into the indexing construction. The uniformity requires classifying the object images into the same scale subsets. Such a decision makes good use of the two facts in near-duplicate image detection, namely: (1) the number of categories is huge; (2) a single category usually contains only a small number of images. Therefore, the uniform distribution is very beneficial to narrow down the search space and does not significantly degrade the detection accuracy. The randomness is embedded into the generation of feature subspace and projection direction, improveing the flexibility of indexing construction. The experimental results show that the proposed method is more efficient than the popular locality-sensitive hashing and more stable and flexible than the traditional KD-tree. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval— Retrieval models and Search process General Terms: Algorithms, Experimentation, Verification Additional Key Words and Phrases: Fast near-duplicate image detection, indexing structure, uniform randomized tree ACM Reference Format: Yanqiang Lei, Guoping Qiu, Ligang Zheng, and Jiwu Huang. 2014. Fast near-duplicate image detection using uniform randomized trees. ACM Trans. Multimedia Comput. Commun. Appl. 10, 4, Article 35 (June 2014), 15 pages. DOI: http://dx.doi.org/10.1145/2602186
1.
INTRODUCTION
With the rapid advancement of multimedia and Internet technologies, the amount of digital images that are easily accessible to users has become overwhelmingly large. It is often the case that a digital image has many near duplicates on the Internet, which can be easily observed by using a search engine
This work was supported in part by 973 Program (2011CB302200), National Science & Technology Pillar Program (2012BAK16B06), NSFC (U1135001, 61332012, 61173147, 61300205). Authors’ addresses: Y. Lei, School of Information Science Technology, Sun Yat-sen University, Guangzhou 510006, China; G. Qiu, School of Computer Science, The University of Nottingham, Nottingham NG8 1BB, UK; L. Zheng, School of Computers, Guangzhou University, Guangzhou 510006, China; J. Huang (corresponding author), College of Information Engineering (also Shenzhen Key Laboratory of Media Security), Shenzhen University, Shenzhen 518060, China; email:
[email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. c 2014 ACM 1551-6857/2014/06-ART35 $15.00 DOI: http://dx.doi.org/10.1145/2602186 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35
35:2
•
Y. Lei et al.
such as Google or Bing. This phenomenon inevitably leads to a huge waste of storage and network resources, as well as problems such as copyright infringement. Therefore, effectively detecting nearduplicated images has become an important issue [Chang et al. 1998]. Note that “near duplicate” refers to a transformed version of the original image [Joly et al. 2007]. The common transformations (also called content-preserving operations) include geometric manipulations, blurring, noise contamination, enhancement, and compression. On some occasions, only a part of the whole image is a near duplicate of another image, making it very challenging to successfully detect the near duplicates [Wu et al. 2009, 2010]. For near-duplicate image detection (NDID), a important challenge is to extract effective image content representations. In recent years, many features have been proposed [Kim 2003; Wu et al. 2007; Xu et al. 2010; Sivic and Zisserman 2003; J´egou et al. 2012; Oliva and Torralba 2001; Lei et al. 2011; Zheng et al. 2012] for NDID. In Kim [2003], each image was first partitioned into 8 × 8 subimages and then ordinal measurement [Bhat and Nayar 1998] of the DCT coefficients was calculated as a fingerprint for NDID. This feature is robust against the noise contamination, but cannot tolerate the rotation manipulation. In Wu et al. [2007] the authors developed an elliptical track division strategy to extract the ordinal measurement, improveing the performance of an ordinal measurement-based scheme. In Xu et al. [2010], the authors employed the differences of multiresolution histograms (MHD) to perform the detection. The main advantage of MHD is its low computation and its disadvantage is its poor robustness for the geometric operation. The vector of locally aggregated descriptors (VLAD) [J´egou et al. 2012] and bag-of-features (BOF) [Sivic and Zisserman 2003] are those techniques that aggregate local descriptors (e.g., SIFT [Lowe 2004]) into global representations. In recent years, VLAD and BOF have shown strong competitiveness in near-duplicate image detection, but are sensitive to noise contamination and blurring. GIST [Oliva and Torralba 2001], which was proposed to describe a scene, is another well-known image representation. It achieves excellent robustness to many kinds of image manipulations [Douze et al. 2009]. The main disadvantage is that it cannot resist the geometric distortion, such as rotation. In our previous work [Lei et al. 2011], we proposed a high-order-invariant moment (HOIM) based on the Radon transform. HOIM shows very good robustness to the image rotation, scaling, and translation, but lacks tolerance to the local editing. In Zheng et al. [2012], the authors introduced the salient covariance matrix (SCOV) as a compact descriptor to NDID. However, SCOV is a Riemannian feature and there are few indexing techniques suitable for it. Another major challenge is the runtime performance. The image database is usually very large, therefore an efficient algorithm is necessary to enable the detection to be not only accurate but also fast. Existing fast NDID algorithms are usually based on a two-stage model. The first stage uses an indexing structure to largely reduce the size of the detecting space while the second step refines the result of the first stage by exhaustive search. This model is also sometimes called the coarse-to-fine detection model. Obviously, the number of candidates returned by the coarse stage dominates the detection efficiency. In this work, we focus on constructing an efficient indexing structure to conduct fast NDID. As a matter of fact, the indexing structure acts as a classifier that projects similar data into the same category. In a typical NDID application, the images are totally unlabeled and therefore unsupervised learning is usually employed to construct the indexing structure. Up to now, locality-sensitive hashing (LSH) [Andoni and Indyk 2008] has been the most popular indexing employed in NDID [Ke et al. 2004; Qamra et al. 2005; Chum et al. 2007; Hu et al. 2008; Cao et al. 2011]. The key idea of LSH is to hash the input points (each feature is represented as a point in a high-dimensional space) into different buckets using a set of hash functions. LSH-based approaches can get excellent detection accuracies, but their detection efficiencies are not very good. Actually, LSH usually classifies most of the samples into one bucket while other buckets contain only a few points, as shown in Figure 1. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
Fig. 1. An instance of LSH.
•
35:3
Fig. 2. Uniform classification.
During the coarse detection stage, the larger bucket is easier to hit, which will lead to much more time to conduct the subsequent fine detection. Such a classification phenomenon of LSH is blameless, but not suitable for the fast NDID. The main reason is that LSH ignores two facts in the application of NDID: (1) each original image and its near duplicates will form a category, so the number of categories is huge; (2) an original image usually has a few near duplicates in the database, and even most of the original images have no near duplicates. In practical applications, it is impossible to construct a category for each original image and its near duplicates. A reasonable solution is to classify similar samples (including original images and their near duplicates) into the same category. According to the previous two facts, we believe that the classification accuracy will not significantly affect the detection accuracy. Therefore, we can sacrifice the classification accuracy for improving other performance, such as, detection efficiency. As mentioned previously, the detection efficiency of a fast NDID algorithm is determined by the number of candidates retrieved by the indexing (coarse detection), so the average number of candidates returned in the first stage is as small as possible. Such a goal can be achieved by the uniform classification, as illustrated in Figure 2. The uniform classification ensures that each coarse detection on average retrieves the least candidates, thus drastically accelerating the fine detection. In the following, we will address how to embed the uniformity into the indexing construction. To our knowledge, this work is the first to study the application of fast NDID. Image features are usually multidimensional vectors. It is very difficult to directly implement the uniform classification in high-dimensional spaces due to the difficulty in estimating the probability density function. To get around this difficulty, we simplify the model by projecting the feature vectors into one-dimensional space. Then, the uniform classification of these one-dimensional samples can be easily implemented in a binary tree by comparing the median threshold. In order to decrease the storage and improve the flexibility, we randomly select the feature subspace and generate the projection direction in growing the tree. We can formulate the construction of a single tree as the following three steps: (1) generating a random subspace and a random projection direction; (2) calculating the random projection; and (3) splitting the input dataset into two subsets uniformly. In this article, the proposed tree structure is called uniform randomized tree (URT), and a cluster of URTs is used as a powerful indexing for fast NDID. In fact, the proposed URTs comprise a typical kind of random forest widely used in recognition and classification applications [Lepetit and Fua 2006; Bosch et al. 2007; Ramirez et al. 2009; Yu et al. 2011]. However, our construction process is novel and very different from the traditional random forest. In Section 2, we will give a detailed description of the construction procedure. The experimental results show that the uniformity strategy improves the detection efficiency and that the proposed method is much faster than LSH-based schemes while achieving the same accuracy. Note that the proposed URTs comprise a special approach for fast NDID, moreover, the previously mentioned two facts may be invalid, so it cannot be directly extended to other applications. In addition, a single URT usually gets worse accuracy than other structures (e.g., a hash table) due to the uniform classification. However, a cluster of URTs can achieve nearly the same accuracy when using exhaustive search and is not inferior to any other approaches. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35:4
•
Y. Lei et al.
Fig. 3. The NDID system based on URTs.
The remainder of this article is organized as follows. Section 2 describes the proposed scheme in detail. Section 3 shows the experimental results and gives relevant discussions. Finally, concluding remarks of the article and future works are given in Section 4. 2.
PROPOSED ALGORITHM
This work focuses on developing a cluster of URTs for the application of NDID. We can summarize the appearance of the URTs as follows. (1) We observe the two facts in NDID, that is, the number of categories is huge and a single category usually contains a small number of samples. Existing algorithms usually ignore these facts in constructing indexing. (2) The uniform classification can make good use of the two facts. It can accelerate the detection and would not degrade the detection accuracy. (3) It is difficult to directly implement the uniform classification in high-dimensional spaces and so we simplify the model by projecting the feature vectors into the one-dimensional space. These onedimensional samples can be uniformly divided into two subsets by comparing the median threshold. (4) We iteratively partition the new subsets and obtain a binary classification tree. (5) To improve the flexibility and decrease the storage cost of the tree, we randomly select the feature subspace and generate the projection direction to grow the tree. (6) To improve the detection accuracy, we use a cluster of trees as the indexing structure for fast NDID. The NDID system based on URTs is shown in Figure 3. URTs are used to perform coarse detection, whereas the fine detection only executes on those candidates retrieved from the first process. In the following, we first describe the constructions of URTs. Then, we introduce the coarse detection based on the proposed URTs. In addition, we will analyze the storage load of a single URT and discuss the query cost in a URT. 2.1 Construction of Uniform Randomized Trees Since the URTs are independent, we only describe the construction of a single tree. Assuming that the features of all images in the database are available and denoted as f = { ft : t = 1, 2, . . . , N}, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
•
35:5
Fig. 4. The construction of the root node.
where ft ∈ Rd is a d-dimensional feature vector describing the t-th image, and N is cardinality of the database, we can construct a randomized classification tree as follows. First, all the images (features) are assigned to the root node, as shown in Figure 4(a). Then, a subspace D ⊂ {1, 2, . . . , d} and a projection direction w = [w1 , w2 , . . . , wdsub ] are randomly generated, where |D| = |w| = dsub < d. According to the following rule fL , if wT ∗ ftD ≤ θ, Assign ft to fR , otherwise, feature ft will be assigned to the left subset fL (left child node) or the right subset fR (right child node), where ftD is a |D|-dimensional feature vector consisting only of those components from ft whose indices are in D (as the red components shown in Figure 4(b)), and θ is a decision threshold. The two new subsets should satisfy the two following conditions: fL ∩ fR = ø, fL ∪ fR = f. The selection of threshold θ is another key issue in our algorithm. Since there is not any priori information of images in the database, the image’s category cannot be decided. According to the two facts in NDID, one simple and reasonable assumption is that the probabilities of any given image belonging to the two subsets are equal, namely the cardinalities of fL and fR are nearly the same. Such an idea is consistent with the proposed uniform classification. Therefore, we use the maximal information entropy principle [Qiu et al. 2007] to choose the threshold θ , as θ ∗ = argmax{− pl log pl − pr log pr }, ∀θ
(1)
where pl and pr represent the probabilities of the left and right subsets, respectively. pl =
|fL | |fL | + |fR |
(2)
|fR | (3) |fL | + |fR | Once all the random projections at a node are available, the median value can be easily obtained. Then, we randomly generate many candidate thresholds around the median value (the number of candidate thresholds is 50 in the experiments) and select the one with the maximal information entropy. pr =
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35:6
•
Y. Lei et al.
Fig. 5. Coarse detection based on uniform randomized trees.
Fig. 6. Examples of candidate images retrieved by the coarse detection.
Now, the images contained by the root node have been uniformly divided into two new subsets. We can summarize the construction of the root node in the following three steps. (1) Generate a random subspace D and a random projection direction w. (2) Calculate the random projection wT ∗ ftD . (3) Split the input dataset into two subsets based on the maximum information entropy principle. Once the input dataset is separated into two subsets, we can further construct a new decision node for each subset using the same principle. This process is repeated until any of two constraints given shortly is satisfied. One is that the tree reaches the preset depth. The other is that the number of images in the node is smaller than a threshold. After this, each leaf of the tree will contain a small number of images; in fact, the images in the database are uniformly distributed into the leaf nodes of each tree. Since the proposed tree structure introduces “uniformity” and “randomness” into the traditional classification tree, we called it Uniform Randomized Tree (URT). The uniformity is beneficial to accelerate the detection while randomness improves the flexibility of tree construction. 2.2 Uniform Randomized Trees-Based Near-Duplicate Image Detection In this section, the proposed URT will be used as an efficient indexing to perform fast NDID. During the coarse detection, the query image can easily find its corresponding leaf node in each tree based on the same rules used for constructing the tree as illustrated in Figure 5, where the red lines represent the search paths, Si (1 ≤ i ≤ M) denotes the retrieved images from the i-th tree, and M is the number of trees. Figure 6 shows an instance of images from the coarse detection. According to the voting ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
•
35:7
Table I. Voting Results of the Instance in Figure 6 ft
f15
f47
f20
f35
f55
f18
f50
f81
f26
f72
v( fq , ft )
6
6
5
5
4
3
3
3
2
2
technique, we can calculate the vote score between the query image and an arbitrary object as v( fq , ft ) =
M
δ( fq , ft ),
(4)
i=1
where δ(·, ·) is an indicator function, defined as 1, if fq , ft fall into the same leaf, δ( fq , ft ) = 0, otherwise.
(5)
Then, we can obtain the sorted results. Taking the retrieved images in Figure 6 for example, Table I depicts their ranks. We are interested in the top ranked results that are most similar to the query, since higher similarity indicates a greater probability that the two images are near-duplicated pairs. For an arbitrary query image, assume that the sorted images/features are denoted as { fq,1 , fq,2 , . . . , fq,nq } (nq is the number of images from the coarse detection), we can select the first ξ · nq (0 ≤ ξ ≤ 1) images as the candidates for the following fine detection, as S = { fq,t | t ≤ ξ · nq }.
(6)
Smaller ξ means better efficiency and worse accuracy, and vice versa. In brief, the parameter ξ can be used to trade off the efficiency and accuracy. In our experiments, we set ξ = 1, indicating that all the images retrieved from coarse detection are used as the candidates. The reason for such a setup is twofold: (1) once the indexing is established, the detection efficiency and accuracy are not affected by some other parameters; (2) it is fair to compare with LSH, which is beneficial to verify the importance of the uniformity strategy to the indexing structure. Once the candidate images are established, the fine detection is performed by exhaustive search with raw feature distance. Obviously, the computations of the fine detection are mainly dependent on the number of candidates. The proposed uniform distribution enables that the average number of candidates from the coarse detection is the least, possibly significantly saving the query cost. This is an important reason why we introduce uniformity into the indexing construction. 2.3 Storage and Computation Complexity Analysis This section will analyze the storage cost of a URT and the query complexity in a URT. Note that we do not discuss the construction of URT, since it is usually offline. An L-level URT at most has 2 L leaves and 2 L −1 non-leaf nodes. A leaf needs L bits to indicate its address, so all the leaves totally cost 2 L · L bits. For an arbitrary non-leaf node, three parameters, namely feature subspace D, projection direction w, and threshold θ , should be recorded. We can coarsely estimax mate the storage cost of the non-leaf nodes as (2 L − 1)(dmax sub ( log2 d + μ(w)) + μ(θ )) bits, where dsub is the dimensionality of the maximum subspace and μ(x) returns the precision of each component of x. It is clearly observed that the storage cost of a URT depends on several parameters. In Section 3.2, we will discuss the selection of these parameters in detail. The query in a URT is traverses from the root to a leaf node. At an arbitrary level, it is necessary to calculate the random projection wT · fqD and compare it with the decision threshold θ . Thus, the max total computations of the query in a URT include L · dmax sub multiplications, L · (dsub − 1) additions, and L max comparisons. Usually, L and dsub are small, meaning that the query processing in a URT is very quick. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35:8
•
Y. Lei et al. Table II. Set of Content-Preserving Operations
3.
Operations
Parameters of the operations
JPEG compression Rotation
Quality Factor 10 : 20 : 90 Degree 5, 15, 45, 60, 90
Scaling AWGN Gaussian blur
Ratio 0.5, 0.8, 1.2, 1.5, 2 PSNR 10 : 5 : 30 Filter order 3 : 2 : 11, σ = 3
Illumination modification Cropping
Ratio 0.5 : 0.2 : 1.3 Percentage 5% : 5% : 25%
EXPERIMENTS
In this section, we will evaluate the proposed URTs for the application of fast near-duplicate image detection. We first describe the setup for the experiments and then discuss the determination of the corresponding parameters in our methods. Subsequently, we present the experimental results compared with LSH- and KD-tree-based approaches. Finally, we discuss the scalability of the proposed structure. 3.1 Experimental Setup 3.1.1 Database. We evaluate the proposed scheme on a dataset constructed from two image databases: INRIA Copydays database [Douze et al. 2009] and a distractor database (around 200,000 images) downloaded from Flickr [Huiskes and Lew 2008], LabelMe [Russell et al. 2008], and the Internet. The 157 images in Copydays are used as test samples. For each sample, we generate 35 nearduplicated versions by using a set of typical content-preserving operations listed in Table II. In the experiments, the aforesaid generated near duplicates (157 × 35 = 5495 images) are inserted into the dataset and the original 157 samples are used as query images. 3.1.2 Image Features. The proposed URTs are suitable for the global features. In the experiments, three global representations are used to verify the effectiveness of proposed indexing, including the 15-D high-order-invariant moment (HOIM) [Lei et al. 2011], the 40-D GIST descriptor [Cao et al. 2011], and the 96-D differences of multiresolution histogram (MHD) [Xu et al. 2010]. 3.1.3 Evaluation Criteria. Accuracy, speed, and storage are the three main evaluation criteria for an NDID system. For detection accuracy, the mean Average Precision (mAP) is used as the evaluation metric. In fact, mAP refers to the area under the precision-recall curve. The mAP of exhaustive search is used as the baseline in our experiments, indicating that the accuracy is evaluated as mAP(ratio) =
mAP of indexing-based methods . mAP of exhaustive search-based methods
For detection speed, we discuss the acceleration factor that can be estimated by N/Nr , where N is the total number of images in the dataset and Nr is the number of candidate images retrieved by the coarse detection. For storage evaluation, we discuss the memory cost required by the indexing structure. Note that the storage load of the image ID is not considered. 3.2 Parameters Selection for URT Three parameters need to be carefully considered in the construction of a URT. They are dsub (dimensionality of subspace D), w (projection direction), and L (tree depth). In the following, we will discuss the effect of these parameters on detection accuracy, speed, and storage. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
•
35:9
Fig. 7. Results of 15-D HOIM.
Fig. 8. Results of 40-D GIST.
dsub affects the generations of feature subspace D and projection direction w. We find two points: (1) a higher-dimensional feature usually needs larger dsub to retain the accuracy; (2) larger dsub costs more storage. In our tests, we set dsub ∈ [3, 8] empirically. For the projection direction w = [w1 , . . . , wi , . . . , wdsub ], we consider three cases as ⎧ ⎪ ⎨ wi ∈ [−1, 1], #1, wi ∈ [0, 1], #2, ⎪ ⎩ #3, wi ∈ {1}, and we observe that they achieve almost the same accuracies (we do not include their results here, since it is hard to discriminate them in a figure). In the following experiments, we employ method number 3, since that method does not cost any storage to record w. For the tree depth L, we can estimate it as log2 (N/NExp ), where N is cardinality of the database and NExp is the expected number of retrieved candidates from a URT. Usually, the value of NExp is on the order of several dozens/hundreds. We can properly adjust NExp to meet the practical requirements. Here, we present results with values of L set to 9, 10, and 11 for the three features, as shown in Figures 7–9. From the results, we clearly observe that the proposed method can achieve almost the same accuracy as that of exhaustive search, if there are sufficient number of URTs. We also observe that more URTs lead to slower detection speed and higher storage cost. Furthermore, larger L (deeper URT) can accelerate the detection but also increase the storage cost. Deeper URT usually has more nodes, which will cost more storage. More nodes means that each (leaf) node contains fewer candidates due to the uniform distribution, thus accelerating the detection. In our experiments, the dataset contains around 200,000 images and tree depth L = 10 may be a reasonable selection. A 10-level URT divides the dataset into 210 = 1024 categories. Each category includes around 200 candidate images. If the dataset scale increases, we can employ deeper URTs to obtain a good trade-off among accuracy, speed, and storage. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35:10
•
Y. Lei et al.
Fig. 9. Results of 96-D MHD.
3.3 Performance Comparison with LSH In order to show the effectiveness of the proposed URT for NDID, we compare our method with LSHbased approaches, since LSH is one of the most popular indexing techniques used in fast NDID [Ke et al. 2004; Qamra et al. 2005; Chum et al. 2007; Hu et al. 2008; Cao et al. 2011]. The key idea of LSH is to project the input features, using a hash function, into different buckets. The generated structure is called a hash table. In the experiments, we employ the typical LSH family whose hash function is defined as [Andoni and Indyk 2008] Ax + b , (7) h(x ) = w where h(x ) indicates the bucket address, A is a kh · d-dimensional random matrix following a Gaussian distribution, kh represents the length of hash key, w is a preset segment length, b ∈ [0, w), and · is the floor function. During the coarse detection, the images retrieved by each hash table are all used as candidates for the following fine detection. According to the previous description, we can easily estimate the storage cost of a hash table as nb log2 nb + khdμ(A) bits, where nb is the number of buckets (nb 2kh ) and μ(A) returns the precision of each component of matrix A. In the test, the sourcecode of LSH is provided by Shakhnarovich [2008]. Based on the construction rules, we know that URT and LSH are both techniques that partition the object dataset into many subsets. The differences between them are twofold. First, URT is hierarchical whereas LSH is a one-level structure. Second, the number of categories is controlled by the tree depth in URT, but limited by the length of the hash key in LSH. In the following, we will compare the detection accuracies, speed, and storage between URT- and LSH-based approaches. Figures 10–12 show the comparison results (in terms of mAP versus acceleration factor and mAP versus additional storage) using 15-D HOIM, 40-D Gist, and 96-D MHD. Note that all the parameters in URT and LSH are single (represented by 4 bytes). For LSH, we present results with values of kh set to 15, 30, 45, and 60 (denoted as LSH15 , LSH30 , LSH45 , and LSH60 , respectively). Usually, values for kh between 15 and 45 are reasonable and effective [Qamra et al. 2005]. For URT, we just include the results of L set to 10 (denoted as URT10 ) for the reason mentioned in Section 3.2. The number of hash tables and randomized trees are both varied to obtain different detection performance. Note that the knots on each curve in Figures 10–12 are related to the number of hash tables or randomized trees. For each curve, the lowest knot indicates that the result is obtained by one table/tree, and higher knots mean more tables/trees. From the comparisons, we observe that, at the same detection accuracy, URT10 always performs much faster than the four LSH-based cases, and the storage cost of URT10 is just between LSH30 and LSH45 . Such results reflect the effectiveness of the proposed URT to some degree. In order to further verify the effectiveness of the reported URT, we provide more detailed discussions between URT and LSH. Taking the results of HOIM, for example, URT10 divides the dataset into 1,024 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
•
35:11
Fig. 10. Comparisons between URT and LSH using 15-D HOIM.
Fig. 11. Comparisons between URT and LSH using 40-D GIST.
Fig. 12. Comparisons between URT and LSH using 96-D MHD.
categories, while LSH15 , LSH30 , LSH45 , and LSH60 on average partition the objects into 616, 5,048, 16,025, and 32,130 groups, respectively. The performance of the LSH structure (denoted as LSHExp ) that separates the dataset into around 1,024 categories should be between LSH15 and LSH30 , as the case “LSH: length=Exp” shown in Figure 10. It is clearly observed that LSHExp performs much slower ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35:12
•
Y. Lei et al. Table III. Comparisons between URT and KD-Tree Using 15-D HOIM Depth
mAP(ratio)
Acceleration Factor
9 bits
KD-tree One URT
Methods
0.8477 0.8452
512 512
Additional Storage (KB) 2.8 4.6
10 bits
Twenty URTs KD-tree One URT
1 0.8167 0.7972
39 1024 1024
91.1 5.7 9.2
11 bits
Twenty URTs KD-tree One URT
0.9986 0.7599 0.7283
78 2048 2049
184.9 11.7 18.7
Twenty URTs
0.9982
166
374.8
Table IV. Comparisons between URT and KD-Tree Using 40-D GIST Depth
mAP(ratio)
Acceleration Factor
KD-tree One URT
0.7878 0.7334
512 512
2.9 5.6
Thirty URTs KD-tree
0.9942 0.7558
25 1024
166.6 6.0
10 bits
One URT Thirty URTs KD-tree
0.6976 0.9894 0.7232
1024 49 2049
11.2 337.2 12.2
11 bits
One URT Thirty URTs
0.6539 0.9801
2048 101
22.7 682.2
9 bits
Methods
Additional Storage (KB)
than URT10 , but LSHExp costs less storage. However, even if LSH costs more storage to improve its detection speed, it still achieves much slower efficiency. For GIST and MHD, we can obtain similar conclusions from Figures 11 and 12. These discussions demonstrate the effectiveness of the proposed uniformity strategy. 3.4 Performance Comparison with KD-Tree In order to further demonstrate the merits of the proposed scheme, we compare our method with the traditional KD-tree-based approaches, since the proposed URT can be treated as an improved version of KD-tree. The KD-tree is also a binary tree. At each non-leaf node, a special dimension (which is usually associated with the maximal variance) is first chosen and the median value is selected as the decision threshold. Then all points with value smaller than the threshold are split into the left subset and others are put into the right subset. A KD-tree structure can be easily used in the coarse detection stage for fast NDID, and all the samples retrieved by the leaf node are used as candidates for the following fine detection. In addition, for an L-level KD-tree, we can easily estimate its storage cost as 2 L · L + (2 L − 1)( log2 d + μ(θ )), where d is the feature dimensionality, θ represents the decision threshold for a non-leaf node, and μ(x) returns the precision of each component of x. Tables III–V show the comparison results (in terms of mAP, acceleration factor, and additional storage). From the comparisons, we observe that KD-tree usually costs less storage than one URT, since the separation of the KD-tree is based on a single dimensionality, whereas the division of a URT is based on a subspace. We also find that, for 15-D HOIM and 40-D GIST, KD-tree achieves a little better accuracies than one URT and their acceleration factors are nearly the same. However, for the 96-D MHD, a URT achieves better accuracies and faster speed than KD-tree. The reason is that the decision of the subspace-based scheme is much more stable than a single-dimensionality-based technique. In ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
•
35:13
Table V. Comparisons between URT and KD-Tree Using 96-D MHD Depth
mAP(ratio)
Acceleration Factor
9 bits
KD-tree One URT
Methods
0.4213 0.4456
320 509
Additional Storage (KB) 3.0 6.1
10 bits
Fourty URTs KD-tree One URT
0.9923 0.3829 0.4023
17 640 1007
242.1 6.1 12.2
11 bits
Fourty URTs KD-tree One URT
0.9865 0.3193 0.3606
32 1280 2052
489.6 12.5 24.7
Fourty URTs
0.9746
63
989.6
Fig. 13. Results of scalability evaluation using 15-D HOIM.
addition, both KD-tree and a single URT cannot obtain good accuracy. However, we can easily build many URTs to retain the accuracy, while KD-tree cannot. Such a result shows that the proposed URT is much more flexible than KD-tree. 3.5 Scalability Evaluation for URT In this section, we simply evaluate the scalability of the proposed scheme. Once the URTs have been constructed, a piece of new data can be easily inserted into the URTs. All we have to do is to traverse the URTs for each image and assign the image into appropriate nodes. The reason we can do this is that adding a piece of data will not change the overall threshold of each node much. Of course, if a large amount of new data is to be added to the database, the URTs should be rebuilt. Figure 13 shows some results when adding different numbers of new images. The original database is around 100k. We test three cases, namely 1k, 10k, and 100k new images to be added to the dataset. We clearly observe that when 1k new images come, the “insert” and “rebuild” achieve almost the same performance. Therefore, such 1k new images can easily be inserted into the original database without rebuilding the URTs. However, there is obvious performance degradation if we directly insert 100k new images into the URTs. Such results are consistent with our previous analysis. 4.
CONCLUSION
In this article, we formulate the indexing construction in near-duplicate image detection (NDID) as a classification problem. We observe two facts in the application of NDID: (1) the number of categories is huge; (2) a single category usually contains a small number of images. Based on the two observations, we develop a novel indexing, that is, a cluster of uniform randomized trees (URTs), for fast NDID. The key idea of the proposed URTs is to uniformly distribute the target images in the dataset to small subsets. Such a strategy can significantly improve the detection efficiency. The extensive experimental ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
35:14
•
Y. Lei et al.
results show that the proposed scheme is much more efficient than the popular locality-sensitive hashing structure and more stable and flexible than the traditional KD-tree. In addition, the proposed URTs have satisfactory scalability. The randomized tree is used as the basic structure in our method, which improves the flexibility of the indexing construction. There are many possible ways to construct the randomized trees and this article has only presented a very simple approach. Our future work will investigate alternative rules to construct the trees for fast near-duplicate video detection. REFERENCES A. Andoni and P. Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Comm. ACM 51, 1, 117–122. D. N. Bhat and S. K. Nayar. 1998. Ordinal measures for image correspondence. IEEE Trans. Pattern Anal. Mach. Intell. 20, 4, 415–423. A. Bosch, A. Zisserman, and X. Muoz. 2007. Image classification using random forests and ferns. In Proceedings of the IEEE International Conference on Computer Vision. 1–8. Y. Cao, H. Zhang, and J. Guo. 2011. Weakly supervised locality sensitive hashing for duplicate image retrieval. In Proceedings of the IEEE International Conference on Image Processing. 2461–2464. E. Chang, J. Wang, C. Li, and G. Wiederhold. 1998. Rime: A replicated image detector for the world-wide web. In SPIE Multimedia Storage and Archiving System III, Vol. 3527, 58–67. O. Chum, J. Philbin, M. Isard, and A. Zisserman. 2007. Scalable near identical image and shot detection. In Proceedings of the ACM International Conference on Image and Video Retrieval. 549–556. M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, and C. Schmid. 2009. Evaluation of gist descriptors for web-scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval. Vol. 19. 1–8. Y. Hu, M. Li, and N. Yu. 2008. Efficient near-duplicate image detection by learning from examples. In Proceedings of the IEEE International Conference on Multimedia and Expo. 657–660. M. J. Huiskes and M. S. Lew. 2008. The MIR Flickr retrieval evaluation. In Proceedings of the ACM International Conference on Multimedia Information Retrieval. 39–43. H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid. 2012. Aggregating local images descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34, 9, 1704–1716. A. Joly, O. Buisson, and C. Frelicot. 2007. Content-based copy retrieval using distortion-based probabilistic similarity search. IEEE Trans. Multimedia 9, 2, 293–306. Y. Ke, R. Sukthankar, L. Huston, Y. Ke, and R. Sukthankar. 2004. Efficient near-duplicate detection and sub-image retrieval. In Proceedings of the ACM International Conference on Multimedia. 869–876. C. Kim. 2003. Content-based image copy detection. Signal Process. Image Comm. 18, 3, 169–184. Y. Lei, Y. Wang, and J. Huang. 2011. Robust image hash in radon transform domain for authentication. Signal Process. Image Comm. 26, 6, 280–288. V. Lepetit and P. Fua. 2006. Keypoint recognition using randomized trees. IEEE Trans. Pattern Anal. Mach. Intell. 28, 9, 1465– 1479. D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2, 91–110. A. Oliva and A. Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 3, 145–175. A. Qamra, Y. Meng, and E. Y. Chang. 2005. Enhanced perceptual distance functions and indexing for image replica recognition. IEEE Trans. Pattern Anal. Mach. Intell. 27, 3, 379–391. G. Qiu, J. Morris, and X. Fan. 2007. Visual guided navigation for image retrieval. Pattern Recogn. 40, 6, 1711–1721. J. Ramirez, J. M. Gorriz, R. Chaves, M. Lopez, D. Salas-Gonzalez, I. Alvarez, and F. Segovia. 2009. SPECT image classification using random forests. Electron. Lett. 45, 12, 604–605. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77, 1–3, 157–173. G. Shakhnarovich. 2008. The source code of locality sensitive hashing. http://www.ttic.edu/gregory. J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE International Conference on Computer Vision. 1470–1477. M.-N. Wu, C.-C. Lin, and C.-C. Chang. 2007. Novel image copy detection with rotating tolerance. J. Syst. Softw. 80, 7, 1057–1069. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.
Fast Near-Duplicate Image Detection Using Uniform Randomized Trees
•
35:15
Z. Wu, Q. Ke, M. Isard, and J. Sun. 2009. Bundling features for large scale partial-duplicate web image search. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 25–32. Z. Wu, Q. Xu, S. Jiang, Q. Huang, P. Cui, and L. Li. 2010. Adding affine invariant geometric constraint for partial-duplicate image retrieval. In Proceedings of the IEEE International Conference on Pattern Recognition. 842–845. Z. Xu, H. Ling, F. Zou, Z. Lu, and P. Li. 2010. Robust image copy detection using multi-resolution histogram. In Proceedings of the ACM International Conference on Multimedia Information Retrieval. 129–136. G. Yu, N. A. Goussies, J. Yuan, and Z. Liu. 2011. Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans. Multimedia 13, 3, 507–517. L. Zheng, Y. Lei, G. Qiu, and J. Huang. 2012. Near-duplicate image detection in a visually salient riemannian space. IEEE Trans. Inf. Forens. Secur. 7, 5, 1578–1593.
Received May 2013; revised January 2014; accepted March 2014
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 10, No. 4, Article 35, Publication date: June 2014.