referred to this ranking as RRM2. The average time of object localization for five images of 600Ã800 is around two seconds using a computer with the following ...
Rotation Invariant HOG for Object Localization in Web Images Ali Vashaeea , Reza Jafaria , Djemel Zioua , Mohammad Mehdi Rashidib,c a
Departement d’informatique, Universite de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada b Shanghai Key Lab of Vehicle Aerodynamics and Vehicle Thermal Management Systems, Tongji University, 4800 Cao An Rd., Jiading, Shanghai 201804, China. c ENN-Tongji Clean Energy Institute of Advanced Studies, Shanghai,China
Abstract To localize objects in Web images using an invariant descriptor is crucial. The HOG (histogram of oriented gradients) descriptor is used to increase the accuracy of localization. It is a shape descriptor that considers frequencies of gradient orientation in localized portions of an image. This well known descriptor does not cover rotation variations of an object in images. This paper introduces a rotation invariant feature descriptor based on HOG. The proposed descriptor is used in a top-down searching technique that covers the scale variation of the objects in images. The efficiency of this method is validated by comparing the performance with existing research in a similar domain on the Caltech-256 Web dataset. The proposed method not only provides robustness against geometrical transformations of objects but also is computationally more efficient. Keywords: Rotation invariant HOG, Object localization, Top-Down searching. 1. Introduction Web images usually contain a high degree of background clutter and also contain multiple objects in each image. To retrieve an image, usually Content-Based Image Retrieval (CBIR) systems [1, 2] are used. These systems try to retrieve images similar to a user-defined specification or pattern (e.g. shape sketch, image example). Generally, the algorithms used in these systems are commonly divided into three tasks: extraction, selection, and
classification [3]. The extraction task transforms the rich content of images into various features. Feature extraction is the process of generating features to be used in selection and classification tasks. Feature selection reduces the number of features provided to the classification task. Those features, which are likely to assist in discrimination, are selected and used in the classification task. Among these three activities, feature extraction is the most critical because the particular features made available for discrimination directly influence the performance of the classification task. Our study focuses on feature extraction and its effect on image ranking performance. Usually, there are several objects in a Web image. Thus, object representation in global feature extraction likely [4] will not result in accurate object categorization. In some applications the accuracy of image retrieval is of utmost importance. For instance, the user is interested in knowing which images in the database contain the given image of query object (In literature this query image is also called a template image or query object). In such cases, it is first necessary to search inside the images to find or localize the object. Then to rank the image among all images in the database, it is needed to assign a weight to this image based on the similarity of the found object with the given template. The Traditional solution is template matching. Given a template, all the possible locations in the image are searched by a sliding window. Template matching has some major issues such as dependency on the scale and orientation of the template. Furthermore, the complexity of the search is O(n)2 for an image of size n × n. To address such dependency issues, one can test all possible scales and orientations of the objects in the image but that would be an inefficient and very time consuming approach. We proposed the Rotation Invariant HOG (RIHOG) feature to cover the different possible orientations of the objects in the image. To address all possible scales and different locations of the object of interest in the image, we proposed the top-down searching method. So, by searching each image in the proposed top down approach and using RIHOG as features for comparison of selected regions with the template, we can find the object of interest in images of the database and rank them accordingly using their RIHOG correlation with given template image. We referred to this ranking approach as Rapid Ranking Method (RRM ). In case there are several objects of the same type in one image (e.g. several guns in one image), by applying this method, the top down searching window converges to the object which has the highest similarity with the template image under RIHOG features. This method 2
exhibits robustness against geometrical transformations of objects and has an efficient computational complexity. The rest of this paper is organized as follows. In following section, we mention some related works. In Section 3, we explain our method and how we use it for image ranking application. Section 4 presents experiment results. 2. Related Research Several geometric invariant feature extraction algorithms are appear in literatures. Among them, BRISK[8], FREAK[9], SURF[10], SIFT[11], are widely used features in computer vision applications. Canclini et al [12] showed that in applications related to object recognition and retrievals, SIF T feature outperforms the other mentioned geometrical invariant methods in terms of True Positive results. A derivative of SIFT is HOG which is a well-known shape descriptor tha is used in several applications such as pedestrian[13] and iris state detection[14]. Liu et al [15] proposed an invariant HOG. Unfortunately, their method has high complexity (i.e. more than O(n2 )) due to several conversion of the whole image to a log polar coordinate. Hence, this approach is not efficient in fast object localization applications. Other related work which uses F F T to obtain rotation invariance [16], is based on a costly computation O(n3 ) on the tangent of boundary points of the object. Dalal-Triggs [13] used HOG on a fixed size searching window, however, it is not applicable in the case of template and test images with variable sizes. This issue has been addressed by using variable sizes of the blocks [17]. In this configuration, any given region is divided into nine overlapping blocks (blki )i=1...9 . The blocks overlap half of their area with one another (see Figure 1). Pixels in each block vote for the corresponding bins in the histogram. Obtained histograms for blocks are denoted as hog(blki , B)i=1..9 . In order to account for changes in illumination and contrast, histograms are locally normalized (Equation1) over each block by its L2 norm (Euclidean norm). N hog(blki , B) =
hog(blki , B) , i = 1...9, ||hog(blki , B)||2
(1)
where N hog(blki , B) is the normalized histogram of oriented gradients of a region bounded in blki . The final HOG(R, B) is a vector with 81 elements which is obtained by concatenation of the 9 bins histograms over 9 blocks.
3
Figure 1: Variable block configuration for HOG feature extraction.
3. Proposed Method The main goal of this study is to provide a fast, accurate, and invariant object localization method to be used in object retrieval applications. In the feature selection section we focus on the HOG features, with the following attribute: it is robust against small local rotational variations of objects. Besides, it provides a distinctive description of objects. Furthermore, as this feature uses histograms, we can decrease its extraction complexity to O(c) by using the integral form of the histograms over all the test images. There are two issues regarding HOG feature exploitation in an image ranking framework: first, dependency of HOG feature on rotation and scale changes of the object; second, computational cost of constructing HOG feature. Complexity of search over all locations of the image is O(n2 ). Furthermore, complexity of the HOG feature extraction in the search process is O(n2 ). Hence, in total object localization will computationally have a cost of O(n4 ). That is to say, although image ranking based on HOG provides accurate ranking results, it is not computationally effective. We address these issues in three subsections. In Section 3.1, we tackled the scale variations of objects and the complexity of the localization problem. 4
In Section 3.2, we tackle the rotation dependency of HOG by introducing a new derivation of HOG features which is invariant to rotation (RIHOG). In Section 3.3, we use these two proposed methods for image ranking application. We explain several objects in one image of the same class in Section 3.4. 3.1. Scale Invariant Top-Down Searching Approach Scale variation of objects in images is an issue for conventional template matching with fixed-size templates. Hence, instead of scanning images with different scales of templates, we propose a top-down, iterative searching approach which is based on directing the searching process to smaller windows that have a higher resemblance with the template. In this approach, first, a test image Im is considered as the main window S0 , then a sliding window (Swx ) with six pixels smaller than S0 in the x axis is considered. It should be noted that the size of the sliding window and its intervals can be considered as a parameter. This parameter varies depending on the needed accuracy and speed of the system. Here, the difference between windows of six pixels and intervals of two pixels are selected. For instance, if the size of S0 is 100 × 200 then the size of the Swx would be 94 × 200. Swx is slided over the test image (S0 ) with intervals of 2 pixels. During this process, three different regions corresponding to slided windows (Swx )w=1...3 are selected. Then the same process is performed along the y axis of S0 which results in three other regions corresponding to slided windows (Swy )w=1...3 . In the next step, all the selected regions of Swx and Swy are compared with the given template and the most similar one is selected as the S0 . Afterwards, the same process is performed iteratively until the size of the S0 window reaches 25% of the original given image Im in the y and x axis. Stopping condition for this process could be treated as a parameter for a final image ranking system. We experimentally select 25% of the original image’s size. An example of this process with four sub-windows ((Swx )w=1,2 and (Swy )w=1,2 ) is depicted in Figure 2. In general, region comparison could be performed by any similarity metric and feature space. The Complexity of the test image Im with a size of n × n pixel is: O[
(n − p
n ) m
] ≈ O(n)
(2)
where the object of interest size is m1 of the size of Im. At each iteration the size of the search window reduces by p pixels in both x and y axis. 5
Figure 2: An example of the top-down searching approach with just four sub-windows. Each sub-window is slightly smaller than the test image (S0 ) in the x axis (sub-windows 1 and sub-windows 2) and in the y axis (sub-windows 3 and sub-windows 4). One of these windows is selected as S0 , based on the similarity value with a given template. Then the same process continues until the size of S0 , in both axis reaches 25% of the original test image’s size.
3.2. Rotation Invariant HOG (RIHOG) As mentioned in Section 2, in order to construct the HOG features of a given image, the image is divided into several blocks. Assume a region r is enclosed in a block. An associated feature is represented by a histogram denoted by hog(r, B) with B bins. The hog(r, B) is a normalized histogram computed by Equation 1 which provides an illumination invariant representation of the region r. Then the rotation of the region r by θ degree is a translation by η 6
bins in its histogram denoted as hog(r, B − η). For example, Figure 3 shows the region r containing a simple edge, with corresponding histograms with three bins. From left to right, rotation by θ = 90 results in a shift of the corresponding bins by one to the right. Here, the aim is to come up with a description of the region r that does not change with region’s rotations. To do so, we used a translation invariant transform (Fourier transformation) over HOG features based on the shift theorem in Fourier transform:
Figure 3: Simple example of an edge with one dominant gradient orientation and the corresponding HOG with three bins. Rotation of the object (first row) is equal to shifts in the corresponding histogram of oriented gradients space (second row).
||F T (hog(r, B − η))||2 = ||F T (hog(r, B))||2 ,
(3)
where F T (.) is the Fourier transform and hog(r, B) is the histogram of gradients over region r in a block blk with B bins. For instance, if the hog(r, B) is a d dimensional vector, its Fourier transform will be a d dimensional complex vector. If we consider the magnitude of each element in this vector, we will have a d dimensional real value vector. Therefore, the Rotation Invariant Histogram of Gradients of that block is denoted by RIhog(blk, B) as follows: RIhog(blk, B) = ||F T (hog(blk, B))||2 .
(4)
Since the Fourier transform coefficients in RIhog(blk, B) are symmetric, to avoid redundancy, we consider half of the coefficients. The DC coefficient always has the highest value over the rest of the coefficients. As crosscorrelation is used for feature comparison in object localization, we neglect 7
the DC coefficient. To obtain the rotation invariant property over a given image R, pixels which are located in each block must remain in the same block after the rotation of the image. In other words, the local distribution of blocks over the image must be preserved after the rotation.
Figure 4: Configuration of nested blocks in a test image. Four nested blocks are indicated with different colors. With the nested configuration of the blocks, each block preserves the majority of the same region before and after the rotation of the image. So, occurred shifts in bins of hog(block) after rotation can be covered by a translation invariant transform (Fourier transform).
For instance, if the spatial layout of the blocks distributed are as Figure 1 then after, rotation of the image, the regions enclosed by blocks will not remain in the same block. To address this issue, we rearrange the spatial distribution of the blocks in an eccentric manner which results in a nested overlapping, spatial layout for the blocks (Figure 4). With an eccentric block configuration, most of the pixels in each block remain in the same block after rotation of the region. The feature construction process of RIHOG is performed in three steps. First, the given image is divided into d nested blocks denoted by (blki )i=1..d . Then each block is represented by RIhog(blki , B). A final rotation invariant descriptor of the region R is constructed by concatenation of the obtained RIhog(blki , B) for i = 1...d which is denoted by 8
RIHOG(R, B). An example of extracted RIHOG features is illustrated in Figure 5. A given image (Figure 5.a) is rotated 90 degrees (Figure 5.b) and the corresponding RIHOG(R, B) are illustrated in the same column. In order to have a better representation of the features, the number of bins (B) is selected as 40 in two blocks (d = 1, d = 2). So, RIHOG is a 20 dimensional feature vector (using half of the Fourier coefficients). If one block (d = 1) is used to represent the images, then the corresponding RIHOG are Figure 5.c and Figure 5.d. RIHOG representations with two blocks (d = 2), are Figure 5.e and Figure 5.f which all illustrate the robustness of RIHOG against rotation variations.
Figure 5: Representation of Rotation Invariant Histogram of Oriented Gradient (RIHOG) features for an original image (a) and its 90 degree rotation (b). Corresponding extracted features are as (c-f)
To address the computational cost of HOG feature reconstruction, we use the integral form of the histogram [18] that allows us to rapidly compute the 9
HOG features for any region in the test image. To do so, the given image is converted to gray level. Then first derivatives of the image are computed with respect to both axes. After computing the gradient orientations of each pixel, an accumulating function Ihog() constructs the Integral form of HOG for the given image. 3.3. Image Ranking Using RIHOG and Top-Down Searching Approach The application of RIHOG in the top-down searching approach for image ranking is referred to as the Rapid Ranking method (RRM ). The process of image ranking is divided into three steps: image localization, signature value assignment, and ranking. In the RRM ranking method, we use the top-down search method for object localization which also covers scale variation of objects in the image. For signature value assignment, we use similarity under RIHOG features that also covers object rotations in images. In this approach, the size of the searching window gradually shrinks to the region of interest. This process is susceptible to existing noise in images which may affect the direction of emerging to the object of interest. To alleviate the background noise effect, before performing the search process a Gaussian filter is applied on the test images and the template image (q). Figure 6 illustrates a general flowchart of this approach. If the scale of the Gaussian filter is large, we will lose some information and if it is too small it is not effective in terms of filtering the noise. So, to choose proper values for the variables in this system (e.g. the scale of the Gaussian filter σ, bin size (B) and block size (d) in the RIHOG features), we perform a grid search [19]. We apply a validation dataset containing a complementary 3000 images of the Caltech-256 dataset. Range of bins, blocks, and Gaussian scales are as follows: B=[9,18,27,36], d=[2,5,10,20,40], and σ = [0.5, 1.16, 1.66] with corresponding mask sizes of [3 × 3, 5 × 5, 10 × 10] for the Gaussian filter. Furthermore, we add another parameter related to the number of blocks (d). Outer blocks are more informative in the sense that they have information about all parts of the objects. On the other hand, inner blocks contain a local representation of the objects which sometimes varies a lot even among the intra-class objects. In order to select an appropriate number of blocks to represent the object, we define variable d2. This variable indicates the number of outer blocks that are selected to be concatenated for representing the object. Here, in the validation step, for each di , we compare the performance on a range of the d2 which varies from one to all the number of the selected blocks (di ). For instance, for the test on the block size of 10 (d = 10), d2 varies from 1.0 to 10. The 10
best ranking performance over the validating dataset achives by the following setting is: B = 27, d = 10, d2 = 3 and σ = 1.16. An example of searching based on these parameters are shown in Figures 8 and 9.
Figure 6: Flowchart of the RRM .
3.4. Several Objects in One Image In image retrieval applications, we are after the most similar object in images. There could be one or several objects of the same class (e.g. Figures 8 and 9) in one image. In either cases we want to find the most similar one. The process of selection among various regions and objects during searching inside the image is as follows. As explained in top-down searching, at each iteration several sub windows are extracted and compared with the query image and one of them is selected as main region of interest (S0 ). Then the same procedure is performed on (S0 ). In fact, the (S0 ) regions are the possible object candidates that we are searching for. Hence, we keep track of them in the image. Then the similarity of all selected S0 windows with the template q are represented in a vector called L1 (e.g. Figure 7). Now we can select among the object candidates by selecting the maximum value of L1. This value is also used to determine the ranking position of the image in the database. Figure 7 shows an example of searching for a given template 11
Figure 7: An example of the top-down image searching based on RIHOG features. a) Template image. b) Test image. c) L1 similarity values of parsed windows. Windows corresponding to local maximums of L1 curve are candidates for object of interest. d) Localized object.
(Figure 7.a) in a test image (Figure 7.b) with RIHOG features. The L1 similarity vector is computed as follows: L1(S0i , q) = Corr RIHOG(S0i ), RIHOG(q) , i = 1...n, (5) Signature(Im) = maxi (L1(S0i , q)),
(6)
where n is the number of selected S0 in test image and Corr is correlation. RRM provides a fast, generic and invariant ranking system. It is invariant to object rotation changes (using RIHOG features) and object scale changes (using the top-down search). In this method, object localization is fast due to using Integral Histogram to extract RIHOG features with a complexity of O(c). Here, c is a constant related to the number of blocks used to construct the RIHOG features. The complexity of the top down search is O(n). So, the complexity of the final top-down template matching to calculate the signature value for each image is O(n).
12
Figure 8: Localizing given object in query image (a) in Test image(b) is object of interest (c) in test image. Details of the search are depicted in Figure 9
. 4. Experimental Results In this study, we use Caltech-256 [20] which consists of 30607 Web images. Images are assigned to 256 categories and evaluated by humans in order to ensure image quality and relevance. Images usually contain several objects with non-uniform background and a variety of rotations, scales, and illuminations. The C256 dataset is used for large scale image classification, object detection, and retrieval purposes [21, 22, 23, 24]. As the Caltech-256 is a large dataset, bootstrapping [25] is used to analyse the generic image ranking performance. Bootstrapping is an appropriate way to control and check the stability of ranking performance results. It helps us to analyze our algorithm on a large dataset of images without necessarily testing each image in that dataset. In bootstrapping, we iteratively sample the given dataset of object classes and create smaller bootstrapped datasets. In each iteration, 30 random object classes are selected to construct the bootstrapped data set. An average of all ranking results over the constructed bootstrapped datasets determines the final ranking performance. We use the mean Average Precision (mAP) and Normalized Discounted Cumulative Gain (NDCG), the measures that are widely used in the information retrieval community. The mAP and NDCG over the first 10 retrieved images are used for ranking performance evaluation. Canclini et al [12] showed that, in object recognition and retrievals applications, SIF T feature outperforms all other mentioned geometrical invariant methods in terms of True 13
Positive results. Hence, in this study, we compare our method with SIFT. In order to evaluate the SIF T results on image ranking we explore an effective technique introduced by Fan [11] to localize objects in images. Results of image ranking based on the Fan method is depicted as SIF T ranking in Table 1. To check the effect of rotation of objects in images on ranking results, we rotate the template images used in SIF T ranking step by 180 degres and the results depicted as SIF T rotated. To simulate the scale transformation of objects, the size of the template images are reduced to half (i.e. SIF T scaled). The SIFT based ranking shows robust image ranking performance under geometrical transformations. The same test for ranking performance is applied for HOG features. Although using HOG provides high ranking performance (i.e. HOG ranking) it is not effective under geometrical transformation of objects. The ranking performance decreases drastically when objects are transformed in images. Related ranking results under object transformations are depicted in HOG scaled and HOG rotated in Table 1. Following the same evaluation over our proposed methods (RRM) shows that ranking performance is almost the same as HOG and is also robust against all geometrical transformations of objects in images. Related results are depicted in RRM scaled and RRM rotated. We measured the localization error of our method under RIHOG. Since the accuracy of object localization relies on human examination, we randomly selected 30 classes of objects from Caltech 256. We then applied our localization algorithm under Rotation Invariant HOG. Afterwards the user checks the results of localization of objects in given images. The average of localization error is 43%. We applied sliding windows for object localization in ranking framework and results are depicted in Table 2. Sliding windows approach is based on pixel by pixel comparison of given template with all locations in image. Hence, this method is very sensitive to scale and orientation of the given template. As in real world images like Web images, objects are not all with the same scale and rotations, the retrieval accuracy based on sliding windows is lower than scale invariant Top-Down searching method. Besides computational costs of object localization with sliding windows is higher in comparison with top-down searching approach (Table 2) which results in slow response time. In order to analyze the effect of several low-level features on image ranking performance, we also focus on the Region Covariance (RC) feature descriptor [30]. To have a better overview of the effect of the selected features 14
Figure 9: Example of localizing the object given in Figure 8. (a-e) Some of the selected windows during top-down searching with their associated RIHOG features. ( f ) Given query image plus its associated RIHOG features.
on image ranking performance, two different feature vectors are used in the RC ranking. The applied feature vectors are denoted as RC Gray and RC RGB in Table 3. In the RC RGB case, RGB color band information is used along with orientation information. Ranking results under RGB features outperform under grayscale features. Hence, we apply further tests
15
under different orientation and scale. Related ranking results under object transformations are depicted in RC RGB scaled and RC RGB rotated in Table 1. Table 1: Image ranking scores under different feature spaces.
M ethods/evaluation SIF T ranking SIF T scaled SIF T rotated HOG ranking HOG scaled HOG rotated RC Gray RC RGB RC RGB scaled RC RGB rotated RRM RRM scaled RRM rotated
mAP NDCG 0.30 0.29 0.27 0.31 0.14 0.12 0.41 0.52 0.12 0.13 0.22 0.23 0.21 0.23 0.33 0.38 0.12 0.12 0.22 0.25 0.36 0.40 0.32 0.37 0.33 0.37
Table 2: Top down searching comparison with Sliding windows.
T op down searching Sliding windows
mAP Scale invariant 0.36 Yes 0.20 No
Complexity O(n) O(n2 )
Table 3: Different sets of pixels attributes in Region Covariance features construction. methods/f eatures RC RGB RC Gray
Extracted feature vector for each pixel F (x, y) = [x, y, R(x, y), G(x, y), B(x, y), | ∂I(x,y) |, | ∂I(x,y) |, | ∂ ∂x ∂y F (x, y) =
2 I(x,y)
|, | ∂
∂x2 ∂I(x,y) ∂ 2 I(x,y) ∂ 2 I(x,y) |, | |] [x, y, I(x, y), | ∂I(x,y) |, | |, | ∂x ∂y ∂x2 ∂y 2
2 I(x,y)
∂y 2
|]
In past years a family of deep learning based image retrieval was proposed [31, 32, 33, 34]. Recently, Wan et al. [31] compared deep learning for image retrieval on Caltech 256. They used convolutional networks for feature extraction and added deep learning to train their neural networks. It 16
must be noted that feature extraction using convolutional networks are not rotation and scale invariant. In order to compare our searching method with deep learning based image retrieval, we focused on the proposed top-down approach based on rotation dependent features, such as RC, GIST [35], and HOG. To do so, we used the top-down searching under RC features. The comparison metric is based on the distance of the search window S0i and q. The distance metric is ρ [30]. If RC(S0i ) and RC(q) are respectively region covariance matrices of searching windows S0i and query image q, then the associated values in the L1 similarity vector would be as follows: L1(S i , q) = ρ RC(S0i ), RC(q) , i = 1...n (7)
v u n uX i ρ(RC(S ), RC(q)) = t ln2 λi (RC(S i ), RC(q)), 0
0
(8)
i=1
where, λi (RC(S0i ), RC(q)) are generalized eigenvalues of RC(S0i ) and RC(q) that are computed by: λi RC(S0i )xi − RC(q)xi = 0
i = 1 . . . d,
(9)
where, xi 6= 0 are eigen vectors. The object candidate regions are local minimums of L1 similarity curve. At this stage, to find the final object of interest in the image, we do some post processing is done on all local minimums of the L1 curve. First, we extract the regions associated with local minimums of the L1 curve. Second, for each selected region, we compute the correlation of HOG and GIST features with the query image. Third, we take the average of two computed correlations. Then the maximum of previous averaged correlations would be the signature of the test image. We referred to this ranking as RRM 2. The average time of object localization for five images of 600 × 800 is around two seconds using a computer with the following spec: Core i7 (3.5 GigHz) CPU and 16 Gig RAM. The comparison results are depicted on Table 4. Our method provides higher ranking and is also invariant to scale variation of objects in the image due to applying the top-down searching approach. Besides, we test the performance on all 255 classes of the Caltech database while their search method (i.e. deep learning) is tested on just 50 classes. 17
Table 4: RRM2 comparison with deep learning image retrieval.
mAP Scale invariant RRM 2 0.63 Yes Deep learning search[31] 0.54 No
Number of classes 255 50
In order to enrich our experimental results we also test the RRM2 on Oxford building dataset in Table 5. Oxford contains 5063 high resolution images downloaded from Flickr. Table 5: RRM2 results on Oxford dataset.
M ethods/evaluation RRM 2 RRM 2 rotated RRM 2 scaled
mAP 0.68 0.63 0.65
In Table 6, as it can be seen under RRM row, geometrical variation of objects is covered and also the computational complexity of the ranking system is reduced from O(n2 ) to O(n) for object localization. The average time of object localization for five images of 600 × 800 is around one second using a computer with mentioned specification. Table 6: Comparison of implemented methods performance Implemented methods Proposed Methods
Ranking Methods HOG ranking SIF T ranking RRM
mAP Scale invariant 0.41 No 0.30 Yes 0.36 Yes
Rotation invariant No No Yes
Complexity O(cn4 ) O(n2 ) O(cn)
Table 7 depicts that RRM outperforms SIF T in object recognition results and complexity. RRM also outperforms HOG in object recognition under geometrical variations of objects in images (Table 1). In Table 8, we compare our methods with previous CBIR frameworks that use the Caltech-256 for performance analysis. Results show the ranking performance of the proposed methods are higher than similar documented image retrieval works. Note that, in previous CBIR methods, to achieve fast retrieval performance it is assumed that each image represents one object. In our study, such assumption ends in considering a large amount of noise in each Web image which has a negative impact on retrieval performance. 18
Table 7: Implemented methods performance comparison against object’s scale and rotation variations Implemented methods Proposed methods
RankingM ethods/transf ormation HOG ranking SIF T ranking RRM
Scaled object(50%) 0.12 0.27 0.32
Rotated object(180◦ ) 0.22 0.14 33%
Although using retrieval methods like hashing strategies improves the image retrieval time, it is not as accurate as sequential searching in images [26]. Table 8: Performance comparison of proposed methods with existing CBIR systems over Caltech-256 dataset.
Ranking methods Previous CBIR Deng et al[26] OASIS[27] MCML[28] LEGO[27] LMNN[29] Euclidean[27] Proposed method RRM
Top-10 Precision 0.27 0.24 0.21 0.20 0.19 0.18 0.36
5. Conclusion The main objective of this study was to find a proper representation (features) of the objects in order to have high ranking performance in databases of Web images. Since HOG feature is proven to be an efficient method for object detection, we focused on this feature and how to address its rotation and scale dependency issues. To address rotation variation of objects in images, the RIHOG feature was introduced which is rotation invariant. Furthermore, a top-down searching method is introduced which covers scale variation of objects in images. These two methods are used together in an image ranking framework. Obtained image ranking method is called RRM . Experimental results show that this method outperforms image ranking using SIFT and HOG features. This ranking method not only is invariant against scale and rotation changes of objects in images but also maintains the ranking performance. Other applications of the proposed methods can be fast and invariant object detection, image matching, Cyber-Security, and object tracking . 19
[1] W.J.X. Wangming and L. Xinhai, Application of image sift features to the context of cbir. International Conference on Computer Science and Software Engineering, pp. 552 555, 2008. [2] A. Marchiori, C. Brodley, J. Dy, C. Pavlopoulou, A. Kak, L. Broderick, and A.M. Aisen, Cbir for medical images-an evaluation trial. IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 89 - 93, 2001. [3] R. Choras, Image feature extraction techniques and their applications for cbir and biometrics systems. International journal of biology and biomedical engineering, 1(1), pp. 6 16, 2007. [4] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin The Qbic project: querying images by content, using color, texture, and shape. International Society for Optics and Photonics, pp. 173 187, 1993. [5] Y. Pang, Y. Yuan, X. Li, and J. Pan, Efficient hog human detection. Signal Processing, 91(4), pp. 773 781, 2011. [6] N. Chen, W.N. Chen, and J. Zhang, Fast detection of human using differential evolution. Signal Processing, 110(1) pp. 155 163, 2014. [7] Y. Yuan, X. Lu, , X. Chen, Multi-spectral pedestrian detection.Signal Processing, 110 (1), pp. 94 100, 2015. [8] S. Leutenegger, M. Chli, and R. Y. Siegwart, Brisk: Binary robust invariant scalable keypoints.IEEE International Conference on Computer Vision, pp. 2548 - 2555, 2011. [9] A. Alahi, R. Ortiz, and P. Vandergheynst, Freak: Fast retina keypoint. IEEE Conference on Computer Vision and Pattern Recognition, pp. 510 - 517, 2012. [10] A. C. Murillo, J. J. Guerrero, and C. Sagues, Surf features for efficient robot localization with omnidirectional images. IEEE International Conference on Robotics and Automation, pp. 3901 - 3907, 2007. [11] L. Fan, Intra-class variation, affine transformation and background clutter: Towards robust image matching. IEEE First Symposium on MultiAgent Security and Survivability, pp. 22 - 26, 2004. 20
[12] A. Canclini, M. Cesana, A. Redondi, M. Tagliasacchi, J. Ascenso, and R. Cilla, Evaluation of low-complexity visual feature detectors and descriptors. International Conference on Digital Signal Processing, pp. 1 7, 2013. [13] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection. International Conference on Computer Vision and Pattern Recognition, pp. 886 893, 2005. [14] R. Jafari and D. Ziou, Eye-gaze estimation under various head positions and iris states. Expert Systems with Applications , 42(1), pp. 510 518, 2015. [15] K. Liu, H. Skibbe, T. Schmidt, T. Blein, K. Palme, T. Brox, and O. Ronneberger, Rotation invariant hog descriptors using fourier analysis in polar and spherical coordinates. International Journal of Computer Vision, 106(3), pp. 342 - 364, 2008. [16] Y. Su and Y. Wang, Rotation invariant shape contexts based on featurespace fourier transformation. International Conference on Image and Graphics, pp. 575 579, 2007. [17] L. Oswaldo and D. Delgado, Trainable classifier-fusion schemes: an application to pedestrian detection. IEEE Conference on Intelligent Transportation Systems, pp. 1 6, 2009. [18] F. Porikli, Integral histogram: A fast way to extract histograms in cartesian spaces. IEEE Conference on Computer Vision and Pattern Recognition, pp. 829 836, 2005. [19] J. Bergstra and Y.Bengio, Random search for hyper-parameter optimization. International Journal of Machine Learning Research, 13(2012) pp. 281-305, 2012. [20] A.H.G. Gregory and P. Perona, Caltech-256 object category dataset. International, 2007. [21] A.Z.A. Bosch and X. Muoz, Image classification using random forests and ferns. International Conference on Computer Vision, pp. 1 8, 2007.
21
[22] P. Gehler and S. Nowozin, On feature combination for multi class object classification. International, International Conference on Computer Vision, pp. 221 228, 2009. [23] E.S.O. Boiman and M. Irani, In defense of nearest-neighbor based image classification. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8, 2008. [24] J.C. van Gemert, C.J. Veenman, A.W.M. Smeulders, and J.M Geusebroek, Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7), pp. 1271 - 1283, 2010. [25] P. Hall and B. Presnell, Intentionally biased bootstrap methods. Journal of the Royal Statistical Society, 61(1), pp. 143 158, 1999. [26] A.B.J. Deng and L. Feis, Hierarchical semantic indexing for large scale image retrieval. IEEE Conference on Computer Vision and Pattern Recognition, pp. 785 792, 2011. [27] U.S.G. Chechik, V. Sharma, and S. Bengio, Large scale online learning of image similarity through ranking. The Journal of Machine Learning Research, 11(2010), pp. 1109 1135, 2010. [28] A. Globerson and S. Roweis, Metric learning by collapsing classes. Advances in neural information processing systems, pp. 451 458, 2005. [29] J. Blitzer and K. Weinberger, Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pp. 1473 1480, 2005. [30] Tuzel, Oncel and Porikli, Fatih and Meer, Peter,Region covariance: A fast descriptor for detection and classification. Computer Vision–ECCV 2006, 3952(1), pp. 589 - 600, 2006. [31] J. Wan, D. Wang, S.C.H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li Deep Learning for Content-Based Image Retrieval: A Comprehensive Study. Proceedings of the ACM International Conference on Multimedia, pp. 157 166, 2014. [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1106-1114, 2012. 22
[33] H. Xie, Y. Zhang, J. Tan, L. Guo, and J. Li Contextual query expansion for image retrieval. IEEE Transactions on Multimedia, 16(4), pp. 11041114, 2014. [34] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson Cnn features off-the-shelf: an astounding baseline for recognition. Computer Vision and Pattern Recognition, abs/1403.6382, 2014. [35] C. Gao and N. Sang, Inspired template matching using scene context. Computer Vision and Pattern Recognition Workshops, pp. 48 55, 2011.
23