The coding responses of the local features on each visual word are then .... the dictionary with K visual words trained by clustering or dictionary learning. Let z be ...
Local Hypersphere Coding Based on Edges between Visual Words Weiqiang Ren1 , Yongzhen Huang1 , Xin Zhao2 , Kaiqi Huang1 , and Tieniu Tan1 1
2
National Laboratory of Pattern Recognition, CASIA, China Department of Automation, University of Science and Technology of China, China {wqren,yzhuang,xzhao,kqhuang,tnt}@nlpr.ia.ac.cn
Abstract. Local feature coding has drawn much attention in recent years. Many excellent coding algorithms have been proposed to improve the bag-of-words model. This paper proposes a new local feature coding method called local hypersphere coding (LHC) which possesses two distinctive differences from traditional coding methods. Firstly, we describe local features by the edges between visual words. Secondly, the reconstruction center is moved from the origin to the nearest visual word, thus feature coding is performed on the hypersphere of feature space. We evaluate our coding method on several benchmark datasets for image classification. The experimental results of the proposed method outperform several state-of-the-art coding methods, indicating the effectiveness of our method.
1
Introduction
Local feature coding has been a standard technique in bag-of-words model based image classification. In recent years a large number of coding methods have been designed, achieving state-of-the-art performance on public classification benchmark datasets. The typical bag-of-words based image classification framework consists of four steps: (1) Local feature extraction. Usually local feature descriptors (e.g. SIFT [1], HOG [2]) are densely extracted from each image. (2) Local feature coding. In this step, each local feature is encoded with a precomputed dictionary. Coding methods play a vital role in enhancing the discriminative power of image representation. Many recent researches focus on designing more powerful coding methods, from the basic hard quantization [3] to more sophisticated coding methods, e.g. soft quantization [4], sparse coding [5], locality-constraint linear coding(LLC) [6], super vector coding [7],improved fisher kernel [8] [9], etc. (3) Spatial pooling. The coding responses of the local features on each visual word are then pooled into a single value, concatenating all these pooled values produces the final image representation. Successful pooling operations include max pooling [5], average pooling [10], weighted average pooling [7], etc. As the standard bag-of-words model does not consider the geometric layout, spatial pyramid matching(SPM) [10] is usually adopted to improve the classification performance. (4) Classification. Usually over-complete dictionary is used for K.M. Lee et al. (Eds.): ACCV 2012, Part I, LNCS 7724, pp. 190–203, 2013. c Springer-Verlag Berlin Heidelberg 2013
Local Hypersphere Coding Based on Edges between Visual Words
191
local feature coding, thus the image representation will typically have very high dimensionality. In such case, linear SVM will greatly reduce the training time while obtain even better performance than non-linear classifiers. As an important step in bag-of-words model, local feature coding can be viewed as a feature space transformation. Local features are transformed from local feature space into a new feature space spanned by a set of precomputed visual words, with higher dimensionality. Traditionally, local feature coding methods are designed based on reconstruction by a linear combination of dictionary bases under different constraints. Sparse coding [11] adds sparsity constraint on reconstruction term, while LLC [6] introduces locality constraint. On the one hand, reconstruction based local feature coding methods retain the most important information by restricting the reconstruction error to be low. On the other hand, they put specific constraint on the solution to obtain discriminative and robust representation. These are also the goal of the proposed coding method in this paper. As the local feature and visual words are usually 2 normalized to unit length, they are all distributed on the surface of a hyperball. Traditional coding methods are carried out in this hyperball, with the local feature described by a linear combination of the visual words. Fig 1-(a) illustrates the reconstruction of local feature x by a set of visual words. In Fig 1-(a), visual word d1 , d2 and d3 are used for describing local feature x. If d1 , d2 , d3 and x are coplanar, x can be perfectly reconstructed by these three visual words. However, in most situations, this assumption does not hold true. Wang et al. [6] pointed out that, under certain assumptions, reconstructing local feature with visual words locating on a local smooth manifold produces lower reconstruction error than normal sparse coding. Motivated by the local smooth assumption, we propose to move the origin to the hypersphere and do feature coding on the hypersphere. As shown in Fig 1-(b), rather than reconstruction local feature with visual words, we propose to use the edges connecting the visual words for reconstruction. As a local region on the hypersphere is close to hyperplane, reconstruction on this hypersphere will produce less error than the traditional methods. We have three main contributions in this paper: 1. Compared with traditional reconstruction based coding methods, we extend the idea of locality and smoothness and perform feature coding on the hypersphere. Moving the origin onto the hypersphere and reconstructing on a local smooth hypersphere obtains better reconstruction and more distinctive representation. 2. The edges between the visual words are utilized for reconstructing the edge from each visual word to the local feature. To the best of our knowledge, this is the first work that encodes local feature with edges between visual words. 3. The proposed new coding scheme can be readily applied to the existing methods. The notion of reconstruction with edges between visual words are general and more regularization can be added to obtain more specialize representation.
192
W. Ren et al.
Fig. 1. The reconstruction scheme of traditional coding methods and the proposed method. The red stars on the hypersphere are the visual words, while the blue rectangle denotes the local feature to be described. (a) illustrates the reconstruction of traditional reconstruction based coding methods. (b) demonstrates the reconstruction in the proposed method. Note that the coding process is performed on the hypersphere. See text for detail explanation.
The rest of the paper is organized as follows: Section 2 reviews the development of coding method in bag-of-words model. Section 3 presents the proposed feature coding method on hypersphere. In Section 4, we report the experiment results on three widely used benchmark datasets and discuss the parameter settings and its influence on the classification results. And finally in Section 5, we draw the conclusion and discuss the future work.
2
Related Work
Over the years there have been a large quantity of novel coding methods proposed for improving the bag-of-words model. These coding methods roughly fall into two categories: (1) reconstruction based coding methods, like hard quantization, sparse coding, LLC, etc. (2) describing the differences between local features and visual words, for example, super vector coding and fisher kernel. Hard quantization [3] is the basic coding method applied in bag-of-words based image classification. Each local feature votes on the nearest visual words and the final image representation is a histogram of the voting responses. From the reconstruction point of view, hard quantization only utilizes the nearest visual word. Hard quantization can be viewed as a description of local feature with one single visual word. When dictionary size is small, hard quantization will introduces large quantization error. Soft quantization [4] improves this problem by voting on multiple visual words. Salient coding [12] employs the ratio between local descriptors’ nearest code and other codes as the salient representation.
Local Hypersphere Coding Based on Edges between Visual Words
193
Sparse coding [11] is also reconstruction based method, where sparsity constraint is added. The solution of sparse coding is generally sparse and the responses on most of the visual words are close to zero. Sparse coding achieves better reconstruction than hard quantization and produces more distinctive sparse response. It has been adopted in image classification (See [5]) and achieves stateof-the-art results on several public benchmarks. However, the 1 regularization in sparse coding is not smooth, thus sparse coding might project on totally different visual words for similar local features. Moreover, the 1 regularization makes it computationally expensive to do sparse coding. To tackle the problem of sparse coding, LLC [6] replaces the sparsity constraint with locality constraint. It encodes the local feature by projecting each local feature into its local-coordinate system. An approximated version of LLC has more clearly locality constraint and a simpler format than standard LLC. The coding process is performed only on the K-nearest neighbor, where a constraint least square problem can be solved efficiently. Recently, several more powerful feature coding methods have been proposed, such as code graph [13], fisher kernel [8,9], super vector coding [7], and vector difference coding[14]. Huang et.al [13] propose to improve image classification via exploring the relationship between visual words. Fisher kernel combines the power of both generative model and discriminative model. Rather than explicitly reconstruction of local feature, fisher kernel records the first and second order differences between local features and visual words. Super vector encodes local feature by recording the differences between local feature an visual words. Zhao et.al [14] perform feature coding based on the vector difference in a highdimensional space via explicit feature maps. Fisher kernel, super vector and vector difference coding representations are about M times larger than other coding methods, such as hard quantization, soft quantization, and LLC. Nevertheless, they achieve state-of-the-art results on several challenging benchmark such as the PASCAL VOC and ImageNet.
3
Proposed Method
In this Section, we first briefly review the traditional reconstruction scheme of several classical bag-of-words based coding methods in Section 3.1, then in Section 3.2 we present the proposed local feature coding method in details. 3.1
Traditional Reconstruction Scheme of Bag-of-Words Based Coding
Denote x ∈ RM as a local feature descriptor, e.g., SIFT, HOG, etc., M is the dimensionality of the local feature descriptor. Let X = [x1 , x2 , · · · , xN ] ∈ RM×N be the local features extracted from an image. D = [d1 , d2 , · · · , dK ] ∈ RM×K is the dictionary with K visual words trained by clustering or dictionary learning. Let z be the encoded feature. For reconstruction based coding methods, the dimensionality of z is usually K, while for other coding methods the dimensionality may be far more larger than K.
194
W. Ren et al.
For the traditional reconstruction based coding method in bag-of-words model, such as hard quantization, sparse coding, LLC, etc., we can readily put them into an unified framework: z = arg minz x − Dz22 + λr(z) s.t. z ∈ C
(1)
where r(z) is the regularization term designed from different intuition and C denote some constraints on the solution. For hard quantization, r(z) = 0 and C = {z |z0 = 1, 1T z = 1}. The constraint forces the solution to have only one response. This means that hard quantization always produces a coarse reconstruction of x. For sparse coding, r(z) = z1 and C = RK , the 1 regularization forces the solution to be sparse. As is pointed by Yang et al. [5], sparse coding obtains lower reconstruction and allows the representation to be specialize and salient. i) )zi 2 and C = RK , the regularization While for LLC, r(z) = i exp( dist(x,d σ term puts locality constraint on the solution so that local feature will have large responses on several closest visual words. For efficient computation, the approximated LLC removes the weighted 2 regularization term. It explicitly restricts that only the K nearest visual words are used for reconstruction. All the reconstruction based coding methods share the same least square error term x − Dz22 , as shown in Eqn (1). The local feature x is described as a linear combination of a set of visual words(dictionary), namely, x≈
K
z i di
(2)
i=1
As shown in Eqn(2), local feature x is transformed into a new feature space which is spanned by all the visual words in the dictionary. The coefficients z is the new representation for local feature x, which is more discriminative than the original local feature. Reconstruction term in the objective function restricts the new representation so that most of the important information is preserved. Bad reconstruction loses too much information about the original local feature, which surely leads to bad performance. For example, in hard quantization local feature is described by the nearest visual word, in which quantization error always exists. Soft quantization solves this problem by describing one local feature with multiple visual words, leading to better performance than hard quantization. Sparse coding and LLC take the sparsity and locality constraints into account, respectively. the parameter λ in Eqn(1) is used to balance the reconstruction error and the penalty. 3.2
Local Hypersphere Coding(LHC)
As demonstrated in Fig 1-(a), in traditional reconstruction based coding algorithms, local feature vector is represented by a set of visual words on the hypersphere. Only the relationships between visual words and local features are
Local Hypersphere Coding Based on Edges between Visual Words
195
considered, ignoring the relationships between visual words. Fig 1-(b) shows that, for one visual word, the edges connecting nearby visual words tend to lie on a local hypersphere approximating a hyperplane. Reconstructing local feature on a hyperplane will produce less error. Motivated by this observation, we propose to do feature coding on local hypersphere to achieve better reconstruction. Rather than reconstructing the raw local feature x, we propose to reconstruct the difference between local feature x and visual word di . D = [d1 , d2 , · · · , dK ] ∈ RM×K is a dictionary of size K. The directed edge from visual word di to dj is denoted as eij , namely eij = dj − di
(3)
The directed edge from visual word dj to local feature x is denoted as yj , namely yj = x − dj
(4)
As is illustrated in Fig 1-(b), we propose to reconstruct yj (the blue arrow) using a subset of the directed edges between visual words(drawn as red arrows). More precisely, denote the set of the L nearest visual words of dj as NL (dj ), the subset De = {ejk |k ∈ NL (dj )} is retained for reconstruction, see Fig 1-(b). The proposed feature coding method describes yj with zj by solving the following problem: arg min yj − De zj 22 zj (5) s.t. 1T zj = 1 Eqn (5) is a constrained least squares problem, which can be solved efficiently. Denote Y = [yj , yj , · · · , yj ] ∈ RM×L , Eqn (5) can be rewritten as arg min zTj Czj zj
s.t. 1T zj = 1
(6)
where C is the covariance matrix defined as C = (Y − De )T (Y − De ). Eqn (6) can be efficiently minimized by solving the linear system of equations Czj = 1, following with a rescaling to make sure the sum-to-one constraint. The final coded feature of x is : β = [w1 zT1 , w2 zT2 , · · · , wK zTK ]T
(7)
where wi is a weighting factor indicating the influence of zi . Here a locality constraint is introduced to ensure that closer visual word contributes more. When visual words are too far away from local feature x, it is justifiable to ignore them and set the corresponding responses to zero. 2 i exp(− x−d ) if i ∈ NS (x) σ (8) wi = 0 otherwise where NS (x) is the index set of the S nearest visual words of x. σ is a smooth factor. Unlike normal coding methods that produce only one response on each visual word, our method produces a vector of L-dimensional on each visual word.Thus the final representation for each local feature is KL-dimensional.
196
3.3
W. Ren et al.
Properties of Local Hypersphere Coding (LHC)
The proposed feature coding algorithm has three desirable properties: – Better Reconstruction. As can be seen from Fig 1-(b), reconstruction on local hypersphere with edges between visual words is more likely coplanar than traditional coding methods, which will naturally do better reconstruction. – Distinctive Representation. There have been a couple of works considering the distinctiveness of feature, such as SIFT, LLC, as well as salient coding. The proposed coding method produces distinctive representation, in the sense that, by considering both locality and visual words ambiguity, dissimilar features are more easily distinguished from each other. For example, on the one hand, two features with different nearest neighbours will have totally different responses using LHC. On the other hand, we produce similar responses for similar local features and retain most of the important information. – Efficient Implementation. As the proposed feature coding can be reformatted as a linear system of equations, it has very simple computation and can be efficiently solved. This is especially desirable for large scale problems where efficient feature coding is extremely important. We list the time cost of each methods with codebook size 64 for processing one image: HQ(0.31s), SQ(0.46s), LHC(3.27s), LLC(5.17s), SC(52.23s).
4
Experiments
In this Section, we first introduce the general experimental settings in Section 4.1. Then in the next several sections we present the performance of the proposed method on three benchmark datasets: Caltech-101 [15], 15-Scenes [16,17,10] and PASCAL VOC 2007 [18]. 4.1
Experimental Settings
For all the experiments we present, a single local feature descriptor, SIFT , is used. The 128-dimensional SIFT descriptors are densely extracted from each image, with step size of 4 pixels and three scales: 16 × 16, 24 × 24, 32 × 32. For one image, there are roughly 10k to 20k SIFT descriptors extracted. The dictionary is generated using standard K-means clustering algorithm. For all the experiments on the same dataset with same dictionary size, we used the same dictionary trained with 2, 000, 000 patches randomly sampled from the training images. After coding the local features with different coding methods, the encoded features are pooled over subregions with spatial pyramid matching(SPM) [10].
Local Hypersphere Coding Based on Edges between Visual Words
197
We adopted 1 × 1, 2 × 2, 4 × 4 sub-regions SPM on 15-Scenes and Caltech-101 dataset, while on PASCAL VOC 2007 we used 1 × 1, 2 × 2, 3 × 1 sub-regions for SPM. As the focus of this paper is concentrated on the coding methods for local feature description, we choose max-pooling1 for all the coding methods. Finally for classifier, we use linear kernel as most of the recent works in the literature. We utilized the widely used SVM software-LIBLINEAR [20] for image classification. To verify the effectiveness of the proposed method, we compare our method with four representative coding methods: – HQ: the basic hard quantization [3]. We implement hard quantization ourselves following the original paper. – SQ: soft quantization [4]. We implement a “localized” version of soft quantization ourselves following [19]. We only consider the 5 nearest visual words. – SC: sparse coding. We use the public available source code from ScSPM [5]. The regularization term λ in sparse coding is set to 0.15. – LLC: locality-constraint linear coding. We use the implementation supplied by the author [6]. The approximated LLC is used throughout the evaluation. The parameter K is set to 5, which means that we use the 5 nearest visual words for reconstruction. – LHC: The proposed local hypersphere coding method . the parameter S = 10, L = 5 is used for all the experiments. In Section 4.5 we will study the impact of S and L in details. σ in Eqn (8) is set to 0.1. This paper focuses on the evaluation of the effectiveness of the proposed local feature coding method. Rather than comparing to the results published in the literature, we decided to put them in the same framework with same dictionary, and same SPM and classifier parameters for all the experiments for fair comparison, the only difference is the coding methods. 4.2
Caltech-101
The Caltech-101 dataset [15] is composed of 101 classes and a background class. There are totally 9144 images and the number of images in each class ranges from 31 to 800. Following the literature, we random sampled a fixed number of images from each class for training and used the rest images for testing. For each experiment settings, we repeat for 5 times with different splitting of training set and testing set. We reported the mean and standard deviation as the final experiment results. Table 1 shows the experimental results on Caltech-101 with 30 training images per class, under dictionary size of 64, 128, 256, 512 and 1024. As shown, our coding methods achieves the best performance on compared with the listed methods. The differences of the performance results became small as larger dictionary is used. This is consistent with what we have observed from results on 15-Scenes. We also compare our result with several published results in Table 2. 1
Theoretical analysis [19] and experimental results show that max pooling generally demonstrates higher performance than sum pooling and average pooling.
198
W. Ren et al.
Table 1. Classification rate(%) comparison on Caltech-101 with 30 training image per class using different dictionary sizes K
64
128
256
512
1024
HQ 57.8 ± 0.7 64.9 ± 0.6 69.2 ± 0.4 71.1 ± 0.8 73.1 ± 1.1 SQ 59.7 ± 0.5 66.8 ± 0.9 70.5 ± 1.1 73.9 ± 0.4 75.1 ± 0.8 SC 63.0 ± 1.5 67.3 ± 1.0 69.4 ± 0.5 72.4 ± 0.9 73.7 ± 1.0 LLC 63.7 ± 0.9 66.5 ± 1.5 69.5 ± 1.4 70.2 ± 0.7 71.3 ± 0.6 LHC 69.0 ± 0.8 71.2 ± 1.3 73.5 ± 1.3 74.9 ± 0.9 75.5 ± 1.2
Fig. 2. The mean classification accuracy and standard deviation of different coding methods on Caltech-101, using different number of training samples. We choose 1, 3, 5, 10, 15, 20 and 30 training samples per class for training and use the rest for testing. The dictionary sizes used are 64, 128, 256, 512, respectively. For each setting, we repeat for 5 times with randomly split training set and testing set.(Please view in color.)
Local Hypersphere Coding Based on Edges between Visual Words
199
For both 15 training and 30 training, our proposed coding method achieves the best performance. Note that for [21] we list the result of linear kernel, since we do not use non-linear kernels in our experiments. Table 2. Classification rate(%) comparison on Caltech-101 with results from the literature Algorithms
15 training
30 training
Jain et al. [22] 61.0 69.6 ScSPM [5] 67.0±0.45 73.2±0.54 LLC [6] 65.43 73.44 Boureau et al. [21] 75.1±0.9 LHC 68.44 ± 0.54 75.49 ± 1.24
To see the influence of the number of training images on the performance, we report the results with 1, 3, 5, 10 , 15, 20, 30 training images per class, using dictionary size K = 64, 128, 256, and 512. The experimental results are presented in Fig 2. When dictionary size K = 64, it is clear that the proposed method outperforms other coding methods by a large margin. As more images are used for training, the performance keeps growing, this is a common result in all coding methods. We can also see that when dictionary size becomes larger, we always obtain better performance. At the same time, the performance differences between different coding methods become smaller.This is natural as there are more visual words used for reconstruction. 4.3
15-Scenes
The 15-Scenes dataset [10] is an expansion of previously published datasets [16,17]. This dataset consists of 4485 images out of 15 categories of natural scenes and indoor scenes. Each category contains 200 to 400 images with a resolution of about 300×250 pixels. Following the standard evaluation process in the literature, we randomly sampled 100 images per category for training and used the rest images for testing. For each setting we repeat for 5 times with different splitting of the training set and testing set. We reported the mean and standard deviation as the final experiment result in Table 3. For different dictionary size, we present the performance of HQ, SQ, SC, LLC, as well as the proposed coding method. As shown, the proposed method significantly outperforms other coding methods. Especially when dictionary size is small, our method outperforms HQ by more than 11 percent, SQ by more than 8 percent, SC by more than 5 percent, and LLC by 3.9 percent. The remarkable performance improvement shows the discriminative power of the proposed method. We again note that as the dictionary size grows larger, the performance differences between different coding methods become smaller. Nevertheless, our proposed coding method still manages to outperform the best of the four other methods by 1.4 percent.
200
W. Ren et al.
Table 3. Classification rate(%) comparison on 15-scenes with different dictionary sizes K
32
64
128
256
512
1024
HQ 62.2 ± 0.8 69.4 ± 0.4 74.1 ± 0.3 76.4 ± 0.7 78.5 ± 0.3 79.9 ± 0.4 SQ 65.3 ± 0.4 71.6 ± 0.8 75.4 ± 0.7 77.4 ± 0.3 79.8 ± 0.3 80.4 ± 0.6 SC 68.1 ± 0.4 71.8 ± 0.7 74.6 ± 0.2 77.5 ± 0.2 78.9 ± 0.5 80.7 ± 1.1 LLC 70.0 ± 0.7 73.1 ± 0.1 75.1 ± 0.6 76.4 ± 0.5 78.0 ± 0.4 78.6 ± 0.3 LHC 73.9 ± 0.7 76.1 ± 0.4 78.1 ± 0.7 80.2 ± 0.5 80.1 ± 0.3 82.1 ± 0.4
4.4
Pascal VOC 2007
Table 4. Comparison of different coding methods on PASCAL VOC 2007, with dictionary size 1024 AP(%) aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor average
HQ
SQ
SC
LLC
LHC
Improvement
63.87 46.69 30.06 57.69 13.91 43.42 68.78 46.47 43.35 35.10 31.39 32.56 67.29 48.57 77.35 14.61 32.82 42.76 65.63 41.70 45.20
66.64 51.46 34.28 61.08 16.87 49.27 71.57 50.60 47.42 36.16 34.03 35.50 69.20 53.06 78.82 17.45 34.99 45.91 68.37 46.78 48.47
68.93 53.98 36.45 61.09 21.06 51.25 72.78 45.12 49.75 33.60 36.99 36.19 73.17 52.03 78.21 15.95 35.67 43.40 63.55 47.56 48.84
67.15 52.76 34.51 61.97 19.50 46.95 72.15 45.78 47.54 35.60 35.96 32.45 71.80 51.91 77.96 15.42 37.21 42.68 67.83 45.88 48.15
69.68 56.65 42.25 62.68 22.00 58.26 76.13 52.84 50.16 39.91 38.63 38.05 73.92 58.45 81.19 18.75 38.80 49.02 70.87 48.68 52.35
+0.75 +2.67 +5.80 +0.71 +0.94 +7.01 +3.35 +2.24 +0.41 +3.75 +1.64 +1.86 +0.75 +5.39 +2.37 +1.30 +1.59 +3.11 +2.50 +1.12 +2.46
We also evaluate our algorithm on PASCAL VOC 2007 [18], which consists of 9963 images from 20 classes. The dataset is divided into three subsets: a training set with 2501 images, a validation set with 2510 images, and a testing set with 4952 images. PASCAL VOC 2007 is one of the most challenging benchmark datasets for classification task as there are large in-class divergence, including variation on scale, illumination, view, deformation, as well as severe
Local Hypersphere Coding Based on Edges between Visual Words
201
object occlusions.The classification performance is measured by average precision(AP), which indicates the area under precision/recall curve. This measure favours both high precision and recall. We used both the training set and validation set for training and report the average precision(AP) for each class on the testing set. Table 4 shows the 20 scores obtained by our method as well as other coding method with dictionary size 1024. As shown, the proposed coding method achieves 52.35% mean AP score. Our method significantly outperforms other methods by 3.5% to more than 7%. For single class AP, the proposed method performs better than the other four methods for all 20 classes. The last column of Table 4 presents the improvement of the proposed coding method over the best of the four other coding methods. We achieve a performance improvement of 2.46% in average. There are also three classes that achieve more than 5 percent improvement(bird:+5.8%, bus:+7.0% and motorbike: +5.4%). 4.5
The Impact of S and L on the Performance
There are two important KNN parameters in the proposed method, L from Eqn (5) and S from Eqn (8). L controls the number of bases used for reconstruction the edge from visual word to local feature. This parameter has a direct affection to the dimensionality of the final representation. S dominates contribution of each visual word to the local feature to be described. It also controls the sparsity of the encoded feature. In this Section, we study the impact of S and L on the final classification accuracy in the proposed method. We carry out experiments with a small dictionary size 32. For different combinations of S and L, we repeat the experiments for 5 times and report the mean and standard deviation. The results are demonstrated in Table 5. Table 5. The impact of S and L on the performance on 15-Scenes, with dictionary size 32.
H H L S HH H 5 10 15 20
5
10
15
20
73.5±0.7 73.9±0.7 73.4±0.7 73.6±0.8
74.3±0.7 73.9±0.9 73.9±0.5 73.9±0.6
73.1±0.6 73.5±0.2 74.2±1.1 73.4±0.5
74.7±0.6 74.2±0.6 74.3±0.8 74.8±0.7
When the parameters S and L change, the classification accuracy only has a minor change. This indicates that the proposed coding method is not sensitive to the two parameters. In practice, choosing small L is sufficient and more efficient, since larger L means higher dimensional feature vector. That’s why we fix S = 10 and L = 5 in our experiments.
202
5
W. Ren et al.
Conclusion
In this paper, we have analyzed traditional reconstruction based coding methods in bag-of-words model and shown that reconstruction on the hypersphere performs better. Based on our observation and analysis, we have proposed a new local feature coding method called local hypersphere coding (LHC). It performs feature coding on the hypersphere of the feature space and describes local feature with the edges between visual words. Experiments on three benchmark datasets shown that the proposed coding method significantly outperforms other reconstruction based coding methods, indicating the effectiveness of the proposed method. The proposed coding scheme is general and can be extended easily. In the proposed coding method, coding of local feature x starts by calculating the directed edge yj from visual word dj to x (See Fig 1-(b)), followed with a decomposition of yj by the edges De = {ejk |k ∈ NL (dj )} between visual words. The second step is a reconstruction of yj with De , where any other reconstruction based regularization can be added. In fact, we can adopt some existing coding methods here to do the decomposition. In future, we will study other coding methods under the proposed reconstruction scheme. Our interests are obtaining more compact and discriminative image representation. Acknowledgement. This work is funded by National Natural Science Foundation of China (Grant No. 61175007), the National Basic Research Program of China (Grant No. 2012CB316302), the National Key Technology R&D Program (Grant No. 2012BAH07B01).
References 1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004) 2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893 (2005) 3. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22 (2004) 4. van Gemert, J.C., Geusebroek, J.M., Veenman, C.J., Smeulders, A.W.M.: Kernel Codebooks for Scene Categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008) 5. Jianchao, Y., Kai, Y., Yihong, G., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1794–1801 (2009) 6. Jinjun, W., Jianchao, Y., Kai, Y., Fengjun, L., Huang, T., Yihong, G.: Localityconstrained linear coding for image classification. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3360–3367 (2010)
Local Hypersphere Coding Based on Edges between Visual Words
203
7. Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image Classification Using Super-Vector Coding of Local Image Descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010) 8. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (2007) 9. Perronnin, F., S´ anchez, J., Mensink, T.: Improving the Fisher Kernel for LargeScale Image Classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010) 10. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178 (2006) 11. Olshausen, B.A., Fieldt, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by v1. Vision Research 37, 3311–3325 (1997) 12. Huang, Y., Huang, K., Yu, Y., Tan, T.: Salient coding for image classification. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1753–1760. IEEE (2011) 13. Huang, Y., Huang, K., Wang, C., Tan, T.: Exploring relations of visual codes for image classification. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1649–1656. IEEE (2011) 14. Zhao, X., Yu, Y., Huang, Y., Huang, K., Tan, T.: Feature coding via vector difference for image classification. In: IEEE International Conference on Image Processing, ICIP (2012) 15. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples. In: Workshop on Generative-Model Based Vision, IEEE Proc. CVPR (2004) 16. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 145–175 (2001) 17. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 524–531 (2005) 18. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007) 19. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: 2011 IEEE International Conference on Computer Vision, ICCV, pp. 2486–2493 (2011) 20. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008) 21. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2559–2566 (2010) 22. Jain, P., Kulis, B., Grauman, K.: Fast image search for learned metrics. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008)