of-Word model, Spatial Pyramid Matching. 1 Introduction. Scene classification is an essential and challenging open problem in computer vision with multiples of ...
Combining Descriptors Extracted from Feature Maps of Deconvolutional Networks and SIFT Descriptors in Scene Image Classification Dung A. Doan1 , Ngoc-Trung Tran1 , Dinh-Phong Vo2 , Bac Le1 , and Atsuo Yoshitaka3 1 1
University of Science, 227 Nguyen Van Cu street, District 5 Ho Chi Minh City, Viet Nam 2 2 C29, TSI department, Telecom ParisTech, 75013, Paris, France 3 3 Japan Advanced Institute of Science and Technology, Japan {anhdungnt91}@gmail.com {tntrung}@fit.hcmus.edu.vn {vo}@enst.fr {lhbac}@fit.hcmus.edu.vn {ayoshi}@jaist.ac.jp
Abstract. This paper presents a new method to combine descriptors extracted from feature maps of Deconvolutional Networks and SIFT descriptors by converting them into histograms of local patterns, so the concatenation operation can be applied and ensure to increase the classification rate. We use K-means clustering algorithm to construct codebooks and compute Spatial Histograms to represent the distribution of local patterns in an image. Consequently, we can concatenate these histograms to make a new one that represents more local patterns than the originals. In the classification step, SVM associated with Histogram Intersection Kernel is utilized. In the experiments on Scene-15 Dataset containing 15 categories, the classification rates of our method are around 84% which outperforms Reconfigurable Bag-of-Words (RBoW), Sparse Covariance Patterns (SCP), Spatial Pyramid Matching (SPM), Spatial Pyramid Matching using Sparse Coding (ScSPM) and Visual Word Reweighting (VWR). Keywords: Scene image classification, Deconvolutional Networks, Bagof-Word model, Spatial Pyramid Matching.
1
Introduction
Scene classification is an essential and challenging open problem in computer vision with multiples of applications involved, for example: content-based image retrieval, automatic assigning labels to images and image grouping from given keywords. Because of some natural conditions of images such as the ambiguity, illumination, scaling, etc, scene classification is a difficult problem as well as there are many approaches proposed to overcome these challenges.
Specifically, early works on classifying scenes extract appearance features (color, texture, power spectrum, etc) [1][2][3] and use dissimilarity measures [4][5] to distinguish scene categories, but they can only be used in the case of classifying images into small number of categories such as indoor/outdoor and human-made/natural. In 2006, Svetlana Lazebnik et al. [6] presents Spatial Pyramid Matching (SPM), which is a remarkable extension of bag-of-features (BoF). SPM exploits descriptors inside each local patch, partitions the image into the segments and computes histograms of these local-features within each segment. After that, SPM framework uses histogram intersection kernel associated with Support Vector Machine (SVM) to classify images. According to Lazebnik [6], SPM gets high performance when using SIFT descriptors [7] or ”gist” [8]. The method is also the major component of the state-of-the-art systems [9]. In 2010, Matthew Zeiler et al. [10] proposed Deconvolutional Networks (DN) which reconstructs images, but maintains stable latent representations and local information such as edge intersections, parallelism and symmetry. To be more specific, DN uses convolution operator principally, hence the networks can assist the grouping behavior and pooling operation on feature maps to extract descriptors (DN descriptors, briefly). However, because edges in scene images are complex and DN cannot control the shape of learned filters, it is difficult for DN to represent edge information comprehensively. Figure 1 shows some bad reconstructed scene images that are done by 1 layer-DN. One way should be considered to overcome this disadvantage of DN is SIFT descriptors, because SIFT features totally provides directional information to enhance edge representation more powerful. Nevertheless, DN and SIFT descriptors are not the same representation, so we cannot ensure to get better performance if naively combining them together. In this paper, we propose a new method to convert DN and SIFT descriptors into histograms of local patterns, then these histograms can be concatenated to produce a new one for classification step. Specifically, we first use K-means clustering algorithm to construct two codebooks, each word in these codebooks corresponds to a local pattern. Then, two spatial histograms are also built to represent the distribution of local patterns in an image. After that, histogram concatenation is carried out to make a new histogram that represents more local patterns than originals. Finally, SVM associated with Histogram Intersection Kernel [6] is used to classify images into appropriate labels. Our approach representing local patterns is similar to bag-of-features [11] [12]. However, note that to improve our method’s performance, Spatial Pyramid Histograms are constructed by following the approach of Lazebnik [6]. To evaluate the performance of our method, we experimented the method in a large database of fifteen natural scene categories [6] and get an significant improvement of accuracy over some recent methods. To express our method clearly, we organize the paper as follows: Section 2 introduces about recent related works. In section 3, we introduce our method in detail. Our results on Scene-15 Dataset are illustrated in section 4. Finally, in section 5, conclusions, discussion and future works are described.
Fig. 1. Original images and correspondence images reconstructed by the first layer of Deconvolutional Networks
2 2.1
Related Work Deconvolutional Networks
Deconvolutional Networks proposed by Matthew Zeiler et al. [10] in 2010 has only a decoder which tries to reconstruct feature maps being expectantly close to the original image. To be more specific, from an input image and learned filters, DN sums over values which are generated by convolution of feature maps and filters in each layer, hopefully these values are proximate with the image. Sparseness constraint is also added to feature maps to encourage economical representation at each level of the hierarchy, thus more complex and high-level features are naturally produced. With both sparsity and convolution approaches, DN can preserve locality, mid-level concepts and basic geometric elements which can open the way for pooling operation and grouping behavior to extract descriptors. In practice, DN descriptor is particular successful when applying to object recognition and denoising images.
2.2
Image Classification by Spatial Pyramid Matching
Motivated by Grauman and Darrell’s method [13] in pyramid matching, Spatial Pyramid Matching (SPM) is proposed by Svetlana Lazebnik [6] in 2006. First,
SPM extracts local descriptors (for example: SIFT descriptors) inside each subregion of images, quantizes these descriptors into vectors and then performs Kmeans to construct dictionary. Secondly, with the constructed dictionary, SPM computes histogram of local descriptors and then these histograms are multiplied by appropriate weights at each increasing resolution. Finally, putting all the histograms together to get a pyramid match kernel for classification step.
3 3.1
The Proposed Method Descriptor Extraction
Deconvolutional Networks Descriptors: In training filters of DN, given (1) (2) (3) (I ) a set of Iu unlabeled images yu , yu , yu ,..., yu u . K0 denotes the number of color channel, so we have a cost function for the first layer:
C1 (yu ) =
Iu X K0 K1 K1 I X X X λT X (i) (i) 1 (i) 2 − yu,c || zk,1 ⊕ fk,c ||2 + |zk,1 |p 2 i=1 c=1 i=1 k=1
(1)
k=1
(i)
l where, zk,l and fk,c respectively are feature map and filter k of layer l, Kl indicates the number of feature maps in layer l, obviously, l = 0 and l = 1 in equation (1). λT is a constant value that balance the the contribution of (i) (i) reconstruction of yu,c and the sparsity of zk,1 . We can also follow Matthew Zeiler [10] to form a hierarchy that means feature maps in layer l become inputs for layer l + 1 1 In reconstruction step, with the learned filters fk,c and a label/scene image scene y , we infer zk,1 by minimizing reconstruction error:
min zk,1
K1 K1 K0 X X λR X 1 |zk,1 |p zk,1 ⊕ fk,c − ycscene ||22 + || 2 c=1 k=1
k=1
We can also infer feature maps in higher layer if following [10]. Each feature map zk,1 is split into overlapping p1 × p1 patches with spacing of s1 pixels. Each patch are pooled and then grouped to give local descriptors. SIFT Descriptors: By conducting some experiments, we observe that because edges in scene images are very complicated, hence the feature maps z of 1 layerDN are not enough (figure 1). Therefore, we utilize SIFT descriptors to support edge representation in scene recognition problem. Concretely, we densely exploit local SIFT descriptors in overlapping p2 × p2 patches at a stride of s2 pixels. 3.2
Building Histograms (1)
(2)
(N )
T
Given a set of SIFT descriptors XSIF T = [xSIF T , xSIF T , ..., xSIF T ] ∈ RN ×128 (1) (2) (2) (M ) T and DN descriptors XDN = [xDN , xDN , xDN , ..., xDN ] ∈ RM ×D , we represent
(i)
(i)
xSIF T and xDN being 128 and D−dimensional feature space respectively. With BSIF T and BDN being codebooks of SIFT and DN descriptors, K-means method are applied to minimize the following cost functions: min VSIF T , BSIF T VDN , BDN
N P
(i)
i=1
T
(i)
||xSIF T − (BSIF T ) .vSIF T ||22 (2) M P
+
i=1 (i)
(i)
T
(i)
||xDN − (BDN ) .vDN ||22
(i)
(i)
(i)
subject to:||vSIF T ||0 = 1; ||vSIF T ||1 = 1; vSIF T,j ≥ 0,∀i,j; ||vDN ||0 = 1; (i)
(i)
||vDN ||1 = 1; vDN,j ≥ 0, ∀i,j (i)
(i)
(i)
(i)
where, vSIF T,j and vDN,j are elements of vector vSIF T and vDN respectively. (1)
(2)
(N )
(1)
T
(2)
(N ) T
VSIF T = [vSIF T , vSIF T , ..., vSIF T ] and VDN = [vDN , vDN , ..., vDN ] are indexes vectors. In training phase, we minimize cost function (2) with respect to BDN , BSIF T , VDN and VSIF T , but in coding phase, with the learned BDN and BSIF T , we only minimize equation (2) with respect to VDN and VSIF T . After obtaining VDN and VSIF T , we compute the histogram: HSIF T = HDN =
1 N
1 M
N P
i=1 M P
i=1
(i)
vSIF T (i)
vDN
With the aim to improve our performance, Spatial Pyramid Histogram are made by following to approach of SPM [6]. 3.3
Image Classification
In this stage, DN and SIFT descriptors are the same representation, so following equation may be applied: H = HSIF T HDN where, denotes concatenation operation. The new histogram H, that is moved into SVM associated with Histogram Intersection Kernel [6], represents more local patterns than HSIF T and HDN . Figure 2 shows all steps of our method.
4
Experiments
We adopt Scene-15 Dataset [6], which contains 15 categories (office, kitchen, living room, mountain, etc). Each category has from 200 to 400 images which
Fig. 2. Each step of our method. After extracting DN and SIFT descriptors, K-means is applied to build two codebooks. Then, the distribution of local patterns is represented in Histogram Building stage. Histogram Concatenation are carried out to produce a new histogram that represents more local patterns. Finally, SVM associated with Histogram Intersection Kernel is used to classify images.
have average size 300×250 pixels, the major image sources is COREL collection, Google image search and personal photographs. Example images of Scene-15 Dataset are illustrated in figure 5. In our experiments, all images are converted into gray-scale and then contrast normalization before applying to DN. We train 8 feature maps of 1 layer-DN by using only 20 images which consist of 10 fruit and 10 city images, Scene-15 images are only used to train supervised classifier. Specifically, we follow experiment setup for Scene-15 Dataset of Lazebnik [6], training on 100 images per class and testing on the rest. In classification step, multi-class SVM rule is 1-vs-All: a SVM classifier is trained to classify each class from the rest and a test image is assigned to the label having highest response. 4.1
Codebook Size
Selecting the codebook size influences the trade-off between discriminative and generalizable characteristics. Concretely, small codebook size can lead to the lack of discriminative characteristic, the dissimilar features may be assigned to same cluster/local patterns. On the other hand, a large codebook size is more discriminative but less generalizable, less tolerant to noises; because the similar features may be mapped to different local patterns. Therefore, in this experiment, we would like to survey how the size of DN and SIFT codebook affects to the classification rate. Specifically, in the survey of SIFT codebook size, we keep the size of DN codebook fixed in 200 and gradually increase SIFT codebook size from 50 to 2000. Similarly, in the survey of DN codebook size, SIFT codebook is fixed in size of 200, and DN codebook is also raised between 50 and 2000. Note that spatial pyramid histogram is not used in this experiment and our detailed results are illustrated in figure 3. On both situations, as the dictionary size increases from 50 to 1000, the performance rises rapidly, and then reaches the peak. When keeping on increasing the dictionary size, the classification rate decreases gradually.
Fig. 3. The classification rates at different sizes of DN and SIFT codebook
4.2
Histogram Combination and Naive Combination
In this experiments, we compare histogram combination with naive combination. Concretely, in naive combination, the following equation is applied on DN and dense SIFT descriptors firstly: x = xDN xSIF T
(3)
where, xDN and xSIF T denote DN and dense SIFT descriptors, respectively. Then, codebook is constructed by using K-means, building local pyramid histograms in each sub-region of images at increasingly resolutions. Finally, SVM associated with Histogram Intersection Kernel are utilized to classify images. The detail of the steps are illustrated in figure 4. In naive combination ap-
Fig. 4. In naive combination, DN and SIFT descriptors are concatenated firstly. Then, a codebook is constructed by using K-means. The distribution of local patterns are represented in Histogram Building. Finally, SVM associated with Histogram Intersection Kernel is utilized in classification step.
proach, the parameters that we setup are K1 = 8, p1 = 16, s1 = 8, p2 = 16, s2 = 8, λT = λR = 10 and the images are also resized to 150 × 150 before extracting dense SIFT descriptors to easily perform equation (3). Experiments are conducted 5 times with different randomly selected training and testing images, then mean and standard deviation are calculated.
Method Classification rate (%) DN Descriptors 75.3 ± 0.9 SIFT Descriptors 81.5 ± 0.3 Naive combination 72.8 ± 0.9 Histogram combination 84.3 ± 0.2 Table 1. Histogram combination compares to naive combination, DN descriptors and SIFT descriptors
Our mean and standard deviation results are shown in table 1, we compare histogram combination with not only naive combination approach but also DN and SIFT descriptors respectively. From table 1, the proposed histogram combination outperforms the others.
4.3
Comparison with Other Methods
In this experiment, we would like to compare the performance of our method with other recent methods, and the parameters that we use are K1 = 8, p1 = 16, s1 = 2, p2 = 16, s2 = 8, λT = λR = 10. The experimental processes are repeated 10 times with different randomly selected training and testing images. The final results are reported as the mean and standard deviation of the recognition rates, our results for 3-fold cross validation in SVM training are shown in table 2. As shown, our method outperforms some recent methods.
Method Classification rate (%) Year SPM [6] 81.4 ± 0.5 2006 ScSPM [14] 80.3 ± 0.9 2009 VWR [15] 83.0 ± 0.2 2011 RBoW [16] 78.6 ± 0.7 2012 SCP [17] 80.4 ± 0.5 2012 Our method 84.4 ± 0.4 Table 2. Classification rate (%) comparison on 15-Scene Dataset
5
Conclusion, Discussion and Future Works
Motivated by observations and experiments, we realize that because edges of scene images are very complicated and the learned filters of Deconvolutional Networks cannot be controlled, feature maps of DN are not enough to represent edge information comprehensively. In this paper, we propose a new method to use SIFT descriptors to support edge representation, both DN and SIFT descriptors are converted into histograms of local patterns, so we can concatenate these histograms together to make a new one that represents more local patterns than the originals. Consequently, our method makes data for classification step become more discriminative, experimental results on 15-Scene Dataset showed that our method has better performance than the recent methods. However, there are some disadvantages in our method: it is still difficult to make a real-time application because the processing time is slow. Furthermore, by reason of using K-means in codebook construction, it is too restrictive in assigning each sample to only one local patterns. In the future works, we would like to improve codebook construction step to be more flexible, not restrictive like K-means; supervised fashion in constructing codebook is also considered as well as implementing practical application is ongoing, too.
Fig. 5. Some example images of Scene-15 Dataset
References 1. Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W.: Efficient and effective querying by image content. Journal of intelligent information systems (1994) 2. Hampapur, A., Gupta, A., Horowitz, B., Shu, C., Fuller, C., Bach, J., Gorkani, M., Jain, R.: Virage video engine. In: Electronic Imaging’97, International Society for Optics and Photonics (1997) 3. Ma, W., Manjunath, B.: Netra: A toolbox for navigating large image databases. In: Image Processing, 1997. Proceedings., International Conference on. (1997) 4. Puzicha, J., Buhmann, J., Rubner, Y., Tomasi, C.: Empirical evaluation of dissimilarity measures for color and texture. In: Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on. (1999) 5. Santini, S., Jain, R.: Similarity measures. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1999) 6. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. (2006) 7. Lowe, D.: Towards a computational model for object recognition in it cortex. In: Biologically Motivated Computer Vision. (2000) 8. Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system for place and object recognition. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. (2003) 9. Bosch, A., Zisserman, A., Muoz, X.: Image classification using random forests and ferns. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. (2007) 10. Zeiler, M., Krishnan, D., Taylor, G., Fergus, R.: Deconvolutional networks. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. (2010) 11. Yang, J., Jiang, Y., Hauptmann, A., Ngo, C.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the international workshop on Workshop on multimedia information retrieval. (2007) 12. Jiang, Y., Yang, J., Ngo, C., Hauptmann, A.: Representations of keypoint-based semantic concept detection: A comprehensive study. Multimedia, IEEE Transactions on (2010) 13. Grauman, K., Darrell, T.: The pyramid match kernel: Discriminative classification with sets of image features. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. (2005) 14. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. (2009) 15. Zhang, C., Liu, J., Wang, J., Tian, Q., Xu, C., Lu, H., Ma, S.: Image classification using spatial pyramid coding and visual word reweighting. Computer Vision– ACCV 2010 (2011) 16. Parizi, S., Oberlin, J., Felzenszwalb, P.: Reconfigurable models for scene recognition. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. (2012) 17. Wang, L., Li, Y., Jia, J., Sun, J., Wipf, D., Rehg, J.: Learning sparse covariance patterns for natural scenes. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. (2012)