Linear Distance Coding for Image Classification - MATLAB Projects

4 downloads 0 Views 3MB Size Report
For the kth class manifold Mk , the coordinate value di,k of local feature xi ... Finally, Section VI concludes this work. II. ..... calculate vi using (6), then di,k = xi − Mkvi. 2 l2 ...... [32] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discrim-.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

537

Linear Distance Coding for Image Classification Zilei Wang, Jiashi Feng, Shuicheng Yan, Senior Member, IEEE, and Hongsheng Xi

Abstract— The feature coding-pooling framework is shown to perform well in image classification tasks, because it can generate discriminative and robust image representations. The unavoidable information loss incurred by feature quantization in the coding process and the undesired dependence of pooling on the image spatial layout, however, may severely limit the classification. In this paper, we propose a linear distance coding (LDC) method to capture the discriminative information lost in traditional coding methods while simultaneously alleviating the dependence of pooling on the image spatial layout. The core of the LDC lies in transforming local features of an image into more discriminative distance vectors, where the robust imageto-class distance is employed. These distance vectors are further encoded into sparse codes to capture the salient features of the image. The LDC is theoretically and experimentally shown to be complementary to the traditional coding methods, and thus their combination can achieve higher classification accuracy. We demonstrate the effectiveness of LDC on six data sets, two of each of three types (specific object, scene, and general object), i.e., Flower 102 and PFID 61, Scene 15 and Indoor 67, Caltech 101 and Caltech 256. The results show that our method generally outperforms the traditional coding methods, and achieves or is comparable to the state-of-the-art performance on these data sets. Index Terms— Image classification, image-to-class distance, linear distance coding (LDC).

I. I NTRODUCTION

G

ENERATING compact, discriminative and robust image representations is undoubtedly critical to image classification [1], [2]. Recently, several local features, e.g., SIFT [3] and HOG [4], are quite popular in representing images due to their ability to capture distinctive details of the images. However, the local features are rarely directly fed into image classifiers due to the computational complexity and their sensitiveness to noise. A common strategy is to integrate the local features into a global image representation at first. To this end, various methods [1], [2], [5], [6] have been proposed, Manuscript received February 16, 2012; revised August 30, 2012; accepted August 30, 2012. Date of publication September 13, 2012; date of current version January 10, 2013. This work was supported in part by the National Natural Science Foundation of China under Grant 61203256 and the Singapore Ministry of Education under Grant MOE2010-T2-1-087. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Erhardt Barth. Z. Wang is with the Department of Automation, University of Science and Technology of China (USTC), Hefei 230027, China, and also with the Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore (e-mail: [email protected]). J. Feng and S. Yan are with the Department of Electrical and Computer Engineering, National University of Singapore, 117576 Singapore (e-mail: [email protected]; [email protected]). H. Xi is with the School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2012.2218826

among which the Bag of Words (BoW) based ones [1], [2], [5] present outstanding simplicity and effectiveness. BoW image representation is typically generated via following three steps: 1) extract local features of an image on the interest points; 2) generate a dictionary/codebook and then quantize/encode the local features into codes accordingly; and 3) pool all the codes together to generate the global image representation. Such a process can be summarized as a feature extraction-coding-pooling pipeline. And it has been widely used in recent image classification methods and achieves impressive performance [1], [2], [7]. Within the above framework, the coding process will inevitably introduce information loss due to the feature quantization. Such undesirable information loss severely damages the discriminative power of the generated image representation and thus decreases the image classification performance. Therefore, various coding methods are proposed to more accurately encode local features with less information loss. Most of these methods are developed from the Vector Quantization (VQ) which conducts hard assignment in the coding process [5]. In spite of great simplicity, its inherent large coding error1 often leads to unrecoverable loss of discriminative information and severely limits the classification performance [8]. To alleviate this issue, various coding methods have been proposed. For example, soft-assignment [6], [9], [10] estimates memberships of each local feature to multiple visual words instead of a single one. Another modified method is Super Vector (SV) coding [11], which additionally incorporates the difference between local feature and selected visual word. Thus SV captures the higher-order information and shows improved performance. Though many coding methods [1], [2], [10], [11] are proposed to accurately represent the input features, the information loss in the feature quantization for coding is still inevitable. In fact, Boiman et al. [8] have pointed out that the local features from long-tail distribution are inherently inappropriate for quantization, and the lost information in feature quantization is quite important for good image classification performance. To tackle this issue, the Naive Bayes Nearest Neighbor (NBNN) method is proposed to avoid the feature coding process, by employing the image-to-class distance for image classification [8]. Benefiting from alleviating the information loss, NBNN is able to achieve competitive classification performance on multiple datasets with coding based methods. Motivated by its success, several methods [12]–[14] are developed to further improve the NBNN. However, all variants of NBNN practically employ uniform summation to aggregate image-to-class distances calculated based on local 1 Or called the coding residual, which refers to the difference between original local feature and the reconstructed feature from the produced codes.

1057–7149/$31.00 © 2012 IEEE

538

features. This introduces two inherent drawbacks, namely they are sensitive to noisy features and easy to be dominated by outlier features. In essence, the BoW-based methods and the NBNN-based methods are using different visual characteristic statistics to perform image classification. The former depends on salient features of an image, while the latter equally treats all the local features. In addition, the NBNN ones replace the image-level similarities with the image-to-class distance on performing classification in order to generate more robust results. Therefore, the BoW and NBNN based methods may be suitable for different types of images. For example, for the images with cluttered background, the BoW based ones show better classification performance due to its ability to capture the salient features. Therefore, it is reasonable to propose that if we can combine the advantages of both of them, namely capturing the saliency of images without information loss, the classification performance can be improved further. Besides reducing the information loss of feature coding, how to more effectively explore spatial context is also crucial for achieving good classification performance. In most of the coding-pooling based methods, Spatial Pyramid Matching (SPM) [7] has been widely adopted in the pooling procedure due to its effectiveness and simplicity. However, SPM strictly requires the involved images to present similar spatial layout to ensure that the generated image representations can match well in element-wise manner [15]. This requirement originates from the fact that the used local features are often representing the object-specific visual patterns. However, such requirement has negative effect on classification accuracy because realistic images usually show various spatial layout even within the same category. Alternatively, if the elements of adopted features can be transformed to bear the class-specific semantic, such requirement would be greatly relieved. In this paper, we propose a novel Linear Distance Coding (LDC) method to simultaneously inherit the nice properties of BoW and NBNN and meanwhile relieve the image spatial alignment requirement of SPM. LDC also works under the feature extraction-coding-pooling framework, i.e., it generates the image representations from the salient characteristic local features for the classification, as shown in Figure 1. The proposed LDC particularly focuses on utilizing the discriminative information lost by the traditional coding methods and more effectively exploiting the spatial information. In practice, LDC transforms each local feature into a distance vector, which is an alternative discriminative pattern of local feature, in the class-manifold coordinate system. Compared with the original local features, each element of the distance vectors represents certain class-specific semantic which consists of the distances of local features to class-specific manifolds. Thus the strict requirement of image layout similarity in original SPM can be effectively relieved, since the embedded class semantic in each feature element robustifies the similarity calculation between the objects posing differently, as detailed later. Comprehensive experiments on various types of datasets consistently show that the image representation produced by LDC achieve better or competitive performance compared with state-of-the-arts. Furthermore, the image representations

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

SPM

Images

Linear Coding

Local features

+

Maxpooling Class 1 Class 2

Class K

Class Manifolds

Distance to Class Manifold

Distance Transformaon

Coding & Pooling

Image Representaon

Manifold Coordinate System

Fig. 1. Illustration of linear distance coding. The local features extracted from various classes of training images are first used to generate a manifold for each class that is represented by a set of local features (i.e., anchor points). Based on the obtained class manifolds, the local feature xi is transformed into a more discriminative distance vector di = [di,1 , di,2 , . . . , di,K ]T , where K denotes the class number. On these transformed distance vectors, the linear coding and max-pooling are performed to produce the final image representation. The principle of the distance transformation from original local feature xi to distance feature di is to form a class-manifold coordinate system with the K obtained class manifolds, where each class corresponds to one axis. For the kth class manifold M k , the coordinate value di,k of local feature xi corresponds to the distance between xi and this class manifold. Image best viewed in color.

produced by LDC are proven to be complementary to the ones from the original coding methods. Thus their combination, even a direct concatenation of resulting image representations, can yield remarkable performance improvement as expected. The main contributions of this work can be summarized as follows: 1) We propose a novel distance pattern of local features through constructing the class-manifold coordinate system. The produced distance vectors are quite discriminative and is able to relieve the strict requirement of SPM on image spatial layout, benefiting from the adopted more robust image-to-class distance. 2) We propose a linear distance coding (LDC) method, which conducts the linear coding and max-pooling on the transformed distance vectors to elegantly aggregate the salient features of images. Compared with the NBNN methods, such process can avoid the undesired case where the discriminative features are dominated by outlier or noisy features, especially for the images with cluttered background. 3) From both theoretical analysis and experimental verification, the image representations produced by LDC are complementary to the one from the traditional coding method. And their combination is shown to outperform each individual of them and achieve the state-of-the-art performance on various benchmark datasets. This paper is organized as follows. Section II introduces the related works, including the linear coding models

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

and the NBNN methods. Section III proposes the distance pattern by introducing the class-manifold coordination system. Section IV applies the linear coding and max-pooling on the transformed distance vectors, and the combination of LDC and the original coding method is discussed. The experiments on three types of datasets are presented in Section V, meanwhile the sensitiveness of the key parameters to classification performance is also discussed. Finally, Section VI concludes this work. II. R ELATED W ORKS The proposed Linear Distance Coding (LDC) utilizes simultaneously the linear coding methods and the image-to-class distance adopted in NBNN [8]. In this section, we briefly discuss the conventional coding methods and the NBNN methods. 1) Linear Coding Models: Linear coding is to approximate the input feature by a linear combination of the basis in a given dictionary. Through the coding process, input features are transformed into more discriminative codes. The popular linear coding models include Vector Quantization (VQ) [5], Soft-assignment Coding [6], Sparse Coding (SC) [1], Localityconstrained Linear Coding (LLC) [2] and their variants [16]. Given a dictionary B = [b1 , b2 , . . . , b p ] ∈ Rd× p consisting of p basis features with dimensionality d, linear coding computes a reconstruction coefficient vector v ∈ R p to represent the input feature x ∈ Rd by minimizing the following loss function: 1 (1) L(v) = x − Bv22 + λR(v) 2 where the first term measures the approximation error and the second one serves as regularization. In fact, existing coding models mainly differ from each other at imposing different prior structures on the generated code v via a specific regularization R(·). In particular, LLC [2] considers that locality is more essential than sparsity for the feature coding. It adopts a locality adaptor in the regularization R(·) to replace the 1 -norm used in SC. The locality regularization takes into account the underlying manifold structure of local features and thus ensures good approximation. Inspired by LLC, Liu et al. [10] propose to inject locality into the soft-assignment coding and devise the Localized Soft-Assignment (LSA) coding method. For any local feature x, its membership estimation is restricted to only certain number of nearest basis in the dictionary. LSA discards the possibly unreliable interpretations from distant basis and obtains more accurate posteriori probability estimation. However, the accuracy of such posteriori estimation (i.e., coding result) heavily depends on the size of the adopted dictionary and the underlying distribution of local features, which determine the performance of image classification. Inspecting the feature coding in (1), the information loss may originate from two aspects. The first one is the inaccurate linear approximation and the imperfectness of the dictionary B. The second one is that the enforced structure in R(·) can only be achieved by sacrificing some approximation accuracy. In linear coding models which operate on the original local features, such information loss is inevitable.

539

However, the lost information is probably quite important for accurate image classification [8]. 2) NBNN Methods: The Naive Bayes Nearest Neighbor (NBNN) [8] is essentially a non-parametric classification method without a training phase, where the classification is performed based on the summation of Euclidean distances between local features of the test image and reference classes (i.e., image-to-class distance) [8], [12]–[14]. By avoiding the feature coding, the NBNN effectively reduces the information loss and thus achieves competitive classification performance on multiple benchmark datasets. In the NBNN methods, all local features from the same class are assumed to be i.i.d. sampled from a certain class-specific distribution, and thus image classification is equivalent to a maximum likelihood estimation problem [8]:  p(x|c) (2) cˆ = arg max p(c|Q) = arg max c

c

x∈Q

where c denotes the class, and Q denotes all the descriptors of the query image. In particular, the NBNN estimates the likelihood probability through a set of Parzen kernel functions (typically Gaussian kernel function):   r 1 1 c 2 (3) p(x|c) ˆ = exp − 2 x − x j  L 2σ j =1

is the j -th nearest neighbor on the class c, σ is where the bandwidth of kernel function, L is a normalization factor, and r denotes the number of nearest neighbors. In NBNN, the case of r = 1 is particularly used due to its simplicity and interpretability. Under such case, the resulting NBNN criterion is simplified to: xcj

cˆ = arg min c

N 

xi − xic 22

(4)

i=1

where xic is the nearest neighbor of xi on the class c, and N is the number of local features. The original NBNN method [8] equally and independently treats local features and classes via the summation in (4), which causes the sensitiveness to the noisy features and outliers. Consequently, the classification performance cannot be greatly improved although more robust image-to-class distance is adopted. More specifically, the original NBNN algorithm suffers from the following three drawbacks: 1) the spatial information [7] is not fully exploited, which however is shown to be quite useful for image classification; 2) the computational complexity rapidly increases with the number of local features and thus the scalability is severely limited. In particular, the time complexity for one query image with N features is O(N N D log N D ), where N D is the number of all local features of the training images [8]; and 3) it equally treats all classes for any local feature of testing images, and consequently can not adapt to the involved dataset and capture the image saliency well, as discussed above. To alleviate these issues, various modified methods have been proposed, such as using class-specific Mahalanobis metric instead of Euclidean distance [13], associating classspecific parameters for each class [12] and kernelizing the

540

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

NBNN [14]. These modified NBNN methods [12]–[14] share two features although they seem to be quite different. First, all of them use the same strategy to improve the classification performance, namely enhancing the adaptiveness of the resultant metrics by learning some key parameters. In fact, such learning process is an alternative of training parametric models on the training samples. Second, the final classification criterion is always reduced to the summation of certain distance of all local features within each image, no matter what distance metric is adopted. Such uniformly summing operation usually renders the generated metric sensitive to noisy points as aforementioned. Consequently, the individual NBNN cannot outperform the feature coding based methods in the image classification tasks. III. D ISTANCE PATTERN In this work, we focus on solving the image classification problem formally stated as follows: given a set of local features Xi and the class label yi of the i -th image Ii , we want to learn a classifier from local features to image label C: Xi → yi such that classification error can be minimized w.r.t. both the training and test images. In particular, we aim at a method generating more discriminative image representations from Xi for better classification performance. Here we propose a novel coding method which preserves the superior discriminative capability and robustness of the feature coding based methods [2], and meanwhile effectively captures the lost information in the previous coding methods. In the following, we first introduce the proposed desired distance pattern which is more discriminative and robust. A. Class-Specific Distance Using the distance between local feature and certain class to estimate image membership can provide better generalization capability. Such class-specific distance is fundamental to the NBNN methods and crucial to achieve outstanding classification performance [8]. In particular, all of the existing NBNN methods approximate the class-specific distance by calculating the distances between the local feature and its corresponding nearest neighbor retrieved in the reference images [8]. Formally, let d(xi , c) denote the distance between a local feature xi and the class c. Here the class c consists of a set of local features {xcj } all of which are extracted from the training images from c. Then d(xi , c) is computed as d(xi , c) = minc xi − x22 = xi − xic 22 x∈{x j }

(5)

where xic denotes the mapped point of xi in class c and reduces to the nearest neighbor of xi in the NBNN methods. However, the derived distance in Equation (5) suffers from the following drawbacks: 1) It is quite sensitive to noisy features in the training set {xcj }. Local feature is prone to change significantly even under slight appearance variation and this causes ubiquitous noisy features. In the presence of noisy features or outliers in {xcj }, the estimated distance of local features in the testing image may severely deviate

from the correct one because of the fragile quadratic criterion. This may lead to quite unreliable distance pattern and consequently degrade the performance of the classification criterion based on such distance pattern. 2) It is highly computationally expensive to find the nearest neighbor for each query feature as aforementioned. The computational complexity O(N N D log N D ) is proportionally increasing with the number of local features in the training set. In practice, many works extract a huge number of local features which heavily limits the efficiency of NBNN based methods. Although there are some accelerated algorithms [17], [18], the low efficiency is still a bottleneck of such distance calculation. To alleviate these issues, we propose a novel algorithm to calculate the distance d(xi , c). The essential idea here is to calculate a more appropriate mapping point xic rather than to simply find the nearest neighbor as in NBNN. The new xic is allowed to be a virtual local feature in the class c. In particular, we assume the local features of each class are sampled from a class-specific manifold M c , which is completely determined c . by available local features of the corresponding class {mic }ni=1 And such features are called “anchor points” [19], which can be obtained through clustering the local features from class c. Here the manifold of class c is denoted as M c = [m1c , m2c , . . . , mnc c ]. Then the computational complexity of a single input image with N features becomes O(Nn c log(n c )) with n c  N D , where N D is the number of all training local features. For example, in our following experiment, there are about 60 000 local features for each class with 2000 features per image and 30 training images. After the clustering preprocessing, only n c = 1024  60 000 anchor points are used to describe the manifold. In addition to reducing the complexity, using the cluster centers as anchor points can effectively reduce the influence of noisy features and thus produce a more robust description for the manifold. This is established under the reasonable assumption that the fraction of outliers is small, and the resultant centers are mainly determined by the dominant inlier features. Now we present an efficient algorithm to determine the good mapping point xic , even when relatively few anchor points are provided. By utilizing the locally linear structure of the manifold, xic can be calculated through the locally linear regression method. More specifically, xic is computed as a linear combination of its neighboring anchors in the manifold M c . Here we apply an approximate fast solution of LLC [2] to our problem, which only selects the fixed number of nearest neighbors and can be formulated as follows: minxic − M c vi 22 vi

subject to : v i, j = 0 1T vi = 1,

if mcj ∈ / Nik ∀i

(6)

where vi = [v i,1 , v i,2 , . . . , v i,nc ]T is the linear representation coefficients of xi on the manifold M c , and Nik is the set of k nearest neighbors of xi . Substitute the resultant xic derived from (6) into (5), the distance d(xi , c) will be finally obtained, which is denoted as dic . Such class-specific distance

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

541

is motivated by capturing the underlying manifold structure of the local features and computed in a robust linear regression way. Thus it gains stronger discriminative power and more robustness to noisy and outlier features. B. What is Good Distance Pattern? [di1 , di2 , . . . , diK ]T ∈ R K denote the local feature xi , which aggregates its

Let di = distance distance vector of the relationship to all K classes. In contrast to original local features (e.g., SIFT), which describe the appearance patterns of characteristic object, the distance vector represents a relative pattern that captures the discriminative part of local features w.r.t. specified classes, i.e., it is more class-specific as we desired. In fact, the distance vector is the projection residue of local features onto the class manifolds, as shown in Figure 1. Note that in the figure each axis denotes one class manifold. Through such residue-pursuit feature transformation, the distance vector gains the following advantages compared with original local features: 1) The distance vector preserves the discriminative information of local features lost in the traditional feature coding process. 2) The distance vector can coordinate better with the additional operation to explore useful spatial information, e.g., SPM. The spatial pooling of traditional local features requires the involved images have similar object layout such that the resulting representations of different images can be well element-wisely matched. Such over strict requirement is significantly relieved by the distance vector because of the class-specific characteristic of the adopted image-to-class distance, as shown in Figure 2. Compared with previous NBNN methods which directly sum up the image-class distance for classification, here we propose to use the distance vector as a new kind of local feature. Thus, any classification model used on the original local features can perfectly fit for the distance vector. Before providing more robust and discriminative distance pattern, we first recall the original NBNN strategy for image classification. Given an image I with N local features xi , the distance vectors di ∈ R K are calculated as in (5). Then the estimated category cˆ of I is determined by the following criterion: N  N N N     cˆ = arg min di = arg min di,1 , di,2 , . . . , di,K i=1

k

i=1

i=1

i=1

(7) where k is the index of element corresponding to the category. Namely, the original NBNN method just separately considers the element-wise semantic of the obtained distance vector, and completely ignores the intrinsic pattern described by the distance vector. Different from the previous methods, we regard each distance vector as an integral feature, and then apply the outperforming coding model on these transformed features. In particular, the final used distance pattern in our method

Image-level representaon space Feature

representaon

Example in Class 1

Image 1

Image 2 Feature space

Class 1

Class 2

Original local features

Class 1

Class 2

Distance features

Fig. 2. Schematic diagram of the distance pattern relieving the requirement of layout similarity. In the original feature space, each class has multiple clusters of characteristic features. When the images involved have different layouts, the resulting image representations may be quite different due to the features contained by the same SPM grid of different images being different. This has a negative impact on the usual element-wise matching-based methods to achieve high classification accuracy. But such an undesired situation can be significantly resolved by our proposed distance transformation, as all distance vectors within the same class turn out to be more similar in the distance feature space, benefitting from the class-specific characteristic of the adopted image-to-class distance. Consequently, image representations of the same class become closer to each other in the image level representation space, even though they show a totally different layout (e.g., the distance image d and vd in class 1). Different shapes represent different representations vI I2 1 classes in certain feature spaces and different color indicates different features (e.g., the pink rectangles represent the indistinctive features in class 1, lying close to class 2). Image best viewed in color.

admits the following form: 

di = di − min(di ), 1     d¯ i = f n (di ) =  [di,1 , di,2 , . . . , di,K ]T di 2

(8)

where f n (·) is the normalization function with 2 -norm. From Equation (8), the used d¯ i mainly represents the distance pattern with d¯ i 2 = 1. In practice, compared with the direct normalization of f n (di ) without the minimum subtraction, it is experimentally shown that the normalization in (8) produces a slightly higher classification accuracy [14], which may be benefitted from the increased gap between elements for more discriminatively describing features. For simplicity, we would use di to refer to d¯ i if without ambiguity in the following sections. Finally, we summarize the procedure to compute the adopted distance pattern in Algorithm 1. IV. L INEAR D ISTANCE C ODING Here we explore how to utilize the obtained distance vectors to produce discriminative and robust image representation. Different from the previous NBNN-like methods, we aggregate the obtained distance pattern under the coding-pooling framework which provides state-of-the-art performance in the previous works. The overview of the image classification flowchart is shown in Figure 3. The distance vectors are transformed from local features one by one, then the distance vector and the original local feature are separately encoded and pooled to generate two image representations vId and vI .

542

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Algorithm 1: Distance Pattern N Data: N local features {xi }i=1 of image I, the class-specified manifolds M c , c = 1, 2, . . . , K . Result: The desired distance vectors d¯ i , i = 1, 2, . . . , N. for i ← 1; i ≤ N; i ← i + 1 do for k ← 1; k ≤ K ; k ← k + 1 do calculate vi using (6), then di,k = xi − M k vi 22 . end Construct the distance vector di = [di,1 , di,2 , . . . , di,K ]T . Obtain the normalized distance vector d¯ i from (8). end

(a) Local Feature Extracon

Linear Coding

Spaal Max-pooling

LDC

Spaal Max-pooling

Concatenated Representaon Distance Transformaon

Linear SVM

Fig. 4. Illustration of the complementary between image representations produced by the LLC-like coding methods and our LDC method. In the coding-pooling framework, the original local feature x are approximated by the fixed visual words (anchor points) and the corresponding code v. Here we specially suppose the anchor points of all classes to form a fixed global dictionary B = [M 1 , M 2 , . . . , M K ] by concatenating them. Then the original information of the original feature x can be completely expressed by the generated codes v = [v1 , v2 , . . . , v K ]T and the residue error [n1 , n2 , . . . , n K ]T . In fact, the proposed LDC is to utilize the residue error information by compressing nk into dk with dk = nk 2 . Therefore, 2

(b) Fig. 3. Overview of the image classification flowchart. This architecture has been proven to achieve state-of-the-art performance on the basis of a single type of feature, e.g., LLC [2]. (a) Linear coding and max-pooling are sequentially performed on original extracted local features, resulting in an original image representation. (b) All local features are transformed into distance vectors, on which the linear coding and max-pooling are sequentially performed. This coding process is called LDC in this paper, and it results in a distance image representation. Finally, the original image representation and the distance image representation are simply concatenated so that they complement each other, where linear SVM is adopted for the final classification.

Finally, the linear SVM is adopted to classify the images based on individual image representation, or their concatenated image representation. To verify the effectiveness and generalization of such distance transformation, we apply two different coding models independently, i.e., LLC [2] and Localized Soft-Assignment coding (LSA) [10], to encode distance vectors due to their high efficiency provided by the approximate fast solution. We particularly illustrate this procedure via LLC2 . Let B ∈ R K ×P be the distance dictionary consisting of P distance vectors b1 , b2 , . . . , b P , which can be obtained by k-means clustering from the obtained distance vectors of training images. For the input distance vector di , the corresponding code yi is calculated as follows [2]  min di − Byi 22 + λei yi 22 ,

d are complementary to each other due the image representations vI and vI to their complementary perspectives on utilizing the original information.

where max is performed element-wisely for the involved vectors. In addition, SPM with three levels is adopted for the spatial pooling. Thus, the distance image representation vId is equivalently compact, salient, and discriminative as the original image representation vI . Here we provide brief analysis on the relationship of the original image representation vI and the distance image representation vId . The most intuitive difference is that they are derived from two different local features: the original local features {xi } and the distance vector {di }, respectively. For individual point within images, the coding quantization on original local features inevitably loses some important information more or less due to only preserving the principal information, while the distance vector captures the discriminative information in the residue part and thus compensate the information loss, as shown in Figure 4. So it is creditable that the resulting image representations vI and vId are complementary to each other. In practice, we simply concatenate vI and vId to form a longer vector vIc , which is expected to achieve better performance. The benefit of such complementarity is well verified by the following experiments on multiple types of benchmark datasets.

yi

subject to : 1T yi = 1,

∀i

(9)

where denotes the element-wise multiplication, 1 is a P-dimensional all-1 vector, and ei ∈ R P is the locality adaptor that gives different freedom for each visual word proportional to its similarity to the input distance feature di . After linear coding on the distance vectors, the max-pooling is performed on the obtained sparse codes {yi } to produce the distance image representation vId for image I, namely, vId = max(y1 , y2 , . . . , y N ) 2 The counterpart of LSA refers to [10] for details.

(10)

V. E XPERIMENTS In this section, we evaluate the performance of the proposed method on three groups of benchmark datasets: specific objects (e.g., flower, food), scene and general objects. In particular, the specific object datasets include Flower 102 [20] and PFID 61 [21], in which the images are relatively clean without cluttered background. The scene datasets include Scene 15 [7] and Indoor 67 [22]. And the general object datasets include Caltech 101 [23] and Caltech 256 [24]. Among various feature coding models producing relatively compact image representations, Locality-constrained Linear Coding (LLC) and Localized Soft-Assignment Coding (LSA)

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

almost always achieves the state-of-the-art classification performance [2], [10]. In addition, they, compared with ScSPM and other similar methods, have much lower computational complexity owing to existed fast solution [2]. Thus we adopt LLC and LSA individually as the coding model in our method, where the max-pooling is always employed. Of course, similar coding models can also be naturally applied on the transformed distance features, e.g., Laplacian Sparse Coding (LSCSPM) proposed by Gao et al. in [16]. And the main target of the following experiments is to verify the uniform effectiveness of the proposed distance pattern on improving classification performance. Moreover, we adopt the best performance of the comparable methods ever reported on each dataset and the achieved accuracies of LLC and LSA as the baselines in the performance evaluation. Before reporting the detailed classification results on these datasets, we first give the experimental settings. A. Experimental Settings For fair comparison with ever reported results, local features of a single type, dense SIFT [3], are used throughout the experiments. In all of our experiments, SIFT features are extracted at single-scale from densely located patches of gray images. The patches are centered at every 4 pixels and of the fixed size as 16 × 16 pixels, where the VLFeat lib [25] is used. Before feature extraction, all the images are resized with reserved aspect ratio to no more than 300 × 300 pixels. The anchor points {mic } of each class manifold M c are learned from the training images of that class, and their number is fixed as K c = 1024 for all classes throughout our experiments. For the original dense SIFT features, and the corresponding distance vectors, the global dictionaries containing P visual words are learned individually from all training samples via k-means clustering. In particular, P = 2048 is fixed for all datasets. Each SIFT feature xi or distance vector di is normalized by its 2 -norm and then encoded into a P-dimensional vector. An important parameter of LLC and LSA is the number c on encoding local features. In our of nearest neighbors knn method, the distance vector is similarly calculated based d neighbors in specified class manifold. For reducing on knn their influence to classification performance, four different d ∈ values are used individually for these parameters, i.e., knn c {1, 2, 3, 4}, and knn ∈ {2, 5, 10, 20} as suggested in LLC [2]. In experiments, we report the best result for each method under these parameters, and the influence of these parameters is discussed in the following subsection. In addition, the bandwidth parameter β of LSA is fixed as 10, as the author’s setting in [10]. In the experiments, the SPM is used by hierarchically partitioning each image into 1×1, 2×2, and 4×4 blocks on 3 levels, whose cumulative concatenations are denoted by SPM0, SPM1 and SPM2, respectively. In particular, SPM2 means that all three levels (from 0 to 2) are used by concatenating their pooling vectors. All obtained image-level representations are fed into the linear SVM in the training and testing phases (the libLinear package [26]), where the penalty parameter of SVM is fixed as C = 1. Actually we found the classification

543

(a)

(b)

Fig. 5. Example images of Flower 102 dataset, where each row represents one category. (a) Original images. (b) Corresponding segmented images. Limited by the performance of the segmentation algorithm, the segmented images may contain part of the background, lose part of the object, or even lose the whole object. Image best viewed in color.

performance is quite stable for different penalty parameter values. The number of repeatitions and the number of training and testing samples follow the provided configuration along with each dataset. The performance is measured by the average classification accuracy on all classes. For multiple runs, both the mean and the standard deviation of the classification accuracy are reported. As for the evaluations of the proposed methods, we report the results of three different image-level representations: the original feature representation vI , the distance image representation vId , and their direct concatenation vIc . In the experimental results, LLC and LSA is assembled separately with different input features. For example, LLC-SIFT refers to applying LLC on the original SIFT features to produce the image level representation, and LLC-Combine refers to the result of the concatenated image representations from LLC-SIFT and LLC-Distance. B. Specific Object Datasets We first evaluate the proposed method on the Flower 102 [20] and PFID 61 [21] datasets, whose images are relatively clean and the background is less cluttered. 1) Flower 102: Flower 102 is a 102 category flower dataset [20], containing 8189 images. And each class consists of 40 to 258 images. Some examples are shown in Figure 5. In particular, the images possess small inter-class difference and large intra-class variance. Here we focus on classifying the segmented images available from the dataset. Limited by the imperfectness of the segmentation algorithm, the segmented foreground may contain part of background, or lose part of object. Therefore, it is still challenging for the classification method on such segmented images. The dataset has been divided into a training set, a validation set and a testing set in the provided protocol. The training set and validation set consist of 10 images per class. And the testing set consists of the remaining 6149 images (minimum 20 per class). 2) PFID 61: Pittsburgh Fast-Food Image Dataset is a collection of fast food images from 13 chain restaurants (e.g., McDonald, Pizza Hut, KFC) acquired under lab and

544

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

TABLE I C LASSIFICATION A CCURACY (%) C OMPARISON ON T WO O BJECT D ATASETS Flower 102 AND PFID 61 Methods

Fig. 6. Example images of PFID 61 dataset, where each row of the left and right part represents one category. Each category contains three instances and each instance has six images from different views. Two images of each instance are shown here. Image best viewed in color.

realistic settings [21]. It contains 61 categories of food items selected from 101 categories. There are 3 instances of each food item, each of which are bought from different branches and taken on different days. And 6 images from 6 viewpoints (60 degrees apart) for each food instance. Figure 6 shows 14 categories of them with two example images per category. It is notable that the appearance of different instances in each category vary greatly. And some different categories (e.g., Hamburgers) are too similar to distinguish them even by the human eyes. Such large instance variance and tiny difference between classes make the classification quite challenging. For Flower 102, most of the previous classification methods employing single feature are based on the χ 2 kernel function of the clustered SIFTint and SIFTbdy features [27]. In stark contrast, we directly uses much simpler and more efficient linear SVM to classify the segmented images. We directly train the classifier on the training and validation images, as used by the baseline method provided in [20]. Namely, 20 images per class are used for training, and the remaining are used for testing. For PFID 61, we follow the experimental protocol proposed in previous work [21], [28], and use 3-fold cross-validation to evaluate the performance. In each iteration, 12 images of two instances are used for training and the 6 images of the third one are used for testing. We repeat the training and testing process for 3 times, with a different instance serving as the test set. Table I gives the classification performances of different methods on the datasets Flower 102 and PFID 61. Here KMTJSRC-CG is the method proposed by Yuan et al. [27] that uses multi-task joint sparse coding and achieves the stateof-the-art performance 55.20% on this dataset. As for PFID 61, the state-of-the-art performance is 28.20%. It is achieved by Yang et al. [28] through utilizing the spatial relationship of local features. Besides these methods, we perform the adopted coding methods LLC and LSA on both datasets to demonstrate the effectiveness of our proposed LDC on improving the classification performance. From Table I, it can be observed that the proposed method significantly outperforms LLC and LSA with SIFT features and generally achieves the state-of-the-art performance. This well verifies that the proposed distance pattern of local features is able to more effectively capture the discriminative

Flower 102

PFID 61

SVM (SIFTint) [20]a

55.10

-

KMTJSRC-CG (SIFTint) [27]

55.20

-

Bag of SIFT [21]b

-

9.20

OM [28]c

-

28.20

LLC-SIFT

57.75

44.63 ± 4.00

LLC-distance

59.76

48.45 ± 3.58

LLC-combine

61.45

48.27 ± 3.59

LSA-SIFT

57.80

43.35 ± 3.36

LSA-distance

58.78

46.90 ± 3.47

LSA-combine

60.38

46.54 ± 3.08

a The best baseline accuracy provided by the authors of Flower 102

for the single feature, which is based on SVM.

b One of baseline accuracies on the 61 categories provided by the authors

of PFID 61. c The Orientation and Midpoint (OM), as one of a set of methods based on the statistics of pairwise local features proposed by Yang et al., yields the best accuracy, where the χ 2 kernel is adopted with SVM.

information among multiple classes. According to our analysis, the combination of the distance vector and the original SIFT features should yield better classification accuracy than using each of them individually. This is because the combination is able to compensate the information loss and provide more useful information. This point is well shown on the dataset Flower 102, where the combination achieves the best accuracy 61.45%. However, the effectiveness of such combination does not hold on the dataset PFID 61, where the individual distance vector achieves the best performance 48.45% rather than the combination. The reason is that different instances of PFID 61 possess too large variations, and thus the consistency of local features distribution between the training images and the testing images is not well guaranteed. This is experimentally demonstrated by the larger accuracy derivations from both LLC and LSA methods in Table I. In this case, the combination may slightly overfit the training data and lead to a negligible decrease of classification accuracy, e.g., the average accuracy is decreased from 48.45% to 48.27% when LLC-Distance is combined with LLC-SIFT. C. Scene DataSets Now we evaluate the proposed method on the scene datasets Scene 15 and Indoor 67. The scene recognition is a challenging open problem in high level vision because each image contains not only the undeterminable characterizing objects but also the complex background [22]. Compared with the object classification, the variations of images in the scene classification are more severe, especially for the light condition, scale, and spatial layout. 1) Scene 15: This dataset consists of 15 scene categories, among which 8 categories are originally collected by Oliva et al. [29], 5 are added by Li et al. [5] and 2 are adopted from Lazebnik et al. [7]. Each class contains 200 to 400 images, and the average image size is around

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

545

TABLE II C LASSIFICATION A CCURACY (%) C OMPARISON ON T WO S CENE D ATASETS Scene 15 AND Indoor 67 Methods

Scene 15

Indoor 67

-

26.50

80.90

37.60

KSPM [7]

81.40 ± 0.50

-

ScSPM [1]

80.28 ± 0.93

-

SC + linear kernel [31]c

84.10 ± 0.50

-

77.00

-

LLC-SIFT

79.81 ± 0.35

43.78

LLC-distance

80.30 ± 0.62

43.53

LLC-combine

82.40 ± 0.35

46.28

LSA-SIFT

80.12 ± 0.60

44.19

LSA-distance

79.73 ± 0.70

42.04

LSA-combine

82.50 ± 0.47

46.69

ROI + gist-annotation [22]a Object Bank [30]b

Fig. 7. Example images of Scene 15 dataset containing all 15 categories with two images per category.

NBNN [13]d

a The baseline result provided by the authors of Indoor 67, where the

Region of interest (ROI) detection is employed to reduce the interference of clutter background and the RBF-kernel SVM is adopted. b Object Bank pre-trains one object detector for each class. c For comparison, the result of basic features is shown here, but it adopts the intersection kernel rather than our adopted linear SVM. d This is the optimized version of NBNN, where the image-to-class distance is learned by employing the Mahalanobis metrics.

Fig. 8. Example images of Indoor 67 data set containing 67 categories. All categories are organized into five big groups: Store, Home, Public spaces, Leisure, and Working. Four categories with two images per category are shown for each group. Due to the complex background, images within each category vary widely. Image best viewed in color.

300 × 250 pixels. Figure 7 shows some example images of each category. all 15 categories with two images per category. 2) Indoor 67: This dataset contains 67 indoor scene categories, and a total of 15620 images [22]. The images in the dataset were collected from three different sources: online image search tools (Google and Altavista), online photo sharing sites (Flickr) and the LabelMe dataset. All images have a minimum resolution of 200 pixels along the smaller axis. The number of images varies across categories, but there are at least 100 images per category. To facilitate seeing the variety of different scene categories, they are organized into 5 big scene groups (Store, Home, Public spaces, Leisure, and Working places), as shown in Figure 8. For Scene 15, we follow the setting in [7] to randomly choose 100 images per class for training and test on the rest. In particular, we repeat the evaluation three times, then report the average results and the standard deviation. As for Indoor 67, we follow the settings of the baseline method provided in [22]. The 80 images of each class are used for training and 20 images for testing, whose partition is provided on the dataset website. Table II provides the classification results on Scene 15 and Indoor 67. In the table, several baseline results on these two scene datasets are provided. The used methods include the detection based methods, the linear coding methods, and the NBNN method. For these two datasets, the distance vectors

yield classification performance close to the original local features due to the relatively poor consistency on the feature distribution of training and testing images. As expected, the combination achieves the best performance for both LLC and LSA methods, as the spatial robustness of the transformed distance vectors strengthens the robustness of the final combined image level representation. D. General Object Datasets Here we conduct experiments on the datasets Caltech 101 and Caltech 256, in which each image contains certain object and cluttered background. The Caltech 101 dataset [23] contains 9144 images in 101 object categories including animals, vehicles, flowers, buildings, etc. The number of images per category varies from 31 to 800. The Caltech 256 dataset [24] contains 30, 607 images from 256 object categories and each category contains at least 80 images. Besides the object categories, both datasets are individually added to an extra “background” class, i.e., BACKGROUND_Google and clutter, respectively. Figure 9 gives some example images. Compared with Caltech 101, Caltech 256 presents much greater variation in object size, location, pose, etc. For both datasets, we randomly select 30 images for training and test on the rest. In particular, we repeat it three times and then report the average classification accuracy and the corresponding standard deviation. Table III provides the resultant classification performance on these two datasets. Here we compare our method mainly with the linear coding methods and the NBNN method. In particular, LLC in [2] adopted three-scale SIFT features, while our work only uses the single-scale SIFT features. For Caltech 256, LLC [2] adopted a dictionary of 4096 visual words to further improve

546

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

Fig. 9. Example images of Caltech 101 and Caltech 256 data sets containing 102 and 257 categories, respectively. Besides object categories, each of both data sets contains one extra background category, namely, BACKGROUND_Google for Caltech 101 and clutter for Caltech 256. All categories in two datasets have large object variations with cluttered background. Compared with Caltech 101, Caltech 256 has a more irregular object layout, which may degrade the classification performance due to the imperfect matching of spatial pooling. Image best viewed in color. TABLE III C LASSIFICATION A CCURACY (%) C OMPARISON ON

Caltech 101 AND Caltech 256

Methods

Caltech 101

Caltech 256

SVM-KNN [32]

66.20 ± 0.50

-

KSPM [7], [24]

64.60 ± 0.80

34.10

ScSPM [1]

73.20 ± 0.54

34.02 ± 0.35

SC + linear kernel [31]a

71.50 ± 1.10

-

-

35.74 ± 0.10

NBNN [2], [8]b

70.40

37.00

LLC [2]c

73.44

41.19

LSA [10]

74.21 ± 0.81

-

LLC-SIFT

72.65 ± 0.33

36.27 ± 0.27

LLC-distance

73.34 ± 0.95

37.40 ± 0.07

LLC-combine

74.59 ± 0.54

38.41 ± 0.11

LSA-SIFT

72.86 ± 0.33

36.52 ± 0.26

LScSPM [16]

LSA-distance

71.45 ± 0.87

36.30 ± 0.06

LSA-combine

74.47 ± 0.46

38.25 ± 0.08

a For fair comparison, the result of basic features with linear kernel is shown here. Higher accuracy is also reported in [31], but where the intersection kernel is employed. b Performance of the original NBNN [8] provided in [2]. c LLC adopts three-scale SIFT features and the global dictionary of size 4096, which can yield higher accuracy than single scale features, especially for Caltech 256 with larger scale variation.

the performance, and our used dictionary of size fixed as 2048. However, even following the same setting for Caltech 101 dataset, the results by ourselves are slightly worse than the reported ones in the previous literatures. It is similar for LSA. Such decrease may be introduced by some implementing details. For the fair comparison, here we only compare the results from our own implementation. Comparing the results in Table III, we can observe that the combination of the distance vector and the original features

always yields better performance than individual one, as expected. Compared with the previous methods, our method achieves the satisfying performance and outperforms the similar methods with linear SVM and single feature. Actually, the classification accuracy can be further increased if some advanced learning-based model [15] or graph-matching kernel [33] is adopted with neglecting their complications. From the above experimental results on several different types of image datasets, we can summarize the effectiveness of the proposed method as follows: 1) The distance vectors are quite discriminative under mild condition that the distributions of the training data and the testing data are consistent to some extent, e.g., the involved images have less interference of cluttered background. 2) The transformation to the distance vector relaxes the requirement for the similarity of object spatial layout due to its independence on spatial position of distinctive objects. This is one of the critical differences from the original local features. 3) Under the coding-pooling framework, the distance vector and the original feature are complementary to each other. Consequently, their combination can more comprehensively capture the useful classification information and generally achieves higher classification performance, which is uniformly effective on all used datasets. E. Discussion We have proposed the linear distance coding method, and then verified its effectiveness on multiple types of benchmark datasets. Here we evaluate the influence of the number of nearest neighbors on calculating distance and coding separately. Particularly, we select the datasets Flower 102, Indoor 67 and Caltech 101 with one per type to investigate the performance under different values, where LLC is particularly employed. d on Calculating Distance: In 1) Neighbor Number knn Section III, we introduce the class manifolds to calculate the distance of local feature to certain class with the aim of reducing the complexity and the interference of noisy d affects the final classification features. To investigate how knn performance, we provide the average classification accuracy d ∈ {1, 2, 3, 4}, and the plot is shown under different values knn in Figure 10. From these results, we have the following observations. d than First, the combined representation is more robust to knn the individual distance vector, since the combination also encapsulates the information from the original features, which is not affected by this parameter. Second, the influence of this parameter varies a lot on different datasets, especially when only the distance vector is adopted. For example, the classification accuracy on the dataset Flower 102 keeps increasing when d increases from 1 to 4. In fact, the performance has only knn d = 1. slight fluctuation when discarding the results under knn Based on the observations of the influences on different d = 3 is a good trade-off as our suggestion. datasets, knn c on Coding: Now we investigate 2) Neighbor Number knn c the effect of knn to the final classification performance, where

Classification Accuracy

WANG et al.: LINEAR DISTANCE CODING FOR IMAGE CLASSIFICATION

80.00% 75.00% 70.00% 65.00% 60.00% 55.00% 50.00% 45.00% 40.00% 35.00%

Caltech 101 - Distance Caltech 101 - Combine Flower 102 - Distance Flower 102 - Combine Indoor 67 - Distance Indoor 67 - Combine 1

2

3

4

d knn

Fig. 10. Classification accuracy of the proposed methods under different d ∈ {1, 2, 3, 4}, where three types of data sets, Flower 102, Indoor 67, knn and Caltech 101, are adopted. Compared to the individual distance vector, d , as it provides more the combination is more robust to the parameter knn complete information. Image best viewed in color.

547

formance of the distance vector is relatively stable on different datasets. For example, the optimal accuracy is c = 10. almost always achieved at knn 3) Combine: Due to taking advantages of both the stable “SIFT” and the discriminative “Distance”, the combic across all nation is most robust to the value of knn different datasets. For example, its achieved almost the c = same accuracy on Flower 102 at different values knn 1, 2, 3, 4. c is very influFrom the above analysis, the parameter knn ential to performance when using original SIFT features to perform LLC, but such dependence is relaxed for the transc = 10 is suggested formed distance vector. In particular, knn for both the individual distance vector and the combination in this work. VI. C ONCLUSION

Fig. 11. Classification accuracy curve of LLC (Original), LDC (Distance), c ∈ {2, 5, 10, 20}, where and their combination (Combine) for different knn three types of data sets, Flower 102, Indoor 67, and Caltech 101, are adopted. c . In particular, Three methods have different trends as the variation of knn the combination has the most slight diversification, i.e., the combination is c . Image best viewed in considered to be nonsensitive to the parameter knn color.

d = 3 is universally used for calculating the distance vector. knn Similarly, we show the varying classification performance under different values, as shown in Figure 11. In particular, the results of LLC on the SIFT features is provided besides that of the distance vector and the combination, where four c ∈ {2, 5, 10, 20} are explored, as suggested in values of knn [2]. For fair comparison, all results here is produced by our own implementations. From Figure 11, the optimal parameter of different methods heavily depends on the characteristics of the involved dataset, e.g., the variations of images, the cluttered degree of the background, etc. Here, we can summarize the observations of Figure 11 for the different representations individually as follows.

1) Sift: For the selected three datasets, the optimal parac = 2 for Flower 102, meter is quite different, e.g., knn c while knn = 5 for Indoor 67 and Caltech 101. This may be caused by the dependence of the optimal parameter value on the interference of cluttered background. In particular, the images in Flower 102 are all segmented, which can significantly reduces the influence of background and a small neighborhood is sufficient. 2) Distance: The distance vector possesses different semantic from the original local feature introduced by our proposed transformation. Compared with SIFT, the per-

In this paper, we propose linear distance coding method to capture the discriminative information of local features and relieve the dependence of spatial pooling on object layout similarity of images. Consequently, the proposed method can effectively improve the classification performance, which is well verified on various types of datasets. In fact, the distance vector is to extract the discriminative information based on the image-to-class distance, which is motivated quite differently from the traditional coding models. From the analysis and the experiments, it is shown that the distance vector and the original features are complementary to each other. Thus the combination of two image representations can generally yield higher classification performance. Through comparing the classification results of the proposed method on different types of benchmark datasets, it is concluded that the cluttered background would significantly degrade the final classification performance because of its influence on the salient features of different classes. Inspired by this observation, we plan to design a new model to reduce the interference of background aiming to improve the classification performance, e.g., embedding the segmentation results into the classification framework, which forms one of our future directions. R EFERENCES [1] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801. [2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3360–3367. [3] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [4] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2005, pp. 886–893. [5] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2005, pp. 524–531. [6] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, “Visual word ambiguity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp. 1271–1283, Jul. 2010. [7] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006, pp. 2169– 2178.

548

[8] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighbor based image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [9] J. van Gemert, J. Geusebroek, C. Veenman, and A. Smeulders, “Kernel codebooks for scene categorization,” in Proc. Eur. Conf. Comput. Vis., Oct. 2008, pp. 696–709. [10] L. Liu, L. Wang, and X. Liu, “In defense of soft-assignment coding,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 2486–2493. [11] X. Zhou, K. Yu, T. Zhang, and T. Huang, “Image classification using super-vector coding of local image descriptors,” in Proc. Eur. Conf. Comput. Vis., vol. 5. Sep. 2010, pp. 141–154. [12] R. Behmo, P. Marcombes, A. S. Dalalyan, and V. Prinet, “Toward optimal naive Bayes nearest neighbor,” in Proc. Eur. Conf. Comput. Vis., vol. 4. Sep. 2010, pp. 171–184. [13] Z. Wang, Y. Hu, and L.-T. Chia, “Image-to-class distance metric learning for image classification,” in Proc. Eur. Conf. Comput. Vis., vol. 1. Sep. 2010, pp. 706–719. [14] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell, “The NBNN kernel,” in Proc. Int. Conf. Comput. Vis., vol. 1. Nov. 2011, pp. 1824–1831. [15] J. Feng, B. Ni, Q. Tian, and S. Yan, “Geometric p-norm feature pooling for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 2609–2704. [16] S. Gao, I. Tsang, L. Chia, and P. Zhao, “Local features are not lonely Laplacian sparse coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., San Francisco, CA, Jun. 2010, pp. 3555– 3561. [17] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” in Proc. Int. Joint Conf. Comput. Vis. Theory Appl., vol. 1. Lisboa, Portugal, Feb. 2009, pp. 331–340. [18] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, Jan. 2011. [19] K. Yu and T. Zhang, “Improved local coordinate coding using local tangents,” in Proc. Int. Conf. Mach. Learn., Jun. 2010, pp. 1215–1222. [20] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Proc. Indian Conf. Comput. Vis., Graph. Image Process., Dec. 2008, pp. 722–729. [21] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang, “PFID: Pittsburgh fast-food image dataset,” in Proc. Int. Conf. Image Process., Nov. 2009, pp. 289–292. [22] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 413–420. [23] F.-F. Li, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories,” Comput. Vis. Image Understand., vol. 106, no. 1, pp. 59–70, 2007. [24] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” Dept. Comput. Sci., California Inst. Technology, Tech. Rep. 7694, Apr. 2007. [25] A. Vedaldi and B. Fulkerson. (2008). VLfeat: An Open and Portable Library of Computer Vision Algorithms [Online]. Available: http://www.vlfeat.org/ [26] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “Liblinear: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871– 1874, May 2008. [27] X. Yuan and S. Yan, “Visual classification with multi-task joint sparse representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3493–3500. [28] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition using statistics of pairwise local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256. [29] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, 2001. [30] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li, “Object bank: A highlevel image representation for scene classification & semantic feature sparsification,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2010, pp. 1378–1386. [31] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2559–2566.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013

[32] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006, pp. 2126–2136. [33] O. Duchenne, A. Joulin, and J. Ponce, “A graph-matching kernel for object categorization,” in Proc. Int. Conf. Comput. Vis., vol. 5. Barcelona, Spain, Nov. 2011, pp. 1792–1799.

Zilei Wang received the B.S. and Ph.D. degrees in control theory and control engineering from the University of Science and Technology of China (USTC), Hefei, China, in 2002 and 2007, respectively. He is currently an Associate Professor with the Department of Automation, USTC, and is also with the Vision and Machine Learning Laboratory, National University of Singapore, Singapore, as a Research Fellow. His current research interests include computer vision and media streaming techniques.

Jiashi Feng received the B.S. degree from the University of Science and Technology of China, Hefei, China, in 2007. He is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore. His current research interests include computer vision and machine learning.

Shuicheng Yan (M’06–SM’09) is currently an Assistant Professor with the Department of Electrical and Computer Engineering, National University of Singapore, where he is the Founding Lead of the Learning and Vision Research Group (http://www.lv-nus.org). His current research interests include computer vision, multimedia, and machine learning. He has authored or co-authored over 200 technical papers. He was a recipient of the Best Paper Award from ICIMCS in 2009, ACMMM in 2010, and ICME in 2010, the Winner Prize of the Classification Task in PASCAL VOC in 2010, the Honorable Mention Prize of the Detection Task in PASCAL VOC in 2010, the TCSVT Best Associate Editor (BAE) Award in 2010, and the co-author of the Best Student Paper Award of PREMIA in 2009 and PREMIA in 2011. He is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY, and the Guest Editor of the special issues for TMM and CVIU.

Hongsheng Xi received the B.S. and M.S. degrees in applied mathematics from the University of Science and Technology of China (USTC), Hefei, China, in 1980 and 1985, respectively. He is currently a Professor with the Department of Automation, USTC, where he also directs the Laboratory of Network Communication Systems and Control. His current research interests include stochastic control systems, network performance analysis and optimization, wireless communications, and signal processing.