Image classification based on improved VLAD - CiteSeerX

0 downloads 0 Views 2MB Size Report
number of image descriptors assigned to each visual word and it ignores the spatial informa- tion of the image. To solve this problem, Spatial Pyramid Matching ...
Multimed Tools Appl DOI 10.1007/s11042-015-2524-6

Image classification based on improved VLAD Xianzhong Long · Hongtao Lu · Yong Peng · Xianzhong Wang · Shaokun Feng

Received: 25 August 2014 / Revised: 22 December 2014 / Accepted: 18 February 2015 © Springer Science+Business Media New York 2015

Abstract Recently, a coding scheme called vector of locally aggregated descriptors (VLAD) has got tremendous successes in large scale image retrieval due to its efficiency of compact representation. VLAD employs only the nearest neighbor visual word in dictionary to aggregate each descriptor feature. It has fast retrieval speed and high retrieval accuracy under small dictionary size. In this paper, we give three improved VLAD variations for image classification: first, similar to the bag of words (BoW) model, we count the number of descriptors belonging to each cluster center and add it to VLAD; second, in order to expand the impact of residuals, squared residuals are taken into account; thirdly, in contrast with one nearest neighbor visual word, we try to look for two nearest neighbor visual words for aggregating each descriptor. Experimental results on UIUC Sports Event, Corel 10 and 15 Scenes datasets show that the proposed methods outperform some state-of-the-art coding schemes in terms of the classification accuracy and computation speed.

X. Long () School of Computer Science & Technology, School of Software, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China e-mail: [email protected] H. Lu · Y. Peng · X. Wang · S. Feng Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China H. Lu e-mail: [email protected] Y. Peng e-mail: [email protected] X. Wang e-mail: [email protected] S. Feng e-mail: [email protected]

Multimed Tools Appl

Keywords Image classification · Scale-invariant feature transform · Vector of locally aggregated descriptors · K-means clustering algorithm

1 Introduction As one of the most important and challenging tasks in computer vision and pattern recognition fields, image classification has recently got many attention. There are some benchmark datasets used to evaluate the classification performance of image classification algorithms, for example, UIUC sports event [23], Corel 10 [26], 15 Scenes [21], Caltech 101 [10] and Caltech 256 [14], etc. Many image classification models have recently been proposed, such as generative models [2, 22, 33], discriminative models [9, 18, 27, 39] and hybrid generative/discriminative models [3]. Generative model classifies images from the viewpoint of probability, it only depends on the data themselves and does not require training or learning parameters. In contrast, discriminative model solves classification problem from the nonprobabilistic perspective, which needs to train or learn parameters appeared in the classifier. Here, we only consider image classification based on discriminative model. In the discriminative models, the earliest bag of words (BoW) technique [35] won the greatest popularity and had a wide range of applications in the fields of image retrieval [31], video event detection [37] and image classification [6, 13]. However, the BoW representation does not possess enough descriptive capability because it is the histogram of the number of image descriptors assigned to each visual word and it ignores the spatial information of the image. To solve this problem, Spatial Pyramid Matching (SPM) model has been put forward in [21], which takes the spatial information of image into account. In fact, SPM is an extension of BoW model and has been proved to achieve better image classification accuracy than the latter [15, 36, 38]. In the image classification based on SPM model, there are five steps, i.e., local descriptor extraction, dictionary learning, feature coding, spatial pooling and classifier selection. Specifically, the commonly used local descriptors include Scale-Invariant Feature Transform (SIFT) [25], Histogram of oriented Gradients (HoG) [7], Affine Scale-Invariant Feature Transform (ASIFT) [28], Oriented Fast and Rotated BRIEF (ORB) [34], etc. After getting all images’ descriptors, vector quantization [21] or sparse coding [38] is utilized to train a dictionary. In the feature coding phase, each image’s descriptors matrix corresponds to a coefficient matrix generated by one different coding strategy. It is necessary to illustrate the principle of spatial pooling clearly because it dominates the whole image classification framework based on SPM model. During the spatial pooling period, an image is divided into increasingly finer subregions of L layers, with 2l × 2l subregions at layer l, l = 0, 1, · · · , L−1. A typical partition is three layers, i.e., L = 3. At layer 0, the image itself as a whole; at layer 1, the image is divided into four regions and at layer 2, each subregion of layer 1 is further divided into 4, resulting in 16 smaller subregions. This process generates a spatial pyramid of three layers with a total of 21 subregions. Then, spatial pyramid is combined with feature coding process and different pooling functions is exploited, i.e., sum pooling [21] and max pooling [36, 38]. Finally, the feature vectors of the 21 subregions are concatenated into a long feature vector for the whole image. The process mentioned above is the spatial pyramid representation of the image. The dimensionality of the new representation for each image is 21P (P is the dictionary size). It is noteworthy that when l = 0,

Multimed Tools Appl

SPM reduces to the original BoW model. In the last step, classifiers such as Support Vector Machine (SVM) [5] or Adaptive Boosting (AdaBoost) [11] is applied to classify images. Over the past several years, a number of dictionary learning methods and feature coding strategies have been brought forward for image classification. In [6], as one vector quantization (VQ) technique, K-means clustering algorithm was used to generate dictionary, during the feature coding phase, each local descriptor was given a binary value that specified the cluster center which the local descriptor belonged to. This process is called BoW, which produces the histograms representation of visual words. However, this approach is likely to result in large reconstruction error because it limits the ability of representing descriptors. To address this problem, SPM based on sparse coding (ScSPM) method has been proposed in [38], which employed L1 norm-based sparse coding scheme to substitute the previous K-means clustering method and to generate dictionary by learning randomly sampled SIFT feature vectors. During the feature coding period, ScSPM used sparse coding strategy to code each local descriptor. However, the computation speed of ScSPM is very slow when the dictionary size becomes large. In order to accelerate the computation and maintain high classification accuracy, locality-constrained linear coding (LLC) was put forward in [36], which gave an analytical solution for feature coding. Furthermore, several improved image classification schemes based SPM have also been suggested recently, such as spatial pyramid matching using Laplacian sparse coding [12], discriminative spatial pyramid [15], discriminative affine sparse codes [20], nearest neighbor basis vectors spatial pyramid matching (NNBVSPM) [24], etc. How to find some efficient feature coding strategies is becoming an urgent research direction. In the field of pattern recognition, Fisher vector (FV) technique has been used for image classification [4, 19, 29, 30]. FV is a strong framework which combines the advantages of generative and discriminative approaches. The key point of FV is to represent a signal using a gradient vector derived from a generative probability model and to subsequently input this representation to a discriminative classifier. Therefore, FV can be seen as one hybrid generative/discriminative model. The vector of locally aggregated descriptors (VLAD) can be viewed as a non-probabilistic version of the FV when the gradient only associates with the mean and replace gaussian mixture models (GMM) clustering by K-means. VLAD has been successfully applied to image retrieval [1, 8, 16, 17]. When some higher-order statistics are considered, researchers proposed another two coding methods, i.e., vectors of locally aggregated tensors (VLAT) [32] and super-vector (SV) [41]. The dimensionality of VLAT is P (D + D 2 ), where the D is the dimension of each descriptor, the high dimensionality representation of VLAT can result in very large computation time. Besides, SV is based on probability viewpoint and it is still a generative model. Therefore, we do not consider the VLAT and SV feature coding algorithms. In this paper, we only concentrate on some image classification methods based on discriminative models, BoW, ScSPM, LLC and VLAD are selected to compare with our improved VLAD methods. In order to obtain stronger coding ability and improve the classification rate or speed, three improved VLAD versions for image classification are given in this paper. First, similar to the bag of words (BoW) model, we count the number of descriptors belonging to each cluster center and add it to VLAD. In this way, our improved VLAD method possesses the characteristics of BoW. Second, in order to expand the impact of residuals, squared residuals are added into the original VLAD. This makes the dimension of new representation is two times of the original. Thirdly, there are some descriptors which have nearly the same

Multimed Tools Appl

distance to more than one visual words. Thus, these descriptors only assigned to the nearest visual word in original VLAD are not appropriate. In contrast with one nearest neighbor visual word, we try to look for two nearest neighbor visual words for aggregating each descriptor. The remainder of the paper is organized as follows: Section 2 introduces the basic idea of existing schemes. Our improved VLAD methods are presented in Section 3. In Section 4, the comparison results of image classification on three widely used datasets are reported. Finally, conclusions are made and some future research issues are discussed in Section 5.

2 Related work Let V be a set of D-dimensional local descriptors extracted from an image, i.e., V = [v1 , v2 , · · · , vM ] ∈ RD×M . Given a dictionary with P entries, W = [w1 , w2 , · · · , wP ] ∈ RD×P , different feature coding schemes convert each descriptor into a P -dimensional code to generate the final image representation coefficient matrix H, i.e., H = [h1 , h2 , · · · , hM ] ∈ RP ×M . Each column of V is a local descriptor corresponding to a coefficient, i.e., each column of H. 2.1 Bag of words (BoW) The BoW representation groups local descriptors. It first generates a dictionary W with P visual words usually obtained by K-means clustering algorithm. Each D dimension local descriptor from an image is then assigned to the closest center. The BoW representation is obtained as the histogram of the assignment of all image descriptors to visual words. Therefore, it produces a P -dimensional vector representation, the sum of the elements in this vector equals the number of descriptors in each image. However, the BoW model does not consider the spatial structure information of image and has large reconstruction error, its ability to image classification is restricted [6]. 2.2 Sparse coding spatial pyramid matching (ScSPM) In ScSPM [38], by using sparse coding in place of vector quantization followed by multilayer spatial max pooling, the authors developed an extension of the traditional SPM method [21] and presented a linear SPM kernel based on SIFT sparse coding. In the process of image classification, ScSPM solved the following optimization problem:

min

W,H

M 

vi − Whi 22 + λhi 1

(1)

i=1

where .2 denotes the L2 norm of a vector, i.e., the square root of sum of the vector entries’ squares, .1 is the L1 norm of a vector, i.e., the sum of the absolute values of the vector entries. The parameter λ is used to control the sparsity of the solution of formula (1), the bigger λ is, the more sparse the solution will be. Experimental results in [38] demonstrated that linear SPM based on sparse coding of SIFT descriptors significantly outperformed the linear SPM kernel on histograms and was even better than the nonlinear SPM

Multimed Tools Appl

kernels. Nevertheless, utilizing sparse coding to learn dictionary and to encode features are time consuming, especially for large scale image dataset or large dictionary. 2.3 Locality-constrained linear coding (LLC) In LLC [36], inspired by the viewpoint of [40] which illustrated that locality was more important than sparsity, the authors generalized the sparse coding to locality-constrained linear coding and suggested a locality constraint instead of the sparsity constraint in the formula (1). With respect to LLC, the following optimization problem was solved:

min H

M 

vi − Whi 22 + λdi  hi 22

i=1

s.t. 1T hi = 1, ∀i

(2)

where 1 = (1, 1, · · · , 1)T ,  denotes the element-wise multiplication, and di ∈ RP is a weight vector. In addition, each coefficient vector hi is normalized in terms of 1T hi = 1. Experimental results in [36] showed that the LLC outperformed ScSPM on some benchmark datasets due to its excellent properties, i.e., better reconstruction, local smooth sparsity and analytical solution. 2.4 Vector of locally aggregated descriptors (VLAD) VLAD representation was proposed in [16] for image retrieval. V = [v1 , v2 , · · · , vM ] ∈ RD×M represents a descriptor set extracted from an image. Like the BoW, a dictionary W = [w1 , w2 , · · · , wP ] ∈ RD×P is first learned using K-means. Then, for each local descriptor vi , we look for its nearest neighbor visual word N N (vi ) in the dictionary. Finally, for each visual word wj , the differences vi − wj of the vectors vi assigned to wj are accumulated. C = [cT1 , cT2 , · · · , cTP ]T ∈ RP D (cj ∈ RD , j = 1, 2, · · · , P ) is the final vector representation of VLAD, which can be obtained according to the following formula. cj =



(vi − wj )

(3)

vi :NN(vi )=wj

The VLAD representation is the concatenation of the D dimensional vectors cj and is therefore P D dimension, where P is the dictionary size. Algorithm 1 gives the VLAD coding process. Like the Fisher vector, the VLAD can then be power- and L2 -normalized sequently, where the parameter α is empirically set to 0.5. It is worth noting that there are no SPM and pooling process in the VLAD coding algorithm. The existing experiments have proved that VLAD is an efficient feature coding method under small dictionary size.

3 Improved VLAD In this section, three improved VLAD methods are presented. They are named as VLAD based on BoW, Magnified VLAD and Two Nearest Neighbor VLAD respectively. The same

Multimed Tools Appl

as VLAD, the improved VLAD representations can also be power- and L2 -normalized, where the parameter α is empirically set to 0.5. 3.1 VLAD based on BoW Inspired by the BoW, we count the number of descriptors belonging to each cluster wj (j = 1, · · · , P ) and add it to VLAD. This improved VLAD method is called VLAD based on BoW (abbreviated as: VLAD+BoW). Therefore, the dimensionality of VLAD+BoW representation is P (D + 1), and the extra one dimension is used to store the BoW representation. After integrating the histogram information of visual words into the VLAD, we hope that VLAD+BoW can possess the characteristics of BoW and improve the classification performance. The VLAD+BoW is presented in Algorithm 2. 3.2 Magnified VLAD In order to magnify the impact of residuals, squared residuals are taken into account. This improved version is called Magnified VLAD (abbreviated as: MVLAD) and its dimension is 2P D. The computation of MVLAD is given in Algorithm 3.

Multimed Tools Appl

3.3 Two nearest neighbor VLAD In addition to a nearest neighbor center, we attempt to seek a second nearest neighbor center for each descriptor. This process is referred to two nearest neighbor VLAD (abbreviated as: TNNVLAD). The dimension of TNNVLAD representation is still P D. TNNVLAD is a kind of soft coding method and it can reduce representation error. The specific details are showed in Algorithm 4. If d1 > βd2 , the 0.5 times differences between vi and its two nearest neighbor centers are accumulated. The value of β can be obtained according to our experiments.

4 Experimental results This section begins with an illustration of our experiments setting which is followed by comparisons between our schemes with other prominent methods on three datasets, i.e., UIUC Sports Event, Corel 10 and 15 Scenes. Figure 1 shows example images of these datasets.

Multimed Tools Appl

4.1 Experiments setting A typical experiments setting for classifying images mainly contains four steps. First of all, we adopt the widely used SIFT descriptor [25] due to its good performance in image classification reported in [12, 21, 36, 38]. Specifically speaking, SIFT features are invariant to image scale and rotation and robust across a substantial range of affine distortion, addition of noise, and change in illumination. To be consistent with previous work, we also draw on the same setting to extract SIFT descriptor. We employ the 128-dimensional SIFT descriptor which are densely extracted from image patches on a grid with step size of 8 pixels under one patch size, i.e., 16 × 16. We resize the maximum side (i.e., length or width) of each image to 300 pixels except for UIUC Sports Event. For UIUC Sports Event dataset, we resize the maximum side to 400 because of the high resolution of original images. Next, about twenty descriptors from each image are chosen at random to form a new matrix which is taken as an input of K-means clustering or sparse coding algorithm, and we then learn a dictionary of specified size. In the third step, we then exploit BoW, sparse coding, LLC, VLAD and improved VLAD schemes to encode the descriptors and produce image’s new representation. For the BoW model, the dimensionality of the new representation is dictionary size P . In the ScSPM and LLC, we combined three layers spatial pyramid matching model (including 21 subregions) with max pooling function, thus, the dimension of the new representation is 21P . The dimensionality for the VLAD and the improved VLAD methods can be found from the Algorithms 1-4. At the final step, we apply linear SVM classifier

Multimed Tools Appl

Fig. 1 Image examples of the datasets UIUC Sports Event (the left four), Corel 10 (the middle four), and 15 Scenes (the right four)

Multimed Tools Appl

for the new representations, randomly selecting some columns per class to train and some other columns per class to test. Then, it is not difficult for us to get a classification accuracy for each category by comparing the obtained label of test set with the ground-truth label of test set. Eventually, we sum up classification accuracy of all categories and divide it by the number of categories to get the classification accuracy of all categories. All the results are obtained by repeating five independent experiments, and the average classification accuracy and the standard deviation over five experiments are reported. All the experiments are conducted in MATLAB, which is executed on a server with an Intel X5650 CPU (2.66GHz and 12 cores) and 32GB RAM. For the TNNVLAD algorithm, Fig. 2 gives the choice process of parameter β on three different datasets. Specifically speaking, Fig. 2 shows the classification accuracy of our TNNVLAD method when β changes in an interval [0.1, 1] where the dictionary size is 130. The experimental results presented in Fig. 2 indicate that β = 0.8 is the best choice for TNNVLAD. Therefore, in our experiments, we fix β = 0.8 in TNNVLAD algorithm. 4.2 UIUC sports event dataset UIUC Sports Event [23] contains 8 categories and 1579 images in total, with the number of images within each category ranging from 137 to 250. These 8 categories are badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snow boarding. In order to compare with other methods, we first randomly select 70 images per class as training data and randomly select 60 images from each class as test data. We compare the classification accuracy of our three improved VLAD schemes with other four methods under different dictionary

85 84 83 82 81 80 79 78 77 76 75 0.1

UIUC Sports Event Corel 10 15 Scenes 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

beta

Fig. 2 Classification accuracy of our TNNVLAD algorithm under different β on the UIUC Sports Event, Corel 10 and 15 Scenes datasets

Multimed Tools Appl 90 85 80 75 70 65

BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD

60 55 50

50

100

150

200 250 Dictionary Size

300

350

400

Fig. 3 Classification accuracy comparisons of various coding methods under different dictionary size on the UIUC Sports Event dataset

size in Fig. 3, where the dictionary size ranges from 10 to 420 and the step length is 10. From the results presented in Fig. 3, we notice that the classification accuracy of our methods surpass all the other algorithms when the dictionary size is small and are comparable to the existing schemes when the dictionary size becomes large. This phenomenon may be explained for the fact that the goal of VLAD is for aggregating local image descriptors into compact codes. VLAD can obtain good performance in the case of small dictionary size. Besides, we can know the results from the Fig. 3 that the performance of BoW is the lowest and ScSPM is better than BoW, yet, the classification accuracy of LLC is further better than ScSPM, these observations are consistent with reports in the existing literature sources. Based on Fig. 3, we list the best classification accuracy of various approaches in Table 1, where the average classification accuracy, standard deviation and corresponding dictionary Table 1 The best classification accuracy comparisons on the UIUC Sports Event dataset (mean±std-dev)% Algorithm

Classification Accuracy (Dictionary Size)

BoW [6]

73.38 ± 0.85 (390)

ScSPM [38]

83.71 ± 2.20 (400)

LLC [36]

84.17 ± 1.36 (330)

VLAD [17]

84.38 ± 2.67 (220)

VLAD+BoW

85.29 ± 0.87 (210)

MVLAD

84.75 ± 1.85 (220)

TNNVLAD

85.25 ± 1.26 (220)

Multimed Tools Appl The Confusion Matrix of VLAD+BoW algorithm on UIUC Sports Event (%) 0.00

1.67

bocce

6.33

64.33

10.00

2.33

4.67

5.00

0.67

6.67

croquet

0.33

18.67

71.67

2.33

3.00

1.33

2.67

0.00

polo

1.67

5.00

3.33

80.67

1.67

4.00

0.67

3.00

rockclimbing

0.00

0.33

0.00

0.33

96.67

0.33

0.33

2.00

rowing

2.33

0.00

0.67

3.33

1.33

87.67

2.33

2.33

sailing

0.00

0.00

1.33

1.00

0.00

5.33

90.33

2.00

snowboarding

0.67

3.33

1.00

1.67

5.33

1.33

0.33

86.33

bocce

snowboarding

0.33

sailing

0.00

rowing

0.67

rockclimbing

0.67

polo

1.00

croquet

95.67

badminton

badminton

The Confusion Matrix of MVLAD algorithm on UIUC Sports Event (%) 0.00

1.00

bocce

4.67

58.67

13.33

4.33

5.67

4.67

0.33

8.33

croquet

0.33

19.67

70.00

3.33

3.00

0.67

2.67

0.33

polo

3.00

5.33

4.67

79.33

1.67

2.33

0.67

3.00

rockclimbing

0.00

0.67

0.00

0.33

94.00

0.67

0.00

4.33

rowing

0.67

2.33

2.00

1.33

1.33

87.00

3.00

2.33

sailing

1.00

0.33

1.33

1.00

0.00

4.33

91.00

1.00

snowboarding

0.33

2.67

1.33

0.67

3.67

4.33

0.67

86.33

bocce

snowboarding

0.33

sailing

0.00

rowing

0.33

rockclimbing

2.67

polo

1.33

croquet

94.33

badminton

badminton

93.00

1.67

0.67

1.33

0.67

1.00

0.33

1.33

bocce

3.67

65.33

10.67

6.67

4.00

2.33

0.67

6.67

croquet

0.00

20.00

74.33

2.00

1.33

1.33

1.00

0.00

polo

1.33

3.33

4.33

85.33

0.67

2.33

0.67

2.00

rockclimbing

0.33

1.67

0.33

0.67

91.00

1.67

0.33

4.00

rowing

0.67

2.00

1.33

3.00

1.67

85.67

4.00

1.67

sailing

0.67

0.33

0.67

0.33

0.00

6.33

90.67

1.00

snowboarding

1.33

2.33

1.00

3.33

2.67

2.67

1.33

85.33

badminton

bocce

croquet

polo

rockclimbing

rowing

sailing

snowboarding

The Confusion Matrix of TNNVLAD algorithm on UIUC Sports Event (%) badminton

Fig. 4 Confusion Matrices of our algorithms on UIUC Sports Event dataset

Multimed Tools Appl

size are given. From Table 1, we can draw the conclusion that the best classification accuracy of our three improved methods are better than those of the other four schemes on the UIUC Sports Event dataset. Our VLAD+BoW and TNNVLAD methods achieve more than 1 % higher accuracy than LLC, which is the state-of-the-art and is based on SPM model. Furthermore, the original VLAD and improved VLAD can get the best classification accuracy under small dictionary size, but the BoW, ScSPM and LLC obtain their highest classification accuracy needing large dictionary size. Moreover, the confusion matrices of our algorithms for UIUC Sports Event dataset are shown in Fig. 4. In the process of obtaining confusion matrices, the dictionary size is set to 130 in our three improved VLAD methods. In the confusion matrices, the element in the i th row and j th column (i  = j ) is the percentage of images from class i that are misidentified as class j . Average classification accuracies of five independent experiments for individual classes are listed along the main diagonal. Figure 4 shows the classification and misclassification status for each individual class. Our algorithms perform well for class badminton and rock climbing. What is more, we also notice that the class bocce and croquet have a high percentage being classified wrongly, and this may result from that they are visually similar to each other. Balls in the class bocce and croquet have very similar appearance. To further demonstrate the superiority of our methods in running speed, the computation time comparisons of various approaches with different dictionary size on the UIUC Sports Event dataset is reported in Fig. 5. The computation time of all methods is the total time of five independent experiments and the corresponding unit is seconds. From Fig. 5, we can know that the computing speed of BoW method is the fastest due to its low dimensional representation. Meanwhile, we also observe that ScSPM algorithm is the slowest. This is because that sparse coding strategy is used to learn a dictionary and to encode features in ScSPM. To solve the optimization problem of minimizing the L1 norm is very time-consuming. The computation time of VLAD and our three improved VLAD methods

8000

Computation Time (seconds)

7000 6000 5000

BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD

4000 3000 2000 1000 0

50

100

150

200 250 Dictionary Size

300

350

400

Fig. 5 Computation time comparisons of various coding methods under different dictionary size on the UIUC Sports Event dataset

Multimed Tools Appl 85 80 75 70 65 60 55 BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD

50 45 40 35

50

100

150

200 250 Dictionary Size

300

350

400

Fig. 6 Classification accuracy comparisons of various coding methods under different dictionary size on the Corel 10 dataset

are smaller than LLC. This experimental results show that our algorithms have a certain advantage on the computation time. 4.3 Corel 10 dataset Corel 10 [26] contains 10 categories and 100 images per category. These categories are beach, buildings, elephants, flowers, food, horses, mountains, owls, skiing and tigers. Like the setting of [12, 26], we randomly select 50 images from each class as training data and use the rest 50 images per class as test data. Similarly, classification accuracy comparisons of various coding methods under different dictionary size on the Corel 10 dataset are described in Fig. 6. We again see that our improved VLAD algorithms can obtain good performance when the dictionary size is small. According to Fig. 6, the best classification accuracy of different algorithms are reported in Table 2. From the results, we can see that the best classification accuracies of our three improved VLAD algorithms are better than those of the other four schemes on the Corel 10 Table 2 The best classification accuracy comparisons on the Corel 10 dataset (mean±std-dev)% Algorithm

Classification Accuracy (Dictionary Size)

BoW [6]

67.44 ± 0.91 (340)

ScSPM [38]

75.24 ± 1.24 (340)

LLC [36]

79.20 ± 1.66 (380)

VLAD [17]

78.76 ± 1.47 (110)

VLAD+BoW

79.88 ± 0.48 (130)

MVLAD

79.96 ± 1.20 (280)

TNNVLAD

81.32 ± 1.45 (130)

Multimed Tools Appl The Confusion Matrix of VLAD+BoW algorithm on Corel 10 (%) beach

72.80

2.80

buildings

3.20

elephants

0.40

4.00

0.40

0.80

2.00

8.80

2.40

3.60

2.40

84.40

1.20

1.20

1.20

0.80

1.20

2.40

3.20

1.20

1.20

85.60

0.00

0.40

6.40

0.80

0.00

2.00

3.20

0.00

0.00

0.00

91.60

2.00

0.00

2.80

2.40

0.40

0.80

0.00

1.60

2.00

0.00

0.00

0.00

0.80

mountains

4.80

0.40

1.20

8.00

7.20

0.80

52.40

3.20

6.00

16.00

owls

7.60

0.00

2.40

8.00

0.00

2.00

3.60

65.20

4.00

7.20

skiing

2.40

6.00

3.20

1.60

2.80

0.80

2.00

2.80

74.00

4.40

tiger

0.80

0.80

2.00

3.20

2.00

2.80

2.80

4.40

1.20

80.00

67.60

5.60

buildings

3.60

elephants

0.40

elephants

beach

tiger

1.60

96.40

skiing

0.00

0.00

owls

81.20

0.00

mountains

10.80

0.80

horses

0.00

0.40

food

0.80

1.60

buildings

2.00

beach

food horses

The Confusion Matrix of MVLAD algorithm on Corel 10 (%)

82.00

2.00

0.40

1.20

0.80

1.60

3.20

3.60

1.60

1.20

88.40

0.00

0.00

6.40

0.00

2.00

1.20

0.40

0.00

0.00

0.00

93.20

0.80

0.00

1.60

1.20

1.60

1.60

food

0.00

2.80

0.00

10.00

77.20

1.20

1.60

0.40

3.60

3.20

horses

1.60

2.80

0.40

0.00

0.00

95.20

0.00

0.00

0.00

0.00

mountains

6.00

1.20

3.20

8.40

2.00

3.20

52.80

3.60

5.20

14.40

owls

3.60

0.40

1.60

10.80

2.00

4.80

3.60

67.20

1.60

4.40

skiing

2.00

6.00

1.60

0.80

1.60

0.80

3.20

5.20

72.00

6.80

tiger

2.00

0.80

1.20

4.40

0.80

3.60

1.60

2.80

3.60

79.20

74.00

2.80

buildings

4.00

elephants

0.00

food

beach

tiger

1.60

skiing

2.40

owls

3.60

mountains

8.80

horses

2.40

elephants

0.40

buildings

4.00

beach

3.60

The Confusion Matrix of TNNVLAD algorithm on Corel 10 (%) 4.40

1.60

0.00

1.20

8.40

3.60

2.00

2.00

85.20

0.40

0.40

0.00

1.20

0.40

4.00

1.20

3.20

1.60

84.40

0.00

0.00

8.80

0.00

2.40

1.60

1.20

0.80

4.00

3.60

horses

1.20

0.40

1.60

0.00

0.00

95.60

0.00

0.40

0.00

0.80

mountains

4.40

0.80

2.80

4.40

5.60

2.00

52.00

2.80

10.40

14.80

owls

6.40

0.40

3.60

5.60

0.40

5.20

3.60

66.40

2.00

6.40

skiing

3.60

9.20

1.60

2.80

3.20

0.80

2.80

4.80

66.80

4.40

tiger

2.00

1.20

0.80

1.20

0.80

2.80

1.60

1.60

0.80

87.20

tiger

0.40

1.60

skiing

2.00

2.40

owls

1.20

0.80

mountains

0.40

78.40

horses

3.60

7.20

food

91.60

0.00

elephants

0.00

2.00

buildings

0.00

0.00

beach

0.00

food

Fig. 7 Confusion Matrices of our algorithms on Corel 10 dataset

Multimed Tools Appl

dataset. Moreover, all the algorithms based on VLAD obtain the best classification accuracy under small dictionary size, but the BoW, ScSPM and LLC get their best classification accuracy needing large dictionary size. Our TNNVLAD method has two percentage point higher than the other best method LLC. The confusion matrices for Corel 10 dataset are also given in Fig. 7. Our algorithms perform well for class flower and horse, and get poor performance on class mountain. Figure 8 gives the computation time comparisons of various coding methods under different dictionary size on Corel 10 dataset. ScSPM algorithm requires the most time than other six algorithms. Although MVLAD needs more time than BoW and LLC, but it still far less than ScSPM. 4.4 15 Scenes dataset The 15 Scenes dataset [21] contains 15 categories and 4485 images in total, with the number of images within each category ranging from 200 to 400. These 15 categories are bedroom, suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, open country, street, tall building, office and store. The image content is different, containing not only indoor scenes, like livingroom and store, but also outdoor sceneries, such as coast and forest etc. In order to compare with other methods, we randomly select 100 images per class as training data and use the rest as test data. Figure 9 gives the classification accuracy comparisons of various coding methods under different dictionary size on the 15 Scenes dataset. Algorithms based on VLAD cat get better performance than ScSPM and LLC when the dictionary size is small, but they become slightly lower than LLC when the dictionary size increases.

3500

Computation Time (seconds)

3000

2500

BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD

2000

1500

1000

500

0

50

100

150

200 250 Dictionary Size

300

350

400

Fig. 8 Computation time comparisons of various coding methods under different dictionary size on Corel 10 dataset

Multimed Tools Appl 85 80 75 70 65 60

BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD

55 50 45

50

100

150

200 250 Dictionary Size

300

350

400

Fig. 9 Classification accuracy comparisons of various coding methods under different dictionary size on the 15 Scenes dataset

On the basis of data in Fig. 9, the most prominent classification accuracy are presented in Table 3. For the 15 Scenes dataset, the best performance of our improved VLAD algorithms are comparable with or slightly lower than LLC and ScSPM. The confusion matrices for 15 Scenes dataset are shown in Fig. 10. Our algorithms perform well for class calsuburb and forest. Besides, we know that the class bedroom and living room have a high percentage being classified wrongly, meanwhile, the class kitchen and living room also have high misclassification rate, and these may result from that they are visually similar to each other. Figure 11 reports the computation time comparisons of various coding methods under different dictionary size on 15 Scenes dataset. ScSPM algorithm requires the most time than other six algorithms. Table 3 The best classification accuracy comparisons on the 15 Scenes dataset (mean±std-dev)% Algorithm

Classification Accuracy (Dictionary Size)

BoW [6]

65.87 ± 0.61 (90)

ScSPM [38]

79.54 ± 0.70 (420)

LLC [36]

80.85 ± 1.02 (420)

VLAD [17]

77.35 ± 0.50 (400)

VLAD+BoW

80.09 ± 0.51 (280)

MVLAD

78.82 ± 0.50 (280)

TNNVLAD

79.23 ± 0.62 (400)

Multimed Tools Appl The Confusion Matrix of VLAD+BoW algorithm on 15 Scenes (%) 62.76

0.69

3.10

5.86

13.62

0.17

0.00

0.69

1.90

0.00

0.00

0.17

0.00

6.90

0.00

99.86

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.14

3.13

1.71

60.85

2.37

3.98

1.71

0.28

0.09

2.65

0.76

0.57

1.52

4.64

1.42

14.31

0.00

0.00

0.00

10.55

6.36

0.42

0.00

0.42

0.42

7.41

8.78

0.00

0.00

0.00

0.00

0.00

86.85

0.62

2.38

0.00

2.31

7.31

0.08

0.31

0.08

0.08

0.00

0.26

0.44

0.00

0.00

0.00

95.88

0.00

0.00

2.11

0.61

0.00

0.00

0.00

0.70

0.13

0.13

0.38

0.13

0.25

4.13

0.00

88.25

1.63

0.38

1.88

1.88

0.38

0.00

0.50

0.10

1.44

1.35

1.15

0.29

0.19

0.48

0.10

82.69

0.10

0.10

5.87

3.27

0.38

2.50

0.00

0.07

0.07

0.00

0.15

1.53

2.41

1.39

0.00

89.56

4.31

0.22

0.22

0.00

0.07

0.39

1.23

0.06

0.00

0.00

10.97

5.16

2.06

0.00

5.61

73.94

0.52

0.00

0.00

0.06

0.00

0.21

0.42

0.00

0.00

0.00

0.00

2.71

6.15

0.63

0.21

86.98

1.88

0.00

0.83

0.16

0.08

0.31

0.16

0.08

0.00

0.39

0.00

2.73

0.86

0.00

0.94

93.67

0.08

0.55

1.39

0.70

0.00

6.43

1.39

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

88.70

1.39

store

1.02

0.65

4.56

3.44

5.58

0.00

1.58

0.00

3.91

0.84

0.00

1.02

1.30

2.05

74.05

bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding

55.52

0.34

2.93

12.76

14.66

0.00

0.00

0.86

1.55

0.00

0.00

0.00

0.69

6.55

4.14

0.00

99.43

0.00

0.00

0.14

0.00

0.00

0.00

0.14

0.00

0.00

0.00

0.14

0.00

0.14

2.56

2.18

53.74

2.56

3.32

1.80

0.38

1.90

3.41

1.23

1.71

3.51

6.26

0.76

14.69

tallbuilding

mountain

store

0.18

0.85

street

1.64

0.32

opencountry

0.00

0.11

insidecity

0.00

0.00

highway

0.18

44.66

forest

10.18

12.70

coast

62.00

5.50

livingroom

3.27

0.63

kitchen

0.00

industrial

5.64 17.78

calsuburb

4.14

bedroom

bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding

0.00

0.18

1.09

9.27

6.18

0.53

0.00

0.00

0.74

7.51

9.74

0.15

0.08

0.00

0.00

0.00

86.38

0.31

3.23

0.08

1.38

7.85

0.31

0.15

0.08

0.00

0.00

0.26

0.00

0.00

0.00

0.00

96.58

0.00

0.00

2.02

0.70

0.09

0.00

0.00

0.35

0.00

0.38

0.00

0.13

0.50

4.25

0.13

86.75

1.88

1.13

2.38

1.63

0.25

0.00

0.63

0.10

1.15

1.35

0.96

0.29

0.29

0.77

0.38

81.54

0.19

0.29

6.44

3.27

0.29

2.69

0.07

0.22

0.07

0.00

0.22

1.90

3.87

1.53

0.15

87.30

4.01

0.36

0.15

0.00

0.15

0.26

2.00

0.39

0.00

0.00

13.48

7.03

2.45

0.00

4.90

68.71

0.32

0.00

0.00

0.45

0.10

0.10

0.52

0.00

0.21

0.31

0.00

2.71

4.48

0.73

0.00

86.88

2.92

0.00

1.04

0.31

0.23

1.64

0.08

0.23

0.23

0.23

0.08

3.28

1.33

0.00

1.25

90.39

0.31

0.39

1.74

0.70

0.17

8.17

1.04

0.17

0.00

0.00

0.00

0.00

0.00

0.00

0.00

86.78

1.22

store

1.12

1.02

2.42

3.16

4.74

0.00

0.84

0.65

7.53

1.95

0.00

1.49

0.84

3.16

71.07

calsuburb

bedroom calsuburb industrial kitchen livingroom coast forest highway insidecity mountain opencountry street tallbuilding

63.45

0.52

1.72

7.24

15.52

0.17

0.00

0.34

1.21

0.00

0.00

0.00

0.34

5.00

4.48

0.14

98.72

0.00

0.00

0.43

0.00

0.14

0.00

0.14

0.00

0.28

0.00

0.00

0.00

0.14

4.08

1.61

55.07

3.22

3.79

2.27

0.47

1.71

4.08

0.95

1.42

3.22

4.93

1.42

11.75

store

insidecity

tallbuilding

0.36

1.06

street

3.64

0.32

opencountry

0.00

0.00

mountain

0.00

0.00

highway

0.00

46.46

forest

12.91

12.06

coast

58.73

4.44

livingroom

1.64

0.85

kitchen

0.00

industrial

6.00 16.30

bedroom

The Confusion Matrix of MVLAD algorithm on 15 Scenes (%)

0.73

63.45

10.55

0.00

0.00

0.18

3.27

0.18

0.00

0.00

0.73

9.64

4.91

0.42

5.08

13.44

49.42

0.00

0.00

0.42

1.38

0.42

0.00

0.21

0.21

5.71

5.40

0.23

0.23

0.08

0.00

0.00

86.38

0.62

2.69

0.00

1.62

7.92

0.00

0.15

0.08

0.00

0.00

0.44

0.44

0.00

0.00

0.09

96.05

0.00

0.00

2.28

0.35

0.09

0.00

0.00

0.26

0.13

0.38

0.25

0.25

0.00

4.50

0.38

88.00

1.75

0.75

1.63

1.25

0.50

0.13

0.13

0.67

1.15

0.77

1.25

0.48

0.87

0.58

0.67

80.77

0.00

0.10

5.38

3.65

0.48

3.17

0.29

0.22

0.29

0.00

0.22

1.39

2.92

1.53

0.00

87.81

4.23

0.51

0.00

0.15

0.44

0.32

1.74

0.45

0.00

0.00

10.84

6.00

2.13

0.13

6.84

70.45

0.19

0.58

0.00

0.32

0.10

0.21

1.15

0.00

0.00

0.31

0.00

2.60

4.48

0.63

0.00

86.98

2.19

0.00

1.35

0.39

0.08

2.03

0.16

0.47

0.23

0.39

0.16

3.05

0.78

0.16

1.64

90.00

0.23

0.23

2.09

1.04

0.17

5.39

2.26

0.00

0.00

0.00

0.52

0.00

0.00

0.00

0.00

86.61

1.91

0.93

1.12

4.47

4.28

3.44

0.00

1.30

0.56

5.02

3.07

0.09

2.23

1.95

2.88

68.65

industrial

kitchen

livingroom

coast

forest

highway

insidecity

mountain

opencountry

street

tallbuilding

Fig. 10 Confusion Matrices of our algorithms on 15 Scenes dataset

store

0.00

calsuburb

store

6.36 17.88

bedroom

The Confusion Matrix of TNNVLAD algorithm on 15 Scenes (%)

Multimed Tools Appl 12000

Computation Time (seconds)

10000

8000

BoW ScSPM LLC VLAD VLAD+BoW MVLAD TNNVLAD

6000

4000

2000

0

50

100

150

200 250 Dictionary Size

300

350

400

Fig. 11 Computation time comparisons of various coding methods under different dictionary size on the 15 Scenes dataset

5 Conclusion and future work In this paper, three feature coding schemes based on VLAD are proposed for image classification. We compare our schemes with some state-of-the-art methods, including BoW, ScSPM, LLC and VLAD. Experiments on different kinds of datasets (UIUC Sports Event dataset, Corel 10 dataset and 15 Scenes dataset) demonstrate that classification accuracy of our improved VLAD coding strategies are better than the previous four classical methods under small dictionary size. At the same time, it is noteworthy that our schemes are much faster than ScSPM because ScSPM algorithm needs more time to learn dictionary and code features using sparse coding strategy. In many cases, we need to consider the classification accuracy and classification speed simultaneously. In the future, we will try to find more efficient feature coding strategies and apply them to large scale image datasets. Acknowledgments This work is sponsored by NUPTSF (Grant No. NY214168), National Natural Science Foundation of China (Grant No. 61300164, 61272247), Shanghai Science and Technology Committee (Grant No. 13511500200) and European Union Seventh Framework Programme (Grant No. 247619).

References 1. Arandjelovic R, Zisserman A (2013) All about vlad. In: IEEE conference on computer vision and pattern recognition, pp 1578–1585 2. Boiman O, Shechtman E, Irani M (2008) In defense of nearest-neighbor based image classification. In: IEEE conference on computer vision and pattern recognition, pp 1–8 3. Bosch A, Zisserman A, Muoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Int 30(4):712–727 4. Cinbis RG, Verbeek J, Schmid C (2012) Image categorization using fisher kernels of non-iid image models. In: IEEE conference on computer vision and pattern recognition, pp 2184–2191 5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

Multimed Tools Appl 6. Csurka G, Dance CR, Fan LX, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, p 22 7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, vol 1, pp 886–893 8. Delhumeau J, Gosselin PH, J´egou H, P´erez P (2013) Revisiting the vlad image representation. In: ACM international conference on Multimedia, pp 653–656 9. Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Proc 15(12):3736–3745 10. Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comp Vision Image Underst 106 (1):59–70 11. Freund Y, Schapire R (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory, pp 23–37 12. Gao SH, Tsang IWH, Chia LT, Zhao PL (2010) Local features are not lonely–laplacian sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 3555–3561 13. Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: International conference on computer vision, vol 2, pp 1458–1465 14. Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset 15. Harada T, Ushiku Y, Yamashita Y, Kuniyoshi Y (2011) Discriminative spatial pyramid. In: IEEE conference on computer vision and pattern recognition, pp 1617–1624 16. J´egou H, Douze M, Schmid C, P´erez P (2010) Aggregating local descriptors into a compact image representation. In: IEEE conference on computer vision and pattern recognition, pp 3304–3311 17. J´egou H, Perronnin F, Douze M, S´anchez J, P´erez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Int 34(9):1704–1716 18. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: International conference on computer vision, vol 1, pp 604–610 19. Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: IEEE international conference on computer vision, pp 1487–1494 20. Kulkarni N, Li BX (2011) Discriminative affine sparse codes for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1609–1616 21. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol 2, pp 2169–2178 22. Li FF, Pietro P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol 2, pp 524–531 23. Li LJ, Li FF (2007) What, where and who? Classifying events by scene and object recognition. In: International conference on computer vision, pp 1–8 24. Long X, Lu H, Li W (2012) Image classification based on nearest neighbor basis vectors. Multimed Tools Appl:1–18 25. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91– 110 26. Lu Z, Ip HHS (2009) Image categorization with spatial mismatch kernels. In: IEEE conference on computer vision and pattern recognition, pp 397–404 27. Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual codebooks using randomized clustering forests. Advances in neural information processing systems 19 28. Morel J, Yu G (2009) Asift: a new framework for fully affine invariant image comparison. SIAM J Imaging Sci 2(2):438–469 29. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: IEEE conference on computer vision and pattern recognition, pp 1–8 30. Perronnin F, S´anchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European conference on computer vision, pp 143–156 31. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: IEEE conference on computer vision and pattern recognition, pp. 1–8 32. Picard D, Gosselin PH (2011) Improving image similarity with vectors of locally aggregated tensors. In: IEEE international conference on image processing, pp 669–672 33. Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T, Van Gool L (2005) Modeling scenes with local descriptors and latent aspects. In: International conference on computer vision, vol 1, pp 883–890

Multimed Tools Appl 34. Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: An efficient alternative to sift or surf. In: International conference on computer vision 35. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: International conference on computer vision, pp 1470–1477 36. Wang JJ, Yang JC, Yu K, Lv FJ, Huang T, Gong YH (2010) Locality-constrained linear coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 3360–3367 37. Xu D, Chang S (2008) Video event recognition using kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Int 30(11):1985–1997 38. Yang JC, Yu K, Gong YH, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1794–1801 39. Yang L, Jin R, Sukthankar R, Jurie F (2008) Unifying discriminative visual codebook generation with classifier training for object category recognition. In: IEEE conference on computer vision and pattern recognition, pp 1–8 40. Yu K, Zhang T, Gong YH (2009) Nonlinear learning using local coordinate coding. Adv Neural Inf Process Syst 22:2223–2231 41. Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. In: European conference on computer vision, pp 141–154

Xianzhong Long obtained his Ph.D. degree from Shanghai Jiao Tong University on June 2014. He received his B.S. degree from Henan Polytechnic University in 2007 and M.S. degree from Xihua University in 2010, both in computer science. Now, he is an assistant professor at Nanjing University of Posts and Telecommunications. His research interests are computer vision, machine learning and image processing, specifically on image classification, object recognition and clustering.

Multimed Tools Appl

Hongtao Lu got his Ph.D. degree in Electronic Engineering from Southeast University, Nanjing, in 1997. After graduation he became a postdoctoral fellow in Department of Computer Science, Fudan University, Shanghai, China, where he spent two years. In 1999, he joined the Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, where he is now a professor. His research interest includes machine learning, computer vision and pattern recognition, and information hiding. He has published more than sixty papers in international journals such as IEEE Transactions, Neural Networks and in international conferences. His papers got more than 400 citations by other researchers.

Yong Peng received the B.S degree in computer science from Hefei New Star Research Institure of Applied Technology, the M.S degree from Graduate University of Chinese Academy of Sciences. Now he is working towards his PhD degree in Shanghai Jiao Tong University. His research interests include machine learning, pattern recognition and evolutionary computation.

Multimed Tools Appl

Xianzhong Wang received the B.S degree in computer science from An Hui University Of Technology.Now he is a Master candidate in Computer Science and Engineering Department of Shanghai Jiao Tong University. His research interests include machine learning and human action recognition.

Shaokun Feng received the B.S degree in information science from University of Shanghai for Science and Technology. Now he is working towards his M.S degree in Shanghai Jiao Tong University. His research interests include machine learning, pattern recognition and deep learning.