Neurocomputing 144 (2014) 408–416
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
A GA-based feature selection and parameter optimization for linear support higher-order tensor machine Tengjiao Guo a, Le Han a, Lifang He b, Xiaowei Yang a,n a b
Department of Mathematics, School of Sciences, South China University of Technology, Guangzhou 510641, China School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, China
art ic l e i nf o
a b s t r a c t
Article history: Received 5 November 2013 Received in revised form 5 March 2014 Accepted 7 May 2014 Communicated by Deng Cai Available online 24 May 2014
In the fields of pattern recognition, computer vision, and image processing, many real-world image and video data are more naturally represented as tensors. Recently, based on the supervised tensor learning (STL) framework, a linear support higher-order tensor machine (SHTM) has been proposed. Considering that there are much redundancy information in the tensor data and the model parameter largely affects the performance of SHTM, in this study, we present a genetic algorithm (GA) based feature selection and parameter optimization algorithm for the linear SHTM. The proposed algorithm can remove the redundancy information in tensor data and obtain a better generalized accuracy by searching for the optimal model parameter and feature subset simultaneously. A set of experiments is conducted on nine second-order face recognition datasets and three third-order gait recognition datasets to illustrate the performance of the proposed algorithm. The statistic test shows that compared with the original linear SHTM, the proposed algorithm can provide a significant performance gain in terms of generalized accuracy for tensor classification. & 2014 Elsevier B.V. All rights reserved.
Keywords: Tensor classification Support higher-order tensor machine Genetic algorithm Supervised tensor learning Tensor rank-one decomposition
1. Introduction In the fields of pattern recognition, computer vision and image processing, many real-world image and video data are more naturally represented as matrices (second-order tensors) or higher-order tensors. For example, gray level face images [1,2] are inherently represented as matrices. Color images [3,4], graylevel video sequences [5–7], gait silhouette sequences [8,9], and hyperspectral cube [10] are commonly represented as third-order tensors. Color video sequences [11,12] are usually represented as fourth-order tensors. At present, it is a hot topic to construct learning models and design fast algorithms for tensor data. In the past several years, some researchers suggested constructing multilinear models by extending the support vector machine (SVM) learning framework to tensor patterns. For example, based on the rank-one decomposition of the weight tensor [13,14], Tao et al. [15] presented a supervised tensor learning (STL) framework by applying a combination of the convex optimization and multilinear operators. Under this learning framework, Cai et al. [16] and Wang et al. [17] studied the second-order tensor and proposed a linear tensor least square classifier, Tao et al. [18]
n
Corresponding author. E-mail address:
[email protected] (X. Yang).
http://dx.doi.org/10.1016/j.neucom.2014.05.018 0925-2312/& 2014 Elsevier B.V. All rights reserved.
extended the classical linear C-SVM [19], v-SVM [20] and least squares SVM (LS-SVM) [21] to general tensor patterns, Liu et al. [22] used the dimension-reduced tensors as input for multimodality video semantic concept detection, Zhang et al. [23] presented a twin support tensor machine for microcalcification clusters detection, Wu et al. [24] proposed a transductive learning model for multimodality video semantic concept detection, and Zhang et al. [25] applied the multifeature tensor to solve the remotesensing target recognition problem. Based on local learning, Liu et al. [26] presented a locally maximum margin classifier for image and video classification. Instead of the classical maximum-margin criterion, Wolf et al. [27] suggested minimizing the rank of the weight tensor with the orthogonality constraints on the columns of the weight tensor and Pirsiavash et al. [28] relaxed the orthogonality constraints to further improve Wolf's method. Recently, based on the weight tensor being decomposed into a sum of R rank-one tensors, Kotsia et al. [29] proposed four higher rank support tensor machines for visual recognition, and Guo et al. [30] built several tensor learning models for regression. A potential disadvantage of the STL framework is that it gives rise to a non-convex optimization problem. This leads to the STLbased method having two main drawbacks. On one hand, it may suffer from the local minima problem. On the other hand, for the non-convex optimization problem, one usually resorts to iterative techniques, which is very time-consuming. In order to overcome
T. Guo et al. / Neurocomputing 144 (2014) 408–416
these two shortcomings, Hao et al. [31] proposed a novel linear support higher-order tensor machine (SHTM) to deal with tensor classification based on SVM and tensor rank-order decomposition. The proposed approach leads to convex optimization and fits into the same primal–dual framework underlying SVM-like algorithms. Therefore, the model parameter will largely affect the performance of SHTM. In the fields of computer vision, machine learning, and data mining, tensor data are usually high dimensional data in which there are a large amount of redundancy information. For example, the redundancy from inter-band correlation is very high in the hyperspectral images [32]. In gait silhouette sequences, there exist a lot of information which is not related to gait recognition. Thus, there is a need for feature selection technologies for tensor data. In this paper, considering that there are much redundancy information in the tensor data and the model parameter largely affects the performance of SHTM, we use genetic algorithm (GA) to search for the optimal model parameter and conduct the feature subset selection of tensor. Consequently, we not only improve the effectiveness of inner product computation and save storage space, but also improve the generalized accuracy of SHTM. In order to examine the performance of the proposed algorithm, we conduct a set of experiments on twelve tensor classification datasets. The rest of this paper is organized as follows. Section 2 covers some preliminaries including notation, basic definitions and a brief review of SHTM. In Section 3, the proposed algorithm is discussed for tensor classification. The experimental results and analyses are presented in Section 4. Finally, Section 5 gives conclusion and future work.
Before presenting our work, we first briefly introduce some notation and basic definitions used throughout the paper and review the SHTM model.
2.1. Notation and basic definitions For convenience, we will follow the conventional notation and definitions in the areas of multilinear algebra, pattern recognition and signal processing [33,34]. In this study, vectors are denoted by boldface lowercase letters, e.g., a; matrices by boldface capital letters, e.g., A; tensors by calligraphic letters, e.g., A. Their elements are denoted by indices, which typically range from 1 to the capital letter of the index, e.g., n ¼ 1; …; N To make it more clear, Table 1 lists the fundamental symbols defined in this study. Table 1 List of symbols.
fX m ; ym gM m¼1 Xm ym N R ¼ rankðX m Þ W b C ξ α, β ε wð1Þ 3wð2Þ 3 ⋯ 3wðNÞ X nw ‖ U ‖F
Definition 1. (Tensor) A tensor, also known as Nth- order tensor, multidimensional array, N- way or N- mode array, is an element of the tensor product of N vector spaces, which is a higher-order generalization of a vector (first-order tensor) and a matrix (second-order tensor), denoted as A A RI1 I2 ⋯IN ; where N is the order of A, also called ways or modes. The element of A is denoted by ai1 ;i2 ;⋯;iN ; 1 r in r I n ; 1 r n r N. Definition 2. (Tensor product or outer product) The tensor product X 3 Y of a tensor X A RI1 I2 ⋯IN and another tensor Y A 0 0 0 RI1 I2 ⋯IM is defined by ðX 3 YÞii ;i2 ;…;iN ;i01 ;i02 ;…;i0M ¼ xi1 ;i2 ;…;iN yi01 ;i02 ;…;i0M :
ð1Þ
for all values of the indices. Definition 3. (Inner product) The inner product of two same-sized tensors X; Y A RI1 I2 ⋯IN is defined as the sum of the products of their entries, i.e., I1
〈X; Y〉 ¼ ∑
I2
IN
∑ ⋯ ∑ xi1 ;i2 ;…;iN yi1 ;i2 ;…;iN :
i1 ¼ 1 i2 ¼ 1
ð2Þ
iN ¼ 1
Definition 4. (n–mode product) The n–mode product of a tensor A A RI1 I2 ⋯I N and a matrix U A RJ n In , denoted by A n U, is a tensor in RI1 I2 ⋯I n 1 J n In þ 1 ⋯IN given by In
ðA n UÞi1 ;i2 ;…;in 1 ;jn ;in þ 1 ;…;iN ¼ ∑ ai1 ;i2 ;…;iN ujn ;in
ð3Þ
in ¼ 1
for all index values.
2. Preliminaries
M L
409
Total number of tensor samples Number of classes A set of tensor samples The mth input tensor sample Label of X m Order of X m A RI1 I2 ⋯IN Rank of X m Weight parameter Bias variable Trade-off parameter Slack vector The Lagrange multipliers vector Threshold parameter Rank-1 tensor of W n-Mode product of X and w The Frobenius norm
Remark. Given a tensor A A RI1 I2 ⋯I N and a sequences of matrices UðnÞ A RJ n I n ; J n o I n ; n ¼ 1; …N The projection of A onto the tensor subspace RJ 1 J2 ⋯J N is defined as A 1 Uð1Þ 2 Uð2Þ ⋯ N UðNÞ . Given a tensor A A RI1 I2 ⋯IN , and two matrices F A RJ n In , and G A RJ m Im , one has ðA n FÞ m G ¼ ðA m GÞ n F ¼ A n F m G. Definition 5. (Frobenius Norm) The Frobenius norm of a tensor A A RI1 I2 ⋯I N is the square root of the sum of the squares of all its elements, i.e., sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi I1 I2 IN pffiffiffiffiffiffiffiffiffiffiffi jjAjjF ¼ 〈A; A〉 ¼ ð4Þ ∑ ∑ ⋯ ∑ a2i1 ;i2 ;…;iN : i1 ¼ 1 i2 ¼ 1
iN ¼ 1
Remark. Given two same-sized tensors A A RI1 I2 ⋯IN and B A RI1 I 2 ⋯IN , the distance between tensors A and B is defined as jjA BjjF . Note that the Frobenius norm of the difference between two tensors equals to the Euclidean distance of their vectorized representations [35]. Definition 6. (Tensor rank-one decomposition) Let A A RI1 I2 ⋯IN be a tensor. If it can be written as R
R
N
ð2Þ ðNÞ ðnÞ A ¼ ∑ uð1Þ r 3 ur 3 ⋯ 3 ur ¼ ∑ ∏ 3 ur r¼1
ð5Þ
r ¼1n¼1
we call (5) tensor rank-one decomposition of A with length R. Particularly, if R ¼ 1, it is called the rank-1 tensor. If R is the minimum number of rank-1 tensors that yield A in a linear combination, R is defined as the rank of A, denoted by ðnÞ R ¼ rankðAÞ. Moreover, if uðnÞ i and uj are mutually orthonormal for all i aj; 1 r i; j rR; n ¼ 1; …; N, the formula (5) is often called as the rank-R approximation [36,37,38].
410
T. Guo et al. / Neurocomputing 144 (2014) 408–416
2.2. Linear SHTM for binary classification Considering a training set of M pairs of samples fX m ; ym gM m ¼ 1 for binary classification problem, where X m A RI 1 I2 ⋯IN is the input data and ym A f 1; þ 1g is the corresponding class labels of X m , the linear SHTM model for binary classification is in the following [31]: M 1 minJðW; b; ξÞ ¼ jjWjj2F þ C ∑ ξm ; 2 W;b;ξ m¼1
ð6Þ
s. t. ym ð〈W; X m 〉 þ bÞ Z 1 ξm ; m ¼ 1; …; M
ð7Þ
ξm Z 0;
ð8Þ
m ¼ 1; …; M:
where W is the normal tensor (or weight tensor) of the hyperplane, b is the bias, ξm is the error of the mth training sample, and C is the trade-off between the classification margin and misclassification error. Obviously, this is a convex optimization model in the tensor space. When the input samples X m are vectors, the optimization model (6)–(8) degenerates into the standard linear support vector machine. Let the rank-one decomposition of X i and X j be X i ∑Rr ¼ 1 ð2Þ ðNÞ ð1Þ ð2Þ ðNÞ R xð1Þ ir 3 x ir 3 ⋯ 3 x ir and X j ∑r ¼ 1 x jr 3 x jr 3 ⋯ 3 x jr respectively. In the
linear SHTM, the inner product of X i and X j is calculated as follow: R
ð2Þ ðNÞ 〈X i ; X j 〉 〈 ∑ xð1Þ ir 3 x ir 3 ⋯ 3 x ir ; r¼1 R
∑
R
ð2Þ ðNÞ ∑ xð1Þ jr 3 x jr 3 ⋯ 3 x jr 〉
r¼1
R
ð1Þ ð2Þ ð2Þ ðNÞ ðNÞ ∑ 〈xip ; xð1Þ jq 〉〈x ip ; x jq 〉⋯〈xip ; x jq 〉
ð9Þ
p¼1q¼1
feature selection algorithms are widely categorized into two groups: filter method [40] and wrapper method [41]. The filter method evaluates the goodness of the feature subset by using the intrinsic characteristic of the data. It is relatively computationally cheap since it does not involve the learning machine. However, it also takes the risk of selecting feature subsets that may not match the chosen learning machine. The wrapper method directly uses the learning machine to evaluate the feature subsets and the wrapper method based feature selection problem is essentially a combinatorial optimization problem. It generally outperforms the filter method in terms of prediction accuracy, but it is generally computationally more intensive. In this study, we use the wrapper method to select a feature subset of tensor data. When using SVM, two problems are confronted: how to choose the optimal input feature subset for SVM, and how to set the best model parameters. Both of these problems are crucial, because the feature subset choice influences the appropriate model parameters. Therefore, obtaining the optimal feature subset and model parameters must occur simultaneously. Based on the fact that GA is a general adaptive optimization search methodology based on a direct analogy to Darwinian natural selection and genetics in biological systems [42,43] and is widely applied to the combinatorial optimization problems, some researchers have applied it to the SVM model selection and feature selection [44,45]. Considering that linear SHTM is the generalization of the standard linear SVM to tensor patterns in tensor space, in this study, we use GA to select the optimal feature subset and model parameter in the SHTM. The chromosome code, fitness function, selection operator, crossover operator and mutation operator in the proposed GA are described in details as follows.
2.3. Linear SHTM for multi-classification
3.1. Chromosome code
For an L- class classification problem, the one-against-one support vector machine (OAO-SVM) [39] needs to construct LðL 1Þ=2 binary classification SVM models where each one is trained on data points from two classes. Inspired by this idea, for the tensor samples X m of the ith and the jth classes, if X m belongs to the ith class, then ym ¼ 1, otherwise ym ¼ 1, we solve the following binary classification problem:
In SHTM, we use the tensor rank-one decomposition to compute inner product of tensors. In order to obtain the optimal feature subset, it is reasonable to use the same feature mask for R vectors in the same mode space. Inspired by this idea, in our proposed GA, the chromosome is comprised of N þ 1 parts, including the model parameter and N feature masks in the N-mode spaces. The binary coding system is used to represent the chromosome. Fig. 1 shows the binary chromosome representation of our design. In Fig. 1, Gc represents the binary value of model parameter C, GF 1 represents the feature mask of the first mode space, GF i represents the feature mask of the ith- mode space, and GF N represents the feature mask of the Nth- mode space. Obviously, the length of Gc is the number of bits representing the model parameter C which depends on the value range of parameter C, the length of GF 1 is I 1 , the length of GF i is I i , and the length of GF N is I N . In general, the minimum and maximum values of the model parameter C are determined by the user. For the parts of the chromosome representing the feature mask, the bit with value ‘1’ represents the feature is selected, and ‘0’ indicates feature is not selected.
M 1 ij min JðW ij ; b ; ξÞ ¼ jjW ij jj2F þC ∑ ξijm ; ij 2 m¼1
ð10Þ
W ij ;b ;ξ
s. t. ij
ym ð〈W ij ; φðX m Þ〉þ b Þ Z 1 ξijm ;
ð11Þ
ξijm Z 0;
ð12Þ
m ¼ 1; …; M:
We call this model one-against-one support higher-order tensor machine (OAO-SHTM). Once the OAO-SHTM models have been solved, the class label of a testing example X can be predicted by applying majority voting strategy, i.e., the vote counting takes into account the outputs of all binary classifiers. If X belongs to the ith class, then the ith class gets one vote, otherwise the jth class gets one vote. X is labeled by the class with the most votes.
3. The proposed algorithm Feature selection is the basic problem of pattern recognition, machine learning, and data mining. It can remove the irrelevant, redundant, or noisy information in patterns, improve the performance of the learning algorithm, reduce the computational cost, and provide better understanding of the datasets. At present,
3.2. Fitness function In our proposed GA, we use a generalized accuracy and the number of selected features as the criteria used to design a fitness function. In general, the chromosome with a high generalized accuracy and a small number of features will produce a high Gc
GF1
GFi
GFN
Fig. 1. The chromosome representation in GA.
T. Guo et al. / Neurocomputing 144 (2014) 408–416
411
Table 2 The proposed algorithm. Input: a set of tensor samples fX m A RI1 I2 ⋯IN ; ym A f 1; 1ggM m ¼ 1 , the threshold parameter ε, the parameter K, the maximum iterative number, the population size P, the selection probability P s , the crossover probability P c , and the mutation probability P m . Output: the optimal generalized accuracy, the optimal model parameter and the corresponding feature subset. Step 1: Apply ALS to conduct the tensor rank-one decompositions for all of the tensors. Step 2: Randomly generate an initial population of chromosomes. Step 3: For each chromosome in the population, use the decoded model parameters and feature subset to solve the optimization problem (6)–(8), and calculate the fitness values of the chromosomes based on generalized accuracy. Step 4: While the stopping condition is not satisfied, conduct the genetic operations as follows. 4.1 Conduct selection operation. 4.2 Conduct crossover operation. 4.3 Conduct mutation operation. 4.4 For each new chromosome, use the decoded model parameter and the feature subset to solve the optimization problem (6)–(8). 4.5 Calculate the fitness values of the new chromosomes based on the generalized accuracy. 4.6 Retain the first P chromosomes with the highest fitness values and remove the others. 4.7 Go to Step 4. Step 5: Output the optimal generalized accuracy, the optimal model parameter and the corresponding feature subset and stop.
fitness value. The chromosome with high fitness value has high probability to be preserved to the next generation. 3.3. Selection operator In the proposed GA, we select K pairs of chromosomes for crossover operation based on the roulette wheel strategy [46,47,48]. 3.4. Crossover operator For each part of the chromosome, we conduct the uniform crossover operation [49,50] based on crossover probability respectively. 3.5. Mutation operator For each part of the chromosome, we conduct a uniform mutation operation [51,52] based on the mutation probability respectively. Based on the discussions above, we choose the most popular and widely used alternating least squares (ALS) [37,53] to conduct the tensor rank-one decomposition. After obtaining the tensor rank-one decomposition, we use GA to search for the optimal model parameter and feature subset simultaneously. Finally, we output the optimal generalized accuracy. The detailed procedure of the proposed algorithm, which is called TFS-SHTM, is presented in the following Table 2. In the proposed algorithm, the stopping condition is the iterative number and is larger than the maximum iterative number.
parameters. Finally the trained learning machine is used to predict the testing set and the generalized accuracy of the learning machine is given [54,55]. (b) For the smaller datasets where test data may not be available, the researchers usually simply conduct a k-fold cross-validation on the whole training data and report the best cross-validation accuracy [54,56,57]. The detailed steps of k-fold cross validation strategy are as follows. Firstly, the given training set W is randomly divided into k subsets of approximately equal size, W 1 ; W 2 ; …; W k , where W i \ W j ¼ ∅. Secondly, for each pair of model parameters, the given algorithm is run on W n W j , test the trained learning machine on W j to get the testing accuracy T j , where j changes from one to k, and compute the average testing accuracy T average ¼ ð1=kÞ∑kj ¼ 1 T j . Thirdly, the maximum average testing accuracy and the corresponding model parameters are obtained. (c) For larger datasets where test data may not be available, the researchers usually partition the whole training dataset into three parts: training set, validation set and testing set [58,59]. The validation set is used to find the optimal parameters based on the k-fold cross validation strategy, the training set is used to train the learning machine based on the optimal parameters and the testing set is used to test the generalized accuracy of the learning machine. In our study, the datasets are smaller and test data are unavailable. Therefore, we used the second method to evaluate the generalized accuracy of the learning machines. 4.1. Experimental datasets
4. Experimental results and analyses In this section, we evaluate the performance of the proposed algorithm on four benchmarking databases (Yale-B, ORL, CMU PIE and USF HumanID). In order to establish a comparative study, we use the linear SHTM as baseline. In the field of machine learning, the most important criterion for evaluating the performance of learning machines is their generalized accuracy. For different datasets, the researchers usually use one of the following three methods to estimate the generalized accuracy of the learning machines: (a) For the datasets where both training and testing sets are available, the training set is firstly used to find the optimal parameters based on the k-fold cross validation strategy. Secondly, the whole training set is retrained using the optimal
In the experiments, we use a total of twelve tensor datasets where nine of them (Yale32 32, Yale64 64, ORL32 32, ORL64 64, C05, C07, C09, C27, and C29) are second-order face recognition datasets obtained from http://www.zjucadcg.cn/dengcai/Data/FaceData.html, and others (USFGait17_32 22 10, USFGait17_64 44 20, and USFGait17_128 88 20) are third-order gait recognition datasets obtained from https://sites.google.com/site/tensormsl/. The detailed information about these twelve datasets is listed in Table 3, in which the data source of each dataset is also given in the first column. To better understand the tensor structures of experimental data, we illustrate with one example for each database, as shown in Fig. 2. As a preprocessing step, we scale each attribute to the range [0, 1] in order to facilitate a fair comparison. We divide each dataset randomly into ten parts of approximately equal size while keeping the proportion of samples in each class, and use nine parts
412
T. Guo et al. / Neurocomputing 144 (2014) 408–416
as the training set, and the remaining is the testing set. Two learning machines use the same training set and testing set. Using the aforementioned ALS algorithm, we decompose each tensor into R rank-one tensors. 4.2. Parameter settings Two learning machines select the optimal model parameter from C A f20 ; 21 ; 22 ; …; 215 g. Considering the fact that there is no known closed-form solution to determine the rank R of a tensor a priori [60], and rank determination of a tensor is still an open problem [61,62], we use grid search to determine the optimal rank, where the rank RA f3; 4; 5; 6; 7; 8g. In the proposed GA, the
Table 3 The detailed information of experimental datasets. Data sources Datasets
Number Number of samples of Class
Size
Yale-B
Yale32 32 Yale64 64
165 165
15 15
32 32 64 64
ORL
ORL32 32 ORL64 64
400 400
40 40
32 32 64 64
CMU PIE
C05 C07 C09 C27 C29
3332 1629 1632 3329 1632
68 68 68 68 68
64 64 64 64 64 64 64 64 64 64
USF HumanID
USFGait17_32 22 10 USFGait17_64 44 20 USFGait17_128 88 20
731 731 731
71 71 71
32 22 10 64 44 20 128 88 20
population size P ¼ 4 þ ∑N i ¼ 1 I i , the selection probability P s ¼ 0:1, the crossover probability P c ¼ 0:9, the mutation probability P m ¼ 0:1, K ¼ P, and the maximum iterative number is 30. To obtain an unbiased statistical result, all the optimal parameters are searched using a ten-fold cross validation strategy. All the programs are written in C þ þ and compiled using the Microsoft Visual Studio 2008 compiler. All the experiments are conducted on a computer with Intel(R) Core(TM) i7-3770 3.40 GHz processor and 16 GB RAM memory running Microsoft Windows 7.
4.3. Classification performance In this section, we conduct the experiments on twelve datasets to compare the performance of TFS-SHTM with that of the linear SHTM. To conduct a fair comparison, OAO strategy is applied for multi-classification in TFS-SHTM and linear SHTM. Table 4 shows the experimental results for twelve datasets, including generalized accuracy, training time, and the corresponding optimal parameters. Generalized accuracy and training time are average of generalized accuracy and training time in 10 trials, respectively. The best generalized accuracy and training time among two learning machines are highlighted in bold type. In order to demonstrate that TFS-SHTM can effectively remove redundancy information of tensor data, on one hand, we report which features in different mode spaces are selected finally for twelve datasets in Table 5; on the other hand, comparison of the original tensors and the corresponding tensors with feature subset on the datasets Yale64 64, ORL64 64, C07, C09, C27 and C29 are also illustrated in Fig. 3. In order to illustrate more clearly the result of three order tensor, comparison of the original tensor and the corresponding tensor with feature subset on the dataset USFGait17_32 22 10 is shown in Fig. 4.
Fig. 2. Illustration of tensor datasets: (a) the Yale64 64 samples, (b) the ORL64 64 samples, (c) the C05 samples, and (d) the USFGait17_32 22 10 samples.
T. Guo et al. / Neurocomputing 144 (2014) 408–416 Table 4 Comparison of the results of TFS-SHTM and SHTM on twelve experimental datasets.
413
From Table 4, we have the following observations:
In terms of generalized accuracy, TFS-SHTM outperforms the
Generalized Training Time Learning Accuracy (%) (Seconds) Machines
linear SHTM on all the datasets. It indicates that TFS-SHTM can remove the redundancy information of tensor and improve the generalized accuracy of SHTM. In terms of training time, TFS-SHTM is faster than linear SHTM on ORL64 64, C05, C29, USFGait17_32 22 10, USFGait17_ 64 44 20, and USFGait17_128 88 20 datasets. On the other datasets, TFS-SHTM is lower than linear SHTM. The main reason is that the optimal rank of TFS-SHTM is larger than that of the SHTM on these datasets.
Datasets
R C
Yale32 32
7 4
1024 84.00 1024 79.33
0.06 0.03
TFS-SHTM SHTM
Yale64 64
7 4
1 89.00 16 85.00
0.12 0.06
TFS-SHTM SHTM
ORL32 32
5 3
16 99.00 1 98.00
0.29 0.17
TFS-SHTM SHTM
ORL64 64
3 16,348 98.75 3 16 98.25
0.22 0.32
TFS-SHTM SHTM
C05
7 6
8 99.04 2 98.51
59.38 67.85
TFS-SHTM SHTM
C07
7 5
512 97.42 128 96.39
18.16 13.84
TFS-SHTM SHTM
C09
7 5
8 98.16 256 97.11
17.46 14.08
TFS-SHTM SHTM
C27
8 6
16 97.23 128 96.32
24.72 20.00
TFS-SHTM SHTM
C29
5 8
2 97.42 2 96.40
9.58 41.57
TFS-SHTM SHTM
USFGait17_32 22 10
7 8
512 80.99 1 79.48
6.15 10.71
TFS-SHTM SHTM
USFGait17_64 44 20
8 8
16 83.84 1 82.00
12.03 19.71
TFS-SHTM SHTM
where Tða; bÞ ¼ min fR þ ða; bÞ; R ða; bÞg, R þ ða; bÞ is the sum of ranks for the experimental datasets on which learning machine b outperforms learning machine a and R ða; bÞ is the sum of ranks for the opposite, which are defined as follows:
USFGait17_128 88 20 5 7
256 82.86 1 82.49
7.27 26.76
TFS-SHTM SHTM
R þ ða; bÞ ¼ ∑ rankðdi Þ þ
From Figs. 3 and 4 and Table 5, we know that the TFS-SHTM removes largely the redundancy information of tensor data and remains the most important information for classification. In the statistical analysis [63,64], the Wilcoxon signed-ranks test is usually used to compare the significant differences of two learning machines. In this study, we use the Wilcoxon signed-ranks test to conduct a statistical comparison of TFSSHTM and SHTM, which is computed as follows: Tða; bÞ ðNðN þ1Þ=4Þ zða; bÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi; ð1=24ÞNðN þ 1Þð2N þ 1Þ
di 4 0
1 ∑ rankðdi Þ; 2d i ¼ 0
Table 5 Feature subsets obtained by TFS-SHTM on twelve experimental datasets. Datasets
Mode spaces
Selected feature subsets
Yale32 32
1 2
10011011010010110010100110000111 01011100001011011100110001010001
Yale64 64
1 2
1111011011101011001001100010000100000010111010100011010011001101 1010111100111011000100001111111100001011101001010010000101001101
ORL32 32
1 2
11011111101011001000110000001010 10011111001101001100011010110110
ORL64 64
1 2
0110001110110101001111000100000111100111101101101000001100111000 1110000100001101000001101100101110010110111000111001100001000110
C05
1 2
1110111001000000010110001100100110001001000000010000111100000101 0010001101100101011000010111011100010100000010101000101000001110
C07
1 2
1001111101100100111111100111011110011110000011100010010000001110 1000001101111010101101000010100001010100000001001111001101001010
C09
1 2
1101000100111100001101110100111000100000001000001000110000101010 0101000011111111101111111100000100100001111111001110111001010001
C27
1 2
1111101011010000111100111010011010001010001000011000101000111110 0000100010000001110110010010110110010101001100110110111010001111
C29
1 2
1110011010101011110000111010100010010000000000000000110000000100 0010000010010110100111001111101110110001101110010110010111101110
USFGait17_32 22 10
1 2 3
11111000111001101100000101101111 0101111111101111101001 1110110111
USFGait17_64 44 20
1 2 3
1111001111100001011011101000100110101101100100001011000101111010 11000100011101011010000011111110110100000011 01000010110110110111
USFGait17_128 88 20
1
1001100110111101110101110001100011001110011011110100000010011011000010001100110101101001000010000 1010111010100001000010011111000 1110001001100011010000011000101010011111101010110010110111010101000000001100101110011110 11100111011110110011
2 3
ð13Þ
ð14Þ
414
T. Guo et al. / Neurocomputing 144 (2014) 408–416
Fig. 3. Comparison of the original tensors and the corresponding tensors with feature subsets on the datasets Yale64 64, ORL64 64, C07, C09, C27 and C29: (a) the original tensor from Yale64 64; (b) the tensor with feature subset from Yale64 64; (c) the original tensor from ORL64 64; (d) the tensor with feature subset from ORL64 64; (e) the original tensor from C07; (f) the tensor with feature subset from C07; (g) the original tensor from C09; (h) the tensor with feature subset from C09; (i) the original tensor from C27; (j) the tensor with feature subset from C27; (k) the original tensor from C29; (l) the tensor with feature subset from C29. In the tensors with feature subsets, black color pixels denote that the corresponding features have not been selected.
Fig. 4. Comparison of the original tensor and the corresponding tensor with feature subset on the dataset USFGait17_32 22 10: (a) the original tensor; (b) the tensor with feature subset. White color pixels denote that the corresponding features have not been selected in (b). Table 6 The difference between the optimal average generalized accuracy of TFS-SHTM and SHTM and their rank values on twelve classification datasets. Datasets
SHTM
TFS-SHTM
di
rankðdi Þ
Yale32 32 Yale64 64 ORL32 32 ORL64 64 C05 C07 C09 C27 C29 USFGait17_32 22 10 USFGait17_64 44 20 USFGait17_128 88 20
79.33 85.00 98.00 98.25 98.51 96.39 97.11 96.32 96.40 79.48 82.00 82.49
84.00 89.00 99.00 98.75 99.04 97.42 98.16 97.23 97.42 80.99 83.84 82.86
4.67 4.00 1.00 0.50 0.53 1.03 1.05 0.91 1.02 1.51 1.84 0.37
12 11 5 2 3 7 8 4 6 9 10 1
R ða; bÞ ¼ ∑ rankðdi Þ þ di o 0
1 ∑ rankðdi Þ 2d i ¼ 0
accuracy. The experiments have been conducted on nine secondorder face recognition datasets and three third-order gait recognition datasets to test the performance of TFS-SHTM. The results show that TFS-SHTM is more effective than the original linear SHTM in term of generalized accuracy. In future work, we will investigate the reconstruction techniques of tensor data so that TFS-SHTM can handle highdimensional vector data more effectively. Another interesting topic would be to design some tensor kernel for SHTM so as to obtain the better generalization performance. Further study on this topic will also include many applications of TFS-SHTM in real-world classification with tensor representations.
Acknowledgments ð15Þ
where di is the difference between the performance scores of two learning machines on the ith- experimental dataset, rankðdi Þ is the rank value of jdi j. di and rankðdi Þ on twelve classification datasets are reported in Table 6. From Table 6, based on formulas (13)–(15), we can obtain z(SHTM, TFS-SHTM) ¼ 3.06 o 1.96. It shows that for the significance level of 0.05, TFS-SHTM is significantly better than the linear SHTM in terms of the generalized accuracy.
5. Conclusion and future work In this paper, focusing on the characteristic of linear SHTM applying the tensor rank-one decomposition to calculate the tensor inner production, we have designed a GA-based feature selection and model parameter optimization algorithm. The proposed algorithm can largely remove the redundancy information in tensor data and make SHTM yield the better generalized
The work presented in this paper is supported by the National Natural Science Foundation of China (61273295), the Major Project of the National Social Science Foundation of China (11&ZD156), and the Open Project of Key Laboratory of Symbolic Computation and Knowledge Engineering of the Chinese Ministry of Education (93K-17–2009-K04).
References [1] M. Turk and A. Pentland, Face recognition using eigenfaces, in: Proceedings of CVPR1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1991, pp. 586–591. [2] M. Felsberg, Low-level image processing with the structure multivector (Doctoral dissertation), Institute of Computer Science and Applied Mathematics, Christian-Albrechts-University, Kiel, 2002 (TR no. 0203). [3] D. Gavrila, The visual analysis of human movement: a survey, Comput. Vis. Image Underst. 73 (1) (1999) 82–92. [4] K. Plataniotis, A. Venetsanopoulos, Color Image Processing and Applications, Springer-Verlag, Berlin, Germany, 2000. [5] R. Green, L. Guan, Quantifying and recognizing human movement patterns from monocular video images-Part II: applications to biometrics, IEEE Trans. Circuits Syst. Video Technol. 14 (2) (2004) 191–198.
T. Guo et al. / Neurocomputing 144 (2014) 408–416
[6] R. Chellappa, A. Roy-Chowdhury, S. Zhou, Recognition of humans and their activities using video, synthesis lectures on image, Video Multimed. Process. 1 (1) (2005) 1–173. [7] P. Negi, D. Labate, 3-D discrete shearlet transform and video processing, IEEE Trans. Image Process. 21 (6) (2012) 2944–2954. [8] S. Sarkar, P. Phillips, Z. Liu, I. Vega, P. Grother, K. Bowyer, The humanid gait challenge problem: data sets, performance, and analysis, IEEE Trans. Pattern Anal. Mach. Intel. 27 (2) (2005) 162–177. [9] H.P. Lu, K. Plataniotis, A. Venetsanopoulos, MPCA: multilinear principal component analysis of tensor objects, IEEE Trans. Neural Netw. 19 (1) (2008) 18–39. [10] N. Renard, S. Bourennane, Dimensionality reduction based on tensor modeling for classification methods, IEEE Trans. Geosci. Remote Sens. 47 (4) (2009) 1123–1131. [11] M. Kim, J. Jeon, J. Kwak, M. Lee, C. Ahn, Moving object segmentation in video sequences by user interaction and automatic object tracking, Image Vis. Comput. 19 (5) (2001) 245–260. [12] H. Wang and N. Ahuja, Compact representation of multidimensional data using tensor rank-one decomposition, in: Proceedings of the 17th International Conference on Pattern Recognition (ICPR'04), 2004, pp. 44–47. [13] A. Shashua and A. Levin, Linear image coding for regression and classification using the tensor-rank principle, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, pp. 42–49. [14] T. Kolda, B. Bader, Tensor decompositions and applications, SIAM Rev. 51 (3) (2009) 455–500. [15] D. Tao, X. Li, W. Hu, S. Maybank, and X. Wu, Supervised tensor learning, in: Proceedings of 15th IEEE International Conference on Data Mining, Houston, Texas, USA, 2005, pp. 450–457. [16] D. Cai, X. He, J. Han, Learning with tensor representation, Technical Report UIUCDCS-R-2006-2716, 2006. [17] Z. Wang, S. Chen, New least squares support vector machines based on matrix patterns, Neural Process. Lett. 26 (1) (2007) 41–56. [18] D. Tao, X. Li, X. Wu, W. Hu, S. Maybank, Supervised tensor learning, Knowl. Inf. Syst. 13 (1) (2007) 1–42. [19] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (3) (1995) 273–297. [20] B. Schö lkopf, A. Smola, R. Williamson, P. Bartlett, New support vector algorithms, Neural Comput. 12 (5) (2000) 1207–1245. [21] J. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett. 9 (3) (1999) 293–300. [22] Y. Liu, F. Wu, Y. Zhuang, and J. Xiao, Active post-refined multimodality video semantic concept detection with tensor representation, in: Proceedings of the 16th ACM Conference on Multimedia, 2008, pp. 91–100. [23] X.S. Zhang, X.B. Gao, Y. Wang, Twin support tensor machines for MCS detection, J. Electron. 26 (3) (2009) 318–325. [24] F. Wu, Y.N. Liu, Y.T. Zhuang, Tensor-based transductive learning for multimodality video semantic concept detection, IEEE Trans. Multimed. 11 (5) (2009) 868–878. [25] L.F. Zhang, L.P. Zhang, D.C. Tao, X. Huang, A multifeature tensor for remotesensing target recognition, IEEE Geosci. Remote Sens. Lett. 8 (2) (2011) 374–378. [26] Y. Liu, Y. Liu, K.C.C. Chan, Tensor-based locally maximum margin classifier for image and video classification, Comput. Vis. Image Underst. 115 (3) (2011) 300–309. [27] L. Wolf, H. Jhuang, and T. Hazan, Modeling appearances with low-rank SVM, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–6. [28] H. Pirsiavash, D. Ramanan, C. Fowlkes, Bilinear classifiers for visual recognition, in: Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, A. Culotta (eds), Advances in Neural Information Processing Systems, 2009, pp. 1482–1490. [29] I. Kotsia, W.W. Guo, I. Patras, Higher rank support tensor machines for visual recognition, Pattern Recognit. 45 (12) (2012) 4192–4203. [30] W.W. Guo, I. Kotsia, I. Patras, Tensor Learning for Regression, IEEE Trans. Image Process. 21 (2) (2012) 816–827. [31] Z.F. Hao, L.F. He, B.Q. Chen, X.W. Yang, A linear support higher-order tensor machine for classification, IEEE Trans. Image Process. 22 (7) (2013) 2911–2920. [32] C.I. Chang, H. Safavi, Progressive dimensionality reduction by transform for hyperspectral imagery, Pattern Recognit. 44 (10) (2011) 2760–2773. [33] L. De Lathauwer, Signal processing based on Multilinear Algebra (Ph.D. thesis), Katholieke Universiteit Leuven, 1997. [34] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, H.J. Zhang, Multilinear discriminant analysis for face recognition, IEEE Trans. Image Process. 16 (1) (2007) 212–220. [35] H.P. Lu, K. Plataniotis, A. Venetsanopoulos, A taxonomy of emerging multilinear discriminant analysis solutions for biometric signal recognition, Biometrics: Theory Methods, and Applications 200921–45.
415
[36] S. Goreinov, E. Tyrtyshnikov, N. Zamarashkin., A theory of pseudoskeleton approximations, Linear Algebra Appl. 261 (1) (1997) 1–21. [37] T. Zhang, G. Golub, Rank-one approximation to high order tensors, SIAM J. Matrix Anal. Appl. 23 (2) (2001) 534–550. [38] H. Wang and N. Ahuja, Rank-R approximation of tensors using image-asmatrix representation, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 346–353. [39] U. Krebel, Pairwise classification and support vector machines, Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge MA (1999) 255–268. [40] G. John, R. Kohavi, and K. Peger, Irrelevant features and the subset selection problem, in: Proceedings of the 11th International Conference on Machine Learning, San Mateo, 1994, CA, pp. 121–129. [41] R. Kohavi, G. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1–2) (1997) 273–324. [42] L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, USA (1991) 10–80. [43] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evolut. Comput. 6 (2) (2002) 182–197. [44] C.L. Huang, C.J. Wang, A GA-based feature selection and parameters optimization for support vector machines, Expert Syst. Appl. 31 (2) (2006) 231–240. [45] Z. Wang, Y.H. Shao, T.R. Wu, A GA-based model selection for smooth twin parametric-margin support vector machine, Pattern Recognit. 46 (2013) 2267–2277. [46] K.Y. Lee, F.F. Yang, Optimal reactive power planning using evolutionary algorithms: a comparative study for evolutionary programming, evolutionary strategy, genetic algorithm, and linear programming, IEEE Trans. Power Syst. 13 (1) (1998) 101–108. [47] J.H. Zhong, X.H. Hu, M. Gu, J. Zhang, Comparison of performance between different selection strategies on simple genetic algorithms, in: Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05), 2005, vol. 2, pp. 1115–1121. [48] W. Chinnasri, N. Sureerattanan, Comparison of performance between different selection strategies on genetic algorithm with course timetabling problem, in: Proceedings of the IEEE International Conference on Advanced Management Science (ICAMS), 2010, vol. 2, pp. 105–108. [49] G. Sywerda, Uniform crossover in genetic algorithms, in: Proceedings of the 3rd International Conference on Genetic Algorithms, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1989, pp. 2–9. [50] S. Picek, M. Golub, Comparison of a crossover operator in binary-coded genetic algorithms, WSEAS Trans. Comput. 9 (9) (2010) 1064–1073. [51] B.H.F. Hasan, M.S.M. Saleh, Evaluating the effectiveness of mutation operators on the behavior of genetic algorithms applied to non-deterministic polynomial problems, Informatica 35 (4) (2011) 513–518. [52] B.R. Rajakumar, Impact of static and adaptive mutation techniques on the performance of genetic algorithm, Int. J. Hybrid Intel. Syst. 10 (1) (2013) 11–22. [53] P. Kroonenberg, J. Leeuw, Principal component analysis of three-mode data by means of alternating least squares algorithms, Psychometrika 45 (1) (1980) 69–97. [54] C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425. [55] R. Batuwita, V. Palade, FSVM-CIL: fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst. 18 (3) (2010) 558–571. [56] R. Debnath, N. Takahide, H. Takahashi, A decision based one-against-one method for multi-class support vector machine, Pattern Anal. Appl. 7 (2) (2004) 164–175. [57] X.W. Yang, G.Q. Zhang, J. Lu, J. Ma, A kernel fuzzy c-means clustering based fuzzy support vector machine algorithm for classification problems with outliers or noises, IEEE Trans. Fuzzy Syst. 19 (1) (2011) 105–115. [58] F. Chang, C.Y. Guo, X.R. Lin, C.J. Lu, Tree decomposition for large-scale SVM problems, J. Mach. Learn. Res. 11 (2010) 2935–2972. [59] G.C. Wu, F.Z. Xiao, J.Q. Xi, X.W. Yang, L.F. He, H.R. Lv, X.L. Liu, A hierarchical clustering and fixed-layer local learning based support vector machine algorithm for large scale classification problems, J. Donghua Univ. 29 (2) (2012) 46–50. [60] M. Kilmer, C. Martin, Factorization strategies for third-order tensors, Linear Algebra Appl. 435 (3) (2011) 641–658. [61] V. Silva, L.-H. Lim, Tensor rank and the ill-posedness of the best low-rank approximation problem, SIAM J. Matrix Anal. Appl. 30 (3) (2008) 1084–1127. [62] C. Martin, The rank of a 2 2 2 tensor, Linear Multilinear Algebra 59 (8) (2011) 943–950. [63] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [64] Y. Hochberg, A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75 (4) (1988) 800–802.
416
T. Guo et al. / Neurocomputing 144 (2014) 408–416 Tengjiao Guo received the B. S. degree in mathematics and applied mathematics from South China University of Technology, Guangzhou, China, in 2011. He is currently a graduate student in computational mathematics in the Department of Mathematics, South China University of Technology. His research interests cover machine learning and pattern recognition.
Le Han received the B. S. degree in pure mathematics and the M. Sc. degree in computational mathematics from Wuhan University in 1999 and 2002, respectively, and the Ph.D. degree in computational mathematics from Sun Yat-sen University in 2008. She is a lecturer in the Department of Mathematics, South China University of Technology. Her research interests include matrix optimization and computer graphics.
Lifang He received the B. S. degree in information and computing science from the Northwest Normal University in 2009. She is currently a Ph.D. candidate in the Department of Computer Science and Engineering, South China University of Technology. Her current research interests include manifold learning, machine learning, tensor learning, and pattern recognition.
Xiaowei Yang received the B. S. degree in theoretical and applied mechanics, the M.Sc. degree in computational mechanics, and the Ph.D. degree in solid mechanics from Jilin University, Changchun, China, in 1991, 1996, and 2000, respectively. He is currently a full time professor in the Department of Mathematics, South China University of Technology. His current research interests include designs and analyses of algorithms for large-scale pattern recognitions, imbalanced learning, semi-supervised learning, support vector machines, tensor learning, and evolutionary computation. He has published more than 90 journals and refereed international conference articles, including the areas of structural reanalysis, interval analysis, soft computing, support vector machines, and tensor learning.