Combining Support Vector Regression with Feature Selection for Multivariate Calibration Guo-Zheng Li1∗ , Hao-Hua Meng1 , Mary Qu Yang2 , Jack Y. Yang3 1
School of Computer Engineering and Science, Shanghai University, Shanghai 200072, China 2 National Human Genome Research Institute, National Institutes of Health (NIH) U.S. Department of Health and Human Services Bethesda, MD 20852 USA 3 Harvard Medical School, Harvard University, Cambridge, Massachusetts 02140 USA ∗ Corresponding author: tel: +86-21-5633-5263; Fax: +86-21-5633-3061; Email:
[email protected]
Abstract. Multivariate calibration is a classic problem in the analytical chemistry field and frequently solved by Partial Least Squares and Artificial Neural Networks in the previous works. The spaciality of multivariate calibration is high dimensionality with small sample. Here we apply Support Vector Regression (SVR) as well as Artificial Neural Networks, and Partial Least Squares to the multivariate calibration problem in the determination of the three aromatic amino acids (phenylalanine, tyrosine and tryptophan) in their mixtures by fluorescence spectroscopy. The results of the leave-one-out method show that SVR performs better than other methods, and appear to be one good method for this task. Furthermore, feature selection is performed for SVR to remove redundant features and a novel algorithm named PRIFER (Prediction RIsk based FEature selection for support vector Regression) is proposed. Results on the above multivariate calibration data set show that PRIFER is a powerful tool for solving the multivariate calibration problems.
1
Introduction
In the analytical chemistry field, multivariate calibration methods provide a convenient way to determine several components in a mixture within only one experimental step, without the tedious operation of pre-separation of these components. The method of calculation usually used is Partial Least Square(PLS) [1, 2]. Artificial Neural Networks(ANNs) [3, 4] are also often used especially when the data set exhibits obvious nonlinearity, but they are prone to overfitting. Therefore, several types of techniques have been developed to prevent overfitting, early stopping is one of them, and was introduced into the chemistry field by Tetko et al. [5], and ANNs with early stopping achieved better results than naive ANNs did. ANNs with regularization techniques [6, 7] have reached such a stage of maturity that they can also be used in the chemistry field. In the last few years, a new set of algorithms, Support Vector Machines(SVMs) [8, 9], are proposed and developed by Vapnik and his coworkers, which is based on statistical learning theory, and excellent at handling small sample cases of training
data sets. SVMs have been successfully applied to several challenging problems in real world and have delivered state-of-the-art performance [10], of which support vector regression (SVR) is proposed to handle regression problems. In this paper, we are going to compare these regression techniques: SVR without kernel function and SVR with Gauss kernel function; ANNs with weight decay, ANNs with early stopping and naive ANNs as well as PLS. They are used for the multivariate calibration in the simultaneous determination of aromatic amino acids (phenylalanine, tyrosine and tryptophan), and the results are obtained by the leave-one-out method to test the relative accuracy of these algorithms. Furthermore, we propose to perform feature selection for the data set, a novel algorithm PRIFER (Prediction RIsk based FEature selection for support vector Regression) is proposed by combing prediction risk based embedded feature selection with SVR, which is compared with another filter feature selection algorithm, mutual information based feature selection [11], on the multivariate calibration task. Amino acids are the structural units of protein, some of which have been used as drugs or food additives, so the determination of amino acids is useful both for biochemical research and for commercial product analysis. Among essential amino acids, there are three aromatic amino acids, phenylalanine, tyrosine and tryptophan, which exhibit fluorescence when excited by ultraviolet rays. So it is possible to determine them by the fluorescence spectroscopic method. The λmax of phenylalanine, tyrosine and tryptophan are 282nm, 303nm and 348nm respectively, but their fluorescence spectra are partially overlapped [12]. Since the separation operation of these three amino acids is rather tedious and troublesome, it is desirable to use multivariate calibration method to determine them in mixtures by fluorescence spectroscopy. In this work, comparative study of SVR, ANNs and PLS for the multivariate calibration of spectrofluorimetric determination of aromatic amino acids is presented, and furthermore PRIFER is proposed and applied on this task. The rest of this paper is arranged as follows: Section 2 introduces the experimental data sets; Section 3 describes the learning algorithms used in this paper in brief; Section 4 gives the experiment results of leave-one-out by the comparative algorithms; Section 5 proposes a novel algorithm PFIFER, and computational results are shown in Section 6. At last, we end with a discussion of results in Section 7.
2
Experimental data sets
The standard solution of phenylalanine (1012µg/ml), tyrosine (256µg/ml) and tryptophan (250µg/ml) are added successively into a volumetric flask, and 4ml of phosphate buffer solution is also added to adjust pH to 7.4. Then add distilled water doubly till final volume adjusted to 25ml. The wave length of excitation light used is 216.6nm. The scanning of the fluorescence spectra takes place in a Hitachi 850 type of fluorescence spectrometry. The parameters of the instrument are selected as in Table 1.
Table 1. Parameters of the used instrument
Parameter
Value
EX slit width EM slit width Scan speed Response time
2nm 20nm 480nm/min 2seconds
The fluorescence light intensities of 23 examples at 13 selected wave lengths are measured for the multivariate calibration. The components of 23 examples are listed in Table 2 and the corresponding fluorescence intensities are shown in Figure 1.
0.8
0.7
Relative Intensity
0.6
0.5
0.4
0.3
0.2
0.1
0 250
300
350
400
450 500 Wavelength
550
600
650
700
Fig. 1. Fluorescence Spectra of 23 examples. (Unit of the horizontal axis is nm)
3
Computational Methods
In this work, Support Vector Regression (SVR), Artificial Neural Networks(ANNs) and Partial Least Square(PLS) methods are used concurrently to process the data of spectra shown in Figure 1. Results by the leave-one-out method are used to compare the accuracy of computation results of these methods, which are described in brief as follow.
Table 2. Components of the used sample (µg/ml)
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
3.1
Tyr. 2.0040 1.5030 1.0020 0.5010 0.2505 0.1002 0.0501 0.1002 0.2004 2.0040 0.4008 1.0020 4.0080 3.0060 0.1002 0.3006 0.2004 0.5010 1.0020 1.5030 2.0040 2.5050 2.0040
Try. 0.0512 0.1024 0.2560 0.5120 1.0240 1.5360 2.0480 1.5360 2.0480 0.0205 2.0480 1.0240 0.1536 0.2048 2.0480 2.0480 1.5360 1.0480 0.5120 0.3072 0.2048 0.1024 0.1048
Phe. 4.0480 3.5420 3.0360 2.5300 2.0240 1.5180 1.0120 4.0480 0.5060 0.5060 6.0720 5.0600 2.0240 2.0240 4.0480 4.0480 3.0360 2.5300 2.0240 4.5540 3.0360 1.0120 2.0240
Support Vector Regression
Compared with other machine learning methods such as ANNs and PLS, Support Vector Regression (SVR) have several advantages: 1)ε-insensitive loss function: In SVR computation, any residue of regression less than some small value ε is considered to be meaningless, to prevent the noise of the training set from influencing the mathematical model. 2) The principle of flatness: The norm of weight vector is minimized in order to prevent the magnification of the error of the training sample. 3) Application of kernel functions: Introduction of kernel function makes nonlinear data sets can be treated by linear algorithms. Given a training sample denoted by S = ((x1 , y1 ), ..., (x` , y` ) ⊆ (X × Y )` ) where X ⊆ Rm , and Y ⊆ R. The regression function is written as: y = hw · xi + b, where w is called as a weight vector, and b is a bias. In this paper, we use a 2-norm soft margin slack variable version of SVMs [13]. By fixing the margin as
1 and minimizing the weight vector, we get the optimization problem P` minimize kwk2 + C i=1 (ξi2 + ξˆi2 ), subject to (hw · xi i + b) − yi ≤ ε + ξi , i = 1, ..., `, yi − (hw · xi i + b) ≤ ε + ξˆi , i = 1, ..., `, ξi , ξˆi ≥ 0, i = 1, ..., `,
(1)
where ξ is the known margin slack variable, C measures the trade off between complexity and minimizing loss, and ε is the fixed value of ε-insensitive loss function, which is defined as: Lε (x, y, f ) = |y − f (x)|ε = max(0, |y − f (x)| − ε). By building a Lagrangian and using the Karush-Kuhn-Tucker(KKT) complementarity conditions [14, 15], we can obtain the value of optimization problem (1), because of the KKT conditions, only those Lagrangian multipliers, αi , α ˆi, which make the constraint active are non zeros, we denote these points corresponding to the non zeros αi , α ˆ i as support vectors. Therefore we can describe the regression function in terms of α and b: y=
` X
(αi − α ˆ i )hxi · xi + b.
i
If we replace hxi · xi with some function K(xi , x), which satisfies Mercer’s condition [16], then the regression function can be written as: y=
` X
(αi − α ˆ i )K(xi , x) + b,
i
where K(x, z) is the known kernel function. A commonly used kernel function is the Guass kernel: K(x, z) = exp(−kx − zk2 /σ 2 ), which is also preferred by C.J. Lin [17]. The SVR software used in this paper is mySvm package [18]. For more details about kernel machines, please refer to http://www.kernel-machines.org. 3.2
Artificial Neural Networks with regularization techniques
A simple regularized ANNs is to add a regularization term to the object function during the training, then the object function MSEreg can be written as: MSEreg = α MSE +β MSW, where α and β control the contribution ratio between MSE and MSW. MSE is the object function in the traditional training, which is written as `
MSE =
1X e (y − yi )2 , ` i=1 i
where yie is the prediction value, yi is the real target value, ` is the total number of examples in the training data set. MSW is often defined[6] as: MSW =
M 1 X 2 ω , M j=1 j
where ω is the sum of weights in ANNs, and M is the number of weights. The new object function will make that the ANNs have small weights, so the response will be more smooth, and overfitting can be depressed. The above algorithm is often named as weight decay, and one of its problems is the choices of α and β. A simple method is grid computing, which means to divide the parameters into several quadrants and try to train the ANNs with one pair of parameters in each quadrant in turn. This method computes heavily and can not find the optimal one, Forsee and Hagan[7] proposed to compute the parameters in bayesian learning frame. These two parameters can be computed as: γ M −γ and β ∗ = , α∗ = ∗ 2 MSE ω 2 MSW ω ∗ where γ = M − 2α∗ tr(H∗ )−1 is the effective number of parameters, which measures how many parameters used in the neural networks during the training of the object function MSEreg , and ranges from zero to M . H = α∇2 MSE +β∇2 MSW is the Hessian matrix of the objective function. The star notation means the parameters are computed at the minimum point of the training. The above algorithm has been implemented in the neural network toolbox with MATLAB[19]. 3.3
Artificial Neural Networks with early stopping
Artificial Neural Networks with early stopping [20] suppose that the error of the training data set becomes small while the error of the novel validation data set will decrease first but increase as the training process goes on. If we stop the training process before the error of the novel validation data set increase, then we can overcome the overfitting, and improve the prediction accuracy. In practice, the novel validation data set is generated from the training data set with some strategy, and whether the training will be stopped depends on the error of the validation data set. In chemometrics, Tekto et al.[5] used early stopping to overcome the overfitting of ANNs, and ANNs with early stopping achieved better results than naive ANNs did. In order to compare with other algorithms, we implement the algorithm in MATLAB[19]. Back propagation algorithm based on Scaled Conjugate Gradient method is used to train the neural networks. The training data set is randomly split into two equally parts, one is used as training data set, and the other as validation data set. Other parameters are adopted as default in MATLAB.
3.4
Partial Least Square method
In this work, Partial Least Square(PLS) is used as a baseline method to treat the same data for comparison. The PLS software used in this work is implemented in Matlab [21]. We choose the number of variables with the smallest P RESS, which is 11. 3.5
Assessment of regression quality
The regression accuracy by the above learning algorithms are compared with the following error measures: Root mean square error (RMSE), for the j component, it is defined as v u ` u1 X RMSEj = t (y e − yij )2 , ` i=1 ij and for the whole, it is v u X u1 n RMSE2j , RMSE = t n j=1
(2)
e where yij means the jth predicted target value of ith example, yij means the jth real target value of ith example, ` denotes the number of examples and n denotes the number of target values of each example, which is 3 in this paper. Mean absolute error (MAE), for the j component, it is defined as `
MAEj =
1X e |y − yij |, ` i=1 ij
and for the whole, it is
n
MAE =
1X MAEj . n j=1
e where yij , ` and n have the same meaning as in the definition of RMSE mentioned above.
4
Computational Results by SVR, ANNs and PLS
Data sets in section 2 have been processed by SVR, ANNs and PLS, as listed in Table 3. Here, for the Try., Tyr. and Phe. components, SVR used parameters like C = 10, σ = 8, C = 1000, σ = 15 and C = 1000, σ = 8 respectively. Leaveone-out is used as the validation method. The results of MAE and RMSE by using the algorithms listed above are illustrated in Table 4. From Table 4, we can see: 1) SVR with Gauss kernel function give the best performance of all the algorithms, ANNs with regularization term in bayesian
Table 3. Learning algorithms used in this paper Algorithm L-SVR G-SVR BNN WD ES ANN PLS
Description SVR without any kernel function SVR with Gauss kernel function ANNs with regularization in bayesian learning frame ANNs with weight decay ANNs with early stopping ANNs without any techniques to prevent overfitting Partial Least Square
Table 4. Comparison of the results by SVR, ANNs and PLS
Algorithm Tyr. RMSE L-SVR 0.0766 G-SVR 0.1249 BNN 0.1558 WD 0.2182 ES 0.2881 ANN 0.1950 PLS 0.1177
MAE 0.0637 0.0893 0.1130 0.1355 0.1672 0.1218 0.0842
Try. RMSE 0.1334 0.1308 0.1677 0.1887 0.1618 0.1873 0.2769
MAE 0.0998 0.0882 0.1269 0.1406 0.1240 0.1378 0.2074
Phe. RMSE 0.3762 0.2781 0.2632 0.2906 0.5057 0.3240 0.4260
MAE 0.2644 0.2093 0.2140 0.2270 0.3635 0.2662 0.3313
Total RMSE 0.2347 0.1915 0.2014 0.2364 0.3488 0.2436 0.3011
MAE 0.1427 0.1289 0.1513 0.1677 0.2182 0.1753 0.2076
learning frame give the second best performance, and the results of PLS (linear algorithm) is not very good in this case; 2) SVR totally perform better than ANNs do; 3) SVR with Gauss kernel function perform better than linear SVR do; 4) ANNs with regularization techniques perform better than ANNs with early stopping and naive ANNs do.
5
Feature selection for support vector regression
Considering there are redundant features in the data set, we perform feature selection for Support Vector Regression. Feature selection is a hot topic in the machine learning and bioinformatics fields [22, 11, 23, 24], which are categorized into three models, the filter model, the wrapper model and the embedded model where the filter model is independent of the learning machine and both the embedded model and the wrapper model are depending on the learning machine, but the embedded model has lower computational complexity than the wrapper model does and has been widely studied in recent years especially on support vector machines [22, 25, 26]. Here in this paper, we will apply the filter feature selection model and the embedded model to improve generalization performance of SVR. Mutual information is a usually used filter model to perform feature selection [11, 27], and we use it as one baseline method. Another method used is the embedded model, although many works are proposed on feature selection for support vector machines, yet many of them are for solving classification problems [28, 25, 26] and few are for regression problems. Therefore, we propose a novel algorithm PRIFER (Prediction RIsk based FEature selection for support vector Regression) for the multivariate calibration problem. 5.1
Mutual Information
The mutual information describes the statistical dependence of two random features or the amount of information that one feature contains about the other and it is a widely used information theoretic measure for the stochastic dependency of discrete random features [27]. The mutual information between two features R and S can be defined in terms of the probability distribution of intensities as follows: XX p{r, s} I(R : S) = p{r, s} lg , (3) p{r}p{s} r∈R s∈S
where p{r, s} is the combined probability distribution of intensities of two features R and S. p{r} and p{s} respectively are the individual probability distribution of intensities of R and S. The mutual information criteria has been widely used in the filter feature selection model [11, 27], therefore, we employ this method to SVR as a baseline method named MIFS (Mutual Information based Feature Selection). The basic idea of MIFS is that: we firstly compute the corresponding mutual information (MI) of features with the target values according to Equ 3. Then the MI values
are ranked, the feature corresponding to the least one is removed firstly, RMSE and MAE are computed on the remained feature subset. At last, we obtain the best results of RMSE and MAE with the optimal feature subset. 5.2
Prediction Risk
Since the embedded model was employed on support vector classification and obtained satisfactory results [25], we also employ this model to regression problems, where the prediction risk criteria is used to rank the features. The prediction risk criteria was proposed by Moody and Utans [6] which evaluates each feature through estimating prediction error of the data sets when the values of all examples of this feature are replaced by their mean value. Ri = RMSE(xi ) − RMSE
(4)
where RMSE is defined as in Equ 2, RSME is the training error, and RMSE(xi ) is the test error on the training data set with the mean value of ith feature. `, D is the number of examples Ri and features, xi is the mean value of the ith feature. ye()is the prediction value of the j th example after the value of the ith feature is replaced by its mean value. Finally, the feature corresponding with the smallest will be deleted, because this feature causes the smallest error and is the least important one. In this paper, the embedded feature selection model with the prediction risk criteria is employed to select relevant features for SVR, which is named as PRIFER (Prediction RIsk based Feature sElection for support vector Regression). The basic steps of PRIFER are described as in Fig 2. Input: Training data set Sr (x1 , x2 , · · · , xD , C) and Test data set St Procedure: 1. Train a model L on the training set Sr by using the support vector regression algorithm and calculate training error RMSE. 2. Compute all the prediction risk values R using Equation (4). 3. Rank R, remove the feature corresponding to the least one and compute test error E on St until all the features are removed. 4. Select the least Eo with the corresponding feature subset Sro Output: the test error Eo with Sro Fig. 2. The PRIFER Approach
6
Computational Results by SVR with feature selection
Mutual information based feature selection (MIFS) and PRIFER are applied to the multivariate calibration of spectrofluorimetric determination of aromatic
amino acids described in Section 2, where parameters and validation methods are set as in Section 3. Computational results are shown in Tab 5. Results on feature subset with different number of features are illustrated in Fig 3, from which we find that the optimal values are obtained by MIFS when the sizes of feature subsets are 13, 7, 13 for Tyr., Try. and Phe. respectively, while those by PRIFER are 9, 4 and 9 for Tyr., Try. and Phe. respectively. Table 5. Comparison of the results by SVR with feature selection
Algorithm Tyr. RMSE G-SVR 0.1249 MIFS 0.1249 PRIFER 0.1205
MAE 0.0893 0.0893 0.0837
Try. RMSE 0.1308 0.1155 0.0788
MAE 0.0882 0.0836 0.0601
Phe. RMSE 0.2781 0.2781 0.2582
MAE 0.2093 0.2093 0.1889
Total RMSE 0.1915 0.1882 0.1707
MAE 0.1289 0.1257 0.1109
From Tab 5 and Fig 3, we can see: 1) SVR is improved with at least 0.0033 on RSME and 0.0034 on MAE by using feature selection methods. 2) PRIFER obtained better results than MIFS, RMSE and MAE are improved 0.0208 and 0.0180 on SVR. 3) Features are need to perform selection, not all the features are helpful in the modeling process by SVR, some need to be removed.
7
Discussions
From the above works, we can see SVR with Gauss kernel function and ANNs with regularization techniques achieve satisfying performance, at the same time, we can see the linear algorithms such as SVR without any kernel function and PLS do not perform well, which shows that the data set in this work is potentially nonlinear, and that SVR with kernel function and ANNs can treat this nonlinear data set well. While feature selection is performed for SVR, prediction results of SVR are further improved, and PRIFER obtained the best results among all the learning algorithms. In this work, SVR totally get better results than ANNs do, which shows SVR have more powerful prediction ability than ANNs do, the main reasons are following: 1) With ε-insensitive function, SVR are immune from the effects of points corrupted by noises in some degree, therefore SVR can depress the overfitting caused by noises. 2) Because of the Mercer’s condition of kernel function, SVR train the object function by solving a quadratic optimisation problem, and give the globally optimal value, while ANNs give the locally optimal value. 3) With the concept of margin, SVR realize the principle of data dependent structure risk minimization, so they make use of the relation between the target function and the data set, and do not depend on the dimension of data set, while ANNs only realize the principle of empirical risk minimization, so they have much relation
Results obtained by MIFS Tyr. Try. Phe.
2
4 6 8 10 12 the number of features remained in the data subset
Results obtained by PRIFER Tyr. Try. Phe.
2
4 6 8 10 12 the number of features remained in the data subset
Fig. 3. Results on different size of feature subsets by using SVR with feature selection
to the dimension of data set, and will overfit the data set especially when the dimension is high. In this work, ANNs with regularization techniques exhibit better performance than other ANNs do. As we all know, ANNs are powerful tools in nonlinear problems, but they are prone to overfitting and cannot achieve the ideal performance. ANNs with early stopping can prevent overfitting by using a validation data set generated from the training data set to monitor the training process. But when the training sample is small, the validation examples can not represent the whole data set to prevent the overtraining, then the ANNs may be overfitting or underfitting. Whereas ANNs with regularization techniques overcome the overfitting through minimizing the norm of weights, so they realize the principle of structure risk minimization, and use the whole training data sets. Therefore, potentially ANNs with regularization techniques can achieve better results than ANNs with early stopping do, especially in small sample cases. Feature selection is performed and has improved generalization performance of SVR, where PRIFER using the embedded model obtained better results than
MIFS using the filter model did. Since features in the used data set are chosen firstly by authors, there are no irrelevant features, so it should be redundant features which reduced generalization performance of SVR. The used feature selection methods MIFS and PRIFER prove to be able to remove redundant features, where PRIFER combining the embedded model with SVR and produces better results than MIFS does, the latter is independent with the learning algorithm of SVR. Another reason of MIFS performing badly is that MIFS need to perform discretization of the continuous features and target values, which must lose useful information for modeling. In conclusion, Support Vector Machines with feature selection as well as Artificial Neural Networks with regularization techniques treat the nonlinear data well, and may become useful methods for multivariate calibration and other topics in chemometrics.
Acknowledgments Thanks to the late professor Nian-Yi Chen for his advices to this paper. This work was supported in part by the Nature Science Foundation of China under grant no. 20503015 and open funding by Institute of Systems Biology of Shanghai University.
References 1. Peussa, M., H¨ ark¨ onen, S., Puputti, J., Niinist¨ o, L.: Application of PLS multivariate calibration for the determination of the hydroxyl group content in calcined silica by DRIFTS. Journal of Chemometrics 14 (2000) 501–512 2. Marx, B.D., Eilers, P.H.C.: Multivariate calibration stability: A comparison of methods. Journal of Chemometrics 16 (2002) 129–140 3. Tormod, N., Knut, K., Tomas, I., Charles, M.: Artificial neural networks in multivariate calibration. J. Near Infrared Spectra 1 (1993) 1–11 4. Poppi, R.J., Massart, D.L.: The optimal brain surgeon for pruning neural network architecture applied to multivariate calibration. Analitical Chimica Acta 375 (1998) 187–195 5. Tetko, V.I., Livingstone, J.D., Luik, I.A.: Neural network studies: 1. comparison of overfitting and overtraining. Journal of Chemical Information and Computer Science 35 (1995) 826–833 6. Moody, J., Utans, J.: Principled architecture selection for neural networks: Application to corporate bond rating prediction. In Moody, J.E., Hanson, S.J., Lippmann, R.P., eds.: Advances in Neural Information Processing Systems, Morgan Kaufmann Publishers, Inc. (1992) 683–690 7. Foresee, F.D., Hagan, M.T.: Gauss-newton approximation to bayesian regularization. In: Proceedings of the 1997 International Joint Conference on Neural Networks. (1997) 1930–1935 8. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 9. Chen, N.Y., Lu, W.C., Yang, J., Li, G.Z.: Support Vector Machines in Chemistry. World Scientific Publishing Company, Singapore (2004)
10. Belousov, A.I., Verzakov, S.A., von Frese, J.: Applicational aspects of support vector machines. Journal of Chemometrics 16 (2002) 482–489 11. Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17(3) (2005) 1–12 12. Lacowicz, J.R.: Principle of Fluorescence Spectroscopy. Plenum Press, New York (1983) 13. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000) 14. Karush, W.: Minima of functions of several variables with inequalities as side constraints. Master’s thesis, Deptment of Mathematics, University of Chicago (1939) 15. Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Proceeding of the 2nd Berkeley Symposium on Mathematical Statistics and Probabilistic, Berkeley, University of California Press (1951) 481–492 16. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philosophy Transactions on Royal Society in London A 209 (1909) 415–446 17. Hsu, C.W., Chang, C.C., Lin, C.J.: A Practical Guide to Support Vector Classification. Technical report, Department of Computer Science and Information Engineering of National Taiwan University, Available: http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf [14 August 2003] (2003) 18. R¨ uping, S.: Mysvm-Manual. University of Dortmund, Lehrstuhl Informatik 8, Available: http://www-ai.cs.uni-dortmund.de/SOFTWARE/mySvm/ [14 August 2003]. (2000) 19. Demuth, H., Beale, M.: Neural Network Toolbox User’s Guide for Use with MATLAB, (4th Ed.). the Mathworks Inc. (2001) 20. Sarle, W.S.: Stopped training and other remedies for overfitting. In: Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics. (1995) 352–360 21. Andersson, C.A., Bro, R.: The n-way toolbox for MATLAB. Chemometrics & Intelligent Laboratory Systems 52 (2000) 1–4 22. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of machine learning research 3 (2003) 1157–1182 23. Liu, H., Dougherty, E.R., Dy, J.G., Torkkola, K., Tuv, E., Peng, H., Ding, C., Long, F., Berens, M., Parsons, L., Yu, L., Zhao, Z., Forman, G.: Evolving feature selection. IEEE Transaction on Intelligent Systems 20(6) (2005) 64–76 24. Zhang, Y.Q., Rajapakse, J.C.: Machine Learning in Bioinformatics. JOHN WILEY & SONS, New York (2007) 25. Li, G.Z., Yang, J., Liu, G.P., Xue, L.: Feature selection for multi-class problems using support vector machines. In: Lecture Notes on Artificial Intelligence 3173 (PRICAI2004), Springer (2004) 292–300 26. Lal, T.N., Chapelle, O., Weston, J., Elisseeff, A.: Embedded methods. In Guyon, I., Gunn, S., Nikravesh, M., eds.: Feature Extraction, Foundations and Applications. Springer, Physica-Verlag (2006) 27. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8) (2005) 1226–1238 28. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46 (2002) 389–422