svm classification with linear and rbf kernels

9 downloads 509 Views 952KB Size Report
Apr 15, 2010 - they can be trained and learn from all kinds of data. Two ... Support vector learning is the problem of ..... http://www.ics.uci.edu/mlearn/MLRepository.html. [13] S. S. ... and the training of non-PSD kernels by SMO-type methods.
SVM CLASSIFICATION WITH LINEAR AND RBF KERNELS Vasileios Apostolidis-Afentoulis

Konstantina-Ina Lioufi

Department of Information Technology Alexander TEI of Thessaloniki P.O. Box 141, 574 00, Thessaloniki, Greece [email protected]

Department of Information Technology Alexander TEI of Thessaloniki P.O. Box 141, 574 00, Thessaloniki, Greece [email protected]

variables", are in general available only for a small subset of objects known as examples. The purpose of estimating the dependency between the input and output variables is to be able to determine the values of output variables for any object of interest. In pattern recognition, this relates to trying to estimate a function f: RN  {±1} that can correctly classify new examples based on past observations.[4] The SVM software that has been used, is LIBSVM [6] with the Linear Kernel and the RBF(radial basis function) kernel. As the execution time of model selection is such an important issue for practical applications of SVM, a number of studies have been conducted on this paper [7], [8], [9], [10]. The basic approach employed by these recent studies is to reduce the search space of the parameter combinations.[11]

ABSTRACT The paper attempts to survey the existing research and development efforts involving the use of Matlab for classification. In particular, it aims at providing a representative view of support vector machines and a way they can be trained and learn from all kinds of data. Two kind of algorithms are presented with short overviews, then discussed separately and finally in comparison with the results, including a few figures. Finally, a summary of the considered systems is presented together with the experimental results. Index Terms— SVM, Classification, Matlab, Linear, RBF

1.2. Linear SVMs

1. INTRODUCTION 1.2.1 Separable case In the binary classification setting, let ((x1, y1)...(xn, yn)) be the training dataset where xi are the feature vectors representing the instances (i.e. observations) and yi {1,+1} be the labels of the instances. Support vector learning is the problem of finding a separating hyperplane that separates the positive examples (labeled +1) from the negative examples (labeled -1) with the largest margin. The margin of the hyperplane is defined as the shortest distance between the positive and negative instances that are closest to the hyperplane. The intuition behind searching for the hyperplane with a large margin is that a hyperplane with the largest margin should be more resistant to noise than a hyperplane with a smaller margin.

1.1. Support vector machines The support vector machines’ (SVMs) technique was introduced by Vapnik [1] and developed fast in recent years. Several studies reported that SVMs, generally, are able to deliver higher classification accuracy than the other existing classification algorithms [2], [3]. In the last decade Support Vector Machines (SVMs) have emerged as an important learning technique for solving classification and regression problems in various fields, most notably in computational biology, finance and text categorization. This is due in part to built-in mechanisms, to ensure good generalization which leads to accurate prediction, the use of kernel functions to model non-linear distributions, the ability to train relatively quickly on large datasets, using novel mathematical optimization techniques and most significantly the possibility of theoretical analysis, using computational learning theory.[5] The main objective of statistical learning is to find a description of an unknown dependency between measurements of objects and certain properties of these objects. The measurements, also known as "input variables", are assumed to be observable in all objects of interest. On the contrary, the properties of the objects, or "output

Formally, suppose that all the data satisfy the constraints (1),(2) Where the w is the normal to the hyperplane, is the perpendicular distance from the hyperplane to the origin, and is the Euclidean norm of w.

1

1.2.2 Non-Separable case The previous section discussed the case where it is possible to linearly separate the training instances that belong to different classes. Obviously this SVM formulation will not find a solution if the data cannot be separated by a hyperplane. Even in the cases where the data is linearly separable, SVM may overfit to the training data in its search for the hyperplane that completely separates all of the instances of both classes. For instance, an individual outlier in a dataset, such as a pattern which is mislabeled, can crucially affect the hyperplane. These concerns prompted the development of soft margin SVMs [10], which can handle linearly non-separable data by introducing positive slack variables that relax the constraints in (2.1) and (2.2) at a cost proportional to the value of . Based on this new criteria, the relaxed constraints with the slack variables then become:

Figure 1-A hyperplane separating two classes with the maximum margin. The circled examples that lie on the canonical hyperplanes are called support vectors.

Two constraints can be conveniently combined into the following:

(7)

(3)

which permits some instances to lie inside the margin or even cross further among the instance of the opposite class (see figure 2). While this relaxation gives SVM flexibility to decrease the influence of outliers, from an optimization perspective, it is not desirable to have arbitrarily large values for as that would cause the SVM to obtain trivial and sub-optimal solutions.

The training examples for which (3) holds lie on the canonical hyperplanes (H1 and H2 in figure 1). The margin ρ can be easily computed as the distance between H1 and H2 .

(4) Hence, the maximum margin separating hyperplane be can constructed by solving the following Primal optimization problem: subject to (5) We switch to Lagrangian formulation of this problem for two main reasons: i) the constraints are easier to handle, and ii) training data only appears as a dot product between vectors. This formulation introduces a new Lagrange multiplier ai for each constraint and the formulation of the minimization problem then becomes, Figure 2 -Soft margin SVM

Thus, “the relaxation is constrained” by making the slack variables part of the objective function (5) yielding, (6) with Lagrange multipliers for each constraint in (5). The objective is then to minimize (6) with respect to w and b simultaneously require that the derivatives of with respect to all the vanish.

(8)

2

subject to the constraints in (7). The cost coefficient C>0 is a hyperparameter that specifies the misclassification penalty and is tuned by the user based on the classification task and dataset characteristics. As in the separable case, the solution to (8) can be shown to have an expansion

number of features is very large, one may just use the linear kernel. 1.3.2 Cross-validation and Grid-search There are two parameters for an RBF kernel: C and γ. It is not known beforehand which C and γ are best for a given problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C, γ) so that the classifier can accurately predict unknown data (i.e. testing data). Note that it may not be useful to achieve high training accuracy (i.e. a classifier which accurately predicts training data whose class labels are indeed known). As discussed above, a common strategy is to separate the data set into two parts, of which one is considered unknown. The prediction accuracy obtained from the “unknown” set more precisely reflects the performance on classifying an independent data set. An improved version of this procedure is known as crossvalidation [17]. In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v −1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified. The cross-validation procedure can prevent the overfitting problem. Figure 3 represents a binary classification problem to illustrate this issue. Filled circles and triangles are the training data while hollow circles and triangles are the testing data. The testing accuracy of the classifier in Figures 3a and 3b is not good since it overfits the training data. If we think of the training and testing data in Figure 3a and 3b as the training and validation sets in cross-validation, the accuracy is not good. On the other hand, the classifier in 3c and 3d does not overfit the training data and gives better cross-validation as well as testing accuracy. It is recommend a “grid-search” on C and γ using crossvalidation. Various pairs of (C, γ) values are tried and the one with the best cross-validation accuracy is picked. It is found that trying exponentially growing sequences of C and γ is a practical method to identify good parameters (for example, C = 2−5, 2−3, . . . , 215, γ = 2−15, 2−13, . . . , 23). The grid-search is straightforward but seems naive. In fact, there are several advanced methods which can save computational cost by, for example, approximating the cross-validation rate. However, there are two motivations why we prefer the simple grid-search approach [17].

(9) where the training instances with are the support vectors of the SVM solution. Note, that the penalty function related to the slack variables is linear, which disappears when (8) is being transformed into the dual formulation

subject to,

(10) The dual formulation is conveniently very similar to the linearly separable case, with the only difference being the extra upper bound of C for the coefficients . Obviously as the misclassification penalty C (10) converges to the linearly separable case. 1.3 RBF SVMs 1.3.1 General In general, the RBF kernel is a reasonable first choice. This kernel nonlinearly maps samples into a higher dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear. Furthermore, the linear kernel is a special case of RBF [13] since the linear kernel with a penalty parameter has the same performance as the RBF kernel with some parameters (C, γ). In addition, the sigmoid kernel behaves like RBF for certain parameters [14]. The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel. Finally, the RBF kernel has fewer numerical difficulties. One key point is 0 < Kij ≤ 1 in contrast to polynomial kernels of which kernel values may go to infinity (γxiT xj + r > 1) or zero (γxiT xj + r < 1) while the degree is large. Moreover, we must note that the sigmoid kernel is not valid (i.e. not the inner product of two vectors) under some parameters [15]. There are some situations where the RBF kernel is not suitable. In particular, when the

Figure 3- (a) Training data and an overfitting classifier (b) Applying an overfitting classifier on testing data 3

Figure 3 (c) Training data and a better classifier, (d) Applying a better classifier on testing data - An overfitting classifier and a better classifier (● and ▲: training data; O and ∆ testing data).

One is that, psychologically, we may not feel safe to use methods which avoid doing an exhaustive parameter search by approximations or heuristics. The other reason is that the computational time required to find good parameters by grid-search is not much more than that by advanced methods since there are only two parameters. Furthermore, the grid-search can be easily parallelized because each (C, γ) is independent. Many of advanced methods are iterative processes, e.g. walking along a path, which can be hard to parallelize [17]. Since doing a complete grid-search may still be timeconsuming, we recommend using a coarse grid first. After identifying a “better” region on the grid, a finer grid search on that region can be conducted. To illustrate this, we do an experiment on the problem german from the Statlog collection [16]. After scaling this set, we first use a coarse grid (Figure 5) and find that the best (C, γ) is (23, 2−5) with the cross-validation rate 77.5%. Next we conduct a finer grid search on the neighborhood of (2 3, 2−5) (Figure 6) and obtain a better cross-validation rate 77.6% at (23.25, 2−5.25). After the best (C, γ) is found, the whole training set is trained again to generate the final classifier. The above approach works well for problems with thousands or more data points. For very large data sets a feasible approach is to randomly choose a subset of the data set, conduct grid-search on them, and then do a betterregion-only grid-search on the complete data set.

Figure 6 - Fine grid-search on C = 21, 21.25, . . . , 25 and γ = 2−7, 2−6.75, . . . , 2−3.[17]

1.4 Dataset Description Title of Dataset: ISOLET (Isolated Letter Speech Recognition) [12] This data set was generated as follows: 150 subjects spoke the name of each letter of the alphabet twice. Hence, we have 52 training examples from each speaker. The speakers are grouped into sets of 30 speakers each, and are referred to as isolet1, isolet2, isolet3, isolet4, and isolet5. The data appears in isolet1+2+3+4.data in sequential order, first the speakers from isolet1, then isolet2, and so on. The test set, isolet5, is a separate file. Note, that 3 examples are missing. They were dropped due to difficulties in recording. This is a good domain for a noisy, perceptual task. It is also a very good domain for testing the scaling abilities of algorithms. We have formatted the two separate files into one data file (isolet12345.data) for convenience and we will provide it, as well. The number of instances from isolet1+2+3+4.data is 6238 and from isolet5.data is 1559. The total number of instances is 7797. The number of attributes is 617 plus 1 for the class which is the last column. All attributes are continuous, realvalued attributes scaled into the range of -1.0 to 1.0 . The features include spectral coefficients, contour features, sonorant features, pre-sonorant and post-sonorant features. There are no missing attribute values.

2. EXPERIMENTS 2.1 General Explanation - Linear Experiments First of all, we have to bring to mind that the given dataset was split into two separate files, so we had to combine them into one. The process of the file combination was verified with a program called WinMerge, which is an open source differencing and merging tool for Windows. Figure 5 - Loose grid search on C = 2−5, 2−3, . . . , 215 and γ = 2−15, 2−13, . . . ,23.[17]

Cross-validation Method has been used to split dataset into random pieces of data. For this implementation in 4

Matlab, «Holdout» parameter is used and sets the amount of data that have been left out of the training procedure. While keeping only the 10% of the data for training, two kinds of indexes have been created, one for training and one for testing data. So far, all this process was made to find the sum of the total instances’ number. The next part, contains the selection of the svm training model, where the parameter C (or ξ as already has been explained in introduction) is also included and then a vector is created to store the values needed for the experiments. After setting the «svmtrain» , it’s «svm prediction’s»| turn to take the lead. As returned values, we get the labels and the accuracy and ignore the last attribute which is information about the training model. This situation is split into 2 parts, one for training and one for testing data. Also, the results of training and testing predictions are printed into a plot with two subplots. The first subplot presents the training data while the other one, the testing data. The whole procedure is encapsulated into a loop of one hundred iterations, so it’s very important to capture an average accuracy value for the training and testing prediction. Finally, there is one more plot, a special one, a semi logarithmic plot which, because of the nature of C parameter, it helps us to comprehend the results more easily. This plot, makes a comparison of the mean value Accuracy with C parameter. In the following figures, you can see the visualization of the training and testing procedure for some specific values of C. The X axis, represents the number of classes and the Y axis represents the total number of instances that have been used. From the results, we can realize that for C=10-4 we get very low accuracy 0,85 % (figure 7). For C=10-3, accuracy raises to 8,28% (figure 8). For C=101 and for higher values, our system performs its highest peak (figure 9), while the accuracy reaches 92,61% (figure 10).

Figure 8 - Training & Testing Data for C=10-3

Figure 9 - Training & Testing Data for C=101 & for higher values

Figure 10 – Final plot, comparison of Accuracy with C

Figure 7 - Training & Testing Data for C=10-4

5

2.2 RBF Experiments The first step, in the SVM-RBF code, is to load the dataset and also to set values for the parameters (C and G). Both (C & G), initial values are set at 100. The C was kept stable, while G was changing constantly. Then, the initialization of the data begins and the path is defined, where the graphs will be stored. A big part of the program are the iterations. In this section, cross validation is implemented. Two pairs of indexes have been made:  The first pair of indexes concludes the instances and the labels, in order to store the training data.  The second pair of indexes concludes the instances and the labels, in order to store the testing data.

Figure 12 - Training & Testing Data for C=102 & for G=8*10-2

The training of RBF kernel starts and then, the development of the parameters that have already been set, follow up. A number of series test the model, while the values of the prediction, the accuracy and the labels’ description is returned. There is also, an inspection of the elements of the accuracy’s vector and its conversion into string format. The next part of the program refers to the graphs. A plot is created with two subplots. Green colors for circles and red dots have been selected to be displayed in the subplots. Moreover, a legend has been created, including the titles of the graphs, the number of each iteration, the accuracy amount and the parameter values about C and G. After that, the output of the data is exported temporarily into an xls file, the saving type of images is set and the kind of information that is going to appear in each graph is selected (minimum/ maximum accuracy). The most important part of this section is the reduction of the G, by dividing it with 1.1 in each rerun. The X axis, represents the number of classes and the Y axis represents the total number of instances that have been used. Finally, there is a procedure about comparing the G’s values with accuracy and that appears at the final graph. Taking a quick peek on this graph (figure 14), it turns out that accuracy gets to really high values, at 92,71%.

Figure 13- Training & Testing Data for C=102 & for G=9*10-3 and lower

Figure 14 - Final plot, Comparison of Accuracy with G

3. COMPARISON OF EXPERIMENTAL RESULTS In order to show the validity and the accuracy of classification of our algorithms, we performed a series of experiments on standard benchmark data-sets. In this series of experiments, the data were split into training and test sets. The differences between the algorithms, are: in Linear kernel, four values have been used for C parameter, so that the output can be checked. On the other hand, in Radial Basis Function kernel, C parameter is being kept stable in 102, while Gamma parameter has been constantly changing from 102 to 8*10-2 . As far as we can perceive from the 2 final graphical representations, the achievement results are almost the same

Figure 11 - Training & Testing Data for C=102 & G=3*10-1

6

from the two kernels. More specifically, both kernels achieve the same level of accuracy, almost 93%. In the current dataset, data are linear elements, so we cannot make a real comparison of the 2 kernels. But in a different dataset with non-linear data, radial basis function kernel, would generalize much better, unlike Linear kernel. In the table below, we can observe the elements from each algorithm, separately and compare the results.

[7] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine Learning, vol. 46, pp. 131–159, 2002. [8] S. Sathiya Keerthi, “Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms,” IEEE Transactions on Neural Networks, 2002. [9] K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters,” Neurocomputing, 2002.

Table 1- Linear Kernel in comparison with RBF kernel results

Isolet Dataset Instances Attributes Train Data Test Data Iterations C G Accuracy Decision boundary Related distance function Regularization [18]

Linear Kernel 7797 617 780 7017 100 10-4 /10-3/101/102 0,85/8,28/92,61 Linear

RBF Kernel 7797 617 780 7017 100 102 8*10-2 - 102 0,85/8,49/92,73 Nonlinear

Euclidian distance

Euclidian distance

Training-set, cross-validation to select C (defining misclassification penalty

Training-set, cross-validation to select C and γ (defining RBF width)

[10] D. DeCoste and K. Wagstaff, “Alpha seeding for support vector machines,” in Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000. [11] Y.-Y. Ou, C.-Y. Chen, S.-C. Hwang, and Y.-J. Oyang, “Expediting model selection for support vector machines based on data reduction,” inSystems, Man and Cybernetics, 2003. IEEE International Conference on, 2003, vol. 1, pp. 786–791. [12] UCI Repository, Uci machine learning http://www.ics.uci.edu/mlearn/MLRepository.html

repository,

[13] S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667Ð1689, 2003. [14] H.-T. Lin and C.-J. Lin. A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Technical report, Department of Computer Science, National Taiwan University, 2003. [15] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995.

4. REFERENCES [1] C. Cortes and V. Vapnik, “Support-vector network,” Machine Learning, vol. 20, pp. 273–297, 1995.

[16] D. Michie, D. J. Spiegelhalter, C. C. Taylor, and J. Campbell, editors. Machine learning, neural and statistical classification. Ellis Horwood, Upper Saddle River, NJ, USA, 1994. ISBN 0-13106360-X. Data available at http://archive.ics.uci.edu/ml/machinelearning-databases/statlog/

[2] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.

[17] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classification. National Taiwan University, Taipei 106, Taiwan. Last updated: April 15, 2010.

[3] T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” in Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, number 1398, pp. 137–142.

[18] M. Misaki, Y. Kim, P. A. Bandettini, and N. Kriegeskorte, “Comparison of multivariate classifiers and response normalizations for pattern-information fMRI,” Neuroimage, vol. 53, no. 1, pp. 103–118, Oct. 2010.

[4] S. Ertekin, “Learning in Extreme Conditions: Online and Active Learning with Massive, Imbalanced and Noisy Data,” Citeseer, 2009. [5] R. S. Shah, “Support Vector Machines for Classification and Regression,” McGill University, 2007. [6] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 7