Multiclass SVM Model Selection Using Particle Swarm ... - IEEE Xplore

8 downloads 0 Views 153KB Size Report
for the base binary models of a decomposition scheme. In this paper, the authors employ Particle Swarm Optimization to perform a multiclass model selection, ...
Multiclass SVM Model Selection Using Particle Swarm Optimization Bruno Feres de Souza, Andr´e C.P.L.F. de Carvalho, Rodrigo Calvo and Renato Porf´ırio Ishii Institute of Mathematical and Computer Sciences, University of Sao Paulo Av. Trabalhador Sao-Carlense, 400, Sao Carlos, SP, Brazil {bferes,andre,rcalvo,rpi}@icmc.usp.br Abstract

lection would be too high, since (

h 

i=1

Tuning SVM hyperparameters is an important step for achieving good classification performance. In the binary case, the model selection issue is well studied. For multiclass problems, it is harder to choose appropriate values for the base binary models of a decomposition scheme. In this paper, the authors employ Particle Swarm Optimization to perform a multiclass model selection, which optimizes the hyperparameters considering both local and global models. Experiments conducted over 4 benchmark problems show promising results.

1 Introduction The determination of the optimal values for the hyperparameters (regularization term C plus kernel parameters) of Support Vector Machines (SVMs) [19] is known as model selection. For the binary case, the issue is well established and studied by now [7][1]. For the multiclass counterpart, there is a research effort to develop efficient methods to deal with the problem [12][14][15]. The most used approach for multiclass SVMs model selection is based on Grid search [9]. It has two versions. The first globally applies the same hyperparameters values to all binary SVMs of the decomposition. The second locally applies different values of hyperparameters to different binary SVMs, independently. Both approaches have drawbacks. For example, the same set of values for all classifiers may be sub-optimal in some situations. On the other hand, optimizing very well each classifier individually does not ensure that they will perform well together. A better way for model selection would be allowing the binary SVMs to have different values for their hyperparameters while also considering the classification performance of the whole multiclass problem. Unfortunately, the number of possible combinations of hyperparameters values in this case is large and the computational burden of the model se-

di )n should be trained,

where n is the number of SVMs of the decomposition, h is the number of hyperparameters to be tuned for each SVM and di is the number of values the hyperparameter d can take. The use of heuristic optimization methods may avoid an exhaustive search throughout the hyperparameters space. In this paper, the authors apply the Particle Swarm Optimization algorithm (PSO) [13] to efficiently deal with the model selection problem for multiclass SVMs with the RBF kernel. The method is able to efficiently tune multiple SVM hyperparameters in simultaneous local and global fashions, i.e., the parameters of the classification algorithm are optimized considering both the individual components and the interactions between the parts. The paper is organized as follows. Section 2 briefly introduces SVMs for binary and multiclass cases. Besides, a method for estimating SVM generalization error is reviewed. Section 3 presents relevant work on SVM model selection and some of the drawbacks are discussed. Section 4 describes the proposed PSO-based method for model selection. Section 5 shows the results of the experimental study. Finally, Section 6 draws some conclusions.

2 Support Vector Machines 2.1

Binary SVMs

SVMs constitute a new class of learning algorithms that have exhibited good performance on a large range of applications [19]. In the simplest case of binary classification, they work by constructing a hyperplane that maximizes the margin of separation between the examples of the two classes. By doing so, they implement the principle of Structural Risk Minimization, which deals with a tradeoff between the empirical risk (commonly referred as training error) and the classifier complexity, in order to minimize a theoretical bound on the generalization error of the classifier. A comprehensive introduction to SVMs is in [19].

Proceedings of the Sixth International Conference on Hybrid Intelligent Systems (HIS'06) 0-7695-2662-4/06 $20.00 © 2006

2.2

Generalization Error of SVMs

Several techniques have been proposed to estimate the generalization error of SVMs. Duan et al [4] compared empirically many of them for the problem of choosing optimal hyperparameters for binary SVMs with the RBF kernel. The ξα-estimator [11] has performed very well with the advantage of not being computationally expensive, so it can be readily used with population based optimization approaches, like PSO. The ξα-estimator is an upper bound on the leave-one-out error. It is defined as: n Errξα =

2 |i : (ραi R∆ ) ≥ 1| n

(1)

→ − − where ρ = 2. → α and ξ are results of the SVM opti2 → → mization problem. R∆ is an upper bound on k(− x ,− x)− → → → → k(− x,− z ), ∀− x,− z. In this paper, the extended ξαestimator provided in SVM-Light 6.01 library (available at svmlight.joachims.org/), with search depth equals 10, was employed.

2.3

Multiclass SVMs

SVMs are originally designed for binary classification problems. Extensions to multiclass usually involve either solving a large optimization problem at once or considering a decomposition of the original problem into smaller binary subproblems and then combining their solutions [9] [16]. Although both approaches, usually, present no significant difference in performance when the hyperparameters are properly tuned [16], the decomposition one is more computational attractive. There are two main schemes of decomposition: One-Versus-One (OVO) and One-VersusAll (OVA). They have been largely employed due to their simplicity, efficiency and similarly good classification performance [16]. This paper focus on the OVO scheme, even though the proposed approach could be suitably applied to other multiclass schemes as well. The OVO method constructs N (N − 1)/2 SVMs, taking into consideration all binary combinations of classes. When a test example is provided, it is applied to all the SVMs and their outputs are somehow combined. The MaxWins voting scheme [9] employed here counts how often each class is outputted by the binary SVMs and the test example is assigned to the most voted class.

3 SVM Model Selection Model selection for binary SVMs is an active research area. Classical methods involve minimizing some estimates of the generalization error of SVM using gradient information over a set of hyperparameters [7][1]. Evolutionary approaches has been proposed to overcome common

gradient-based methods’ drawbacks, such as the need for differentiable functions and the high risk to be trapped into local minima. Genetic algorithms (GAs) were applied in both single [3][10] and multi-objective frameworks [18]. Friedrichs and Igel [6] used a covariance matrix adaptation evolution strategy technique to determine the model parameters. Runarsson and Sigurdsson [17] managed to solve the problem using an asynchronous parallel evolutionary strategy. Howley and Madden [8] evolved SVM kernels by means of Genetic Programming. Despite the development of sophisticated methods, the Grid search remains as the de facto standard for general SVMs model selection. It works by varying each SVM parameter throughout a wide range with predefined step sizes and by calculating a measure of generalization error for every combination of values. Those with the smallest error are selected. An obvious limitation of such approach is its computational burden in the presence of various parameters with many discrete values. Multiclass SVMs model selection has been less investigated. Lee et al [12] extended the Generalized Approximate Cross-Validation (GACV) to the multiclass case. Passerini et al [14] developed a new bound on leave-one-out error for Error Correcting Outputs Code framework for tuning kernel hyperparameters. As both methods rely on a Grid search throughout the parameters space, when parameters for individual SVMs in a decomposition approach (e.g. OVO) are to be independently tuned, the model selection becomes intractable. To efficiently deal with this problem, Peng and Chan [15] developed a GA-based model selection for multiclass SVMs. First, they perform a rough Grid selection over a range of possible values for the C and γ and then use the GA for refinement. A possible improvement to their work is to jointly optimize these two parameters. Furthermore, they considered equal C for all binary SVMs, which can be sub-optimal. The initial rough selection can also provide erroneous values, misleading the GA. All these problems are addressed by the approach proposed next.

4 PSO for Model Selection Particle Swarm Optimization is a global optimization technique inspired by the social interactions of animals, such as birds, fishes and even human beings [13]. It is based on the observation that social sharing of information may provide some kind of evolutionary advantage to the individuals of a population, enhancing their capability to solve complex problems. PSO consists of a swarm of interacting particles moving in a n-dimensional search space of the problem’s possible solutions. Each particle i is represented by four elements: → → its current position − xi , its velocity − vi , its best previous po→ sition − pi and the best global position in the swarm ever

Proceedings of the Sixth International Conference on Hybrid Intelligent Systems (HIS'06) 0-7695-2662-4/06 $20.00 © 2006

→ found − g . In this work, the authors consider the real-valued version of PSO and the fully connected network topology scheme [13]. The PSO iterates for several generations, updating particles positions and velocities. Information about good solutions spread throughout the swarm and the particles can explore promising regions. The updating formulas are given by Eqs. 2, 3 and 4: → − −i (t) + → − xi (t + 1) = → x vi (t + 1)

(2)

→ − − − − − − vi (t + 1) = K[→ vi (t) + φ1 · r · (→ pi (t) − → xi (t)) + φ2 · r · (→ g (t) − → xi (t))] (3) K=

|2 − θ −

2 , with θ = φ1 + φ2 , θ > 4 √ θ 2 − 4θ|

(4)

where (t) represents the iteration, r ∈ [0, 1] is a uniformly distributed number, φ1 and φ2 are the acceleration constants and K is the so called constriction factor, which was proposed by Clerc [2] to insure the convergence of → PSO. To avoid too high velocities, − vi are constrained to [−vmax , vmax ], where vmax = s · xmax , with 0.1 ≤ s ≤ 1. → When a − xi exceeds the search space boundary, it is repositioned at the boundary and its velocity is set to zero. The real valued PSO particles encode N (N −1)/2 (C, γ) pairs, allowing tuning for independent values of (C, γ) for each SVM. The values considered are γ ∈ [2−10 , 24 ] and C ∈ [2−2 , 212 ]. For the multiclass SVM model selection, the PSO aims to optimize both the individual binary SVMs and the multiclass scheme as a whole. The authors believe that the synergy of joint local and global model selection leads to better overall classification accuracies. The ξαestimator of Eq. 1 is employed for the local optimization of each SVM. To estimate how the SVMs work together in the multiclass scheme, the accuracy on a validating set is used. → Thus, the fitness function of a particle − x , to be minimized by PSO, is defined by Eq. 5: − f (→ x)=

1 N (N −1) 2

N (N −1) 2

 



n → − Errξα (− x )i + Errval (→ x)

(5)

i=1

n − where N is the number of classes, Errξα (→ x ) is the ξα→ − estimator of Eq. 1 and Errval ( x ) is the error on a valida→ tion set, considering SVM parameters encoded in − x . Each of the two fitness terms ranges from 0 to 1.

5 Experimental Results The experiments carried out in this work compare the PSO approach with other model selection methods, concerning classification error. For this purpose, each dataset is partitioned according to the 10-Fold Cross-Validation (CV) scheme, where 10 different training/testing splits are produced. SVMs are constructed (with LIBSVM library) using training data and have their performance assessed on the test data. Table 1 exhibits the numerical characteristics of

the four benchmark datasets used (available at www.csie. ntu.edu.tw/∼cjlin/libsvm/). It presents, for each dataset, the number of classes, the number of examples, the number of features and the mean, minimum and maximum number of examples per class. Table 1. Characterstics of the datasets used. Name Iris SVMGuide2 Vowel Wine

Classes 3 3 11 3

Examples 150 391 528 178

Features 4 20 10 13

mean/min/max 50.0/50/50 130.3/53/221 48.0/48/48 59.3/48/71

Three methods are compared with PSO. The first is a baseline method, named Naive, that sets the same values of hyperparameters for all binary SVMs. Naive employs the default values of LIBSVM library, i.e., C = 1 and γ = 1/k, where k is the number of features of the dataset. The second method, named Local method, optimizes each SVMs independently. It performs a Grid search over C = {2−2 , 2−1 , ..., 212 } and γ = {2−10 , 2−9 , ..., 24 }. The pair with the smallest ξα-estimator error is selected. The last method, Global Method, also considers the same values of hyperparameters for all binary SVMs and uses a validation set to estimate the generalization error. The pair of values with the smallest error is selected. The validation sets used by Global and PSO methods came from the partition of the training set of each CV fold into train (70%) and validation (30%) data. PSO used the following parameters, already discussed: 30 particles, 150 generations, φ1 = φ2 = 2.05, s = 1. They were all empirically defined. Table 2. CV errors of the methods. Name Iris SVMGuide2 Vowel Wine

Naive 5.33/4.00 43.47/0.46 25.95/5.35 2.22/3.68

Local 4.66/4.26 18.15/3.11 2.27/1.47 1.66/2.54

Global 4.00/3.26 21.23/6.30 4.72/4.26 3.30/3.66

PSO 4.66/4.26 19.69/3.98 1.18/1.62 2.22/4.44

Table 2 presents the CV errors obtained using the different model selection approaches (mean and standard deviation of errors over the 10 folds are showed). As PSO is a stochastic method, it runs for 10 times for each fold of the CV process, generating different solutions, and the classification error is measured on the test partition. The error closest to the mean error obtained for the 10 runs is chosen to represent the PSO solutions in the comparisons in Table 2. Not surprisingly, Naive was the worst method, emphasizing the urge for model selection. PSO, Local and Global presented comparable performance. In order to assess if the different performances are statistically equal, a multiple hypothesis test was applied with 95% of confidence [5]. According to the test, PSO, Local and Global present no difference in performance, while, in datasets SVMGuide2

Proceedings of the Sixth International Conference on Hybrid Intelligent Systems (HIS'06) 0-7695-2662-4/06 $20.00 © 2006

and Vowel, Naive was the worst. [3]

0.45 Iris Svmguide2 Vowel Wine

0.4

0.35

Average best fitness

0.3

[4]

0.25

0.2

[5]

0.15

0.1

[6]

0.05

0 0

20

40

60

80 Generations

100

120

140

Figure 1. Average fitness of best particle on a typical fold of the tested datasets.

[7]

[8]

[9]

Figure 1 shows the convergence behavior of the PSO for each dataset. Plots are the averaged best particles in each generation, over 10 runs of the algorithm. Figures illustrate a typical fold of each dataset. On can see that PSO converges very fast. The most obvious case is the Wine dataset, where particles achieve fitness equals to 0.

[10]

[11]

6 Conclusion [12]

This paper addressed the model selection problem in the context of multiclass SVMs. A novel PSO-based approach was presented. It efficiently optimized SVM hyperparameters considering both the individual binary SVMS and the interactions between them, dealing with the complete multiclass problem. Although the experiments conducted here did not show the superiority of any method, they showed that PSO performed well, encouraging further investigation. The authors believe that for more complex grid search problems, using several parameters with many values, PSO will overcome the other methods, producing good results with manageable computational effort. Additional experiments using larger datasets and other model selection methods, including population-based methods, may also provide a better picture of the potential of PSO for model selection.

[13]

[14]

[15]

[16] [17]

References [18] [1] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Mach. Learn., 46(1-3):131–159, 2002. [2] M. Clerc. The swarm and the queen: Towards a deterministic and adaptive particle swarm optimization. In Proc. of

[19]

Proceedings of the Sixth International Conference on Hybrid Intelligent Systems (HIS'06) 0-7695-2662-4/06 $20.00 © 2006

IEEE CEC 1999), pages 1951–1957, Piscataway, New Jersey, 1999. IEEE Press. G. Cohen, M. Hilario, and A. Geissb¨uhler. Model selection for support vector classifiers via genetic algorithms. an application to medical decision support. In Proc. ISBMDA 2004, LNCS, pages 200–211, Valencia, Spain, 2004. Springer-Verlag. K. Duan, S. S. Keerthi, and A. N. Poo. Evaluation of simple performance measures for tuning svm hyperparameters. Neurocomp., 51:41–59, 2003. A. Feelders and W. Verkooijen. On the statistical comparison of inductive learning methods. In Learning from data Artificial Intelligence and Statistics V, pages 271–279. Spring-Verlag, 1996. F. Friedrichs and C. Igel. Evolutionary tuning of multiple svm parameters. Neurocomp., 64:107–117, 2005. T. Glasmachers and C. Igel. Gradient-based adaptation of general gaussian kernels. Neural Comput., 17(10):2099– 2105, 2005. T. Howley and M. G. Madden. The genetic kernel support vector machine: Description and evaluation. Artif. Intell. Rev., 24(3-4):379–395, 2005. C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13:415–425, 2002. F. Imbault and K. Lebart. A stochastic optimization approach for parameter tuning of support vector machines. In Proc. of ICPR 2004, volume 4, pages 597–600. IEEE Computer Society, 2004. T. Joachims. Estimating the generalization performance of a SVM efficiently. In P. Langley, editor, Proc. of 17th ICML00, pages 431–438, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US. Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99:67–81, March 2004. K. E. Parsopoulos and M. N. Vrahatis. Recent approaches to global optimization problems through particle swarm optimization. Natural Computing: an international journal, 1(2-3):235–306, 2002. A. Passerini, M. Pontil, and P. Frasconi. New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks, 15(1):45–54, January 2004. X. Peng and A. Chan. An efficient algorithm on multi-class support vector machine model selection. In Proc. of IJCNN 2003, volume 4, pages 3229–3232, Portland, Oregon, July 2003. IEEE Press. R. Rifkin and A. Klautau. In defense of one-vs-all classification. J. Mach. Learn. Res., 5:101–141, 2004. T. P. Runarsson and S. Sigurdsson. Asynchronous parallel evolutionary model selection for support vector machines. Neural Information Processing - Letters and Reviews, 3(3):59–67, June 2004. T. Suttorp and C. Igel. Multi-objective Machine Learning, volume 16 of Studies in Computational Intelligence, chapter Multi-objective optimization of support vector machines, pages 199–220. Springer Verlag, 2006. V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995.