An EA multi-model selection for SVM multiclass schemes G. Lebrun1 , O. Lezoray1 , C. Charrier1 , and H. Cardot2 1
LUSAC EA 2607, Vision and Image Analysis Team, IUT SRC, 120 Rue de l’exode, Saint-Lˆo, F-50000, France {gilles.lebrun, olivier.lezoray, christophe.charrier}@unicaen.fr 2 Laboratoire d’Informatique (EA 2101), Universit´e Franc¸ois-Rabelais de Tours, 64 Avenue Jean Portalis, Tours, F-37200, France
[email protected]
Abstract. Multiclass problems with binary SVM classifiers are commonly treated as a decomposition in several binary sub-problems. An open question is how to properly tune all these sub-problems (SVM hyperparameters) in order to have the lowest error rate for a SVM multiclass scheme based on decomposition. In this paper, we propose a new approach to optimize the generalization capacity of such SVM multiclass schemes. This approach consists in a global selection of hyperparameters for sub-problems all together and it is denoted as multi-model selection. A multi-model selection can outperform the classical individual model selection used until now in the literature. An evolutionary algorithm (EA) is proposed to perform multi-model selection. Experimentations with our EA method show the benefits of our approach over the classical one.
1
Introduction
The multiclass classification problem refers to assigning a class to a feature vector in a set of possible ones. Among all the possible inducers, SVM have particular high generalization abilities [?] and have become very popular in the last few years. However, SVM are binary classifiers and several combination schemes [?] were developed to extend SVM for problems with more two classes. These schemes are based on different principles: probabilities [?,?], error correcting [?], correcting classifiers [?] and evidence theory [?,?]. All these combination schemes involve the following three steps: 1) decomposition of a multiclass problem into several binary sub-problems, 2) SVM training on all sub-problems to produce the corresponding binary decision functions and 3) decoding strategy to take a final decision from all binary decisions. Difficulties rely on the choice of the combination scheme and how to optimize it. In this paper, we focus on step 2) when steps 1) and 3) are fixed. For that step, each binary problem needs to properly tune the SVM hyperparameters in order to have a global low muticlass error rate with the combination of all binary decision functions involved in. The search for efficient values of hyperparameters is commonly designed by the term of model selection. The classical way to achieve optimization of multiclass schemes is an individual model selection for each related binary sub-problem. This methodology overtones that a multiclass scheme based on SVM combination is optimal
when each binary classifier involved in that scheme is optimal on the dedicated binary problem. But, if it is supposed that a decoding strategy can more or less easily correct binary classifiers errors, then individual binary model selection on each binary subproblem cannot guaranty an optimal optimization on the full multiclass problem. For this main reason, we are thinking that another way to achieve direct optimization of multiclass schemes is a global multi-model selection for binary problems all together. In fact, the goal is to have a minimum of errors on a muticlass problem. The selection of all sub-problem models has to be globally performed to achieve that goal, even if that means that error rates are not optimal on all binary sub-problems. Multi-model selection is a hard problem, because it corresponds to a search in a huge space. Therefore, we propose an evolutionary algorithm (EA) to achieve that multi-model selection. Specific fitness function and recombination operator were defined in a framework where individual model selection is based on grid search techniques [?]. We have compared the classical and our EA methodologies on some datasets. Two combination schemes are used: one-versus-all and one-versus-one. This choice is motivated firstly because these ones are the most commonly used, and secondly because Rifkin’s experiments in [?] have shown that these simple combination schemes are as much efficient than more complicated ones when SVM hyperparameters are well-tuned. Experimental results highlight that our EA multi-model selection produces a more efficient multiclass classifier than the classical individual model selection. Section 2 gives important overviews and definitions to understand the evolutionary model selection proposed in the sequel. Section 3 gives the details of our evolutionary optimization method. Section 4 gives experimental protocol details and results. Section 5 draws the conclusion of this paper and proposes future research directions.
2 2.1
Overviews and definitions Support Vector Machines (SVM)
SVM were developed by Vapnik according to structural risk minimization principle from statistical learning theory [?]. Given training data (xi , yi ), i = {1, . . . , m}, xi ∈ Rn , yi ∈ {−1, +1}, SVM map an input vector x into a high-dimensional feature space H through some mapping function φ : Rn → H, and constructs an optimal separating hyperplane in this space. The mapping φ(·) is performed by a kernel function K(·, ·) which defines an inner product in H. The separating hyperplane given by a SVM is: w · φ(x) + b = 0. The optimal hyperplane is characterized by the maximal distance to the closest training data. Thus, computing this hyperplane is equivalent to minimize Pm the following optimization problem: V (w, b, ξ) = 21 kwk2 + C ( i=1 ξi ) where the constraint ∀m i=1 : yi [w · φ (xi ) + b] ≥ 1 − ξi , ξi ≥ 0 requires that all training examples are correctly classified up to some slack ξ and C is a parameter allowing trading-off between training errors and model complexity. This optimization is a convex quadratic programming problem. PmIts whole1 dual Pm [?] is to maximize the following optimization m problem: W (α) = α − i i=1 i,j=1 αi αj yi yj K (xi , xj ) subject to ∀i=1 : 0 ≤ 2 Pm ∗ αi ≤ C , optimal solution α specifies the coefficients for the i=1 yi αi = 0. PThe m optimal hyperplane w∗ = i=1 αi∗ yi φ (xi ) and defines the subset SV of all SVs. An
example xi of the training set is a SV if αi∗ ≥ 0 in the optimal solution. The SVs subset gives the BDF h: X h(x) = sign(s(x)) , s (x) = αi∗ yi K (xi , x) + b∗ (1) i∈SV
where the threshold b∗ is computed via the unbounded SVs [?] (i.e. 0 < αi∗ < C). An efficient algorithm SMO [?] and many refinements [?] were proposed to solve dual problem. 2.2
SVM probabilities estimation
The output of an SVM is not a probabilistic value, but an un-calibrated distance measurement of an example x to the separating hyper-plane. For some decoding strategies, it is necessary to have a probability estimation p(y = +1|x) (see section 2.3). Platt has proposed a method [?] to map the SVM output into a positive class posterior probability by applying a sigmoid function to the SVM output: p(y = +1|x) =
1 1 + ea1 ·f (x)+a2
(2)
The parameters a1 and a2 are determined by minimizing the negative log-likelihood under a test set [?]. 2.3
SVM combination schemes
SVM are specifically designed for binary problems. Several combination schemes have been developed to take into account that specificity and deal with multiclass problems [?,?,?,?,?,?]. Within all combination schemes, the one-versus-all scheme based on winner-takes-all strategy and the one-versus-one (or pairwise) method based on max-wins voting strategy are generally used. When class probabilities on each binary problem are estimated (see section 2.2), the two above schemes have adapted decoding strategies to estimate class probabilities for the multiclass problem. Let ω denote the set of class labels and ωi (i ∈ [1, |ω|]) one of class labels. The class c selected as the final decision is the one with maximum probability: c = arg max p(ωi |x). ωi ∈ω
For one-versus-all combination schemes, k = |ω| binary sub-problems are constructed from initial multiclass problem. The ith sub-problem is built by using all examples of class ωi as positive instances and examples of the other classes as negative instances. Binary decision function hi is produced by training SVM on the ith subproblem. Let p(i) (x) (determined by Platt mapping from hi SVM output) denote the posterior probability of an example x to be a positive instance on the ith binary subproblems. The decoding method to estimate class probabilities is: p(i) (x) p(ωi |x) = P|ω| j=1 p(j) (x)
(3)
For one-versus-one combination schemes, k = |ω|·(|ω|−1)/2 binary sub-problems are constructed from initial multiclass problem by pairwise coupling decomposition. Let (i, j) denote a binary sub-problem ((i, j)|i, j ∈ [1, |ω|], i < j) which is built by using all examples of class ωi as positive instances and all examples of class ωj as negatives instances. Let p(i,j) (x) denote the posterior probability of an example x to be a positive instance for the (i, j) sub-problems (Platt mapping is used in the same way than with one-versus-all to determine p(i,j) (x) values). There are several decoding methods to estimate p(ωi |x) class probabilities from all p(i,j) (x) values [?]. One fast and efficient way to do this, is to use the formulation of Price [?]: 1
p(ωi |x) = P|ω|
1
j=1,j6=i p(i,j) (x)
2.4
− (|ω| − 2)
(4)
Multi-model optimization problem
A multiclass combination scheme induces several binary sub-problems. The number k and the nature of binary sub-problems depends on the decomposition involved in the combination scheme. For each binary sub-problem, a SVM must be trained to produce an appropriate binary decision function hi (1 ≤ i ≤ k). The quality of hi is greatly dependent on the selected model θi and is characterized by the expected error rate ei for new datasets with the same binary decomposition. The model θi contains the regularization SVM parameter C and all others parameters associated with kernel function (bandwidth of the Gaussian kernel for example). Expected error rate ei associated to a model θi is commonly determined by cross-validation techniques. All the θi models constitute the multi-model θ = (θ1 , ..., θk ). The expected error rate e of a SVM multiclass combination scheme is directly dependent on the selected multi-model θ. Let Θ denote the multi-model space for a multiclass problem (i.e. θ ∈ Θ) and Θi the model space model for the ith binary sub-problem (i.e. θi ∈ Θi ). The best multi-model θ ∗ is the one for which expected error e is minimum and corresponds to the following optimization problem: θ ∗ = arg max e(θ) (5) θ∈Θ
where e(θ) denotes the expected error e of a multiclass combination scheme with multimodel θ. The huge size of the multi-model space Θ = × Θi makes the optimization i∈[1,k]
problem (5) very hard. To reduce the optimization problem complexity, it is classic to use the following approximation: ˜ = {arg max ei (θi )|i ∈ [1, k]} θ
(6)
θi ∈Θ
˜ ≈ e(θ ∗ ). This hypothesis also supposes that ∀i ∈ [1, k] : Hypothesis is made that e(θ) ∗ e(θi ) ≈ e(θ˜i ). If it is evident that each θi model in the best multi-model θ ∗ must correspond to efficient SVM (low value of ei ) on the corresponding ith binary sub-problem, all the best individual models θi∗ do not necessarily define the best multi-model θ ∗ . The first reason is that all error rates ei are estimated with some tolerance and combination of all these deviations can have a great impact on the final muticlass error rate e. The
second reason is that even if all the binary classifiers of a combination scheme have identical ei error rates for different multi-models, these binary classifiers can have different binary class predictions for a same example. Multiclass predictions by combining these binary classifiers can also be different for a same feature vector example since the correction involved in a given decoding strategy depends on the nature of the internal errors of the binary classifiers (mainly, the number of errors). Then, multiclass classification schemes with the same internal-error ei , but different multi-model θ, can have different capacities of generalization. For all these reasons, we claim that multi-model optimization problem (Eq. 5) can outperform individual model optimization (Eq. 6).
3
Evolutionary optimization method
Evolutionary algorithms (EA) [?] belong to a family of stochastic search algorithms inspired by natural evolution. These algorithms operate on a population of potential solutions and apply a survival principle according to a fitness measure associated to each solution to produce better approximations of the optimal solution. At each iteration, a new set of solutions is created by selecting individuals according to their level of fitness and applying to them several operators. These operators model natural processes, such as selection, recombination, mutation, migration, locality and neighborhood. Although the basic idea of EA is straightforward, solutions coding, size of population, fitness function and operators must be defined in compliance with with the kind of problem to optimize. Within our AE multi-model selection method, a fitness measure f is associated to a multi-model θ which is all the more large as the error e associated to θ is small; this enables to solve (Eq. 5) optimization problem. Fitness value is normalized in order to have f = 1 when error e is zero and f = 0 when error e corresponds to a random draw. Moreover, the number of examples in each class are not alway well balanced for many multiclass datasets; to overcome this, the error e corresponds to a Balanced Error Rate (BER). As regards these two key points, the proposed fitness formulation is: 1 1 1 − − e (7) f= 1 |ω| 1 − |ω| In the same way, the internal-fitness fi is defined as fi = |ω| − 2ei for the ith binary classifier with corresponding BER ei . The EA cross-over operator for the combination of two multi-models θ 1 and θ 2 must favor the selection of most efficient models (θ 1i or θ 2i , ∀i ∈ [1, k]) in these two multi-models. It is worth noting that one should not systematically select all the best models to produce a efficiency child multi-model θ (see section 2.4). For each subproblem, internal-fitnesses fi1 and fi2 are used to determine the probability 2
pi =
(fi1 ) 2
2
(fi1 ) + (fi2 )
(8)
to select the ith model in θ 1 as a model in θ. fij denotes the internal fitness of the ith binary classifier with the multi-model θ j . For the child multi-models generated by the
cross-over operator, an important advantage is that no new SVM training is necessary if all the related binary classifiers were already trained. In contrast, only the BER error rates of all child multi-models have to be evaluated. SVM Training is only necessary for the first step of the EA and when models go through a mutation operator (i.e. hyperparameters modification). The rest of our EA for multi-model selection is similar to other EA approaches. First, a population {θ1 , · · · , θλ } of λ multi-models is generated at random. Each model θij corresponds to a uniform random within all possible values of SVM hyperparameters (see section 4 for experimental details). New multi-models are produced by combination of multi-models couples selected by a Stochastic Universal Sampling (SUS) strategy. A selective pressure of 2 is used for the SUS selection. Each model θij (the ith binary classifier with the j th multi-model i.e. j ∈ [1, λ] and i ∈ [1, k].) has a probability of pm /k to mutate (uniform random as for initialization of EA). Fitness f of all child multi-models are evaluated and become multi-models in the next iteration. The number of iterations of EA is fixed to nmax . At the end of the EA, the multi-model with the best fitness f from all these iterations is selected as θ ∗ .
4
Experimental results
In this section, three well known multiclass datasets are used: Satimage (|ω| = 6), Letter (|ω| = 26) from the Statlog collection [?], and USPS (|ω| = 10) dataset [?]. The same random splitting and scaling factors are applied on feature vectors than in experiments of Wu et al [?]. Two sampling sizes (identical to [?]) are used : 300/500 and 800/1000 for training/testing dataset, and for each sizes, 20 random splits is generated (always identical to [?]). Two optimization methods are used for the determination of the best multi-model θ ∗ for each training set used: the classical individual model selection and our EA multi-model selection. For both methods, the two combination schemes presented in section 2.3 are used. For each binary problem a SVM 2 with Gaussian kernel K(xi , xj ) = exp(−γ kxi − xj k ) is trained. Possible values 3 of SVM hyperparameters for model θi are identical for all binary problems: Θ = [2−5 , 2−3 , · · · , 215 ] × [2−5 , 2−3 , · · · , 215 ]. BER e on a multi-class problem and BER ei on binary sub-problems are estimated by five-fold cross-validation (CV). These BER values are used for multi-models selection. For all tested model θi , associated values of a1 and a2 (c.f. section 2.2) are determined by averaging all values found by a second level of CV4 (four-fold CV). Final BER e of the selected multi-model θ ∗ is estimated on the test dataset. Table 1 gives average BER under all 20 split sets of previously mentioned datasets. This is done for the two specified testing set sizes (row size of table 1), for the two combination schemes (one-versus-one and one-versus-all), and for the two above mentioned selection methods (columns e¯classic and e¯EA ). Column ∆¯ e, in table 1 provides the average variation of BER between our multi-model selection and classical one. Results of that column are particularly important. For two datasets (USPS and Letter) our optimization method produces SVM combination schemes with best generalization 3 4
induced by gridsearch technique used with LIBSVM [?]. See section 7.1 of [?] for detailed reasons of that process.
Size e¯classic Satimage 14.7 ± 1.8 % USPS 12.8 ± 1.2 % Letter 40.5 ± 3.0 % Satimage 14.6 ± 1.7 % USPS 11.9 ± 1.3 % Letter 41.9 ± 3.3 %
500 e¯EA ∆¯ e e¯classic one-versus-one 14.5 ± 2.1 % -0.2 % 11.8 ± 0.9 % 11.0 ± 1.8 % -1.8 % 8.9 ± 0.9 % 35.9 ± 2.9 % -4.6 % 21.4 ± 1.7 % one-versus-all 14.5±2.0 % -0.1 % 11.5 ± 0.8 % 11.2±1.5 % -0.7 % 8.8 ± 1.3 % 36.3±3.3 % -5.6 % 22.1 ± 1.3 %
1000 e¯EA
∆¯ e
11.8 ± 1.0 % -0.0 % 8.4 ± 1.6 % -0.5 % 18.6 ± 2.1 % -2.8 % 11.6±1.0 % +0.1 % 8.5±1.6 % -0.3 % 19.7±1.8 % -2.4 %
Table 1. Two decoding methods and two training sizes for each dataset are used. Column e¯classic corresponds to the average of 20 balanced errors with individual model selection for each binary problem. Column e¯EA corresponding to average of 20 balanced errors with EA multi-model selection for all binary problems (λ = 50, pm = 0.01, nmax = 100). Values in column ∆ e¯ corresponds to the variation of error rate between EA multi-model and individual model selection (i.e ∆ e¯ = e¯EA − e¯classic ).
capacities than the classical one. That effect appears to be more marked when number of classes in the multi-class problem increases. A reason is that the multi-model space search size exponentially increases with the k number of binary problems involved in a combination scheme (121k for these experiments). This effect is directly linked to the number of classes |ω| and could explained why improvements are not measurable with Satimage dataset. In some way, a classical optimization method explores the multimodel space Θ in blink mode, because cumulate effect of the combination of k SVM decision functions could not be determined without estimation of e. That effect is emphasized when estimated BER ei are poor (i.e. training and testing data size are low). Comparison of ∆¯ e values when training/testing dataset size change in table 1 illustrates this one.
5
Conclusion
In this paper, a new EA multi-model selection method is proposed to optimize the generalization capacities of SVM combination schemes. The definition of a cross-over operator based on internal fitness of SVM on each binary problem is the core of our EA method. Experimental results show that our method increases the generalization capacities of one-versus-one and one-versus-all combination schemes when compared with individual model selection method. For future works, the proposed EA multi-model selection method has to be tested with other combination schemes [?] and with new datasets which have a great range in |ω|. Adding feature selection abilities to our AE muti-model selection is also of importance. Another key point to take into account is the reduction of the learning time of our EA method which is actually expensive. One way to explore this is to use fast CV error estimation technique [?] for the estimation of BER.