Polynomial kernel adaptation and extensions to the SVM ... - CiteSeerX

0 downloads 0 Views 248KB Size Report
However, it does not guarantee to satisfy the equality constraint ... strates a method for automatically adapting the poly- ..... IEEE Trans Fuzzy Syst 4(4):403–417.
Neural Comput & Applic (2008) 17:19–25 DOI 10.1007/s00521-006-0078-2

O R I G I N A L A RT I C L E

Polynomial kernel adaptation and extensions to the SVM classifier learning Ramy Saad Æ Saman K. Halgamuge Æ Jason Li

Received: 17 February 2006 / Accepted: 6 November 2006 / Published online: 10 January 2007  Springer-Verlag London Limited 2006

Abstract Three extensions to the Kernel-AdaTron training algorithm for Support Vector Machine classifier learning are presented. These extensions allow the trained classifier to adhere more closely to the constraints imposed by Support Vector Machine theory. The results of these modifications show improvements over the existing Kernel-AdaTron algorithm. A method of parameter optimisation for polynomial kernels is also proposed. Keywords Kernel methods  Support vector machines  Kernel-AdaTron algorithm

1 Introduction Support Vector Machine (SVM) learning is used in many applications including cancer classification, image recognition and protein functions prediction for its robustness and approximation capability [1–3]. Presently, various parameters of SVM learning algorithms have to be set by the user in a trial and error manner. This work attempts to automate this process to improve the practicality of SVM and also suggests extensions to a well-known SVM learning algorithm. Since the conception of SVM, there have been a number of papers published regarding its modifications and improvements, parameter choosing, its learning theory and applications [1, 2, 4–6]. Among these, two well known training algorithms are the Sequential R. Saad  S. K. Halgamuge  J. Li (&) Mechatronics Research Group, University of Melbourne, Melbourne, VIC 3010, Australia e-mail: [email protected]

Minimal Optimisation (SMO) and Kernel-AdaTron. They are well known due to their implementation simplicity, which is an important aspect to many researchers. The SMO is based on a repeat process of locally optimising a pair of Lagrangian variables [5], and thus has a strong analytical background. However, it relies on heuristics for fast convergence. KernelAdaTron is based on stochastic-gradient ascent and is arguably the simplest training algorithm for SVM. However, it does not guarantee to satisfy the equality constraint of the SVM optimisation problem, as will be shown later in Section 2. Despite their apparent differences, it has been demonstrated in [7] that SMO and Kernel-AdaTron do share high similarities in their training processes. This work suggests a few extensions to the Kernel-AdaTron algorithm. Some other recent researches aim to improve the training speed by reformulating the underlying equations of SVM. Examples are Lagrangian SVM [8], LSSVM [9] and Nearest Point Algorithm [4]. However, they have differed with the optimisation concept of the original SVM. There are also literature concerned with choosing the type of kernel and the parameters of a chosen kernel [10, 11]. To date, there is no automatic algorithm for choosing the right kernel for a given application. Some attempts have been made to automatically adapt the parameter of Gaussian kernels [12]. Unfortunately, it does not generalise to other kernel types. This work will address the parameter optimisation of polynomial kernels. The remainder of the paper is organised as follows. Section 2 gives an introduction of the SVM classifier and Kernel-AdaTron. Section 3 presents three modifications to the existing Kernel-AdaTron algorithm

123

20

Neural Comput & Applic (2008) 17:19–25

that will both accelerate the learning procedure and make the results more accurate. Section 4 demonstrates a method for automatically adapting the polynomial kernel parameter to its optimum. Section 5 presents the results of the modifications on several data sets including the Iris data set [13], the Monk’s problems [14] and the leukaemia data set [15]. Section 6 concludes the paper.

Let

Fi ¼

m X

  aj yj K xi ; xj :

ð6Þ

j¼1

The KKT conditions are summarised as three cases [5]: Case 1: ðFi þ bÞyi  1>0;

8ai ¼ 0

2 Kernel-AdaTron and Support Vector Machine classification

Case 2: ðFi þ bÞyi  1 ¼ 0;

Support Vector Machine binary classification is a quadratic programming (QP) optimisation problem that aims to maximise the performance function:

Kernel-AdaTron is a widely used SVM training algorithm that solves both a and b. It has the merit of simplicity. Let



m X

ai 

i¼1

m X m   1X ai aj yi yj K xi ; xj 2 i¼1 j¼1

Case 3: ðFi þ bÞyi  160;

ð1Þ

gi ¼ yi F i

ai yi ¼ 0

ð2Þ

06ai 6C; 8i

ð3Þ

1.

i¼1

where m is the number of training examples, K the kernel function, x the input vectors, y 2{–1, +1} represents one of the two possible output classes, C a trade-off parameter for generalisation performance, and a = {a 1, ..., a m } are the Lagrangian variables to be optimised [16, 17]. The conceptual space between y = – 1 and y = 1 is called the classification margin. The kernel is generally interpreted as a function to perform an implicit inner product between two transformed vectors:      K xi ; xj ¼ /ðxi Þ; / xj

ð4Þ

where h;i denotes an inner product and u is a mapping from the input space to a feature space, which in most cases does not need to be explicitly defined. Given a set of values for a, the decision function of the SVM classifier is expressed as: f ð xÞ ¼ sign

v X

! ai yi K ðx; xi Þ þ b

ð5Þ

i¼1

where v 6 m is the number of support vectors (i.e., a i > 0) and b is the bias that must satisfy the Karush– Kuhn–Tucker (KKT) optimality conditions of the SVM dual problem.

123

8ai ¼ C:

ð8Þ ð9Þ

ð10Þ

and initialise ai = g/1000, "i, with the learning rate g set to a small value. The Kernel-AdaTron algorithm is described as follows [6, 18]:

subject to constraints: m X

80  ai  C

ð7Þ

2.

For all i, calculate Fi and gi using (6) and (10) respectively. Let

Dai ¼ g

@J ¼ gð 1  gi Þ @ai

ð11Þ

be the proposed change for ai. 3. At time T + 1, each ai were updated with: 8 < ai ðT Þ þ Dai ðT Þ; if 0\ai ðT Þ þ Dai ðT Þ\C ai ð T þ 1Þ ¼ 0; if ai ðT Þ þ Dai ðT Þ60 : C; if ai ðT Þ þ Dai ðT Þ>C: ð12Þ 4.

Let

bup ¼ min ðFi Þ and blow ¼ max ðFi Þ: fijyi ¼þ1g

fijyi ¼1g

ð13Þ

Calculate the bias b with: b¼ 5.



 1 bup þ blow : 2

ð14Þ

If a maximum number of iterations is exceeded or the margin (p)  1 bup  blow 2

is approximately 1, i.e.,

ð15Þ

Neural Comput & Applic (2008) 17:19–25

1  t\p\1 þ t;

21

ð16Þ

where t is a threshold, then stop the training. Otherwise return to step 1.

3 Proposed extensions to the Kernel-AdaTron learning algorithm

3.2 Updating the Lagrangian variables

This section addresses some of the drawbacks of Kernel-AdaTron and proposes a solution to each of them. Some of these drawbacks have also been discussed in [6, 7, 17]. 3.1 Choosing the learning rate The learning rate is highly dependent on the dataset, making it difficult to find a suitable value. The discovery of the learning rate is quite often a trial and error procedure. A method for finding optimal learning rates is proposed in [6]. However, it uses different rates for training different variables, adding complexity to the algorithm. The learning rate proposed here is calculated by: " #1 m X m   k 1 X g¼ K xi ; xj ; m m2 i¼1 j¼1

tion k=m; The value of g in (17) with k = 1 is conservative. Our experiments suggest that 16k62 can be used in practice. The amount by which g can be scaled up indicates much about the separability of the data. The more separable the data, the more g can be scaled up and so the larger k can be used. Note initially a i = g/1000 for all i.

The learning algorithm does not guarantee to update the Lagrangian variables a in such a way to satisfy m P equality condition (2), that is, ai yi ¼ 0; the constraint is guaranteed to hold only i¼1 when kernels that do not require an explicit bias in the SVM are used, such as the Gaussian RBF kernel [7]. This is an intrinsic problem of the Kernel-AdaTron algorithm accounted by its stochastic-gradient nature [17]. To tackle this, let us define: F¼

v 1X ai yi v i¼1

!2 ð18Þ

:

m P Minimising F guarantees the condition ai yi ¼ 0; i¼1 Together with the objective function J described in (1), our new optimisation problem becomes

ð17Þ

where k>1 is a scaling factor and the divisor m is accounted by Kernel-AdaTron being a sample update algorithm and not a batch update algorithm. The bracketed term represents the inversed average of the kernel matrix, which has some similarity with the adaptive rate proposed in [6]. In their work, the kernel entry K(xi, xi) for each data point xi is inversed to compute a local learning rate for that particular data point, whereas in our method, the average value across the whole kernel is computed to give a single representative learning rate. Our method has an obvious advantage when the number of data points in the data set is large, as only a single learning rate needs to be retained in memory. Moreover, since the kernel matrix must be calculated anyway, the calculation of (17) takes only a little more computational time and needs to be computed only once. The inspiration for (17) comes from classical engineering dimensionless analysis [19]. The term gi in (10) has the same unit of measurement as the kernel matrix. Hence, dividing it by the average value of the kernel matrix would make the correction of ai in (11) dimensionless, i.e., just a scale multiplied by the frac-

Maximise  wF;

ð19Þ

subject to the same constraints as in (2) and (3): m X

ai yi ¼ 0;

ð20Þ

i¼1

06ai 6C;

8i;

ð21Þ

where w is a weight specifying the importance of F relative to J. For Kernel-AdaTron, our experiments suggest w = 1 is a good value. With the new objective function and w = 1, (11) becomes   @ ðJ  F Þ @J @F Dai ¼ g ¼g  : @ai @ai @ai

ð22Þ

Therefore, ! v 2yi X Dai ¼ g 1  gi  2 ai yi : v i¼1

ð23Þ

Equation (23) is to replace (11) in the Kernel-AdaTron algorithm.

123

22

Neural Comput & Applic (2008) 17:19–25

3.3 Stopping condition

we will address the dynamic adaptation of the polynomial kernel:

The condition to terminate the learning procedure is generally overlooked even though it can have a serious effect on the results given. Even when addressed, most thresholds proposed are data dependent and hence can be quite hard to use in a fully automated system. They are generally determined using a trial and error procedure. The proposed method is to halt the learning when the classification margin becomes stabilised. The margin for an SVM system is [17, 18]:

    d k xi ; xj ¼ xi ; xj þ c ;

2 c ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  ffi : Pm Pm i¼1 j¼1 ai aj yi yj K xi ; xj

ð24Þ

Note that (24) reflects the true classification margin in the feature space, whereas the margin p in (15) is an approximation based on the bias. It is only at optimum that these two margins give the same information, i.e., c = 2p. The stability of the margin c can be detected using the following straightforward condition: max

e[

t2f0;...;ðq1Þg

jcTt  cj

jcj

;

ð25Þ

where e is a parameter with value close to 0, q defines the number of iterations to be involved in the computation of stability, T indicates the current iteration, cT – t is the margin t iterations ago and c is the mean value of previous margins: 1X c ¼ c : q t¼0 Tt

ð26Þ

The learning is halted when equation (25) returns true. Setting the parameter e to be very small (e.g., 1 · 10– 5) and q to be relatively large (e.g., 100) will ensure that the margin has stabilised before the learning stops. The most computationally costly component of this method is the denominator of (24). However, as the training progresses, more a’s will become zero due to being non-support vectors, which can largely reduce the time required to compute (24). Hence, to improve overall training time, margin stability (hence stopping condition) should only be checked in the later stage of training, where more a’s are zero. 4 Dynamically adapting kernel parameters Dynamically adapting kernel parameters has been the focus of several research papers [10–12]. In this work,

123

where both c and d are kernel parameters. Parameter d determines the number of dimensions in the induced feature space and hence the complexity of the hyperplane. In our work, we aim to adapt d to a value for optimal classification performance while c is set to zero. The primary challenge to overcome here is to find a method for comparing the effect of different d values on a trained classifier. Generally, the classification margin c can be used for calculating the efficiency. However, the margin given by (24) is dependent on the parameter d itself as well as the trained a and b, which makes the margin change by orders of magnitude as d changes. In fact, margins with different d should not be compared as they lie in different feature spaces. An evaluation that does not depend on the feature space must be found for comparing the results of different kernel parameters and hence be able to choose the most ideal one. Another possible measure is the classification accuracy of the SVM. This is however too coarse for the fine tuning of parameters and is also highly dependent on the relative distribution of the data points of the training set and the testing set. Therefore, we propose the following, Perf, for performance measure: Perf ¼

q1

ð27Þ

1 ; T E

ð28Þ

where vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i2 uPm h Pm u a y a y kðx ; x Þ þ b  1 i i j t i¼1 i j¼1 j j Pm E¼ ; i¼1 ai ð29Þ T is the number of iterations the algorithm has processed so far and E is deduced from KKT conditions (7)–(9) representing the amount of error of the current classifier. A higher value for Perf indicates better performance of the system. The proposed performance measure is based on the following assumption: a faster and more accurate (i.e. lower T and lower E respectively) SVM training implies a larger separation space between the two classes. This assumption only holds for polynomial kernels. Training and testing of the results demonstrate the effectiveness of this performance measure.

Neural Comput & Applic (2008) 17:19–25

23

The second step is to define the stabilisation condition for d, which once satisfied halts the adaptation:





dT t  d max d d td 2f0;::;ðqd 1Þg



; ð30Þ ed [ 

d where ed, qd and Td have the same meanings as the e, q and T in (25) but take different values here. Also,  is the dTd td is the d value td adaptations ago and d mean value: qX d 1 ¼ 1 d dT t : qd t ¼0 d d

ð31Þ

d

Setting ed to be 0.01 and qd to be 10 will ensure that d has stabilised before the adaptation of d halts. Similar to (25), the function of (30) is used to test whether d is fluctuated by a significant amount in previous iterations. The learning algorithm for d is now described as follows: if condition (25) holds true and d is not yet stabilised, i.e., condition (30) is false, then keep updating d with: dTd þ1 ¼ dTd  gd

ðlog ðPerf Td 1 Þ  log ðPerf Td ÞÞ ; ðdTd  dTd 1 Þ

ð32Þ

where gd is set to 0.1. After the adaptation, g in (17) should be recalculated using the new kernel parameter d. This process is repeated until d is stable and the adaptation of d is halted.

5 Simulation results The three proposed extensions have been tested with several data sets, including two benchmark data sets: the Iris data set [13] and the Monks data set [14], and an application data set: the Leukaemia AML/ALL cancer data set [15]. The Iris data set is perhaps the best known database to be found in the pattern recognition literature. The dataset contains three classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the others; two classes are not linearly separable. The Monk’s data set was composed of three Monk’s problems and was the basis of an international comparison of learning algorithms. One of the problems had noise added to increase learning difficulties. There are 432 instances and 7 attributes in each problem. The AML/ ALL leukaemia data set was obtained from 72 samples collected from acute leukaemia patients at the time of

diagnosis. There are 7,129 attributes compiled from DNA microarray data. The raw data of these data sets were not pre-processed and there was no normalisation and scaling applied. Polynomial kernels with dynamically adapted parameter d are used for the simulations. The trade-off parameter C of SVM is set to 10 in all cases. The effects of the proposed extensions are observed using two different measurement criteria. The first criterion is the degree of satisfaction for the constraint given in (2). The second is the classification accuracy. We observe that the calculations of Da using (23) provide additional feedback to the training process and have significant positive effects on the learning of the SVM. Tables 1, 2, 3 show that the extended KernelAdaTron algorithm satisfies constraint (2) far better than that of the standard algorithm. It should also be noted that the automatic learning rate and stopping criterion have led to stable training processes in all three cases, without affecting the classification rate. Figs. 1 and 2 show the inverse performance (i.e., 1/ Perf) for the Iris class 1 data set and the Leukaemia data set when simulated using various d values for the polynomial kernel. Similar graphs are also obtained for Iris class 2, Iris class 3 and other data sets. In these graphs, a lower value in the y-axis indicates better performance; the optimal d value corresponds to the minimum point of the graph (i.e., point of highest Perf). Table 1 The Iris data set results for classifying class 1 versus the rest, class 2 versus the rest and class 3 versus the rest v P Class 1 Class 2 Class 3 ai yi i¼1

Without extensions With four extensions

–1.66 –4.16E–16

–1.64 1.77E–15

–1.67 1.78E–15

Table 2 The Monk’s data set results for classifying the three monks problems v P ai yi Monks 1 Monks 2 Monks 3 i¼1

Without extensions With four extensions

1.53 5.11E–15

Table 3 The Leukaemia data set results for classification

v P

1.57 1.28E–15

ai yi

1.54 –7.22E–16

AML/ALL

i¼1

Without extensions With three extensions

1.52 –1.37E–14

123

24

Neural Comput & Applic (2008) 17:19–25 Table 5 Classification rate of the training and testing results for various data sets Data set

Iris class 1 Iris class 2 Iris class 3 Monks 1 Fig. 1 Inverse performance versus kernel parameter d for the Iris data set (class 1 vs. the rest)

Monks 2 Monks 3 AML/ALL Leukaemia

Traditional Kernel-AdaTron

Modified Kernel-AdaTron

Training

Testing

Training

Testing

25/25 50/50 25/25 50/50 25/25 50/50 60/62 60/62 64/64 102/105 60/60 61/62 11/11 27/27

25/25 50/50 25/25 43/50 24/25 45/50 144/216 144/216 104/142 230/290 199/228 198/204 14/14 19/20

25/25 50/50 25/25 50/50 25/25 50/50 62/62 62/62 64/64 105/105 60/60 62/62 11/11 27/27

25/25 50/50 25/25 44/50 24/25 46/50 201/216 186/216 105/142 233/290 201/228 199/204 14/14 19/20

Bolded values indicate cases where the proposed system outperforms the traditional system

Fig. 2 Inverse performance versus kernel parameter d for the leukaemia data set

Table 4 summarises the optimal d values of different data sets. The proposed kernel adaptation technique adapts to these optimal values within a 1% error. Results of classification accuracy for the extended Kernel-AdaTron algorithm in comparison to the standard algorithm are illustrated in Table 5. For each result, we have recorded the number of correctly classified data samples out of the total number of samples for both positive and negative classes: yi = 1 (first line) and yi = – 1 (second line). These results show a clear overall improvement with the modifications to the learning algorithm. Improvement is most significant for the Monks 1 and Monks 2 problems. This may be explained by the intrinsic complexity of the problems, for which the standard

learning algorithm was not able to give good results. The results also show an improvement in the learning ability of the extended algorithm for large data sets. Since its improvement in the testing set is obvious, we may conclude that the generalisation capability of the classifier resulted from the extended algorithm is also better. For less complex and smaller data sets, there is only little or no improvement in the classification accuracy observed with the extended algorithm. The computational overheads in (23) and (26) have led to 3–20% increase in the SVM training time. The computational times of the equations decrease when more a’s equate to zero. Consequently, their impacts on the training speed vary depending on the complexity of the underlying data set. An effective way of reducing the training time is to modify the algorithmic implementation in such a way that less complex computational hardware is required [20, 21].

6 Conclusion We have presented three modifications to the KernelAdaTron learning algorithm and data independent formulae for automatically choosing the learning rate and stopping criterion. A method for adapting the

Table 4 The d values for best performances

Optimal d values

123

Iris class 1

Iris class 2

Iris class 3

Monks 1

Monks 2

Monks 3

Leukaemia

1.0

1.2

1.2

4.5

4.3

4.6

12.5

Neural Comput & Applic (2008) 17:19–25

polynomial kernel parameter has also been presented. The proposed extensions are effective in improving the optimality of the trained classifier in relation to the SVM theory and improving the classification performance for large data sets.

References 1. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10):906–914 2. Zien A, Ratsch G, Mika S, Scholkopf B, Lengauer T, Muller KR (2000) Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(8):799–807 3. Hammer B, Gersmann K (2003) A note on the universal approximation capability of support vector machines. Neural Process Lett 17(1):43–45 4. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2000) A fast iterative nearest point algorithm for support vector machine classifier Design. IEEE Trans. Neural Netw 11(1):124–136 5. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to Platt’s SMO algorithm for SVM classier design. Neural Comput 13:637–649 6. Campbell C, Cristianini N (1998). Simple learning algorithms for training support vector machines. Technical Report, University of Bristol. Available: http://www.svms.org/training 7. Kecman V, Vogt M, Huang TM (2003) On the equality of Kernel AdaTron and sequential minimal optimisation in classification and regression tasks and alike algorithms for kernel machines. In: Proceedings of the 11th European symposium on artificial neural networks, Bruges, Belgium, Apr 2003 8. Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177 9. Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300

25 10. Baesens B, Viaene S, Van Gestel T, Suykens JAK, Dedene G, De Moor B, Vanthienen J (2000) An empirical assessment of kernel type performance for least squares support vector machine classifiers. In: Proc. 4th int. conf. knowledgebased intelligent engineering systems and allied technologies, 2000. 11. Cristianini N, Campbell C, Shawe-Taylor J (1999) Dynamically adapting kernels in support vector machines. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in neural information processing systems 11. MIT Press, Cambridge 12. Askew A, Miettinen H, Padley B (2003) Event selection using adaptive gaussian kernels. In: Proceedings of statistical problems in particle physics, astrophysics, and cosmology, Stanford, CA, Sep 2003 13. Gates GW (1972) The Reduced Nearest Neighbor Rule. IEEE Trans Inf Theory 18(3):431–433 14. Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, et al. (1991). The MONK’s problems–a performance comparison of different learning algorithms. Technical Report CS-CMU-91–197, Carnegie Mellon University 15. Getz G, Levine E, Domany E (2000). Coupled two-way clustering analysis of gene microarray data. PNAS: Cell Biol Genet 97(22):12079–12084 16. Vapnik VN (1998) Statistical learning theory. Wiley, New York 17. Cristianini N, Shawe-Taylor J (2000). An introduction to support vector machines: and other kernel-based learning methods. Cambridge Press, Cambridge 18. Principe JC, Euliano NR, Lefebvre WC (1999). Neural and adaptive Systems: fundamentals through simulations. Wiley, New York 19. Barenblatt GI (1987) Dimensional analysis. Gordon & Breach, New York 20. Halgamuge SK, Poechmueller W, Glesner M (1995) An alternative approach for generation of membership functions and fuzzy rules based on radial and cubic basis function networks. Int J Approx Reason 12(3–4):279–298 21. Hollstein T, Halgamuge SK, Glesner M (1996) Computeraided design of fuzzy systems based on generic VHDL specifications. IEEE Trans Fuzzy Syst 4(4):403–417

123