Neural Comput & Applic (2008) 17:19–25 DOI 10.1007/s00521-006-0078-2
O R I G I N A L A RT I C L E
Polynomial kernel adaptation and extensions to the SVM classifier learning Ramy Saad Æ Saman K. Halgamuge Æ Jason Li
Received: 17 February 2006 / Accepted: 6 November 2006 / Published online: 10 January 2007 Springer-Verlag London Limited 2006
Abstract Three extensions to the Kernel-AdaTron training algorithm for Support Vector Machine classifier learning are presented. These extensions allow the trained classifier to adhere more closely to the constraints imposed by Support Vector Machine theory. The results of these modifications show improvements over the existing Kernel-AdaTron algorithm. A method of parameter optimisation for polynomial kernels is also proposed. Keywords Kernel methods Support vector machines Kernel-AdaTron algorithm
1 Introduction Support Vector Machine (SVM) learning is used in many applications including cancer classification, image recognition and protein functions prediction for its robustness and approximation capability [1–3]. Presently, various parameters of SVM learning algorithms have to be set by the user in a trial and error manner. This work attempts to automate this process to improve the practicality of SVM and also suggests extensions to a well-known SVM learning algorithm. Since the conception of SVM, there have been a number of papers published regarding its modifications and improvements, parameter choosing, its learning theory and applications [1, 2, 4–6]. Among these, two well known training algorithms are the Sequential R. Saad S. K. Halgamuge J. Li (&) Mechatronics Research Group, University of Melbourne, Melbourne, VIC 3010, Australia e-mail:
[email protected]
Minimal Optimisation (SMO) and Kernel-AdaTron. They are well known due to their implementation simplicity, which is an important aspect to many researchers. The SMO is based on a repeat process of locally optimising a pair of Lagrangian variables [5], and thus has a strong analytical background. However, it relies on heuristics for fast convergence. KernelAdaTron is based on stochastic-gradient ascent and is arguably the simplest training algorithm for SVM. However, it does not guarantee to satisfy the equality constraint of the SVM optimisation problem, as will be shown later in Section 2. Despite their apparent differences, it has been demonstrated in [7] that SMO and Kernel-AdaTron do share high similarities in their training processes. This work suggests a few extensions to the Kernel-AdaTron algorithm. Some other recent researches aim to improve the training speed by reformulating the underlying equations of SVM. Examples are Lagrangian SVM [8], LSSVM [9] and Nearest Point Algorithm [4]. However, they have differed with the optimisation concept of the original SVM. There are also literature concerned with choosing the type of kernel and the parameters of a chosen kernel [10, 11]. To date, there is no automatic algorithm for choosing the right kernel for a given application. Some attempts have been made to automatically adapt the parameter of Gaussian kernels [12]. Unfortunately, it does not generalise to other kernel types. This work will address the parameter optimisation of polynomial kernels. The remainder of the paper is organised as follows. Section 2 gives an introduction of the SVM classifier and Kernel-AdaTron. Section 3 presents three modifications to the existing Kernel-AdaTron algorithm
123
20
Neural Comput & Applic (2008) 17:19–25
that will both accelerate the learning procedure and make the results more accurate. Section 4 demonstrates a method for automatically adapting the polynomial kernel parameter to its optimum. Section 5 presents the results of the modifications on several data sets including the Iris data set [13], the Monk’s problems [14] and the leukaemia data set [15]. Section 6 concludes the paper.
Let
Fi ¼
m X
aj yj K xi ; xj :
ð6Þ
j¼1
The KKT conditions are summarised as three cases [5]: Case 1: ðFi þ bÞyi 1>0;
8ai ¼ 0
2 Kernel-AdaTron and Support Vector Machine classification
Case 2: ðFi þ bÞyi 1 ¼ 0;
Support Vector Machine binary classification is a quadratic programming (QP) optimisation problem that aims to maximise the performance function:
Kernel-AdaTron is a widely used SVM training algorithm that solves both a and b. It has the merit of simplicity. Let
J¼
m X
ai
i¼1
m X m 1X ai aj yi yj K xi ; xj 2 i¼1 j¼1
Case 3: ðFi þ bÞyi 160;
ð1Þ
gi ¼ yi F i
ai yi ¼ 0
ð2Þ
06ai 6C; 8i
ð3Þ
1.
i¼1
where m is the number of training examples, K the kernel function, x the input vectors, y 2{–1, +1} represents one of the two possible output classes, C a trade-off parameter for generalisation performance, and a = {a 1, ..., a m } are the Lagrangian variables to be optimised [16, 17]. The conceptual space between y = – 1 and y = 1 is called the classification margin. The kernel is generally interpreted as a function to perform an implicit inner product between two transformed vectors: K xi ; xj ¼ /ðxi Þ; / xj
ð4Þ
where h;i denotes an inner product and u is a mapping from the input space to a feature space, which in most cases does not need to be explicitly defined. Given a set of values for a, the decision function of the SVM classifier is expressed as: f ð xÞ ¼ sign
v X
! ai yi K ðx; xi Þ þ b
ð5Þ
i¼1
where v 6 m is the number of support vectors (i.e., a i > 0) and b is the bias that must satisfy the Karush– Kuhn–Tucker (KKT) optimality conditions of the SVM dual problem.
123
8ai ¼ C:
ð8Þ ð9Þ
ð10Þ
and initialise ai = g/1000, "i, with the learning rate g set to a small value. The Kernel-AdaTron algorithm is described as follows [6, 18]:
subject to constraints: m X
80 ai C
ð7Þ
2.
For all i, calculate Fi and gi using (6) and (10) respectively. Let
Dai ¼ g
@J ¼ gð 1 gi Þ @ai
ð11Þ
be the proposed change for ai. 3. At time T + 1, each ai were updated with: 8 < ai ðT Þ þ Dai ðT Þ; if 0\ai ðT Þ þ Dai ðT Þ\C ai ð T þ 1Þ ¼ 0; if ai ðT Þ þ Dai ðT Þ60 : C; if ai ðT Þ þ Dai ðT Þ>C: ð12Þ 4.
Let
bup ¼ min ðFi Þ and blow ¼ max ðFi Þ: fijyi ¼þ1g
fijyi ¼1g
ð13Þ
Calculate the bias b with: b¼ 5.
p¼
1 bup þ blow : 2
ð14Þ
If a maximum number of iterations is exceeded or the margin (p) 1 bup blow 2
is approximately 1, i.e.,
ð15Þ
Neural Comput & Applic (2008) 17:19–25
1 t\p\1 þ t;
21
ð16Þ
where t is a threshold, then stop the training. Otherwise return to step 1.
3 Proposed extensions to the Kernel-AdaTron learning algorithm
3.2 Updating the Lagrangian variables
This section addresses some of the drawbacks of Kernel-AdaTron and proposes a solution to each of them. Some of these drawbacks have also been discussed in [6, 7, 17]. 3.1 Choosing the learning rate The learning rate is highly dependent on the dataset, making it difficult to find a suitable value. The discovery of the learning rate is quite often a trial and error procedure. A method for finding optimal learning rates is proposed in [6]. However, it uses different rates for training different variables, adding complexity to the algorithm. The learning rate proposed here is calculated by: " #1 m X m k 1 X g¼ K xi ; xj ; m m2 i¼1 j¼1
tion k=m; The value of g in (17) with k = 1 is conservative. Our experiments suggest that 16k62 can be used in practice. The amount by which g can be scaled up indicates much about the separability of the data. The more separable the data, the more g can be scaled up and so the larger k can be used. Note initially a i = g/1000 for all i.
The learning algorithm does not guarantee to update the Lagrangian variables a in such a way to satisfy m P equality condition (2), that is, ai yi ¼ 0; the constraint is guaranteed to hold only i¼1 when kernels that do not require an explicit bias in the SVM are used, such as the Gaussian RBF kernel [7]. This is an intrinsic problem of the Kernel-AdaTron algorithm accounted by its stochastic-gradient nature [17]. To tackle this, let us define: F¼
v 1X ai yi v i¼1
!2 ð18Þ
:
m P Minimising F guarantees the condition ai yi ¼ 0; i¼1 Together with the objective function J described in (1), our new optimisation problem becomes
ð17Þ
where k>1 is a scaling factor and the divisor m is accounted by Kernel-AdaTron being a sample update algorithm and not a batch update algorithm. The bracketed term represents the inversed average of the kernel matrix, which has some similarity with the adaptive rate proposed in [6]. In their work, the kernel entry K(xi, xi) for each data point xi is inversed to compute a local learning rate for that particular data point, whereas in our method, the average value across the whole kernel is computed to give a single representative learning rate. Our method has an obvious advantage when the number of data points in the data set is large, as only a single learning rate needs to be retained in memory. Moreover, since the kernel matrix must be calculated anyway, the calculation of (17) takes only a little more computational time and needs to be computed only once. The inspiration for (17) comes from classical engineering dimensionless analysis [19]. The term gi in (10) has the same unit of measurement as the kernel matrix. Hence, dividing it by the average value of the kernel matrix would make the correction of ai in (11) dimensionless, i.e., just a scale multiplied by the frac-
Maximise wF;
ð19Þ
subject to the same constraints as in (2) and (3): m X
ai yi ¼ 0;
ð20Þ
i¼1
06ai 6C;
8i;
ð21Þ
where w is a weight specifying the importance of F relative to J. For Kernel-AdaTron, our experiments suggest w = 1 is a good value. With the new objective function and w = 1, (11) becomes @ ðJ F Þ @J @F Dai ¼ g ¼g : @ai @ai @ai
ð22Þ
Therefore, ! v 2yi X Dai ¼ g 1 gi 2 ai yi : v i¼1
ð23Þ
Equation (23) is to replace (11) in the Kernel-AdaTron algorithm.
123
22
Neural Comput & Applic (2008) 17:19–25
3.3 Stopping condition
we will address the dynamic adaptation of the polynomial kernel:
The condition to terminate the learning procedure is generally overlooked even though it can have a serious effect on the results given. Even when addressed, most thresholds proposed are data dependent and hence can be quite hard to use in a fully automated system. They are generally determined using a trial and error procedure. The proposed method is to halt the learning when the classification margin becomes stabilised. The margin for an SVM system is [17, 18]:
d k xi ; xj ¼ xi ; xj þ c ;
2 c ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi : Pm Pm i¼1 j¼1 ai aj yi yj K xi ; xj
ð24Þ
Note that (24) reflects the true classification margin in the feature space, whereas the margin p in (15) is an approximation based on the bias. It is only at optimum that these two margins give the same information, i.e., c = 2p. The stability of the margin c can be detected using the following straightforward condition: max
e[
t2f0;...;ðq1Þg
jcTt cj
jcj
;
ð25Þ
where e is a parameter with value close to 0, q defines the number of iterations to be involved in the computation of stability, T indicates the current iteration, cT – t is the margin t iterations ago and c is the mean value of previous margins: 1X c ¼ c : q t¼0 Tt
ð26Þ
The learning is halted when equation (25) returns true. Setting the parameter e to be very small (e.g., 1 · 10– 5) and q to be relatively large (e.g., 100) will ensure that the margin has stabilised before the learning stops. The most computationally costly component of this method is the denominator of (24). However, as the training progresses, more a’s will become zero due to being non-support vectors, which can largely reduce the time required to compute (24). Hence, to improve overall training time, margin stability (hence stopping condition) should only be checked in the later stage of training, where more a’s are zero. 4 Dynamically adapting kernel parameters Dynamically adapting kernel parameters has been the focus of several research papers [10–12]. In this work,
123
where both c and d are kernel parameters. Parameter d determines the number of dimensions in the induced feature space and hence the complexity of the hyperplane. In our work, we aim to adapt d to a value for optimal classification performance while c is set to zero. The primary challenge to overcome here is to find a method for comparing the effect of different d values on a trained classifier. Generally, the classification margin c can be used for calculating the efficiency. However, the margin given by (24) is dependent on the parameter d itself as well as the trained a and b, which makes the margin change by orders of magnitude as d changes. In fact, margins with different d should not be compared as they lie in different feature spaces. An evaluation that does not depend on the feature space must be found for comparing the results of different kernel parameters and hence be able to choose the most ideal one. Another possible measure is the classification accuracy of the SVM. This is however too coarse for the fine tuning of parameters and is also highly dependent on the relative distribution of the data points of the training set and the testing set. Therefore, we propose the following, Perf, for performance measure: Perf ¼
q1
ð27Þ
1 ; T E
ð28Þ
where vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i2 uPm h Pm u a y a y kðx ; x Þ þ b 1 i i j t i¼1 i j¼1 j j Pm E¼ ; i¼1 ai ð29Þ T is the number of iterations the algorithm has processed so far and E is deduced from KKT conditions (7)–(9) representing the amount of error of the current classifier. A higher value for Perf indicates better performance of the system. The proposed performance measure is based on the following assumption: a faster and more accurate (i.e. lower T and lower E respectively) SVM training implies a larger separation space between the two classes. This assumption only holds for polynomial kernels. Training and testing of the results demonstrate the effectiveness of this performance measure.
Neural Comput & Applic (2008) 17:19–25
23
The second step is to define the stabilisation condition for d, which once satisfied halts the adaptation:
dT t d max d d td 2f0;::;ðqd 1Þg
; ð30Þ ed [
d where ed, qd and Td have the same meanings as the e, q and T in (25) but take different values here. Also, is the dTd td is the d value td adaptations ago and d mean value: qX d 1 ¼ 1 d dT t : qd t ¼0 d d
ð31Þ
d
Setting ed to be 0.01 and qd to be 10 will ensure that d has stabilised before the adaptation of d halts. Similar to (25), the function of (30) is used to test whether d is fluctuated by a significant amount in previous iterations. The learning algorithm for d is now described as follows: if condition (25) holds true and d is not yet stabilised, i.e., condition (30) is false, then keep updating d with: dTd þ1 ¼ dTd gd
ðlog ðPerf Td 1 Þ log ðPerf Td ÞÞ ; ðdTd dTd 1 Þ
ð32Þ
where gd is set to 0.1. After the adaptation, g in (17) should be recalculated using the new kernel parameter d. This process is repeated until d is stable and the adaptation of d is halted.
5 Simulation results The three proposed extensions have been tested with several data sets, including two benchmark data sets: the Iris data set [13] and the Monks data set [14], and an application data set: the Leukaemia AML/ALL cancer data set [15]. The Iris data set is perhaps the best known database to be found in the pattern recognition literature. The dataset contains three classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the others; two classes are not linearly separable. The Monk’s data set was composed of three Monk’s problems and was the basis of an international comparison of learning algorithms. One of the problems had noise added to increase learning difficulties. There are 432 instances and 7 attributes in each problem. The AML/ ALL leukaemia data set was obtained from 72 samples collected from acute leukaemia patients at the time of
diagnosis. There are 7,129 attributes compiled from DNA microarray data. The raw data of these data sets were not pre-processed and there was no normalisation and scaling applied. Polynomial kernels with dynamically adapted parameter d are used for the simulations. The trade-off parameter C of SVM is set to 10 in all cases. The effects of the proposed extensions are observed using two different measurement criteria. The first criterion is the degree of satisfaction for the constraint given in (2). The second is the classification accuracy. We observe that the calculations of Da using (23) provide additional feedback to the training process and have significant positive effects on the learning of the SVM. Tables 1, 2, 3 show that the extended KernelAdaTron algorithm satisfies constraint (2) far better than that of the standard algorithm. It should also be noted that the automatic learning rate and stopping criterion have led to stable training processes in all three cases, without affecting the classification rate. Figs. 1 and 2 show the inverse performance (i.e., 1/ Perf) for the Iris class 1 data set and the Leukaemia data set when simulated using various d values for the polynomial kernel. Similar graphs are also obtained for Iris class 2, Iris class 3 and other data sets. In these graphs, a lower value in the y-axis indicates better performance; the optimal d value corresponds to the minimum point of the graph (i.e., point of highest Perf). Table 1 The Iris data set results for classifying class 1 versus the rest, class 2 versus the rest and class 3 versus the rest v P Class 1 Class 2 Class 3 ai yi i¼1
Without extensions With four extensions
–1.66 –4.16E–16
–1.64 1.77E–15
–1.67 1.78E–15
Table 2 The Monk’s data set results for classifying the three monks problems v P ai yi Monks 1 Monks 2 Monks 3 i¼1
Without extensions With four extensions
1.53 5.11E–15
Table 3 The Leukaemia data set results for classification
v P
1.57 1.28E–15
ai yi
1.54 –7.22E–16
AML/ALL
i¼1
Without extensions With three extensions
1.52 –1.37E–14
123
24
Neural Comput & Applic (2008) 17:19–25 Table 5 Classification rate of the training and testing results for various data sets Data set
Iris class 1 Iris class 2 Iris class 3 Monks 1 Fig. 1 Inverse performance versus kernel parameter d for the Iris data set (class 1 vs. the rest)
Monks 2 Monks 3 AML/ALL Leukaemia
Traditional Kernel-AdaTron
Modified Kernel-AdaTron
Training
Testing
Training
Testing
25/25 50/50 25/25 50/50 25/25 50/50 60/62 60/62 64/64 102/105 60/60 61/62 11/11 27/27
25/25 50/50 25/25 43/50 24/25 45/50 144/216 144/216 104/142 230/290 199/228 198/204 14/14 19/20
25/25 50/50 25/25 50/50 25/25 50/50 62/62 62/62 64/64 105/105 60/60 62/62 11/11 27/27
25/25 50/50 25/25 44/50 24/25 46/50 201/216 186/216 105/142 233/290 201/228 199/204 14/14 19/20
Bolded values indicate cases where the proposed system outperforms the traditional system
Fig. 2 Inverse performance versus kernel parameter d for the leukaemia data set
Table 4 summarises the optimal d values of different data sets. The proposed kernel adaptation technique adapts to these optimal values within a 1% error. Results of classification accuracy for the extended Kernel-AdaTron algorithm in comparison to the standard algorithm are illustrated in Table 5. For each result, we have recorded the number of correctly classified data samples out of the total number of samples for both positive and negative classes: yi = 1 (first line) and yi = – 1 (second line). These results show a clear overall improvement with the modifications to the learning algorithm. Improvement is most significant for the Monks 1 and Monks 2 problems. This may be explained by the intrinsic complexity of the problems, for which the standard
learning algorithm was not able to give good results. The results also show an improvement in the learning ability of the extended algorithm for large data sets. Since its improvement in the testing set is obvious, we may conclude that the generalisation capability of the classifier resulted from the extended algorithm is also better. For less complex and smaller data sets, there is only little or no improvement in the classification accuracy observed with the extended algorithm. The computational overheads in (23) and (26) have led to 3–20% increase in the SVM training time. The computational times of the equations decrease when more a’s equate to zero. Consequently, their impacts on the training speed vary depending on the complexity of the underlying data set. An effective way of reducing the training time is to modify the algorithmic implementation in such a way that less complex computational hardware is required [20, 21].
6 Conclusion We have presented three modifications to the KernelAdaTron learning algorithm and data independent formulae for automatically choosing the learning rate and stopping criterion. A method for adapting the
Table 4 The d values for best performances
Optimal d values
123
Iris class 1
Iris class 2
Iris class 3
Monks 1
Monks 2
Monks 3
Leukaemia
1.0
1.2
1.2
4.5
4.3
4.6
12.5
Neural Comput & Applic (2008) 17:19–25
polynomial kernel parameter has also been presented. The proposed extensions are effective in improving the optimality of the trained classifier in relation to the SVM theory and improving the classification performance for large data sets.
References 1. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10):906–914 2. Zien A, Ratsch G, Mika S, Scholkopf B, Lengauer T, Muller KR (2000) Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(8):799–807 3. Hammer B, Gersmann K (2003) A note on the universal approximation capability of support vector machines. Neural Process Lett 17(1):43–45 4. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2000) A fast iterative nearest point algorithm for support vector machine classifier Design. IEEE Trans. Neural Netw 11(1):124–136 5. Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK (2001) Improvements to Platt’s SMO algorithm for SVM classier design. Neural Comput 13:637–649 6. Campbell C, Cristianini N (1998). Simple learning algorithms for training support vector machines. Technical Report, University of Bristol. Available: http://www.svms.org/training 7. Kecman V, Vogt M, Huang TM (2003) On the equality of Kernel AdaTron and sequential minimal optimisation in classification and regression tasks and alike algorithms for kernel machines. In: Proceedings of the 11th European symposium on artificial neural networks, Bruges, Belgium, Apr 2003 8. Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177 9. Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300
25 10. Baesens B, Viaene S, Van Gestel T, Suykens JAK, Dedene G, De Moor B, Vanthienen J (2000) An empirical assessment of kernel type performance for least squares support vector machine classifiers. In: Proc. 4th int. conf. knowledgebased intelligent engineering systems and allied technologies, 2000. 11. Cristianini N, Campbell C, Shawe-Taylor J (1999) Dynamically adapting kernels in support vector machines. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in neural information processing systems 11. MIT Press, Cambridge 12. Askew A, Miettinen H, Padley B (2003) Event selection using adaptive gaussian kernels. In: Proceedings of statistical problems in particle physics, astrophysics, and cosmology, Stanford, CA, Sep 2003 13. Gates GW (1972) The Reduced Nearest Neighbor Rule. IEEE Trans Inf Theory 18(3):431–433 14. Thrun SB, Bala J, Bloedorn E, Bratko I, Cestnik B, Cheng J, et al. (1991). The MONK’s problems–a performance comparison of different learning algorithms. Technical Report CS-CMU-91–197, Carnegie Mellon University 15. Getz G, Levine E, Domany E (2000). Coupled two-way clustering analysis of gene microarray data. PNAS: Cell Biol Genet 97(22):12079–12084 16. Vapnik VN (1998) Statistical learning theory. Wiley, New York 17. Cristianini N, Shawe-Taylor J (2000). An introduction to support vector machines: and other kernel-based learning methods. Cambridge Press, Cambridge 18. Principe JC, Euliano NR, Lefebvre WC (1999). Neural and adaptive Systems: fundamentals through simulations. Wiley, New York 19. Barenblatt GI (1987) Dimensional analysis. Gordon & Breach, New York 20. Halgamuge SK, Poechmueller W, Glesner M (1995) An alternative approach for generation of membership functions and fuzzy rules based on radial and cubic basis function networks. Int J Approx Reason 12(3–4):279–298 21. Hollstein T, Halgamuge SK, Glesner M (1996) Computeraided design of fuzzy systems based on generic VHDL specifications. IEEE Trans Fuzzy Syst 4(4):403–417
123