2014 IEEE International Conference on System Science ang Engineering (ICSSE) July 11-13 2014,Shanghai, China
An Elaboration of Sequential Minimal Optimization for Support Vector Regression Chan-Yun Yang*, Kuo-Ho Su**, Gene Eu Jan***
National Taipei UniversitylDepartment of Electrical Engineering, New Taipei City, Taiwan Chinese Culture University/Graduate Institute of Digital Mechatronic Technology, Taipei, Taiwan
[email protected];
[email protected];
[email protected]
*, ***
**
Abstract-The
computational
reduction
by
sequential
minimal optimization (SMO) is crucial for support vector
regression (SVR) with large-scale function approximation. Due to the importance, the paper surveys broadly the relevant
researches,
digests
their
essentials,
and
then
reorganizes the theory with a plain explanation. Sought first to provide a literal comprehension of SVR-SMO, the paper reforms the mathematical development with a framework of unified
and
non-interrupted
derivations
together
with
appropriate illustrations to visually clarify the key ideas. The
development
is
also
examined
by
an
alternative
viewpoint. The cross-examination achieves the foundation of the development more solid,
and leads to a consistent
suggestion of a straightforward generalized algorithm. Some consistent experimental results are also included. Index
Terms:
Regression,
Support
Vector
Sequential Minimal Optimization.
I.
Machine,
INTRODUCTIONS
Support vector regression (SVR) [1] is an emerging paradigm for function approximation. Inspired by the theory of statistica1 1earning [2-3], a regression from the paradigm gains not only a relationship between the system inputs and output but also the excellent properties which a traditional regression never provided, e. g., the maximal margin, the regularization of quadratic programming, the kernel function, the generalization ability, and the sparsity of support vectors [4-5]. With the property excellence, intrinsic facts behind the given data would be significantly discovered, and be exploited to pursue a general trend of the input/output relationship. Recovering rationally from the given data, the relationship given by such a regression can be potentially advantageous to the applications in physical engineering, social science, or biological science to rebuild a data driven model [6]. Though the promising advantages, most of the engagements of SVR are still a preference in scientific discovery rather than its applications. To be honest, appropriate enhancements on SVR to bridge it from the science to the engineering are also crucial. As expected, the influence would be elementary and substantial if a broad bridge connects well between the theory and the applications. From this aspect, a computational feasibility to avoid the curse of dimensionality should be most urgent for large scales applications [7]. To overcome the problem, the sequential minimal optimization (SMO) algorithm, decomposing the n-element working set of optimizers to the minimum 2-element set and heuristically updating the optimizers by an iterative scheme, guarantees the
978-1-4799-4367-8/14/$31.00 ©2014
IEEE
efficiency in the corresponding computational issue. The original SMO by Platt [8] was created for support vector classification, not for SVR. Smola and SchOlkopf brought first SMO with SVR in their famous article [1]. Later, contributions from respected researchers, including Shevade et at. [9], Flake and Lawrence [10], Dong et at. [11], Barbero and Dorronsoro [12], Keerthi et at. [13], Fan et at. [14], Glasmachers and Igel [15], and Keerthi and Gilbert [16] pushed progressively SMO towards a state more and more realistic, but fragmental. To characterize the novelty of authors' findings, most of the contributions detailed the facts they funded and shortened the developments previously established. With these fragmental and vague descriptions, the promotion of SVR-SMO becomes relatively harder. The paper, based on a sufficient survey, hence seeks to reform SVR-SMO in an easier tutorial. Especially, the mathematical expressions elaborated in this paper will be reorganized to be unified and non-interrupted to enhance the readability. IT.
SUPPORT VECTOR REGRESSION
Based on the regularization [2, 17], SVR is funded on the trade-off between fitting accuracy and model complexity to pursue a best function approximation of a series of input { Xl, X2, ..., Xn} to their corresponding target value {Yl,Y2, ...,Yn}: arg Min AQQ[f] + Remp [f]
fEH -
(1)
where Remp[/] is empirical error, Q[/] is a regularization term, and An denotes a scalar regularization factor. Instead of the expected error, an empirical error Remp[/] is used in the objective (1) to seek a good approximation. With the minimal Remp[/], the objective considers simultaneously the least An-weighted cost of model complexity which is represented by Q[/]. In the expression, H denotes a reproducing kernel Hilbert space (RKHS) which is a high dimensional space mapped from the [mite dimensional input data space by a kernel function K( -, - ). In H, a sophisticated nonlinear problem can be solved linearly by the kernel function [2-3]. For support vector machines here, the Q[/] in general is 2 asserted as I w 1 /2 which regards with the orientation of the best fit f( x ) w· z + b where z = !p(x) denotes the mapped feature of x in H and !pO is the mapping function. To cope with the trade-off in the objective in (1), an £ insensitive loss function �(e) Max(1 e1-£, 0) (Fig. 1a) is introduced to convert individually the empirical errors into different levels of loss, and represented symbolically =
=
by the twin slack variables;; = [(], (2, ... , ( n]T and I; = [(], (2, ... , ( n( Utilizing the loss function, the minimization of (1) approaches an accurate fitting. By changing An with a reciprocal C and equipping it to Remp[/], a complete primal form of the SVR can thus be given as [1]: (2)
Min w,�,�
subject to
Yk - (w'ip(xk)+b):S:c+�k (w·ip(xk)+b) - Yk:S:c +�,and �k '� ::::0,k=1,2,oo.,n
(3)
where the parameter c corresponding to the c-insensitive loss function specity a zero-loss region [-c, +c] as shown in Fig. la. The zero-loss tolerance in the loss function later produces an insensitive c-tube around the regression line (curve) as depicted in Fig. 1b. By taking Lagrange T multipliers a = [a i, a2, , a n]T and (i = [ai, a2, , an] , the primal problem (2)-(3) can be converted into a dual convex quadratic problem [1]: 00.
0 0 .
n
n
+�>k(ak -ak)- C�)ak+ak) k�l k�l subject to n
(ak - ak)=O,andO:S:ak,ak:S:C,k=1,2,oo.,n. L k�l
Sequential minimal optimization (SMO) is an excellent method for solving the quadratic optimization programming (QP). The original SMO is invented to tackle the computational problem of SVM classification, and is broadly employed in the implementations, such as the famous LTBSVM supported by Chang and Lin [18] and Matlab toolbox [19]. In contrast to the classification applications, there are relatively rare of references to detail the implementation of SMO in the regression applications. As known, there still a little difference between the SVM classification and the SVM regression. It makes a room for a detailed model-recheck to provide the whole picture of SMO for the applications of SVR. As an efficient QP solver, SMO reduces significantly 2 the computational complexity form O(n ) to O(n·Card(SVs» where n denotes the training sample size and SVs denotes the support vectors. The computation reduction is mainly contributed by the methodology of splitting the entire working set in QP into many fragmental short working set. The splitting decomposes 2 the n -scale QP problem into many small-scale sub problems, and reduces accordingly the computation. The most efficient decomposition can be preliminarily one variable once for one sub-problem if the QP is free from constraints. In SVM, only two-variable sub-problem can be decomposed due to the equality constraint in (5). With the smallest decomposed working set, SMO can be more efficient than any other method such as chunking [20-21] or active set [22-24]. The decomposition can be started from the objective function LD in (4). To be dense in expressions and derivations, we assume first Y'k =Yk -ekc where
(5)
(8) and LD can be re-expressed as:
Loss
where (a) Loss function
(b) e---tube
Figure 1. £-insensitive loss function
From the viewpoint of consequent regression functionj{x), only those optimized paired a k *- 0 or ak *- 0, called support vectors (SVs) [11], are collected to form the subsequentj{x):
f(x)= L(ak- ak)K(xk,x)+b kESVS
(6)
The intrinsic sparsity of SVs has merits in both computational time and space. If we move to the observation of SVs, there is also an interesting characteristic of the twin pair (a b ak). Due to the complementary KKT conditions, both a k and ak can complementarily only one be nonzero, and conduct: (7) m.
SEQUENTIAL MINIMAL OPTIMIZATION
Kkl =K(Xk ,XZ)
and { ak' akl k
=
1, 2,
00.
, n } are
variables to optimize LD• The idea of 2-variable decomposition extracts {i,j} form the whole set J={I, 2, , n}, and decomposes LD as: 0 0 .
Lo
�
=- ( aj -1X;)( al - 1X;)KiJ -( al - 1X;)(a,- a,)KI,
- �(a,- a,)(aJ - a,)K" - (aj -IX;) L(ak - ak)Kjk (10) 2 kE'\II ,.!l -(aJ -a) L(ak - ak)KJk kE!\{i}j,
By maintaining most of a k's, kE/\{i,}}, as constant, the remaining variables a i, ai, a 'i and ai are selected candidates for sub-problem optimization. Ignoring the unchanged terms corresponding the unchanged a k's, the
objective function LD' and its corresponding constraints of the sub-problem are thus extracted and written as:
Because the problem we are engaged in is solving iteratively the problem, we use the superscript t to quote iteration state of the operational variables, such as Vi and e;. The relationship can be extended to the differentiations ofv and e between sample i and}:
subject to
(a; -aJ +(aj -ii )= n o ::; a j ,, � a" iii ::; C where
n
, n
(12)
Li (a k -ak), would keep constant during j kE/I( },
=-
the iteration, and V; denotes
j V = y,' - L (a k, k -ak)K kEi\{iJj,
(13)
LAGRANGE MULTIPLIER UPDATE RULE
IV.
- I )j -,' )KiJ +(a' -a ) j1 = Yk -.[I (j X +'b +(a, , -a K V I I l ' (17b) ' -B k e
By the constraint in (12), (ar aj) can be substituted by Jr-(a;- a;) which makes LD' a function only dependent on (a;- a;):
� (a; -�)2K;; -(a; -�)
LD '= -
(n
-(a; -»K � u
-�(Jr-(aI -a» 2K]) +(aI -a) I I vI 2 +(n -(a; -� » V j
�
(14)
-VI -JrK'1 +JrK) I/ -Wj
To gain the maximal change in LD', the derivative of LD' with respect to (ai- ai) should be zero. we have (15) By denoting the error e; measured from the estimated fix;):
-!(x) , n
=y , -L(a, -a) j )-b , K(xk,X k 1 = Y i -(a i -�)K;; -(aj -a)Kj; -v i -b =
the relationship between Vi and ei is:
n '/
=
KII -2 KY +K.If.
(19)
Since only one case from (ai, a,I)'
(ai, aj), (ai, aj), and (a;, a) is selected once for update, the general update rule can split further to fit the specific conditions respectively: ii; = ai = 0, a; = aj = 0, a; = ai = 0, and a; = ai = ° in each case. 1
1
Ci -e j,aj t+l = n -a t+l i r; ej 1 -e' j - 2 c _ l+! ,a, = aj1+1 -Jr case 2 : a j1+1 = a,I + r; t t _I e; -ei +2 e 1+1 -1+1 n -1+1 case3: a; =a i ,aj =a; + r; t+l
�
e, = y ,
(B , -B)e r;
t
case 1: a; = a i +
� 2(K;; -2 Ku +K) =- (a; _) U )(v,
(18)
The relationship in (18) is so nice to conduct a general update rule of SMO. By substituting (18) into (15):
where
=- (a, _) � 2(KiJ -2 K'1 +K)� K'I I/ Jr(aj -) j -v)-wj +Jr(a, -�)KI/ +(aj -�)(V +(a, -�
1 -I i K;; " Vi1 -V' j -Jr(Kij -Kj) j = ei -ej +(a i -a) j -(a', -�')KIJ +(a/ -a /)Kji -(a', -a /)Kl j -B)e -(a', -'� )Kij -(a/ -a /)Kif -(B - I)Kll -') +(a ,i -a i K,) +(a' j -aj 1 1 -,')Kij +(a', -a -,')KiJ -2 (a ',)KU ' -a j =ej -ej +(a , , -a j -B)e -(B ' ', -�I)r; -(Bi-B)e =e, -e/ +(a
(16)
case
4.
-1+1
. aj
-/
-a,
_
_
(20)
eI 1 -e J 1
r;
- 1+1 -1+1 ,a, = -a j -Jr
As observed, the update of each case of a or a is directly proportional to the error difference between sample i and j. Considering more the constraint ° S ai, ai, ai' aiS C, the updated a or a should be clipped if they exceed the [0, C] boundaries as illustrated in Fig. 2. The conditional clippers then summarized and tabulated as those in Table L An alternative analysis of the update rule is also commented in [12]. By terming the change as an additional step length 6, the analysis assume first: (21) With the complementary 8k for both ak and ak, they merge all the ak and ak together and create a"double" array of a:
- I )K'I +(aj , -a,
a
= [al. al, ..., an, an+l.oo., aln]T = [al. al, , an, (XI, !Xl, , !Xn]T 0 0 .
0 0 .
for the extended corresponding"double" inputs X and Y: X=
Y=
�
[Xl. , Xn, Xn+b Xln] = [Xl, [Yb ,Yn,Yn+boo"Yln] = [Yb o o"
0 0 .
0 0 .
0 0 . o o ·
The unchanged portion of the objective function LDO can be decomposed and withdrawn during the iteration from t to t+I :
f
, Xn, Xl, , Xn ,Yn,Yb ,Yn] 0 0 .
o o ·
TABLE!.
CONDITIONAL CLIPPER SUMMARIZED FROM FIG. 2 ai a
j
a, a
j
=min(C,Jr)
=max(O,Jr-C)
=min(C,C- Jr) =max(O,-Jr)
2 l E
c ::,- C c a"",- :�- o: t-\ \r------I ;"
f1.,IU-j- ir>C
(X
ai a) ai a)
PD.:aj-;r-a,-;r-C
1'11: (1. ;-0
211 - IBka;y'k-BAly';-B/Jy'J k�1 2 5 1 z 2 -BA' (y'j-y', ) . l z + =BA' w i1 � IIi1 11 +LD(a' )
-
(X
=min(C,C+Jr) =max(O,Jr)
=min(C,-Jr)
=max(O,-Jr- C)
c+ " jO ,,_ oo t-\ \------I M 'p,��,,: c"� � iljti,-O \------I ,- - c :: �,,�
6l P
p/( (ff-O
a
and term the remainder - the changed portion - as q('), where q(') is a function with respect to the step length 6 during the iteration:
(1.j - C
aj-ti)-;r>0
l'n: (1.,-
if
i[
2
51 q(5:)=BAIWi. i1z+ � IIi1z112 - B;5: (y';-y'J )
C
"'-; k2t-\J
PI!:
-(j/- (j/-;r" of lOa, samples by" .... " of 12, samples by "O)VCBk =-l,a� < C)} (40) e l=argMin k { kICBk =l,a� < C)V(Bk =-l,a; >O)} VI.
ALGORITHM
Diluting from the elaborations in section TV and V, a generalized SVR-SMO algorithm can therefore be asserted. The algorithm, following the one given by Shevade [9], is aggregated as that in Table II. TABLE II. THE SMO ALGORITHM
Algorithm Randomly Initialize vectors of a and a While (38) is unsatisfactory for every h, and 1 { Use (40) to select h and I For paired cases (ah, all , (ah' a,), (o.h, aLl , (o.h, 0./) repeat ( Update the current case of pair by (20) Clip the updated pair by the conditional clippers listed in Table I If both elements in the updated pair are zero, select next case } Until a non-zero updated pair happened By ao. 0 in (7), vanish complementary multipliers of the updated pair =
Figure 3. Five partitions of the scattering data points
Defming alternatively an error e, slightly different to e in (16), which still a measure between the expectation, y' = y - BE:, and the prediction, w·z =f(x) - b: (35) In this case of Fig. 3, the samples in the different partitions can be categorized into two groups. The categorization is based on the error direction of e measured from the regression line to the sample's location. Here, e"P and ito are defined as:
VII.
EXPERIMENTAL SIMULATIONS
To realize the correctness of the developed algorithm, one experiment was drawn on a simulated Sinc dataset. By adding different p-Ievel of contamination, 200 noisy data points were generated by y = Sinc(x)+p'randC) for regression. One of the consequences with contamination level p = 0.5 in Fig. 4 shows the fitness of the contaminated dataset is coherent in both SMO algorithm and the traditional Newton method. Both the fitted curves
were generated in the same setting of parameters: C = 103, (J =1, E: = 0.05, and RBF kernel function. The coherence convinces not only the correctness but also the computing speed. The computational expense in Table ITI shows the significant time reduction trom 11.11 seconds to 0.34 seconds.
[3] [4] [5]
V. N. Vapnik,Statistical Learning & Sons,1998.
Theory,
New York: John Wiley
B. Scholkopf and A. .I. Smola, Learning with kernels, MIT Press, Cambridge,MA, 2002. J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge,M.A.: MIT Press,2004.
[6]
1. Guyon,"SVM application list," Clopinet.com,CA. [Online]. Available: http://www.clopinet.comlisabelle/Projects/SVM/ applist.html. [Accessed Mar. 16,2014].
[7]
c.-H. Ho, and c.-J. Lin, "Large-scale linear support vector regression," Journal of Machine Learning Research, vol. 13, pp. 3323-3348,2012. J. Platt,"Fast training of support vector machines using sequential minimal optimization," in Advances in Kernel Methods - Support Vector Learning, Chapter 12,B. Scholkopf,C. J. C. Burges,and A. J. Smola,Eds. MIT Press,1999,pp. 185-208.
[8]
[9] Figure 4. Contaminated Sinc dataset simulation
The experiments then stepped further to investigate the facts on more datasets acquired from the VCT repository [25]. To ensure the comparisons are based on the best parameter settings, a procedure to optimize performance was carried out in advance by a genetic algorithm to provide the best selection of corresponding parameters in each dataset. Upon the best parameter selection, the computational expense of both SVR's with SMO algorithm and Newton method was compared and listed in Table ITI. The considerable time reduction in every datasets confirms the potential of SVR-SMO in future large-scale applications. TABLE III.
COMPUTATIONAL REDUCTIONS BY SMO ON THE SIMULATED DATASET Parameter Setting Dataset Sinc Servo Pyrim Triazines AutoMpg Housing
C
II
£
103 103 10-1 100 101 101
I Ix21 1xl' Ix2° Ix2° lx2°
5x10-2 5.1Oxl0-2 5.00xlO-2 8.00xI0-2 2.04xI0-2 2.40xlO-2
VITI.
Time (Sec) Without With SMO SMO 11.11 0.34 0.21 1.51 0. 89 0.28 0.32 4.03 0.23 22.61 0.29 2.66
CONCLUSION REMARK
The study elaborates the development of SMO algorithm for support vector regression, and clarifies some unclear points in connecting previous contributions. Tn the shortage of easy-reading references, the paper offers timely an comprehensive and integrated guidance for the researchers who want to engage in varieties of SVR-SMO works, especially the application works. Using the basic suggested algorithm, one can employ and modifY the algorithm flexibly to adventure and realize his idea in every topic. ACKNOWLEDGMENT
The authors acknowledge the financial support of National Science Council of Taiwan through its grants NSCI02-2221-E-305-006 and NSCIOI-2218-E-305-001. REFERENCES [1]
[2]
A. J. Smola, and B. Scholkopf, "A tutorial on support vector regression," Statistics and Computing, vol. 14,no. 3,pp. 199-222, 2004. V. N. Vapnik, The Nature of Statistical Learning Theory, New York: Springer-Verlag,1995.
S. K. Shevade, S. S. Keerthi, C. Bhattacharyya, and K. R K. Murthy, "Improvements to the SMO algorithm for SVM regression," IEEE Transactions on Neural Networks, vol. 11,no. 5, pp. 1188-1193,2000. [10] G. W. Flake,and S. Lawrence,"Efficient SVM regression training with SMO," Machine Learning, vol. 46,pp. 271-290,2002. [II] .I.-X. Dong, A. Krzyzak and C. Y. Suen, "A practical SMO algorithm," Proceedings of International Conference on Pattern Recognition, vol. 3,2002. [12]
A. Barbero, and .I. R. Dorronsoro, "A simple maximum gain algorithm for support vector regression,"
International
Proceedings
10th
Work-Coriference on ArtifiCial Neural Networks
(IWANN 2009),Salamanca,Spain,pp. 73-80,2009. [13] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R K. Murthy, "Improvements to Platt's SMO algorithm for SVM classifier design," Neural Computation, vol. 13, no. 3, pp. 637649,2001. [14] R-E. Fan, P.-H. Chen,and c.-J. Lin,"Working set selection using second order information for training support vector machines," Journal of Machine Learning Research, vol. 6, pp. 1889-1918, 2005. [15] T. Glasmachers, C. Igel, "Maximum-gain working set selection for SVMs," Journal of Machine Learning Research, vol. 7, pp. 1437-1466,2006. [16] S. S. Keerthi, and E. G. Gilbert, "Convergence of a generalized SMO algorithm for SVM classifier design," Machine Learning, vol. 46,pp. 351-360,2002. [17] c.-J. Lin, "Formulations of support vector machines: a note from an optimization point of view," Neural Computation, vol. 13,no. 2 pp. 307-317,2001. [18] c.-c. Chang and c.-J. Lin,"LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology, vol. 2,no. 3,pp. 27:1-27:27. [19] The MathWorks Inc.,Matlab and statistics toolbox, Release 2013a, Natick,MA,2013. [20] E. Osuna, R Freund, F. Girosi, "An improved training algorithm for support vector machines," Proc. IEEE Workshop Neural Networks & Signal Processing, Amelia Island,pp. 276-285,1997. [21] T. Joachims, "Making large-scale support vector machine practical," in Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, C. Burges and A. J. Smola, Eds. Cambridge,MA: The MIT Press,1999,pp. 169-184.
[22] K. Scheinberg, "An efficient implementation of an active set method for SVMs," Journal of Machine Learning Research, vol. 7, pp. 2237-2257,2006. [23] E. Wong, Active-set methods for quadratic programming, Ph.D. Dissertation of Mathematics,Univ. of California,San Diego,2011. [24] C. M. Maes, A quadratic
regularized active-set method for sparse convex
PhD. Dissertation of Institute for Mathematical Engineering, Stanford
programming,
Computational and University,2010.
[25] K. Bache and M. Lichman, "UCI Machine Learning Repository," School of Information and Computer Science, UC Irvine, CA. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed Nov. 01,2013].