Regression Based on Support Vector Classification Marcin Orchel AGH University of Science and Technology, Mickiewicza Av. 30, 30-059 Krak´ ow, Poland,
[email protected]
Abstract. In this article, we propose a novel regression method which is based solely on Support Vector Classification. The experiments show that the new method has comparable or better generalization performance than ε-insensitive Support Vector Regression. The tests were performed on synthetic data, on various publicly available regression data sets, and on stock price data. Furthermore, we demonstrate how a priori knowledge which has been already incorporated to Support Vector Classification for predicting indicator functions, could be directly used for a regression problem. Keywords: Support Vector Machines, a priori knowledge
1
Introduction
One of the main learning problems is a regression estimation. Vapnik [6] proposed a new regression method, which is called ε-insensitive Support Vector Regression (ε-SVR). It belongs to a group of methods called Support Vector Machines (SVM). For estimating indicator functions the Support Vector Classification (SVC) method was developed. The SVM were invented on a basis of statistical learning theory. They are efficient learning methods partly for the reason of having the following important properties: they lead to convex optimization problems, they generate sparse solutions, kernel functions can be used for generating nonlinear solutions. In this article, we analyze the differences between ε-SVR and SVC. We list some advantages of SVC over ε-SVR, that are: ε-SVR has the additional free 2 parameter ε, in ε-SVR the minimized term ∥w∥ is responsible for rewarding flat functions, while in SVC the same term has a meaning fully dependent on training data – it takes a part in finding a maximal margin hyperplane. This is the motivation for the proposed new regression method which is fully based on SVC. Additionally, the proposed method has an advantage while incorporating a priori knowledge into SVM. Incorporating a priori knowledge into SVM is an important task and is extensively researched recently [3]. It is a practice, that most of a priori knowledge is first incorporated to SVC. Additional effort is needed to introduce the same a priori knowledge for ε-SVR. We show on example that a particular type of a priori knowledge already incorporated to
SVC can be directly used for a regression problem by using the proposed method. Recently some attempts were made to combine SVC with ε-SVR [7]. They differ substantially from the proposed method by the fact, that the proposed method is a replacement for ε-SVR. 1.1
Introduction to ε-SVR and SVC
In a regression ( estimation, ) we consider a set of training vectors ai i for i = 1..l, where ai = a1i , . . . , am i . The i-th training vector is mapped to yr ∈ IR. The m is a dimension of the problem. The ε-SVR soft case optimization problem is OP 1. Minimization of l ∑ ) ( i ( ) 2 f wr , br , ξr , ξr∗ = ∥wr ∥ + Cr ξr + ξr∗i i=1
with constraints yri − g (ai ) ≤ ε + ξri , g (ai ) − yri ≤ ε + ξri∗ , ξr ≥ 0, ξr∗ ≥ 0 for i ∈ {1..l}, where g (ai ) = wr · ai + br . The g ∗ (x) = wr∗ · x + b∗r is a regression function. Optimization problem 1 is transformed to an equivalent dual optimization problem. The regression function becomes l ∑ ∗ g (x) = (αi∗ − βi∗ ) K (ai , x) + b∗r , (1) i=1
where αi , βi are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function which is incorporated to a dual problem. The most popular kernel functions are linear, polynomial, radial basis function (RBF) and sigmoid. A kernel function which is a dot product of its variables we call a simple linear kernel. The i-th training example is a support vector, when αi∗ − βi∗ ̸= 0. It can be proved that a set of support vectors contains all training examples which fall outside the ε tube, and some of the examples, which lie on the ε tube. The conclusion is that a number of support vectors can be controlled by a tube height (the ε). For an indicator function ) we consider a set of training vectors ( estimation, ai for i = 1..l, where ai = a1i , . . . , am i . The i-th training vector is mapped to yci ∈ {0, 1}. The m is a dimension of the problem. The SVC 1-norm soft margin case optimization problem is OP 2. Minimization of 2
f (wc , bc , ξc ) = ∥wc ∥ + Cc
l ∑
ξci
i=1
with constraints yci h (ai ) ≥ 1 + ξci , ξc ≥ 0 for i ∈ {1..l}, where h (ai ) = wc · ai + bc .
The h∗ (x) = wc∗ ·x+b∗c = 0 is a decision curve of the classification problem. Optimization problem 2 is transformed to an equivalent dual optimization problem. The decision curve becomes h∗ (x) =
l ∑
yci αi∗ K (ai , x) + b∗c = 0 ,
(2)
i=1
where αi are Lagrange multipliers of the dual problem, K (·, ·) is a kernel function, which is incorporated to a dual problem. Margin boundaries are defined as the two hyperplanes h (x) = −1 and h (x) = 1. Optimal margin boundaries are defined as the two hyperplanes h∗ (x) = −1 and h∗ (x) = 1. The i-th training example is a support vector, when αi∗ ̸= 0. It can be proved that a set of support vectors contains all training examples which fall below optimal margin boundaries (yi h∗ (ai ) < 1), and some of the examples, which lie on the optimal margin boundaries (yi h∗ (ai ) = 1). Comparing the number of free parameters of both methods, the ε-SVR has the additional ε parameter. The one of motivations of developing a regression method based on SVC is flatness property of the ε-SVR. The minimization term 2 ∥wr ∥ in ε-SVR is related to the following property of a linear function: for two linear functions g1 (x) = w1 · x + b1 and g2 (x) = w2 · x + b2 , whenever ∥w2 ∥ < ∥w1 ∥, then we can say that g2 (x) is flatter than g1 (x). Flatter functions are awarded by ε-SVR. Flatness property of a linear function is not related to 2 training examples. It differs from SVC, where minimizing term ∥wc ∥ is related to training examples, because it is used for finding a maximal margin hyperplane.
2
Regression Based on SVC
) ( We consider a set of training vectors ai for i = 1..l, where ai = a1i , . . . , am i . The i-th training vector is mapped to yri ∈ IR. The proposed regression method (SVCR) is based on the following scheme of finding a regression function: 1. Every training example ai is duplicated, an output value yri is translated by a value of a parameter φ > 0 for an original training example, and translated by −φ for a duplicated training example. 2. Every training example ai is converted to a classification example by incorporating the output as an additional feature and setting class 1 for original training examples, and class −1 for duplicated training examples. 3. SVC is run with the classification mappings. 4. The solution of SVC is converted to a regression form. The above procedure is repeated for different values of φ, for particular φ it is depicted in Fig. 1. The best solution among various φ is found based on the mean squared error (MSE) measure. The result of the first step is a set of training mappings for i ∈ {1, . . . , 2l} { bi = (ai,1 , . . . , ai,m ) → yi + φ for i ∈ {1, . . . , l} bi = (ai−l,1 , . . . , ai−l,m ) → yi−l − φ for i ∈ {l + 1, . . . , 2l}
1 1 0.8
0.8
0.6
0.6
0.4
0.4 0.2
0.2
0 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. 1. In the left figure, there is an example of regression data for a function y = 0.5 sin 10x + 0.5 with Gaussian noise, in the right figure, the regression data are translated to classification data. With the ’+’ translated original examples are depicted, with the ’x’ translated duplicated examples are depicted
for φ > 0. The parameter φ we call the translation parameter. The result of the second step is a set of training mappings for i ∈ {1, . . . , 2l} { ci = (bi,1 , . . . , bi,m , yi + φ) → 1 for i ∈ {1, . . . , l} ci = (bi,1 , . . . , bi,m , yi−n − φ) → −1 for i ∈ {l + 1, . . . , 2l} for φ > 0. The dimension of the ci vectors is equal to m + 1. The set of ai mappings is called a regression data setting, the set of ci ones is called a classification data setting. In the third step, we solve OP 2 with ci examples. Note that h∗ (x) is in the implicit form of the last coordinate of x. In the fourth step, we have to find an explicit form of the last coordinate. The explicit form is needed for example for testing new examples. The wc variable of the primal problem for a simple linear kernel case is found based on the solution of the dual problem in the following way 2l ∑ wc = yci αi ci . i=1
For a simple linear kernel the explicit form of (2) is ∑m − j=1 wcj xj − bc . xm+1 = wcm+1 The regression solution is g ∗ (x) = wr · x + br , where wri = −wci /wcm+1 , br = −bc /wcm+1 for i = 1..m. For nonlinear kernels, a conversion to the explicit form has some limitations. First, a decision curve could have more than one value of the last coordinate for specific values of remaining coordinates of x and therefore
it cannot be converted unambiguously to the function (e.g. a polynomial kernel with a dimension equals to 2). Second, even when the conversion to the function is possible, there is no explicit analytical formula (e.g. a polynomial kernel with a dimension greater than 4), or it is not easy to find it and hence a special method for finding an explicit formula of the coordinate should be used, e.g. a bisection method. The disadvantage of this solution is a longer time of testing new examples. To overcome these problems, we propose a new kernel type in which the last coordinate is placed only inside a linear term. A new kernel is constructed from an original kernel by removing the last coordinate, and adding the linear term with the last coordinate. For the most popular kernels polynomial, radial basis function (RBF) and sigmoid, the conversions are respectively ( d
(x · y)
→
m ∑
)d xi yi
i=1
∑m
2
exp −
∥x − y∥ 2σ 2
→ exp −
tanh xy → tanh
m ∑
i=1
+ xm+1 ym+1 ,
(3)
2
(xi − yi ) + xm+1 ym+1 , 2σ 2
xi yi + xm+1 y m+1 .
(4) (5)
i=1
The proposed method of constructing new kernels always generates a function fulfilling Mercer’s condition, because it generates a function which is a sum of two kernels. For the new kernel type, the explicit form of (2) is xm+1 =
−
∑2l
(
i i i i=1 yc αi Kr cr , xr ∑2l i m+1 i=1 yc αi ci
) ,
) i ( 1 m) ( where cir = c1i ..cm i , xr = xi ..xi . 2.1
Support Vectors
The SVCR runs the SVC method on duplicated number of examples and therefore a maximal number of support vectors of the SVC is 2l. The SVCR algorithm is constructed in the way, that while searching for the best value of φ, the cases for which a number of SVC support vectors is bigger than l are omitted. We prove the fact, that in this case the set of SVC support vectors does not contain any two training examples where one of them is a duplicate of the another; therefore a set of SVC support vectors is a subset of the ai set of training examples. Let’s call a margin boundaries vector or a below margin boundaries vector as an essential margin vector and a set of such vectors EM V . Theorem 1. The ai examples are not collinear and |EM V | ≤ l, implicates EM V does not contain duplicates. Proof (Proof sketch). Let’s assume, that the EM V contains a duplicate at′ of the example at . Let p (·) = 0 be a hyperplane parallel to margin boundaries
and containing the at ; therefore the set of EM V examples for which p (·) ≥ 0 has r elements where r >= 1. Let p′ (·) = 0 be a hyperplane parallel to margin boundaries and containing the at′ ; therefore the set of EM V examples for which p′ (·) ≤ 0 has equal or greater than l − r + 1 elements and so |EM V | ≥ l + 1 which contradicts the assumption. ⊓ ⊔ For nonlinear case the same theorem is applied in induced feature kernel space. It can be proved that a set of support vectors is a subset of the EM V and therefore the same theorem is applied for a set of support vectors. Experiments show that it is a rare situation when for any value of φ checked by SVCR, a set of support vectors has more than l elements. In such situation the best solution among violating the constraint is chosen. Here we consider how changes of a value of φ influence on a number of support vectors. First, we can see that for φ = 0, l ≤ |EM V | ≤ 2l. When for a particular value of φ both classes are separable then 0 ≤ |EM V | ≤ 2l. By a configuration of essential margin vectors we call a list of essential margin vectors, each with a distance to a one of the margin boundaries. Theorem 2. For two values of φ, φ1 > 0 and φ2 > 0, where φ1 > φ2 , for every margin boundaries for φ2 , there exist margin boundaries for φ1 with the same configuration of essential margin vectors. Proof (Proof sketch). Let’s consider the EM V for φ2 with particular margin boundaries. When increasing a value of φ by φ1 − φ2 in order to preserve the same configuration of essential margin vectors we extend margin bounded region by φ1 − φ2 on both sides. ⊓ ⊔ When increasing a value of φ, new sets of essential margin vectors arise, and all sets presented for the lower values of φ remains. When both classes become separable by a hyperplane, further increasing the value of φ does not change a collection of sets of essential margin vectors. The above suggests that increasing a value of φ would lead to solutions with lesser number of support vectors. 2.2
Comparison with ε-SVR
Both methods have the same number of free parameters. For ε-SVR: C, kernel parameters, and ε. For SVCR: C, kernel parameters and φ. When using a particular kernel function for ε-SVR and a related kernel function for SVCR, both methods have the same hypothesis space. Both parameters ε and φ control a number of support vectors. There is a slightly difference between these two methods when we compare configurations of essential margin vectors. For the case of ε-SVR, we define margin boundaries as a lower and upper ε tube boundaries. Among various values of the ε, every configuration of essential margin vectors is unique. In the SVCR, based on Thm. 2, configurations of essential margin vectors are repeated while a value of φ increases. This suggest that for particular values of φ and ε a set of configurations of essential margin vectors is richer for SVCR than for ε-SVR.
3
Experiments
First, we compare performance of SVCR and ε-SVR on synthetic data and on publicly available regression data sets. Second, we show that by using SVCR, a priori knowledge in the form of detractors introduced in [5] for classification problems could be applied for regression problems. For the first part, we use a LibSVM [1] implementation of ε-SVR and we use LibSVM in solving SVC problems in SVCR. We use a version ported to Java. For the second part, we use the author implementation of SVC with detractors. For all data sets, every feature is scaled linearly to [0, 1] including an output. For variable parameters like the C, σ for the RBF kernel, φ for SVCR, and ε for ε-SVR, we use a grid search method for finding best values. The number of values searched by the grid method is a trade-off between an accuracy and a speed of simulations. Note that for particular data sets, it is possible to use more accurate grid searches than for massive tests with multiple number of simulations. The preliminary tests confirm that while φ is increased, a number of support vectors is decreased. 3.1
Synthetic Data Tests
We compare the SVCR and ε-SVR methods on data generated from particular functions with added Gaussian noise for output values. We perform tests with a linear kernel on linear functions, with a polynomial kernel on the polynomial function, with the RBF kernel on the sine function. The tests with results are presented in Table 1. We can notice generally slightly worse training performance for the SVCR. The reason is that ε-SVR directly minimizes the MSE. We can notice fairly good generalization performance for the SVCR, which is slightly better than for the ε-SVR. We can notice lesser number of support vectors for the SVCR method for linear kernels. For the RBF kernel the SVCR is slightly worse. 3.2
Real World Data Sets
The real world data sets were taken from the LibSVM site [1] [4] except stock price data. The stock price data consist of monthly prices of the DJIA index from 1898 up to 2010. We generated the training data as follows: for every month the output value is a growth/fall comparing to the next month. Every feature i is a percent price change between the month and the i-th previous month. In every simulation, training data are randomly chosen and the remaining examples become test data. The tests with results are presented in Table 2. For linear kernels, the tests show better generalization performance of the SVCR method. The performance gain on testing data is ranged from 0–2%. For the polynomial kernel, we can notice better generalization performance of the SVCR (performance gain from 68–80%). A number of support vectors is comparable for both methods. For the RBF kernel, results strongly depends on data: for two test cases the SVCR has better generalization performance (10%).
Table 1. Description of test cases with results for synthetic data. Column descriptions: (∑ )kerP ∑ dim a function – a function used for generating data y1 = dim , i=1 xi , y4 = i=1 xi ∑dim y5 = 0.5 i=1 sin 10xi + 0.5, simC – a number of simulations, results are averaged, σ – a standard deviation used for generating noise in output, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, tes – a testing set size, dm – a dimension of the problem, tr12M – a percent average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percent average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for ε-SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE function
simC σ
y1 100 y2 = 3y1 100 y3 = 1/3y1 100 y4 100 y5 20
0.04 0.04 0.04 0.04 0.04
ker kerP trs tes dm tr12M te12M tr12MC te12MC s1 s2 lin lin lin pol rbf
– – – 3 var
90 90 90 90 90
300 300 300 300 300
4 4 4 4 4
0% −0.4% 0% −2% −500%
0.5% −0.4% 1% 10% −10%
20% 10% 50% 2% 30%
58% 40% 80% 80% 20%
50 74 50 61 90
46 49 40 61 90
Generally the tests show that the new method SVCR has good generalization performance on synthetic and real world data sets used in experiments and often it is better than for the ε-SVR.
3.3
Incorporating a Priori Knowledge in the Form of Detractors to SVCR
In the article [5], a concept of detractors was proposed for a classification case. Detractors were used for incorporating a priori knowledge in the form of a lower bound (a detractor parameter b) on a distance from a particular point (called a detractor point) to a decision surface. We show that we can use a concept of detractors directly in a regression case by using the SVCR method. We define a detractor for the SVCR method as a point with the parameter d, and a side (1 or −1). We modify the SVCR method in the following way: the detractor is added to a training data set, and transformed to the classification data setting in a way that when a side is 1: d = b + φ, for a duplicate d = 0; when a side is −1: d = 0, for a duplicate d = b − φ. The primal application of detractors was to model a decision function (i.e. moving far away from a detractor). Indeed, a synthetic test shows that we can use detractors for modeling a regression function. In Fig. 2, we can see that adding a detractor causes moving a regression function far away from the detractor point.
Table 2. Description of test cases with results for real world data. Column descriptions: a name – a name of the test, simT – a number of random simulations, where training data are randomly selected, results are averaged, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, all – a number of all data, it is a sum of training and testing data, dm – a dimension of the problem, tr12M – a percent average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percent average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for ε-SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE name
simT ker kerP trs all
dm tr12M
abalone1 abalone2 abalone3 caData1 caData2 caData3 stock1 stock2 stock3
100 100 20 100 100 20 100 100 20
8 8 8 8 8 8 10 10 10
lin pol rbf lin pol rbf lin pol rbf
– 5 var – 5 var – 5 var
90 90 90 90 90 90 90 90 90
4177 4177 4177 4424 4424 4424 1351 1351 1351
−0.2% −90% 70% −1.5% −105% −25% 0% −4500% 76%
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
te12M tr12MC te12MC s1 s2 2% 80% 10% 2% 68% 10% 0% 78% −6%
20% 0% 90% 1% 0% 50% 40% 0% 100%
0.2
0.4
70% 100% 65% 55% 100% 50% 55% 100% 25%
35 78 90 41 79 90 35 90 90
38 73 90 44 75 90 32 87 90
0.6
0.8
0 0
0.2
0.4
0.6
0.8
1
0
1
Fig. 2. In the left figure, the best SVCR translation for particular regression data is depicted, in the right figure, the best SVCR translation for the same data, but with a detractor in a point (0.2, 0.1) and d = 10.0 is depicted. We can see that the detractor causes moving the regression function far away from it. Note that the best translation parameter φ is different for both cases
4
Conclusions
The SVCR method is an alternative for ε-SVR. We focus on two advantages of the new method: first, a generalization performance of the SVCR is comparable or better than for the ε-SVR based on conducted experiments. Second, we show on the example of a priori knowledge in the form of detractors, that a priori knowledge already incorporated to SVC can be used for a regression problem solved by the SVCR. In such case, we do not have to analyze and implement the incorporation of a priori knowledge to the other regression methods (e.g. to the εSVR). Further analysis of the SVCR will concentrate on analysing and comparing the generalization performance of the proposed method in the framework of statistical learning theory. Just before submitting this paper, we have found in [2] very similar idea. However, the Authors solve an additional optimization problem in the testing phase to find a root of the nonlinear equation and therefore two problems arise: multiple solutions and lack of solutions. Instead, we propose a special type of kernels (3)(4)(5) which overcome these difficulties. In [2], the Authors claim that by modifying φ parameter for every example in a way that the examples with lower and upper values of yi have lesser values of φ than the middle ones, a solution with lesser number of support vectors can be obtained. However, this modification leads to a necessity of tuning a value of an additional parameter during the training phase. Acknowledgments. The research is financed by the Polish Ministry of Science and Higher Education project No NN519579338. I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) for contributing ideas, discussion and useful suggestions.
References 1. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 2. Fuming Lin, J.G.: A novel support vector machine algorithm for solving nonlinear regression problems based on symmetrical points. In: Proceedings of the 2010 2nd International Conference on Computer Engineering and Technology (ICCET) (2010) 3. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: A review. Neurocomput. 71(7-9), 1578–1594 (2008) 4. Libsvm data sets. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 5. Orchel, M.: Incorporating detractors into svm classification. In: Kacprzyk, P.J. (ed.) Man-Machine Interactions; Advances in Intelligent and Soft Computing. pp. 361– 369. Springer (2009) 6. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (September 1998) 7. Wu, C.A., Liu, H.B.: An improved support vector regression based on classification. In: Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering. pp. 999–1003. MUE ’07, IEEE Computer Society, Washington, DC, USA (2007)