Incorporating detractors into SVM classification

Incorporating detractors into SVM classification Marcin Orchel1 AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected] Summary. As was shown recently [19], prior knowledge has a significant importance in machine learning from the no free lunch theorem viewpoint. One of the type of prior information for classification task is knowledge on the data. Here we will propose another type of prior knowledge, for which a distance from decision boundary to selected data samples (detractors) is maximised. Support Vector Machines (SVMs) is a widely used algorithm for data classification. Detractors will be incorporated into SVMs by weighting the samples. For the reason, that standard CSVM sample weights are not suitable for maximising the distance to selected points, additional SVM weights will be proposed. We will show that detractors can enhance the classification quality for areas with lack of training samples and for time series classification. We will demonstrate that incorporating detractors to stock price predictive models can lead to increased investment profits.

Support Vector Machines (SVMs) is a widely used method for statistical classification. SVMs have been already used in many domains, such as: credit scoring [6], stock price movements [7][1][5], weather forecasting [15], customer relationship management[2]. Prior knowledge could significantly enhance SVM classification quality [9]. There are two main types of prior information: class invariance to some transformations of the input data, and knowledge on the data. Particular cases of the latter are unlabelled samples, imbalance of the training set and different quality of the data. There are two main possibilities to incorporate additional knowledge to classifier, either modify a feature set or modify classification algorithm. The most known example of modification of classifier is inclusion of polyhedral type of knowledge, which disallow particular class inside polyhedrons, explored in [4][3]. In this article we propose another type of prior knowledge, that is to maximise a distance from decision boundary to some chosen points, called detractors. We will incorporate detractors by modifying SVM classifier.

2

Marcin Orchel

1 Detractors We use sample weights to incorporate detractors into SVM algorithm. Sample weights are investigated for C-SVM problem formulation in [20][16][8][10] and for ν-SVM in [17][18]. In this article we will consider C-SVM formulation. A 1-norm soft margin SVM optimisation problem with sample weights Ci is: Optimisation problem 1 Minimisation of: f (w, b, ξ) where

n X 1 2 f (w, b, ξ) = |w| + Ci ξi 2 i=1

with constraints: yi g (Ai ) ≥ 1 − ξi ξi ≥ 0 for i ∈ {1..n}. Weights Ci are not suitable for maximizing a distance to detractors, because when for point i, ξi = 0, increasing its weight does not change a decision bound. The another possibility of using Ci weights for detractors would be decreasing weights for all points instead of the chosen point. Although lowering weights disturbs the solution, because errors are more acceptable. In order to overcome these difficulties with using Ci weights for detractors we introduce additional weights bi in a following way: Optimisation problem 2 Minimisation of: f (w, b, ξ) where f (w, b, ξ) =

n X 1 2 |w| + Ci ξi 2 i=1

with constraints: yi g (Ai ) ≥ 1 − ξi + bi ξi ≥ 0 for i ∈ {1..n}. When bi = 0 we get original formulation. When bi < 0 a point could lie closer to the decision bound, and when bi > 0 a point could lie farther from decision bound. For detractors idea the interesting case is when bi > 0. The example with different bi values is presented in Fig. 1. Analysis shows that increasing bi parameter leads to expected results, only when Ci parameter is high enough.

1 Detractors

3

7

6

5

detractor

4

3

2

1 data from class 1 data from class -1 decision bound without detractors decision bound with detractors a) decision bound with detractors b) decision bound with detractors c) decision bound with detractors d)

0

-1 -1

0

1

2

3

4

5

6

7

Fig. 1. Comparison of original SVM problem and SVM problem with detractors

In order to construct an efficient algorithm for the modified SVM problem we will derive its dual form. The dual problem is: Optimisation problem 3 Maximisation of: d (α, r) where d (α, r) = min h (w, b, α, ξ, r) w,b

h (w, b, α, ξ, r) =

n X 1 2 |w| + Ci ξi − 2 i=1

−

n X

αi (yi g (Ai ) − 1 + ξi − bi ) −

n X i=1

i=1

with constraints αi ≥ 0 ri ≥ 0 for i ∈ {1..n}. Partial derivative with respect to wi is: n X ∂h (w, b, α, ξ, r) αj yj aji = 0 = wi − ∂wi j=1

ri ξi

4

Marcin Orchel

for i ∈ {1..m}. Partial derivative with respect to b is: n

∂h (w, b, α, ξ, r) X αi yi = 0 = ∂b i=1 Partial derivative with respect to ξi is: ∂h (w, b, α, ξ, r) = Ci − ri − αi = 0 ∂ξi After substitution of above equations to d (α, r) we finally get: Optimisation problem 4 Maximisation of: d (α) where d (α) =

n X

αi (1 + bi ) −

i=1

n n m X 1 XX αk αi yk yi aij akj 2 i=1 j=1 k=1

with constraints n X

αi yi = 0

i=1

αi ≥ 0 αi ≤ Ci for i ∈ {1..n}. In the above formulation similarly as for original SVM problem it is possible to use kernel function instead of scalar product, thus a new formulation can be used for non-linear SVM classification: Optimisation problem 5 Maximisation of: d (α) where d (α) =

n X

n

αi (1 + bi ) −

i=1

k=1

with constraints n X

αi yi = 0

i=1

αi ≥ 0 αi ≤ Ci for i ∈ {1..n}.

n

1 XX αk αi yk yi Kik 2 i=1

1 Detractors

5

0.6

0.5

0.4

detractor

0.3

0.2 data from class 1 data from class -1 data without detractors data with detractors a) data with detractors b)

0.1

0

0.2

0.4

0.6

0.8

1

Fig. 2. Comparison of original SVM problem and SVM problem with detractors for nonlinear case

SVM with detractors for nonlinear case is depicted in Fig. 2. We will derive an efficient algorithm for solving Op 5, which is similar to Sequential Minimization Algorithm (SMO) [14], which solves original SVM dual optimisation problem. For two parameters the new objective function has a form: d (α1 , α2 ) = α1 b1 + α2 b2 + dold After substituting: α1 = γ − y1 y2 α2 where γ = α1old + y1 y2 α2old we get: d (α1 , α2 ) = b1 γ − b1 y1 y2 α2 + α2 b2 + dold After differentiating we get: ∂dold (α1 , α2 ) ∂d (α1 , α2 ) = b2 − b1 y 1 y 2 + ∂α2 ∂α2 And a solution is α2new = α2 +

y2 (E1 − E2 ) κ

where Ei =

n X

yj αj Kij − yi − yi bi

j=1

κ = K11 + K22 − 2K12

(1)

6

Marcin Orchel

After that, α2 is clipped in the same way as for SMO with different weights: U ≤ α2 ≤ V where, when y1 6= y2 U = max 0, α2old − α1old

V = min C2 , C1 − α1old + α2old when y1 = y2 U = max 0, α1old + α2old − C1 V = min C2 , α1old + α2old

Parameter α1 is computed in the same way as for SMO. Karush Kuhn Tucker complementary condition is: αi (yi g (Ai ) − 1 − bi + ξi ) = 0 (C − αi ) ξi = 0 Based on this condition it is possible to derive equations for SVM heuristic and SVM stop criteria. Equations for original heuristic are described in [12][11]. After incorporating weights bi a heuristic and stopping criteria are almost the same, with one difference, that Ei are computed as stated in (1). Detractors modify classification bound based on external knowledge. An equivalent implementation would be to modify feature values, in the way that classification bound changes its position in the same way as original implementation does. But there are some differences between these approaches. In the second approach one need to derive a method to modify properly feature values. Modification of feature values is computationally demanding task especially for large data sets.

2 Testing Detractors are valuable in two cases. When data for some area are unavailable, and there is an external knowledge about empty areas, which could be expressed by detractors. The second case is for classifying time series data. Classification of time series data is analysed in [13]. The most common way of classifying time series is to transform them to fixed number of attributes, then apply static classification algorithm. Although it is sometimes desirable to create dynamically parametrized classification model, in which decision boundary depends on time periods and is controlled by detractors. A concept of detractors for the time series case will be incorporated to stock price movements predictive model. In our example, we analyse NASDAQ daily data from 02-04-2007. We have 6 features, every feature is a percentage daily

References

7

growth for a previous day, for a day before previous day, etc. Classification value is 1, when there was a growth from previous day, otherwise is -1. We have 100 training vectors and 356 testing vectors. In table 1 we can see comparison of original algorithm, and algorithm with detractors. In our examples we choose arbitrarily two detractors. Although in real stock prediction systems detractors should be chosen based on used trading strategies. SVM algorithm

misclassified training data misclassified test data

without detractors 3 with 2 detractors 5

182 156

Table 1. Comparison of SVM original algorithm and SVM algorithm with detractors

We showed, that detractors can be incorporated into Support Vector Machines in an efficient way. Moreover detractors is a useful type of prior knowledge, which allows to control dynamic classification models. Though finding detractors is a domain specific task and could be a challenging one. Acknowledgement. This research is financed by internal AGH Institute of Computer Science grant and the project co-financed by EU and Polish Ministry of Science and Higher Education (MNiSW), number UDA – POKL.04.01.01-00-367/08-00 entitled ”‘Doskonalenie i Rozw´ oj Potencjalu Dydaktycznego Kierunku Informatyka w AGH”’. I would like to express my sincere gratitudes to Professor Witold Dzwinel and Tomasz Arod´z (AGH University of Science and Technology, Institute of Computer Science) for contributing ideas, discussion and useful suggestions.

References 1. Chen, W.H., Shih, J.Y., Wu, S.: Comparison of support-vector machines and back propagation neural networks in forecasting the six major asian stock markets. International Journal of Electronic Finance 1(1), 49–67 (2006) 2. Coussement, K., Van den Poel, D.: Churn prediction in subscription services: An application of support vector machines while comparing two parameter-selection techniques. Expert Syst. Appl. 34(1), 313–327 (2008) 3. Fung, G.M., Mangasarian, O.L., Shavlik, J.W.: Knowledge-based nonlinear kernel classifiers. In: Learning Theory and Kernel Machines, Lecture Notes in Computer Science, pp. 102–113. Springer Berlin / Heidelberg (2003) 4. Fung, G.M., Mangasarian, O.L., Shavlik, J.W.: Knowledge-based support vector machine classifiers. In: S.T. S. Becker, K. Obermayer (eds.) Advances in Neural Information Processing Systems 15, pp. 521–528. MIT Press, Cambridge, MA (2003) 5. Gao, C., Bompard, E., Napoli, R., Cheng, H.: Price forecast in the competitive electricity market by support vector machine. Physica A: Statistical Mechanics and its Applications 382(1), 98–113 (2007)

8

Marcin Orchel

6. Huang, C.L., Chen, M.C., Wang, C.J.: Credit scoring with a data mining approach based on support vector machines. Expert Syst. Appl. 33(4), 847–856 (2007) 7. Huang, W., Nakamori, Y., Wang, S.Y.: Forecasting stock market movement direction with support vector machine. Comput. Oper. Res. 32(10), 2513–2522 (2005) 8. Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML ’99: Proceedings of the Sixteenth International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999) 9. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: A review. Neurocomput. 71(7-9), 1578–1594 (2008) 10. Lin, C.F., Wang, S.D.: Fuzzy support vector machines. IEEE Transaction on Neural Networks 13(2), 464–471 (2002) 11. Orchel, M.: Support vector machines: Sequential multidimensional subsolver (sms). In: professor Adam Dabrowski (ed.) Signal Processing: Algorithms, Architectures, Arrangements, and Applications SPA 2007, pp. 135–140. IEEE (2007) 12. Orchel, M.: Support vector machines: Heuristic of alternatives (hoa). In: R.S. Romaniuk (ed.) Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2007 (Proceedings Volume), vol. 6937. SPIE (2008) 13. Orsenigo, C., Vercellis, C.: Time series classification by discrete support vector machines. In: Artifical Intelligence and Data Mining Workshop (2006) 14. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization (1999) 15. Trafalis, T.B., Adrianto, I., Richman, M.B.: Active learning with support vector machines for tornado prediction. In: ICCS ’07: Proceedings of the 7th international conference on Computational Science, Part I, pp. 1130–1137. SpringerVerlag, Berlin, Heidelberg (2007) 16. Wang, L., Xue, P., Chan, K.L.: Incorporating prior knowledge into svm for image retrieval. In: ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 2, pp. 981–984. IEEE Computer Society, Washington, DC, USA (2004) 17. Wang, M., Yang, J., Liu, G.P., Xu, Z.J., Chou, K.C.: Weighted-support vector machines for predicting membrane protein types based on pseudo amino acid composition. Protein engineering, design & selection 17(6), 509–516 (2004) 18. Wei, H., Jia, Y., Jia, C.: A new weighted nu-support vector machine. In: Wireless Communications, Networking and Mobile Computing WiCom 2007, International Conference on. ACM (2007) 19. Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural Computation 8(7), 1341–1390 (1996) 20. Wu, X., Srihari, R.: Incorporating prior knowledge with weighted margin support vector machines. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 326–333. ACM, New York, NY, USA (2004)