Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 72 (2015) 477 – 484
The Third Information Systems International Conference
High-Dimensional Data Classification Based on Smooth Support Vector Machines Santi Wulan Purnamia*, Shofi Andaria, Yuniati Dian Pertiwia a
Department of Statistics, ITS Campus, Sukolilo, Surabaya 60111 Institut Teknologi Sepuluh Nopember
Abstract Classification on high dimensional data arises in many statistical and data mining studies. Support vector machines (SVM) are one of data mining technique which has been extensively studied and have shown remarkable success in many applications. Many researches developed SVM to increase performance such as smooth support vector machine (SSVM). In this study variants of SSVM (spline SSVM, piecewise polynomial SSVM) are proposed for highdimensional classification. Theoretical results demonstrate piecewise polynomial SSVM has better classification. And numerical comparison results show that the piecewise polynomial SSVM slightly better performance than spline SSVM. © Authors. Published by Elsevier This is an open access article under the CC BY-NC-ND license ©2015 2015The Published by Elsevier Ltd. B.V. Selection and/or peer-review under responsibility of the scientific (http://creativecommons.org/licenses/by-nc-nd/4.0/). of The Third Information Systems International Conference (ISICO 2015) (ISICO2015) committee Peer-review under responsibility of organizing committee of Information Systems International Conference
Keywords: high-dimensional data, classification, SSVM, polynomial, spline
1. Introduction In pattern recognition problem, including classification and clustering system, assigning the proper algorithm is the key to provide the best classifier. Therefore that would bring out the best performance in analysis. However, people are often prone to make mistakes during analyses, or possibly, when trying to establish relationship between multiple features [1]. One should pay attention on how some algorithms work on particular data. Machine learning can often be successfully implemented in these problems. One of the challenging aspects in supervised learning is to deal with high-dimensional data. This term refers to any dataset with large-scale features regardless metric or nonmetric data. This kind of data
* Corresponding author. Tel.: +62-31-594-3352; fax: +62-31-592-2940. E-mail address:
[email protected].
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of Information Systems International Conference (ISICO2015) doi:10.1016/j.procs.2015.12.129
478
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
is becoming hot issue in many fields along with the development in database and data engineering. Machine learning, in many ways, is still successfully applied in classifying high-dimensional datasets. For instance, in healthcare quantitative measures as the forefront in solving classification problem can no longer be avoided, moreover in detection and identification particular disease from microarrays data. The challenging effect of the identification of the disease in a patient is highly subjective and it is reliant on the physician expertise [2]. Support vector machines (SVMs) [3], as well as other machine such as neural networks and logistic regression have been developed in many ways to deal with supervised learning or classification problem. This algorithm is well known for its high accuracy and prediction due to its global solution in optimization. However, the elapse time of computing when dealing with larger datasets still becoming main problem. Therefore, recently this machine has already many variants developed in all over the world. Smooth support vector machine [4], proposed by Lee and Mangasarian as classifier with smoothing methods to reformulate SVM. Due to twice differentiation is not applicable in its objective function, smoothing function is inserted to smooth the unconstrained optimization problem. Since then, many researches focus on increasing the performance of the smoothing function, e.g quadratic polynomial function [11], fourth polynomial function [11], spline function [8], and piecewise polynomial function [7,10]. All four functions have been compared and reported in [5]. It was reported that SSVM with piecewise polynomial function (PPSSVM-1) proposed by Luo, et. al [7] provided the best performance in classifying diagnostics for breast cancer . In [9], Purnami, et. al tested the performance of PPSSVM-1 and compared the algorithm to other piecewise polynomial function proposed by Wu and Wang [10], from now on will be called PPSSVM-2. The result of the comparison showed that PPSSVM-2 could provide better performance dominantly in classifying simulation datasets as well as in classifying cervical cancer diagnostics [9]. The data generated in the previous study [9] is to accommodate high-dimensional data scenario which is adopted in this study with more challenging size of datasets. The dataset generated for this study were focused on multiple features. The machines used in the study is SSVM with original smoothing function, SSVM with spline function (TSSVM), and PPSSVM-2. In the end, those three functions will be compared to each other in term of the binary classification accuracy.
2. Literature Review 2.1 Smooth Support Vector Machine (SSVM) SSVM is proposed by Lee and Mangasarian [4]. It was begun with linear case which can be converted to an unconstrained optimization problem. We consider the problem of classifying n points in the mdimensional real space Rm , represented by the n u m matrix A, according to membership of each point Ai in the classes 1 or -1 as specified by a given m u m diagonal matrix D with 1 or -1 along its diagonal. For this problem the standard SVM is given by the following quadratic program: 1 min ve ' y w ' w 2 ( w,v , y )R p 1 n
st
D( Aw eJ ) y t e
(1)
yte
where v is positive weight, y is the slack variable, e represents a column vector whose elements worth one with specific dimensions, w is normal vector sized n u 1, and J is bias value that determines the relative location of the hyperplane.
479
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
In the SSVM approach, the modified SVM problem as follows: v
min
( w,J , y )R n1 m 2
y' y
1 (w ' w J 2 ) 2
D ( Aw eJ ) y t e
st
(2)
yt0
The constraint in equation (2), can be written by (3) y (e D( Aw eJ )) Thus, we can replace y in constraint (2) by (3) and convert the SVM problem (2) into an equivalent SVM which is an unconstrained optimization problem as follows: min
v
( w,J ) 2
(e D( Aw eJ ))
2 1 (w ' w J 2 ) 2 2
(4)
The second derivative for objective function in (4) does not exist. Therefore, solving it using conventional optimization method will be useless, because it always requires Hessian matrix. Lee and Mangasarian [4] applied the smoothing techniques and replace x+ by the integral of the sigmoid function: p( x, D )
x
1
D
log(1 H D x ), D ! 0
(5)
This p function with a smoothing parameter α is used here to replace the plus function of (4) to obtain a smooth support vector machine (SSVM): min
v
( w,J )R n1 2
p(e D( Aw eJ ), D )
2 1 (w ' w J 2 ) 2 2
(6)
While, the optimization for nonlinear SSVM problem is as follows: min
v
(u ,J ) 2
p(e D( K ( A, A ') Du eJ ), D )
Where K ( A, A ') is a kernel function from R
num
2 1 (u ' u J 2 ) 2 2
(7)
u R mun to R nun .
2.1. Spline Smooth Support Vector Machine (TSSVM) Spline SSVM is proposed by Yuan, et al [2] where they using a new smooth function to replace optimization problem (7). It is three order spline function as following:
T( x, k )
0, if x 1 k ° ° k2 3 k 2 1 1 1 ° 6 x 2 x 2 x 6k , if k d x 0 ® 2 ° k x3 k x 2 1 x 1 , if 0 d x d 1 2 2 6k k ° 6 ° x, if 1 x k ¯
x in
(8)
If we replace the plus function in problem (7) by spline function (8), a new smooth SVM model is obtained as following:
2 1 v T e D K A, A ' Du eJ , D u ' u J 2 2 2 2
It is called the three-order spline smooth support vector machines (TSSVM).
(9)
480
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
2.2. Piecewise Polynomial Smooth Support Vector Machine Many researches have been proposing variants smoothing function in SSVM to increase its performance. Luo [7] proposed piecewise polynomial function to approximate the plus function. The formulation of this function is as follows:
f1 ( x, k)
1 xt ° x, k ° 5· 1 1 °1 3§ 2 3 2 ® (k x 1) ¨ k x 3kx x ¸ , x k k ¹i k © ° 32 ° 1 °0, xd k ¯
(10)
Then, Wu and Wang [10] also proposed piecewise polynomial function which different formulation as follows :
f 2 ( x, k)
°0, ° 3 °3 2 § 1 · ° k ¨x ¸ , °2 © 3k ¹ ® 3 2§ 1 · ° ° x 2 k ¨© 3k x ¸¹ , ° ° °¯ x,
x
1 3k
1 d x, 0 3k
0d xd x!
(11)
1 3k
1 3k
The piecewise polynomial function which proposed by Luo [7] is called PP-1 and the piecewise polynomial function which proposed by Wu and Wang [4] is called PP-2. If we replace the plus function in optimization problem (7) by PP-1, the SSVM based on PP-1 which called PPSSVM-1 is obtained as following: min(u, J ) Rm1
v 2 1 f1 e D( K ( A, A ') Du eJ ) , D (u ' u J 2 ) 2 2 2
(12)
As well as, if PP-2 replaced the plus function of optimization problem (7), the PPSSVM-2 is obtained as follows : min(u, J ) Rm1
v 2 1 f 2 e D( K ( A, A ') Du eJ ) , D (u ' u J 2 ) 2 2 2
(13)
Purnami, et al [9] have compared PPSSVM-1 and PPSSVM-2 in theoretical and numerical. Comparison result of theoretical presents that PPSSVM-2 is better than PPSSVM-1 to plus function. The proof can be seen in [9]. The numerical comparison used some variations number of data and number of features. PPSSVM-2 method presents that it has better performance in accuracy, sensitivity, specivity, and computation time than PPSSVM-1. In this research will be compared SSVM, TSSVM and PPSSVM-2 to analyze high dimensional data. The comparison is done on theoretical and numerical experiment.
481
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
3. Comparison of Three SSVM on High Dimensional Data 3.1. Theoretical Comparison of Three SSVM The comparison of the three SSVM (SSVM, TSSVM, and PPSSVM-2 ) will be done based on these following lemmas: Lemma 1: p(x, k) is defined as integral of sigmoid function (5) and x+ is the plus function:
p x, k
U ! 0, k R :
2
x2
log 2 d
2
k
2
U k
log 2
The proof of Lemma 1 is as in [4].
1 , By taking the integral function of (5), by Lemma 1, we obtain: k 2 log 2 2 U log 2 2 p x, k x2 d k k 1 log 2 2 2log 2 2 k
Lemma 2: Let U
|
0.6927 k2
Lemma 3: Let : R , T(x, k) be defined as (8) , then the following results are easily obtained. (i)
T x, k C 2 : , x : , or T x, k satisfies the following equalities at the points x x 0, T 1 , k 0, lim T x, k k ° x o 0 ° ' 1 ' ® T k , k 0, lim T x, k x o0 ° °T '' 1k , k 0, lim T '' x, k x o0 ¯
(ii)
T x, k t x , x : ;
(iii)
x :, k R .
lim T x, k , T 1k , k
x o 0
,
lim T ' x, k , T ' 1k , k 1,
x o 0
lim T '' x, k , T '' 1k , k
x o 0
T x, k x2 t 2
1 k
1 24k 2
Lemma 4: The function of PP-2 as defined in (11) has the properties: (i) x R, f 2 x, k t x ;
0.
1 r , k
482
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
(ii) x R, f 2 x, k x2 d 2
1 216k 2
The proof for this lemma is as written in [9]. According to results of Lemma 1, Lemma 2, Lemma 3 and Lemma 4, the following performance comparison result of smooth function are obtained. 1 Theorem 1. Let U k (i) The integral of sigmoid smooth function is defined as (5) , by Lemma 2,
log 2 2 2 U 1 1 log 2 ((log 2) 2 2 log 2) 2 | 0.6927 2 ) k k k k (ii) The three spline smooth function is defined as (8), by Lemma 3, 1 1 T ( x, k ) 2 x2 d | 0,0415 2 2 24k k (iii) The piecewise polynomial smooth function (PP-2) is defined as (11), by Lemma 4, 1 1 f ( x, k ) 2 x2 d | 0.00463 2 216k 2 k From theorem 1, it can be concluded that the piecewise polynomial smooth function (PP-2) has smallest value of difference between square smooth function and square plus function or another word the PP-2 has best performance than the others. p( x, k ) 2 x 2 d (
3.2. Numerical Comparison of Three SSVM The evaluation of the performance of three SSVMs is performed by numerical analysis. We generated high-dimensional data with variants number of data n and number of variable m. There are 10 fold testing for accuracy and time processing. All our experiments were performed on a computer with Matlab R2011b for 32-bit operating system that was installed on a PC Intel Core i7 and 64-bit operating system. Processor required for 3.20 Ghz with 9 GB of RAM. The results of the experiment high-dimensional data using SSVM, TSVM and PPSSVM2 can be presented as follows. Table 1. Accuracy of SSVM, TSSVM and PPSSVM2 (high score in bold) Data n, m 50, 10 50, 50 50, 100 50, 500 100, 10 100, 50 100, 100 100, 500
SSVM 92.50% 75.00% 62.50% 52.50% 93.33% 87.78% 75.00% 70.00%
Accuracy (%) TSSVM PPSSVM2 97.50% 97.50% 67.50% 75.00% 65.00% 67.50% 42.50% 52.50% 94.44% 94.44% 84.44% 90.00% 76.00% 78.00% 68.89% 70.00%
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
Table 2. Time Processing SSVM, TSSVM and PPSSVM2 Data n, m 50, 10 50, 50 50, 100 50, 500 100, 10 100, 50 100, 100 100, 500
Time (sec) SSVM 1.6692 1.1388 1.2012 1.6692 2.1996 2.0124 2.3244 3.1356
TSSVM 6.1152 2.8392 2.4804 1.8252 16.302 6.8328 6.9732 4.6488
PPSSVM2 1.2324 1.4267 1.5600 1.7628 7.5972 4.3212 4.6644 4.1184
In smaller number of features or variables, the three SSVMs provided remarkable result in accuracy. The more features in the dataset, the accuracy also becoming lesser. However, from Table 1, it was shown that PPSSVM2 has a higher accuracy than two other SSVMs. Although in some datasets have the same accuracy value, but it does not affect the performance of PPSSVM for high dimensional data analysis. That is because formulation of PPSSVM more complex, but has better performance than other (see the Theorem 1). The time processing of three methods does not give significant differences. On TSSVM and PPSSVM method are longer a few second. This is because algorithm more complex than SSVM method. Generally, PPSSVM2 slightly better performance than TSSVM.
4. Conclusions and Recommendations The theoretical results demonstrate that the piecewise polynomial smooth function (PP-2) has better performance than the others. Based on the numerical results, the three SSVMs compared in this study show a consistent performance in terms of the accuracy. And numerical comparison results show that the PPSSVM2 slightly better performance than TSSVM.
References [1] S. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques," Informatica, vol. 31, pp. 249 - 268, 2007. [2] Y. Haowen and G. Rumbe, "Comparative Study of Classification Techniques on Breast Cancer," International Journal of Artificial Intelligence and Interactive Multimedia , vol. 1, no. 3, pp. 5-12, 2010. [3] N. V. Vapnik, The Nature of Statistical Learning Theory, Springer - Verlag, 1995. [4] Y. J. Lee and O. L. Mangasarian, "SSVM : A Smooth Support Vector Machine for Classification," Journal of Computational Optimization and Aplication, vol. 20, pp. 5-22, 2001. [5] S. W. Purnami, E. Abdullah, J. M. Zain and S. P. Rahayu, "A comparison of smoothing functions in smooth supprot vector machine," International Conference on Software Engineering & Computer Systems, 2009. [6] Y. J. Lee and O. L. Mangasarian, "A Smooth Support Vector Machine," 2011. [7] L. Luo, "Study on Piecewise Polynomial Smooth Aproximation to the Plus Function," in Proceding of the ICARCV, 2006. [8] Y. Yuan, W. Fan and D. Pu, "Spline Function Smooth Support Vector Machine For Classification," Journal of Industrial and Management Optimization, vol. 3, no. 3, pp. 529-542, 2007. [9] S. W. Purnami, V. Chosuvivatwong, H. Sriplung, M. R. Dewi and E. Suryanto, "Comparison of
483
484
Santi Wulan Purnami et al. / Procedia Computer Science 72 (2015) 477 – 484
Piecewise Polynomial Smooth Support Vector Machine to Classify Diagnosis of Cervical Cancer," International Journal of Applied Mathematics and Statistics, vol. 53, no. 6, pp. 159-166, 2015. [10] Q. Wu and W. Wang, "Piecewise-Smooth Support Vector machine for Classification," Hindawi Publishing Corporation Mathematical Problems in Enginering, 2013. [11] Y. Yuan, J. Yan and C. Xu, "Polynomial smooth support vector machine (PSSVM)," Chinese Journal of Computers, vol. 28, pp. 9-17, 2005.