Sphere Support Vector Machines for large ... - Semantic Scholar

5 downloads 6019 Views 1MB Size Report
Aug 25, 2012 - Virginia Commonwealth University, Computer Science Department, .... the ball center while the other weight au belongs to the support.
Neurocomputing 101 (2013) 59–67

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Sphere Support Vector Machines for large classification tasks Robert Strack n, Vojislav Kecman, Beata Strack, Qi Li Virginia Commonwealth University, Computer Science Department, Richmond, VA 23284-3019, United States

a r t i c l e i n f o

abstract

Article history: Received 9 March 2012 Received in revised form 1 July 2012 Accepted 7 July 2012 Communicated by K.J. Cios Available online 25 August 2012

This paper introduces Sphere Support Vector Machines (SVMs) as the new fast classification algorithm based on combining a minimal enclosing ball approach, state of the art nearest point problem solvers and probabilistic techniques. The blending of the three significantly speeds up the training phase of SVMs and also attains practically the same accuracy as the other classification models over several large real datasets within the strict validation frame of a double (nested) cross-validation. The results shown are promoting SphereSVM as outstanding alternatives for handling large and ultra-large datasets in a reasonable time without switching to various parallelization schemes for SVM algorithms recently proposed. & 2012 Elsevier B.V. All rights reserved.

Keywords: Support Vector Machines Core Vector Machines Minimum enclosing ball Large datasets Classification

1. Introduction Support Vector Machines are considered to be among the best classification tools available today. Many experimental results achieved on a variety of classification (and regression) tasks complement the highly appreciated theoretical properties of SVMs. However, there is one property of SVM learning algorithm that has required, and still requires, special attention. This is the fact that the learning phase of SVMs scales with the number of training data points. The time complexity of the general purpose, state-of-the-art SVM implementations is somewhere between O(n) and Oðn2:3 Þ. Hence, with an increase of datasets sizes, the learning phase can be a quite slow process. Some successful attempts to deal with this matter include the decomposition approaches that have led to several efficient pieces of software, the most popular being SVMlight [1] and LIBSVM [2]. However, none of these algorithms obtained linear complexity, and the ever increasing size of datasets has driven the SVMs training time beyond acceptable limits. The two remedial avenues for overcoming the issues of large datasets employed during the last decade include various parallelization attempts (including the newest GPU embedded implementations [3,4]) and the use of geometric approaches. The latter include solving SVMs learning problem by employing both convex hulls and enclosing ball approaches [5,6]. The most recent and advanced method, known

n

Corresponding author. E-mail address: [email protected] (R. Strack).

0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.07.025

as the Ball Vector Machines [7], has shown a high capacity for handling large datasets. The Sphere Support Vector Machine (SphereSVM) proposed here combines the two techniques (convex hulls and enclosing balls approaches). While keeping the level of accuracy, it achieves a significant speedup with respect to all three L1 and L2 LIBSVM and BallVMs. Although the most popular SVM solvers (such as Platt’s SMO [8]) are based on the Lagrange multipliers method and search for solutions in the dual space, there is substantial research conducted towards finding efficient algorithms that work directly in the feature space. These algorithms are mostly based on the geometric interpretation of the maximum margin classifiers. The geometric properties of hard margin SVM classifiers have been known for a long time [9]. Recently, Keerthi et al. [10] and Franc [11] proposed algorithms based on the geometric interpretation of the SVM algorithm for solving cases with separable classes. Their approach treats the problem of finding the maximum margin between two classes as a problem of finding two closest points belonging to convex polytopes covering the classes. Crisp et al. analyzed the geometric properties of the n-SVM algorithm [12] and, based on this work, Mavroforakis introduced the reduced convex hulls (RCH) [13]. This idea allowed using a geometric approach to solve SVM problems for overlapping classes. Reduced convex hulls enabled a shrink of overlapping convex polytopes covering each class. This reduction creates the margin between two overlapping classes and permits separating them (previously unfeasible with Keerthi’s or Franc’s algorithms). Another field of research involves algorithms based on the minimal enclosing ball (MEB) problem. Tsang et al. [5,14] formulated

60

R. Strack et al. / Neurocomputing 101 (2013) 59–67

the SVM problem as the MEB problem and proposed Core Vector Machines (CVM) algorithm as an approach suitable for very large SVM training. Their algorithm is an application of Badoiu and Clarkson’s work [15] that investigates the use of coresets in finding an approximation of MEB. Additional speedup is obtained by using the ‘‘probabilistic speedup’’ approach proposed by Smola and Sch¨olkopf [16]. CVM is further generalized [17] by allowing the use of any kernel function (and not only the normalized ones as previously required). Furthermore, Tsang et al. [7] have improved the idea of Core Vector Machines by introducing a new algorithm not requiring QP solver—Ball Vector Machines (BVM). Moreover, Asharaf et al. [18] have proposed another extension of CVM that is capable of handling multi-class problems. In this paper, we propose a new algorithm that improves the BVM by applying ideas previously used in SVM learning based on RCH. The original SVM solver involving RCH was improved by Lo´pez et al. [19] by replacing the SK algorithm [20] that was used in searching for the closest points with a faster MDM algorithm introduced by Michell et al. in [21]. Our work, similar to Lo´pez’s, introduces an algorithm originating within an MDM solver as the technique for finding an enclosing ball (EB) surrounding the data points. This novel EB algorithm is successfully applied in solving the EB problem that originates after transforming an original L2 SVM problem as the EB task.

2. Sphere Support Vector Machines It has been shown in [5] that, for normalized kernels,1 the learning setting of the L2 SVM, defined as 2 m 1 b CX arg min JwJ2 þ r þ z2i , 2 2 2 w,b, f, r i¼1

ð1Þ

subject to yi ðxi  w þbÞ Z rzi ,

i ¼ 1, . . . ,m

ð2Þ

can be rewritten as a minimization task equal to solving a problem of the finding minimal enclosing ball arg minR2 ,

ai ðJcx~ i J2 R2 Þ ¼ 0,

Algorithm 1. SphereSVM algorithm. Require: e A ½0,1Þ {the parameter of the stopping criterion} Require: Nr A Z þ {the size of the random subset} Require: Na A Z þ {the number of draw attempts} P ~ Ensure: c ¼ m i ¼ 1 ai x i {the approximation of the EB center} a’0, a0 ’1 1: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2: ^ t þ 1 þ C1 R’

ð4Þ

ð5Þ

where kij ¼ jðxi Þ  jðxj Þ is the kernel used in the original L2 SVM ~ ðxi Þ is the image of the problem, dij is Kronecker’s delta, and x~ i ¼ j ~ . In other words, solving the vector xi in the feature space F minimal enclosing ball problem will also produce a solution of the L2 SVMs task. This idea becomes the foundation of the CVM and BVM algorithms introduced by Tsang, as well as the new approach presented here. The minimal enclosing ball problem solved by the CVM algorithm was slightly simplified in the BVM approach. Specifically, Tsang found an accurate estimate of the radius of the enclosing ball. Since the ball’s radius could be considered known, the only unknown was the center of the ball. This way, the minimal enclosing ball problem was replaced by an enclosing ball problem. This approach turned out to be very effective and so we decided to apply it to our algorithm. SphereSVM, like its 1 Kernels satisfying condition kii ¼ jðxi Þ  jðxi Þ ¼ t is constant, e.g. for a Gaussian kernel kii ¼ 1.

ð6Þ

if the condition ai a0 holds then x~ i lies on the boundary of the minimal enclosing ball. In other words, the vectors lying inside the ball are not support vectors and do not affect the solution. This observation leads to the conclusion that there are two types of violators: the vectors lying outside the enclosing ball and the vectors having nonzero weights that are lying inside the ball. The SphereSVM algorithm aims at eliminating support vectors inside the ball. The simplified pseudo-code of the SphereSVM algorithm is presented in Algorithm 1.

4: 5: 6: 7: 8: 9:

~ defined by kernel in the feature space F

dij k~ ij ¼ yi yj kij þyi yj þ , C

The SphereSVM algorithm is a novel reformulation of the BVM approach. Therefore, some parts of both algorithms are similar. For instance, the initialization procedure, the way the violating vectors2 are found, and the stopping criterion are all the same. However, there are important differences, the main one being the way updates of the center are performed. In the case of a BVM algorithm, all weights ai corresponding to vectors x~ i belonging to the coreset are modified in each updating step. In the SphereSVM algorithm proposed here, only two weights av and au are updated. The first weight av corresponds to the vector that is furthest from the ball center while the other weight au belongs to the support vector closest to the center. According to the KKT conditions of the MEB problem

3:

subject to i ¼ 1, . . . ,m

2.1. The algorithm

ð3Þ

R,c

Jcx~ i J2 r R2 ,

predecessor BVM, is not trying to minimize the radius of the enclosing ball.

10: 11: 12: 13: 14: 15:

e^ ’ 12 repeat e^ ’maxfe^ , eg i’0 repeat X r ’ random subset of X of size Nr v’arg maxi:x~ i A X r 4Jcx~ i J 4 ð1 þ e^ ÞR^ Jcx~ i J i’i þ1 until v a | or i4 Na if v a| then u’arg mini:ai 4 0 Jcx~ i J



ðx~ v x~ u Þðx~ v cÞ Jx~ v x~ u J2

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2

v

16: 17: 18: 19: 20:

^2

R b^ ’r r2  Jx~Jvx~cJ x~ J2 u

b’minfb^ , au g av ’av þ b au ’au b else e^ ’ 2e^

2 Violating vectors are the data points that violate some predefined conditions. In the case of the BVM algorithm, these are the vectors lying outside the enclosing ball. In SphereSVM, there is also another type of violators, namely vectors that do not satisfy KKT conditions (samples x~ i having non-zero weight ai and not being on the surface of the enclosing ball).

R. Strack et al. / Neurocomputing 101 (2013) 59–67

21: 22:

where r is

end if until e^ o e

Our approach applies the ideas introduced in the MDM method by Michel et al. [21] as a solution to the nearest point problem (NPP). Here, we adopted the MDM approach in solving the enclosing ball (EB) problem. During the initialization part of the algorithm (lines 1 to 2) the random support vector is chosen (here, the support vector with index 0) and its weight is initialized pffiffiffi to 1. Then, the radius of the enclosing ball is estimated at R^ ¼ t~ , where t~ ¼ k~ ii ¼ t þ 1 þ1=C is ~ and t ¼ kii the square norm of the vectors x~ i in the feature space F is the square norm of the vectors xi in the feature space inducted by kernel kij ¼ jðxi Þ  jðxj Þ. It was shown in [7] that R^ ZR and that, when the size of the dataset and the dimensionality of the feature space are large, the difference between R and R^ is negligible. Similarly, as in the MDM algorithm, two violating vectors are selected in each iteration. First, a random subset Xr of size Nr is drawn from the entire dataset. Then, from among the vectors x~ i A X r a vector x~ v is chosen, whose distance from the center of the ^ If such vector is not found in subset ball c is greater than ð1 þ e^ ÞR. Xr, another random subset is selected and the search is performed again. Drawing the Xr subset may be repeated up to Na times. If after Na times there is no outlier satisfying the condition ^ the e^ is decreased and the algorithm continues Jcx~ i J 4 ð1 þ e^ ÞR, the next iteration. Finally, after violator x~ v is selected, searching for another violator begins. The algorithm finds a support vector that violates the KKT conditions (6) the most. In other words, the algorithm searches for a support vector x~ u that lies closest to the center of the ball and for which Jcx~ i J2 is minimal. After the two violating vectors are selected, an update to the center of the ball is performed. The center of the ball is then shifted along the line connecting the two violating vectors c0 ¼ c þ bðx~ v x~ u Þ,

ð7Þ

as can be seen in Fig. 1. The b coefficient is selected in such a way that the new sphere centered at c0 touches the violator x~ v (x~ v must be lying on the boundary of the new enclosing ball). Specifically, the following condition must be satisfied: ^ Jc0 x~ v J ¼ R:

ð8Þ

2

which can be reduced to the following: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 2 u Jx~ cJ2 R^ t , b^ ¼ r r2  v 2 Jx~ v x~ u J



ðx~ v x~ u Þ  ðx~ v cÞ : Jx~ v x~ u J2

ð11Þ

In the dual space, (7) is equivalent to the increase of av by b and the decrease of au , also by b. It is important to keep all the conditions arising from the Lagrange multiplier method satisfied. In particular, the non-negativity condition of the ai weights must be fulfilled. Therefore, b r1av and b r au must be true. The first of these requirements is always fulfilled. However, one must assure non-negativity of all ai . For this reason, the b coefficient must be limited from above by the weight au

b ¼ minfb^ , au g:

ð12Þ

Having the value b, it is possible to update the center of the ball and resume the algorithm by checking the stopping criterion and searching for other violators. We use the multi-scale approximation method proposed in [7]. The algorithm searches for violators x~ v lying further than ð1 þ e^ ÞR^ from the center c, where e^ is a parameter initialized to 12 and gradually decreasing towards e whenever the algorithm is unable to make progress (specifically, when it is not possible to find violator x~ v after Na attempts). Our experiments proved that this approach slightly improves the time performance, and significantly improves the accuracy, of the obtained model. There are two main differences between the pseudo-code we present in Algorithm 1 and the actual implementation that was used in our experiments. First, we used kernel cache in order not to repeat unnecessary kernel computations. Second, our implementation contains an additional step that tunes the values of a vector (in each iteration, another update to the a vector is performed with such difference that the violator x~ v is searched among support vectors only). Both modifications originate from Tsang’s implementation of the BVM algorithm and are included in the LibCVM toolkit [5,7]. 2.2. Convergence and computational complexity In this section, we show that the SphereSVM reaches the final

e-approximation of the enclosing ball in a finite number of iterations which depend only upon e. For simplicity’s sake, let us assume that we are looking for the

e-approximation of the EB that satisfies Jcx~ i J r ð1 þ eÞR^ for all

Substituting (7) into (8), we obtain Jc þ bðx~ v x~ u Þx~ v J2 ¼ R^ ,

61

ð9Þ

ð10Þ

vectors x~ i . For this purpose, we set Nr ¼ m and Na ¼1. By doing so, X r ¼ X is enforced, which means that the random set Xr that is selected in line 8 of the algorithm is identical to the entire dataset X. This will result in picking x~ v as the worst violator (the point that is lying furthest from the center). The following property holds for all possible vectors a having P ai Z 0 and m i ¼ 1 ai ¼ 1 m X

2

2

ai Jcx~ i J2 ¼ R^ JcJ2 r R^ :

ð13Þ

i¼1

The initialization procedure of the algorithm ensures that Pm ~ 2 In each iteration, the center update i ¼ 1 ai Jcx i J ¼ 0. P ~ 2 expressed in (7) changes the value of m i ¼ 1 ai Jcx i J by m X

a0i Jc0 x~ i J2 

i¼1

m X

2

 cb Jx~ v x~ u J2 :

Fig. 1. One step of the SphereSVM algorithm. The center c is being shifted along vector x~ v x~ u to the new position c0 . After that, vector x~ v becomes the support vector. The previously estimated radius R^ of the enclosing ball does not change.

ai Jcx~ i J2 ¼ 2bðx~ v x~ u Þ

i¼1

ð14Þ

Now, in order to prove the convergence of the algorithm, it is sufficient to show that this change increases the sum Pm ~ 2 i ¼ 1 ai Jcx i J by a value greater than some positive constant. Let us assume that the ratio of the number of updates, where b^ 4 au (clipped updates) to the number of updates having

62

R. Strack et al. / Neurocomputing 101 (2013) 59–67

b ¼ minfb^ , au g ¼ b^ (full updates), is limited by a constant. In other words, we postulate that the number of clipped updates (updates limited by the value of weight au ) is not significantly greater than the number of full updates. This hypothesis is more than feasible. The results presented in Figs. 5 and 8 reveal that the number of support vectors for the BVM algorithm, which is not capable of removing support vectors from the coreset, and the SphereSVM method, which can discard support vectors from the coreset (by performing clipped update), are similar. This allows us to conclude that the vectors once selected to become support vectors are very unlikely to be eliminated from the final coreset. Therefore, the number of clipped updates is expected to be much smaller than the number of full updates. For this reason, we can analyze only the updates for which b ¼ b^ and assume that all P ~ 2 other updates do not increase the value of m i ¼ 1 ai Jcx i J at all. Fig. 2 visualizes the update step performed by the SphereSVM ~ onto algorithm. It contains the projection of the feature space F the plane determined by the violators x~ v , x~ u and the current center of the ball c. The current solution is drawn as the black circle of radius R^ and center c. The blue dashed circle represents the locations of data points x~ i . Since we are using normalized kernel and k~ ii ¼ t~ is constant, then all points are mapped on a pffiffiffi sphere with radius R ¼ t~ and center located in the origin 0. It is true that x~ u  c Z JcJ2 . Otherwise, either c would not be lying inside the convex hull formed by the support vectors or there would be another support vector closer to the center c. ^ because otherwise all data points would be Moreover, JcJ Z eR, enclosed within e-approximation of the enclosing ball, which would form the acceptable solution. Therefore, we can conclude that qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffi 2 ^ Jcx~ u J r R^ JcJ2 r 1e2 R: ð15Þ From (8) and (15) and the fact that Jc00 x~ v J ¼ Jcx~ u J, it follows that pffiffiffiffiffiffiffiffiffiffiffi ^ ð16Þ Jc0 c00 J 4 ð1 1e2 ÞR:

Fig. 2. Visualization of the update step (projection on the plane determined by points x~ v , x~ u and the ball center c). The black circle is the current solution, which is a ball with center c and radius R. The blue (dashed) circle, centered in the origin, ~ (since k~ ii ¼ t~ is represents the locations of points x~ i in the feature space F constant, all points are mapped on the sphere). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

will rarely be reached, and the training usually finishes in a much shorter time. Now, let us assume that N d ¼ Nr N a o m. This modification changes the deterministic character of the algorithm into a probabilistic one. The procedure stops when it is unable to find violator x~ v within Nd attempts. In other words, the algorithm stops when the probability of finding a violator that lies further than ð1 þ eÞR^ from the center is estimated to be less than 1=N d . The probabilistic algorithm requires not more than N d =Z kernel computations in each iteration to find violator x~ v , and at most 1=Z kernel computations in order to find violator x~ u . This means that its complexity is OðN d =Z þ1=Z2 Þ, which is equivalent to Oð1=Z2 Þ since Nd is a constant independent of the number of data points.

Also, ^ Jbðx~ v x~ u ÞJ ¼ Jcc0 J Z eR,

ð17Þ

because if not, then Jc0 x~ v J 4 R^ and the new enclosing ball with center c0 and radius R^ would not include the data point x~ v and (8) would not be satisfied. From Fig. 2 follows that 0

00

0

2ðcc Þ  c ¼ ðcc Þ  ðcc Þ,

ð18Þ

which leads to 2ðcc0 Þ  c ¼ ðJcc0 J þ Jc0 c00 JÞJcc0 J:

ð19Þ

Now, the consequence of (16), (17) and (19) is the following lower bound of (14): pffiffiffiffiffiffiffiffiffiffiffi 2 ð20Þ 2bðx~ v x~ u Þ  cJbðx~ v x~ u ÞJ2 4 eð1 1e2 ÞR^ : Pm 2 Hence, in each iteration the value of i ¼ 1 ai Jcx~ i J is increased pffiffiffiffiffiffiffiffiffiffiffi 2 by at least ZR^ , where Z ¼ eð1 1e2 Þ. Now, it is clear that the P 2 ~ 2 maximum number of iterations after which m i ¼ 1 ai Jcx i J ¼ R holds true is equal to 1=Z. Moreover, 1=Z constitutes the maximum number of support vectors (there must not be more support vectors than the number of iterations since, in each iteration, at most one support vector is added). In each iteration, the algorithm must evaluate m=Z kernels in order to find the violator x~ v , and at most 1=Z kernels in order to find the violator x~ u . Therefore, the complexity of this algorithm is equal to Oðm=Z2 Þ. Note, however, that in real applications, both the bound for the number of iterations and the number of support vectors

3. Datasets and experimental setup 3.1. Datasets The datasets used in our experiments and the results obtained are divided into two groups dubbed ‘‘medium’’ and ‘‘large’’ here. Note that these are (almost) the same datasets as in [7] where the BVM algorithm was introduced (Table 1). However, the experimental environment in [7] was not as strict as the double CV used here. The two adjectives do not describe the size of datasets only. They also include the complexity of the classification tasks. Such complexity is usually reflected in the number of data samples that become support vectors as well as the number of classes mC (because the number of classification models which have to be designed equals mC ðmC 1Þ=2 for a pairwise multi-class classification). All the datasets can be downloaded from the LIBSVM3 and LibCVM4 sites. 3.2. Experimental environment All results presented in this section were obtained using the double cross-validation procedure, a very rigorous scheme for assessing a classification model’s performance [22,23]. Here, we 3 4

Available at http://www.csie.ntu.edu.tw/cjlin/libsvm/. Available at http://c2inet.sce.ntu.edu.sg/ivor/cvm.html.

R. Strack et al. / Neurocomputing 101 (2013) 59–67

63

Table 1 Datasets used in experiments, number of classes, number of features (dimension), number of training patterns, and average number of patterns used in one training run (within a nested 5  5 cross-validation and with a pairwise classification for multi-class datasets). Dataset

# of classes

Medium datasets optdigits 10 satimage 6 usps 10 pendigits 10 reuters 2 letter 26 Large datasets adult 2 w3a 2 shuttle 7 2 web (w8a) ijcnn1 2 intrusion 2

Dim. Total # of patterns

Avg. # of training patterns

64 36 256 16 8315 16

5620 6435 9298 10,992 11,069 20,000

719 1373 1190 1407 7084 985

123 300 7 300

48,842 49,749 58,000 64,700

31,259 31,839 10,606 41,408

22 141,691 127 5,209,460

90,682 3,334,054

Fig. 3. Medium datasets–accuracy obtained during nested cross-validation.

evaluate the generalization performances of six models by using such procedure, structured as the two loop algorithm. In the outer loop, the dataset is separated into J1 roughly equal-sized parts (for purpose of this paper, J 1 ¼ 5 was used in all the experiments). Each part is held out, in turn, as the test set, and the remaining four parts are used as the training set. In the inner loop, J2-fold cross-validation is performed over the training set only to determine the best values of hyper-parameters (here, J 2 ¼ 5). The best model obtained in the inner-loop is then applied on the test set. The double cross-validation procedure ensures that the class labels of the test data will not be seen when tuning the hyper-parameters. This is consistent with the real-world applications scenario. Obviously such a rigorous procedure is done in many runs, but if the main goal is to fairly compare different classification models on the same datasets and under same conditions, double cross-validation must be used. It is the only objective procedure for comparing performances of various classification algorithms, and it is suggested as the required environment for the model comparisons in the future. The datasets were first normalized by linear transformation of the feature values into the ½0,1 range. Then the training process, which involved searching for the best model parameters using a grid search method, was performed. The parameters were selected among 64 possible combinations of the regularization parameter C and g coefficient being the parameter of the Gaussian 2 kernel jðxi Þ  jðxj Þ ¼ egJxi xj J . There were eight possible values for the C parameter C A f4n g,

n ¼ 2, . . . ,5,

Tsang, namely 2  106 1 þ NS C t þ1 , e¼ 1 t þ 1þ C

ð23Þ

where NS is the expected number of support vectors (we assumed that NS ¼ 15 000). This heuristic permitted us to estimate the e parameter based on the value of the regularization coefficient C. In our case, e values were in the range ½106 ,103  depending on the C value (smaller e for larger C). More information regarding dependency of the e parameters for both methods can be found in [24,25]. The selection of the best model’s parameters has been done using five-fold cross-validation applied to the previously selected training sets. After the best parameters were chosen, one additional SVM model was trained using an entire training dataset. This model was then assessed on the test dataset. Our experiments were performed using a computer cluster composed of six nodes. Each node was equipped with two E5520 Intel Xeon CPUs (4-core, 2.27 GHz) and 24 GB of RAM. Although the implementation of the algorithms that we show in this paper does not support multi-threaded executions, we utilized the multi-core environment by decomposing the nested crossvalidation procedure into several independent processes. In other words, double cross-validation was performed by parallel execution of the independent training and testing processes.

ð21Þ 4. Performance of SphereSVM and comparisons to other SVMs

and eight possible g values

g A f4n g, n ¼ 5, . . . ,2:

4.1. Performance of the Sphere Support Vector Machines ð22Þ

The tolerance parameter e used in the stopping criterion was set to e ¼ 103 for the L1 SVM and L2 SVM algorithms. As shown in [4], for SMO based algorithms, the value of e does not affect accuracy or time performance significantly. In other words, decrease of e does not improve accuracy nor does reasonable increase of its value speed up the training procedure in a way that could change results substantially. Therefore, we believe that this setting is a good trade-off between accuracy and the time required to train the model. For BVM and SphereSVM algorithms, we used e value that was calculated using a heuristic proposed by

In this section, we present comparisons of our algorithms to both classical L1 and L2 SVMs and the BVM algorithm. The LIBSVM [2] software is used as the reference implementation of the L1 and L2 SVMs, whereas the BVM implementation is taken from the LibCVM [5,7] package. We compared three different versions of SphereSVM

 SphereSVM: This version uses the same settings as the BVM algorithm (Nr ¼59 and Na ¼10).

 SphereSVM-590: Here, the values of Nr and Na parameters were changed to Nr ¼1 and Na ¼590. Therefore, this configuration preserves the same stopping probability. The algorithm

64



R. Strack et al. / Neurocomputing 101 (2013) 59–67

stops when the probability of finding appropriate violator x~ v is 1 estimated to be less than 1=N r N a ¼ 590 . SphereSVM-100: The parameters for this configuration were set to Nr ¼ 1 and Na ¼100 thereby increasing the stopping 1 probability to 100 .

4.1.1. Medium datasets Fig. 3 shows accuracies obtained during nested cross-validation for optdigits, satimage, usps, pendigits, routers and letter datasets. It can be readily seen that the accuracies of all six models are very similar. More precisely, they are equal within the standard error for four datasets, namely optdigits, satimage, pendigits (not for SphereSVM-100) and reuters. For the usps dataset only, L1 and L2 SVMs are just slightly better (less than 0.5%) than SphereSVM, and for the letter data, only SphereSVM-100 is below the standard error difference by about 0.4%. As for the latter, this tiny difference in accuracy is compensated by a faster training stage. Overall, we can say that the accuracy of the SphereSVM-100 seems to be competitive with the other algorithms.

Fig. 4. Medium datasets–total nested cross-validation time.

Fig. 5. Medium datasets–average percent of support vectors for each of the models obtained in one-vs-one training.

Fig. 4 presents the total time of the nested cross-validation procedure. One can readily see that the original BVM is slower than the SphereSVM, which consecutively is slower than the SphereSVM-590 and SphereSVM-100 algorithms. One can also see that, for not too big datasets, Sphere-100 performs best in terms of training time followed by LIBSVM’s implementation of L1 SVM and SphereSVM-590. The average percent of support vectors (calculated as the average percent of support vectors of the models resulting from the one-vs-one training) is presented in Fig. 5. All algorithms have similar percentages of support vectors. 4.1.2. Large datasets Figs. 6–8 show results of the nested cross-validations obtained for large datasets adult, w3a, shuttle, web, ijcnn1 and intrusion. Note that for the intrusion dataset, we present only results obtained by algorithms based on the enclosing ball approach. The reason is simple. Both L1 and L2 SVM were unable to complete the learning process in a reasonable time. We decided to abort their training after approximately 60 h because none of the algorithms was able to finish cross-validation even for a single set of parameters. Since our nested cross-validation procedure consists of searching for an optimal set of parameters among 64 combinations of C and g values, we could roughly estimate that the learning process would not have finished within 160 days, which is a huge difference compared to the three days required by SphereSVM-590. Fig. 6 shows the accuracy obtained by the nested cross-validation procedure for large datasets. For w3a, shuttle, web, ijcnn and intrusion, there is no significant difference between the methods (except for SphereSVM-100). The accuracy of the SphereSVM-100 is approximately 0.3% lower than other algorithms but is usually more than 10 times faster than the other methods (see Fig. 7). One can notice that SphereSVM achieved much lower accuracy for the adult dataset. The reason for that is too big value for the tolerance parameter e. We obtained competitive accuracy after decreasing its value to 10  6. This is the only time when the heuristic used by Tsang for determining the value of e failed. The learning times presented in Fig. 7 demonstrate that all versions of SphereSVM are faster than their predecessor BVM (more precisely, the SphereSVM-590 is usually four to five times more efficient than BVM). Moreover, the capabilities of the SphereSVM approach are even more evident in comparison to SphereSVM-100 and BVM. The former is up to two orders of magnitude faster than the latter (although, as mentioned earlier,

Fig. 6. Large datasets–accuracy obtained during nested cross-validation.

R. Strack et al. / Neurocomputing 101 (2013) 59–67

Fig. 7. Large datasets–total nested cross-validation time.

65

Fig. 9. Dependency of the training time of the SphereSVM algorithm (having Nr ¼ 1 and Na ¼590) upon the number of support vectors for best hyper-parameters values.

(SphereSVM-590). A linear correlation between the training time and the number of support vectors5 is clearly evident. 4.2. Draw scheme for Sphere Support Vector Machines

Fig. 8. Large datasets–average percent of support vectors for each of the models obtained in one-vs-one training.

its accuracy is approximately 0.3% lower). Furthermore, the SphereSVM approach is more than competitive with L1 SVM. SphereSVM-590 is at least as fast as L1 SVM (except for the shuttle dataset) and, in the case of datasets w3a and web, is even more than 35 times faster. As for the models’ sizes, Fig. 8 shows that the models generated by all three SphereSVM algorithms are smaller compared to the models obtained by the BVM algorithm. This is caused by the difference in the way both algorithms update the weight vector a. The update scheme applied in SphereSVM allows a reduction in the number of support vectors (removal of the support vector is performed in line 18 of Algorithm 1 whenever b ¼ au ). In contrast, the BVM algorithm cannot remove a support vector from the coreset; once a vector x~ i becomes a support vector, it is not possible to decrease its weight ai to 0 (although the ai value may asymptotically approach 0). Fig. 9 shows how the training time of the SphereSVM algorithm depends on the complexity of the dataset, measured as the number of support vectors required to build the model. The version of the algorithm having Nr ¼ 1 and Na ¼590 was used

Based on the foregoing figures, it is apparent that SphereSVM100 (SphereSVM having parameters Na ¼100 and Nr ¼1) is by far the fastest algorithm while still producing accuracy similar to the other methods. This is particularly pronounced at large datasets (web, w3a and intrusion) where speedup is up to three orders of magnitude. In the case of intrusion, both classic SVM algorithms (L1 and L2 implemented in LIBSVM) never converged and the speedup cannot be expressed. Thus when the number of data goes in the domain of a few million or more (ultra-large datasets), SphereSVM seems to be the only alternative right now. This raises a question of what would constitute a good number of draws for SphereSVM. Figs. 10, 11, and 12 present dependencies of accuracy, training time, and percent of support vectors upon the number of draws for medium datasets. By ‘‘number of draws’’ we mean the maximal number of random vector draws performed in one iteration (parameter Na of the Algorithm 1). Moreover, we assume that the size of the random subset Xr is 1 (the parameter Nr ¼1). In other words, in each iteration, the algorithm performs up to Na draws from the entire dataset until it finds a violator x~ v lying further than ð1 þ e^ ÞR from the center. The algorithms dubbed SphereSVM-590 (having, Na ¼590 and Nr ¼1) and SphereSVM-100 (having, Na ¼100 and Nr ¼1) presented in Section 4.1 are examples of the new draw scheme presented here. Fig. 10 shows the dependency between the classification accuracy and the number of draws for the optdigits, pendigits, reuters and satimage datasets. It can be observed that increasing Na beyond 100 does not affect the performance of the algorithm. Although for these datasets even Na ¼100 seems to be enough to obtain maximal performance, one can expect that, for more complex datasets (having more support vectors), this number should be increased. This expectancy is partially confirmed by the results presented in Fig. 6 where, for the largest datasets, the 5 Linear regression of the logarithm of the model size vs. the logarithm of the training time was performed (p-value ¼ 0.007 and R2 ¼ 0:54).

66

R. Strack et al. / Neurocomputing 101 (2013) 59–67

Fig. 10. Dependency of the classification accuracy upon the maximal number of draws allowed during one iteration.

Fig. 12. Dependency of the percent of support vectors upon the maximal number of draws allowed during one iteration.

orders of magnitude for complex datasets, SphereSVM still attains accuracies comparable to the other three algorithms. All comparisons have been performed within double (nested) cross-validation and thus accuracy estimates have been obtained on samples not seen by the classifiers during the training phase. Such a rigorous experimental environment produces accuracy estimates that can be expected in real life applications of all the models. Proof of convergence for SphereSVM is given and it states that the time complexity of the algorithm depends upon the tolerance parameter e only. However, as is often the case with bounds, this one is also loose. In all our simulations, we have attained results in a much shorter time than given by the theoretical bound. Finally, it is worth noting that SphereSVM usually generates sparser models (having less support vectors) than the BVM, which is the result of an efficient elimination of support vectors inside the enclosing sphere. At present, it seems that SphereSVM may well be the recommended sequential classification approach when data size goes into ultra-large domains (say, when the number of samples begin to exceed several million). Fig. 11. Dependency of the total nested cross-validation time upon the maximal number of draws allowed during one iteration.

accuracy of the SVM model trained with parameter Na ¼100 is slightly smaller than for the rest of the models. Fig. 11 visualizes the dependency between the nested crossvalidation time and the number of random draws. It is noteworthy that the training time increases linearly when the Nd parameter is large enough (N d 4100 for optidigits, pendigits and satimage and N d 4 600 for reuters). The results presented in Fig. 12 suggest that, for optdigits, pendigits and satimage, increasing the number of draws beyond 100 does not affect the size of the model. The exception here is reuters dataset, which achieves a stable number of support vector for Na b 100.

5. Conclusions The novel L2 SVM classification algorithm proposed in this paper and dubbed SphereSVM is aimed at classifying large and very large datasets. It shows a significant speedup with respect to all three L1 and L2 SVM implementations in the LIBSVM and Ball Vector Machines approaches. While achieving a speedup exceeding a few

Appendix A. Draw scheme for Sphere Support Vector Machines In the original BVM algorithm, the authors applied the probabilistic speedup heuristic [16]. Instead of finding a violating vector among the entire dataset (a very expensive operation) they picked the violator from the random subset of the dataset. More precisely, they chose the values of Nr and Na parameters as 59 and 10, respectively. In other words, in each iteration, the BVM algorithm made up to 10 attempts to find a violator x~ v and the search for a violator was performed by finding a vector most distant from the center c from among 59 random chosen vectors. This strategy ensures, with 95% probability, that at least one of these random selected vectors would be among the 5% of vectors that are furthest from the center [5,16,24]. Although this approach was proven to be quite effective, there is a better approach for tuning the values of the Nr and Na parameters in SphereSVM. During experiments, it turned out that the algorithm having Nr ¼1 often outperformed other configurations (both, in terms of speed and accuracy). Moreover, the total number of random draws Nd ¼ Nr Na ¼ 590 could be decreased without affecting accuracy in any significant way. Decreasing the

R. Strack et al. / Neurocomputing 101 (2013) 59–67

Table A1 The upper bound of the percent of data points (violators) left outside the ð1 þ eÞR^ hyper-sphere by Agresti–Coull estimator. Nd

Upper bound of the percentage of outliers with a confidence level of: 90%

100 3.1% 590 0.55%

95%

99%

4.4% 0.78%

7.5% 1.3%

Nd value can dramatically decrease the time required to find the model (see Figs. 10 and 11). This is especially important when the number of model training runs is high (e.g., when using grid search for model selection). In addition, for very large datasets (say, for more than 1 million data), SphereSVM using smaller Nd seems to be the only algorithm able to perform extensive crossvalidation in a reasonable amount of time. In order to understand the impact of the Nd parameter on the accuracy of the EB estimation, we used the Agresti–Coull confidence interval estimator [26] to find the upper bound for the percentage of vectors located outside enclosing ball obtained by the algorithms. Results are shown in Table A1. If, for example, we use the approach having a maximal number of draws Nd ¼100 (as in SphereSVM-100 used in Section 3), according to Table A1 one can, with 90% confidence, conclude that there will be less than 3.1% data points left outside the enclosing hyper-sphere with radius ð1þ eÞR^ and center c. These statistics can give us insight into what portion of the data is neglected by the algorithms during the training process.

67

[17] I.W. Tsang, J.T. Kwok, J.M. Zurada, Generalized core vector machines, IEEE Trans. Neural Networks/A Publication of the IEEE Neural Networks Council 17 (2006) 1126–1140. [18] S. Asharaf, M.N. Murty, S.K. Shevade, Multiclass core vector machine, in: Proceedings of the 24th International Conference on Machine Learning—ICML ’07, 2007, pp. 41–48. [19] J. Lo´pez, A. Barbero, J.R. Dorronsoro, An MDM solver for the nearest point problem in Scaled Convex Hulls, in: The 2010 International Joint Conference on Neural Networks (IJCNN), 2010, IEEE, pp. 1–8. [20] B.N. Kozinec, Recurrent algorithm separating convex hulls of two sets, in: Learning Algorithms in Pattern Recognition, 1973, pp. 43–50. [21] B.F. Michell, V.F. Demyanov, V.N. Malozemov, Finding the point of polyhedron closest to the origin, SIAM J. Control 12 (1974) 19–26. [22] S. Varma, R. Simon, Bias in error estimation when using cross-validation for model selection, BMC Bioinformatics 7 (2006) 91. [23] T. Scheffer, Error Estimation and Model Selection, Ph.D. Thesis, Technische ¨ Berlin, 1999. Universitat [24] G. Loosli, S. Canu, Comments on the core vector machines: fast svm training on very large data sets, J. Mach. Learn. Res. 8 (2007) 291–301. [25] I.W. Tsang, J.T. Kwok, Authors’ Reply to the ‘‘Comments on the core vector machines: fast SVM training on very large data sets’’, 2007. [26] A. Agresti, B.A. Coull, Approximate is better than ‘‘exact’’ for interval estimation of binomial proportions, Am. Stat. 52 (1998) 119–126.

Robert Strack received his M.S. Eng. degree in Computer Science from AGH University of Science and Technology, Krakow, Poland, in 2007. He is now working towards his Ph.D. degree in Computer Science at Virginia Commonwealth University, Richmond, USA. His research is oriented towards Machine Learning and Data Mining algorithms and his field of interest includes support vector machines classification and parallel computing.

References [1] T. Joachims, Making large-scale support vector machine learning practical, in: Advances in Kernel Methods, MIT Press, 1999, pp. 169–184. [2] C.-H. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intelligent Syst. Technol. 2 (2011) 27:1–27:27. [3] Q. Li, R. Salman, E. Test, R. Strack, V. Kecman, GPUSVM: a comprehensive CUDA based support vector machine package, Cent. Eur. J. Comput. Sci. 1 (2011) 387–405. [4] Q. Li, Fast Parallel Machine Learning Algorithms for Large Datasets Using Graphic Processing Unit, Ph.D. Thesis, Virginia Commonwealth University, 2011. [5] I.W. Tsang, J.T. Kwok, P.-M. Cheung, Core vector machines: fast SVM training on very large data sets, J. Mach. Learn. Res. 6 (2005) 363–392. [6] K.P. Bennett, E.J. Bredensteiner, Duality and geometry in SVM classifiers, in: Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 57–64. [7] I.W. Tsang, A. Kocsor, J.T. Kwok, Simpler core vector machines with enclosing balls, in: Proceedings of the 24th International Conference on Machine Learning—ICML’07, 2007, ACM Press, New York, USA, pp. 911–918. [8] J. C. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, in Advances in Kernel Method: Support Vector ¨ Learning, Scholkopf, Burges, and Smola, Eds. Cambridge, MA: MIT Press, 1998, pp. 185–208. [9] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. Electron. Comput. (1965) 326–334. [10] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.K. Murthy, A fast iterative nearest point algorithm for support vector machine classifier design, IEEE Trans. Neural Networks 11 (2000) 124–136. [11] V. Franc, V. Hlava´cˇ, An iterative algorithm learning the maximal margin classifier, Pattern Recognition 36 (2003) 1985–1996. [12] D.J. Crisp, C.J.C. Burges, A geometric interpretation of nu-SVM classifiers, in: Advances in Neural Information Processing Systems, vol. 12, 2000, pp. 223–229. [13] M.E. Mavroforakis, S. Theodoridis, A geometric approach to support vector machine (SVM) classification, IEEE Trans. Neural Networks 17 (2006) 671–682. [14] I.W. Tsang, J.T. Kwok, P.-M. Cheung, Very large SVM training using core vector machines, in: Proceedings of the 10th International Workshop on Artificial Intelligence, 2005, pp. 349–356. [15] M. Badoiu, K.L. Clarkson, Smaller core-sets for balls, in: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, vol. 1, 2003, pp. 801–802. ¨ [16] A.J. Smola, B. Scholkopf, Sparse greedy matrix approximation for machine learning, in: Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 911–918.

Vojislav Kecman is with VCU, Department of CS, Richmond, VA, USA, working in the fields of machine learning by both support vector machines (SVMs) and neural networks, as well as by local approaches such as adaptive local hyperplane (ALH) and local SVMs, in different regression (function approximation) and pattern recognition (classification, decision making) tasks. He was a Fulbright Professor at MIT, Cambridge, MA, a Konrad Zuse Professor at FH Heilbronn, DFG Scientist at TU Darmstadt, a Research Fellow at Drexel University, Philadelphia, PA, and at Stuttgart University. Dr. Kecman authored several books on ML (see www. supportvector.ws and www.learning-from-data.com).

Beata Strack received her M.S. degree in Applied Mathematics from AGH University of Science and Technology, Krakow, Poland, in 2008. Currently, she is a Ph.D. student at the Department of Computer Science, Virginia Commonwealth University, Richmond, USA. Her research interests include computational neuroscience and data mining.

Qi Li received his B.S. degree in Electronic Engineering from Beijing University of Posts and Telecommunications, Beijing, China, in 2007 and M.S. degree in Computer Science from Virginia Commonwealth University, Richmond, United States, in 2008. He is now a Ph.D. candidate in Computer Science at Virginia Commonwealth University. His research interests include data mining and parallel computing using GPU.

Suggest Documents