Chapter 1 SELECTING DATA FOR FAST SUPPORT VECTOR MACHINE TRAINING Jigang Wang, Predrag Neskovic, Leon N Cooper Institute for Brain and Neural Systems, Physics Department, Brown University, Providence RI 02912, USA ∗
[email protected],
[email protected], Leon
[email protected]
Abstract
In recent years, support vector machines (SVMs) have become a popular tool for pattern recognition and machine learning. Training a SVM involves solving a constrained quadratic programming problem, which requires large memory and enormous amounts of training time for largescale problems. In contrast, the SVM decision function is fully determined by a small subset of the training data, called support vectors. Therefore, it is desirable to remove from the training set the data that is irrelevant to the final decision function. In this paper we propose two new methods that select a subset of data for SVM training. Using real-world datasets, we compare the effectiveness of the proposed data selection strategies in terms of their ability to reduce the training set size while maintaining the generalization performance of the resulting SVM classifiers. Our experimental results show that a significant amount of training data can be removed by our proposed methods without degrading the performance of the resulting SVM classifiers.
Keywords: Support vector machines, quadratic programming, data selection, statistical confidence, Hausdorff distance, random sampling
1.
Introduction
Support vector machines (SVMs), introduced by Vapnik and coworkers in the structural risk minimization (SRM) framework [1–3], have
∗ This work was partially supported by ARO under grant W911NF-04-1-0357. Jigang Wang was supported by a dissertation fellowship from Brown University.
2 gained wide acceptance due to their solid statistical foundation and good generalization performance that has been demonstrated in a wide range of applications. Training a SVM involves solving a constrained quadratic programming (QP) problem, which requires large memory and takes enormous amounts of training time for large-scale applications [5]. On the other hand, the SVM decision function depends only on a small subset of the training data, called support vectors. Therefore, if one knows in advance which patterns correspond to the support vectors, the same solution can be obtained by solving a much smaller QP problem that involves only the support vectors. The problem is then how to select training examples that are likely to be support vectors. Recently, there has been considerable research on data selection for SVM training. For example, Shin and Cho proposed a method that selects patterns near the decision boundary based on the neighborhood properties [6]. In [7–9], k-means clustering is employed to select patterns from the training set. In [10], Zhang and King proposed a β-skeleton algorithm to identify support vectors. In [11], Abe and Inoue used Mahalanobis distance to estimate boundary points. In the reduced SVM (RSVM) setting, Lee and Mangasarian chose a subset of training examples using random sampling [12]. In [13], it was shown that uniform random sampling is the optimal robust selection scheme in terms of several statistical criteria. In this work, we introduce two new data selection methods for SVM training. The first method selects training data based on a statistical confidence measure that we will describe later. The second method uses the minimal distance from a training example to the training examples of a different class as a criterion to select patterns near the decision boundary. This method is motivated by the geometrical interpretation of SVMs based on the (reduced) convex hulls. To understand how effective these strategies are in terms of their ability to reduce the training set size while maintaining the generalization performance, we compare the results obtained by the SVM classifiers trained with data selected by these two new methods, by random sampling, and by the data selection method that is based on the distance from a training example to the desired optimal separating hyperplane. Our comparative study shows that a significant amount of training data can be removed from the training set by our methods without degrading the performance of the resulting SVM classifier. We also find that, despite its simplicity, random sampling performs well and often provides results comparable to those obtained by the method based on the desired SVM outputs. Furthermore, in our experiments, we find that incorporating the class
Selecting Data for Fast Support Vector Machine Training
3
distribution information in the training set often improves the efficiency of the data selection methods. The remainder of the paper is organized as follows. In Section 2, we give a brief overview of support vector machines for classification and the corresponding training problem. In Section 3, we present the two new methods that select subsets of training examples for training SVMs. In Section 4 we report the experimental results on several real-world datasets. Concluding remarks are provided in Section 5.
2.
Related Background
In this section we give a brief overview of Support Vector Machines for classification. The reader is referred to [1–4] for more details on the SVM approach. For simplicity, we only consider the binary classification problem.
Large Margin Classifiers Given a set of training examples {(x1, y1), . . . , (xn, yn )}, where xi ∈ Rd are input vectors and yi ∈ {−1, 1} are the corresponding class labels, Support Vector Machines seek to construct a hyperplane that separates the data with the maximum margin. Suppose that the problem is linearly separable, i.e., there exists a hyperplane hw, xi = 0 such that yi (hw, xii + b) > 0
∀i = 1, . . . , n,
(1.1)
where w is normal to the hyperplane and b is a threshold. Rescaling w and b such that the point(s) closest to the hyperplane satisfy yi (hw, xii + b) = 1, we obtain a canonical form (w, b) of the hyperplane, satisfying yi (hw, xii + b) ≥ 1. Note that in this case, the minimum Euclidean distance between the two classes (i.e., twice the margin), measured perpendicularly to the hyperplane, equals 2/kwk. Therefore, the problem of finding the separating hyperplane with largest margin can be formulated as follows: 2 (w∗, b∗) = arg max (1.2) w,b kwk2 subject to the constraints yi (hw, xii + b) ≥ 1
∀i = 1, . . . , n.
(1.3)
This constrained optimization problem can be reformulated as the following quadratic programming problem: (w∗ , b∗) = arg min kwk2 w,b
(1.4)
4 subject to the constraints yi (hw, xii + b) ≥ 1 ∀i = 1, . . . , n.
(1.5)
In practice, however, a separating hyperplane may not exist, e.g., when different classes overlap in some regions in the input space due to one’s choice of feature representation methods or a high noise level. In this case, it may be desirable to have some examples violating the constraints in (1.3). To allow for this possibility, a standard approach is to introduce slack variables ([2, 3]) ξi ≥ 0
∀i = 1, . . . , n,
(1.6)
along with relaxed constraints yi (hw, xii + b) ≥ 1 − ξi
∀i = 1, . . . , n.
(1.7)
Support Vector Machines seek to find a hyperplane that minimizes the objective function n X 1 ξi (1.8) kwk2 + C 2 i=1
subject to the constraints (1.6) and (1.7). The parameter C > 0 in the above objective function is a regularization constant that controls the trade-off between the separation margin and the number of training errors. To construct the optimal separating hyperplane, one therefore needs to solve the following quadratic optimization problem: n
X 1 min hw, wi + C ξi w,b 2
(1.9)
i=1
subject to the constraints: yi (hw, xii + b) ≥ 1 − ξi ∀i = 1, . . ., n ξi ≥ 0 ∀i = 1, . . . , n.
(1.10) (1.11)
This constrained optimization problem can be dealt with by using the Lagrange multiplier method. We introduce positive Lagrange multipliers αi and βi, i = 1, . . . , n, one for each of the inequality constraints (1.10) and (1.11) respectively. This gives the Lagrangian: L=
n
n
n
i=1
i=1
i=1
X X X 1 ξi − αi (yi (hw, xii + b) − 1 + ξi) − βi ξi . (1.12) hw, wi + C 2
Selecting Data for Fast Support Vector Machine Training
5
From the Lagrangian L we can obtain the dual problem by minimizing L with respect to the primal variables w, b, and ξi , for all i = 1, . . ., n: Θ(α, β) = inf L(w, b, ξ, α, β) w,b,ξ
(1.13)
The Lagrangian dual problem is then defined as max
Θ(α, β)
α,β
(1.14)
subject to αi ≥ 0 βi ≥ 0
∀i = 1, . . . , n ∀i = 1, . . . , n .
(1.15) (1.16)
Note that the primal problem is a convex optimization problem. Furthermore, the objective function of the primal problem is differentiable. Therefore, according to the Karush-Kuhn-Tucker (KKT) theory, a solution (w∗ , b∗, ξ ∗) is optimal if and only if there exists dual parameters α∗ , β ∗ such that ∆w L(w∗, b∗, ξ ∗, α∗ , β ∗) = 0 ∂L =0 ∂b α∗i (yi (hw∗, xii + b∗) − 1 + ξi∗ ) = 0 βi∗ ξi∗ = 0 yi (hw∗, xi i + b∗) − 1 + ξi∗ ≥ 0 ξi∗ ≥ 0 α∗i ≥ 0 βi∗ ≥ 0
(1.17) (1.18) (1.19) (1.20) (1.21) (1.22) (1.23) (1.24)
The conditions that at the optimal solution, the derivatives of L with respect to w and b mush vanish, ∂ L = 0 and ∂w leads to
n X i=1
αi yi = 0 and
∂ L = 0, ∂b
w=
n X
α i yi xi .
(1.25)
(1.26)
i=1
The solution vector w thus is an expansion in terms of a subset of the training examples, namely those training examples whose Lagrange multipliers αi are non-zero. According to the Karush-Kuhn-Tucker (KKT)
6 conditions for the optimal solution, we have αi = 0 ⇒ yi (hw, xii + b) ≥ 1 0 < αi < C ⇒ yi (hw, xii + b) = 1 αi = C ⇒ yi (hw, xii + b) ≤ 1
and ξi = 0 and ξi = 0 and ξi ≥ 0
Therefore, only αi that correspond to training examples xi which lie either on the margin or inside the margin area are non-zero. All the remaining training examples are irrelevant and their corresponding αi are zero. By substituting (1.26) into the Lagrangian L, one eliminates the primal variables w, b, and ξi , for all i = 1, . . ., n and arrives at the following Wolfe dual form of the primal optimization problem n n X 1 X αi αj yi yj hxi , xj i − αi αi ,i=1,...,n 2
min
i,j=1
(1.27)
i=1
subject to 0 ≤ αi ≤ C ∀i = 1, . . ., n n X αi yi = 0.
(1.28) (1.29)
i=1
This dual problem, similar to the primal problem, is also a constrained quadratic programming problem. However, there are at least two reasons that the dual problem is preferred. The first is that the constraints (1.28) and (1.29) are simpler than the constraints (1.10) and (1.11) and are much easier to handle. The second is that both the dual problem and the decision function only involves inner products between input vectors. This property is crucial because it allows us to apply the so-called kernel trick to generalize the large margin classifier to deal with classification problems with highly complex decision boundaries, as will become clear later. Solving the dual problem, one obtains the multipliers αi , i = 1, . . . , n, which gives w as an expansion w=
n X
α i yi xi .
(1.30)
i=1
Knowing w, the bias term b can be subsequently determined using the KKT conditions αi (yi (hw, xii + b) − 1 + ξi ) = 0 ∀i = 1, . . . , n.
(1.31)
Selecting Data for Fast Support Vector Machine Training
7
Therefore, the bias term b can be determined as b = yi − hw, xii
(1.32)
from any of the training examples such that 0 < αi < C. This leads to the following linear decision function f (x) = sgn(
n X
αi yi hx, xii + b).
(1.33)
i=1
Feature Spaces and Kernels In reality, linear decision boundaries are generally not rich enough for pattern separation in real-world problems. Historically, this is also the difficulty encountered by the perceptron algorithm. To overcome this problem, more flexible decision boundaries are needed. To achieve this goal, one solution is to first map the data into some high dimensional space F (usually called the feature space) via a nonlinear mapping Φ : Rd → F
(1.34)
and then compute the separating hyperplane in the high dimensional feature space F . If the mapping Φ is properly chosen, linear decision boundaries in the feature space may actually represent highly complex decision boundaries in the input space. For instance, suppose we are given training patterns x ∈ Rd where most information is contained in the d-th order monomials of entries xj of x, i.e. xj1 xj2 · · · xjM , where j1 , . . . , jM ∈ {1, . . ., d}. In this case, we can first map a pattern x to a feature vector containing all M -th order monomials and then work with the feature vectors in the feature space. The problem with this approach, however, is that it quickly becomes computationally infeasible for large real-world problems. For example, for d-dimensional input patterns, the number of M -th order monomials is M +d−1 . Consider images of 16 × 16 pixels as input patM terns and 5-th order monomials as the mapping Φ, then one would map each input pattern to a feature vector of dimension 5+256−1 ∼ 1010. 5 Fortunately, for certain mappings and corresponding feature spaces F , there exists a highly effective trick for computing inner products in the feature spaces by using kernel functions. Let us consider the monomial example again: √ Φ : (x1, x2) ∈ R2 → (z1 , z2, z3) := (x21, 2x1 x2 , x22) ∈ R3
(1.35)
8 The inner product between two feature vectors Φ(x) and Φ(x0) of patterns x and x0 can be expressed as √ 2 √ 2 hΦ(x), Φ(x0)i = (x21 , 2x1x2 , x22) · (x01 , 2x01 x02 , x02 ) = ((x1 , x2) · (x01, x02))2 = hx, x0i2 . (1.36) This result can be generalized to more general cases. For instance, for x, x0 ∈ Rd and M ∈ N , the kernel function k(x, x0) = hx, x0iM
(1.37)
computes an inner product in the space spanned by all products of exactly M dimensions of Rd . Similarly, the following kernel function k(x, x0) = (hx, x0i + c)d
(1.38)
with c > 0 can also be shown to compute an inner product in the space spanned by all monomials of order up to d. More generally, the following theorem from functional analysis shows that kernel functions k of positive integral operators give rise to mappings Φ and corresponding l2 spaces such that hΦ(x), Φ(x0)i = k(x, x0).
Theorem 1.1 if k is a continuous symmetric kernel of a positive integral operator T , i.e. Z 0 (T f )(x ) = k(x, x0)f (x)dx (1.39) X
with
Z
k(x, x0)f (x)f (x0)dxdx0 ≥ 0
(1.40)
XX
for all f ∈ L2 (X ), it can be expanded in a uniform convergent series (on X × X ) in terms of T ’s eigenfunctions φj and positive eigenvalues λj , k(x, x0) =
NF X
λj φj (x)φj (x0 ),
(1.41)
j=1
where NF ≤ ∞ is the number of positive eigenvalues. From Mercer’s theorem, it follows that k(x, x0) corresponds to an inner product in l2NF , i.e. k(x, x0) = hΦ(x), Φ(x0)i with p Φ : x → ( λj φj (x)), j = 1, . . . , NF , (1.42)
Selecting Data for Fast Support Vector Machine Training
9
for almost all x ∈ X . In fact, the uniform convergence of the series implies that given > 0, there exists an n ∈ N such that even if the range of Φ is infinitedimensional, k can be approximated with accuracy as an inner product in Rn , between images of p p Φn : x → ( λ1 φ1(x), . . ., λn φn (x)) . (1.43) Rather than thinking of the feature space as an l2 space, we can alternatively represent it as an Hilbert space Hk containing all linear combinations of the function f (·) = k(xi , ·) (xi ∈ X ). To ensure that the map Φ : X → Hk , which in this case is defined as Φ(x) = k(x, ·) ,
(1.44)
k(x, x0) = hΦ(x), Φ(x0)i ,
(1.45)
satisfies we need to endow Hk with a suitable inner product h, i such that hk(x, ·), k(x0, ·)i = k(x, x0) ,
(1.46)
which amount to requiring that k is a reproducing kernel for Hk . This can be achieved by defining the following inner product. Let f (·) = Pm Pm 0 i=1 αi k(·, xi) and g(·) = i=j βj k(·, xj ), define 0
hf, gi =
m X m X
αi βj k(xi , xj ).
(1.47)
i=1 j=1
First we check that it satisfies the reproducing property of the kernel. P Note that for any f (·) = m α i=1 i k(·, xi ), we have hk(·, x), f i =
m X
αi k(xi , x) = f (x).
(1.48)
i=1
Plugging the kernel k(·, x0) for f , we have hk(·, x), k(·, x0)i = k(x, x0),
(1.49)
which verifies that it satisfies the reproducing property of the kernel. It is easy to verify that this in fact defines an inner product. For example, hf, gi = hg, f i because k is symmetric. Linearity is trivial to show by the definition. hf, f i ≥ 0 by the positivity of the Gram matrix. Since k is positive definite, it is easy to show that f 2 (x) = hk(·, x), f i2 < k(x, x)hf, f i
(1.50)
10 which implies that if hf, f i = 0, f = 0. Therefore, the definition above really is an inner product. For a Mercer kernel, such an inner product does exist. Since k is symmetric, the φi (i = 1, . . . , NF ) can be chosen to be orthogonal with respect to the inner product in L2 (C), i.e. we can construct the inner product h, i such that δjn hφj , φni = . (1.51) λn By Mercer’s theorem, it is easy to check that this linear space Hk endowed with such an inner product is a Hilbert space. The importance of Mercer’s theorem becomes immediate by noticing that both the construction of the optimal hyperplane in F and the evaluation of the corresponding decision function (1.33) only require the evaluation of inner products hΦ(x), Φ(x0)i, and never require the images Φ(x) in explicit form. Therefore, by simply replacing the inner products hxi , xj i in the dual problem with a kernel function k(xi , xj ), Support Vector Machines implicitly map the training vectors xi to feature vectors Φ(xi ) in some feature space F such that hΦ(xi), Φ(xj )i = k(xi , xj ) .
(1.52)
Consequently, we get a linear decision function of the following form in the feature space F f (x) = sgn(
n X
αi yi (hΦ(x), Φ(xi)i + b)
(1.53)
= sgn(
i=1 n X
αi yi (k(x, xi) + b) ,
(1.54)
i=1
which might be a nonlinear decision function in the original input space, depending on the choice of the kernel k. The power of the kernel method lies in that the inner products in the feature space F are computed implicitly, without one explicitly carrying out or even knowing the mapping Φ. Kernels that have proven to be effective for pattern classification include the Gaussian kernel k(xi , xj ) = exp(−
kxi − xj k2 ) 2σ 2
(1.55)
and the polynomial kernel k(xi , xj ) = (1 + hxi, xj i)p.
(1.56)
Selecting Data for Fast Support Vector Machine Training
11
Training Support Vector Machines As we have shown, to train a SVM classifier, one needs to solve the dual quadratic programming problem (1.27) under the constraints (1.28) and (1.29). To achieve nonlinear separation, one only needs to replace the inner products hxi, xj i in (1.27) with a suitable kernel function k(xi , xj ), such as the Gaussian kernel (1.55). For a small training set, standard QP solvers, such as CPLEX, LOQO, MINOS and Matlab QP routines, can be readily used to obtain the solution. However, for a large training set, standard QP solvers become infeasible due to the large memory and enormous training time required. To alleviate the problem, a number of solutions have been proposed by exploiting the sparsity of the SVM solution and the KKT optimality conditions. The first solution, known as chunking [14], uses the fact that only the support vectors are relevant for the final form of the decision function. At each step, chunking solves a QP problem that consists of all nonzero Lagrange multipliers αi from the last step and some of the αi that violate the KKT conditions. The size of the QP problem varies but finally equals the number of non-zero Lagrange multipliers. At the last step, the entire set of non-zero Lagrange multipliers are identified and the QP problem is solved. This technique is suitable for fairly large problems, however, it is still limited by the maximal number of support vectors that one can handle. Furthermore, it still requires a quadratic optimizer to solve the sequence of smaller QP problems. Another method, known as decomposition, was proposed by [15, 16]. It solves a large QP problem by breaking it down into a series of smaller QP sub-problems. This method is based on the observation that solving a sequence of QP subproblems that always contain at least one training example that violates the KKT conditions will eventually lead to the optimal solution. The algorithm in [15] suggests to keep the size of the QP sub-problems fixed and to add and remove one example in each iteration. In practice, researchers add and remove multiple examples using various heuristics. This method allows the training of arbitrarily large datasets and usually achieve fast convergence even on large datasets. However, a quadratic optimizer is still required for implementing this method. Recently, Platt proposed a method called Sequential Minimal Optimization (SMO). It implements an extreme form of the decomposition method by iteratively solving a QP subproblem of size two [17]. The key idea is that a QP subproblem of size two can be solved analytically without invoking a quadratic optimizer. The main issue is to choose a good pair of examples to jointly optimize in each iteration. Heuristics based
12 on the KKT conditions are usually used for this purpose. This method has been reported to be several orders of magnitude faster and to exhibit better scaling properties than the classical chunking algorithm. Note that all the above methods make use of the whole training set. However, according to the KKT optimality conditions, the optimal separating hyperplane depends only on the support vectors, which are the training examples that lie either on the margin or inside the margin area. In many real-world applications, the number of support vectors is expected to be much smaller than the total number of training examples. Therefore, the speed of SVM training will be significantly improved if only the set of support vectors is used for training, and the resulting separating hyperplane will be exactly the same as if the whole training set was used. Although in theory one has to solve the full QP problem with the whole training data in order to identify the support vectors, it is easy to observe that the support vectors are training examples that are close to the optimal decision boundaries and therefore more likely to be misclassified. Therefore, if there exists a computationally efficient way to find a reduced training set such that with high probability it contains the desired support vectors and that its size is small compared to that of the total training set, then the speed of SVM training can be improved without degrading the generalization performance. The size of the reduced training set can be larger than the set of desired support vectors. But as long as its size is much smaller than the size of the total training set, the SVM training speed will be significantly improved since the training algorithms scales quadratically in the training set size on many problems [5]. In the next section, we present several strategies that select a subset of training examples for fast SVM training.
3.
Training Data Selection for Support Vector Machines
Data Selection based on Confidence Measure A good heuristic for determining whether a training example is near the class boundaries is the number of training examples that are contained in the largest sphere centered on the training example without covering examples of other classes. Centered on each training example xi , let us create a sphere that is as large as possible without covering training examples of other classes and denote the number of training examples that fall inside the sphere by N (xi). This can be easily achieved by setting its center on xi and its
13
Selecting Data for Fast Support Vector Machine Training
radius ri as ri = min kxi − xj k − , j:yj 6=yi
(1.57)
where > 0 is an arbitrarily small number. Subsequently, we have N (xi ) =
n X
1kxi −xj k≤ri .
(1.58)
j=1
Fig. 1.1 shows two such spheres centered on training examples 1 and 2 respectively. It can be shown that the number N (xi ) is related to the 2
1 2
0
1
−1
−2 −2
−1
0
1
2
Figure 1.1. Examples of the largest spheres centered on training examples 1 and 2 without enclosing training examples of the opposite class.
statistical confidence that can be associated with the class label yi of the training example xi . Roughly speaking, the larger the number N (xi), the more likely xi truly belongs to the class yi as labeled in the training data, i.e., the more confident we are about its class membership, see [20]. Intuitively, it is also easy to see that the larger the number N (xi), the more training examples (of the same class as xi ) are scattered around xi before the sphere reaches the other class, therefore the less likely xi is close to the decision boundary and is a support vector. As illustrated in Fig. 1.1, training example 2 is much more likely to be a support vector than example 1. Because we want to select examples that are more likely to be support vectors as training data, this number can be used as a criterion to decide which training examples should belong to the reduced training set. For each training example xi , we compute the number N (xi) and sort the training data according to the corresponding values of N (xi) and choose a subset of data with the smallest numbers N (xi) as the reduced training set. We call this data selection scheme the confidence measure-based training set selection.
14
Data Selection based on Hausdorff Distance Our second data selection strategy is based on the Hausdorff distance. In the separable case, it has been shown that the optimal SVM separating hyperplane is identical to the hyperplane that bisects the line segment which connects the two closest points of the convex hulls of the positive and of the negative training examples [18, 19]. The problem of finding the two closest points in the convex hulls can be formulated as min kz + − z − k2
(1.59)
z+ ,z −
subject to z+ =
X
X
and z − =
αi xi
i:yi =1
αi xi ,
(1.60)
i:yi =−1
where αi ≥ 0 satisfies the constraints X αi = 1 and i:yi =1
X
αi = 1.
(1.61)
i:yi =−1
Based on this geometrical interpretation, the support vectors are training examples corresponding to the vertices of the convex hulls that are closest to the convex hull of the training examples from the opposite class. For the non-separable case, a similar result holds by replacing the convex hulls with the reduced convex hulls [18, 19]. Therefore, a good heuristic for determining whether a training example is likely to be a support vector is the shortest distance from the training example to the convex hull of the training examples of the opposite class, which can be computed by solving the following quadratic programming problem min kxi − zk2 z
subject to z=
X
αj xj ,
(1.62)
(1.63)
j:yj 6=yi
where αj ≥ 0 satisfies the constraints X αj = 1.
(1.64)
j:yj 6=yi
To simplify the computation, the distance from a training example to the closest training examples of the opposite class can be used as an approximation. We denote the minimal distance as d(xi) = min kxi − xj k , j:yj 6=yi
(1.65)
Selecting Data for Fast Support Vector Machine Training
15
which is basically the radius ri we defined in (1.57) and is also the Hausdorff distance between the training example xi and the set of training examples that belong to the other class. To select a subset of training examples, we sort the training set according to d(xi ) and select examples with the smallest Hausdorff distances d(xi ) as the reduced training set. This method will be referred to as the Hausdorff distance-based selection method.
Data Selection based on Random Sampling and Desired SVM Outputs To study the effectiveness of the proposed data selection strategies, we compare them to two other strategies. One is random sampling and the other is a data selection strategy based on the distance from the training examples to the desired separating hyperplane. The random sampling strategy simply selects training examples uniformly at random to form the reduced training set. It is straightforward to implement and requires no extra computation. However, since the training data are selected at random, there is no guarantee that the selected training examples will be close to the class boundaries. The other data selection strategy we compare our methods to is implemented as follows. Given the training set and the parameter setting, we solve the full QP problem to obtained the desired separating hyperplane. Then for each training example xi , we compute its distance to the desired separating hyperplane as:
f (xi ) = yi (
n X
αj yj k(xi , xj ) + b) .
(1.66)
j=1
Note that Eq. (1.66) has taken into account the class information. Therefore training examples that are misclassified by the desired separating hyperplane will have negative distances. According to the KKT optimality conditions, support vectors are training examples that have relatively small values of distance f (xi ). We sort the training examples according to their distances to the separating hyperplane and select a subset of training examples with the smallest distances as the reduced training set. This strategy, although impractical because one needs to solve the full QP problem first, is ideal for comparison purposes as the distance from a training example to the desired separating hyperplane provides the optimal criterion for selecting the support vectors according to the KKT conditions.
16
4.
Results and Discussion
In this section we report experimental results on several real-world datasets from the UCI Machine Learning repository 1 . The SVM training algorithm was implemented based on the SMO method. For all datasets, Gaussian kernels were used and the generalization error of the SVMs was estimated using the 5-fold cross-validation method. For each training set, according to the data selection method used, a portion of the training set (ranging from 10 to 100 percent) was selected as the reduced training set to train the SVM classifier. The error rate reported is the average error rate of the resulting SVM classifiers on the test sets over the 5 iterations. Note that when the data selection method is based on the desired SVM outputs, the SVM training procedure has to be run twice in each iteration. The first time a SVM classifier is trained with the training set to obtain the desired separating hyperplane. Then a portion of the training examples in the training set is selected to form the reduced training set based on their distances to the desired separating hyperplane (see Eq. (1.66)). The second time a SVM classifier is trained with the reduced training set. Given a training set and a particular data selection criterion, there are two ways to form the reduced training set. One can either select training examples regardless of which classes they belong to or select training examples from each class separately while maintaining the class distribution. It was found in our experiments that selecting training examples from each class separately often improves the classification accuracy of the resulting SVM classifiers. Therefore, we only report results in this case. Fig 1.2 shows the error rates of SVMs on the Wisconsin Breast Cancer dataset when trained with the reduced training sets of various sizes selected by the four different data selection methods. This dataset consists of 683 examples from two classes (excluding the 16 examples with missing attribute values). Each example has 8 attributes. The size of the training set in each iteration is 547 and the size of the test set is 136. The average number of support vectors is 238.6, which is 43.62% of the training set size. From Fig. 1.2 one can easily see that a significant amount data can be removed from the training set without degrading the performance of the resulting SVM classifier. When more than 20% of the training data is selected, the confidence-based data selection method outperforms the other two methods. Its performance is actually as good as that of the method based on the desired SVM outputs. The method based on
Selecting Data for Fast Support Vector Machine Training
17
35 Confidence Hausdorff Random SVM
30
Error rate
25 20 15 10 5 0 0
20
40 60 80 Percentage of training data
100
Figure 1.2. Error rates of SVMs on the Breast Cancer dataset when trained with reduced training sets of various sizes.
the Hausdorff distance gives comparable results as the random sampling method and they have the worst overall results. However, when the data reduction rate is high, e.g., when less than 20 percent of the training data is selected, the results obtained by the Hausdorff distance-based method and random sampling are much better than those based on the confidence measure and the desired SVM outputs. Fig. 1.3 shows the corresponding results obtained on the BUPA Liver dataset, which consists of 345 examples, with each example having 6 attributes. The sizes of the training and test sets in each iteration are 276 and 69, respectively. The average number of support vectors is 222.2, which is 80.51% of the size of the training sets. Interestingly, as we can see, the method based on the desired SVM outputs has the worst overall results. When less than 80% of the data is selected for training, the Hausdorff distance-based method and random sampling have similar performance and outperform the methods based on the confidence measure and the desired SVM outputs. Fig. 1.4 provides the results on the Ionosphere dataset, which has a total of 351 examples, with each example having 34 attributes. The sizes of the training and test sets in each iteration are 281 and 70, respectively. The average number of support vectors is 159.8, which is 56.87% of the size of the training sets. From Fig. 1.4 we see that the data selection method based on the desired SVM outputs gives the best results when more than 20% of the data is selected. When more than 50% of the data is selected, the results of the confidence-based method are very close to the best achievable results. However, when the reduction rate is high, the performance of random sampling is the best. The Hausdorff distance-based method has the worst overall results.
18 65 Confidence Hausdorff Random SVM
60
Error rate
55 50 45 40 35 30 0
20
40 60 80 Percentage of training data
100
Figure 1.3. Error rates of SVMs on the BUPA Liver dataset when trained with reduced training sets of various sizes.
40 Confidence Hausdorff Random SVM
35
Error rate
30 25 20 15 10 5 0
20
40 60 80 Percentage of training data
100
Figure 1.4. Error rates of SVMs on the Ionosphere dataset when trained with reduced training sets of various sizes.
Fig. 1.5 shows the corresponding results on the Pima Indians dataset. This dataset consists of 768 examples with each example having 8 attributes. The sizes of the training and test sets in each iteration are 615 and 153, respectively. The average number of support vectors is 477.6, which is 77.66% of the size of training set. As we can see, the method based on the desired SVM outputs has the worst results and random sampling gives the best results. The methods based on confidence measure and Hausdorff distance are comparable. Fig. 1.6 shows the experimental results on the Sonar dataset. This dataset consists of 208 examples, with each example having 60 attributes. The sizes of the training and test sets in each iteration are 167 and 41 respectively. The average number of support vectors is 94.8, which is 56.77% of the size of the training sets. As we can see, the method based on the desired SVM outputs has the best performance. When
Selecting Data for Fast Support Vector Machine Training
19
60 Confidence Hausdorff Random SVM
55
Error rate
50 45 40 35 30 25 0
20
40 60 80 Percentage of training data
100
Figure 1.5. Error rates of SVMs on the Pima Indians dataset when trained with reduced training sets of various sizes.
more than 50% of the data is selected for training, the performance of the confidence-based method is close to be optimal, followed by random sampling and the Hausdorff distance-based method. At high reduction rates, however, random sampling and the method based on the Hausdorff distance have better performance than the method based on the confidence measure. 50 Confidence Hausdorff Random SVM
Error rate
40
30
20
10 0
20
40 60 80 Percentage of training data
100
Figure 1.6. Error rates of SVMs on the Sonar dataset when trained with reduced training sets of various sizes.
An interesting finding of the experiments is that the performance of the SVM classifiers deteriorates significantly when the reduction rate is high, e.g., when the size of the reduced training set is much smaller than the number of the desired support vectors. This is especially true for data selection strategies that are based on the desired SVM outputs and the proposed heuristics. On the other hand, the effect is less significant for random sampling, as we have seen that random sampling usually has better relative performance at higher data reduction rates. From a the-
20 oretical point of view, this is not surprising because when only a subset of the support vectors is chosen as the reduced training set, there is no guarantee that the solution of the reduced QP problem will still be the same. In fact, if the reduction rate is high and the criterion is based on the desired SVM outputs or the proposed heuristics, the reduced training set is likely to be dominated by ’outliers’, therefore leading to worse classification performance. To overcome this problem, we can remove those training examples that lie far inside the margin area since they are likely to be ’outliers’. For the data selection strategy based on the desired SVM outputs, it means that we can discard part of the training data that has extremely small values of the distance to the desired separating hyperplane (see Eq. (1.66)). For the methods based on the confidence measure and Hausdorff distance, we can similarly discard the part of the training data that has extremely small values of N (xi) and the Hausdorff distance. In Fig. 1.7 we show the results of the proposed solution on the Breast Cancer dataset. Comparing Figs. 1.2 and 1.7, it is easy to see that, when only a very small subset of the training data (compared to the number of the desired support vectors) is selected for SVM training, removing training patterns that are extremely close to the decision boundary according to the confidence measure or according to the underlying SVM outputs significantly improves the performance of the resulting SVM classifiers. The effect is less obvious for the method based on the Hausdorff measure. Similar results have also been observed on other datasets.
8 Confidence Hausdorff Random SVM
Error rate
7 6 5 4 3 0
Figure 1.7.
20
40 60 80 Percentage of training data
100
Effect of removing ’outliers’ while performing data reduction.
Selecting Data for Fast Support Vector Machine Training
5.
21
Conclusion
In this paper we presented two new data selection methods for SVM training. To analyze their effectiveness in terms of their ability to reduce the training data while maintaining the generalization performance of the resulting SVM classifiers, we conducted a comparative study using several real-world datasets. More specifically, we compared the results obtained by these two new methods with the results of the simple random sampling scheme and the results obtained by the selection method based on the desired SVM outputs. Through our experiments, several important observations have been made: (1) In many applications, significant data reduction can be achieved without degrading the performance of the SVM classifiers. For that purpose, the performance of the confidence measure-based selection method is often comparable to or better than the performance of the method based on the desired SVM outputs. (2) When the reduction rate is high, some of training examples that are ‘extremely’ close to the decision boundary have to be removed in order to maintain the generalization performance of the resulting SVM classifiers. (3) In spite of its simplicity, random sampling performs consistently well, especially when the reduction rate is high. However, at low reduction rates, random sampling performs noticeably worse compared to the confidence measure-based method. (4) When conducting training data selection, sampling training data from each class separately according to the class distribution often improves the performance of the resulting SVM classifiers. By directly comparing various data selection schemes with the scheme based on the desired SVM outputs, we are able to conclude that the confidence measure provides a criterion for training data selection that is almost as good as the optimal criterion based on the desired SVM outputs. At high reduction rates, by removing training data that are likely to be outliers, we boost the performance of the resulting SVM classifiers. Random sampling performs consistently well in our experiments, which is consistent with the results obtained by Syed et al. in [21] and the theoretical analysis of Huang and Lee in [13]. The robustness of random sampling at high reduction rates suggests that, although an SVM classifier is fully determined by the support vectors, the generalization performance of an SVM is less reliant on the choice of training data than it appears to be.
Notes 1. http://www.ics.uci.edu/∼mlearn/MLRepository.html
22
References [1] Boser, B. E., Guyon, I. M., Vapnik, V. N.: A training algorithm for optimal margin classifiers. In: Haussler, D. (ed.): Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (1992) 144–152 [2] Cortes, C., Vapnik, V. N.: Support vector networks. Machine Learning. 20 (1995) 273–297 [3] Vapnik, V. N.: Statistical Learning Theory. Wiley, New York, NY (1998) [4] Cristanini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, U.K. (2000) [5] Joachims, T.: Making large-scale SVM learning practical. In: Sch¨ olkopf, B., Burges, C. J. C., Smola, A. J. (eds.): Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999) 169–184 [6] Shin, H. J., Cho, S. Z.: Fast pattern selection for support vector classifiers. In: Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence (LNAI 2637) (2003) 376–387 [7] Almeida, M. B., Braga, A. P., Braga, J. P.: SVM-KM: speeding SVMs learning with a priori cluster selection and k-means. In: Proceedings of the 6th Brazilian Symposium on Neural Networks (2000) 162–167 [8] Zheng, S. F., Lu, X. F., Zheng, N. N., Xu, W. P.: Unsupervised clustering based reduced support vector machines. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2 (2003) 821–824 [9] Koggalage, R., Halgamuge, S.: Reducing the number of training samples for fast support vector machine classification. Neural Information Processing - Letters and Reviews 2(3) (2004) 57–65 [10] Zhang, W., King, I.: Locating support vectors via β-skeleton technique. In: Proceedings of the International Conference on Neural Information Processing (ICONIP) (2002) 1423–1427 [11] Abe, S., Inoue, T.: Fast training of support vector machines by extracting boundary data. In: Proceedings of the International Conference on Artificial Neural Networks (ICANN) (2001) 308–313 [12] Lee, Y. J., Mangasarian, O. L.: RSVM: Reduced support vector machines. In: Proceedings of the First SIAM International Conference on Data Mining (2001)
Selecting Data for Fast Support Vector Machine Training
23
[13] Huang, S. Y., Lee, Y. J.: Reduced support vector machines: a statistical theory. Technical report, Institute of Statistical Science, Academia Sinica, Taiwan. http://www.stat.sinica.edu.tw/syhuang/ (2004) [14] Vapnik, V. N.: Estimation of Dependence Based on Empirical Data. Springer-Verlag, Berlin (1982) [15] Osuna, E., Freund, R., Girosi, F.: Support vector machines: training and applications. A.I. Memo AIM - 1602, MIT A.I. Lab. (1996) [16] Osuna, E., Freund, R., and Girosi, F.: Training support vector machines: An application to face recognition. In Proc. Of Computer Vision and Pattern Recognition (1997) [17] Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Sch¨ olkopf, B., Burges, C. J. C., Smola, A. J. (eds.): Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA (1999) 185–208 [18] Bennett, K. P., Bredensteiner, E. J.: Duality and geometry in SVM classifiers. In: Proceedings of 17th International Conference on Machine Learning. (2000) 57–64 [19] Crisp, D. J., Burges, C. J. C.: A geometric interpretation of nu-svm classifiers. Advances in Neural Information Processing Systems. 12 (1999) [20] Wang, J, Neskovic, P, Cooper, L. N: Neighborhood selection in the k-nearest neighbor rule using statistical confidence. Pattern Recognition. Vol. 39 (3) (2006) 417–423 [21] Syed, N. A., Liu, H., Sung, K. K.: A study of support vectors on model independent example selection. In: Proceedings of the Workshop on Support Vector Machines at the International Joint Conference on Artificial Intelligence. (1999)