A Fast SVM Training Method for Very Large Datasets - Amazon Web ...

2 downloads 586 Views 2MB Size Report
tionally infeasible for very large datasets. Reducing the size of training dataset is naturally considered to solve this problem. SVM classifiers depend on only ...
Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009

A Fast SVM Training Method for Very Large Datasets Boyang LI, Qiangwei WANG and Jinglu HU Abstract-In a standard support vector machine (SVM), the training process has O(n 3 ) time and O(n 2 ) space complexities, where n is the size of training dataset. Thus, it is computationally infeasible for very large datasets. Reducing the size of training dataset is naturally considered to solve this problem. SVM classifiers depend on only support vectors (SVs) that lie close to the separation boundary. Therefore, we need to reserve the samples that are likely to be SVs, In this paper, we propose a method based on the edge detection technique to detect these samples. To preserve the entire distribution properties, we also use a clustering algorithm such as K-means to calculate the centroids of clusters. The samples selected by edge detector and the centroids of clusters are used to reconstruct the training dataset. The reconstructed training dataset with a smaller size makes the training process much faster, but without degrading the classification accuracies. I. INTRODUCTION

S

VM is a prominent applications of kernel methods. Many of the kernel methods are formulated as quadratic programming (QP) problems. Denote the number of training samples by n, then the training time complexity of QP is O(n 3 ) and its space complexity is at least O(n 2 ) . Hence, a major problem is how to reduce the training time and space complexities on large datasets. In order to reduce the time and space complexities, many improved approaches were proposed. One of them is to obtain low-rank approximations on the kernel matrix, by using greedy approximation [1], sampling [2] or matrix decompositions [3]. However, the resulting rank of the kernel matrix may still be too high to be handled efficiently. Another approach to improve kernel methods is chunking or sophisticated decomposition methods [4]. However, chunking needs to optimize the entire set of non-zero Lagrange multipliers, but the generated kernel matrix may still be too large to adapt to memory. The third kind of approach is to avoid the QP problems such as core vector machine algorithm [5], scale-up methods [6], and Lagrangian SVM (LSVM) [7]. However, for nonlinear kernels, it still requires a large matrix. Another kind of approaches is scaling down the training data before the SVM training process. This more direct and radical kind of approaches is also the focus in this paper. Pavlov et al. [8] and Collobert et al. [9] used boosting and a neural-network-based "gater" to combine small SVMs. Lee and Mangasarian [10] proposed the reduced SVM (RSVM), which uses a random subset of the kernel matrix. Instead Boyang LI, Qiangwei WANG and Jinglu HU are with the Graduate School of Information, Production and Systems, Waseda University. Hibikino 2-7, Wakamatsu-ku, Kitakyushu-shi, Fukuoka-ken, Japan. (phone/fax (+81)93692-5271; email: [email protected]@fuji.waseda.jp. jinglu @waseda.jp)

of random sampling, one can also use active learning [11], squashing [12], editing [13] or even clustering [14]. The basic problem of this kind of approaches is how to detect the non-relevant samples in the training dataset. Most of methods mentioned here can reduce the size of training dataset, but there are still many non-relevant samples are used as the training samples. Hence, a more efficient way of reserving relevant samples needs to be developed. In this paper, we introduce an edge detection technique to scale down the training dataset. In digital image processing, the edge detection is a technique reducing the amount of data and filtering out the useless information, while preserving the important structural properties in an image. This is also stated in the process of scaling down the training data. Therefore, the edge detection technique is introduced to this fast SVM training algorithm to preserve local properties around the separation boundary. In addition, the clustering technique is applied to preserve the distribution properties of the entire data. The precision of the clustering is not important, so K-means clustering is enough in this paper. The reconstructed training dataset consists of the samples detected by edge detector and the centroids of clusters. Two parameters are used to adjust the precision of edge detection and the number of clusters. Since the proposed method focuses on the samples around the edge between classes, SVs are reserved maximally. The remainder of this paper is organized as follows: the next section provides an introduction of SVM classifier. And then, Section 3 introduces the proposed training data reduction approach based on edge detection. Simulation results and discussions are presented in Section 4. Finally, concluding remarks are given in the last section. II. SUPPORT VECTOR MACHINE

SVM has revealed its prominent capability in many actual applications, especially in classification problems. In the elementary design of SVM classifier, bounding planes of each class are considered and the distance between the bounding planes is defined as the margin. The purpose of SVM is to find an optimal separation boundary via maximizing the margin between classes. Since in real-world the classification problems are commonly non-separable and non-linear, a nonlinear mapping function should be introduced and the violations should be accepted. Input vectors are firstly mapped into a high-dimensional feature space in which a separating hyperplane is found by solving a QP problem in its dual form. Thus its space complexity is at least quadratic. A binary classifier is the most simple model and the basis of any complex classifier, so we just consider it here. Suppose that we have a training dataset denoted as {Xi, Yi}, where

978-1-4244-3553-1/09/$25.00 ©2009 IEEE Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on March 10,2010 at 23:49:58 EST from IEEE Xplore. Restrictions apply.

1784

Xi E H"; i = 1,2, .. . , N . Xi means the i-th input vector and Yi is its corresponding class label( + 1 or -1) . The training data set can be divided into two classes A and B which corre spond to labels + 1 and -1 respectively. The distance between these two classes bounding planes is defined as the margin . It is obvious that maximizing this margin could improve the ability of the classifier generally [15]. In the case where the training data are non-separable, we should attempt to minimize the separation error and to maximize the margin simultaneously. SVM classifier is obtained by solving an optimization problem with an objective function that balances a term of forcing separation classes and a term of maximizing the margin of separation [15]. The only relevant samples in determining the optimal separation boundary found in the training process are called support vectors (SVs) . The number of SVs is much smaller than the size of training dataset and is proportional to a bound on the generalization error of the class ifier. Following the Vapniks method [16], the problem is built as a QP problem [17] in its dual space, by introducing the vector of Lagrange multipliers cx

=

(CXl , " " CXN ),

m~xQ(cx) =

1 N

- "2

L

N

YiyjK(Xi , Xj) CXiCXj

i,j = l

+L

CXj

(I)

j= l

• Cfass I}j

• *" ~ ~ ..

.:t..

.:+:-

.:..,



• •

• •

Cfass.ll

.:+:. Fig. I.

Edge detection in classification problem

Suppo se that the whole dataset can be expressed as an image and each class has a certain color, then decision boundaries can be considered as the edge in an image . According to SVM theory, samples close to the separation boundaries have a higher likelihood of being SVs. Therefore, sample s close to the decision boundaries can be hereby detected as points near the edge . B. Edge dete ction for scaling down the training data

s.t, { where K( Xi, Xj) = rp(Xi)Trp (Xj ) is the kernel function [15]. In the experiments, we use Radial Basis Function (RBF) kernel. RBF kernel is a common function for non-linear model ing shown as following (2)

where u 2 is the common width . RBF kernel has outstanding performances in many actual applications. The decision making function is written as a sign function, N

y(x) = sign[LcxiYiK(x , xi) + b]

(3)

i= l

y(X) is the output prediction label of input vector x . Ill. TRAINING DATA REDUCTION BASED ON EDGE DETECTION

A. Issues to be considered Since SVM is formulated as QP problems, the training time and space complexities are O(n 3 ) and O(n 2 ) respectively, in which n is the number of training samples. Hence, there can be advantages in reducing the size of training data before trying to select SVs, in terms of speed and space requirements. However, the training data reduction should not affect the classification results. In SVM, the separation boundary is decid ed based on SVs. In order to maintain the same accuracy, it is important to keep the sample s that could possibly become SVs [14].

Edge detection is a terminology in digital image processing and computer vision . Edges characterize bound aries and are therefore a problem of fundamental importance in image processing. Similarly, in class ification problems, the samples close to the decision bound ary are also very important. These samples have a higher likelihood of being SVs, so they are needed to be reserved in the train ing data reduction process . In image proce ssing, edge detect ion technique aims at identifying points in a digital image at which the image brightness changes sharply or more formally has discontinuities. Edges are areas with strong intensity contrasts (a jump in intensity from one pixel to the next) [18]. Edge detection significantly reduces the amount of data and preserves the important properties. In classification problem s, our purpose of detecting sharp changes between different classe s is to capture important samples around the bound ary. Edge detection in classificat ion problem is simpler than in image processing. Discontinuities edge detection in image brightness is likely to correspond to: discontinuities in depth , discontinu ities in surface orientation, changes in material proper ties, and variations in scene illumination. However, in classification problems, we only consider the change of class label. A typical edge might for instance be the border between a block of red color and a block of yellow. Similarly, a typical classification problem might be binary. In image process ing, we need to scan the neighbor pixels of a pixel to detect the sharp changes of brightness and color. In our edge detection model, this rule is changed to find the m nearest neighbor samples. As shown in Fig.1, assume that m = 5, for a certain train ing data sample P, if one of its neighbor samples that 1785

Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on March 10,2010 at 23:49:58 EST from IEEE Xplore. Restrictions apply.

.

. ./ ;·";.;1; l:-"..

' . .. ·~. ....:t;.~;r.~ '... .. .. . ~

':

Original training data

Reformed training data

Samples found by edge detector

Fig. 2.

Scaling down the training data

has a different class label, then P is selected as an edge sample to be reserved . In this case, the result of applying the edge detector on the training dataset may lead to a set of samples that indicate the data around the separation boundary between classes . Thus, applying an edge detector on the training dataset may significantly reduce the amount of data to be processed and may therefore filter out information that may be regarded as less relevant, while preserving the important structural properties close to the separation boundary . Since SVs are the samples lie around the boundary, the samples reserved by the edge detector can also keep SVs and make no bad effect on the classification result. Sometimes , only using the samples reserved by edge detection to train SVM may lead the classifier to overfitting. In other words, the classifier may be not suitable for solving the whole data. Hence, we also need to add some data to describe the structural properties of the entire dataset. Therefore, we use clustering technique in each class, and the centroids of clusters are also reserved . K-means clustering is an algorithm to group objects based on attributes into k number of group s, where k is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroids [19]. Samples selected by edge detector and the cluster centroids selected by K-means are all used to reconstruct the training data . There are 4 steps to do this process : Step 1 - Begin with a decision on the values of rn = number of neighbor samples . Step 2 - Use edge detector to select the samples around the edge .

• Find the nearest rn neighbor samples for each sample in training dataset. • Check the class labels of neighbor samples . If one of them is different from the class label of the test sample, then the test sample is reserved,

else it is removed. Step 3 - Determine the value of k = number of clusters and do K-means process to calculate the centroids of clusters for the whole data . Step 4 - Reconstruct the training dataset by using the samples selected by edge detector and the centroids calculated by K-means clustering method .

We can also see it clearly in Fig.2. Obviously, this method can scale down the training data size and make SVM training process much faster. IV. SIMULATIONS AND DISCUSSIONS A. Checkerboard Data

Firstly, we experiment on the 4 x 4 checkerboard data, which is commonly used for evaluating large-scale SVM implementations. The original training data and testing data are created randomly. We use training sets with a maximum of 1000 samples and 1,000 samples for testing. Parameters of SVM are used in default settings . Since our focus is on nonlinear kernels, we chose the RBF kernel. In this paper, our proposed approach is adapted from the SVM-KMToolbox (in MATLAB) [20]. Experimentally, conventional SVM is also implemented to compare with our proposed approach . Training data and their classification results are shown in Fig.3 graphically. Figure 3(a) is the original training data. Figure 3(b) is the reconstructed training data by using the proposed approach . Figure 3(c) shows the test data and separation boundaries built by using the original data in Fig.3(a). Figure 3(d) shows the test data and separation boundaries built by using the reconstructed data in Fig.3(b) . Obviously, the size of training data reconstructed by our proposed approach is much smaller than the size of original training data . From Fig.3(c) and Fig.3(d), we can also find that the proposed approach basically maintains the same separation boundaries . 1786

Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on March 10,2010 at 23:49:58 EST from IEEE Xplore. Restrictions apply.

t 1.5

.. +

+ "

+

++ ' + "

+ +t "

Suggest Documents