A Fast Parallel SVM Algorithm for Massive ... - Semantic Scholar

A Fast Parallel SVM Algorithm for Massive Classification Tasks Thanh-Nghi Do1 , Van-Hoa Nguyen2 , and Fran¸cois Poulet2 1

CIT, CanTho University, VietNam [email protected] 2 IRISA, Rennes, France {vhnguyen,francois.poulet}@irisa.fr

Abstract. The new parallel incremental Support Vector Machine (SVM) algorithm aims at classifying very large datasets on graphics processing units (GPUs). SVM and kernel related methods have shown to build accurate models but the learning task usually needs a quadratic programming, so that the learning task for large datasets requires big memory capacity and a long time. We extend the recent finite Newton classifier for building a parallel incremental algorithm. The new algorithm uses graphics processors to gain high performance at low cost. Numerical test results on UCI, Delve dataset repositories showed that our parallel incremental algorithm using GPUs is about 45 times faster than a CPU implementation and often significantly over 100 times faster than state-of-the-art algorithms LibSVM, SVM-perf and CB-SVM. Keywords: Support vector machines, incremental learning, parallel algorithm, graphics processing unit, massive data classification.

1

Introduction

Since SVM learning algorithms were first proposed by Vapnik [26], they have been shown to build accurate models with practical relevance for classification, regression and novelty detection. Successful applications of SVMs have been reported for such varied fields as facial recognition, text categorization and bioinformatics [14]. In particular, SVMs using the idea of kernel substitution have been shown to build good models, and they have become increasingly popular classification tools. In spite of the prominent properties of SVMs, current SVMs can not easily deal with very large datasets. A standard SVM algorithm requires solving a quadratic program (QP); so its computational cost is at least O(m2), where m is the number of training datapoints. Also, the memory requirements of SVM frequently make it intractable. Unfortunately, real-world databases doubles every 9 months [12], [16]. There is a need to scale up these learning algorithms for dealing with massive datasets. Effective heuristic methods to improve SVM learning time divide the original quadratic program into series of small problems [2], [21], [22]. Incremental learning methods [3], [7], [9], [10], [13], [23], [24] improve memory performance for massive datasets by updating solutions in a growing training H.A. Le Thi, P. Bouvry, and T. Pham Dinh (Eds.): MCO 2008, CCIS 14, pp. 425–434, 2008. c Springer-Verlag Berlin Heidelberg 2008

426

T.-N. Do, V.-H. Nguyen, and F. Poulet

set without needing to load the entire dataset into memory at once. Parallel and distributed algorithms [9], [23] improve learning performance for large datasets by dividing the problem into components that execute on large numbers of networked personal computers (PCs). Active learning algorithms [8], [25] choose interesting datapoint subsets (active sets) to construct models, instead of using the whole dataset they can not deal easily with very large datasets. In this paper, we describe methods to build the incremental and parallel Newton SVM algorithm for classifying very large datasets on GPUs, for example, a Nvidia GeForce 8800 GTX graphics card. Our work is based on Newton SVM classifiers proposed by Mangasarian [17]. He proposed to change the margin maximization formula and add with a least squares 2-norm error to the standard SVM and then this brings out an unconstrained optimization which is solved by the finite stepless Newton method. The Newton SVM formulation requires thus only solutions of linear equations instead of QP. This makes training time very short. We have extended Newton SVM in two ways. 1. We developed an incremental algorithm for classifying massive datasets (billions of datapoints) of dimensionality up to 103 . 2. Using a GPU (massively parallel computing architecture), we developed a parallel version of incremental Newton SVM algorithm to gain high performance at low cost. Some performances in terms of learning time and accuracy are evaluated on the UCI repository [1] and Delve [6], including Forest cover type, KDD cup 1999, Adult and Ringnorm datasets. The results showed that our algorithm using GPU is about 45 times faster than a CPU implementation. An example of the effectiveness of the new algorithms is their performance on the 1999 KDD cup dataset. They performed a binary classification of 5 million datapoints in a 41-dimensional input space within 18 seconds on the Nvidia GeForce 8800 GTX graphics card (compared with 552 seconds on a CPU, Intel core 2, 2.6 GHz, 2 GB RAM). We also compared the performances of our algorithm with the highly efficient standard SVM algorithm LibSVM [4] and with two recent algorithms, SVM-perf [15] and CB-SVM [28]. The remainder of this paper is organized as follows. Section 2 introduces Newton SVM classifiers. Section 3 describes how to build the incremental learning algorithm with the Newton SVM algorithm for classifying large datasets on CPUs. Section 4 presents a parallel version of the incremental Newton SVM using GPUs. We present numerical test results in section 5 before the conclusion and future work. Some notations are used in this paper. All vectors are column vectors unless transposed to row vector by a T superscript. The inner dot product of two vectors, x, y is denoted by x.y. The 2-norm of the vector x is denoted by x. The matrix A[mxn] is m datapoints in the n-dimensional real space Rn . e will be the column vector of 1. w, b will be the normal vector and the scalar of the hyperplane. z is the slack variable and C is a positive constant. I denotes the identity matrix.

A Fast Parallel SVM Algorithm for Massive Classification Tasks

2

427

Newton Support Vector Machine

Let us consider a linear binary classification task, as depicted in Figure 1, with m datapoints xi (i = 1, . . . , m) in the n-dimensional input space Rn . It is represented by the mxn matrix A, having corresponding labels yi = ±1, denoted by the mxm diagonal matrix D of ±1 (where Di,i = 1 if xi is in class +1 and Di,i = −1 if xi is in class -1).

Fig. 1. Linear separation of the datapoints into two classes

For this problem, the SVM algorithm try to find the best separating plane (denoted by the normal vector w ∈ Rn and the scalar b ∈ R1 ), i.e. furthest from both class +1 and class -1. It can simply maximize the distance or margin between the supporting planes for each class (x.w − b = +1 for class +1, x.w − b = −1 for class -1). The margin between these supporting planes is 2/w (where w is the 2-norm of the vector w). Any point xi falling on the wrong side of its supporting plane is considered to be an error (having corresponding slack value zi > 0). Therefore, a SVM algorithm has to simultaneously maximize the margin and minimize the error. This is accomplished through the following QP (1): min f (w, b, z) = (1/2)w2 + CeT z s.t. : D(Aw − eb) + z ≥ e

(1)

where z ∈ Rm is the non negative slack vector and the positive constant C ∈ R1 are used to tune errors, margin size, respectively. The plane (w, b) is obtained by solving the QP (1). Then, the classification function of a new datapoint x based on the plane is: predict(x) = sign(w.x − b) SVM can use some other classification functions, for example a polynomial function of degree d, a RBF (Radial Basis Function) or a sigmoid function. To change from a linear to non-linear classifier, one must only substitute a kernel evaluation in (1) instead of the original dot product. More details about SVM and others kernel-based learning methods can be found in [5]. Recent developments for massive linear SVM algorithms proposed by Mangasarian [17], [18] reformulate the classification as an unconstrained optimization. By changing the margin maximization to the minimization of (1/2)w, b2

428


and adding with a least squares 2-norm error, the SVM algorithm reformulation with linear kernel is given by the QP (2). min f (w, b, z) = (1/2)w, b2 + (C/2)z2 s.t. : D(Aw − eb) + z ≥ e

(2)

where z is the non negative slack vector and the positive constant C are used to tune errors, margin size. The formulation (2) can be rewritten by substituting for z = [e − D(Aw − eb)]+ (where (x)+ replaces negative components of a vector x by zeros) into the objective function f . We get an unconstrained problem (3): min f (w, b) = (1/2)w, b2 + (C/2)[e − D(Aw − eb)]+ 2

(3)

By setting [w1 w2 . . . wn b]T to u and [A − e] to H, then the SVM formulation (3) is rewritten by (4): min f (w, b) = (1/2)uT u + (C/2)(e − DHu)+ 2

(4)

Mangasarian [17] has shown that the finite stepless Newton method can be used to solve the strongly convex unconstrained minimization problem (4). The algorithm is described in figure 2. Mangasarian has proved that the sequence ui of the algorithm terminates at the global minimum solution. In most of the tested cases, the Newton algorithm has given the good solution with a number of iterations varying between 5 and 8. The SVM formulation (4) requires thus only solutions of linear equations of (w, b) instead of QP. If the dimensional input space is small enough (less than 103 ), even if there are millions datapoints, the Newton SVM algorithm is able to classify them in minutes on a PC.

Fig. 2. Newton SVM algorithm


3

429

Incremental Newton SVM Algorithm

Although the Newton SVM algorithm is fast and efficient to classify large datasets, it needs load whole dataset in the memory. With a large dataset e.g. one billion datapoints in 20 dimensional input, Newton SVM requires more than 80 GB RAM. Any machine learning algorithm has some difficulties to deal with the challenge of large datasets. Our investigation aims at scaling up the Newton SVM algorithm to classify very large datasets on PCs (Intel CPUs). The incremental learning algorithms are a convenient way to handle very large datasets because they avoid loading the whole dataset in main memory: only subsets of the data are considered at any one time and update the solution in growing training set. The main idea is to incrementally compute the gradient of f and the generalized Hessian of f at u for each iteration in the finite Newton algorithm described in figure 3. Suppose we have a very large dataset decomposed into small blocks by rows Ai , Di . The incremental algorithm of the Newton SVM can simply incrementally compute the gradient and the generalized Hessian of f by the formulation (5) and (6). Consequently, the incremental Newton SVM algorithm can handle massive datasets on a PC. If the dimension of the input space is small enough (less than 103 ), even if there are billions datapoints, the incremental Newton SVM algorithm is able to classify them on a standard personal computer (Pentium IV, 512 MB RAM). The algorithm only needs to store a small (n + 1)x(n + 1) matrix and two (n + 1)x1 vectors in memory between two successives steps (where n is number of dimensions). The accuracy of the incremental algorithm is exactly the same as the original one.

4

Parallel Incremental Newton SVM Using GPUs

The incremental Newton SVM algorithm described above is able to deal with very large datasets on a PC. However it only runs on one single processor. We have extended it to build a parallel version using a GPU. During the last decade, GPUs described in [27] have developed as highly specialized processors for the acceleration of raster graphics. The GPU has several advantages over CPU architectures for highly parallel, compute intensive workloads, including higher memory bandwidth, significantly higher floating-point, and thousands of hardware thread contexts with hundreds of parallel compute pipelines executing programs in a single instruction multiple data (SIMD) mode. The GPU can be an alternative to CPU clusters in high performance computing environments. Recent GPUs have added programmability and been used for general-purpose computation, i.e. non-graphics computation, including physics simulation, signal processing, computational geometry, database management, computational biology, data mining.

430


Fig. 3. Incremental Newton SVM algorithm

NVIDIA has introduced a new GPU, i.e. Geforce 8800 GTX and a C-language programming API called CUDA [19] (compute unified device architecture). A block diagram of the NVIDIA Geforce 8800 GTX architecture is comprised of 16 multiprocessors. Each multiprocessor has 8 SPs (streaming processors) for a total of 128 SPs. Each group of 8 SPs shares one L1 data cache. A SP contains a scalar ALU (arithmetic logic unit) and can perform floating point operations. Instructions are executed in a SIMD mode. The NVIDIA Geforce 8800 GTX has 768 MB of graphics memory, with a peak observed performance of 330 GFLOPS and 86 GB/s peak memory bandwidth. This specialized architecture can sufficiently meet the needs of many massively data-parallel computations. In addition, NVIDIA CUDA also provides a C-language API to program the GPU for general purpose applications. In CUDA, the GPU is a device that can execute multiple concurrent threads. The CUDA software stack is composed of a hardware driver, an API, its runtime and higher-level mathematical libraries of common usage, an implementation of Basic Linear Algebra Subprograms (CUBLAS [20]). The CUBLAS library allows access to the computational resources of NVIDIA GPUs. The basic model by which applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. Furthermore, the datatransfer rate between GPU and CPU memory is about 2 GB/s. Thus, we developed a parallel version of incremental Newton SVM algorithm based on GPUs to gain high performance at low cost. The parallel incremental implementation in figure 4 using the CUBLAS library performs matrix computations on the GPU massively parallel computing architecture. Note that in CUDA/CUBLAS, the GPU can execute multiple concurrent threads. Therefore, parallel computations are done in the implicite way.


431

Fig. 4. Parallel incremental Newton SVM algorithm using GPUs

First, we split a large dataset A, D into small blocks of rows Aj , Dj . For each incremental step, a data block Aj , Dj is loaded to the CPU memory; a datatransfer task copies Aj , Dj from CPU to GPU memory; and then GPU computes in the parallel way the sums of: ∇f (ui ) = ∇f (ui ) + (−Dj Hj )T (e − Dj Hj ui )+ and ∂ 2 f (ui ) = ∂ 2 f (ui ) + (−Dj Hj )T diag([e − Dj Hj ui ]∗ )(−Dj Hj ) Then the results ∇f (ui ) and ∂ 2 f (ui ) are uploaded from GPUs memory back to CPU memory to update u at the ith iteration. The accuracy of the new algorithm is exactly the same as the original one.

5

Numerical Test Results

We prepared an experiment setup using a PC, Intel Core 2, 2.6 GHz, 2 GB RAM, a Nvidia GeForce 8800 GTX graphics card with NVIDIA driver version 6.14.11.6201 and CUDA 1.1, running Linux Fedora Core 6. We implemented two versions (GPU and CPU code) of incremental Newton SVM algorithm in C/C++ using NVIDIA’s CUDA, CUBLAS API [19], [20] and the high performance linear algebra libraries, Lapack++ [11]. The GPU implementation results are compared against the CPU results under Linux Fedora Core 6. We have only evaluated the computational time without the time needed to read data from disk. We focus on numerical tests with large datasets from on the UCI repository, including Forest cover type, KDD cup 1999 and Adult datasets (c.f. table 1). We created another massive datasets by using the RingNorm program. It is a 20 dimensional, 2 class classification example. Each class is drawn from a

432

T.-N. Do, V.-H. Nguyen, and F. Poulet Table 1. Dataset description Datasets

Dimensions Training set Testing set

Adult Forest covertype KDD cup 1999 Ringnorm 1M Ringnorm 10M

110 54 41 20 20

32561 495141 4898429 1000000 10000000

16281 45141 311029 100000 1000000

Table 2. Classification results reported on a CPU (Intel Core 2, 2.6 GHz, 2 GB RAM) and a GPU (NVIDIA Geforce 8800 GTX) Datasets Adult Forest covertype KDD cup 1999 Ringnorm 1M Ringnorm 10M

GPU time (sec) CPU time (sec) Accuracy (%) 0.48 2.42 18.01 0.39 17.44

17.52 84.17 552.98 39.01 395.67

85.18 77.18 92.31 75.07 76.68

multivariate normal distribution. Class 1 has mean equal to zero and covariance 4 times the identity. Class 2 (considered as -1) has unit covariance with mean = 2/sqrt(20). First, we have split the datasets into small blocks of rows to avoid fitting in memory. Table 2 presents the classification results obtained by GPU and CPU implementations of the incremental Newton SVM algorithm. The GPU version is a factor of 45 faster than the CPU implementation. For Forest cover type dataset, the standard LibSVM ran for 21 days without any result. Recently-published results indicate that the SVM-perf algorithm performed this classification in 171 seconds (CPU time) on a 3.6 GHz Intel Xeon processor with 2 GB RAM. This indicates that our GPU implementation of incremental Newton SVM is probably about 70 times faster than SVM-Perf. KDD Cup 1999 dataset consists of network data indicating either normal connections (negative class) or attacks (postive class). LibSVM ran out of memory. CB-SVM has classified the dataset with over 90% accuracy in 4750 seconds (CPU time) on a Pentium 800 MHz with 1GB RAM, while our algorithm achieved over 92% accuracy in only 18.01 second. They appear to be about a factor of 264 times faster than CB-SVM. The numerical test results showed the effectiveness of the new algorithm to deal with very large datasets on GPUs.

6

Conclusion and Future Work

We have presented a new parallel incremental Newton SVM algorithm being able to deal with very large datasets in classification tasks on GPUs. We have


433

extended the recent Newton SVM algorithm proposed by Mangasarian in two ways. We developed an incremental algorithm for classifying massive datasets. Our algorithm avoid loading the whole dataset in main memory: only subsets of the data are considered at any one time and update the solution in growing training set. We developed a parallel version of incremental Newton SVM algorithm based on GPUs to gain high performance at low cost. We evaluated the performances in terms of learning time on very large datasets of UCI repository and Delve. The results showed that our algorithm using GPU is about 45 times faster than a CPU implementation. We also compared the performances of our algorithm with the efficient standard SVM algorithm LibSVM and with two recent algorithms, SVM-perf and CB-SVM. Our GPU implementation of incremental Newton SVM is probably over 100 times faster than LibSVM, SVM-Perf and CB-SVM. A forthcoming improvement will extend our methods for dealing with complex non-linear classification tasks.

References 1. Blake, C., Merz, C.: UCI Repository of Machine Learning Databases (2008) 2. Boser, B., Guyon, I., Vapnik, V.: An Training Algorithm for Optimal Margin Classifiers. In: Proc. of 5th ACM Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, pp. 144–152 (1992) 3. Cauwenberghs, G., Poggio, T.: Incremental and Decremental Support Vector Machine Learning. In: Advances in Neural Information Processing Systems, vol. 13, pp. 409–415. MIT Press, Cambridge (2001) 4. Chang, C.C., Lin, C.J.: LIBSVM – A Library for Support Vector Machines (2001) 5. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000) 6. Delve: Data for evaluating learning in valid experiments (1996) 7. Do, T.N., Poulet, F.: Towards High Dimensional Data Mining with Boosting of PSVM and Visualization Tools. In: Proc. of 6th Int. Conf. on Entreprise Information Systems, pp. 36–41 (2004) 8. Do, T.N., Poulet, F.: Mining Very Large Datasets with SVM and Visualization. In: Proc. of 7th Int. Conf. on Entreprise Information Systems, pp. 127–134 (2005) 9. Do, T.N., Poulet, F.: Classifying one billion data with a new distributed SVM algorithm. In: Proc. of 4th IEEE International Conference on Computer Science, Research, Innovation and Vision for the Future, pp. 59–66 (2006) 10. Do, T.N., Fekete, J.D.: Large Scale Classification with Support Vector Machine Algorithms. In: Proc. of 6th International Conference on Machine Learning and Applications, pp. 7–12. IEEE Press, USA (2007) 11. Dongarra, J., Pozo, R., Walker, D.: LAPACK++: a design overview of objectoriented extensions for high performance linear algebra. In: Proc. of Supercomputing 1993, pp. 162–171. IEEE Press, Los Alamitos (1993) 12. Fayyad, U., Piatetsky-Shapiro, G., Uthurusamy, R.: Summary from the KDD-03 Panel - Data Mining: The Next 10 Years. SIGKDD Explorations 5(2), 191–196 (2004)

434


13. Fung, G., Mangasarian, O.: Incremental Support Vector Machine Classification. In: Proc. of the 2nd SIAM Int. Conf. on Data Mining SDM, USA (2002) 14. Guyon, I.: Web Page on SVM Applications (1999) 15. Joachims, T.: Training Linear SVMs in Linear Time. In: Proc. of the ACM SIGKDD Intl Conf. on KDD, pp. 217–226 (2006) 16. Lyman, P., Varian, H.R., Swearingen, K., Charles, P., Good, N., Jordan, L., Pal, J.: How much information (2003) 17. Mangasarian, O.: A finite newton method for classification problems. Data Mining Institute Technical Report 01-11, Computer Sciences Department, University of Wisconsin (2001) 18. Mangasarian, O., Musicant, D.: Lagrangian Support Vector Machines. Journal of Machine Learning Research 1, 161–177 (2001) 19. NVIDIA CUDA: CUDA Programming Guide 1.1 (2007) 20. NVIDIA CUDA: CUDA CUBLAS Library 1.1 (2007) 21. Osuna, E., Freund, R., Girosi, F.: An Improved Training Algorithm for Support Vector Machines. Neural Networks for Signal Processing VII, 276–285 (1997) 22. Platt, J.: Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208 (1999) 23. Poulet, F., Do, T.N.: Mining Very Large Datasets with Support Vector Machine Algorithms. In: Camp, O., Filipe, J., Hammoudi, S., Piattini, M., et al. (eds.) Enterprise Information Systems V, pp. 177–184. Kluwer Academic Publishers, Dordrecht (2004) 24. Syed, N., Liu, H., Sung, K.: Incremental Learning with Support Vector Machines. In: Proc. of the 6th ACM SIGKDD Intl Conf. on KDD 1999, USA (1999) 25. Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. In: Proc. of 17th Int. Conf. on Machine Learning, pp. 999–1006 (2000) 26. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 27. Wasson, S.: Nvidia’s GeForce 8800 graphics processor. Technical report, PC Hardware Explored (2006) 28. Yu, H., Yang, J., Han, J.: Classifying large data sets using SVMs with hierarchical clusters. In: Proc. of the ACM SIGKDD Intl Conf. on KDD, pp. 306–315 (2003)

A Fast Parallel SVM Algorithm for Massive ... - Semantic Scholar

A Fast Parallel SVM Algorithm for Massive ... - Semantic Scholar

Suggest Documents

Speed Up SVM Algorithm for Massive ... - Semantic Scholar

A NEW FAST TRAINING ALGORITHM FOR SVM

Fast Parallel Thinning Algorithm for the Binary ... - Semantic Scholar

A fast parallel modularity optimization algorithm ... - Semantic Scholar

Using a multi-objective genetic algorithm for SVM ... - Semantic Scholar

A Parallel Watershed Algorithm - Semantic Scholar

Fast Parallel Randomized Algorithm for Nonnegative Matrix

Fast algorithm for generating ascending ... - Semantic Scholar

Fast algorithm for generating ascending ... - Semantic Scholar

A Hash-based Hierarchical Algorithm for Massive ... - Semantic Scholar

a geometric compression algorithm for massive ... - Semantic Scholar

Scalable Fast Parallel SVM on Cloud Clusters for ...

Massive parallel-sequencing-based hydroxyl ... - Semantic Scholar

Support Vector Reduction in SVM Algorithm for ... - Semantic Scholar

A self-training semi-supervised SVM algorithm and ... - Semantic Scholar

A Grid-Based Distributed SVM Data Mining Algorithm - Semantic Scholar

a fast parallel huffman decoder for fpga ... - Semantic Scholar

A Fast Parallel Conservative Garbage Collector for ... - Semantic Scholar

A new parallel genetic algorithm for solving ... - Semantic Scholar

A Parallel Genetic Algorithm for Cell Image ... - Semantic Scholar

A Configurable Algorithm for Parallel Image ... - Semantic Scholar

A Scalable Parallel Algorithm for Self-Organizing ... - Semantic Scholar

A Parallel Evolutionary Algorithm for Prioritized ... - Semantic Scholar

a parallel, black-box coupling algorithm for fluid ... - Semantic Scholar