An Intelligent System for Accelerating Parallel SVM ... - IEEE Xplore

0 downloads 0 Views 235KB Size Report
Dept. of Computer Science. Virginia Commonwealth University. Richmond, USA vkecman@vcu.edu. Abstract—Support Vector Machine (SVM) is one of the most.
An Intelligent System for Accelerating Parallel SVM Classification Problems on Large Datasets Using GPU Qi Li Dept. of Computer Science. Virginia Commonwealth University Richmond, USA [email protected]

Raied Salman Dept. of Computer Science Virginia Commonwealth University Richmond, USA [email protected]

Abstract—Support Vector Machine (SVM) is one of the most popular tools for solving general classification and regression problems because of its high predicting accuracy. However, the training phase of nonlinear kernel based SVM algorithm is a computationally expensive task, especially for large datasets. In this paper, we propose an intelligent system to solve large classification problems based on parallel SVM. The system utilizes the latest powerful GPU device to improve the speed performance of SVM training and predicting phases. The memory constraint issue brought by large datasets is addressed through either data reduction or data chunking techniques. The complete system includes multiple executable modules and all of them are managed through a main script, which reduces the implementation difficulty and offers platform portability. Empirical results have shown that our system achieves an order of magnitude speed up compared to the classic SVM tool, LIBSVM. The speed performance is further improved to two orders of magnitude by slightly compromising on the predicting accuracy. Keywords-SVM; multi-GPU; HPC; parallel;

I. I NTRODUCTION Since the semiconductor industry revealed that high performance processors cannot be built by simply increasing the clock frequency any more, the scientific computing market has been offered alternative products with multicore, multi-processor, or many-core which all shift to parallel architecture. One of these successful products is the Graphic Processing Units (GPUs) based computational device. GPUs used to be only integrated on the video card specialized for 2D and 3D graphic rendering. These applications normally require a higher capability in floating point operations than in logic control and memory fetch operations. Thus GPU is designed with many built-in modified floating point units, which can do parallel computations, and the GPU itself acts as an assisting processor to CPU. Because of its nature, GPU has become more and more popular in many applications which require intensive computations. General purpose programming on the parallel architecture of GPU is difficult not only because most libraries offered are solely for graphic related programming but also because finding the parallelism is critical in many well known problems.

c 978-1-4244-8136-1/10/$26.00 2010 IEEE

Vojislav Kecman Dept. of Computer Science Virginia Commonwealth University Richmond, USA [email protected]

This situation changed when NVIDIA released Compute Unified Device Architecture [1] (CUDA) in 2007. CUDA offers simplified programming interface which is an extension of C language for general purpose programming on GPU. Meanwhile NVIDIA also released their GPU based computational device called Tesla. The core modules of the proposed system are implemented using CUDA and run on Tesla cards. Support Vector Machine (SVM) [2] is a learning algorithm used in many classification and regression problems. SVM generally achieves better predicting accuracy with properly selected parameters compared to other statistical classification algorithms, e.g. K-Nearest Neighbors and Linear Discriminant Analysis. Nevertheless, the training phase of a SVM, especially nonlinear kernel based SVM on large datasets, is much more computationally expensive. The training time of a dataset like Mnist [3] can easily cost several hours using LIBSVM [4] on a mainstream sequential computer. This situation restricts the popularity of SVM on large classification problems. Due to GPUs’ incredible performance on floating points operation, Catanzaro et al. and Herrero-Lopez et al. both have proposed efficient SMO based parallel SVM implementations in [5] and [6] using GPU, which significantly improve the speed performance compared to SVM tools using CPU. In this paper, we continue on the research direction using multi-GPU to accelerate SVM tools by developing an intelligent system, which is feasible for solving large classification problems as well as boosting the speed by slightly compromising on predicting accuracy. The organization of this paper is as follows. Section II briefly reviews the theory of classic SVM and introduces the hardware specification of our HPC workstation as well as the datasests used in the experiments. Section III introduces the architecture of the intelligent system and describes part of the implementation. Performance comparison results are given in Section IV. Section V summarizes the conclusion of this project and discusses possible future work.

1131

II. BACKGROUND A. Support Vector Machine Classification Given n data points (x1 , y1 ), ..., (xn , yn ), where each point xi ∈ Rm and yi ∈ {−1, +1}, 1 ≤ i ≤ n, a soft margin linear SVM problem is defined by introducing the slack variables ξi using min C w,ξ

n 

1 ξi + w 2 i=1

(1)

subject to: yi (wT xi + b) ≥ 1 − ξi and ξi ≥ 0, 1 ≤ i ≤ n. The dual form of this problem is given by maxm

α∈R

n 

1 αi − αT Kα 2 i=1

n 

yi αi = 0 and i=1 yi yj xTi xj . Equation

subject to:

(2)

0 ≤ αi ≤ C, 1 ≤ i ≤ n, where

2 is a quadratic programming Kij = optimization problem which is usually solved by Platt’s Sequential Minimal Optimization [7] (SMO) algorithm. The linear kernel function can be interchanged with nonlinear kernel functions, e.g. Gaussian kernel and Polynomial kernel. Cao et al. presented a parallel SMO using Message Passing Interface in [8] and this idea has been adopted to GPU implementation in [6]. We follow the same methodology to implement the individual SVM solver for GPU. B. Hardware Specification

Figure 1.

The testing platform used in this project is a HPC workstation. It is equipped with one Intel Xeon E5426 2.8GHz quad-core CPU with 16GB DDR2 RAM and three Tesla C1060 cards. Each of these Tesla cards has 4GB GDDR3 RAM with a memory bandwidth at 102GB/s. The operating system is 64bit Fedora Core 10 linux distribution with the latest CUDA 3.1 installed. C. Datasets The datasets used in the experiments are listed in Table I. C and γ are the constraint parameter and the shape parameter for Gaussian kernel. The Mnist* is a binary dataset converted from the 10 classes Mnist using even vs odd. The Covertype [9] has 581012 data points. The first 500000 data points are used for training and the remaining 81012 data points are used for testing. Table I B INARY CLASSIFICATION DATASETS . Datasets Adult [9] Web [7] Mnist* Covertype

1132

# Training Points 32561 49749 60000 500000

# Testing Points 16281 14951 10000 81012

# Features

(C,γ)

123 300 780 54

(100,0.5) (64,7.8125) (10,0.125) (2048,0.03125)

Performance of random selected sub training datasets

III. T HE I NTELLIGENT S YSTEM Fig 1 shows a simplified flow chart of the proposed intelligent system. There are two major branches after the training dataset is loaded in. They are data reduction (Method 1) and data chunking (Method 2). Both of them are suitable for normal or large datasets. The system can also feed the complete training dataset in the training phase if there is no memory constraint issue. A. Data Reduction Lee et al. [10] developed RSVM using a randomly selected small subset of the original training set to address the memory constraint issue brought by large datasets. Their method requires that distances between any pair of data points in the selected subset are bigger than a tolerance value. In this mana, the reduced subset still covers the scope of the original dataset in the hyper space. We conduct multiple experiments on the Web dataset by randomly selecting different sizes of subsets w/ and w/o distance tolerance consideration. The results show that the variance of predicting accuracy between them is trivial. This means a simple random selection is good enough and there is no need to involve distance calculation. Besides, calculating the

2010 10th International Conference on Intelligent Systems Design and Applications

Figure 2.

Flowchart of the intelligent classification system

distance matrix of the training set to find the proper tolerance can be computational expensive when the datasets have a large volume of samples. Fig. 2 shows the performance of the GPU based SVM solver in terms of speed and accuracy on the reduced Web datasets. The training dataset contains 49749 data points and the testing dataset contains 14951 data points. The number of attributes is 300 and the number of classes is 2. Gaussian kernel is used in the SVM solver. Training on the complete dataset takes about 151 seconds and achieves an accuracy of 99.45%. When the size of the training set is reduced to 5000, the training time cost is about 2.62 seconds (61x faster) and the accuracy is about 97.63% (1.82% drop). Because the sub training set is randomly selected, we run 20 different trails with 5000 samples selected every time to estimate the variance of the training time and the variance of the predicting accuracy shown in Fig. 3 and Fig. 4. Both variances are very small, indicating stable results. When the training set is imbalanced, it is recommended to use stratified sampling in order to minimize the variance brought by random selections. This is also very important for training multi-class datasets. B. Data Chunking Although data reduction can address the memory constraint issue of large datasets and significantly shorten the time cost of training phase, trading predicting accuracy for speed performance may not be favored in certain applications with large datasets. In this situation, data chunking combined with cascade SVM training structure should be used. Graf et al. proposed a cascade SVM in [11] to split the large training dataset to multiple small subsets solved by individual SVM solvers. The results from these individual SVM solvers are combined together using a reverse binary tree structure. In this way, large SVM classification problems

Figure 3. Variation of training time on 5,000 random selected data points in 20 trails

Figure 4. Variation of predicting accuracy using 5,000 random selected data points in 20 trails

can be distributed to a cluster system. Graf et al. proves that this approach will eventually converge to the global optimal solution. There are two feasible ways to implement the cascade SVM method. The first one wraps everything into a complete executable program which uses multi-thread to manage multiple GPU devices to simulate the ”computing nodes”. Notice that only one CPU process will be generated which can be executed on one CPU core only. All shared data (SVM sub-models) will be stored in the main memory. It is faster to finish the model merging problems in this case. The second one builds multiple executable programs including the SVM training tool, the model combining tool, the SVM predicting tool and some file I/O tools. All of these tools are managed by a main script written in a script language

2010 10th International Conference on Intelligent Systems Design and Applications

1133

such as python or perl. SVM models will be stored as files in the disk and processed by the model combining tool. This method will launch multiple CPU processes in the training phase. The number of processes is determined by the number of available GPU devices in the system. It can utilize the power of multi-core CPU but it also brings more file I/O operations. Both of the two methods have their advantages and disadvantages. Our implementation uses the second method for the sake of simplicity and platform portability. The first solution requires the multi-thread library support, which usually are platform dependent packages. IV. R ESULTS The performance results comparison between the proposed system and LIBSVM are shown in Table II. Adult, Web and Mnist* are considered normal datasets, which can fit into a GPU device memory. On the other hand, dataset Covertype is not suitable for training on the complete set. The performance of LIBSVM is given as a baseline for speed up comparison. LIBSVM trains the complete datasets to generate the predicting model and makes predictions on the testing datasets. The proposed intelligent system offers three possible approaches. It can train the complete dataset as LIBSVM does if there is no memory constraint. It also can utilize the data reduction and data chunking techniques to train for large datasets, which also work for normal datasets. Both LIBSVM and GPU acquire the same predicting accuracy by training on the complete dataset. GPU achieves a speed up of 13.53x to 52.71x. This proves the correctness of the individual GPU SVM solver. The speed performance is further improved by training on the reduced datasets. This part of speed gain can also be achieved by using LIBSVM on the same reduced datasets. This significantly speed improvement is a trade off for slight predicting accuracy loss. All datasets except Mnist* are reduced to a stratified selection of 5000 data samples. Mnist* gets a 10% accuracy drop, which is not acceptable, when the size of the reduced dataset is 5000. Thus, 30000 data samples are used in the reduced dataset for Mnist*. The data chunking method can use multiple GPU devices because of its task level parallelism. Adult dataset is broken down to 4 subsets where Web and Mnist* datasets are broken down to 8 subsets. The covertype dataset is split into 64 subsets. The speed performance varies by both the number and the size of the subsets. The predicting accuracy, on the other hand, is more stable. V. C ONCLUSION AND F UTURE W ORK The performance results of the proposed system is analyzed in the previous section and the following conclusions are summarized. GPU based SVM solver can achieve the same predicting accuracy as CPU based SVM solver with more than one order of magnitude speed up. Data reduction shows superior performance in both normal datasets and

1134

large datasets. In the situation where predicting accuracy is not critical, there is potentially a significant speed gain by using data reduction. It is not recommended to use cascade SVM on normal datasets when the datasets can fit into the memory. Because the cascade system only runs one iteration due to the close performance shown in [11], the global optimal solution is never achieved. In other words, it can not reaches the same accuracy acquired by training the complete dataset in one piece. In the case of large classification problem, it would be a good question to ask whether it is necessary or not to cover all data points in the full dataset using techniques such as cascade SVM in the training phase. The future work will include adapting the system to solve multi-class problems. A similar SVM based regression solver system is also under development. R EFERENCES [1] NVIDIA, CUDA Compute Unified Device Architecture Programming Guide, June 2007. [2] V. N. Vapnik, The Nature of Statistical Learning Theory, November 1999. [3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 –2324, nov 1998. [4] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie. ntu.edu.tw/∼cjlin/libsvm. [5] B. Catanzaro, N. Sundaram, and K. Keutzer, “Fast support vector machine training and classification on graphics processors,” in ICML ’08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp. 104–111. [6] S. Herrero-Lopez, J. R. Williams, and A. Sanchez, “Parallel multiclass classification using svms on gpus,” in GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. New York, NY, USA: ACM, 2010, pp. 2–11. [7] J. C. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” ADVANCES IN KERNEL METHODS - SUPPORT VECTOR LEARNING, Tech. Rep., 1998. [8] L. Cao, S. Keerthi, C.-J. Ong, J. Zhang, U. Periyathamby, X. J. Fu, and H. Lee, “Parallel sequential minimal optimization for the training of support vector machines,” Neural Networks, IEEE Transactions on, vol. 17, no. 4, pp. 1039 – 1049, july 2006. [9] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [Online]. Available: http://archive.ics.uci. edu/ml [10] Y. jye Lee and O. L. Mangasarian, “Rsvm: Reduced support vector machines,” in Data Mining Institute, Computer Sciences Department, University of Wisconsin, 2001, pp. 00–07.

2010 10th International Conference on Intelligent Systems Design and Applications

Table II P ERFORMANCE RESULTS COMPARISON

Datasets Adult Web Mnist* Covertype

LIBSVM Training Predicting Time Accuracy 488.85s 82.70% 3362.54s 99.45% 21834.23s 95.32% N/A N/A

1 GPU, Complete Dataset Training Predicting Time Accuracy 36.13s (13.53x) 82.70% 151.39s(22.21x) 99.45% 414.27s(52.71x) 95.32% N/A N/A

1 GPU, Data Reduction Training Predicting Time Accuracy 3.09s (158.20x) 81.71% 2.62s (1283.41x) 97.63% 157.59s (138.55x) 94.10% 0.83s 75.12%

3 GPUs, Data Chunking Training Predicting Time Accuracy 25.23s (19.38x) 82.14% 78.40s (42.89x) 99.31% 168.42s (129.64x) 94.77% 651.90s 76.03%

[11] H. P. Graf, E. Cosatto, L. Bottou, I. Durdanovic, and V. Vapnik, “Parallel support vector machines: The cascade svm,” in In Advances in Neural Information Processing Systems. MIT Press, 2005, pp. 521–528.

2010 10th International Conference on Intelligent Systems Design and Applications

1135