Scalable Fast Parallel SVM on Cloud Clusters for Large Datasets Classification Ghazanfar Latif, Rafiul Hassan Computer Science Department, College of Computer Science Engineering, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
[email protected]
Abstract: A support vector machine (SVM) is supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. For testing and training of a multidimensional large datasets requires a lot of computing resources in terms of memory and computational power. We purposed a scalable and cost effective technique for running support vector machine in parallel on distributed cloud cluster nodes which reduced memory requirements and computational power. We divide the datasets in to ‘n’ equal parts and process each dataset part on distributed cluster. We combined produced support vectors of all the clusters nodes onto master node and again apply Support Vector Machine algorithm. We tested our solution on different datasets available online by using the local single node machine, HPC clusters and Amazon Cloud Clusters and done a comparison in term of efficiency and accuracy. We prove that our proposed solution is very efficient in terms of training time as compared to the existing techniques and it classifies the datasets correctly with minimal error rate. Keywords-
Keywords: Support Vector Machine; Cloud Computing; Parallel SVM; Cluster Computing; Amazon Web Services.
1. Introduction Distributed Parallel classification [7] gains importance due to geographical, physical, and computational constraints of the centralized system. The support vector machine (SVM), which implements the principle of structural error minimization is one of the most popular classification algorithms in many fields. SVM based Classification approach is founded on the notion of hyper planes which acts as class segregators’ in common binary classification, such as spam or ham in the context of spam filtering. SVM generally achieves better predicting accuracy with properly selected parameters compared to other statistical classification algorithms, e.g. K-Nearest Neighbors and Linear Discriminant Analysis. Nevertheless, the training phase of a SVM, especially nonlinear kernel based SVM on large datasets, is much more computationally expensive. High computational scientific computing requires very high computational processing power to produce good results for growing problem sizes in a reasonable time frame. There are very few number of universities and research centers who can afford expensive supercomputers [] onsite for their research on very big projects where large processing power is required due to lake of high performance computing lab in many institutions, many researcher were unable to do research work on these types of projects.
A. Distributed Computing It is possible to run a computer program on more than one computer simultaneously. This can be achieved by dividing this program into tasks for each computer. Distributed computing is a network system in which the task is divided into many tasks [15]. Each one is solved by one node (computer). Those tasks are
communicated by message passing. A distributed system consists of different kinds of computers, networks, network topology and number of computers. It also has to tolerate failures. Distributed computing is better in terms of performance and reliability. It uses more than one CPU and other resources in parallel [6]. Those resources form a supercomputer. It is also more reliable since all tasks run on particular servers. It is possible to limit most of security issues to those servers. The main goal of the distributed system is to coordinate sharing resources.
B. Cloud Computing Cloud computing [1] proposes an alternative in which resources are no longer hosted by the researcher's computational facilities, but leased from big data centers only when needed. Despite the existence of several cloud computing vendors, such as Amazon, the potential of clouds remains largely unexplored. To address this issue, in this paper we present a performance analysis of cloud computing services for scientific computing. There is no need to install an operation system or applications. They also do not need to update or do maintenance. There could be a great reduction in computer costs. Users can add or remove resources on-demand. The cloud computing paradigm holds good promise for the performance hungry scientific community. Clouds [2] promise to be a cheap alternative to supercomputers and specialized clusters, a much more reliable platform t ha n grids, and a much more scalable platform than the largest of commodity clusters or resource pools. The organization of the paper is as follows. Section 2 describes the SVM classification problems and cloud computing briefly. Section 3 presents the
details experimental results of our proposed solution. We conclude proposed solution in Section 4.
2. Distributing SVM on Cloud Clusters In this paper, we show how Support Vector Machine training and classification can be adapted to a highly parallel, yet widely available and affordable computing platform. Eliminating non-support vectors early from the optimization is an effective strategy for accelerating SVMs. Using this concept we developed a filtering process that can be parallelized SVM efficiently. We design a performance evaluation method that allows an assessment of Parallel Support Vector on clouds clusters connected with high network speed for the large datasets classification. We perform all our measurements on the EC2 [3] environment.
Algorithm Parallel SVM on cloud clusters work as follow 1.
Start “N” cluster nodes on the cloud server by using automated script for EC2 HPC instances [4]. The first node work as a Master Node.
2.
Each cluster node will process the equally distributed dataset part in the following manner.
Where “D” is the input dataset, “L” is the size of the dataset and i = 0, 1, 2… N-1 3.
After processing all the identified support vectors will be collected on the master node and automated script will turn off all the other cloud cluster nodes except master node.
Input Dataset “d”
Equal Dataset Distribution
d/n
d/n d/n Cluster Node #1
SV-1
d/n
Cluster Node #2 SV-2
Cluster Node #3 SV-3
.…
Cluster Node #n
SV-n
Merging Generated Data Vectors SV Master Cluster Node
NewSV
4.
Master node take collected support vectors as input and again apply Figure 1. Proposed Architecture get more refined results.
3.
SVM
3. Proposed System Experimental Results The single node experimental results are shown in Table I where PT represents the processing time on a single node while ISV is the identified support vectors. Table II shows the initial results of our purposed method. In this table we started 4 cloud cluster nodes with same specification and give same datasets as an input. In this table TSV represent the total support vectors generated by running SVM on 4 parallel nodes.
In Table III final results after merging the identified support vectors from parallel cluster nodes and again applying the SVM to get more refined results. In this table a performance and accuracy comparison is also done. In table III, TPT is total processing time of SVM on multiple parallel cloud cluster nodes. TABLE I. SINGLE NODE PERFORMANCE ANALYSIS Single Node Test #
Data Size
# of Features PT
ISV
Accuracy %
1
2000
2
14.549
804
86.2
2
5000
2
89.35
1916
84.84
3
10000
2
982.68
3620
85.12
4
16000
2
21422.22
5715
84.84
5
24000
2
79195
8407
84.97
6
4000
4
388.5193
1815
90.375
7
22400
4
53052.36
8647
85.96
8
59535
8
83517
25074
96.797
Multiple Parallel Nodes Performance Analysis (Step 1) Multi Node Parallel Clusters Test #
# of
Data Size
Node 1
Node 2
Node 3
Node 4
Features
TSV PT
ISV
PT
ISV
PT
ISV
PT
ISV
1
2000
2
0.634
251
0.553
228
0.505
241
0.515
228
948
2
5000
2
8.269
563
8.407
530
8.649
534
8.648
542
2169
3
10000
2
31.021
1001
24.772
964
18.939
1039
20.824
1015
4019
4
16000
2
58.139
1526
61.31
1591
52.27
1577
45.71
1566
6260
5
24000
2
200.94
2303
123.21
2286
135.26
2272
227.79
2219
9080
6
4000
4
7.737
593
7.786
594
8.224
617
7.913
609
2413
8
22400
4
1054.898
2428
1231.171
2420
910.6977
2363
2246.163
2500
9711
9
59535
8
13931
7979
14037
8773
8606.2
6046
12018
8254
31052
TABLE II. MULTIPLE PARALLEL NODES PERFORMANCE ANALYSIS (STEP 2) Multi Node Parallel Clusters (P2) Test #
Data Size
# of
Merging Results of Multi Node to single Node
Features TSV
PT
ISV
Accuracy
TPT
Efficiency
Accuracy Effect
1
2000
2
948
4.321
721
85.3
4.955
65.94
1.04%
2
5000
2
2169
37.53
1822
84.88
46.179
49
-0.047%
3
10000
2
4019
313.1
3494
85.09
344.121
64.88
0.035%
4
16000
2
6260
2102.75
5603
84.8
2164.06
89.89
0.047%
5
24000
2
9080
4959.9
8259
85.021
5187.69
93.45
-0.06%
6
4000
4
2413
214.1918
1610
89.125
222.4164
42.75
1.30%
8
22400
4
9711
25815.7
7959
85.92
28061.87
47.1
0.10%
9
59535
8
31052
36007
24467
96.67
50044
46.01
0.131%
TABLE III. COMPARISION BETWEEN OUR PROPOSED SOLUTION AND EXISTING TECHNIQUES Type of Infrastructure
Efficiency
Accuracy
Resources Cost
Amazon Cloud Clusters [4] GPU Clusters [16, 11] Local Cascade SVM Method [13] Local Strongly Connected Networks [8] Local Single Node [5]
On Average 0.20%
Hourly based
Overhead
Pay only what you use
On average 0.55%
Physical Machines
Overhead
GPU Maintenance Cost
Depends upon the #
Depends upon the
Physical Machines
of iterations
# of iterations
Networking Cost
Depends upon the #
Depends upon the
Physical Machines
of iterations
# of iterations
Networking Cost
Maximum
Normal Physical
Efficiency
Machine
Up to 60%
Up to 80%
Maximum Time
Fig. 1 represents a graph about accuracy analysis of the single node results and multiple cluster nodes in the cloud. In graph, S-Accuracy and MAccuracy represents single node accuracy and multiple cluster nodes accuracy. In Fig. 2 graph show the efficiency that we achieved by using the multiple cloud cluster nodes. The figure 2 graph shows that the processing time is reduced up to 60% by running SVM in parallel on multiple cluster nodes. Fig. 3 shows a graph between the total support vectors generated by running SVM on single node and running on multiple cluster nodes. Table IV represents a comparison between existing techniques for parallel processing of SVM on single or multiple nodes and our proposed solution. The table shows that overall our proposed solution is better than the existing techniques.
Figure 1. Accuracy comparision beteen single node and multiple nodes
Figure 2. Example of a figure caption. (figure caption)
Figure 3. Example of a figure caption. (figure caption)
4. Conclusion In this paper we purposed a simple idea about parallel SVM on cloud clusters for large datasets classification. We prove that our proposed solution is very efficient in terms of training time as compared to the existing techniques and it classifies the datasets correctly with minimal error rate. Experimental
results over a real-world and test databases shows that this algorithm is scalable and robust. We will extend the performance evaluation results by running similar experiments on other IaaS providers and also on other real large-scale platforms, such as grids and commodity clusters.
References [1] Florian Schatz, Sven Koschnicke, Niklas Paulsen, Christoph Starke, and Manfred Schimmler, “MPI Performance Analysis of Amazon EC2 Cloud Services for High Performance Computing”, A. Abraham et al. (Eds.): ACC 2011, Part I, CCIS 190, pp. 371–381, 2011. Springer-Verlag Berlin Heidelberg 2011. [2] Simon Ostermann, AlexandruIosup , Nezih Yigitbasi, Radu Prodan, Thomas Fahringer and Dick Eperna, “A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing”, D.R. Avreskyetal. (Eds.) : Cloudcomp 2009 , LNICST 34, pp. 115- 131 , 2010. Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering 2010. [3] Amazon Elastic Compute Cloud (Amazon EC2): http://aws.amazon.com/ec2/ [4] High Performance Computing (HPC) on AWS Clusters: http://aws.amazon.com/hpcapplications/ [5] G. Zanghirati and L. Zanni, “A parallel solver for large quadratic programs in training support vector machines,” Parallel Comput., vol. 29, pp. 535–551, Nov. 2003. [6] C. Caragea, D. Caragea, and V. Honavar, “Learning support vector machine classifiers from distributed data sources,” in Proc. 20th Nat. Conf. Artif. Intell. Student Abstract Poster Program, Pittsburgh, PA, 2005, pp. 1602–1603. [7] A. Navia-Vazquez, D. Gutierrez-Gonzalez, E. Parrado-Hernandez, and J. NavarroAbellan, “Distributed support vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 1091–1097, Jul. 2006. [8] Yumao Lu, Vwani Roychowdhury, and Lieven Vandenberghe, “Distributed Parallel Support Vector Machines in Strongly Connected Networks”, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 7, JULY 2008. [9] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie. ntu.edu.tw/cjlin/libsvm.
[10] B. Catanzaro, N. Sundaram, and K. Keutzer, “Fast support vector machine training and classification on graphics processors,” in ICML ’08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp. 104–111. [11] S. Herrero-Lopez, J. R. Williams, and A. Sanchez, “Parallel multiclass classification using svms on gpus,” in GPGPU’10: Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units. New York, NY, USA: ACM, 2010, pp. 2–11. [12] Cao, L., Keerthi, S., Ong, C.-J., Zhang, J., Periyathamby, U., Fu, X. J., & Lee, H. (2006). Parallel sequential minimal optimization for the training of support vector machines. IEEE Transactions on Neural Networks, 17, 1039-1049. [13] Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., & Vapnik, V. (2005). Parallel support vector machines: The cascade svm. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing systems 17, 521-528. Cambridge, MA: MIT Press. [14] Wu, G., Chang, E., Chen, Y. K., & Hughes, C. (2006). Incremental approximate matrix factorization for speeding up support vector machines. KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 760-766). New York, NY, USA: ACM Press. [15] Zanni, L., Serani, T., & Zanghirati, G. (2006). Parallel software for training large scale support vector machines on multiprocessor systems. J. Mach. Learn. Res., 7, 14671492. [16] Qi Li, Raied Salman, Vojislav Kecman, “An Intelligent System for Accelerating Parallel SVM Classification Problems on Large Datasets Using GPU”, 2010 10th International Conference on Intelligent Systems Design and Applications.