Scalable Fast Parallel SVM on Cloud Clusters for ...

2 downloads 21615 Views 823KB Size Report
Keywords: Support Vector Machine; Cloud Computing; Parallel SVM; Cluster. Computing ... Clouds [2] promise to be a cheap alternative to supercomputers and ...
Scalable Fast Parallel SVM on Cloud Clusters for Large Datasets Classification Ghazanfar Latif, Rafiul Hassan Computer Science Department, College of Computer Science Engineering, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia [email protected]

Abstract: A support vector machine (SVM) is supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. For testing and training of a multidimensional large datasets requires a lot of computing resources in terms of memory and computational power. We purposed a scalable and cost effective technique for running support vector machine in parallel on distributed cloud cluster nodes which reduced memory requirements and computational power. We divide the datasets in to ‘n’ equal parts and process each dataset part on distributed cluster. We combined produced support vectors of all the clusters nodes onto master node and again apply Support Vector Machine algorithm. We tested our solution on different datasets available online by using the local single node machine, HPC clusters and Amazon Cloud Clusters and done a comparison in term of efficiency and accuracy. We prove that our proposed solution is very efficient in terms of training time as compared to the existing techniques and it classifies the datasets correctly with minimal error rate. Keywords-

Keywords: Support Vector Machine; Cloud Computing; Parallel SVM; Cluster Computing; Amazon Web Services.

1. Introduction Distributed Parallel classification [7] gains importance due to geographical, physical, and computational constraints of the centralized system. The support vector machine (SVM), which implements the principle of structural error minimization is one of the most popular classification algorithms in many fields. SVM based Classification approach is founded on the notion of hyper planes which acts as class segregators’ in common binary classification, such as spam or ham in the context of spam filtering. SVM generally achieves better predicting accuracy with properly selected parameters compared to other statistical classification algorithms, e.g. K-Nearest Neighbors and Linear Discriminant Analysis. Nevertheless, the training phase of a SVM, especially nonlinear kernel based SVM on large datasets, is much more computationally expensive. High computational scientific computing requires very high computational processing power to produce good results for growing problem sizes in a reasonable time frame. There are very few number of universities and research centers who can afford expensive supercomputers [] onsite for their research on very big projects where large processing power is required due to lake of high performance computing lab in many institutions, many researcher were unable to do research work on these types of projects.

A. Distributed Computing It is possible to run a computer program on more than one computer simultaneously. This can be achieved by dividing this program into tasks for each computer. Distributed computing is a network system in which the task is divided into many tasks [15]. Each one is solved by one node (computer). Those tasks are

communicated by message passing. A distributed system consists of different kinds of computers, networks, network topology and number of computers. It also has to tolerate failures. Distributed computing is better in terms of performance and reliability. It uses more than one CPU and other resources in parallel [6]. Those resources form a supercomputer. It is also more reliable since all tasks run on particular servers. It is possible to limit most of security issues to those servers. The main goal of the distributed system is to coordinate sharing resources.

B. Cloud Computing Cloud computing [1] proposes an alternative in which resources are no longer hosted by the researcher's computational facilities, but leased from big data centers only when needed. Despite the existence of several cloud computing vendors, such as Amazon, the potential of clouds remains largely unexplored. To address this issue, in this paper we present a performance analysis of cloud computing services for scientific computing. There is no need to install an operation system or applications. They also do not need to update or do maintenance. There could be a great reduction in computer costs. Users can add or remove resources on-demand. The cloud computing paradigm holds good promise for the performance hungry scientific community. Clouds [2] promise to be a cheap alternative to supercomputers and specialized clusters, a much more reliable platform t ha n grids, and a much more scalable platform than the largest of commodity clusters or resource pools. The organization of the paper is as follows. Section 2 describes the SVM classification problems and cloud computing briefly. Section 3 presents the

details experimental results of our proposed solution. We conclude proposed solution in Section 4.

2. Distributing SVM on Cloud Clusters In this paper, we show how Support Vector Machine training and classification can be adapted to a highly parallel, yet widely available and affordable computing platform. Eliminating non-support vectors early from the optimization is an effective strategy for accelerating SVMs. Using this concept we developed a filtering process that can be parallelized SVM efficiently. We design a performance evaluation method that allows an assessment of Parallel Support Vector on clouds clusters connected with high network speed for the large datasets classification. We perform all our measurements on the EC2 [3] environment.

Algorithm Parallel SVM on cloud clusters work as follow 1.

Start “N” cluster nodes on the cloud server by using automated script for EC2 HPC instances [4]. The first node work as a Master Node.

2.

Each cluster node will process the equally distributed dataset part in the following manner.

Where “D” is the input dataset, “L” is the size of the dataset and i = 0, 1, 2… N-1 3.

After processing all the identified support vectors will be collected on the master node and automated script will turn off all the other cloud cluster nodes except master node.

Input Dataset “d”

Equal Dataset Distribution

d/n

d/n d/n Cluster Node #1

SV-1

d/n

Cluster Node #2 SV-2

Cluster Node #3 SV-3

.…

Cluster Node #n

SV-n

Merging Generated Data Vectors SV Master Cluster Node

NewSV

4.

Master node take collected support vectors as input and again apply Figure 1. Proposed Architecture get more refined results.

3.

SVM

3. Proposed System Experimental Results The single node experimental results are shown in Table I where PT represents the processing time on a single node while ISV is the identified support vectors. Table II shows the initial results of our purposed method. In this table we started 4 cloud cluster nodes with same specification and give same datasets as an input. In this table TSV represent the total support vectors generated by running SVM on 4 parallel nodes.

In Table III final results after merging the identified support vectors from parallel cluster nodes and again applying the SVM to get more refined results. In this table a performance and accuracy comparison is also done. In table III, TPT is total processing time of SVM on multiple parallel cloud cluster nodes. TABLE I. SINGLE NODE PERFORMANCE ANALYSIS Single Node Test #

Data Size

# of Features PT

ISV

Accuracy %

1

2000

2

14.549

804

86.2

2

5000

2

89.35

1916

84.84

3

10000

2

982.68

3620

85.12

4

16000

2

21422.22

5715

84.84

5

24000

2

79195

8407

84.97

6

4000

4

388.5193

1815

90.375

7

22400

4

53052.36

8647

85.96

8

59535

8

83517

25074

96.797

Multiple Parallel Nodes Performance Analysis (Step 1) Multi Node Parallel Clusters Test #

# of

Data Size

Node 1

Node 2

Node 3

Node 4

Features

TSV PT

ISV

PT

ISV

PT

ISV

PT

ISV

1

2000

2

0.634

251

0.553

228

0.505

241

0.515

228

948

2

5000

2

8.269

563

8.407

530

8.649

534

8.648

542

2169

3

10000

2

31.021

1001

24.772

964

18.939

1039

20.824

1015

4019

4

16000

2

58.139

1526

61.31

1591

52.27

1577

45.71

1566

6260

5

24000

2

200.94

2303

123.21

2286

135.26

2272

227.79

2219

9080

6

4000

4

7.737

593

7.786

594

8.224

617

7.913

609

2413

8

22400

4

1054.898

2428

1231.171

2420

910.6977

2363

2246.163

2500

9711

9

59535

8

13931

7979

14037

8773

8606.2

6046

12018

8254

31052

TABLE II. MULTIPLE PARALLEL NODES PERFORMANCE ANALYSIS (STEP 2) Multi Node Parallel Clusters (P2) Test #

Data Size

# of

Merging Results of Multi Node to single Node

Features TSV

PT

ISV

Accuracy

TPT

Efficiency

Accuracy Effect

1

2000

2

948

4.321

721

85.3

4.955

65.94

1.04%

2

5000

2

2169

37.53

1822

84.88

46.179

49

-0.047%

3

10000

2

4019

313.1

3494

85.09

344.121

64.88

0.035%

4

16000

2

6260

2102.75

5603

84.8

2164.06

89.89

0.047%

5

24000

2

9080

4959.9

8259

85.021

5187.69

93.45

-0.06%

6

4000

4

2413

214.1918

1610

89.125

222.4164

42.75

1.30%

8

22400

4

9711

25815.7

7959

85.92

28061.87

47.1

0.10%

9

59535

8

31052

36007

24467

96.67

50044

46.01

0.131%

TABLE III. COMPARISION BETWEEN OUR PROPOSED SOLUTION AND EXISTING TECHNIQUES Type of Infrastructure

Efficiency

Accuracy

Resources Cost

Amazon Cloud Clusters [4] GPU Clusters [16, 11] Local Cascade SVM Method [13] Local Strongly Connected Networks [8] Local Single Node [5]

On Average 0.20%

Hourly based

Overhead

Pay only what you use

On average 0.55%

Physical Machines

Overhead

GPU Maintenance Cost

Depends upon the #

Depends upon the

Physical Machines

of iterations

# of iterations

Networking Cost

Depends upon the #

Depends upon the

Physical Machines

of iterations

# of iterations

Networking Cost

Maximum

Normal Physical

Efficiency

Machine

Up to 60%

Up to 80%

Maximum Time

Fig. 1 represents a graph about accuracy analysis of the single node results and multiple cluster nodes in the cloud. In graph, S-Accuracy and MAccuracy represents single node accuracy and multiple cluster nodes accuracy. In Fig. 2 graph show the efficiency that we achieved by using the multiple cloud cluster nodes. The figure 2 graph shows that the processing time is reduced up to 60% by running SVM in parallel on multiple cluster nodes. Fig. 3 shows a graph between the total support vectors generated by running SVM on single node and running on multiple cluster nodes. Table IV represents a comparison between existing techniques for parallel processing of SVM on single or multiple nodes and our proposed solution. The table shows that overall our proposed solution is better than the existing techniques.

Figure 1. Accuracy comparision beteen single node and multiple nodes

Figure 2. Example of a figure caption. (figure caption)

Figure 3. Example of a figure caption. (figure caption)

4. Conclusion In this paper we purposed a simple idea about parallel SVM on cloud clusters for large datasets classification. We prove that our proposed solution is very efficient in terms of training time as compared to the existing techniques and it classifies the datasets correctly with minimal error rate. Experimental

results over a real-world and test databases shows that this algorithm is scalable and robust. We will extend the performance evaluation results by running similar experiments on other IaaS providers and also on other real large-scale platforms, such as grids and commodity clusters.

References [1] Florian Schatz, Sven Koschnicke, Niklas Paulsen, Christoph Starke, and Manfred Schimmler, “MPI Performance Analysis of Amazon EC2 Cloud Services for High Performance Computing”, A. Abraham et al. (Eds.): ACC 2011, Part I, CCIS 190, pp. 371–381, 2011. Springer-Verlag Berlin Heidelberg 2011. [2] Simon Ostermann, AlexandruIosup , Nezih Yigitbasi, Radu Prodan, Thomas Fahringer and Dick Eperna, “A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing”, D.R. Avreskyetal. (Eds.) : Cloudcomp 2009 , LNICST 34, pp. 115- 131 , 2010. Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering 2010. [3] Amazon Elastic Compute Cloud (Amazon EC2): http://aws.amazon.com/ec2/ [4] High Performance Computing (HPC) on AWS Clusters: http://aws.amazon.com/hpcapplications/ [5] G. Zanghirati and L. Zanni, “A parallel solver for large quadratic programs in training support vector machines,” Parallel Comput., vol. 29, pp. 535–551, Nov. 2003. [6] C. Caragea, D. Caragea, and V. Honavar, “Learning support vector machine classifiers from distributed data sources,” in Proc. 20th Nat. Conf. Artif. Intell. Student Abstract Poster Program, Pittsburgh, PA, 2005, pp. 1602–1603. [7] A. Navia-Vazquez, D. Gutierrez-Gonzalez, E. Parrado-Hernandez, and J. NavarroAbellan, “Distributed support vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 1091–1097, Jul. 2006. [8] Yumao Lu, Vwani Roychowdhury, and Lieven Vandenberghe, “Distributed Parallel Support Vector Machines in Strongly Connected Networks”, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 7, JULY 2008. [9] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie. ntu.edu.tw/cjlin/libsvm.

[10] B. Catanzaro, N. Sundaram, and K. Keutzer, “Fast support vector machine training and classification on graphics processors,” in ICML ’08: Proceedings of the 25th international conference on Machine learning. New York, NY, USA: ACM, 2008, pp. 104–111. [11] S. Herrero-Lopez, J. R. Williams, and A. Sanchez, “Parallel multiclass classification using svms on gpus,” in GPGPU’10: Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units. New York, NY, USA: ACM, 2010, pp. 2–11. [12] Cao, L., Keerthi, S., Ong, C.-J., Zhang, J., Periyathamby, U., Fu, X. J., & Lee, H. (2006). Parallel sequential minimal optimization for the training of support vector machines. IEEE Transactions on Neural Networks, 17, 1039-1049. [13] Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., & Vapnik, V. (2005). Parallel support vector machines: The cascade svm. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing systems 17, 521-528. Cambridge, MA: MIT Press. [14] Wu, G., Chang, E., Chen, Y. K., & Hughes, C. (2006). Incremental approximate matrix factorization for speeding up support vector machines. KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 760-766). New York, NY, USA: ACM Press. [15] Zanni, L., Serani, T., & Zanghirati, G. (2006). Parallel software for training large scale support vector machines on multiprocessor systems. J. Mach. Learn. Res., 7, 14671492. [16] Qi Li, Raied Salman, Vojislav Kecman, “An Intelligent System for Accelerating Parallel SVM Classification Problems on Large Datasets Using GPU”, 2010 10th International Conference on Intelligent Systems Design and Applications.

Suggest Documents