A Grid-Based Distributed SVM Data Mining Algorithm - Semantic Scholar

64 downloads 104454 Views 97KB Size Report
management infrastructure for supporting decentralized and parallel data .... database service, user service, application management service, autonomy and ...
European Journal of Scientific Research ISSN 1450-216X Vol.27 No.3 (2009), pp.313-321 © EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.htm

A Grid-Based Distributed SVM Data Mining Algorithm Ali Meligy Middle East University for Graduate Studies Faculty of Information Technology, Amman, Jordan E-mail: [email protected] Manar Al-Khatib Middle East University for Graduate Studies Faculty of Information Technology, Amman, Jordan Abstract Distribution of data and manipulation allows for solving larger problems and executing applications that are distributed in nature. In this paper we present a grid-based distributed Support Vector Machine (SVM) algorithm. The Grid is a distributed computing infrastructure that enables coordinated resource sharing within dynamic organizations consisting of individuals, in situations and resources. Grid environments can be used both for compute intensive tasks and data intensive applications as they offer resources, services, and data access mechanisms. Data mining algorithms and knowledge discovery processes are both compute and data intensive, therefore the Grid can offer a computing and data management infrastructure for supporting decentralized and parallel data analysis. The SVM algorithm is implemented in C and MPI.

Keywords: Data Mining, Distributed Computing, SVM-Algorithm.

1. Introduction A grid is a distributed system that enables the sharing, selection and aggregation of geographically distributed "autonomous" resources dynamically at runtime depending on their availability, capability, performance, cost, and user's quality –of-services requirements. Grid use the resources of many separate computers connected by a network (usually the Internet) to solve large-scale computation problems. Grids provide the ability to perform computations on large data sets, by breaking them down into many smaller ones, or provide the ability to perform many more computations at once than would be possible on a single computer, by modeling a parallel division of labor between processes. Grid computing is optimized for work loads which consist of many independent jobs or packet of work, which do not have to share data between the jobs during the computation process. Grids serve to manage the allocation of jobs to computers which will perform the work independently of the rest of the grid cluster. Resources such as storage may be shared by all the nodes, but intermediate results of one job do not affect other jobs in progress on other nodes of the grid. [Mel06] Grid technology allows create a computing system with extreme performance and capacity from geographically distributed computational and memory resources. System has the features of global worldwide computer, where all of the components are connected together via Internet. For the user it appears like a common workstation, but some segments of a solved task are computed in different parts

A Grid-Based Distributed SVM Data Mining Algorithm

314

of system. The main advantage of the Grid is high efficiency of using technological capacity. The grid is high effective in using associated technological capacities of creative users potential, the safety, the reliability, the effectiveness and high level of transportability for computational applications. The main building block of the Grid is network. Geographically distributed resources are linked together via networks. Networks also allow them to be used collectively. Networks connect the resources on the Grid, the most prevalent of which are computers with data storage. Computational elements can be on any level of power and capability. Some of Grids involve nodes that are high performance machines or clusters. These Grid nodes provide major resources for simulation, analysis, data mining, text mining and other activities. [Tal06] Grid computing represents the natural evolution of distributed computing infrastructure that enables coordinated resource sharing within dynamic organizations consisting of individuals, institutions and resources. The main aim of grid computing is to give organizations and application developers the ability to create distributed computing environments that can utilize computing resources on demand. Grid computing can leverage the computing power of a large numbers of server computer s desktop PCs, cluster and other king of hardware. Therefore it can help increase efficiencies and reduce the cost of computing networks by decreasing data processing time and optimizing resources and distributing workloads, thereby allowing users to achieve much faster results on large operations and lower costs. As the Grid is becoming a well accepted computing infrastructure in science and industry it is necessary to provide general data mining services, algorithms and applications that help analysts, scientists, organizations and professionals to leverage Grid capacity in supporting high-performance distributed computing for solving their data mining in a distributed way. [Tal06] Data mining technology has emerged as a means for identifying patterns and trends from large quantities of data [Wit00], [Han01]. Data mining and data warehousing go hand-in-hand: most tools operate on a principal of gathering all data into a central site, then running an algorithm against that data (Figure 1). There are a number of applications that are infeasible under such a methodology, leading to a need for distributed data mining [Bae99]. The obvious solution of a \virtual" data warehouse {heterogeneous access to all the data {is not always possible. The problem is not simply that the data is distributed, but that it must be distributed. There are several situations where this arises: 1. Connectivity. Transmitting large quantities of data to a central site may be infeasible. 2. Heterogeneity of sources. Is it easier to combine results than combine sources? 3. Privacy of sources. Organizations may be willing to share data mining results, but not data. Figure 1: Data Warehouse approach to Distributed Data Mining

315

Ali Meligy and Manar Al-Khatib

Distributed data mining in distributed environments like virtual organization networks, the Internet, corporate internets, sensor networks, and other centralized infrastructures questions the suitability of centralized KDD architectures for large-scale knowledge in a network environment. Distributed data mining works by analyzing data in a distributed fashion and pays careful attention to the trade-off between centralized collection and distributed analysis of data [Cli02], [Jan06]. Distributed computing plays an important role in the Data mining process for several reasons. First, Data Mining often requires huge amounts of resources in storage space and computation time. To make systems scalable, it is important to develop mechanisms that distribute the work load among several sites in a flexible way. Second, data is often inherently distributed into several databases, making a centralized processing of this data very inefficient and prone to security risks. Distributed Data Mining explores techniques of how to apply Data mining in a non-centralized way [DO06].

2. Support Vector Machine Algorithm (SVM) Over the recent years, Support Vector Machine (SVM It is based on the Structural Risk Minimization principle for which error-bound analysis has been theoretically motivated [Mou97], [Ton01]. Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. [Dhi02]. This method is defined over a vector space in which the problem is to find a decision surface that “best” separates the data vectors into two classes.[Mou97]. In its simplest linear form, an SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin [Tan06]. For instance, from Figure 2, we can see hyperplanes that separate the training data by a maximal margin. All vectors lying on one side of the hyperplane are labeled as .1, and all vectors lying on the other side are labeled as 1. The training instances that lie closest to the hyperplane are called support vectors [Dhi02]. Figure 2: (a) A simple linear support vector machine. (b) A SVM (dotted line) and transductive SVM (solid line). Solid circles represent unlabeled instances.

The formula for the output of a linear SVM is u = w.x ,where w is the normal vector to the hyperplane ,and x is the input vector. In the linear case, the margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples [Tan06]. The hyperplane can be found in the original dataset (and this is referred to as linear SVMs) or it can be found in a higher-dimensional space by transforming the dataset into a representation having more dimensions (input variables) than the original dataset (referred to as nonlinear SVMs). Mapping the dataset, in this way, into a higher dimensional space, and then reducing the problem to a linear problem, provides a simple solution.

A Grid-Based Distributed SVM Data Mining Algorithm

316

An advantage of the method is that the modeling only deals with these support vectors, rather than the whole training dataset, and so the size of the training set is not usually an issue. A disadvantage is that the algorithm is sensitive to the choice of parameter settings, making it harder to use [Tan06].

3. Grid-based Support Vector Machine. Some of the methods are time-consuming and use of the Grid infrastructure can bring significant benefits. Implementation of text mining techniques in distributed environment allows us to access different geographically distributed data collections and perform text mining tasks in parallel and distributed fashion. SVM is one common method used in data mining to extract predicted information. Grid computing is regarded as the extension of PC Clusters and its future development is therefore highly valued. Basically, a Grid is a very large-scale, generalized distributed computing system that can scale to Internet size environments with machines distributed across multiple organizations and administrative domains. [Tsa04] The main effort of this section is to give a theoretical description for modifying SVM algorithm to get a Grid-based Distributed data mining version. Since grid computing has caught a lot of attention, some applications on Grid-based data mining infrastructure have been proposed, such as the Knowledge Grid (K-Grid), NASA’s Information Power Grid (IPG). Both are the input on the data mining algorithms work to extract new knowledge. The objective is to build the next generation computing infrastructure providing intensive analysis and integration over Word Wide Web. In addition, the Data Grid Project developed by European Union is focused on core Grid data services, like data access and metadata access. It can process the data sets from hundreds of TeraBytes to PetaBytes. In general, basic Grid technology services include security service, scheduling, data service, database service, user service, application management service, autonomy and monitoring service, information service, composition service and message service. Currently, Grid architecture is mature, and we will not elaborate the various service details here. Our architecture is shown in Figure 3. We just use Open Source Globus Toolkit as Grid middleware, it provides three functions, that is, resource management, data management and information services, they are built in GSI (Grid Security Infrastructure). The data exchange protocol, allocation, control and security in the Grid environment are all depends on this middleware. Web Service provides XML and HTTP services. It will be a corresponding suite of XML schema describing the object and services associated with the application. The particular application is built on the top of them. Users submit instructions to the SVM application and through Globus find the suitable resources. Workload is estimated according to the resource efficiency. Following, attribute data and description will be specified in XML. Note the attribute data maybe have been sorted by task type. All the work is processed through Web Services. Those XML documents not only contain attribute data, but also describe the task including some parameters for computing the algorithm Finally, Globus sends the resulting documents to the client side [Tsa04].

317

Ali Meligy and Manar Al-Khatib Figure 3: Grid-Based SVM

The PSVM Algorithm We employ a parallel PSVM (parallel of Support Vector Sector Machine) algorithm that implemented it's parallelization. The type of parallelism reflects the structure of either the application or the data and both types may exist in different parts of the same application. Parallelism arising from the structure of the application is named as functional parallelism. In this case , different parts of the program can perform different tasks in a concurrent and cooperative manner. But parallelism may be found by the structure of the data. Parallel applications can be classified into some well defined programming paradigms and a few programming paradigms are used repeatedly to develop many parallel programs. Each paradigm is a class of algorithms that have the same control structure. The choice of paradigm is determined by the available parallel computing resources and by the type of parallelism inherent in the problem. The formula for the output of a linear SVM is u = w.x , w is the normal vector to the hyperplane ,and x is the input vector. where Task Farming Paradigm (or Master / Slave) Task Farming paradigm consists of two entities: Master and Multiple Slaves. The Master is responsible for: • Decomposing the problem into small tasks and distributes this tasks among a farm of slaves process. • Gathering the partial results in order to produce the final result of the computation. The Slave processes execute in a very simple cycles: • Get the message with the task. • Process the task. • Send the result to the master. The communication takes place only between the master and slaves Figure 4. Task Farming may either use static load balancing or dynamic load balancing. In Static Load Balancing the distribution of tasks are all performed at the beginning of the computation , which allow

A Grid-Based Distributed SVM Data Mining Algorithm

318

the master o participate in the computation after each slave has been allocated a fraction of the work. Task Farming can achieve high performance computational speeds up and an interesting degree of scalability. For large number of processors the centralized control of master process can become a bottleneck to the applications. It's however possible to enhance the scalability of the paradigm by extending the single master to a set of masters, each of them controlling a different group of process (slave). [Roo00] Figure 4: Static Load Balancing (Master/Slave)

Sequential SVM Algorithm code: 1. Read the Input Date “Training Sample Weight “from the File and Store in Array While (readfromfile>>trainingdata[j]) { j = j+1; } 2. Read the 8 classes “Test Data Weight" from File and Store them in Arrays While (readfromfile1 >> class1 [i1]) { i1 = i1+1; } While (readfromfile2 >> claas2[i2]) { i2 = i2+1; } While (readfromfile3>>class3[i3]) { i3 = i3+1; } While(readfromfile4>>class4[i4]) { i4 = i4+1; } While(readfromfile5>>class5[i5]) { i5 = i5+1; } While(readfile6>>class6[i6]) { i6 = i6+1; } While(readfile7>>class7[i7]) { i7 = i7+1; } While(readfile8>>class8[i8]) { i8 = i8+1; } 3. Merge the Training Array with Each of the 8 Classes of Test Data for x1 = 1 to i1 do { merge1[x1] = array[x1] * class1[x1];}

319

Ali Meligy and Manar Al-Khatib

for x2 = 1 to i2 do { merge2[x1] = array[x2] * class2[x2];} for x3 = 1 to i3 do { merge3[x3] = array[x3] * class3[x3];} for x4 = 1 to i4 do { merge4[x4] = array[x4] * class4[x4];} for x5 = 1 to i5 do { merge5[x5] = array[x5] * class5[x5];} for x6 = 1 to i6 do { merge6[x6] = array[x6] * class6[x6];} for x7 = 1 to i7 do { merge7[x7] = array[x7] * class7[x7];} for x8 = 1 to i8 do { merge8[x8] = array[x8] * class8[x8];} 4. Find the Summation of the 8 Merges. for z1 = 1 to i1 do { sum1=sum1+merge1[z1]; } for z2 = 1 to i2 do { sum2=sum2+merge2[z2]; } for z3 = 1 to i3 do { sum3=sum3+merge3[z3]; } for z4 = 1 to i4 do { sum4=sum4+merge4[z4]; } for z5 = 1 to i5 do { sum5=sum5+merge5[z5]; } for z6 = 1 to i6 do { sum6=sum6+merge6[z6]; } for z7 = 1 to i7 do { sum7=sum7+merge7[z7]; } for z8 = 1 to i8 do { sum8=sum8+merge8[z8]; } 5. Store the 8 Summations in Array and then Find the Maximum value from the 8 Summation Merge and the Maximum Value will be the Class Label for the Training Data max = sum1[1]; for s = 2 to 8 do if max > sum[s] then max = sum[s]; 6. The Class Label for the Training Data is the Maximum Value. Class label = max;

4. Distributed SVM Algorithm Code: Distributed SVM algorithm can be written in c and MPI (message passing interface ) so we can distribute the calculation among many processes to work in parallel , so we can save time and finish in short time. Code: Master Process: 1. Read the Input Date " Training Sample Weight " From the File and Store in Array.*/ While (readfromfile>>trainingdata[j])

A Grid-Based Distributed SVM Data Mining Algorithm

320

{ j=j+1; } 2. Send the Training Data to Each of the Eight Processors with the Size of the Training Data Array MPI_Send(trainingdata, j, MPI_INT, 1,99,comm); MPI_Send(trainingdata, j, MPI_INT, 2,99,comm); MPI_Send(trainingdata, j, MPI_INT, 3,99,comm); MPI_Send(trainingdata, j, MPI_INT, 4,99,comm); MPI_Send(trainingdata, j, MPI_INT, 5,99,comm); MPI_Send(trainingdata, j, MPI_INT, 6,99,comm); MPI_Send(trainingdata, j, MPI_INT, 7,99,comm); MPI_Send(trainingdata, j, MPI_INT, 8,99,comm); 3. Receive the Output Results from Each Process. MPI-Recv( &sum1,1,MPI-INT,1,99,comm,&status1); MPI-Recv( &sum2,1,MPI-INT,1,99,comm,&status2); MPI-Recv( &sum3,1,MPI-INT,1,99,comm,&status3); MPI-Recv( &sum4,1,MPI-INT,1,99,comm,&status4); MPI-Recv( &sum5,1,MPI-INT,1,99,comm,&status5); MPI-Recv( &sum6,1,MPI-INT,1,99,comm,&status6); MPI-Recv( &sum7,1,MPI-INT,1,99,comm,&status7); MPI-Recv( &sum8,1,MPI-INT,1,99,comm,&status8); 4. Store the 8 values in array and find the maximum value between the 8 processes, which will be the label of class. Max= sum1[1]; for s=2 to 8 do { if max > sum[s] then max=sum[s]; } Slave Process: 1. Receive the Training Data from the Master Process and Store the Training Data in Array. MPI-Recv( training,j,MPI-INT,0,99,comm.,$status); For r= 1 to j do { fprintf( training ); } 2. Read Class1 "Test Data Weight" from the File and Store it in Array; While(readfromfile1>>class1[i1]) { i1=i1+1; } 3. Merge the Training Data with the Class 1 and Store the Result in Array. for x 1=1 to i1 do { merge1[i1] = training[i1] * class1[i1];} 4. Find the Summation of the Merge Array. for z1=1 to i1 do { sum1=sum1+merge1[z1]; } 5. Send the Result of the Summation Back to the Master Process MPI-Send(&sum1,1,MPI-INT,0,999,comm.)

5. Conclusion We have presented a grid-based distributed SVM algorithm. The grid computing that enables coordinated resource sharing within dynamic organizations consisting of individuals, in situations and

321

Ali Meligy and Manar Al-Khatib

resources. The Grid can offer a computing and data management infrastructure for supporting decentralized and parallel data analysis. The algorithm has been implemented in C and MPI. More investigations concerning the efficiency of the algorithm need detailed testing. Interesting extensions include multiple classification and parameter optimization for the case of multi-class applications.

References [1] [2] [3]

[4] [5]

[6]

[7] [8] [9] [10]

[11] [12]

[13]

[14]

[Bae99] Ricardo Baeze-Yates and Berthier Ribeiro-Neto Modern Information retrieval Addison Wesley; 1st edition, 1999 [Cli02] Chris Clifton Privacy Preserving Distributed Data Mining ACM SIGKDD Explorations, 4(2), December 2002. [Dhi02] Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar Enhanced Word Clustering for Hierarchical Text Classification, Department of Computer Sciences, University of Texas at Austin, 2002. [Han01]David Hand, Heikki Mannila and Padhraic Smyth Principles of data mining. The MIT Press; 1 edition, 2001 [DO06] T-N. Do and F. Poulet Classifying one billion data with a new distributed SVM algorithm. Proc. of RIVF’06, 4th IEEE International Conference on Computer Science, Research, Innovation and Vision for the Future, Ho Chi Minh, Vietnam, 2006, pp. 59-66. [Jan06] Ivan Janciak, Martin Sarnovsky, A Min Tjoa, and Peter Brezany. Distributed Classification of Textual Documents on the Grid LECTURE NOTES IN COMPUTER SCIENCE, NUMB 4208, pages 710-718, 2006. [Mel06] Ali Meligy Parallel Computer Graphics Dar Elnahda, Cairo, Egypt, June 2006 [Mou97] Isabelle Moulinier Feature Selection: A Useful Preprocessing Step. In Proceedings of the 19th Annual BCS-IRSG Colloquium on IR Research, pp. 140-158, 1997. [Roo00] Roosta, S. Parallel processing and Parallel Algorithm , Springer Verlage, 2000. [Tal06] Domenico Talia Grid-based Distributed Data Mining Systems, Algorithms and services. Proc. Workshop HPDM 2006 at the SIAM Data Mining Conference, Bethesda, USA, April 2006. [Tan06] Pang-Ning Tan, M. Steinbach, and V. Kumar Data Mining Addison Wesley, London, 2006. [Ton01] Simon Tong, Daphne Koller Support Vector Machine Active Learning with Applications to Text Classification, Proceedings of ICML-00, 17th International Conference on Machine Learning, 2001. [Tsa04] Shu-Tzu Tsai and Chao-Tung Yang, Decision Tree Construction for Data Mining on Grid Computing, Proceedings of the 2004 IEEE International Conference on e-Technology, eCommerce and e-Service (EEE’04) 0-7695-2073-1/04© 2004 IEEE [Wit00] Ian H. Witten & Eibe Frank Data Mining practical machine learning tools and techniques Morgan Kaufmann Publishers San Francisco ,California, 2000. (pp. 148--155). New York: ACM Press.

Suggest Documents