http://mpi-softtech.com/ . [15] Wooley, Bruce, Yoginder Dandass, Susan Bridges,. Julia Hodges, and Anthony Skjellum. 1998. Scalable knowledge discovery from ...
Machine Learning Using Clusters of Computers Bruce Wooley, Diane Mosser-Wooley, Anthony Skjellum, and Susan Bridges Mississippi State University Mississippi State, MS 39762 {bwooley, dwooley, tony, bridges) @ cs.msstate.edu
Abstract: Machine learning using large data sets is a computationally intensive process. One technique that offers an inexpensive opportunity to reduce the wall time for machine learning is to perform the learning in parallel. Raising the level of abstraction of the parallelization to the application level allows existing learning algorithms to be used without modification while providing the potential for a significant reduction in wall time. An MPI-shell is needed to partition and distribute the data to accommodate this higher level of abstraction. Executing this shell on a cluster of computers offers the potential for significant speedups with respect to processors as well as parallel I/O. A method for combining the results (obtained by applying the learning to each partition of the data) must be identified and require minimal time with respect to the training time of a partition of data.
I. INTRODUCTION Chan and Stolfo (2, 3, 4, 5, 6, 7) developed techniques for allowing supervised learning (classifiers) to scale with respect to the size of the data set. They refer to their technique as a meta classifier. The meta classifier partitions the data into P partitions; trains a different classifier for each partition, and merges the results of the P trained classifiers resulting in a single trained system. Wooley et al. (15, 16) applied a similar technique to unsupervised learning (clustering). They refer to their technique as a meta categorizer. The meta categorizer is for unsupervised learning, so merging the different trained categorizers requires additional work for matching the categories from the different categorizers. The meta classifier and the meta categorizer allow each of the individual partitions to be processed by an algorithm that is independent of the algorithms used on other partitions as well as allowing the processing (training) to be performed independently. Additionally, since there is no communication between the independent classifiers (or categorizers) during the training phase, existing implementations of training algorithms may be used without modifications. Finally, since both these methods include the opportunity for the P data sets to be operated on in parallel, there is a potential for the overall process to scale with respect to the number of processors (or number of partitions) without making alterations to the original training algorithms. There are many existing algorithms that may be used for supervised learning and many others for unsupervised learning. The meta classifier and the meta categorizer are designed to work with these existing algorithms. They both describe techniques for merging the independently trained systems, and these techniques are outlined in the following
sections. The meta categorizer was initially applied using the AutoClass clustering software (8). We were able to implement the meta categorizer with AutoClass as the learning algorithm by adding MPI code to the application that allowed input data for each classifier to reside in a unique directory. Unfortunately, this is not a very general solution. Attempts to add MPI commands to broadcast the data instead of reading it from different directories, turned out to involve a significant coding effort. The idea of raising the level of abstraction of MPI from the functional level to the application level and using a cluster of computers (where each node has a disk drive) was developed to accomplish the broadcast without making changes to the source code of the learning algorithm. This reduces the I/O contention by allowing the broadcast of data to local drives; the local data is then used as input for the learning algorithm executing on that node. A software package called MPI-Shell was developed to accomplish this task. The remaining sections of this paper discuss the meta learning techniques (for both supervised and unsupervised learning), describe the MPIShell and its API, and provide a summary and description of planned future work. II. LEARNING ALGORITHMS A. Supervised Learning - Meta Classifier The meta classifier is a method developed by Chan and Stolfo (2, 3, 4, 5, 6, 7) that combines the results of independently trained classifiers to create a single trained classifier dependent on all the training data. It is important to recall that all the training/test data for use in supervised learning contains the correct classification. This allows the trained classifiers to be evaluated based on how well the training data (and test data) match the correct answers. This also allows the meta classifier to be trained using a new data set constructed by combining classification results from each of the base classifiers along with the training data and the known correct classification. Chan and Stolfo (3, 4, 7) present two approaches for developing a meta classifier. The first approach uses an arbiter and an arbitration rule as pictured in Figure 1. The arbiter is a classification system that is trained on a subset of raw data on which the base classifiers perform poorly. The arbitration rule deter mines the final classification for any specific instance based on the results of the p classifiers and the arbiter. The second approach Chan and Stolfo present, called a combiner and pictured in Figure 2, is a learning system that is trained by processing raw data through the p classifiers and presenting
the output of the p classifiers as input to the combiner. An alternative training set for the combiner may include the raw
data and the output of the p classifiers.
Prediction 1 Classifier 1 A rbiter R ule Instance
Classifier 2
Arbiter’s Prediction
Prediction 2 A rbiter
Figure 1: An Arbiter with Two Classifiers. From (Chan and Stolfo 1995a, 92).
Prediction 1 Classifier 1 Combiner
Instance
Classifier 2
Prediction 2
Figure 2. A Combiner with Two Classifiers. From (Chan and Stolfo 1995a, 92).
B. Unsupervised Learning – Meta Categorizer The meta categorizer developed by Wooley et al. (15, 16) differs from the meta classifier in the way the arbiter/combiner works. The training and test data sets used in supervised learning include the correct classification of each example; this provides a means for the arbiter/combiner to evaluate the performance of the base classifiers. This correct classification does not exist in the training data used with unsupervised learning. In fact, experts often disagree among themselves with respect to the correct categorization of data used with unsupervised learning. Additionally, the clusters defined as a result of applying an unsupervised learning algorithm to a data set may not represent the same set of clusters defined when applying a different learning algorithm to the same data set. For example, cluster number 1 from using one training algorithm may most closely correspond to cluster number 3 when applying a different training algorithm (or even the same algorithm with different initial conditions) to the same data set. Two basic techniques are being investigated to build correspondences between the clusters obtained from unsupervised learning algorithms that are trained on partitioned data.
processor (where each processor is being used to train one classifier). Each processor would then train its classifier using only its designated partition of the training set; at the conclusion of training, the each classifier would be used to predict the category of the entire training data set. These final predictions would then be gathered to a central point where a serial process would learn the correspondences. This technique requires all the training data to be shared with each node. An alternative approach shares classifiers rather than data sets. Using this alternative, only the partition of the data set that will be used by a particular processor is sent to that processor. Each processor uses its learning algorithm to train a classifier based on its partition. The resulting classifier is then broadcast to all other nodes where it is used to classify the local data partition. In cases where the nodes are running different learning algorithms, each node must have a copy of the learning algorithm stored or the learning algorithm must also be sent with the classifier. Each node can then apply the classifier sent by the other nodes to classify its partition of the data. These resulting classifications would then be gathered and used to train the meta categorizer as described for each of the two techniques. III. MPI SHELL
The first technique is to provide the entire data set as input to each of the trained categorizers and to build a correspondence matrix from the resulting classifications. This matrix is then used to identify a new set of clusters (which typically contain more clusters than the output from each individual classifier), and a special cluster for “leftovers” that do not fit well in any of the identified clusters. Because using the trained categorizers to predict the cluster of data is very fast when compared to training, the bulk of the time needed in training the meta categorizer is in building the correspondence matrix. Efficient implementations require a sparse matrix due to the potential for high dimensionality of training data (number of clusters and number of partitions). Preliminary results indicate this technique works well for up to 8 processors. The second technique also uses each of the trained categorizers as a classifier for the entire data set. This technique then builds a set of feature vectors from the resulting classifications and submits this new data to an unsupervised learning algorithm to discover the correspondences between the clusters. This technique may use all of the new feature vectors or a subset (sampling) of these correspondence vectors. Two other research groups working in this area (1, 10, 17) have developed closely related techniques that use statistical summaries of data as input for subsequent clustering. Using these statistical summaries as input for the clustering results in smaller training sets and thus reduced training time. Both of the techniques described above require that each classifier predict the class of all training instances. This may be accomplished in two different ways. The first alternative is to broadcast the complete training data set to each
The need for communication for the learning algorithms described in this paper is at a higher level of abstraction than with traditional parallel computational algorithms; in our case, the communication for the learning algorithms must be performed outside of the individual tasks being executed. Implementing MPI-Shell, a communication/command shell that executes on top of MPI/Pro, allows the users to describe the partitioning scheme of the data and the algorithms that are to be executed on each node. MPI-Shell allows any embarrassingly parallel algorithm to be implemented without making any changes to that application algorithm (there is no MPI code added to the algorithm). It accomplishes the parallelization by broadcasting or partitioning the data, initiating an appropriate algorithm on each node, and gathering the results computed by each node. The general API for using this batch shell includes the following commands: A. Broadcast Broadcast allows broadcasting a barrier, an instruction, a file, a partition, or a time-stamp request, to all nodes. The barrier is used to synchronize the MPI-Shell process on all nodes so that no shell commands may be executed until all processes reach the barrier. Broadcasting an instruction tells all nodes to execute the requested instruction where each instruction must be defined in the environment of each node. Broadcasting a file sends the file over the network where each copy of the MPI-Shell process receives the file and writes it to a local disk. The partitioning broadcast performs a single broadcast but marks each data element with the appropriate node for the partitioning process. Then each node’s MPI-Shell process writes two files (the broadcast file
and the partition file) from the single broadcast data. Finally, the time-stamp notifies each MPI-Shell process that it should create a time-stamp when it encounters this broadcast step.
Shell, however, allows exploitation of the true MPMD capabilities of the loosely coupled, heterogeneous nature of the cluster environment by abstracting the parallel aspects to the application level
B. Message Message allows communication between any two of the MPI-Shell processes or targets a single specific MPI-Shell process (instruction or time-stamp). This communication may be an instruction, a file, or a time-stamp. The instruction is a notification to a specific process that it is to execute the specified instruction. The file communication transfers the file using point to point communication; in this situation the sender reads and sends, and the receiver receives and writes a new local file. The time-stamp notifies the specific process that it is to create a time stamp. C. Scatter Scatter allows a file to be partitioned and scattered to all the processes. The partitioning may be by rows or by columns. Each process receives its partition and writes it to a file on a local disk. D. Gather Gather allows partitions from each process (a file on a local disk) to be gathered to the root process and written to any drive to which the root has access. The gather may be by row or by columns. IV. SUMMARY Many machine learning algorithms are computationally intensive and embarrassingly parallel. Maintaining a parallel version of each algorithm would be a difficult task. Abstracting the parallelization from the function level to the application level allows the user to choose the latest version of a learning algorithm without the overhead of modifying the algorithm for using parallel computers. This abstraction led to the development of a parallel batch shell called MPIShell. There are several benefits offered by using MPI-Shell when using clusters of computers to parallelize applications that work with partitioned data. First, MPI-Shell moves the focus of parallel computing from the SPMD model to the MPMD model, and its strength is realized in a cluster environment with applications that are embarrassingly parallel (SPMD) or with multiple applications that can execute concurrently, thus taking advantage of multiple processors (MPMD). The traditional methods of programming were aimed at parallel computing in a homogeneous, scientific, and specialized type of environment; they are typically designed to use the SPMD form of programming. The benefits of SPMD are realized by parallelizing the individual programs and exploiting functional parallelism within an application. This is usually well suited to a tightly coupled, homogenous environment predominantly in use today for parallel computations. MPI-
A second benefit of MPI-Shell is heterogeneity. In addition to reduced expense and competitive pricing for hardware and software, this provides the true potential for heterogeneous computing. This heterogeneous environment applies to both the hardware and the software. The very powerful contribution can be exploited at the application level, allowing the user to match the application to the hardware that provides the best performance for that application. A further benefit of this feature is that off-theshelf software components can be used unaltered. This use of off-the-shelf components allows system designers to choose the hardware and software components that best meet their needs without the significant cost of modifications. As new hardware, middle-ware, and software components are released, the system designer may also update these components with minimal changes to the other system components, thus improving the performance with minimal interference in the day to day operations of the system. Another benefit of MPI-Shell is that it may be used in other parallel computing environments. It can be useful in multi-computer and distributed computer environments by aiding the user in scheduling execution of applications and providing tools for managing synchronization among these applications. It also allows the user to define the desired synchronization between applications and data communications; this can enhance the use of parallel I/O in a cluster environment. Finally, MPI-Shell provides a user interface that is easier to learn. This interface is at a higher level of abstraction and has a limited number of functions. Users who are not proficient in high performance computing should have a reduced learning curve since all communication is at the file level or batch instruction level of abstraction. Users who are already familiar with MPI should be able to use MPI-Shell with very little effort. MPI-Shell is implemented on top of an existing message-passing interface, MPI-Pro (14). This allows one to make use of the services and benefits of message passing such as the portability of the MPI environment. MPI-Shell is a tool that allows users to select off-the-shelf software in the same way they currently select off-the-shelf hardware. This allows users to reduce the cost of their high performance system. The replacement of hardware or middleware to improve performance and flexibility will not require alterations to the application software. MPI-Shell will work on any computational units that support MPI. This enhances functionality for applications such as distributed data mining that work with partitioned data, as well as distributing corporate work loads over multiple computational units.
V. FUTURE WORK There are other artificial intelligence algorithms that may benefit from using MPI-Shell to perform communication at the application level. For example, we plan to write a wrapper for genetic algorithms that will use different initial populations on different processors. As each version of the GA operates, it will be possible to exchange the most fit individuals between nodes in order to increase the evolutionary pressure. This provides the potential of find a better result in a smaller wall clock time. It may also be able to adapt other AI search algorithms to this partitioned data environment. Finally, outside of AI, small businesses that are computationally bound (or I/O bound) and need more horsepower may be able to partition their workload so that it is spread over multiple computational units without making modifications to their applications. This would allow them to choose low cost hardware instead of installing more powerful multiprocessor machines.
[5]
Chan, Philip K. and Salvatore J. Stolfo. 1996. Sharing learned models among remote database partitions by local meta learning. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. URL:www.cs.columbia.edu/~sal/recent-papers.html (Accessed 13 October 1998).
[6]
Chan, Philip K. and Salvatore J. Stolfo. 1997a. JAM: Java agents for meta learning over distributed databases. In Proceedings of the Third International Conference on Knowledge Discovery & Data Mining, edited by David Heckerman, Heikki Mannila, Daryl Pregibon, & Ramasamy Uthurusamy, 74-81. Menlo Park, CA: AAAI press. URL:www.cs.columbia.edu/~sal/recent-papers.html (Accessed 13 October 1998).
[7]
Chan, Philip K. and Salvatore J. Stolfo. 1997b. Scalability of learning arbiter and combiner trees from partitioned data. Work in progress. URL:www.cs.columbia.edu/~sal/recent-papers.html (Accessed 13 October 1998).
[8]
Cheeseman, Peter, and John Stutz. 1996. Bayesian classification (AutoClass): Theory and results. Advances in Knowledge Discovery and Data Mining. Edited by Usama M. Fayyad, Gregory PiatetskyShapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Menlo Park, CA: AAAI Press. 158-180.
[9]
Chi, Zheru, Hong Yan, and Tuan Pham. 1996. Fuzzy algorithms with applications to image processing and pattern recognition. River Edge, NJ: World Scientific.
[10]
Livny, Miron, Raghu Ramakrishnan, and Tim Zhang. 1996. Fast density and probability estimation using CF-Kernel method for very large databases. Technical Report: URL:www.cs.wisc.edu/~zhang/birch.html (Accessed 13 October 1998). Under relevant publications.
[11]
Pacheco, Peter S. 1997. Parallel programming with MPI. San Francisco, CA: Morgan Kaufman Publishers Inc.
[12]
Kumar, Vipin, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to parallel computing design and analysis of algorithms. Redwood City, CA: The Benjamin/Cummings Publishing Company, Inc.
[13]
Foster, Ian. 1995. Designing and building parallel programs. Reading, MA: Addison-Wesley Publishing Company.
VI. REFERENCES [1]
[2]
Bradley, P.S., Usama Fayyad, and Cory Reina. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, edited by Rakesh Agrawal and Paul Stolorz, 9-15. Menlo Park, CA: AAAI Press. Chan, Philip K. and Salvatore J. Stolfo. 1994. Toward scalable and parallel inductive learning: A case study in splice junction prediction. Presented at the ML94 Workshop on Machine Learning and Molecular Biology. 1-21.
[3]
Chan, Philip K. and Salvatore J. Stolfo. 1995a. A comparative evaluation of voting and meta learning on partitioned data. In Proceedings of Twelfth International Conference on Machine Learning, edited by Armand Prieditis and Stuart Russell, 90-8. Morgan Kaufmann. URL:www.cs.columbia.edu/~sal/recentpapers.html (Accessed 13 October 1998).
[4]
Chan, Philip K. and Salvatore J. Stolfo. 1995b. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, edited by Usama M Fayyad and Ramasamy Uthurusamy, 39-44. Menlo Park, CA: AAAI Press.
[14]
MPI Software Technology, Inc. http://mpi-softtech.com/ .
1999.
MPI-Pro.
[15]
Wooley, Bruce, Yoginder Dandass, Susan Bridges, Julia Hodges, and Anthony Skjellum. 1998. Scalable knowledge discovery from oceanographic data. In Intelligent engineering systems through artificial neural networks. Volume 8 (ANNIE 1998). Edited by Cihan H Dagli, Metin Akay, Anna L Buczak, Okan Ersoy, and Benito R. Fernandez. New York, NY:ASME Press. 413-24
[16]
Wooley, Bruce, Susan Bridges, Julia Hodges, And Anthony Skjellum. 2000. Scaling the data mining step in knowledge discovery using oceanographic data. Accepted at IEA/AIE 2000.
[17]
Zhang, Tim, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of ACMSIGMOD’96 Int’l conference on Data Management, Montreal Canada. URL:www.cs.wisc.edu/~zhang/birch.html (Accessed 13 October 1998), under relevant publications.