Data Mining on the Grid for the Grid

1 downloads 0 Views 208KB Size Report
11 Marcos Aguilera and Jeffrey Mogul and Janet. Wiener and Patrick Reynolds and ... 17 P. Stelling, C. DeMatteis, I. Foster, C. Keeselman,. C. Lee, and G. von ...
Data Mining on the Grid for the Grid Nitesh V. Chawla, Douglas Thain, Ryan Lichtenwalter, David A. Cieslak Department of Computer Science and Engineering University of Notre Dame, USA {nchawla, dthain, rlichten, dcieslak}@cse.nd.edu Abstract Both users and administrators of computing grids are presented with enormous challenges in debugging and troubleshooting. Diagnosing a problem with one application on one machine is hard enough, but diagnosing problems in workloads of millions of jobs running on thousands of machines is a problem of a new order of magnitude. Suppose that a user submits one million jobs to a grid, only to discover some time later that half of them have failed, Users of large scale systems need tools that describe the overall situation, indicating what problems are commonplace versus occasional, and which are deterministic versus random. Machine learning techniques can be used to debug these kinds of problems in large scale systems. We present a comprehensive framework from data to knowledge discovery as an important step towards achieving this vision.

1. Introduction Computing grids have achieved many technical successes in the execution of large scale computations. Campus-scale grids measured in hundreds or thousands of CPUs are found in every research university in the United States, executing computation intensive workloads for every imaginable branch of science. These campus scale grids are often aggregated into larger structures. For example, the Open Science Grid (OSG) joins together resources at 77 major research universities to provide computing resources to projects both large and small. The OSG is not an anomaly: similar sprawling systems are found around the world. Examples include the NSF TeraGrid, the NASA Information Power Grid, the Europe-wide EGEE Grid, and the Japanese NAREGI Grid. Although each is differs in policy and implementation details, all make use of common software systems such as Condor and the Globus Toolkit. However, grids remain very difficult to test, tune and debug. This underlying reason for this is simply heterogeneity on every axis imaginable. Computing grids are constructed from machines of varying ages, with different CPU architectures and implementations, with varying storage and memory 978-1-4244-1694-3/08/$25.00 ©2008 IEEE

resources, and with networking technologies ranging from 10Mb packet switched Ethernet to 10Gb circuit switched optics. At the software level, a grid may run many different operating systems on each node, each with slightly different installation and configuration options. Each cluster may have its own batch scheduling software and run a different version of the grid-level software. Local administrators add to the mix with different policies for adding and removing users, retaining storage, and configuring firewalls and other network devices. The result is that jobs are difficult to port from site to site. A sophisticated user may carefully construct a job that runs for months on hundreds of CPUs in one cluster. When expanding to multiple clusters on the grid, the job may run perfectly on certain machines, run poorly on others, and fail to start entirely on others. The reason is often due to subtle incompatibilities not known to the user Many people have observed this property with firsthand experience, and researchers have made formal observations of the same properties. Gardner et al observed a thirty percent failure rate on Grid3 (the predecessor of the Open Science Grid), usually due to exhaustion of shared resources such as disks1. Iosup et al observe approximately a twenty percent failure rate in OSG, TeraGrid, and LCG.2 Thain and Livny observed approximately ten percent of machines are offline at any given moment in a five years study of Condor worldwide.3 Chen et al observe a failure rate of thirty percent of simple benchmarks in a grid testbed, diagnosing a combination of factors including user error, credential expiration, and filesystem failure.4 Medieros et al. observe that failures in grids are typically due to middleware configuration and not simple hardware failure.5 Schopf has observed through anecdotal discussion with a variety of grid users that, when failures occur, there is no reliable mechanism for diagnosing failures.6 Contributions In this paper, we present our data to knowledge discovery framework for troubleshooting on the grid. This provides a capability for understanding of the

nature of failures and successes of jobs on a grid. As part of the project, thus far, we have implemented an extensive feature-rich website that allows a user to visualize various statistics, properties of the jobs and system, alike. The user can also upload his/her log files for specific analyses. The upload of data also helps us to garner more diverse properties of different jobs and refine our models. At the knowledge discovery end, we have implemented a capability of learning decision trees and rules from the data to enable classification of jobs and focused action at the part of the user for answering the question – why did my job fail?

2. Condor Log Properties In Condor, each job and machine in the system is described by a data structure of name-value pairs called a ClassAd. Each ClassAd contains about 100 properties describing both critical and prosaic Job ClassAd ClusterId = 11530 QDate = 1158686002 User = "[email protected]" Cmd = "mysim.exe" Arguments = “-l –s 10 – p 5 –i config” Output = “mysim.out” ImageSize = 20485 BufferSize = 524288 DiskUsage = 1024850 (90 more properties...)

Machine ClassAd Name = "aluminum.helios.nd.edu" MachineGroup = "elements" MachineOwner = "crc" UidDomain = "nd.edu" Arch = "INTEL" OpSys = "LINUX" VirtualMemory = 2047980 Disk = 7763980 CondorVersion = “6.7.19” (90 more properties...)

attributes. In addition, the system produces a log recording the relationships between jobs and machines, including placement attempts, Activity Log job 17661 09/22 16:44:14 Job submitted from host: job 17661 09/22 16:44:30 Job executing on host: job 17661 09/22 16:45:13 Job was evicted. job 17662 09/22 16:45:36 Job submitted from host: job 17662 09/22 16:46:28 Job executing on host: job 17662 09/22 17:03:53 Job terminated with exit code 0 ... Given these three data sources – job data, machine data, and job-machine bindings – we can apply machine learning techniques to automatically extract the properties of jobs or machines that are correlated with failures. To do this, we require some external

specification of what constitutes a failure versus a success class. In this case, we consider a failure to be any job that is explicitly evicted by a host (i.e. the machine’s owner is typing at the keyboard) or a job that exits with a non-zero status (i.e. error detected by application level logic.)

3. Debugging Web Server The webserver is available to http://www.cse.nd.edu/~dial/debug.

Public

via

Design: The web server is designed to provide a helpful view of the data resident in Condor logs and Condor ClassAds. In some cases, increasingly so as more functionality is added, it also mines that data to help users theorize the reasons behind their job failures. To accomplish these tasks, the website presents several tabs that organize the types of data presentation and generation of which the tool is capable. File Upload: The website requires the Condor user log file. This is the chronological log of job events that occur from the point of job submission onward. Without this file, no analysis is possible. The Condor user log may be obtained by specifying a log output location in the Condor job submission file. After job submission and processing is completed by Condor, the complete user log file will be available at the specified location. Additional analysis can be performed if a Condor machine ClassAds file is uploaded. This file provides a wealth of attributes describing machines in the Condor pool on which jobs execute. Since machine characteristics often are largely responsible for job successes and failures, the importance of this information is great. Still, this file is optional as it is not required for basic analysis. This file may be obtained by performing the ‘condor_status –l” command and directing it to some location in the file system. Summary Tab: This tab provides basic information about the pool of jobs such as how many succeeded and failed and the average time jobs spent in the pool. Moving forward, this tab will be the main location for textual information describing general characteristics of logs. Job Browser: This tab provides a tool for browsing jobs in a hierarchical fashion. Each job is an object with

characteristics. Some of those characteristics themselves are objects or lists and can be expanded accordingly. This view provides a very logical and encapsulated view that is helpful in determining information about specific jobs individually. With the purely chronological ordering of the Condor user log, this is difficult to achieve. Charts: This tab provides a collection of charts that can be dynamically generated from uploaded log files and machine ClassAds. When users choose not to upload machine ClassAds, some of the chart options are not displayed. Some of the chart options are as follows. Image Updates, Shadow Exceptions, and Jobs Executed by IP Address all break down, in the form of a bar chart, how various log events were distributed in the pool of machines. Consider the situation where one machine experiences a high number of shadow exceptions, processes jobs quickly due to rapid failures, and causes many execution failures to appear in the log. These types of charts are useful for this reason. Failures by type shows a pie chart of the types of failures that occurred. Successes and failures vs. problems is a line chart correlating the number of problematic events jobs experiences with numbers of jobs that fail. The success and failure plots with respect to machine attributes such as architecture and operating system are bar charts illustrating how each set of machines with a given characteristics do in successfully processing jobs. The box plots show numerical characteristics of the machines in a way that demonstrates what may be average characteristics correlated with success or failure in a given set of jobs. They can also serve to show what boundary conditions a machine may possess that causes all jobs processed on it to fail. For instance, machines with limitations on address space will fail for all jobs requiring more address space. Where possible, all charts are presented in the manner most conducive to determining the cause of failures. For instance, all bar charts are sorted so that the x values resulting in the most failures appear at the top of the chart. Timelines: This tab provides a Gantt chart view of every job. Users can select a particular job to view and all of its associated log events are displayed chronologically. Information Gain: This tab provides a tool that interfaces with the information gain algorithm. It requires that a machine ClassAds file is uploaded. The machine ClassAds file

is first parsed for feature construction. The information gain algorithm is then run to determine what machine attributes are most correlated to job success and failure classes. The information in this tab is useful in determining what to examine in other tabs. For instance, if physical memory ranks highly in information gain, it may be interesting to look at the machine memory box plot. Model Result Tab (In Progress): We are currently adding this tab to provide an interface to view the decision tree or rules learned on the data set. This is associated with a classifications resulting in action oriented decision making from the user. This will shortly be part of the website functionality.

5. Case Studies We constructed a variety of workloads with success and failure classes. We considered C4.5 decision trees7 and the RIPPER8 classification algorithm as the base learning algorithms. We present two case analyses in this paper. First one just uses the machine properties and second one relies on job properties. 1.

Workload : (91343 jobs on 2153 machines on the Purdue Teragrid Nodes) Jobs often fail on machines with (Memory>1920MB) Later diagnosis: Bug in pointer arithmetic in application.

In workload A, a pointer arithmetic bug was not exercised until the application happened to run on a machine with more than 2GB of memory. This is an example of how such a troubleshooter can be used to diagnose problems on systems that are simply not in the application designer’s testing set. 2. The oil-svm case-study is a study of the application of the linear support vector machine classifier to the Oil dataset. Given the highly imbalanced nature of the data set, we first applied a variant of standard clustering algorithm to isolate the sub-complexities in the data set. Then, each cluster was treated differently, in terms of the amount of oversampling or undersampling performed, based on the class distribution.

6. Related Work Many researchers have proposed the large scale collection of data in order to localize problems in complex systems. For example, Moe and Carr collect

data generated by a common object framework to understand runtime relationships between objects9. Chen et al. have applied association rules to identify what component is likely at fault for a given set of observed failures10. Aguilera et al. describe a technique for identifying a performance bottleneck, using only the knowledge of messages passed between components11. We noted that the SVM was taking very long (if it was even finishing) to complete on a fairly small dataset. Essentially this set of experiments was designed to check two parameters: k=2,4,6,8 (the number of clusters) and rep=1,2,3,4,5,6,7,8,9,10 (the number of times the minority class was replicated). These were the two parameters for the algorithm. We ran this experiment on Condor, which eventually came back with 46 successes to 2237 failures. On this failure/success data, we leaned a decision tree using the job parameters. We observed that the number of clusters k has a fairly important role in determining the rate of failure and success. This corroborates our belief that the linear separation required by SVM is very difficult in this application.

Monitoring System collects data from the various sources available on a Grid, and then applies a distributed outlier detection approach to discover the misconfigured machines. Duan et al.16 applied data mining for fault tolerance on the Grid. Utilizing historical data on the grid, including user level faults, activity level faults, service level faults, and QOS level faults, they conduct classification for detecting anomalies and failures. There have been other works for fault and failure detection in grids17 18. However, the current stream of work is limited; it is not general purpose in its application to a sundry of troubleshooting problems on the Grid, which is critical given the heterogeneous, dynamic, and uncertain state of the grid. In each of these cases, data mining is used to narrow down the source or precise description of a single problem We relied on the tree constructed from job parameters for our analysis. This highlighted the non-linear aspects of the problem that caused the algorithm to fail in delivering the solution. The algorithm could not converge resulting in either suspension or failure. This highlights the usefulness of job specific troubleshooting as it brings forth important caveats of the algorithm specific to a data set property.

7. Discussion

Engler et al.12 use language and compiler techniques to observe conventions of code use, and then flag deviation from convention to infer errors in system level code. Redstone et al.13 collect data on computer failures, and then use rule mining to assist users. Zheng et al. 14 applied a statistical machine learning approach to software debugging, especially in the presence of multiple bugs. They utilized MOSS, a software plagiarism detection program with a large user community, as their test bed. Palatin et al.15 proposed a distributed data mining approach for detecting misconfigured machines on a grid.. Their Grid

We believe that large scale distributed systems are sorely lacking in debugging and troubleshooting facilities. Most users of the grid are limited by usability, not by performance. Troubleshooting via machine learning represents one method that helps the user to draw general conclusions from large amounts of data. Advances in tools and techniques are needed in order draw out specific reasons why individual components fail. We presented a framework from data to knowledge discovery presenting troubleshooting opportunities for both users and administrators of distributed systems.

Acknowledgements This work is supported by the National Science Foundation under grant CNS-07-20813.

References 1

Rob Gardner, et al. “The Grid2003 Production Grid: Principles and Practice”, Proceedings of High Performance Distributed Computing, 2004.

2

Alexandru Iosup, Catalin Dumitrescu, D.H.J. Epema, Hui Li, and Lex Wolters, How are real grids used? The analysis of four grid traces and its implications. IEEE Gird Computing, 2006.

3

How to Measure a Large Open Source Distributed System, Douglas Thain, Todd Tannenbaum, and Miron Livny, Concurrency and Computation: Practice and Experience, to appear in 2006. DOI: 10.1002/cpe.1041

4

Greg Chun, Holly Dail, Henri Casanova, and Allan Snavely, Benchmark Probes for Grid Assessment, IEEE International Parallel and Distributed Processing Symposium, 2004.

5

Raissa Medeiros and Walfredo Cirne and Francisco Vilar Brasileiro and Jacques Philippe Sauve, Faults in grids: Why are they are so bad and what can be done about it. IEEE Grid Computing, 2003.

6

Jennifer Schopf, “State of Grid Users: 25 Conversations with UK eScience Groups”, Argonne National Laboratory Tech Report ANL/MCS-TM-278, October 2004.

Symposium on Operating Systems Principles, October 2001. 13

Joshua A. Redstone and Michael M. Swift and Brian N. Bershad, "Using Computers to Diagnose Computer Problems", USENIX Workshop on Hot Topics in Operating Systems, 2003. 14

Alice X. Zheng, Michael I. Jordan, Ben Libit, and Mayur Naik, “Statistical Debugging: Simultaneous Identification of Multiple Bugs,” Proceedings of International Conference on Machine Learning, 2006. 15

Noam Palatin, Assaf Schuster, Ran Wolff, “Mining for misconfigured machines in Grid systems,” 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. 16

R. Duan, R. Prodan, and T. Fahringer, “Short Paper: Mining-based fault prediction and detection on the grid,” Proc. Of IEEE High Performance and Distributed Computing, 2006. 17

7

R. Quinlan, “Induction of decision trees”, Machine Learning, 1:81 – 106, 1986.

P. Stelling, C. DeMatteis, I. Foster, C. Keeselman, C. Lee, and G. von Laszewski, “A fault detection service for wide area distributed computations,” Cluster Computing, 2:117 –128, 1999.

8

18

W. Cohen, “Fast effective rule induction,” In Proc. Of Machine Learning: the 12th International Conference, pages 115 – 123, July 1995. 9 Johan Moe and David Carr, "Using Execution Trace Data to Improve Distributed Systems", Software -Practice and Experience, 32(9), July 2002. 10

M. Chen and E. Kiciman and E. Fratkin and E. Brewer and A. Fox, “Pinpoint: Problem Determination in Large, Dynamic, Internet Services", International Conference on Dependable Systems and Networks, 2002. 11

Marcos Aguilera and Jeffrey Mogul and Janet Wiener and Patrick Reynolds and Athicha Muthitacharoen, "Performance Debugging of BlackBox Distributed Systems", ACM Symposium on Operating Systems Principles, October 2003. 12

Dawson Engler and David Chen and Seth Hallem and Andy Chaou and Benjamin Chelf, "Bugs as Deviant Behavior: A General Approach to Inferring Errors in System Code", ACM

P. Townend and J. Xu, “Fault tolerance within a grid environment,” Proceedings of AHM2003, 2003.