auto-AID: A Data Mining Framework for Autonomic ... - CiteSeerX

7 downloads 500 Views 1MB Size Report
on system performance and operation costs are becoming an increasingly important .... monitoring tools make the data model extremely complex. Moreover, the .... The institute-wide computational grid consists of 11 Linux clusters located in ...
auto-AID: A Data Mining Framework for Autonomic Anomaly Identification in Networked Computer Systems Qiang Guan and Song Fu Department of Computer Science and Engineering University of North Texas [email protected]; [email protected]

Abstract Networked computer systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. A failure will cause one or multiple computer(s) to be unavailable, which affects the resource utilization and system throughput. When a computer fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. In this paper, we present auto-AID, an autonomic mechanism for anomaly identification in networked computer systems. It is composed of a set of data mining techniques that facilitates automatic analysis of system health data. The identification results are very valuable for the system administrators to manage systems and schedule the available resources. We implement a prototype of auto-AID and evaluate it on a production institutionwide compute grid. The results show that auto-AID can effectively identify anomalies with little human intervention. Keywords: Anomaly identification; Data mining; System dependability; Parallel and distributed systems.

1

Introduction

Networked computer systems continue to grow in scale and in the complexity of their components and interactions. In these systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators [26, 17, 30]. The growing complexity of hardware and software mandates autonomic management of failures in production systems. When a system fails to function properly, health-related

978-1-4244-9328-9/10/$26.00 ©2010 IEEE

data collected across the system are valuable for troubleshooting. However, localizing anomalies in large-scale complex systems is challenging. The major challenges include data volume and diversity, data dependency, anomaly characteristics, system dynamics, and more [3]. The data collected for analysis from large-scale systems are characterized by their huge volume, usually in order of gigabytes per day [22]. Moreover, the data often have various formats and semantics, which makes it difficult to process them in a uniform manner. Finding anomalies from such overwhelming amount of diverse data is challenging. The collected data are also mixtures of independent and dependent signals and they often contain noises. How to discover their dependency and to remove noises is critical. In addition, there are many types of anomalies in large-scale computing systems and some of them are very complex. The dynamics of system behaviors add more difficulties to defining the normal behaviors of system components and detecting anomalies [23]. In this paper, we present an autonomic Anomaly IDentification (auto-AID) framework, which provides a collection of data mining techniques that enable autonomic analysis of runtime data and identification of abnormal behaviors in large-scale networked computer systems. Specifically, data transformation is first employed to tackle data diversity by reducing the problem of anomaly identification with different data types to the problem of finding outliers in a new space of a single data type. Then, to address the overwhelming data volume and inherent data dependency, feature selection based on mutual information techniques is performed to convert the multi-dimensional data into a space of lower dimensions with reduced relevancy and redundancy for quick and better analysis. Finally, outlier detection automatically extracts the expected normal behaviors from the data and identifies significant deviations as anomalies. Together, these techniques from an unsupervised learning framework, which addresses the unknown and dynamic characteristics of anomalies in large-scale networked computer systems. We implement a prototype of

auto-AID and evaluate it on a production institution-wide compute grid, which contains 362 high-performance compute nodes. The experimental results show that auto-AID can effectively identify anomalies with little human intervention. The rest of this paper is organized as follows. Section 2 describes the framework of auto-AID. Section 3 presents the key techniques for analyzing health data and identifying anomalous behaviors. Experimental results are presented and discussed in Section 4. Section 5 describes the related works. Conclusions and remarks on future works are presented in Section 6.

2

Autonomic Anomaly Identification Framework

We propose an autonomic Anomaly IDentification (autoAID) framework to process massive volume of diverse health-related data by leveraging pattern recognition technologies. Health-related data are collected across the system and sent for analysis that includes data transformation, feature extraction, clustering and outlier identification. By investigating the structure of a networked computer system, in which multiple high-performance clusters are interconnected by high-speed networks, we define health-related variables that are used in our anomaly identification framework, as listed in Figure 1. In total, there are fifty-two variables that we monitor on each node in a networked computer system. They characterize the runtime statistics of an entire node, including its processors, memory, I/O devices, network connections and disks. The data are collected by the operating system on each node. The runtime states of a node is defined by the values of these variables at each time point. The data provide insightful information about the system behaviors and are valuable for identifying anomalies. The health-related data are collected across a networked computer system and the data transformation component assemble the data into a uniform format. A feature in the data set refers to any individual measurable variable of a compute node being monitored. It can be the system or user utilization, CPU idle time, memory utilization, volume of I/O operations, and more. There may be hundreds or even thousands of features for large-scale systems. Feature extraction is the technique exams the dependency of anomaly occurrences on the values of features and selects the most relevant features for further analysis. It reduces the dimensions of data to be processed while keeping the most important information. The resulting data are classified into multiple groups by clustering, and the outlier detector identifies the nodes that are far away from the majority as potential anomalies. To identify anomaly, we can invoke the autonomic mech-

anism periodically with a predefined or an adaptive frequency. In another way, a tool that monitors the system dynamics can trigger the mechanism when it finds some suspicious events. To validate the correctness of detection results, the identified anomalies are sent to and checked by the system administrators. By integrating the automatic processing by computers with human expertise, our anomaly identification framework and mechanism can identify anomalies quickly and accurately. The design objectives of our anomaly identification system is to provide high accuracy and time efficiency in analyzing healthrelated data and identifying anomalous behaviors in large networked computer systems.

3 3.1

Anomaly Identification Mechanisms Data Transformation

There are a variety of tools available for health monitoring in modern computer systems. For example, hardware sensors are equipped to monitor the processor temperature, disk rotation speed, and fan speed. Above the hardware layer, hypervisors and operating systems provides various hypercalls or system calls to trace usage information of processor, memory, network communication, I/O operations and more, and write the data into event/system logs. In the application layer, users can define their own mechanisms to profile resource usage by their application programs. There are also many third-party tools available for monitoring system performance. The data collected by all these hardware and software tools can be used by our framework.

3.2

Dimensionality Reduction

The large number of performance metrics that are measured and the overwhelming volume of data collected by health monitoring tools make the data model extremely complex. Moreover, the existence of interacting metrics and external environmental factors introduce measurement noises in the collected data. To make the anomaly identification tractable and yield high detection accuracy, we apply dimensionality reduction, which transforms the collected health data to a new feature space with only the more relevant attributes preserved [15]. The data presented in a low-dimensional subspace are easier to be classified into distinct groups, which facilitates anomaly identification. 3.2.1 Relevance Deduction For anomaly identification, large feature sets introduce high dimensionality. Moreover, with unsupervised learning, high dimensionality makes features less distinctive to each other, especially in clustering with Euclidean distance. Usually,

Figure 1. Variables characterizing system dynamics.



 









  





  









 





 

 



 

  

  

 

 

 



 

  

 



  

    



 

  

  

 







 

features with high similarity impede the performance of clustering in unsupervised learning. Similarity of features can be evaluated by using the mutual information. In the following discussions, we denote discrete random variables of different features by X1 , X2 , · · · , Xn . The mutual information for two features is defined as follow. XX p(xi , x j ) (3.1) I(Xi ; X j ) = p(xi , x j )log p(x i )p(x j ) x x i

j

It quantifies how much information is shared between Xi and X j . Mutual information has been widely used as a feature selection method [6]. I(Xi , X j ) can measure the goodness of a term globally between two features. Those feature pairs with high co-relevance have large mutual information. If Xi and X j are independent, their mutual information has a minimum value as zero. If Xi and X j are the same feature, their mutual information has a maximum value as one. The objective of utilizing mutual information is to reduce the relevance of a selected subset of features. Therefore, a feature, which has high mutual information with other features, should be excluded from the subset. The index for  evaluation of the relevance is defined as follow. Index(Xi )

i−1 X

I(Xi ; X j ) +

j=1

N X

 

I(Xi ; X j )



(3.2)   

j=i+1

  It can be proved that two features with linear relation have high index, which shows evidently high relevance.  

3.2.2 Redundancy Deduction

 

 Ͳ

(3.3)

where ci, j is the covariance of the ith and jth attributes of the health data. The covariance of two attributes measures the extent to which the two attributes vary together. PCA performs a coordinate rotation that aligns the transformed axes with the directions of maximum variance. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as





Figure 2. Database schema for storing formatted   data. 

 

 

                        possible. This means if the summation of the principal com-             ponents possesses most of variability of the whole dataset,                such as 90% which contribute to  or above,  the components,    Ͳ the most of the variability,   can take the place  of the  original   dataset. These components constitute the selected subset.                        

3.3

For feature selection, it has been known that the combination of individually independent features does not necessarily lead to a good clustering performance. The existence of redundant dimensions can bring problems in clustering. PCA (principle component analysis) has been widely used and proved to be powerful in eliminating redundancy of a subset of features. By using PCA, a matrix H is generated from the health-related dataset. It is constructed as an m × n matrix, where m is the number of instances in the dataset and n refers to the number of attribute values. Then, the covariance matrix of H is calculated, which is denoted by C. Each element in C is defined as ci, j = covaraince(h∗,i , h∗, j ),





Outlier Detection

In order to identify anomalies, we need to identify nodes whose behaviors are significantly different from those of the majority. These nodes are called outliers. Simply put, an outlier is a data point that is quite different from other data according to some criteria. In this work, we use the Euclidean distance to specify dissimilarity between two data points. An algorithm to detect an outlier is to calculate the number of neighbors within a distance from each object to determine whether it is an outlier or not.

4

Performance Evaluation

We have designed a generic anomaly identification framework, auto-AID, in which performance data collected by various health monitoring tools can be used. As a proof of concept, we implemented a prototype of auto-AID. In this section, we evaluate the performance of our framework for autonomic anomaly identification in a production networked computer system.



 

        Ͳ      





1000000 14000

100000

12000

Frequency

Frequency(logarithmic)

10000

10000

1000

8000

6000

100 4000

10 2000

1

0.25 2.25 4.25 6.25 8.25 10.25 12.25 14.25 16.25 18.25 20.25 22.25 24.25 26.25 28.25 30.25 32.25 34.25 36.25 38.25 40.25 42.25 44.25 46.25 48.25 50.25 52.25 54.25 56.25 58.25 60.25 62.25 64.25 66.25 68.25 70.25 72.25 74.25 76.25 78.25 80.25 82.25 84.25 86.25 88.25 90.25 92.25 94.25 96.25 98.25

0.25 2.25 4.25 6.25 8.25 10.25 12.25 14.25 16.25 18.25 20.25 22.25 24.25 26.25 28.25 30.25 32.25 34.25 36.25 38.25 40.25 42.25 44.25 46.25 48.25 50.25 52.25 54.25 56.25 58.25 60.25 62.25 64.25 66.25 68.25 70.25 72.25 74.25 76.25 78.25 80.25 82.25 84.25 86.25 88.25 90.25 92.25 94.25 96.25 98.25

0

MemoryUtilization(%)

UserUtilization(%)

(a) User Time Utilization

(b) System Time Utilization

Figure 3. Distributions of CPU and Memory Utilization in the Institute-wide Compute Grid System.

4.1

1

Experiment Platform

0.9

4.2

Health Data processing

Collection

and

Pre-

We used sysstat [1] to collect performance data on each node in the grid. The values of 83 performance metrics were recorded every five minutes. They cover the statistics of every components of each node, including CPU usage, process creation, task switching activity, memory and swap space utilization, paging and page faults, interrupts, network activity, I/O and data transfer, power management, and more. The collected data were pushed to master nodes in the grid for system health analysis. The collected data are cleaned first. Missing values of features in the data are filled by the average of two adjacent

0.8

Normalized Relevance

The institute-wide computational grid consists of 11 Linux clusters located in separate buildings on campus. Two clusters contain 116 nodes each, while others host 32, 16, 22, 20, 10 and 8 nodes in them. In total, there are 362 highperformance compute servers in the grid and they are dedicated to computational research. Among all the nodes, 87.6% of them were up most of the time from July 17, 2008 to December 31, 2009. The compute nodes are equipped with 4 to 8 Intel Xeon or AMD Opteron cores and 2.5 to 16 GB of RAM. Within each cluster, nodes are interconnected by gigabit Ethernet switches (The two clusters with 116 nodes each are equipped with 2 gigabit Myrinets). Connections between clusters are through gigabit Ethernet. Typical applications running on the grid include molecular dynamics simulations, genome and proteome analysis, chemical kinetics simulations, materials and metallodrugs property analysis, and more. These parallel applications ran on 8 to 128 nodes and some of them lasted for more than 20 days. The grid is also open to institute students to execute their sequential and parallel programs.

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

2

4

6

8 10 Features

12

14

16

Figure 4. Mutual Information of CPU and Memory Related Features. The 16 features are F1:represent proc/s; F2: %user; F3: %system; F4: %iowait; F5: %idle; F6: kbmemfree, F7: kbmemused; F8: memused; F9: kbbuffers; F10: kbcached; F11: kbswpfree; F12: kbswpused; F13: %swpused; F14: kbswpcad; F15: pgpgout/s F16: fault/s.

values. That is ft = ( ft−1 + ft+1 )/2, where t denotes the time point where the value of feature f is recorded, if ft is missing in the collected data. Then the data are parsed and transformed into a uniform format. A C program was written to parse the collected data. The parsing program was written using regular expressions and was about 1,500 lines of code. After getting parsed, the data are formatted into the CSV format and then inserted into a database. Figure 2 depicts the database schema.

4.3

Anomaly Identification Performance

Figures 3 depict the distributions of CPU utilization (i.e., user time) and memory utilization in the grid system. Data

PCA 0.9 0.8 0.7

Percentage

0.6 0.5 0.4 0.3 0.2 0.1 0

1

2

3 4 Components

5

6

Figure 5. Redundancy Deduction by Principle Component Analysis.

Figure 6. Redundancy Deduction by Principle Component Analysis.

5

Related Works

are cleaned based on these distributions. After the data are cleaned, the training data are passed to the feature selection component. A mutual informationbased feature selection algorithm is used to choose independent features that capture most information. Table 1 and Table 2 list the mutual information for each pair of features in the CPU Time Statistics category and Memory Utilization Statistics category respectively. In addition, we calculate the mutual information of the sixteen CPU and memory related features. Figure 4 depicts their normalized relevance. From the figure, We find out that the normalized relevance of %user, %system and %idle are less than 0.6 among the CPU related features. Among the features of memory utilization statistics, kbbbu f f er, pgpgout/s and f ault/s are selected based on Figure 4. By applying the mutual information based selection algorithm to other sets of features, we additionally select rxpck/s, txpck/s, and tps. In total, eight variables are selected to characterize system behavior and we exploit them to identify anomalies. Then, the PCA algorithm is utilized to reduce redundancy among the selected eight features. The results are shown in Figure 5. It shows that the first and second principal components account for more than 94.54% of variability of the original dataset. Therefore, we reduce the dimension from five to two. With the selected two principal components, we apply a K-means clustering algorithm. The results from clustering are shown in Figure 6. A heuristic method is applied, which automatically determine the number of clusters that should be generated based on rules about the data, such as mean, variability, distribution, and more. The kmeans algorithm clusters health-related data based on their Euclidian distance in the space with selected principal components. Based on the distances, we can identify outliers, which are sent to the system administrator for verification.

As the complexity and scale of networked computing systems increases, health monitoring and anomaly identification tasks require significantly higher levels of automation. Examples include diagnosis and prediction based on realtime streams of computer events, and performing continuous monitoring of the runtime services. The core of autonomic computing [14, 16] is the ability to analyze data in realtime and to identify potential anomalies automatically. The objective is to avoid fatal failures or mitigate their impact by promptly executing remedy actions. To automate identification of anomalies in large-scale computer systems, it is imperative to understand the characteristics of anomalous behaviors. Research in [26, 18, 25, 28] studied event traces collected from clusters and supercomputers. They found that failures are common in largescale systems and their occurrences are quite dynamic, displaying uneven inter-arrival time. Sahoo et al. [25] found the correlation of failure rate with hour of the day and the distribution of failures across nodes. They reported that less than 4% of the nodes in a machine room experience almost 70% of the failures and found failure rates during the day to be four times higher than during the night. Similar result was observed by Schroeder and Gibson [26]. Several studies [2, 27] have examined system logs to identify causal events that lead to failures. Correlation between the workload intensity and the failure rate in real systems was pointed out in many studies [4, 21]. Anomaly and failure detection based on analysis of system logs has also been studied. Hellerstein et al. [20] developed a method to discover patterns such as message burst, periodicity and dependencies from SNMP data in an enterprise network. Yamanishi et al. [29] modeled syslog sequences as a mixture of Hidden Markov Models to find messages that are likely to be related to critical failures.

Table 1. Normalized Mutual Information Matrix for CPU Time Statistics Features. Features proc/s %user %system %iowait %idle proc/s 0 0.681 0.745 1.000 0.089 %user 0.681 0 0.656 0.714 0.042 %system 0.745 0.656 0 0.609 0.127 1.000 0.714 0.609 0 0.118 %iowait %idle 0.089 0.042 0.127 0.118 0 Lim et al. [19] analyzed a large-scale enterprise telephony system log with multiple heuristic filters to search for messages related to failures. However, treating a log as a single time series does not perform well in large-scale computer systems with multiple independent processes that generate interleaved logs. The model becomes overly complex and parameters are hard to tune with interleaved logs [29]. Our analysis is based on health data groups rather than a time series of individual data. The grouping approach makes it possible to obtain useful results with simple and efficient algorithms for autonomic operations. Recently, data mining and statistical learning technologies have received growing attention for failure diagnosis and diagnosis. They do not require a priori model or knowledge of failure distributions. Failure patterns are learned and discovered from normal system behaviors. They use the learned patterns to detect anomalous behaviors [24]. For example, the group at the Berkeley RAD laboratory applied statistical learning techniques for failure diagnosis in Internet services [33]. Similar techniques were applied to automate failure management in information technology systems [5]. Statistical approaches were studied in forecasting failure events on the BlueGene/L supercomputer [17]. In our own studies [10, 11, 9, 13, 31, 8, 7, 12, 32], we developed a framework for failure prediction in networked computer systems and proactively managed system resource in a failure-aware manner.

6

Conclusions

In this paper, we present auto-AID, an anomaly identification framework for autonomic management of largescale networked computer systems. It exploits a collection of techniques to analyze runtime health data and identify anomalous behaviors in a system. The collected data are transformed to a uniform format and the complexity of data is reduced by extracting the primary features that characterize the system health dynamics. Anomalies are identified as outliers from normal behaviors. Experimental results on an institute-wide computational grid system present the feasibility of applying auto-AID to autonomic anomaly identification in large-scale networked computer systems for dependability assurance.

Acknowledgments We would like to thank the anonymous reviewers for their constructive comments and suggestions. This research was supported in part by U.S. NSF Grant CNS-0915396 and LANL Grant IAS-1103.

References [1] sysstat. Available at: http://pagesperso-orange.fr/sebastien.godard/. [2] H. Berenji, J. Ametha, and D. Vengerov. Inductive learning for fault diagnosis. In Proceedings of IEEE Conference on Fuzzy Systems, 2003. [3] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):1–58, 2009. [4] B. Chun and A. Vahdat. Workload and failure characterization on a large-scale federated testbed. Technical Report IRB-TR-03-040, Intel Research Berkeley, 2003. [5] I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: a building block for automated diagnosis and control. In Proceedings of USENIX Symposium on Opearting Systems Design and Implementation (OSDI), 2004. [6] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. [7] S. Fu. Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing. In Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), 2009. [8] S. Fu. Dependability enhancement for coalition clusters with autonomic failure management. In Proceedings of the 15th IEEE International Symposium on Computers and Communications (ISCC), 2010. [9] S. Fu. Failure-aware resource management for highavailability computing clusters with distributed virtual machines. Journal of Parallel and Distributed Computing, 70(4):384–393, 2010. [10] S. Fu and C.-Z. Xu. Exploring event correlation for failure prediction in coalitions of clusters. In Proceedings of ACM/IEEE Supercomputing Conference (SC), 2007. [11] S. Fu and C.-Z. Xu. Quantifying temporal and spatial correlation of failure events for proactive management. In Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS), 2007.

Table 2. Normalized Mutual Information Matrix for Memory Utilization Statistics Features.(F1:kbmemfree; F2:kbmemused; F3:%memused; F4:kbbuffers; F5:kbcached; F6:kbswpfree; F7:kbswpused; F8:%swpused; F9:kbswpcad; F10:pgpgout/s; F11:fault/s) Features F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

F1 0 0.687 0.711 0.593 0.661 0.054 0.019 0.041 0.038 0.094 0.071

F2 0.687 0 0.862 0.602 0.771 0.040 0.037 0.036 0.040 0.088 0.067

F3 0.711 0.862 0 0.515 0.627 0.038 0.022 0.028 0.041 0.090 0.069

F4 0.593 0.602 0.515 0 0.609 0.078 0.051 0.078 0.055 0.047 0.034

F5 0.661 0.771 0.627 0.609 0 0.075 0.080 0.072 0.096 0.052 0.041

[12] S. Fu and C.-Z. Xu. Proactive resource management for failure resilient high performance computing clusters. In Proceedings of IEEE International Conference on Availability, Reliability and Security (ARES), 2009. [13] S. Fu and C.-Z. Xu. Quantifying event correlations for proactive failure management in networked computing systems. Journal of Parallel and Distributed Computing, 70(11):1100–1109, 2010. [14] A. G. Ganek and T. A. Corbi. The dawning of the autonomic computing era. IBM Systems Journal, 42(1):5–18, 2003. [15] J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 2005. [16] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003. [17] Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. K. Sahoo. BlueGene/L failure analysis and prediction models. In Proceedings of IEEE Conference on Dependable Systems and Networks (DSN), 2006. [18] Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta. Filtering failure logs for a BlueGene/L prototype. In Proceedings of IEEE DSN, 2005. [19] C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proceedings of IEEE Conference on Dependable Systems & Networks (DSN), 2008. [20] S. Ma and J. L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proceedings of IEEE Intel. Conference on Data Engineering (ICDE), 2001. [21] J. Meyer and L. Wei. Analysis of workload influence on dependability. In Proceedings of Symposium on Fault-Tolerant Computing (FTCS), 1988. [22] A. J. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proceedings of IEEE Conference on Dependable Systems and Networks (DSN), 2007. [23] D. Oppenheimer, A. Ganapathi, and D. Patterson. Why do Internet services fail, and what can be done about it. In Proceedings of USENIX Symposium on Internet Technologies and Systems (USITS), 2003.

F6 0.054 0.040 0.038 0.078 0.075 0 0.728 1.000 0.598 0.017 0.079

F7 0.019 0.037 0.022 0.051 0.080 0.728 0 0.789 0.727 0.017 0.081

F8 0.041 0.036 0.028 0.078 0.072 1.000 0.789 0 0.612 0.016 0.073

F9 0.038 0.040 0.041 0.055 0.096 0.598 0.727 0.612 0 0.015 0.080

F10 0.094 0.088 0.090 0.047 0.052 0.017 0.017 0.016 0.015 0 0.732

F11 0.071 0.067 0.069 0.034 0.041 0.079 0.081 0.073 0.080 0.732 0

[24] W. Peng, T. Li, and S. Ma. Mining logs files for computing system management. In Proceedings of IEEE International Conference on Automatic Computing (ICAC), 2005. [25] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of IEEE DSN, 2004. [26] B. Schroeder and G. Gibson. A large-scale study of failures in high-performance-computing systems. In Proceedings of IEEE Conference on Dependable Systems and Networks (DSN), 2006. [27] R. Vilalta and S. Ma. Predicting rare events in temporal domains. In Proceedings of IEEE Conference on Data Mining (ICDM), 2002. [28] P. Yalagandula, S. Nath, H. Yu, P. B. Gibbons, and S. Sesha. Beyond availability: Towards a deeper understanding of machine failure characteristics in large distributed systems. In Proceedings of USENIX WORLDS, 2004. [29] K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proceedings of ACM Conference on Knowledge Discovery in Data Mining (KDD), 2005. [30] Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In Proceedings of Workshop on Job Scheduling Strategies for Parallel Processing, 2004. [31] Z. Zhang and S. Fu. Failure prediction for autonomic management of networked computer systems with availability assurance. In Proceedings of IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems in conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010. [32] Z. Zhang and S. Fu. A hierarchical failure management framework for dependability assurance in compute clusters. International Journal of Computational Science, 2010. [33] A. Zheng, J. Lloyd, and E. Brewer. Failure diagnosis using decision trees. In Proceedings of IEEE International Conference on Automatic Computing (ICAC), 2004.

Suggest Documents