a survey on failure prediction methods

0 downloads 0 Views 58KB Size Report
The FT-Pro makes use of local failure event prediction at each node. ... time-series models and rule-based classification techniques to do rare event prediction.
Muthumani N et al. / International Journal of Engineering Science and Technology (IJEST)

A SURVEY ON FAILURE PREDICTION METHODS MUTHUMANI N, Department of Computer Applications, SNR Sons College, Coimbatore-641 006.

DR. ANTONY SELVADASS THANAMANI Department of Computer Science NGM College, Pollachi. Abstract: The preventive measures of anomalous system behavior depend on failure prediction mechanism. There are an enormous number of faults that can occur in a computing system which leads to system failure. As faults are unknown and cannot be measured, they produce error messages on their detection. This paper presents a survey on various failure prediction methods. Keywords: Fault; Error; Failure, Failure prediction; Log files. 1.

Introduction

An important characteristic of an intelligent agent is its ability to learn from previous experience in order to predict future events. The mechanization of the learning process by computer algorithms has led to vast amounts of research in the construction of predictive algorithms. The basic difference of failure prediction methods is in the ability to evaluate the current state. Since the current state can only be considered if some monitoring of the system is used as input data, these methods are also called monitoring based methods. The category of methods that evaluate the current system state can be further divided into three categories by analyzing at which stage of failure evolution, observations are taken. Faults can be observed at three stages: by monitoring of symptoms, detection of errors or observation of failures. 2.

Literature Survey

Errin W. Fulp et al. [8] introduced a new system failure prediction method using Support Vector Machines (SVM).The source of data is system log files. The proposed approach takes advantage of the sequential nature of log messages and determines which sequence of messages are precursors to failure. Each Message was represented using the tag value, which offers an indication of message criticality. Experimental results using log files from a large 1024 node Linux-based compute cluster indicate that the spectrum- representation of messages combined with a SVM classifier can achieve an accuracy of 73% Fu et al. [9] developed a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. They discovered more correlations among failure instances by taking into account the information of application allocation. The failure events are clustered based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment. In [21] Xiaojuan Ren et al. developed a multi-state model to represent the characteristics of resource failures in Fine-Grained Cycle Sharing FGCS systems. They applied a semi-Markov Process (SMP) to predict the probability that no resource failure will happen in a future time window, based on the host resource usage history. The based prediction model was implemented and tested in the iShare Internet sharing system. Experimental results show that the prediction algorithm adds less than 0.006% overhead to a guest job and the prediction accuracy is higher than 86.5% on average. The effectiveness of the prediction in accommodating the deviations of host workloads was also tested, and the results show that the impact of the deviations on our prediction is negligible Li Y & Lan Z [17] proposed an adaptive fault management approach that exploits failure prediction. The proposed FT-Pro utilizes a cost-based evaluation algorithm to innovatively integrate proactive process migration with reactive check pointing. The FT-Pro makes use of local failure event prediction at each node. They

ISSN : 0975-5462

Vol. 3 No. 2 Feb 2011

1400

Muthumani N et al. / International Journal of Engineering Science and Technology (IJEST)

demonstrated that the proposed adaptive fault management scheme can be effective even with modest prediction accuracy. F. Salfner et al. presented a new approach in failure prediction called Similar Events Prediction (SEP) [22] .It is based on the recognition of suspicious patterns of error events. They compared SEP to two failure prediction techniques of the same class that evaluate event logs such as error or failure logs. Dispersion Frame Technique (DFT) and reliability based. All three models have been applied to data of a complex commercial telecommunication system. Predictive power of the approaches is compared in terms of precision, recall, FMeasure and accumulated runtime- costs. They demonstrated that SEP outperformed the other failure prediction techniques in all measures and achieved a precision of 80% and recall of 92% Woochul Andrew, Y presented a new method for failure prediction in which periodic failures are first determined and then filtered from the failure list (Filtered failure Prediction) [27]. The remaining failures are then used in a traditional statistical method. The use of prefiltering leads to an order of magnitude better predictions. Liang et al. collected event logs over an extensive period from IBM BlueGene/L, and developed a prediction model based on the real failure data. They partitioned the time window into intervals and tried to fine the fatal and failure events within the predict window based on the event characteristics of the preceding intervals. They addressed two main challenges: feature selection and classification [18]. Zhiguo Li et al. [29] presented an effective data-driven technique to predict the occurrence of failure events based on event sequence data. The Cox proportional hazard model was used to provide a rigorous statistical prediction of system failure events. .they has developed an algorithm to extract the frequent failure signatures and two types of failure signatures—parallel and serial signatures were identified efficiently. By coding the failure signatures as time-dependent covariates and interactions, a Cox prediction model was developed based on the frequent failure signatures Turnbull, D analyzed hardware sensor data to predict failures in a high-end compute server. Features are extracted using sensor windows and potential failure windows. They trained radial basis function networks on these features and achieve a 0.87 true positive rate and 0.10 false positive rate for predicting failures using a data set that comprises of sensor and failure information which was taken for a 5 month period. . This shows that sensor data can be used to predict failures in hardware systems. They demonstrated that RBF network classifiers work well both in terms of computational performance and classification accuracy. Classification accuracy is further improved by using feature subset selection [24]. P. Gujrati et al. [10] presented a new framework for failure prediction in Blue Gene/L, which comprises three-phase namely event preprocessing, base prediction and meta-learning prediction. They have proposed the use of meta-learning for improving failure prediction in large scale clusters such as Blue Gene/L. The proposed framework adaptively integrates and combines two widely used base prediction methods, statistical based method and rule-based method for discovering various fault modes. They demonstrated that the proposed metal earning prediction can significantly improve failure accuracy by up to three times. In [11] Jiexing Gu et al. presented a dynamic metal earning prediction engine for large-scale systems. Here, the “dynamic” part is from two perspectives: one is to continuously increase the training set during the system operation; and the other is to dynamically modify the rules of failure patterns by tracing prediction accuracy at runtime. . They used 130-week RAS log from the production Blue Gene/L system at SDSC and has shown that it can effectively forecast failures with a precision of 0.9-1.0 and a recall of 0.7-0.8. Hoffman et al. [13] employed two modelling approaches: an extended Markov chain model and a function approximation technique utilising universal basis functions (UBF) for failure forecasting. Their results show that they can achieve 82%-92% accuracy by using these methods in predicting rare failure events. Yang, S [28] presented a failure prediction and processing scheme for PM via the thermal power-plant example, by using a hybrid Petri net modeling method endowed with fault-tree analysis and Kalman filtering. They first constructed FPN (Petri net dealing with system failure). The next step is to obtain control charts for all fault places in the FPN in order to prescribe thresholds and increment times for every step in Kalman prediction. Afterwards, the system model of each place in the FPN must be derived to perform Kalman filtering. Sahoor et al. [23] discussed critical event prediction in terms of proactive management in autonomic computing. They do critical event prediction in large-scale computer clusters. They suggest the use of linear time-series models and rule-based classification techniques to do rare event prediction. The prediction is made by detecting occurrences of a set of event types in a time window. They assumed that a predictor has detailed information about event types, which is rarely available in Grid.

ISSN : 0975-5462

Vol. 3 No. 2 Feb 2011

1401

Muthumani N et al. / International Journal of Engineering Science and Technology (IJEST)

R. Vilalta and S. Ma [25] described an approach to detect patterns in event sequences. They assumed special events called target events. By using association rule mining techniques they find patterns frequently occurring before target events. Patterns are then combined into a rule-based model for prediction. They demonstrated the importance of the size of the time window preceding target events. Experiments on two different combinations of event-type and host of interest show how the false negative error rate decreases significantly as the time window increases. Hamerly, G. & Elkan [12] introduced a mixture model of naive Bayes sub models (i.e. clusters) that is trained using expectation-maximization. The second method is a naive Bayes classifier, a supervised learning approach. Both methods are tested on real world data concerning 1936 drives. The predictive accuracy of both algorithms is far higher than the accuracy of thresholding methods used in the disk drive industry today. The failure prediction methods presented here perform better than the current industry standard methods, and they perform well enough to be useful in practice. Bianca Schroeder et al. [1] analyzed failure data of a high-performance computing site. They used data that has been collected at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. A study on the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair was made. The time between failures is modeled well by a Weibull distribution with decreasing hazard rate. I. Lee et al. [16] has presented a methodology for the analysis of automatically generated event logs from fault tolerant systems. They used event log data from three Tandem systems. Raw event log was taken and the data was reduced by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components In [4], Chang-Hua Hu et al. a novel reliability prediction technique based on the evidential reasoning (ER) algorithm is developed .The ER algorithm is applied to forecast reliability in turbocharger engine systems. The feasibility and validity of the ER algorithm in systems reliability prediction is examined. Some nonlinear optimization models are used to find the optimal parameters of forecasting model by minimizing the mean square error (MSE) criterion. J. Brevik Brevik et al. [3] examined the problem of predicting machine availability in desktop and enterprise computing environments .They compare one parametric and two non-parametric methods for predicting machine availability. They used a synthetic trace of machine availability traces from three separate desktop and enterprise computing environments. Their result shows that a non-parametric approach is better in most experiments in estimating the lower bound of a given quantile, especially when the sample size is small. They found that a non parametric method method based on a binomial approach generates the most accurate estimates. Ei-Aroui, M., & Soler, J used a Bayesian statistical model to track and predict software reliability. they assumed an environment to get a stochastic model where the successive times between software failures are exponentially distributed. They have shown that the proposed method is useful for simulated failure data based on the numerical examples and real data [7]. T.-T. Y. Lin and D. P. Siewiorek [20] have considered two types of errors: transient and intermittent. They developed a technique called the Dispersion Frame Technique (DFT) which is based on the shape of the interarrival time function of the intermittent errors observed from actual error logs. The DFT was implemented in a distributed on-line monitoring and predictive diagnostic system for the campus-wide Andrew file system at Carnegie Mellon University. Data collected from 13 file servers over a 22 month period were analyzed using both the DFT and conventional statistical methods. It is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error log entry points required by statistical methods for failure prediction. Liang et al. [19] predict failures of IBM’s BlueGene/L from event logs containing reliability, availability and serviceability data. They use temporal and spatial compression. Temporal compression includes all events at a single location occurring with inter-event times lower than some threshold, and spatial compression includes all messages that refer to the same location within some time window. Berenji et al. [2] present a novel hybrid Model based and Data Clustering (MDC) architecture for fault monitoring and diagnosis, which is suitable for complex dynamic systems with continuous and discrete variables. Cheng et al. [5] proposed an application cluster service (APCS) scheme. The proposed APCS provides both a failover scheme and a state recovery scheme for failure management. Hughes et al. [14] employ a rank sum hypothesis test to identify failure prone hard disks. Two improved SMART algorithms are proposed. They use the SMART internal drive attribute measurements in present drives.

ISSN : 0975-5462

Vol. 3 No. 2 Feb 2011

1402

Muthumani N et al. / International Journal of Engineering Science and Technology (IJEST)

The present warning-algorithm based on maximum error thresholds is replaced by distribution-free statistical hypothesis tests. Daidone et al. [6] have proposed to use a hidden Markov model approach. Taking advantage of the characteristics of the hidden Markov models formalism, widely used in pattern recognition, they proposed a formalization of the diagnosis process, addressing the complete chain constituted by monitored component, deviation detection and state diagnosis. This method is based on concurrent monitoring. So, this method could also be used for failure prediction: If a component is detected to be faulty, a failure is likely to occur. Weiss [26] introduces a failure prediction technique called “timeweaver” that is based on a genetic training algorithm. Timeweaver, a genetic-based machine learning system that solves the event prediction problem by identifying predictive temporal and sequential patterns within data. Leangsuksun et al. [15] describe that they have implemented predictive check pointing for a high-availability high performance Linux cluster. 3.

Conclusion

The preventive measures of anomalous system behavior depend on failure prediction mechanism. There are an enormous number of faults that can occur in a computing system which leads to system failure. As faults are unknown and cannot be measured, they produce error messages on their detection. A survey of failure prediction methods has been presented here. Table 1: Failure Prediction Methods Study

Date

Length

Environment

Type Of Data

Approach

1

2008

24 Months

1024 Node LinuxBased Compute Cluster

System Log Files

Svm

2

2007

-

Wayne State Grid

Failure Log

Spherical Covariance & Stochastic Model

3

2006

2 Days

4

2006

8 Months

5

2006

3 Months

6

2005

7

Commercial Telecommunication Pl tf Supercomputer Platinum At Ncsa Ishare Internet Sharing System

Error Logs

Sep

Failure Log

Ft-Pro

Log

Semi Markov

1 Month

Ct

Log Files

2005

20 Weeks

Ibm Bluegene/L

Ras Event Logs

8

2005

3 Months

University Of Virginia

Monitoring

Ffp

9

2004

5 Months

Single Sever

Sensor And Failure Information

Rbf

10

2004

130 Weeks

Ras Event Logs

Dynamic Meta Learner

11

2004

20 Months

Ras Event Logs

Meta Learner

12

2004

53 Days

Ras Event Logs & Error Logs

Ubf

14

2002

1 Month

Event Log

Rule Based Model

15

2002

1 Years

Event Log, Sar Data, Node Topology

Time Series, Rule Based, Bayesian Network

16

2001

-

Quantum Smart Dataset

Naive Bayes Em

17

1997

9 Years

Hardware Systems-Disk Drives Los Alamos National Laboratory

18

1991

-

3 Tandem Systems

Event Log

21

1996

-

40 Suits Of Turbochargers

Time To Failure Data

Er Algorithm

22

1990

22 Months

13 Vice File Servers

Error Logs

Dft

ISSN : 0975-5462

Ibm Bluegene/L Systems At Sdsc Ibm Bluegene/L Systems At Anl And Commercial Telecommunication Platform Network Having 750 Hosts 350 Nodes Cluster System

Vol. 3 No. 2 Feb 2011

Failure Data

Cox Proportion Model Customized Nearest Neighbor

Weibull Distribution Multivariate Statistical Techniques

1403

Muthumani N et al. / International Journal of Engineering Science and Technology (IJEST)

References [1] [2] [3] [4] [5] [6]

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

[23]

[24] [25] [26] [27] [28] [29]

Bianca Schroeder, Garth A. Gibson, A Large-Scale Study of Failures In High-Performance Computing Systems, Proceedings of The International Conference On Dependable Systems and Networks (Dsn2006), Philadelphia, USA, June 25-28, 2006 Berenji, H., Ametha, J., & Vengerov, D. Inductive learning for fault diagnosis. In IEEE Proceedings of 12th International Conference on Fuzzy Systems (FUZZ’03), volume 1.2003 Brevik.J, D. Nurmi and R. Wolski, Automatic Methods For Predicting Machine Availability In Desktop Grid and Peer To-Peer Systems, CCGRID, IEEE, 2004, Pp. 190-199 Chang-Hua Hu, Xiao-Sheng Si, Jian-Bo Yang, System Reliability Prediction Model Based On Evidential Reasoning Algorithm With Nonlinear Optimization Cheng, F., Wu, S., Tsai, P., Chung, Y., & Yang, H. Application Cluster Service Scheme for Near-Zero-Downtime Services. In IEEE Proceedings of the International Conference on Robotics and Automation, 4062–4067. 2005 Daidone, A., Di Giandomenico, F., Bondavalli, A., & Chiaradonna, S. Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution. In IEEE Proceedings of the 25th Symposium on Reliable Distributed Systems (SRDS 2006). Leeds, UK, 2006 Ei-Aroui, M., & Soler, J. (1996). A Bayes Nonparametric Framework for Software Reliability Analysis. IEEE Transactions On Reliability, 45, 652–660 Errin W. Fulp, Glenn A. Fink,Jereme N. Haack, Predicting Computer System Failures Using Support Vector Machines, Proceedings of The First Usenix Conference On Analysis Of System Log Fu, S. & Xu, C.-Z. Quantifying Temporal and Spatial Fault Event Correlation for Proactive Failure Management. In IEEE Proceedings of Symposium on Reliable and Distributed Systems (SRDS 07). 2007 Gujrati.P, Y. Li, Z. Lan, R. Thakur, And J. White, “A Meta-Learning Failure Predictor For Bluegene/L Systems,” Proc. Of Icpp’07. Jiexing Gu, Ziming Zheng1, Zhiling Lan,John White, Eva Hocks, Byung-Hoon Park, Dynamic Meta-Learning For Failure Prediction In Large-Scale Systems: Acase Study, Proceedings Of The International Conference On Parallel Processing 2008 Hamerly, G. & Elkan, C. Bayesian Approaches to Failure Prediction for Disk Drives. In Proceedings of the Eighteenth International Conference on Machine Learning, 202–209. Morgan Kaufmann Publishers Inc., 2001 [Pdf] Hoffmann Ga, Salfner F., Malek M. Advanced Failure Prediction In Complex Software Systems, Srds 2004 Hughes, G., Murray, J., Kreutz-Delgado, K., & Elkan, C. Improved disk-drive failure warnings. IEEE Transactions on Reliability, volume 51(3): 350–357, 2002 Leangsuksun, C., Liu, T., Rao, T., Scott, S., & Libby, R. A Failure Predictive and Policy- Based High Availability Strategy for Linux High Performance Computing Cluster. In The 5th LCI International Conference on Linux Clusters: The HPC Revolution, 18–20. 2004 Lee.I, R. K. Iyer and D. Tang, Error/Failure Analysis Using Event Logs From Fault Tolerant Systems, Proceedings 21st Intl. Symposium On Fault-Tolerant Computing, 1991,Pp. 10-17. Li, Y. & Lan, Z. Exploit Failure Prediction For Adaptive Fault-Tolerance In Cluster Computing. In IEEE Proceedings of the Sixth International Symposium on Cluster Computing and the Grid (Ccgrid’ 06), 531–538. IEEE Computer Society, Los Alamitos, Ca, Usa. Liang, Y., Zhang, Y., Xiong, H., and Sahoo, R. Failure Prediction In Ibm Bluegene/L Event Logs. In Proceedings of The IEEE International Conference On Data Mining (2007). Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., & Sahoo, R. BlueGene/L Failure Analysis and Prediction Models. In IEEE Proceedings of the International Conference on dependable Systems and Networks (DSN 2006), 425–434. 2006 Lin T.-T. Y. and D. P. Siewiorek. Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability, 39(4):419–432, Oct. 1990 Ren, S. Lee, R. Eigenmann and S. Bagchi, Resource Failure Prediction In Fine-Grained Cycle Sharing System, IEEE HPDC, Paris,France, 2006. Salfner.F,M. Schieschke, And M.Malek. Predicting Failures Of Computer Systems: A Case Study For A Telecommunication System. In Proceedings of IEEE International Parallel And Distributed Processing Symposium (Ipdps 2006), Dpdns Workshop, Rhodes Island, Greece, Apr. 2006 Sahoo R.K, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta and A. Sivasubramaniam, Critical Event Prediction For Proactive Management In Largescale Computer Clusters, Kdd '03: Proceedings of the Ninth ACM International Conference On Knowledge Discovery and Data Mining, ACM Press, Washington, D.C., 2003, Pp. 426-435. Turnbull, D. & Alldrin, N. Failure Prediction in Hardware Systems. Technical Report, University Of California, San Diego, 2003. Vilalta.R and S. Ma, “Predicting Rare Events in Temporal Domains”, Proc. of IEEE Intl. Conf. On Data Mining, 2002. Weiss, G. Timeweaver: A Genetic Algorithm for Identifying Predictive Patterns in Sequences of Events. In Proceedings of the Genetic and Evolutionary Computation Conference, 718–725. Morgan Kaufmann, San Francisco, CA, 1999 Woochul Kang and Andrew Grimshaw, Failure Prediction In Computational Grids Yang, S. A Condition-Based Failure-Prediction and Processing-Scheme For Preventive Maintenance. IEEE Transactions On Reliability, Volume 52(3): 373–383, 2003 Zhiguo Li, Shiyu Zhou, Suresh Choubey And Crispian Sievenpiper, Failure Event Prediction Using The Cox Proportional Hazard model Driven By Frequent Failure Signatures, Iie Transactions (2007) 39, 303–315.

ISSN : 0975-5462

Vol. 3 No. 2 Feb 2011

1404

Suggest Documents