Providing Persistent and Consistent Resources through ... - CiteSeerX

3 downloads 1457 Views 70KB Size Report
addressing problems related to hardware maintenance, heartbeat monitoring etc, but also figuring .... the high availability service and the consistent recovery.
Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems Ramendra K. Sahoo, Myung Bae*, Ricardo Vilalta, Jose Moreira, Sheng Ma & Manish Gupta IBM TJ Watson Research Center, Yorktown Heights, NY 10598 *Unix Development Laboratory, IBM, Poughkeepsie, NY 12601

1.INTRODUCTION The ability to track and analyze every possible fault condition, whether transient (soft) or permanent (hard), is one of the most critical requirements for large-scale cluster computer systems. All such events are generally termed as “RAS Events” (RAS for Reliability, Availability, and Serviceability). Depending on the complexities of the cluster computers, the RAS events must address the logs categorized based on a number of subsystems like (a) CPU subsystem, (b) memory subsystem, (c) power supply subsystem and I/O subsystem[9,10]. The RAS event monitoring not only helps in addressing problems related to hardware maintenance, heartbeat monitoring etc, but also figuring out the role of software system including applications on such events. Keeping an eye for future cluster systems like Blue Gene (BG/L)series of computers [1], we are in the process of developing an automated health monitoring and prediction system, to minimize the human intervention in system management and control. In order to achieve degree of self-healing and automatic-resource management, we are also evaluating the merits and drawbacks of the existing monitoring methods, models including various “forecasting” algorithms to predict and take action proactively. This paper covers three important aspects from a feasibility study point of view. (1) Analysis of error events (based on real data collected from a 350 node cluster), (2) Developing a proxy based resource monitoring model based on the system management characteristics and (3) Evaluation of fault prediction algorithms for large-scale cluster systems. The rest of the paper is divided into six sections. Section 2 covers a brief look at some of the literature involving error/event monitoring features, current cluster system management tools used and a list of relevant fault prediction algorithms. Section 3 describes event characteristics observed through error logs collected on a real system. A brief description of the Proxy Resource Model [3], developed to address the functionality and the scale of a clustered are covered in Section 4. Section 5 covers a number of prediction/forecasting methods in use to predict the rare events for clustered network or computer systems. The results and discussion section (Section 6) has some of the event distribution and characteristics based on the realistic data obtained from a 350 node UNIX based cluster. Finally, we conclude the paper with a summary of our results and future work plans.

2.LITERATURE REVIEW Most of the earlier literature related to similar topics are spread over three different areas: (1) Error log and analysis in the area of fault tolerant computing, (2) Computer system and automation area covering the mechanism to capture the events starting from a single system to large scale clusters, and (3) Event forecasting or prediction in the category of prediction algorithms or standard time-series[4] forecasting. Our literature review confines to some of the recent work carried in multi-processor systems addressing any of these three areas.

2.1 Error/Event Log A number of studies related to error/failure logs are carried out in fault tolerant computing literature [5, 6, 9,10,17]. Some of the earlier works were based on heuristic trend analysis and theoretical models on the basis of the raw measurements carried out on small to medium scale computer systems during early 80’s and 90’s. Tsao’s [17] research at CMU demonstrated the feasibility of tuple-based classification to reduce the data observed on a DEC system. Lee et al [9]and Lin et al [10] worked independently analyzing the error trends for Tandem system and DCE environments establishing lognormal and other functional distributions through the observed data fitting to specific functions. Recently, Buckley [6] carried out a study on a fairly large VAX/VMS cluster with a total of 193 systems collecting 2.35 million events covering about 335-machine year of time. His work was able to coalesce related events in a set of critical event logs based on Tsao tuple based works. His scheme covered extensive analyses of the event logs with a vast amount of data which fault diagnosis and recovery can be carried out. 2.2 Resource Management Model Most of the resource management model and related works for large-scale computer system are complex and proprietary in nature. Some of the recent projects in this area are the “autonomic computing” initiative [8] from IBM, “N1”from SUN and “Utility Data Center” from HP. S/390 RAS related work [11] is a good example of resource management model implementation in mainframe domain. Recent work on UNIX based platforms includes Reliable Scalable Cluster Technology (RSCT) [15], providing seamless High Available Topology and Group services (HATs /HAGs). RSCT also provides other features like FFDC (First Failure Data Capture) to isolate and pinpoint the system related problems or faults. For Blue Gene type of clusters such technology needs to be revisited and evaluated, because of specialized needs and presence of hundreds and thousands of system-on-chip processors. 2.3 Prediction Event prediction is the study of how to anticipate events represented by categorical data. Predictive algorithms play a crucial role in system management by alerting the user to potential failures. We define the event prediction problem for large cluster computer systems, similar to the telecommunication problems reported in literature[21]. Time-series based prediction tools are mostly used to predict telecommunication related problems [7,19,20,21]. However, these prediction tools are not sufficient enough to address the prediction requirements for computer systems. Hence either Dispersion Frame based techniques or heuristic based approaches are constantly used in literature for prediction purposes [4]. Use of either time-series based techniques or heuristic techniques for largescale computer system would result in developing either complex event based or associative classification rules. Moreover the presence of uneven inter-arrival time for the events would require either variable inter-arrival time or “time-normalization” based techniques to predict the rare events through data mining [18].

3. EVENT CHARACTERISTICS Most computers capture the hardware or software events such as power on/off through event log files stored in any of the system directory. These logs provide the raw data along with a number of repetitions and/or unwanted information (can be termed as noise) from an event analysis and prediction prospective. These logs could contain a number of problems; e.g.: 1. Sometimes the logs get lost or deleted before they are logged into the log file because of some problems in storing or communicating the information to the file. 2. Unwanted information as a result of scheduled operations, time bound reporting status are also recorded.

3. A single log can be repeated or even suppressed, because of the co-occurrence of another event might generate confusion in recording the information. Based on the collected data from a 350 node based cluster, we can classify the events either based on the severity to the system, or based on the subsystem where it occurred. A severity-based classification could differentiate all the logs into the following type: 1.PEND: The loss of availability of a device or component is imminent. 2.PERF: The performance of the device/component has degraded to below an acceptable level. 3.PERM: Permanent Error (Unrecoverable/Most Severe Error). 4.TEMP: Condition recovered from after a no. of unsuccessful attempts. 5.UNKN: Unknown error (Cannot determine the severity). 6.INFO: Entry is an informational/warning. A subsystem-based classification can also be used to record the logs under the following categories: 1. Hardware related events (Event class H) 2. Software related events (Event class S) 3. Events for information only (Event class O) 4. Undetermined events (Event class U). It may be noted that the INFO category for severity based classification are not normally recorded in event logs, whereas the event class ‘O’, containing the events “for information only” might contain no hardware or software or undetermined subsystem problems. However, it might include PEND or TEMP category events of severity based classifications. Such a broad classification helps in correlating a series of events into a number of event-based subsystems [10] or tuple based classifications [6].More details of our analysis of event characteristics are covered in [14].

4. RESOURCE ATTRIBUTE MODEL As discussed earlier, it is required to have an efficient system related event logging mechanism before even the logs are picked up for analysis and prediction. Such a procedure would be totally dependent on, how different resources (either hardware or software) are abstracted from a global system point of view, including well-defined relationships and dependencies. For BG/L type of clusters, the presence of thousands of nodes at the same level of control and their inter-dependencies would make it a difficult task to simultaneously monitor the status of all the nodes, process the predictive failure analyses and take smart actions. We have developed a proxy based semi hierarchical model [12,13] at a very high level to address the scalability problem. Similar to the semi-hierarchical model at higher level, we have also formulated a proxy based remote resource attribute model [2,3] at the component level to provide persistent and consistent system management. Through this approach, a cluster of nodes (called service nodes or S-node), provide the proxy function to a set of nodes (compute nodes or C-nodes and I/O nodes or IO-node) in case of Blue Gene/L). Reliable Scalable Cluster Technology (RSCT) [15,16] infrastructure in the S-nodes provide the common abstraction of the resource management of monitoring, configuring, and controlling for every resource including the compute nodes in a system. It also helps the design and implementation of fault-tolerant applications including the high availability service and the consistent recovery. Using this architecture, all hardware and software components the monitoring, control and coordination among the nodes are efficiently managed. A higher level of resource abstraction also helps to address very large-scale clusters without any limitations. This mechanism monitors and controls the resources that exist on any node (whether C-node or IOnode) through proxy resource manager (PxRM) and proxy resource agent (PRA) (Figure1). A PxRM

is located on a node (either S-node or IO-node), which runs the resource management infrastructure (RMIs) and communicates with the PRAs on the IO-node. Although it supports the monitoring and control of the remote resources, it may not provide the consistent dynamic attributes, (e.g., up/down status of the resources) if the PxRM has either a failure or a restart status because of a failure. The existing infrastructure may report the attributes of the resources as failed or unknown, even if the PxRM is restarted, because the restarted PxRM does not know the previous status of the resources including whether the resources were down during the failure of the PxRM. Hence this model requires providing persistent and consistent attributes, even if there is a failure or restart of the PxRMs. A PRA acts as a peer agent running on any non-RMI node (IO-node for BG/L) to control the C-nodes or any other controllable devices (like Fans, power supply etc.).

Cluster Control Node #2 Service Node

Service Node(Proxy)

PxRM

PGN #

RMI

PGN # R At eso ab trib urc ou ut e t C es -N od es

R At eso a trib ur C- bou ut ce No t es de s

PRA

C2

C3

C-Node #m

C-Node #1

I/O Node (C-Node Proxy)

P-set

C-node Interface

C-Node #2

C2 C3

I/O Node (C-Node Proxy)

P-set

I/O Node (C-Node Proxy)

C1

PRA

C2

C3

PGN #

PGN #

C1

C1

C-Node #2

PxRM

PGN #

PGN #

P-set

Node #n

RMI

RMI

C-Node #1

Service Node(Proxy)

Node #n

Node #n

PRA

PGN #

Level 2

Service Node(Proxy)

PxRM

PxRM

PGN #

C-Node #m

C-node Interface C-Node #1

C-Node #2

C-Node #m

Tier 3

PxRM

PGN #

RMI

R At eso ab trib urc ou ut e t C es -N od es

PxRM

RMI

Tier 1

RMI

Cluster Control Node #n Service Node

Tier 2

Cluster Control Node #1 Service Node

Level 1

Failover Group

Figure 1: Semi-hierarchical proxy resource monitoring model for Blue Gene/L Clusters Providing a persistent and consistent attribute values of the resources between PRA and PxRM is carried out through Proxy Generation Number (PGN). The PGN must be changed properly and traced by both the PxRM and its PRA, so that the PxRM knows the current status of the resource attributes. A PGN is a unique time specific number. This property guarantees that there is no ambiguous state in determining whether the PGN has been changed or not. More details about the Proxy Resource model are covered in [2] and [3].

5. PREDICTION ALGORITHMS The task of predicting target events across a computer network requires the use of different technique from standard time-series method of predicting rare events. Classical time series based tools like ANSWER [20] and TIMEWEAVER [21] help in predicting performance variables, or threshold violations. However, cluster-based event logs would require fairly idealistic simplifications and assumptions to be fitted into such methods. The presence of uneven inter-arrival time for the target events and multi-dimensional inter-dependency of the events require development of time

normalization based algorithms. In event prediction we aim at providing an estimation of a categorical or nominal value of an occurrence of a critical event in near future say e.g.: ``communication link will be down within 5~minutes''. The nature of the problem calls for machine learning based data-mining techniques. Hence, even if the prediction problem for multi-processor cluster falls into the category of “rare event prediction in temporal domains”, it is required either to have a normalization of the uneven inter-arrival time or similar reductions in time scale for an accurate prediction. Some of the results presented in Section 6 demonstrate such requirements. We are in the process of developing a prediction algorithm for uneven interval-original rates. Details of the algorithms are discussed in [18].

6.RESULTS AND DISCUSSION All the results are based on the error logs collected from a 350 node cluster primarily running scientific workloads and a few nodes running commercial workloads like database, mail server etc. The raw data were taken from each of the node, processed and filtered for analysis. We observed 2,289,049 raw events recorded over a span of 121 days on this cluster. These events include: scheduled maintenance, node shutdowns and reboots, apart from software or hardware related events. 6.1 Data Filtration Filtering the data to eliminate the redundant and repetitive events is one of the most difficult tasks. We filtered the data based on a simple algorithm (as described below) and considered the filtered data as input to our analysis, assuming an aggressive way of looking at the error or event logs. 6.1.1 Filtering Algorithm Error Count Error Class Class ID The filtering algorithm records any new event type as a H 9824 1 new event ID and any new node number as a new node ID. It compares the event ID and node ID at any time T S 5780 2 with the event ID and node ID at time (T-1). If both the O 9631 3 IDs are same, it only records the event ID and node ID U 1376 4 corresponding to time (T-1). And discards the recorded event at time T. If either of the IDs is different, it records the events as a separate event. Through this Table 1: Error Breakup based on Error Class simple filtering process, we were able to eliminate roughly 99% of the data, thus coming up with a total of Error Sev. 26,611 distinct events. Error Count Type ID PEND P 1 11221 6.2 Filtered Data Analysis PERF U 2 5427 All the filtered data are analyzed based on three PERM T 3 3774 different parameters (1) events inter-arrival time (2) number of events or event types and (3) TEMP I 4 5977 number of nodes. A total of 26,611 events within a N 5 212 time span of 121 days make an average of 220 UNKNOWN events per day. An initial break up of the events in terms of error types and classes are as shown in Table Table 2: Error Breakup based on Error Type 1. Based on the events noted in Table 1 and 2 we can infer the following results: 1. 37% of the total events in a 350-node cluster were hardware related logs. 2. 22% of the total events were based on software or software related errors.

3. 14% of the errors were of unrecoverable type, i.e., these nodes require at least some type of human intervention either for hardware or for software resets. 3500

3000

No. of Events

2500

2000

1500

1000

500

0

0

50

100

150

200

250

Node Number

300

350

Node No. 1 2 3 4 5 6 7 8 9 10 11

Host ID 4 6 100 291 17 178 1 69 284 141 59

Count

% of total no. of Events

3392 3020 2178 1219 1110 943 833 508 500 475 435

12.74 11.34 8.18 4.58 4.17 3.54 3.13 1.91 1.88 1.78 1.63

Cum. % of total no. of Events 12.74 24.08 32.26 36.84 41.01 44.55 47.68 49.59 51.47 53.25 54.88

Figure 2: No. of Error Events vs. Node no. 6.2.1 Analysis of Event Logs based on Nodes Figure 2 shows the distribution of the number of evens with respect to different nodes. It is quite clear from theTable 3 : Top Error/Event Recording Nodes figure that most of the events are confined to a small fraction of the total number of nodes. A closer look at the distribution of different events (through Table 3) show : (1) more than 50% of the errors are confined to only 10 nodes. For each of these 10 nodes more than 90% of the errors are of a particular category. We are in the process of splitting the errors /events in terms of dependency graphs, similar to the tuple based analysis carried out in [6] for VAX clusters. 6.2.2 Analysis of Event Logs based on Events and Time Intervals Figure 3 shows the variation of the number of events per event ID. Out of a total of 160 types of events recorded, 20 types account for more than 95% of the events. A global picture of the event inter-arrival-time with the number of events (Figure 4) shows significant number of events occur within 10 minutes of another different event either on the same node or on a different node. It is Cum. % of difficult to pick a particular time window as a unit % of Event Ev. total no. total to put a series of events and pickup the rare events no. ID Count of Events no. of from there. In order to magnify and have an Events investigation, we zoomed the occurrence of two particular events on a number of nodes (Figures 5 1 9 4487 16.86 16.86 & 6). It shows clearly, why we need to go for a 2 32 4279 16.08 32.94 variable inter-arrival-time based algorithm for 3 4 3376 12.69 45.63 analysis and prediction of some of the rare events. 4 10 2214 8.32 53.95 5 162 1424 5.35 59.3 6 172 1270 4.77 64.07 7 14 860 3.23 67.3 8 1 571 2.15 69.45 9 82 551 2.07 71.52 10 2 511 1.92 73.44 11 24 411 1.54 74.98 4500

4000

3500

No. of Events

3000

2500

2000

1500

1000

500

0

0

20

40

60

80

100

120

140

160

Event ID

Table 4: Top Event Types

Figure 3: No. of Error/Events vs. Event IDs.

250

6000

60

50

5000

No. of Events

No. of Events

4000

3000

No. of Events

200

150

40

30

100 20

2000 50 10

1000

0

0 −1 10

0

10

1

10

2

10

Time Interval in Minutes

3

10

0

100

200

300

400

500

600

700

800

900

1000

0 50

100

150

200

250

300

350

400

450

500

4

10

Time Interval in Seconds

Time Interval in Seconds

Figure 4 Figure 5 Figure 6 Figure 4: Inter-arrival Time vs. no. of Events Figure 5: Inter-arrival Time vs. no. of Events for a Particular Event on a Particular Node (CASE 1). Figure 6: Inter-arrival Time vs. no. of Events for a Particular Event on a Particular Node (CASE 2).

7 SUMMARY We have carried out a preliminary three-prong approach study to develop a self-adaptive resource management and control system through health event/error monitoring and predictive failure analysis for large-scale clustered systems. An event log analysis carried out for a 350 node UNIX based cluster shows, most of the event logs are very repetitive in nature, till the problem is fixed or superceded by another disastrous event. A simple filter to eliminate the redundant errors eliminates up to 95-99% of the events, because of redundant recording. Within the filtered data, 50-55% of the events are confined to 10 to 12 machines or processors in a 350-node cluster. The error/event log distribution fits some of the standard theoretical models like Weibull or lognormal functions reported in distributed and fault-tolerant computing literature. We developed a remote resource monitoring technique through problem identification and isolation mechanism (based on RSCT), primarily targeting to manage the system level software and hardware events simultaneously for hundreds and thousands of processors or system components. The presence of variable inter-arrival-time and inter-dependancy of the events, and independent processor based subsystems, standard time-series based event prediction algorithms would not hold good. Apart from developing a variable inter-arrival-time based prediction algorithm, it is also required to develop an intelligent dynamic-dependency graph method to predict some of the future events. We are in the process of collecting more error logs from a variety of large-scale clusters running wide variety of workloads. In order to link all the three components together we are developing a dependency graph for the different types of errors or events and disastrous scenarios, both from hardware and software point of view. Apart from working and evaluating various predictive algorithms, there is also an ongoing effort to link the real event log analysis data to the cluster system simulators.

8. REFERENCES 1. G. Almasi et al “Cellular Super Computing with System-On-A-Chip.” Proceedings IEEE International Solid-state Circuits Conference ISSCC, 2001. 2. M.Bae, S. Fakhouri, L.L. Fong, J. Moreira, J. Pershing and R.K. Sahoo “Autonomic Resource Monitoring of Large-scale Clustered Systems”. Submitted for IBM Sys. Jl.

3. M. Bae, R. K. Sahoo, and J. Moreira “A monitoring method of the remotely accessible resources to provide persistent and consistent resource states”, 2002, Patent submitted. 4. P. J. Brockwell and R. Davis "Introduction to Time-Series and Forecasting, 2002, SpringerVerlag. 5. M. F. Buckley and D. P. Siewiorek “VAX/VMS Event Monitoring and Analysis”, in FTCS-25: 25th Intl. Symp. on Fault Tol. Computing Digest of Papers, Pasedena, CA, June 1995, pp. 414423. 6. M. F. Buckley and D. P. Siewiorek “A Comparative Analysis of Event Tupling Schemes”, Proc. Of the 26th. Intl. Symp. on Fault-Tol. Computing, June 1996, pp. 294-303. 7. T. Dietterich and R. Michalski "Discovering Patterns in Sequences of Events.", Artificial Intelligence, 1985, 25, 187-232. 8. P. Horn, “Autonomic Computing: IBM's Prospective on the State of Information Technology”, IBM Corp., October 2001. 9. I. Lee, R. K. Iyer, D. Tang, “Error/failure Analysis using Event Logs from Fault Tolerant Systems”, Proc. 21st Intl. Symposium on Fault-Tolerant Computing, June 1991, 10-17. 10. T. Y. Lin, D. P. Siewiorek “Error log analysis: Statistical Modeling and Heuristic Trend Analysis”, IEEE Trans. On Reliability, vol. 39, no. 4, 419-432, Oct. 1990. 11. M. Mueller, L. C. Alves, W. Fischer, M. L. Fair and I. Modi “ RAS Strategy for IBM S/390 G5 and G6”, IBM J. Research and Development, Vol. 43, No. 5/6, Sept/Nov. 1999. 12. R. K. Sahoo, M. Bae and J. Moreira “A Semi-hierarchical Monitoring and Management System for Very Large Computer Reliability, Availability and Serviceability (RAS)”. 2002,Patent submitted. 13. R. K. Sahoo, M. Bae and J. Moreira “Semi-hierarchical Approach for Reliability, Availability, and Serviceability of Cellular Systems.” HPCA8, Work-in-Progress Session, 2002, Cambridge, MA, also ACM Computer Architecture News June 2002. 14. R. K. Sahoo, J. Moreira, R. Vilalta, S. Ma and M. Gupta “Hardware/Software Failures (Events) Analysis and Prediction for Large Cluster Computer Systems”. In preparation. 15. “Http://rs6000.pok.ibm.com/afs/aix/project/linux/www/cluster-project/index.html”, Reliable Scalable Cluster Technology (RSCT) Project. 16. “Http://www-1.ibm.com/servers/eserver/clusters/”, IBM eServer Clusters 17. M. M. Tsao, “Trend Analysis and Fault Prediction, PhD Dissertation, Carnegie-Mellon University, May 1983. 18. R. Vilalta, C. Apte, J. Hellerstein, S. Ma, S. Weiss, “Predictive Algorithms in the Management of Computer Systems”, IBM Systems Journal, Special Issue on Artificial Intelligence. Vol. 41, No. 3, 2002. 19. G. M. Weiss "Predicting Telecommunication Equipment Failures from Sequences of Network Alarms", Handbook of Knowledge Discovery and Data Mining, Oxford Univ. Press 20. G. M. Weiss, J. P. Ros and A. Singhal "ANSWER : Network Monitoring using Object-Oriented Rules", Proceedings, 10th Conf. on Innovative Applications of Artificial Intelligence, 1998, AAAI Press. 21. G. M. Weiss "TIMEWEAVER: A Genetic Algorithm for Identifying Predictive Patterns in Sequence of Events", Proc. San Francisco, 1999, Morgan Kaufman