Cross-Core Event Monitoring for Processor Failure Prediction

33 downloads 18352 Views 269KB Size Report
single execution unit on a processor gets faulty. In order to tackle this ... One typical countermeasure example is migrating workload or data to ..... unnecessarily performed proactive recovery are ... of hard disk hardware parameters. A study by ...
Cross-Core Event Monitoring for Processor Failure Prediction Felix Salfner, Peter Tröger, and Steffen Tschirpke Humboldt University Berlin {salfner,troeger,tschirpke}@informatik.hu-berlin.de

ABSTRACT A recent trend in the design of commodity processors is the combination of multiple independent execution units on one chip. With the resulting increase of complexity and transistor count, it becomes more and more likely that a single execution unit on a processor gets faulty. In order to tackle this situation, we propose an architecture for dependable process management in chip-multiprocessing machines. In our approach, execution units survey each other to anticipate future hardware failures. The prediction relies on the analysis of processor hardware performance counters by a statistical rank-sum test. Initial experiments with the Intel Core processor platform proved the feasibility of the approach, but also showed the need for further investigation due to a high prediction quality variation in most of the cases.

KEYWORDS: multi-core, fault injection, performance counter, failure prediction

1. INTRODUCTION The support for multiple parallel activities has a long history in computer hardware design. Beside the traditional support for multiple processors in one system, there was always also a class of solutions described as chip multi-threading [1]. It exploits parallelism inside the processor chip by instruction-level parallelism or simultaneous multithreading. The most recent extension for chip multi-threading is the chip multi-processing (CMP) approach, where a set of independent execution units („cores‟) is packaged as one processor chip. The widely promoted new „era‟ of multicore systems basically focuses on the introduction of such CMP design in standard desktop processors. A report from Berkeley predicts CMP processors with thousands of parallel execution units as the mainstream hardware of the future [2].

With the ever-increasing number of transistors on these chips, hardware reliability is about to become a pressing issue in the upcoming years. AMD already sells triplecore processors that were originally intended as quad-core processors, but contain one disabled defective execution unit. This shows that new paradigms and approaches are needed to ensure dependability of modern CMP-based computer systems. One way to improve dependability (meaning performance and availability) at run-time in imperfect systems is the proactive failure management approach. In contrast to classical fault tolerance techniques, which typically react after the problem has occurred, it relies on the short-term anticipation of upcoming failures. This is realized by a permanent evaluation of the system state at runtime. A successful prediction initiates a subsequent proactive phase, where the effects of a possible failure are compensated in advance. One typical countermeasure example is migrating workload or data to redundant resources. In our presented online failure prediction approach, we treat the multiple execution units of modern CMP hardware as set of redundant computational resources. They are either used for load balancing as originally intended or as spare resource in case of a partial processor failure. The proactive part can be realized by process migration or controlled application shutdown. Fault anticipation requires continuous system state information, in order to detect patterns of anomalies that indicate an upcoming failure. System monitoring usually relies on either operating system-specific or applicationspecific solutions. The technique investigated in this work operates on the processor hardware level only, which makes it operating system as well as application independent. The cores monitor each other and predict upcoming failures by analyzing hardware event sampling data.

This paper presents our initial experiences with this approach. It is organized as follows: Section 2 describes our approach in detail, Section 3 explains the experiment setup, Section 4 discusses the obtained measurement results, Section 5 discusses some related work, and Section 6 analyses the relevant next steps towards a sufficient processor failure prediction solution.

2. APPROACH From the different information sources available in modern processor hardware, performance event monitoring provides the most detailed information. Performance events are signaled by hardware components in the execution engine, for example after the completion (resp. retirement) of a microinstruction, a cache miss, a branch miss-prediction, or a misaligned memory access. Performance events of a typical CMP processor are monitored by configuring built-in hardware counters with an event type and overflow threshold value. Depending on the particular processor type, multiple parallel counters can be activated at the same time. With each counter overflow, a hardware interrupt is triggered that can be used to save a sample of the processor state at the time of overflow. Each sample contains all monitored counter and register values, including the current instruction pointer. The primary idea of these functionalities is to support performance profiling tools, which map the context information to an investigated running application. Our approach, however, aims at the utilization of these values as representation of the whole execution engine state. An initial investigation showed that hardware performance event monitoring is available in all major CMP platforms. Each particular vendor has the according hardware counter support; the solutions mainly differ in the set of monitorable events per execution engine. The Intel Core technology architecture distinguishes between modelspecific (‟non-architectural‟) and somehow standardized (‟architectural‟) processor performance events [3]. The AMD multi-core processor families provide a comparable, but smaller, set of performance events. The SPARC CMP processor series offers performance instrumentation counters (PIC) for different event types [4]. Also different versions of the IBM POWER processor line support hardware performance counters [5]. Overall it can be assumed that hardware performance event counting is a common feature of modern CMP processors.

2.1. CROSS-CORE MONITORING Our research hypothesis assumes that there is a detectable change in the behavior of hardware performance events before a failure of the particular core occurs. If a failure

prediction algorithm is able to detect such a change in the event pattern early enough, it can trigger preventive actions accordingly. The general idea is now to let the monitored core (that is also running the workload) periodically triggers an interrupt routine that moves observed data samples to a small buffer on the second core. The second core performs the failure prediction based on the monitoring values in the buffer. If the prediction algorithm detects a problem that might lead to a core failure, a warning signal is sent to the workload application. It can then perform actions to cope with the failure-prone situation, e.g., it might be check-pointed or moved by an external entity such as the operating system scheduler. The concept of cross-core event monitoring demands some specific conditions. The chosen algorithm must work with a relative small computational overhead, in order to perform the failure prediction as a background task during normal operation. The processor hardware needs to support event counting with comparatively low overhead. Finally, predictions must be accurate enough to justify preventive actions, especially with respect to false warnings.

2.2. EVENT TYPE REDUCTION All CMP platforms with hardware performance monitoring support offer a large number of events. However, only a very limited number can be monitored at the same time, due to the restricted number of hardware counting units and their internal wiring [6]. It is therefore necessary to identify a very small indicative set of counters in the first step. We analyzed correlations among counters available in our experiment setup, in order to remove those that behave very similar to other counters. This allows measuring as many as possible independent event types on one processor core at the same type. The analysis is realized by running the chosen workload with different counter configurations, each containing one possible combination of event types. The two sampled data sets are then analyzed for their Spearman‟s rank correlation coefficient. This specific measure has the advantage of not assuming any frequency distribution of the variables, which is in fact not known for performance event samples. The result is a list of unique event type combinations to be monitored during run-time. Additionally, a qualitative reduction of the counter set can be done. Events with an obviously strictly monotonic behavior, such as the number of cycles elapsed, can be sorted out. They would not show any significantly different behavior in the failure case, which renders them irrelevant for the prediction task.

2.3. SAMPLING RATE Each sampling approach demands the choice of an according sampling rate, either based on a time or event count threshold. A time-based approach would lead to major technical difficulties, since modern processors have no fixed timing behavior due to frequency scaling, pipelining, and out-of-order execution. Even the classical Pentium time stamp counter (TSC) is not guaranteed to provide a constant rate for some processor models [3]. Our experimental setup relies on a threshold for the number of processed instructions on the core. This gives us a constant scale with respect to the execution of the load application. Workload-based sampling has also been shown to be more appropriate in other areas [7,8] of failure prediction.

2.4. FAILURE PREDICTION ALGORITHM We used a Wilcoxon statistical rank-sum test as failure prediction approach. A one-sided version of the test has successfully been applied to a comparable scenario [9]. The test has the relevant characteristics for our environment: No assumptions about the form of the distribution have to be made, such as that the counter values are assumed to be normally distributed (nonparametric test). It is based on ordinal statistics, i.e., it can handle outliers more robustly. It is computationally light, meaning that it can be performed in parallel to other workloads on the predicting core. The Wilcoxon rank-sum test compares a test data set to a reference data set in order to determine whether the median is about the same. In our case, the reference data set corresponds to CPU counter values that have been measured and stored during normal operation without failures. The test data set consists of the samples that were measured during runtime. Note that the test data set can be much smaller than the reference data set.

Figure 1. Example for Rank Sum Test Figure 1 shows an explanatory example. A test data set, which is observed during runtime, is merged and sorted with the stored reference data set. This is done to determine the ranks of the test values, according to their

position in the combined data set. The sum of the created new ranks is compared to a pre-computed threshold. The predictor issues a failure warning if the rank sum has a deviation more than a threshold from expectations. In the example, the test values tend to be larger than the reference data and the resulting rank sum is hence larger than expected. In our specific failure prediction scenario, the last n samples from the monitored core are compared to the reference data set. The assumption is that prior to a failure the counter‟s median deviates significantly from the median of the reference data set.

3. EXPERIMENT In order to set up an experimental environment for our concept, several choices had to be made, including the choice of a particular hardware, of a performance event monitoring approach, and of a fault injection technique. We performed our experiments with an Intel Core2 Quad CPU (Q6600) with 2.40 GHz, 2 GB memory and a Linux 2.6 64bit operating system. The tested system was connected to a monitoring computer by a custom made heartbeat line. The monitoring system had the responsibility to automatically reset the tested multicore system – in case of non-reachability by ICMP ping, or after a maximum time period. Every reset initiated a new test run on the multi-core machine, which allowed us to run automated tests with different performance event combinations to be monitored. The performance counter monitoring was realized with a modified version of the perfmon2 toolkit for Linux. Performance data collection on Intel processors (since Pentium 4) can operate in three different modes: Event counting instructs one of the processor counters to count an event type. The software periodically fetches the counter value from a register. Non-precise event-based sampling instructs a counter to generate a performance monitoring interrupt (PMI) on overflow. With precise event-based sampling (PEBS), the CPU stores its architectural state on overflow in a special memory buffer on its own. Even though PEBS provides the better accuracy for measurements [6], it is only available for a small subset of events. We decided therefore to stick with the PMI sampling approach. The Intel Core technology architecture also distinguishes between model-specific (‟non-architectural‟) and somehow standardized (‟architectural‟) processor performance events [3]. We were forced to use the nonarchitectural types, since the number of architectural ones is very limited. After the reduction step, we ended up with 31 distinct Intel performance event types to be tested.

3.1. FAULT INJECTION AND WORKLOAD The experimental analysis of hardware failure prediction demands a suitable fault injection technique. Since failures occur too rarely under normal conditions, erroneous hardware behavior must be triggered explicitly. The most obvious choice is over-clocking of the processor hardware. Due to the danger of permanent damage to the system hardware we rejected this approach. Instead we opted for an under-volting approach, where the CPU core voltage is set to a level below normal operation. This option is offered by many motherboards for gaming and over-clocking. In order to generate workload on the monitored core, we used the Mersenne prime number test application MPRIME. Initial experiments showed that within a certain voltage reduction range (25% - 30% in our case), the system is in a semi-stable operational mode were it starts to generate machine check exceptions (MCE) for the particular core during the execution of MPRIME, but not during normal operation. This allowed us to boot the tested machine, start the performance measurement and trigger the core failure by executing the load application on one of the cores. Figure 2 illustrates the setup.

All Intel processors based on the Core micro-architecture work with an L2 cache shared between two cores on one die [10]. The quad-core models are realized as combination of two such dies, so that the L2 caches are independent but shared [11]. We accounted for this hardware design by excluding one die‟s cores from the regular operating system scheduler.

3.2. PREDICTION EVALUATION In order to determine how accurate a failure prediction algorithm is, three data sets of counter values are needed. The first is a reference data set, containing values from normal operation. The second one is an initial test data set, containing values prior to CPU failures. Several runs are needed in order to obtain a statistically significant number of such failure data sets. The third data set contains values from normal operation. This data set is necessary in order to determine the false positive rate of the prediction implementation. After the three sets have been recorded, we analyzed the data offline in order to determine the quality of the failure predictor for different event types. Data recording and analysis had to be performed for each CPU counter. Several metrics exist to express accuracy of a prediction; one of them is called accuracy itself. However, it can be shown that accuracy is not an appropriate metric to evaluate predictions in cases of skewed classes. Since failures will occur by far more rarely than non-failure examples, this is the case in our target scenario. We therefore decided for receiver operating characteristic (ROC) curves and the corresponding area under curve (AUC) metric as suitable measures for evaluation in this case.

Figure 2. Experiment Setup MCE‟s normally expressing an unrecoverable failure in the hardware operation. Operating systems therefore stop the machine in case of such an event. In order to be able to save the relevant last samples before the core failure, we reconfigured the Linux kernel to continue operating as far as possible in case of a MCE. This resulted in a behavior where the CPU continued to operate for a short period of time after the initial MCE (there were always subsequent fatal ones). Since the inherent MCE logging of the Linux kernel provided us a core-specific TSC value, we were able to identify the last performance sample value before the actual processor failure.

ROC curves express the true positive rate (tpr) over the false positive rate (fpr). True positive rate denotes the fraction of true failures that have been predicted, i.e., for which a warning has been issued, and false positive rate is the fraction of failure warnings that are issued although no failure was coming up: tpr = positive predictions / all predictions on failures fpr = positive predictions / all predictions on non-failures Both quantities can be estimated from the three recorded data sets, leading to one (fpr;tpr) tuple per performance event type. A perfect predictor would achive (0;1), which means that all true failures are predicted (tpr = 1) without any false alarms (fpr = 0). By varying the rank-sum comparison threshold, a tuple can be determined for each threshold, resulting in the ROC curve. Since curves cannot be compared numerically, the area under the ROC curve is

calculated for comparison. A perfect predictor would achieve an AUC of one. A random predictor, i.e., a predictor that randomly warns about an upcoming failure, results in a linear ROC curve with slope one (tpr = fpr) and AUC = 0.5. Failure prediction evaluation, meaning the ROC curve generation and subsequent AUC computation, is performed on a finite set of event sampling data. It follows that obtained results are only a stochastic estimation of the true prediction accuracy. In order to assess exactness of the evaluation, we apply bootstrapping techniques [10]. Bootstrapping involves multiple random sampling in order to determine means and confidence intervals for estimation. We plotted average ROC curves and add boxwhisker plots at selected threshold values. The boxes of the plots show the value of the first quartile, median and third quartile. The whiskers indicate minimum and maximum of non-outlier data, and circles denote mild outliers (between 1.5 and three times the inter-quartile range below the first or above the third quartile).

4. RESULTS We have investigated 31 performance counters with unique behavior and collected about 30 failures for each of the counters. Several sources of randomness occurred in the experiment. The reference data set was extracted from a random test run. Within the reference test run, the reference data set was extracted from a random timestamp within a fault-free region. In order to determine fpr, nonfailure sequences were tested against the reference data set. Since performing predictions on all monitored samples is computationally not feasible, non-failure sequences were sampled from random positions within fault-free regions based on randomly selected test runs.

always worked better than a random predictor, however variability was also comparatively high.

Figure 3. Example for High Variability Group Figure 3 shows an example for the second group, the highly varying counters. The solid line shows that on average across all experiments, the counter behaves like a random predictor. However, taking into account that a predictor with AUC = 0.23 can be turned into a predictor with AUC = 1 – 0.23 = 0.77 by simply inverting the predictor‟s output, rather good predictions could be obtained quite frequently. It seemed as if the quality of the predictor depends quite heavily on the reference data set. It remains an open question at this point if these counters can be turned into consistently good predictors. Validation and model selection techniques could be used to sort out reference data sets that do not perform well. In our experiments five out of the 31 counters belong to this group.

Since failure sequences were rare due to technical limitations (about 30 per counter), all failure sequences were used to determine tpr. We also determined the variability introduced by these random factors with up to 3250 repetitions of the prediction step.

4.1. ROC CURVES AND AUC Our experimental work has shown three major groups of event types, each behaving differently in terms of prediction feasibility. The first group of counters acted like purely random predictors. 24 out of 31 counters were in this category. A second group sometimes performed very well, but sometimes also worked very badly. They showed a great variability among experiments. A third group of counters

Figure 4. Example for High AUC Group

Two out of the 31 counters were situated in the third group of predictions that consistently performs better than a random prediction (minimum AUC value of 0.68). Figure 4 shows a comparatively good predictor achieving AUC values of up to 0.91. It can be seen from the Figure that when a false positive rate of 20% is accepted (fpr = 0.2), in one experiment about 90% (maximum whisker of the plot) of all failures could be predicted. While a false positive rate of 20% is not acceptable in many faulttolerance scenarios, it might be less of a problem for the intended cross-core surveillance scenario. The costs for an unnecessarily performed proactive recovery are comparatively low. Nevertheless, similar to the second group, variability was still quite high here. Validation and model selection techniques might also help to end up with a more concise predictor. Note that these techniques are applied offline and do not make computation of predictions during runtime more complex.

Vaidyanathan et al. [13] estimated the exhaustion of system resources as function of time and system workload. This is realized by constructing a semi-Markow reward model, a typical approach that is infeasible for our online surveillance approach.

One of the reasons why only two counters were in the third group was the fault injection approach of this work. In most cases, it generated exactly the same kind of MCE, specifically a hardware problem with the bus used for memory access. The RESOURCE STALLS.ROB FULL performance event shown in Figure 4 is clearly related to this particular failure. It expresses the number of cycles the processor was stalled due to a filled pipeline. This happens when long latency operations (such as load and store without cache hit) prevent progress in the pipeline processing. The memory access hardware problem triggered by the under-voltage operation advertised itself here by an unusual value change in this particular counter value.

Some researchers use hardware performance counters for online performance analysis. Azimi et al. [16] showed an approach where application stall times are used for sampling activities, which reduced the application interference through sampling to zero. In our experiments, we achieved the same effect by simply adjusting the sampling rate to a value small enough not to influence execution of the application.

5. RELATED WORK

6. CONCLUSION AND FUTURE WORK

There has been exhaustive research on monitoring and predicting failures of specific components in computer systems in the past.

We presented concept and design of a proactive management framework for partial processor hardware failures, based on cross-core event monitoring and failure prediction. Our initial experiments with the Intel Core technology proved feasibility of the concepts, but leave room for further improvements and investigations.

A widely known approach is the S.M.A.R.T. monitoring of hard disk hardware parameters. A study by Pinheiro et al. [11] showed that the relevant counters do not provide any useful prediction in over 56% of the cases. Hughes et al. [9] accordingly investigated other techniques like the rank-sum test to improve prediction quality for these counters. Liang et al. [12] analyzed more than 100 days of failure log for a BlueGene/L parallel computer system, and derived matching failure prediction methods. The developed strategies relied on the burst nature and spatial skewness of failure events, properties that are not directly applicable to hardware performance events as in our case.

Vilalta and Ma‟s approach [14] extracts a set of indicative error events in order to predict the occurrence of target failure events. A multi-stage sorting approach detects event sets that are typical before a chosen target event. The opposite approach is the dispersion frame technique by Lin [15], which relies only on the interval time between successive error events. Both approaches rely on an understanding of error events, a kind of information that is not provided by hardware performance counters. It remains an open question if alternative information sources about the processor might render one of these approaches feasible for our scenario.

Hardware fault injection for processors is a comparatively unusual approach, since most methodologies rely on software-implemented fault injection. One example is the work by Carreira et al. [17], who work on pin level for pure hardware fault injection.

Future work first needs to focus on the improvement of the fault injection approach, in order to generate more types of (partial) processor failures. So far, the working event types for prediction match exactly to the processor failure triggered by the under-voltage fault injection. In order to be able to react to other MCEs as well, alternative fault injection mechanisms must be applied. Using non-failed cores as spare resource also opens the question of fault containment. A hardware problem within one execution unit might also affect the other cores in the

system, making them unavailable for recovery activities. The degree of spreading is related to the nature of the underlying fault (e.g. L1 cache defect vs. signaling defects). A discussion of according fault models was left out in favor of the prediction mechanism analysis, but should be included in future research. It should also be noted that the presented approach is less portable than classical software event monitoring for failure prediction purposes. Hardware counters are specific for any particular processor platform, so the right set of event types is specific per CPU design. On the other hand, these information sources are completely decoupled from the software executed on the machine. We will work on a meta-model for the hardware performance events of different CMP architectures, in order to render our approach more general.

ACKNOWLEDGEMENTS This work was sponsored in part by Intel.

REFERENCES [1]

L. Spracklen and S. G. Abraham, “Chip Multithreading: Opportunities and Challenges,” in 11th International Symposium on High-Performance Computer Architecture (HPCA-11), 2005, pp. 248–252.

[2]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The Landscape of Parallel Computing Research: A View from Berkeley,” Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech. Rep. UCB/EECS-2006183, December 2006.

[3]

Intel Corporation, “Intel 64 and IA-32 Architectures, Software Developer‟s Manual, Volume 3B: System Programming Guide, Part 2,” Nov. 2008.

[4]

Sun Microsystems Inc., “UltraSPARC Architecture 2007,” Nov. 2008.

[5]

B. Sprunt, “Managing The Complexity Of Performance Monitoring Hardware: The Brink And Abyss Approach,” Int. J. High Perform. Comput. Appl., vol. 20, no. 4, pp. 533–540, 2006.

[6]

B. Sprunt, “Pentium 4 Performance-Monitoring Features,” IEEE Micro, vol. 22, no. 4, pp. 72–82, 2002.

[7]

K. S. Trivedi, K. Vaidyanathan, and K. GosevaPopstojanova, “Modeling and Analysis of Software Aging

and Rejuvenation,” in Proceedings of the IEEE Annual Simulation Symposium, Apr. 2000. [8]

A. Andrzejak and L. Silva, “Deterministic Models of Software Aging and Optimal Rejuvenation Schedules,” in 10th IEEE/IFIP International Symposium on Integrated Network Management (IM ’07), May 2007, pp. 159–168.

[9]

G. Hughes, J. Murray, K. Kreutz-Delgado, and C. Elkan, “Improved disk-drive failure warnings,” IEEE Transactions on Reliability, vol. 51, no. 3, pp. 350–357, Sep. 2002.

[10] B. Efron, “Bootstrap Methods: Another Look at the Jackknife,” The Annals of Statistics, vol. 7, no. 1, pp. 1–26, 1979. [11] E. Pinheiro, W. D. Weber, and L. A. Barroso, “Failure Trends in a Large Disk Drive Population,” in Proc. Of the FAST’07 Conference on File and Storage Technologies, 2007. [12] Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, “BlueGene/L Failure Analysis and Prediction Models,” in IEEE Proceedings of the International Conference on Dependable Systems and Networks (DSN 2006), Jun. 2006, pp. 425–434. [13] K. Vaidyanathan and K. S. Trivedi, “A MeasurementBased Model for Estimation of Resource Exhaustion in Operational Software Systems,” in Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), Nov. 1999. [14] R. Vilalta and S. Ma, “Predicting Rare Events In Temporal Domains,” in Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02). Washington, DC, USA: IEEE Computer Society, 2002, pp. 474–482. [15] T.-T. Y. Lin, “Design and evaluation of an on-line predictive diagnostic system,” Ph.D. dissertation, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA, Apr. 1988. [16] R. Azimi, M. Stumm, and R. W. Wisniewski, “Online performance analysis by statistical sampling of microprocessor performance counters,” in ICS ’05: Proceedings of the 19th annual international conference on Supercomputing. New York, NY, USA: ACM, 2005, pp. 101–110. [17] J. Carreira, D. Costa, and J. Silva, “Fault injection spotchecks computer system dependability,” IEEE Spectrum, vol. 36, pp. 50–55, Aug. 1999.

Suggest Documents