Proactive Detection of Software Aging ... - Semantic Scholar

4 downloads 101214 Views 372KB Size Report
Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described. 1. Introduction. Software aging is a phenomenon in ...
Proactive Detection of Software Aging Mechanisms in Performance Critical Computers Kenny C. Gross, Sun Microsystems Vatsal Bhardwaj, Duke University Randy Bickford, Expert Microsystems Abstract Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990’s the U.S. Dept. of Energy and NASA funded development of an advanced statistical pattern recognition method called the Multivariate State Estimation Technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigate the feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multi-CPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a “memory leak” injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.

1. Introduction Software aging is a phenomenon in which the state of the software system degrades with time.[1,2] This is not to be confused with the phenomena documented by Parnas [3] wherein legacy software becomes more difficult to maintain as it ages. Software aging mechanisms in the sense discussed herein refer to resource

contention issues that can cause performance degradation or can cause systems to hang, panic, or crash. Software aging mechanisms can include memory leaks, unreleased file locks, accumulation of unterminated threads, data corruption/round-off accrual, filespace fragmentation, shared memory pool latching, and others. Software rejuvenation [4] is a proactive fault management technique for cleaning up the system internal state to prevent the future occurrence of more severe failures or system performance degradation. Proactive rejuvenation can include measures such as monitoring resources and flushing stale locks, reinitializing application components, purging shared-memory pool latches, or node/application failover (in cluster systems). Software aging mechanisms and associated hangs and performance problems have become commonplace for many home PCs. For business-critical eCommerce computers, software aging can result in hangs with lost revenues exceeding $100K per minute.[5] Software aging mechanisms also occur in safety critical systems, necessitating adoption of preventative maintenance procedures including software rejuvenation. Software aging was responsible for the loss of American solder lives during the first Gulf War in the well known case involving Patriot Missile software.[1,6] The solution for this problem was to reboot and re-start the Patriot software system every 8 hours. For NASA’s X2000 Advanced Flight Systems software, rejuvenation was achieved via duty switching between system components, thereby slowing down the system's aging process and enhancing mission reliability.[7] Both of these approaches are examples of time-based rejuvenation. Although effective, time-based rejuvenation carries some performance and/or availability cost. A superior approach is known as prediction-based rejuvenation, which performs rejuvenation actions only when required to avert catastrophic failures. In this paper we demonstrate the feasibility and practicability of a novel and powerful approach to prediction-based rejuvenation. For this approach we introduce an advanced statistical pattern recognition system that was demonstrated for NASA’s Space Shuttle Main Engine dynamic signal validation and use this to proactively detect the

incipience or onset of software aging mechanisms in large, business-critical servers. The Multivariate State Estimation Technique (MSET) [8-10] is a nonlinear, nonparametric modeling method that was originally developed by Argonne National Laboratory (ANL) for high-sensitivity proactive fault monitoring applications in advanced commercial nuclear power systems where plant downtime can cost utilities and their customers on the order of $1-million a day. MSET is a statistical modeling technique that learns a high fidelity model of an asset from a sample of its normal operating data. Once built, the software model provides an accurate estimate for each observed signal given a new data observation from the asset. Each estimated signal is compared to its actual signal counterpart using a highly sensitive fault detection procedure called the Sequential Probability Ratio Test (SPRT) [11-14] to statistically determine whether the actual signal agrees with the learned model or, alternatively, is indicative of a process anomaly, sensor data quality problem, or equipment problem. During the mid 1990’s Expert Microsystems worked in collaboration with ANL under funding provided by NASA to customize MSET to meet the unique and stringent requirements for realtime signal validation, sensor operability validation, and proactive fault monitoring for space shuttle vehicle and ground support systems. This work culminated in Expert Microsystems’ commercial SureSense® diagnostic monitoring software.[15,16] It is this implementation of MSET that has been employed in the present investigation. Our objective for this study was to explore the feasibility and practicability of MSET for proactive annunciation of the incipience or onset of software aging mechanisms in large, Unix-based multiprocessor servers that are used in mission-critical and businesscritical eCommerce applications. Metrics employed to evaluate the feasibility of MSET for this application include sensitivity, time-to-annunciation, and falsealarm avoidance. A user-tunable fault injection utility was developed that allows very finely adjustable parasitic free-memory consumption ramp rates. This tool allows subtle fault injections to be actuated during time periods of large and chaotic multi-user transactional processing to fully evaluate the sensitivity of MSET for identifying the onset of very subtle anomalies that would ostensibly be masked by dynamic system loads.

2. Diagnostic Procedure Overview The SureSense implementation of MSET has been designed to enable process engineers to define a diag-

nostic model for their process or equipment, optimize the design, and then automatically generate the software that performs the online diagnostic function. The model development process is accomplished using a graphical, mouse-driven set of tools and a proven methodology. This combination of tools and methodology permits complex diagnostic systems to be developed in a very short amount of time. The software’s validation algorithm combines MSET parameter estimators with sequential hypothesis test fault detection techniques and a Bayesian belief network decision logic for a solution that provides excellent realtime performance, well-defined error rates, and is scaleable to validate any number of system signals.. MSET uses advanced statistical pattern recognition techniques to measure the similarity or overlap between data signals within a learned operational domain. The learned patterns or relationships among the signals are used to estimate the operating state that most closely corresponds with the current measured set of signals. By quantifying the relationship between the current and learned states, MSET estimates the current expected response of the system signals. For cases where it can be established that sensor failure (and not an equipment malfunction) is responsible for anomalous behavior, MSET's analytical estimate of the signal can be used as a temporary substitute for the erroneous signal until a repair can be accomplished. The mathematical foundations of the MSET algorithms are well described in the literature [8,9]. Referring to Fig. 1, the overall diagnostic framework consists of a training procedure and a monitoring procedure. The training procedure is used to characterize the monitored equipment using historical operating data or simulation data. There are two important attributes of the historical operating data used to train an MSET model of the asset. First, the data should contain all modes and ranges of operation that are to be considered normal operation of the asset. Second, the training data should not contain any operating anomalies, sensor failures or equipment failures that would be considered abnormal operation of the asset. These criteria are prerequisites for the learned MSET model to fully characterize normal operation of the asset. After a comprehensive and error-free set of training data has been assembled, the MSET training algorithms are used to build an MSET model of the asset. The training procedure evaluates the training data and selects a subset of the training data observations that are determined to best characterize the asset’s normal operation. First, the observations containing the minimum and maximum observed value for each included signal are selected. The procedure then fills in

additional operating states by first ordering the training data observations using a statistical method based on the Euclidean norms of the data observations and then selecting observations at equal intervals to fill the MSET model with a user-specified number of operating state examples. Training creates a stored model of the equipment that is used in the monitoring procedure to estimate the expected values of the signals.

Train Model

Training Data

Online Model

Training Monitoring

Equipment

Acquire Data

Parameter Estimation

Fault Detection

No

Fault Found ? Yes

Alarm or Control Action

Diagnosic Decision

Figure 1. Diagnostic Procedure Overview In the monitoring step, a new observation of the asset signals is first acquired. Again referring to Fig. 1, this observation is used in conjunction with the previously trained MSET model to estimate the expected values of the signals. The estimation procedure is accomplished by comparing the new observation to the previously learned examples. A weighting method is used to produce the estimate by combining an optimal combination example data values from the training “memory matrix”. The major power and utility of the patented MSET modeling engine lies in the fact that those examples most similar to the current observation are heavily weighted, while those that are dissimilar are negligibly weighted. Similarity between the current observation and the learned examples is computed using a sophisticated multivariable pattern matching technique. The weighted combination of the most similar learned examples is used to compute the estimated signal values given the current observed signal values. The MSET technique provides an extremely accurate estimate of sensor signals, with error rates that are typically 1% to 2% of the standard deviation of the input signal, which is excellent. The difference between a signal’s predicted value and its directly sensed value is termed a residual. The

residuals for each monitored signal trace out time series that are used as the indicator for sensor and equipment faults in prior applications of MSET, and software aging faults in the present investigation. Instead of using simple thresholds to detect fault indications (i.e., declaring a fault when a signal’s residual value exceeds a preset threshold), the software’s fault detection procedure employs a Sequential Probability Ratio Test (SPRT) technique to determine whether the residual error value is uncharacteristic of the learned process model and thereby indicative of an incipient anomaly. The SPRT algorithm improves the threshold detection process by providing more definitive information about signal validity using statistical hypothesis testing. The SPRT technique allows the user to specify false alarm and missed alarm probabilities, allowing a high degree of user control over the likelihood of false alarms or missed detection. The SPRT technique is a superior surveillance tool because it is sensitive not only to disturbances in the signal mean, but also to very subtle changes in the statistical quality (variance, skewness, bias) of the signals. For sudden, gross failures of a sensor, component, or subsystem, the SPRT procedure will annunciate the disturbance as fast as a conventional threshold limit check. However, for slow degradation, or for subtle intermittent anomalies that appear first in the noise associated with the signals, the SPRT procedure can detect the incipience or onset of the disturbance long before it would be apparent with conventional threshold limits. In general, the SPRT procedure is accomplished by first establishing the expected distribution of the residual values when the asset is operating normally. This step is accomplished in conjunction with the MSET model training procedure. Once an MSET model is trained by learning examples from the training data, the remaining (unselected) training data observations are run through the model in order to characterize the expected distribution of the residual values. Having characterized the expected distribution of the residual values when the asset is operating normally, the SPRT procedure may be used to detect those conditions that deviate from the learned MSET model. In operation, a time series of residual values are evaluated to determine whether the series of values is characteristic of the expected distribution or alternatively of some other specified distribution. Four possible faulttype distributions are considered in the current software. These are: 1) the residual mean value has shifted high; 2) the residual mean value has shifted low; 3) the residual variance value has increased; and 4) the residual variance value has decreased. The sensitivity of the SPRT procedure when selecting between

the expected (null-type) distribution and a fault type distribution is primarily established by a userconfigurable setting termed the system disturbance magnitude. The system disturbance magnitude controls the crossover point at which a disturbance in the residual values is deemed uncharacteristic of the normal operating states of the monitored asset. The definitive fault alarm decision is made using a Bayesian conditional probability analysis of a series of SPRT fault detection results in order to reduce the potential for single observation false alarms. The conditional probability technique improves on the conventional multi-cycle voting approach (e.g., fail on N fault indications out of the last M observations) by allowing the user to more explicitly control the statistical confidence level used in the fault alarm decision. In the final step, the software uses a Bayesian belief network decision manager to processes the fault events and determine the most probable cause of the faults, including sensor failures, equipment failures, and un modeled operating conditions. The belief network operates much like a rule base with the addition of conditional probability information. Bayesian belief networks enable a systematic analysis to be performed in which all diagnoses are weighed against one another to determine which has the most evidence for its substantiation. Belief networks are resilient to missing information (e.g., sensor failures), and gracefully handle multiple system failures. They have been used in a wide variety of applications to represent probabilistic knowledge for automated reasoning. Using Bayesian probability theory, the software captures expert beliefs about the dependencies between variables and propagates the fault results consistently and quantitatively for diagnosing the current state of the monitored equipment. Figure 2 shows an example of the working of MSET in monitoring mode. The green and blue colored signals show the MSET Estimate and the Observation respectively. The middle subplot in the figure shows the residual time series. The bottom subplot shows the SPRT alarms. MSET techniques have been successfully applied in a number of reliability critical applications, including monitoring of Space Shuttle’s Main Engine sensors [15,16], military gas turbine engine sensors, industrial process equipment, high-performance computers, and nuclear power plant sensors [7-10,17].

Figure 2. Top : Observation and MSET Estimate Middle : Residuals Bottom : SPRT Alarms

3. Analytical Results The design of an online diagnostic model begins with the user’s definition of the signal parameters to be processed by the model. These will include all signals required for diagnostic processing, either as targets of validation or as collaborating data. For the purposes of this investigation, a telemetry harness was designed that is portable to any Unix-based server and collects digitized signals related to performance, throughput, and quality-of-service. The parameters that are continuously sampled can be conveniently be divided into three categories:

Internal variables:

Dynamic Loads on CPU, Memory, Cache IO, Traffic Queue Lengths, Transaction Latencies, Operational Profiles from OS Kernel “Virtual Sensors”

Physical variables:

Distributed Internal Board and Module Temperatures, Voltages, Currents

External variables:

“Canary Test” Variables (Distributed Synthetic Transaction Generators to Provide QoS Metrics 24x7)

The telemetry harness designed for this investigation provides 240 performance metrics with a useradjustable sampling interval. Data sampling intervals for the experiments reported herein were 2 seconds. For MSET training, a subset of the 240 monitored variables was selected on the basis of a cross correlation analysis. The top 50 signals were used in the MSET model training as described in the previous section. Typical training data for two of the signals employed in the model are shown in Fig. 3. The data plot shows the actual signal values in blue and the MSETestimated values in green. The excellent ability of the MSET algorithm to track normal trends and fluctuations in the actual signal data is made readily apparent by the close overlay of the observed and estimated curves in the figure.

33 minutes after the onset of the controlled parasitic resource consumption actuation, and well before the disturbance is observable by visual examination of the raw signals.

Figure 4: MSET Detection of Onset of Software Aging Mechanism in Signal 9 Parasitic Resource Consumption Rate = 9% per 24 Hrs For the experimental results shown in Fig. 5, the rate of controlled parasitic resource consumption was set at an extremely small rate of only 1% per 24 hours. The very high sensitivity of MSET for catching subtle software aging phenomena is evident in the lower subplot in which SPRT alarms start tripping at only 35 minutes after the onset of the controlled fault injection. This fault would not have been evident by conventional performance monitoring thresholds, nor by visual inspection of the signals.

Figure 3: Typical Training Data for Signals 9 and 27 (Two of Fifty Monitored Performance Metrics) In Fig. 4, the behavior of Signal 9 is shown during an experiment with a controlled parasitic resource consumption rate of 9% per 24 hours. The upper plot in the figure shows observed (blue) and estimated (green) signals plotted against experiment time in hours. The controlled fault injection was activated at 5.0 hours into the experiment. The upper subplot of Fig. 4 shows the subtle onset of a departure between the measured signal and the MSET estimate. The lower subplot in Fig. 4 shows the SPRT alarm index. SPRT fault alarms are signified by red x markers when the SPRT index jumps to 1.0, which occurs when the data disturbance hypothesis is satisfied with a confidence factor of 99.9% for the analyses presented here. Note that software aging fault alarms start tripping at only

Figure 5. MSET Detection of Onset of Software Aging Mechanism in Signal 27 Parasitic Resource Consumption Rate = 1% per 24 Hrs

4. Summary and Conclusion An advanced nonlinear, nonparametric statistical pattern recognition system called MSET was originally customized for high sensitivity proactive annunciation of dynamic sensor and instrumentation anomalies for DOE and NASA applications in the mid-1990's. In this paper we apply MSET to a new class of problems, those originating in software. MSET does not detect bugs in software; rather, it is used to detect the incipience or onset of a class of transient errors called software aging. For the experimental studies conducted in this investigation, a memory-leak fault injector was developed that allows systematic replications of experiments with controlled rates of parasitic resource consumption in operating software systems. MSET was found to be extremely effective at catching the onset of resource contention phenomena, even when 7. the magnitude of the anomaly is significantly smaller than the normal process variability in the parameters under surveillance. MSET also exhibited zero false alarms in the experiments conducted during this investigation, which is equally important for NASA applications as for business-critical eCommerce applications where system down time can cost millions of dollars an hour in lost revenues. For eCommerce computing applications, MSET can provide system administrators with reliable early notification of the incipience of resource contention problems, well before critical levels are approached, thereby allowing operators to perform corrective rejuvenation actions before outages occur. Future research is focused on integrating MSET in closed-loop autonomic control systems for enhanced performance management of business-critical enterprise computing systems.

5. Acknowledgment

ware Rejuvenation in Cluster Systems,” ACM Sigmetrics 2001/Performce 2001, June 2001. 3. D. L. Parnas,, "Software Aging", Proc. 16th Intnl. Conf. on Software Eng., Sorento, Italy, IEEE Press, pp. 279-287, May 16-21/94. 4. T. Dohi, K. Goseva-Popstojanova, and K. S. Trivedi, “Estimating Software Rejuvenation Schedules in High-Assurance Systems,” The Computer Journal (44), pp. 473-482 (2001). 5. J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Chapter 1, Morgan Kaufmann Publishers (2002). 6. E. Marshall, “Fatal Error: How Patriot Overlooked a Scud,” Science, page 1347, March 13, 1992. 7. A.T. Tai, L. Alkalai, S.N. Chau "On-Board Preventive Maintenance for Long-Life Deep-Space Missions: A Model-Based Analysis", Proc. IEEE Int’l Computer Performance and Dependability Symp. (Sep. 1998). 8. K.C. Gross, R.M. Singer, S.W. Wegerich, J.P. Herzog, R. VanAlstine, and F. Bockhurst, “Application of a Model-Based Fault Detection System to Nuclear Plant Signals,” Proc., Intelligent System Application to Power Systems (ISAP ’97), Seoul, Korea, July 6-10, 2001. 9. R. M. Singer, K. C. Gross, J. P. Herzog, R. W. King, and S. Wegerich, “Model-based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations,” Proc., Intelligent System Application to Power Systems (ISAP ’97), Seoul, Korea, July 6-10, 2001.

The authors wish to thank Dr. Larry Votta, Sun Distinguished Engineer, for his helpful suggestions and feedback during manuscript development.

10. K. C. Gross, S. Wegerich, and R. M. Singer, “New Artificial Intelligence Technique Detects Instrument Faults Early,” Power Magazine, (42) no. 6, pp. 89-95, McGraw-Hill (1998).

6. References

11. A. Wald, Sequential Analysis, Wiley, New York, 1947.

1. K.S.Trivedi, K.Vaidyanathan, and K.GosevaPostojanova, "Modeling and Analysis of Software Aging and Rejuvenation ", Proc. 33rd Annual Simulation Symp., pp. 270-279, IEEE Computer Society Press (2000).

12. K. C. Gross and K. Humenik, “Nuclear Power Plant Component Surveillance Implemented in SAS Software,” Proc. SAS Users Group Intnl. Conf. 11271131, San Francisco (April 1989).

2. K. Validyanathan, R. E. Harper, S. W. Hunter, and K. S. Trivedi, “Analysis and Implementation of Soft-

13. K. E. Humenik and K. C. Gross, “Sequential Probability Ratio Tests for Reactor Signal Validation and

and Sensor Surveillance Applications,”, Nucl. Sci. and Eng. (105), pp. 383-390 (Aug 1990). 14. J. W. Hines, A. Gribok, and B. Rasmussen, “OnLine Sensor Calibration Verification: A Survey,” Proc. 14th Intnl. Congress and Exhibition on Condition Monitoring and Diagnostic Engineering Mgmt., Manchester, England, Sept., 2001. 15. R.L. Bickford, et al, “MSET Signal Validation System for Space Shuttle Main Engine,” Final Report, NASA Contract NAS8-98027, August 2000. 16. R. Bickford, C. Meyer, and V. Lee, “Online Signal Validation for Assured Data Quality,” Proc. 2001 Instrument Society of America. 17. R. Bickford, E. Davis, R. Rusaw, and R. Shankar, Development of an Online Predictive Monitoring System for Power Generating Plants, 45th ISA POWID Symposium, Paper No. 805, June 3-5, 2002, San Diego, California