Hierarchical Computation of Interval Availability and Related Metrics Dong Tang Sun Microsystems, Inc.
[email protected] Abstract As the new generation high-availability commercial computer systems incorporate deferred repair service strategies, steady-state availability metrics may no longer reflect reality. Transient solution of availability models for such systems to calculate interval availability over shorter time horizon is desirable. While many solution methods for transient analysis have been proposed, how to apply these methods on hierarchical models has not been well addressed. This paper describes an approach to computing interval availability and related metrics for hierarchical Markov models. The approach divides the time interval of interest into small subintervals such that the input parameters can be treated as constants in each subinterval to make the model satisfy the homogeneous Markov property, and then pass the output interval availability metrics as constants from the sub-model to its parent model. Finally, these quantities are integrated to obtain the expected interval availability for the entire interval. The study also addresses methods of passing parameters across levels for generating multiple metrics from a hierarchical model. The approach is illustrated with an example model and has been implemented in RAScad. All computations for the example model have also been carried out using the SHARPE textual language interface.
1. Introduction In the development of highly available computer server, storage, and networking systems, system designers perform reliability, availability, and serviceability (RAS) modeling to assess operational availability, performability, and service cost achievable by the architectures under consideration to optimize the design. The modeling is typically done based on well accepted mathematical models (e.g., Markov chains) which are solved by commercial or in-house software tools. Traditionally, steady-state analysis methods are used to evaluate availability and associated metrics on availability models (without absorbing states) [4, 16]. Most commercially available modeling tools [9, 11, 12, 14] provide either no support or limited support in generating transient results when used in availability analysis, especially for hierarchical models. With decreasing cost for components of massive
Kishor S. Trivedi Duke University
[email protected] usage such as memory chips and disks, the new high-end server and storage systems can afford to tolerant multiple component failures without having to repair faulty components before the number of faulty components reaches a threshold or a scheduled maintenance action occurs, to reduce both downtime and service cost. This is called deferred repair [1, 2]. The traditional steady-state analysis provision in availability modeling tools has been found not good enough for evaluating new architectures with deferred repair service strategies by design engineers at Sun Microsystems. The interval availability (average availability for a time interval from 0 to T) and associated measures, such as interval performability, interval failure rate, and interval service call rate, are the metrics that should be used for this type of models. The interval availability concept was defined over 16 years ago [4]. Numerical methods for calculating the expected interval availability and the distribution of interval availability have been addressed by many studies [4, 5, 8, 10, 13]. However, how to apply these methods to a hierarchical model has not been addressed in published studies. The hierarchical modeling approach has been proven very useful in practical engineering design [11, 14, 15]. There are two fundamental reasons for using the hierarchical modeling approach: 1. Reducing model complexity so that human construction of the model is feasible 2. Facilitating identification of RAS bottlenecks in terms of subsystems or components It is thus necessary to provide the capability of computing interval availability and related metrics for hierarchical models. This capability is not explicitly supported in commercial dependability modeling tools, probably partially because of the lack of recognition for its practical value. In response to the newly recognized engineering needs, we present a solution approach to hierarchical computation of expected interval availability and related metrics in this paper. The approach has been implemented in a Sun internal RAS modeling tool – RAScad [15] and results presented in this paper have been independently verified by solving models using SHARPE [11]. The rest of the paper is organized as follows: Section 2 reviews/introduces the interval availability definition and several other interval metrics and describes the basic computation approach for hierarchical Markov models. Section 3 uses an example to illustrate hierarchical
Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04) 0-7695-2052-9/04 $ 20.00 © 2004 IEEE
modeling of deferred repair and how interval availability metrics are rolled up from submodels to the parent model. Section 4 analyzes results which shows the necessity of interval availability evaluation. Section 5 concludes this paper.
2. Definitions and Basic Approach The interval availability for the time interval (0, T) is defined as [4, 11, 16] T 1 1 A I T A t dt T 0 where A(t) is the instantaneous availability at time t. The interval availability is the average availability in the interval (0, T). When T approaches infinity, AI(T) approaches A, the steady-state availability. In addition to interval availability, there are several other interval metrics that are useful in engineering design. First, we define the interval failure rate and interval repair rate for the time interval (0, T) by extending the definition of equivalent failure rate and repair rate from [6, 16] as follows: q i , j À i I T 2 a » I T i UP , j DNA T I q i , j À i I T i DN , j UP 2 b ¼I T 1 AI T where UP and DN are the sets of working (up) states and failure (down) states, respectively, q(i,j) is the transition rate from state i to state j, and À(i)I(T) is the interval state probability for interval (0, T) and calculated by T À i I T T1 À i t dt 3 0 where À(i)(t) is the probability that the system is in state i at time t. From the definition, the interval failure rate is the average failure rate in the time interval (0, T). Replacing the availability symbol in Eq. (1) and (2) with the performability symbol, we can define interval performability to be the average performability and interval degradation rate to be the average rate to degraded states for the time interval (0, T). In a Markov availability model, similar to the partition into working states and failure states for availability evaluation, we can partition all states into two sets for service cost evaluation: The set of states in which no service action is taken and the set of states in which service action is being performed. Applying Eq. (1) and (2) again, we define the interval service call rate to be the average rate to the service states in a Markov chain for the time interval (0, T). The typical life time range for a computer server or storage system is roughly 5 to 10 years. The interval availability metrics defined above are good for
quantifying the average RAS behavior for the entire operational life time of the system modeled. Often, system designers wish to know the average RAS behavior in a subinterval, or an increment interval (typically from one month to one year), in the projected life time for the system under design. Thus, it is necessary to extend the interval availability concept to the increment interval availability: A I T k 1 ,T k
Tk
1 A t dt T k T k 1 T
4
k 1
where k : 1 and Tk > Tk-1. Notice we use the same notation to represent interval availability and increment interval availability. The difference is in the number of parameters: The interval availability has one parameter while the increment interval availability has two parameters. If we set parameter Tk-1 to 0, then the increment interval availability reduces to the interval availability. The increment interval availability concept is not only useful in engineering design, but also useful in the hierarchical computation of interval availability discussed below. The interval failure rate and repair rate definitions can be similarly extended to the increment interval failure rate and increment interval repair rate: q i , j À i I T k 1 ,T k i UP , j DN 5 a » I T k 1 ,T k A I T k 1 ,T k q i , j À i I T k 1 ,T k i DN , j UP ¼I T k 1 ,T k 5 b 1 A I T k 1 ,T k where UP, DN, and q(i,j) have the same meaning as in Eq. (2) (when these equations are applied in the approach shown in Fig. 1 discussed below, q(i,j) is also a function of (Tk-1,Tk)), and À(i)I(Tk-1,Tk) is the increment interval state probability for interval (Tk-1,Tk) and calculated by
À i I T k 1 ,T k
Tk
1 T k T k 1 T
À i t dt
6
k 1
where À(i)(t) is the probability that the system is in state i at time t. The same extension also applies to the other interval metrics defined above. Now we derive a relationship between the interval availability and the increment interval availability. We divide T into N subintervals with equal length and let §T = T/N. Then we have T § T N 1 k T A t dt 1 A I T A t dt T 0 T k 1 § T k 1 T 1 1 $ , N § 7 N § 7 N That is, the interval availability can be calculated from the increment interval availability for all of its subintervals. When §T is small enough (typically a month or a quarter for a commercial system), the input failure
Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04) 0-7695-2052-9/04 $ 20.00 © 2004 IEEE
§
§
rates and repair rates for a model in a hierarchical structure can be treated as constants. This is also true for the output metrics (increment interval failure rate, etc.), i.e., it is viable to pass the increment interval failure rate as a constant to the parent model. Thus, for the small interval §T, each model in the hierarchical structure can be considered to have the homogeneous Markov property and numerical methods for calculating interval availability are applicable to the model. Fig. 1 shows how interval availability is calculated for a Markov chain in the hierarchical model. In the figure, ©(Tk-1,Tk) represents input parameter vector, which determines the generator matrix of the Markov chain, Q (Tk-1,Tk), for subinterval (Tk-1, Tk). «(Tk) represents the state probability vector of the Markov chain at time Tk. The increment interval availability, AI(Tk-1,Tk), and other increment interval metrics (failure rate, repair rate, etc.) can be calculated from «(Tk-1) and Q(Tk-1,Tk). These quantities, again, can be input parameters to the parent model. If interval availability, AI(T), and other interval metrics need to be evaluated, they can be derived from the corresponding incremental interval metrics using Eq. (7).
3. Example Model and Metrics to Roll up In this section, we use an example of hierarchical model with deferred repair to illustrate how the metrics defined in the previous section are related to the model and how they are rolled up from a child model to the parent model. Assume we wish to model the impact of permanent faults on a system with two types of components: CPU and memory. Each type has N units and the system can tolerate up to two unit failures for each type before a repair action is taken. Although this example is simple, it is representative of the model structures for more complicated architectures with deferred repair. The RAScad Markov reward model for the CPU submodel and memory submodel are shown in Fig. 2 and 3, respectively. The parameter values shown in the diagrams are all hypothetical and do not represent particular products.
Figure 2. The CPU Submodel Figure 1. Interval Availability Calculation for a Markov Chain in the Hierarchy
In this approach, it is important for each model in the hierarchy to remember the state probability at time Tk for use in the next iteration of calculation for the next subinterval. Each iteration starts from the bottom level, moves up through all the models in the hierarchy and ends at the top level model. Notice that the generator matrix will be reconstructed in each iteration. In the beginning of the first iteration, «(T0) is specified by the user. At the end of k-th iteration, «(Tk) has been available for use in the next iteration. When N iterations are completed, both N sets of increment interval availability metrics and the interval availability metrics for (0, T) will be available. A similar approach has been used in the past for nonhierarchical reliability models by [7] to solve a nonhomogeneous Markov model using time-stepping, i.e., approximating time dependent generator matrix with piece-wise constant matrices over small subintervals, and by [3] to solve phase mission reliability models. Differences in our hierarchical approach include currently solving multiple models, passing parameters across levels in each iteration, and calculating availability metrics.
Figure 3. The Memory Submodel
For a RAScad Markov model, there are three user defined reward vectors associated with its states, as displayed in the circles representing states: 1. Availability (0 or 1) 2. Performance (: 0) 3. Service Cost (: 0) The first reward vector is used to calculate system
Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04) 0-7695-2052-9/04 $ 20.00 © 2004 IEEE
availability and system failure rate. The second reward vector is used to calculate performability and mean rate to degraded states. The third reward vector is used to calculate annual service cost and mean service call rate. The interval system availability and performability are the expected accumulated reward during interval (0, T) using the first two reward vectors, respectively. The interval system service cost, SSC(T), for (0, T) is calculated by M
S S C T
AVF i 1
i
T
GS C
i
8
where AVFi(T) denotes the average annual visit frequency to state i during interval (0, T), SCi the third reward value for state i, and M the number of states in the model. SCi could be an actual cost in terms of dollar amount. If it is set to 1 for the states where a service action is taken, then the annual service cost is just the annual service call rate. Fig. 2 is a model with deferred repair, cold swap, and non-transparent recovery. When a CPU fails, the system does an automatic recovery by a reboot during which the failed CPU is deconfigured from the system and a short downtime is experienced. When the second CPU fails, a repair action is scheduled at an off-peak time, which is modeled by a waiting time (Twaiting). The states Ok, 1 Dead and 2 Dead are working states, so the first reward value is set to 1 for these states and to 0 for the other states. Assume there are initially 10 CPUs in the system and the full performance is thus defined as 10 (the second reward value). In the degraded states 1 Dead and 2 Dead, the performance number is set to 9 and 8, respectively. The only service state is Repair for which the third reward value is set to 1. Fig. 3 is a model with deferred repair, hot swap, and transparent recovery. When a fault occurs on a memory component (e.g., DRAM), the system does an on-line automatic recovery by replacing the faulty component with a hot spare and reconstructing data on the spare. A successful recovery should not incur a system downtime (transparent). If the recovery is not successful, modeled by Prf (probability of recovery failure), a system reboot is needed to do a boot-up reconfiguration which incurs a short system downtime. In any case, there is no repair action associated with the first fault until the second fault occurs. The repair can be performed concurrently with the system operation because hot swap is supported. However, an imperfect repair, either due to diagnostic problems or human error, would bring the system down, incurring a downtime (Trestore). The imperfect repair is modeled by the parameter Pre (probability of repair error). Notice in the working states Ok, 1 Dead, 2 Dead, and Repair, there is no performance degradation (the second reward value is 10) because the memory size is not reduced. Also, there are two service states in the model: Repair and RepairError. One (Repair) is working state and another is failure state. For each of the above submodels, interval metrics
associated with the three reward vectors can be evaluated. How these metrics are related to or integrated in the parent system model? Fig. 4 shows the System Model. The model has only three states: Ok (working state), CPU_Fail and Mem_Fail (both are failure states). The two gray color rectangle boxes are the interface to submodels in which parameter bindings are defined. As shown in the first box, the overall CPU failure rate, La_cpu, and repair rate, Mu_cpu, are bound to the CPU Submodel output Lambda1 and Mu1 which are the submodel equivalent failure rate and repair rate evaluated based on the first reward vector. The same binding rule also applies to La_mem and Mu_mem, as shown in the second box. This approach of binding output measures of submodel to parameters in the current model is called explicit parameter passing.
Figure 4. The System Model
Explicit parameter passing is not sufficient for supporting hierarchical modeling when multiple reward vectors are used to generate multiple metrics from the same model, as typically required in the system design. To allow integration of performability measures from submodels with the parent model, we define performance loss to be maximum performance (the largest value of the third reward vector in the model) minus performability (expected performance evaluated from the model). If all submodels are independent, the performance loss as well as service cost generated from the submodels are integratable at the parent model. In RAScad, the following formulas are used to calculate the interval system performance loss (SPL) and system service cost (SSC) for a parent model: SPL T
SPLcur T C2 GSPL T
SSCcur T C3 G SSC T
i
i
9
i S
SSC T
i S
i
i
10
where SPLcur(T) and SSCcur(T) are the interval system performance loss and service cost evaluated from the current Markov diagram (Fig. 4), respectively, SPLi(T) and SSCi(T) are the interval system performance loss and service cost evaluated from submodel i, respectively, and S is the set of submodels of the current model. C2i and C3i are the coefficients defined by the user in specifying
Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04) 0-7695-2052-9/04 $ 20.00 © 2004 IEEE
submodel i, which allow to count contributions from multiple instances of a submodel. If these coefficients are set to 0, the performance loss and service cost will not be rolled up and integrated with the parent model. The way of passing parameters shown in Eq. (9) and (10) is called implicit parameter passing. Looking at the reward vector assignments in the System Model, we can determine that both SPLcur and SSCcur are 0. Assume we have only one CPU Submodel and one Memory Submodel, i.e., C2 and C3 are set to 1 for both models. According to Eq. (9) and (10), the System Model performance loss is just the sum of performance loss for the two submodels, and the System Model service cost is just the sum of service cost for the two submodels. These metrics can continue to roll up if there is another parent model above the System Model.
4. Analysis of Results Having discussed the example model and how parameters are passed from submodels to the parent model, we analyze results generated from the model to show the necessity of interval availability evaluation and advantages of hierarchical modeling in this section. First, we look at the increment interval failure rate generated from the CPU submodel and memory submodel, as shown in Fig. 5 and 6.
In the figures, the increment interval, §T, equals ¼ year or a quarter. The failure rate unit used is FIT which represents a failure in 109 hours. The increment interval failure rate for both submodels changes over time. It does not reach the steady-state failure rate in 10 years. This quantity is used as an input parameter to the parent model for constructing the System Model (Fig. 4). To conciliate the time variance of input parameters and the requirement for constant transition rates to satisfy the homogeneous Markov property, it is necessary to apply the “divide and conquer” approach described in Fig. 1. Table 1 compares steady-state and interval availability metrics evaluated from the system level model. It is obvious that the steady-state results are much higher than the transient results for the interval of the first 5 years of life time. Table 2 further shows the speed of convergence from interval results to steady-state results is slow (not identical in 100 years). These results indicate the necessity of interval availability evaluation for models with deferred repair. Measure
Steady State
Expected Yearly Downtime
Interval (5 years)
3 min. 55 sec.
2 min. 1 sec.
Mean Failure Rate
15,690 FITs
12,602 FITs
Expected Performance
9.47 (CPUs)
9.83 (CPUs)
Mean Service Call Rate
14,607 FITs
6,693 FITs
Table 1. Comparison of Steady-State and Interval Measures for System Model Time Interval
Yearly Downtime
1 year
1 min. 14 sec.
Service Call Rate 1,890 FITs
10 years
2 min. 35 sec.
9,534 FITs
50 years
3 min. 36 sec.
13,467 FITs
100 years
3 min. 46 sec.
14,037 FITs
Steady State
3 min. 55 sec.
14,607 FITs
Table 2. Convergence Speed of Interval Measures for System Model Figure 5. Increment Interval Failure Rate for CPU Submodel
Why is there such a large difference between the steady-state and interval results for the meaningful time period? A closer look at the state probabilities for the steady state and a 5-year interval for the Memory Submodel in Table 3 can help understand the issue. State
Figure 6. Increment Interval Failure Rate for Memory Submodel
Steady State
Interval (5 years)
Ok
0.494
0.734
1 Dead
0.506
0.266
2 Dead
2.369E-4
1.242E-4
Repair
4.936E-6
2.587E-6
Reboot1
8.226E-8
1.224E-7
Reboot2
8.226E-8
4.315E-8
RepairError
9.871E-7
5.173E-7
Table 3. Comparison of Steady-State and Interval State Probabilities for Memory Submodel
Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04) 0-7695-2052-9/04 $ 20.00 © 2004 IEEE
For the interval of the first 5 years of life time, most systems (73%) in the field would be staying in the Ok state (no fault) and only 27% percentage of the population would be in the "1 Dead" state (one fault). But in the steady state, less than 50% of all systems would be staying in the Ok state and over 50% of all systems would be in the "1 Dead" state. However, it would be a very long process (over 100 years) for these systems to accumulate faults and to transition to the "1 Dead" state. In reality, this accumulation process will never reach the steady state in the system life time. Table 4 shows the interval (5 years) yearly downtime and service call rate distribution by submodel. The CPU Submodel dominates the downtime while the Memory Submodel dominates the service call rate. This table shows that the hierarchical modeling facilitates the identification of RAS bottlenecks in terms of subsystems. Submodel CPU Memory
Yearly Downtime
Service Call Rate
1.66 min. (82.2%)
1,520 FITs (22.7%)
21.5 sec. (17.8%)
5,173 FITs (77.3%)
Table 4. Distribution of 5-Year Interval Yearly Downtime and Service Call Rate by Submodel
5. Conclusions In this paper, we analyzed an example model which reveals that the traditional steady-state availability metrics are no longer appropriate for quantifying dependability for systems incorporating deferred repair. Instead, interval availability metrics need to be used for such systems to quantify their RAS behavior in the useful life time. This is the first study to show the importance of interval availability to solving real engineering problems. We made two contributions in addressing hierarchical computation of interval availability and related metrics: 1. Proposed and implemented a “divide and conquer” approach to hierarchical calculation of interval availability metrics. The approach allows not only passing interval availability metrics from submodels to the parent model, but also general time dependent failure rates as input parameters to the model. 2. Identified a method to explicitly and implicitly pass up output quantities from submodels to the parent model so that multiple interval metrics can be generated from the same model in one evaluation procedure. This method is important for developing highly productive RAS modeling tools.
Acknowledgments The authors thank William Bryson and Robert Cypher for raising the issue of validity of stead-state availability in evaluating RAS architectures incorporating deferred repair and testing interval availability solution
methods implemented in RAScad. The authors also thank Dazhi Wang for implementing the example model discussed in this paper using the SHARPE textual language which produces same results as those generated by RAScad.
References [1] S. Bose, V. Kumar, and K. S. Trivedi, “Effect of Deferring the Repair on Availability,” Supplement Volume of the 2003 International Conference on Dependable Systems and Networks, June 2003, pp. B32-B33. [2] D. C. Bossen, A. Kitamorn, K. F. Reick and M. S. Floyd, “Fault-Tolerant Design of the IBM pSeries 690 System Using POWER4 Processor Technology,” IBM J. Res. & Dev. Vol. 46 No. 1, January 2002. [3] J. B. Dugan, “Automated Analysis of Phased-Mission Reliability,” IEEE Trans. On Reliability, April 1991, pp. 45-51. [4] A. Goyal, S. S. Lavenberg and K. S. Trivedi, “Probabilistic Modeling of Computer System Availability,” Annals of Operations Research, No. 8, March 1987, pp. 285-306. [5] A. Goyal and A. N. Tantawi, “A Measure of Guaranteed Availability and its Numerical Evaluation,” IEEE Trans. on Computers, January 1988, pp. 25-32. [6] M. Lanus, L. Yin, and K. S. Trivedi, "Hierarchical Composition and Aggregation of State-Based Availabilityand Performability Models," IEEE Trans. on Reliability, March 2003, pp. 44-52. [7] S. Ramani, S. Gokhale and K. S. Trivedi, “SREPT: Software Reliability Estimation and Prediction Tool,” Performance Evaluation, Vol. 39, 2000, pp. 37-60. [8] A. Reibman, R. Smith and K. S. Trivedi, “Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches,” European Journal of Operational Research, Vol. 40, 1989, pp. 257--267. [9] Relex Software Corporation, "Relex Markov," http://www.relexsoftware.com/products/markov.asp, Sept. 2002. [10] G. Rubino and B. Sericola, “Interval Availability Analysis Using Denumerable Markov Processes: Application to Multiprocessor Subject to Breakdowns and Repair,” IEEE Trans. on Computers, February 1995, pp. 286-291. [11] R. A. Sahner, K. S. Trivedi and A. Puliafito, Performance and Reliability Analysis of Computer Systems - An ExampleBased Approach using the SHARPE Software Package, Kluwer Academic Publishers, 1996. [12] W. H. Sanders, W. D. Obal II, M. A. Qureshi and F. K. Widjanarko, “The UltraSAN Modeling Environment,” Performance Evaluation, Oct./Nov. 1995, pp. 89-115. [13] B. Sericola, “Availability Analysis of Repairable Computer Systems and Stationarity Detection,” IEEE Trans. on Computers, November 1999, pp. 1166-1172. [14] D. Tang, M. Hecht, J. Miller and J. Handal, “MEADEP - A Dependability Evaluation Tool for Engineers,” IEEE Trans. on Reliability, Dec. 1998, pp. 443-450. [15] D. Tang, J. Zhu and R. Andrada, “Automatic Generation of Availability Models in RAScad,” Proceedings of International Conference on Dependable Systems and Networks, 2002, pp. 488-492. [16] K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 2nd Edition, John Wiley & Sons, 2002.
Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04) 0-7695-2052-9/04 $ 20.00 © 2004 IEEE