Survey of Software Tools for Evaluating ... - ACM Digital Library

0 downloads 0 Views 4MB Size Report
paper reviews those related primarily to reliability, availability, and serviceability. The purpose ..... question to be asked is, What is the prob- ability of ...... O(n4) (applies only to models of re- ..... puter and mathematical tools previously ...... manual entry of data should be reduced to .... Technical Memorandum 87735 (Aug.).
Survey of Software and Serviceability

Tools for Evaluating

ALLEN M. JOHNSON,

JR.

IBM, Advanced Engineering

MIROSLAW Department

Systems, Austin,

Reliability,

Availability,

Texas 78758

MALEK

of Electrical

and Computer Engineering,

University

of Texas at Austin, Austin,

Texas 78712

In computer design, it is essential to know the effectiveness of different design options in improving performance and dependability. Various software tools have been created to evaluate these parameters, applying both analytic and simulation techniques, and this paper reviews those related primarily to reliability, availability, and serviceability. The purpose, type of models used, type of systems modeled, inputs, and outputs are given for each package. Examples of some of the key modeling elements such as Markov chains, fault trees, and Petri nets are discussed. The information is compiled to facilitate recognition of similarities and differences between various models and tools and can be used to aid in selecting models and tools for a particular application or designing tools for future needs. Tools included in the evaluation are CARE-III, ARIES-82, SAVE, MARKl, HARP, SHARPE, GRAMP, SURF, SURE, ASSIST, METASAN, METFAC, ARM, and SUPER. Modeling tools, such as REL70, RELCOMP, CARE, CARSRA, and CAST, that were forerunners to some of the current tools are noted for their contributions. Modeling elements that have gained widespread use for general systems, as well as fault-tolerant systems, are included. Tools capable of modeling both repairable and nonrepairable systems, accepting constant or time varying failure rates, and predicting reliability, availability, and serviceability parameters are surveyed. Categories and Subject Descriptors: C.4 [Performance of Systems]: Measurement techniques, Modeling techniques, Reliability, availability, and serviceability; 1.6.3 [Simulation and Modeling]: Applications General Terms: Measurement,

Reliability

Additional Key Words and Phrases: Dependability, maintainability, Markov model, Petri nets, repair

INTRODUCTION

With the increasing complexity of systems and the importance of performing their intended functions correctly and without interruption, the need for reliability models has intensified. Whether it is a military, commercial, or university environment, the performance parameters of reliability,

failure rate, fault tolerance,

fault-tree,

availability, and serviceability (RAS) are crucial to the viability of a project. The major life cycle costs in time and money are related to these performance parameters. Therefore, being able to model these parameters accurately and use this information to make design decisions becomes crucial to the ultimate success of a project.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1988 ACM 0360-0300/88/1200-0227 $01.50

ACM Computing

Surveys, Vol. 20, No. 4, December

1988

228

l

A, M. Johnson, Jr., and M. Malek

CONTENTS

INTRODUCTION 1. BASIC MODELING ELEMENTS AND TECHNIQUES 1.1 Failure Rate Models 1.2 Service Cost and Repair Time Model 1.3 Reliability Graph 1.4 Fault Tree 1.5 Markov Model 1.6 Extended Stochastic Petri Nets (a Model for Simulation) 1.7 Simulation 2. RAS TOOLS FOR COMPUTER SYSTEMS 2.1 Characteristics of the Tools 2.2 RAS Tools 3. CONCLUSIONS AND FUTURE DIRECTIONS ACKNOWLEDGMENTS REFERENCES

These parameters are defined as follows. Formally, reliability is the conditional probability at a given confidence level that a system will perform its intended function properly without failure and satisfy specitied performance requirements during a given time interval [0, t ] when used in the manner and for the purpose intended while operating under the specified application and operation environment stress levels. Instantaneous availability, A(t ), is the probability that a system is performing properly at time t and is equal to reliability for nonrepairable systems. Steady-state availabilitj, is the probability that a system will be operational at any random point of time and is expressed as the expected fraction of time a system is operational during the period it is required to be operational. Serviceability has to do with the aspects of a system design contributing to ease of diagnosis and repair (i.e., the maintainability of the system). “Maintainability is the probability of successfully performing and completing a specified corrective maintenance action within a prescribed period of time at a desired confidence level with specified manpower, skill levels, test equipment, technical data, operating and maintenance documentation, and maintenance support organizations and facilities, and under specific environmental conditions” as defined by Kececioglu [ 19861. ACM Computing

Surveys, Vol. 20, No. 4, December

1988

This paper discusses the basic models used in RAS evaluation tools and surveys many of the RAS tools developed by universities and/or industry for both military and commerical use to provide assessments of repairable and nonrepairable systems. Many are designed specifically with faulttolerant systems in mind. Because of the complexity, no single tool is able to handle all aspects of the problem, but most of them perform well within the constraints of their design. Evaluation tools are evolving that use multiple types of models to solve the problem and maintain databases of information that do not have to be recalculated. This paper intends to capture, in a succinct manner, the essence, origin, purpose, application, and operational characteristics of these tools. Since a tool is only as good as its input, considerable effort in any serious project is spent making the prediction of the input parameters coincide with the environmental condition of the system when it is put into operation. Even the most accurate prediction can miss the mark if the actual environmental conditions have deviated from the assumed environment. To interpret the environment and the system design accurately requires a considerable amount of technical knowledge in the area of reliability prediction. A lack of understanding in this technical area, coupled with the lack of consideration of design failures, can lead to a reliability prediction missing reality by a wide margin. A vital input for any model is the behavior of the various components, and a valuable output is the translation of the system RAS parameters into time and money. This translation to time and money is important in both the commercial and military environment since a lack of this knowledge can be disastrous to a company’s financial health, taxpayers’ money, and human safety. Thus, included with the basic modeling elements in Section 1 are the primary failure rate model and a service cost and repair model for evaluating life-cycle cost. Many of the currently available software tools using these basic modeling elements will be introduced in Section 2 for repairable and nonrepairable systems, followed by conclusions in Section 3.

Survey of Software Took 1. BASIC MODELING TECHNIQUES

ELEMENTS

AND

A model is an abstraction of the various assumptions about a system’s behavior. These assumptions represent mathematical or logical relationships, which, if they are simple enough, lead to analytic solutions. As more reality is put into the models, however, analytical solutions become intractable and simulation emerges as a reasonable way to determine the operational characteristics of the model. Simulation involves evaluating a system numerically over some relevant period of time and using the data gathered to characterize the model’s behavior. When more information is needed than is provided by either analytical evaluation or simulation, system testing must be done. All of the tools discussed in this paper use at least three of the modeling elements mentioned in this section to provide analytic solutions. Some of the tools, such as GRAMS [Fleming et al. 19841 and HARP [Trivedi et al. 19841, also provide the capability for doing simulation. A reasonable strategy would be to use the simplest model that leads to reasonable analytic solutions and to supplement this with simulation when adequate analytic solutions are not available or prohibitively expensive due to complexity. This follows the consistent theme found in the use of these modeling elements, which is to simplify a complex system environment that presents an intractable problem to the point at which key assumptions enable the problem to be solved. The modeling elements and techniques discussed in this section are as follows: (1) For system input (a) Failure rate (b) Repair rate (c) Costs-parts, service, time (2) For system behavior Combinatorial (a) Reliability graph (b) Fault tree Noncombinatorial (a) Markov chain or process (b) Petri nets

l

229

Combinatorial models enumerate all the combinations of failed and working elements to represent either the success or failure of a system. Noncombinatorial models do not always enumerate all combinations. Siewiorek and Swarz [1982] provide a comprehensive discussion of many of these modeling elements. The objective of the modeling activity is to obtain output that reflects in some desired form the reliability, availability, or serviceability characteristics of the system. Additional modeling elements are required to obtain information such as service cost and repair time. The modeling elements discussed here typically focus on system behavior at the highest level where, for example, a failing element is an entire processor. When the modeling activity goes to lower levels, the state space explosion begins to make use of these system behavior modeling elements prohibitive. Either very specific analytic models are developed such as those discussed by Rutledge [1985] for memory with error-correcting codes (ECC) or simulation is done to capture the desired level of detail over time. If these models are being used to determine which design alternative is best with respect to various RAS parameters, then modeling at the lowest level at which replacements for repair are made in the system may be required. Components at this level are typically the replaceable units for either hardware or software. These components are assumed to fail independent of other components. For simplicity, not necessarily reality, these components are considered to be in either an operational or failed state, and a state graph can be generated where each state is a unique combination of the operational and failed states of the components. (An example is Figure 6, in which the only components considered are processors and a voter.) Analytical models tend to be sensitive to minor modifications to the assumptions on which the models are based, which can either render the models “useless” or lead to such an increase in complexity that calculations become intractable. One way to overcome such problems is to use Monte Carlo simulation with these models (which is discussed in Section 1.7). ACM Computing

Surveys, Vol. 20, No. 4, December

1988

230

.

A. M. Johnson, Jr., and M. Malek

1.1 Failure Rate Models

Although many factors contribute to device failures, the primary assumption is that the failure rate of semiconductor devices is a function of temperature. Thus, the model for failure rates can be represented by the Arrhenius relationship with temperature as shown in the following equation [Department of Defense 1986; Siewiorek and Swarz 19821:

where X = component failure rate. TJ = operating (junction) temperature in degree Kelvin. TR = reference temperature in degree Kelvin. E = activation energy in electron volts (eV) of the dominant failure mechanism. K = 8.63 x lo-‘eV/“K (Boltzmann’s constant). The objective is to derive either a constant failure rate or an average failure rate h that can be used in the reliability equation R = eeht. This constant failure rate (exponential) model is assumed in many of the tools discussed in this paper. Unfortunately, reality tends to deviate from the exponential model, and extreme caution must be exercised in applying the results of this model to a real system. Only tools that employ semi-Markov models, nonhomogeneous Markov models, or Petri nets have the capability to eliminate this assumption and allow other distributions such as Weibull. Deviation from the exponential model leads to difficult analytic solutions, and usually simulation is performed to reach a solution. The actual failure rate will typically vary with time and may be more appropriately modeled by the Weibull distribution. Confidence in a particular model for a technology emerges only after extensive testing and usage in the field. The most well-known source of failure rates is MIL-HDBK 217E [Department of Defense 19861. This is the standard for ACM Computing

Surveys, Vol. 20, No. 4, December 1988

military contractors. In deriving an average failure rate, many factors other than temperature are taken into account in the equation, as shown below: A, = II&

ClnTn”

+ (C, + C,)&; lo6

9

where II, = the quality factor that accounts for the effects of different quality levels. IIL = the learning factor, reflects the improvement in people’s proficiency with experience, and is related to the calendar time that a part is in production. IIT = the temperature factor (Arrhenius relationship) = 0.1 exp((-E/K)(l/Ta - l/TJ)), where TR = 298°K and TJ is the worst case junction temperature in degrees Kelvin. II, = the voltage derating stress factor. II, = the environment (application) factor. C, and CS = the circuit complexity failure rates based on chip complexity. C3 = the package complexity failure rate. Although the part stress analysis prediction equation given above provides a great deal of insight into what may cause microelectronic components to fail, it also provides many opportunities to make mistakes when used. A simpler alternative is provided in Part 2 of MIL-HDBK 217E, which is called the “Parts Count Reliability Prediction” method. It utilizes the following much simpler equation: ;=?I x EQUIP = 2 Ni(bnQ)i i=l

for a given equipment &QUIP

environment,

where

= total equipment failure rate (failures/106 hr). X0 = generic failure rate for the ith generic part (failures/IO6 hr). II, = quality factor for the ith generic part. Ni = quantity of the ith generic part. n = number of different generic part categories.

Survey of Software Tools Because of the extremely conservative predictions derived from MIL-HDBK 217, many commercial and military computer companies develop their own models and tools in order to assess the technology they are using accurately [Spencer 19861. This allows them to introduce time dependency and additional updated knowledge they have about a technology. For example, Bellcore [Healy 19861 has developed a method to obtain failure rates for integrated circuits (ICs) for varying complexity, technology, and packaging. A failure rate for specific ICs is a function of the number of bits, the number of transistors or the number of gates, and a package type. It is expressed in FITS (failures in 10’ hours). For example, for a dynamic RAM in MOS technology in a nonhermetic package, a failure rate X in FITS is x = 45(B + o.25)“.60,

where B is the memory size in kilobits. Spencer showed the extent of MILHDBK 217’s inaccuracy when he compared the output of Bellcore’s reliability prediction procedure (RPP) and five other commercially available reliability prediction tools to that of MIL-HDBK 217. MILHDBK 217 consistently provided higher failure rates than the other six models and in some cases was as much as 6000 times higher than one of the other models. The other six models typically provided predictions that varied from one another by no more than a factor of 5, but there were instances where variations by a factor of almost 1000 were found. Failure rate predictions that are substantially higher than the actual failure rates can lead to unnecessary expenditures to meet reliability objectives. Predictions that are far lower than the actual failure rates can have disastrous consequences for the user or the party responsible for maintenance. Accurate predictions are worth the investment because they lead to correct business and technical decisions. An accurate prediction requires more than providing correct mean values for the life of the product. Failure rates should also be accurate for the observed time units throughout the life and thus should be time

l

231

varying. A frequently used technique for representing failure rates that vary with time is to represent them as piecewise constant failure rates, where failure rates are given by time intervals such as one month. For commercial companies that are building thousands of units, accuracy is important. Inaccuracy can lead to either cancellation of projects that could be successful, which means lost opportunity, or a significant overrun of maintenance cost, which can be unprofitable. Whereas absolute reliability assessments are essential for evaluating whether a design meets the cost and reliability requirements, doing a relative reliability assessment is an important evaluation approach in making design decisions involving several alternative designs. Frequently, during the early design phase, an accurate absolute reliability assessment cannot be performed because the necessary failure rate information is not available for some of the technologies being used or an accurate count and identification of components has not been done. Doing a relative reliability assessment requires consistency in assumptions, data, and approach used but does not require accuracy. Thus, a relative assessment requires much less time and resources to perform and quite appropriately is often the preferred approach until the design solidifies and more is known about component failure rates. Finally, there are factors at the system level not included in failure rate models, such as MIL-HDBK 217E, that can have a significant affect on the system reliability. Such factors include the repair and replacement strategy, failures induced by users, production variations, deficiencies in the design, and environmental aspects not introduced in the part level formula such as vibration, humidity, and switching effects. 1.2 Service Cost and Repair Time Model

When a system is designed for repair, there are three important parameters to be estimated to determine service cost, the number of repairs that will be made, the hours spent servicing systems, and the cost of parts required for maintenance. These parameters are used to calculate the service ACM Computing

Surveys, Vol. 20, No. 4, December 1988

232

A. M. Johnson, Jr., and M. Malek

l

cost from the vendor’s (the company that sells and services the system) point of view. An additional factor crucial to the customer that is not discussed in this article is the cost of unavailability that is application dependent but should be included in the total life-cycle cost. This is true in both government and commercial environments. Because the emphasis may be different in each particular environment, the list of parameters can be expanded beyond the three mentioned here. For example, in the commercial environment an important input parameter is how the inventory varies with time. The repair actions in a given month (a month is chosen as a convenient time unit) for a system can be represented by RAimonth = A(i)X(i)

+ G(i),

where

process in manufacturing assembly, and to reflect the occurrence of intermittent and marginal faults. To calculate the hours spent servicing a given type of system, there are several items that must be derived first. One is the average duration of a repair action (DRA,,,) for a component in the system. For systems with no fault tolerance, this is equivalent to the mean time to repair (MTTR,,,). For complex systems with fault tolerance, the MTTR,,, can be derived from a Markov model or from simulation. For a triple modular redundancy system, assuming a repair person for every failed processor, MTTR can be calculated from the general equation for a K-out-of-N system with K = 2 and N= 3: MTT%s

i = current month. A(i) = a composite of various application factors. X(i) = the time-varying failure rate. G(i) = the number of times an adjustment is made to correct a failure condition plus the number of times no problem is located (possibly because the fault became latent). The total repair actions for the life of a system is given by RAlife = i

RAimonth,

i=l

where Z is the number of months for the life of the system. The mean repair actions per month for the life of the system is

=

(N

MTTRproc _ K +

1)

MTT&mc = (3 - 2 + 1) = MTT%roc 2



where N is the total number of processors and K is the minimum number of processors that must be working for the system to be operational. To calculate the service cost for the system, the DRA,,, must be used rather than MTTR,,,. The system is divided into replaceable units (RUs), which may be referred to in industry by such names as field replaceable units (FRUs in a commercial environment) or line replaceable units (LRUs in a military environment). The mean DRA for the system in month i is given by

R&ota,/month = 0$ RAlife. There are some similarities between calculating the failure rate and the RAs, and under certain conditions they could be the same. It is more important to focus on the conditions that can make them different. A(i) is an uplift to reflect imperfect applications resulting from situations such as abnormal operating conditions, to reflect the impact of defects escaping the testing ACM Computing

Surveys, Vol. 20, No. 4, December 1988

where P is the total number of RUs in the system (if a given RU appears n times in the system it is counted n times) and MTTR(i)j is the MTTR for thejth RU in the ith month. The mean DRA for the life of the system is DR&e

=

$ 0

,g I1

DRA(~)sYs*

Survey of Software Tools DRA(i)sus can be a function of time due to changes in the experience and skill of the service personnel. The more reliable a system, the slower service personnel gain experience, and in many cases skill deteriorates due to the service personnel forgetting their training from lack of use. Therefore, the mean time to repair in month i is equal to MTTR(i)j = TAti) +

TC

+ TB(Pdiag + (1 - Pdiag)Udiag)

(i 1(pisol + (1 - Pisol) Usymp)

+ To(i)

TE = time required to tics to verify that been fixed. TF = time required to the administrative sociated with the

l

233

run the diagnosthe problem has clean up and do paperwork asrepair action.

Similar models for MTTR have been presented in the literature, such as the one in Jager and Krause [1987]. The service hours for the j th RU in the ith month is given by T(i),,,+

= MTTR(i)jA

+ TE + TF,

-I- T(ihvrF-j

(i)j h(i)j + T(iJm-j,

where TA (i) = average time to talk to customer and obtain the fault symptoms, identification of failing unit, and any other preparation required in the ith month. TB = time required to run diagnostics or analyze information logged at the time of the error to determine the fault symptoms. Pdiag= probability that the diagnostics or logout analysis will be effective in determining the fault symptoms. Udiag= application factor to obtain the additional time required when the diagnostics or logout analysis are not effective in isolating the problem to a single RU. Tc(i) = time required to use the maintenance analysis procedures to isolate the failing RU in the ith month. Pisol = probability that the error symptoms uniquely identify the failing RU (dependent upon the maintenance strategy applied). u symp= the average additional time required to account for when the error symptoms do not uniquely identify a single RU and the other RUs have a greater probability of being replaced first. TI,(i) = time required to remove and replace or make an adjustment to a RU in the ith month.

where TNTF = additional time required when there is no trouble found and no repair made. T,,, = additional time required for service not related to system failure. This reflects user friendliness. The third parameter to be calculated is the maintenance parts cost. The maintenance parts cost in the ith month for the jth part is C(i)Mp-j

= A(i)X(i)B(i)RRM(C(i)R”_j - S(ihj), where B(~)RRM = ratio of removed RUs to total failing RUs in the ith month. This factor is determined by the defective parts in the field stock, isolation capability of the system, its diagnostics, and maintenance procedures [Bossen and Hsiao 19811, reseating of cards and cables, and requirements for adjustments. Only the requirements for adjustments can cause this factor to be less than 1. C(i),“-j = replacement cost of the jth RU in the ith month. S(i)Ru-j = salvage value for the j th RU being replaced in the i th month. ACM Computing

Surveys, Vol. 20, No. 4, December 1988

234

A. M. Johnson, Jr., and M. Malek

l

The total maintenance life of a system is

-

s(;)RU-j

A

parts cost for the

)*

The cost of the service time for the jth RU in the ith month is

Figure

1.

Graph

representation

for three

parallel

processors.

The total service cost can now be computed using the following equation: SGotal =

5 i=l

i

c(i

)j +

CMP-~~J

+

Gravel,

j=l

where Crave1= cost associated with service personnel traveling to perform maintenance. 1.3 Reliability

Graph

The reliability graph is perhaps the simplest model for modeling system behavior. The focus of this combinatorial model is primarily limited to determining system reliability. Use of graphs for reliability modeling is divided into two categoriesphysical interconnection modeling and system failure dependency modeling (success graphs). The physical interconnection model uses a probabilistic reliability graph where nodes represent system components and edges correspond to communication links. The probability of failure can be assigned to nodes or edges or both. This model is commonly used in solving communication network reliability problems. In this environment, the typical theoretical questions are, What is the probability for reliable transmission of a message from each node to all others, from a given node to all others, and from one specific node to another (twoterminal reliability)? Since solving the general model is exponentially complex, several simplifying assumptions are frequently used. Examples include assuming that the node failure probability is zero (i.e., nodes do not fail) and that edges may fail ACM Computing

Surveys, Vol. 20, No. 4, December 1988

independently with the same probability. Even with these assumptions, the problem remains exponentially complex because, in require enumeration general, solutions of all paths between two terminals for even the simplest two-terminal reliability problem. Several simplification methods for series-parallel or parallel-series systems exist and Hagstrom 1981; [ Satyanarayana Satyanarayana and Wood 19851 that are beyond the scope of this paper. In general, for a graph G, with edge failure probability equal to p and m edges, the reliability (the probability that there is a connection between two nodes) is R(G, p) = 1 - i

Nip’(l

- p)m-i,

i=l

where Ni is the number of edge disconnecting sets of size i. Research in this area is extensive, but most of it focuses on determining bounds or solving special cases [Ball 1979; Boesch 1986; Colbourn 1986; Provan and Ball 1984; Satyanarayana and Wood 1985; Wilkov 19721. When the reliability model reflects the physical interconnection, the reliability problem to be solved must be described along with the graph. This is the approach commonly used in solving communication network reliability problems as in the case of Wilkov [1972] and Satyanarayana and Hagstrom [ 19811. For example, in Figure 1, there is the graph representation of the physical interconnection of three processors represented by edges A, B, and C operating in parallel. If they are all executing the same instruction but operating on

Survey of Software Tools different data, a failure in any processor will cause a system failure. The reliability question to be asked is, What is the probability of three processors operating without failure? The resulting reliability would be Rsys = RARBRC. If the processors are executing both the same instruction and data, then the question might be, What is the probability of at least one processor operating without failure? The resulting reliability is R,,, = RA + Rg + RC - RARe - &RC - ReRc + RARsRc. A software tool that uses this type of reliability graph is ADVISER [Kini and Siewiorek 19821. A useful feature of ADVISER is its ability to construct reliability graphs from the PMS structure of the system and a statement of the operational requirements for the structure. This minimizes the requirement for the user to have a solid background in reliability analysis and reduces the possibility of human error. A disadvantage of any tool based solely on the reliability graph model such as ADVISER is its difficulty in dealing with such issues as (1) the consequences of policy decisions regarding manner of use and service, (2) transient and intermittent failures, (3) statistical dependence of component failures, and (4) prohibitive computational complexity for an arbitrary graph model. When the reliability graph is used to model system failure dependency, it is commonly known as the reliability block diagram (RBD) or success diagram. Each path through the diagram represents one set of components that have an operational dependency. Any given component can be in more than one path depending on the system design. Thus, the RBD is not a direct mapping of the physical structure and is intended only to characterize the system’s reliability. A tutorial for constructing RBDs can be found in MIL-STD-756B [Department of Defense 19811. If the system reliability can be modeled as a series/parallel network, then solutions can be derived very quickly from the reliability diagram or fault tree (described in the next section) as in the case of the performance and reliability tool SPADE [Sahner and Trivedi 19851. If the system must be defined by a nonseries/parallel

Figure 2.

l

235

Nonseries/parallelreliability graph.

graph, then solutions for the reliability graph can become intractable as the size of the network grows. An example of a reliability graph that is not series/parallel is shown in Figure 2. Bounds and partial solutions for some specific networks, multistage interconnection networks, which are a special class of nonseries/nonparallel networks are given in Cherkassky and Malek [1987], Cherkassky et al. [1984], and Johnson et al. [1988]. The next example is a triple modular redundancy (TMR) system, which is used throughout the paper. This TMR system consists of three processors and a voter. The system is operational if at least two of the three processors and a voter are operational. When fewer than two processors and a voter are operational, the system is considered failed. Operational processors are determined by a voter, which is assumed to be nonfailing in this example. A processor that disagrees with the other two is considered faulty and is no longer allowed to participate. This system is shown in Figure 3a. The RBD for this system is shown in Figure 4. Observe that the structure (this structure is popularly used in the literature) of the RBD (Figure 4), which shows reliability dependencies, is quite different from the physical structure represented in Figure 3a and its graphical representation found in Figure 3b. For comparison purposes, a simple derivation is performed. Let X1 = AB, X2 = AC, and X3 = BC for the RBD in Figure 4. Then the reliability for this TMR system can be written as the probability of the system successfully being in an operational state:

RTMR= P(system works) = P(X, U X, U X3)P(voter ACM Computing

works).

Surveys, Vol. 20, No. 4, December 1988

.

236

A. M, Johnson, Jr., and M. Malek

INPUT

OUTPUI +.

(b)

Figure 3. ~-

TMR

.--A--B

0 I .-

system with three processors (a) and its reliability

--.-

A

c-.-

R.~

i;.--~

v

a

graph (b).

The reliability for the TMR then be stated as

system can

R TMR

I

= (RARB + RARc + RBRc Figure 4. Reliability out of three).

diagram for TMR

system (two

- RaReRARc - RARcRBRc - RARBRBR~ + RARBRARcRBRc)R”.

Thus, the probability of the system working is the union of the probability of the two processors in each path, both being operational times the probability P(u) that the voter works. This probability can be rewritten as follows:

Ry = Ry + R,R,, reduces the equation to

P (system works) = [PW,)

Because elements appear in more than one path, there is a logical dependence between paths. Applying the Boolean algebra theorem,

RTMR

+ PW2)

= (RARB + RARc + RBRc+ P(X3)

- P(X,X2)

- P(X2X3)

- P(XlX3) + wGx2x3)I~(u). The probability of a path being operational can be stated in terms of the reliability of each processor: p(x,)

=

RARB,

p(&)

=

RA&,

and

If all the processors have the same reliability, then

RTMR= (3R2 - 2R3)R,. Note that the same result can be obtained from the reliability graph in Figure 3b:

R.,.,R=[,,.(;)R’(I-R)]R, = (3R2 - 2R3)R,,

PW,)

= RBRc,

where RA, Rg, and RC are reliabilities of processors A, B, and C, respectively. The probability of the voter being operational can likewise be stated: P(u) = R,. ACM Computing

~RARBR~)R,.

Surveys, Vol. 20, No. 4, December 1988

where R3 represents the probability that all three processors operate correctly and

0

; R2(1 - R)

corresponds to three cases of a system with a single faulty processor where two out of three operate correctly. Not surprisingly,

Survey of Software Tools the result is the same as for the RBD in Figure 4. The usefulness of both the reliability graph and the RBD is limited to determining a numerical value of reliability in an environment in which a mission time is given and there is no repair. Mission time is the length of time the equipment is required to be operational to do its job. For a rocket booster, the mission time might be a matter of seconds, whereas the satellite launched by this rocket may have a mission time of several years. If more information is desired, other appropriate representations should be used. For example, if an evaluation of the failure mode effects is needed, fault trees are a more appropriate representation; they are discussed in the next section.

unreliability

l

237

of the TMR system: ---

RTMR=(RA+RB)(RA+Rc)(RB+RC)+R~, --RTMR=((RA+Re)(Ra+Rc)(Rtl+RC))+ii;, Because the leaves of the tree are not independent, this equation will yield an incorrect result if substitutions are made. Thus, further Boolean reductions are necessary to obtain the correct result:

RTMR= PL& + ULRBRC+ RA%JM + WA&RC + ~&&M-L = (RaRe + R.&RC + %RBRC)Ru. Since R = 1 - R, then RTMR = V-t&

+ &(l

- &MC

1.4 Fault Tree

A fault tree description like the RBD serves as input to many of the software tools for evaluating RAS. The purpose of a fault tree is to model the conditions that result in a system (subsystem) failure. This requires the enumeration of the fault and/or normal events (hence, this is a combinatorial model) that result in a failure (nonoperational state) of the system under consideration. Proper construction is heavily dependent upon the knowledge and understanding the analyst has of the system. There are numerous references in the literature discussing fault tree construction and analysis, including Arsenault and Roberts [1980], Barlow and Lambert [1975], Dhillon and Singh [ 19781, and Shooman [ 19701. Fault tree analysis was shown to be equivalent to the reliability block diagram [Shooman 19701. Thus, both forms have been used as input descriptions for reliability modeling. CARE III [Stiffler et al. 19791 and SHARPE [Sahner and Trivedi 19861 are examples of tools that make use of these models. A fault tree for the TMR system is shown in Figure 5. The reliability for the system is derived by starting with the Boolean expression for the probability of system failure in Figure 5, which is the

+ (1 - RaVWklR,

= (R.z,RB+ RARc + R,jRc - ~RARBRcUL for

R, = Re = Rc = R R.mm = (3R2 - 2R”)R,. This is the same reliability equation that was derived from the reliability graph and the RBD for the TMR system. An advantage of fault trees is that very large systems can easily be described. A disadvantage of the fault tree is its inability to handle complex behavior scenarios such as interdependence of failures among various system elements or repairable systems that do not have an independent repair crew for each component. Examples of interdependencies include the failure of a cooling fan accelerating the failure of the semiconductor components and a CPU hardware failure corrupting an operating system such that it crashes. Observe that changing the order of failures will not necessarily bring about the same results (i.e., a semiconductor failure will not cause a cooling fan to fail unless it is part of the cooling fan, nor will an operating system failure be likely to cause a hardware ACM Computing

Surveys, Vol. 20, No. 4, December 1988

238

l

Figure 5.

A. M. Johnson, Jr., and M. Malek

Fault tree for TMR

system.

r5

failure). To handle this type of behavior requires the use of Markov models, which are discussed next. 1.5 Markov Model

A system has some degree of fault tolerance if there exist component failures that do not cause a system failure. Instead, a component failure usually results in the system operating without certain resources available or in a degraded mode. Thus, in nearly all instances, it takes some combination of multiple component failures to cause a system failure. As failures occur, the system’s next operating state is determined completely by its current state without regard to how it got to the current state. This characteristic makes fault-tolerant systems ideal subjects to be modeled by Markov chains [Trivedi 19821. Furthermore, Markov chains are capable of modeling sequence-dependent failures and repairable systems (allowing assumptions that there is not a repair person for every failing component), which is not possible with a fault tree. Also, if the system structure is dynamic rather than static, this can be modeled by Markov chains but not by a fault tree or RBD. An example of a dynamic redundant structure is a system with standby spares. Such a system can be dynamically reconfigured to bring a spare on line and decouple the failing component. A collection of random variables, X(t) ] t E T, define a stochastic process. ACM

Computing

Surveys,

Vol.

20, No. 4, December

1988

r’i

Let X(t) represent the state of the system at some observed time t, which is a member of T where T can consist of either discrete times or represents an interval of real values. There is a finite conditional probability P(X(t)) of being in each possible state X(t). The state space of this stochastic process defines a finite-state Markov chain (process) if the following conditions hold: (1) There is a finite number of states. (2) The conditional probability of being in any future state is determined only by the present state and is thereby independent of any past states. (3) The probability of a transition from one state to another does not change with time. (4) A set of initial probabilities is defined for all states. Such a finite-state Markov chain can be used to model a fault-tolerant system, where each state of the chain is defined by the number of components that are operational or failed. These states can be divided into two classes-operational states for the system and failed states for the system. The transitions from one state to another can be divided into three categories: (1) Transitions representing component failure rates (for both repairable and nonrepairable systems). (2) Transitions representing repairs being made that take the system from one

Survey of Software Tools Non-repairable

Figure 6.

Continuous-time

(closed)

If each system component is considered to be in one of two states (operational or failed), there can be 2” system states for a system with n components. For complex systems with large numbers of components, the number of system states can grow prohibitively large; thus, a major part of any reliability modeling effort is to reduce the complexity of the problem. One approach is compression, which usually involves merging all equivalent states. For example, consider the simplified Markov model of the TMR system with no repair shown in Figure 6. In this model, all states corresponding to a system failed state have been merged. The model can be further simplified if the failure rates for A, B, and C are identical; in that case states 1, 2, and 3 can be merged into one state as shown in Figure 7. One source of rules for merging

239

system

time-invariant

system nonfailed state to another (needed for repairable systems to calculate both reliability and availability). (3) Transitions representing a repair being made that take the system from a system failed state (inoperative, but not destructively failed) to a system nonfailed state (needed for repairable systems to calculate availability).

.

reliability

model.

states can be found in Shooman and Laemmel [ 19871. A second technique is truncation, which involves eliminating those states that have a very low probability of being entered relative to other states. Two truncation techniques are state space truncation and sequential truncation. In state space truncation the highest probability of the states truncated must be less than the lowest probability of the remaining states. If new absorbing states are generated by the truncation, then they should be deleted or the truncated state that generated these absorbing states should be retained as an absorbing state. This technique is used in SURF [Costes et al. 19811 and ARM [Liceaga and Siewiorek 19861. Sequential truncation involves calculating the state probabilities every time a new state is generated and deleting states with probabilities below a predetermined value. Another approach is the functional decomposition of the system into a set of subsystems. Models that use this technique include MARK1 [Lala 1983a], GRAMP [Dolny et al. 19831, and SHARPE [Sahner and Trivedi 19861. SHARPE provides additional state space reduction capability by using a hierarchical modeling structure ACM Computing

Surveys, Vol. 20, No. 4, December 1988

240

l

A. M. Johnson, Jr., and M. Malek xv

Figure 7.

Merged Markov

that permits a mixture of different kinds of models at different levels. Another approach for decomposing the system is behavioral decomposition. In this approach submodels are created along temporal lines by separating the behavior into fault occurrence behavior and fault/error handling behavior. This method works because fault/ error-handling is orders of magnitude faster than frequency of fault occurrence. Models that use this technique are included in CARE III [Stiffler et al. 19791 and HARP [Trivedi et al. 19841. After compression, truncation, and decomposition come algorithms for handling sparse matrices and efficient numerical methods for solving ordinary differential equations such as those utilized in SAVE and GRAMP. All of these techniques for simplifying the problem are for naught unless there are ways to simplify the generation of the state space. A major disadvantage of Markov models is the difficulty of generating (describing) the chain for all but small (less than 100 states) models. An automatic Markov chain generator is required to make this model practical. ASSIST [Johnson 19861 is an example of a Markov chain generator that can be used to provide input to tools such as SURE and ARMS. Markov models can be classified by model types or model application as shown in Figure 8. Homogeneous model refers to a model in which transition rates are constant and are called time invariant. Nonhomogeneous model refers to a model in which transition rates are not constant with respect to time and are called time variant. ACM Computing

Surveys, Vol. 20, No. 4, December 1988

model for TMR system.

If the Markov model is continuous time, the transitions can occur at any instant in time and the transition probabilities follow a continuous distribution. If the continuous-time model is homogeneous, the transition probabilities have an exponential distribution, where the distribution function is given by F(t)

= 1 - e+

if t 2 0, otherwise,

J-(t) = 0

where t represents any instant in time. If the continuous-time model is nonhomogeneous, the transition probabilities have time-dependent coefficients. If a parameter other than time is used, the model choices may be restricted to discrete as in the case in which the parameter chosen is the number of unique test cases run for some software reliability model. If the Markov model is a discrete-time model, the transitions can only occur at discrete intervals of time and the transition probabilities follow a discrete distribution. If the discrete-time model is homogeneous, the transition probabilities have a geometric distribution, where the distribution function is given by F(x) = 1 - (1 -pY+’ if

x 2 0

and

x E (0, 1, . . .),

F(x) = 0 otherwise, where x represents an interval of time. If the discrete-time model is nonhomogeneous, the transition probabilities have a

Survey of Software Tools MODEL

TYPES

MODEL

Markov

Discrete

241

APPLICATIONS Markov

Continuous

bmog~f~f)igeneoL,s

Relia::?irpa\

Reliability Periodically Renewed System Figure8. Finite-state pairable TMR system.

9

Markov

models classified

distribution other than geometric such as discrete uniform, binomial, or Poisson. If the model is time variant, the assumption that the state of all components for a Markov model is governed by a global clock is violated since replacing the faulty component with a new component (not a hot spare) resets the time for that component to zero or some other nonglobal value. The new component represents either a spare being activated or, in the case of a repairable system, a repair being made. One solution is to make use of a simulation model. HARP [Bavuso et al. 19871 is an example of a model that overcomes this difficulty through the use of simulation. HARP uses a Markov model for the fault occurrence level and an extended stochastic Petri net (ESPN) for the fault-coverage level. Fault coverage is the probability that the system will recover given that a fault has occurred. The application of the Markov model, independent of the type of model, can be divided into two categories-nonrepairable systems and repairable systems. A nonrepairable system is not accessible for repair once it is put into operation, whereas a repairable system is continuously available for repair. The application is easily identifiable from the graph of the Markov model for the system being evaluated. Nonrepairable systems contain no cycles involving

for

by types and applications

Availability

for nonre-

more than one state (a node represents a state) in the graph. Thus, it is not possible to traverse the directed graph and return to a state previously visited. An example of a nonrepairable system graph is shown in Figure 6. Reliability is the main parameter that can be calculated in a nonrepairable system. Instantaneous availability equals reliability when there is no repair. One variation of a nonrepairable system is one that can be periodically renewed. In this case, the modeling must be split into two phases, operation phase and renewal phase. The renewal phase involves cycles in the graph to represent the restoration to the initial state from whatever state existed at the time of renewal. Repairable systems are easily recognized from their graph because they contain cycles to represent the restoration to a previous state via a repair rate. The repair rate can be a function of the complexity of the system, the physical packaging, the symptoms, the skill of the maintenance personnel, and other variables. The repair rate may vary with time due to the skill of the maintenance personnel varying with time. For repairable systems, both availability and reliability can be calculated. Availability graphs are recognizable by the condition that no system-fail state (sometimes called a trapping or death state) is an absorbing ACM Computing Surveys, Vol. 20, No. 4, December 1988

242

.

A. M. Johnson, Jr., and M. Malek

(a)

lb) Figure 9. Continuous-time time-variant availability repair person. (b) Same system with states merged.

state (i.e., there is a repair rate linking back to some previous operating state). Conversely, reliability graphs are conspicuous by the fact that all system-fail states are absorbing states with no links directed out. Reliability will be determined by how long it takes to traverse from the initial state to an absorbing state. A graph for a repairable system with one repair person is shown in Figure 9. A model closely related to the Markov model is the semi-Markov model. This model is used in CARE-III [Bavuso 19821

ACM Computing

Surveys, Vol. 20, No. 4, December

1988

model. (a) Repairable

TMR with one

for fault handling and in SURE [Butler 1986a] and HARP [Trivedi et al. 19841. The semi-Markov model [Ross 19831 is like the Markov model as far as the state transitions are concerned. Unlike the Markov model, however, there is a random or constant amount of time between changes. This local holding time can have any distribution. In summary, for a semi-Markov model, when a process enters a state i, the next state j is determined by transition probabilities that are exponential (or geometric for discrete-time model), but the

Survey of Software Tools process is held in this state for a positive amount of time governed by a holding probability distribution function, which can be any distribution. Use of semi-Markov models facilitates the production of the effect of competing events in fault handling as contrasted against instantaneous coverage that does not. 1.6 Extended Stochastic for Simulation)

Petri Nets (a Model

When the limits of the Markov chain for detailed modeling are reached, a natural transition is to consider Petri nets (PN) [Peterson 19771. One of the problems that leads to the use of Petri nets is the manifestation of inherent concurrency, for example, the concurrency between fault behavior and the fault-handling behavior as seen in the common situation of an intermittent fault being transformed to a benign fault during a recovery operation. A similar example is a second fault arriving while a system is reconfiguring itself as a result of the first fault. Other problems that lead to the use of Petri nets are when nonexponential transition rates are being used and/or there is a state space explosion. Under these circumstances, an analytic solution becomes intractable, and simulation using Petri nets may be the viable alternative. Several researchers [Beyaert et al. 1981; Dugan et al. 1985; Movaghar and Meyer 19841 have proposed the use of extended stochastic Petri nets (ESPN) for evaluating computer system’s performance and RAS parameters. Petri nets are specified by a bipartite graph [Peterson 19771 with a finite set of places, P (represented by circles), as one type of node and a finite number of transitions, T (represented by bars), as the second type of node. These nodes are connected by a set of directed edges, E, connecting places to transitions or connecting transitions to places. If a directed edge (called arc) exists from a place to a transition, then the place is an input to the transition. If an arc exists from a transition to a place, then the place is an output of the transition. The dynamic properties of Petri nets are represented by the movement of tokens

9

243

(represented by dots inside a circle) through the network. When modeling RAS characteristics, these tokens can represent such items as faults or operational components in the system. There is usually a limit on the number of tokens that can be present in the system. This means that the Petri net is K-bounded, where K is the maximum number of faults allowed or the maximum number of operational components in a given network. The position of the tokens in a network represents the marking of the network. When there is a bound on the number of tokens, then there are a finite number of possible states (markings) of the PN. A transition is enabled when there is a token in each of its input places. An enabled transition has the possibility of firing, but only one enabled transition may tire at any instant of time. When a transition fires, one token is removed from each input place and one token is deposited in each output place, which results in a change in the state of the PN (the markings have changed). When the firing does not occur instantaneously and the firing time associated with each transition is an exponentially distributed random event, the net becomes a stochastic Petri net (SPN). When a transition is enabled, an exponentially distributed amount of time elapses followed by the transition firing provided it is still enabled. If both timed transitions and instantaneous transitions are allowed, we have a generalized stochastic Petri net (GSPN) [Balbo et al. 19871. When both timed and instantaneous transitions are enabled, only the instantaneous transitions can tire. Both instantaneous and timed transition representations are shown in Figure 10. Both SPNs and GSPNs have been shown to be equivalent to continuous time Markov processes as a result of the memoryless property of exponentially distributed firing times [Marsan et al. 1984; Molloy 19811. Thus, they can easily be converted to Markov processes and solved analytically rather than by simulation. An extended stochastic Petri net, as defined by Dugan et al. [ 19851, allows the firing times to belong to an arbitrary distribution. Other extensions included ACM Computing

Surveys, Vol. 20, No. 4, December 1988

244

l

A. M. Johnson, Jr., and M. Malek

(a) inhibitor

arc

+

(b) probabilistic

+I-

(d) instantaneous transitions Figure 10.

(e) timed transitions

(c) counter

arc

--j-p (f) multiple output timed transitions

Some Petri net elements and generalizations.

(shown in Figure 10) are inhibitor arcs, and probabilistic arcs. An inhibitor arc goes from a place to a transition with a small circle rather than an arrowhead at the transition. The firing rules are changed such that a transition is now enabled when there is no token in any of its inhibitor input places and tokens in all of its other (normal) input places. When a transition fires, tokens are removed from all normal input places and deposited in one output place associated with each output arc while the number of tokens in the inhibiting input place remains zero. A probabilistic arc from a transition to a set of n (n is an integer 22) output places through a distribution node deposits a token in exactly one of the n places in the set. The selection of a place to receive the token is determined by the probability labels on each branch arc emanating from the distribution node. In Figure lob, an enabled transition has a choice of three places to deposit one token according to the probabilities listed on each branch arc. Although they are not true extensions to Petri nets, both counter arcs (multiple input arcs) and counter-alternate arcs reduce the number of transitions and places that ACM Computing

arc

Surveys, Vol. 20, No. 4, December 1988

must be shown in a Petri net. A counter arc connects a place to a transition and is labeled with an integer value k. This modifies the firing rule to where a transition is enabled when tokens are present in all of its normal input places and at least k tokens are at the counter input place. The firing of the transition removes one token from each normal input place and k tokens from the counter place. There can be a counteralternate arc associated with a counter arc, which enables an alternate transition when the count has a value from 1 to (k - 1). The alternate transition can fire once each time a token is deposited in the counting input place until there are k tokens existing in the counter input place. Because of the modeling and decision power of ESPNs, they have wide application in modeling not only the RAS characteristics of a system but the performance characteristics as well. Popular applications include modeling performance, reliability, fault recovery, fault tolerance, and fault coverage [Beyaert et al. 1981; Dugan et al. 1985; Movaghar and Meyer 19841. Such models are able to handle permanent, intermittent, and transient faults. HARP is an example of a tool that utilizes ESPN

Survey of Software Tools

Figure 11. ESPN model for TMR sient faults.

in evaluating system RAS. METASAN’ utilizes a similar generalized version of stochastic Petri nets called SANS. The major advantage of generalized versions of stochastic Petri nets is the potential they have for modeling detail. The major disadvantage is that complexity increases faster with the size of the problem than previously discussed behavioral models. Figure 11 shows an ESPN model for the TMR system (the voter is assumed to be nonfailing), with the additional capability of modeling the handling of transient faults. In the initial state, all three processors are operational. Note that by changing the number of tokens and the values of kl and kp, the graceful degradation of any Kout-of-N system can be modeled. Timed transition Tl fires when a processor fails causing the token representing that processor to transfer to the fault-handling place. While in this place, there are various probabilities that govern which transition is taken out of this state. If there is only one token in the fault-handling place, then there are three mutually exclusive exits from this place through timed transitions ’ METASAN and SANSCRIPT are trademarks Industrial Technology Institute.

of the

.

245

system that handles tran-

T,, TX, and T4. There is a probability that the failure is transient, in which case T2 fires and the processor is returned to the operational processors place. There is a probability that the failure is permanent, which results in transition T3 firing and transfer of the processor token to the processors failed place. If there are two tokens in the fault-handling place, then instantaneous transition T5 fires to take the system to the system failed place. If a failed processor is repaired, then timed transition Ts fires, transferring the token from P3 to PI. If two processor tokens appear at place P3 at the same time, instantaneous transition T7 fires, transferring both tokens to the system failed place (P4). Whenever tokens arrive at P4, instantaneous transitions T9, Tlo, and Tll are enabled, which results in all tokens being transferred to P4. When the system is repaired, all tokens are transferred back to P, by the firing of time transition T8. Observe that this model does not account for all three processors being failed at once since it is determined to be a very low probability occurrence compared to the two failed processor case, nor does it account for the voter failing. Simple modifications are required to incorporate these capabilities. ACM Computing

Surveys, Vol. 20, No. 4, December 1988

246

.

A. M. Johnson, Jr., and M. Malek

1.7 Simulation

There are many different types of simulation techniques and no standard definition agreed upon in the literature. All the models described in previous sections can be used in simulation. Simulation involves conducting experiments with a model in order to understand how a system will behave and obtaining numerical evaluations of the various operational strategies. The specific behavior we are concerned with in this paper is the RAS characteristics of a system. There are many ways to classify simulation. One classification is as either static or dynamic. Static simulation does not involve time while in dynamic simulation, time is a variable. Static simulation usually means Monte Carlo simulation, but Monte Carlo simulation can also be used with dynamic simulation. In Monte Carlo simulation, random numbers are generated to obtain random observations from probability distributions. Extreme care must be used in creating a pseudorandom number generator. Knuth [1981] provides several methods, including the subtractive method, for creating “good)) pseudorandom number generators. Since the law of large numbers (i.e., the larger the sample size, the more certain the sample mean will be a good approximation of the population mean) applies, discipline must be implemented to keep the number of simulations required to achieve accurate results from becoming too large. This involves using variance reducing techniques such as stratified sampling or the method of complementary random numbers, which are described in Hillier and Lieberman [ 19801. Simulation can reproduce, typically, a more accurate model than analytic models at, however, a higher cost for computation. Without variance reduction techniques, simulation is not practical for assessing highly reliable/available systems. To illustrate how Monte Carlo simulation works, consider the RBD for the TMR example as shown in Figure 4. Assume Ra = 0.9, Rg = 0.8, Rc = 0.7 and R, = 1.0. Using a table or an algorithm such as the subtractive method, generate a random

ACM

Computing

Surveys,

Vol.

20, No.

4, December

1988

number between 0.000000001 and 1.00. Compare this random number to RA. If the number is less than 0.9, then processor A is operational; otherwise processor A has failed. Apply this same procedure to RB and Rc. After some number of trials determined to be sufficient to provide an accurate answer, tabulate the results. The reliability of the system is calculated from the equation

R TMR

=

(# trials where number of operational processors L 2) (total # trials) .

For this example, a run of 100 trials yielded R TMR = gj = 0.9. In this example, it is observed that the use of heterogeneous processors in a TMR application yields no better reliability than the use of processor A alone. But, the cost is far greater for the TMR configuration than for the uniprocessor. One form of dynamic simulation is discrete-event simulation. In discrete-event simulation, events are generated randomly to occur at instants of time determined by the transition rates between states. When these events occur, the state of the system changes. For example, in Figure 9b, if the current state is all three processors operational, then the occurrence of a failure of processor A would take the system to the state where only two processors are operational. A Markovian queueing model is often used in discrete-event simulation. The state diagram of Figure 9b has a corresponding queueing model shown in Figure 12 when a nonfailing voter is assumed. The arrival of failures to the queue is determined by the exponential distribution with a failure rate of 3X. Each failure receives service (repair) at the rate CL.The system is considered failed when the queue length reaches two and the arrival of any further failures is suspended until the queue length is reduced to one. The mean time to failure (MTTF) is determined by the time it takes to go from the initial operating state with three processors operational to the failed state with two processors failed. The mean time between failures is determined by calculating the

Survey of Software Tools tions related include:

.3h - +gTTj&@Figure 12. Queueing model for simulating one repair person.

to dependability.

l

247

Examples

TMR with

(1) Reliability

average time the system is operational between failures. Dynamic simulation need not be event oriented, it can also be process oriented. In this case, tokens representing each of the three processors can be routed from state to state according to the transition probability distributions. The ESPN model shown in Figure 11 is an example of process-oriented simulation. This structure lends itself very well to modeling conand creating hierarchies of currency models. Statistics can be gathered on how much time is spent in the system failed state versus how much time is spent elsewhere, and the availability can be calculated from this information. Corresponding calculations can also be made for the other states. Although other forms of simulation exist, the primary forms used in tools discussed in this paper have been covered above. Useful references for exploring simulation in more detail are in Law and Kelton [1982] and Lazowska et al. [1984].

(2) (3) (4) (5) (6) (7) (8)

evaluation Availability evaluation MTTF and MTTR Fault coverage evaluation Life-cycle evaluation Service cost estimation Comparative analysis of design alternatives Design “weak spot’ identification

(1) Combinatorial techniques (2) Markov chains (processes)/Petri nets (3) Simulation (only with variance reduction techniques for fault occurrence models but generally useful for fault handling with behavioral decomposition)

The following section provides information about various tools that is intended to be helpful in understanding which tools are available and which ones might serve an intended purpose. Making decisions about which tools should be used, however, requires more information than can be presented here. Information concerning limitations and capabilities is not all inclusive since it is based primarily on what has been published about a given tool. Furthermore, many tools are still being improved and updated. Another important consideration is the actions taken to validate the tools. Since a given tool may have many different applications, a user’s main concern should be whether the tool is valid for the desired applications. Validation comes in many forms; there is the validation of the correctness of the algorithms, validation that the results are consistent with those of other known tools, and validation that the results are consistent with the actual field performance of systems evaluated. An understanding of the assumptions of the tool is needed for effective evaluation but is seldom provided with the model. A comprehensive validation process for a tool typically involves many test sites over a period of time. Since most tools do not provide support for analysis of test data, commercially available tools such as SAS2 should be used [SAS Institute, Inc. 19851.

The objective of these models is to provide a number of the assessments and evalua-

’ SAS is a registered trademark

2. RAS TOOLS 2.1 Characteristics

FOR COMPUTER

SYSTEMS

of the Tools

Currently known reliability prediction tools are related to one another and to common predecessors. These analytic models utilize one or more of three basic modeling techniques [Geist and Trivedi 19831:

ACM Computing

of SAS Institute,

Inc.

Surveys, Vol. 20, No. 4, December 1988

248

l

A. M. Johnson, Jr., and M. Malek

2.2 MS Tools 2.2.1 ARIES-82 (Automated Reliability Interactive Estimation System)

1. References [Makam and Avizienis 1982; Makam et al. 1982; Ng and Avizienis 19771 2. Purpose: Performs reliability and lifecycle analysis for fault-tolerant systems, serves as a teaching aid, and is a base for future integration of reliability and analysis techniques. media, size, and capac3. Implementation ity a. Implemented on a VAX/ll-780 in Clanguage procedures b. Approximately 85 kilobytes of executable code c. No capacity limitations were stated in the literature Markov 4. Basic model: Homogeneous process 5. System types modeled a. Nonrepairable systems b. Repairable systems c. Systems with transient fault recovery d. Periodically renewed nonrepairable systems 6. Genealogy: The first ARIES was developed at UCLA by Ng and Avizienis [1977] in 1976 and was originally written in APL. It was created out of the experience of using REL70 [Bouricius et al. 19691, CARE [Mathur 19721, RELCOMP [Fleming 19711, and RMS [Rennels and Avizienis 19731. It evolved to ARIES-81 [Makam and Avizienis 19821 and then finally to ARIES-82 [Makam et al. 19821. 7. Inputs a. Interactive user interface b. Five classes of parameters (1) Structural: Initial number of active and spare modules, number of degradations allowed in the active set, number of good modules in the first safe shutdown state, others.

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

Failure rate for each (2) Physical: active module, failure rate for each spare module, failure rate for each good module in the safe shutdown state, transient fault arrival rate for each active module, and the mean duration of a transient fault (3) Logistical (a) Number of repair facilities, repair rate for each module, when the system is operating in less than full configuration state or is in a safe shutdown state, and restart rate to attain the full-configuration state from the crashfailure state. (b) Renewal process: Service interval length, replacement or repair rate per module, restart rate from the crash-fail state, mean system checkout duration, and the total time interval between the initiation of the ith renewal phase and the resumption of system operation with full configuration. The total time interval is a random variable estimated using the other parameters. (4) Detection and recovery for permanent faults: Probability of recovery (coverage) from a spare module failure, coverage for active modules, coverage for active set when spares remain, and coverage for successful final degradation to safe shutdown state. (5) Detection and recovery for transient faults: The number of recovery phases, recoverability (the conditional probability that the fault is noncatastrophic, given that a fault occurs), the failure rate of all the hardware engaged in the execution of the transient recovery processes, the recovery duration vector, and the recovery effectiveness vector.

Survey of Software Tools 8. Outputs a. Mission-Time Measures for the System (1) The reliability (2) Mean time to first failure (3) Failure rate (4) The balance of the unreliability contribution of each subsystem improvement factor (5) Reliability resulting from the redundancy (6) Mission time improvement factor between two competing designs b. Life-Cycle Measures for the System (1) Aggregate state probabilities (4 Probability that the system is operating without apparent degradation either in performance or in fault tolerance (b) Probability that the system is operating in any of the degraded states (cl Probability that the system is operating in one of the unsafe states or in states with some failed spares that are unrecoverable (4 Probability that the system is in a safe shutdown state (e) Probability of a catastrophic failure probabilities for (2) Steady-state expressing the limiting behavior of repairable and renewable systems availability-the (3) Instantaneous probability that the system is operational at time t probability that the (4) Safety-the system is in an operational state or in a safe shutdown state of specified poten(5) Availability tial level of performance (capacity) (6) Average expected potential performance level (capacity) (7) Frequency of failures

l

249

(8) Mean lost computation time due to crashes (9) Mean lost computation time due to safe shutdowns (10) Mean down time (11) Average number of module failures 9. Major restrictions of constant transition a. Assumption rates (failure rates) b. Assumption of distinct eigenvalues for the matrix of transition rates that results in a solution of O(n5) as compared to the classical solution that is O(n4) (applies only to models of repairable systems) 2.2.2 CARE-III (Computer-Aided Estimation Program)

Reliability

References [Bavuso 1982; Bavuso 1984; Geist and Trivedi 1983; Stiffler et al. 19791 Purpose: To be a general-purpose reliability estimation tool for very large highly reliable digital fault-tolerant avionic systems and to provide comparisons of the stochastic attributes of alternative systems. Implementation media, size, and capacity IV a. CDC Cyber 170 in FORTRAN with plotting capability for DISSPLA and Plot 10. b. VAX-11 in FORTRAN under VMS with a user-friendly prompting front end. C. Can model on the order of one million Markovian equivalent states. d. Can model up to 70 stages with a maximum value of 70 for M and N, where N is the maximum number of modules needed and M is the minimum number for stage survival. Basic models a. Fault trees to specify fault-occurrence behavior b. Nonhomogeneous Markov chain to model fault-occurrence behavior

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

250

.

A. M. Johnson, Jr., and M. Malek

c. Semi-Markov chain to model fault/ error handling behavior. Within the fault/error-handling model, the two main components are a single-fault model and a double-fault model. The double-fault model recognizes the possibility that system failures may be dominated by critically coupled failures. 5. System type modeled: Highly reliable nonrepairable systems 6. Genealogy: Codeveloped by NASA Langley and the Raytheon Company to overcome limitations in previous versions of CARE and other related models. The original CARE program was developed by NASA’s Jet Propulsion laboratory for application to space-borne computer systems. This model implemented the concept of coverage that was developed by Roth et al. [1967] and Bouricius et al. [1969] and was combinatorial like several other predecessors [Chelson 1967,197l; Computer Sciences Corp. 19701. The techniques of CARE were combined with TASRA (tabular system reliability analysis) [Bavuso 1982], developed at Battelle Memorial Laboratories, in the creation of CAREII by Raytheon and NASA Langley. Further evolution occurred when the methodology of CAST (combined analytic simulative technique) [Conn et al. 19771 was merged with CARSRA (computer-aided redundant system reliability analysis) [Bavuso 19821 and with software reliability studies [Nagel 19821. CAST focused on modeling transients and introduced the coupling of an analytical approach with computer simulation. Furthermore, coverage was expanded into components that could be evaluated in a straightforward manner. Coverage c = uvw, where u is the probability that a fault is detected given that a fault occurs, v is the probability that a fault is isolated given that a fault is detected, and w is the probability that the system recovers given that the fault is isolated. CARSRA used the Markov approach with a state reduction technique. Stage dependencies were modeled in order to assess the variety of system ACM Computing

Surveys, Vol. 20, No. 4, December 1988

configurations that constitute continued mission success. This is one of the most mature evaluation tools, with more than 50 copies distributed to beta test sites. 7. Inputs a. User interface is interactive on VAX b. Stage description input (a stage can be composed of hardware or software modules or represent a function that can be a mathematical model of some component) (1) Stage name (identifier for a subsystem made up of modules with the same Weibull time-to-failure distribution) (2) Number of initially functioning modules in stage number of modules (3) Minimum for stage operation (4) Set(s) of modules subject to critical pair failures (5) Critical fault threshold models C. Fault-handling (1) Fault type (either permanent, transient, or intermittent) rate from active to (2) Transition benign fault state (benign state: fault manifestation has vanished) (3) Transition from benign to active fault state (4) Rate at which fault is detected by self-test (5) Rate at which a fault generates errors (6) Rate at which errors are detected (7) Single point failure probability a detected fault is (8) Probability permanent d. Fault occurrence model (1) Failure distributions and parameters for previously defined stages (Weibull or exponential) e. Fault tree description (1) User describes relationships between system stages by using a fault tree language

Survey of Software Tools language supports (2) Description AND, OR, M out of N, invert input gates, and up to 2000 total events and 70 date input events (3) Hardware and functional redundancy are described f. Output control/format selection (1) Coverage moment function data (2) Reliability print or plot data (3) Coverage function plots (4) Plot axis selections (5) Mission time for the assessment (6) Truncation value and number of integration steps are given for numerical integration routines 8. Outputs a. Probability that a given “module in a stage X” (module(x)) has not experienced a specified category fault by time t b. Reliability of a module(n) C. Rate of occurrence of a specified category fault in a given operational module(x) d. Rate of occurrence of faults in the remaining fault-free modules at time t given a specified number of faulty modules summed over all stages that a given module(x) e. Probability has a specified category latent fault at time t given that it has experienced some fault by time T f. Probability that a given module(x) has a latent fault at time t given that it has experienced some fault by time t !s For a given stage, the probability that a subsystem contains a specified number of latent faults given a specified number of faulty modules h. Probability that a system having a specified accumulated number of faults has a specified accumulated number of latent faults 1. Probability that a system containing a specified number of faults would be in a supercritical state if a given category fault occurred at time t Probability that a system containing a specified number of faults would

l

251

enter a specified critical state if a given fault occurred at time t k. Probability that a fault in a given category is active at time t given that it is latent at time 7 1. Probability, given that a system enters a specified critical state at time t, that this event eventually causes a system failure m. Probability that a system having a specified number of faults is in a critical state at time t n. Rate at which systems having a specified number of faults fail at time t due to critical fault conditions 0. Probability that the system recovers from a fault 9. Major restrictions a. Limited to highly reliable nonrepairable fault-tolerant systems b. Coverage model is limited in size due to the difficulty in solving the semiMarkov model C. Cannot model sequence dependencies d. Cannot model “cold” spares with lower failure rates

2.2.3 GRAMP (Generalized Reliability and Maintainability Program) and GRAMS (Generalized Reliability and Maintainability Simulator)

1. References [Dolny et al. 1983; Fleming 1980; Fleming et al. 19841 2. Purpose: GRAMP models the complexities of a fault-tolerant design, including coverage, preventative maintenance, acquisition cost, operations cost, and support cost and provides sensitivity analysis. GRAMS predicts reliability, maintainability, and lifecycle cost of the same fault-tolerant systems as GRAMP. 3. Implementation media, size, and capacity a. Consists of about 80K lines of FORTRAN 77 code and runs on a VAX under VMS. ACM Computing

Surveys, Vol. 20, No. 4, December

1988

252

l

A. M, Johnson, Jr., and M. Malek

b. Can solve and perform sensitivity analysis on systems having up to 1500 Markov states. 4. Basic model: Continuous-time Markov model for GRAMP. GRAMS is a Monte Carlo discrete event digital simulator. Time varying failure rates are handled as piecewise constant failure rates. Assumptions for model are listed as follows: a. System (1) Randomly deteriorating (stochastic failure process) (2) Independent component failures constant failure (3) (Piecewise) rates (4) State transition immediately following component failure (5) Components are either working or failed b. Maintenance (1) Stationary maintenance policies (2) Repair brings component back to “new” condition (3) Maintenance actions at instant of component failure (4) Instantaneous repair; unlimited service capacity (5) When a subsystem fails, fix everything (6) Maintenance policy restrictions from Coherent System Repair Models c. Time clock (1) Continuous-time, single failure per state transition. 5. System types modeled: Both nonrepairable and repairable systems Grew out of work by 6. Genealogy: Fleming for his Ph.D. dissertation [Fleming 19801. Benefited from work on prior models, such as ARIES, SURF, and CARE III. Emphasis is on application as a tool to assist in design decisions based on reliability, maintainability, and availability requirements. This tool continues to be expanded and improved. This is the ACM Computing

Surveys, Vol. 20, No. 4, December 1988

most complete system evaluated in terms of types of capabilities. It handles the full gamut of responsibilities from component failure rate data to translation of RAS parameters into time and money. Because of the attention paid to identifying model assumptions relevant to system design, the possibility of misuse due to lack of understanding is significantly reduced. 7. Inputs (GRAMP) a. System (1) Fixed charge for repair (2) Failure charge for breakdown (3) Run options (4) Print options (5) Mission length (6) Acquisition cost (system level and module level) (7) Average repair cost b. Subsystem (1) Number of modules in subsystem (2) Reliability requirement maintenance op(3) Opportunistic tion. (When one module is repaired, this parameter allows the user to specify whether the maintenance strategy involves searching for other failed modules and repairing those too.) (critical (4) Subsystem structure sets) c. Module 0) Preventative maintenance level (2) Redundancy level (3) Replacement level (4) Sensitivities to compute (5) Failure rates and spare (6) Coverage-active units 8. Inputs (GRAMS): Input is accepted for the system, subsystem, and module level just as in GRAMP. a. System level input (1) Costs associated with various maintenance actions (a) Cost of required maintenance on a backup system

Survey of Software Tools when it fails its premission inspection (b) Cost incurred when the switching mechanism for the backup system fails after the primary system has failed (4 Cost of opportunistic maintenance done on one or more components that were repaired while the host was in shop for reaching its maximum operating times (MOT) or for on-schedule maintenance whenever (4 Cost incurred scheduled line replaceable units (LRU) maintenance is done (e) Cost incurred whenever unscheduled LRU maintenance is done (f) Costs associated with host removal, shipment to shop, and repair for scheduled maintenance and for unscheduled maintenance (2) Event probabilities needed in system level input (4 Probability of mission type A (b) Probability of doing end of mission even when not required of not doing (c) Probability end of mission maintenance even when required of discrete Cd) Probability events (such as maximum temperature and lightning events) (4 Probability of each level of severity (n of them) for the discrete events (such as maximum temperature and lightning events) (0 Probability of host failure on a mission (d Probability of backup inspection failures

(3) (4) (5) (6) (7) (8) (9)

l

253

(h) Probability of backup switching failures Number of host operating hours per life and per year Lengths of mission A and B in hours Number of severity levels (n) for discrete events Number of hosts to be simulated Number of hosts in the fleet Array of times for failure rate data Host population and the minimum and maximum number of hosts

9. outputs a. GRAMP output (1) Cost evaluator model (CEM) (on a module or subsystem basis) (a) Failure/maintenance events per million hours (b) Mean time between events and (c) Relative operations support (0 & S) cost (d) Acquisition cost (e) Weight ( f ) Sensitivities (2) Reliability feasibility model (module or subsystem basis) (a) Reliability (b) Mean times to failure (c) Reliability time history plot (3) System bookkeeping (system basis) (a) Reliability (b) MTTF (c) 0 & S and acquisition costs (d) Total cost of ownership; includes fuel, delay, line maintenance, and shop repair cost (e) Weight (f) CEM events per million hours (abort rate) (g) Mean time between failures (MTBF) and mean time between repair (MTBR) ACM Computing

Surveys, Vol. 20, No. 4, December 1988

254

l

A. M. Johnson, Jr., and M. Malek (h) Subsystem summaries above seven categories

of

b. GRAMS output (1) Namelist input (2) Parameter definitions (3) Subsystem definitions (4) Module definitions (5) System results (6) Subsystem reliability (7) Subsystem MTBF (8) Module MTBF due to component failure/module MTBF due to coverage failure (9) System MTBR by maintenance type

(10) MTBR

driven by subsystem maintenance (average over host life)

(11) MTBR

driven by subsystem maintenance (yearly)

(12) Events driven by subsystem maintenance (yearly average over host life) (13) Events driven by subsystem maintenance (yearly) (14) 0 & S cost breakdown replacements (15) Component maintenance type (yearly erage over life cycle) (16) Component maintenance

replacements type (yearly)

by avby

(17) Component replacements attributed to maintenance plan (yearly and aggregate) (18) Backup failures (19) Print option definitions (determines which outputs are selected for printing) 10. Major restrictions a. Emphasis appears to be on military systems rather than commercial systems b. Can handle a maximum of 1500 Markov states

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

2.2.4 HARP (Hybrid Automated Reliability Predictor)

1. References [Bavuso et al. 1987; Dugan et al. 1985; Geist and Trivedi 1983; Trivedi et al. 19841. 2. Purpose: To provide a hybrid model for evaluating the reliability and instantaneous availability of large complex systems. 3. Implementation media, size, and capacity a. Consists of nearly 30K lines of FORTRAN 77 code and comments and has been tested under AT&T UNIX3 and DEC VMS.4 Program development is being done on a VAX. b. The graphics interface is written in C and runs on an IBM PC AT.5 PC HARP produces text files that can be used to solve a small system (500 Markovian states) or can be uploaded to a larger machine to model a large system. Has an introduction and user’s guide. i: Capable of modeling a large system. Has been used to model a system of 20 components distributed among 7 stages, with coverage, that produced a Markov chain with 24,533 states and more than 335,000 transitions (fault occurrence model) and ran from between 4 to 8 hours. 4. Basic model: There are two models a. A fault occurrence and repair model (FORM), which involves a Markov chain that can handle exponential, Weibull, and general distributions b. A Fault/Error Handling Model (FEHM) Seven different types available, which can be mixed: (1) Markov version of CARE III (2) Extended stochastic Petri net (3) ARIES transient fault recovery model 3 UNIX is a trademark of Bell Laboratories. 4 VMS is a trademark of DEC. ’ PC AT is a registered trademark of the IBM Corporation.

Survey of Software Tools (4) Probabilities and distributions (5) Probabilities and empirical data (6) Probabilities and moments (7) User-defined values modeled: Repairc. System types able and nonrepairable systems 5. Genealogy: One of a series of models developed at Duke University and currently one of the most mature in that it is undergoing beta test at more than 50 locations 6. Inputs a. Fault tree representation in graphic form model representation in b. Markov graphic form Failure rates with variation band Ii: Repair rates e. Initial conditions faults by location f. Near-coincident Near-coincident faults by component it: Ignore near-coincident faults 1. Inputs for the default ESPN model for fault handling (1) ACTIVE transition (from active to benign intermittent) (2) BENIGN transition (from benign to active intermittent) (3) Lifetime of a transient fault (4) DETECT transition (from DETECT to counter) (5) ERROR transition (from ERROR to being detected or becoming a system failure) (6) ERROR DETECT transition (from ERROR to DETECT only) (7) ISOLATION transition [permanent fault to reconfigure (place) or system failure

(place)1 (8) RECOVERY transition (from transient recovery to recovery complete) (9) RECONFIGURATION transition (from reconfiguration to degraded operation or system failure)

l

255

(10) Probability of fault detection by self-test (11) Probability of error detection (12) Probability of isolating detected fault (13) Number of recovery attempts (14) Probability of successful reconfiguration (15) Fraction of faults that is transient (16) Desired s-confidence level (17) Allowable error j. Description of the recovery processes in terms of a Petri net 7. Outputs a. Sensitivity of the reliability or availability prediction to parameter variations and initial state uncertainty. This information provides a measure of the system’s sensitivity to design faults. b Reliability (unreliability) and instantaneous availability (unavailability) with respect to time. Failure probabilities attributed to ” exhaustion of redundant element, single-point failure, and near-coincident faults. of a near-coincident d. Probability fault. 8. Major restrictions a, Does not address maintenance strategies. b. Cannot generate models with sequence dependencies or cold spares. 2.25

MARK 1 (Markov Modeling Package)

1. References [Lala 1983a, 1983b] 2. Purpose: To evaluate the reliability of any complex system whose characteristics can be modeled using Markov chains. 3. Implementation media, size, and capacity a. Written in PL/l and run on an Amdahl 470 under MVS. The program can run on any IBM compatible machine.

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

256

4. 5.

6.

7.

8.

.

A. M. Johnson, Jr,, and M. Malek

b. Plotting is done, using Calcomp plotting routine, on a Calcomp and/or Versatec plotter. C. With six megabytes of memory, it is possible to process 50 models of 100 states, each over a time period of 100 time points (10 decades). Basic model: Discrete-state continuous-time Markov chain System types modeled: Any nonrepairable system whose characteristics can be modeled using Markov chains Genealogy: Combined many of the concepts of Avizienis [1975] with computer and mathematical tools previously developed at Draper Laboratory for performance prediction using Markov chains. This tool was used to do reliability predictions for the fault-tolerant multiprocessor (FTMP) developed for NASA [Smith and Lala 19861 Inputs: The inputs are currently organized as an 80-column card deck image, but a new user-friendly interface is currently being developed (Lala, personal communication, 1987) a. Model specification (1) Number of states in the model (2) Description of each state at same (3) Occupancy probabilities initial time for each state rates between the (4) Transition states (defines the interconnection between states) b. Time span for which the Markov model should be solved c. Commands to merge the results of the models, if there is more than one, to obtain the system state probabilities outputs a. Plots of various state probabilities as a function of time (e.g., plot of probability of being in a given state or the probability of not being in a given state for a specified model) b. Plot of MTBF c. Plot of average state occupancy probability

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

9. Major restrictions a. Considers exponential distributions only b. Coverage cannot be specified c. Does not handle repairable systems d. Does not address transient and intermittent faults 2.2.6 METASAN (Michigan Evaluation Tool for the Analysis of Stkhastic Activity Networks)

1. References [Movaghar and Meyer 1984; Sanders and Meyer 19861 2 Purpose: This tool is designed to eval* uate the performance and reliability of a system via analysis and simulation. media, size, and capac3. Implementation ity a. Version 1.0 consists of about 37,000 lines of source code. It is written using UNIX tools (C, Yacc, Lex, Csh). b. User interface is menu driven to permit access to one of two files, a structure file or an experiment file. c. Has been used to evaluate a model with more than 500 places, gates, and activities. 4. Basic model: A stochastic activity network (SAN), which is an extension of stochastic Petri nets. 5 * System types modeled: Both repairable and nonrepairable systems (a different solution module is used for each type). An earlier related tool is METAPHOR (Michigan evaluation aid for perphormability). 6 Genealogy: Was developed at the In* dustrial Technology Institute in Ann Arbor, Michigan. The development of SAN was at the University of Michigan. 7. Inputs: A description language called SANSCRIPT’ permits specification of the SAN, which includes the initial markings, coverage, distributions for activity times (firing times), case distributions, and halting conditions. 8. Outputs: Performability for repairable and nonrepairable systems.

Survey

9. Major restrictions: Computational complexity increases rapidly with the size of the problem. 2.2.7 SAVE (System AVailability

Estimator)

1. References [Conway and Goyal 1986; Goyal et al. 1985, 19861 and solv2. Purpose: For constructing ing probabilistic models of computer system availability and reliability for both mission-oriented systems with high-reliability requirements (typically nonrepairable systems such as space computers and avionics systems) and continuously operating systems with high-availability requirements (typically repairable systems such as telephone switching systems, generalpurpose computer systems, and transaction processing systems). media, size, and ca3. Implementation pacity a. Written in FORTRAN 77. b. Currently operational on an IBM System/370. c. Capable of modeling tens of thousands of states. Utilizes sparse matrix storage techniques and iterative methods to solve the equations. 4. Basic model: Homogeneous Markov chain a. Steady-state availability is computed by solving the homogeneous set of simultaneous linear equations derived from the Markov chain. b. Sensitivity with respect to the transition rate parameters (failure rate, repair rate) is calculated by differentiating the linear equations that satisfy the Markov chain. C. Mean time to failure is obtained from the transient behavior of the system. 5. System types modeled: Repairable and nonrepairable computer systems 6. Models can be solved both analytically [Goyal et al. 19861 and through Monte Carlo simulation [Conway and Goyal

of Software

Tools

l

257

19861. The Monte Carlo simulation is for application to large models and has two approaches that may be useddirect Monte Carlo and analog Monte Carlo. Direct Monte Carlo simulation involves the random generation of time to failure and time to repair of each component. Analog Monte Carlo involves simulating the simulation Markov chain by generating state transitions randomly according to the jump of the chain. When the probabilities number of component types is large, the analog approach is faster. Knowledge of predeces7. Genealogy: sor models such as RESQ [Sauer et al. 19821, ARIES, CARE III, and the parallel work on HARP helped to influence the development of this model. 8. Inputs a. There is an input language for SAVE. b. The model to be evaluated is specified: MODEL: (modelname) C. The method to be used for constructing a Markov chain can be defined. A numerical method can be defined using the input language, a Markov chain can be specified directly, or a combinatorial method can be specified: METHOD: (NUMERICAL ] MARKOV ] COMBINATORIAL) d. Components are specified by type such as processor, database, or spare: COMPONENT: (camp-name) ( (no.-of-camps) ) e. Spares are listed for each component along with the spare’s failure rate: SPARES: (no.-of-spares) SPARES FAILURE RATE: (expression) f. Operational dependencies for each component are specified, such as a database depends on a processor to be operational: OPERATION DEPENDS UPON: (camp-name) ((no.)), . . . ACM Computing

Surveys, Vol. 20, No. 4, December 1988

258

l

A. M. Johnson, Jr., and M. Malek

g. Failure rates are specified for each component in both the operational and dormant state: FAILURE RATE: (expression) DORMANT FAILURE RATE (expression) h. Repair rates for each component are specified: REPAIR RATE: (expression) i. Repair dependencies specifying the components that must be operational for a component to be repaired: REPAIR DEPENDS UPON: (camp-name) ((no.)), . . . j. Failure modes are specified for the system. k. Failure mode probabilities are specified for each component: FAILURE MODE PROBABILITIES: ( prob-value), . . . 1. For each failure mode, specify the components affected: COMPONENTS AFFECTED: (NONE ] list-name ] compname((no.)), ... (LIST-NAME ] COMP-NAME): (affect-prob-val), ... (LIST-NAME ] COMP-NAME): (affect-prob-val), ... m. The conditions under which the system is considered operational are specified using the EVALUATION CRITERIA construct: EVALUATION CRITERIA: (ASSERTIONS ( BLOCKDIAGRAM ] FAULTTREE) n. When multiple components have failed, the REPAIR STRATEGY construct specifies the order in which components are repaired: REPAIR STRATEGY: (PRIORITY ] ROS) The PRIORITY option is used to specify repair priorities for the various component types. The ROS (random order service) option specifies that all components are given equal priority. ACM Computing

Surveys, Vol. 20, No. 4, December 1988

o. The number of repairmen available for the system can be specified: REPAIRMEN: (number-of-repairmen) 9. outputs Steady-state availability it: Sensitivity analysis results obtained by differentiating the stationary probability vector (obtained from the Markov chain) with respect to the transition rate parameters C. Mean time to failure of the system 10. Major restrictions a. Considers exponential distributions only b. Does not address handling transient and intermittent faults 2.2.8 SHARPE

1. References

[Sahner and Trivedi 1986, 19871. 2. Purpose: To provide a hybrid, hierarchical modeling framework for doing reliability and availability evaluations of complex fault-tolerant systems. By being able to provide a variety of different kinds of models at multiple levels, a state space explosion is usually avoided and model dependencies such as repair state-dependent failure dependence, rates, and near-coincident fault dependence can usually be handled easily. To provide a result that is symbolic in the mission time variable and thus solutions can be obtained for multiple mission times without repeating the computation by simply substituting the desired mission time into the symbolic equation obtained as a solution. For the TMR example, assuming a nonfailing voter, the solution equation is R(t) = 3epzxt - 2ee3” Since X is known, substituting the desired mission time t will yield the reliability of the system for that mission time. 3. Implementation media, size, and capacity

Survey

a. Consists of about 4800 lines of C code and runs under UNIX or VMS b. Runs on a DEC VAX 11/750 or a VAX 11/780 c. Can be used either interactively or in batch mode Basic model: Framework allows five model types a. Series-parallel reliability block diagrams b. Fault trees without repeated nodes c. Cyclic or acyclic Markov chains d. Acyclic semi-Markov chains e. Series-parallel directed (acyclic) graphs System types modeled: Both nonrepairable and repairable systems. Genealogy: One of several tools developed at Duke University. Predecessors include SPADE, DEEP, and HARP. Inputs a. Multiple levels of models can be specified. b. A solution from one submodel can be used as a specification to another submodel, which means solutions from all submodels must have the same form. The class of functions chosen is the exponential polynomial, which has the following form: 5 ajtkj exp(bjt),

j=l

where kj is a nonnegative integer and aj and bj are real or complex numbers, but the distribution function is always a real value. c. The type of model is specified and named. It can be one of the five types. d. For the type Markov, the name and the transition rates are given. For example, Markov bridge (u 1, ~2, ~3, ~4, u5), where Markov is the type, bridge is the name, and ul through u5 are the transition rates. e. For each Markov chain, a set of state transitions are given with an associated rate. The rates can be any valid arithmetic expression.

of Software

Tools

l

259

f. For the type block, it is given a name; then the basic components are defined with each one assigned a cumulative distribution function (CDF). g- Components of a block can be defined using previously defined components. h. The connections for a block are defined. This includes both a list of components in series and components in parallel. i. Assign values to bind variables. i Specify the system components and mission time for which reliability calculations are desired. k. Define the interval over which system reliability is to be calculated and the time increments desired. 8. Outputs a. Reliability of selected components b. System reliability over an interval c. System steady-state availability 9. Major restrictions a. No interactive user interface. b. Does not address nonhomogeneous transition rates on a global scale. 2.2.9 SURE (Semi-Markov Unreliability Evaluator)

Range

1. References [Butler 1986a, 1986b] 2. Purpose l To be used in performing reliability analysis of fault-tolerant architectures. l To provde upper and lower bounds of unreliability using semi-Markov models. l Models permanent, transient, and intermittent faults. 3. Implementation media, size, and capacity l The front end and computation modules are written in Pascal. l The graphics output module is written in FORTRAN but uses the graphics library TEMPLATE. (SURE can be installed and used without the graphics output module.) ACM Computing

Surveys, Vol. 20, No. 4, December 1988

.

260

A. M, Johnson, Jr., and M. Malek

Runs under VMS 3.7 on VAX-111750 and VAX-111780. . Solutions are obtained from an algebraic computation method which is computationally more efficient than methods that solve differential equations or perform simulation. Basic model: Semi-Markov model System types modeled: Both repairable and nonrepairable systems Genealogy: Developed at the NASA Langley Research Center. It is based on work by White [1985], who developed a new mathematical theorem that enables efficient computation of bounds for the death-state probabilities of a large family of semi-Markov models using the means and variances of the transitions. Subsequent work by Lee [1985] added a generalized method for bounding the probability of entering a death state. Benefited from work on prior models such as ARIES, SURF, and CARE-III. Like CARE III, it is assumed that fault occurrences can be modeled by slow exponential transitions, and recovery processes can be modeled by fast general transitions. Instead of using fault trees or Inputs: a PMS description for input, an abstract language was created [Butler 1986a] for defining the set of rules used to create the state transition matrix. This language was implemented in the ASSIST program [Johnson 19861, which is written in Pascal and runs on a VAX 111750. ASSIST is actually independent of SURE and can be used with other tools. The language consists of five types of statements: l

4. 5. 6.

7.

a. b. c. d. e.

The The The The The

constant-definition statement SPACE statement START statement DEATHIF statement TRANTO statement

A constant-definition statement equates an identifier consisting of alphamerics to a number. Examples are LAMBDA RECOVER2 ACM

Computing

Surveys,

Vol.

= 0.0034, = 0.002

20, No.

4, December

Constants can also be defined in terms of previously defined constants. The generic syntax is “ident”

= “expression”;

where “ident” is a string of up to eight characters and digits beginning with a character, and “expression” is an arbitrary mathematical expression using constants and any combination of mathematical or logical operations. The SPACE statement specifies the state space on which the Markov model is defined as an n-dimensional vector, where each component of the vector defines an attribute of the system being modeled. For example, if the state space consists of two elements, number of processors working (NW) and number of processors failed (NF), where the total number of processors is three for the TMR example, assuming the voter is nonfailing, then the abstract language statement is SPACE = (NW:ARRAY[l NF:ARRAY[l

.. 31 OF 0 .. 3, .. 31 OF 0 .. 3);

The 0 .. 3 represents the range of values over which the components can vary. ARRAY allows each processor to be uniquely identified. The START statement indicates which state is the start state of the model that represents the initial state of the system. For the TMR example, START

= (1, 1, 1, 0, 0, 0);

The DEATHIF statement specifies which states are death states (i.e., trapping states in the model). For example, in Figure 6, there is only one death state, (4) system failed. This could be represented by the statement DEATHIF(NF

> NW)

The TRANTO statement is used to generate all of the transitions required for a model recursively. For the TMR example, if states 1, 2, and 3 are merged into one state: IF NW = 3 TRANTO(NW - 1, NF + 1) BY NW * LAMBDA;

1988

Survey of Software Took

coverage C exe. COVERAGE-fault pressed as the probability that the system can survive a fault in this type of component and successfully recover. repair rate (p) at f. REPAIR-the which components of this type are repaired and returned to service. rate (p) at which g. RECOVERY-the the system can detect, isolate, and reconfigure from faults in components of this type by using a shadow (a hot or powered-up spare that is imitating the active component).

2.2.10 Other Tools

1. Component types: List describing the types of components in the PMS structure. There are nine classes a type declaration can contain, with the first two classes being required in every declaration. Type-name of the component type. HARD-the permanent failure rate A. C. TRANSIENTS-consists of two rates. One is the transient failure rate 7. The second is the transient duration rate 6. d. INTERMITTENTS-consists of three rates. (1) Intermittent failure rate L

261

(2) Intermittent benign rate /3 (3) Active rate (Y

8. Outputs l Upper and lower bounds on the probability of total system failure l Probability bounds for each death state in the model l List of every path in the model and its probability of traversal 9. Major restrictions: The focus is on reliability and availability. There is no attempt to provide many different types of capabilities such as maintenance capability.

There are several tools not mentioned in previous sections because less information was available; some of these are discussed here. One tool is ARM (automated reliability modeling) [Liceaga and Siewiorek 19861, which is currently under development at CMU. The intent of this tool is to generate reliability and availability Markov models automatically for arbitrary interconnection structures at the processormemory-switch (PMS) level (utilize work done by Kini and Siewiorek [1982]). The output of ARM will be a file containing the reliability or availability state transition matrix. The user can specify the format of the output to serve as input to one of the evaluation programs, SURE, HARP, or ARIES. The following inputs have been defined for ARM:

l

rate (a) at which the h. SHADOW-the system can provide a shadow. DEGRADATION-the rate (0) at which the system can gracefully degrade by eliminating one redundant group of components that are all of this type. Degradation is necessary when a group component fails, there are no spares to replace it, and the number of these groups is above the minimum requirements for the system. Redundant groups: List specifying any redundant group of components in the system. System watchdog times: List specifying which component type or group of components act as watchdog timer for the system and the rate at which the watchdog can restart the system. PMS Structure: An interconnection list of the PMS structure. connections: port Intracomponent List specifying the internal port connectivity of some components and/or component types. This facilitates analyzing which component failures will prevent communication between critical components and therefore cause system failure. Intracomponent-type communication: List specifying the component types for which communication between components of like type is necessary. 1.

2.

3.

4. 5.

6.

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

262

l

A. M. Johnson, Jr., and M. Malek

7. Component clustering: List specifying which components form clusters (i.e., subsystems with their own separate requirements). 8. System functionality requirements: Succinct statement of the minimum set of critical component types and/or component groups that are required for the system to be operational. A second RAS evaluation tool is the SUPER Software System developed by AT&T Bell Laboratories [Leon and Tortorella 19861. SUPER (system used for prediction and evaluation of reliability) is intended to help designers do reliability prediction, reliability allocation, and reliability analysis and to manage the process of part reliability data acquisition. This is a Markov-based model that accepts as input the reliability block diagram for the system. The input structures permitted are series, parallel, Wheatstone bridge, k-outof-n cold standby systems, and any hierarchical combination of these. More than seven built-in distributions can be used to describe the reliability of the “atomic” blocks of the diagram, or blocks can be described by a parts list with reliability parameters specified for each individual part. Repair times are an input for maintained systems. For a nonrepairable system, SUPER gives the values of the system life distribution (time to failure or probability of system failure as a function of time), the system survivor function (system reliability or probability of survival as a function of time), and instantaneous failure rate at the finite number of points in the study interval determined by start time, end time, and increment and the mean and standard deviation of the time to the first system failure. For maintained systems, two perspectives are given-mission reliability and basic reliability. Mission reliability provides output that represents the view of the system as seen by the system user. Mission reliability includes availability and number of failures that will occur. Basic reliability provides output that represents the view of the system as seen by the service personnel. ACM

Computing

Surveys,

Vol.

20, No. 4, December

1988

Basic reliability includes the number of maintenance actions or failures that require maintenance attention. Five models to describe the reliability of a maintained system are available currently. The instantaneous system renewal model and the instantaneous system revival mission reliability model are used to estimate the expected number of system failures under different assumptions about how a repair is made. The instantaneous maintainable block revival basic reliability model and maintainable block alternating renewal mission reliability model permit repair time distributions to be specified to describe the time it takes to repair or replace a maintainable block. The mission time model is used to estimate system availability and downtime. Sensitivity analysis can be done varying the architecture and component reliability. Thus, SUPER is designed to help the reliability engineer generate the information on which reliability improvement decisions are based. A third tool is SURF [Costes et al. 19811, which is similar to ARIES in capability except SURF allows the use of time-varying failure rates and has a different approach to constructing Markov models. SURF was developed at the Laboratoire d’Automatique et d’Analyse des Systemes du CNRS in France. It can be used to perform reliaand maintainability bility, availability, evaluations of fault-tolerant systems with either constant or nonconstant transition rates. Rather than use semi-Markov processes like CARE-III and SURE to handle nonconstant transition rates, SURF transforms non-Markov processes into Markov processes by using the Coxian method of stages, which adds fictitious states to the model [Cox 19551. This approximation method was chosen to facilitate performing sensitivity analysis, but it is susceptible to state explosion. In order to avoid the state explosion, two techniques are used-state merging and the truncation of states with low probability. SURF provides analytical solutions only and avoids simulation. The entire SURF program contains approximately 3000 PL/l instructions. The executable modules require 190 Kbytes, and the interactive portion needs the software

Survey of Software Tools support of a time-sharing system such as IBM TSO. A fourth software tool, METFAC, is designed to model and evaluate complex faulttolerant computing systems [Carrasco and Figueras 19861. The- code consists of approximately 6000 lines of FORTRAN 77 and Pascal and runs on a VAX 750. The user inputs a set of production rules that results in a Markov model being generated. Generative specification, a high-level behavioral description, is used for constructing the evaluated transition digraph model. The generative specification has the following items: The production lowing:

rules structure is the fol-

PE-set of integer variables called structural parameters PF-set of real variables called functional parameters VE-set of integer variables called state variables R-set of production r&S rk A-set of positive real functions Xk (PE, PF, VE) called action rates C-set of positive real functions c” (PE, PF, VE) called response probabilities I (PE, PF, VE)-positive real function called index if ok (PE, VE) then if a: (PE, VE) then VE t if i,kd (PE, VE) then VE t end Table 1. Tool

st (PE, VE) syhk

(PE,VE)

Summary of Characteristics

Application

l

263

The basic model generated is a Markov model. METFAC can be used to model both repairable and nonrepairable systems. The outputs for METFAC include the following: All Constant -Steady-state availability -Mean time up -Mean time in failure -Mean time between failures -Mean time to first failure -Mean time to operation -Availability Variable dependent -Availability -Reliability -Maintainability -Life-cycle reliability -Life-cycle maintainability . Performance-related Constant -Steady-state expected performance -Mean service to first failure -Mean service during operation Variable dependent -Expected performance -Serviceability -Life-cycle serviceability Cost related ’ Constant -Steady-state expected cost rate Variable dependent -Expected cost rate

l

The summary of tool characteristics shown in Table 1.

is

of Software Tools for Evaluating RAS

Input

Models

Model solutions

ARIES-82

Repairablelnonrepairable homogenous subsystems Temporary and permanent faults

Exponential failure and repair distributions Physical and logical structure needed to build Markov process

Continuous-state homogenous Markov process

Steady state: eigenvalues Transient: eigenvalues FEHM: analytic

CARE-III

Very large nonrepairable flight control systems Temporary and permanent faults

Weibull or exponential failure distribution Fault tree

Continuous nonhomogeneous Markov chain Semi-Markov model

Transient: convolution integral FEHM: numerical

(Continued) ACM Computing

Surveys, Vol. 20, No. 4, December 1988

264

l

A. M. Johnson, Jr., and M. Malek Table 1.

Tool

(Continued)

Application

Input

Models

Model solutions

GRAMP and GRAMS

Repairable and nonrepairable systems Maintainable systerns Permanent faults Catastrophic discrete events Various maintenance strategies Life-cycle costs

Piecewise time-varying failure rates Reliability requirements Input at the module, subsystem, and system level Physical and logical structure information Maintenance strates Maintenance costs Removal/shipping costs

Continuous-time homogeneous Markov process Monte Carlo discrete event digital simulator

Analytic: conjugate gradient algorithm for sparse matrix inversion Simulation

HARP

Large repairable and nonrepairable flight control systems Temporary and permanent faults

Any failure distribution Fault tree Markov process representation in graphic form ESPN model inputs

Seven fault/error handling models Homogenous and nonhomogenous Markov models

Transient: RungeKutta FEHM: simulation, analytic

MARK1

Nonrepairable terns

Poisson failure distribution Markov chain description

Discrete-state homogeneous Markov chain

Steady state: integration of difference equation

METASAN

Repairable and nonrepairable systems

Description of stochastic activity network

Stochastic activity networks

Analytical-steady state (1) Gaussian elimination (2) Iterative method using sparse matrix techniques Simulation-discreteevent next-event time advance

METFAC

Repairable/nonrepairable systems

Production

Markov

Steadv state: LU decompositions Transient: State dissolution

SAVE

Repairable/nonrepairable computer systems Permanent faults Maintenance strategies

Exponential failure and repair distribution Fault tree Input language to describe Markov chain

Fault trees Continuous-state homogeneous Markov chain

Analytical Steady state: successive overrelaxation (SON Transient: method of randomization Simulation

SHARPE

Repairable and nonrepairable computer systems

Exponential polv_ _ nomial distribution Multiple levels of models can be specified

Series-parallel RBD or directed graphs Fault trees Continuous state homogeneous Markov chain Semi-Markov chain

Steady state: SOR Transient: Laplace

sys-

rules

(Continued)

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

Survey of Software Tools Table 1. Tool

Application

l

(Continued)

Input

Models

Model solutions

SUPER

Repairable and nonrepairable systems

RBD Hierarchical structure Multiple distributions

Markov

Not mentioned

SURE

Repairable and nonrepairable systems Temporary and permanent faults

Markov chain description

Semi-Markov

Computation of bounds using means and variantes of transitions

SURF

Repairable and nonrepairable systems

Transition

Markov processes with stages and “fictitious” events

Transient:

3. CONCLUSIONS DIRECTIONS

matrix

AND FUTURE

We have presented a survey of state-of-theart tools for evaluating reliability, availability, and serviceability. All the software tools discussed are highly sophisticated and require a user with a high degree of expertise in reliability engineering and computer design. Supplying the inputs for such models is a major task, and properly interpreting and using the outputs requires considerable resources and skill. Early software tools were particularly labor intensive in the input generation phase. Focus on solving this deficiency is reflected in the more recently developed software tools such as GRAMP, SHARPE, and ARM. The most monumental task faced in using a RAS modeling tool is obtaining realistic input parameters that accurately predict the system dependability. There are many possible causes for system behavior to be radically different from the solution derived using one of the software tools. Inaccurate failure rates is one major cause. It has been shown that, depending on which failure rate model is used, failure rate predictions for the same electronic component can vary by as much as a factor of 6000 [Spencer 19861. Inadequate testing is another major cause of RAS predictions being dramatically different from the actual performance. Without adequate testing, the probability of either a design bug in the software or

265

Laplace

hardware or a manufacturing defect that causes the system to fail is very high. Inadequate testing thus leads to a violation of the most basic assumption of a reliability prediction, which is that the system is assumed to be defect free when put into operation. If there is insufficient discipline in the design and manufacturing process, the RAS predictions are worthless for that system. A third major cause of errors in RAS predictions is errors in the models used. Smotherman [1984] provides a detailed analysis of these errors. Structural errors can appear in the form of an incorrect number of states and/or transitions and initial state uncertainty. Parametric errors can appear due to limitations of the representation of failure rates and inaccurate coverage factors. Model solution errors can occur due to approximation errors, truncation, and round-off errors. Beyond being able to predict the RAS performance for a system accurately, these software tools can be used very effectively to make design decisions. Making design decisions requires a comparative study to be made among several alternatives. The focus is on the relative dependability and performance of the designs being considered at a given cost. In such instances, the actual numbers for reliability, MTBF, MTTR, and availability are not as important as the relative values obtained from comparing the designs.

ACM Computing

Surveys, Vol. 20, No. 4, December 1988

266

l

A. M. Johnson, Jr., and M. Malek

The development of models is still very much an evolutionary process. The tools have evolved from using only combinatorial models such as reliability graphs and fault trees to adding Markov models and extensions of stochastic Petri nets to their arsenal. As the systems being evaluated become larger and more complex, simulation becomes an essential part of the process. A judicious balance between analytic and simulation approaches is necessary to achieve the desired accuracy. An additional capability to do sensitivity analysis is available in many of the tools. There is a need for future tools to be able to translate confidence intervals on input parameters into confidence intervals on the output in order to provide a better assessment of the risk involved. There is a need for future tools to include a maintenance cost evaluation as GRAMP does. Whether viewed from the vendor’s point of view or the user’s point of view, maintenance cost is an essential element in any RAS evaluation. There should be more focus in the future on integrating both hardware and software RAS evaluations into one system model. Specific work was done for CARE III [Nagel 19821 for the software element, but combining both hardware and software fault-handling has not appeared very often as a major research activity. Based on the current activity, tools with more capabilities and capacity can be expected in the future. The software tools discussed in this paper represent significant achievements along the path to providing more effective RAS (dependability) evaluation tools. Tools should be able to handle all types of faults, intermittent, transient, permanent, and catastrophic events. Tools should be able to handle all types of architectures, static and redundant architectures, hybrid redundant architectures, multiprocessor systems connected by buses, crossbar switches, or multistage interconnection networks. Tools should be able to handle imperfect coverage, various repair strategies, various maintenance support strategies, and costs. Software RAS considerations should be integrated into the tools for both fault occurrence and fault handling. Tools should transform a problem to the simplest model possible to ACM Computing

Surveys, Vol. 20, No. 4, December 1988

achieve a solution given the requirements of the problem. This would take advantage of the apparent isomorphism that exists between the different behavioral models when the problem is appropriately constrained. Tools for evaluating RAS should offer a totally integrated package to the user, which includes capabilities for statistical analysis. Except for GRAMP, none of these tools approaches a total solution. Ideally, manual entry of data should be reduced to a minimum and reentry of data avoided entirely. The databases for the system design at the PMS level, component failure rates, software reliability, cost for components, and service times should be available to the user without having the user enter the data. The maximum gain from these tools is achieved if they are used during the development phase to influence the logic, system, and physical packaging design and better understand the environment in which the system will operate. This will increase dramatically the probability that accurate predictions and reliable systems are attained. ACKNOWLEDGMENTS We thank many of the tool developers for their comments and information, including Randall Fleming, Ambuj Goyal, Sally Johnson, Jay Lala, John Meyer, Mike Tortorella, and Kishor Trivedi. We would also like to thank the reviewers for their suggestions and insight. This work was supported in part by DARPA under Grant No. N00039-86-C0167, by ONR under Grant No. N0014-86-K0554, and by IBM Corporation.

REFERENCES ARSENAULT, J. E., AND ROBERTS, J. A. 1980. Reliability and Maintainability of Electronic Systems. Computer Science Press, Rockville, Md. AVIZIENIS, A. 1975. A unifying reliability model for closed fault-tolerant systems. In Digest of the 5th Annual Symposium on Fault-Tolerant Computing. BALBO, G., CHIOLA, G., FRANCESCHINIS, G., AND ROET, G. M. 1987. Generalized stochastic Petri nets for the performance evaluation of FMS. In Proceedings of the 1987 IEEE International Conference on Rbbotics and Automation (Mar. 31Apr. 3), pp. 1013-1018. BALL, M. 0. 1979. Computing network reliability. Operations Research 27, pp. 823-836.

Survey of Software Took BARLOW, R. E., AND LAMBERT, H. E. 1975. Introduction to fault tree analysis. In Reliability and Fault Tree Analysis, Theoretical and Applied Aspects of System Reliability and Safety Assessment, Barlow, R. E., Fussell, J. B., Singurwalla, N. D., Eds. SIAM, Philadelphia, pp. 7-35. BAVUSO, S. J. 1982. Advanced reliability modeling of fault-tolerant computer-based systems. NASATM-84501 (May). BAVUSO, S. J. 1984. A user’s view of CARE III. In Proceedings of the 1984 Annual Reliability and Maintainability Symposium (Jan.), pp. 382-389. BAVUSO, S. J., DUGAN, J. B., TRIVEDI, K. S., ROTHMANN, E. M., AND SMITH, W. E. 1987. Analysis of typical fault-tolerant architectures using HARP. IEEE Trans. Reliabil. R-36, 2, 176-185. BEYAERT, B., FLORIN, G., LONC, P., AND NATKIN, S. 1981. Evaluation of computer systems dependability using stochastic Petri nets. In Digest of the 11th Annual Symposium on Fault-Tolerant Computing (June), pp. 79-81. BOESCH, F. T. 1986. On unreliability polynomials and graph connectivity in reliable network synthesis. J. Graph Theory 10, 3, 339-352. BOSSEN, D. C., AND HSIAO, M. Y. 1981. ED/FI: A technique for improving computer system RAS. In Digest of the 1 Zth Annual Symposium on FaultTolerant Computing (June). IEEE, New York, pp. 2-7. BOURICIUS, W. G., CARTER, W. C., AND SCHNEIDER, P. R. 1969. Reliability modeling techniques for self-repairing computer systems. In Proceedings of the 24th National Conference of ACM. ACM, New York, pp. 295-309. BUTLER, R. W. 1986a. An abstract language for specifying Markov reliability models. IEEE Trans. Reliabil. R-35, 5, (Dec.), 595-601. BUTLER, R. W. 1986b. The SURE reliability analysis program. NASA TM 87593 (Feb.). CARRASCO, J. A., AND FIGUERAS, J. 1986. METFAC: Design and implementation of a software tool for modeling and evaluation of complex faulttolerant computing systems. In Digest of the 26th Annual Symposium on Fault-Tolerant Computing (July). IEEE, New York, pp. 424-429. CHELSON, P. 0. 1967. Reliability math modeling using the digital computer. Jet Propulsion Laboratory TR-32-1089, Apr. CHELSON, P. 0. 1971. Reliability computation using fault tree analysis. Jet Propulsion Laboratory TR-32-1542 (Dec.). CHERKASSKY, V., AND MALEK, M. 1987. Graceful degradation of multiprocessor systems. In Proceedings of the 1987 International Conference on Parallel Processing (Aug. 17-21). IEEE Computer Society Press, Washington, D.C., pp. 885-888. CHERKASSKY, V., OPPER, E., AND MALEK, M. 1984. Reliability and fault diagnosis analysis of fault-tolerant multistage interconnection networks. In Digest of the 14th Annual Symposium on Fault-Tolerant Computing (June 20-22). IEEE, New York, pp. 246-251.

l

267

COLBOLJRN, C. J. 1986. Exact algorithms for network reliability. Congressus Numerantum 51 (March), pp. 7-57. CORPORATION, 1970. SCIENCES COMPUTER RELAN: Reliability analysis package. CSC Sales Brochure No. 333. CONN, R., MERRYMAN, P., WHITELAW, K. 1977. CAST-A complementary analytic-simulative technique for modeling complex fault-tolerant computing systems. In Proceedings of the AZAA Computer Aerospace Conference (Nov.). AIAA, New York. CONWAY, A. E., AND GOYAL, A. 1986. Monte Carlo simulation of computer system availability/reliability models. IBM Research Rep. RC 12459 (Dec. 15). COSTES, A., DOUCET, J. E., LANDRAULT, C., LAPRIE, J. C. 1981. SURF: A program for dependability evaluation of complex fault-tolerant computing systems. In Digest of the 11 th Annual Symposium on Fault-Tolerant Computing (June). IEEE, New York, pp. 72-78. Cox, D. R. 1955. A use of complex probabilities in theory of stochastic processes. In Proceedings of the Cambridge Philosophical Society, vol. 51. Cambridge University Press, Cambridge, U.K., pp. 313-319. DEPARTMENT OF DEFENSE. 1986. Military Handbook Reliability Prediction of Electronic Equipment, MIL-HDBK-217E, Oct. DEPARTMENT OF DEFENSE. 1981. Military Standard: Reliability Modeling and Prediction, MILSTD-756B, (Nov. 18). DHILLON, B. S., AND SINGH, C. 1978. Bibliography of literature on fault-trees. Microelectr. Reliability 17,501-503. DOLNY, L. J., FLEMING, R. E., DE HOFF, R. L. 1983. Fault-tolerant computer system design using GRAMP. In Proceedings of the 1983Annual and Maintainability Symposium Reliability (Jan.). IEEE, New York, pp. 417-422. DUGAN, J. B., TRIVEDI, K. S., GEIST, R., NICOLA, V. F. 1985. Extended stochastic Petri nets: Applications and analysis. Performance ‘84 Models Northof Computer System Performance. Holland,*Amsterdam. FLEMING, J. L. 1971. RELCOMP: A computer program for calculating system reliability and MTBF. IEEE Trans. Reliabil., R-20, 3 (Aug.). FLEMING, R. E. 1980. Coherent system repair models. Ph.D. dissertation, T.R. 195, Dept. of Operations Research and Dept. of Statistics, Stanford Univ., Stanford, Calif. FLEMING, R. E., AND DOLNY, L. J. 1984. Faulttolerant design-to-specs with GRAMP & GRAMS. In Proceedings of the 1984 Annual Reliability and Maintainability Symposium (Jan.). IEEE, New York, pp. 403-408. GEIST, R. M., AND TRIVEDI, K. S. 1983. Ultrahigh reliability prediction for fault-tolerant computer systems. IEEE Trans. Comput. C-32, 12 (Dec.). ACM Computing Surveys, Vol. 20, No. 4, December 1988

268

l

A. M. Johnson, Jr., and M. Malek

GOYAL, A., CARTER, W. C., DE SOUZA E SILVA, E., LAVENBERG, S. S., AND TRIVEDI, K. S. 1986. The system availability estimator. In Digest of the 16th Annual Symposium on Fault-Tolerant Computing (Jul.). IEEE, New York, pp. 84-89. GOYAL, A., LAVENBERG, S. S., AND TRIVEDI, K. S. 1985. Probabilistic modeling of computer system availability. IBM Research Rep. RC 11076 (Mar. 27). HEALY, J. 1986. Modeling IC failure rates. In Proceedings of the Annual Reliability and Maintainability Symposium. IEEE, New York, pp. 307-311. HILLIER, F. S., AND LIEBERMAN, G. J. 1980. Introduction to Operations Research 3d ed., Holden-Day, Oakland, Calif. JAGER, R., AND KRAUSE, G. S. 1987. Generic automated model for early MTTR predictions. In Proceedings of the 1987 Annual Reliability and Maintainability Symposium (Jan.). IEEE, New York, pp. 280-285. JOHNSON, A.M., JR., MENEZES, B., MALEK, M., YAU, K. H., AND JENEVEIN, R. 1988. Options for achieving fault tolerance in multiprocessor interconnection networks. IBM Technical Rept. TR 51.0432 (May). JOHNSON, S. C. 1986. ASSIST user’s manual. NASA Technical Memorandum 87735 (Aug.) KECECIOGLU, D. 1986. Maintainability Engineering, The University of Arizona, Tucson. KINI, V., AND SIEWIOREK, D. P. 1982. Automatic generation of symbolic reliability functions for processor-memory-switch structures. IEEE Trans. Comput. C-31,8 (Aug.), 752-770. KNUTH, D. E. 1981. The Art of Computer Programming. Vol. 21, Seminumerical Algorithms, 2d ed. Addison-Wesley Publishing Co., Reading, Mass. LALA, J. H. 1983a. Mark1 Markov modeling package. The Charles Stark Draper Laboratory, Cambridge, Mass. LALA, J. H. 198313. Interactive reductions in the number of states in Markov reliability analysis. In Proceedings of the AZAA Guidance and Controls Conference (Aug.). AIAA, New York. LAW, A. M., AND KELTON, W. D. 1982. Simulation Modeling and Analysis, McGraw-Hill Book Co., New York. LAZOWSKA, E. D., ZAHORJAN, J., GRAHAM, G. S., AND SEVCIK, K. C. 1984. Quantitatiue System Performance: Computer System Analysis using Queueing Network Models. Prentice-Hall, Englewood Cliffs, N.J. LEE, L. D. 1985. Reliability bounds for fault-tolerant systems with competing responses to component failures. NASA TP-2409. LEON, R. V., AND TORTORELLA, M. 1986. The SUPER software system for reliability modeling and prediction. AT&T Bell Laboratories, Holmdel, N.J.

ACM Computing Surveys, Vol. 20, No. 4, December 1988

LICEAGA, C. A., AND SIEWIOREK, D. P. 1986. Towards automatic Markov reliability modeling of computer architectures. NASA Technical Memorandum 89009 (Aug.). MAKAM, S. V., AND AVIZIENIS, A. 1982. ARIES 81: A reliability and life-cycle evaluation tool for fault-tolerant systems. In Digest of the 12th Annual Symposium on Fault-Tolerant Computing (June). IEEE, New York, pp. 267-274. MAKAM, S., AVIZIENIS, A., AND GRUSAS, G. 1982. UCLA ARIES 82 users’ guide. Tech. Rep. CSD82030, Computer Science Dept., Univ. of California, Los Angeles. MARSAN, M. A., CONTE, G., AND BALBO, G. 1984. A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems. ACM Trans. Comput. Syst. 2, 2 (May), 93-122. MATHUR, F. P. 1972. Automation of reliability evaluation procedures through CARE-The computer-aided reliability estimation program. In Proceedings of the Fall Joint Computer Conference. vol. 41, AFIPS, pp. 65-82. MOLLOY, M. K. 1981. On the integration of delay and throughput measures in distributed processing models. Ph.D. dissertation, Univ. of California, Los Angeles. MOVAGHAR, A., AND MEYER, J. F. 1984. Performability modeling with stochastic activity networks. In Proceedings of the Real-Time System Symposium (Dec. 4-6). IEEE, New York, pp. 215-224. NAGEL, P. M. 1982. Software reliability: Repetitive run experimentation and modeling. NASA CR165836. Boeing Computer Services Co. NG, Y. W., AND AVIZIENIS, A. 1977. ARIES-an automated reliability estimation system. In Proceedings of the 1977 Annual Reliability and Maintainability Symposium (Jan.), pp. 108-113. PETERSON, J. L. 1977. Petri nets. ACM Comput. Suru. 9, 3 (Sept.), 223-252. PROVAN, J. S., AND BALL, M. 0. 1984. Computing network reliability in time polynomial in the number of cuts. Operations Research 32, pp. 516-526. RENNELS, D. A., AND AVIZIENIS, A. 1973. RMS: A reliability modeling system for self-repairing computers. In Digest of the 3rd Annual Symposium Fault-Tolerant Computing. IEEE, New York, pp. 131-135. ROTH, J. P., BOURICIUS, W. G., CARTER, W. C., AND SCHNEIDER, P. R. 1967. Phase II of an architectural study for a self-repairing computer. SAMSO TR-67-106 (Nov.). ROSS, S. K. 1983. Stochastic Processes, John Wiley & Sons, New York. RUTLEDGE, R. A. 1985. Models for the Reliability of Memory with ECC. In Proceedings of the 1985 Annual Reliability and Maintainability Symposium (Jan.). IEEE, New York, pp. 57-62.

Survey of Software Took SAHNER, R. A., AND TRIVEDI, K. S. 1985. SPADE: A tool for performance and reliability evaluation. Rept. No. AFOSR-TR-85-0745 (Jul.). SAHNER, R. A., AND TRIVEDI, K. S. 1986. A hierarchical, combinatorial-Markov method of solving complex reliability models. In Proceedings of the 1986 Full Joint Computer Conference (Nov. 2-6). AFIPS, New York, pp. 817-825. SAHNER, R. A., AND TRIVEDI, K. S. 1987. Reliability modeling using SHARPE. IEEE Trans. Rel. R-36,2 (June), 186-193. SANDERS, W. H., AND MEYER, J. F. 1986. METASAN: A performability evaluation tool based on stochastic activity networks. In Proceedings of the 1986 Fall Joint Computer Conference (Nov. 2-6). AFIPS, New York, pp. 807-816. SAS INSTITUTE, INC. 1985. SAS User’s Guide: Basics. Version 5 ed. Cary, N.C. SATYANARAYANA, A., AND HAGSTROM, J. N. 1981. A new algorithm for the reliability analysis of multiterminal networks. IEEE Trans. Ret. R-30, 4 (Oct.). 325-334. SATY~NA~AYANA, A., AND WOOD, R. K. 1985. A linear-time algorithm to compute the reliability of planar cube-free networks. SIAM J. Comput. 14, 818-832. SAUER, C. H., MACNAIR, E. A., AND KUROSE, J. F. 1982. The research queueing package: Past, present and future. In Proceedings of the National Computer Conference (June). SHOOMAN, M. L. 1970. The equivalence of reliability diagrams and fault-tree analysis. IEEE Trans. Reliability R-19, 2 (May), 74-75. SHOOMAN, M. L., AND LAEMMEL, A. E. 1987. Simplification of Markov models by state merg-

l

269

ing. In Proceedings of the 1987Annual Reliability and Maintainabilitv Svmposium (Jan.). IEEE. New York, pp. 1591164. SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass. SMITH, T. B., III AND LALA, J. H. 1986. Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. vol. IV, FTMP executive summary, NASA-CR-172286, (Feb.). SMOTHERMAN, M. K. 1984. Parametric error analysis and coverage approximations in reliability modeling. Ph.D. dissertation, Dept. of Computer Science, Univ. of North Carolina. SPENCER, J. L. 1986. The highs and lows of reliability predictions. In Proceedings of the Annual Reliability and Maintainability Symposium. IEEE, New York, pp. 156-162. STIFFLER, J. J., BRYANT, L. A., GUCCIONE, L. 1979. CARE III final report phase I volume I & II. NASA Contractor Rep. 159122 & 159123. (Nov.). TRIVEDI, K. S. 1982. Probability & Statistics with Reliability Queuing, and Computer Science Applications. Prentice-Hall, Inc., Englewood Cliffs, N.J. TRIVEDI, K. S., DUGAN, J. B., GEIST, R., AND SMOTHERMAN, M. 1984. Hybrid reliability modeling of fault-tolerant computer systems. Znt. J. Comput. Electr. Eng. 11, 213, 87-108. WHITE, A. L. 1985. Synthetic bounds for semi-Markov reliability models. NASA CR-178008. WILKOV, R. 1972. Analysis and design of reliable computer networks. IEEE Trans. Commun. COM-20 3 (June), 660-678.

Received November 1987; final revision accepted June 1988.

ACM Computing Surveys, Vol. 20, No. 4, December 1988