Varaince reduction in mean time to failure ... - Stanford University

Proceedings

of the 1988 Winter

Simulation

Conference

M. Abrams, P. Haigb, and J. Comfort (eds.)

Varaince

reduction

in mean

time

to failure

simulations

Perwez Shahabuddin **I Victor F. Nicola * Philip Heidelberger * Ambuj Goyal * Peter W. Glynn **2 * IBM Research Division T.J. Watson Research Center P.O. Box 704 Yorktown Heights, New York 10598 ** Department of Operations Research Stanford University Stanford, California 94305

ABSTRACT

with complexity

of the system. Thus conventional

numerical

solution techniques are only feasible for relatively small mod-

We describe two variance reduction methods for estimating the mean time to failure (MTTF) in Markovian models of highly reliable systems. The first method is based on a ratio representation of the MTTF and employs importance sampling. The second method is based on a hybrid simulation/analytic technique where the number of simulated transitions are reduced by computing partial results analytically. Experiments with a large example show the effectiveness of both techniques for highly reliable systems.

els, i.e., simple systems. Simulation analysis is an alternative approach,

however,

because system failures

are rare, ex-

tremely long simulations may be required in order to obtain accurate estimates of the measures of interest. Our goal is to obtain variance reduction methods which are applicable to a broad class of models. Specifically, we are in-

1. INTRODUCTION The requirement

terested in models defined by the reliability

for highly reliable systems, such as airborne

modeling language described in [7], so that the techniques can

computing systems, is increasing the importance of reliability and mean time to failure (MTTF) phase of these systems.

be implemented in a software package and made available to

prediction during the design

These systems are different

designers in an automatic and transparent fashion.

than

system contains multiple

highly available systems, such as banking or airline reservation systems, where continuous important.

operation

(availability)

I

2

Markov

nation of component

While such systems

performed

types with redundant

Failure of these systems is

failures leading to a system failure.

Failed components may be repaired.

If all components are

repairable and component failures are Poisson, then a regen-

model increases rapidly

This work was performed while this author was visiting IBM Research. This work was partially

A typical

usually caused by exhaustion of redundancy or by a combi-

can typically be modeled as Markov chains (see, e.g., [6]), the size of the corresponding

component

units for each component type.

is more

In highly reliable systems any system failure dur-

ing the mission causes a mission failure.

and availability

while this author was a visiting scientist at IBM Research

491

erative state for the. system (see, e.g., [4]) is the state where

pendently

all units of all component types are operational.

If, in addi-

(actually, importance sampling was not used at all to estimate

tion, all repair times are exponentially

then, typi-

the denominator).

cally, a continuous

time Markov

distributed,

chain is obtained.

As

which resulted in additional

variance

reduction

We call this technique Measure Specific

DIS (MSDIS).

recommended in [ 1 I], we transform the continuous time chain

The importance sampling technique discussed in previous pa-

into an appropriate

pers does not yield significant

estimation

discrete time Markov

of steady state availability

chain.

In [S] the

for such systems was

mating the MTTF

variance reductions

using the standard simple estimator.

discussed. In this paper, we discuss estimation of the MTTF.

intuuive

Specifically,

events occur more often by choosing an appropriate

we consider an importance

sampling

(see, e.g., [9]) and a hybrid simulation/analytic Importance

method

method.

for estiAn

reason for this is as follows. When we make failure

tance sampling distribution,

impor-

the value of the estimator ends

sampling for rare event simulation has been suc-

up smaller than the actual value and, in addition, the likeli-

cessfully used in [2], [3], [8], [12], [ 131 and [ 141. Proper se-

hood ratio is less than one. This actually ends up producing

lection of the importance sampling distribution

variance rather than reducing it. However, if we formulate the

makes the rare

events more likely to occur which results in a variance re-

MTTF problem as a ratio of two expectations

duction.

Section

The key, of course, is to choose a good importance

sampling distribution.

The theory of large deviations was used

in 131, [13], and [14] to select an effective distribution

2), then significant

variance

achieved using importance sampling.

(as shown in

reductions

can be

In Section 2, both DIS

and MSDIS methods are used to estimate the ratio represen-

for

problems arising in Markov chains with “small increments”,

tation of the MTTF.

random walks, and queueing systems, respectively.

in [8] gives the best results.

Effective heuristics were used in [2] and [12] to select impor-

In Section 3 we outline a hybrid simulation/analytic

tance sampling distributions

esti-

described in [ 101 and describe its application in estimating the

models. For example,

MTTF in simulations of highly reliable systems. In highly re-

for steady state availability

mation in large machine repairmen-like

The MSDIS sampling scheme proposed

four orders of magnitude reduction in variance was reported

liable systems, a large number of transitions

in [2] for a system with 70 components (the Markov chain for

before the system fails.

this model has 270 states).

original Markov

state performance

For regenerative

systems, steady

measures can be expressed as a ratio.

[2], a single importance sampling distribution

was used to es-

chain into a different,

tional of the transformed

Markov

timate both the numerator and denominator of this ratio. The

MTTF.

distribution

used in [2] is dynamic in the sense that it does not

failure is less in the transformed

correspond

directly

chain.

to a time homogeneous Markov

We call this technique Dynamic Importance In [Xl, different

dynamic importance

chain.

Sampling (DIS).

but related, Markov defined func-

chain is identical

to the

Furthermore, the expected number of transitions until

This reduction

chain than in the original

usually translates into a variance re-

duction since, for a given number of total transitions

simu-

lated, more failures occur in the transformed chain than in the

sampling distributions

were used to estimate the numerator and denominator

occur

The hybrid method transforms the

chain. The expected value of an appropriately

In

typically

method

original chain.

inde-

492

In Section 4, we summarize the results, discuss the relative

mator.

The reason is that the naive estimator is effectively

merits of the two methods and indicate future areas of re-

identical to the new estimator, in the sense that the two esti-

search.

mators are probabiistically

identical at the instant at which

The difference

failures occur.

comes from the fact that

2. RATIO METHOD

Equation 2.1 is formulated in terms of the probability

Let jr,, s > 0) be a continuous time Markov chain (CTMC)

rare event ( P{a, < a03 ) which can be estimated effectively

with state space E = 10, 1,2, . ..m (N may be either finite or

using importance sampling.

infinite),

and

transition

rate

Q = (q(ij)).

matrix

Let

As recommended in [S] and [ll],

q(i) = - q(i,i) = Zq(ij)

we simulate only the em-

denote the total rate out of state i. i#l Pick a starting state, say state 0, set Y0 = 0 and for any set B

bedded discrete

let aB be the first entrance time of the CTMC to B. In a reli-

of [X,, n 2 01 is P = (PJ where P, = q(ij)/q(i)

ability

8, = 0.

states (e.g., in a reliability

Let F = E be a set of

states,

then

E[cw,]-E[u,]/P{a,

[ 141 applies

the

sampling (DIS), uses the same importance

merator to

of Equation

2.1.

The second

the numerator and denominator which are then estimated in-

In fact for any

dependently.

CTMC we have the exact relationship

aa,1 =

and denominator

(MSDIS), uses different importance sampling distributions for

P{+ < a,,]. (Since F is a rare set of

states, Pja, < a03 is very close to zero.)

sampling distrib-

method, called measure specific dynamic importance sampling

estimate the numerator E[cr,] and importance sampling to estimate the denominator

for i # j and

ution (and the same sample paths) to estimate both the nu-

approximation

< a03 and then uses direct simulation

The transition matrix

Equation 2.1. The first method, called dynamic importance

state and the set F represents a rare set of Walrand

{X., n 1 01 and use

As in [8], we consider two methods of estimating the ratio in

model they might be the set of

states where the system is failed) for which 0 p F. If state 0 is a regenerative

chain

deterministic holding times hi = l/q(i).

model, we generally take state 0 to be the state for

which all components are operational.

time Markov

of the

In fact, like W&and,

(no importance

sampling)

we use direct simulation

to estimate the numerator

since

very stable estimates of E[ min ((Y,,,c+)] can be obtained via

E[ min (q,, adI PlaF c qJ

(2.1)

direct simulation.

In addition, for such highly reliable systems,

the expected number of transitions in a regenerative cycle is which

is easily

obtained

by considering

the two

cases

typically small. Expressions for the resulting asymptotic vari-

{a, < cyJ and (1~~> aFl, and applying the Markov property

ances are given in [8]; they are closely

attimecu,ifor, 2, systems of equations of size ] S, ] need to be

to be the expected reward earned by the original chain until

solved.

exit from S,). Thus many transitions may be saved since, in

number of states in S, depends on the queueing discipline for

models of highly reliable systems, the expected number of

the repairmen.

transitions to go from S, to Fk may be large.

its own repairman, then ] S, 1 = O(Mz).

On the set Fkr

This may be feasible for small values of k, but the

For example, if each type of component has

lx’,,] and {X,1 are lstochastically identical.

We now consider the application

As discussed in [lo],

estimating the MTTF in two examples.

if transitions in the two chains require

the same amount of computer time, then an approximation variance

the

VR = Var[z’] rience

in

reduction

E[r’,]/Var[Zl reliability

VRZE[~‘J/E[TJ.

is

model of a fault-tolerant

to

of the hybrid method for The fist example is a

system with two types of compo-

ratio

nents and two units of each type. All units are initially oper-

E[T,]. If, as has been our expe-

ational. Units of component types 1 and 2 fail at rates h, and

simulations,

the

Var[Z’]cVar[a,

They are repaired according to a FCFS disci-

of

pline by separate repair facilities at rates y, and pL2.respec-

when there are j failed

tively. The system is operational if at least one unit of any type

If, in the original chain, the probability

a failme before a repair is 0(&,)x0

X,, respectively.

then

then the expected number of transitions until k-l reaching a state with k failed components is 0( l/jv,~j) and

is operational.

thus we expect the hybrid method to produce a variance rek-1 duction of VRXO(~!,&~). If ej = E for allj, then VRXO(@-‘).

method.

This will be verified for several examples.

described in [ll].

Consider now the general class of models described in the In-

obtained

troduction

If we increase the failure rates by setting h, = h, = 10, then

components,

with M types of components

The resulting Markov chain, while having only

nine states, is illustrative

of the performance

We applied the hybrid method with k = 2,3 and

computed the variance ratio numerically

and redundancy.

of the hybrid

using the techniques

For ,n, = pL2= 100 and h, = A, = 1, we

VR = 0.03 for k = 2 and VR = 0.0004 for k = 3.

Figure 1 depicts the transition structure on the set S, and its

VR = 0.25 fork

interface to the set F2. The set S, = 10, 1, . . . , M) with state

variance reductions are due to the reduction in the expected

0 corresponding

sponding to one component of component type i. Because of

number of transitions until failure and are consistent with the k-l prediction that VRXO(~~, E,).

this special structure, Equations 3.1 can be solved analytically

The second example is the model with ten types of compo-

for the case k = 2: flo = (f, + jfrPOd)/(l

- J!,PO#‘J and

nents and two components of each type described in Section

for 1 5 i 5 M. Similarly, if E, is the proba-

2. The repair rate was set to be n = 1 and the component

f’; =A + Paf’,, bility

to no component failures and state i corre-

of exiting

E”j = P&l

- PN)/(l

S, via state j given - lflPr$“)

for

= 2 and VR = 0.028 fork

= 3. Most of the

failure rate was X = 0.001. This model was simulated with

that X, = i, then

1 2 j 5 M, E, = Z’J$, 496

of the MTTF.

The numerator of the ratio is estimated using

direct simulation while the denominator is, independently, timated using importance

sampling.

es-

The second (hybrid)

method combines simulation and numerical techniques. hybrid method requires some preliminary

The

computations,

al-

though in its simplest form these can be done analytically.

A

different, but related, Markov chain is then simulated.

In this

Markov chain, the expected number of transitions until failure may be greatly reduced.

Both methods produced significant

variance reduction in simulations of a large model. Although

the two methods are not necessarily mutually ex-

clusive, the hybrid method, while effective, has a number of disadvantages that reduce its utility Figure 1. Transition structure on the set S,.

when one considers its

implementation

into a broadly applicable

lation package.

First, it requires preliminary

these parameter settings the predicted variance reduction is

specifically designed to estimate the MTTF and requires sim-

F, = 0.019 which is of the same order of magnitude as the

ulation of its own stochastic process. Therefore, the output

observed variance reduction.

from this method may not be useful for estimating other per-

between

the actual and predicted

ductions

is due to the fact that hybrid

variance

method

numerical com-

putations.

crepancy

the

simu-

k = 2 and the estimated variance ratio was VR = 0.036. For

The apparent factor of two dis-

Second,

availability

as described

here is

formance measures, such as as the steady state unavailability.

re-

(as we

Third, the method is best suited for situations in which the

it) requires two transitions, rather than one, to

amount of redundancy is low since, in this case, the expected

exit S, : one to select the exit state in S, and another to select

number of transitions to failure in the modified chain remains

the entrance state in F2. For a general purpose simulator this

reasonable.

represents a realistic implementation

of the hybrid

implemented

method

of the method. For this

More specifically,

the practical implementation

method with k = 2 uses direct

simulation

example, the best variance reduction using MSDIS yields a

whenever the system is in a state with two or more failed

variance

components, whereas the ratio method with importance sam-

ratio

of

0.00035

(from

Table

1,

with

pling quickly moves the system through states with more than

p’ = p” = 0.9).

two failed components to the failed state. Therefore the var-

4. SUMMARY

iance reductions using importance

In this paper we described two methods for estimating the

(and in practice) greater than those using the hybrid method.

MTTF in a class of Markovian

It might appear that since the ratio method requires simulation

models of highly reliable sys-

tems. The first (ratio) method relies on a ratio representation

sampling are potentially

of two processes, it too would be of limited value in estimating

497

other performance

measures.

However,

the technique de-

scribed in [8] for estimating steady state unavailability

also

independently

of a

simulates the numerator and denominator

[:j]

Fox, B.L. and Glynn, P.W. (1986). Discrete-Time Conversion for Simulating Semi-Markov Processes. 0peration.s Research Letters 5, 191-196.

[a

Geist, R.M. and Trivedi, KS. (1983). Ultra-High Reliability Prediction for Fault-Tolerant Computer Systems. IEEE Transactions on Computers C-32, 111 S- 1127.

[71

Goyal, A. and Lavenberg, S.S. (1987). Modeling and Analysis of Computer System Availability. IBM Journal of Research and Development, 31 6,651-664.

ratio (in that case importance sampling is used for the numerator and direct simulation is used for the denominator). thermore, the samr: importance

sampling distribution

Furcan be

used in estimating both the MTTF and the steady state unavailability

rfd Goyal, A., Heidelberger, P. and Shahabuddin, P. (1987).

so that both of these performance measures can be

estimated simultaneously.

to apply this technique to simultaneously performance

Measure Specific Dynamic Importance Sampling for Availability Simulations. 1987 Winter Simulation Conference Proceedings. A. Thesen, H. Grant and W.D. Kelton (eds.). IEEE Press, 351-357.

In addition, it should be possible

measures at different

estimate multiple

parameter

settings (e.g.,

different failure rates), since this is an inherent capability

J.M. and Handscomb, D.C. 191 Hammersley, Monte Carlo Methods. Methuen, London.

of

1101Heidelberger,

P. (1979). A Variance Reduction Technique That Increases Regeneration Frequency. Current Issues in Computer Simulation. N.R. Adam and A. Dogramaci (eds.). Academic Press, Inc., 257-269.

importance sampling. Therefore,

our current research is directed towards further

experimentation

with and extensions of the Importance sam-

pling approach, including sampling distributions

(1964).

improved

1111Hordijk,

A., Iglehart, D.L. and Schassberger, R. (1976). Discrete Time Methods for Simulating Continuous Time Markov Chains. Adv. AppI. Prob. 8, 772-788.

selection of importance

for unbalanced systems, its application

in estimating the failure time and interval availability

1121 Lewis, E.E. and Bohm, F. (1984).

distrib-

lation of Markov Unreliability neering and Design 77, 49-62.

utions, as well as extensions to gradient estimation and slmulations of non-Markovian

Importance Sampling in the 1131 Siegmund, D. (1976). Monte Carlo Study of Sequential Tests. The Annuls of Statistics 4, 673-684.

models.

REFERENCES ill

Chung, K.L. (1967). Markov Chains With Stationary Transition Probabilities, Second Edition. SpringerVerlag, New ‘York.

El

Conway, A.ES. and Goyal, A. (1987). Monte Carlo Simulation of Computer System Availability/Reliability Models. Proceedings of the Seventeenth Symposium on Fault-Tolerant Computing. Pittsburgh, Pennsylvania, 230-235.

Monte Carlo SimuModels. Nuclear Engi-

t141 Walrand, J. (1987). Quick Simulation of Rare Events in Queueing Networks. Proceedings of the Second Znternational Workshop on Applied Mathematics and Performance/Reliability Models of Computer/Communication Systems. G. Iazeolla, P.J. Courtois and O.J. Boxma (eds). North Holland Publishing Company, Amsterdam, 275-286.

AUTHOR’S

BIOGRAPHIFS

PERWEZ SHAHABUDDIN received his B.Tech. degree in Mechanical Engineering from the Indian Institute of Technology, Delhi, in 1984. After working for a year in Engineers India Limited, India, he joined the Ph.D. program in Operations Research at Stanford University, California. His research interests include simulation techniques, queueing theory and diffusion processes. He is a student member of the Operations Research Society of America.

[31 Cottrell,

M., Fort, J.C. and Malgouyres, G. (1983). Large Deviat.ions and Rare Events in the Study of Stochastic Algorithms. IEEE Transactions on Automatic Control AC-28,907-920.

[41 Crane, M.A. and Iglehart, D.L. (1975).

Simulating Stable Stochastic Systems, III: Regenerative Processes and Discrete Event Simulations. Operations Research 23, 33-45.

Perwez Shahabuddin Department of Operations Research Stanford University

498

Stanford, California 94305 (415) 723-9383

Yorktown Heights, New York 10598 (914) 789-7504

PHILIP HEIDELBERGER received a B.A. in mathematics from Oberlin College, Oberlin, Ohio, in 1974 and a Ph.D. in Operations Research from Stanford University, Stanford, Cahfornia, in 1978. He has been a Research Staff Member at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York since 1978. His research interests include modeling and analysis of computer performance and statistical analysis of simulation output. Dr. Heidelberger is the Program Chairman of the 1989 Winter Simulation Conference and an Associate Program Chairman of the 1988 Winter Simulation He is an Associate Editor of Operations Conference. Research. He is a member of the Operations Research Society of America, the Association for Computing Machinery and the IEEE.

VICTOR F. NICOLA received the B.S. and MS. degrees in electrical engineering from Cairo University, Egypt, and Eindhoven University of Technology, The Netherlands, respectively. In 1979 he joined the scientific staff of the Measurement and Control Group at Eindhoven University. From 1983 to 1986 he was at Duke University, Durham, North Carolina, where he received the Ph.D. degree in Computer Science. Currently he is a Research Staff Member at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. His main research interests are in the areas of performance and reliability modeling of computer systems and fault-tolerance. Victor Nicola IBM Research Division Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, New York 10598 (914) 789-7502

Philip Heidelberger IBM Research Division Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, New York 10598 (914) 789-7156

PETER GLYNN is an associate professor in the Department of Operations Research at Stanford University. He received a B.S. from Carleton University in 1978 and a Ph.D. in operations research from Stanford University in 1982. From 1982 to 1987, he was an associate professor in the Industrial Engineering Department of the University of Wisconsin at Madison. His current research interests include simulation, queueing theory and applied probability. He is a member of ORSA, TIMS, IMS and the Statistical Society of Canada.

AMBUJ GOYAL received the B.Tech. degree from the Indian Institute of Technology, Kanpur, and the M.S. and Ph.D. degrees from The University of Texas at Austin. At present, he is a Research Staff Member and manages Fault-Tolerant Systems Theory Group at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. His research interest include fault-tolerant computing, computer architecture, communication networks and performance evaluation. Dr. Goyal is a member of the IEEE Computer Society.

Peter Glyrm Department of Operations Research Stanford University Stanford, California 94305 (415) 7250554

Ambuj Goyal IBM Research Division Thomas J. Watson Research Center P.O. Box 704

499