Yorktown Heights, New York 10598 ..... at the IBM Thomas J. Watson Research Center in Yorktown. Heights ... degrees from The University of Texas at Austin.
Proceedings
of the 1988 Winter
Simulation
Conference
M. Abrams, P. Haigb, and J. Comfort (eds.)
Varaince
reduction
in mean
time
to failure
simulations
Perwez Shahabuddin **I Victor F. Nicola * Philip Heidelberger * Ambuj Goyal * Peter W. Glynn **2 * IBM Research Division T.J. Watson Research Center P.O. Box 704 Yorktown Heights, New York 10598 ** Department of Operations Research Stanford University Stanford, California 94305
ABSTRACT
with complexity
of the system. Thus conventional
numerical
solution techniques are only feasible for relatively small mod-
We describe two variance reduction methods for estimating the mean time to failure (MTTF) in Markovian models of highly reliable systems. The first method is based on a ratio representation of the MTTF and employs importance sampling. The second method is based on a hybrid simulation/analytic technique where the number of simulated transitions are reduced by computing partial results analytically. Experiments with a large example show the effectiveness of both techniques for highly reliable systems.
els, i.e., simple systems. Simulation analysis is an alternative approach,
however,
because system failures
are rare, ex-
tremely long simulations may be required in order to obtain accurate estimates of the measures of interest. Our goal is to obtain variance reduction methods which are applicable to a broad class of models. Specifically, we are in-
1. INTRODUCTION The requirement
terested in models defined by the reliability
for highly reliable systems, such as airborne
modeling language described in [7], so that the techniques can
computing systems, is increasing the importance of reliability and mean time to failure (MTTF) phase of these systems.
be implemented in a software package and made available to
prediction during the design
These systems are different
designers in an automatic and transparent fashion.
than
system contains multiple
highly available systems, such as banking or airline reservation systems, where continuous important.
operation
(availability)
I
2
Markov
nation of component
While such systems
performed
types with redundant
Failure of these systems is
failures leading to a system failure.
Failed components may be repaired.
If all components are
repairable and component failures are Poisson, then a regen-
model increases rapidly
This work was performed while this author was visiting IBM Research. This work was partially
A typical
usually caused by exhaustion of redundancy or by a combi-
can typically be modeled as Markov chains (see, e.g., [6]), the size of the corresponding
component
units for each component type.
is more
In highly reliable systems any system failure dur-
ing the mission causes a mission failure.
and availability
while this author was a visiting scientist at IBM Research
491
erative state for the. system (see, e.g., [4]) is the state where
pendently
all units of all component types are operational.
If, in addi-
(actually, importance sampling was not used at all to estimate
tion, all repair times are exponentially
then, typi-
the denominator).
cally, a continuous
time Markov
distributed,
chain is obtained.
As
which resulted in additional
variance
reduction
We call this technique Measure Specific
DIS (MSDIS).
recommended in [ 1 I], we transform the continuous time chain
The importance sampling technique discussed in previous pa-
into an appropriate
pers does not yield significant
estimation
discrete time Markov
of steady state availability
chain.
In [S] the
for such systems was
mating the MTTF
variance reductions
using the standard simple estimator.
discussed. In this paper, we discuss estimation of the MTTF.
intuuive
Specifically,
events occur more often by choosing an appropriate
we consider an importance
sampling
(see, e.g., [9]) and a hybrid simulation/analytic Importance
method
method.
for estiAn
reason for this is as follows. When we make failure
tance sampling distribution,
impor-
the value of the estimator ends
sampling for rare event simulation has been suc-
up smaller than the actual value and, in addition, the likeli-
cessfully used in [2], [3], [8], [12], [ 131 and [ 141. Proper se-
hood ratio is less than one. This actually ends up producing
lection of the importance sampling distribution
variance rather than reducing it. However, if we formulate the
makes the rare
events more likely to occur which results in a variance re-
MTTF problem as a ratio of two expectations
duction.
Section
The key, of course, is to choose a good importance
sampling distribution.
The theory of large deviations was used
in 131, [13], and [14] to select an effective distribution
2), then significant
variance
achieved using importance sampling.
(as shown in
reductions
can be
In Section 2, both DIS
and MSDIS methods are used to estimate the ratio represen-
for
problems arising in Markov chains with “small increments”,
tation of the MTTF.
random walks, and queueing systems, respectively.
in [8] gives the best results.
Effective heuristics were used in [2] and [12] to select impor-
In Section 3 we outline a hybrid simulation/analytic
tance sampling distributions
esti-
described in [ 101 and describe its application in estimating the
models. For example,
MTTF in simulations of highly reliable systems. In highly re-
for steady state availability
mation in large machine repairmen-like
The MSDIS sampling scheme proposed
four orders of magnitude reduction in variance was reported
liable systems, a large number of transitions
in [2] for a system with 70 components (the Markov chain for
before the system fails.
this model has 270 states).
original Markov
state performance
For regenerative
systems, steady
measures can be expressed as a ratio.
[2], a single importance sampling distribution
was used to es-
chain into a different,
tional of the transformed
Markov
timate both the numerator and denominator of this ratio. The
MTTF.
distribution
used in [2] is dynamic in the sense that it does not
failure is less in the transformed
correspond
directly
chain.
to a time homogeneous Markov
We call this technique Dynamic Importance In [Xl, different
dynamic importance
chain.
Sampling (DIS).
but related, Markov defined func-
chain is identical
to the
Furthermore, the expected number of transitions until
This reduction
chain than in the original
usually translates into a variance re-
duction since, for a given number of total transitions
simu-
lated, more failures occur in the transformed chain than in the
sampling distributions
were used to estimate the numerator and denominator
occur
The hybrid method transforms the
chain. The expected value of an appropriately
In
typically
method
original chain.
inde-
492
In Section 4, we summarize the results, discuss the relative
mator.
The reason is that the naive estimator is effectively
merits of the two methods and indicate future areas of re-
identical to the new estimator, in the sense that the two esti-
search.
mators are probabiistically
identical at the instant at which
The difference
failures occur.
comes from the fact that
2. RATIO METHOD
Equation 2.1 is formulated in terms of the probability
Let jr,, s > 0) be a continuous time Markov chain (CTMC)
rare event ( P{a, < a03 ) which can be estimated effectively
with state space E = 10, 1,2, . ..m (N may be either finite or
using importance sampling.
infinite),
and
transition
rate
Q = (q(ij)).
matrix
Let
As recommended in [S] and [ll],
q(i) = - q(i,i) = Zq(ij)
we simulate only the em-
denote the total rate out of state i. i#l Pick a starting state, say state 0, set Y0 = 0 and for any set B
bedded discrete
let aB be the first entrance time of the CTMC to B. In a reli-
of [X,, n 2 01 is P = (PJ where P, = q(ij)/q(i)
ability
8, = 0.
states (e.g., in a reliability
Let F = E be a set of
states,
then
E[cw,]-E[u,]/P{a,
[ 141 applies
the
sampling (DIS), uses the same importance
merator to
of Equation
2.1.
The second
the numerator and denominator which are then estimated in-
In fact for any
dependently.
CTMC we have the exact relationship
aa,1 =
and denominator
(MSDIS), uses different importance sampling distributions for
P{+ < a,,]. (Since F is a rare set of
states, Pja, < a03 is very close to zero.)
sampling distrib-
method, called measure specific dynamic importance sampling
estimate the numerator E[cr,] and importance sampling to estimate the denominator
for i # j and
ution (and the same sample paths) to estimate both the nu-
approximation
< a03 and then uses direct simulation
The transition matrix
Equation 2.1. The first method, called dynamic importance
state and the set F represents a rare set of Walrand
{X., n 1 01 and use
As in [8], we consider two methods of estimating the ratio in
model they might be the set of
states where the system is failed) for which 0 p F. If state 0 is a regenerative
chain
deterministic holding times hi = l/q(i).
model, we generally take state 0 to be the state for
which all components are operational.
time Markov
of the
In fact, like W&and,
(no importance
sampling)
we use direct simulation
to estimate the numerator
since
very stable estimates of E[ min ((Y,,,c+)] can be obtained via
E[ min (q,, adI PlaF c qJ
(2.1)
direct simulation.
In addition, for such highly reliable systems,
the expected number of transitions in a regenerative cycle is which
is easily
obtained
by considering
the two
cases
typically small. Expressions for the resulting asymptotic vari-
{a, < cyJ and (1~~> aFl, and applying the Markov property
ances are given in [8]; they are closely
attimecu,ifor, 2, systems of equations of size ] S, ] need to be
to be the expected reward earned by the original chain until
solved.
exit from S,). Thus many transitions may be saved since, in
number of states in S, depends on the queueing discipline for
models of highly reliable systems, the expected number of
the repairmen.
transitions to go from S, to Fk may be large.
its own repairman, then ] S, 1 = O(Mz).
On the set Fkr
This may be feasible for small values of k, but the
For example, if each type of component has
lx’,,] and {X,1 are lstochastically identical.
We now consider the application
As discussed in [lo],
estimating the MTTF in two examples.
if transitions in the two chains require
the same amount of computer time, then an approximation variance
the
VR = Var[z’] rience
in
reduction
E[r’,]/Var[Zl reliability
VRZE[~‘J/E[TJ.
is
model of a fault-tolerant
to
of the hybrid method for The fist example is a
system with two types of compo-
ratio
nents and two units of each type. All units are initially oper-
E[T,]. If, as has been our expe-
ational. Units of component types 1 and 2 fail at rates h, and
simulations,
the
Var[Z’]cVar[a,
They are repaired according to a FCFS disci-
of
pline by separate repair facilities at rates y, and pL2.respec-
when there are j failed
tively. The system is operational if at least one unit of any type
If, in the original chain, the probability
a failme before a repair is 0(&,)x0
X,, respectively.
then
then the expected number of transitions until k-l reaching a state with k failed components is 0( l/jv,~j) and
is operational.
thus we expect the hybrid method to produce a variance rek-1 duction of VRXO(~!,&~). If ej = E for allj, then VRXO(@-‘).
method.
This will be verified for several examples.
described in [ll].
Consider now the general class of models described in the In-
obtained
troduction
If we increase the failure rates by setting h, = h, = 10, then
components,
with M types of components
The resulting Markov chain, while having only
nine states, is illustrative
of the performance
We applied the hybrid method with k = 2,3 and
computed the variance ratio numerically
and redundancy.
of the hybrid
using the techniques
For ,n, = pL2= 100 and h, = A, = 1, we
VR = 0.03 for k = 2 and VR = 0.0004 for k = 3.
Figure 1 depicts the transition structure on the set S, and its
VR = 0.25 fork
interface to the set F2. The set S, = 10, 1, . . . , M) with state
variance reductions are due to the reduction in the expected
0 corresponding
sponding to one component of component type i. Because of
number of transitions until failure and are consistent with the k-l prediction that VRXO(~~, E,).
this special structure, Equations 3.1 can be solved analytically
The second example is the model with ten types of compo-
for the case k = 2: flo = (f, + jfrPOd)/(l
- J!,PO#‘J and
nents and two components of each type described in Section
for 1 5 i 5 M. Similarly, if E, is the proba-
2. The repair rate was set to be n = 1 and the component
f’; =A + Paf’,, bility
to no component failures and state i corre-
of exiting
E”j = P&l
- PN)/(l
S, via state j given - lflPr$“)
for
= 2 and VR = 0.028 fork
= 3. Most of the
failure rate was X = 0.001. This model was simulated with
that X, = i, then
1 2 j 5 M, E, = Z’J$, 496
of the MTTF.
The numerator of the ratio is estimated using
direct simulation while the denominator is, independently, timated using importance
sampling.
es-
The second (hybrid)
method combines simulation and numerical techniques. hybrid method requires some preliminary
The
computations,
al-
though in its simplest form these can be done analytically.
A
different, but related, Markov chain is then simulated.
In this
Markov chain, the expected number of transitions until failure may be greatly reduced.
Both methods produced significant
variance reduction in simulations of a large model. Although
the two methods are not necessarily mutually ex-
clusive, the hybrid method, while effective, has a number of disadvantages that reduce its utility Figure 1. Transition structure on the set S,.
when one considers its
implementation
into a broadly applicable
lation package.
First, it requires preliminary
these parameter settings the predicted variance reduction is
specifically designed to estimate the MTTF and requires sim-
F, = 0.019 which is of the same order of magnitude as the
ulation of its own stochastic process. Therefore, the output
observed variance reduction.
from this method may not be useful for estimating other per-
between
the actual and predicted
ductions
is due to the fact that hybrid
variance
method
numerical com-
putations.
crepancy
the
simu-
k = 2 and the estimated variance ratio was VR = 0.036. For
The apparent factor of two dis-
Second,
availability
as described
here is
formance measures, such as as the steady state unavailability.
re-
(as we
Third, the method is best suited for situations in which the
it) requires two transitions, rather than one, to
amount of redundancy is low since, in this case, the expected
exit S, : one to select the exit state in S, and another to select
number of transitions to failure in the modified chain remains
the entrance state in F2. For a general purpose simulator this
reasonable.
represents a realistic implementation
of the hybrid
implemented
method
of the method. For this
More specifically,
the practical implementation
method with k = 2 uses direct
simulation
example, the best variance reduction using MSDIS yields a
whenever the system is in a state with two or more failed
variance
components, whereas the ratio method with importance sam-
ratio
of
0.00035
(from
Table
1,
with
pling quickly moves the system through states with more than
p’ = p” = 0.9).
two failed components to the failed state. Therefore the var-
4. SUMMARY
iance reductions using importance
In this paper we described two methods for estimating the
(and in practice) greater than those using the hybrid method.
MTTF in a class of Markovian
It might appear that since the ratio method requires simulation
models of highly reliable sys-
tems. The first (ratio) method relies on a ratio representation
sampling are potentially
of two processes, it too would be of limited value in estimating
497
other performance
measures.
However,
the technique de-
scribed in [8] for estimating steady state unavailability
also
independently
of a
simulates the numerator and denominator
[:j]
Fox, B.L. and Glynn, P.W. (1986). Discrete-Time Conversion for Simulating Semi-Markov Processes. 0peration.s Research Letters 5, 191-196.
[a
Geist, R.M. and Trivedi, KS. (1983). Ultra-High Reliability Prediction for Fault-Tolerant Computer Systems. IEEE Transactions on Computers C-32, 111 S- 1127.
[71
Goyal, A. and Lavenberg, S.S. (1987). Modeling and Analysis of Computer System Availability. IBM Journal of Research and Development, 31 6,651-664.
ratio (in that case importance sampling is used for the numerator and direct simulation is used for the denominator). thermore, the samr: importance
sampling distribution
Furcan be
used in estimating both the MTTF and the steady state unavailability
rfd Goyal, A., Heidelberger, P. and Shahabuddin, P. (1987).
so that both of these performance measures can be
estimated simultaneously.
to apply this technique to simultaneously performance
Measure Specific Dynamic Importance Sampling for Availability Simulations. 1987 Winter Simulation Conference Proceedings. A. Thesen, H. Grant and W.D. Kelton (eds.). IEEE Press, 351-357.
In addition, it should be possible
measures at different
estimate multiple
parameter
settings (e.g.,
different failure rates), since this is an inherent capability
J.M. and Handscomb, D.C. 191 Hammersley, Monte Carlo Methods. Methuen, London.
of
1101Heidelberger,
P. (1979). A Variance Reduction Technique That Increases Regeneration Frequency. Current Issues in Computer Simulation. N.R. Adam and A. Dogramaci (eds.). Academic Press, Inc., 257-269.
importance sampling. Therefore,
our current research is directed towards further
experimentation
with and extensions of the Importance sam-
pling approach, including sampling distributions
(1964).
improved
1111Hordijk,
A., Iglehart, D.L. and Schassberger, R. (1976). Discrete Time Methods for Simulating Continuous Time Markov Chains. Adv. AppI. Prob. 8, 772-788.
selection of importance
for unbalanced systems, its application
in estimating the failure time and interval availability
1121 Lewis, E.E. and Bohm, F. (1984).
distrib-
lation of Markov Unreliability neering and Design 77, 49-62.
utions, as well as extensions to gradient estimation and slmulations of non-Markovian
Importance Sampling in the 1131 Siegmund, D. (1976). Monte Carlo Study of Sequential Tests. The Annuls of Statistics 4, 673-684.
models.
REFERENCES ill
Chung, K.L. (1967). Markov Chains With Stationary Transition Probabilities, Second Edition. SpringerVerlag, New ‘York.
El
Conway, A.ES. and Goyal, A. (1987). Monte Carlo Simulation of Computer System Availability/Reliability Models. Proceedings of the Seventeenth Symposium on Fault-Tolerant Computing. Pittsburgh, Pennsylvania, 230-235.
Monte Carlo SimuModels. Nuclear Engi-
t141 Walrand, J. (1987). Quick Simulation of Rare Events in Queueing Networks. Proceedings of the Second Znternational Workshop on Applied Mathematics and Performance/Reliability Models of Computer/Communication Systems. G. Iazeolla, P.J. Courtois and O.J. Boxma (eds). North Holland Publishing Company, Amsterdam, 275-286.
AUTHOR’S
BIOGRAPHIFS
PERWEZ SHAHABUDDIN received his B.Tech. degree in Mechanical Engineering from the Indian Institute of Technology, Delhi, in 1984. After working for a year in Engineers India Limited, India, he joined the Ph.D. program in Operations Research at Stanford University, California. His research interests include simulation techniques, queueing theory and diffusion processes. He is a student member of the Operations Research Society of America.
[31 Cottrell,
M., Fort, J.C. and Malgouyres, G. (1983). Large Deviat.ions and Rare Events in the Study of Stochastic Algorithms. IEEE Transactions on Automatic Control AC-28,907-920.
[41 Crane, M.A. and Iglehart, D.L. (1975).
Simulating Stable Stochastic Systems, III: Regenerative Processes and Discrete Event Simulations. Operations Research 23, 33-45.
Perwez Shahabuddin Department of Operations Research Stanford University
498
Stanford, California 94305 (415) 723-9383
Yorktown Heights, New York 10598 (914) 789-7504
PHILIP HEIDELBERGER received a B.A. in mathematics from Oberlin College, Oberlin, Ohio, in 1974 and a Ph.D. in Operations Research from Stanford University, Stanford, Cahfornia, in 1978. He has been a Research Staff Member at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York since 1978. His research interests include modeling and analysis of computer performance and statistical analysis of simulation output. Dr. Heidelberger is the Program Chairman of the 1989 Winter Simulation Conference and an Associate Program Chairman of the 1988 Winter Simulation He is an Associate Editor of Operations Conference. Research. He is a member of the Operations Research Society of America, the Association for Computing Machinery and the IEEE.
VICTOR F. NICOLA received the B.S. and MS. degrees in electrical engineering from Cairo University, Egypt, and Eindhoven University of Technology, The Netherlands, respectively. In 1979 he joined the scientific staff of the Measurement and Control Group at Eindhoven University. From 1983 to 1986 he was at Duke University, Durham, North Carolina, where he received the Ph.D. degree in Computer Science. Currently he is a Research Staff Member at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. His main research interests are in the areas of performance and reliability modeling of computer systems and fault-tolerance. Victor Nicola IBM Research Division Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, New York 10598 (914) 789-7502
Philip Heidelberger IBM Research Division Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, New York 10598 (914) 789-7156
PETER GLYNN is an associate professor in the Department of Operations Research at Stanford University. He received a B.S. from Carleton University in 1978 and a Ph.D. in operations research from Stanford University in 1982. From 1982 to 1987, he was an associate professor in the Industrial Engineering Department of the University of Wisconsin at Madison. His current research interests include simulation, queueing theory and applied probability. He is a member of ORSA, TIMS, IMS and the Statistical Society of Canada.
AMBUJ GOYAL received the B.Tech. degree from the Indian Institute of Technology, Kanpur, and the M.S. and Ph.D. degrees from The University of Texas at Austin. At present, he is a Research Staff Member and manages Fault-Tolerant Systems Theory Group at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. His research interest include fault-tolerant computing, computer architecture, communication networks and performance evaluation. Dr. Goyal is a member of the IEEE Computer Society.
Peter Glyrm Department of Operations Research Stanford University Stanford, California 94305 (415) 7250554
Ambuj Goyal IBM Research Division Thomas J. Watson Research Center P.O. Box 704
499