recovery). For a program with fi- nite failure free running time, this technique subst an- ..... used so far (z), second, the time to recover and restart ..... on hard-disk.
Minimizing
Completion
Time of a Program Rejuvenation
Sachin Gargl* sgarg@ee.
duke.edu
Chandra
Kintala2
cmkQresearch.at
1 Center
for
Adv.
Department
of
Elec.
Duke
Yennun
Kishor
and
&
2AT&T
NC
technique
gram
in
the
is corrective
failures In this
mostly
together
time
the
*Supported
amount in part
Bell laboratories
from how
to further
of a program.
to reduce
aimed the
reduce The
idea
of rollback by an IBM
reduce
“aging” these
well
nor
permission
to
to
lists,
requires
with
Avenue NJ
combining
07974
for
it
with
rejuvenation.
expected
completion
finite
failure
free
cases
when;
(a)
rejuvenation
is employed,
phenomenon. may
running
(b)
only
We time
time
neither
of
for
the
checkpointing checkpointing
time
distribution
optimal
is taken
the
and by an AT&T
prior
finally
(c)
is
numerical
expected
completion
sults,
some
efits
of these
ure
distribution.
three
rejuvenation
time.
interesting
the
numerical
are drawn
in relation
to the
discuss
minimizes re-
about
nature
benof fail-
Introduction with
rollback
involves
It
recovery
occasional
storage.
is a well known
saving
Upon
of the
a failure,
ware/program
does not need to be restarted but
checkpoint
(rollback
nite
free
failure reduces In earlier
ures
reju-
failure
and
that
Using
conclusions
techniques
cases
very beginning
tially
252
and
results for Weibull
above
and
gram state on stable
and/or a fee.
the
checkpointing
technique.
SIGMETRICS ’96 5/96 PA, USA m 1996 ACM 0-89791 -793 -619610005 . ..$3.50
checkpointing ‘
for
(7heckpointing
specific
both
are employed.
We also present
comple-
checkpoints
a failure
and
venation
summer internship
redistribute
by
three
employed,
unexpected
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on or
Hill,
equations
following
1
servers,
the
a program
of a pro-
techniques
of using
fellowship
ain
to preventive
the expected
upon
a
checkpointing
refers to
both
time
While
rejuvenation
resulting we show
is
the completion
of failures.
of software
paper,
be used tion
presence in nature,
maintenance
rollback-recovery
to reduce
Mount
Murray
a step further with
Laboratories
27705
derive Checkpointing
Bell
600
Engg.
Abstract known
S. Trivedil
University
Durham,
.com
kst @ee.duke.edu
Comm.
Comp.
and
Huang2
yenQresearch.att
t .com
Comp.
by Checkpointing
were
can be restarted recovery).
running
its work
assumed
For
time,
completion
this
be
from
the Lmt
a program
the
saved
with
technique
fi-
subst an-
time.
on the analysis to
from
pro-
the soft-
caused
of checkpointing, mostly
by
fail-
hardware
faults,
independent
of the program/software
running
on
, Failure I
them,
and for the most part
failure
process
has constantly reliability,
the assumption
was adequate. improved
in terms
it has been observed
technology
of performance
[8] that
emphasizing covery
software
blocks
[9], N-version
self-checking
nent techniques
for tolerating
of design diversity,
reactive
in nature,
tive approach
software
failure avoided ~—~_-
Re-
failurefi,
D _
Baeed are
Another
reac-
has been proposed
Figure
1: Effect
which
is preventive
stopping
ponents;
of a program
the volatile
OS environment program
state
program’s
execution
to resources operating
by three com-
the persistent
The volatile
stack and static
Persistent
is determined
state,
[5].
state and the
state
consists
and dynamic
of the
data segments.
refers to all the user files related
that
while
refers
must access through
the
system, such as swap space, file systems,
munication
channels,
keyboard,
monitors,
large percentage
OS environment buffer
failures
volve.
Lee [12] also observed
than 70% of software
failures
ware are manifestations conditions,
timing
of transient
problems,
because of an undesirable OS environment
in Tandem’s faults
etc. faulty
state
were
that
more
system
soft-
such as race
Such failures
of the program.
occur
reached
ing the hardware undesirable time
states
causing
of software failure.
aging
in the
servation
failures
is re-executed the
“aging”
calls for a fault-tolerant diversity
in the Flushing
freeing
reinitializing
unused
the inter-
up the file system
of what
known
is the “reboot”
In this paper,
rejuvenation
and very simple
etc.
are
might
in-
way of re-
of a computer.
we combine
which
completion
time
that further
reduction
by incorporating
state
reju-
We show
time is possible
Checkpointing
on stable
time in the
reduces the
can fail.
in the completion
however,
time
by itself
which
rejuvenation.
the volatile
environment,
execution
Checkpointing
of a program
with
the completion
has a finite
absence of failures,
saving
checkpointing
The goal is to minimize
of a program
involves
storagel.
The
is not saved as part
OS
of a check-
point.
This
are likely after
such
accrue with phenomenon
of the OS environment technique
the benefit
in the
venation
and is explained
program
fails at an arbitrary
shows
amount This
reduction
This idea is taken a step further
program
to disappear
phenomenon.
ing comes from
ob-
based on
[11] and has led to an approach
that
the
absence
due
to
sulte
in a smaller
point.
Assume
1sometimes, may
253
in
instant
checkpointing,
depending
also be saved.
of rollback.
by incorporating
the
1 as follows.
The
to the beginning
the
only
to the
failure
had
on the application,
See [5] for details.
reju-
and as seen in Fig-
of checkpointing.
rollback, that
of checkpoint-
amount
in Figure
ure 1(a), involves a large rollback
as a transient
a certain
In the presence of failures,
bugs
for shar-
resources,
manifests
and re-initialization
counteracts
environment
to “age”.
eventually
Such transient
of clean-up
system
in the OS environment
the software
if the program which
[7] and due to interactions and operating
conditions
lead to failure.
memory,
examples
its internal
It is also observed
in [4] that owing to the presence of subtle software called “Heisenbugs”
a
in na-
ture, i.e., they may not occur again if the program to be re-executed.
transient
by file server,
cleaning
A commonly
juvenating
and cleaning
might
allocated
some physical
al. [4] call
com-
that
are transient
that
tables,
et.
and it consists of occasionally
queues maintained
and wrongly
time etc. [5].
[1, 6], it has been observed
of software
and rejuvenation
Huang
program
state to remove the accrued
venation. In recent studies
in nature.
the running
nal kernel
to a
the OS environment
the program
of checkpointing
it Soj3ware Rejuvenation
in [13]. Behavior
Checkpoint Rejuvenation
the means of deal-
it has occurred.
based on data diversity
(c) No rollback
[10] and N
these techniques
i.e., they provide after
(b)
[2] are some of the promi-
on the principle
ing wit h a failure
; Failure
Evolution
fault-tolerance.
programming
programming
Rollback
and
has added to this effect,
the need for software
(a)
J
most of the fail-
ures are caused due to defects in the software. of large and complex
v
1
of Poisson
As hardware
of the
Figure same last
failure
saved
occurred
l(b) re-
checkbecause
the persistent
state
through
gradual
deterioration,
an undesirable
reached in the OS environment ure 1(c), the program a checkpoint, state
thus removing
occurred
prevents The
so far.
planned
This
stopping,
to failure
checkpoint
results
to a large extent us with
‘(renewal”
cleaning
failures
rest
model
on checkpointing pointing pletion
deals with
analysis
model
as follows.
previous
work
a very simple
and precisely
a numerical
possible
Figure
of
2: Assumptions
on time to failure
tion
of
with
the case of minimizing
time
imizing
deadline
retries
com-
is the most
Weibull
As most
of comthe pa-
feature
terchangeably
and “program”
of the earlier
work
hardware
sumption
is that
the time
ponential
[18, 17, 14]. This
Figure
2 illustrates
rejuvenation
Figures
analysis
has only
cal problem which
will depend
alyzed
both
pletion
system.
new concept carried
out.
the rejuvenation of interest
The formulation
and its
point)
A t ypipolicy
given a partic-
2(b)-(d)
of the problem
running
time.
In this
with paper,
Such analyses
the performance set of assumptions
measure
transaction failure
we restrict
differ
as-
is ex-
was removed
in
and the failure
restriction
(Figure
generalization
to
after
respects; and the
the system behavior.
2(c)).
the time
254
during
checkat each
in [14, 19] by allowing
upon Figure
that
tributed,
which by it memoryless
checkpointing.
completion 2(d)
shows
of each another
process does not renew
[18, 15].
assumed
to failure
to the case shown in Figure
a check-
distribution
renewed
through
where the failure
each checkpoint
(the
2(b)) was made in [17]. The
was removed renewed
2(a)
by Fx(t)),
undergoing to failure
of no failures
process to continue still
Figure
of the program
process getting
(shown in Figure
checkpoint
oriented
process.
the program
points
in assumptions
(denoted
by the time
An early assumption
It is, however,
free com-
ourselves
in two main
evaluated/optimized
made regarding
distribution
difference
distribution
Fx (t).
the failure
a finite
as
distributions.
plot the execution
bars represent
former
on the other hand, has been well an-
for infinitely
checkpointing
restriction
of the failure
superimposed
checkpoint
on the system specifics.
and programs
the latter.
been
of finding
the measure
Checkpointing, software
recently
is that
optimizes
ular software
is a fairly
made process.
most common
to failure
the
plots the time to failure
Work
vertical Software
important
of the failure
failures,
of
are used in-
in this paper.
Previous
Another
regarded
general
made on the renewal
2
by a
number
is the set of assumptions
[19, 15] by considering “software”
of a task
specified
and renewal
a way of tolerating
of this work
of completion
[14] or in a finite
on the distribution
comple-
[17, 15, 19, 18], max-
[20] has also been evaluated.
distinguishing
In
the expected
common
the probability
given
state the
cases are derived.
implications
Whereas
check-
for the expected
example
In
done
directions.
The words
distribution
checkpointing
time distribution to illustrate the benefit the two techniques. Finally, we conclude
and future
“t)m(d)t
the
and state the contributions
In Section 4, expressions
6 with
t
This leaves
the
a particular
In Section 3, we present
per in Section
a
it reduces
failures.
and is organized
analysis
5, we provide
failure bining
(as
after
the proba-
non-zero
time for the three possible
Section
right
Although,
and compare
and rejuvenation
problem.
failure.
?
under
2, we outline
this paper.
time)
is still
paper
above three scenarios Section
of the program
of when and how often should
of the
and rejuvenation
in the OS
up and restarting
for such transient
be rejuvenated
The
after
an unexpected
in no rollback.
the question
program
immediately
at an arbitrary
of unexpected
In Fig-
the degradation
(or at least postpones)
opposed
bility
is rejuvenated
“t)=(a)
state was
of the program.
2(c).
In [18], however,
it is
is exponentially
dis-
property
is equivalent
Leung
and Choo [15]
assume arbitrary
distribution
lem in the most
general
failures
to occur during
Our assumptions shown
in Figure
rollback
2(d).
checkpointing
arbitrary
distribution.
The
after
rejuvenation
the program
however, Failure
fails
State
failure
continues
during
As an example,
libf
form application
t2 provides
library
level checkpointing
Our
of
stable
time yf.
Rejuvenation
stopping from
a failure
[3].
bined
with
knowledge,
rejuvenation
for the first
per as a two-dimensional tal
number
cution
along which the expected and Gilbert,
ered preventive but assumed nance with checkpointing “save”.
that every
the two
checkpoint.
and preventive
the exe-
operation
maintenance
essentially
optimization
failure
total
in the failyielding
2(c) and still remaining
time
simply
problem.
shall
controlled
3
Problem
Assume ecution
that time
checkpointing
a given
Formulation program
to complete
requires
in the
or rejuvenation.
which
absence
Further,
w is a constant and shall be referred requirement” of the program. Time z Mbf
t is a
registeredtrademark
w units
of AT&T
time
of failures, assume
constant
the first
Laboratories.
exits.
255
~mpficity,
dimension
w/N+
ex-
of the opti-
being
with
each
w/N.
3 The
(including
CYdenoted
of the program
Our
it after
as /3. The
including
integer
every
constant
with
have
of execution
kth
check-
given
assumed takes
and
dimension
be
along
completion
k and then
that the Nth
may
is to be minimized.
the expected
the
We In our
value
the expected N
is
checkpoint.
whose
time
evaluate
model
distance.
the second
completion
is to first
we
rejuvenation
rejuvenation
and it constitutes
w units
to finish
whose value can be
interval
Model:
of a program
3 For
pleting
the program
are equidistant
values of N and k for which
that
to as the “workto failure of the Bell
We assume that
of N checkpoints
to k as the
goal
astimes.
it runs on, none of the
of each of these segments
the expected
Our
of ex-
may
is thus N~.
k is an
model,
time
to justifying
Model:
is therefore
to perform refer
compared
and rejuvenation
a total
checkpoints
Rejuvenation
a one-
aa the
can be controlled.
work requirement
pointing
applies
recovery
free inter-checkpoint
the checkpoint)
is very small
and a system
work requirement
was called a
resulted
The
The assump-
in checkpointing
reasoning
and constitutes
mization.
of
state
N is an integer
varied
[16] consid-
renewed at every checkpoint
model of Figure
ecution.
some mainte-
The joint
a program
needs to complete
checkpointing,
underwent
Similar
as it cleans inter-
time is remonable
variation
of constant
Checkpoznting
num-
reboot
may also sim-
and frees memory.
To-
dimensions
models
along with
the system
the
after a “crash”
Rejuvenation
planned
above parameters
pa-
time is minimized.
in one of their
The combination
dimensional
and during
completion
maintenance
ure process getting a failure
checkpoints
constitute
in this
problem.
to be performed
of the program
Coffman
time
A sim-
occasional,
Given
is com-
involve
reinitializes
sumptions
of a program.
optimization
of equidistant
ber of rejuvenations
time
both
equal to ~,.
ply involve
slight
It
time y~. Since
of restarting
to save the volatile
after
up, and reload-
nal tables,
be ignored.
can be
checkpointing
cleaning
checkpointing
right
a checkpoint.
and rejuvenation,
example
by cr. and the
to take a
is performed
Yf is typically
occurs is “reboot”.
a
the last saved checkis assumed
completed
the program,
ple (yet common)
time
the completion
To the best of our
and is denoted
This
has successfully
to w. Thus,
used to minimize
storage,
with
a check-
by reloading
is restarted
from
failure
to per-
X
program
the same procedure,
level.
and checkpointing
variable
to complete
some cleanup is performed
recovery
Contribution
We show how rejuvenation
random
Time
to be constant
tion of constant
2.1
the
ing and is also assumed to take a constant
the check-
routines
is assumed
by
Fx (.).
Upon a crash failure,
includes
or
which
process when it is done at the application
point
the program
process is
degradation
distribution
constant
is as-
and is restarted
is denoted
given
point
has an
respect to its OS environment,
causes the transient pointing
continues
to failure
in recovery.
is performed.
with
process
program,
if the program
program
for
those of [15] and are
failure
The
also allow
recovery.
and the time
sumed not to fail while only
They
closely follow
through
renewed
and have solved the prob-
setting.
to find
completion
Program
checkpoint
after
and
com-
then
Failure
2P
L
x
Failure x ...,
...,
..... .......... .................. ....-
P
F n
Figure
3: Program
with
no Checkpointing
or Rejuvena-
time
is minimal.
by N*
We shall denote
these optimal
Checkpoint
Figure
tion Proofi
values
Recovery
-
4: Program
We proceed
failure
and k“ respectively.
‘. ... ... ... ...
R
with
checkpointing
by conditioning
X = z (See Figure
does not occur within
Expected
Completion
Time
tion
(x < w), the program
the sum of three Let
denote
T(w)
when
neither
ployed
the completion
checkpointing
time
nor
ures,
T(w)
T.(N@,
rejuvenation
=
Similarly,
w.
sents the work the random
requirement
variable
is employed
employing
deterministic units
in this case) and finally
after rejuvenating
point.
of
work
completion
the
law
~~ E[T(w)[X
the
for the reis started
by the dotted requirement
time
line in
is still
w
Formally,
time is E[T(w)].
completion
r < w time the program
=
would
total
expectation, Therefore,
= z] dFx(z).
is written
as:
&’[Z’(w)]
=
E[T’(w)]
Iw
(z + y~ + E[T(w)])
Wrx(w)+
dFx(z)
o
from the beginning
units
time
are
w
never allow
WTX(W)
=
+
(Tj + E[T(w)])
Fx(w)
+
J
x dFx(2)
o
to derive
these
E[TC(N@, N)]
3), the remaining
of
the time
and restart
let
to complete.
values
E[T(w)],
under
every
Restart
(shown
By
First,
As the program
beginning
expected
is given by
to recover
completion
work-requirement.
and its expected
the case of
is meaningless
every r time
We now proceed pected
and rejuvenation to see that
of rejuvenating
components.
the expected
the conditional
time
the comple-
is no saved state to restart
from an intermediate the program
forward
rejuvenation policy
as there
variable
distinct
completes.
does occur before comple-
completion
second, the time
the very
Figure
the program
(iV@ repre-
T, C(N~, N, k) denote
It is straight
from
time of the same pro-
tion time when both checkpointing employed.
maining
T(w)
in the failof any fail-
let the random
N) denote the completion
(7~ ) and l~t,
is em-
Clearly,
variable due to randomness Note that in the absence
gram when only checkpointing
only
of the program
and w is the “work-requirement”.
is a random ure process.
used so far (z),
on the time to first
3). If z > w, i.e., if the failure
w units,
On the other hand, if a failure
4
only
three
expressions random
for the ex-
variables
and E[T,,(N~,
Rearranging
viz.
with
respect
to E[T(w)],
N, k)].
we get
w
X dFx(x) E [~(w)]
4.1
No
Checkpointing
or
=
W +
7jFx(w) _
+
/o
Fx(w)
Rejuvenation
Tx(w) ❑
Let 7X(w) time
= 1 – Fx(w),
then the expected
is given by the following
completion
theorem.
4.2 Theorem
Checkpointing
Only
1: w
The Z
yjFx(w) E [~(w)]
=
W +
_
Fx(w)
+
/~
CM’x(z)
7X(W)
256
program
our
model,
ing
the
execution IV equidistant
program
into
is shown
in
checkpoints .0/ segments,
Figure
4.
Under
are taken
divid-
each
wit h a work
requirement
/3 (including
expected
completion
currence
the
time
checkpointing
is given
by
time).
the
The
following
re-
relation. m
Theorem
2: E [T’e(iV~,
N)]
~X(,B)
=
Figure NpFx(Np)
+ 7f F~(N@
-Rejuvenation
Checkpoint
+ jNp o
x dFx(z)
5: Program
+
J‘p
with
(z+
checkpointing
and rejuvenation
1)]) ~~X(z) +
yj + E[~,(P,
(N-1)/3
N-1
E[TC((N
~
-
i)~,
N -
+ 1)~) -
i)] [FX((i
&(i@]
w N/3dFx(z).
i=l
/ N/3
Proof:
Again,
to first not
we proceed
failure
occur
=
within
completion in any
X
time
of the
N
x.
the
work
is N@.
time
nents.
First,
time
is the
Yj and last,
remaining
work
If however,
spent
the expected
mally,
variable
T.((N
the conditional
written
time
does
i.e.,
program distinct
~, second,
N)IX
z and 7f
terms,
we get
N)]=
comcompo-
time
with
checkpoints
(i+l)o
N-1
E!to
the
E[T.((N
– z)~, (N – i))] dFx(z)
~=o
have
work requirement
completion
with
if
the restart
of which
the integrals
E[Z_.(N/3,
occurs
segment), the
Combining
program
is
is denoted
- i + 1)~, (N – i + l)).
expected
as; 17[T’(N@,
ith
As (i-1)
remaining
time
failure
completion
requirement.
on the failure the
of three so far,
(N – i + 1)~, the completion by random
the
the
1 to N,
summation
been completed,
already
the
i from
the time
i.e.,
requirement,
segments(say
(i – 1)/3 < z < i/?, for pletion
by conditioning
If z > N/3,
17[TC(N/3, N)]
Combining sides, evaluating
For-
time may be
the integral
N)] ~X(@
E [Tc(N8,
both
on
and rearranging,
we get
=
= Z] =
~E[T.((N -
i)~, N - i)] [Fx((i
+ l)f?) - Fx(i/3)]
i=l
o The
above
currence
By the law of total
expectation,
E[Tc(N~,
N)]
E[z-c(Np, /o
=
N)/x = (t?](iFx(z)
J JP‘p
p (z + 7t + E[T.(N8,
~)])
form
recursive
~~X($)
+
4.3
o
forall
for
which
involves
1 ~ i < N
scdut ion,
solution
E[Tc(N~,
a weighted
and
However,
is straight
N)]
does
not
sum
have
a numerical
is a reof
a simple
iterative
or
forward.
Checkpointing
and
Rejuvenation
Combined
(z + ~f + E[TC((N
- l)p,
N -
1)1)@(z) Under
++,..
our checkpointing
program
(N-1)~
/
i)]
closed
w =
relation
E[Z’e(i~,
expression
every (~+
?j
+
J7K’’(W3,
2)])
~~X(~)
+
kth
juvenation
(N-2)/3
257
takes a total checkpoint, is performed
and rejuvenation
of N checkpoints with after
1 ~ the
k
model,
the
and rejuvenates
1,
For d = 1, Weibull
distribution. in the density
of failure
of
Furthermore, higher
values
function
are concentrated
of
where
in a small
is given by [21]
aa work re-
is k’/3. of the
time.
The mean time to failure
~ 1/0
The work requirement
however,
Versatility
time,
of the distribution,
larger probabilities
For i
with
to the exponential
for a given d imply
the two parameters.
lies in the choice of 6 to vary the failure
If 01.0
data associated
program
N*, k“)]
oryless property
using
Checkpointing
is performed,
17[T’’c(N*P,
function
are close to the time
close to the time it takes to save critical
Given
time
of 15 checkpoints,
dist ante takes a value from
rejuvenation
being fixed at 900. Larger
a large computer.
with a large scientific
venation
Restarting
and d, A is calculated
it is performed.
for % = 1.0, for a total
O takes values from
is the density
and the restart
5.0
7: Effect of O on completion
head of y, every time
parame-
take y, = ~f = 5
6 plots the Weibull
O values, MTTF
Figure
in the absence of re-
= 900 minutes.
1.0, to 5.0. Given the MTTF equation
4.0
3.0
e
1200 min-
takes a = 4 minutes.
MTTF
2.0
1.0
values of O).
The mean time to failure
juvenation
and rejuvenation
values of 0. It can be seen
(larger
following
only
checkpointing
~
we obtain
from the numerical results that for the same mean time to failure, the benefit from rejuvenation increases for peskier
checkpointing
the optimal
increase
in 8.
and should
The variation on actual
rejuvenation Note
that
k is
not be mistaken in time to rejuve-
parameter
values and an
is not obvious.
any Figure
in an over-
259
8 shows the expected
completion
time
plot-
1700.0 t ante assumed
1600.0
.-8 fi g
Checkpointing
1500,0
ability
g ~
in checkpointing
~ ~ 1300.0
of literature
checkpointing
of interest.
Rejuvenation
5
10
15
number
of checkpoints
(N)
20
In computer
systems,
8: Effect
of rejuvenation
on completion
time
the number
of checkpoints
value of O (2.2 in this case) and illustrates of the cost of checkpointing
value for a certain juvenation pointing
number
distance
the amount
the extra time required these operations
Given
of rollback
the cost involved
starts
corrective
tion
Figure
dist ante
quired
increases,
to minimize
increases.
the expected
Note that
same for different
If the check-
[1] M. Sullivan
in performing
and their
as the rejuvenaof checkpoints
completion
impact
[2] J-C. Laprie, “Architectural
J. Arlat,
combined
to reduce the completion
completion
times
pared the results this
work
We derived
rejuvenation
equations
for the three possible numerically.
is to derive
the distribution
can be
Fault
of completion
real time
systems.
checkpointing dynamic
and
optimization
which dunamic checkpoint
by a given Another
extension
rejuvenation problem.
programming
ing policy.
deadline
extension
apply
John,
Proc.
pp.
and K. Kanoun,
M.
R. Lyu,
John,
1995.
Software
Wiley
fault-tolerance Fault
Tolerance,
& sons. ltd.,
[5] Y-M
Wang,
Kintala,
to
Proc.
in
pp. 231-
the equidis-
260
design,
“A
P. Vo, P-Y
Computing
Chung
and C.
and its applications”, on
Fault
California,
census of tandem
between
1985 and 1990”,
IEEE
ity,
39, pp.
Oct.
Vol.
implementation
CA, June 1995.
Y. Huang,
Pasadena,
N. D. Fulton,
of Fault-tolerant
“Checkpointing
[6] J. Gray,
is used to find the optimal
Proc.
of Symposium
Systems,
aa a two-dimensional
N. Koletis,
Pasadena,
Symposium,
is to formulate
Using this approach,
in
fault-tolerance”,
“Software
layer”,
Rejuvenation-
and analysis”,
of
of the comple-
[19] is an example
- A study
Symposium,
Ed.
pp. 47-80,
C. Kintala,
“Software
cases and com-
which
defects
systems”,
C. B60unes
Tolerance,
& sons. ltd.,
[4] Y. Huang,
tion time for these cases. This will enable us to evaluate other performance measures such as the probability
in
248, 1995.
for expected
One natural
availability
issues in software
Ed. M. R. Lyu, checkpointing
research
“Software
Computing
[3] Y. Huang and C, Kintala,
Work
that
with
be explored.
2-9, 1991.
Software
Future
on system
Fault-Tolerant
also
We have shown in this paper of a program.
this preventive
spur further
in operating
in the application
time
with
and should
re-
values of k,
and
Com-
and R. Chillarege,
failures
is not the
time
the value of this minima
Conclusions
will
prove useful.
References
Wiley
6
paper
toler-
dominates
of field
the number
this
of
fault
a re-
to dominate.
8 also shows that
techniques
may also be beneficial
IEEE
Second,
alternative
shifts
nature
this area.
If the check-
because of failures.
is too frequent,
and as the transient
such as rejuvenation
We hope that
First,
as the cause of failures
(minimum)
of checkpoints.
(or no rejuvenation)
is infrequent,
pointing
an optimum
other
technique
the tradeoffs
and rejuvenation.
each of the six curves attains
bining
for a particular
check-
optimiza-
based systems.
is accentuated,
ant techniques
finding
with
as a two-dimensional
to software
failures
Again,
has been a problem
for such transaction
from hardware software
strategy
can be combined
and analyzed
tion problem
avail-
based database systems and a wide exists in its analysis.
the optimal pointing
1200.0 0
ted against
mod-
haa also been used to maximize
of transaction
body
1400.0
Figure
and rejuvenation
els is not necessary.
409-418,
Tolerant
Computer
1995. system
availability
Trans.
on Reliabil-
1990.
In
[7] J. Gray,
“Why
done about
do computers
it?”,
in Distributed
[8] J. Gray
and
Symp.
[18] V.G. fects
on Reliability
Database
Systems,
D.
Kulkarni,
Stochastic
P. Siewiorek,
systems”,
and
checkpoint
Msg., pp. 39-
and K. S. Trivedi,
and
6(4),
O.
queuing
Vol.
“Ef-
on program on
615-648,
Babouglu,
selection
Computing,
48, Sept. 1991.
Nicola
Communications
Models,
[19] S. Toueg
“High-availability
Computer
IEEE
V.F.
of checkpointing
performance”,
pp.
1986.
and
computer
stop and what can be
of 5th
Software
January
3-12,
Proc.
Statistics-
1990.
“On
problem”,
the
SIAM
13, No. 3, pp.
optimum Journal
630-649,
on
August
1984. [9] B. Randell,
“System
tolerance”,
structure
Trans.
IEEE
SE-1, pp. 220-232,
on
for
software
fault
Engg.,
Vol.
Software
[20]
R. Geist,
of a checkpoint
June 1975.
ment”, [10] A.
Avizienis,
tolerant
“The
software”,
n-verion
IEEE
approach
Trans.
Y. Huang
for understanding In Proc. and
Quality
Florida,
Lee, “Software phase”,
Computer
Ph. D.
Engineering,
Champaign,
in the opera-
Dept. of Electrical
Univ.
of Illinois,
and
Urbana-
1995.
[13] P, E. Ammann an approach 17th
8-10, 1995, Orlando,
dependability Thesis,
Intnl.
and J. C. Knight, to software
Symp.
“Data-diversity:
fault-tolerance”, Fault
on
Tolerant
Proc.
of
Computing,
pp. 122-126, June 1987. [14] K. G. Shin, pointing
T. Lin
of real-time
and Y. Lee, tasks”,
IEEE
[15] C, H. C, Leung and Q. H. Choo, batch
systems”,
IEEE
10, No. 4,
July
[16] E.G. gies
Coffman for
[17]
1, April
A. Duda,
Vol.
in unreliable
16, pp.
computing
Engg.,
Vol. SE-
“Optimal
strate-
1984, pp. 444-450. and E. N. Gilbert, checkpoints Trans.
IEEE
and
preventive
on Reliability,
VOL 39,
1990, pp. 9-18.
“The effects of checkpointing
execution
on
1987.
“On the execution
Trans. on Sofiware
scheduling
maintenance” No.
programs
check-
Transactions
Vol. C-36, No. 11, November
Computers,
of large
“Optimal
time”,
Information
221-229,
on program
Processing
queuing
Prentice-Hall,
failures”,
Letters,
1983.
261
interval Trans.
and J. Westall, in a critical-task
on Reliability
y, 37(4),
“Selection, environ395-400,
1988.
K. S. Trivedi, bility,
on Reliability
Conf.
March
in Design,
[21]
1985.
“A framework
transient
Intnl.
Engg.,
pp.231-237.
[12] Inhwan tional
ISSAT
IEEE
October
fault-
December
and C. Kintala,
and handling
of 2nd
to
on Sofiware
Vol. SE-11, No. 12, pp. 1491-1501, [11] P. Jalote,
R. Reynolds
“Probability and 1982.
computer
and Statistics science
with
relia-
applications”,