Minimizing Completion Time of a Program by ... - Semantic Scholar

6 downloads 165 Views 824KB Size Report
recovery). For a program with fi- nite failure free running time, this technique subst an- ..... used so far (z), second, the time to recover and restart ..... on hard-disk.
Minimizing

Completion

Time of a Program Rejuvenation

Sachin Gargl* sgarg@ee.

duke.edu

Chandra

Kintala2

cmkQresearch.at

1 Center

for

Adv.

Department

of

Elec.

Duke

Yennun

Kishor

and

&

2AT&T

NC

technique

gram

in

the

is corrective

failures In this

mostly

together

time

the

*Supported

amount in part

Bell laboratories

from how

to further

of a program.

to reduce

aimed the

reduce The

idea

of rollback by an IBM

reduce

“aging” these

well

nor

permission

to

to

lists,

requires

with

Avenue NJ

combining

07974

for

it

with

rejuvenation.

expected

completion

finite

failure

free

cases

when;

(a)

rejuvenation

is employed,

phenomenon. may

running

(b)

only

We time

time

neither

of

for

the

checkpointing checkpointing

time

distribution

optimal

is taken

the

and by an AT&T

prior

finally

(c)

is

numerical

expected

completion

sults,

some

efits

of these

ure

distribution.

three

rejuvenation

time.

interesting

the

numerical

are drawn

in relation

to the

discuss

minimizes re-

about

nature

benof fail-

Introduction with

rollback

involves

It

recovery

occasional

storage.

is a well known

saving

Upon

of the

a failure,

ware/program

does not need to be restarted but

checkpoint

(rollback

nite

free

failure reduces In earlier

ures

reju-

failure

and

that

Using

conclusions

techniques

cases

very beginning

tially

252

and

results for Weibull

above

and

gram state on stable

and/or a fee.

the

checkpointing

technique.

SIGMETRICS ’96 5/96 PA, USA m 1996 ACM 0-89791 -793 -619610005 . ..$3.50

checkpointing ‘

for

(7heckpointing

specific

both

are employed.

We also present

comple-

checkpoints

a failure

and

venation

summer internship

redistribute

by

three

employed,

unexpected

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on or

Hill,

equations

following

1

servers,

the

a program

of a pro-

techniques

of using

fellowship

ain

to preventive

the expected

upon

a

checkpointing

refers to

both

time

While

rejuvenation

resulting we show

is

the completion

of failures.

of software

paper,

be used tion

presence in nature,

maintenance

rollback-recovery

to reduce

Mount

Murray

a step further with

Laboratories

27705

derive Checkpointing

Bell

600

Engg.

Abstract known

S. Trivedil

University

Durham,

.com

kst @ee.duke.edu

Comm.

Comp.

and

Huang2

yenQresearch.att

t .com

Comp.

by Checkpointing

were

can be restarted recovery).

running

its work

assumed

For

time,

completion

this

be

from

the Lmt

a program

the

saved

with

technique

fi-

subst an-

time.

on the analysis to

from

pro-

the soft-

caused

of checkpointing, mostly

by

fail-

hardware

faults,

independent

of the program/software

running

on

, Failure I

them,

and for the most part

failure

process

has constantly reliability,

the assumption

was adequate. improved

in terms

it has been observed

technology

of performance

[8] that

emphasizing covery

software

blocks

[9], N-version

self-checking

nent techniques

for tolerating

of design diversity,

reactive

in nature,

tive approach

software

failure avoided ~—~_-

Re-

failurefi,

D _

Baeed are

Another

reac-

has been proposed

Figure

1: Effect

which

is preventive

stopping

ponents;

of a program

the volatile

OS environment program

state

program’s

execution

to resources operating

by three com-

the persistent

The volatile

stack and static

Persistent

is determined

state,

[5].

state and the

state

consists

and dynamic

of the

data segments.

refers to all the user files related

that

while

refers

must access through

the

system, such as swap space, file systems,

munication

channels,

keyboard,

monitors,

large percentage

OS environment buffer

failures

volve.

Lee [12] also observed

than 70% of software

failures

ware are manifestations conditions,

timing

of transient

problems,

because of an undesirable OS environment

in Tandem’s faults

etc. faulty

state

were

that

more

system

soft-

such as race

Such failures

of the program.

occur

reached

ing the hardware undesirable time

states

causing

of software failure.

aging

in the

servation

failures

is re-executed the

“aging”

calls for a fault-tolerant diversity

in the Flushing

freeing

reinitializing

unused

the inter-

up the file system

of what

known

is the “reboot”

In this paper,

rejuvenation

and very simple

etc.

are

might

in-

way of re-

of a computer.

we combine

which

completion

time

that further

reduction

by incorporating

state

reju-

We show

time is possible

Checkpointing

on stable

time in the

reduces the

can fail.

in the completion

however,

time

by itself

which

rejuvenation.

the volatile

environment,

execution

Checkpointing

of a program

with

the completion

has a finite

absence of failures,

saving

checkpointing

The goal is to minimize

of a program

involves

storagel.

The

is not saved as part

OS

of a check-

point.

This

are likely after

such

accrue with phenomenon

of the OS environment technique

the benefit

in the

venation

and is explained

program

fails at an arbitrary

shows

amount This

reduction

This idea is taken a step further

program

to disappear

phenomenon.

ing comes from

ob-

based on

[11] and has led to an approach

that

the

absence

due

to

sulte

in a smaller

point.

Assume

1sometimes, may

253

in

instant

checkpointing,

depending

also be saved.

of rollback.

by incorporating

the

1 as follows.

The

to the beginning

the

only

to the

failure

had

on the application,

See [5] for details.

reju-

and as seen in Fig-

of checkpointing.

rollback, that

of checkpoint-

amount

in Figure

ure 1(a), involves a large rollback

as a transient

a certain

In the presence of failures,

bugs

for shar-

resources,

manifests

and re-initialization

counteracts

environment

to “age”.

eventually

Such transient

of clean-up

system

in the OS environment

the software

if the program which

[7] and due to interactions and operating

conditions

lead to failure.

memory,

examples

its internal

It is also observed

in [4] that owing to the presence of subtle software called “Heisenbugs”

a

in na-

ture, i.e., they may not occur again if the program to be re-executed.

transient

by file server,

cleaning

A commonly

juvenating

and cleaning

might

allocated

some physical

al. [4] call

com-

that

are transient

that

tables,

et.

and it consists of occasionally

queues maintained

and wrongly

time etc. [5].

[1, 6], it has been observed

of software

and rejuvenation

Huang

program

state to remove the accrued

venation. In recent studies

in nature.

the running

nal kernel

to a

the OS environment

the program

of checkpointing

it Soj3ware Rejuvenation

in [13]. Behavior

Checkpoint Rejuvenation

the means of deal-

it has occurred.

based on data diversity

(c) No rollback

[10] and N

these techniques

i.e., they provide after

(b)

[2] are some of the promi-

on the principle

ing wit h a failure

; Failure

Evolution

fault-tolerance.

programming

programming

Rollback

and

has added to this effect,

the need for software

(a)

J

most of the fail-

ures are caused due to defects in the software. of large and complex

v

1

of Poisson

As hardware

of the

Figure same last

failure

saved

occurred

l(b) re-

checkbecause

the persistent

state

through

gradual

deterioration,

an undesirable

reached in the OS environment ure 1(c), the program a checkpoint, state

thus removing

occurred

prevents The

so far.

planned

This

stopping,

to failure

checkpoint

results

to a large extent us with

‘(renewal”

cleaning

failures

rest

model

on checkpointing pointing pletion

deals with

analysis

model

as follows.

previous

work

a very simple

and precisely

a numerical

possible

Figure

of

2: Assumptions

on time to failure

tion

of

with

the case of minimizing

time

imizing

deadline

retries

com-

is the most

Weibull

As most

of comthe pa-

feature

terchangeably

and “program”

of the earlier

work

hardware

sumption

is that

the time

ponential

[18, 17, 14]. This

Figure

2 illustrates

rejuvenation

Figures

analysis

has only

cal problem which

will depend

alyzed

both

pletion

system.

new concept carried

out.

the rejuvenation of interest

The formulation

and its

point)

A t ypipolicy

given a partic-

2(b)-(d)

of the problem

running

time.

In this

with paper,

Such analyses

the performance set of assumptions

measure

transaction failure

we restrict

differ

as-

is ex-

was removed

in

and the failure

restriction

(Figure

generalization

to

after

respects; and the

the system behavior.

2(c)).

the time

254

during

checkat each

in [14, 19] by allowing

upon Figure

that

tributed,

which by it memoryless

checkpointing.

completion 2(d)

shows

of each another

process does not renew

[18, 15].

assumed

to failure

to the case shown in Figure

a check-

distribution

renewed

through

where the failure

each checkpoint

(the

2(b)) was made in [17]. The

was removed renewed

2(a)

by Fx(t)),

undergoing to failure

of no failures

process to continue still

Figure

of the program

process getting

(shown in Figure

checkpoint

oriented

process.

the program

points

in assumptions

(denoted

by the time

An early assumption

It is, however,

free com-

ourselves

in two main

evaluated/optimized

made regarding

distribution

difference

distribution

Fx (t).

the failure

a finite

as

distributions.

plot the execution

bars represent

former

on the other hand, has been well an-

for infinitely

checkpointing

restriction

of the failure

superimposed

checkpoint

on the system specifics.

and programs

the latter.

been

of finding

the measure

Checkpointing, software

recently

is that

optimizes

ular software

is a fairly

made process.

most common

to failure

the

plots the time to failure

Work

vertical Software

important

of the failure

failures,

of

are used in-

in this paper.

Previous

Another

regarded

general

made on the renewal

2

by a

number

is the set of assumptions

[19, 15] by considering “software”

of a task

specified

and renewal

a way of tolerating

of this work

of completion

[14] or in a finite

on the distribution

comple-

[17, 15, 19, 18], max-

[20] has also been evaluated.

distinguishing

In

the expected

common

the probability

given

state the

cases are derived.

implications

Whereas

check-

for the expected

example

In

done

directions.

The words

distribution

checkpointing

time distribution to illustrate the benefit the two techniques. Finally, we conclude

and future

“t)m(d)t

the

and state the contributions

In Section 4, expressions

6 with

t

This leaves

the

a particular

In Section 3, we present

per in Section

a

it reduces

failures.

and is organized

analysis

5, we provide

failure bining

(as

after

the proba-

non-zero

time for the three possible

Section

right

Although,

and compare

and rejuvenation

problem.

failure.

?

under

2, we outline

this paper.

time)

is still

paper

above three scenarios Section

of the program

of when and how often should

of the

and rejuvenation

in the OS

up and restarting

for such transient

be rejuvenated

The

after

an unexpected

in no rollback.

the question

program

immediately

at an arbitrary

of unexpected

In Fig-

the degradation

(or at least postpones)

opposed

bility

is rejuvenated

“t)=(a)

state was

of the program.

2(c).

In [18], however,

it is

is exponentially

dis-

property

is equivalent

Leung

and Choo [15]

assume arbitrary

distribution

lem in the most

general

failures

to occur during

Our assumptions shown

in Figure

rollback

2(d).

checkpointing

arbitrary

distribution.

The

after

rejuvenation

the program

however, Failure

fails

State

failure

continues

during

As an example,

libf

form application

t2 provides

library

level checkpointing

Our

of

stable

time yf.

Rejuvenation

stopping from

a failure

[3].

bined

with

knowledge,

rejuvenation

for the first

per as a two-dimensional tal

number

cution

along which the expected and Gilbert,

ered preventive but assumed nance with checkpointing “save”.

that every

the two

checkpoint.

and preventive

the exe-

operation

maintenance

essentially

optimization

failure

total

in the failyielding

2(c) and still remaining

time

simply

problem.

shall

controlled

3

Problem

Assume ecution

that time

checkpointing

a given

Formulation program

to complete

requires

in the

or rejuvenation.

which

absence

Further,

w is a constant and shall be referred requirement” of the program. Time z Mbf

t is a

registeredtrademark

w units

of AT&T

time

of failures, assume

constant

the first

Laboratories.

exits.

255

~mpficity,

dimension

w/N+

ex-

of the opti-

being

with

each

w/N.

3 The

(including

CYdenoted

of the program

Our

it after

as /3. The

including

integer

every

constant

with

have

of execution

kth

check-

given

assumed takes

and

dimension

be

along

completion

k and then

that the Nth

may

is to be minimized.

the expected

the

We In our

value

the expected N

is

checkpoint.

whose

time

evaluate

model

distance.

the second

completion

is to first

we

rejuvenation

rejuvenation

and it constitutes

w units

to finish

whose value can be

interval

Model:

of a program

3 For

pleting

the program

are equidistant

values of N and k for which

that

to as the “workto failure of the Bell

We assume that

of N checkpoints

to k as the

goal

astimes.

it runs on, none of the

of each of these segments

the expected

Our

of ex-

may

is thus N~.

k is an

model,

time

to justifying

Model:

is therefore

to perform refer

compared

and rejuvenation

a total

checkpoints

Rejuvenation

a one-

aa the

can be controlled.

work requirement

pointing

applies

recovery

free inter-checkpoint

the checkpoint)

is very small

and a system

work requirement

was called a

resulted

The

The assump-

in checkpointing

reasoning

and constitutes

mization.

of

state

N is an integer

varied

[16] consid-

renewed at every checkpoint

model of Figure

ecution.

some mainte-

The joint

a program

needs to complete

checkpointing,

underwent

Similar

as it cleans inter-

time is remonable

variation

of constant

Checkpoznting

num-

reboot

may also sim-

and frees memory.

To-

dimensions

models

along with

the system

the

after a “crash”

Rejuvenation

planned

above parameters

pa-

time is minimized.

in one of their

The combination

dimensional

and during

completion

maintenance

ure process getting a failure

checkpoints

constitute

in this

problem.

to be performed

of the program

Coffman

time

A sim-

occasional,

Given

is com-

involve

reinitializes

sumptions

of a program.

optimization

of equidistant

ber of rejuvenations

time

both

equal to ~,.

ply involve

slight

It

time y~. Since

of restarting

to save the volatile

after

up, and reload-

nal tables,

be ignored.

can be

checkpointing

cleaning

checkpointing

right

a checkpoint.

and rejuvenation,

example

by cr. and the

to take a

is performed

Yf is typically

occurs is “reboot”.

a

the last saved checkis assumed

completed

the program,

ple (yet common)

time

the completion

To the best of our

and is denoted

This

has successfully

to w. Thus,

used to minimize

storage,

with

a check-

by reloading

is restarted

from

failure

to per-

X

program

the same procedure,

level.

and checkpointing

variable

to complete

some cleanup is performed

recovery

Contribution

We show how rejuvenation

random

Time

to be constant

tion of constant

2.1

the

ing and is also assumed to take a constant

the check-

routines

is assumed

by

Fx (.).

Upon a crash failure,

includes

or

which

process when it is done at the application

point

the program

process is

degradation

distribution

constant

is as-

and is restarted

is denoted

given

point

has an

respect to its OS environment,

causes the transient pointing

continues

to failure

in recovery.

is performed.

with

process

program,

if the program

program

for

those of [15] and are

failure

The

also allow

recovery.

and the time

sumed not to fail while only

They

closely follow

through

renewed

and have solved the prob-

setting.

to find

completion

Program

checkpoint

after

and

com-

then

Failure

2P

L

x

Failure x ...,

...,

..... .......... .................. ....-

P

F n

Figure

3: Program

with

no Checkpointing

or Rejuvena-

time

is minimal.

by N*

We shall denote

these optimal

Checkpoint

Figure

tion Proofi

values

Recovery

-

4: Program

We proceed

failure

and k“ respectively.

‘. ... ... ... ...

R

with

checkpointing

by conditioning

X = z (See Figure

does not occur within

Expected

Completion

Time

tion

(x < w), the program

the sum of three Let

denote

T(w)

when

neither

ployed

the completion

checkpointing

time

nor

ures,

T(w)

T.(N@,

rejuvenation

=

Similarly,

w.

sents the work the random

requirement

variable

is employed

employing

deterministic units

in this case) and finally

after rejuvenating

point.

of

work

completion

the

law

~~ E[T(w)[X

the

for the reis started

by the dotted requirement

time

line in

is still

w

Formally,

time is E[T(w)].

completion

r < w time the program

=

would

total

expectation, Therefore,

= z] dFx(z).

is written

as:

&’[Z’(w)]

=

E[T’(w)]

Iw

(z + y~ + E[T(w)])

Wrx(w)+

dFx(z)

o

from the beginning

units

time

are

w

never allow

WTX(W)

=

+

(Tj + E[T(w)])

Fx(w)

+

J

x dFx(2)

o

to derive

these

E[TC(N@, N)]

3), the remaining

of

the time

and restart

let

to complete.

values

E[T(w)],

under

every

Restart

(shown

By

First,

As the program

beginning

expected

is given by

to recover

completion

work-requirement.

and its expected

the case of

is meaningless

every r time

We now proceed pected

and rejuvenation to see that

of rejuvenating

components.

the expected

the conditional

time

the comple-

is no saved state to restart

from an intermediate the program

forward

rejuvenation policy

as there

variable

distinct

completes.

does occur before comple-

completion

second, the time

the very

Figure

the program

(iV@ repre-

T, C(N~, N, k) denote

It is straight

from

time of the same pro-

tion time when both checkpointing employed.

maining

T(w)

in the failof any fail-

let the random

N) denote the completion

(7~ ) and l~t,

is em-

Clearly,

variable due to randomness Note that in the absence

gram when only checkpointing

only

of the program

and w is the “work-requirement”.

is a random ure process.

used so far (z),

on the time to first

3). If z > w, i.e., if the failure

w units,

On the other hand, if a failure

4

only

three

expressions random

for the ex-

variables

and E[T,,(N~,

Rearranging

viz.

with

respect

to E[T(w)],

N, k)].

we get

w

X dFx(x) E [~(w)]

4.1

No

Checkpointing

or

=

W +

7jFx(w) _

+

/o

Fx(w)

Rejuvenation

Tx(w) ❑

Let 7X(w) time

= 1 – Fx(w),

then the expected

is given by the following

completion

theorem.

4.2 Theorem

Checkpointing

Only

1: w

The Z

yjFx(w) E [~(w)]

=

W +

_

Fx(w)

+

/~

CM’x(z)

7X(W)

256

program

our

model,

ing

the

execution IV equidistant

program

into

is shown

in

checkpoints .0/ segments,

Figure

4.

Under

are taken

divid-

each

wit h a work

requirement

/3 (including

expected

completion

currence

the

time

checkpointing

is given

by

time).

the

The

following

re-

relation. m

Theorem

2: E [T’e(iV~,

N)]

~X(,B)

=

Figure NpFx(Np)

+ 7f F~(N@

-Rejuvenation

Checkpoint

+ jNp o

x dFx(z)

5: Program

+

J‘p

with

(z+

checkpointing

and rejuvenation

1)]) ~~X(z) +

yj + E[~,(P,

(N-1)/3

N-1

E[TC((N

~

-

i)~,

N -

+ 1)~) -

i)] [FX((i

&(i@]

w N/3dFx(z).

i=l

/ N/3

Proof:

Again,

to first not

we proceed

failure

occur

=

within

completion in any

X

time

of the

N

x.

the

work

is N@.

time

nents.

First,

time

is the

Yj and last,

remaining

work

If however,

spent

the expected

mally,

variable

T.((N

the conditional

written

time

does

i.e.,

program distinct

~, second,

N)IX

z and 7f

terms,

we get

N)]=

comcompo-

time

with

checkpoints

(i+l)o

N-1

E!to

the

E[T.((N

– z)~, (N – i))] dFx(z)

~=o

have

work requirement

completion

with

if

the restart

of which

the integrals

E[Z_.(N/3,

occurs

segment), the

Combining

program

is

is denoted

- i + 1)~, (N – i + l)).

expected

as; 17[T’(N@,

ith

As (i-1)

remaining

time

failure

completion

requirement.

on the failure the

of three so far,

(N – i + 1)~, the completion by random

the

the

1 to N,

summation

been completed,

already

the

i from

the time

i.e.,

requirement,

segments(say

(i – 1)/3 < z < i/?, for pletion

by conditioning

If z > N/3,

17[TC(N/3, N)]

Combining sides, evaluating

For-

time may be

the integral

N)] ~X(@

E [Tc(N8,

both

on

and rearranging,

we get

=

= Z] =

~E[T.((N -

i)~, N - i)] [Fx((i

+ l)f?) - Fx(i/3)]

i=l

o The

above

currence

By the law of total

expectation,

E[Tc(N~,

N)]

E[z-c(Np, /o

=

N)/x = (t?](iFx(z)

J JP‘p

p (z + 7t + E[T.(N8,

~)])

form

recursive

~~X($)

+

4.3

o

forall

for

which

involves

1 ~ i < N

scdut ion,

solution

E[Tc(N~,

a weighted

and

However,

is straight

N)]

does

not

sum

have

a numerical

is a reof

a simple

iterative

or

forward.

Checkpointing

and

Rejuvenation

Combined

(z + ~f + E[TC((N

- l)p,

N -

1)1)@(z) Under

++,..

our checkpointing

program

(N-1)~

/

i)]

closed

w =

relation

E[Z’e(i~,

expression

every (~+

?j

+

J7K’’(W3,

2)])

~~X(~)

+

kth

juvenation

(N-2)/3

257

takes a total checkpoint, is performed

and rejuvenation

of N checkpoints with after

1 ~ the

k

model,

the

and rejuvenates
1,

For d = 1, Weibull

distribution. in the density

of failure

of

Furthermore, higher

values

function

are concentrated

of

where

in a small

is given by [21]

aa work re-

is k’/3. of the

time.

The mean time to failure

~ 1/0

The work requirement

however,

Versatility

time,

of the distribution,

larger probabilities

For i

with

to the exponential

for a given d imply

the two parameters.

lies in the choice of 6 to vary the failure

If 01.0

data associated

program

N*, k“)]

oryless property

using

Checkpointing

is performed,

17[T’’c(N*P,

function

are close to the time

close to the time it takes to save critical

Given

time

of 15 checkpoints,

dist ante takes a value from

rejuvenation

being fixed at 900. Larger

a large computer.

with a large scientific

venation

Restarting

and d, A is calculated

it is performed.

for % = 1.0, for a total

O takes values from

is the density

and the restart

5.0

7: Effect of O on completion

head of y, every time

parame-

take y, = ~f = 5

6 plots the Weibull

O values, MTTF

Figure

in the absence of re-

= 900 minutes.

1.0, to 5.0. Given the MTTF equation

4.0

3.0

e

1200 min-

takes a = 4 minutes.

MTTF

2.0

1.0

values of O).

The mean time to failure

juvenation

and rejuvenation

values of 0. It can be seen

(larger

following

only

checkpointing

~

we obtain

from the numerical results that for the same mean time to failure, the benefit from rejuvenation increases for peskier

checkpointing

the optimal

increase

in 8.

and should

The variation on actual

rejuvenation Note

that

k is

not be mistaken in time to rejuve-

parameter

values and an

is not obvious.

any Figure

in an over-

259

8 shows the expected

completion

time

plot-

1700.0 t ante assumed

1600.0

.-8 fi g

Checkpointing

1500,0

ability

g ~

in checkpointing

~ ~ 1300.0

of literature

checkpointing

of interest.

Rejuvenation

5

10

15

number

of checkpoints

(N)

20

In computer

systems,

8: Effect

of rejuvenation

on completion

time

the number

of checkpoints

value of O (2.2 in this case) and illustrates of the cost of checkpointing

value for a certain juvenation pointing

number

distance

the amount

the extra time required these operations

Given

of rollback

the cost involved

starts

corrective

tion

Figure

dist ante

quired

increases,

to minimize

increases.

the expected

Note that

same for different

If the check-

[1] M. Sullivan

in performing

and their

as the rejuvenaof checkpoints

completion

impact

[2] J-C. Laprie, “Architectural

J. Arlat,

combined

to reduce the completion

completion

times

pared the results this

work

We derived

rejuvenation

equations

for the three possible numerically.

is to derive

the distribution

can be

Fault

of completion

real time

systems.

checkpointing dynamic

and

optimization

which dunamic checkpoint

by a given Another

extension

rejuvenation problem.

programming

ing policy.

deadline

extension

apply

John,

Proc.

pp.

and K. Kanoun,

M.

R. Lyu,

John,

1995.

Software

Wiley

fault-tolerance Fault

Tolerance,

& sons. ltd.,

[5] Y-M

Wang,

Kintala,

to

Proc.

in

pp. 231-

the equidis-

260

design,

“A

P. Vo, P-Y

Computing

Chung

and C.

and its applications”, on

Fault

California,

census of tandem

between

1985 and 1990”,

IEEE

ity,

39, pp.

Oct.

Vol.

implementation

CA, June 1995.

Y. Huang,

Pasadena,

N. D. Fulton,

of Fault-tolerant

“Checkpointing

[6] J. Gray,

is used to find the optimal

Proc.

of Symposium

Systems,

aa a two-dimensional

N. Koletis,

Pasadena,

Symposium,

is to formulate

Using this approach,

in

fault-tolerance”,

“Software

layer”,

Rejuvenation-

and analysis”,

of

of the comple-

[19] is an example

- A study

Symposium,

Ed.

pp. 47-80,

C. Kintala,

“Software

cases and com-

which

defects

systems”,

C. B60unes

Tolerance,

& sons. ltd.,

[4] Y. Huang,

tion time for these cases. This will enable us to evaluate other performance measures such as the probability

in

248, 1995.

for expected

One natural

availability

issues in software

Ed. M. R. Lyu, checkpointing

research

“Software

Computing

[3] Y. Huang and C, Kintala,

Work

that

with

be explored.

2-9, 1991.

Software

Future

on system

Fault-Tolerant

also

We have shown in this paper of a program.

this preventive

spur further

in operating

in the application

time

with

and should

re-

values of k,

and

Com-

and R. Chillarege,

failures

is not the

time

the value of this minima

Conclusions

will

prove useful.

References

Wiley

6

paper

toler-

dominates

of field

the number

this

of

fault

a re-

to dominate.

8 also shows that

techniques

may also be beneficial

IEEE

Second,

alternative

shifts

nature

this area.

If the check-

because of failures.

is too frequent,

and as the transient

such as rejuvenation

We hope that

First,

as the cause of failures

(minimum)

of checkpoints.

(or no rejuvenation)

is infrequent,

pointing

an optimum

other

technique

the tradeoffs

and rejuvenation.

each of the six curves attains

bining

for a particular

check-

optimiza-

based systems.

is accentuated,

ant techniques

finding

with

as a two-dimensional

to software

failures

Again,

has been a problem

for such transaction

from hardware software

strategy

can be combined

and analyzed

tion problem

avail-

based database systems and a wide exists in its analysis.

the optimal pointing

1200.0 0

ted against

mod-

haa also been used to maximize

of transaction

body

1400.0

Figure

and rejuvenation

els is not necessary.

409-418,

Tolerant

Computer

1995. system

availability

Trans.

on Reliabil-

1990.

In

[7] J. Gray,

“Why

done about

do computers

it?”,

in Distributed

[8] J. Gray

and

Symp.

[18] V.G. fects

on Reliability

Database

Systems,

D.

Kulkarni,

Stochastic

P. Siewiorek,

systems”,

and

checkpoint

Msg., pp. 39-

and K. S. Trivedi,

and

6(4),

O.

queuing

Vol.

“Ef-

on program on

615-648,

Babouglu,

selection

Computing,

48, Sept. 1991.

Nicola

Communications

Models,

[19] S. Toueg

“High-availability

Computer

IEEE

V.F.

of checkpointing

performance”,

pp.

1986.

and

computer

stop and what can be

of 5th

Software

January

3-12,

Proc.

Statistics-

1990.

“On

problem”,

the

SIAM

13, No. 3, pp.

optimum Journal

630-649,

on

August

1984. [9] B. Randell,

“System

tolerance”,

structure

Trans.

IEEE

SE-1, pp. 220-232,

on

for

software

fault

Engg.,

Vol.

Software

[20]

R. Geist,

of a checkpoint

June 1975.

ment”, [10] A.

Avizienis,

tolerant

“The

software”,

n-verion

IEEE

approach

Trans.

Y. Huang

for understanding In Proc. and

Quality

Florida,

Lee, “Software phase”,

Computer

Ph. D.

Engineering,

Champaign,

in the opera-

Dept. of Electrical

Univ.

of Illinois,

and

Urbana-

1995.

[13] P, E. Ammann an approach 17th

8-10, 1995, Orlando,

dependability Thesis,

Intnl.

and J. C. Knight, to software

Symp.

“Data-diversity:

fault-tolerance”, Fault

on

Tolerant

Proc.

of

Computing,

pp. 122-126, June 1987. [14] K. G. Shin, pointing

T. Lin

of real-time

and Y. Lee, tasks”,

IEEE

[15] C, H. C, Leung and Q. H. Choo, batch

systems”,

IEEE

10, No. 4,

July

[16] E.G. gies

Coffman for

[17]

1, April

A. Duda,

Vol.

in unreliable

16, pp.

computing

Engg.,

Vol. SE-

“Optimal

strate-

1984, pp. 444-450. and E. N. Gilbert, checkpoints Trans.

IEEE

and

preventive

on Reliability,

VOL 39,

1990, pp. 9-18.

“The effects of checkpointing

execution

on

1987.

“On the execution

Trans. on Sofiware

scheduling

maintenance” No.

programs

check-

Transactions

Vol. C-36, No. 11, November

Computers,

of large

“Optimal

time”,

Information

221-229,

on program

Processing

queuing

Prentice-Hall,

failures”,

Letters,

1983.

261

interval Trans.

and J. Westall, in a critical-task

on Reliability

y, 37(4),

“Selection, environ395-400,

1988.

K. S. Trivedi, bility,

on Reliability

Conf.

March

in Design,

[21]

1985.

“A framework

transient

Intnl.

Engg.,

pp.231-237.

[12] Inhwan tional

ISSAT

IEEE

October

fault-

December

and C. Kintala,

and handling

of 2nd

to

on Sofiware

Vol. SE-11, No. 12, pp. 1491-1501, [11] P. Jalote,

R. Reynolds

“Probability and 1982.

computer

and Statistics science

with

relia-

applications”,

Suggest Documents