On the Performance of Software Fault-Tolerance Strategies - CiteSeerX

1 downloads 159 Views 244KB Size Report
fault-tolerant software approaches3–5 which allow to recover after a software .... service .. rate of “Safe Down and Repair State” while “Continue” and “Fadure”.
ON THE

PERFORMANCE

OF SOFTWARE

A. Grnarov’,

FAULT-TOLERANCE

J. Arlat””,

STRATEGIES+

A. Avizienis

Computer Science Department University of California Los Angeles, Cahfornia 90024, USA

ABSTRACT In the papera

comparison

continuation of the program when A T k successful (for RB only), No processing time is required for this state, “Update” is execution

of processing

time

and reliability

of the deciwon strategy in NV followed by updating of all versions with tbe “best” result. “Recovery” means selection and initialization of the next alternate or version for RB or NVS respectively, For NVP (as well as for RB and NVS after tbe last alternate or version

performance for the Recovery Blocks scheme and N-Version Programming technique IS presented. Derived queueing models can be useful in deciding which of tbe strategies should be used, depending on system parameters.

bas been executed),

INTRODUCTION Despite

all

efforts

in careful

design

of

software

systems,

they are unlikely to be error free and a lot of effort is still required to ehminate “bugs” after a system has been made available to the user, Recently self-checking software approachesi,2 (limited to detection

of

software

failures),

and

fault-tolerant

software

approaches3–5 which allow to recover after a software failure occured, were proposed. Figure 1 presents a general model for umfied interpretation of the two mam fault-tolerance strategies: Programmings’. The Recovery Blocks scheme~b and N-Version Recovery Blocks scheme (RB) consists of three major entities: a a list of supplementary alternates alterndte primary (AI), (.42, ,.. , AN-I) and an acceptance test (AT). The AT can be

(~)

for NV7.

“Continue”

fault-tolerance

rate of “Safe Down

and Repair”

the average segment processing time or NV fault-tolerance strategy, tbe

strategy;

and Repair

p~

= ~

is the

..

State” while

“Continue”

average

service

and “Fadure”

states have infimte service rates. Queueing models have been derived for each strategy, but due to the limitation of space, derivation of NVS mode19 will not be presented in this paper. In

the

followlng

P:

= 1 – Q;

we

where “*”

will

notation P, =

probability

BLOCKS

=l—q* *

and

MODEL

of no faults

in alternate

2, where

the

following

A,

of success of,4 T

P, =

Probability

of success of A, if A, executed.

Q’, =

Probablhty

of failure

The service ‘This research is supported in part by the ONR Contract NOOO1479-C-0866, in part by the INRIA, France, in part by the NSF Contract MCS78-18918, and in part by a Grant of the Battelle Memorial Institute. * On leave from Electrical Engineering Department, Umverslty of Skopje, Yugoslavia. *‘ On leave from LAAS du CNRS, Toulouse, France.

p * *

that

A RB model is shown in Figure is used for 1 = 1, 2, . .. . N–1:

PAT = Probability

is

assume

stands for any symbol.

RECOVERY

multiple hardware systems, which make a parallel execution of the versions (NW) quite practical. However, in order to perform a fair comparison wltb RB, we also consider a “sequentia~ application of NV (NVS). Inthemodel ltisassumed that aprogram is partitioned into S segments, executed oneat a given time, and that the’’length (execution time) of the segments is exponentially distributed P1, P2, ,P~ represents alternates and versions in the case of RB and NV respectively, “TesV k the execution of AT m the case of “c-vectors”

In order to determine the system with RB

in

wltbout

compared and, in the case of disagreement, a preferred result can be identified by a majority vote (possibly inexact) or another predeterThe bad versions are mined strategy (e.g., two-out-of-N). recovered by updating them with data provided by identified good versions and normal processing of all N versions can be resumed. The primary motivation for NVwasthe exploitation of fault-tolerant

and the comparison

to “Safe Down

queueing theory8 is used. The program segments are considered as “customers” and “service rates” of states are defined as follows: p~ is the average service rate for processing a segment in a perfect system

considered as an abstract implementation of the function performed by alternates. When conditions stated by ATare not met, a purging of data E performed and a new alternate is called. According to pubhshed work, in tbe paper a sequential application of RB is considered only. The N-Version Programmmg (NV) requires N>2 independently designed programs (versions) for the same function (task) The obtained results after each stage of computation are

of RB,

it means a transition

state, where a higher level (global) recovery is initiated. “Next Segment” is the outcome of this state meanning that all actions that return the system to the successful execution of this segment are included as part of the “Repair” action, “Failure” is entered when a fault is undetected I.e., invalid results are passed on to the next segment. This can happen when correlated faults occur i.e., when m the same way fail: a) RB: an alternate and AT, b) NVS: two versions, c) NVP: majority of the versions.

in the same way of A, and AT

rate of state (alternate)

I is given

by ~, = 1

t,

where

the

average alternate i processing time ~ = ~ + ~ + ~~~ while ~ is the average segment execution time, ~ is the average alternate i setup time, and ~4~ IS the average ,4 T execution time. For computational convenience, we use IA, =w,

(l +a,

+k)-’

=K,

(l + tr, )-l

251 1604-8/80/0000-0251$00.75

@1980

IEEE

p

1 p~ = ~,

where We

also

c“JE[l,

use

+-]

7A T

a, = ~,

,s

the

for~

k = y

and b, == a,+k

notation:

Q, = c“, q,

=2,3,...,

N–1;

where

c“, = 1,

and

and also Q’, = c’, q, where

-, c’, < [0,1]

is the conditional

probability

P(,4, and ,47’ fail in the same

way if A, failed). ‘The following

assumption

can be made for the model:

a) The setup times for different alternates do not dffer and accordingly, a, = a, b, = b, and p, = P,(1 + b)-l b) The probabilities P, =P~,

of no faults

in the alternates

significantly = Ka

are the same, i.e.,

c“, = c“, and c’, = C’. while L = N–m–l LL>O), q“= The probability of fadure is given

c) Since the coverage of AT increases with its “length (execution time) and since the probability of software faults increases with the length, we combine both of them in the probability of success of the acceptance test c = [(1–a)k+al

According to the previous, stages 8, the average segment

fault

in A T if its length

‘RB=H+QP’11-(UN-2 +*(P’)N-21 i-

1– I

QN-’

PAT q.

Concerning the probability

R



‘l+b

of the RB

of entering

QN-2

.

P

the reliability

strategy,

the “Failure”

[

Probability

at least m versions

P’ =

Probabdity

while

majority

Adoption notation:

of analogous

if versions

II

N+l — 2

assumptions

Figure

3,

considered strategies as functions of Nd; we consider also the influence of RB, the reliabihty of the A T as function of the initial coverage a and the processing time ratio k; NV, the “correlation” factor c’. These results point out the criticality of the A T on the reliability of RB and shows that the impact of c’ with respect to NV is approximately constant in the case of 2NVS while It increases with the value of Nd in the case of NVP. In both Figures, the values of other system parameters are assumed to be. pa= PV=,999, qo=.OO1, c’’= 101a =.l, andd=. l.

agree

for N odd or even

as for RB leads to the following

CONCLUDING w., =K,

p where

a = ~,

version

setup

choice

as m =

MODEL in

(l +a

+ d)-]

=K,

ts

time,

and

and v = a + d; ‘d t

is the average

while

?

is the average

The models presented in the paper have been introduced in to achieve comprehensive and quantitative evaluation of the The performances of the software fault-tolerance strategies.

decision

and best result

order

developed queueing models are used for both time efficiency and reliability evaluauon of the two main fault-tolerance strategies: the Recovery Blocks scheme and the N-Version Programming. Some of the obtained results are presented in the paper showing that the models can be useful for system designers m deciding which of the strategies should be used depending on system parameters.

time.

For /=1,2,.., where q‘, k sion failed)

N, we consider also: p, = p, ; q’, = q’ = c“ q,, the conditional probability P( V, fails if any other verwhile

c“E [1, ~];

conditional probability version if V, faded).

and the factor

P( V, fails

It follows that the average segment NVP strategy is given by

REMARKS

(l + u)-’

id

d = =-,

NV STRATEGIES

ties Nd, with the repair ratio R as a parameter. RB appears to be more sensitive to the value of R than NV. Furthermore for both 2NVS and NVP, the time penalty is considerably reduced when Nd grows; while for RB, the influence of Nd strongly depends on the reliability of the AT, as it can be seen for ddTerent values of A T processing time ratio k Figure 5 presents the variations of the probabilities of failure for the

agree

at least m good versions m N defined

m

1 IS shown by:

OF RB AND

PFO, where /3: (RB, 2NVS and ~sVP) In order to perform a coherent comparison, we define Nd, the number of diversified entities required by each strategy, as: RB, N – 1 alternates and the AT; NV, N versions. Figure 4 compares the relative time penalties introduced by use of RB, 2NVS and NVP strategies as functions of the diversified enti-

is equal to

PROGRAMMING

A model for parallel NV strategy the transition probabilities are defined

where P =

N-VERSION

l–c’q’

[n this section, we present some of the results derwed from the discussed models, as well as results for sequential NV with twoout-of-N decision strategy (2 NVS). As measures for evaluation, we use: a) the relative increase of_the average segment processing time Tfl-? b) the probabihty of failure in a system defined as 66 = ~,

11

we are interested

state which

PF~B = QAT 1 – (P’) N-l

PARALLEL

COMPARISON

k

and using the general series processing time in the sYstem

parallel is given by:

p”=

PFNVP = z

given by F’4~ = c .pAT where the coverage and ~~T = (l–k. qO) while a k the minimal cov-

erage and q. k the probabilhy of software the same as the length of an alternate.

c’q’and by

c’, = c’ to denote

m the same way as another

the

failed REFERENCES

processing

time

1.

in a system with

252

Yau, S. S, and Cheung, R. Software”, Proc. 1975 Itlr. Cor~

C., “Design of Self-Checking Rel!able Software, pp. 450-457.

2.

Ayache,

J. M , et.

al.,

“Observer:

Detection of Control Errors 1979 Irrt, Symp, Fault-Tolerant

A

Concept

for

6,

On-Line

in Concurrent Systems”, Proc. Computmg, Madison, Wisconsin,

June 1979, pp. 3-8.

June 1979, pp. 79-86.

7.

Avizienis, A. and Chen, L., “On the Implementation of NVersion Programming for Software Fault-Tolerance During Proc. of COMPSAC 77, November 1977, pp. 149Execution”, 155.

Randell, B , “System Structure for Software Fault-tolerance”, IEEE Trans. Software Eng., June 1975, pp. 220-232.

8.

Klelnrock, L., Sons, 1975.

Awzlenis,

9.

Grnarov, Software

3

Fischler,

“Distinct Reliable Computing”, Proc, 2nd Tokyo, Japan, 1975, pp. 1-7.

4,

5,

Lee, P. A., et. al., “A Recovery Cache for the PDP-1 l“, Proc. 1979 [rrt. Symp. Fault-Tolerant Computorg, Madison, Wisconsin,

M,

et

al.,

A,, “Fault-Tolerance

mentary Cotf

A.,

approaches

and Fault-Intolerance:

to Reliable

Rel[able Sojfware,

Software: An Approach to USA -Japan Computer Conj,

Computmg”,

Proc.

Comple1975 Ior.

Systems,

J. A , Arlat, Fault-Tolerance

Modehrrg 1980.

pp. 458-464.

Qaeue/ng

and S/mu/ar/on

and

Vol. 1: Theory, J. Wiley

Avizienis,

Strategies”, Conf,

A.,

“Modeling

Proc.

Pittsburgh,

2980

Pennsylvania,

*—

xl

%% Nvs

i

REPAIR

ro?+ux,

RB8NV

NW . ●

. CCNTINIX Pa Ii

y SE@ENl PF3XESSIM

SE : ALTErawrE ~ : ~RSIm

Figure

1 : General

i. NEXTSE&f34T

Model.

s

s

NEXl SEGfNT

Figure

Figure

2: A RB Model

3:

A NVP

Model.

! 3

I 4

I 5

v 62NVS A 8 NVP

._m

10+1

J:t9

.—0—

—.’=”3 .==———— = R=loz k=,l

1

R=103V R=1038 R=1OZ +

loo

I%

k=,9 _m—m

%— ~

—v :—. k=,9

R=lOO~ .

.~

R=102A R=l’O;

1’-1

._.

Figure

Ly

4:

R= 1132 — — ._-.--

v Oll=lO” =R=lOO

=

Suggest Documents