fault-tolerant software approaches3â5 which allow to recover after a software .... service .. rate of âSafe Down and Repair Stateâ while âContinueâ and âFadureâ.
ON THE
PERFORMANCE
OF SOFTWARE
A. Grnarov’,
FAULT-TOLERANCE
J. Arlat””,
STRATEGIES+
A. Avizienis
Computer Science Department University of California Los Angeles, Cahfornia 90024, USA
ABSTRACT In the papera
comparison
continuation of the program when A T k successful (for RB only), No processing time is required for this state, “Update” is execution
of processing
time
and reliability
of the deciwon strategy in NV followed by updating of all versions with tbe “best” result. “Recovery” means selection and initialization of the next alternate or version for RB or NVS respectively, For NVP (as well as for RB and NVS after tbe last alternate or version
performance for the Recovery Blocks scheme and N-Version Programming technique IS presented. Derived queueing models can be useful in deciding which of tbe strategies should be used, depending on system parameters.
bas been executed),
INTRODUCTION Despite
all
efforts
in careful
design
of
software
systems,
they are unlikely to be error free and a lot of effort is still required to ehminate “bugs” after a system has been made available to the user, Recently self-checking software approachesi,2 (limited to detection
of
software
failures),
and
fault-tolerant
software
approaches3–5 which allow to recover after a software failure occured, were proposed. Figure 1 presents a general model for umfied interpretation of the two mam fault-tolerance strategies: Programmings’. The Recovery Blocks scheme~b and N-Version Recovery Blocks scheme (RB) consists of three major entities: a a list of supplementary alternates alterndte primary (AI), (.42, ,.. , AN-I) and an acceptance test (AT). The AT can be
(~)
for NV7.
“Continue”
fault-tolerance
rate of “Safe Down
and Repair”
the average segment processing time or NV fault-tolerance strategy, tbe
strategy;
and Repair
p~
= ~
is the
..
State” while
“Continue”
average
service
and “Fadure”
states have infimte service rates. Queueing models have been derived for each strategy, but due to the limitation of space, derivation of NVS mode19 will not be presented in this paper. In
the
followlng
P:
= 1 – Q;
we
where “*”
will
notation P, =
probability
BLOCKS
=l—q* *
and
MODEL
of no faults
in alternate
2, where
the
following
A,
of success of,4 T
P, =
Probability
of success of A, if A, executed.
Q’, =
Probablhty
of failure
The service ‘This research is supported in part by the ONR Contract NOOO1479-C-0866, in part by the INRIA, France, in part by the NSF Contract MCS78-18918, and in part by a Grant of the Battelle Memorial Institute. * On leave from Electrical Engineering Department, Umverslty of Skopje, Yugoslavia. *‘ On leave from LAAS du CNRS, Toulouse, France.
p * *
that
A RB model is shown in Figure is used for 1 = 1, 2, . .. . N–1:
PAT = Probability
is
assume
stands for any symbol.
RECOVERY
multiple hardware systems, which make a parallel execution of the versions (NW) quite practical. However, in order to perform a fair comparison wltb RB, we also consider a “sequentia~ application of NV (NVS). Inthemodel ltisassumed that aprogram is partitioned into S segments, executed oneat a given time, and that the’’length (execution time) of the segments is exponentially distributed P1, P2, ,P~ represents alternates and versions in the case of RB and NV respectively, “TesV k the execution of AT m the case of “c-vectors”
In order to determine the system with RB
in
wltbout
compared and, in the case of disagreement, a preferred result can be identified by a majority vote (possibly inexact) or another predeterThe bad versions are mined strategy (e.g., two-out-of-N). recovered by updating them with data provided by identified good versions and normal processing of all N versions can be resumed. The primary motivation for NVwasthe exploitation of fault-tolerant
and the comparison
to “Safe Down
queueing theory8 is used. The program segments are considered as “customers” and “service rates” of states are defined as follows: p~ is the average service rate for processing a segment in a perfect system
considered as an abstract implementation of the function performed by alternates. When conditions stated by ATare not met, a purging of data E performed and a new alternate is called. According to pubhshed work, in tbe paper a sequential application of RB is considered only. The N-Version Programmmg (NV) requires N>2 independently designed programs (versions) for the same function (task) The obtained results after each stage of computation are
of RB,
it means a transition
state, where a higher level (global) recovery is initiated. “Next Segment” is the outcome of this state meanning that all actions that return the system to the successful execution of this segment are included as part of the “Repair” action, “Failure” is entered when a fault is undetected I.e., invalid results are passed on to the next segment. This can happen when correlated faults occur i.e., when m the same way fail: a) RB: an alternate and AT, b) NVS: two versions, c) NVP: majority of the versions.
in the same way of A, and AT
rate of state (alternate)
I is given
by ~, = 1
t,
where
the
average alternate i processing time ~ = ~ + ~ + ~~~ while ~ is the average segment execution time, ~ is the average alternate i setup time, and ~4~ IS the average ,4 T execution time. For computational convenience, we use IA, =w,
(l +a,
+k)-’
=K,
(l + tr, )-l
251 1604-8/80/0000-0251$00.75
@1980
IEEE
p
1 p~ = ~,
where We
also
c“JE[l,
use
+-]
7A T
a, = ~,
,s
the
for~
k = y
and b, == a,+k
notation:
Q, = c“, q,
=2,3,...,
N–1;
where
c“, = 1,
and
and also Q’, = c’, q, where
-, c’, < [0,1]
is the conditional
probability
P(,4, and ,47’ fail in the same
way if A, failed). ‘The following
assumption
can be made for the model:
a) The setup times for different alternates do not dffer and accordingly, a, = a, b, = b, and p, = P,(1 + b)-l b) The probabilities P, =P~,
of no faults
in the alternates
significantly = Ka
are the same, i.e.,
c“, = c“, and c’, = C’. while L = N–m–l LL>O), q“= The probability of fadure is given
c) Since the coverage of AT increases with its “length (execution time) and since the probability of software faults increases with the length, we combine both of them in the probability of success of the acceptance test c = [(1–a)k+al
According to the previous, stages 8, the average segment
fault
in A T if its length
‘RB=H+QP’11-(UN-2 +*(P’)N-21 i-
1– I
QN-’
PAT q.
Concerning the probability
R
—
‘l+b
of the RB
of entering
QN-2
.
P
the reliability
strategy,
the “Failure”
[
Probability
at least m versions
P’ =
Probabdity
while
majority
Adoption notation:
of analogous
if versions
II
N+l — 2
assumptions
Figure
3,
considered strategies as functions of Nd; we consider also the influence of RB, the reliabihty of the A T as function of the initial coverage a and the processing time ratio k; NV, the “correlation” factor c’. These results point out the criticality of the A T on the reliability of RB and shows that the impact of c’ with respect to NV is approximately constant in the case of 2NVS while It increases with the value of Nd in the case of NVP. In both Figures, the values of other system parameters are assumed to be. pa= PV=,999, qo=.OO1, c’’= 101a =.l, andd=. l.
agree
for N odd or even
as for RB leads to the following
CONCLUDING w., =K,
p where
a = ~,
version
setup
choice
as m =
MODEL in
(l +a
+ d)-]
=K,
ts
time,
and
and v = a + d; ‘d t
is the average
while
?
is the average
The models presented in the paper have been introduced in to achieve comprehensive and quantitative evaluation of the The performances of the software fault-tolerance strategies.
decision
and best result
order
developed queueing models are used for both time efficiency and reliability evaluauon of the two main fault-tolerance strategies: the Recovery Blocks scheme and the N-Version Programming. Some of the obtained results are presented in the paper showing that the models can be useful for system designers m deciding which of the strategies should be used depending on system parameters.
time.
For /=1,2,.., where q‘, k sion failed)
N, we consider also: p, = p, ; q’, = q’ = c“ q,, the conditional probability P( V, fails if any other verwhile
c“E [1, ~];
conditional probability version if V, faded).
and the factor
P( V, fails
It follows that the average segment NVP strategy is given by
REMARKS
(l + u)-’
id
d = =-,
NV STRATEGIES
ties Nd, with the repair ratio R as a parameter. RB appears to be more sensitive to the value of R than NV. Furthermore for both 2NVS and NVP, the time penalty is considerably reduced when Nd grows; while for RB, the influence of Nd strongly depends on the reliability of the AT, as it can be seen for ddTerent values of A T processing time ratio k Figure 5 presents the variations of the probabilities of failure for the
agree
at least m good versions m N defined
m
1 IS shown by:
OF RB AND
PFO, where /3: (RB, 2NVS and ~sVP) In order to perform a coherent comparison, we define Nd, the number of diversified entities required by each strategy, as: RB, N – 1 alternates and the AT; NV, N versions. Figure 4 compares the relative time penalties introduced by use of RB, 2NVS and NVP strategies as functions of the diversified enti-
is equal to
PROGRAMMING
A model for parallel NV strategy the transition probabilities are defined
where P =
N-VERSION
l–c’q’
[n this section, we present some of the results derwed from the discussed models, as well as results for sequential NV with twoout-of-N decision strategy (2 NVS). As measures for evaluation, we use: a) the relative increase of_the average segment processing time Tfl-? b) the probabihty of failure in a system defined as 66 = ~,
11
we are interested
state which
PF~B = QAT 1 – (P’) N-l
PARALLEL
COMPARISON
k
and using the general series processing time in the sYstem
parallel is given by:
p”=
PFNVP = z
given by F’4~ = c .pAT where the coverage and ~~T = (l–k. qO) while a k the minimal cov-
erage and q. k the probabilhy of software the same as the length of an alternate.
c’q’and by
c’, = c’ to denote
m the same way as another
the
failed REFERENCES
processing
time
1.
in a system with
252
Yau, S. S, and Cheung, R. Software”, Proc. 1975 Itlr. Cor~
C., “Design of Self-Checking Rel!able Software, pp. 450-457.
2.
Ayache,
J. M , et.
al.,
“Observer:
Detection of Control Errors 1979 Irrt, Symp, Fault-Tolerant
A
Concept
for
6,
On-Line
in Concurrent Systems”, Proc. Computmg, Madison, Wisconsin,
June 1979, pp. 3-8.
June 1979, pp. 79-86.
7.
Avizienis, A. and Chen, L., “On the Implementation of NVersion Programming for Software Fault-Tolerance During Proc. of COMPSAC 77, November 1977, pp. 149Execution”, 155.
Randell, B , “System Structure for Software Fault-tolerance”, IEEE Trans. Software Eng., June 1975, pp. 220-232.
8.
Klelnrock, L., Sons, 1975.
Awzlenis,
9.
Grnarov, Software
3
Fischler,
“Distinct Reliable Computing”, Proc, 2nd Tokyo, Japan, 1975, pp. 1-7.
4,
5,
Lee, P. A., et. al., “A Recovery Cache for the PDP-1 l“, Proc. 1979 [rrt. Symp. Fault-Tolerant Computorg, Madison, Wisconsin,
M,
et
al.,
A,, “Fault-Tolerance
mentary Cotf
A.,
approaches
and Fault-Intolerance:
to Reliable
Rel[able Sojfware,
Software: An Approach to USA -Japan Computer Conj,
Computmg”,
Proc.
Comple1975 Ior.
Systems,
J. A , Arlat, Fault-Tolerance
Modehrrg 1980.
pp. 458-464.
Qaeue/ng
and S/mu/ar/on
and
Vol. 1: Theory, J. Wiley
Avizienis,
Strategies”, Conf,
A.,
“Modeling
Proc.
Pittsburgh,
2980
Pennsylvania,
*—
xl
%% Nvs
i
REPAIR
ro?+ux,
RB8NV
NW . ●
. CCNTINIX Pa Ii
y SE@ENl PF3XESSIM
SE : ALTErawrE ~ : ~RSIm
Figure
1 : General
i. NEXTSE&f34T
Model.
s
s
NEXl SEGfNT
Figure
Figure
2: A RB Model
3:
A NVP
Model.
! 3
I 4
I 5
v 62NVS A 8 NVP
._m
10+1
J:t9
.—0—
—.’=”3 .==———— = R=loz k=,l
1
R=103V R=1038 R=1OZ +
loo
I%
k=,9 _m—m
%— ~
—v :—. k=,9
R=lOO~ .
.~
R=102A R=l’O;
1’-1
._.
Figure
Ly
4:
R= 1132 — — ._-.--
v Oll=lO” =R=lOO
=