An Optimal Approach to Fault Tolerant Software - IEEE Computer ...

11 downloads 53931 Views 4MB Size Report
Such systems rely solely on hard- ... of the fault coverage for a given software recovery scheme and which is .... data reasonableness testing of input parameters is to insure .... ery schemes which provide the best coverage of system faults.
390

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

An Optimal Approach to Fault Tolerant Software Systems Design THOMAS F. GANNON AND STEPHEN D. SHAPIRO, MEMBER, IEEE

Abstract-A systematic method of providing software system fault recovery with maximal fault coverage subject to resource constraints of overall recovery cost and additional fault rate is presented. This method is based on a model for software systems which provides a measure of the fault coverage properties of the system in the presence of computer hardware faults. Techniques for system parameter measurements are given. An optimization problem results which is a doubly-constrained 0,1 Knapsack problem. Quantitative results are presented demonstrating the effectiveness of the approach.

Index Terms-Error control and recovery, fault tolerant computing, software systems design. I. INTRODUCTION

IN RECENT years, numerous publications have addressed the design and implementation of fault tolerant hardware

systems, but few papers have addressed fault tolerant software systems. In designs which require fault tolerant computing, one generally encounters the introduction of a duplex or triplex computer hardware configuration which has been enumerated by Goldberg et al. [1] . Such systems rely solely on hardware detection of fault conditions and automatic hardware recovery from a fault condition, and their reliability has been analyzed from various viewpoints.1 This paper deals with software recovery procedures which supplement hardware fault detection and recovery mechanisms, and provide fault tolerance in the presence of hardware fault conditions. At present, rudimentary concepts of fault tolerant software design have been applied to operating systems. These concepts are documented by Horning et al. [5] . Papers by Goodenough [6], Hill [7], and Gannon and Horning [8], however, have addressed the effects of language structure on reliable software development and various levels of exception handling in software systems in general. Many language constructs have been devised to provide softManuscript received June 17, 1977; revised December 2, 1977. This work was supported in part by the National Science Foundation under Grant MCS 76-08176. Extensions of the material presented in this paper will also appear in the book Fault Tolerant Software Systems Design by the authors to be published by Prentice-Hall. T. F. Gannon was with Bell Laboratories, Whippany, NJ 07981. He is now with Sperry Univac Technical Research Center, Blue Bell, PA 19424. S. D. Shapiro is with the Department of Electrical Engineering, Stevens Institute of Technology, Hoboken, NJ. 1 See [2] -[4] for current literature examples.

ware recovery in the presence of a detectable hardware fault. Hi [71 provides a relative comparison of these constructs based on language structuring and Goodenough [6] extends these concepts to the micro-operation level. Until recently, no guidelines have been established to clearly indicate how a particular recovery mechanism should be used. Randell [9] has recently begun work in this area, but no quantitative procedure has been defined to analyze the fault coverage and resource assignments of various software recovery schemes within a software system. In the following sections, the authors present a method of analysis for the LABEL method of software recovery mechanism introduced by Hill [7] which will yield a determination of the fault coverage for a given software recovery scheme and which is similar to Randell's [9] recovery block scheme. Furthermore, given various recovery altematives, the authors define a procedure which will yield a selection of available alternatives which maximizes the overall system fault coverage subject to the following constraints: 1) the minimum additional size of the system recovery scheme; 2) the minimum additional system fault rate introduced by the system recovery scheme. In addition, a process control system is modeled and analyzed using the methodology developed to substantiate the intuitive claim that an optimum fault recovery scheme exists for a software system, and that the law of diminishing retums applies beyond that optimum point. The analysis which is contained in the following sections implicitly requires the following assumptions. 1) Only single fault occurrences will be considered. 2) The software system itself will be assumed to be errorfree and reliable, and implies that the coverage associated with the software recovery design is unity. 3) The software system under analysis will be core resident. 4) All hardware operations of the simplex computer system are monitored by reliable fault detection hardware using a parity code. 5) The probability of a successful hardware reconfiguration is assumed to be unity. These assumptions will be discussed in what follows. Several terms are used throughout this paper and should be defined for reference. A fault condition within a system is an erroneous response to an operation performed by a hardware component of the system. Recovery from a fault condition is

0098-5589/78/0900-0390$00.75

© 1978 IEEE

391

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN

Fig. 1. Illustration of a typical task hierarchy structure.

defined as the response of the system hardware or software to the fault condition, so that system execution can continue to yield correct results in spite of the fault condition. The actual steps taken by the hardware or software components of the system to effect recovery are termed recovery procedures. The fault coverage of a recovery procedure represents the probability that the system recovery will be successfully achieved given that the procedure is employed in response to a detectable hardware fault condition. Finally, a fault tolerant system is normally used to denote system software which will continue to yield correct results in the presence of hardware and software fault conditions. As mentioned previously, the authors only address software recovery procedures which provide recovery for detectable hardware fault conditions. II. SOFTWARE RECOVERY MECHANISMS

Among the most commonly used software recovery mechanisms, the LABEL mechanism has been shown by Hil [7] to provide a structured approach to fault recovery. This mechanism, however, only provides fault coverage for that class of hardware errors which is detectable by the computer hardware. Hence, supplementary safeguards are required to provide fault coverage for the remaining classes of probable fault conditions. The recovery block approach discussed by Randell [9] provides a consistent set of software recovery techniques which address these remaining classes of fault conditions. The approach presented by the authors is similar to the approach presented by Randell [9] with minor differences. These differences will be identified in Section II-B. Before discussing software recovery mechanisms, however, several terms must be defined for reference. It is assumed that any software system can be decomposed into a number of mutually exclusive functional operations called tasks. (See, for example, the discussion of processes in Brinch Hansen [10].) In addition, the hierarchy of precedence relationships which exist between these tasks can be represented by a directed graph whose nodes represent tasks of the software system, as illustrated in Fig. 1. Arcs eminating from each node represent possible branch conditions to other tasks. Hence, each cyclic execution in the software system can be represented by a path from the start task TI to the end task T8. (This rep-

Fig. 2. Illustration of a typical segment hierarchy structure.

resentation for software systems is discussed in Dahl et al. [ 11] as an OR graph.) Extending the task concept to encompass a smaller number of computations, each task can be decomposed into a number of mutually independent computations called segments. Furthermore, Fig. 2 illustrates a precedence graph which represents the hierarchy of precedence relationships for all segments of a typical task. Circular nodes of a precedence graph will be used to represent tasks, while rectangular nodes will be used to represent segments within a task. A. The LABEL Recovery Mechanism To facilitate dynamic software recovery of system tasks and segments, the LABEL mechanism is assumed to be available through the system monitor to inform a segment that a specific detectable fault condition has occurred within a given segment boundary, and to allow that segment to execute recovery procedures. The format of this mechanism can be illustrated as follows: ON-ERROR

LABEL";

Upon execution of this instruction, the monitor recognizes that in the event a detectable fault condition is trapped during the execution of the current segment, control will be retumed to the faulted segment at the address specified by the "LABEL" Furthermore, upon execution of another argument. ON_ ERROR instruction, the preceding ON ERROR instruction is superceded. The usage of the LABEL mechanism is well defined in the literature [7] ; two common recovery procedures are reviewed in the following paragraphs. Furthermore, it is assumed in the analysis of Section III that the LABEL recovery mechanism will be applied to each segment of every task within a software system. By prefixing each segment with an ON-ERROR instruction, each task of the system is partitioned into a number of mutually exclusive recovery "zones." Each such "zone," or segment, in turn possesses a defined recovery procedure. It is generally assumed that each segment has a defined set of input and output parameters. Therefore, the computations per-

392

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

formed in each segment are well defined, and recovery procedures for each segment can be developed. The exact nature of the recovery procedure in each case depends upon the operations performed in each segment, as well as the overall recovery objectives for the task itself. In general, however, two procedures for recovery can be defined: 1) the "Clean-Up and Get-Out" procedure; 2) the "Attempt Reexecution" procedure. The "Clean-Up and Get-Out" (CAGO) procedure is simple and straightforward to implement. The recovery philosophy taken in this procedure is to restore all outputs from the faulted segment to nominal or default values. A status flag may also be set to inform subsequent segments and/or tasks that recovery has been attempted and that the output parameters may contain inaccuracies. Finally, the faulted segment can pass control to the following segment if subsequent computations can be continued, or the faulted segment can terminate the entire task if subsequent computations are not meaningful. It should be noted, however, that the primary objective of this recovery procedure is to permit subsequent segments and tasks to continue their sequence of computations, whenever possible. The "Attempt Reexecution" (ARE) procedure is feasible in those cases for which the input parameters of the faulted segment can be reinitialized to previously stored redundant or nominal values. Under these circumstances, a second attempt to execute the faulted segment can be made. Hence, the primary objective of the ARE procedure is to continue the present computations, whenever possible. If the second attempt to reexecute the segment should prompt an additional fault condition, the REPORT mechanism which is discussed in the following paragraphs may be required for recovery from the recurring fault condition. The use of the two procedures outlined above will depend upon the urgency and purpose of the task and its component segments. Section V of this paper addresses an example of the use of these procedures as applied to a specific process control problem with specific recovery objectives. It should be noted that when the CAGO procedure is applied to iterative or recursive segments or tasks, its recovery objective is identical to that of the ARE procedure. In the event that the CAGO and ARE procedures do not provide full recovery from the fault condition (i.e., the same fault condition arises each time the segment is executed), and the system monitor is capable of dynamically reconfiguring the hardware components of the computer system, a second recovery mechanism suggests itself: REPORT (Id, type); The REPORT mechanism essentially provides a communications path between the faulted segment and the system monitor for the purpose of requesting a specific reconfiguration of the computer hardware. The parameter "type" is used to specify the type of reconfiguration desired, while the parameter "id" is used to identify the segment requesting the reconfiguration.

hardware faults. In large software systems, however, a significant class of fault conditions which cannot be detected by the system monitor or the computer hardware can exist.2 The effects of these fault conditions can be localized, and thus prevented from propagating throughout the software system, by employing one or more of the following software fault detection techniques (otherwise known as acceptance tests): 1) input parameter reasonableness testing (IPRT); 2) output parameter reasonableness testing (OPRT); 3) intratask/intrasegment handshaking. As indicated in Section II-A, each segment of a task possesses a well-defined set of input and output parameters. To insure that the preceding segment's (or task's) computations were valid, a "reasonableness" test of the current segment's input parameters could be used, where possible. The specific test employed for each parameter depends upon the usage and range of expected values for that parameter as indicated by Randell [9]. In general, however, such a test would verify that the given input parameter possesses a reasonable value. A definition of what is meant by a reasonable value could encompass one or more of the following conditions: 1) the input parameter lies within the expected range of values for the given parameter type; 2) the input parameter increases or decreases in value by a prescribed manner for successive executions of the current segment; 3) the input parameter compares favorably with an approximation of its value derived from a previous execution of the current segment to within specified limits. Items 2 and 3 assume that the segment (and/or task) of interest is executed iteratively or recursively. As might be expected, the primary purpose of performing data reasonableness testing of input parameters is to insure that the effects of undetectable hardware or monitor faults are not allowed to propagate through the current segment. A more encompassing test of segment computations can be employed if one assumes that an independent method of testing a segment's computations exists. In this case, a more comprehensive reasonableness test of the current segment's output parameters can be defined in terms of the segment's input parameters. Such a test would encompass one or more of the following conditions: 1) the output parameter lies within the expected range of values for the given parameter type; 2) the output parameter increases or decreases in value by a prescribed manner for successive executions of the current segment; 3) the output parameter compares favorably with an approximation of its value derived from a previous execution of the current segment to within prescribed limits; 4) the output parameter compares favorably with an approximation of its value derived from the current input parameters to within specified limits. Items 2-4 assume that the segment (and/or task) of interest is executed iteratively or recursively. It should be noted that

B. Software Fault Detection Mechanisms Although the LABEL mechanism provides a major portion of 2A calculation of the expected fault each segment's fault coverage, its use is limited to detectable nism appears in Appendix A.

coverage for the LABEL

mecha-

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN

when OPRT is employed, a "handshake" flag can be used to inform subsequent segrnents that the output data of the current segment is reasonable, erroneous, or has been reset to nominal or default values.3 Hence, the OPRT procedure eliminates the necessity of the IPRT procedure for subsequent segments and tasks, tests for the effects of undetectable hardware and monitor fault conditions which may occur during the execution of the current segment, and prevents the propagation of the effects of such faults to subsequent segments and tasks. As indicated in Appendix A, certain classes of hardware faults, such as an even number of bit errors in an odd parity system, are undetectable by the computer hardware. In the event that the effects of such undetectable hardware or monitor fault conditions are detected using the IPRT or OPRT procedures, modifications of the CAGO and ARE recovery procedures can be applied to the input and/or output parameters of the faulted segment. Again, the exact nature of the recovery scheme used depends upon the computations involved within the faulted segment. The following general approaches, however, can be outlined for IPRT recovery: 1) restore erroneous input parameters to nominal or default values, and continue the execution of the current segment; 2) terminate the execution of the current segment; 3) terminate the execution of the current task. A parallel outline of recovery can be summarized for the OPRT procedure as follows: 1) mark the current output segment's parameters as being erroneous using the handshake flag for subsequent segments and/or tasks; 2) mark the current segment's output parameters as nominal using the handshake flag, and reset those output parameters to nominal or approximated values (where the approximation should be based on the current input parameters or the output parameters of the previous execution of the current segment); 3) use the response outlined in item 2, above, and terminate the task. C. Fault Coverage of Recovery Mechanisms Because the LABEL mechanism of recovery only provides fault coverage for detectable hardware faults, which represent approximately 58 percent of all possible faults, the software recovery mechanisms of Section II-B were introduced to augment the overall fault coverage for a given segment of the software system.4 Even with these safeguards, however, there exists a class of undetectable errors which could occur and which could evade the software recovery mechanisms of the previous section. It is shown in Appendix A, however, that the percentage of such fault occurrences is less than 4 percent of all possible faults. In addition, it can be shown that the probability of such a fault occurrence is extremely small. For these reasons, 3This approach differs from the concept of recovery block usage introduced by Randell [9], since all of the output parameters of each recovery block are not available to subsequent recovery blocks (or

se&ments). A derivation of the stated fault probabilities may be found in Ap-

pendix A.

393

the authors have assumed in the analysis of Section III that each applicable segment of the software system will effectively provide complete (100 percent) fault coverage for these faults associated with the execution of that segment. Fault recovery for that segment is therefore assumed to occur with a probability of unity. Furthermore, segments which do not contain such recovery mechanisms are said to have null (0 percent) fault coverage. III. OPTIMIZATION OF SOFTWARE RECOVERY CHOICES Although the concepts introduced in Section II are by no means revolutionary, no systematic analysis has been pursued to date which provides a measure of the overall coverage of software recovery assignments as applied to software systems. Intuitively, one might expect that the implementation of software recovery mechanisms for a particular task or segment would be "more effective" than the implementation of software recovery mechanisms for other tasks or segments. In addition to the coverage obtained from a recovery scheme for a task or segment, however, each such recovery scheme possesses a specific cost or size and introduces an additional margin for fault occurrence within the software system. Hence, given realistic constraints of overall system size and system fault rate, some compromises between each segment's recovery fault coverage, size, and fault rate must be made to arrive at a truly optimal software recovery allocation for the entire software system.

A. The Optimization Problem As stated is Section II, it is tacitly assumed that all segments of the software system possess a LABEL recovery scheme and appropriate software recovery schemes such that the fault coverage of these schemes is approximately 100 percent of all possible segment faults. To arrive at an optimal allocation of these recovery schemes for all segments within the software system, some measure of each segment's effective fault rate, recovery scheme cost, and recovery scheme fault rate must be obtained.5 Once these parameters have been identified and the system has been modeled, various optimization techniques can be used to determine the overall software system recovery scheme which maximizes the fault coverage of the software system. It should be noted that for any realizable software system, such a maximization problem must be bounded by the following practical constraints: 1) the total additional recovery detection memory allocation (size) cannot exceed a specified percentage of the total system size; 2) the fault rate of the overall recovery detection assignment cannot exceed a significant percentage of the total system fault rate (e.g., the recovery assignment should not introduce a significant error rate). The nomenclature which will be used to express the measures mentioned above is given as follows: SA detailed discussion of the necessary measurements may be found in Section IV. The average time period which is assumed for these measurements is one execution cycle of the system.

394

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

cost (or size) of the recovery scheme for segment j of task i; e =o fault rate covered by the recovery scheme for segmenti of task i; f = fault rate contribution of the recovery scheme for segment / of task i; = index fault coverage for the recovery scheme for seghii ment j of task i (relative value of ei). c =

Equation (3) or (4) therefore represents the fraction of total system fault rate which is covered by the system recovery assignment dictated by the values of parameter ai. Ideally, if all segment recovery schemes were implemented, (4) would reduce to a unity index of fault coverage for the system. However, several important practical constraints may bound this index to less than unity. In general, however, (3) and (4) yield a measure of coverage for any system recovery scheme which can be formulated in this manner. As mentioned previously, two practical constraints bound the optimal choice of system recovery schemes. Using parameter ai1 as defined previously, the total cost, or size, (C) of a generalized system recovery assignment can be expressed as:

It is assumed that the probability of fault recovery associated with a given recovery scheme is unity, so that faults which are detected during the execution of a given segment can be considered recoverable. The coverage of a recovery scheme for a particular segment will be expressed as the fraction of the total system fault rate C= :Eaiqcij. (5) which is covered by the recovery scheme for that segnent. i i This fraction will be referred to as the index of fault coverage In addition, the additional fault rate introduced by a generalfor that segment and can be represented as: ized system recovery assignment (F) can be expressed as:

hij = eEj E"

(1)

where the system fault rate (E) can be expressed as: E=

i i

e,,.

F = 11aijfi i

j

(6)

The optimization problem of choosing those segment recovery schemes which provide the best coverage of system faults (2) can therefore be expressed as a maximization of the system index of fault coverage,

Equation (2) implies that the normalizing parameter E repreI= ai,hi1 (7) i j sents a linearized approximation to the total fault rate of the system. Derivations contained in Appendix B indicate that with respect to the binary-valued variable ai1, subject to the this approximation is very accurate when the segnent fault constraints rates are small (0(10-6)). In addition, the linear property of (2) restricts the value of the segment index of fault coverage (8) Cm.U > C= E 2E agici defined in (1) to lie within the range [0, 1]. i i To account for the fact that various segment recovery and, schemes can be either included or excluded from the total system recovery assignment, the following binary-valued variFma >F= (9) aiifij able is introduced:6 i i ai1 = 0, if the recovery scheme for segment j of task i is where excluded from the system recovery scheme. Cmax = maximum allowable cost of the system recovery = 1, if the recovery scheme for segment j of task i is assignment. included in the system recovery scheme. = maximum allowable effective fault rate of the sysFmax tem recovery assignment. Hence, for a given selection of values for parameter aij for all tasks (i) and segments (j) of the software system, the followThis problem is well known in the literature as a doubly ing expression for the total index of fault coverage (I) of the constrained 0, 1 Knapsack problem. Saaty [12] discussed system can be obtained for each recovery assignment: this problem as a typical integer-optimization problem which is heuristically solvable by a modified simplex linearI= >2EJa,1hg(3 programming process. This problem is known to be NP comai plete implying the necessity of heuristic solutions since the or, global optimization is essentially enumerative.7 Other heuristic procedures have been introduced by Gilmore et al. [131, Z Z aijeij Shapiro [14] ,Cabot [15], and Yormark [16] which accelerate 1=' E (4) the convergence of Knapsack algorithms. A typical heuristic solution of such an optimization problem, as applied to the example contained in Section V, may be found in Section V-B. 6When a segment recovery scheme is excluded from the total system recovery assignment, hardware interrupts are inhibited during the execution of that segment, and no recovery scheme is provided for that segment.

7See Aho et al. f27] for a discussion of algorithmic complexity analysis.

395

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN

Fig. 3. Cyclic task hierarchy structure with steady-state execution probabilities. Fig. 4. Cyclic segment hierarchy structure with steady-state execution probabilities.

IV. MEASUREMENT OF SOFTWARE SYSTEM RECOVERY PARAMETERS The model of the previous section assumed that three parameters associated with a software system's fault rate and recovery philosophy could be measured in some manner. Specifically, the system fault rate, the recovery scheme cost, and the recovery scheme fault rate were required in the optimization problem formulation of Section III-A. In general, it will be shown that each of these parameters is a function of one or more of the following properties: 1) the size of each segment; 2) the complexity of computations performed within each segment; 3) the size of the specific recovery computations for each

probabilities can be measured easily once the software system has been tested. A derivation of these execution probabilities in terms of the steady-state segment branching probabilities is contained in Appendix B-A. A. Segment Fault Rate As described in Section III-A and Appendix B-B, the system fault rate can be expressed approximately as:

E=Y2e11

(10)

iij where the parameter ei, was defined as the segment fault rate for segment j of task i. Intuitively, one might expect that segment; parameter ei would be a function of the size and complexity 4) the complexity of computations performed within the of the computations performed within a segment. Formally, recovery scheme for each segment; assuming independence between segments, parameter ei can 5) the fault probabilities associated with each instruction of be expressed as follows: the instruction repertoire of the host simplex computer. (11) eii = g,ipipqj Before embarking upon a discussion of parameter measurement, however, several assumptions regarding system task and where segment characteristics must be enumerated. It is tacitly asgii = probability of a single fault occurrence within segment sumed in the following sections that the task hierarchy of the j of task i during one execution cycle of the system. software system is cyclic in nature. This property is illustrated = pi execution probability of task i. in Fig. 3 for a typical task hierarchy by a dotted edge from the qii = execution probability of segment j within task i. end task T8 to the start task T1. The period of such a system execution cycle is specified as t. In addition, it is assumed that As stated in the preceding section, the task and segment exefor every task (Ti) of the system there exists a task execution cution probabilities are assumed to be known system paramprobability (pi) which represents the probability that the given eters. Furthermore, parameter g11 can be expressed as follows: task (TI) will be executed during the current execution cycle (12) g11 E2 FnijkZk of the system under steady-state conditions. Furthermore, it k is assumed that for every segment (S,j) there exists a segment execution probability (qj1), as illustrated by Fig. 4, which where represents the conditional probability that the given segment = the number of times that instruction k can be executed (Sty) will be executed during the current system execution niqk within segment j of task i. cycle given that the segment's task (Ti) will be executed during = fault probability associated with instruction k.' Zk the current execution cycle. For a specific software system, these task and segment execution probabilities can usually be Hence, (11) which describes the segment fault rate can be predetermined from system design specifications, or these rewritten as follows:

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

396

e, =pjqjj (

(13)

nlikZk)

The instruction fault probability Zk found in (12) and (13).is totally dependent upon the computer hardware reliability and can be expressed in terms of the fault probabilities of the basic hardware operations, or microinstructions, which comprise each instruction. Methods of measurement for these fault probabilities have already been identified by Yetter [171, Mathur [18] ,Trufanov [19], andothers,whileCox and Rankin [20] have extended these concepts to large data processing systems. Hence, for a given computer hardware configuration, the parameter Zk can be determined by current measurement techniques. The size and complexity of each segment's computations are inherently reflected in the parameter ng,k. Hence, the measurement of the segment fault rate parameter eiq essentially reduces to the measurement of parameter ntjk. We propose that two straightforward methods of measurement can be employed using a finite state recognizer (FSR): 8 1) during the generation of assembly code, the software system compiler could be modified to maintain an account of parameter nijk; 2) after compilation of the system software source code, the system assembly code could be scanned by an FSR to generate the parameter nijkIn any event, several declaration statements must be added to the programming language of the software system to delimit task and segment boundaries, and the number of executions of various loops which encompass several segments of a task. Because branch conditions within segments influence the various instructions which might be executed within a segment, additional methods for specifying branch probabilities must also be introduced. To this end, the following assumptions are made which are consistent with the practices of structured programming: 1) a branch condition (or set of branch conditions) will only occur at the end of a segment; 2) loop constructs can only occur within a single segment, or include one or more segments, of a single task; 3) no loop constructs are permitted to include one or more tasks, except for the start and stop tasks of a cyclic system; 4) only one entry point is permitted for a task, although a task may have multiple exits. The above assumptions place several restrictions upon the structure of the software system, and parallel many of the concepts set forth by Keminghan and Plauger [23], among others, for structured programming. In addition, the first assumption implies a one-to-one correspondence between branch probabilities and the segment execution probabilities The formalization of task and segment declarations can be illustrated as follows:

# declare begintask M 8A description of the usage of finite state recognizers in Kohavi [211 and Aho and Ullman [22].

may

be found

: Task M #declare end_taskM # declare begin seg M, N : Segment (M, N) # declare end_segM, N

where M defines the task number and N defines the segment number. The number of loop execution cycles can also be declared as follows:

# declare

loop L

where L defines the maximum or estimated number of loop execution cycles. Hence, upon recognition of the above statements, the system compiler or post-processor can trivially begin an accounting of n{ik for each unique instruction contained within the specified task and segment. In this manner, the segment fault rate for each segment within a software system can be computed automatically by the system compilation

process.9

In any large software system, there most probably exists a set of computations whose associated fault rate may be extremely small, but whose importance to system operations may be extremely great. In such cases, one must provide appropriate recovery schemes at any cost. To insure that the recovery scheme of such a segment (or segments) is chosen by the optimization techniques outlined in Section III, some measure of the importance of these recovery schemes must be introduced into the expression for the total index of fault coverage (I). To accomplish this result, the authors propose the addition of a heuristic parameter wi, to (4) as follows:

I=

2, L i

i

aiEwe E

(14)

where

wi

=

the relative importance of segment j of task i to the software system.

For most segments within the software system, parameter w11 would be set equal to unity. However, for "highly important" segments, this parameter can be set to a value which will significantly increase the magnitude of eii. Under these circumstances, of course, the following constraint must be satisfied if the value of the index of fault coverage (I) is to have a consistent meaning:

E, E wij i

i

= MN.

(15)

Hence, by significantly increasing the segment fault rate for such segments, the associated recovery assignment scheme for those segments will be assured of inclusion into the total system recovery assignment by the optimization techniques of Section III. 9An example of an FSR post-processor for this purpose bound in Section V-A.

may

be

397

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN

B. SegmentRecovery Scheme Cost mi/k = the number of times instruction k can be executed for recovery of segment j of task i within the stateas in utilizes the scheme The system recovery cost, given (5), ment defined by the "recov" declaration. parameter c11 which was defined as the segment recovery scheme cost or size for segment j of task i, and parameter a11 Upon computing both parameters given on the right-hand which was used to specify whether such a recovery scheme was side of (17) using the FSR techniques set forth in the precedto be included or excluded from the overall system recovery ing section, the segment recovery scheme cost c1/ can be comassignment. The segment recovery cost c11, therefore, by its puted using (16) or equivalently: very nature must reflect the size and complexity of computa(18) CI = E (mik + mqjk) tions of the segment recovery scheme. Formally, this paramk eter can be expressed as follows: c,j= £ ti/k (16) C. SegmentRecovery Scheme FaultRate k The third parameter discussed in Section III-A defined the fault rate for the software system recovery scheme as follows:

where

ti/k = the number of times that instruction k can be executed for recovery of segment j of task i. As indicated in Section IV-B, the instruction count for a given segment can be accomplished very simply by an FSR. An FSR will be used to compute the segment recovery cost. To delimit the segment recovery procedure, the following declaration statements can be used:

# declare begin recov M, N Recovery scheme for Segment (M, N) # declare end_recov M, N

where M defines the task number and N defines the segment number associated with the recovery scheme. Hence, upon scanning the above sequence of source statements, the system compiler or post-processor can compute the recovery scheme cost identified by the above delimiters in a manner similar to its computation of parameter ei,. Unfortunately, however, some recovery statements may be included within the segment computations itself for the purpose of providing software recovery mechanisms. A computation of these imbedded segment recovery costs may be accomplished by identifying source statements as follows: # declare recov expression; where "expression" is simply the desired recovery statement written in the system programming language. Such source statements would therefore be handled as recovery statements, and their resulting instruction counts would be included in the computation of parameter ti/k rather than the segment cost parameter nijk-

F=

i

(19)

/ ,fiiaU

j

where f,j was defined as the recovery scheme fault rate contribution for segment j within task i. The parameter fi' should reflect the size and complexity of the segment recovery computations which are executed during normal segment computations to provide software recovery, thereby yielding a measure of the scheme's fault probability. In a manner parallel to the analysis of parameter ei, in Section IV-A, the segment recovery fault rate can be expressed as follows:

f = pi(

Ik

)

(20)

where

Pi

q{y

mqik

= execution probability of task i. = execution probability of segment i within task i. = the number of times that instruction k can be executed for the recovery of segment j within the statement defined by the "recov" declaration. = fault probability associated with instruction k.

Zk It has been assumed that the task and segment execution probabilities are known system parameters. In Section IV-A, techniques for measuring parameter Zk were set forth. In addition, the measurement of parameter m,k was discussed in detail in Section IV-B with the use of FSR techniques set forth in Section IV-A. Hence, for a general software system, the parameter f11 can be computed automatically by existing measurement techniques. V. AN EXAMPLE OF A FAULT TOLERANT SOFTWARE SYSTEM At the present time, a small number of large fault tolerant software systems have been developed. In each case, the software recovery approach has been chosen over a multiplex computer configuration because of the significant hardware

The above declarations, in parallel with the declarations set forth in Section IV-A, provide the facility for an automatic computation of parameter ti/k by techniques discussed in that section. In summary, such a computation of segment recovery cost can be expressed formally as follows: savings realized. ti/k = mi/k + mik (17) For the purpose of illustrating the software recovery mechanisms set forth in the preceding sections, as well as the optimiwhere zation procedures developed in Section III, the authors have =the of number times instruction k be can executed developed a more tractable example of a small peripheral proMi/k for recovery of segment j of task i within the cess control system which might be used to serve as a controller for remote valves, pumps, and measurement sensors for a "begin-recov" and "end-recov" declarations.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

398 UNIBUS

T Pi 1.OO

12

REDUNDANT DO- 11 INTERFACE

REDUNDANT 11/05 PROCESSOR MEMORY

P.7

REDUNDANT

ICS-It INTERFACE

Fig. 5. Hardware configuration for example fault tolerant software

P3=0.18

system.

chemical processing plant under the control of a central processing system.10 The hardware configuration for such a peripheral controller is illustrated in Fig. 5, and uses standard Digital Equipment Corporation components with some modification for parity code fault detection. Dynanic reconfiguration of redundant components such as the DQ 11 synchronous line interface, the PDP 11105 processor, and the ICS parallel interface subsystem can be provided by use of the DT03 UNIBUS programmable bus switch using algorithms discussed by Mills [241 in conjunction with the REPORT mechanism outlined in Section II. The software system which is specified in [25], comprises

T

P4=0.42

P50.11

X T6

Fig. 6. Task hierarchy structure for example software system. BEGIN

T, qll

1.00

six tasks:"1

1) the Input Protocol Task (Ti); 2) the Executive Control Task (T2); 3) the Valve Operation Task (T3); 4) the Parameter Measurement Task (T4); 5) the Pump Operation Task (T5); 6) the Output Protocol Task (T6). The task hierarchy structure and task execution probabilities of this system are illustrated in Fig. 6. Upon initiation of a controller execution cycle, five segments within task TI have been identified for the processing of DQ 11 input commands from the central processing computer. Task T2 is composed of two segments which decode the validated input message and determine which one of three peripheral operation tasks (T3, T4, or T5) should be executed. Each of the peripheral operation tasks T3, T4, and T5 are similar in structure. In each case, the desired operation is decoded and the appropriate operation is commanded through the appropriate ICS port. The operation acknowledgment is monitored for completion of the desired operation or for the peripheral equipment failure. Upon completion of the monitoring operation, the appropriate operation status is formatted, and control is transferred to the output protocol task T6. Within task T6, three segments are used to encode the appropriate status and identification information for transmission to the central computer and output the encoded information using the DQ 1 interface. At the completion of task T6, control is retumed to task TI to initiate the next cycle of the controller's execution.

or

10A detailed specification of this example can be found in [ 251. 11 The system monitor function will not be considered in the example its analysis.

=0.20

S,5 TO T2

q15=0.08

TO T6

Fig. 7. Segment hierarchy structure for task 1-input protocol task.

The segment hierarchy and segment execution probabilities for each task of this software system are illustrated in Figs. 7-12 for tasks TI through T6. Each segment of the system possesses an ON-ERROR statement which specifies an appropriate recovery procedure. The ARE recovery procedure is used exclusively in conjunction with an error counter which determines the presence of recurrent fault conditions. Upon detection of recurrent fault conditions, the most likely hardware component is dynamically replaced by a redundant unit using the REPORT mechanism. If the fault condition persists after such a reconfiguration of the hardware, a complete reconfiguration of all hardware components is requested. A. An Assembly Language Post-Processor Summary In the discussion of parameter measurements of Section IV, several references were made to the development of an assembly language processing scheme which could be easily constructed for various instruction repertoires to measure the software recovery parameters hi,, ci1, and fif. To illustrate this concept, we have developed an assembly language postprocessor which performs the parameter measurements set

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN

399 BEGIN T4

Iq412 1.00

BEGIN T2

q22 = 1.00

q22- 1.°°

TO T3

TO T4

TO T5

Fig. 8. Segment hierarchy structure for task 2-executive control task.

Fig. 10. Segment hierarchy structure for task 4 -parament measurement task.

BEGIN T3 BEGIN T5

q31- 1.00

qSI= 1.00

q33=0.50

q53=0.50,

q34= 1.00

q54= 1.00

q(5- 1.00 q 55

TO

TS

Fig. 9. Segment hierarchy structure for task 3-valve operation task.

forth in Section IV for the PDP 11 instruction repertoire.12 A discussion of the actual post-processor written in language C can be found in [25]. The results of the post-processor computations for the example system have been tabulated in Table I. Several system parameters which are of note can be summarized as follows: S = 428 (words) (21) E = 12266*p (cycle) - 1 (22) where S = total software system size (less recovery size). E = fault rate of the software system. p = instruction error probability. 120ne additional instruction (int) was added to the PDP 11 repertoire to provide the ON-ERROR operation. Reference [26] specifies these in-

structions in detail.

1.00

TO T6

Fig. 11. Segment hierarchy structure for task 5-pump operation task.

BEGIN T6

TO Tl

Fig. 12. Segment hierarchy structure for task 6-output protocol task.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL, SE-4, NO. 5, SEPTEMBER 1978

400

TABLE I SUMMARY OF POST-PROCESSOR PARAMETER MEASUREMENTS

Task (i) I

Segmen

Ci.)

aju (wrd)

3

0. 122 0.033 0.005

5

0.003

1

2

0.047

15 9 11

,2i (P./ccvl 0.80

0.20 0.72 0.05

1 2

0.058 0.076

11 2

1.44

3

1

0.029

23 17 17 16

2.52 0.27 0.27 0. 18

3

4

4 5

0.004 0.060 0.004

1 2

0.081 0.006

23

4 5

0.018

17 15 17

3

6 7

8

5

1

2

3 4 5 6

0. 00

1

2

3

0.073 0.001

0.007 0.091 0.011

0.018 0.002 0.002 0.037 0.003 0.131 0.000 0.073

11

17

15

1. 44

0. 18

6.02

0.52 0. 17 0. 13

0.04 0.65

15 11

0.22

23 17 17 16

0. 17 0. 17

11

9 0

16

I

1.00

2

2

TABLE II OPTIMAL FAULT COVERAGES FOR VARIOUS RECOVERY SCHEME COSTS (FAULT RATE UNCONSTRAINED)

0.43 1.54 0. 11 0. 11

1.00 0.00 1.00

It should be noted that the fault rates 4f, and E have been expressed in terms of a single instruction fault probability p. An approximate worst case estimate of 10-1O for this probability has been obtained for I/O operations using a PDP 11/70 processor and TU 16 tape drives. The parameter measurements, as they appear in Table I, will be used in the following section to illustrate the optimization procedures derived in Section III.

B. Applications of Optimization Techniques Given the example of software system set forth in Section V, the optimization techniques of Section III can be applied to this system to determine the optimum choice of segment recovery schemes which maximize the recovery scheme fault coverage subject to the scheme's maximum cost and additional fault rate. Even for a small software system as the example, an exhaustive search for such a maximum would require the computation of all possible combinations of the variable ai;. Given that the example software system is composed of twenty-eight segments, approximately (228) iterations would be required! We estimate that such an exhaustive algorithm would take approximately one year of computing time on a dedicated PDP 11/70 computer. We have chosen the method developed by Cabot [15] for the heuristic solution of Knapsack problems whose time complexity is roughly proportional to square of the number of segments being considered, 0(n2). Using this method of solution, functional variations in the index of fault coverage have been obtained with respect to the following two conditions: 1) the recovery scheme cost or size, expressed as a percentage of the total system cost, or size, (with the recovery scheme fault rate unconstrained); 2) the recovery scheme fault rate expressed as a percentage of the overall system fault rate (with the recovery scheme cost

unconstrained).

Index ot Feult ''nxi±lum Cost (nercent) Coverage 0% 5%

10i

15% 20% 25% 30% 35%

40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90%

I.000 U.083 U.092 U.106 0.359

0.566

0.584 0.644 0.772 0.776 0.780 0.786 0.794 0.796 0.798 0.871. 0.900

0.981 0.999

Recovery Scheme

Cost (words)

Recovery Scheme

Fault Rate

0 11 33 55 79 107 122

0.00 1.60 1.98 2.52 4.52 7.57 7.61

169 186 203 220 254 271

8.12

138

288 304 327 3 50 373

(p/cycle)

7.79

8.39 8.66 9.18 9.96 10.13 10.30 11.30 13.82 19.84 21.38

TABLE III OPTIMAL FAULT COVERAGES FOR VARIOUS RECOVERY SCHEME FAULT RATES (COST UNCONSTRAINED) Maximum Fault Index ot Fault Rate (percent) Coverage

U% 1%

2% 3% 4%

5% 6% 7% 8%

9% 10% 11% 12% 13% 14% 15% 16% 17% 18%

U.000

U. 05 U.103 U.256 0.359 0.435 0.557 0.776

U.787 0.798 U.871 0.871 0.900 0.9OU

0.900 0.900

0.970 0.981

0.999

Recovery Scheme Cost (words)

0

2

44 /0

79

90

110

186 237

288 304 304 327 327 327 327 335 350 373

Pecovery Scheme

Fault Rate

(p/cycle)

0.U0

U.80 2.41

3.58

4.52 5.96 7.10 8.39 9.31 10.30 11.30 11.30 13.82 13.82 13.82 13.82 18.86 19.84 21.38

A tabulation of these functional dependences may be found in Tables II and III. These results are also plotted in Figs. 13 and 14. To illustrate the contrast between the optimum solutions obtained, an average random choice of recovery schemes for each data point contained in Tables II and III was simulated using a multiplicative congruential random number generator. The results of this simulation may be found in Tables IV and V. Plots of the average choice data also appear in Figs. 13 and 14. The variations in the index of fault coverage shown in Figs. 13 and 14 exhibit several very interesting characteristics. In general, for recovery scheme choices which yield less than total fault coverage, the optimum recovery scheme is far superior to an average random choice of segment recovery schemes. As the recovery scheme cost or fault rate approaches the point at which total fault coverage can be provided, fewer choices of segment recovery schemes must be made. Hence, the optimum and average choice solutions converge to the maximum index of fault coverage.

The most interesting characteristic of these variations, how-

ever, is apparent in the slope of the optimum plots. In Fig. 13, for recovery scheme costs below 15 percent, the slope of the data is very gradual and on the order of 0.001 per percent of

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN

401

TABLE IV AVERAGE RANDOM CHOICE FAULT COVERAGES FOR VARIOUS RECOVERY SCHEME CosTs (FAULT RATE UNCONSTRAINED) 0.7o

Maximum

Cost (percent)

0.6 _

0.54 tL 0.4 _ /

o

X

/

~~~~~~~~~LEGEND: OPTIMUM SOLUTION - AVERAGE RANDOM CHOICE

-

03

30%

-0. 2

35%

40% 45% 50%

0.1 _ 0.0

0

0% 5% 10% 15% 20% 25%

Z__,--

10

20

30 40 50 60 RECOVERY SCHEME COST (PERCENT)

70

80

90

Fig. 13. Plot of index of fault coverage versus recovery scheme cost (fault rate unconstrained).

55% 60% 65% 70% 75% 80% 85% 90%

Inlex of Fault Coverage U.000

0.018

U.U23 U.044 0.046 0.250

0.254 U. 304 U.337 U.435 U.557 U.561 U.643 U.647

0.705 0.823 U.900 0.970 0.999

Recovery Scheme Cost (words) 0 15 26 60 77 101 112 147 156 188 203 220

254 271 282 321 334 350 373

Recovery Scheme

Fault Rate

(p/cycle)

0.00 U.04 0.24 1.89 2.06 3.23 3.41 4.43 5.23 6.10 7.10 7.37 8.97 9.24 10.68 16.81 18.68 18.86

21.38

1.0

TABLE V AVERAGE RANDOM CHOICE FAULT COVERAGES FOR VARIOUS RECOVERY SCHEME FAULT RATES (COST UNCONSTRAINED)

0.9-

0.8 w

(D ,

0.7 _

Maximum Fault Index of Fault Rate (percent) Coverage

0.6 _

0.5 -i

LEGEND:

0.4 _---

0.3

-

0.2

/

0.1 _

0.0,

OPTIMUM SOLUTION AVERAGE RANDOM CHOICE

y 0

2

4

6 a 102 14 RECOVERY SCHEME FAULT RATE (PERCENT )

16

la

Fig. 14. Plot of index of fault coverage versus recovery scheme fault rate (cost unconstrained).

cost. At the 15 percent cost point, the slope of the curve rapidly increases yielding a gain of approximately 0.05 per percent of cost. The variation of the index finally becomes more gradual at the 40 percent cost point and approaches a limit of zero gain at 90 percent. A similar characteristic is noted in Fig. 14 for increasing recovery scheme fault rates. Hence, the concept of diminishing returns of fault coverage for increases in recovery scheme cost have been shown to exist for the given example software system. An explanation of the behavior of the index of fault coverage variations illustrated in Figs. 13 and 14 can be obtained by considering the strategy of most Knapsack hueristic algorithms. In the context of the problem at hand, such algorithms initially seek out those segment recovery schemes which yield a high index of fault coverage with minimal associated cost and fault rate relative to the remaining recovery schemes. This process continues iteratively until one of the constraints is exceeded, or until all segment recovery schemes have been chosen. Hence, as the overall recovery assignment cost or fault rate is increased, the more efficient segment recovery schemes are incorporated into the system recovery scheme until only the more costly and less efficient segment recovery schemes remain unchosen. As the constraints are increased further, the less efficient recovery schemes are also incorporated into the system recovery assignment with very little gain of fault coverage in comparison to cost or fault rate. At this point, the law of diminishing returns is fulfilled.

0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18%

0.000 0.023 U.046 U.228 U. 304 U.417 U.417 U.570 U.647 0.705 0.705 0.705 0.705 0.705 0.823 U.834 U.900 0.970 0.999

Recovery Scheme Recovery Scheme Cost (words) Fault Rate (p/cycle) 0 26 77 129 147 188 188 238 271 282 282 282 282 282 321 332 327 350 373

0.00 0.24 2.06 3.52 4.43 6.10 6.10 B.97 U.24 10.68 10.68 10.68 10.68 10.68 16.81 17.24 17.82

18.86 21.38

One observation from Fig. 13 is that a recovery scheme cost of 90 percent is required to provide maximal fault coverage for the example system. By examining the example more closely, a marked reduction in total recovery assignment cost could be achieved by combining similar recovery schemes for multiple segments into a shared recovery procedure. The inclusion of such a concept into the optimization techniques discussed in Section III would require a modification to the optimization procedure. In choosing candidate segment recovery schemes for inclusion into the overall system recovery assignment, segments which share common recovery code must be considered as a single recovery scheme candidate. By pursuing this reduction in segment recovery cost, complete fault coverage can be obtained for the example system at a cost of 45 percent. Because this reduction affects only normally executed recovery scheme code, it should be noted that no reduction in the recovery scheme fault rate is realized by this cost reduction.

C. Results for a Fault Tolerant Software System To verify the validity of the fault coverage concepts set forth in the preceding sections, a system simulation was devised to measure the fault coverage properties using the example fault tolerant software system described in the preceding subsections of this section. The memory organization for this

402

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

TABLE VI core-resident system was formulated such that the data area OF ACTUAL AND ESTIMATED OPTIMAL FAULT COVERAGES FOR for the system resided in the first one-hundred addressable COMPARISON VARIOUS RECOVERY SCHEME COSTS (FAULT RATE UNCONSTRAINED) locations of memory, and the program area for the system Actual Index of Estimated Index of Maximum immediately followed the data area. During each cycle of the Fault Coverage Fault Coverage Cost (percent) simulation process, a random, uniformly distributed location U.000 U.000 U% U.278 U.083 5% within either the data area or program area of the system was 0.297 0.U92 10% 0.306 0.106 15% chosen to introduce the effect of a fault occurrence within the 0.344 U.359 20% U.511 0.566 25% system. 0.577 0.584 30% 0.632 0.644 35% The manifestation of each simulated fault occurrence con0.755 0.772 40% 0.767 0.776 45% sisted of the injection of a random number of errors into the 0.779 0.780 50% 0.791 0.786 55% various bits of the chosen fault location. Each such fault 0.815 0.794 60% 0.829 0.796 65% occurrence was deemed detectable and recoverable if and 0.833 0.798 70% 0.843 0.871 75% only if the software recovery assignment under test could 0.891 0.900 80% 0.948 0.981 85% detect the fault occurrence and execute appropriate recovery 0.999 90% 0.999 procedures. Fault occurrences which satisfied this criterion were considered as coverable by the recovery assignment under TABLE VII test, while all other fault occurrences were considered as not COMPARISON OF ACTUAL AND ESTIMATED OPTIMAL FAULT COVERAGES FOR VARIOUS RECOVERY SCHEME FAULT RATES (COST UNCONSTRAINED) coverable by the recovery assignment under test. The index of fault coverage for a particular recovery assignment was Actual Index of Estimated Index of Maximum Fault Fault Coverage Fault Coverage Rate (percent) therefore computed as the ratio of coverable fault occurrences 0.000 0.000 0% to the total number of simulated fault occurrences. 0.264 0.050 1% 0.297 U.103 2% For each optimal recovery assignment derived in Section V-B 0.334 3% 0.256 0.344 4% 0.359 for the example fault tolerant software system, as summarized U. 38 X 5% U.453 U.505 U0.51 69% in Tables II and III, over four-hundred fault occurrences were /% U. -6/ U. 776 U.803 U.787 8% simulated using the technique described above to yield stable 0.835 9% 0.798 0.843 U.871 10% estimates. The actual index of fault coverage for each assign0.843 0.871 11% 0.891 0.900 12% ment has been tabulated and compared with the estimated 0.891 0.900 13% 0.891 0.900 14% index of fault coverage in Table VI, as a function of optimal 0.891 0.900 15% 0.930 0.970 16% recovery scheme cost, and in Table VII, as a function of 0.948 (0.991 17% ;.999 0.999 18% optimal recovery scheme fault rate. An illustration of these comparisons between the estimated and actual fault coverage 1.0 indices can be found in Figs. 15 and 16, respectively. 0.9One noteworthy conclusion which is evident from Fig. 15 is 0.8 _ that for small recovery scheme costs (less than 20 percent) the Ir 0.7 actual fault coverage is approximately three times the estiO -h mated fault coverage. This phenomenon arises from the fact 0.5 that although a small percentage of system segments possess LEGEND: A OA4 _ ESTIMATED INDEX _}~~~~~~~~ recovery assignments, the effects of fault occurrences which x0 -ACT UAL NDEX --/,_ occur in other system tasks and segments can ultimately be detected and corrected by these schemes indirectly. For ex0.1 _// ample, a fault occurrence which results in a branch condition I.CK10 0 90 60 70 sO 50 20 30 40 error within an unprotected segment can be eventually detected RECOVERY SCHEME COST (PERCENT) by a protected segment which will be executed as the result of Fig. 15. Plot of actual and estimated optimal indices of fault coverage the erroneous branch condition, even though the fault occurversus recovery scheme cost (fault rate unconstrained). rence was not detected within the segment in which the fault .0originally occurred. As the index of fault coverage is in_ 0.9 creased, this disparity between the actual and estimated fault , 0.8 coverage indices decreases rapidly as a larger number of system lx 0.7 X segments possess recovery schemes. (A similar property is 0 0.6 / evident in Fig. 16 for small recovery scheme fault rates.) / : 0.5 Aside from the "small coverage" effect described above, LEGEND: .4/ 0,4ESTIMATED INDEX the simulation results closely agree with the theoretical esti,0-3 0.3/ ACTUAL INDEX mate of fault coverage set forth in the preceding sections to I/ - 0.2 within 10 percent of the estimated index of fault coverage. To demonstrate the accuracy of the linear approximation to 10 4 6 12 14 18 16 0 2 8 the segment and overall system fault rates developed in AppenRECOVERY SCHEME FAULT RATE (PERCENT dix B, the index of fault coverage of each segment of the simu- Fig. 16. Plot of actual and estimated optimal indices of fault coverage versus recovery scheme fault rate (cost unconstrained). lated system was measured for the software recovery assignw

lL

0.3

-

z

U

U.

x

GANNON AND SHAPIRO: FAULT TOLERANT SOFTWARE DESIGN TABLE VIII COMPARISON OF ACTUAL AND ESTIMATED INDICES OF FAULT COVERAGE Task 1

(i)

Segment (i)

Theoretical hii

Measured

hil

Absolute Value of Difference

0.124 0.032 0.005 0.048 0.002

0.002 0.001 0.000 0.001

5

0.122 0.033 0.005 0.047 0.003

1 2

0.058 0.076

0.056 0.078

0.002 0.002

1 2 3 4 5

0.029 0.004 0.004 0.060 0.004

0.030 0.004 0.004 0.061 0.004

0.000

1 2 3 4 5 6 7 8

0.081 0.006 0.073 0.001 0.018 0.007 0.091 0.011

0.082 0.005 0.072 0.001 0.019 0.006

0.012

0.001 0.001 0.001 0.000 0.001 0.001 0.001 0.001

1 2 3

0.018 0.002 0.002 0.037 0.003

019 0.002 0.002 0.036 0.003

0.001 0.000 0.000 0.001 0.000

1 2

3 4

0.090

0.

0.001

0.001 0.000 0.001

0.000

403

set forth in the preceding sections yield a detailed characterization of fault tolerant software systems. At the present time, the application of a global optimization procedure to large software systems is computationally un-

feasible using current heuristic procedures. The authors suggest, however, that a promising area of further research lies in the development of local optimization procedures for large software systems. In addition, although the authors have only addressed the topic of software recovery from detectable hardware fault conditions, many classes of software fault conditions can be shown to have identical characteristics to hardware fault conditions. This topic of indistinguishable software and hardware fault conditions also presents an interesting area of future research which the authors are pursuing. REFERENCES

[1] J. Goldberg, P. G. Newmann, and J. H. Wensley, "Survey of fault-tolerant computing systems (revised)," Stanford Res. Inst., 1 0.131 0.133 0.002 Aug. 1972. 2 0.000 0.000 [2] W. G. Boricius, W. C. Carter, and P. R. Schneider, "Reliability 3 0.073 0 .074 0.001 modeling techniques and trade-off studies for self-repairing computers," IBM Res., Feb. 1969. [3] S. J. Bavuso, "Impact of coverage on the reliability of a fault ment which provided full system fault coverage. A comparison tolerant computer," Nasa Tech. note TN D7938, Sept. 1975. between the measured and theoretical segment fault coverage [41 F. P. Mathew, "Reliability modeling and architecture of ultrareliable fault tolerant digital computers," Univ. California, Ph.D. indices for the example software system can be found in dissertation, 1970. Table VIII. (The theoretical indices of fault coverage for the (5] J. J. Horning, H. C. Lauer, P. M. Melliar-Smith, and B. Randell, "A program structure for error detection and recovery," in example software system were obtained using the techniques Operating Systems. New York: Springer-Verlag, 1974, pp. 171set forth in Section V-B and are tabulated in Table I.) For each 187. [6] J. B. Goodenough, "Exception handling: Issues and a proposed segment, the theoretical index of fault coverage agrees very notation," Commun. Ass. Comput. Mach., pp. 683-696, Dec. closely with the results obtained from the simulation. 1975. [7] I. D. Hill, "Faults in functions, in ALGOL and FORTRAN," Computer, pp. 315-316, Mar. 1972. VI. CONCLUSIONS [81 J. D. Gannon and J. J. Horning, "Language design for programming reliability," IEEE Trans. Software Eng., vol. SE-1, pp. 179A general model and optimization scheme for fault tolerant June 1975. software allocation has been presented. Using the example of [9] 191, B. Randell, "System structure for software fault tolerance," a small fault tolerant software system in Section V, actual IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975. measurements of the required system parameters defined in [10] P. Brinch Hansen, Operating System Principles. Englewood Cliffs, NJ: Prentice-Hall, 1973. Section IV were obtained using the assembly language post- [11] 0. J. Dahl, E. W. Dijkstra, and C. A. R. Hoare, Structured Programming. New York: Academic, 1972. processor illustrated in Section V-A. Functional variations of the index of fault coverage for the example system were [12] T. L. Saaty, Optimization in Integers and Related External Problems. New York: McGraw-Hill, 1970, pp. 211-215. then obtained with respect to ranges of system recovery [13] P. C. Gilmore and R. E. Gomory, "The theory and computations scheme cost and fault rate. of knapsack functions," Oper. Res., vol. 14, pp. 1045-1074, 1966. In addition to demonstrating the feasibility of the approach J. F. Shapiro, "Dynamic programming algorithms for the integer [14] to provide optimal fault coverage for fault tolerant software programming problem-I: The integer programming problem systems, the following conclusions were established from the viewed as a knapsack type problem," Oper. Res., vol. 16, pp. 103121, 1968. optimization results of Section V-B. [15] A. V. Cabot, "An enumeration algorithm for knapsack problems," 1) The optimal selection of segment recovery schemes is Oper. Res., vol. 18, pp. 306-311, 1970. superior to an average random choice of segment recovery [16] J. Yormark, "Accelerating Greenberg's method for the computation of knapsack functions," Oper. Res., vol. 20, B-33/WAMw.26, schemes throughout the entire range of total system recovery 1972. scheme costs and fault rates. [17] I. H. Yetter, "High speed fault simulation for Univac 1107 com2) The optimal segment recovery selection process exhibits puter system," in Proc. Ass. Comput. Machines Nat. Conf., 1968, pp. 265-277. a breakpoint of diminishing returns for increasing system reF. P. Mathur, "Reliability modeling, analysis, and prediction of [181 covery scheme costs and fault rates. ultrareliable fault tolerant digital systems,"IEEE Trans. Comput., vol. C-20, pp. 1376-1382, 1971. 3) By modifying the original problem formulation to allow for shared segment recovery computations, a marked decrease [191 S. V. Trufanov, "Markov model of digital computer operations in the presence of faults," Eng. Cybern., vol. 9, pp. 291-294, in total recovery scheme cost can be realized. 1971. The results summarized above, in addition to the concepts [20] P. Cox and K. F. Rankin, "Reliability in large electronic data 4

5

0

.

000

404

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 5, SEPTEMBER 1978

[21] [22] [231

[241

[25]

[26]

processing systems," Automatic Telephone and Electric Company, Ltd., J21, pp. 162-177, 1965. Z. Kohavi, Switching and Finite Automata Theory. New York: McGraw-Hill, 1970, ch. 16, pp. 541-576. A. V. Aho and J. D. Ullman, The Theory of Parsing, Translation, and Compiling Volume 1: Parsing. Englewood Cliffs, NJ: Prentice-Hall, 1972. B. W. Kerninghan and P. J. Plauger, The Elements of Programming Style. New York: McGraw-Hill, 1974. D. L. Mills, "6Transient fault recovery in the distributed computer network," Computer Sci. Tech. Rep. Series, Univ. Maryland, Tech. Rep. TR414, Oct. 1975. T. F. Gannon, "An optimal approach to fault tolerant software system design," Ph.D. dissertation, Stevens Inst. Tech., Mar. 1977. PDP 11/45 Processor Handbook, Digital Equipment Corporation,

1973.

[271 A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms. New York: Addison-Wesley, 1974.

of undetectable errors to all possible errors within the bit positions of word w can be derived:

E(w)

=

n

0 E e, mod r

(Al)

i=o

(i)

i=i

where n = number of bits in the representation of word w, E(w) = fraction of undetectable errors to total possible errors in the bit positions of w,

and n!

(AS)

k!(n - k)!

k

A. Fault Coverage of LABEL Mechanisms As set forth in Section I, it is assumed that all computer hardware operations are monitored by reliable (error-free) fault detection hardware using a parity code. It is therefore assumed that the probability of a fault within the fault detection hardware is negligible. Hence, the fault coverage of the LABEL mechanism is equal to the percentage of possible faults which are detectable by a parity code. The usage and theory of parity codes is well documented by Sellers et al. [Al]. It is well known that a parity code will detect all patterns of errors in a word w except those such that:

(A4)

=

(n\ APPENDIX A CALCULATION OF FAULT COVERAGE

(n)

n/2

For a typical 16-bit word in which one bit is used for parity, the following percentage is obtained: 32767 = 0.4179 78405= (A6)

Hence, approximately 58 percent of all possible fault conditions are detectable by a parity code for a 16-bit word. It should be noted that (A4) and (A6) express the fraction of undetectable fault occurrences to the total number of possible fault occurrences. The ratio of the probability of undetectable fault occurrences to the probability of all possible fault occurrences can be expressed as:

P(w)

E =

(i) P

(A7)

, Iv

where the pattern of errors is

Ew(enen-l

...

(A2) where

eleo)

r = number

ei

probability that a bit of word w is in error. The expansion of (A7) with all second-order terms neglected for a 16-bit word results in the following expression: p

and

base of the word w,

if no error occurs in the ith pattern position, 1, if an error occurs in the ith pattern position.

= 0, =

For binary representations of the word becomes:

w,

therefore, (Al)

=

P(W

=

120p2 + 0(p4) 16p +

p

120p2

+

0(p3)

(A8)

(A9) 4 (A3) 0= E ei mod 2. 30 i=l If it is assumed that From this expression, it is obvious that a parity code will 4 detect all single error occurrences in the bit positions of word (AIO) P

Suggest Documents