Combining Fault Avoidance, Removal and Tolerance: An Integrated ...

Combining Fault Avoidance, Removal and Tolerance: An Integrated Basis for Software Veri cation and Validation R. Ben Ayed B. Cukic, C. Fuhrman, A. Mili Institute for Software Research School of Engineering, PO Box 37 1000 Technology Drive University of Tunis Fairmont, WV 26554, USA Belvedere, 1002 Tunisia May 6, 1998

Abstract

Fault avoidance, fault removal and fault tolerance represent three successive lines of defense against the contingency of faults in software systems and their impact on system reliability. The law of diminishing returns advocates that these three sets of methods be put to bear to achieve eective software veri cation and validation: each method is used in the context where it is most eective. In this paper, we present an integrated approach to veri cation and validation, where we identify what aspects each set of methods is best adapted to deal with.

Keywords

Fault avoidance, Fault removal, Fault tolerance, Formal speci cations, Veri cation and validation.

1 Successive Lines of Defense Despite three decades of intensive research, the veri cation and validation of software products remains an active research area. A great deal of progress has been achieved in this eld, but the advent of new programming languages and new software development paradigms, combined with the increasing reliance on software and the increasing complexity of software applications, have maintained the pressure for more research. All the methods of veri cation and validation revolve around the theme of dealing with the existence and the manifestation of faults. Traditionally, these methods are classi ed into three categories, which dier by how early faults are identi ed and dealt with; the categories can be seen as successive lines of defense against the eects of faults on software quality. Fault Avoidance. These methods take the view that it is possible to build fault-free software, and focus on means to specify, verify and derive software products that are free of faults. Fault Removal. These methods take the view that despite our best eorts, developed software may still contain faults, and apply methods to remove faults from existing software products.

Correspondence author. Email: [email protected], [email protected]. http://www.isr.wvu.edu/

1

Fault Tolerance. These methods take the view that neither fault avoidance nor fault removal

ensures that software products are free of faults; they advocate that measures be taken to prevent faults from causing failure. Not surprisingly, each family of methods is eective for some types of faults and ineective for others |as we will discuss later in this paper; hence it makes sense to apply all three families of methods, by virtue of the Law of Diminishing Returns. By doing so, we aord the luxury of applying each method where it is most eective, and dropping it in favor of another when its eectiveness drops. In this paper, we want to discuss the technical implication of combining these three categories of methods, using a uni ed relational framework. We will highlight in particular that software speci cations can be structured as aggregates of sub-speci cations, where each sub-speci cation is more adapted to a particular family of methods than to others; so that the law of diminishing returns becomes less a general philosophical statement than an actual working solution. In section 2 we give a brief introduction to relations and relational calculus, which we use throughout the paper. In section 3 we use this background to provide a general model for fault avoidance, fault removal and fault tolerance. In section 4 we analyze the structure of speci cations, and discuss how to decompose the veri cation and validation eort of a software product into a fault avoidance eort, fault removal eort and fault tolerance eort. Finally, section 5 presents a synthesis of our results as well as prospects for future work.

2 Relational Speci cations 2.1 Elements of Relations

A relation on set S is a subset of the Cartesian product S S . Constant relations on a set S include the identity relation, denoted by I , the universal relation, denoted by L, and the empty relation, denoted by . A vector a is a relation of the form a = A S for some non-empty set A; there is a straightforward isomorphism between vector a (a relation) and set A. An invector (inverse of a vector) is a relation of the form S A, for some non-empty set A. Because relations are sets, we can apply to them all the set theoretic operations (union, [; intersection, \; complement, ; set dierence, n). In addition, we de ne the following relationspeci c operations: The domain of relation R is the set denoted by dom(R) and de ned by dom(R) = fsj9s0 : (s; s0) 2 Rg: The image set of element s by relation R is the set denoted by s R and de ned by s R = fs0 j(s; s0) 2 Rg: The inverse of relation R is the relation denoted by Rb and de ned by Rb = f(s; s0)j(s0; s) 2 Rg: The product of relations R and R0 is the relation denoted by R R0 (abbreviated by RR0 , when no ambiguity arises) and de ned by R R0 = f(s; s0)j9t : (s; t) 2 R ^ (t; s0 ) 2 R0g: 2

For reasons that will become apparent in section 3.3, we need another relational product operator, which we call the monotonic product, denote by R 2 R0 , and de ne as R 2 R0 = RR0 \ RR0L: Note that for a given relation R, the relation RL is a vector; it can be written as RL = f(s; s0)js 2 dom(R)g: The restriction of relation R to set A is the relation denoted by A nR and de ned by A nR = f(s; s0)js 2 A ^ (s; s0) 2 Rg: Note that if a is the vector de ned by a = A S then A nR = a \ R. Also note that R0 L \ R is the restriction of R to the domain of R0 (a recurrent expression in the sequel). A relation R is said to be total if and only if I RRb , and said to be deterministic if and only b I . A deterministic relation is called a function. A relation R is said to be an equivalence if RR relation if and only if it is re exive (I R), symmetric (Rb R), and transitive (RR R).

2.2 Specifying with Relations

We use relations to represent speci cations. A relation R on some space S contains all the input output pairs (s; s0 ) that the speci er considers correct. If an element s is outside the domain of R, we understand that it is not expected to be submitted by the user, and the behavior of candidate implementations on s is arbitrary (in particular, they may fail to terminate). If s 2 dom(R), then s R is the set of outputs that are considered correct for input s. In the special case when s R = S , candidate implementations may return any output for input s, but they may not fail to terminate. Homogeneous relations (from S to S ) are adequate to specify software products that implement simple input output functions, but inadequate for software products that maintain an internal state (such as objects, data types, stimulus response systems, etc). For such systems, we also use a relational model, which we present brie y below: An input space, X , from which we derive a set of input histories, H , where an input history is a sequence of elements of X . An output space, Y . A relation, R, from H to Y , i.e. a subset of H Y . The pair (h; y) is in R if and only if the speci er considers that y is a correct output for the input history h. As a brief illustration of this model, we consider the speci cation of a stack of integers: X = freset; pop; top; sizeg [ fpushg integer. Y = integer [ ferrorg.

We cannot give a closed form representation of R, hence will, for the purposes of this example, merely present some sample pairs. (reset:push(1):top; 1) 2 R. (reset:push(4):size; 1) 2 R. (reset:top; error) 2 R. 3

We have brie y shown how relations can be used to represent the speci cation of a wide range of software products. For the sake of readability, we focus our future discussions on homogeneous speci cations and simple input output products, with the knowledge that our results can be readily generalized to other kinds of software products.

2.3 Re nement Ordering

We want to de ne an ordering relation between speci cations to the eect that one speci cation is harder to satisfy (by candidate implementations) than another. A speci cation is harder to satisfy than another if it has a larger set of inputs to deal with (in relational terms: a larger domain) and/or if it imposes more conditions on the outputs (in relational terms: smaller image sets); this notion is captured in the following de nition. De nition 1 Speci cation R is said to re ne speci cation R0 (denoted by R w R0 or R0 v R) if and only if

RL R0L ^ R0L \ R R0 :

By virtue of this de nition, the following propositions hold; note that in each instance the relation on the left side can be interpreted to be harder to satisfy than the relation on the right side; the space of these relations is S = real. f(s; s0)js02 = s ^ s0 0g w f(s; s0)js02 = sg. f(s; s0)js02 = sg [ f(s; s0)js < 0 ^ s0 = ?1g w f(s; s0)js02 = sg. f(s; s0)js02 = s ^ s0 0g [ f(s; s0)js < 0 ^ s0 = ?1g w f(s; s0)js02 = sg. f(s; s0)js02 = sg [ f(s; s0)js < 0 ^ s0 = ?1g w f(s; s0)js02 = sg [ f(s; s0)js < 0g. Given that w is a partial ordering, it is legitimate to ponder the question of whether it has latticelike properties, i.e. whether any two speci cations have a join (least upper bound) and a meet (greatest lower bound). We brie y discuss meets in this section, and leave the discussion of joins to section 4.1. We have the following proposition (due to [2]). Proposition 1 Any two speci cations R and R0 have a meet (greatest lower bound), whose expression is R u R0 = RL \ R0 L \ (R [ R0 ): More important than the expression of the meet is its concrete interpretation: R u R0 represents the requirements information that is common to both R and R0; i.e. R u R0 represents all the requirements that are captured in both R and R0 . The meet can be interpreted by means of the following characterization: If we know that under all circumstances, speci cation A re nes R or re nes R0, but we are not sure which it re nes, then all we can say is A w (R u R0 ): As an example, consider that we want to write the speci cation of a program that derives the median value of an array; assume that we have decided to derive the median by sorting the array (in an arbitrary order) then returning the element in the middle of the array. Because the order in which the array is sorted is immaterial, we can choose speci cation Inc (increasing) or Dec (decreasing); because we do not want to make an arbitrary decision, as we have no basis of favoring one over the other, and wish to leave our future options open, we merely say 4

Mon = Inc u Dec; where Mon stands for monotonic. If we know that either Inc or Dec is applied but do not know

which of the two is applied, then all we know is that the array is monotonic.

3 Fault Management Methods We discuss how the re nement ordering introduced in the previous section, as well as the associated lattice operator of meet, can be used to model all three families of veri cation and validation methods.

3.1 Modeling Fault Avoidance

We consider a software product P (say, a program P , for simplicity), and we consider that we have established the (total) correctness of this program with respect to some speci cation V . We denote by [P ] the relation that program P de nes on its space, i.e. the set of pairs (s; s0) such that if P starts execution in state s then it terminates in state s0 . If P is implemented in a (traditional) deterministic language, then this relation is in fact a function, but whether it is actually a function or not has little impact on our subsequent discussions. Nevertheless, we refer to [P ] as the functional abstraction of P ; we may confuse a program and its functional abstraction when this raises no ambiguity. By virtue of the de nition of [P ], we infer that dom([P ]) is the set of states s such that if P starts execution in state s then it terminates. To be (totally) correct with respect to speci cation V , program P must terminate for all inputs states in the domain of V , and must be partially correct with respect to V . For termination, program P must terminate for all the elements of dom(V ); in other words, we must have dom([P ]) dom(V ): For partial correctness, whenever an input state satis es the precondition (s 2 dom(V )) and execution of program P terminates in some state s0 ((s; s0) 2 [P ]), then the output s0 must satisfy the postcondition ((s; s0) 2 V ); in other words, 8s; s0 : s 2 dom(V ) ^ (s; s0) 2 [P ] ) (s; s0) 2 V: Using algebraic notations, we can write the rst clause as [P ]L V L; and the second clause as V L \ [P ] V: The conjunction of these two clauses yields that [P ] re nes V ; this provides the proof for the following proposition. Proposition 2 A program P is correct with respect to a speci cation V if and only if [P ] w V . This characterization of correctness is equivalent, modulo super cial dierences of notation, to traditional de nitions of total correctness [3, 4, 7]. The application of correctness veri cation produces a statement to the eect that the functional abstraction of P re nes some speci cation V . We attempt to model the other methods in a similar manner, then discuss how to combine these results to produce a general statement about our product. 5

3.2 Modeling Fault Removal

We consider a program P , and we attempt to capture the result of submitting this program to a set of tests. We distinguish between three goals of software testing: Debugging. Under this goal, we submit input data to the program and observe its behavior, in the hope of sensitizing, identifying then removing faults. Certi cation. Under this goal, we submit input data to the program and observe its behavior, in the hope of showing that the program behaves according to its speci cations for all instances of the test data. No modi cation of the product is involved under this goal. Reliability estimation. Under this goal [1, 6, 5], we submit input data to the program and observe its behavior; whenever it fails, we identify the fault, remove it, then resume testing. By keeping track of the evolution of inter-failure intervals, we hope to predict the reliability of the product after delivery. Because we are concerned with veri cation and validation in this paper, the rst goal is of no interest to us, but the second and third goals are. While the second goal produces a logical statement about the program, the third produces a probabilistic statement. For the sake of parsimony, we focus on the second goal in this paper, and brie y discuss our prospects for the third goal in the conclusion (section 5). We consider a program that we want to certify, i.e. to test under the goal of certi cation. Certi cation testing proceeds as follows: We are given a test oracle, which we represent by a relation, say , and a set of test data, say A0 . For the sake of consistency, we assume that all the test data elements are in the domain of , i.e. A0 dom( ). For each element of the test data, say s, we run the program P on input s and observe the outcome. We consider that the outcome is successful if and only if: rst, the program terminates for input s; second, the nal state s0 that it produces satis es the condition (s; s0) 2 . We let A be the set of input data elements for which the test is successful, and we let T be the restriction of to A, i.e. T =A n . By virtue of the rst clause in the de nition of successful tests, we know that program P terminates for all elements of A, hence the domain of [P ] is a superset of A. By de nition of T , we nd that A is the domain of T , hence we infer [P ]L TL: Also, by virtue of the second clause in the de nition of successful tests, we know that if s is in A and s0 is the output produced by [P ] for input s, then (s; s0) 2 . We write this as:

8s; s0 : s 2 A ^ (s; s0) 2 [P ] ) (s; s0) 2

, f logical identity g 0 8s; s : s 2 A ^ (s; s0) 2 [P ] ) s 2 A ^ (s; s0) 2

, f de nition of restriction g 0 8s; s : s 2 A ^ (s; s0) 2 [P ] ) (s; s0) 2A n

, f de nition of relation T g 0 8s; s : s 2 A ^ (s; s0) 2 [P ] ) (s; s0) 2 T , f because A dom( ), dom(An ) = A g 6

8s; s0 : s 2 dom(An ) ^ (s; s0) 2 [P ] ) (s; s0) 2 T , f de nition of relation T g 0 8s; s : s 2 dom(T ) ^ (s; s0) 2 [P ] ) (s; s0) 2 T , f restriction of a relation to the domain of another g 0 8s; s : (s; s0) 2 TL \ [P ] ) (s; s0) 2 T , f set theory g TL \ [P ] T , f by virtue of earlier result: [P ]L TL g [P ] w T . This constitutes the proof of the following proposition. Proposition 3 If we run certi cation testing on program P using oracle and test data A0, where A0 dom( ), and we let A be the set of all the elements of A0 that produce a successful test, then we can infer [P ] w T; where T is the restriction of to A. Interestingly, certi cation testing, just as correctness veri cation, produces a statement to the eect that the program under consideration re nes some speci cation.

3.3 Modeling Fault Tolerance

In order to model fault tolerance, we begin by brie y de ning the notions of fault, error and failure. For the purposes of this paper we do not need formal de nitions, hence we will present these ideas on the basis of a simple example. We consider the following program on space S = natural : P: begin read(x); x:= 2*x; 0: {label} x:= x mod 3; write(x) end;

and we consider the following speci cation: R = f(x; x0)jx0 = x2 mod 3g: Whereas the speci cation requires that we compute the remainder by 3 of the square of the input value, the program actually computes the remainder by 3 of twice the input value. The fault is the line x:= 2*x, which should have been x:= x*x instead. For input x0 = 2, the fault causes no error, because 22 = 2 2. The input x0 = 3 sensitizes the fault and causes an error, which is that x equals 6 at label 0 rather than 9; however, because 6 mod 3 = 9 mod 3, this error is masked. The input x0 = 4 will also cause an error (x = 8 rather than x = 16); this error is propagated to the output and causes a failure (output is 2 rather than 1). 7

De nition 2 Fault tolerance refers to the set of measures that one can take to avoid failure after

faults have caused errors. One way to achieve fault tolerance [8] is to structure the program as a set of blocks of the form: B: begin ps:= s; {saving initial state} body; {changing s, preserving ps} if not correct(ps,s) then recovery(ps,s) end;

In its most general form, predicate correct de nes a binary relation, which we denote by C , between the state of the program at the beginning of the block (ps) and the state of the program at the end (s), when condition correct is checked. Likewise, procedure recovery de nes a binary relation, namely its functional abstraction, which we denote by R. Whenever block B is executed and predicate correct evaluates to true, we know that B is correct with respect to speci cation C , which (by virtue of proposition 2) we write as: [B ] w C: On the other hand, whenever block B is executed and predicate correct evaluates to false, we know that procedure recovery is invoked with the (retrieved) initial state (ps), whence the eect of the execution of body is overridden by the execution of the recovery routine and we get [B ] = R, which we rewrite (for the sake of uniformity) as [B ] w R: Because we do not know in general whether predicate correct will evaluate to true or false, all we can infer in general is that the functional abstraction of [B ] re nes either C (if the test of predicate correct is true) or R (if the test of predicate correct is false). By virtue of the discussions of section 2.3, we infer [B ] w F; where F = C u R. If our program has more than one block, say two for example, that are structured in sequence, then we let F1 and F2 be the relations associated to each block and we infer: [P ] w F1 2 F2 : The reason why we use monotonic composition, rather than the traditional relational product, is that the former is monotonic with respect to re nement ordering, whereas the latter is not. In other words, if [B1 ] w F1 and [B2 ] w F2 , we can infer that [B1; B2 ] w (F1 2 F2 ) but cannot infer that [B1 ; B2 ] w (F1 F2 ). If we let F be de ned by F = F1 2 F2 , we nd that the fault tolerance capability that we provide to program P enables us to make the statement: [P ] w F: This concludes our claim that, interestingly, all three methods produce statements to the eect that the program re nes some relational speci cation. In the next section we discuss how to use this result in such a way as to break down the veri cation and validation eort and dispatch it among the three families of methods. 8

4 Combining Methods

4.1 The Lattice of Re nement

In section 2.3 we had discussed the meet operator that derives from the lattice of the re nement ordering; in this section, we brie y consider the join operator. We have the following proposition (due to [2]). Proposition 4 Two relations R and R0 have a join (least upper bound) with respect to the re nement ordering if and only if they satisfy the condition (called the consistency condition): RL \ R0L = (R \ R0)L: When R and R0 do satisfy the consistency condition, their join is given by the following expression: R t R0 = R \ R0 L [ R0 \ RL [ (R \ R0): The consistency condition holds whenever R and R0 add to each other's information (rather than to contradict each other); in [2] we had shown that R and R0 have a least upper bound if and only if they have an upper bound. The join re ects all the requirements information of R (upper bound of R) and all the requirements information of R0 (upper bound of R0 ) and nothing else (least upper bound). In other words, it performs the addition of speci cations R and R0, so to speak. As illustrations of this operator, consider the following examples: f(s; s0)js02 = sg t f(s; s0)js 0 ^ s0 0g = f(s; s0)js02 = s ^ s0 0g, f(s; s0)js02 = sg t f(s; s0)js < 0 ^ s0 = 0g = f(s; s0)js02 = sg [ f(s; s0)js < 0 ^ s0 = 0g, f(s; s0)js02 = s ^ s0 0g [ f(s; s0)js < 0g tf(s; s0)js02 = sg [ f(s; s0)js < 0 ^ s0 = 0g, = f(s; s0)js02 = s ^ s0 0g [ f(s; s0)js < 0 ^ s0 = 0g, f(s; s0)js0 = s + 1g and f(s; s0)js0 = s + 2g have no join, Sort = Prm t Ord, Optim = SemPres t Opt, Gcd = DivA t DivB t Max. where Sort: output array is the sorted permutation of the input array. Prm: output array is a permutation of the input array. Ord: output array is ordered. Optim: code optimizer. SemPres: semantic preservation (input and output have the same semantics). Opt: output is optimized. Gcd: Greatest common divisor. DivA; DivB: output divides A, B. Max: output is maximal. In the sequel we discuss how this join operator can be used to combine veri cation and validation eorts on a software product. 9

4.2 Cumulating V&V Results

We consider a software product P on which we have applied three V&V methods: Correctness veri cation with respect to some speci cation V , producing [P ] w V ; Certi cation testing with respect to some speci cation T (de ned by oracle and test data set A), producing [P ] w T ; Fault tolerance with respect to some speci cation F (de ned by the correctness criterion C and the recovery routine R), producing [P ] w F . By virtue of the discussions of section 4.1, speci cations V , T , and F have a join (since they have an upper bound, which is P ). Furthermore, by virtue of lattice theory, we infer [P ] w (V t T t F ): What is even more interesting, in practice, is how do we divide a given speci cation into components V , T and F (for which we apply veri cation, testing, tolerance methods) in such a way as to minimize veri cation eort and/or maximize impact. This is the subject of the next section.

4.3 Characterizing Sub-speci cations

In [2] and subsequent works, we have advocated that the join operator can be used as a means to structure complex speci cations without violating the generally accepted principle of abstraction (which provides that speci cations must not favor speci c design choices, leaving these choices at the discretion of the designer). Given a complex set of requirements, one can capture these requirements with arbitrarily partial, arbitrarily weak sub-speci cations, say R1, R2, R3, .. Rn . Then we derive the overall speci cation as R = R1 t R2 t R3 t :: t Rn: Given a product (program) P that is supposed to satisfy this speci cation, we are interested in establishing the property [P ] w R by dividing the set of sub-speci cations into three classes: sub-speci cations for which correctness veri cation is appropriate will be factored together to produce V = Rf (1) t Rf (2) t :: t Rf (k): We do the same for speci cations T and F . In this section, we discuss how to decide, for each term Ri, whether to factor it into V , T or F .

Correctness Veri cation Terms. Speci cation terms that are typically good candidates for

correctness veri cation methods are represented by equivalence relations between the current state and the initial state of the program, and re ect information that the programs seeks to preserve as it is proceeding towards a solution. The equivalence property is important because it means that the notoriously dicult inductive steps of program veri cation become trivial: the loop invariant required to prove correctness with respect to an equivalence relation is the equivalence relation itself; hence no creativity is required to carry out the proof [9]. Furthermore, it is much easier to prove preservation of a conservative property that the program is maintaining while proceeding towards termination than it is to prove complex properties about what the program is trying to establish. The following examples, although very super cial, illustrate our idea. 10

P = insertionsort.

R = Prm t Ord.

We recommend to take V = Prm, since Prm is an equivalence relation between the initial state of the array and the current state, and re ects what property the program tries to preserve as it sorts the array. Note that the loop invariant required to prove that the program satis es Prm is Prm itself (no inductive argument argument is involved, because Prm is re exive and transitive). Note also that it is in fact very easy to prove the preservation of Prm: whenever the array is modi ed, it is modi ed by merely swapping two cells, hence preserving Prm |end of proof. As another (even more trivial) example, consider the following factorial program. P = begin k:=1; f:=1; while k

Combining Fault Avoidance, Removal and Tolerance: An Integrated ...

Combining Fault Avoidance, Removal and Tolerance: An Integrated ...

Suggest Documents

Combining Fault Avoidance, Fault Removal and Fault ... - IEEE Xplore

An Integrated Fault Tolerance Framework for Service

Fault Tolerance and Avoidance in Biomedical Systems - Data61

Integrated Replication-Checkpoint Fault Tolerance Approach of

An Algebra of Fault Tolerance

Fault and Failure Tolerance

Integrated Hardware and Software Fault Tolerance for Real Time ...

Integrated Hardware and Software Fault Tolerance for Real Time

Analysis of Fault Tolerance of a Combining Classifier

Open Circuit Fault Diagnosis and Fault Tolerance

AN EFFICIENT FAULT TOLERANCE SCHEME FOR ... - CiteSeerX

An MPI Proposal for Process Fault Tolerance

An Algorithm Providing Fault-Tolerance for

Byzantine Fault-Tolerance and Beyond

Cadmium Tolerance and Removal from

Software-implemented fault-tolerance and

towards an operational bird avoidance system: combining models and ...

Unmasking Fault Tolerance

COMMENTARY Avoidance and tolerance of ... - units.miamioh.edu

An empirical study on testing and fault tolerance for ... - CiteSeerX

Implementing Software-Fault Tolerance in C++ and Open C++: An ...

An Integrated Approach to Collision Avoidance and Vehicle Automation

Deep Fault Recognizer: An Integrated Model to

fault detection and fault tolerance methods for ... - APEM journal