Verification of Fault-Tolerance and Real-Time Zhiming Liu Department of Maths & Computer Science University of Leicester Leicester LE1 7RH, U.K.
[email protected]
Abstract A transformational method is given for specifying and verifying fault-tolerant, real-time programs. Such a program needs to be provably correct according to both its functional and real-time requirements, despite the possible occurrence of system failures. The paper demonstrates that a suitably expressive logic for real-time systems makes it possible to naturally model the state changes caused by system failures and determine their effect on the functional and real-time properties of executions. Keywords: fault-tolerance, fault-tolerant refinement, realtime, specification, transformation.
1. Introduction Consider the development of a concurrent program by a sequence of refinement steps, from a requirement specification to an executable implementation. Each step in such a development constructs a lower level specification Pl from a higher-level specification Ph and it must be provable that Pl refines Ph , denoted Pl v Ph . The refinement relation Pl v Ph is reflexive (P v P for any specification P ), and transitive (if Pll v Pl and Pl v Ph , then Pll v Ph ). Different frameworks and refinement calculi have been used to prove that one specification refines another [AL88, Bac89]. These refinement steps assume that the implementation Pl is executed on a fault-free system, i.e. each operation in Pl is executed correctly, according to its defined semantics. However, concurrent programs may be run on systems which exhibit various failures, referred to here as physical faults, as processors may fail, channels may lose messages, and memory may be corrupted. Thus, if the execution of Pl suffers the effects of such faults, its behaviour may not satisfy Ph even though Pl v Ph can be proved. This paper presents a transformational framework in which it is assumed that the physical faults of a system are
Mathai Joseph Department of Computer Science University of Warwick Coventry CV4 7AL, U.K.
[email protected] modelled as being caused by a set F of ‘fault operations’ which perform state transformations in the same way as the ordinary program operations. The effect of the faults F on the execution of a program P can be modelled by a transformation F of P into an ‘F -affected’ version F (P; F ). In an atomic action model, the F -affected version F (P; F ) is simply the union of the operations of P and F which are are executed in an interleaved manner. Fault-tolerance is achieved if a specification Pl is an F -tolerant implementation of a specification (or program) Ph and the F -affected version F (Pl ; F ) refines Ph . Fault-tolerant systems often also have real-time constraints. So it is important that the timing properties of a program are refined along with the fault-tolerant and functional properties of a program specification. This paper extends the transformational approach by adding time bounds to actions. This allows fault-tolerant redundancy actions (such as checkpointing and recovery actions) to be specified with time constraints. The method presented in the paper is based on the following observations. A requirement specification Spec for a program P is generally refined into a canonical specification (Ph ) which can be encoded as an action system Ph . Analysis of the possible failures in the system (perhaps in consultation with system engineers) provides the specification F of the fault-environment. Ph is then refined into an F tolerant implementation Pl . For real-time fault-tolerant programs, timing assumptions as well as the functional effects of faults have to be specified in order to verify fault-tolerant refinements with real-time constraints. Redundancy operations for fault-tolerance have to be time-constrained to meet the original deadlines given in the high-level specification. In general, the redundancy operations are required to satisfy the condition that, after a fault occurs, the system recovers to an error-free and consistent state within a time bound which takes account of the possible wasted computation caused by the fault and subsequent recovery. It is clear that faulttolerance in real-time systems can only be achieved if reasonable timing assumptions are made about the occurrence
step, i.e. si+1 = si , or by executing an operation of program P . The set of all the behaviours of a program is stuttering closed: if an infinite state sequence is a behaviour of the program, then so is any behaviour obtained from by adding or deleting a finite number of stuttering steps. A behaviour is terminating if it has an infinite sequence of stuttering steps.
of faults. Therefore, it should be possible to specify a variety of such assumptions. Further, for a given high-level specification, either more powerful (or faster) machines may be needed, or the original deadlines in the high-level specification may need to be relatively loose, to allow fault-tolerant implementations. We shall use the example of a processormemory interface to illustrate these points.
Example Consider a simple processor-memory interface. The processor issues read and write operations that are executed by the memory. Such an interface consists of two registers, represented by the following state variables:
2. Program Specification and Refinement This section introduces a computational model and a specification language that will be used in this paper.
op:
2.1. Action system models A program is represented as an action system which is a pair P = (I; O) consisting of an initial condition I and a set O of atomic operations on a finite set v of state variables. A state s of program P is a mapping from v to an associated set of values (i.e. the value domain) D. We use s[x], for x 2 v , to denote the value of x in state s. The initial condition I is a first-order predicate with free variables in v; it defines the permitted initial states from which execution of P can start. Each atomic operation 2 O is represented by a guarded nondeterministic multiple assignment
val:
I Rp Wp Q Rm Wm O
where x is a vector of state variables in v , x0 is a vector of logical variables, and Q is a first-order predicate over state variables in x and logical variables. When the guard g of is true (i.e. is enabled) in a state s, can be executed and will change state s into a state t such that the variables in x are assigned the values other variables remain unchanged
t[y] =
x0 (i) s[y]
Set by the processor to indicate the value to be written by a write, and set by the memory to return the result of a read; its value space is the integers Z.
To describe the interface as an action system, we introduce an (internal) variable d with value space Z to denote the contents (data) of the memory. Let
g ! x := x0:Q
Set by the processor to indicate the desired operation, and reset by the memory after executing the operation. Its value space is frdy; r; wg, for ready, read and write, respectively.
x0 and all
= = = = = = =
(op = rdy) (op = rdy) ! op := r (op = rdy) ! (op; val) := (u; v):Q (u = w) ^ (v 2 Z) (op = r) ! (op; val) := (rdy; d) (op = w) ! (op; d) := (rdy; val)
fRp ; W p; Rm ; W m g
The interface can then be described as the program
P = (I; O)
if y is the ith element x(i) of x otherwise
2
the predicate Q is satisfied by t.
2.2. Reasoning about action systems in TLA
An atomic operation may introduce infinite nondeterminism. This expressive power is needed to represent a fault by an atomic operation which may damage the program state by setting a variable to an incorrect value. A deterministic assignment g ! v := e is a special case of a nondeterministic assignment. We omit the guard g of an operation when it is the constant true. Because of the atomicity of operations, a behaviour of program P can be represented as an infinite sequence = s0 ; s1 ; s2; : : : of states, where s0 is the initial state which satisfies the initial condition I , and each si+1 , i 0, is obtained from si by either a stuttering
We use TLA [Lam90] to reason formally about the properties of an action system. Models of TLA are defined as infinite sequences of states, = s0 ; s1 ; : : : , where each si is a state over a set of state variables. A property of program P is represented by a formula ' which defines a set of behaviours of P . An action formula A is a boolean-valued expression which specifies the relation between the values of variables before the execution of an operation and the values of ‘primed’ variables after the execution of the operation. It is 2
interpreted over pairs of states hs; s0 i j= A such that A holds when the variables v in A are assigned values according to state s, and the primed variables v0 = fx0 j x 2 v g in A are assigned values according to state s0 . hs; s0 i is called an Astep if hs; s0 i j= A. Formulas in TLA are constructed from action formulas in the following way. An action formula A is a formula. A is satisfied by a sequence of states = s0 ; s1 ; s2; : : : of program P , denoted j= A, if hs0 ; s1 i j= A. A first-order state predicate is a particular action formula which does not have primed variables; satisfies if the initial state s0 of satisfies , i.e. s0 j= . If ' is a formula, so is 2' (always '). satisfies 2' if all the suffixes of satisfy '. The formula 3' (eventually ') is defined as :2:', and this is satisfied by if there is a suffix of which satisfies '. Formulas can be composed using the first-order connectives :, ^, _ and ) with their standard semantics. Quantification (i.e. 9x', 8x') is possible over logical variables, whose values are fixed over states, and over state variables, whose values can change from state to state. To specify stuttering steps in the execution of a program, we define the following notation: for an action formula A and a vector u of variables,
Given an action system (program) P
NP =
2O
?( )
This is the state-transition relation for the atomic operations of P . The exact specification of P is expressed by the formula
I ^ 2[N ] (P ) = Pv
This formula specifies all the possible sequences of values that may be taken by the state variables, including the internal variables. Using existential quantification to hide the internal variables, x = (x1; : : :; xn) v , the canonical (safety) specification of P is
9x : (P ) (P ) = A behaviour satisfies (P ) iff there is a sequence of values that can be assigned to xi , i = 1; : : :; n, such that the so obtained behaviour satisfies (P ). Example (continued) For the processor-memory interface, we have
?(Rp ) ?(W p ) ?(Rm ) ?(W m )
A _ unch(u) [A]u = where unch(u) = ^x2u(x0
_
= (I; O), let
= x).
NP
(P )
2.3. Transforming action systems to TLA formulas
= = = = = =
(op = rdy) ^ (op0 = r) (op = rdy) ^ (op0 = w) ^ (val0 2 Z) (op = r) ^ (op0 = rdy) ^ (val0 = d) (op = w) ^ (op0 = rdy) ^ (d0 = val) ?(Rp ) _ ?(W p ) _ ?(Rm ) _ ?(W m ) (op = rdy) ^ 2[NP ]v
where v = fop; val; dg. From the user’s point of view, d in this specification is an internal variable. Thus, we have
The relationship between action systems and TLA can be defined in a relatively straightforward way. A syntactic operation : g ! x := x0 :Q in an action system P corresponds to an equivalence class of action formulas. For each action formula A in this class and each pair hs; s0 i of states over the state variables v of P , execution of in state s terminates in state s0 iff hs; s0 i is an A-step. The operation can be translated into an action formula ?( ) which is a representative of the equivalence class of action formulas corresponding to .
(P ) = 9d : (op = rdy) ^ 2[NP ]v
2 Formulas (P ) and (P ) are safety properties, i.e. they are satisfied by an infinite behaviour iff they are satisfied by every finite initial prefix of the behaviour. Safety properties allow behaviours in which a system performs correctly for a while and then leaves the values of all variables unchanged. Such behaviours are undesirable in distributed systems and they can be ruled out by adding fairness properties. For simplicity, we only consider safety properties in this paper. For the treatment of liveness properties of fault-tolerant systems, we refer the reader to [LJ96].
g ^ Q[x0 =x ] ^ unch(v n x) ?( ) = 0 where x0 are the primed counterparts of x, and v n x is the complement of the variables x in the set v of program variables of P . Action is enabled only when g holds. When is enabled and executed, the program state is changed so that the new values of the variables x are related to the old values by Q[x0=x0], where each x0(i) is replaced by x(i)0 and the values of any variable outside x remain unchanged. In the examples in this paper, we will omit the unch(v n x) part when we specify operations.
2.4. Verification and Refinement Given an action system P with specification (P ), let a property of P be described by the formula '. To prove that 3
P satisfies property ' is to prove in TLA the validity of the implication (P ) ) '. The relation P1 v P2 between programs P1 and P2 characterises refinement, i.e. that program P1 correctly implements P2 . Let (P1 ) = 9x : I1 ^ 2[NP1 ]v1 and (P2 ) = 9y : I2 ^ 2[NP2 ]v2
specified by the action formula NF which is the conjunction of the action formulas ?( ) for all 2 F . Executing P = (I; O) on a system with a faultenvironment F is simulated by interleaving the execution of the operations of P and F . Therefore, interference by F on the execution of P can be defined as a transformation F
be canonical specifications of P1 and P2 respectively, where
The internal and external behaviours of P being executed on a system with faults F are then specified respectively by
x = fx1; : : :; xng
F (P; F ) = (I; O [ F )
y = fy1; : : :; ym g
(F (P; F )) = I ^ 2[NP _ NF ]v and (F (P; F )) = 9x:I ^ 2[NP _ NF ]v
Then the refinement relation is formalised as
P1 vs P2
iff
(P1) ) (P2)
The fault-prone properties of P under F can be reasoned about in terms of the properties of F (P; F ). We call F (P; F ) the F -affected version of P , and a behaviour of F (P; F ) an F -affected behaviour of P .
The refinement relation P1 v P2 can be proved as the implication (P1) ) (P2). For this, we must define state functions y~1 ; : : :; y~m in terms of the variables v 1 and prove g the implication (P1) ) ( P2) is obtained P2), where (g from (P2) by substituting y~i for all free occurrences of yi in (P2 ), for i = 1; : : :; m. The collection of state functions y~1 ; : : :; y~m is called a refinement mapping1.
Example (continued) For the processor-memory interface, assume that the memory is faulty and that its value may be corrupted. Such a fault can be represented by the atomic operation
fault = d := d0:(d0 6= d) whose action formula ?(fault) is (d0 6= d). The faultenvironment F thus contains the single action fault. The F -affected version of P is then
3. The Transformational Framework For Fault-Tolerance We now give a brief summary of the transformational approach to the formal treatment of untimed fault-tolerant programs.
F (P; F ) = ((op = rdy); fRp ; W p; Rm ; W m ; faultg) Thus, NF = ?(fault) and NF (P;F ) = NP _ ?(fault) (F (P; F )) = (op = rdy) ^ 2[NF (P;F ) ]v (F (P; F )) = 9d : (F (P; F ))
3.1. Faults and their effects Let P = (I; O) be a program with the safety specification (P ) = 9x : I ^ 2[NP ]v . Informally, a physical fault in the execution of P causes a transition from a valid state of P into an error state. Continuing the execution of P from such an error state may lead to a failure state which violates the specification of P . Thus, in general, a physical fault can be modelled as the effect of an atomic fault-operation; this can be translated to an action formula by ? (defined in Section 2.3). For example, a malicious fault may set the variables of P to arbitrary values, a crash in a processor may cause variables to become unavailable, and a fault may cause the loss of a message from a channel. Physical faults can be described by a set of atomic operations F , called a faultenvironment, which interfere with the execution of P by possibly changing the values of variables in v. F can be
2 3.2. Fault-tolerance Given a program P
= (I; O) with the specification
(P ) = 9x : (I ^ 2[NP ]v and a property described by a formula ', P is said to implement (or satisfy) ' if (P ) ) '. Such an implementation assumes that the hardware is fault-free, i.e. that each operation of P is executed according to its semantics. However, if P is run on a system subject to hardware faults, errors may occur in the program execution which lead to failures that violate the specification (P ). That is, the F -affected behaviours may not satisfy ', and F (P; F ) is then not an implementation of '.
)
1 The validity of the implication (P ) (P2 ) does not guarantee 1 the existence of a refinement mapping; in general, refinement mappings can be found if a specification is modified by adding dummy variables [AL88].
4
For P to tolerate the faults F , correcting operations must be carried out to prevent an error from leading to a failure. P is called an F -tolerant implementation of ', if F (P; F ) is an implementation of ':
three faulty memories whose values may be corrupted, under the condition that at any time at most one can be corrupted. Let d1, d2 and d3 denote the current data of the three memories respectively. Each memory is faulty and its value di may be corrupted by faulti , i = 1; 2; 3. Let fi be a variable with value space f0; 1g and assume that di has been corrupted if fi = 1. Each faulti can be specified as follows:
(F (P; F )) ) ' This means that the behaviours of P comply with the specification ' despite the presence of faults F . When such a property ' is a canonical specification of a program Ph ,
?(faulti ) = (d0i 6= di) ^ (fi0 = 1) ffault ; fault ; fault g F = 1 2 3 NF = ?(fault1 ) _ ?(fault2 ) _ ?(fault3 ) 2(f + f + f 1) BF = 1 2 3
(Ph ) = 9y : Ih ^ 2[NPh ]u
Pl is an F -tolerant refinement of Ph , denoted Pl vF Ph , if Pl is an F -tolerant implementation of (Ph ). An F -tolerant refinement relation vF is stronger than the ordinary refinement relation: i.e. if Pl is an F -tolerant refinement of Ph , then Pl is a refinement of Ph but the converse is not necessarily true. An F -tolerant refinement is
The specification of a program Pl which is an F -tolerant refinement of the interface P can be obtained by specifying its variables, initial condition, operations and fairness conditions. The variables and initial condition are as follows.
generally not reflexive. However, it is transitive (even a bit stronger): if Pll vF Pl and Pl vF1 Ph , then Pll vF Ph . This allows stepwise development of a fault-tolerant program. Faults (or their representations) considered at a lowerlevel step in the development may differ from those in the previous step. Furthermore, the fault-tolerant refinements are fault-monotonic [Jan95]: if F F1 (or NF ) NF1 ) and Pl vF1 Ph , then Pl vF Ph . This means that a program which tolerates a set of faults also tolerates any subset of these faults. Realistic modelling often requires specification of both the fault state transitions F and a behavioural fault assumption BF about the local and global properties of F [Nor92]), such as the frequency and the minimum separation of faults. It follows then that the F -tolerant refinement of Ph by Pl should be proved under the condition BF :
vl = fop; val; d1; d2; d3; f1 ; f2; f3 g Il = (op = rdy) ^ (f1 = f2 = f3 = 0) ^ (d1 = d2 = d3 ) To specify the operations of Pl , we first define the following auxiliary function.
vote(x; y; z ) =
x y
if x = y or x = z if x 6= y and x 6= z
The specification of the rest of the program is:
?(Rpl ) = ?(Wlp ) = ?(Rm l )= ?(Wlm ) =
(F (Pl ; F )) ^ BF
) (Ph ) Such a behavioural fault assumption BF is in general a safety property which prevents certain fault transitions from taking place. Thus property 2[NPl _ NF ]vl ^ BF is also a
NPl =
(Pl ) = (Pl ) =
^ ^
(op = rdy) ^ (op0 = r) (op = rdy) ^ (op0 = w) ^ (val0 2 Z) (op = r) ^ (op0 = rdy) (val0 = vote(d1 ; d2; d3)) (op = w) ^ (op0 = rdy) V 3 ( i=1(d0i = val)) ^ (^3i=1 (fi0 = 0)) m ?(Rpl ) _ ?(Wlp ) _ ?(Rm l ) _ ?(Wl ) Il ^ 2[NPl ]vl 9d1; d2; d3; f1 ; f2; f3 : (Pl )
Let Pl be defined as
safety property and hence can be transformed into a formula of the form 2[NPl _ NF1 ]vl . This shows that the specification (F (Pl ; F )) ^ BF equals (F (Pl ; F1)) for some F1 , and therefore BF can be encoded into the model of fault operations. The separation of fault transitions and behavioural fault assumptions usually makes it easier to specify the F affected behaviours of Pl . To prove that Pl is an F -tolerant refinement of Ph , however, it is helpful to use a canonical form 9x : Il ^ 2[N ]vl so that the proof rule for refinement given in [AL88, AL90] can be directly used.
Pl = (Il ; fRpl; Wlp ; Rml; Wlm g)
m The action formulas of Rpl , Wlp , Rm l andm Wl are respecp m m tively ?(Rl ), ?(Wl ), ?(Rl ) and ?(Wl ). Notice that
(F (Pl ; F )) = 9d1; d2; d3; f1; f2 ; f3 : Il ^ 2[NPl _NF ]vl
To prove that Pl is an F -tolerant refinement of the faultfree P under the condition BF , we first calculate the formula 2[NPl _ NF ]vl ^ BF into 2[NPl _ NF1 ]vl , where F1 = ffaul1i : i = 1; 2; 3g, faul1i is the action
Example (continued) Assume we must prove that the fault-free memory interface P can be implemented using
(fi1 = 0^fi2 = 0) ! (d; fi) := (d; f ):(d 6= di^fi = 1) 5
and is + modulo 3. Define the following mappings from vl to v
op e = op
and
f = val val
and
(next; op) = (w; rdy) =
I
d~ = vote(d1 ; d2; d3)
f ready to issue write g
(next; op) = (r; rdy) ! = (next; op) := (w; r) p W = (next = w) ! (op; val; next) := (w; v; r):(v 2 Z) (op = r) ! (op; val) := (rdy; d) Rm = (op = w) ! (op; d) := (rdy; val) Wm = fRp ; W p; Rm ; W m g O = p L(R ) = U (Rp ) = L(W p ) = U (W p ) = L(Rm ) = L(W m ) = 0 U (Rm ) = U (W m ) = u
Rp
Using the proof rules of TLA, we then can then easily prove the implication (F (Pl ; F1)) ) (P ), which shows that Pl is an F -tolerant refinement of P under the assumption BF .
2
4. Real-Time and Fault-Tolerance The framework developed so far can be extended to deal with both real-time and fault-tolerant properties. For this, the semantic model must be extended to allow specification and refinement of real-time properties.
The timed interface is then the program
P T = h(I; O); L; U i
4.1. Specification and refinement of real-time programs
2
Deadlines introduce a set of timing constraints over a program, requiring its operations to be executed neither too early nor too late. Let time be represented by the set R+ [ f1g, where R+ is the set of the non-negative real numbers. To describe timing constraints, each atomic operation of P is given a lower time bound L( ) and an upper time bound U ( ), both in R+ [ f1g, such that L( ) U ( ):
lower bound condition: once an operation is enabled, it can be performed only if it has been continuously enabled for at least L( ) time units;
upper bound condition: an operation must not be continuously enabled for more than U ( ) time units without being performed.
As for untimed programs, we shall require an exact specification (P T ) of a real-time program P T . We first introduce into each program a distinguished state variable now to represent time, and an operation to advance time, both constrained by the following assumptions:
time starts at 0: initially now = 0.
2[now0 2 (now; 1)]now. time diverges2 : 8t 2 R+ :3(now > t). time never decreases:
These three assumptions can be combined to specify realtime:
RT =
Thus, a real-time program can be represented as a triple P T = hP; L; U i, where P is an ‘untimed’ program, as defined in the previous section, and L and U are functions from the atomic operations of P to a time domain such that L( ) U ( ) for any operation of P . A semantic model of real-time programs can be found in [HMP91].
^
(now = 0) ^ 2[now0 2 (now; 1)]now 8t 2 R+ :3(now > t)
We further assume that the program state and time do not change simultaneously, and the program state can be changed only by program operations. (P ) can be rewritten as
I ^ 2[N (P ) = P
^ (now0 = now)]v
Then the conjunction (P ) ^ RT specifies the interleaving of program and time advancing operations. The program operations need to be further constrained by lower bound and the upper bound conditions. To specify the lower bound L( ) for each operation , introduce a timer t which is an auxiliary state variable. When is enabled after
Example (continued) Assume now that the processor issues write and read operations alternately and periodically to memory, starting with a write operation. After an operation has been issued, assume that the memory take no more than u time units to complete its execution. To ensure that a message written be read only once but not be overwritten before being read, let the period of issuing an operation by the processor be greater than u. The real-time program P T = hP; L; U i is defined as follows.
2 Time divergence is also called the No-Zeno property: this ensures that only a finite number of operations can be performed in any finite interval of time.
6
The action formulas for the operations in P T can be obtained as before. The timing constraints are specified as follows:
a state in which it was disabled, or when is executed, t is assigned a clock time which is the current time now plus L( ) units of time:
LB =
timer(t ; L( ); v) = ((g ^ t = L( )) _ (:g ^ t = 1)) ^ 2[ (g0 ^ (?( ) _ :g ) ^ t0 = now0 + L( ) _ g ^ g0 ^ :?( ) ^ t0 = t _ :g0 ^ t0 = 1) ^ (v 0 6= v ) 0 ^(now = now)](t ;v) Operation cannot be executed until now reaches t :
mintime(t ; v) = 2[?( ) ) (t
UB =
Then
NP
now)]v
+u (val = v)) 9next:((next = r ^ op = rdy ^ d = v) ;
2 4.2. Faults and fault-tolerance in real-time systems The functional properties of faults are still modelled by the set F of atomic operations specified by the action formula NF . This assumes that the lower and upper bound of each fault operation is 0 and 1. Given a real-time program P T = hP; L; U i, the F -affected version of P T is defined as
(P ) ^ RT ^ B (P T ) (P T ) = Hiding the internal variables x and the auxiliary timers, timers(P T ), gives the canonical specification of P T
F (P T ; F ) = hF (P; F ); L; U i
9x; timers(P T ):(P T ) (P T ) =
where the domain of L and U is extended to O [ F and each operation in F has time-bounds of 0 and 1. To achieve fault-tolerance in a real-time system in which deadlines are to be met, we need a timing assumption about the occurrence of faults. Such an assumption is usually a constraint on the frequency of occurrence of faults, or the minimum time between faults. This time should be long enough to allow recovery of the computation and for progress to be made after recovery. Let the timing assumption for faults F be the conjunction of assumptions of the form whenever fault1 occurs, fault2 does not occur within " units of time. If a fault fault1 is specified by its action formula ?(fault1 ), the timing assumption can be written as:
satisfies (or implements) a
' ; = 8t:2(' ^ now = t ) 3( ^ now t + ))
is to prove the implication (P T ) ) ' ; . Further, the refinement relation P1T v P2T between programs P1T and P2T is proved as the implication (P1T ) ) (P2T ) using a refinement mapping.
v= I=
(next = w) ^ (op = rdy) ^ 2[NP ^ (now0 = now)]v ^ RT ^ LB ^ UB
It can be proved that the interface P T satisfies the property u (val = v), and that the required (op = r ^ d = v) ; value from memory will be output within u units of time of the processor issuing a read operation. Another property is that a message written is read within + u time units:
junction of the upper bound specifications for all the operations of P T is its upper bound condition UB (P T ). The time bound specification B (P T ) for P T is the conjunction LB (P T ) ^ UB (P T ). The real-time executions [ P T ] are exactly specified by
Example (continued) interface, let
?(Rp ) _ ?(W p ) _ ?(Rm ) _ ?(W m )
(P T ) = 9(tRp ; tW p ; TRp ; TW p ; TRm ; TW m ; d):(P T )
maxtime(T ) = 2[now0 T ]now The conjunction timer(T ; U ( ); v) ^ maxtime(T ) is the upper bound specification for each operation , and the con-
PT
=
(P T ) =
The conjunction timer(t ; L( ); v)^mintime(t ; v) specifies the lower bound for operation , and the lower bound condition for the program is the conjunction LB (P T ) of the lower bound conditions for all its operations. Similarly, for the upper bound U ( ) for operation , a timer T is defined by timer(T ; U ( ); v), obtained by substitution in timer(t ; L( ); v). Operation must be executed before now exceeds the clock time T :
To prove that the program timing property such as
timer(tRp ; ; v) ^ timer(tW p ; ; v) ^ mintime(tRp ; v) ^ mintime(tW p ; v) timer(TRp ; ; v) ^ timer(TW p ; ; v) ^ maxtime(TRp ) ^ maxtime(TW p ) ^ timer(TRm ; u; v) ^ timer(TW m ; u; v) ^ maxtime(TRm ) ^ maxtime(TW m )
For the timed processor-memory
fnext; op; d; valg (next = w) ^ (op = rdy)
2(h?(fault1 )iv ) 2" [:?(fault2 )]v ) 7
The formula 2" is interpreted as must always hold within " units of time. It can be defined in in terms of 2 and the time variable now as
P1l will ensure that the time for the processor to issue an operation is still d, and the upper bound u1 for the memory to complete execution of an issued operation is not greater than u, where u1 u:
2" = 8t : (now = t ) 2(now t + " ) ))
Ll (Rp1l ) = Ul (Rpl ) = Ll (W1pl ) = Ul (Rp1l ) = Ll (Rm1l ) = Ll (W1pl ) = 0 Ul (Rm1l ) = Ul (W1ml ) = u1 The implication (P1Tl ; F ) ^ BF ) (P T ), asserts that P1l F -tolerantly implements P T under the assumption BF .
Let the timing assumption on faults F be denoted by TF . The internal and external specifications of the F -affected version F (P T ; F ) are respectively
The assumption can be relaxed to the timing assumption TF , where v1 = v 1l [ ff1 ; f2; f3 g:
(F (P T ; F )) = (F (P; F )) ^ RT ^ B (P T ) ^ TF and
TF =
(F (P T ; F )) = 9x; timers(P T ) : (F (P T ; F )) The F -affected version of a real-time program P T is also a real-time program. The normal form allows the definition of fault-tolerance for real-time systems to be similar to that for untimed systems. A real-time program P T has an F -tolerant implementation of a real-time property , if holds. P T is an F the implication (F (P T ; F )) ) tolerant refinement of a real-time program PhT if the implication (F (P T ; F )) ) (PhT ) holds.
^3
2(h?(faulti )iv1 ) i=1 2+u1 [:(?(faulti1) _ ?(faulti2 ))]v1 )
From TF , only one of the most recently written memories may be corrupted before execution of the read operation is completed. Then P1Tl = hP1l ; Ll ; Ul i is an F -refinement of interface P T under the fault-assumption TF . The specifications of P T and P1Tl demonstrate a practical fact: to achieve fault-tolerance with timing constraints requires use of a more powerful (or faster) machine. Execution of the multiple assignment W1ml on such a machine should not be slower than the execution of the single assignment W m for an non-fault-tolerant implementation of P T ; similarly, execution of the multiple assignment Rm1l with a voting function should not be slower than execution of the single assignment Rm . Alternatively, if a machine of the same speed is used, the original time bounds must have enough slack for execution of the fault-tolerant redundancy code. This point becomes even clearer when P1Tl is further refined by implementing the multiple assignments W1ml and Rm1l by the sequential statements:
Example (continued) Consider a timed fault-free processor-memory interface can be implemented using three faulty memories whose values may be corrupted. Let F = ffault1 ; fault2 ; fault3 g be defined as in the untimed case. To obtain the specification of a program P1Tl , which is to be proved to be an F -tolerant refinement of P T under the assumption BF , define variables and initial conditions as follows:
v1l = fnext; op; val; d1; d2; d3; f1 ; f2; f3g I1l = (next; op; f1; f2; f3 ) = (w; rdy; 0; 0; 0)^ (d1 = d2 = d3) The specification of the operations is:
?(Rp1l) = (next; op) = (r; rdy) ^ (next0 ; op0) = (w; r) ?(W1pl ) = (next; op) = (w; rdy) ^ (next0; op0 ) = (r; w) ^(val0 2 Z) m ?(R1l ) = (op = r) ^ (op0 ; val0 ) = (rdy; vote(d1 ; d2; d3)) ?(W1ml ) = V (op = w) ^ (op0 = rdy) ^ ^3i=1(d0i = val) ^ 3i=1p(fi0 = 0) p NP1l = ?(R1l ) _ ?(W1l ) _ ?(Rm1l ) _ ?(W1ml ) (P1l ) = I1l ^ 2[NP1l ]vl (Pl ) = 9d1; d2; d3; f1 ; f2; f3 : (Pl ) m Let Pl1 be defined as (I1l ; fRp1l; W1pl ; Rm 1l ;mW1l g), where the p p m action formulas of R1l, W1l , R1l and W1l are respectively p m m ?(Rm 1l ), ?(W1l ), ?(R1l ) and ?(W1l ). T To meet the timing properties of P requires a guarantee that the time bounds of the actions of the implementation
Wsm = (op = w) ! (d1; f1) := (val; 0); (d2 ; f2) := (val; 0); (d3; f3 ) := (val; 0); op := rdy Rms = (op = r) ! v1 := d1; v2 := d2; v3 := d3; val := vote(v1 ; v2 ; v3); op := rdy
A sequential statement can be specified in our model in terms of smaller atomic actions by introducing control variables. Let cw and cr be variables with values from the sets fw1; w2; w3; w4g and fr1; r2; r3; r4; r5g respectively and initial values w1 and w2 . The sequential statement Wsm can be decomposed into a set of four atomic operations fWsm1 ; Wsm2 ; Wsm3 ; Wsm4 g which are specified as the four actions:
(op; c ) = (w; w ) ^ (d0 ; c0 ) = (val; w ) ?(Wsm1 ) = w 1 2 1 w m 0 0 ?(Ws2 ) = (cw = w2) ^ (d2; cw ) = (val; w3) (c = w ) ^ (d0 ; c0 ) = (val; w ) ?(Wsm3 ) = w 3 4 3 w (c = w ) ^ (op0; c0 ) = (rdy; w ) ?(Wsm4 ) = w 4 1 w
8
The sequential statement Rm s is made up of a set of five atomic operations fRm ; Rms2; Rms3; Rms4; Rms5g as shown s 1 above for Wsm . The new program Ps has an enlarged set of variables
also dealt with fault-tolerant broadcasts [Sch82, SG84] and Byzantine agreement [LSP82]. The earlier work did not deal with real-time, but this could be added in a similar way to that shown here. Suppose that fault-tolerance was to be achieved using checkpointing and backward recovery. As proposed in [LJ93], this can be done by a transformation of a non-fault-tolerant program P = hI; Oi to a program R(P ) = hI; O [ C [ Ri, which adds to P the checkpointing operations C and the recovery operation R. When a real-time program P T = hP; L; U i is considered, we need to treat the time constraints on the actions of R(P ) and consider real-time program R(P )T = hR(P ); L1; U1 i. For the latter to be an F -tolerant implementation of the former, the implication
vs = v 1l [ fcw ; cr ; v1; v2; v3g
and a strengthened initial condition
Is = I1l ^ cw = w1 ^ cr = r1 Its operations are
Os =
fRp1l; W1pl g [ fWsm1 ; Wsm2 ; Wsm3 ; Wsm4 g [ fRms1; Rms2; Rms3; Rms4; Rms5g The control variables fcw ; cr ; v1; v2; v3 g will be treated as
(F (R(P )T ; F )) ^ TF
) (P T ) must be proved for some time assumption TF about the faults F . Intuitively, L1 , U1 and TF together constrain the
internal variables. Consider the time bounds of this program to meet the time constraints of the program P1Tl . The time bounds of Rp1l and W1pl should remain unchanged (i.e. the lower bound and upper bound being d). The sum of the upper bounds of the constituent atomic operations should not be greater than the upper bound of the corresponding sequential operation, i.e.
behaviour in the following way.
1. The time bounds on C ensure that checkpoints are taken frequently enough to limit the amount of computation to be undone after error-recovery. But a high frequency of checkpointing also adds to the overhead.
P Us (Rmsi ) = i ; P5i=1 i u1 Us (Wsjm ) = j ; 4i=1 i u1
2. The time bounds on R guarantee that whenever a fault occurs, the recovery algorithm R executes to completion within a small time r.
Correctness of the refinement and its fault-tolerance properties can be proved by showing validity of the implication:
(F (PsT ; F )) ^ TF
) (F (P1Tl ; F )) ^ TF
3.
Recalling that P1Tl is a F -tolerant refinement of P T , PsT is also an F -refinement of P T , under the assumption TF .
TF assumes that the system becomes stable for a pe-
riod which is sufficient for the execution of the recovery algorithm R, and for some progress in underlying execution of P , so that the original time constraints of P can be met.
5. Discussion
In practice, periodic checkpointing can be used and it can be assumed that faults may occur with some minimum separation. The method for verification of fault-tolerance and realtime properties was illustrated using a simple example (the processor-memory interface). However, for programs of reasonable size, some mechanized proof checking becomes essential. Since our framework does not need a new semantic model, there is also no need for a new proof checking tool. An existing tool, such as PVS [ORS92], which can provide support for temporal reasoning, can be used to reason about fault-tolerance. [Sha93] gives an example of the use of PVS for verifying Byzantine fault-tolerant clock synchronization.
A detailed discussion of the application of the transformational framework for fault-tolerance can be found in earlier publications. In [Liu91, LJ92], it was shown how an F -tolerant implementation of a program P is achieved by a ‘recovery’ transformation R(P ). This transformation, together with refinement transformations, can be used for formal treatment of most existing fault-tolerant mechanisms, such as multiple-implementation (N-version implementation), backward and forward recovery algorithms [Had82], recovery blocks and conversation structures for fault-tolerance [Ran75, RLT78]. [LJ93] formally dealt with fault-tolerance in asynchronous communicating systems using checkpointing and recovery, permitting faults also during checkpointing and recovery. [LJ94] presented a stepwise and modular method for the development of faulttolerant reactive systems: it showed how to design a component of a reactive system to tolerate both software design faults and the hardware faults of a given component. It
6 Conclusions The main purpose of this paper was to show how, using transformations, stepwise refinement (e.g. [AL88, Bac89]) 9
can be used for the development of fault-tolerant systems, with or without real-time constraints; there is no need to develop a new semantics for dealing with fault-tolerant systems. This was demonstrated here using a simple action system and TLA, but the idea of using transformations for faulttolerance serves equally well in other formal frameworks: e.g. [Nor92] uses Hoare’s CSP and [Jan95] uses Milner’s CCS, both for fault-tolerance. The great advantage of this approach is that developments in specification and verification techniques for programs can be used also for faulttolerant systems. Acknowledgements We would like to thank the anonymous referees for suggestions which improved the paper. The second author acknowledges the support of EPSRC research grants GR/H39499 and GR/K52447.
References [AL88]
M. Abadi and L. Lamport. The existence of refinement mapping. In Proc. 3rd IEEE Symposium on Logic and Computer Science, 1988.
[AL90]
M. Abadi and L. Lamport. Composing specifications. Technical Report 66, Digital SRC, California, 1990.
[Bac89]
R.J.R. Back. Refinement calculus, Part II: Parallel and reactive programs. In Lecture Notes in Computer Science 340, pages 67–93. SpringerVerlag, 1989.
[LJ93]
Z. Liu and M. Joseph. Specifying and verifying of recovery in asynchronous communicating systems. In J. Vytopil, editor, Formal Techniques in Real-Time and Fault Tolerant Systems, pages 137–166. Kluwer Academic Publishers, 1993.
[LJ94]
Z. Liu and M. Joseph. Stepwise development of fault-tolerant reactive systems. In Formal Techniques in Real-Time and Fault-Tolerant Systems, LNCS 863, pages 529–546. SpringerVerlag, 1994.
[LJ96]
Z. Liu and M. Joseph. Verification of faulttolerance and real-time. Technical Report 1996/4, Department of Maths & Computer Science, University of Leicester, Leicester, LE1 7RH, U.K., March 1996.
[LSP82] L. Lamport, R. Shostak, and M. Pease. The Byzantine General problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, July 1982. [Nor92]
J. Nordahl. Specification and Design of Dependable Communicating Systems. PhD thesis, Department of Computer Science, Technical University of Denmark, 1992.
[ORS92] S. Owre, J. Rushby, and N. Shankar. PVS: A prototype verification system. In Proc. 11th Conference on Automated Deduction, pages 748–752. Springer-Verlag, 1992.
[Had82] V. Hadzilacos. An algorithm for minimising rollback cost. In Proceedings of ACM Symposium on Principles of Database Systems., March 1982.
[Ran75]
[HMP91] T. Henzinger, Z. Manna, and A. Pnueli. Temporal proof methodologies for real-time systems. In Proceedings of the 8th ACM Annual Symposium on Principles of Programming Languages, pages 269–276, 1991.
[RLT78] B. Randell, P.A. Lee, and P.C. Treleaven. Reliability issues in computing systems design. Computing Survey, 10(2):123–165, 1978.
[Jan95]
T. Janowski. Bisimulation and Fault-Tolerance. PhD thesis, Department of Computer Science, University of Warwick, 1995.
[Lam90] L. Lamport. A temporal logic of actions. Technical report, Digital SRC, California, April 1990. [Liu91]
Z. Liu. Fault-Tolerant Programming By Transformations. PhD thesis, Department of Computer Science, University of Warwick, 1991.
[LJ92]
Z. Liu and M. Joseph. Transformation of programs for fault tolerance. Formal Aspects of Computing, 4(5):442–469, 1992. 10
B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Engineering, SE-1(2):220–232, June 1975.
[Sch82]
F.B. Schneider. Fault-tolerant broadcasts. ACM Transactions on Programming Languages and Systems, 4(2):125–148, April 1982.
[SG84]
F.B. Schneider and D. Gries. Fault-tolerant broadcasts. Science of Computer Programming, 4:1–15, 1984.
[Sha93]
N. Shankar. Verification of real-time systems using PVS. In Proc of Computer Aided Verification’93. LNCS 697, pages 280–291. SpringerVerlag, 1993.