Formal Reasoning on Fault Coverage of Fault ... - Semantic Scholar

5 downloads 7393 Views 208KB Size Report
ability to successfully recover after the occurrence of a fault, therefore tolerating ..... returning a corrupted data to the user after a read of the damaged sector. We.
Formal Reasoning on Fault Coverage of Fault Tolerant Techniques: a Case Study Cinzia Bernardeschi, Alessandro Fantechi, Luca Simoncini Dipartimento di Ingegneria dell'Informazione, Universita di Pisa Via Diotisalvi 2 - 56126 Pisa, Italy?

Abstract. In this paper we show how a formal reasoning can be ap-

plied for studying the fault coverage of a fault tolerant technique when the behaviour of a system with a set of prede ned faults is considered. This method is based on process algebras and equivalence theory. The behaviour of the system in absence of faults is formally speci ed and faults are assumed as random events which interfere with the system by modifying its behaviour. A fault tolerant technique can be proved to tolerate the set of prede ned faults i the actual behaviour of the system is the same as the behaviour of the system in absence of faults. The approach is illustrated by considering the design of a stable storage disk.

1 Introduction Before any system can be designed and built, some form of speci cation of the required behaviour must be available. The speci cation provides a document against which the behaviour of the system can be judged, and a failure of a system occurs when the behaviour of the system rst deviates from that required by the speci cation. According to Laprie [10] fault tolerance is the property of a system \to provide, by redundancy, service complying with the speci cation in spite of faults occurred or occurring". Fault tolerant techniques are applied in the design of fault tolerant systems to achieve fault tolerance. When a system speci cation is given in a formal language, then the program implementing the system can in principle be developed using formal rules which guarantee that the program will satisfy the speci cation when executed on a fault-free environment. A formal reasoning can be applied also for studying the behaviour of a system when a set of anticipated faults which occur during the exploitation of the system is considered. A fault tolerant system is designed to tolerate (only) a set of anticipated faults and we suppose that anticipated faults correspond to the set of faults which can occur in the operational life of the system (operational faults). ?

This work was partly supported by the Italian Ministry of University and Scienti c and Technologic Research (MURST-40%).

The behaviour of a system can be divided into normal behaviour, the behaviour of the system when no fault occurs and faulty behaviour, the behaviour of the system in presence of faults. The faulty behaviour may be di erent for di erent kind of faults and we refer to the set of faulty behaviours as failure mode. A fault tolerant technique is applied to design a system that recovers from the faulty behaviour hiding the e ect of faults to an external observer. Equivalence theories based on the reactions of systems to stimuli from the outside world can therefore be used in the design of fault tolerant systems [6, 20]. In particular, the relationship between the system and the fault tolerant system can be studied by using the notion of observational equivalence. Observational equivalence, rst introduced in [15], is based on the idea that the behaviour of the system is determined by the way it interacts with the environment: two systems are observational equivalent whenever no observation can distinguish them. In [1] we developed a framework based on observational equivalence for the veri cation of the correctness of fault tolerant systems obtained by the application of fault tolerant techniques. In the framework, the fault tolerant design is represented in LOTOS [2], a process algebra-like formal speci cation language which includes constructs to de ne networks of processes that execute actions and that communicate synchronously. The process which represents the system in absence of faults and whose behaviour corresponds to the normal behaviour of the system is speci ed. A fault is modelled explicitly as an observable action which may occur at any time of the execution of the system. The process corresponding to the faulty behaviour of the system when a fault occurs is speci ed and the e ect of a fault in the system is that of transforming the system which behaves normally into the system which behaves faulty. Important in fault tolerant system design is the fault hypothesis which gives the constraint on how faults are supposed to occur in the system. Given a set of prede ned faults, under a particular failure mode, a system is designed to tolerate the occurrence of faults as stated by the fault hypothesis i the occurrence of such faults in the system does not inhibit the system's ability to correctly satisfy its speci cation. In the framework, actions corresponding to faults are made unobservable to the external environment, then observational equivalence is checked between the process corresponding to the system speci cation in absence of faults and that corresponding to the fault tolerant system where fault occurrences are constrained by the fault hypothesis. Observational equivalence has been rst used in [20] to compare the "correct" behaviour of a system with the behaviour of the fault tolerant system in presence of faults; in [6] the authors de ne new process algebra operators to describe the behaviour of a faulty system, still using observational equivalence to relate the system and its fault tolerant version. On the other hand, several works use trace-based equivalences for the same purpose [17, 21, 22, 23, 24]. Our framework has the merit to use available tools, both from the linguistic point of view and from the automatic veri cation point of view, to study the use of

observational equivalence in this context. In particular, in this paper we show how the fault coverage of a fault tolerant technique can be studied in this framework. The term fault coverage is generally de ned as \a measure of the system's ability to successfully recover after the occurrence of a fault, therefore tolerating the fault" [13, 10]. In particular, the set of faults tolerated by a fault tolerant technique can be studied by changing the fault hypothesis and by proving observational equivalence with the system speci cation in absence of faults. The paper is organised as follows: Section 2 introduces the framework developed in [1]. In Section 3 we illustrate our reasoning by considering the design of a stable storage disk: a fault tolerant design is introduced with the application of a Triple Modular Redundancy (TMR) fault tolerant technique. In Section 4 the fault coverage of the applied fault tolerant technique is studied. Some information about the possibility of automatic veri cation are reported along the paper.

2 Observational Equivalence and Veri cation of Fault Tolerant Systems For a fault tolerant system, it is not its internal structure which is of interest but its behaviour with respect to the outside world. Equivalence theories which can be used to establish whether two systems are equivalent or whether one system is a satisfactory \approximation" of another may be useful in the design of fault tolerant systems. Many equivalences have been de ned on Labelled Transition Systems [7]. A Labelled Transition System (LTS) is an abstract relational model based on two primitive notions, namely those of state and transition. In particular, LTSs are nondeterministic transition systems which can be used to model systems controllable through interaction with a surrounding environment, but also capable of performing internal or hidden actions which cannot be in uenced or even observed by any external agent. De nition1 (LTS). A LTS is a 4-uple (Q; Act; ; q0) such that Q is a set of countable states, Act is a set of countable actions, (Q Act Q)) is the  labelled transition relation, with Act = Act i (i is the \internal" action) and q0 Q is the initial state. In this de nition each of the relations a ; a Act ,adescribes the e ect of the execution of the action a and, if q; q0 Q, then q q0 indicates that the system, being in the state q, can reach the state q0 by performing the action a. The special symbol i is used to denote internal actions and q i q0 indicates that a system in the state q can perform a silent move to the state q0 . A transition system can be unrolled into a tree. The initial state is the root and the transition relation is represented by arcs labelled with elements from Act ; the nodes will represent the states. We will freely use the word process to denote a state of a LTS. In [16] and in previous related works observational equivalence is de ned by a bisimulation relation on the states of LTSs: !

!



[ f g

2

!

2

2

!

!



De nition2 (bisimulation). A bisimulation is a binary relation on Q such that awhenever p q and a Act, then: i) p p0 q0 such that q a q0 and p0 ii) q a q0 p0 such that p a p0 and q0 R

!

!

2

) 9

!

) 9

R

q0 , p0 .

R

!

R

De nition3 (bisimulation equivalence). Two processes p and q are called bisimulation equivalent if and only if it exists a bisimulation R with pRq, and

we write p q. 

De nition4 (weak bisimulation). Let



)

be the re exive and transitive clo-

sure of !i , that is a ) move can be seen as a sequence of zero or more unobserva , a 2 Act, as ) a  able actions; we de ne the relation ) !), where juxtaposition a denotes f(x; y) j x)  z; z ! a w; w)  yg. denotes relation composition. In particular, ) A weaka bisimulation is a binary relation such that whenever p q then: i) if p p0 then for some q0, q a q0 and p0 q0 , ii) if q a q0 then for some p0, p a p0 and q0 p0. R

)

R

)

)

R

)

R

De nition5 (observational equivalence). Two processes p and q are called

weak bisimulation equivalent (or observational equivalent) if and only if it exists

a weak bisimulation such that p q, and we write p q. R

R



Given a system, a failure mode, a condition on faults and a fault tolerant technique, the framework developed in [1] states that the correctness of the system obtained by the application of the fault tolerant technique can be veri ed by checking observational equivalence. The framework is built according to the steps reported in Fig. 1. The process corresponding to the normal behaviour of the system is rst speci ed.

De nition6 (correct system). The correct system P is the process corresponding to the system in absence of faults (step 1). Faults interfere with the system by modifying its behaviour.

De nition7 (faulty system). For each fault, we de ne the faulty behaviour of the system by the process FSpfault . De nition8 (possibly faulty system). Let PFSp , the possibly faulty system for P, obtained by introducing anticipated faults as a random choice at each step of execution of the correct system P (step 2). The e ect of each fault in the system is described as a fault action which transforms the PFSp into the faulty system FSpfault . If Obs(P) is the set of observable actions of the system, and is the set of actions associated to faults, then Obs(PFSp) = Obs(P) . The speci cation of PFSp is given in terms of the speci cation of FSp and it is dependent on the hypothesized characteristics of faults. =

[ =

FAULT TOLERANT TECHNIQUE SPECIFICATION

FAULT HYPOTHESIS

POSSIBLY FAULTY SYSTEM 3

SPECIFICATION

2

FAULT TOLERANT SYSTEM 4

5

FAULT TOLERANT SYSTEM UNDER A FAULT HYPOTHESIS 6

~ ?7 CORRECT SYSTEM

1

Fig. 1. The general framework

De nition9 (fault tolerant technique). Let FTT be the fault tolerant tech-

nique; FTT is speci ed by giving the set of components implementing it and how components synchronise on their observable actions (step 3). De nition10 (fault tolerant system). Let FTSp be the fault tolerant system obtained by the application of the fault tolerant technique to the possibly faulty system PFSp; the application of FTT may employ several copies of PFSp and usually some new synchronisation actions are introduced (step 4). If S is the set of new synchronisation actions, we hide actions in S in the speci cation of FTSp , thus obtaining Obs(FTSp ) = Obs(P) . De nition11 (fault hypothesis). Let FHp be the fault hypothesis for P, that is faults that may occur in the system P (step 5). De nition12 (fault tolerant system under a fault hypothesis). Let  p be the de nition of the behaviour of the fault tolerant system under the FTS  p (step 6) is obtained by the parallel compofault hypothesis. We have that FTS sitions of the behaviour expressions at steps 4 and 5, with synchronisation on all the actions corresponding to faults. This means that faults occur in the system according to the fault hypothesis. Finally fault events are hidden, that is, are considered as internal events of the obtained system. [=

 p ) = Obs(P). We have that Obs(FTS

De nition13 (fault tolerant system correctness).

Observational equivalence between the correct system speci cation P and the fault tolerant  p , obtained at point 6, is studied to decide if the fault tolerant system FTS system under the speci ed fault hypothesis FHp guarantees a correct behaviour of the system when faults occur according to FHp (step 7). Actions which explicitly model faults occurrences have been introduced for the study of the FTT, while they are hidden in the global speci cation of the fault tolerant design and only the e ects of faults on the externally visible behaviour of the original system are modelled. The alphabet of the processes remains unchanged, concentrating on the action observable at the interface of the fault tolerant system. The elements of the formalization of the framework use LOTOS (Language of Temporal Ordering Speci cation) [2], a CCS-like speci cation language [16]. LOTOS is a formalismwhose semantics is based on LTS, for which a notion of observational equivalence is de ned and for which a toolset [9] including automated tools for the veri cation of observational equivalence between speci cations has been developed. The essential component of a system speci cation is the \process de nition", where a process is an entity able to interact with its environment through gates. The behaviour of a process is speci ed by a \behaviour expression". The (simpli ed) syntax we use for a process de nition is the following, where Gatelist and V arlist are gate and variable formal parameters: process Id[Gatelist](V arlist) : noexit := behaviour expression

endproc

A behaviour expression is formed out of terms obtained by applying the language operators. Let B; Bi denote behaviour expressions and g; gi denote gates. This language includes: { the operator to execute actions in sequence g; B; { the boolean guarded command [c] B which says that only if c is veri ed, the behaviour speci ed by B is performed; { the nondeterministic choice among actions B1[]B2; { the hiding of actions hide (g1; ; gn) in B, where g, gi are gates that are transformed into the internal action i; { the parallel composition B1 [g1; ; gn] B2 which means that B1 and B2 are able to execute any actions that either are ready to perform at a gate di erent from any of the gi, but they must synchronise on actions at the gates gi and { the generalized choice operator choice x : t[]B(x) which speci es the choice among one of the behaviour expressions B(v) for all the values v of type t. !



j



j

The language includes also: { the output action denotation g!e to send the value expression e at the channel g and the input action denotation g?x : t to receive a value via gate g and assign this value to variable x of the same type t of the expression. Processes communicate synchronously by value passing: if we have process P performing g!3, process Q performing g?x : nat and the two processes synchronise at g, the result is that the value 3 is passed to Q in the variable x. Actually, multiway synchronization is also possible, in which more than two process agree to perform the same action. The operational semantics for LOTOS gives a LTS as a model of any LOTOS process, by de ning the transition relation for each LOTOS operator; in this case Act is the set: g < v > g G; v V i , where V is the set of de nable value of LOTOS, G the set of user de nable gates, i the unobservable action. The reader can refer to [2] for the LOTOS formal semantics. f

j

2

2

g [ f g

3 Fault Tolerant Design of a Stable Disk Storage A disk is used to store and retrieve data. During these operations, some faults can occur. In a stable disk storage, if faults occur, they should be tolerated by the system without leaving any observable trace. The disk is divided into several sectors. We suppose a simple storage medium that can get at most one request at a time. The user signals with a read that he wants to read the content of some sector of the disk, while the disk returns the information stored into the sector by a content action. Finally, the user signals with a write that some information must be written onto a sector. Let D be the disk, SN be the number of sectors and DA be the set of information items. In the following, we denote by Si ; 1 i SN the information stored in the sector i. The syntax of the operations is: read(i); 1 i SN, sent by the user to the stable disk storage to read sector i; content(Si ) sent by the stable disk storage to the user to return Si after a read(i); write(i; d); 1 i SN; d DA, sent by the user to the stable disk storage to write data d into the sector i. A write is the only operation which modi es the state of the disk; if with Sj0 we indicate the information stored in the sector j after a write operation write(i; d), then we have that Sj0 = Sj ; j = i and Si0 = d. 











2

6

3.1 Correct Disk Let Snat be the set of natural numbers from 1 to SN; the speci cation of the behaviour of the disk in absence of faults (step 1 of the framework) is:

process Cdisk[read; write; content](S : DA; ; SSN : DA): noexit := 

1

(read?i : Snat; content!Si ; Cdisk[read; write; content](S1; ; SSN ) []write?i : Snat?d : DA; Cdisk[read; write; content](S1; ; Si?1; d; Si+1; ; SSN )) 



endproc



where the content of each sector is described by a variable which is a parameter of the process de nition. This process de nition expresses the correct system behaviour: for each read request, the last data written in the sector is returned to the user.

3.2 Fault Tolerant Design

Di erent approaches are proposed in the literature for the design of a fault tolerant stable storage disk [4, 5]. Information stored in a disk can be subject to the following malfunctions: { temporary read/write errors due to transient faults; { permanent errors due to permanent hardware faults, persistent malfunctions in the controller or the crash of the system during a write operation. To realise a stable storage disk, we assume a physical disk where transient faults are not allowed and we concentrate on permanent faults. We assume in the following that two di erent kinds of permanent faults can occur in the system: 1. the physical damage of the information into a physical sector i. We denote it by dmg(i); 2. a fault in the control system such that the read/write of the sector i will result in the read/write of a di erent sector. We denote by csf(i; j) the fault of referring the wrong sector j by sector number i. The crash of the system during a write operation of the sector i is modelled by a damage of the sector. We suppose that the reset of the damaged sector to the correct value is not allowed. Let be the set of faults in the system, we have SSN Si=1 dmg(i) = SN i;j =1;j 6=i csf(i; j). =

=

[

If a fault of kind 1 occurs, the failure mode of the system corresponds to returning a corrupted data to the user after a read of the damaged sector. We model this kind of fault by using the special value cd to indicate corrupted data. In addition, we maintain a set of variables A1 ; ; ASN such that i; Ai = 0 at the beginning and Ai = 1 i dmg(i). The variable Ai is checked before any write onto sector i to avoid a new write onto i be executed successfully when the sector is damaged. We maintain both cd information and Ai variables to simplify the speci cation of the read and write operations. 

8

After a fault of kind 2, we always write the information to be stored in sector i into sector j instead. If csf(i; j) occurs, then the failure mode corresponds to

deliver wrong data to the user when 1) a read of sector i is requested before any write of sector i; 2) a read of sector i is required after a write of sector j and before a write of sector i; 3) a read of sector j is required after a write of sector i and before a write of sector j. We model this faults by a set of variables B1 ; ; BSN such that Bi = i at the beginning and Bi = j after a csf(i; j). 

Sectors are initialised to value 0 and for simplicity, we assume that a fault cannot occur in the system after a read and before the corresponding content operation. Read requests are executed according to values of variables Bi , while user write requests depend on values of variables Ai and Bi . When a read of sector i is requested, the system returns the value stored on sector SBi . When a write of d onto sector i is requested, the system write d onto sector SBi if and only if the sector is not damaged (ABi = 0). The PFS corresponding to the physical disk speci cation is the following, where variables Ai and Bi appear as parameters of the process de nition:

process PFSCdisk [read; write; content; dmg; csf](S : DA; ; SSN : DA; A : nat; ; ASN : nat; B : Snat; ; BSN : Snat): noexit := 

1



1



1

(read?i : Snat; content!SBi ; PFSCdisk[read; write; content; dmg; csf] (S1 ; ; SSN ; A1; ; ASN ; B1; ; BSN ) []write?i : Snat?d : DA; ([ABi eq 1] PFSCdisk[read; write; content; dmg; csf] (S1 ; ; SSN ; A1; ; ASN ; B1; ; BSN ) [][ABi eq 0] PFSCdisk[read; write; content; dmg; csf] (S1 ; ; SBi ?1 ; d; SBi+1 ; ; SSN ; A1 ; ; ASN ; B1; ; BSN ) []dmg?i : Snat; PFSCdisk[read; write; content; dmg; csf](S1; ; Si?1; cd; Si+1 ; ; SSN ; A1; ; Ai?1; 1; Ai+1; ; ASN ; B1 ; ; BSN )) []csf?i : Snat?j : Snat; PFSCdisk[read; write; content; dmg; csf] (S1 ; ; SSN ; A1; ; ASN ; B1; ; Bi?1 ; j; Bi+1; ; BSN )) 











!

!













endproc















Moreover, let FSdmg(i) and FScsf (i;j ) be the faulty system after the fault dmg(i) and csf(i; j) respectively; processes FSdmg(i) and FScsf (i;j ) are described by the PFSCdisk process with di erent values of the parameters: Si = cd and Ai = 1 after dmg(i), while Bi = j after csf(i; j). The restriction for which faults do not occur after a read action and before the corresponding content action in the PFS, could be eliminated by substituting the behaviour expression content!SBi ; PFSCdisk[ ] by a nondeterministic choice among the content action and the set of fault actions, each fault action being followed by the process which describes the faulty behaviour of the system after the fault occurrence. 

Let us apply a TMR fault tolerance technique to the previous system (see Fig. 2(a) and 2(b)). Each replica corresponds to a physical disk and read and write actions are executed synchronously by the user and by the replicas, while content actions are returned by the replicas to the Voter. Let us assume the

Voter component is speci ed as a recursive process V ot that waits for the result of the computation from each replica and then outputs the result according to the majority of the received values. If a majority is not detected, we assume the Voter outputs one of the values in DA. content 1

Disk 1

. ..

read write

read Disk write

Disk 2

content 2

content

Voter

content 3

content Disk 3

(a)

sync

(b)

Fig. 2. TMR technique The constraint that all replicas synchronise on the user requests is allowed in LOTOS by multiway synchronisation. Moreover, to avoid replicas accepting new user requests before the Voter returns the result to a previously requested read operation, a synchronisation gate sync is added. Di erent communications from each replica to the Voter are obtained by renaming the content gate of each replica: Voter awaits a communication via any of the gates content1 , content2 and content3 . After the output of the Voter occurs, it synchronises with the replicas to enable them to receive a new user request. The process de nition of the Voter process Vot used in this example is the following:

process V ot[content ; content ; content ; sync; content] :noexit := 1

2

3

content ?x1 : DA; content ?x2 : DA; content3 ?x3 : DA; ([x1 eq x2 ] content!x1; sync; V ot[content1 ; content2; content3 ; sync; content] [][x1 eq x3 ] content!x1; sync; V ot[content1 ; content2; content3; sync; content] [][x2 eq x3 ] content!x2; sync; V ot[content1 ; content2; content3; sync; content] []([x1 neq x2 ] [x1 neq x3] [x2 neq x3] choice y : DA [] content!y; sync; V ot[content1; content2; content3 ; sync; content])) 1

2

!

!

!

!

!

!

endproc

We rede ne the PFSCdisk as the process that after a content action, synchronises at sync gate before accepting a new request. PFSCdisk is obtained by that described in the previous page by simply substituting the rst alternative of the choice operator as follows:

process PFSCdisk [read; write; sync; content; dmg; csf](S : DA; ; SSN : DA; A : nat; ; ASN : nat; B : Snat; ; BSN : Snat): noexit := 1



1





1

(read?i : Snat; content!SBi ; sync; PFSCdisk[read; write; sync; content; dmg; csf](S1 ; ; SSN ; A1 ; ; ASN ; B1 ; ; BSN ) [] ) 







endproc

The set of faults of the system after the application of the TMR fault tolerant technique, is: = dmg1; dmg2 ; dmg3; csf 1 ; csf 2 ; csf 3 , where dmgi (csf i ) corresponds to a damage (control system) fault in the i-th replica. For simplicity, we assume in the following that the extra components added by the fault tolerant technique (in this case the Voter) never fail. =

f

g

A FTT is expressed in LOTOS as a context (a behaviour expression with free process variables).

De nition14 (fault tolerant technique). Let n be the number of replicas

of a fault tolerance technique FTT; FTT is a context FTT(1; ; n) of n arguments, one argument for each replica. A TMR fault tolerant technique is the context: TMR(1 ; 2; 3), de ned by the behaviour expression: 

(((1 [Igates; sync; V gates1 ; Fgates1 ] [Igates; sync] 2 [Igates; sync; V gates2 ; Fgates2]) [Igates; sync] 3 [Igates; sync; V gates3 ; Fgates3]) [V gates; sync] V ot[V gates; sync; content])

j

j

j

j

j

j

where Igates are the gates corresponding to the user requests, sync is the synchronisation gate, V gatesi are the gates Sby3 which the replica i sends the retrieved information to the Voter, V gates = i=1 V gatesi , Fgatesi are the gates corresponding to faults for the replica i and content is the gate corresponding to the output of the Voter. Note that TMR could be expressed as a context which takes only one argument, namely the process PFS, and generates the required instances of the argument with appropriate renaming of the channels. The distinction among the arguments allows us to simply specify also fault tolerant techniques based on design diversity [13], in which instead of replicas, variants are used each of which corresponds to a particular speci cation of the system. The fault tolerant system is therefore the process: process FTSCdisk [read; write; content; dmg1; dmg2; dmg3; csf 1 ; csf 2 ; csf 3] (S1 : DA; ; SSN : DA; A1 : nat; ; ASN : nat; B1 : Snat; ; BSN : Snat): noexit := hide content1 ; content2 ; content3; sync in 





(((PFSCdisk [read; write; sync; content1; dmg1 ; csf 1 ](S1 ; ; SSN ; A1; ; ASN ; B1 ; ; BSN ) [read; write; sync] PFSCdisk [read; write; sync; content2; dmg2; csf 2 ](S1 ; ; SSN ; A1; ; ASN ; B1 ; ; BSN )) [read; write; sync] PFSCdisk [read; write; sync; content3; dmg3; csf 3 ](S1 ; ; SSN ; A1; ; ASN ; B1 ; ; BSN )) [content1; content2 ; content3; sync] V ot[content1; content2 ; content3; sync; content]) 





j

j







j

j







j

j

endproc

4 Fault Coverage of a Fault Tolerant Technique

De nition15 (fault coverage). Fault coverage C is de ned as the conditional

probability that, given the existence of a fault, the system recovers [13]: C = P(system recovery fault existence). In the introduced framework it is possible to reason on fault coverage by acting on the de nition of the fault hypothesis at step 5. Anticipated faults correspond to the estimation of the types of faults that can occur in the system. A system is recovered from the e ects of a fault i by choosing a fault hypothesis which allows the occurrence of that fault, the fault tolerant system under the fault hypothesis is observational equivalent to the system in absence of faults. Each sequence of faults occurrences corresponds in principle to a di erent fault hypothesis in the framework and we can check if the sequence of faults is tolerated by specifying an ad hoc fault hypothesis and then checking observational equivalence in the framework. More in general, a fault hypothesis can be speci ed to cover di erent sequences of fault occurrences thus proving tolerance to the whole set of sequences. j

In the LOTOS constraint-oriented style a speci cation is structured as a conjunction of separate constraints, where the parallel operator is used as conjunction operator. This style is useful in the speci cation of fault tolerant systems, since it makes the speci cation clearly re ect the separation of the constraints of the system behaviour from the constraints on the faults in the system. Constraint-oriented style has already been used for this aim in [6, 17], although within di erent speci cation languages. We simply can specify a fault hypothesis as a constraint on the speci cation of the fault tolerant system. The fault hypothesis is a process (named FHp) which synchronises with the fault tolerant system to execute any action corresponding to faults. A rough evaluation of fault occurrences that are recovered by the fault tolerant system can be done. The correct measure of the fault coverage would in principle be computed by de ning a fault hypothesis for each con guration of occurrences of faults in the system and then proving observational equivalence.

4.1 Fault Hypothesis in the Stable Storage Disk Given the fault tolerant system FTSCdisk , we can prove that the fault tolerant system design is observational equivalent to the correct disk speci cation under the following fault hypothesis:

De nition16 (FH1). A sector may be damaged and/or involved in a control system fault in at most one replica. The process corresponding to a fault hypothesis will execute actions corresponding to faults fault ; we have: FH[ fault ]. f

2 =g

f

2 =g

In order to express in LOTOS this fault hypothesis, let us consider separately the following fault hypotheses:

De nition17 (FH2). A sector may be damaged in at most one replica. De nition18 (FH3). A sector may be involved in a csf fault in at most one replica.

The process corresponding to the Fault Hypothesis 2 can be given in LOTOS in several di erent ways. The speci cation we report in the following is not the simplest one, but it is motivated by the nal aim of deriving the speci cation of the Fault Hypothesis 1. For each replica, we introduce a set of SN variables to maintain information about damaged sectors. We denote such variables by G1; ; GSN for the rst replica, by GSN +1 ; ; G2SN for the second replica and by G2SN +1 ; ; G3SN for the third replica. The variable corresponding to the sector i of the k-th replica is therefore:G(k?1)SN +i . We assume Gi = 0; i at the beginning and both G(k?1)SN +i = 1 after a dmgk (i) fault in the k-th replica. 





8

Let us denote with the usual indexed sum the nondeterministic choice among an indexed set of alternatives. Moreover, let us assume conddmg(i;k) is the following formula: t = 1; 2; 3 such that t = k, G(t?1)SN +i = 0. 8

6

process FH2[dmg ; dmg ; dmg ](G : nat; ; GSN : nat; 1

2

3

1



G2SN +1 : nat; ; G3SN : nat) : noexit := P GSN +1 : nat;[cond; Gdmg2SN(i;k: )nat; ] dmgk !i; FH2[dmg1; dmg2 ; dmg3] i2f1::SN g;k=1;2;3 (G1 ; ; G(k?1)SN +(i?1); 1; G(k?1)SN +(i+1); ; G3SN ) 



!

endproc





A csf fault always involves two sectors, i.e. sector i and j in the case of csf(i; j). We have in fact that both read(i) and read(j) user requests will return the content of sector i or that of sector j depending on the last write that has been executed between write(i; d) and write(j; d0), respectively. Thus, we model both csf(i; j) and csf(j; i) as the damage of the sectors i and j. We put

G(k?1)SN +j = 1 and G(k?1)SN +i = 1 after a csf k (j; i) or a csf k (i; j) fault in the k-th replica. Let us assume condcsf (i;j;k) is the following formula: t = 1; 2; 3 such that t = k, G(t?1)SN +i = 0 and G(t?1)SN +j = 0. process FH3[csf 1 ; csf 2; csf 3 ](G1 : nat; ; GSN : nat; ; G2SN : nat; G2SN +1 : nat; ; G3SN : nat) : noexit := P GSN +1 : nat; [cond csf (i;j;k)] csf k !i!j; i;j 2f1::SN g;k=1;2;3 1 2 FH3[csf ; csf ; csf 3 ](G1; ; G(k?1)SN +(i?1); 1; G(k?1)SN +(i+1); ; G(k?1)SN +(j ?1); 1; G(k?1)SN +(j +1); ; G3SN ) 8

6







!







endproc

The process corresponding to the Fault Hypothesis 1 is therefore obtained by de ning a unique process which maintains the state variables Gi and whose behaviour expression corresponds to a nondeterministic choice among the set of actions stated in process FH2 and those actions stated in FH3. We have: process FH1[dmg1; dmg2; dmg3; csf 1 ; csf 2 ; csf 3](G1 : nat; ; GSN : nat; P GSN +1 : nat; ; G2SN : nat; G2SN +1 : nat; ; G3SN : nat) : noexit := ( i2f1::SN g;k=1;2;3[conddmg(i;k)] dmgk !i; FH1[dmg1; dmg2; dmg3 ; csf 1 ; csf 2 ; csf 3 ](G1; ; G(k?1)SN +(i?1); 1; G(k?1)SN +(i+1); ; G3SN ) P [] i;j 2f1::SN g;k=1;2;3[condcsf (i;j;k)] csf k !i!j; FH1[dmg1; dmg2; dmg3; csf 1 ; csf 2 ; csf 3 ](G1; ; G(k?1)SN +(i?1); 1; G(k?1)SN +(i+1); ; G(k?1)SN +(j ?1); 1; G(k?1)SN +(j +1); ; G3SN )) 





!





!







endproc

We can now prove that the fault tolerant system design tolerates faults according to the Fault Hypothesis 1 by proving that the system obtained by the parallel composition of FH1 and FTSCdisk with synchronisation on all the common gates and by hiding faults and all the actions added by the fault tolerant technique (step 6 in the framework) is observational equivalent to the correct  Cdisk system speci cation Cdisk (step 7 in the framework). Let us denote by FTS the fault tolerant system under the fault hypothesis. We have that:  Cdisk [read; write; content] : noexit := process FTS hide dmg1; dmg2; dmg3; csf 1; csf 2 ; csf 3 in (FTSCdisk [read; write; content; dmg1; dmg2; dmg3; csf 1 ; csf 2 ; csf 3 ] (0; ; 0; 0; ; 0; 1; ; SN) [dmg1; dmg2 ; dmg3; csf 1 ; csf 2 ; csf 3 ] FH1[dmg1; dmg2; dmg3 ; csf 1 ; csf 2 ; csf 3 ](0; ; 0)) 

j

endproc





j



both FTSCdisk and FH1 processes are free to engage independently in any action that is not in the other's set of observable actions, but they have to engage simultaneously in all the actions that are observable in both of them. We can  Cdisk is observational equivaformally prove that the system speci cation FTS  lent to the correct system speci cation: FTS Cdisk Cdisk. 

Let us consider instead the case in which the two di erent kinds of faults are assumed as independent; in this case the fault hypothesis is given by the process FH4 corresponding to the independent parallel execution of the processes corresponding to the Fault Hypothesis 2 and 3. process FH4[dmg1; dmg2; dmg3; csf 1 ; csf 2 ; csf 3 ] : noexit := FH2[dmg1; dmg2 ; dmg3](0; ; 0) FH3[csf 1; csf 2 ; csf 3 ](0; ; 0) 

endproc

jjj



We can verify that the fault tolerant technique does not cover the previous fault hypothesis. We formally prove that the system speci cation obtained at step 6 of the framework FTSCdisk under FH4 is not observational equivalent  Cdisk Cdisk. to Cdisk: FTS 6

Fault Hypothesis 1 states that for each sector of the disk, if we consider the TMR structure associated to the sector, a single fault is allowed which corresponds to a damage and/or a control system fault of the sector. Fault Hypothesis 4 states instead that for each sector of the disk, if we consider the TMR structure associated to the sector, a multiple fault is allowed which corresponds to a damage of the sector in a replica and to a control system fault of the same sector in another replica. Damage and control system faults are tolerated only if they occur according to the Fault Hypothesis 1, and they are not tolerated if they can occur independently according to Fault Hypothesis 4. Let us assume the user requests a write of the value d onto sector i and subsequently a read of sector i. Let us consider the case in which a damage of sector i occurs after the write and before the read request. Let T1 be the LTS of the correct system and T2 be the LTS of the fault tolerant system under the Fault Hypothesis 1. T1 and T2 are reported in Fig. 3(a). In T2 the rst i action corresponds to the damage fault occurrence into the rst replica, while the internal actions after the read correspond to the data sent by the replicas to the Voter. LTSs T1 and T2 are observational equivalent and the broken line indicates bisimilar states. Under the Fault Hypothesis 4, the damage of the sector i and a control system failure involving sector i are allowed to occur in two di erent replicas. Let us assume the damage occurs in the rst replica and the control system fault occurs in the second replica. The LTS T3 of the fault tolerant system when both faults occur before the read operation, is that reported in Fig. 3 (b), where Sector j is supposed to contain the value d0 . Under the condition that d = d0, we have that T3 is not observational equivalent to T1 , since the Voter receives three di erent values and it may return any value to the user. 6

4.2 Automatic Veri cation

The proof of observational equivalence between speci cations can be automated by using the AUTO [8] tool which builds, if it exists, the nite state automaton of a speci cation and, given two automata, it can test if observational equivalence is veri ed. Observation equivalence is in fact decidable on nite state automata.

T1

..

..

T2

write!i!d

..

write!i!d

write!i!d

read!i

i

content!d

read!i

..

i (dmg1 !i)

(dmg1 !i)

i (csf 2 (i,j))

i

(content 1!cd)

read!i i (content1!cd)

i

(content 2!d)

i

i

(content 3!d)

i

content!d (a)

T3

..

..

(content2!d’) (content3!d)

contentd!d’ (b)

 Cdisk under FH1 and FH4 Fig.3. The behaviour of FTS

Since AUTO works on Basic LOTOS (a subset of LOTOS without data) speci cations, we need to transform the full LOTOS speci cation of the system into a speci cation in Basic LOTOS. Being data of main concern in the correctness property of the system, we need to restrict to a nite set of values and the transformation will associate to each original gate a new gate for each di erent value that can be exchanged at the gate. In the general case, the Basic LOTOS speci cation of the system will be quite large. However, to prove the correctness of the stable storage disk design, it is enough to prove the correctness of the same design by reducing to a small number of sectors and few di erent kinds of stored information. The proof that the fault tolerant technique covers the Fault Hypothesis 1 has been automatically done for SN = 2 and DA = 0; 1; cd . The automaton has 10 states and 30 transitions (see Fig. 4, where writejd corresponds to write!j!d and readj corresponds to read!j). Similarly we can prove that the fault tolerant technique covers the Fault Hypothesis 2 and the Fault Hypothesis 3, while it does not cover the Fault Hypothesis 4. f

g

In [1] we have applied the approach to an alternative fault tolerant technique based on an error detection mechanism and a couple of mirrored disks, analysing the set of tolerated faults.

5 Conclusions A lot of e ort has been put on the formalisation of fault tolerance in literature [4, 5, 6, 12, 14, 17, 18, 19, 20, 21, 22, 23, 24]. In this paper we have shown how it is possible to formally reason about the fault coverage of a fault tolerant technique. Actions which explicitly model fault occurrences have been introduced for

write10

write20

write11

read2

write11

write20

m

c

a content0

content0

write10

read1

read1 write20 e

content1

write21

read2 write21

n

write20 write11

write21

write11

read2 read1

content1 content0

read1

d

b

f

g

write10

read2 h content1

Fig. 4. Automaton of the fault tolerant system the study of the fault coverage of the technique, while these actions are hidden in the global speci cation of the fault tolerant design and only the e ects of faults on the externally visible behaviour of the original system are modelled. Observational equivalence is then used to prove that a fault tolerant technique tolerates a set of faults according to a fault hypothesis. We illustrated the methods by applying it to the design of a stable storage disk. Observational equivalence automatic veri cation is allowed by already developed veri cation tools. The usual drawback of observational equivalence is that, being de ned on underlying automata, its computation requires all the informations on states and transitions collected in this global structure. This results in limiting the size of the speci cations to which tools for the automatic checking of observational equivalence can successfully be applied. In our case, we need to translate our full LOTOS speci cation in Basic LOTOS, by operating a reduction on the size of value sets involved, in order to reduce the state space. Recent advances in veri cation techniques, like the development of equivalence checking tools based on Binary Decision Diagrams [3] or based on a notion of symbolic bisimulation [11], could help signi cantly to avoid the cited drawbacks.

References 1. Bernardeschi, C., Fantechi, A., Simoncini, L.: A formal framework for verifying fault tolerant systems. Internal Report IR-BFS1-93, Department of Information Engineering, University of Pisa (1993) (available on request from the authors) 2. Bolognesi, T., Brinksma, E.: Introduction to the ISO speci cation language LOTOS. The Formal Description Technique LOTOS, Elsevier Science Publishers B.V., North-Holland (1989) 23{73

3. Bouali, A., De Simone, R.: Symbolic bisimulation minimisation. Proc. Computer Aided Veri cation '92, LNCS 663 (1992) 96-108 4. Cau, A., de Roever, W.: Specifying fault tolerance within Stark's formalism. Proc. FTCS'23, Toulouse, France (1992) 392{401 5. Cristian, F.: A rigorous approach to fault tolerant programming. IEEE Transaction on Software Engineering, 11 (1), (1985) 23{31 6. De Boer, F.S., Coenen J., Gerth R.: Exception handling in process algebra. Proc. 1st North American Process Algebra Workshop, Workshop in Computing Series, Springer-Verlag (1993) 7. De Nicola, R.: Extensional equivalences for transition systems. Acta Informatica 24 (1987) 211{237 8. De Simone, R., Vergamini, D.: Aboard AUTO. Technical Report RT111, INRIA (1989) 9. van Eijk, P.: Tool demonstration: the Lotosphere Integrated Tool Environment LITE. Formal Description Techniques, IV, North-Holland (1992) 471{474 10. Laprie, J.C.(ed.): Dependability: basic concepts and terminology. Dependable Computing and Fault-Tolerant Systems, 5, Springer-Verlag (1992) 11. Lin, H.: A veri cation tool for value passing processes. Proc. Protocol Speci cation, Testing and Veri cation, XIII, North-Holland (1993) B1.1{B1.13 12. Liu, Z., Joseph, M.: Transformation of programs for fault tolerance. Formal Aspects of Computing, 4 (1992) 442{469 13. Johnson, B.: Design and analysis of fault tolerant systems. Addison-Wesley Publishing Company (1989) 14. Mancini, L.V., Pappalardo, G.: Towards a theory of replicated processing. Proc. Symposium on Formal Techniques in Real-time and Fault Tolerant Systems, LNCS 331 (1992) 175{192 15. Milner, R.: A calculus of communicating systems. LNCS 92, Springer-Verlag (1980) 16. Milner, R.: Communication and concurrency. Prentice-Hall International, Englewood Cli s (1989) 17. Nordahl, J.: Design for dependability. In: C.E. Landwehr, B. Randell, L. Simoncini (eds.): Dependable Computing for Critical Applications 3. Dependable Computing and Fault-Tolerant Systems, 8, Springer-Verlag (1992) 65{89 18. Peled, D., Joseph, M.: A compositional approach for fault-tolerance using speci cation transformation. Proc. PARLE'93, LNCS 649 (1993) 173-184 19. Peleska, J.: Design and veri cation of fault tolerant systems with CSP. Distributed Computing, 5 (2), (1990) 95{106 20. Prasad, K.V.S.: Speci cation and proof of a simple fault tolerant system in CCS. Internal Report CSR-178-84, Department of Computer Science, University of Edinburg (1984) 21. Schepers, H.: Tracing fault tolerance. In: C.E. Landwehr, B. Randell, L. Simoncini (eds.): Dependable Computing for Critical Applications 3. Dependable Computing and Fault-Tolerant Systems, 8, Springer-Verlag (1992) 91{110 22. Schepers, H., Hooman, J.: Trace-based compositional reasoning about fault tolerant systems. Proc. PARLE'93, LNCS 649 (1993) 197{208 23. Schepers, H., Gerth, R.: A compositional proof theory for fault tolerant real-time distributed systems. Proc. 12th Symposium on Reliable Distributed Systems (1993) 34{43 24. Weber, D.G.: Formal speci cation of fault-tolerance and its relation to computer security. ACM Software Engineering Notes, 14 (3), (1989) 273{277

This article was processed using the LaTEX macro package with LLNCS style

Suggest Documents