Fault Tolerance in a Layered Architecture: a General ... - CiteSeerX

Fault Tolerance in a Layered Architecture: a General Specification Pattern in B Linas Laibinis Åbo Akademi University, Department of Computer Science, Lemminkäisenkatu 14, FIN-20520 Turku, Finland

Elena Troubitsyna Åbo Akademi University, Department of Computer Science, Lemminkäisenkatu 14, FIN-20520 Turku, Finland

Turku Centre for Computer Science TUCS Technical Report No 609 May 2004 ISBN 952-12-1350-7 ISSN 1239-1891

Abstract Dependable control systems are usually complex and prone to errors of various natures. Such systems are often built in a modular and layered fashion. To guarantee system dependability, we need to develop software that is not only fault-free but also is able to cope with faults of other system components. In this paper we propose a general formal specification pattern that can be recursively applied to specify fault tolerance mechanisms at each architectural layer. Iterative application of this pattern via stepwise refinement in the B method results in development of a layered fault tolerant system correct by construction. We demonstrate the proposed approach by an excerpt from a realistic case study – development of liquid handling workstation FillwellTM.

Keywords: formal methods, layered architecture, fault tolerance, B Method.

TUCS Laboratory Distributed Systems Laboratory

1. Introduction Dependable control systems [8,14] are usually complex systems composed from components belonging to different physical domains. Development of such systems in a modular and layered fashion allows designers to map the physical domains into the layers of controlling software and is traditionally recognized to be an effective way to manage system complexity [11]. The components at each architectural layer are vulnerable to various kinds of faults. To meet the system’s dependability requirements, controlling software should cope with these faults. Therefore, fault tolerance mechanisms [2,3,9] such as error detection and error recovery should be integrated into each architectural layer. Although components at each layer are susceptible to specific kinds of faults, they employ similar mechanisms for error detection and recovery. Essentially, this is the mechanism of exception raising and handling [3,4]. In this paper we propose a general specification pattern that can be recursively applied to formally specify exception raising and handling at each architectural layer. Moreover, we propose a formal systematic approach to developing layered fault tolerant systems by iterative application of this pattern. Our approach is based on stepwise refinement of a formal system model in the B Method [1,12]. While developing a system by refinement, we start from an abstract specification and step by step incorporate implementation details into it until an executable code is obtained. The general idea of the development process presented in this paper is to start from an abstract specification of the upper layer and gradually add lower layers by refinement. Each refinement step elaborates on the specification of upper layer component(s) and creates an abstract specification of lower layer component(s). Each added layer detects and handles specific classes of errors represented as exceptions. Exceptions that cannot be handled at a certain layer are propagated to the upper layers. As a result, error recovery (modelled as exception handling) has a hierarchical structure. The proposed approach allows us to smoothly incorporate reasoning about fault tolerance into software development process. It results in the development of fault tolerant layered systems correct by construction. We argue that application of our approach allows us to achieve a higher degree

1

of dependability since it enables a formal reasoning not only about normal functional behaviour but also about fault tolerance. We proceed as follows. In section 2 we briefly outline basic modelling concepts of the B method and demonstrate how to model communication between layers in a fault-free layered system. Section 3 describes our main contributions – a general specification and development pattern for fault tolerant layered systems. In section 4 we illustrate our approach by a realistic case study – development of liquid handling workstation FillwellTM [5]. Finally, Section 5 concludes with discussion of the proposed approach as well as overview of future and related work.

2. Modelling layered architectures in B Modelling with B. The B Method [1] (further referred to as B) is an approach for the industrial development of highly dependable software. The method has been successfully used in the development of several complex real-life applications [10]. The tool support available for B provides us with the assistance for the entire development process. For instance, Atelier B [13], one of the tools supporting the B Method, has facilities for automatic verification and code generation as well as documentation, project management and prototyping. The high degree of automation in verifying correctness improves scalability of B, speeds up development and, also, requires less mathematical training from the users. The development methodology adopted by B is based on stepwise refinement [1]. While developing a system by refinement, we start from an abstract formal specification and transform it into an implementable program by a number of correctness preserving steps, called refinements. A formal specification is a mathematical model of the required behaviour of a (part of) system. In B a specification is represented by a set of modules, called Abstract Machines. The common pseudo-programming notation, called Abstract Machine Notation (AMN), is used in constructing and formally verifying them. An abstract machine encapsulates a state and operations of the specification and has the following general form:

2

MACHINE MachineName SETS Definition of local types VARIABLES list of variables INVARIANT constraining predicates of variables and invariant properties of the machine INITIALIZATION parallel assignment of initial values to variables OPERATIONS OpName_1 = … … OpName_N = … END

Each machine is uniquely identified by its name. The state variables of the machine are declared in the VARIABLES clause and initialized in the INITIALIZATION clause. The variables in B are strongly typed by constraining predicates of INVARIANT clause. The constraining predicates are conjoint by conjunction (denoted as &). All types in B are represented by non-empty sets and hence set membership (denoted as :) expresses typing constraint for a variable, e.g., x : TYPE. Local types can be introduced by enumerating the elements of the type, e.g., TYPE = {element1, element2, …}.

We can also define local types as deferred sets. In this case we introduce a new name for a type, without providing any further information about precise nature of its elements. Basically, we defer the actual definition until some later development stage. The operations of the machine are defined in OPERATIONS clause. The operations are atomic meaning that, once an operation is chosen, its execution will run until completion without interference. There are two standard ways to describe an operation in B: either by the preconditioned operation PRE cond THEN body END or the guarded operation SELECT cond THEN body END. Here cond is a state predicate, and body is a B statement. If cond is satisfied, the behaviour of both the precondition operation and the guarded operation corresponds to the execution of their bodies. However, these operations behave differently when an attempt to execute them from a state where cond is false is undertaken. In this case the precondition operation leads to a crash (i.e., 3

unpredictable or even non-terminating behaviour) of the system, while the guarded operation blocks itself by waiting until cond is changed to true. Preconditioned operations are used to describe operations that will be turned (implemented) into procedures that can be called by the user. On the other hand, guarded operations are useful when we have to specify so called event-based (reactive) systems. Then SELECT operation describes the reaction of the system when particular event occurs. B statements that we are using to describe a state change in operations have the following syntax: S == x := e | x :: T |

S1 ; S2 S1 || S2

| IF cond THEN S1 ELSE S2 END | | ANY z WHERE cond THEN S END | ...

The first three constructs - an assignment, a sequential composition (used only in refinements), and a conditional statement have the standard meaning. The remaining constructs allow us to model nondeterministic or parallel behaviour in a specification. Usually they are not implementable so they have to be refined (replaced) with executable constructs at some point of program development. In our modelling of control systems we use two kinds of nondeterministic statements – nondeterministic assignment x :: T and nondeterministic block ANY z WHERE Q THEN S END. Nondeterministic assignment x :: T assigns variable x arbitrary value from given set (type) T. Nondeterministic block ANY z WHERE Q THEN S END introduces new local variable z which is initialised (possibly nondeterministically) according to predicate Q and then used in S. Finally, S1 || S2 models parallel (simultaneous) execution of S1 and S2. The special case of a parallel composition is a multiple assignment which is denoted as x,y := e1,e2. The B method provides us with mechanisms for structuring the system architecture by modularization. In this paper we use INCLUDES mechanism as our main compositionality technique. A module is described as a machine. If machine C "includes" machine D then all variables and operations of D are visible in C. However, the only way to change the variables of D is via D operations. Also, the invariant properties of D are automatically included into C invariant. Therefore, machine C can be considered as the extension of machine D.

4

Event-based modelling and layered architectures. The event-based modelling has proven its worth in the design of complex parallel and reactive systems [10]. As noted above, the event-based system consists of a set of SELECT-guarded operations. Any enabled operation – the operation with its condition being true – can be chosen for execution. Even while all operations are disabled the system is still considered to be running but in a "waiting" or "hybernating" mode. In this paper we focus on design of controllers for fault tolerant control systems. Usually development of a control system spans over several engineering domains, such as mechanical engineering, software engineering, operator interface etc. It is widely recognized that a layered architecture is preferable in designing such complex control systems since it allows developers to map real-world domains into software layers [11]. Usually the lowest level confines embedded real-time subsystems which directly communicate with sensors and actuators – the electro-mechanical devices used to monitor and control the plant. These subsystems cyclically execute the standard control loop consisting of reading the sensors, and assigning the new states to the actuators. The layer above contains the components that encapsulate the detailed behaviour of the lowest level subsystems by providing abstract interfaces to them. On the highest level of hierarchy is a component server, which serves as an interface between the operator and the components. Let us describe the communication between the layers of a fault-free system (we discuss errors and fault tolerance in the next section). The operator interacts with the system by placing the requests to execute certain services. A service is an encapsulation of a set of commands to be executed by the components. Upon receiving a request to execute a service the component server at first translates (decomposes) the service into the corresponding sequence of commands. Then it initiates and monitors the execution of commands by placing corresponding requests on the components. In their turns, the requested components further decompose these commands into the lower level commands to be executed by realtime subsystems at the lowest level of hierarchy. Upon completion of each command, the requested subsystem notifies the requesting component about success of the execution. The component continues to place the requests on the subsystems until completion of the requested command. Then it ceases its autonomous functioning and notifies the component server about success of the execution. The communication between components can be graphically represented as shown in Fig. 1.

5

O perato r's request to execute service

S ervice O peration1 S ubOp.1.1

S ubO p1.2

O peratio n2 ... S ubO p1.M

...

...

O peratio nN

S ubO pN .1 ... S ubOpN.M

Request to execute O peration1 Request to execute S ubOp1.1

. ackno wlegem ent

.

Figure 1. The structure of a system

Observe that behaviour of the components follows the same general pattern: the component is initially “dormant” but becomes active upon receiving a request to execute a certain command. In the active mode the component autonomously executes a command until completion. Then it returns the acknowledgement to the requesting component and becomes inactive again. Such behaviour can be abstractly described by the B specification of the following form:

MACHINE Component VARIABLES flag INVARIANT flag : {Executing,Stopping,Stopped} . INITIALISATION flag:= Stopped OPERATIONS Request(parameters) = PRE flag=Stopped THEN …|| flag:=Executing END; Execute = SELECT flag=Executing THEN IF “request is completed” THEN flag := Stopping ELSE … END END; Stop = SELECT flag = Stopping THEN … || flag :=Stopped END END END

6

We use the preconditioned operation Request to model the activation of a component upon request. The autonomous behaviour of the component is modelled by the guarded operations Execute and Stop. Note that a component can be activated only if it is in a “dormant” state, i.e., its flag equals Stopped. Next we demonstrate how our simple event-based specification of a component can be enhanced to include faulty behaviour and fault tolerance mechanisms.

3. Errors and exceptions in a layered architecture Fault tolerance in layered architecture. The main goal of introducing fault tolerance is to design a system in such a way that faults of components do not result in a system failure [2,14]. While designing a controller, we should provide means for tolerating faults of various natures. In this paper we focus on hardware faults and human errors. A fault of a component manifests itself as an error. Upon detection of an error, error recovery is performed. Error recovery is an attempt to restore a fault-free system state or at least preclude system failure. Next we investigate the behaviour of control systems in the presence of errors. In a fault-free system, after receiving a request to execute an operation, a component always succeeds in eventually completing it. However, occurrence of errors might prevent a component from providing a required operation correctly. The component should detect the error and notify about it the requesting component, so that error recovery can be initiated. This behaviour coincides with the mechanism of exception raising and handling. Observe that for each component (except the lowest level subsystems) we can identify two classes of exceptions: 1. generated exceptions: the exceptions raised by the component itself upon detection of an error, or 2. propagated exceptions: the exceptions raised at the lower layer but propagated to the component for handling. The generated exceptions are propagated upwards (to the requesting component) for handling. Usually the component that raised an exception ceases its autonomous functioning – the behaviour modelling the fact that the component is unable to handle erroneous situation. Note that the mechanism of notification about successful termination is the same as the mechanism of exception propagation, i.e., successful

7

termination is a special case of exception. After receiving a propagated exception, a component evaluates and classifies it as 1) an acknowledgement of normal termination, or 2) a signal indicating recoverable error occurrence, or 3) a signal indicating unrecoverable error occurrence. In the first case, the normal control flow continues. In the second case, the error recovery is attempted from the current layer. In the third case, the exception is propagated further. The exception propagation stops at the layer that can handle an exception (i.e., the layer which classifies the error as recoverable and initiates the error recovery). However, if an exception cannot be handled even at the uppermost layer, then it is propagated to the operator. Observe that we design exception handling following the principle “the more critical is an error, the higher the layer that can handle its exception”. Hence our exception handler has a hierarchical structure. This topic has been extensively discussed in the literature [2,3] so we omit its further discussion here. Fault tolerance: abstract specification. Next we augment the general specification scheme Component given in the previous section with the mechanisms for exception handling. The resultant specification – the abstract specification FTComponent – is presented in Fig. 2. The specification defines a component raising and handling exceptions as described above. Generated and propagated exceptions are modelled by variables exc and exc2 respectively. We abstract away from the implementation details of exceptions by choosing deferred sets EXC and EXC2 as the types for exc and exc2. Exception evaluation functions Eval : EXC {OK, RECOV, UNRECOV} and Eval2 : EXC2 {OK, RECOV, UNRECOV} classify exceptions into three categories: normal termination, recoverable, and unrecoverable. The local state of the component is modelled by variable state. To model exception raising and handling, we extend the general scheme Component with operations catch_and_handle and recover. The behaviour of a component is graphically represented in Fig.3. Each phase of execution is specified by the corresponding operation. The value of variable flag indicates the current phase. The conditions that the generated and propagated exceptions should satisfy, when a component enters a particular phase, are formulated as invariant properties. For example, the invariant conjunct (flag = Recovering => Eval(exc)=OK & Eval2(exc2)=RECOV)

8

means that the recovery operation starts only when the current layer exception is not raised and the propagated lower layer exception is evaluated as recoverable. MACHINE FTComponent VARIABLES flag, exc, exc2, state, recov_flag INVARIANT flag : { Executing, Handling, Recovering, Stopping, Stopped} & exc : EXC & exc2 : EXC2 & state : STATE & recov_flag : BOOL & (flag = Executing => Eval(exc)=OK & Eval2(exc2)=OK) & (flag = Handling => Eval(exc)=OK) & (flag = Recovering => Eval(exc)=OK & Eval2(exc2)=RECOV) & (flag = Stopping => stop_cond(state)=TRUE or Eval(exc)/=OK or Eval2(exc2)=UNRECOV) DEFINITIONS raise(ee) == exc := ee; not_raised == (Eval(exc) = OK) & (stop_cond(state)=FALSE) & (Eval2(exc2) /= OK) INITIALISATION flag := Stopped || exc := Success || exc2 := Success2 || state :: STATE || recov_flag := FALSE OPERATIONS start(par) = PRE par:PARAM & flag = Stopped THEN IF Valid_param(par)= TRUE THEN state,flag := Init_state(par), Executing || exc,exc2 := Success,Success2 ELSE raise(Bad_param) || flag := Stopping END END; recover = SELECT flag = Recovering THEN state :: STATE || exc2 :: EXC2 || recov_flag, flag := TRUE, Handling END;

execute = SELECT flag = Executing THEN ANY new_exc WHERE new_exc:EXC THEN raise(new_exc) || IF Eval(new_exc) /= OK or stop_cond(state)=TRUE THEN flag := Stopping ELSE exc2 :: EXC2 || flag := Handling || state :: STATE END END END; catch_and_handle = SELECT flag = Handling THEN IF Eval2(exc2) = OK THEN flag := Executing ELSIF Eval2(exc2) = RECOV THEN IF recov_flag = TRUE & one_time_recovery(exc2) = TRUE THEN raise(Recovery_failed) || flag := Stopping ELSE flag := Recovering END ELSE flag := Stopping END || recov_flag := FALSE END; stop = SELECT flag = Stopping THEN state :: STATE || flag := Stopped IF not_raised THEN raise(Exc_trans(exc2)) END END END

Figure 2. General specification of fault-tolerant component

9

Start

lower layer command executed

valid parameters Executing

bad parameters (raised exception)

exception was raised or operation was completed Stopping

lower layer command executed Recovering Handling

.

recoverable successful propagated termination exception of lower layer command unrecoverable propagated exception or recovery failed

component stops, returning control to the higher layer Stopped .

Figure 3. The phases of component behaviour

Operation Request models placing a request on the component. It sets the initial state according to the input parameters of the request. If the parameters are invalid then the corresponding exception is raised. Such exceptions belong to the class of generated exceptions. An execution of the request, modelled by operation execute, starts from assessing the current state of the component (e.g., whether it is safe to execute the requested operation from a current state) and completion of request execution. If the assessment fails or the completion of the request execution detected, the corresponding exception is raised and the component is stopped. Otherwise, the component executes the requested operation by placing the requests on the lower layer components. In the initial specification we model the effect of executing the lower layer command by receiving a propagated exception and updating the local state. Operation catch_and_handle becomes enabled after the lower layer has completed its execution. The task of catch_and_handle is to classify propagated exception exc2. In case the propagated exception is classified as the normal termination, the execution of catch_and_handle enables operation execute. If exc2 signals about a recoverable error, operation recover becomes enabled. However, if exc2 is classified as an unrecoverable error, operation stop is executed next. Operation recover is similar to operation execute in the sense that the lower layer is called: the state of the component is changed and a new propagated exception is received. The purpose of operation recover is to abstractly model the effect of error recovery. After recover, the operation catch_and_handle becomes enabled, which again evaluates the propagated exception and directs the control flow accordingly. While executing error recovery, it is important to guarantee its eventual termination. One of the approaches would be to assume that the lower layer components actually 10

executing recovery guarantee its termination, i.e., by reporting unrecoverable error when error recovery fails after a certain (finite) number of times. An opposite approach is to completely control error recovery at the layer from which it is attempted. We propose a combined approach. Namely, we introduce a function one_time_recovery that distinguishes between errors for which we can attempt recovery several times from those for which we can attempt it only once. The failed onetime recovery is detected in operation catch_and_handle. This leads to raising the corresponding exception and stopping the component. Operation stop becomes enabled in three cases: when a generated exception exc is raised, or an unrecoverable exception exc2 of the lower layer has occurred, or the execution of the request is completed. In case of unrecoverable error, exception exc2 is converted into the generated exception exc and the component is stopped. The conversion (done by function Exc_trans: EXC EXC2) means that an unrecoverable lower layer exception should be interpreted in terms of the current layer exceptions before being propagated further. The proposed specification can be used to abstractly specify components at each layer except the lowest one. Each component at the lowest layer has only one type of exceptions – the generated exceptions. Hence the specification facilities for exception handling are redundant at this layer. Moreover, the operations to be executed at the lowest layer are not decomposed any further and hence can be specified as atomic preconditioned operations. Unfolding layers by refinement. Obviously, our abstract specification lacks many implementation details and should be refined further. Refinement is a technique to incorporate implementation details into a specification. In this paper we demonstrate how refinement facilitates development of systems structured in a layered manner. Let us observe that the schematic representation of communication between the components of layered system represented in Fig.1 can also be seen as a scheme of atomicity refinement. Indeed, each layer decomposes a higher layer operation into a set of operations of smaller granularity. The decomposition continues iteratively until the lowest layer is reached. At this layer the operations are considered to be not further decomposable. From the architectural perspective, an abstract specification is a “folded” representation of the system structure. The system behaviour is specified in terms of large atomic services at the component server layer. Each refinement step adds (or “unfolds”) an 11

architectural layer in the downward direction. Large atomic services are decomposed into operations of smaller granularity. Refinement process continues until the whole architectural hierarchy is build. We argue that refinement process conducted in such a way allows us to obtain realistic model of fault tolerant systems. Indeed, by iterative refinement of atomicity we eventually arrive at modelling errors occurring practically at any instance of time, i.e., before and after execution of each operation of finest granularity. The proposed refinement process is illustrated in Fig. 4 where we outline the development pattern instantiated to a three-layered system.

FTComponent refines FTComponentR

includes

SubFTComponent refines includes SubFTComponentR SubSubFTCom ponent1 ... includes SubSubFTComponentN

.

.

Figure 4. General development pattern in B

Each refinement step leads to creating components at the lower layer by including their specifications into the refinement of the corresponding components at the previous layer. The newly introduced components are specified according to the abstract specification FTComponent. The excerpt from the refined specification of the upper layer component is given in Fig 5. We strengthen the invariant to indicate that the lower layer can be activated only when flag=Handling, i.e., the current layer has placed a request on the lower layer component and is waiting for the response from it. While operation Request is unaffected by the refinement, operation execute is modified to model activation of the lower layer component. By placing the request on the lower layer component (executing its Request operation), the autonomous execution of the lower layer component is triggered. The operations of the requesting component will remain disabled until the requested component terminates its autonomous execution. This is modelled by strengthening the condition of operation catch_and_handle. Refinement also makes operation execute more deterministic by providing functions Bad_state_exc and State_update. Bad_state_exc 12

REFINEMENT FTComponentRefined REFINES FTComponet INCLUDES FT_subcomponent VARIABLES flag, state, exc, recov_flag INVARIANT … not(flag=Handling) => flag2=Stopped OPERATIONS

start(par) = … execute = SELECT flag = Executing THEN raise(Bad_state_exc(state)); IF Eval(exc) /= OK or stop_cond(state)=TRUE THEN flag := Stopping ELSE state := State_update(state,state2); sub_start(Param2(state)); flag := Handling END END; catch_and_handle = SELECT flag = Handling & flag2 = Stopped THEN … END recover = SELECT flag = Recovering THEN state := State_recov(state,state2); sub_start(Param2(state)); recov_flag := TRUE; flag := Handling END; stop = … END

Figure 5. Refinement of the general specification

evaluates the current state, mapping it into one of current layer exceptions. The function result indicates a failure, if Eval(exc) OK. State_update describes a state change (if necessary) before making request to the lower layer. Finally, sub_start activates the lower layer, supplying the parameters describing the request. Function Bad_state_exc can be implemented as a functional procedure. On the other hand, state:=State_update(…) can be 13

implemented as a program fragment describing a sequence of changes on the local state. Like execute, operation recover is made more deterministic. The State_recov function describes the changes that should be made on the current layer before calling the lower layer for some recovery action. In practice the first two statements are usually implemented as IF ... ELSIF ... ELSIF ... END construct proposing different recovery actions for different recoverable situations (exceptions).

4. Case study: liquid handling workstation Fillwell To illustrate our approach, we present an excerpt from development of a real-life control system – liquid handling workstation FillwellTM[5]. The workstation belongs to the class of products for drug discovery and bioresearch. Ensuring correct functioning of the workstation is extremely important since high precision is required in replicating experiments leading to drug discovery. The system consists of an operating head dispensing liquid substances into and aspirating them from high-density micro plates placed on a processing table. A gantry moves the operating head with high precision and speed from one plate to another in XYZ-directions. The main purpose of the workstation is to perform various experiments. An experiment is described by a protocol – a sequence of commands defined by the operator. Essentially the protocol consists of commands for aspirating and dispensing liquid substances from one plate to another. The operator passes the completed protocol to the component server – a component that provides an interface between the operator and system components. After this the system enters autonomous mode. In the autonomous mode the component server decomposes high-level commands of the protocol into the operations to be provided by the components. Correct functioning of the system should be provided even in the presence of component faults. The system can automatically (i.e., without operator’s interference) cope with certain errors occurring in its functioning. However, there is also a set of errors to be handled manually (i.e., by the operator) – their discussion is outside of the scope of this paper.

14

In Fig.6 we present an excerpt from the abstract specification of the component server. The specification captures behaviour of the component server while it executes the command Aspirate(plate, amount). The abstract specification repeats the general specification pattern FTComponent.

MACHINE FW_Aspirate … OPERATIONS

start(plate,amount) = PRE plate:NAT & amount:NAT & flag=Stopped THEN task := Aspirate(plate,amount) || IF Valid_param(plate,amount)=TRUE THEN flag := Executing || exc, exc2 := Success, Success2 ELSE raise(Bad_param) || flag := Stopping END END; execute = SELECT flag = Executing THEN ANY new_exc WHERE new_exc:EXC THEN raise(new_exc) || IF Eval(new_exc) /= OK or task = THEN flag := Stopping ELSE exc2 :: EXC2 || cmd,task := first(task), tail(task) || flag := Handling END END END; catch_and_handle = … recover = … stop = … END

Figure 6. Abstract specification of Fillwell (Aspirate)

We instantiate the abstract functions and data structures by concrete counterparts from the workstation. For example, the deferred types EXC and EXC2 are replaced with concrete sets of exceptions. Similarly, the

15

evaluation functions Eval and Eval2 now describe the concrete classification of exceptions from EXC and EXC2. The definitions of EXC, EXC2 and Eval, Eval2 are shown below. EXC = {Success, Bad_param, Execution_failed, Recovery_failed}; EXC2 = {Success2, Bad_param2, Failed_to_move, Pump_failed, Unsafe}; Eval = {Success |-> OK, Bad_param |-> RECOV, Execution_failed |-> RECOV, Recovery_failed |-> RECOV} Eval2 = {Success2 |-> OK, Bad_param2 |-> UNRECOV, Failed_to_move |-> RECOV, Pump_failed |->UNRECOV, Unsafe |-> UNRECOV}

Note that variable state of the general pattern is replaced by variables task and cmd. Variable task contains translation of the command Aspirate(plate,amount) – the sequence of lower layer operations required to aspirate the amount of liquid defined in parameter amount from the microplate defined by plate. Variable cmd is the current operation to be executed on the lower layer. It consists of the name of the command and its parameters. The aspiration is completed when task = , i.e., there are no commands left to be executed on the lower layer. The component server translates the operator’s command Aspirate into four different lower layer commands: moving along x-axis, moving along y-axis, moving along z-axis, and finally pumping liquid from a plate. We perform refinement of the abstract specification according to the proposed pattern as illustrated in Fig. 7.

F W _ A s p ira te re fin e s F W _ A s p ira te 2

in c lu d e s

FW _Sub re fin e s in c lu d e s XCom p FW _Sub2 YCom p ZCom p Pum p .

Figure 7. Scheme of Fillwell development

16

.

The refinement step results in creating a new machine FW_sub for the lower layer and adding some implementation details to the current layer. In the refined specification of the current layer we replace abstract model of the lower layer by explicit procedure calls activating execution of lower layer operations described in machine FW_sub. The excerpt from the refined specification is given in Fig. 8. REFINEMENT FW_Aspirate2 OPERATIONS

start = … execute = SELECT flag = Executing THEN IF Eval(exc) /= OK or task = THEN flag := Stopping ELSE cmd, task := first(task), tail(task); sub_start(fst(cmd),snd(cmd)); flag := Handling END END; catch_and_handle = … recover = SELECT flag = Recovering THEN IF fst(cmd) = MoveX THEN sub_start(MoveX,x_init) ELSIF fst(cmd) = MoveY THEN sub_start(MoveY,y_init) ELSE sub_start(MoveZ,z_init) END; task := cmd -> task; recov_flag := TRUE; flag := Handling END; stop = … END

Figure 8. Refinement of the abstract specification

17

In operation execute we model placing a request on the lower layer by procedure call sub_start(…). The procedure has two parameters: the name of the operation to be executed and its parameters. Our refinement step also affects the recovery operation. Namely, we modify operation recover by introducing error recovery procedures specific to each operation. Observe that operator’s command Aspirate has four lower layer commands and each of them can fail. The error recovery procedures specify system’s reaction on each failure. In case of a failure of a moving command, the error recovery is executed by attempting to move the operating head to a predefined position and retrying the failed command subsequently. To avoid infinite looping of error recovery, we classify moving errors as eligible to one-time recovery only. Unlike moving errors, failure to pump requires manual error recovery, so the corresponding exception is propagated to the operator. We omit the presentation of the abstract machine FW_sub of the lower layer component since it again follows the same general scheme. Instead, to illustrate recovery from errors occurring in the lowest layer components, in Fig. 9 we give an excerpt from FW_sub2 (the refinement of FW_sub) and the abstract specification of one of the lowest layer components that FW_sub2 includes. The behaviour of the subsystem that moves the operating head along the x-axis is specified as a preconditioned operation x_move. It has one input parameter – the target position x_target. Besides the basic functionality, we also specify various errors that might occur in executing the operation, for instance, the situations when the motor suddenly stops or the operating head is moved outside of the safety boundaries. Among these errors, only the partial execution (i.e., when the head progressed towards the target position without actually reaching it) is recoverable. The recovery is initiated from FW_sub2 by retrying the execution of the same command for the same target. Note that, if after a number of recovery attempts the operating head gets stuck, the different exception Failed_to_start is raised. This exception is considered as unrecoverable on the layer described in FW_sub2 and is propagated further up. While presenting the case study, we mostly focused on specification and refinement of fault tolerance mechanisms and omitted many details of overall system behaviour. The complete development can be found in Appendix.

18

REFINEMENT FW_sub2 … sub_recover = SELECT flag2 = Recovering THEN IF cname=MoveX THEN x_move(par); exc3 := x_exc ELSIF cname=MoveY THEN y_move(par); exc3 := y_exc ELSIF cname=MoveZ THEN z_move(par); exc3 := z_exc ELSE pump(par); exc3 := pump_exc END; flag2 := Handling END; … END MACHINE X_Comp … OPERATIONS

x_move(x_target) = PRE x_target : NAT THEN ANY x_new WHERE x_new : NAT THEN x_curr := x_new || IF x_new = x_target THEN raise(Success3) ELSIF x_new < x_min or x_max < x_new THEN raise(Unsafe3) ELSIF x_new = x_curr THEN raise(Failed_to_start) ELSIF (x_curr < x_new & x_new < x_target) or (x_target < x_new & x_new < x_curr) THEN raise(Suddenly_stop) ELSE raise(Unsafe3) END END END END

Figure 9. Excerpts from lower layers specifications

19

5. Conclusions In this paper we aimed at creating a generic yet simple development pattern that would facilitate system development in a layered manner and, at the same time, suffice for reasoning about fault tolerance in complex systems. We validated the proposed approach by a realistic case study – a liquid handling workstation. Our approach formalises development of fault tolerant layered systems by refinement in B. We introduced a general specification pattern, which can be applied to specify components with integrated exception handling mechanisms at each architectural layer. Moreover, the proposed development technique is based on the recursive application and instantiation of this pattern via refinement and, therefore, per se can be seen as a development pattern. Currently most of the work related to exception handling is associated with object-oriented systems [3]. Meanwhile, research on integrating exception handling into system development process, especially in the formal setting, is scarce. The idea of reasoning about fault tolerance in the refinement process has also been explored by Joseph and Liu [9]. They specified a fault intolerant system in a temporal logic framework and demonstrated how to transform it into a fault tolerant system by refinement. However, they analyse a “flat” system structure, i.e., in their specification all operations are described on the same architectural level. The advantage of our approach is possibility to introduce hierarchy (layers) and describe different exceptions and recovery actions for different layers. As a result, error recovery procedures are distributed among the layers, so that each layer handles its own class of errors (and hence mask them from the upper layers). Arora and Kulkarni [6] have done the extensive research on establishing correctness of adding fault tolerance mechanisms to fault intolerant systems. Correctness proof of such an extended system is based on soundness of their algorithm working with next-state (transition) relation. In our approach we start with an abstract specification of a system and develop a fault tolerant system by refinement, incorporating fault tolerance mechanisms on the way. Correctness of our transformation

20

is guaranteed by soundness of the B method. Moreover, an automatic tool support available for our approach facilitates verification of correctness. Reasoning about fault tolerance in B has also been explored by Lano et. al [7]. However, they focused on structuring B specifications to model a certain mechanism for damage confinement rather than exception handling mechanisms. While developing and validating the proposed approach, we extensively used Atelier B – the automatic tool support for B. The use of Atelier B has significantly eased verification of refinement since the tool generated all the required proofs and proved most of them automatically. We argue that generality of the proposed approach and availability of the automatic tool support makes our approach scalable to complex real-life applications. In the future we are planning to extend the proposed approach along two directions: to introduce a more sophisticated model of exceptions, e.g., by integrating all the information required for error recovery into the representation of exceptions, and to elaborate on the encapsulation of components. The extensions are obviously related since the former facilitates the latter. References 1. 2.

J.-R. Abrial. The B-Book. Cambridge University Press, 1996. T.Anderson and P.A. Lee. Fault Tolerance: Principles and Practice. Dependable Computing and Fault Tolerant Systems, Vol 3. Springer Verlag; 1990. 3. A.Avizienis. Towards Systematic Design of Fault-Tolerant Systems. Computer 30 (4), pp. 51-58. 1997. 4. F.Cristian. Exception Handling. In T.Anderson (ed.): Dependability of Resilient Computers. BSP Professional Books, 1989. 5. FillwellTM 2002 – Feature guide. Via http://lifescience. perkinelmer.com 6. S. Kulkarni and A. Arora. Automating the addition of fault-tolerance. Formal Techniques in Real-time and Fault-tolerant Systems (FTRTFTS'2000), Pune, India. 2000. 7. K.Lano, D. Clark, K. Androutsopoulos, P. Kan. Invariant-Based Synthesis of Faulttolerant Systems. In Proc. of Formal Techniques in Real-Time and Fault-Tolerant Systems. FTRTFT 2000, LNCS vol. 1926, p. 46 -57. Pune, India, September 2000. 8. J.-C. Laprie. Dependability: Basic Concepts and Terminology. Springer-Verlag, Vienna, 1991. 9. Z. Liu and M. Joseph. Transformations of programs for fault-tolerance, Formal Aspects of Computing, Vol 4, No. 5, pp. 442-469, 1992. 10. MATISSE Handbook for Correct Systems Construction. EU-project MATISSE: Methodologie and Technologies for Industrial Strength Systems Engineering, IST1999-11345, 2003. http://www.esil.univ-mrs.fr/~spc/matisse/Handbook/

21

11. B.Rubel. Patterns for Generating a Layered Architecture. In J.O. Coplien, D.C. Schmidt (Eds.). Pattern Languages of Program Design. Addison-Wesley. 1995. 12. S.Schneider. The B Method. An introduction. Palgrave2001. 13. Steria, Aix-en-Provence, France. Atelier B, User and Reference Manuals, 2001. Available at http://www.atelierb.societe.com/index uk.html. 14. Storey N. Safety-critical computer systems. Addison-Wesley, 1996.

22

Appendix Complete B development of the Fillwell system MACHINE FW_Aspirate SEES FWdata VARIABLES flag, exc, exc2, task, cmd, recov_flag INVARIANT flag : PHASE & exc : EXC & exc2 : EXC2 & task : seq(CNAME*NAT) & cmd : CNAME*NAT & recov_flag : BOOL & (flag = Executing => Eval(exc)=OK & Eval2(exc2)=OK) & (flag = Handling => Eval(exc)=OK) & (flag = Recovering => Eval(exc)=OK & Eval2(exc2)=RECOV) & (flag = Stopping => task= or Eval(exc)/=OK or Eval2(exc2)=UNRECOV) DEFINITIONS raise(ee) == exc := ee; not_raised == (Eval(exc) = OK) & (task /= ) & (Eval2(exc2) /= OK) INITIALISATION flag := Stopped || exc := Success || exc2 := Success2 || task := || cmd :: CNAME*NAT || recov_flag := FALSE OPERATIONS start(plate,amount) = PRE plate:NAT & amount:NAT & flag=Stopped THEN task := Aspirate(plate,amount) || IF Valid_param(plate,amount)=TRUE THEN flag := Executing || exc := Success || exc2 := Success2 ELSE raise(Bad_param) || flag := Stopping END END; execute = SELECT flag = Executing THEN ANY new_exc WHERE new_exc:EXC

23

THEN raise(new_exc) || IF Eval(new_exc) /= OK or task = THEN flag := Stopping ELSE exc2 :: EXC2 || flag := Handling || cmd,task := first(task), tail(task) END END END; catch_and_handle = SELECT flag = Handling THEN IF Eval2(exc2) = OK THEN flag := Executing ELSIF Eval2(exc2) = RECOV THEN IF recov_flag = TRUE THEN raise(Recovery_failed) || flag := Stopping ELSE flag := Recovering END ELSE flag := Stopping END || recov_flag := FALSE END; recover = SELECT flag = Recovering THEN exc2 :: EXC2 || task :: seq(CNAME*NAT) || recov_flag := TRUE || flag := Handling END; stop = SELECT flag = Stopping THEN IF not_raised THEN raise(Exc_trans(exc2)) END || flag := Stopped END END

24

MACHINE FWdata SETS EXC = {Success,Bad_param,Execution_failed,Recovery_failed}; EXC2 = {Success2,Failed_to_move,Pump_failed,Unsafe2,Bad_param2}; EXC3 = {Success3,Failed_to_start,Pump_failed_to_start,Suddenly_stop,Unsafe3}; PHASE = {Executing,Handling,Recovering,Stopping,Stopped}; E_STATUS = {OK, RECOV, UNRECOV}; CNAME = {MoveX,MoveY,MoveZ,Pump} CONSTANTS Eval, Eval2, Eval3, Exc_trans, Exc_trans2, Aspirate, fst, snd, Valid_param, x_init, y_init, z_init, x_min, y_min, z_min, x_max, y_max, z_max, p_max PROPERTIES Eval: EXC --> E_STATUS & Eval = {Success |-> OK, Bad_param |-> RECOV, Execution_failed |-> RECOV, Recovery_failed |-> RECOV} & Eval2: EXC2 --> E_STATUS & Eval2 = {Success2 |-> OK, Failed_to_move |-> RECOV, Pump_failed |-> UNRECOV, Unsafe2 |-> UNRECOV} & Eval3: EXC3 --> E_STATUS & Eval3 = {Success3 |-> OK, Failed_to_start |-> UNRECOV, Suddenly_stop |-> RECOV, Unsafe3 |-> UNRECOV} & Exc_trans : EXC2 --> EXC & Exc_trans = {Success2 |-> Success, Failed_to_move |-> Execution_failed, Unsafe2 |->Execution_failed, Bad_param2 |-> Bad_param} & Exc_trans2 : EXC3 --> EXC2 & Exc_trans2 = {Success3 |-> Success2, Failed_to_start |-> Failed_to_move, Pump_failed_to_start |-> Pump_failed, Suddenly_stop |->Failed_to_move, Unsafe3 |-> Unsafe2} & Aspirate : NAT*NAT --> seq(CNAME*NAT) & fst: CNAME*NAT --> CNAME & fst = %(e1,e2). (e1:CNAME & e2:NAT | e1) & snd: CNAME*NAT --> NAT & snd = %(e1,e2). (e1:CNAME & e2:NAT | e2) & Valid_param : NAT*NAT --> BOOL & x_init: NAT & y_init: NAT & z_init: NAT & x_min: NAT & y_min: NAT & z_min: NAT & x_max: NAT & y_max: NAT & z_max: NAT & x_min Eval2(exc2)/=OK or Eval3(exc3)=UNRECOV) DEFINITIONS raise(ee) == exc2 := ee; not_raised == (Eval2(exc2) = OK) & (Eval3(exc3) /= OK) INITIALISATION flag2 := Stopped || exc2 := Success2 || exc3 := Success3 || cname :: CNAME || par :: NAT OPERATIONS sub_start(c_name,param) = PRE c_name : CNAME & param : NAT & flag2=Stopped THEN flag2 := Executing || cname,par := c_name,param || exc2 := Success2 || exc3 := Success3 END; sub_execute = SELECT flag2 = Executing THEN ANY new_exc WHERE new_exc:EXC2 THEN raise(new_exc) || IF Eval2(new_exc) /= OK THEN flag2 := Stopping ELSE exc3 :: EXC3 || flag2 := Handling END END END;

28

sub_catch_and_handle = SELECT flag2 = Handling THEN IF Eval3(exc3) = OK THEN flag2 := Executing ELSIF Eval3(exc3) = RECOV THEN flag2 := Recovering ELSE flag2 := Stopping END END; sub_recover = SELECT flag2 = Recovering THEN exc3 :: EXC3 || flag2 := Handling END; sub_stop = SELECT flag2 = Stopping THEN IF not_raised THEN raise(Exc_trans2(exc3)) END || flag2 := Stopped END; exc2_OK = PRE flag2 = Stopped THEN exc2 := Success2 END END

29

REFINEMENT FW_sub2 REFINES FW_sub SEES Alt_FWdata INCLUDES X_Comp, Y_Comp, Z_Comp, Pump_Comp VARIABLES flag2, exc2, exc3, cname, par INVARIANT flag2 : PHASE & exc2 : EXC2 & exc3 : EXC3 & cname : CNAME & par : NAT & (exc3=Success3 or exc3=x_exc or exc3=y_exc or exc3=z_exc or exc3=pump_exc) DEFINITIONS raise(ee) == exc2 := ee; not_raised == (Eval2(exc2) = OK) & (Eval3(exc3) /= OK) INITIALISATION flag2 := Stopped || exc2 := Success2 || exc3 := Success3 || cname :: CNAME || par :: NAT OPERATIONS sub_start(c_name,param) = PRE c_name : CNAME & param : NAT & flag2=Stopped THEN flag2 := Executing || cname,par := c_name,param || exc2 := Success2 || exc3 := Success3 END; sub_execute = SELECT flag2 = Executing THEN IF (cname=MoveX & (par < x_min or par > x_max)) or (cname=MoveY & (par < y_min or par > y_max)) or (cname=MoveZ & (par < z_min or par > z_max)) or (cname=Pump & par > p_max) THEN raise(Bad_param2) END; IF Eval2(exc2) /= OK THEN flag2 := Stopping ELSE IF cname=MoveX THEN x_move(par); exc3 := x_exc ELSIF cname=MoveY THEN y_move(par); exc3 := y_exc ELSIF cname=MoveZ THEN z_move(par); exc3 := z_exc ELSE pump(par); exc3 := pump_exc END;

30

flag2 := Handling END END;

sub_catch_and_handle = SELECT flag2 = Handling THEN IF Eval3(exc3) = OK THEN flag2 := Executing ELSIF Eval3(exc3) = RECOV THEN flag2 := Recovering ELSE flag2 := Stopping END END; sub_recover = SELECT flag2 = Recovering THEN IF cname=MoveX THEN x_move(par); exc3 := x_exc ELSIF cname=MoveY THEN y_move(par); exc3 := y_exc ELSIF cname=MoveZ THEN z_move(par); exc3 := z_exc ELSE pump(par); exc3 := pump_exc END; flag2 := Handling END; sub_stop = SELECT flag2 = Stopping THEN IF not_raised THEN raise(Exc_trans2(exc3)) END || flag2 := Stopped END END

31

MACHINE X_Comp SEES Alt_FWdata VARIABLES x_exc, x_curr INVARIANT x_exc : EXC3 & x_curr: NAT DEFINITIONS raise(ee) == x_exc := ee INITIALISATION x_exc := Success3 || x_curr := x_init OPERATIONS x_move(x_target) = PRE x_target : NAT THEN ANY x_new WHERE x_new : NAT THEN x_curr := x_new || IF x_new = x_target THEN raise(Success3) ELSIF x_new < x_min or x_max < x_new THEN raise(Unsafe3) ELSIF x_new = x_curr THEN raise(Failed_to_start) ELSIF (x_curr < x_new & x_new < x_target) or (x_target < x_new & x_new < x_curr) THEN raise(Suddenly_stop) ELSE raise(Unsafe3) END END END END

32

MACHINE Y_Comp SEES Alt_FWdata VARIABLES y_exc, y_curr INVARIANT y_exc : EXC3 & y_curr: NAT DEFINITIONS raise(ee) == y_exc := ee INITIALISATION y_exc := Success3 || y_curr := x_init OPERATIONS y_move(y_target) = PRE y_target : NAT THEN ANY y_new WHERE y_new : NAT THEN y_curr := y_new || IF y_new = y_target THEN raise(Success3) ELSIF y_new < y_min or y_max < y_new THEN raise(Unsafe3) ELSIF y_new = y_curr THEN raise(Failed_to_start) ELSIF (y_curr < y_new & y_new < y_target) or (y_target < y_new & y_new < y_curr) THEN raise(Suddenly_stop) ELSE raise(Unsafe3) END END END END

33

MACHINE Z_Comp SEES Alt_FWdata VARIABLES z_exc, z_curr INVARIANT z_exc : EXC3 & z_curr: NAT DEFINITIONS raise(ee) == z_exc := ee INITIALISATION z_exc := Success3 || z_curr := x_init OPERATIONS z_move(z_target) = PRE z_target : NAT THEN ANY z_new WHERE z_new : NAT THEN z_curr := z_new || IF z_new = z_target THEN raise(Success3) ELSIF z_new < z_min or z_max < z_new THEN raise(Unsafe3) ELSIF z_new = z_curr THEN raise(Failed_to_start) ELSIF (z_curr < z_new & z_new < z_target) or (z_target < z_new & z_new < z_curr) THEN raise(Suddenly_stop) ELSE raise(Unsafe3) END END END END

34

MACHINE Pump_Comp SEES Alt_FWdata VARIABLES pump_exc, p_curr INVARIANT pump_exc : EXC3 & p_curr: NAT DEFINITIONS raise(ee) == pump_exc := ee INITIALISATION pump_exc := Success3 || p_curr := 0 OPERATIONS pump(amount) = PRE amount : NAT THEN ANY new_amount WHERE new_amount : NAT & new_amount >= p_curr & new_amount

Fault Tolerance in a Layered Architecture: a General ... - CiteSeerX

Fault Tolerance in a Layered Architecture: a General ... - CiteSeerX

Suggest Documents

RECONFIGURABLE FAULT TOLERANCE: A ... - CiteSeerX

A system architecture for fault tolerance in concurrent software ...

A Fault Tolerance Architecture: A Solution for Routing Protocols ... - ijiet

Software Fault Tolerance in a Clustered Architecture - CUHK CSE

A system architecture for fault tolerance in concurrent ... - Google Sites

design approach for fault tolerance in fpga architecture - CiteSeerX

towards a semantic web layered architecture - CiteSeerX

A Layered Architecture for Network Management ... - CiteSeerX

A Novel Parallel Architecture with Fault-Tolerance for ... - Google Sites

A DISTRIBUTED FAULT TOLERANT ARCHITECTURE ... - CiteSeerX

A Heuristic for Fault-Tolerance Provisioning in Multi-Radio ... - CiteSeerX

Implementing MapReduce-Style Fault Tolerance in a ... - CiteSeerX

A Heuristic for Fault-Tolerance Provisioning in Multi-Radio ... - CiteSeerX

A Layered Component-Based Architecture of a Virtual ... - CiteSeerX

a layered neural continuum architecture in attention

Practical Byzantine Fault Tolerance - CiteSeerX

Fault tolerance in VLSI circuits - Computer - CiteSeerX

A Fault Detection and Recovery Architecture for a ... - CiteSeerX

Fault Tolerance improvement through architecture change in Artificial ...

Incorporating Fault Tolerance Tactics in Software Architecture Patterns

A Framework for Reconfiguration-based Fault-Tolerance ... - CiteSeerX

OFTT: A Fault Tolerance Middleware Toolkit for Process ... - CiteSeerX

A Novel Approach Based on Fault Tolerance and ... - CiteSeerX

DFTS: A Novel Distributed High Fault-Tolerance Storage ... - CiteSeerX