Generic Fault Tolerant Software Architecture Reasoning ... - IEEE Xplore

IEEE TRANSACTIONS ON RELIABILITY, VOL. 55, NO. 3, SEPTEMBER 2006

421

Generic Fault Tolerant Software Architecture Reasoning and Customization Ling Yuan, Jin Song Dong, Jing Sun, and Hamid Abdul Basit

Abstract—This paper proposes a novel heterogeneous software architecture GFTSA (Generic Fault Tolerant Software Architecture) which can guide the development of safety critical distributed systems. GFTSA incorporates an idealized fault tolerant component concept, and coordinated error recovery mechanism in the early system design phase. It can be reused in the high level model design of specific safety critical distributed systems with reliability requirements. To provide precise common idioms & patterns for the system designers, formal language Object-Z is used to specify GFTSA. Formal proofs based on Object-Z reasoning rules are constructed to demonstrate that the proposed GFTSA model can preserve significant fault tolerant properties. The inheritance & instantiation mechanisms of Object-Z can contribute to the customization of the GFTSA formal model. By analyzing the customization process, we also present a template of GFTSA, expressed in x-frames using the XVCL (XML-based Variant Configuration Language) methodology to make the customization process more direct & automatic. We use an LDAS (Line Direction Agreement System) case study to illustrate that GFTSA can guide the development of specific safety critical distributed systems. Index Terms—Customization, fault tolerance, formal reasoning, Object-Z, software architecture.

GFTSA XVCL FTS LDAS PVS

A d

ACRONYMS1 Generic Fault Tolerant Software Architecture XML-based Variant Configuration Language Fault Tolerant System Line Direction Agreement System Prototype Verification System NOTATION The form of the sequent, a sequent is valid if at least one of the predicates in is true in the local environment of A enriched by the declarations d, and satisfying all predicates in Class name A list of declarations, referred to as the antecedent A list of predicates, referred to as the antecedent A list of predicates, referred to as the consequent

Manuscript received October 17, 2005; revised February 16, 2006; accepted March 5, 2006. Associate Editor: J.-C. Lu. L. Yuan, J. S. Dong and H. A. Basit are with the School of Computing, National University of Singapore, Singapore (e-mail: [email protected]; [email protected]; [email protected]). J. Sun is with the Department of Computer Science, The University of Auckland, New Zealand (e-mail: [email protected]). Digital Object Identifier 10.1109/TR.2006.879605 1The

singular and plural of an acronym are always spelled the same.

I. INTRODUCTION

A

distributed system can be viewed as a system composed of a set of concurrently interacting activities at different locations that cooperate with each other to perform a joint task. In practice, different kinds of concurrency might co-exist in a system, which thus would require a general supporting framework for controlling & coordinating those concurrent activities [22]. Two kinds of concurrency are mostly discussed in this context: competitive, and cooperative. Competitive concurrency indicates that concurrent activities compete for some common resources, but without explicit cooperation. Cooperative concurrency means that concurrent activities cooperate & communicate with each other [12]. Software architecture is considered as a critical design methodology to provide a general framework to guide the development of distributed systems. Software architecture styles, such as pipe-and-filter [2], can only guide the development of distributed systems with cooperative concurrency. Some other basic software architecture styles, such as repository style [3], can only guide the development of distributed systems with competitive concurrency. However, many distributed systems involve both cooperative, and competitive concurrency. Our proposed GFTSA (Generic Fault Tolerant Software Architecture) combines several widely used basic architecture styles to guide the development of these distributed systems. For the distributed systems involving competitive, and cooperative concurrency, if faults occur & cause errors, their consequences may not always be limited to one system component. To satisfy the reliability requirement, error recovery mechanisms may require stepping outside the boundaries of a computer system. We need to use a coordinated error recovery mechanism [4], [9], [31], [22], and transaction semantics [18] to facilitate recovery from faults that affect both the computer system, and its environment. How to incorporate error recovery mechanisms with functional aspects in the software architecture level is a new research area that has recently gained considerable attention. Existing work in this area mostly emphasizes the creation of fault tolerance mechanisms [13], [21]; descriptions of software architectures with respect to their reliability properties [24], [30]; and the evolution of component-based software architectures by adding or changing components to guarantee reliability properties [7], [10]. In this paper, we incorporate error recovery mechanisms in the proposed software architecture GFTSA in the early system design phase.

0018-9529/$20.00 © 2006 IEEE

422


The well-defined semantics & syntax make formal modeling techniques suitable for precisely specifying, and formally verifying architecture designs. The formalisms of an architecture style can be used to provide precise, explicit, common idioms & patterns to the software system designers [17], [19], [25]. Object-Z [8], [27] is an object oriented structure that can describe internal state transitions, and interface communications, of software components. Compared to other formal languages such as Z, CSP, TCL, Object-Z has inheritance & instantiation mechanisms. These two mechanisms can help the development of specific distributed systems to reuse the formal model of GFTSA. Based on the Object-Z reasoning rules, we could formally prove the fault tolerant properties of developed systems. Furthermore, the formal model of GFTSA can be reused in the high level design of a specific safety critical distributed system by the customization process. The XVCL technique (XML-based Variant Configuration Language) [14] is used to customize the formal template of GFTSA to formal models of specific systems automatically. A case study LDAS (Line Direction Agreement System) [6] is presented to illustrate the customization process. The remainder of this paper is organized as follows. Section II describes the GFTSA including literal description, formal model, and formal proofs of fault tolerant properties. Section III presents a template for customization. Section IV presents a case study LDAS to illustrate how to produce the formal model of LDAS automatically, and shows that the LDAS preserves the fault tolerant properties by formal proofs. Section V concludes the paper, and presents the future work. II. GFTSA (GENERIC FAULT TOLERANT SOFTWARE ARCHITECTURE A. The Overall Description of GFTSA 1) Software Architecture Style: GFTSA involves object-oriented organization [11], pipe-and-filter architecture [2], and the repository style [3]. GFTSA can help develop safety critical distributed systems called FTS (Fault Tolerant Systems) which are composed of a set of Objects, a set of Connectors, a set of SharedResources, and a CoordinatingComponent, as shown in Fig. 1. The Object derives from the object-oriented organization [11] to accommodate the distributed environment. In the object-oriented organization, data representations, and their associated primitive operations, are encapsulated in an abstract Object. The object-oriented organization supports concurrent executions because each Object type can implement a separate task, and potentially execute in parallel with other Objects. In the pipe-and-filter architecture [2], filters represent computation entities, and pipes represent the communication channels. filters do not need to know the identity of their upstream & downstream neighbors in virtue of the pipe style communication. The Objects need to execute concurrently in the distributed system. Therefore, the Connector in GFTSA connects the out_port of one Object to the in_port of another Object similar to the pipe communication pattern in the pipe-and-filter architecture. The cooperative concurrency is modeled by the Objects interacting with each other via the Connectors to cater for common goals.

Fig. 1. The generic fault-tolerant software architecture.

The style that Objects compete for SharedResources derives from the repository style [3]. The Objects access Shared Resources by request because the SharedResource need to know the identity of Objects to guarantee the transaction semantics [9]. The transaction semantics imply that at a given time, each SharedResource can only be accessed by one Object. That Objects compete for SharedResource models the competitive concurrency. 2) Coordinated Error Recovery Mechanism: To satisfy the reliability requirements of safety critical distributed systems, several fault tolerant mechanisms are incorporated into GFTSA. Each Object has its own exception context [31] to deal with the exceptions occurring in the execution process, similar to the concept of an idealized fault tolerant component [16]. The exception context has a set of exception handlers [5], one of which is called when its corresponding exception is raised. The Object defines a list of stable states. During the execution of an Object, a checkpoint is used to record the previous stable state of the Object. After calling the exception handler to deal with exceptions, the Object can either go to a stable state, or roll back to the previous stable state recorded by the checkpoint. This solution is scalable as it only requires extending the behavior of existing objects rather than adding new objects to deal with exceptions. We classify the exceptions raised in the Object into two types: local exceptions, and global exceptions. The influence of a local exception is limited within a single Object. Global exceptions, on the other hand, affect the control flows of more than one Object within a distributed system. Once a local exception is raised in one Object, the Object can call the corresponding exception handler in its own exception context to cope with the exception. If this exception can not be handled successfully, a global exception is signaled, which can be transferred to the CoordinatingComponent. If a global exception is originally raised in an Object, this global exception is also passed to the CoordinatingComponent. The CoordinatingComponent broadcasts

YUAN et al.: GFTSA REASONING AND CUSTOMIZATION

Fig. 2. The Queue class schema.

the global exception to the related Objects & SharedResources within a distributed system. These components need to replace the stable execution with the exception handling execution. When several global exceptions are raised in different Objects concurrently, these global exceptions are passed to the CoordinatingComponent concurrently. The CoordinatingComponent uses exception graph [4] mechanism to resolve these concurrently raised exceptions into an unique global exception called universal exception which covers all the raised exceptions. When the CoordinatingComponent obtains the universal exception, it propagates this exception to all the related Objects & SharedResources involved in the distributed system. Furthermore, the Objects call the corresponding exception handler in their own exception contexts to deal with this exception. The state of each SharedResource needs to be restored to its prior stable state.

B. Formal Model of GFTSA In the previous section, we presented GFTSA in a box-and-line fashion, accompanied with the literal explanations of the generic architecture style, and a coordinated error recovery mechanism. Such description cannot provide precise, explicit common idioms & patterns of GFTSA to the system designers. In the description Of the formal model of GFTSA, we illustrate how this coordinated error recovery mechanism can be defined precisely. Z [29] is a formal language based on set theory & predicate logic, which can help describe internal state transitions, and interface communications of a system by the state & operation schema definitions. Many researchers [1], [25] have used Z to formalize the state & computations of software architectures. Object-Z [8], [27] is an extension of Z to accommodate the object-oriented style. The inheritance & instantiation mechanisms can help the customization process. We have used Object-Z to formally specify the GFTSA in this paper. For the formal verification of the different properties of the system, we have used an extension to Object-Z as introduced in [26]. Below is a simple example queue to describe basic features of Object-Z, presented in Fig. 2. The Queue[Item] class schema is reused

423

later to specify Object, CoordinatingComponent, and SharedResource class by the inheritance mechanism. For full description of Object-Z notations, please refer to [27]. The Queue[Item] class schema is generic with the parameter Item representing the type of items in the queue. The visible list specifies the interface between objects of class schema, and their environment. The state variable items is declared in the state schema, which would be changed by the operations of class. The INIT schema defines the initial state of the state variable. The Join, and Leave operation schemas specify that one item? joins the queue, and one item! leaves the queue, besides the state transformations of variable items. The formal model of GFTSA2 includes the global types, Object, Connector, CoordinatingComponent, ShareResource, and FTSystem class schemas. 1) Object: The Object class schema describes the stable activities, and error recovery activities of Object components in GFTSA, presented in Fig. 3. The Queue[item] is inherited with instantiation, and rename mechanisms of Object-Z. The constant variables n_states, l_excepts, and g_excepts represent three different sets of states that an Object can be in: a set of stable states, a set of local exception states, and a set of global exception states. To model the idea that the IO-ports are directional, we partition them into a set of in_ports, and a set of out_ports. The constant variable comp_msgs represents a set of messages that an Object can transmit to the SharedResources. We associate a message with a port in the coop_msg, which indicates that the message can be received or sent out from the associated port. The transition function specifies that when an Object receives a message at its in_port, the state of the Object can be changed while sending out a message from its out_port at the same time. The except_context function models that any exception occurring in the Object has a corresponding exception handler. The function exception_handle is used to check whether the exception handler deals with the exception successfully. The bottom predicate part of the constant state schema imposes several constraints on the constant variables. The state schema in the Object class schema declares four state variables: inter_state, checkpoint, ue_rec, and sr_qlist. The inter_state denotes the current state of an Object, the checkpoint records its prior stable state of an Object, the ue_rec indicates whether an Object has received a universal exception from the CoordinatingComponent, and the sr_qlist records the identity of SharedResources, which are available for the Object. The Transition operation denotes the state transitions of an Object according to the transition function. The LocalExceptHandle operation specifies how the Object deals with local exception. The operation GlobalExceptPropagate, UniExceptReceive, and UniExceptHandle denote how the Object implements a coordinated error recovery mechanism. Each operation describes the transformations of state variables, and the interface message communications. The operation SRRequest specifies the access request of the Object to the SharedResource. The FromSR operation describes that the Object receives answers from an available 2The complete formal model of GFTSA is presented in http://www.comp. nus.edu.sg/~yuanling/gftsa.pdfqq.

424


Fig. 4. The connector class schema.

Fig. 5. The CoordinatingComponent class schema.

Fig. 3. The object class schema.

SharedResource. The ToSR operation denotes that the Object sends out the msg! to the assured available SharedResource. This communication protocol can guarantee the transaction semantics of SharedResource. 2) Connector: The Connector class schema describes that a connector is responsible for connecting the send_port of an Object to the receive_port of another Object to transfer the message represented by msg, presented in Fig. 4. No operation occurs on the connectors. In the predicate, the declaration send_ob, receive_ob: Object means that send_ob, and receive_ob refer to the Object class schema, or any of its subclass schemas which are in the inheritance hierarchy rooted at the Object class.

3) CoordinatingComponent: The CoordinatingComponent class schema describes how the CoordinatingComponent implements the coordinated error recovery mechanism when a global exception is raised, or several global exceptions are raised concurrently, presented in Fig. 5. The CoordinatingComponent class schema inherits the Queue[Item] class schema instantiated with type GE, and uses the rename mechanism of Object-Z. The constant variable except_graph is a function to resolve several concurrently raised exceptions into a universal exception, represented by uni_exception, that can cover all the raised exceptions. The state variable exceptions represents the sequence of received exceptions. The operations ExceptRec, and ExceptGraph are responsible for receiving exception? from Objects, resolving these received exceptions by the except_graph, and sending out the resolved exception uni_exception!. 4) SharedResource: The SharedResource class schema models how the SharedResource can guarantee the transaction semantics when receiving messages from Objects, and preserve consistent state when facing exceptions, presented in Fig. 6. The SharedResource class schema inherits the Queue[Item] schema by using the instantiation & rename mechanisms of Object-Z. The constant variable states represents a set of states in which the SharedResource can be, and the function trans is used to model the state transition of the SharedResource when it receives a message from an Object. The state variables semaphore, ob_qlist, sr_state, and checkpoint represent the signal to show whether the SharedResource is accessed by an Object, the request list of Objects, the current state of SharedResource, and the recorded prior stable state of SharedResource respectively. The ObList operation specifies that the


425

Fig. 6. The SharedResource class schema. Fig. 7. The FTSystem class schema.

SharedResource receives an access request from an Object. The Available operation models how when the SharedResource is available, it sends out a signal to the Object. The Trans operation specifies the state transitions of SharedResource according to the trans function when receiving msg? from an Object. The Except operation describes that the state of SharedResource needs to roll back to the prior state recorded in the checkpoint when facing an un_exception?. 5) FTSystem: The FTSystem class schema describes how the components & connectors in GFTSA, which constitute a FTS (Fault Tolerant System), are synchronized, presented in Fig. 7. The constant variable critical represents the set of Objects whose Fail state can cause the whole FTS to stop. The Result_Control is a function to check the execution result of FTS. If the state of any critical Object is not in the Fail state, the execution result of FTS is tolerate, which means that the FTS can recover from the exceptions; otherwise the whole FTS has to stop execution. The instances of components & connectors in the FTS are all declared in the state schema. The secondary variable ob_fail records a set of Objects in the Fail state. The SystemRecover operation models that, when the execution result of FTS is tolerate, the states of all Objects in the FTS should be initialized. The Transition operation expression is the conjunction of Transition operations of all Objects in the FTS. The operation ExceptPropagate specifies that, when global exceptions have been raised in the Objects, these Objects need to pass the exceptions to the

CoordinatingComponent by the parallel operator to compose two operations together. The ExceptGraph operation specifies that the CoordinatingComponent sends the resolved exception uni_exception! to all the Objects & SharedResources. The operations ObReqSR, SRAnsOb, and ObAccSR are used to model how the Objects compete for the SharedResources. C. Reasoning About GFTSA Reasoning about the formal model of the system enables the designer to gain confidence in the correctness of the system development [23]. By the reason that GFTSA is used to help design distributed systems with reliability requirements, reasoning about GFTSA mainly involves showing that the formal model of GFTSA can preserve fault tolerant properties, which are expressed as theorems. The process of reasoning needs to use formal reasoning rules based on the semantics of Object-Z [26] to prove that fault tolerant properties can be derived from the formal model of GFTSA. The following items show that GFTSA can preserve significant fault tolerant properties. 1) When a global exception is raised by an Object in the FTS, all of the Objects & SharedResources in the FTS should be informed about the exception. This property can be formally expressed as the following theorem. The formal proof of this property is presented in Appendix I-1.

426


Theorem

2) If an Object in the FTS raises a local exception, the other Objects except this one in the FTS are not influenced, and are still in their stable states. This property can be formally expressed as follows. The formal proof of this property is presented in Appendix I-2. Theorem

3) When two global exceptions are raised concurrently by two Objects in the FTS, the FTS can inform all the Objects to handle these exceptions. This property can be formally expressed as follows. The formal proof of this property is presented in Appendix I-3. Theorem

4) When a non-critical Object fails, the FTS can tolerate this fault, which means that the states of all Objects in the FTS can recover to their stable states. This property can be formally expressed as follows. The formal proof of this property is presented in Appendix I-4. Theorem

III. THE TEMPLATE OF GFTSA IN XVCL By customizing the formal model of GFTSA, we can build formal models of specific safety critical distributed systems. During the customization process, each class schema in the formal model of GFTSA can be inherited by the class schemas of specific safety critical distributed systems. Besides inheritance, the customization process involves several other mechanisms, such as defining a visibility list, declaring predicates of constant variables, defining initial schema, defining newly operation schemas, etc. The following class schemas describe the customization from the GFTSA model to the formal model of a specific safety critical distributed system. The Object class schema for customization is presented in Fig. 8, and the other class schemas including Connector,

Fig. 8. The object class schema for customization.

CoordinatingComponent, ShareResource, and FTSystem are presented in Fig. 9. Note that the items in quotation marks can be instantiated according to the requirements of specific systems. The meaning of each item is interpreted in the corresponding comment. The class schemas above consist of fixed points, and variable points, which can be reused in the high level design of specific safety critical distributed systems through customization. During the customization process, we just need to put effort into the variable part by giving concrete definitions to the items in quotation marks. Therefore, these class schemas can be viewed as a template of GFTSA the formal model for customization. XVCL (XML-based Variant Configuration Language) [15], [14], [28] is a meta-programming technique developed to facilitate building flexible, adaptable, and reusable software artifacts. In the XVCL technique, all of small or large variation points can be represented as meta-expressions instantiated during the customization process according to the specific requirements. Following the mechanism of XVCL, the template of the GFTSA formal model can be built as generic, adaptable fragments, called x-frames. Each x-frame is an XML file composed of Object-Z class schema LaTeX code, and XVCL commands. To help the flexible reuse of the template of the GFTSA formal


427

to guarantee the compatibility between the formal model of the specific system, and that of GFTSA. These rules are as follows: 1) When giving a concrete definition or a new predicate to a constant variable which is already defined in the formal model of GFTSA, the concrete definition or new predicate should be compatible with the predicates declared in the formal model of GFTSA associated with this constant variable. 2) When declaring a new constant variable, the name, type, and predicate of this new variable should not clash with the constant variables already declared in the formal model of GFTSA. 3) When giving the concrete initial state of a state variable, its type should be consistent with that of the initial state of the same state variable, as already declared in the INIT state schema of a formal model of the GFTSA. 4) When giving the concrete definition of a function, this definition should be consistent with the function declaration defined in the formal model of GFTSA. 5) When defining an operation schema which has the same schema name as the one in the formal model of GFTSA, if the newly defined predicate of this operation schema can not be conjoined with predicates of the same name operation schema declared in the formal model of GFTSA, we should rename the operation schema declared in the formal model of GFTSA. IV. CASE STUDY - LDAS (LINE DIRECTION AGREEMENT SYSTEM) In the previous sections, we proposed a novel heterogeneous software architecture GFTSA, and the template of the GFTSA formal model for customization. In this section, we use an LDAS system case study to demonstrate how GFTSA can guide the high level design of specific safety critical distributed systems. Fig. 9. The Connecter, CoordinatingComponent, SharedResource, and FTSystem class schemas for customization.

A. The Development of LDAS Guided by GFTSA model, we build five primitive x-frames3 corresponding to five class schemas above, called object, connector, coco, sr, and ftsystem. In the primitive x-frame, each item in quotation marks is represented as a variable with the same name. And `` '' , XVCL commands, such as , and , are used to mark this variable to help the adaptation in the customization process. The template of the GFTSA formal model is built based on the formal model of GFTSA with inheritance & instantiation mechanisms of Object-Z which ensure that the fault tolerant properties of GFTSA can be preserved. The instantiation of variation points in the template by the use of the XVCL technique during the adaptation process do not consider the semantic factor. If the formal model of the specific system generated by customization of the template needs to preserve the fault tolerant properties of GFTSA, we should establish semantic rules 3These x-frames built for the template of the GFTSA formal model are all presented in http://www.comp.nus.edu.sg/~yuanling/template.pdf..

The safety critical distributed system LDAS (Line Direction Agreement System) [6] is designed to control the line direction to prevent the head-on train crashes on the line. Each station communicate with LDACS (Line Direction Agreement Control System) to guarantee that, at a time, only one train runs on the line connecting two stations. The operator in each station can command the station. Fig. 10 shows a part of LDAS, composed of three stations. The overall LDAS can be much more complex comprising of several stations, and communication paths. In the system under consideration, StatonA and StationB communicate with LDACS to control the direction of LineAB. Similarly, StationB and StationC communicate with LDACS to control LineBC. We can consider StationA, OperatorA, LDACS, StationB, and OperatorB to be a relatively independent sub-system that can be analyzed in isolation, as it has the requirement for both cooperative & competitive concurrency, even in isolation. The interaction pattern in the other sub-systems, e.g., the system comprising of StationB and StationC follows the same regulation as the above mentioned case.

428


Fig. 10. The LDAS system.

Fig. 13. The trans function.

Fig. 11. GFTSA architecture view of LDSA sub-system. Fig. 14. The StationA StateChart.

B. The Generation of Formal Model of LDAS

Fig. 12. The x-frame adaptation relationship.

According to the box-and-line patterns of GFTSA shown in Fig. 1, the LDAS sub-system is composed of five Objects, called StationA, OperatorA, LDACS, StationB, OperatorB, and a SharedResource, called LineAB. Six connectors are used to assist the communication among the Objects. A CoordinatingComponent called CC is also involved in the LDAS to implement a coordinated error recovery mechanism. Fig. 11 shows the model design of LDAS in the box-and-line fashion guided by the pattern of GFTSA.

In the following, we generate the formal model of LDAS4 by customizing of the template of the GFTSA formal model presented in Section III. The five primitive x-frames in the template can be reused during the customization via adaptation. Following the mechanisms of XVCL, the adaptation means that a new x-frame for one component in the specific system is built based on the corresponding primitive x-frame in the template by , and instantiating the variation using XVCL command points. We can build x-frames5 for the formal models of LDAS based on primitive x-frames of the template of the GFTSA formal model complying with semantic rules defined in Section III. Fig. 12 describes x-frame adaptation relationships between the LDAS, and the template of the GFTSA formal model. The sa, sb, ldacs, oa, and ob x-frames are built for StationA, StationB, 4The formal model of LDAS is presented in http://www.comp.nus.edu.sg/ ~yuanling/fldas.pdf. 5The x-frames for the formal models of LDAS are presented in http://www. comp.nus.edu.sg/~yuanling/xldas.pdf.


429

Fig. 16. The OSA class schema.

Fig. 17. The CC class schema.

Fig. 15. The StationA class schema.

LDACS, OperatorA, and OperatorB correspondingly. The osa, sal, lsa, lsb, sbl, and osb x-frames are built for six connectors in the LDAS. The cc x-frame is built for the CC component, and the line x-frame is built for the LineAB component. The ldas x-frame is built to describe how these components & connectors synchronize. By running the XVCL processor with the gsa SPC file which adapts all of the 14 x-frames of LDAS, we can generate a formal model of LDAS automatically. In the following, several representative class schemas are presented to illustrate the features of the formal model of the LDAS. 1) Object: The StationA class represents the Object components in the LDAS which describes how the StationA interacts with other Objects & SharedResource, and how to deal with local & global exceptions. The Trans function describes the state transitions of StationA, presented in Fig. 13. The Trans models the state transitions of StationA, presented in the Fig. 14.

The Trans function defines not only the stable state transitions, but also exceptional state transitions of StationA. The APointClose, AskedOpen, AskedClose, and AskedOpen are four stable states. The input_exception, bothopen_exception, and output_exception are three exceptional states. When receiving messages from other Objects, the state of StationA could be transformed from one stable state to either another stable state, or an exceptional state. The StationA class schema inherits the Object class schema by using the rename mechanism of Object-Z, presented in Fig. 15. The local exception input_exception defined in l_excepts represents that the StationA cannot handle the received messages. The global exception bothopen_exception represents that both StationA, and StationB can open the gates. The global exception bothclose_exception represents that both StationA, and StationB can close the gates. A forward_recovery function is defined to handle exceptions. We use this forward_recovery to handle bothopen_exception, and bothclose_exception. When facing input_exception, the state of StationA would roll back to the state recorded in the checkpoint, which is described in the LocalExceptHandle operation schema. 2) Connector: The OSA class schema represents Connector in the LDAS which connects OperatorA to StationA, presented in Fig. 16. 3) CoordinatingComponent: The CC class schema represents CoordinatingComponent in the LDAS, which describes how the LDAS implements the coordinated error recovery mechanism, presented in Fig. 17. The concrete definition of except_graph is presented in the CC class schema. When there exists a bothopen_exception among the concurrently raised global exceptions, the bothopen_exception is chosen as the universal exception. When all of the concurrently raised global exceptions are bothclose_exception, it is chosen as the universal exception. 4) SharedResource: The LineAB class schema represents SharedResource in the LDAS which can be accessed by StationA, and StationB; and should guarantee the transaction semantics, presented in Fig. 18. The trans function represents the state transitions of LineAB. In case the LineAB class schema

430


Objects in the LDAS can handle this exception, and recover to stable states. This property can be formally expressed as follows. The formal proof of this property is presented in Appendix II-1. Theorem

Fig. 18. The LineAB class schema.

2) When the bothopen_exception is raised in LDACS, representing that both StationA and StationB can open the gates, and concurrently the bothclose_exception is raised in StationB, meaning that both stations can close the gates, the LDAS can also handle this situation. It means that each Object in the LDAS can recover to its stable state. This property can be formally expressed as follows. The formal proof of this property is presented in Appendix II-2. Theorem

V. CONCLUSION, AND FUTURE WORK

Fig. 19. The LDAS class schema.

inherits the SharedResource class schema, the Except operation in the SharedResource class schema can handle the exceptions. 5) FTSystem: How the components & connectors in the LDAS cooperate with each other is described in the LDAS class, presented in Fig. 19.

C. Verifying Fault Tolerant Properties of an LDAS System By adapting the template of GFTSA complying with defined semantic rules, we generate the formal model of LDAS. Formal proofs are constructed to demonstrate that the formal model of LDAS can preserve fault tolerant properties. Similar to the proof methodology used in Section II-C, two significant fault tolerant properties can be derived from the formal model of LDAS. 1) When the bothopen_exception is raised in StationA, which represents that both StationA and StaionB can open the gates, the LDAS can tolerate this exception. It means that

This paper proposes a novel heterogeneous software architecture GFTSA (Generic Fault Tolerant Software Architecture) which can guide the development of safety critical distributed systems. GFTSA integrates fault tolerant mechanisms with functional aspects in the early system design phase, and combines several widely used basic architecture styles. A formal model of GFTSA can be directly reused in the high level design of specific safety critical distributed systems via a customization process. The XVCL technique makes the customization process more convenient & automatic. The constructed formal proofs based on Object-Z reasoning rules can demonstrate that the formal model of specific critical distributed systems generated by customization can preserve fault tolerant properties. Formal proofs shown in this paper are all done manually, which is laborious, and error-prone. At this stage, we need to use a verification tool to generate the proofs systematically. PVS (Prototype Verification System) [20] is a verification system developed at SRI. PVS has a powerful interactive theorem prover, and its automation suffices to prove many straightforward results automatically. Therefore, we can embed a formal model of GFTSA to the PVS environment to prove fault tolerant properties systematically by virtue of the theorem prover of the PVS. APPENDIX I THE PROOF OF THE FAULT TOLERANT PROPERTIES OF GFTSA 1) When a global exception is raised by an Object in the FTS, all of the Objects, and SharedResources in the FTS should be informed about the exception. This property can be formally expressed as follows.


Theorem

As an intermediate step, it is useful to think of the proof strategy informally. When a global exception is raised in an Object, the Object can use GlobalExceptPropagate to send this global exception out to the CoordinatingComponent, and the CoordinatingComponent can use the ExceptRec operation to receive this global exception. Two operations are combined in the ExceptPropagate operation expression declared in the FTSystem class schema. Because the sequence exceptions is not empty, the ExceptGraph operation in the CoordinatingComponent class schema sends out the uni_exception!. The UEReceive operation in the Object, and the operation Except in the SharedResource can receive the un_exception?, which is expressed in the ExceptGraph operation declared in the FTSystem class schema. When an Object receives the un_exception?, the value of ue_rec is changed to 1. When a SharedResource receives the un_exception?, its sr_state rolls back to its pre-state recorded in the checkpoint. These transformations assure that the Objects, and SharedResources are informed about the exception. In the following, a formal proof based on this strategy is constructed. Proof

431

2) If an Object in the FTS raises a local exception, the other Objects except this one in the FTS are not influenced, and still in their stable states. This property can be formally expressed as follows. Theorem

The proof strategy is described first. When a local exception is raised by an Object, the LocalExceptHandle operation in the Object class is executed to handle this exception. If this local exception is handled successfully, the state of Object recovers to a stable state. Otherwise, the state of Object is changed to a global exception state. There is no communication between this Object and other Objects. The states of other Objects except this one are still in their stable states, which cannot be influenced by the raised local exception. In the following, a formal proof based on this strategy is constructed. Proof

3) When two global exceptions are raised concurrently by two Objects in the FTS, the FTS can inform all the Objects to handle these exceptions. This property can be formally expressed as follows. Theorem

When two global exception are raised concurrently in the FTS, each Object can use GlobalExceptPropagate operation to send its global exception out to the CoordinatingComponent, and the CoordinatingComponent can use the ExceptRec operation to receive these two global exceptions. The GlobalExceptPropagate, and ExceptRec operations are combined in the ExceptPropagate operation

432


expression declared in the FTSystem class schema. Because the sequence exceptions is not empty, the ExceptGraph operation can use the function except_graph to get the uni_exception, which covers the two global exceptions. After that, the uni_exception! is sent out to all the Objects in the FTS. The UEReceive operation in the Object class can receive this un_exception?. The operations ExceptGraph, and UEReceive are combined in the ExceptGraph operation expression declared in the FTSystem class schema. When an Object receives the un_exception?, the value of ue_rec is changed to 1, which means that the Object is informed about the exceptions. In the following, a formal proof based on this strategy is constructed. Proof

4) When a non-critical Object fails, the FTS can tolerate this fault, which means that the states of all Objects in the FTS can recover to stable states. This property can be formally expressed as follows. Theorem

First, we give the informal proof description. When the state of an Object is Fail, if this Object is not in the state critical declared in the FTSystem class schema, we can gain the execution result, namely tolerate, by using function Result_Control. Because the execution result is tolerate, the SystemRecover operation is used to reset the states of all Objects to the initial states, which means that all the Objects recover to stable states. A formal proof based on the previous description is shown in the following. Proof

APPENDIX II THE PROOF OF FAULT TOLERANT PROPERTIES OF LDAS SYSTEM 1) When the bothopen_exception is raised in StationA, which represents that both StationA and StaionB can open the gates, the LDAS can tolerate this exception. It means that Objects in the LDAS can handle this exception, and recover to stable states. This property can be formally expressed as follows. Theorem


433

As an intermediate step, it is useful to think of a proof strategy informally. When a global exception called bothopen_exception is raised in the StationA, the StationA can use GlobalExceptPropagate operation to send this global exception out to CC, and CC can use ExceptRec operation to receive this global exception. Two operations are combined in the ExceptPropagate operation expression declared in the LDAS class schema. Because the sequence exceptions is not empty, the ExceptGraph operation in the CC class sends out the uni_exception!. The UEReceive operation in each Object of the LDAS can receive this un_exception?, which is expressed in the ExceptGraph operation declared in the LDAS class schema. When each Object of the LDAS receives the un_exception?, the state is changed to a stable state. These transformations assure that Objects in the LDAS can handle the global exception. In the following, a formal proof based on this strategy is constructed. Proof

2) When the bothopen_exception is raised in LDACS, which represents that both StationA and StationB can open the gates, and the bothclose_exception is raised in StationB, which means that both stations can close the gates concurrently, then the LDAS can handle this situation. It means that each Object in the LDAS can recover to stable state. This property can be formally expressed as follows. Theorem

When the bothopen_exception raised in LDACS, and the bothclose_exception raised in StationB concurrently, each of them can use GlobalExceptPropagate operation to send the exception out to the CC, and the CC can execute the ExceptRec operation to receive these two global exceptions. The GlobalExceptPropagate, and ExceptRec operations are combined in the ExceptPropagate operation expression declared in the LDAS class schema. Because the sequence exceptions is not empty, the ExceptGraph operation in the CC class schema can send out the uni_exception!, which covers bothopen_exception, and bothclose_exception. The UEReceive operation in each Object of LDAS can receive this un_exception?. Two operations, ExceptGraph, and UEReceive, are combined in the ExceptGraph operation expression declared in the LDAS class schema. When each Object in the LDAS receives the un_exception?, the state is changed to a stable state. The following is a formal proof based on this strategy.

434


Proof

ACKNOWLEDGMENT The authors would like to thank Dr. Stan Jarzabek at the National University of Singapore for his help in applying the XVCL technique, and the editors for their constructive comments which greatly improved this paper. REFERENCES [1] G. D. Abowd, R. Allen, and D. Garlan, “Formalizing style to understand descriptions of software architecture,” Transactions on Software Engineering and Methodology, vol. 4, no. 4, pp. 319–364, 1995. [2] R. Allen and D. Garlan, “A formal approach to software architectures,” in Proceedings of IFIP’92, 1992. [3] V. Ambriola, P. Ciancarini, and C. Montangero, “Software process enactment in oikos,” in Proceedings of the Fourth ACM SIGSOFT, California, 1990, pp. 183–192. [4] R. Campbell and B. Randell, “Error recovery in asynchronous system,” IEEE Transactions on Software Engineering, vol. SE-12, no. 8, pp. 826–881, 1986. [5] F. Cristian, “Exception handling and tolerance of software faults,” Software Fault Tolerance, pp. 81–107, 1994. [6] D. Bjøoner, C. W. George, B. Stig.Hansen, H. Laustrup, and S. Prehn, A Railway System, Coordination’97, Case Study Workshop Example Macau, Technical Report 93, 1997, UNU/IIST, P.O. Box 3058. [7] R. de Lemos, “Describing evolving dependable systems using co-operative software architecture,” in Proceedings of the IEEE International Conference on Software Maintenance, 2001, pp. 320–329. [8] R. Duke and G. Rose, Formal Object Oriented Specification Using Object-Z. : Macmillan, 2000. [9] J. Gary and A. Reuter, Transaction Processing: Concepts and Techniques. : Morgan Kaufmann, 1993. [10] P. Guerra, C. Rubira, and R. de Lemos, “An idealized fault-tolerant architectural component,” in Proceeding of the 24th International Conference on Software Engineering-Workshop on Architecting Dependable Systems, 2002. [11] W. Harrison, “Rpde: a framework for integrating tool fragments,” IEEE Software, vol. SE-4, no. 6, 1987. [12] C. A. R. Hoare, “Communicating sequential processes,” CACM, vol. 21, no. 8, pp. 666–677, 1978. [13] V. Issarny and J. P. Banatre, “Architecture-based exception handling,” in Proceedings of the 34th Annual Hawaii International Conference on System Sciences, IEEE, 2001. [14] S. Jarzabek and S. B. Li, “Eliminating redundancies with a “composition with adaption” meta-programming technique,” in European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundation of Software Engineering, September 2003, pp. 237–246. [15] S. Jarzabek and H. Zhang, “Xml-based method and tool for handling variant requirements in domain models,” in 5th IEEE International Symposium on Requirements Engineering, August 2001, pp. 166–173.


[16] P. A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, 2nd ed. : Prentice Hall, 1990. [17] D. Luckham and J. Vera, “An event based architecture definition language,” IEEE Transactions on Software Engineering, vol. 21, 1995. [18] N. A. Lynch, M. Merrit, W. E. Wehil, and A. Fekete, Atomic Transactions. : Morgan Kaufmann, 1993. [19] J. Magee, N. Dulay, S. Eisenbach, and J. Kramer, “Specifying distributed software architecture,” in Proceedings of 5th European Software Engineering Conference, 1994. [20] S. Owre and J. Rushby, “Formal verification for fault-tolerant architecture: Prolegomena to the design of pvs,” IEEE Transactions on Software Engineering, vol. SE-21, no. 2, pp. 107–125, 1995. [21] M. Rakic and N. Medvidovic, “Increasing the confidence in off-theshelf components: a software connector-based approach,” in Proceedings of the 2001 Symposium on Software Reusability, May 2001, pp. 11–18. [22] B. Randell, A. Romanovsky, R. Stroud, J. Xu, and A. F. Zorzo, “Coordinated atomic actions: from concept to implementation,” Special Issue of IEEE Transactions on Computers, 1997. [23] J. M. Rushby and F. von Henke, “Formal verification of algorithms for critical systems,” IEEE Transactions on Software Engineering, vol. SE-19, no. 1, pp. 13–23, 1993. [24] T. Saridakis and V. Issarny, Fault Tolerant Software Architectures INRIA/IRISA, , 1999, Technical report. [25] M. Shaw and D. Garlan, Software Architecture: Perspectives on an Emerging Discipline. Englewood Cliffs, N. J.: Prentice-Hall, 1996. [26] G. Smith, “Extending ! for object-z,” in 9th International Conference of Z Users, Lecture Notes in Computer Science, 1995, vol. 967. [27] ——, The Object-Z Specification Language. : Kluwer Academic Publishers, 2000. [28] M. S. Soe, H. Zhang, and S. Jarzabek, “Xvcl: A tutorial,” Proc. of 14th, Int. Conf. on Software Engineering and Knowledge Engineering, SEKE’02 ACM Press, July 2002, pp. 341–349. [29] J.M. Spivey, The Z notation: A Reference Manual International Series in Computer Science, Prentice-Hall, 1989. [30] V. Stavridou and A. Riemenschneider, “Provably dependable software architecture,” in Proceedings of the Third ACM SIGPLAN International Software Architecture Workshop, 1998.

435

[31] J. Xu, B. Randell, A. Romanovasky, C. Rubira, R. Stroud, and Z. Wu, “Fault tolerance in concurrent object-oriented software through coordinated error recovery,” in Proc. 25th Int. Symp. on Fault-Tolerant Computing, Pasadena, June 1995, pp. 499–508.

Ling Yuan is pursuing her Ph.D. at the Computer Science Department, School of Computing in the National University of Singapore. Her research interests mainly are reliability, formal method application in software architecture design, and fault tolerant mechanisms in distributed system.

Jin Song Dong is an Associate Professor & Assistant Dean (Graduate) at the Computer Science Department, School of Computing in the National University of Singapore. His research area involves real-time concurrent system specification, web environment for software design, formal methods and safety critical systems, object orientation, and language semantics.

Jing Sun is a lecturer at the Department of Computer Science in the University of Auckland. His research interests are in the areas of software engineering, integrated formal methods, formal specification languages, real-time modeling, requirement analysis, specification animation, formal verification, model checking, and theorem proving.

Hamid Abdul Basit is pursuing his Ph.D. at the Computer Science Department, School of Computing in the National University of Singapore. His research interests include reuse of software artifacts, analysis of software similarities, and tools to aid software maintenance.

Generic Fault Tolerant Software Architecture Reasoning ... - IEEE Xplore

Generic Fault Tolerant Software Architecture Reasoning ... - IEEE Xplore

Suggest Documents

Behavioral Analysis of a Fault-Tolerant Software System ... - IEEE Xplore

Sensor Fault Detection and Fault Tolerant Control of ... - IEEE Xplore

Fault-Tolerant SOftware

Fault Reconstruction and Fault-Tolerant Control via ... - IEEE Xplore

Fault-Tolerant SOftware

A Fault-Tolerant Software Architecture for COTS-Based Software ...

Storage Resource Broker; Generic Software ... - IEEE Xplore

Software Implemented Fault Tolerance: Technologies ... - IEEE Xplore

Fault Tolerant Software Architectures - CiteSeerX

A Fault-Tolerant Software Architecture for Component ... - CiteSeerX

Fault-tolerant Resource Reasoning - Department of Computing ...

Fault-tolerant Resource Reasoning - Department of Computing

Real-Time Fault-Tolerant Scheduling Based on Primary ... - IEEE Xplore

Fault-Tolerant Control Strategies for Quad Inverter ... - IEEE Xplore

Adaptive Fault-Tolerant Synchronization Control of a ... - IEEE Xplore

Fault-Tolerant Network Interfaces for Networks-on-Chip - IEEE Xplore

Fuzzy Scheduler Fault-Tolerant Control for Wind Energy ... - IEEE Xplore

An Optimal Approach to Fault Tolerant Software - IEEE Computer ...

Real-Time Fault-Tolerant Scheduling Based on Primary ... - IEEE Xplore

Distributed Fault-Tolerant Topology Control in ... - IEEE Xplore

Model-based development of fault tolerant systems of ... - IEEE Xplore

Cost Optimized K-Fault Tolerant Channel Assignment in ... - IEEE Xplore

Fault-Tolerant Torque Control of BLDC Motors - IEEE Xplore

A Fault-Tolerant Permanent Magnet Synchronous Motor ... - IEEE Xplore