Putting Detectors in Their Place - CiteSeerX

7 downloads 0 Views 320KB Size Report
the guard is a boolean expression over the program variables and the statement is either the empty statement or an instantaneous assignment to one or more ...
Putting Detectors in Their Place Arshad Jhumka Dept of Computer Science University of Warwick Coventry, UK

Martin Hiller Dept of Electronics and Software Volvo Technological Department Gothenburg, Sweden

detectors, as detectors allows the system to remain in a safe state [2]. This classification allowed for modular design of fault tolerance. Based on this classification, Jhumka et.al [10, 11] identified properties of detectors that capture their operational effectiveness. The properties identified were (i) completeness and (ii) accuracy. Each property plays an important role in the design of fault tolerance. Composing1 a fault-intolerant2 program with detector components entails determining which program action(s) these components needs to “protect” such that certain dependability properties are met. Informally, this is known as the location problem of detector components. The detector location problem was identified by Leveson at.al in [14], in which the authors wrote “ineffective selfchecks...wrong checks placement.”. The ineffectiveness of the checks was that they allowed errors to propagate through the system, eventually leading to service disruption, and this was due to their placement and/or effectiveness. Thus, the detector location/design problem is an important one. The methodology proposed in [2] does not address this location problem, whereas the work presented in [11] addresses the problem in a restricted, and implicit way. In fact, there is a dearth of framework or methodology that addresses this important issue. Hiller et al. [5, 6] have developed a software profiling methodology that identifies weak spots in a program, and these weak spots are then proposed as candidate locations for detectors. However, this approach is both computationally expensive, and heuristic. What is needed is a framework that effectively guides the system designer in locating detectors in programs such that the resulting program satisfies given dependability properties. The dependability properties investigated in this work are: (i) high detection coverage, (ii) low false alarms rate, and (iii) low detection latency. In this paper, we make the following novel contributions: (i) We formally define the detector location problem (DLP) under resource constraints, and (ii) subsequently show that the problem is NP-complete. (iii) To

Abstract In this paper, we address the problem of locating detectors in a given program under resource constraints. A detector is a program component that asserts the validity of a predicate in a program. The detector location problem is to identify which program actions need to be monitored by detectors such that certain given dependability properties are met. In this paper, we focus on the following dependability properties: (i) high detection coverage, (ii) low detection latency, and (iii) low false alarms rate. Our main contributions are: (i) We first provide a formal definition of the detector location problem under resource constraints, and (ii) We subsequently show that the problem is NP-complete, (iii) We investigate a special case of the detector location problem that can be solved in polynomial time, and present a sound and complete algorithm that solves the problem. We present an example to show the applicability of our approach, which is intended in the area of dependable embedded systems. Keywords: Detectors, location, design, formal methods, resource constraints, embedded systems Contact Author: Arshad Jhumka {[email protected]}

1

Introduction

Computer systems are becoming increasingly pervasive, such that our reliance on their continual provision of service, in spite of external perturbations, increases, i.e., we want those systems to be fault-tolerant. However, the design of fault-tolerant systems is known to be hard and expensive. This points to a need to develop appropriate methodologies that can help conquer the associated design complexity. One such methodology has been developed by Arora and Kulkarni [2], and by Jhumka et al. [10, 11]. In [2] it is shown that fault tolerance mechanisms can be factorized along two main dimensions, namely (i) detectors, and (ii) correctors. Informally, detectors are program components that check whether a given predicate holds in a given program state. Examples are executable assertions [7], self-checks etc. On the other hand, correctors are program components that impose a given predicate on a given program state. Examples are error handling mechanisms, voters etc. In this paper, we will focus on

1 We

will formally define the term later fault-intolerant program in one that satisfies its correctness specification in the absence of faults, but violates it in the presence of faults. 2A

1

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

address this complexity issue, we identify a special case where location of detectors can be achieved in polynomial time. To achieve this, we further develop a theory for efficiently locating detectors in programs and (iv) we present a sound and complete algorithm3 that achieves this. (v) An example is developed to show the viability of our approach, which is well-suited to a model of programs called bounded programs, whose property is that the set of reachable states is finite(bounded). An example of bounded programs are designs of embedded programs that can be model-checked. The paper is structured as follows: Sect 2 present the formal terminology used in the paper. Sect 3 reviews the role of detectors in ensuring fail-safeness. We provide a formal definition of the detector location problem (and associated complexity) in Sect. 4. Sect 5 presents a special case of detector location problem which can be solved in polynomial time. Sect 6 develops an example to show the working of our approach, and Sect 7 concludes the paper.

2

State Transition System The state space Sp of a program p is the set of all possible assignments of values to variables. A state predicate of p is a boolean expression over the state space of p. The set of initial states Ip is defined by the set of all possible assignments of initial values to variables. A computation of p is a weakly fair (finite or infinite) sequence of states s0 , s1 , . . . such that s0 ∈ Ip and for each j ≥ 0, sj+1 results from sj by executing a single action that is enabled in sj . Weak fairness means that if a program action ac is continuously enabled along the states of an execution, then ac is eventually chosen to be executed. We say that state s occurs in a computation s0 , s1 , . . . iff there exists an i such that s = si . Similarly, a transition (s, s! ) occurs in a computation s0 , s1 , . . . iff there exists an i such that s = si and s! = si+1 . Programs can equivalently be represented as state machines, i.e., a program is a tuple p = (Sp , Ip , δp ) where Sp is the state space and Ip ⊆ Sp is the set of initial states. The state transition relation δp ⊆ Sp × Sp is defined by the set of actions as follows: Every action ac implicitly defines a set of transitions which is added to δp . Transition (s, s! ) ∈ δp iff ∃ac ∈ p is enabled in state s and computation of the statement results in state s! . We say that ac induces these transitions. State s is called the start state and s! is called the end state of the transition.

Formal Preliminaries

In this section, we summarize the formal terminologies that will be used through out this paper. This work assumes an interleaved execution semantics, i.e., state transitions are atomic events and an execution is regarded as a linear sequence of states. We assume a shared variables communication model, i.e., processes communicate with each other by writing data into memory locations accessible by the receiver. A program will be represented either as a set of guarded commands, or as a state transition systems, depending on the context.

2.1

2.2

Specifications

A specification for a program p is a set of computations which is fusion-closed. A specification S is fusion-closed iff the following holds for finite computations α, γ, a state s and computations β, %: If α · s · β and γ · s · % are in S, then so are α · s · % and γ · s · β. A computation cp of p satisfies a specification S iff cp ∈ S, otherwise cp violates S. A program p satisfies a specification S iff all possible computations of p satisfy S. Intuitively, a fusion-closed specification allows a program to make decisions about future state transitions by looking at its current state only. Fusion-closed specifications are non-restrictive in the sense that every specification which is not fusion-closed can be transformed into an equivalent fusion-closed specification by adding history variables. Alpern and Schneider [1] have shown that every specification can be written as the intersection of a safety specification and a liveness specification. A safety specification demands that “something bad never happens” [13]. Formally, it defines a set of “bad” finite computation prefixes that should not be found in any computation. Since we are mainly interested in detectors, we focus on safety specification, and present a definition here.

Programs

Guarded Command Notation A program p consists of a finite set of processes {p1 , . . . , pn }. Each process pi contains a finite set of actions, and a finite set of variables. An action has the form !guard" → !statement" the guard is a boolean expression over the program variables and the statement is either the empty statement or an instantaneous assignment to one or more variables. p is the union of all process variables. Each variable stores a value from a predefined nonempty finite domain and is associated with a predefined set of initial values. An action ac of p is enabled in a state s if the guard of ac evaluates to “true” in s. An action ac can be represented by a set of state pairs. We assume that actions are deterministic.

Definition 1 (Safety specification) A specification S of a program p is a safety specification iff the following condition holds: For every computation σ that violates S, there exists a prefix α of σ such that for all state sequences β, α · β violates S.

3 Soundness in the sense that the result indeed solves the detector location problem, and completeness in the sense that if a solution exists, the algorithm will find it.

2

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

We say that a program p is F -intolerant for SSPEC iff p satisfies SSPEC in the absence of F but violates SSPEC in the presence of F . We will also write faultintolerant instead of F -intolerant for SSPEC if F and SSPEC are clear from the context.

The notion of a finite computation not being “bad”, i.e., the possibility to extend it to remain in the specification, is captured by the definition of maintains. Definition 2 (Maintains) Let p be a program, S be a specification and α be a finite computation of p. We say that α maintains S iff there exists a sequence of states β such that α · β ∈ S.

2.3

3

Detectors: Role and Design

In this section, we will briefly preview the role of detectors in the design of fault tolerance, and subsequently review the basis underpinning their design.

Fault Models and Fault Tolerance

A fault model precisely describes the way in which components of the system may fail. Fault models have been categorized into different domains [16]: time faults, and value faults. Traditional stopping faults cannot lead by themselves to a violation of safety. To violate a safety specification, a system must exhibit one of the disallowed computation prefixes. The standard value faults from practice (i.e., bit-flips, stuck-at faults) can directly or indirectly lead to a violation of safety. We focus on the following fault model: We disallow faults to violate the safety specification directly, i.e., faults can only lead to violation of safety (indirectly causing it). The reason for choosing this fault model is that it has the potential of being tolerated. From the context of guarded commands, any fault model which endangers a safety specification can be modeled as a set of added actions (or added transitions, in the case of state transition systems). We will study the impact of faults directly causing violation of safety in our future work.

3.1

Role of Detectors in Fault Tolerance

Arora and Kulkarni [2] showed that a model of program components called detectors is necessary and sufficient to establish fail-safe fault-tolerance in the context of fusionclosed specifications. The main idea of the result is to use detectors to simply “halt” the program in a state where it is about to violate the safety specification, i.e., “halt” the program in a safe state. An important prerequisite for this sufficiency result is that specifications are fusion-closed. Fusion-closed specifications allow to characterize a safety specification as a set of disallowed “bad” transitions (instead of a set of disallowed computation prefixes). Definition 6 (bad transition) For a program p, and fault F , a transition t ∈ δpF is bad with respect to a safety specification SSPEC if for all computations σ of p, it is the case that if t occurs in σ then σ )∈ SSPEC.

Definition 3 (Fault model) A fault model F for program p and safety specification SS is a set of actions over the variables of p that do not violate SS, i.e., if transition (sj , sj+1 ) is a transition induced by F and s0 , s1 , . . . , sj maintains SS, then s0 , s1 , . . . , sj , sj+1 also maintains SS.

Note that, under our fault model assumption, a fault transition cannot be a bad transition. Intuitively, to maintain a safety specification now requires to keep track of the current computation and take precautions not to run into one of the bad transitions. A detector d refines the guard of the corresponding program action that induces the bad transition in such a way that the action is never executed whenever the computation could result in taking a bad transition, i.e.,

We call actions of F faulty actions (or faults). A fault occurs if a faulty action (transition) is executed. Such a fault model have been used by Liu and Joseph [15]. Definition 4 (Computation in presence of faults) A computation of p in the presence of F is a weakly p-fair sequence of states s0 , s1 , . . . such that s0 is an initial state of p and for each j ≥ 0, sj+1 results from sj by executing a program action from p or a fault action from F .

d ∧ !guard" → !statement" Definition 7 (Detector for an action) Let SSPEC be a safety specification. An SSPEC -detector d monitoring program action ac of p is a state predicate of p such that executing ac in a state where d holds maintains SSPEC.

Note: By weakly p-fair, we mean that the actions of p are treated weakly fair, but not fault actions. Rephrased in the transition system view, a fault model adds a set of transitions to the transition relation of p. We denote the modified transition relation by δpF .

We will simply talk about detectors instead of SSdetectors if the relevant safety specification is clear from the context. When a detector d refines the guard of an action ac of p, we say that we compose ac with d. We will also sometimes say that the detector d is located at location ac. We also say that we compose p with d if ∃ac ∈ p such that ac is composed with d. We say that we compose a program p with a set D of detectors (denoted p[]D) iff ∀d ∈ D · ∃ac ∈ p s.t ac is composed with d.

Definition 5 (Fail-safe fault-tolerance) Let S be a specification, SSPEC be the smallest safety specification including S, and let F be a fault model. A program p is said to be fail-safe F -tolerant for specification S iff all computations of p in the presence of F satisfy SSPEC. 3

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

inconsistent w.r.t. α1

Formally, in the transition system view of a program p, a state s is reachable by p iff starting from an initial state of p, there exists a computation which contains s using only transitions from δp . Otherwise s is unreachable. Similarly the notions of a state or transition being reachable in the presence of faults can be defined by referring to δpF . Using the above terminology, composing a program p with detectors results in some transitions of p becoming unreachable in the presence of faults. As we have mentioned earlier, design of detectors has been usually achieved through experience and intuition, as mentioned in [2, 14].

3.2

bad transition initial state s1

. . .

s3

s6

s7

α1

s8

s9

α2

s5

also means is that, if there is no “precaution” in the program, then safety can be violated. Adding detectors is a way of implementing these “precautions”. In the presence of faults, a program can violate safety in the presence of faults if it contains an SS-inconsistent transition. Formally, we capture this by defining SSinconsistency independent of a particular computation.

Error Detection: A Review

Definition 9 (SS-inconsistent transition for p) Given a program p, fault F , and safety specification SS. A transition (s, s! ) is SS-inconsistent for p iff there exists a computation α of p in the presence of F such that (s, s! ) is SS-inconsistent for p w.r.t. α. Definition 10 (SS-consistent transition for p) Given a program p with safety specification SS. A transition (s, s! ) is SS-consistent for p iff (s, s! ) is not SS-inconsistent for p.

Designing Efficient Detectors

Based on this concept, we introduce the notion of perfect detectors, which characterizes a model of efficient detectors, using the terminology of SS-inconsistency.

The intuition behind the definition of inconsistency is that if a given computation of p in the presence of faults violates the safety specification, then some “erroneous” transition has occurred in the computation.

3.4

Definition 8 (SS-inconsistent transitions) Given a fault-intolerant program p, safety specification SS, fault F , and a computation α of p in the presence of F . A transition (s, s! ) is SS-inconsistent for p w.r.t. α iff

Perfect Detectors

To implement the precautions that prevent the program from executing a bad transition, detectors are needed. We also want these detectors to be operationally efficient. For this, we introduce the concept of perfect detectors. The definition of a perfect detector follows two guidelines: A detector d monitoring a given action ac (i.e., refining its guard) of program p needs to (1) “reject” the starting states of all transitions induced by ac that are SS-inconsistent for p (make the states unreachable), and (2) “keep” the starting states of all induced transitions that are SS-consistent for p (maintain reachability of transition). These two properties are captured in the definition of completeness and accuracy of detectors. (the notions are defined in analogy to Chandra and Toueg [3]).

• there exists a prefix α! of α such that α! violates SS, • (s, s! ) occurs in α! , i.e., α! = σ · s · s! · β, • all transitions in s · s! · β are in δp , and • σ · s maintains SS. Fig. 1 illustrates Definition 8. It shows the state transition relation of a program in the presence of faults (the transition (s3 , s4 ) is introduced by F ). The safety specification SS identifies a bad transition (s6 , s7 ) which should be avoided. In the absence of faults, the bad transition is unreachable, and does not violate safety. However, in the presence of faults, this transition becomes reachable and hence the program is F -intolerant since it exhibits a computation α1 violating SS. In this computation, the three transitions following the fault transition match Definition 8 and hence are SS -inconsistent w.r.t. α1 in the presence of F . Note that an SS -inconsistent transition is only reachable in the presence of faults. Intuitively, SS-inconsistent transitions can lead the program computation on the “wrong path”. What this

Definition 11 (Detector completeness) Given a program p with safety specification SS, fault model F , and a program action ac of p. A detector d monitoring action ac is SS-complete for ac in p in presence of F iff for all transitions (s, s! ) induced by ac holds: if (s, s! ) is SS-inconsistent for p, then s )∈ d. Definition 12 (Detector accuracy) Given a program p with safety specification SS, fault model F , and a program action ac of p. A detector d monitoring ac is SS-accurate for ac in p in presence of F iff for 4

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

s4

Figure 1: Graphical explanation of SS -consistency.

In this section, we recall some important results that underpin the design of efficient detectors, i.e., detectors that detect all harmful errors, and minimize the number of false detections. The results are based on the concept of a transition that is deemed inconsistent w.r.t. a safety specification if executing that transition can lead to a violation of safety. More details can be found in [10, 11]. Note: For reasons of space, proofs and intuitions of theorems/lemmas/propositions presented in this section have been omitted, but can be obtained from [10, 11].

3.3

fault transition s2

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

for ac in p in presence of faults F , will be fail-safe faulttolerant, and the detectors do not trigger false alarms. In such cases, we say that p! has perfect fail-safe fault tolerance to faults F . Intuitively, perfect detectors have optimal coverage with respect to harmful faults, i.e., faults that can lead to violation of safety. At this point, we define a class of SS-inconsistent transition, called earliest SS-inconsistent transition. We will later explain their role in fault tolerance.

all transitions (s, s! ) induced by ac holds: if (s, s! ) is SS-consistent for p, then s ∈ d. Definition 13 (Perfect detector) Given a program p with safety specification SS, fault model F , and a program action ac of p. A detector d monitoring ac is SS-perfect for ac in p in presence of F iff d is both SS-complete and SS-accurate for ac in p. Where the specification is clear from the context we will write accuracy instead of SS-accuracy (similarly for completeness and perfection). Intuitively, the completeness property of a detector is related to the safety property of the program p, whereas the accuracy property relates to the liveness specification of p This intuition is captured by the following lemmas. Lemma 1 uses the accuracy property to show that the fault free behavior of a program is not affected by adding perfect detectors. Intuitively, it says that perfect detectors do not trigger “false” alarms. On the other hand, Lemma 2 uses the completeness property to show that perfect detectors indeed establish fault-tolerance.

Definition 15 (Earliest SS -inconsistent transition) Given an F -intolerant program p with safety specification SS, and a computation α = s0 ·s1 · · · si ·si+1 · · · sm of p in the presence of faults F that violates SS. The transition (si , si+1 ) is the earliest SS -inconsistent transition for p w.r.t. α in presence of F iff the following two properties hold: 1. (si , si+1 ) is SS-inconsistent for p w.r.t. α in presence of F . 2. (si−1 , si ) is a transition induced by an F action.

Lemma 1 (Fault-free behavior) Given a faultintolerant program p and a set D of perfect detectors. Consider program p! resulting from the composition of p and D. Then the following statements hold:

4

The Detector Location Problem (DLP)

When designing fault-tolerant programs, two common metrics have been used to evaluate their dependability properties: (i) coverage [4] and (ii) latency [7]. The coverage metric determines the proportion of errors that are correctly handled (in this case, detected - hence we will talk of detection coverage), and is related to the completeness property of detectors, hence fail-safe fault tolerance (Lemma 2). The latency metric determines the delay after which an error is correctly handled (i.e.,detected – hence detection latency), after the occurrence of a fault. In general, detection coverage is required to be very high (complete, if possible), and the detection latency very small (minimal, if possible). Further, another important factor is the rate of false alarms, as the system needs to keep some liveness properties. Based on the formal definitions provided in the previous section, we formally define the terms detection coverage, detection latency, and false alarm rate. This is possible due to the finite state assumed.

1. In the absence of faults, every computation of p! is a computation of p. 2. In the absence of faults, every computation of p is a computation of p! . Before we characterize the role of perfect detectors in the presence of faults, we formally define critical actions of a program. Intuitively, a critical action is one which can cause violation of safety when executed in an erroneous state. Definition 14 (Critical and non-critical actions) Given a program p with safety specification SS, and fault model F . An action ac of p is said to be critical for p in the presence of F for SS iff there exists a transition (s, s! ) induced by ac such that (s, s! ) is a bad transition that is reachable in presence of faults F . An action is non-critical iff it is not critical.

Definition 16 (Detection coverage of d) Given a program p with safety specification SSP EC, and fault model F . Given also a program p! obtained by composing an action ac in p with a detector d. The detection coverage of d at ac in p! in the presence of F (cov(d, ac, p! , F )) is defined as the ratio of SS-inconsistent transitions induced by ac in presence of F that are rejected by d to the actual number of SS-inconsistent transitions induced by ac in presence of F .

We now present an important fault tolerance result. Lemma 2 (Behavior in the presence of faults) Given a fault-intolerant program p with safety specification SS, and fault model F . Given also a program p! by composing each critical action ac of p with a perfect detector for ac in presence of F . Then, p! satisfies SS in presence of faults F . Hence, from Lemmas 1 and 2, we observe that a program p! , obtained by composing each critical action ac of a fault-intolerant program p with a perfect detector

Intuitively, the coverage of a detector is linked to the proportion of bad transitions that are made unreachable 5

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

Then, the detection latency of d at ac in p! w.r.t. α in the presence of F ! , denoted by L(d, ac, p! , F ! , α) is either the number of transitions executed in si . . . sm , (i.e., (m-i) transitions), or ∞.

by that detector. Different detectors have different coverages [14] (depending on their design and location). To be able to grasp the understanding of the overall coverage of a given program for a given fault model, we present the following definition.

The intuition is as follows: A detector in a program refines the original transition relation so as to make the bad transitions in presence of faults unreachable. This means that there will be computations of p! that will only be prefixes of corresponding computations of p, and will not include bad transitions. In particular, those computations that are halted by a detector. Thus, focusing on a given computation, the detection latency of a detector at a given program action is taken to be the number of steps (“time taken”) between the occurrence of a fault, and the computation halted (i.e., corresponding error detected). However, if violation of the safety specification occurs before the computation is stopped, then we set the default to be ∞.

Definition 17 (Detection coverage of p! ) Given a program p with safety specification SSP EC, and fault model F . Given also a set BC of bad transitions of p in the presence of F for SSP EC, and a program p! obtained by composing p with a set of detectors D. The detection coverage of p! in the presence of F for SSP EC (cov(p! , F )) is the ratio of the number of transitions in BC that are unreachable in p! in presence of F to the number of transitions in BC . We now define the term false alarm rate. Definition 18 (False alarm rate of d) Given a program p with safety specification SSP EC, and fault model F . Given also a program p! obtained by composing an action ac in p with a detector d. The false alarm rate of d at ac in p! in the presence of F (f ar(d, ac, p! , F )) is defined as the ratio of SS-consistent transitions induced by ac in presence of F that are rejected by d to the actual number of SS-consistent transitions induced by ac in presence of F .

Definition 21 (Detection latency of p! ) Given a program p with safety specification SSP EC, and fault model F . Given also a program p! obtained by composing p with a set of detectors D, i.e., p! = p[]D. Then, the maximum detection latency of p! in the presence of F , denoted lat(p! , F ) is equal to max{L(d, ac, p! , F ! , α)|d ∈ D, ac ∈ p! }, ∀α.

As in the case of detection coverage, we now define false alarm rate for p! .

Thus, for a fault-intolerant program, the maximum detection latency is ∞. For a fault-tolerant program to have high detection coverage, low false alarm rate, and low detection latency, locating the correct detectors at the right locations in the program is crucial [14]. However, in the context of embedded systems, resources, such as memory, are scarce. Hence, location of detectors has to be cognizant of this restriction, and the resulting program has to satisfy the resource constraints. Having presented the necessary formal terminologies, we now formalize the detector location problem (DLP), which is one of the major contributions of the paper.

Definition 19 (False alarm rate of p! ) Given a program p with safety specification SSP EC, and fault model F . Given also a set GC of transitions induced by critical actions of p in the presence of F for SSP EC that are not bad transitions for p in presence of F for SSP EC, and a program p! obtained by composing p with a set of detectors D. The false alarm rate of p! in the presence of F for SSP EC (f ar(p! , F )) is the ratio of transitions in GC that are unreachable in p! in the presence of F to the number of transitions in GC . We now define the term detection latency.

4.1

Definition 20 (Detection Latency) Given a faultintolerant program p with safety specification SSP EC, and fault model F . Given also a program p! obtained by composing p with a set of detectors D. Consider a computation α = s0 · · · si · si+1 · · · sm · sm+1 . . . of p that violates SSP EC in the presence of F , and a finite computation α! = s0 · · · si · si+1 · · · sm of p! in the presence of F , such that:

The Detector Location Problem

Given a fault-intolerant program p = {ac1 , . . . , acn }, safety specification SSPEC, and a fault model F . Given also a set D = {d1 , . . . , dm } of detectors, and a set of − → − → resources, denoted by R , such that p |= R . Informally, the problem is finding an optimal location mimimizing an objective function in coverage, latency and lase alarm rates subject to a constraint of bounded resources. The Detector Location Problem (DLP) can be stated as follows: Select a set Dp ⊆ D such that

1. (si−1 , si ) is a transition induced by an F action, 2. all transitions in si . . . sm in α are SS-inconsistent, and

= p[]Dp has a high detection coverage C1. p! (cov(p! , F )) (above a given threshold value c) in the presence of F .

3. either starting from sm in α, a bad transition specified by SSPEC is reachable by using only program transitions of p, or a bad transition of p in presence of F for SSP EC occurs in si . . . sm .

C2. p! = p[]Dp has a low detection latency (lat(p! , F )) (below a given threshold value l) in the presence of F. 6

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

C3. p! = p[]Dp has a low false alarm rate (f ar(p! , F )) (below a given threshold value f ) in the presence of F. − → C4. p! = p[]Dp |= R .

3. L(di , acj , p, F ! ): It returns L(di , acj , p! , F ! ), where p! is the program obtained by composing acj ∈ p with di , and F ! ⊆ F .

A program p satisfying the above conditions C1 . . . C4 is said to solve the DLP problem. The fourth condition stipulates that, when transforming a fault-intolerant program p, that satisfies the imposed resource constraints, into a program p! , p! should satisfy the resource constraints too.

1. Uc (C(di , acj , p, F ! )): Returns the utility associated with the detection coverage of a detector di ∈ D located at acj ∈ p in the presence of F ! ⊆ F .

We also assume the following utility functions:

!

4.2

2. Uf (F (di , acj , p, F ! )): Returns the utility associated with the false alarm rate of a detector di ∈ D located at acj ∈ p in the presence of F ! ⊆ F .

Detector Location Problem: Complexity Issues

3. Ul (L(di , acj , p, F ! )): Returns the utility associated with the detection latency of a detector di ∈ D located at acj ∈ p in the presence of F ! ⊆ F .

The formalization of DLP allows us to better understand the problem to be tackled, enabling us to analyze the computational complexity associated with the problem. We show, in this section, that DLP is NP-complete, which is another major contribution of our paper. To prove this, we reduce the knapsack problem to DLP. We briefly recall the knapsack problem, and then we identify a transformation to reduce the knapsack problem to DLP. Knapsack Problem: The knapsack problem is defined as follows: It consists of a finite set I of n items labelled 1 . . . n. With each item i is associated a size s(i) ∈ Z + and a value v(i) ∈ Z + . There also exists a constant B > 0 called the size constraint (knapsack capacity). Question: Is there a subset I ! ⊆ I such that !n maximize !n i=1 (x(i) · v(i)) subject to i=1 (x(i) · s(i)) ≤ B x(i) ∈ {0, 1}, i = 1 . . . n

When the detection coverage of a detector di at a location acj in presence of some F ! is maximal, then Uc (C(di , acj , p, F ! )) = 1, and when the detection coverage of a detector di at a location acj in presence of some F ! is minimal (or below a certain threshold value), then Uc (C(di , acj , p, F ! )) = 0. On the other hand, when the false alarm rate of the detector di at a location acj in presence of some F ! is minimal, then Uf (F (di , acj , p, F ! )) = 1, and when the detection latency of a detector di at a location acj in presence of F ! is maximal (greater than a threshold value), then Uf (F (di , acj , p, F ! )) = 0. Similarly for the detection latency. Since we are interested equally in (i) detection coverage, (ii) false alarm rate, and (iii) detection latency, we associate an equal weight to each attribute. Hence, we calculate the value of each detector item is computed as follows:

Now, we formalize DLP as an optimization problem. DLP: Given is a set of detectors D = {d1 . . . dm }, fault-intolerant program p = {ac1 . . . acn }, and a fault model F , and safety specification SSP EC. Computing the size function, s: We assume a function resources-consumed(c) that returns the resource usage of component c. Thus, resources-consumed(di ) that returns a vector of resources consumed by di ∈ D. In this paper, we assume a vector of dimension 1. Then,

v(di ) = /maxnj=1,F ! ⊆F {Uc (C(di , acj , p, F ! )) + Uf (F (di , acj , p, F ! )) + Ul (L(di , acj , p, F ! ))} ∗ 1001. When locating detectors in a program, a system designer looks to maximize the overall utility of each detector added. Specifically, the designer wishes to solve the following: !m maximize i=1 (xi · v(di )) subject to the following constraints: !m 1. i=1 xi ≤ m

s(di ) = resources-consumed(di ).

2. (resources-consumed(p) + (single resource)

Computing the value function, v: We assume the following functions:

1=1 (xi )

− → · s(di )) |= R1

3. xi ∈ {0, 1}, i = 1 . . . m, xi = 1 if di is used

1. C(di , acj , p! , F ! ): It returns cov(di , acj , p! , F ! ), where p! is the program obtained by composing acj ∈ p with di , and F ! ⊆ F .

At this point, we state an important result when locating detectors such that the resulting program has high detection coverage, low false alarm rate, and low detection latency.

It returns f ar(di , acj , p! , F ! ), 2. F (di , acj , p, F ! ): ! where p is the program obtained by composing acj ∈ p with di , and F ! ⊆ F .

Theorem 1 DLP is NP-complete. 7

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

!m

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

C3. p! = p[]Dp has a low false alarm rate (below a given threshold value f ) in the presence of F .

Proof. First, observe that DLP belongs to the class of NP. A verifier for the problem takes a fault-intolerant program augmented with a set of detectors, and determines if the DLP conditions are satisfied. If this can be done in polynomial time, then DLP is in NP. Reducing Knapsack to DLP: The reduction is as follows:

A program p! satisfying the above conditions C1 to C3 is said to solve the W-DLP problem. In earlier work [9, 10], we have indirectly solved a still weaker problem specification consisting of only conditions C1 and C3. This was made possible due to our work on perfect detectors. In this paper, we will again make use of perfect detectors but in different ways. Specifically, in [9, 10], we presented sufficiency results, whereas the work presented in this paper will develop both necessary and sufficient results to solve W-DLP.

• s(di ) = s(i) • v(di ) = v(i) • m = n

5.2

• R = B

As mentioned in Sect. 3, composing critical actions of a program with perfect detectors ensures fail-safeness of the resulting program (sufficiency results). In this section, we will look closer at the role of non-critical actions in fault tolerance, specifically, in helping towards reducing the detection latency of the resulting program, as well as ensuring both necessary and sufficient results. To solve W-DLP, two problems have to be solved: (i) what program actions need to be monitored to reduce the detection latency of the program, and (ii) is there a minimal (necessary and sufficient) set of program actions, that when composed with perfect detectors, result in a program that satisfies C1 . . . C3. As mentioned previously, composing any action ac with a detector d that is perfect for ac in p! in presence of faults some F ! ensures that the detection coverage of d at ac in p! in presence of F ! is 1, and the false alarms rate equal to 0.

! Since exponential complexity is unavoidable in the general case of locating detectors as in DLP, there are different ways of tackling the complexity problem. For example, one can look for special classes of the DLP problem which can allow polynomial time algorithms, or heuristics. Here, we decide to weaken the DLP problem, to obtain an algorithm of polynomial-time complexity.

5

Theory for Locating Detectors

Weaker Specification for DLP (W-DLP)

The main problem behind the intractability result is the need to design effective detectors, and then properly locate them, whilst satisfying the resource constraints. To address this issue, we focus on a class of embedded systems where resource constraints are not imposed. This class of embedded systems are usually not intended for the commercial markets, like PDA’s, cars etc. Rather, they are systems that are deployed in safety-critical systems like avionics systems. In such systems, as avionics systems, the need for effective detectors is particularly pressing. The need for perfect detectors in such systems is obvious (to prevent occurrences of catastrophic events, and false alarms). Hence, we propose to solve a weaker specification of the DLP problem, which we call W-DLP.

Definition 22 (Latency Critical Program Actions) Given a program p with safety specification SS, a fault model F that has to be handled, and a program action ac of p. Action ac is said to be latency critical for p in the presence of F iff, when composed with a detector d that is perfect for ac in p in presence of F , there exists a non-empty set Fac ⊆ F for which the maximum detection latency of the resulting program p! in presence of Fac is 0. An action is non-latency critical for p in presence of F if it is not latency critical for p in presence of F .

C1. p! = p[]Dp has a high detection coverage (above a given threshold value c) in the presence of F .

The intuition behind the definition of latency critical program actions is that, for “every aspect” (Fac ) of a fault model (F ), there is a detector associated with a latency critical program action that handles it immediately. The latency critical actions are actions that are necessarily included in a program p so as to ensure immediate handling of errors once they occur. Now that we know of the importance of latency critical actions, one question has to be: how do we formally characterize them? To answer this question, we present the following result:

C2. p! = p[]Dp has a low detection latency (below a given threshold value l) in the presence of F .

Lemma 3 (Linking Latency Critical Actions) Given a program p with safety specification SS, and

5.1

The Weaker Problem

Detector

Location

Given a fault-intolerant program p = {ac1 , . . . , acn }, safety specification SSPEC, and a fault model F . Given also a set D = {d1 , . . . , dm } of detectors. The Weaker Detector Location Problem (W-DLP) can be stated as follows: Select a set Dp ⊆ D

8

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

• Calculating the set eit is O(n2 ), where n is the size of the state space of the program.

fault model F . Every earliest SS-inconsistent transition for p in presence of F is induced by a latency critical action for p in presence of F .

• Removing the set eit is O(n2 ).

For reasons of space, we only sketch a proof. Assume the contrary (that no such Fac exists), and then construct one by adding every transition in F that immediately precedes every earliest SS-inconsistent transition. Thus, to be able to identify the latency critical actions of a program p in presence of faults, it suffices to identify the set of earliest SS-inconsistent transitions. This set can be calculated as specified by Definition 15.

• Overall complexity is O(n2 ). Thus, for this case, exponential complexity is avoided, and we instead have a polynomial time complexity algorithm that solves DLP. In the next section, we present an experimental evaluation of the algorithm.

6

Theorem 2 (Latency Critical Actions and DLP) Given a program p with safety specification SS, and fault model F that has to be handled. Conditions C1 . . . C3 of the DLP problem are satisfied iff every latency critical action ac of p is composed with a detector that is perfect for ac in p in presence of F .

An Example: Token Ring

We now present the working of our approach. The token ring protocol is one that is commonly used in embedded systems to mediate access to a common communication medium among different processes. When implementing the token ring protocol, there is no assumption about resource constraints. Proof Sketch: Composing critical actions of p with We first recall the problem of mutual exclusion which perfect detectors ensure high (maximal) coverage, and can be regarded as the specification of a token ring. low false alarms (minimal)- Sufficiency result. ComIn the mutual exclusion problem, multiple processes bined with Lemma 3 (necessary result), we ensure low have a special section of their code which is called crit(minimal) latency, maximal coverage, and minimal false ical section. Processes may wish to enter the critical alarms rate. section, e.g., to access a shared resource (communication At this point, we present a sound and complete algomedium). Processes leave the critical section in finite rithm (locate-detectors, presented in Fig. 2) that solves time. A protocol solving mutual exclusion guarantees the DLP problem. The first part is based on Lemma 3, that at any point in time at most one process is in its where the set of earliest SS-inconsistent transitions is critical section. This is the safety specification of mugenerated. This thus identifies the set of latency critical tual exclusion. The liveness specification states that if a actions. The second part iteratively refines the detection process wants to enter its critical section, it will manage predicate for each of these latency critical action. to do this in finite time. We can implement mutual exclusion by using a token locate-detectors(δp , δF , ss: set of bad transitions) ring. For this, we assume that the processes are arranged in a ring and these processes circulate a token in a particular direction. Whenever a process wants to access its %calculate the set of earliest inconsistent transitions for p critical section (to send a message on the communication eit := {(s1 , s2 )|∃s0 ∈ Sp .(s0 , s1 ) ∈ δF ∧ there exists a medium), it waits for the token to arrive. After accessstate sequence s0 s1 s2 . . . si si+1 . . . such that (si , si+1 ) ∈ ss} ing the critical section, it forwards the token to the next process in sequence. %obtain detectors and locations by removing all earliest In our example, there are N + 1 processes, numbered inconsistent transitions from 0 to N , arranged in a ring. Process k with 0 ≤ k < N passes the token to process k + 1, whereas process N passes the token to process 0. Each process k has a ! return (p = p \ eit)} binary variable, t.k. All variables are initialized to the same value. Every process has one action only. If it Figure 2: Algorithm to generate and locate detectors executes this action, it is said to “receive the token” and it then executes its critical section. All processes k (0 < k ≤ N ) compare their value Theorem 3 (Soundness/ Completeness of Algorithm) t.k with that of the predecessor t.(k − 1) in the ring. Algorithm locate-detectors is sound and complete. If the values are different, the value of the successor is updated upon receipt of the token. Similarly, process 0 Proof Sketch: It is sound (p! satisfies C1 . . . C3) as compares its value t.0 with the value t.N of process N . it removes all earliest inconsistent transitions, and it is If both values are the same, they are made different by complete (if such a p! exists, algorithm will find it) as all executing the action. The fault-intolerant program for earliest inconsistent transitions can be obtained, hence the token ring is as follows: removed ITR1 :: k $= 0 ∧ t.k $= t.(k − 1) → t.k := t.(k − 1) We now briefly present an analysis of the complexity ITR2 :: k = 0 ∧ t.k = t.N → t.k := ¬t.N of the algorithm. 9

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

We consider faults that may corrupt the token variables at the processes in a detectable way, i.e., by setting t.k to a “bad” value ⊥. The fault actions we consider are formalized as follows (one such action exists for every k): F ::

t.k $=⊥



[2] Anish Arora and Sandeep S. Kulkarni. Detectors and correctors: A theory of fault-tolerance components. In Proceedings of the 18th IEEE International Conference on Distributed Computing Systems (ICDCS98), May 1998. [3] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.

t.k :=⊥

This fault model satisfies our fault assumption as it is otherwise easy to create a fault model that just “create” multiple tokens in the systems, and then we would not be able to protect against these faults. In the presence of faults, the values of several processes can be set to ⊥ and, following their program, they may execute actions independently, violating the safety property. The set of bad transitions ss therefore contains all transitions of process k that start in a state where t.(k − 1) =⊥. Note that, whenever t.k =⊥, then process t.(k + 1) will detect that at the earliest, and not “use” any token received, i.e., will not execute its action. This means that a detector needs to be located at process k + 1. Since the fault model ranges over all processes, we expect a perfect detector to be located at every action. Applying the procedure to the fault-intolerant token ring program yields the resulting program: FSTR1 :: FSTR2 ::

t.(k − 1) $=⊥ ∧k $= 0 ∧ t.k $= t.(k − 1) t.N $=⊥ ∧k = 0 ∧ t.k = t.N

→ →

[4] D.Powell, E.Martins, J.Arlat, and Y.Crouzet. Estimators for fault tolerance coverage evaluation. 1993. [5] M. Hiller, A. Jhumka, and N. Suri. An approach for analysing the propagation of data errors in software. In DSN, pages 161– 172, 2001. [6] M. Hiller, A. Jhumka, and N. Suri. On the placement of software mechanisms for detection of data errors. In DSN, pages 135–144, 2002. [7] Martin Hiller. Executable assertions for detecting data errors in embedded control systems. In Proceedings of the International Conference on Dependable Systems and Network (DSN 2000), pages 24–33, 2000. [8] Mei-Chen Hsueh, Timothy K. Tsai, and Ravishankar K. Iyer. Fault injection techniques and tools. IEEE Computer, 30(4):75–82, April 1997. [9] A: Jhumka, F. Freiling, C. Fetzer, and N. Suri. Automated synthesis of fail-safe fault-tolerance using perfect detectors. Technical report, University of Warwick, 2005.

t.k := t.(k −[10] 1) A. Jhumka, M. Hiller, and N. Suri. Approach for designing and assessing detectors for dependable component-based systems. t.k := ¬t.N In HASE, pages 69–78.

Observe that the resulting program has high detection coverage (always detects a error), has low false alarms rate (never falsely detecting a error), and low detection latency (when fault occurs - t.k =⊥- detection occurs in the next step).

7

[11] Arshad Jhumka, Felix C. G¨ artner, Christof Fetzer, and Neeraj Suri. On systematic design of fast and perfect detectors. Technical Report 200263, Swiss Federal Institute of Technology (EPFL), School of Computer and Communication Sciences, Lausanne, Switzerland, September 2002. [12] Sandeep S. Kulkarni and Ali Ebnenasir. SYNFT: A framework for adding fault-tolerance to distributed programs. Available via email from the authors at Michigan State University, USA, 2003.

Discussion and Conclusions

In this paper, we have analyzed the problem of locating detectors when designing efficient fault-tolerant programs under resource constraints. We have shown that the problem of locating detectors is NP-complete. To address this complexity, we have identified a special case under which the detector location problem can be efficiently solved. We have developed an example to show the working of our approach. In our proof of NP-completeness, we have made assumptions about the existence of several functions, such as C(d, a, p! , F ) (for coverage) etc. Such functions can be obtained through fault injection experiments [8]. We have also assumed existence of utility functions for detection coverage and detection latency, and these functions can be provided by the system designer. Function s(d) (size) can be obtained during the compilation process. As future work, we will try to identify other special cases of DLP. We also plan to extend tools such as FTSyn [12] with the algorithm presented.

[13] Leslie Lamport. Proving the correctness of multiprocess programs. IEEE Transactions on Software Engineering, 3(2):125–143, March 1977. [14] Nancy G. Leveson, Stephen S. Cha, John C. Knight, and Timothy J. Shimeall. The use of self checks and voting in software error detection: An empirical study. IEEE Transactions on Software Engineering, 16(4):432–443, 1990. 29 refs. [15] Z. Liu and M. Joseph. Verification of fault tolerance and real time. In Proceedings of the 26th IEEE Symposium on Fault Tolerant Computing Systems (FTCS-26), pages 220– 229, Sendai, Japan, June 1996. IEEE. [16] David Powell. Failure mode assumptions and assumption coverage. In Dhiraj K. Pradhan, editor, Proceedings of the 22nd Annual International Symposium on Fault-Tolerant Computing (FTCS ’92), pages 386–395, Boston, MA, July 1992. IEEE Computer Society Press.

References [1] Bowen Alpern and Fred B. Schneider. Defining liveness. Information Processing Letters, 21:181–185, 1985.

10

Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods (SEFM’05) 0-7695-2435-4/05 $20.00 © 2005

IEEE

Authorized licensed use limited to: WARWICK UNIVERSITY. Downloaded on April 24, 2009 at 04:21 from IEEE Xplore. Restrictions apply.

Suggest Documents