Using Fail-Awareness to Design Adaptive Real

16 downloads 0 Views 138KB Size Report
Using Fail-Awareness to Design Adaptive Real-Time Applications. Christof Fetzer and Flaviu Cristian [email protected]. Abstract. We address the problem of ...
Using Fail-Awareness to Design Adaptive Real-Time Applications. Christof Fetzer and Flaviu Cristian [email protected]

Abstract

QoS

We address the problem of how to adapt the quality of service of distributed, fault-tolerant, real-time applications with respect to the current failure rate that the system experiences. Our approach to that problem uses fail-awareness which is a design concept that allows clients to detect when a distributed service cannot provide certain properties due to a too high failure rate. In this paper we describe several problems that have to be solved in applications that can adapt their quality of service and show how failawareness can be used in such applications.

impossible

the

ore

tica

lly

practically possible

pos

sib

le

failure rate

Figure 1: Our goal is to support the design of applications that adapt their quality of service to the failure rate that the system currently experiences.

1 Introduction

In many application domains there is also a difference between what properties an application can practically provide for a certain failure rate and what is theoretically possible: some properties are theoretically possible to provide for a given failure rate and a given distributed computing system but an actual implementation would become too complex, i.e. too expensive.

One of the problems we address in this paper is how to adapt the quality of service of a distributed realtime application with respect to the currently experienced failure rate (i.e. the number of failures per time unit) in a system (see Figure 1). We will describe the quality of service by a set of service properties that the application provides. One can use redundancy to mask failures and thus, an application can provide all its properties as long as the failure rate is below some maximum failure rate. For example, an application can use two components (two processes or two objects) to provide a certain service property and as long as one of the components is operational, the system can provide that property. However, when the failure rate becomes too high, certain properties become impossible to provide since the amount of redundancy available in the system is not sufficient to mask all failures.

2 Synchronous vs Asynchronous Systems Much of the current research in distributed real-time systems concentrates on the guaranteed response paradigm [15]. This paradigm uses static resource allocation and peak load assumptions to guarantee that the real-time system reacts to events occurring in the controlled object within an a priori known time bound. Applications designed according to that paradigm are based on the synchronous system model which is characterized by a priori bounded process scheduling and message transmission delays and a bounded failure-rate.



This research was performed at UCSD and there it was partially supported by a grant from the Air Force Office of Scientific Research. This paper will published in the Proceedings of the IEEE National Aerospace and Electronics Conference, July 14-18, 1997, Dayton, Ohio, USA.

1

QoS

crashed processes are eventually delivered and that each non-crashed process executes its program code with an unknown but non-zero speed. This model has no notion of time and hence, is not an appropriate model to construct real-time applications. To solve a problem in that model, one assumes an upper bound on the number of crash failures that can occur during the execution of an algorithm. The “performance failure rate” in such systems can be viewed as unbounded because there exists no upper bound on the transmission delay of messages or the scheduling delay of processes. Hence, the time-free model does not allow a deterministic solution of even very weak problems like consensus [14].

standard semantics max. failure rate

behavior undefined

failure rate

Figure 2: Synchronous services provide their standard semantics as long as the failure rate is below some a priori given bound and their behavior is generally undefined when the failure rate exceeds that bound.

In our work we use the timed asynchronous system model [5] instead. This model defines two thresholds and for the maximum acceptable process scheduling and message transmission delays, respectively. When the scheduling delay of a process is greater than , we say that has suffered a performance failure. Otherwise, we say that is timely. When the transmission delay of a message was greater than , we say that has suffered a performance failure. Otherwise, is timely.





Applications designed according to the guaranteed response paradigm have to provide their standard “synchronous” semantics as long as the failure rate is within some a priori given maximum failure rate and when the failure rate exceeds that bound, the behavior of the application is typically unconstrained (see Figure 2). Such applications have to use redundancy to mask all failures as long as the failure rate is within the specified maximum bound. QoS

















The timed asynchronous system model does not restrict the number of failures per time unit in a system. Hence, the process and message transmission delays are effectively not bounded. Since unbounded process scheduling and message transmission delays are characteristic properties of asynchronous systems, this model is an asynchronous system model.

bounded number of crashes

standard semantics

3 Failure Model vs Failure Assumption The guaranteed response paradigm requires that the maximum rate of failures be a priori known. This requirement is formalized by a failure assumption. In general a failure assumption states the maximum number of performance, crash, and omission failures that can occur per “round” when the considered protocols are round-based or per “time unit” when the progress of the considered protocols is measured by the passage of time. The failure model states what kind of failures can occur in a system. Knowing what classes of failures can occur (stated in the failure model) and knowing the maximum number of these failures per round or time unit (stated in the failure assumption), one can use a sufficient

"performance failure rate"

Figure 3: Since the execution speed and the message transmission delays are not restricted in timefree systems, one can view them as systems in which a service has to provide all its properties even when an arbitrary number of performance failures occur. Hence, hardly any useful problem is solvable in the time-free model. Another approach to design fault-tolerant applications is to use the time-free system model [14]. This model requires that messages sent between two non2

QoS standard semantics

Typically, the failure model used in most synchronous systems assumes that processes have a crash/performance failure semantics and messages have an omission/performance failure semantics [2]. While we do not make any assumptions about the rate of failures in this work, we assume the classes of failures that can occur are restricted by a failure model. The timed asynchronous systems model assumes that processes have crash/performance failure semantics and messages have an omission/performance failure semantics, i.e. messages can be lost or delivered with a transmission delay greater than .

exception semantics failure rate

Figure 4: A fail-aware service has to provide its standard semantics as long as the failure rate is below some maximum failure rate. When a server cannot provide its standards semantics anymore, i.e. the failure rate exceeds the maximum bound, the server has to indicate to its clients that it provides a specified exception semantics.



ure rates to become implementable. Therefore, it can be advantageous to have more than one indicator per server (see Figure 5): each set of properties that requires the same maximum failure rate is associated with its own indicator. We say that a server provides its standard semantics when none of the indicators signals an exception. Otherwise, the server provides an exception semantics.

4 Fail-Awareness The general idea of fail-awareness is that a failaware server (1) has to provide its standard semantics, which is similar to the semantics of a synchronous service, as long as the failure rate is within some give maximum rate, and (2) whenever a server of the service cannot provide its standard semantics, it has to provide an predefined exception semantics and has to signal this to its clients using an (exception) indicator (see Figure 4). An indicator lets a client know if the server currently guarantees that certain service properties hold. To explain that, we will describe below a simple fail-aware clock synchronization service.

...

indicator k

indicator 1

output

...

indicator k

indicator 1

output

...

indicator k

indicator 1

output

Service

... Server 1

The semantics of a service can be seen as a set of properties that the service has to provide. For example, the requirements of an internal clock synchronization service are typically defined by the following two properties: (P1) at any time the deviation between any two correct clocks is bounded by an a priori given constant , and (P2) the drift rate of a clock is bounded by a given constant (i.e. a clock proceeds within a linear envelope of realtime).



max. failure rate

amount of redundancy to mask all failures that can occur. Thus, when the failure model and assumptions are correct, one can exclude the occurrence of non-maskable failures, that is, system failures, by design. In other terms, the probability that the system masks all failures is at least as high as the probability that the failure model and the failure assumption are valid [16].

Server 2

Server n

Figure 5: A fail-aware service consists of a set of fail-aware servers. Each server has one or more indicators and each of the indicators signals if some service property (or a set of properties) currently holds or not.

 

For example, consider a clock synchronization service that has to provide in addition to the properties (P1) and (P2) also the following property: (P3) all correct clocks are externally synchronized within

For some services, the different properties provided by a service might require different maximum fail-

3

 

some constant . For such a clock synchronization service it is advantageous to define for each server two indicators and such that (1) when is true, then the clock of is externally synchronized and hence, also internally synchronized, and (2) when is true, then the clock of is internally synchronized (see Figure 6). If none of the two indicators is true, it is still guaranteed that the drift rate of the clock of is bounded, i.e. (P2) always holds.











nor the sender



Sensor





has suffered a performance failure.

n m

p

Cm=‘‘down’’



q

QoS external clock synchronization

Actuator

I1 = false I3 = false

I1 = true I3 = true

Cn=’’up’’

%

FR1

FR2

"! "!

failure rate



At the voter level it is not even always possible to detect that one cannot mask all server failures. Consider that two of the three servers have suffered the same kind of failures and thus, produce the same wrong result. A voter finds a majority of servers that agree on the same value and hence, the voter cannot necessarily detect that the result of the voting process is wrong.

+'*



Using fail-awareness, one can mask up to server failures with only servers (see Figure 9). The voter has to select the output of a server that indicates that it provides it standard semantics. When all servers provide only an exception semantics, the voter has to signal to its clients that its output is that of a server that provides an exception semantics. Thus, fail-awareness allows to detect on a higher level if not all failures of lower levels can be masked.



#!





To explain why fail-awareness is useful in the structuring of applications and the masking of failures, we first sketch that when performance and omission failures are not detected or masked, they can result in failures that look like arbitrary failures. Let us consider a system with four processes (see Figure 7). The “sensor” process reads its sensor and sends these readings to two redundant processes and . These two processes send commands (“up” and “down”) to an actuator unit. We depict in Figure 7 a scenario in which the performance failure of a “sensor” message that is not detected results in a wrong command for the actuator unit: while the last command sent to the actuator unit should be down, due to a performance failure of a sensor message, the last command is “up” and the actuator cannot detect that it should reject since neither

"!

&('(

&)'*

5 Performance and Omission Failures





To mask up to arbitrary failures of the servers of a service, one needs at least servers (see Figure 8). Since undetected performance failures can look like arbitrary failures, one typically servers to mask server failures needs also caused by undetected performance failures. Note that not necessarily all failures can be masked in that way because the rate of performance failures is not bounded, and hence, the number of failures can exceed .

Figure 6: A fail-aware clock synchronization server provides externally synchronized clocks as long as and interthe failure rate is below some bound nally synchronized clocks as long as the failure rate is below . When the failure rate exceeds it is only guaranteed that the drift rate of a clock is within some given bound.

 





Figure 7: When the performance failure of the sensor message is not detected, process can send a wrong command to the actuator. This command can be viewed as an arbitrary failure since the last command the actuator gets is “up” while it should be “down” and the actuator cannot detect that it should . reject

internal clock synchronization I1 = true bounded clock I3 = false drift rate

$!

Because performance, omission, and crash failures can be transformed into failures that look like ar4

output

voter

output

output

output

Server 6

Server 1 Figure 8: To mask one undetected performance failure (or an arbitrary failure) of a server, one needs at least three servers. A voter has to find a majority of processes that agree on the same output value.

exception semantics standard semantics

Server 2

in other subsets due to network failures or excessive performance failures. Each such subset is informally referred to as a (communication) partition. Our goal is to allow the servers in each partition to “make progress” independent of the servers in other partitions, i.e. these servers can provide their standard semantics even though they cannot communicate with servers in other partitions.

server 3

output

indicator server 2

server 1

output

indicator

output

indicator

LAN q r

p

Figure 9: When all servers are fail-aware, the voter can select the output of a server that indicates that it is correct. If all of the servers indicate that they are providing an exception semantics, the voter has also to signal an exception.

q p



r

overloaded network

k

l

slow communication connected

LAN o l

k

o

Figure 11: Two LANs are connected by a network. When the network is overloaded and the communication between the two LANs becomes too slow, it can be advantageous to view the two LANs as partitioned.

bitrary failures, to detect a failure of a server , it is necessary to detect all non-masked failures of all services that depends upon. Fail-aware services therefore typically depend upon other fail-aware services to allow the detection of all non-masked failures (see Figure 10).



Server 5

Figure 10: Fail-aware services typically depend upon other fail-aware service to allow the detection of all non-masked failures of lower levels.

indicator

output voter

Server 4

server 3

server 2

server 1

Server 3

The notion of a communication partition is hard to formalize because even though at a first sight the notion of a communication partition might be intuitively clear, in many cases it is actually not clear at all if and/or how a system is partitioned. For example, consider that two LANs are connected by an interconnection network (see Figure 11). Consider that we want to elect a leader in each partition. When the communication between the two LANs becomes too slow, it might be better to have a local leader in each of the two LANs because a local leader in one

6 Partitionable Systems We also address the problem of communication partitions, i.e. the problem that a system can split into disjoint subsets of processes such that processes in one subset cannot communicate with the processes

5

12 

  254 ;4 2 3 ;   37698:   376=8 < 0 3 24 ;4   3>6?8   376@8A< 0ABDCE 2 4 376GFH ; 4 3>6ICJ )

LAN cannot communicate in timely fashion with the processes in the other partition. Hence, it might be advantageous to view each of the two LANs as a separate partition in the sense that in each LAN (i.e. partition) a local leader is elected until the network connecting the two LANs allows a timely communication between the LANs again.

ment can be expressed as follows using the indicator of a clock synchronization server : when two clock synchronization servers and are in the same logical partition at real-time , i.e. , then their clocks and are at most apart at : .

To address the problem of what a communication partition is, we introduced the concept of a logical partition (see Figure 12) [12]. The goal of this concept is that we want to allow the servers in each logical partition to provide their standard semantics. Since in general a server needs to be able to communicate in a timely manner with other servers and clients in its logical partition to provide its standard semantics, we define the notion of a logical partition such that, (1) a logical partition looks to an application like a synchronous subsystem, i.e. the communication delay between the processes in a logical partition is bounded by an a priori given constant, and (2) a logical partition is maximum in the sense that when a process can communicate in a timely manner with all processes in a logical partition , then is member of . We derived a formal definition of a logical partition in [12].





7 Performance Failure Detection Most of the fail-aware protocols we have designed are “round based” and use time redundancy to mask a bounded number of performance and omission failures per round. Because the failure rate is not always bounded, not all performance/omission failures are necessarily maskable. Since such nonmaskable failures can lead to system failures, we require their detection to allow at higher level of abstraction to change the quality of service and to avoid that non detected and non-masked failures are transformed into “arbitrary failures”. We review some of the mechanisms we use to detect performance failures, in particular, show how to implement indicators that allow a server to signal to its local clients that it cannot provide its standard semantics due to non-maskable performance/omission failures.

,.-

,/-



partition LAN logical partition

Real-time communication protocols can be divided into two broad classes [15]: time triggered and event triggered. Event triggered systems react to events directly while time-triggered systems react only at predefined points in time. Orthogonal to the above classification, protocols can also be classified as clockdriven or timer-driven [17]. Clock-driven protocols rely on synchronized clocks while timer-driven protocols rely on (unsynchronized) timers. All our failaware protocols are event-triggered and almost all of them are clock-driven. We chose event-triggered protocols since operating systems like Unix have relative good reaction times to events like a message reception but they have relative poor real-time scheduling support. The protocols are clock-driven because clock-driven protocols simplify the implementation of indicators: a process knows the clock deadline beyond which it cannot provide its standard semantics anymore, unless ‘something good’ happens before that deadline. Our indicator design relies on this knowledge to signal when a server starts providing its exception semantics.

computer

Figure 12: We designed a set of fail-aware services that create collectively for each communication partition one or more logical partition(s). A logical partition can be viewed as a synchronous subsystem. In our approach, a server uses its indicators not only to signal to its clients if it currently provides its standard semantics or an exception semantics, the indicators also let the clients know in what logical partition they are. When a server cannot provide its standards semantics, at least one of its indicators shows which signals that the property associa value ated with the indicator cannot be guaranteed by the server.

0

For example, a partitionable fail-aware internal clock synchronization service synchronizes the clocks of all processes in a logical partition within some constant maximum deviation . This require-



Detecting performance failures is vital to many of

6

S

7.1 Indicators

p ≤B

n q

≤C

T

U

t

u

s ≤a

V

r

W ≤d

v

2

w



Figure 13: Processes use hardware clock time stamps to detect performance failures.

K L



3 N O S 3UT>N1PV Q R

3

p, q know clock is synchronized

exception

S+D U

deadline

V



2

2



2





0

P ^'f]





 h'*]  '] 

Hp =expTime

lpartion Ip



 22



clock time

out-of-date Ip =(lpartion,expTime)

performance failure







S+2D

clock time p, q know clock is synchronized

Q



2

2

d2

d2



K





Figure 15: The indicator of a process consists of the logical partition lparition of and an expiration time expTime, i.e. the time when the indicator will become out-of-date.

 '^]  ' &] P Q 'Y]S _`Q H'A]aTbQcV

Figure 14: Process has to adjust its virtual clock before hardware clock times , , and . It actually performs the adjustments at times , , and . Since misses deadline , a process that reads ’s clock between has to detect that is not synchronized.

R

e'[2 ]

L  P Q Q F PYX)L 2 4 H' ]eTbQg6 d RZF[QX\  2 2  K L

p

q

S N5% T>OWV

N



T S







2

 2 d 

our protocols to ensure their safety and timeliness properties. Our protocols require that a process reads its hardware clock at certain points during the protocol execution. For example, all our protocols require that a process reads its hardware clock when it (1) receives a message, (2) sends a message, (3) reads an indicator, or (4) is awakened by the operating system. Consider the situation shown in Figure 13 and let us assume that the standard execution can be stopped as soon as one of the bounds , , , or is violated. Process reads its clock at real-times , , and and its hardware clock returns the values , , and , respectively. It checks in interval if the transmission delay of was at most real-time units while it checks in that the processing time between and was at most clock time units. The latter is easy to achieve because can use the time stamps , to test if . Similarly, can check if . When detects that at least one of the four bounds , , or is violated, it does no longer send .

M

   '[]

2

m

clock time realtime

To explain how clock-driven protocols help to maintain indicators, consider a simple clock synchronization protocol. A process has to adjust a virtual clock periodically, say, before its local hardware clock shows values , , ... to keep its virtual clock synchronized (see Figure 14). When process does not adjust its clock before the given deadlines, is not necessarily synchronized to the its clock other clocks. Let process be another process that is executed on the same computer node as (i.e. and use the same hardware clock ). When process tries to read between and (measured by ), has to detect that is out of synchrony. We achieve that by using the following mechanism (see Figure 15). An indicator of a process consists of two parts: (1) the identification of ’s logical partition (lpartion), and (2) the expiration time (expTime) beyond which the indicator has to signal that provides its exception semantics. When process evaluates , it first reads the local hardware clock . If its hardware clock shows at most time exis lpartion. Otherwise, the pTime, the value of value of is out-of-date ( ) which tells that provides its exception semantics. Process updates its indicator periodically, e.g. at time it sets the expiration time to its next deadline (see Figure 14). When suffers a performance failure, it does not update its indicator in time and hence, any client that reads the indicator will evaluate to outof-date. For example, when evaluates during interval , the expiration time is while shows a value greater than . Thus, returns value out-of-date that allows to detect is not in synchrony anymore. that



2

A process that reads might itself suffer a performance failure while evaluating the current value of . An indicator therefore returns the hardware clock time stamp used in its evaluation. This allows the detection of performance failures that occur dur-

2

7

2

2

ing the evaluation of value returned by .

2

or during the usage of the

broadcast, a membership, a time, and a fail-aware datagram service.

Note that our design of indicators allows a system to be structured such that some high priority processes check periodically that certain properties hold by querying the indicators of servers. The indicators signal an exception even when the servers (which can have lower priorities that the querying processes) suffer performance failures. For example, in fail-safe systems these processes can detect when the system has to transit to a safe mode.

O1 O2 O3 O4

P1 network

O9 OB OA OC P2

object process P3 computer node

Figure 17: A Fortress process contains a collection of objects. An application consists of a set of processes. Fortress is based on the following concepts. A system consists of a set of computer nodes connected by a network (see Figure 17). An application process contains a set of objects. Some of the objects implement Fortress services like a clock synchronization service or membership service.

Application FA-Broadcast FA-Membership FA-Clock Synchronization

O5 O6 O7 O8

Fortress 8.1 Scheduling

FA-Datagram

In Fortress there is no notion of a thread. Instead, Fortress is event oriented in the sense that the occurrence of an external event (e.g. a message arrival) or an internal event (e.g. a timeout occurrence) triggers automatically the execution of a method that handles the event. We assume that the processing time of an event is quite short. Thus, we have not built a preemption mechanism into Fortress, i.e. the execution of an event has to terminate before the execution of the next event starts. Events are associated with a priority and a deadline by when the processing of an event is supposed to be finished. Events are processed in order of their priority and events with the same priority are processed in the order of their deadlines.

(Real-Time) Unix Figure 16: Fortress is a system that supports the development of fault-tolerant real-time applications. All services it provides are fail-aware, i.e. each server lets its clients know whenever it cannot provide its standard semantics.

8 Fortress In this section we describe the Fortress system [11] which is an attempt to provide a common framework for designing real-time applications for systems in which the failure rate cannot be bounded a priori. The main issues we address in Fortress are (1) the support of redundancy management to allow the masking of crash, performance, and omission failures, and (2) the detection of non-maskable failures to ensure application integrity and to facilitate mode changes.

8.2 Communication Each Fortress process has a unique identifier that can be used to send unicast messages between processes. A process might crash and a crashed process is typically automatically restarted. A process loses all its state when it crashes but it always keeps the same identification in Fortress (even though the process identification used by the underlying operating system might change with each restart). To support multicasts, Fortress uses the concept of a team: a team is a constant maximum set of processes that collectively implement a service [3]. Each team has

Fortress is a collection of C++ classes that allow to build fail-aware real-time applications (see Figure 16). It provides group communication services that simplify the replication of application state. The major fail-aware services it provides are an atomic 8

a unique identifier that can be used to broadcast messages to that set of processes. Two teams may have some processes in common.

Broadcast ship object

For example, an application could define four teams (see Figure 18): a sensor, a controller, an actuator, and a process team. The process team is used to broadcast messages to all processes in the system and the remaining teams to broadcast messages to a certain subset of processes. Typically, the teams of a system are defined a priori. However, an application can also create teams dynamically.

s1

s2

...

sk

sensor team

c1

c2

...

cm

controller team

a1

a2

...

an





s1

actuator team process team

c1

j



c2

cm

i

LP1

s2

...

sk

sensor team

c2

...

cm

controller team

...

an

a2

LP2

actuator team process team

Figure 20: The membership service creates for each network partition a logical partition: LP1 and LP2. The membership of a logical partition can change but at no real-time the valid memberships of two logical partitions overlap.

8.3 Logical Partitions Fortress supports partitioned operation: even when the system splits into disjoint subsets of processes due to network failures or excessive performance failures, Fortress allows the processes in each subset to make progress. Fortress tries to create for each communication partition a logical partition which consists of a unique identifier and the sequence of memberships that exists in time for . When the failure rate in a partition is below a certain a priori given threshold, the membership service provides the standard semantics of a synchronous membership [13].

k-ml

i

The fail-aware datagram service calculates for each message that it delivers an upper bound on the transmission delay of [8]. This upper bound can be used by a client to detect when has suffered a performance failure. For example, a process might have to detect and reject all messages containing sensor readings with a transmission delay greater than some bound since the information contained in these messages is out-of-date.



process c1

Deliver

sage with a lower priority is delivered.

Message based communication is achieved by either sending a unicast message to a certain process or by sending a broadcast message to a team. Fortress provides two broadcast services: a datagram service and an atomic broadcast service. Both services allow a process to broadcast a copy of some object to a team (see Figure 19). By calling the method Broadcast of an object the sender requests to ship a copy of to all processes in the destination team. When the object arrives at some process , ’s method Deliver is called to notify of the arrival of .

 i

O

O Deliver

i

Figure 18: A team is a fixed set of processes that collectively implement a service. A team has a unique identifier and teams are used as multicast destinations.

i

O Deliver

Figure 19: Processes can send each other objects. The method broadcast ships a copy of an object to each process in the destination team and then calls the method Deliver of at any of the destination processes.

a1

i

Destination Team

Object O



,/-ml



k -Gl ,.-ml

k-?l

The membership service creates new logical partitions for good reasons only: when it detects that physical communication partitions “split” into smaller partitions or “merge” into bigger partitions (for details see [13]). Each membership of a logical partition has a limited time validity. At any point in

The sender of a message can associate a priority with that determines the priority of the reception event at the receiver process(es). When a process receives multiple messages in parallel, the messages with higher priorities are delivered before any mes9

synch.

time a process can be in at most one valid membership. Hence, two logical partitions do not overlap, in the sense that, a process in the currently valid membership for cannot also be in the valid membership of another logical partition (see Figure 20).

,/- l

m1

m2

T clock time

T+R

,/-5n

m3

m4

m5

8.4 Clocks Fortress provides each process with access to a local hardware clock with a known maximum drift rate. Fortress calibrates the hardware clock to compensate for systematic clock drift errors. The clock calibration guarantees that the maximum drift rate is typically in the order of . Whenever a process has access to an external real-time provider, Fortress synchronizes the hardware clock with real-time.

LP1

T+2R T+3R T+4R T+5R

Io1prq

LP2 s5

s1 s2 s3 s4 m i ,s i = membership of LP1, LP2

Figure 21: The membership service updates periodically the membership of a logical partition: each membership is valid for clock time units before it must be replaced by a new membership.

A8As





0



,/-

Fortress defines an internal time base for each logical partition independently of the availability of an access to real-time: each process has a clock and the deviation between and of two processes and that are in the same logical partition is at most . If access to external time is available, and all clocks in are synchronized within of real-time, Fortress ensures that for each process in . In other words, Fortress tries to provide one common time base whenever possible. We use an improved probabilistic [6, 1] approach for synchronizing clocks, which guarantees that the clock synchronization service is fail-aware: when the clock synchronization server of some process in cannot guarantee that is within of the other clocks in , it makes sure that process leaves , i.e. that is ’s indicator signals .

2





=  

 .,  ,.-

Since we allow an unbounded failure rate, it cannot be guaranteed that a membership server always knows the current membership of its logical partition. The membership service is like all other Fortress services fail-aware in the sense that it lets its clients know when a membership server cannot guarantee its standard semantics. When a membership server cannot keep the membership up-to-date, it joins a predefined logical partition . The membership of is always the empty set and is used to indicate to a process that Fortress has not been able to keep the membership up-to-date or that the standard semantics of some other Fortress service cannot be guaranteed, such as when a local clock is out-of-

0





d2



The membership service updates the membership of a logical partition periodically to account for the departure or the joining of processes. Each membership of a logical partition is valid for exactly clock time units, where is an a priori defined constant. The membership service creates a sequence of memberships so that all processes in a logical partition can have at all times a valid membership (see Figure 21). A process that departs from a physical communication partition is excluded from the membership of the corresponding logical partition within a bounded amount of time, and a process that joins a communication partition is included in the membership within a bounded amount of time (again, see [13] for details). A process can query what its current logical partition is and what the current membership is by calling the function memberset. Fortress can also notify a process of each new logical partition or a membership change by an upcall.





When access to external time is available, Fortress (externally) synchronizes all clocks within of realtime. In such a case, any two synchronized clocks are within of each other. We call the maximum external deviation and the maximum internal deviation. Fortress always provides each process with a computed upper bound on the deviation beand real-time. tween ’s clock

2



,/-

;

,/-

,/-



2

2 8td 2

0



Fortress provides alarm clocks that allow to set an alarm time with respect to either the local hardware clock or the internal time base defined by the failaware clock synchronization service. A process can actually specify an interval within which it wants to be awakened, the priority of the alarm, and a method that should be called when the alarm time is reached.

0

8.5 Atomic Broadcast The fail-aware atomic broadcast service provides an all-or-nothing semantics that guarantees that all 10

s1

s1

"Train Leaving"

"Train Leaving"

"Train Entering"

s4

s4

exception s4 removed from LP s4 in LP

"Train Entering" c1 S+∆B

S

T+∆B

T

c1

U+∆B

U

T

Figure 22: The atomic broadcast service delivers broadcast messages in the order of their send time stamps. A broadcast message is either delivered to all members of a logical partition or to none.

,.-  P   u  P ,/-

P

 u ., Pv





P



jxw

%

"Train Entering"

"Train Leaving"

s4

,/-

z{



,.-

z{

P

 ,.z|{

,/-

z|{

y&



exception

c1 S

S+∆B

z|{

j w

T

T+∆B

y&

U+∆B

U

y&

9 Performance

Figure 23: Process does not receive the message from within time units. Thus, is informed from the broadcast service no later than that it has missed a broadcast.

We measured the performance of our services on a 10MBit Ethernet connecting several relatively slow SUN IPX workstations running SunOs 4.1.2 in our Dependable System Laboratory at UCSD. The first measurement shows the distribution of the maximum error of the upper bounds calculated for unicast messages (see Figure 25). Since we cannot measure the exact one-way transmission delay of a message, we approximate the error made by the failaware datagram service by the difference between the calculated upper bound and a lower bound for the message transmission delay. Note that this is a conservative approximation, in that the real error is always smaller than that our approximation. This distribution is based on 20,000 round-trips of uni-

P}'~j w

If the failure rate becomes too high, it cannot be guaranteed that all broadcasts in a logical partition can be delivered within time units to all team processes in . For example, consider that some process that controls the crossing arms of a railway crossing receives a broadcast late (see Figure 23). The broadcast service informs of that condition and can take appropriate actions, such as the lowering of the crossing arms because does not know if currently there is a train in the railway crossing.

,/-



z|{

s1 "Train Leaving"

z|{

Q

Another failure scenario that has to be detected by the fail-aware atomic broadcast service is that of a broadcast message sent by a process that is not . Consider delivered within ’s logical partition the case that some “sensor” process broadcasts a message at and the broadcast service rejects due to some failure, for example becoming partitioned from (see Figure 24). The processes in have to be able to detect that messages broadcasted by are not delivered. This is done in Fortress by removing all processes whose broadcast messages are rejected from the membership of a logical partition. In the example given in Figure 24, process will be removed from the memberof that ship and the membership service informs event by an upcall. When “too many” sensor processes are removed from the membership process can for example lower the crossing arms (because it cannot sense trains that enter the crossing).



%

,/-

T+∆B

is removed from the logical Figure 24: Process partition at because ’s broadcast message at is not delivered in .

team members in a logical partition see the same sequence of broadcasts. Its standard semantics is that of a synchronous atomic broadcast [4]. When a process in a logical partition sends broadcast at time to team , then 1) is either delivered to all processes in or to none, 2) if no failures occur, is delivered to all processes in within a known constant time units, and 3) all broadcasts are delivered in the order of their send time stamp, i.e. when the send time stamp of a broadcast is smaller than that of broadcast , then is delivered before (see Figure 22).



U

y&

y&

,/-

j w

y&

y&

11

3000

cast message with a length of 248 bytes.

2500 max. error

no. measurements

2500

no. messages

2000

1500

1000

2000

1500

1000

500

0 150us

500

0 100us

150us 200us 250us max. error of upper bound

300us

250us 300us max. deviation

350us

Figure 27: Distribution of the deviation between the hardware clock of a local leader and the virtual clock of one of its supporters.

Figure 25: Distribution of the maximum error for the calculated upper transmission delay bound for unicast messages.

remove a crashed process from the membership of a logical partition (see Figure 28). During this ex. periment the membership was updated every Since the membership protocol gives a processes a “second chance” to prove that it is still in the same partition, it takes typically between and to remove a crashed process from the membership (see [9] for details).

‚&o  z

The next measurement, shown in Figure 26, describes the distribution of the time needed to elect a local leader (these local leaders are instrumental in defining the logical partitions mentioned earlier). The measurements were based on 100,000 elections. The election time increases linearly with the number of processes participating in the election: the average election time and the 99% election time, i.e. a process succeeds with a 99% probability to become leader within that time, are shown in Figure 26.

‚&o  z

|ƒ&o  z

4000

11

no. removals

3000

10 election time [ms]

200us

9

2000

8 7

1000

6 5

0 60ms

4

80ms 100ms 120ms 140ms 160ms 180ms removal time

3 1

2 Avg. Time

3

4 T99 Time

5

6

Figure 28: Distribution of the removal time of a crashed process by the second chance membership protocol. Processes have to send an “alive-msg” at least every 80ms and get a “second chance”.

7 nodes

Figure 26: The average election time and 99% election time for 1 to 7 participating processes.

The last distribution shows the delivery times of atomic broadcasts that are ordered according to their send time stamps (see Figure 29). A broadcast message is ordered by the local leader as soon as it is known that no other message has to be delivered before , i.e. we use an early delivery option. The threshold for slow messages was and the maximum internal deviation between clocks was . This implies that a local leader has to wait for about 10ms before it can order a message.

The next figure shows the distribution of the deviation between the hardware clock of a local leader and a process in ’s logical partition (see Figure 27). The distribution shows an upper bound on the deviation calculated whenever a membership server reads a local virtual clock that is synchronized with the hardware clock of the local leader.







 z

We also show the distribution of the time needed to 12



j

„ z



The width of the distribution ([13ms,23ms]) is about 10ms, which reflects the scheduling resolution of the operating system that is also 10ms.

1989. [2] F. Cristian. Understanding fault-tolerant distributed systems. Communications of ACM, 34(2):56–78, Feb 1991.

1500

[3] F. Cristian. Synchronous and asynchronous group communication. Communications of the ACM, 39(4):88–97, Apr 1996.

no. broadcasts

1000

[4] F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. Information and Computation, 118(1):158–179, April 1995. Early version: FTCS15, June 1985.

500

0 10ms

15ms 20ms 25ms transmission delay

30ms

[5] F. Cristian and C. Fetzer. The timed asynchronous distributed system model. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing, Munich, Germany, Jun 1998.

Figure 29: Distribution of the delivery times of atomic broadcasts.

[6] C. Fetzer. Fail-aware clock synchronization. Technical Report Time-Services, Dagstuhl-Seminar-Report; 138, Mar 1996. http://www.christof.org/FACS.

10 Conclusion We addressed the problem of how an application can adjust its quality of service to the currently experienced failure rate. Our approach uses the concept of fail-awareness to adapt the quality of service. Failawareness is a concept that detects all non-maskable failures of a server and propagates that information to the clients of the server using indicators.

[7] C. Fetzer and F. Cristian. Fail-awareness in timed asynchronous systems. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing, pages 314–321a, Philadelphia, May 1996. [8] C. Fetzer and F. Cristian. A fail-aware datagram service. In Proceedings of the 2nd Annual Workshop on Fault-Tolerant Parallel and Distributed Systems, Geneva, Switzerland, Apr 1997.

We introduced the notion of fail-awareness in [7]. A more detailed description of Fortress can be found in [11] and an overview of fail-awareness in a partitionable setting is given in [10]. The fail-aware services is are given in [8, 12, 9]. A simple synchronized traffic signaling example that demonstrates the use of fail-aware services in a partitionable environment is described in [10] and a more detailed rail-way crossing example is presented in [11]. The example of [11] shows how an application can use application level redundancy to mask a bounded number of performance failures per time unit and how the system can be switched to a safe state when the amount of redundancy is exceeded due to the occurrence of too many failures. All these reports are available via our home pages.

[9] C. Fetzer and F. Cristian. A fail-aware membership service. In Proceedings of the 16th Symposium on Reliable Distributed Systems, pages 157–164, Oct 1997. http://www.christof.org/FAMS. [10] C. Fetzer and F. Cristian. Fail-awareness: An approach to construct fail-safe applications. In Proceedings of the 27th Annual International Symposium on Fault-Tolerant Computing, Seattle, Jun 1997. [11] C. Fetzer and F. Cristian. Fortress: A system to support fail-aware real-time applications. In IEEE Workshop on Middleware for Distributed Real-Time Systems and Services, San Francisco, Dec 1997.

References [1] F. Cristian. Probabilistic clock synchronization. Distributed Computing, 3:146–158, 13

[12] C. Fetzer and F. Cristian. A highly available local leader service. In Proceedings of the Sixth IFIP International Working Conference on Dependable Computing for Critical Applications, Grainau, Germany, Mar 1997. [13] C. Fetzer and F. Cristian. Derivation of failaware membership service specifications. In Proceedings of the 3nd Annual Workshop on Fault-Tolerant Parallel and Distributed Systems, Orlando, Florida, Apr 1998. [14] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374– 382, Apr 1985. [15] H. Kopetz and P. Verissimo. Real time and dependability concepts. In S. Mullender, editor, Distributed Systems, Second Edition, chapter 16, pages 411–446. Addison-Wesley, New York, 1993. [16] D. Powell. Failure mode assumptions and assumption coverage. In Proceedings of the 22nd International Symposium on FaultTolerant Computing Systems, pages 386–395, 1992. [17] P. Verissimo. Real-time communication. In S. Mullender, editor, Distributed Systems, Second Edition, chapter 17, pages 447–490. Addison-Wesley, New York, 1993.

14

Suggest Documents