call start-up algorithms, must bring a system from asyn- chronous into synchronous .... startup phase affecting at most ½ node and/or a part of the communication ...
The Transition from Asynchronous to Synchronous System Operation: An Approach for Distributed Fault-Tolerant Systems Wilfried Steiner
Michael Paulitsch
Real-Time Systems Group, Technische Universit¨at Wien Treitlstr. 3/182, A-1040 Vienna, Austria, {willi,michael}@vmars.tuwien.ac.at Abstract Immediately after power-up, synchronous distributed systems need some time until essential timing properties, which are required to operate correctly, are established. We say that synchronous systems are initially in asynchronous operation. In this paper, we present an algorithm and architectural guidelines that assure the transition from asynchronous to synchronous operation within a bounded duration even in case of failures.
1 Introduction Asynchronous distributed systems do not require any assumptions concerning the timing behavior of system components for the system to operate correctly. Synchronous distributed systems, on the other hand, require that assumptions concerning the timing behavior of components are met by system components. Algorithms built on top of synchronous systems can take advantage of these assumptions and allow timeliness guarantees. Generally speaking, algorithms are easier to implement on top of synchronous systems than on top of asynchronous systems; some algorithms even cannot be implemented on asynchronous systems [2, 4]. Work and research has been done that takes advantage of and focuses on the properties of synchronous and asynchronous systems, respectively, such as [8, 3]. The work on synchronous systems focuses mainly on the operation of distributed systems in a state where all system properties are already guaranteed. However, after system start every system is basically asynchronous. Consequently, special algorithms, which we call start-up algorithms, must bring a system from asynchronous into synchronous operation. Work in this field has been done, e.g., by L¨onn who simulated algorithms that lead to a synchronous system after power-up [10]. He de This work has been supported by the European projects DSoS (proj. No IST-1999-11585), PAMELA (proj. No G4RD-CT1999-00086), and NEXT TTA (proj. No IST-2001-32111). We thank G¨unther Bauer for fruitful discussions.
scribes three different start-up algorithms and does performance tests using simulation. If a synchronous systems is to be used in safety-critical applications, timeliness guarantees need to be given also during execution of the startup algorithm. These guarantees must hold even in presence of faults and basically provide an upper bound on the time needed to establish synchronous operation. It is the objective of this paper to present an algorithm and architectural guidelines that bring embedded faulttolerant distributed systems from an asynchronous state after power-up into synchronous operation within a bounded duration. The timeliness of the algorithm will also be guaranteed in case of failures. This algorithm and guidelines will be evaluated using a time-triggered synchronous system, but may also be used for other synchronous systems. The paper begins with a description of an architecture for synchronous operation. In Section 3, we describe a start-up algorithm. Section 4 discusses different failure scenarios in the start-up phase and possible solutions to handle the presented failures. We finally conclude in Section 5.
2 Architecture for Synchronous Operation This section starts with a definition of a synchronous system. We then give an example of an architecture that meets the presented attributes, and describe a fault-tolerant architecture that operates under a given fault hypothesis.
2.1 Structure of a Distributed System We identify the following components in our system: Node: A node is a physical unit and a unit of failure that comprises a communication controller and a computing resource. A node creates, analyzes, sends, and receives messages and performs computational activities. Communication Medium: A communication medium is a connection between nodes for the purpose of message transmission. Figure 1 depicts a system of 4 nodes as an example.
node 0
node 1
These identifiers are called node ID and slot ID, respectively. The slots are ordered by their identifiers in ascending order. By definition the node ID equals the slot ID and, thus, a sending node is also identified by its slot. Figure 2 depicts the access scheme of a system with 4 nodes and its numbering.
communication medium
node 2
TDMA round (n-1)
node 3
TDMA round (n+1) TDMA round n
Figure 1. Distributed system with 4 nodes assigned to node3
2.2 Characteristics of Synchronous Systems
assigned to node0
12:00
3:00 slot0
slot3
assigned to node1
assigned to node2
5:00 slot1
assigned to node3
9:00 slot2
assigned to node0
12:00 slot3
3:00 slot0
A distributed system is classified as synchronous if it satisfies the following conditions [5]:
Figure 2. Access scheme of a TDMA system with 4 nodes
Bounded Transmission Delay: There is a known upper bound on message delay. consists of the time it takes for sending, transporting, and receiving a message over a communication medium.
In order to work correctly, each node must have a local copy of the global transmission schedule, so each node knows when to send and when the others send and, thus, when to receive frames. In the described system, a frame contains an arbitrary number of messages, however, as already mentioned, a node can only send 1 frame within 1 TDMA round. Each frame carries the following information:
Bounded Clock Drift: Every node n has a local clock Cn with known bounded rate of drift n 0 with respect to physical time.
Bounded Processing Time: There are known upper and lower bounds on the time required by a process to execute a processing step.
Global Time. the slot.
2.3 An Example of a Synchronous Architecture
Node ID.
We use a time-division multiple-access (TDMA) architecture such as the Time-Triggered Architecture (TTA) and its safety-critical communication protocol TTP/C [7] as an example of a synchronous system. A TDMA architecture obtains its synchronous behavior from the progression of real time, i.e., we assume the availability of a global system time, which is used for the arbitration of the communication medium. In a time-triggered system this global time is established using the local clocks of the nodes. Nevertheless, synchronous systems can also be constructed without knowledge of real time, e.g., with token-passing mechanisms. In an architecture using a TDMA scheme, time is split up into (non-overlapping) pieces of not necessarily equal durations, which are called sending slots (for short called slots). In the architecture we focus on, each of the N nodes of the system is assigned a priori one unique slot, where N equals the number of nodes of the system. When the time of a node’s slot is reached, the node is allowed to broadcast one frame on the communication medium for the duration of the slot, islot whereas 0 i < N . The sequence of the slots of a system is called a TDMA round. After the end of one TDMA round the next TDMA round starts, i.e., after the sending of the node in the last slot of a TDMA round, the node that is allowed to send in the first slot sends again. Consequently, each node starts sending every round time PN 1 units, whereas round = i=0 islot . We identify each node and slot by a unique integer; the numbering starts at 0 and is incremented by 1 for each node.
Data. The messages that a node sends. Since the transmission pattern is defined and coordinated among the nodes and the delay of the transport itself is given by the communication medium and its physical length and has an upper bound, there exists a calculable upper bound on the transmission of frames. We assume that the hardware on which the architecture is running has clocks with known bounded drift rates and the execution times of computations and data accesses have an upper bound. This allows us to conclude that the execution time of the process at the node that performs computational activities, e.g., in order to generate the messages, has an upper bound. With respect to the points mentioned in Section 2.2, we can say that the described architecture is synchronous.
The current global time at the beginning of
The node ID of the sending node. 1
2.4 Fault Tolerance in Synchronous Operation For all modes of operation, we assume that the following fault hypothesis holds: Only 1 single fault occurs during startup phase affecting at most 1 node and/or a part of the communication medium in spatial proximity. The singlefault hypothesis is justified, because the architecture aims at systems where the inter-arrival time of failures is higher than the system recovery time. We assume that nodes are not in spatial proximity. 1 Due to the node ID, the delay of the transmission can be corrected precisely. This becomes important for the startup of large-scale systems.
2.4.1 Masking of Failures Since a fault can affect a part of a communication medium, the communication medium consists of two replicated channels. The channels are routed physically separately in order to avoid both channels being affected by a single fault. In order to be able to detect failures of the communication medium in the value domain each frame is protected by a checksum. A received frame is said to be correct at a receiving node if the calculation of the checksum by the receiving node equals the checksum of the frame. Otherwise it is said to be incorrect. A failure of a node is tolerated (i.e., it is masked to the application) by replication of nodes and the use of, e.g., Triple-Modular Redundancy (TMR) and correct voting algorithms [8]. 2.4.2 Fault Isolation A fault containment region (FCR) is ”a set of components that is considered to fail (a) as an atomic unit, and (b) in a statistically independent way with respect to other FCRs” [6]. In order to assure that a node is an FCR, i.e., that a fault affects only a node and that a failure that occurred in a component is not able to affect another component, we have to provide fault isolation. E.g., a fault that may affect both channels can occur close to a node where the two channels are in spatial proximity, because there they are connected to the node. An example for a fault affecting both channels is the short circuit of two channels close to a node. We use a star topology and assume the placement of devices into the star center of each of the channels, which isolate failures, to ensure that the consequences of such a failure will influence the operation of at most 1 node. We call such a device a guardian. Actually, there must be at least 2 guardians to cope with the faults of the single-fault hypothesis. Moreover, a correct guardian must not be able to produce correct frames on its own [12, 1]. node 0
node 1
guardian 1
guardian 0
communication medium node 2 node 3
Figure 3. System of 4 nodes with replicated channels and guardians The task of the guardian is to prevent a malicious node from sending outside its slot, which prevents a failure in the time domain. We assume that the guardian has access to the global transmission schedule and the system structure. This enables a guardian to analyze a frame that is sent from a node and to check whether this frame contains information that may lead to a failure at another node, such as the wrong time. Furthermore, a guardian performs signal reshaping in
order to provide a consistent view on a frame to all other nodes of a cluster and to avoid Slightly-off-Specification (SoS) failures [9].
3 The Transition from Asynchronous to Synchronous Operation In the preceding sections, we have explained the normal operation of the system, i.e., the synchronous operation. In this section, we will focus on the transition from asynchronous to synchronous operation. After power-up, the clock of a node is initially asynchronous to the one of other nodes, we say a node is asynchronous with respect to the synchronized system time or in asynchronous operation. If nodes were left in asynchronous operation after power-up, collisions of frames could occur due to the missing coordination of the nodes’ communication medium access. This would lead to an unboundable transmission delay. Thus, the provision of an algorithm that synchronizes the different times of the nodes after startup is essential for the operation of synchronous systems. We call such an algorithm startup algorithm. The subject of this section is to describe a startup algorithm.
3.1 Start-Up Algorithm Init
Listen
1
2
Cold Start
3
Active
4
Figure 4. Finite State Machine of the Startup Algorithm For the startup algorithm, we define three timeouts: Startup Timeout istartup is unique to each node. It is given by the duration of all TDMA slots prior to the slot of the node with the node ID i beginning at a TDMA round 0 start. startup Pi slot i = 0 i = (1)
j=1 j 1 i > 0 Cold Start Timeout i oldstart of a node i is given by the sum of the node’s Startup Timeout istartup and a complete round TDMA round
:
startup i oldstart = round + i (2) Listen Timeout ilisten of a node i is given by the sum of the node’s Startup Timeout istartup and two TDMA rounds 2 round : startup ilisten = 2 round + i (3)
To explain the system startup we refer to the finite state machine in Figure 4. We distinguish between the following states: Init, Listen, Cold Start, Active.
Init State: After power-up, a node starts in Init State. After its internal initialization it transits to Listen State. Listen State: The node starts its Listen Timeout and begins listening on the bus. If an incorrect frame or any noise has been received while listening, the Listen Timeout is restarted. If a correct frame has been received, the receiving node accepts the global time and node ID of the incoming frame. Then, the node enters Active State. If no traffic has been detected on the communication medium while the node was listening and the Listen Timeout expires, the node enters Cold Start State. Cold Start State: The node generates a frame using its local time as global time and its node ID (we call such a frame a Cold Start Frame). The node then starts its Cold Start Timeout, begins sending the Cold Start Frame on the communication medium, proceeds the transmission schedule for one TDMA round, and listens for a frame on the communication medium. If during this TDMA round no frames from other nodes have been received, the node expects that no other node was in Listen State. It waits for its Cold Start Timeout to expire (i.e., it waits for the duration of the remaining unique Startup Timeout) and sends another Cold Start Frame. If during this TDMA round a correct frame has been received and the global time in this frame equals the node’s view of the global time, the node enters the Active State. The node enters Listen State if it received a faulty frame, other noise has been detected, or a correct frame that contains a global time different from the node’s view of the global time. Active State: In Active State the node is in synchronized operation and proceeds its transmission schedule cyclically, i.e., it waits until the time for its slot is reached and starts sending, as described in Section 2.3.
3.2 Analysis of the Startup Algorithm In this section, we analyze the startup algorithm in the failure-free case. 3.2.1 Collision When at least two nodes start to send on the communication medium within an interval Æ 0, which is given by the communication medium’s propagation delay, the frames collide, which is realized as noise at the receivers. In the startup algorithm presented in Section 3.1, there exists just one class of scenarios for collision, which depends on the nodes’ power-up sequence. We will show that in the fault-free case there can be at most one collision. We assume that the drift rates of the clocks can be neglected (the circumstances where this assumption holds will be described in Section 3.2.2). Furthermore, we assume that a node cannot send and receive concurrently. While this is technically feasible, it imposes additional cost to the bus driver and/or cabling depending on the physical architecture.
We define tlisten as the point in time, when node i starts i its Listen Timeout ilisten . Thus, the point in time to send 1 , is given per node by: the first Cold Start Frame, t oldstart i 1 listen + listen t oldstart = ti i i
(4)
We define the class of nodes, N oldstart , that send their first Cold Start Frames that collide as follows:
1 1 fX jt oldstart 2 [min(t oldstart ); x i
oldstart 1 min(ti ) + Æ ℄g (5) whereas 0 i < N and 0 X < N , N equals the num-
N oldstart
=
ber of nodes of the system, and Æ represents the maximum propagation delay of the communication medium. We refer to the nodes of the class N oldstart as Cold Start Nodes. In scenarios in which jN oldstart j = 1 no collision occurs, because in the interval 1 ); min(t oldstart1 ) + Æ ℄ only one node [min(t oldstart i i sends a Cold Start Frame and this ensures that no other node sends before it has received the sent Cold Start Frame. In scenarios in which jN oldstart j > 1, the sent Cold Start Frames collide. The sending nodes do not recognize that there was a collision, for they cannot listen on the communication medium while sending a frame. They proceed their transmission pattern for one TDMA round. The listening nodes receive noise due to the collision and restart their Listen Timeouts. After one TDMA round the Cold Start Nodes wait for their Cold Start Timeouts to expire while they are listening on the communication medium. The node with the shortest Startup Timeout is then allowed to send its second Cold Start Frame. All other Cold Start Nodes, which still proceed their Startup Timeouts, receive the sent Cold Start Frame, interpret it as a sign that a collision occurred, and fall back into Listen State. Every node that is in Listen State at this time transits into Active State. We see that in the collision scenarios the node with the shortest Startup Timeout wins. The following figures depict the timing of nodes during startup. The start of a transmission of a node is symbolized by a thick line, collisions are symbolized by a flash-sign. Listen Timeout n1 1
2
3
0
1
2
3
3
0
1
2
3
0
1
Cold Start Timout n1
0
1
2
3
0
2
3
0
1
n1:
n2:
2
Listen Timeout n2
Cold Start Timout n2
Real Time
t1
t2
Figure 5. Collision of frames of node 1 and node 2 in a system of 4 nodes Figure 5 present a collision scenario of a system of 4 nodes. We expect that node 2 is started first and after an interval of 2startup 1startup node 1 is started. As one can see, these two nodes collide at point t1 . Node 1, which has
the shorter Startup Timeout, is the first one to send its next Cold Start Frame at t2 and forces node 2 back into Listen State.
Node (N 1) and node (N 2) collide, because they have the longest Cold Start Timeouts and, thus, the clock drifts are maximal.
Node (N
The subject of this section is to show that clock drifts can cause additional collisions if they are more than a specific bounding value. In order to determine this bounding value, we will derive the maximum duration of a TDMA round as a function of the maximum clock drift rate, the minimum slot duration, and the maximum propagation delay of the communication medium. Figure 6 depicts a collision scenario caused by clock drifts. In contrast to Section 3.2.1, we expect the clocks of node 2 and node 3 to drift. We can also see a reference clock with drift rate 0. According to the reference clock, node 2 would send its Cold Start Frame at t2 and node 3 would send its Cold Start Frame at t3 . Node 2, however, has a slower clock than the reference clock and, thus, sends its Cold Start Frame at td . Node 3, however, has a faster clock than the reference clock and, thus, ends its Cold Start Timeout at td , and also sends a Cold Start Frame on the communication medium. As we can see, we can construct a scenario in which clock drifts can cause additional collisions. Listen Timeout n2
Cold Start Timeout n2
2
3
0
1
0
1
drift of node 2 = +ρmax
Startup Timeout 2
3 drift of node 3 = −ρmax
0
1
2
0
1
2
Startup Timeout 3
Listen Timeout n3
Cold Start Timeout n3
TDMA round
drift of reference clock = 0
0
1
t1
2
3
t2 td t3 drift2
real time
We will now analyze a system of N nodes: slot in the TDMA Given a minimal slot duration of min round round with duration and a maximal clock drift rate max for all nodes, we are able to calculate an upper bound for round . The requirements for the worst case are:
Node (N duration.
2)
and node (N
1)
sends its second Cold Start Frame
2
slot after the collision, node (N 2) sends its min slot after the second Cold Start Frame 2 round 2 min
collision. So the sum of the clock drifts of both nodes must slot , minus be lower than the duration of the minimum slot, min a constant Æ , which accounts for the propagation delay of the communication medium.
slot min
Æ
>
(2 round max (2 round
max
slot ) + min 2
slot ) min
(6) Algebraic transformations lead to an upper bound for the duration of one TDMA round. Given the maximum clock slot , and the drift rate max , the minimum slot duration min maximum propagation delay of the communication medium Æ the maximum duration of one TDMA round round is given by: round