roughly the speed of a correct hardware clock and are within some known bound of each other. ... of California,. San Diego. Houtan Aghili is now with IBM Research, T. J. Watson Research Center. 1 ... We call late timing failures per- formance ...
Clock Synchronization in the Presence of Omission and Performance Failures, and Processor Joins Flaviu Cristian, Houtan Aghili and Ray Strong IBM Research Almaden Research Center
Abstract This paper presents a simple practical protocol for synchronizing clocks in a distributed system. Synchronization consists of maintaining logical clocks which run at roughly the speed of a correct hardware clock and are within some known bound of each other. Synchronization is achieved by periodically computing adjustments to the hardware clocks present in the system. The protocol is tolerant of any number of omission failures (e.g. processor crashes, link crashes, occasional message losses) and performance failures (e.g. overloaded processors, slow links) that do not partition the communications network and handles any number of simultaneous processor joins.
An earlier version of this paper was ppresented at the 16th IEEE Int. Symp. on Fault-tolerant Computing Systems, Vienna, July 1-4, 1986. Flaviu Cristian is now with the University of California, San Diego. Houtan Aghili is now with IBM Research, T. J. Watson Research Center.
1
1 Introduction Consider a set of processors interconnected in a distributed system to perform certain distributed computations, where each processor is equipped with a hardware clock . If one wants to measure the time elapsed between the occurrences of two events of a computation local to a processor p, one can instantaneously read the values of the local hardware clock register when these events occur and compute their dierence. A dierent method must be devised if the intention is to measure the time elapsed between two events of a distributed computation. For instance, if an event ep occurs on a processor p and another event eq occurs on a dierent processor q, it is practically impossible for q to instantaneously read its hardware clock when the remote event ep occurs. Indeed, the sending of a message from p to q that noti es the occurrence of ep entails a random transmission delay that makes it impossible for q to exactly know the value displayed by q's hardware clock at the instant when ep occurred in p. Hence, q (and by a similar argument p) cannot compute exactly the time elapsed between ep and ep by relying only on their own clocks. The problem of measuring the time elapsed between occurrences of distributed events would be easily solved if the processor clocks could be exactly synchronized. That is, if at any instant of a postulated Newtonian time referential, all clocks would read the same value. It would then be sucient for p to send the value of its local clock reading when ep occurs to q, and for q to subtract this from the value of its clock reading when eq occurs. Unfortunately, the uncertainty on message transmission delays inherent in distributed systems makes exact clock synchronization impossible. This paper presents a protocol for maintaining approximately synchronized clocks in a loosely coupled distributed system. The protocol maintains on each correctly functioning processor p a logical clock Cp which measures the passage of real time with an accuracy comparable to that of a hardware clock and which, at any instant t, displays a clock time Cp(t) that is within some known bound DMAX from the clock times displayed by all other logical clocks Cq running on other correct system processors q:
8 t : jCp(t) ? Cq (t)j < DMAX Such logical clocks allow one to measure with an a priori known accuracy the time that elapses between events which occur on distinct processors. 2
A logical clock is maintained by periodically computing an adjustment function for a hardware clock. We refer the reader to discussions of various methods for maintaining smooth logical clocks by amortizing adjustments in [DHSS] and [CS]. One of our main objectives was to design a protocol that is practical . Such a protocol should work in the presence of events that are reasonably likely to occur, (e.g. processor crashes and joins, message losses or delays in communications or processing) yet be simple enough to be understandable, correctly implementable, and maintainable.
2 Failure Classi cation Let P denote the set of processors of a distributed system and L the set of physical links between the processors in P . Processors and links undergo certain state transitions in response to service requests. For example, a link (p,q)L that joins processor pP to processor qP delivers a message to q if p so requests. Similarly, a processor p computes a given function if its user so requests. System components, such as processors and communication links, are correct if, to any incoming service request, they respond in a manner consistent with their speci cation. Service speci cations prescribe the state transition that a component should undergo in response to a service request and the real time interval within which the transition should occur. A component failure occurs when a component does not deliver the service requested in the manner speci ed. We distinguish among three general failure classes [CASD]. If the component never responds to a service request it suers an omission failure. If the component delivers a requested service either too early or too late (i.e. outside the real time interval speci ed) it suers a timing failure. We call late timing failures performance failures. If a component delivers a service dierent from the one requested or delivers unrequested "services" it suers an arbitrary or Byzantine failure. Typical examples of omission failures are processor crashes, hardware clocks that stop running, link breakdowns, processors that occasionally do not relay messages they should, and links that sometimes lose messages. Examples of performance failures are occasional message delays caused by overloaded relaying processors, hardware clocks running at a speed lower than (1+ )? , where is the maximum drift speci ed by the clock manufacturer. An example of an early timing failure is a hardware clock that runs at a speed that exceeds 1+ . Examples of Byzantine failure are an undetectable message corruption on a link, due to electro-magnetic noise or human sabotage, or a 1
3
processor that sends two messages "the time is 10:00am" and "the time is 11:00am" when the correct time is midnight. The system components that suer failures are eventually taken o the system by maintenance personnel for repair or replacement. After maintenance such components join the system of active components by explicitely executing a join protocol.
3 Assumptions A hardware clock consists of an oscillator, which generates a cyclic waveform at a uniform rate, and a counting register, which records the number of cycles elapsed since the start of the clock. We assume that the hardware clocks of the processors that compose the system are driven by quartz crystal controlled oscillators that are highly stable and accurate. The use of this type of clock is very common in modern computer architectures (e.g. the IBM 4300, 3080 and 3090 series). Typically, a correct quartz clock may drift from real time at a rate of at most , where is on the order of 10? seconds per second. Quartz clocks are not only highly stable, but are also extremely reliable. Current experience indicates that the average quartz clock that is used in medium to highend digital computers has a mean time between failures (MTBF) in excess of 1525 years, and that good clocks, like those used in military applications, can have MTBF's expressed in hundreds of years [MIL]. Many of the clock failures likely to occur in practice can be detected by the error detecting circuitry incorporated in clock chips. For example, if the counting register that composes a clock is self-checking, the occurrence of a physical failure within it will generate (with high probability) a clock-error exception. If a detectable physical failure aects a hardware clock, any attempt at reading its value terminates with a clock-error exception [IBM370]. Given the very signi cant MTBFs observed for current quartz based clock chips, and the extensive error detecting circuitry built in such chips, in this paper we will assume that the likelihood of undetectable clock failure occurrences is negligible compared to other sources of system failures (for a precise interpretation of what "negligible" means, we refer to [Cr]). Let HC (t) denote the value displayed by a hardware clock HC at some real time t. (As in [LM,DHSS], we write the variables and constants that range over real time in lower case, and the variables and constants ranging over clock time in upper case.) We can formulate our assumption concerning the high reliability of a hardware clock by saying that a hardware clock is within a linear envelope of real time (which runs 6
4
by de nition with speed 1):
(A1) After it is powered on, the hardware clock HC of a processor measures the
passage of time between any two successive real time instants t , t correctly: (1 + )? (t ? t ) ? G < HC (t ) ? HC (t ) < (1 + )(t ? t ) + G: A clock that fails signals a clock-error exception whenever an attempt at reading its value is made. G is a constant depending on the granularity of the hardware clock. For simplicity in what follows, we will assume G = 0. 1
1
2
1
2
1
2
2
1
Given that by hypothesis (A1) correct clocks drift from real time by at most , one can infer that the rate at which two correct clocks can drift apart from each other during t real time units is at most (1+ )t ? (1+ )? t = (2+ )=(1+ )t. We denote by dr (2 + )=(1 + ) (relative clock drift rate) the factor which when multiplied by a real time interval length gives the net amount by which hardware clocks could drift apart in the worst case during that time interval. The next assumption is not necessary for the correct functioning of our algorithm, but it simpli es computing performance estimates. 1
(A2) Let N be the maximum number of processors participating in the protocol. Then the rate of drift is suciently small that 3N (2 + ) < 1: 2
The next assumption concerns the normal speed at which messages can be sent over correct links between two processes running on adjacent correct processors:
(A3) A message sent from a correct processor p to a correct processor q over a correct link (p,q) arrives at q and is processed at q in less than ldel (link delay) real time units.
If a message sent from p to a neighbor q needs more than ldel real time units to arrive at q, or never arrives, then at least one of the processors p,q or the link (p,q) has experienced a failure. The fourth assumption states that, during clock synchronization, any two correct active processors in P are linked by at least one chain of correct links and correct intermediate processors. That is, 5
(A4) No partition of the system of correct processors and links occurs during clock synchronization.
If suciently many redundant physical communication paths exist between any two processors in a network, the likelyhood of this hypothesis being violated can be made negligible. Assumptions (A3,A4) allow us to conclude that, if a synchronization message is sent by a correct processor p to a correct processor q, then there is a chain of correct links and intermediate processors over which the message can require no more than ndel = (N ? 1) ldel (network delay) real time units between p and q, where N is the maximum number of processors in the system. (For a better, but more complex, upper bound on the network delay which would guarantee a closer synchronization of clocks, the interested reader is referred to [CASD].) Our protocol is designed to tolerate omission and performance failures that do not partition the network of correct processors. We acknowledge the possibility that other types of failures (e.g. very fast clocks, or sabotaged processors). Such failures can in principle occur, and there are several more complex protocols that have been designed to handle them [LM,LL,DHSS]. We have chosen improved simplicity and performance at the expense of a chance of loss of synchronization in the presence of these rare failure types. Thus, rather than aiming at generality and power, the goal was to favor simplicity and practicality and to aim for those applications where the likelihood of early timing or Byzantine clock failures causing major damage is negligible. Our intention is to develop a protocol that can handle the overwhelming majority of failures that are likely to occur in practice, yet is simple to understand, prove correct, implement, and maintain.
4 Objectives The goal of the clock synchronization protocol is to ensure the following three properties:
(C1) For any two correct joined processors pq P , the clocks C , C indicating the p
q
current logical time should be synchronized within some a priori known bound DMAX (for maximum deviation): 9 DMAX : 8 t : jCp(t) ? Cq (t)j < DMAX 6
(C2) The clocks of joined correct processors should read times within a linear en-
velope of real time. That is, there should exist a constant such that for any clock Cp and any real time t:
X + (1 + )? t < Cp(t) < X 0 + (1 + )t; 1
where X , X 0 are constants which depend on the initial conditions of the clock synchronization algorithm execution.
(C3) A correct processor that joins a set of synchronized processors, should have
its clock synchronized with those of the other correct processors within some a priori known real time delay jdel (join delay). Also, in the absence of processor joins, it is required that each periodic clock synchronization terminate within some known real time delay sdel (synchronization delay).
A protocol that achieves (C1,C2) is said to achieve linear envelope clock synchronization.
5 Informal Algorithm Presentation Our algorithm is based on information diusion [CASD], [DHSS]. It is simpler than [DHSS] because we limit the class of failures to be tolerated to omission and performance failures. It also uses a simpler method for handling processor joins. The protocol to be presented is based on the following consequence of assumptions (A3,A4): If at real time t a correct processor pP diuses a message containing its clock time T to all other processors, and each correct processor qP sets its clock to T upon receipt of a message from p, then the clocks of all correct processors will indicate times within ndel(1 + ) by real time t + ndel. By message diusion we mean the following process: processor p sends a new synch message on all outgoing links and any processor q that receives a new synch message 7
on some link, relays it on all other outgoing links. We call such a synchronization message diusion a synch wave. An informal (not quite accurate) picture of how our protocol works is provided by assuming that one clock is suciently faster than all others so that its time messages diuse in a synch wave causing each other correct clock to set its time ahead to match the time of the wave. In this section we make this assumption. In the following two sections we give a more formal and more accurate description and proof of correctness of our protocol. Although, immediately after a synch wave propagation the clocks of all correct processors are within NDEL=ndel (1 + ), as time passes, clocks will naturally tend to drift apart. For instance, t real time units after the end of a synch wave propagation, the correct clocks might be as far apart as ndel(1 + ) + dr t. If the intention is to keep the processor clocks close at all times, one has to periodically re-synchronize the clocks. If PER is the clock synchronization period length (in clock time units) then in the interval between two successive synchronization waves numbered s, s+1, the clocks might drift as far apart as D = ndel(1 + ) + dr(PER(1 + ) + ndel). It will be the role of the (s+1)th synchronization wave to bring the clocks back again within ndel(1 + ). In the absence of processor crashes or joins, one could use a prede ned synchronizer processor to generate synch waves. If processor crashes are likely, and they certainly are, the existence of a unique synchronizer becomes a single point of failure. As observed in [DHSS], it is better to distribute the role of synchronizer among all processors. The idea is that any processor should be able to initiate a synch wave if it discovers that PER clock time units have elapsed since the last synchronization occurred. If (as we assume for this section) one clock is suciently fast, then its synch wave will happen before any others and make the others unnecessary. Synch waves also have to be generated when new processors join a cluster of already synchronized processors, in order to synchronize the clocks of the new processors with the clocks of the old processors. In such a case a joiner p will send a special \new" message to all its neighbors, forcing them to initiate synch waves. The neighbors of these neighbors either propagate these waves, if their clocks are slower than the clocks of the wave initiators, or generate new synch waves, if their clocks are faster. After at most ndel real time units, a 'winning' synch wave is generated in this way by some processor with a fastest clock. When this propagates to all the other processors, including the ones who are joining, they will all synchronize their clocks within ndel (1 + ). Thus, within at most 2 ndel real time units from the moment a join demand is made by a processor pP , a winning synch wave is re ected back to p. At 8
that moment, p is joined. That is, its clock is at most ndel(1 + ) apart from the clocks of previously joined correct processors. In the next section we give the protocol and in the following section we discuss how this informal discussion must be modi ed to provide for the case when there is no winning synch wave.
6 Detailed Algorithm Description A detailed description of the clock synchronization protocol is given in Figure 1. This description is made in terms of two abstract data types: Logical-Clock and Timer. Instances C and TP of these data types can be declared as shown in line 2 of Figure 2. Users of an instance C of the Logical-Clock data type can perform the following operations on it. An invocation of a C.initialize operation initializes the time displayed by C to 0. The operation C.adjust(L,T:Time) adjusts the local time L currently displayed by C so that after PER time units C will show the same time as a logical clock which currently shows time T (assuming that the clocks run at roughly the same speed). Such an adjustment can be implemented either by bumping the local clock to T, or by slightly increasing the speed of the local clock so as to catch up with the remote clock [C], [CS]. The operation C.read reads the current value displayed by C. The operation C.duration(T:Time), used to measure time intervals, reads the number of time units elapsed between a previous time T and the present time. The Timer data type has a unique operation \set(T:Time)". If TP is a Timer instance, the meaning of invoking the operation TP.set(T) is \ignore all previous TP.set calls and signal a Timeout condition T clock time units from now." Thus, if after invoking TP.set(100) at time 200, a new invocation is made at time 250, there is no Timeout condition at time 300, but there might be one at time 350. If no other invocation of TP.set is made between 250 and 349, then a Timeout condition occurs at time 350. For convenience of presentation, we use two independent timers TP and TJ (although one is in principle sucient). The former is used to measure the maximum time interval which can elapse between periodic resynchronizations. The latter is used to time the join process. The protocol uses the following communication primitives: receive(m,l) which receives a message m on some link and returns the identity l of that link, forward(m,l) which sends a message m on all outgoing links except l, and send-all(m), that sends m on all 9
outgoing links. We do not assume that the forward and send operations are atomic with respect to failure occurrences, i.e. a processor can fail after sending a given message on certain links and before sending it on the remaining links. 1 task Time-Manager ::= 2 var L,T:Time; C:Logical-Clock; TP,TJ:Timer; 3 s,s': Natural-Number; joined:Boolean; l:link; 4 s 0; C.initialize; joined false; 5 send-all(\new"); TJ.set(NDEL + TDEL); 6 cycle 7 select 8 receive(\new",l) ! s s + 1; send-all(s,C.read); 9 2 10 receive(s',T,l) ! L C.read; 11 if 12 (s'L) ! C.adjust(L,T); forward((s,T),l); 15 2 16 (s'>s)&(Ts)&(T L) ! s s'; C.adjust(L,T); forward((s,T),l); 19 ; 20 2 21 Timeout TJ ! joined true; 22 2 23 Timeout TP ! s s+1; L C.read; C.adjust(L,L); send-all(s,L); 24 endselect; 25 TP.set(PER); 26 endcycle; Figure 1. At processor start, the local synch wave sequence number s and the current local clock time are initialized to 0 (line 4). Then, a join phase that lasts for NDEL + TDEL time units begins with the sending of a special \new" message on all outgoing links (line 5). As before NDEL = (1 + )ndel. The constant TDEL is slightly larger than NDEL. A real time duration of tdel is de ned in the next section and TDEL = 10
(1 + )tdel. Under assumption A2 we may choose tdel = 2 ndel. This gives the particularly simple join delay of jdel = 3 ndel. During the join phase, the \joined" Boolean variable is false, and nothing can be said about how close the local clock is to other clocks of the cluster being joined. At the end of the join phase (line 21), \joined" becomes true and measurements of delays elapsed between distributed event occurrences can begin. The Time-manager can be awakened by three kind of events: a \new" message that arrives from a neighbor that joins (line 8), a message belonging to a synch wave numbered s' that announces that the time is T (line 10), and a Timeout condition generated by the timers TJ or TP (lines 21, 23). The reception of a \new" message results in an attempt to generate a `winning' wave with a new sequence number s'=s+1 and local time L=C.read. The Boolean tests executed by a processor when a message (s',T) belonging to such a wave is received ensure that either the processor forwards the wave (s',T) to all its neighbors (if T L holds, see lines 14,18), or that it will attempt itself to initiate a wave (if T