ROLLBACK-DEPENDENCY TRACKABILITY: AN OPTIMAL

4 downloads 0 Views 434KB Size Report
La propri et e RDT: une caract erisation et un protocole optimaux ... entre deux points de contr^ole locaux due a une s equence de messages non causale doit ...
I

IN ST IT UT

DE

E U Q TI A M R

ET

ES M È ST Y S

E N

RE CH ER C H E

R

IN F O

I

S

S IRE O T ÉA AL

A

PUBLICATION I N T E1107 RNE o N

ROLLBACK-DEPENDENCY TRACKABILITY: AN OPTIMAL CHARACTERIZATION AND ITS PROTOCOL

ISSN 1166-8687

´ ROBERTO BALDONI , JEAN-MICHEL HELARY , MICHEL RAYNAL

IRISA CAMPUS UNIVERSITAIRE DE BEAULIEU - 35042 RENNES CEDEX - FRANCE

http://www.irisa.fr

Rollback-Dependency Trackability: An Optimal Characterization and its Protocol Roberto Baldoni* , Jean-Michel Helary** , Michel Raynal***

Theme 1 | Reseaux et systemes Projet Adp Publication interne n1107 | Mai 1997 | 29 pages

Abstract: Considering a checkpoint and communication pattern, the Rollback Dependency Track-

ability (RDT) property stipulates that there is no hidden dependency between local checkpoints. In other words, if there is a dependency between two checkpoints due to a non-causal sequence of messages, then there must exist a causal sequence of messages that \doubles" the non-causal one and that establishes the same dependency. This paper provides a minimal characterization of the RDT property. This characterization de nes the smallest set of non-causal sequences of messages that have to be doubled in order to ensure the RDT property. Then, we consider the family of communication-induced checkpointing protocols that ensure on-the- y the RDT property. Assuming processes take local checkpoints independently (called basic checkpoints), protocols of this family direct them to take on-the- y additional local checkpoints (called forced checkpoints) in order that the resulting checkpoint and communication pattern satis es the RDT property. A new protocol belonging to that family is presented and it is shown that this protocol is optimal in terms of the number of forced checkpoints and in terms of the size of data structures it requires. The protocol attains this goal by a subtle tracking of causal dependencies on already taken checkpoints; this tracking is then used to prevent the occurrence of hidden dependencies. Finally a set of non-optimal protocols are derived from the optimal one. These derivations show a tradeo between the size of the control information required by a protocol and the number of forced checkpoints it takes. It is interesting to note that this set includes many of the communication-induced checkpointing protocols proposed in the literature. Key-words: Distributed applications; Checkpointing and Communication Patterns; RollbackDependency Trackability ; Communication-Induced Protocols.

(Resume : tsvp) * ** ***

DIS, Universita di Roma "La Sapienza", Via Salaria 113, Roma, Italy [email protected]

[email protected] [email protected]

CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE

Centre National de la Recherche Scientifique (URA 227) Universit´e de Rennes 1 – Insa de Rennes

Institut National de Recherche en Informatique et en Automatique – unit´e de recherche de Rennes

La propriete RDT: une caracterisation et un protocole optimaux Resume : Dans une execution repartie avec points de contr^ole, la propriete RDT stipule qu'il n'y

a pas de dependance cachee entre les points de contr^ole locaux. Autrement dit, toute dependance entre deux points de contr^ole locaux due a une sequence de messages non causale doit ^etre \doublee" par une dependance causale. Dans cet article, on propose une caracterisation minimale de la propriete RDT. Cette caracterisation de nit le plus petit sous-ensemble de sequences non-causales dont le doublement implique celui de toutes les sequences non-causales. On s'interesse a une famille de protocoles qui assurent la propriete RDT, en ligne et sans modi er le calcul. Ces protocoles n'utilisent que des informations vehiculees par les messages du calcul. Dans ce contexte, les processus du calcul prennent des points de contr^ole locaux de maniere independante et imprevisible (points de contr^ole spontanes). Le r^ole du protocole consiste alors a evaluer une condition locale lors de l'arrivee d'un message et, si cette condition s'avere vraie, a obliger les processus a prendre des points de contr^ole locaux supplementaires (points de contr^ole forces) de maniere a maintenir la propriete RDT. On presente un nouveau protocole de cette famille, et l'on montre qu'il est optimal aussi bien en ce qui concerne le nombre de points de contr^ole forces que du point de vue de la taille des messages. Ce but est atteint gr^ace a un depistage n des dependances causales averees entre points de contr^ole locaux. Cette information est utilisee pour prevenir l'apparition de dependances cachees. Finalement, on montre que plusieurs protocoles deja connus dans cette famille peuvent ^etre derives du protocole optimal. Chacun d'eux realise un compromis entre le nombre de points de contr^ole forces qu'ils induisent et la taille de l'information de contr^ole transportee par les messages du calcul. Mots-cle : Applications reparties; points de contr^ole; depistage des dependances cachees; protocoles en-ligne.

1 Introduction A distributed computation consists of a nite set of processes connected by a communication network, that communicate and synchronize only by exchanging messages through the network. A local checkpoint is a recorded local state of a process. When a process has to record such a local state, we say that this process takes a (local) checkpoint. With each distributed computation is thus associated a checkpoint and communication pattern, de ned from the set of messages and local checkpoints. A global checkpoint is a set of local states, one from each process, and a consistent global checkpoint is a global checkpoint such that no message sent by a process after its local checkpoint is received by another process before its local checkpoint. The determination of consistent global checkpoints is an important task that has many applications such as: detection of global properties of distributed computations [1, 3, 6], determination of distributed breakpoints [5, 11] and rollback-recovery [4] to name a few. In general, the fact that two local checkpoints be not causally related is a necessary but not sucient condition for them to belong to the same consistent global checkpoint. They can have "hidden" dependencies that make impossible for them to belong to the same consistent global checkpoint. These dependencies are characterized by the fact they cannot be tracked with transitive dependency vectors. Netzer and Xu [12] have shown that dependencies between local checkpoints are due to the occurrence of particular sequences of messages occurring in a checkpoint and communication pattern, called Z-paths. Two categories of Z-paths have been identi ed: causal and non-causal. A causal Z-path is one in which the delivery event of each message of the sequence is causally dependent of the send event of the message that immediately precedes it in the sequence; such Z-paths create causal dependencies between pairs of local checkpoints, i.e., dependencies that are on-line trackable (e.g., by using a transitive dependency vector). On the contrary, non-causal Z-paths create \hidden" dependencies, i.e., dependencies that are not on-line trackable. However, if a non-causal Z-path is doubled by a causal one, i.e., if the pair of checkpoints related by this non-causal Z-path is also related by a causal Z-path, then this dependency is no more hidden, and can be tracked on-line (in that case we say that the non-causal Z-path is causally doubled). To formalize this situation, Wang [15] has de ned the Rollback-Dependency Trackability (RDT) property. A checkpoint and communication pattern satis es this property if all dependencies between local checkpoints are on-line trackable, in other words if all Z-paths occurring in this pattern are causally doubled. RDT has two noteworthy properties: (1) It ensures that any set of local checkpoints that are not pairwise causally related can be extended to form a consistent global checkpoint; (2) It enjoys ecient calculations of the minimum and the maximum consistent global checkpoints that contain a given set of local checkpoints. As a consequence, the RDT property has applications in a large family of dependability problems such as: distributed software diagnosis, consistent deadlock recovery, output commit,etc [15]. In this paper we are interested in determining properties on Z-paths such that, if all the Z-paths satisfying such a property are causally doubled, then the RDT property holds (of course, the converse is always true, by de nition of RDT). Such properties will be called RDT-compliants. Each RDTcompliant property establishes a sucient characterization of the RDT property, in the sense that it is sucient to check that all Z-paths satisfying this property are causally doubled in order to check the RDT property. However, to our knowledge, a necessary and sucient characterization of the RDT property has not yet been stated. By this is meant the problem of nding a RDT-compliant property on Z-paths, which is implied by no other RDT-compliant property. The rst contribution of this paper is a solution to this problem. This is done by introducing successive properties on Z-paths, more and more constrained, namely : Z-paths of order two, Causal-Message-Z-paths (CM-paths), Simple-Causal-Message-Z-paths (SCM-paths), Elementary-Simple-Causal-Message-Z-paths (ESCMpaths) and nally Prime-Elementary-Simple-Causal-Message-Z-paths (PESCM-paths). The following PI n1107

results are then proved: 1) PESCM is RDT-compliant (a characterization of RDT) (2) For every RDT-compliant property X on Z-paths, we have PESCM ) X (so, this characterization is optimal). The previous result is important not only from a theoretical point of view, but also from a practical one, when considering the task of maintaining the RDT property on the y and without adding control messages while letting processes to take local checkpoints independently. Protocols that achieve this goal track Z-paths satisfying a RDT-compliant property, and force processes to take additional local checkpoints when those Z-paths are perceived as not causally doubled (Such local checkpoints are said to \break" the corresponding Z-path). Such protocols have already been proposed in the literature (e.g., [13, 14, 15]). Obviously, the stronger is the RDT-compliant property considered, the more ecient is the protocol, in the sense that there will be less Z-paths to consider for potential breaking. The second contribution of the paper is the design of a new protocol, based on the previous minimal PESCM-path subset. A family FRDT of checkpointing protocols ensuring the RDT property is introduced. This family follows the communication-induced1 checkpointing approach: processes take independently local checkpoints (called basic checkpoints) and the protocol requires them to take additional local checkpoints (called forced checkpoints) in order to maintain RDT; this is achieved only by piggybacking control information on application messages. Also, FRDT is based on the following basic hypotheses. (i) On the y: no knowledge of the \future" of the computation is available, i.e., the usable knowledge of the computation at a certain event can not be more than the one included in the causal past of that event. (ii) The computational model is fully asynchronous (no private information -such as clock speed- about other processes is available and message transfer delays are arbitrary). The new protocol proposed in the paper belongs to that family. It is shown that it is optimal within the family in terms of the number of forced checkpoints and that it achieves this optimality by using a minimal size of control information piggybacked on messages of the computation. It attains this goal by a subtle tracking of causal dependencies on already taken checkpoints; this tracking is then used to prevent the occurrence of hidden dependencies due to PESCM-paths which are non-causally doubled. Finally, it is shown that the proposed protocol contains some previously known non-optimal protocols of the family. These protocols show there is a tradeo between the size of the control information piggybacked on applications messages and the number of forced checkpoints that are taken. This set includes many communication-induced checkpointing protocols proposed in the literature such as Checkpoint-Before-Receive, No-Receive-After-Send [13], FDAS [15] and FDI [14]. The paper is structured in six main sections. Sections 2 and 3 introduce checkpoints and RDT, respectively. Then Section 4 presents the optimal characterization of the RDT property. Section 5 presents the FRDT family of communication-induced checkpointing protocols. The new protocol and its comparison with other protocols of the family are presented in Section 6. Finally Section 7 concludes the paper.

2 Consistent Global Checkpoints 2.1 Distributed Computations

A distributed computation consists of a nite set P of n processes fP1 ; P2 ; : : : ; Pn g connected by a communication network, that communicate and synchronize only by exchanging messages through the network. We assume that each ordered pair of processes is connected by an asynchronous directed logical channel whose transmission delays are unpredictable. Each process runs on a processor. Processors do not share either a common memory or a common clock value; there is no bound for their relative speeds. A process can execute internal, send and delivery statements. An internal statement does not involve communication. When Pi executes the statement \send(m) to Pj " it puts the message m into the channel from Pi to Pj . When Pi executes the statement \deliver(m)", it is blocked until at least 1

We use the terminology introduced in [4].

Irisa

one message directed to Pi has arrived; then a message is withdrawn from one of its input channels and delivered to Pi . Executions of internal, send and delivery statements are modeled by internal, sending and delivery events. Processes of a distributed computation are sequential, in other words, each process Pi produces a sequence of events ei;1 : : : ei;s : : : This sequence can be nite or in nite. Every process Pi has an initial local state denoted i;0 . The local state i;s (s0) results from the execution of the sequence ei;1 : : : ei;s applied to the initial state i;0 . More precisely, the event ei;s moves Pi from the local state i;s?1 to the local state i;s. By de nition we say that \ei;x belongs to j;s" if i = j and x  s. Let H be the set of all the events produced by a distributed computation. This computation hb ), where ! hb denotes the well-known Lamport's is modeled by the partially ordered set Hb = (H; ! happened-before relation de ned in [7]: hb De nition 2.1 : Relation ! .

8 >< j = i and t = s + 1 hb ei;s ! ej;t , > or ei;s = send(m) and ej;t = deliver(m) hb hb : or 9e : ei;s ! e^e ! ej;t

hb hb Two events ei;s and ej;t are concurrent, denoted ei;s k ej;t , if, and only if, :(ei;s ! ej;t) and :(ej;t ! ei;s).

2.2 Local Checkpoints

A local checkpoint C is a recorded state of a process. A local state is not necessarily recorded as a local checkpoint, so the set of local checkpoints is only a subset of the set of local states.

b CHb ) where Hb is a distributed De nition 2.2 A communication and checkpoint pattern is a pair (H; computation and CHb is a set of local checkpoints de ned on Hb .

Ci;x represents the x-th local checkpoint of process Pi . The local checkpoint Ci;x corresponds to some local state i;s with x  s. Figure 1 shows an example of checkpoint and communication pattern. We assume that each process Pi takes an initial local checkpoint Ci;0 (corresponding to i;0 ), and after each event a checkpoint will eventually be taken. Thus, each process always begins, and ends, with a checkpoint. Pi Pj Pk

Ci;0

Ci;1

m1

Cj;0

Cj;1

Ij;1

Ci;2

m2

Ci;3

m5

k;1

Ik;1

Cj;3

m6 C m7C k; k;

mC 3 m4

Ck;0

Cj;2 2

Ik;2

3

Ik;3

Figure 1: A Checkpoint and Communication Pattern A message m sent by process Pi to process Pj is called orphan with respect to the ordered pair of local checkpoints (Ci;x ,Cj;y ) if the delivery of m belongs to Cj;y while its sending event does not belong to Ci;x. An ordered pair of local checkpoints is consistent if and only if there are no orphan messages with respect to this pair. For example, Figure 1 shows the pair (Ck;1 ,Cj;1 ) is consistent, while the pair (Ci;2 ,Cj;2 ) is inconsistent (because of orphan message m5 ). PI n1107

2.3 Global Checkpoints

A global checkpoint is a set of local checkpoints one from each process. For example, fCi;1 ; Cj;1 ; Ck;1 g and fCi;2 ; Cj;2 ; Ck;1 g are two global checkpoints depicted in the Figure 1.

De nition 2.3 A global checkpoint is consistent if all its pairs of local checkpoints are consistent. For example, Figure 1 shows that fCi;1 ; Cj;1 ; Ck;1 g is a consistent global checkpoint, while fCi;2 ; Cj;2 ; Ck;1 g is not consistent (due to the inconsistent pair (Ci;2 ; Cj;2)).

3 Rollback-Dependency Trackability This section rst introduces the concepts of Z-path [12] and causal doubling of a Z-path and then the concept of Rollback-Dependency Trackability [15]. These concepts are related to a given checkpoint b CHb ). In other words, assertions such as \ is a Z-path", or \the Z-path and communication pattern (H;  is causally doubled", may hold in some checkpoint and communication pattern but not in another b CHb ) one. Similarly, RDT is a property attached to a checkpoint and communication pattern: some (H; satis es or does not satisfy the RDT property. In this Section, we assume that there is an implicit b CHb ). given checkpoint and communication pattern (H;

3.1 Z-Paths

The sequence of events occurring at Pi between Ci;x?1 and Ci;x (x > 0) is called checkpoint interval and is denoted by Ii;x; x is called the index of this checkpoint interval.

De nition 3.1 A Z-path is a sequence of messages [m1; m2 ; : : : ; mq ] (q  1) such that, for each ; 1   q ? 1, we have: deliver(m ) 2 Ik;s ^ send(m +1 ) 2 Ik;t ^ s  t. To our knowledge this notion has been introduced for the rst time by Netzer and Xu in [12] under the name zig-zag path. If a Z-path [m1 ; : : : ; mq ] is such that send(m1 ) 2 Ii;x and deliver(mq ) 2 Ij;y we say that this Z-path is from Ii;x to Ij;y . In the checkpoint and communication pattern shown Figure 1, [m3 ; m2 ] is a Z-path of length 2, from Ik;1 to Ii;2 ; [m5 ; m4 ] and [m5 ; m6 ] are two Z-paths of length 2, from Ii;3 to Ik;2.

De nition 3.2 A Z-path is causal if the delivery event of each message (but the last one) occurs before the send event of the next message in the sequence. A Z-path is non-causal if it is not causal.

A Z-path with only one message is causal. A causal Z-path will also be called a causal path.

Notation. The following notation will be used in the rest of the paper. In a Z-path  , the rst (resp.

the last) message will be denoted :first (resp. :last). The length of a Z-path  is the number of messages forming  and will be denoted as j  j. Let  and  0 be two Z-paths whose concatenation is also a Z-path, e.g.,  = [m1 ] and  0 = [m2 ; m3 ]. The concatenation of  and  0 will be equivalently denoted    0, or   [m2 ; m3 ], or [m1 ]   0 , or else [m1 ; m2 ; m3 ].

3.2 Causal Doublings

De nition 3.3 A Z-path from Ii;x to Ij;y is causally doubled if i = j ^ x  y or if there exists a causal Z-path  from Ii;x0 to Ij;y0 where x  x0 and y0  y. In the checkpoint and communication pattern shown Figure 1, the Z-path [m5 ; m4 ] is causally doubled by [m5 ; m6 ]. Note that, by construction, every causal Z-path is causally doubled (by itself!). Thus, the only interesting notion is that of non-causal Z-path doubling. Irisa

3.3 Rollback-Dependency Trackability

The following concept, Rollback-Dependency Trackability, has been introduced by Wang in [15]. It can be de ned as follows2 :

b CHb ) satis es the Rollback-Dependency De nition 3.4 A checkpoint and communication pattern (H; Trackability (RDT) property if and only if all Z-paths are causally doubled.

From the previous de nitions, it follows that a checkpoint and communication pattern satis es the RDT property if and only if all non-causal Z-paths are causally doubled.

4 An Optimal Characterization of RDT

b CHb ), it is not necesThis section shows that, given a checkpoint and communication pattern (H; b CHb ) satis es the sary to check that every non-causal Z-path is causally doubled to ensure that (H; RDT property. In other words, a strictly weaker RDT characterization is obtained, and we show that this characterization is minimal, in the sense that weaker conditions are not sucient to ensure RDT. At this end we introduce the concepts of elementary causal path, simple causal path and prime causal path, and we use these concepts to introduce successive embedded subsets of Zpaths, namely Z-paths of order two, Causal-Message-Z-paths (CM-paths), Simple-Causal-MessageZ-paths (SCM-paths), Elementary-Simple-Causal-Message-Z-paths (ESCM-paths) and nally PrimeElementary-Simple-Causal-Message-Z-paths (PESCM-paths). In this Section, all de nitions and reb CHb ). sults are related to an implicit checkpoint and communication pattern (H; 4.1 Preliminary De nitions

4.1.1 Elementary causal paths

This section shows that every causal path is doubled by an elementary causal path.

De nition 4.1 With each Z-path  let's associate its traversal sequence Pj ; Pk ; : : : ; Pk ; Pi , which 1

is the sequence of processes traversed by . We say that  is elementary if its traversal sequence has no repetition.

Note that the length of an elementary Z-path is  n ? 1. We have the following lemma:

Lemma 4.1 Every causal Z-path is causally doubled by an elementary causal Z-path. Proof Let  be a causal Z-path from Ij;y to Ii;x. If  is elementary, then the result holds. x

Pi z Pk Pj

1

z"

z0

2 z000

y

Figure 2: Causal Z-path doubled by Elementary causal Z-path 2

Though expressed di erently, this de nition is equivalent to Wang's one.

PI n1107

If  is not elementary, let Pk be a process occurring at least twice in its traversal sequence (Figure 2)3 . Thus,  can be expressed as 1   2 , where 1 ; ; 2 are causal Z-paths respectively from Ij;y to Ik;z , from Ik;z0 to Ik;z00 and from Ik;z000 to Ii;x , with z  z 0  z 00  z 000 , and j 1  2 j Di [k]. Thus, to the knowledge of Pi at the receipt of m, the set of Prime CM-paths from process Pk to process Pj with m = :last is given by the set of pair (k; j ) such that: sent toi [j ] ^ (m:D[k] > Di [k])

6.1.2 Tracking simple paths

To detect the set of processes Pk such that there exists a simple causal path  (with m = :last) from Pk to Pi , each process Pi maintains a vector simplei [1::n] of booleans, with the following meaning: for all k (1  k  n), simplei[k] is true if, to the knowledge of Pi , all causal paths from Ck;Di [k] to Ci;Di[i] are simple. The consistency of simplei is maintained by Pi as follows:

 simplei[i] is permanently true.  when Pi takes a local checkpoint (including the initial one), it resets all entries simplei[k] (with k= 6 i) to false.  when Pi sends a message m, the array simplei is piggybacked on this message;  when a message m (sent by P` ) is delivered to Pi: observe that, for each k, m:simple[k] is the value of simple`[k] when m has been sent. (i) for every k such that m:D[k] > Di [k]: simplei [k] := m:simple[k]; (ii) for every k such that m:D[k] = Di [k]: simplei [k] := simplei [k] ^ m:simple[k].

Thus, to the knowledge of Pi at the receipt of m, the set of Prime and Simple CM-paths from process Pk to process Pj with m = :last is given by the set of pair (k; j ) such that: sent toi [j ] ^ (m:D[k] > Di[k]) ^ m:simple[k]

PI n1107

6.1.3 Tracking elementary causal paths

Since there is an elementary prime and simple causal path  from Pk to Pi , with :last = m, if and only if there is a prime and simple causal path 0 from Pk to Pi , with 0 :last = m, no extra data is necessary to check whether a causal path  with :last = m is elementary or not. Thus, to the knowledge of Pi at the receipt of m, there exists a PESCM-path   [m0 ] from process Pk to process Pj with :last = m, if and only if the following condition holds:

sent toi [j ] ^ (m:D[k] > Di[k]) ^ m:simple[k]

6.2 Breaking All Non-Visibly-Doubled PESCM-Paths

Let us remark that if Pi takes a local checkpoint before the delivery of m, it breaks all PESCM-paths   [m0 ] such that :last = m. On the contrary, if Pi does not take a local checkpoint before the delivery of m, none of these PESCM-paths is broken. Thus, if, to the knowledge of Pi , at least one PESCM-path   [m0 ] (such that :last = m) is not causally doubled, Pi must take a forced checkpoint before the delivery of m to prevent the formation of a PESCM-path non-visibly doubled (Corollary 5.3). In the next two Sections, we address rst the case of PESCM-paths from Pk to Pj when k = j (PESCM-cycles), then the case of PESCM-paths from Pk to Pj when k 6= j .

6.2.1 Breaking PESCM-cycles (k = j )

As we have already seen in Section 4.2.2, PESCM cycles cannot be causally doubled. Thus, they must be broken as soon as they are detected. Such a situation occurs when, upon the receipt of a message m, a process Pi detects a PESCM-path   [m0 ] (:last = m) from Pk to Pj with k = j , i.e., when the condition (9k : sent toi [k] ^ m:D[k] > Di [k] ^ m:simple[k]) holds. Such a path  is from Ik;m:D[k] to Ik;z for some z . As explained in Section 4.2.2,  is doubled if and only if m:D[k]  z . Otherwise,  is a PESCM-cycle, and cannot be doubled. So, in this situation, Pi must determine if, among the set of messages m0 sent to Pk since the last checkpoint and up to the receipt of m, there is one whose delivery event belongs to Ik;z with m:D[k] > z . Two cases are to consider. 1. m:D[i] < Di [i] (see Figure 7.a). It means that there is no causal path from Ii;Di [i] to Ii;Di [i], terminated by m. In particular, all messages m0 sent by Pi to Pk in Ii;Di [i] and before the receipt hb send(:first)). From this, we conclude that deliver(m0 ) 2 I of m satisfy :(deliver(m0 ) ! k;z with m:D[k]  z , and  is doubled. 2. m:D[i] = Di [i]. It means that there is at least one causal path from Ii;Di [i] to Ii;Di[i] , terminated by m. Thus, among messages m0 sent by Pi to Pk in Ii;Di[i] and before the receipt of m, it is possible that some of them satisfy deliver(m0 ) 2 Ik;z with m:D[k]  z . Thus, in that case, Pi must determine whether m:D[k] > z . That will be the case if, and only if, there is at least one checkpoint on Pk between the events deliver(m0 ) and send(:first). Since  is simple, this situation occurs if, and only if, the causal path [m0 ]   is not simple, i.e, :m:simple[i] (see Figure 7.b). So, if the following condition C1 holds upon the receipt of message m, Pi detects that there is a PESCM-cycle, and breaks it by taking a forced checkpoint before delivering m:

C1  9(k; j ) : sent toi[j ] ^ (m:D[k] > Di [k]) ^ m:simple[k] ^ (j = k) ^ (m:D[i] = Di[i]) ^:m:simple[i]

6.2.2 Breaking non-visibly doubled PESCM-paths (k 6= j ) Let Pi be a process detecting, during interval Ii;x , a PESCM-path   [m0] upon the receipt of a message m such that :last = m. Such a path is from Ik;m:D[k] to Ij;y , where deliver(m0 ) 2 Ij;y . In order to

Irisa

determine whether this path is visibly doubled or not, Pi has to answer the following question : \is hb send(m) there a causal path  from Ik;z to Ij;y0 , with m:D[k]  z and y0  y and deliver(  last) ! (see Figure 11 and Proposition 5.1)"? This information concerns the existence of causal paths between intervals throughout the set of intervals. It is managed as follows: Each process Pi keeps a boolean matrix causali , such that, for all (k; j ) (1  k; j  n), causali [k; j ] is true if and only if, to the knowledge of Pi , there is a causal path from Ik;z to Ij;y with Di [k]  z and y  Di [y].  causali is initialized to false and each time Pi takes a local checkpoint, all the entries causali [i; j ] are reset to false.  When Pi sends a message m, the matrix causali is piggybacked on m.  When a message m, sent by Pj , is delivered to Pi , causali is updated as follows: 1. for each k such that m:D[k] > Di [k]: for every `, causali [k; `] := m:causal[k; `]. In fact, Pi must reset its knowledge about causal paths issued from the new checkpoint interval Ik;m:D[k]. 2. for each k such that m:D[k] = Di [k]: for every `, causali [k; `] := causali [k; `]_m:causal[k; `]. In fact, Pi adds to its current knowledge causal paths issued from the checkpoint interval Ik;Di[k] that he was not yet aware of. Then (in both cases) causali [j; i] := true, and for every `, causali [`; i] := causali [`; i]_causali [`; j ] (transitive closure). With this setting, the condition m:causal[k; j ], evaluated by Pi upon the receipt of m, is true if and only if the PESCM-path   [m0 ] (with   last = m) from Pk to Pj is visibly doubled or if there exists a PESCM-cycle from Pj to Pj , which cannot be causally doubled. In fact, if m is sent by process P` , the value of m:causal[k; j ] is the value of causal` [k; j ] at the time where m is sent. By construction, this value is true if and only if there exists a causal path  from Ik;z to Ij;y0 with D` [k]  z and y0  D` [j ], i.e., m:D[k]  z and y0  m:D[j ], and a causal path  0 from Ij;y00 to I`;t with y00  y0 and deliver( 0 :last) occurs before send(m). Thus, m:D[j ]  y00 . Moreover, if y denotes the index of interval where m0 is delivered, then either y0  y or there is a PESCM-cycle from Pj to Pj . Suppose y0 > y. This implies m:D[j ] > y which, together with sent toi [j ], implies the existence of a PESCM-cycle from Pj to Pj . The previous discussion shows that, when a message m is received by Pi , the following condition must be evaluated: C2  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ m:simple[k] ^ (k 6= j ) ^ (:m:causal[k; j ]) This condition means that, to the knowledge of Pi , there exists at least one PESCM-path from Pk to Pj which is non-visibly doubled. If it is evaluated to true, then the protocol forces Pi to take a local checkpoint before delivering m.

6.2.3 Summary

The two previous sections showed that, when a message m is received by Pi , this process has to to take a forced local checkpoint before delivering m if and only if one of the two conditions C1 or C2 is evaluated to true. This is summarized by the condition C  (C1 _ C2 ): C  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ m:simple[k] ^ (((k 6= j ) ^ :m:causal[k; j ])_ ((k = j ) ^ (m:D[i] = Di [i]) ^ :m:simple[i]) ) PI n1107

6.3 Formal Description of the Checkpointing Protocol

Each process Pi is endowed with the following arrays whose semantics has been de ned in the previous sections. Di : array[1::n] of integer; simplei; sent toi : array[1::n]of boolean; causali : array[1::n; 1::n] of boolean; The protocol is formally described in Figure 12. It is composed of statements performed by a process Pi at initialization (S0), when it sends a message (S1), when a message is received (S2) and when a checkpoint (basic or forced) has to be taken (S3). A forced checkpoint is taken, when the predicate C is true. Though theoretically the size of the control information piggybacked on application messages is given by one vector of integer (D), one square matrix of boolean (causal) and one vector of boolean (simple), we would like to remark that, practically, the vector simple can be encoded in the diagonal of the matrix causal as that diagonal is never used by the condition C1 . In this way, the size of the control information reduces to one vector of integer and a square matrix of boolean. procedure take checkpoint is k do sent toi [k] := false; simplei [j ] := false; causali [i; j ] := false enddo; 8

save the current local state and a copy of the array Di ; Di [i] := Di [i] + 1;

% ckpt event %

(S0) initialization k doDi [k] := 0 ; ` do causali [k; `] := false enddo; simplei[i] := true; take checkpoint; (S1) procedure send(m; Pj ) % send event % sent toi [j ] := true ; net-send(m; Di ; simplei; causali ; Pj ); % net-send event % (S2) when an event (m; D; simple; causal) is accepted by the Network % net-receive event % if (k; j ) : (m:D[k] > Di [k]) sent toi [j ] m:simple[k] ( ( (k = j ) ( m:causal[k; j ])) ( (k = j ) (m:D[i] = Di [i]) ( m:simple[i]))) then take checkpoint % taking a forced checkpoint endif; % updating of control variables % k do case m:D[k] < Di [k] skip m:D[k] > Di [k] Di [k] := m:D[k] ; simplei [k] := m:simple[k] ; ` do causali [k; `] := m:causal[k; `] enddo; m:D[k] = Di [k] simplei [k] := simplei [k] m:simple[k]; ` do causali [k; `] := causali[k; `] m:causal[k; `] enddo; endcase enddo ; %Ps is the sender of m % causali[s; i] := true; ` do causali[`; i] := causali [`; i] causali [`; s] enddo; deliver(m) % deliver event % (S3) procedure take-ckpt( ) % take-ckpt event % % taking a basic checkpoint % take checkpoint 8

8

9

^

^

6

^

^

:

_

^

^

:

8

!

!

8

!

^

8

8

_

_

Figure 12: The Checkpointing Protocol

Irisa

6.4 Optimality

6.4.1 Number of forced checkpoints

Let CP denote the protocol presented in the previous section. We show that CP is optimal with respect to (#f ckpt(CP )). Suppose that there is a protocol CP 0 belonging to the class FRDT , such that (#f ckpt(CP 0 )) < (#f ckpt(CP )). The protocol CP 0 bases its decision to force processes taking local checkpoints on a condition C 0 , evaluated upon the receipt of each message. Let us denote pat and pat0 the checkpoint and communication patterns produced by the application and the protocols CP and CP 0 respectively. By assumption, there is a message m such that, upon the receipt of m at a process Pi , CP forces Pi to take a local checkpoint C , whilst CP 0 does not force Pi to take this checkpoint (i.e., C 0 ) C ). As CP forces Pi to take C , there exits in pat a PESCM-path  =   [m0 ], with   last = m, which is not visibly doubled. Since CP 0 does not force Pi to take A and belongs to the class FRDT ,  must be visibly doubled in pat0 . But pat0 is obtained from pat by removing the local (forced) checkpoint C . Thus,  must be also visibly doubled in pat, a contradiction.

6.4.2 Size of data structures

To show the optimality of the protocol CP with respect to the size of data structure either stored by processes or piggybacked on messages, we proceed by \omission": if CP 0 is a protocol derived from CP in which a single entry of one of the arrays simple, causal or D is omitted on a single message, then there exist checkpoint and communication patterns such that, either CP 0 is no more optimal in terms of number of forced checkpoints, or the pattern does not satisfy any more the RDT property (according to the \default" value chosen to replace the missing entry). Let us remark that if only a process does not store even a single entry of one of those arrays, then there can be a message where this entry will be omitted, so it is sucient to analyze the situation where entries are omitted on messages.

Array simple. Suppose that the entry simple[ ] (with 1   n) is omitted on a message m.  Consider a situation where, upon the receipt of m at the process P , we have: 9k : (m:D[k] > D [k]) ^ sent to [k] ^ m:simple[k] ^ (m:D[ ] = D [ ]) If m:simple[ ] has the default value false, then condition C will be evaluated to true, and P will be forced to take a checkpoint before delivering m, although this might not be necessary, as shown by Figure 13.a. If m:simple[ ] has the default value true, then condition C will be evaluated to false, and P will not be forced to take a checkpoint before delivering m. So, the RDT property will be violated, as shown by Figure 13.b.

 Consider a situation where, upon the receipt of m at the process Pi, we have: 9j (j = 6 ) : (m:D[ ] > Di [ ]) ^ sent toi[j ] ^ :m:causal[ ; j ] If m:simple[ ] has the default value false, then condition C will be evaluated to false and Pi will not be forced to take a checkpoint before delivering m. So, the RDT property will be violated (see Figure 13.c).

PI n1107

Pj Pk

Pk

m P

P a)

m:simple [

to

false,

b)

] evaluated

but should be

true

[

true,

m

P

m:simple

to

Pi

m c)

] evaluated

but should be

false

m:simple

to

false,

[

] evaluated

but should be

true

Figure 13: Omission of an entry m:simple[ ]

Array causal. Suppose that the entry causal[ ; ] (with 1  6=  n) is omitted on a message

m. The analysis is similar to the previous one: Let us consider the situation where, upon the receipt of m at the process Pi , we have: (m:D[ ] > Di [ ]) ^ sent toi [ ] ^ m:simple[ ]

 if the default value of causal[ ; ] is true, then condition C will be evaluated to false and Pi will not be forced to take a checkpoint even if the detected PESCM-path is not visibly doubled, thus violating RDT (Figure 14.a).

P

P

Pi

Pi

m

P

m

P a)

m:causal ;

to

[

true,

] evaluated

but should be

false

b)

m:causal ;

to

[

false,

] evaluated

but should be

true

Figure 14: Omission of an entry m:causal[ ; ]

 if the default value of causal[ ; ] is false, then condition C will be evaluated to true and Pi will be forced to take a checkpoint even if the detected PESCM-path is visibly doubled, as shown by Figure 14.b).

Array D. Suppose that the entry D[ ] (with 1   n) is omitted on a message m. Then,

the default value used instead of this entry can be any integer value. Thus conditions such that m:D[ ] > Di [ ] or m:D[ ] = D [ ], used in the evaluation of C , can be either true or false, according to the default value.

 Consider the situation where, upon the receipt of m at the process Pi, we have: 9j (j = 6 ) : sent toi[j ] ^ m:simple[ ] ^ :m:causal[ ; j ] { If m:D[ ] > Di[ ] is false by default, then Pi does not take a forced checkpoint (C is evaluated to false), although there can be a non visibly doubled PESCM-path (but Pi does not detect that it is Prime!). So, RDT is violated. Irisa

{ If m:D[ ] > Di [ ] is true by default, then Pi takes a forced checkpoint (C is evaluated to true), even if the detected PESCM-path is not Prime.

 Consider the situation where, upon the receipt of m at the process P , we have: 9j (j = k) : sent to [k] ^ m:simple[k] ^ :m:simple[ ] { If m:D[ ] = D [ ] is evaluated by default to true, Pi takes a forced checkpoint (C is evaluated to true), even if there is no PESCM-cycle, as shown by Figure 15, producing a non-optimal solution in terms of number of forced checkpoints.

Pk

m:D[ ] = D [ ] is evaluated to true, but it should be false since m:D[ ] < D [ ]. m

P

Forced checkpoint

Figure 15: Omission of an entry m:D[ ]

{ If m:D[ ] = D [ ] is evaluated by default to false, Pi does not take a forced checkpoint (C is evaluated to false), although there could be a PESCM-cycle. So, RDT is violated.

Hence according to De nition 5.3, the protocol CP is optimal with respect to the the number of control information and the size of the data structures.

6.5 Size of Control Information vs Number of Forced Checkpoints

Several variants of the protocol CP can be obtained by weakening the predicate C , i.e., by replacing C by a weaker predicate C 0 (thus C ) C 0) in the test performed upon the arrival of a message m to decide whether a forced checkpoint must be taken. Obviously, weakening the predicate C leads to variants taking at least as many forced checkpoints as CP , since the implication C ) C 0 shows that each time CP takes a forced checkpoint, then the variant based on C 0 takes also a forced checkpoint. However, weakening of condition C allows to decrease the size of the control information piggybacked on application messages, and thus these variants show a tradeo between the size of the control information and the number of forced checkpoints while ensuring the RDT property: the bigger is the the number of forced checkpoints, the smaller is the control information piggybacked on application messages. Before examining some of these variants, let us remark that, since by construction, 8k : m:causal[k; k] =false, predicate C can be rewritten as C  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ m:simple[k] ^ :m:causal[k; j ] ^ (((k 6= j )_ ((k = j ) ^ (m:D[i] = Di [i]) ^ (:m:simple[i])) ) i.e., in a more concise way: C  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ m:simple[k] ^ :m:causal[k; j ] ^ ((k = j ) ) ((m:D[i] = Di[i]) ^ :m:simple[i])) PI n1107

No-PESCM-Cycle(j = k). If (k = j ) ) ((m:D[i] = Di[i]) ^ :m:simple[i]) is evaluated to true by default, the predicate C is weakened to: C:PESCM (j=k)  9(k; j ) : (m:D[k] > Di [k]) ^ sent toi[j ] ^ m:simple[k] ^ (:m:causal[k; j ]) The corresponding protocol breaks all non-visibly doubled PESCM-paths and all PESCM-paths from a process to itself (cycle) whether doubled or not (it supposes that all such cycles are non-simple).

No-PESCM-Path(j 6= k). If matrices causal are omitted, i.e., evaluated to false by default, the predicate C is weakened to: C:PESCM (j6=k)  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ m:simple[k] ^ ((k = j ) ) ((m:D[i] = Di[i]) ^ :m:simple[i]))

In such a case we obtain a protocol that does not allow the existence of any PESCM-path in which (j 6= k), whether it is causally doubled or not, as well as all the PESCM-cycles in the checkpoint and the communication pattern. Each application message piggybacks the same control information than the one of the optimal algorithm.

No-PESCM. By avoiding the presence of any PESCM-path or cycle in a checkpointing and com-

munication pattern, we obtain the following predicate: C:PESCM  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ m:simple[k] Each application message piggybacks the transitive dependency vector D plus the boolean vector simple. It is clear that the the number of forced checkpoints is not less than the one taken by both C:PESCM (j=k) and C:PESCM (j6=k) as C:PESCM (j=k) ) C:PESCM and C:PESCM (j6=k) ) C:PESCM .

No-PECM-Cycle(j = k). By removing the boolean vector simple from C:PESCM (j=k), we get the following predicate:

C:PECM (j=k)  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ (:m:causal[k; j ])

In such a case we obtain a protocol that does not allow the existence of any PECM-cycle, whether simple or not, doubled or not. Each application message piggybacks only the transitive dependency vector D. It is clear that C:PECM (j =k) is weaker than C:PESCM (j =k).

No-PECM-Path(j 6= k). By removing the boolean vector simple from C:PESCM (j6=k) we get the following predicate:

C:PECM (j6=k)  9(k; j ) : (m:D[k] > Di[k]) ^ sent toi[j ] ^ ((k = j ) ) (m:D[i] = Di[i]))

In such a case we obtain a protocol that does not allow the existence of any PECM-path from Pk to Pj with j 6= k, whether simple or not, whether causally doubled or not. Each application message piggybacks only the transitive dependency vector D. It is clear that C:PECM (j 6=k) is weaker than C:PESCM (j6=k).

No-PECM. By removing the boolean vector simple from C:PESCM we get the following predicate: C:PECM  9(k; j ) : (m:D[k] > Di [k]) ^ sent toi[j ]

In such a case we obtain a protocol that does not allow the existence of any PECM-path or cycle, whether simple or not, whether causally doubled or not. Each application message piggybacks D as control information. We would like to remark that in this case the vector send to might be replaced by a simple boolean ag after first send. Doing so we exactly get the Fixed-Dependency-After-Send (FDAS) checkpointing protocol proposed by Wang in [15]. It is clear that C:PECM is weaker than C:PECM (j=k), C:PECM (j6=k) and C:PESCM . Irisa

No-Non-Visibly-Doubled ( ) C

No-PESCM-Cycle(j = k) ( :PESCM (j=k))

No-PESCM-Path(j = k) ( :PESCM (j6=k))

C

No-PECM-Cycle(j = k) ( :PECM (j=k))

6

No-PESCM ( :PESCM ) C

C

C

No-PECM-Path(j = k) ( :PECM (j6=k)) 6

C

No-PECM ( :PECM ) C

Only-Causal-Path ( :NCP )

No-Causal-Dependency ( :CD )

C

C

Only-One-Delivery (true)

Figure 16: A family of protocols satisfying RDT.

Only-Causal-Paths. Another variant can be obtained by avoiding the presence of any non-causal

Z-path (see Section 3.1) by preventing the formation of any break point (i.e., a send-delivery sequence in a checkpoint interval). This can be done by using the following predicate:

C:NCP  9j : (sent toi [j ]) In such a case each time a message is received by process a Pi after a send event a forced checkpoint is taken in order to break the sequence. This protocol is purely \syntactic" in the sense it does not use control information piggybacked on application messages. Also in this case the vector send to might be replaced by a simple boolean ag after first send. In this way we get Russell's Algorithm [13]. It is clear that C:NCP is weaker than C:PECM .

No-Causal-Dependency. If we take a checkpoint each time we receive a dependency vector D that brings at least one new information about another process, we actually avoid the occurrence of any causal dependency between the newly-learnt checkpoint interval and the current checkpoint interval. This is done by using the following predicate: C:CD  9k : (m:D[k] > Di [k]) This protocol piggybacks a dependency vector D on each application message as control information. This protocol corresponds actually to the one presented in [14] that was named Fixed-DependencyInterval (FDI) by Wang in [15]. Clearly, C:CD is weaker than C:PECM PI n1107

Only-One-Delivery. If a forced checkpoint is taken each time a message is received, every checkpoint interval contains at most one delivery event. This corresponds to the ultimate weakening of C to the tautology. Of course, the tautology is the weakest of all predicates and thus, this protocol is the least ecient in terms of number of forced checkpoints. Figure 5 summarizes the discussion of Section 6.5. In that gure, a plain arrow from CP 1 to CP 2 indicates that (#f ckpt(CP 1))  (#f ckpt(CP 2)) and a dotted arrow indicates that jctrl inf j(CP 1)  jctrl inf j(CP 2).

7 Conclusion Considering a checkpoint and communication pattern, the Rollback Dependency Trackability (RDT) property stipulates that there is no hidden dependency between local checkpoints. In other words, if there is a dependency between two checkpoints due to a non-causal message chain, then there must exist a causal message chain that \doubles" the non-causal one and that establishes the same dependency. This paper has provided a minimal characterization of the RDT property. This characterization de nes the smallest set of non-causal sequences of messages that have to be doubled in order to ensure the RDT property. A new protocol belonging to that family of communication-induced checkpointing protocols has been presented and it has been shown that this protocol is optimal in terms of the number of forced checkpoints and in terms of the size of data structures it requires. The protocol attains this goal by a subtle tracking of causal dependencies on already taken checkpoints; this tracking is then used to prevent the occurrence of hidden dependencies. Finally a set of non-optimal protocols have been derived from the optimal one. These derivations showed a tradeo between the size of the control information required by a protocol and the number of forced checkpoints it takes.

Acknowledgments The authors wish to thank Achour Mostefaoui with whom they designed the protocol described in [2]. This protocol was the starting point of this study. We would like also to thank Yi-Min Wang for many interesting discussions and for pointing out the importance of the concept of prime Z-path. We are also grateful to Rob Netzer for helful discussions.

References

 Fromentin, E. and Raynal, M. A Uni ed Framework for the Speci cation and [1] Babaoglu, O, Run-time Detection of Dynamic Properties in Distributed Computations, Journal of Systems Software, 33:287-298, 1996. [2] Baldoni, R., Helary, J.M., Mostefaoui, A, and Raynal, M., A communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability. in Proc. 27th IEEE Symposium on Fault-Tolerant Computing (FTCS 27), Seattle, June 1997. [3] Chandy, K.M. and Lamport, L. Distributed Snapshots: Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, 3(1):63-75, 1985. [4] Elnozahy, E. N., Johnson, D. B. and Wang, Y. M. A Survey of Rollback-Recovery Protocols in Message-Passing Systems, Technical Report CMU-CS-96-181, Carnegie-Mellon University, 1996. [5] Fowler, J. and Zwaenepoel, W., Distributed Causal Breakpoints. Proc 10th IEEE Int. Conference on Distributed Computing Systems, Paris, pp. 134-141, May 1990. Irisa

[6] Helary, J.M., Jard, C., Plouzeau N. and Raynal M., Detection of Stable Properties in Distributed Applications. Proc. 6th ACM Symposium on Principles of Distributed Computing, Vancouver, 1987, pp. 125-136. [7] Lamport, L. Time, Clocks and the Ordering of Events in a Distributed System, Communications of the ACM, 21(7):558-565, 1978. [8] Koo, R., and Toueg, S. Checkpointing and Rollback-Recovery for Distributed Systems, IEEE Transactions on Software Engineering, 13(1):23-31, 1987. [9] Johnson, D., Ecient Transparent Optimistic Rollback Recovery for Distributed Application Programs, Proc. 14th IEEE Symposium on Reliable Distributed Systems, 1993, pp.86-95. [10] Manivannan, D., and Singhal, M. A low-overhead recovery technique using quasi synchronous checkpointing, in Proc. of the 16th IEEE International Conference on Distributed Computing Systems, Hong-Kong, pp.100-107, 1996. [11] Miller, B., and Choi J., Breakpoint and Halting in Distributed Programs. Proc. 8th IEEE Int. Conference on Distributed Computing Systems, San Jose, CA, pp. 316-323, May 1988. [12] Netzer, R.H.B., and Xu, J. Necessary and Sucient Conditions for Consistent Global Snapshots, IEEE Transactions on Parallel and Distributed Systems, 6(2):165-169, 1995. [13] Russell, D.L. , State Restoration in Systems of Communicating Processes, IEEE Transactions on Software Engineering, SE6(2):183-194, 1980. [14] Venkatesh, K., Radhakrishanan, T. and Li, H.L. Optimal Checkpointing and Local Recording for Domino-free Rollback Recovery, Information Processing Letters, 25:295-303, 1987. [15] Wang, Y.M. Consistent Global Checkpoints That Contain a Given Set of Local Checkpoints. IEEE Transactions on Computers, 46(4):456-468, 1997. [16] Wang, Y.M., and Fuchs, W.K., Scheduling Message Processing for Reducing Rollback Propagation, Proc. 22nd Symposium on Fault-Tolerant Computing, 1992, pp.204-211.

PI n1107