Preliminares del Instituto de Matem´ aticas, UNAM
No. 875
1
Recursion in distributed computing Eli Gafni∗
Sergio Rajsbaum†
January 18, 2010
Abstract The benefits of developing algorithms via recursion are well known. However, little use of recursion has been done in distributed algorithms, in spite of the fact that recursive structuring principles for distributed systems have been advocated since the beginning of the field. We present several distributed algorithms in a recursive form, which makes them easier to understand and analyze. Also, we expose several interesting issues arising in recursive distributed algorithms. Our goal is to promote the use and study of recursion in distributed algorithms.
Keywords: shared memory, distributed algorithms, wait-free, renaming, snapshots. Regular presentation
1
Introduction
The benefits of designing and analyzing sequential algorithms using recursion are well known. Recursive algorithms are discussed in most textbooks, notably in Udi Manber’s book [25]. However, little use of recursion has been done in distributed algorithms, in spite of the fact that recursive structuring principles for distributed systems have been advocated since the beginning of the field e.g. [12, 26]. In this paper we describe simple and elegant recursive distributed algorithms for some important tasks, that illustrate the benefits of using recursion. We hope to convince the reader that thinking recursively is a methodology that facilitates the process of designing a distributed algorithm; at least to obtain a first version of an algorithm, that can later be improved. We consider the following fundamental tasks: snapshots [1], immediate snapshots [6, 29], renaming [4], and swap [2, 31], and recursive distributed algorithms for each one. We work with a wait-free shared memory model where any number of processes can fail by crashing. We propose that studying recursion in a distributed setting is a worthwhile aim, although not without its drawbacks. There is no doubt that recursion should be covered starting with the first year introductory computer science courses, but it has been suggested recursive programming teaching be postponed until after iterative programs are well mastered, as recursion can lead to extremely inefficient solutions e.g. [30]. In distributed algorithms, a well known example is the original Byzantine Agreement algorithm [24]. It exhibits the case that recursive distributed algorithms can be even more “dangerous” than in the centralized setting. The recursive algorithm looks ∗ Computer Science Department, University of California, Los Angeles. 3731F Boelter Hall, UCLA, LA. CA. 90095, USA.
[email protected] † Instituto de Matem´ aticas, Universidad Nacional Aut´ onoma de M´exico, Ciudad Universitaria, D.F. 04510, Mexico.
[email protected]. Partly supported by a DGAPA-UNAM Grant.
simple and convincing, yet to really understand what it is doing, i.e. to unfold the recursion, took researchers a few years [5]. Even the seminal distributed spanning tree algorithm of [20], which is not recursive, can be viewed as a recursive algorithm that has been optimized [11]. Thus, one of the goals of this paper is to open the discussion of when and how recursion should be used in distributed algorithms. There are several interesting issues that appear in recursive distributed algorithms, that do not appear in sequential recursion. Naming Consider the classical binary search recursive algorithm. It searches an ordered array for a single element by cutting the array in half with each pass. It picks a midpoint near the center of the array, compares the data at that point with the data being searched. If the data is found, it terminates. Otherwise, there are two cases. (1) the data at the midpoint is greater than the data being searched for, and the algorithm is called recursively on the left part of the array, or (2) the data at the midpoint is less than the data being searched for, and the algorithm is called recursively on the right part of the array. Namely, only one recursive call is invoked, either (1) or (2), but not both. In contrast, in a distributed recursive algorithm, there are several processes running concurrently, and it is possible that both recursive calls are performed, by different processes. Thus, we need to name the recursive calls so that processes can identify them. Branching In sequential computing, recursive functions can be divided into linear and branched ones, depending on whether they make only one recursive call to itself (e.g. factorial), or more (e.g. fibonacci). In the first case, the recursion tree is a simple path, while in the second, it is a general tree. The distributed algorithms we present here are all of linear recursion, yet the recursion tree may not be a simple path, because, as explained above, different processes may invoke different recursive calls. See Figures 3, 5 and 12. Iterated memory As in sequential recursion, each distributed recursive call should be completely independent from the others. Even in the sequential binary search recursive algorithm, each recursive call (at least theoretically) operates in a new copy of the array, either the left side or the right side of the array. In a distributed algorithm, a process has local variables and shared variables. Both should be local to the recursive invocation. Thus, a fresh new copy of the shared memory is associated to each recursive invocation. Recursive distributed algorithms can be used to program an iterated model of computation, where the memory is divided in sections. All processes run by accessing each section at most once, and in the same order. The benefits of working in an iterated model have been exploited in the past e.g. [11, 13, 16, 22, 27, 28]. Tasks In the simplest case, the shared memory that is accessed in each iteration can be a singlewriter/multi-reader shared array. Namely, the recursive distributed algorithm writes to the array, reads each of its elements, and after some local computation, either produces an output or invokes recursively the algorithm. Assuming at least one process decides, less processes participate in each recursive invocation (on a new copy of the array), until at the bottom of the recursion, only one process participates and decides. More generally, in each recursive invocation, processes communicate through a more powerful shared memory. We need to know only the specification of this memory, and not how it is implemented; we specify this shared memory by the task (see Section 2) it implements. 2
Inductive reasoning As there are no “side effects,” we can logically imagine all processes going in lockstep from task to task just varying the order in which they invoke each task, but they all do return from task S1 , before any of them invokes task S2 . First, this structured set of executions facilitates an inductive reasoning, and greatly simplifies the understanding of distributed algorithms. Second, if we have a description of S1 as a topological complex X1 , and S2 as X2 , we know that the iterated executions accessing S1 and then S2 have a simple topological description: replacing each simplex of X1 by X2 [8, 23]. We include Figures 6, 8 and 10 to illustrate this point. While recursion in sequential algorithms is well understood, in the distributed case we don’t know much. For example, we do not know if in some cases side effects are unavoidable, and persistent shared objects are required to maintain information during recursive calls. We don’t know if every recursive algorithm can be unfolded, and executed in an iterated model (different processes could invoke tasks in different order).
2
Model
We assume a shared memory distributed computing model, with processes Π = {p1 , . . . , pn }, in an asynchronous wait-free setting, where any number of processes can crash. Tasks Processes communicate through shared objects that can be invoked at most once by each process. The input/output specification of an object is in terms of a task, defined by a set of possible inputs vectors, a set of possible output vectors, and an input/output relation ∆. A more formal definition of a task is given e.g. in [23]. Process pi can invoke an instance of a task once (object implementing the task) with the instruction Tasktag (x), and eventually gets back an output value, such that the vector of outputs satisfies the task’s specification. The subindex tag identifies the particular instance of the task, and x is the input to the task. The most basic task we consider is the write/scan task, that is an abstraction of a shared array SM [1..n]. When process pi invokes TaskSM (x), the value x is written to SM [i] and then pi reads in arbitrary order SM [j], for each j. Also, we are interested in designing distributed algorithms that solve a given task. The processes start with private input values, and must eventually decide on output values. For an execution where only a subset of processes participate an input vector I specifies in its i-th entry, I[i], the input value of each participating process pi , and ⊥ in the entries of the other processes. Similarly, an output vector O contains a decision value O[i] for a participating process pi , and ⊥ elsewhere. If an algorithm solves the task, then O should be in ∆(I). In an inputless task, process pi has only one possible input value, namely i. A famous task is consensus. Each process proposes a value and the correct processes have to decide the same value, which has to be a proposed value. This task is not wait-free solvable using read/write registers only [14]. We introduce other tasks as we discuss them. Recursive distributed algorithms Our recursive distributed algorithms are in the canonical form represented in Figure 1. Initially, there are n processes invoking Generictag,n (x), where x is an input parameter, and tag is used to identify a specific instance of the algorithm; initially, tag is typically 0 or the empty set. Processes communicate with each other through shared objects that solve some task; process pi invokes Task(i), and eventually gets back a value stored in a local variable val. For example, in Section 4 the task will be write/scan; namely, processes communicate 3
by writing to a shared array, and then reading one by one the components of the array. In other sections processes communicate through other tasks. Notice that the name Task is local to the current invocation of Generic: different invocations of Generic invoke different instances of the task. Algorithm Generictag,n (x); (1) val ← Task(i); (2) if P (val) then return f (val); (3) else (4) tag ← g(val); x ← h(val); (5) Generictag,n−1 (x)
Figure 1: The Generic algorithm (code for pi ) In line 1 process pi invokes the task, and stores the output it gets back from the task in val. Then, in line 2, the process computes a predicate P , to decide if it can compute an output value and terminate the algorithm. The output value is computed with a function f . Notice that P and f are specific to the particular recursive algorithm, and their inputs can be any of the local variables, such as val, x or i (although the code specifies only val for simplicity). In our recursive algorithms, one or more processes may terminate in line 2, but in any case, at most n − 1 processes will invoke recursively Generictag,n−1 (x) in line 5. It is important to notice that different processes may want to invoke different instances of the recursive algorithm. That is why we need tag, so that processes can identify which instance they want to invoke. For this purpose, in line 4 process pi computes a new value for tag using a function g, and a new value for the input parameter x using a function h (as before, these two functions are computed using val and possibly other local variables).
3
Two examples: linear and binary recursion
We describe a linear branching instance of the generic recursive algorithm in Section 3.1, that solves immediate snapshots, as explained in Section 4. Then we present a binary branching version in Section 3.2, that solves renaming, as explained in Section 5.1. A multi-way branching version that solves renaming using immediate snapshots is in Section 5.2, and a more general multi-way branching version that solves swap is in Section 6.
3.1
Linear recursion
Consider the Generic algorithm of Figure 1 for the case where tag is always the empty set (g returns empty), the task is write/scan, invoked with WScan, and its output is stored in local variable view. The algorithm, called IS, is in Figure 2. The predicate P consists of checking if view contains all n ids, and f is the identity function. The last process to write sees n such values, i.e., |view| = n, it returns view, and terminates the algorithm. Namely, at least one process terminates the algorithm, but perhaps more than one (all process that see |view| = n), in each recursive call of the algorithm. When n = 1, the single process invoking the algorithm returns a view that contains only itself.
4
Algorithm ISn (i); (1) view ← WScan(i); (2) if |view| = n then return view (3) else ISn−1 (i)
Figure 2: Linear recursion (code for pi )
2
1 invoke
IS 3 1
invoke
3 3
outputs 1,2,3
2
outputs 1,2
1
outputs 1
2
IS 2 1
invoke
IS 1
Figure 3: Linear recursion, for 3 processes 1, 2, 3 As we shall see in Section 4, this algorithm solves the immediate snapshot task. For now, we describe only its behavior, as linear recursion. Namely, the recursion tree is a simple path; one execution of the algorithm is illustrated in Figure 3 for 3 processes, 1, 2 and 3. In Figure 3 all three processes invoke IS3 , each one with its own id. In this example, only process 3 sees all three values after executing WScan, and exits the algorithm with view 1, 2, 3. Processes 1, 2 invoke IS2 , and a new task instance through WScan. Namely, in an implementation of the write/scan task with a shared array, in the first recursive invocation IS3 a shared array SM1 is used, in the second recursive invocation IS2 , a fresh shared array SM2 is used, and finally, when process 1 invokes IS1 alone, it uses a shared array SM3 , sees only itself, and terminates with view 1. A process pi that returns a view with |view| = k executes Θ(n(n − k + 1)) read/write steps. Process pi returns view during the invocation of ISk (i). Thus, it executed a total of n − k + 1 task invocations, one for each recursive call, starting with ISn (i). Each invocation to the task involves one write and n read steps.
3.2
Binary branching
Let us now consider a small variation of algorithm IS, called algorithm BRtag,n in Figure 4, where tag ∈ {L, R, ∅}. The first time the algorithm is invoked by all n processes, with tag equal to empty. Recursive invocations are invoked with smaller values of n each time, and with tag equal to L or to R. Until at the bottom of the recursion, the algorithm is invoked with n = 1 by only one process, which returns with a view that contains only its own id. In line 3, after seeing all n processes, 5
process pi checks if its own id is the largest it saw, in view. If so, it terminates the algorithm. If it is not the largest id, it invokes recursively an instance of BR identified by the tag = R, and size n − 1 (at most n − 1 processes invoke it). Line 5 is executed by processes that saw less than n ids in their views obtained in Line 2; they all invoke the same instance of BR, now identified by the tag = L, and size n − 1 (at most n − 1 invoke it). Algorithm BRtag,n (i); (1) view ← WScan(i); (2) if |view| = n then (3) if i = max view then return view ; (4) BRR,n−1 (i); (5) else BRL,n−1 (i)
Figure 4: Branching recursion algorithm (code for pi ) This time the recursive structure is a tree, not a path. An execution of the algorithm for 4 processes is illustrated in Figure 5. In the first recursive call, processes 3, 4 see only themselves, and hence invoke BRL,3 , while processes 1, 2 see all 4 processes, and as neither is the largest among them, they both invoke BRR,3 . The rest of the figure is self-explanatory. Here we are not interested in the task solved by Algorithm BR, only in that it has the same recursive structure as the renaming algorithm of Section 5, and hence the same complexity. Each process executes at most n recursive calls, and hence at most n invocations to a write/scan task. Thus, the total number of read and write steps by a process is O(n2 ).
4
Snaphots
In this section we describe an immediate snapshot [6, 29] recursive algorithm. As the specification of the snapshot [1] task is a weakening of the immediate snapshot task, the algorithm solves the snapshot task as well. Immediate snapshot task An immediate snapshot task IS abstracts a shared array SM [1..n] with one entry per process. The array is initialized to [⊥, . . . , ⊥], where ⊥ is a default value that cannot be written by a process. Intuitively, when a process pi invokes the task, it is as if it instantaneously executes a write operation followed by a snapshot of the whole shared array. If several processes invoke the task simultaneously, then their corresponding write operations are executed concurrently, followed by a concurrent execution of their snapshot operations. For each pi , the result of the invocation satisfies the three following properties, where we assume i is the value written by pi (without loss of generality) and smi is the set of values or view it gets back from the task. If SM [k] = ⊥, the value k is not added to smi . We define smi = ∅, if the process pi never invokes the task. These properties are: • Self-inclusion. ∀i : i ∈ smi . • Containment. ∀i, j : smi ⊆ smj ∨ smj ⊆ smi . • Immediacy. ∀i, j : i ∈ smj ⇒ smi ⊆ smj .
6
4 2 1 3
BR 4 43
BRL,3
2
2
BR L,2 4
43
L,1
BR R,3
1
BR L,2 2
3
BR
BR
1
1
BR
BR R,1
R,1
L,1
Figure 5: Branching recursion, for 4 processes 1,{1}
3,{1,3}
2,{1,2}
3,{1,2,3}
2,{1,2,3}
1,{1,2}
1,{1,3} 1,{1,2,3}
2,{2}
3,{3} 3,{2,3}
2,{2,3}
Figure 6: All immediate snapshot subdivision views, for 3 processes The immediacy property can be rewritten as ∀i, j : i ∈ smj ∧ j ∈ smi ⇒ smi = smj . Thus, concurrent invocations of the task obtain the same view. A snapshot task is required to satisfy only the first two properties. The set of all possible views obtained by the processes after invoking an immediate snapshot task can be represented by a complex, consisting of sets called simplexes, with the property that if a simplex is included in the complex, so are all its sub-simplexes. The set of all possible views, for 3 processes, is represented in Figure 6. Each vertex is labeled with a pair i, smi . The simplexes are the triangles, the edges, and the vertexes. The corners of each simplex are labeled with compatible views, satisfying the three previous properties. In the case of 4 processes, the complex would be 3-dimensional, including sets up to size 4, and so on for any number of processes. For more details about complexes and their use in distributed computing, see e.g. [23].
7
Recursive algorithm. A wait-free algorithm solving the immediate snapshot task was described in [6]. We include it in Appendix A for comparison. We encourage the reader to try to come up with a correctness proof, before reading the recursive version, where the proof will be trivial. Actually, algorithm IS of Figure 2 solves the immediate snapshot task. Theorem 1 Algorithm ISn solves the immediate snapshot task for n processes in O(n2 ) steps. Moreover, a process obtains a snapshot of size k from the algorithm in Θ(n(n − k + 1)) steps. Proof: The complexity was analyzed in the previous section. Here we prove that IS solve the immediate snapshot task. Let S be the set of processes that terminate the algorithm in line 2, namely, with |view| = n. Each pi ∈ S, terminates the algorithm with a view smi that contains all processes. Thus, for each such pi , the self-inclusion property holds. Also, for any two pi , pj in S, we have smi = smj , and the properties of containment and immediacy hold. By induction hypothesis, the three properties also hold for the other processes, that call recursively ISn−1 . It remains to show that the two last properties of the immediate snapshot task hold for a pi ∈ S, and a pj 6∈ S. First, property containment holds: clearly smj ⊂ smi , because pi does not participate in the recursive call. Finally, property immediacy holds: j is in smi (pi sees every process participating), and we have already seen that smj ⊆ smi . And it is impossible that i ∈ smj , because pi does not participate in the recursive call. 2
5
Renaming
In the renaming task [4] each of n processes than can only compare their ids must choose one of 2n − 1 new distinct names, called slots. It was proved in [23] that renaming is impossible with less than 2n − 1 slots, except for some special values of n [10]. The algorithm of [4] solves the problem with 2n − 1 slots, but is of exponential complexity [15]. Later on, [7] presented a recursive renaming algorithm based on immediate snapshots, of O(n3 ) complexity. We restate this algorithm in Section 5.2, and show that its complexity is actually O(n2 ). Also, we present a new renaming algorithm in Section 5.1 that is not based on immediate snapshots, also of O(n2 ) complexity; it is very simple, but requires knowledge of n. To facilitate the description of a recursive algorithm, the slots are, given an integer F irst and Direction ∈ {−1, +1}, the integers in the range F irst + [0..2n − 2] if Direction = +1, or in the range F irst+[−(2n−2)..0] if Direction = −1. Combining the two, we get slots F irst+Direction∗ [0..2n − 2]. Thus, the number of slots is 2n − 1 as required; i.e., |Last − F irst| + 1 = 2n − 1, defining Last = F irst + Direction ∗ (2n − 2).
5.1
Binary branching renaming algorithm
The algorithm has exactly the same structure than the binary branching algorithm of Figure 4, using WScan to partition the set of processes into two subsets, and then solve renaming recursively on each subset. The algorithm Renamingn (F irst, Direction) is in Figure 7. It uses tags of the form {←, →}, to represent the intuition of renaming “going down” and “going up” as indicated by the value of Direction. Given F irst and Direction, the algorithm is invoked by k processes, where k ≤ n, and each process decides on a slot in the range F irst + Direction ∗ [0..2k − 2]. The algorithm ends in line 4, with a slot selected. 8
In the algorithm, the processes first write and scan a shared array, in line 1. According to the size of the view they get back, they are partitioned in 2 sets– the processes that get back a view of size n and the processes that get back a view of size less than n. If k < n invoke it then, of course, nobody can get a view of size n. In this case they all go to line 6 and solve the problem recursively executing Renaming→ n−1 (F irst, Direction). Thus, such recursive calls will be repeated until k = n. The variant of the algorithm described below evades this repeated calls, using immediate snapshots instead of write/scans. Consider the case of k = n invoking the algorithm. In this case, some processes will get a view of size n in line 1. If one of these, say pi , sees that it has the largest id i in its view Si (line 4), terminates the algorithm with slot Last. The other processes, Y , that get a view of size n, will proceed to solve the problem recursively, in line 5, renaming from slot Last − 1 down (reserving slot Last in case it was taken by pi ), by calling Renaming← n−1 (Last − 1, −Direction). The processes X, that get a view of size less than n, solve the problem recursively in line 6, going up from position F irst, by calling Renaming→ n−1 (F irst, Direction). Thus, we use the arrow in the superscript to distinguish the two distinct recursive invocations (to the same code). The correctness of the algorithm, in Theorem 1, is a simple counting argument, that consists of the observation that the two ranges, going down and up, do not overlap. Algorithm Renamingn (F irst, Direction); (1) Si ← WScan(i); (2) Last ← F irst + Direction ∗ (2n − 2); (3) if |Si | = n then (4) if i = max Si then return Last; (5) Renaming← n−1 (Last − 1, −Direction); (6) else Renaming→ n−1 (F irst, Direction)
Figure 7: Write/scan binary branching renaming (code for pi ) Theorem 2 Algorithm Renamingn solves the renaming task for n processes, in O(n2 ) steps. Proof: Clearly, the algorithm terminates, as it is called with smaller values of n in each recursive call, until n = 1, when it necessarily terminates. A process executes at most n recursive calls, and in each one it executes a write/scan. Each write/scan involves O(n) read and write steps, for a total complexity of O(n2 ). If n = 1, then |Si | = 1 in line (1) so the algorithm terminates with slot Last = F irst in line (4). At the basis of the recursion, n = 1, the algorithm terminates correctly, renaming into 1 slot, as Last = F irst when n = 1. Assume the algorithm is correct for k less than n. The induction hypothesis is that when k 0 processes, k 0 ≤ k, invoke Renamingk (F irst, Direction), then they get new names in the range F irst + Direction ∗ [0..2k 0 − 2]. Now, assume the algorithm is invoked as Renamingn (F irst, Direction), with n > 1, by k ≤ n processes. Let X be the set of processes that get a view smaller than n in line (1), |X| = x. Notice that 0 ≤ x ≤ n − 1. If k < n then all get a view smaller than n, and they all return Renamingn−1 (F irst, Direction) in line (6), terminating correctly. So for the rest of the proof, assume k = n. Let Y be the set of processes that get a view of size n in line (1), |Y | = y, excluding the process of largest id. Thus, 0 ≤ y ≤ n − 1. 9
1
2
3
1
1
2
3
2
3
1
1
2
y x
1
1
2
3 3
2
3
2
Figure 8: Renaming subdivision, for 3 processes The processes in X return from Renamingn−1 (F irst, Direction) in line (6), with slots in the range F irst+[0..2x−2]. The processes in Y return from Renamingn−1 (Last−1, (−1)∗Direction) in line (5), with slots in the range [Last − 1 − (2y − 2)..Last − 1]. To complete the proof we need to show that F irst + 2x − 2 < Last − 1 − (2y − 2). Recall Last = F irst + Direction ∗ (2n − 2). Thus, we need to show that 2x − 2 < 2n − 2 − 1 − (2y − 2). As x + y ≤ n, the previous inequality becomes 2(n) − 2 < 2n − 2 − 1 + 2, and we are done.
2
Notice that the renaming algorithm produces views that are immediate snapshots, obtaining the subdivision depicted in Figure 8. Vertexes are labeled with ids only, to avoid cluttering the figure. The immediate snapshot views for the 2-simplexes x and y are as follows. For x, the views are: h1, {1, 2, 3}, {1, 2}i, that is, p1 sees all 3 processes in its first immediate snapshot, and sees both itself and p2 in the second; h2, {1, 2, 3}, {2}i, that is, p2 sees all 3 processes in its first immediate snapshot, and sees only itself in the second; h3, {3}i, that is, p3 sees only itself in its first immediate snapshot, and hence does not participate in the second. Similarly, for the simplex y the views are: h1, {1, 3}, {1}i, that is, p1 sees itself and p3 in its first immediate snapshot, and sees only itself in the second; h2, {1, 2, 3}, {2}i, that is, the same view as in x; h3, {1, 3}, {1, 3}i, that is, p3 sees itself and p1 in both immediate snapshots.
5.2
A multi-way branching renaming algorithm
In the previous Renamingn algorithm the set of processes, X, that get back an immediate snapshot of size less than n, will waste recursive calls, calling the algorithm again and again (with the same values for F irst and Direction, but smaller values of n) until n goes down to n0 , with n0 = |X|. In this recursive call, Renamingn0 , the processes that get back a snapshot of size n0 , will do something interesting; that is, one might get a slot, and the others will invoke recursively Renaming, but in 10
opposite direction. Therefore, using immediate snapshots, we can rewrite the Renaming algorithm in the equivalent form, of Figure 9. This is the renaming algorithm presented in [7]. Algorithm isRenamingtag (F irst, Direction); (1) Si ← Immediate Snapshot(i); (2) Last ← F irst + Direction ∗ (2|Si | − 2); (3) if i = max Si then return Last; (4) isRenamingtag·|Si | (Last − 1, −Direction)
Figure 9: Immediate snapshot multi-way branching renaming (code for pi ) Notice that in isRenamingtag the subindex tag is a sequence of integers: in line 4, the new tag tag · |Si | is the old tag, appending at the end Si . These tags are a way of naming the recursive calls, so that a set of processes that should participate in the same execution of isRenaming, can identify using the same tag. The first time isRenaming is called, tag should take some default value, say the empty set. We have seen in Figure 8 the views for the 3 process runs of the Renaming algorithm. For comparison, the views for the isRenaming version appear in Figure 10. The difference is that when 2 processes see only each other in their first snapshot, one of them will obtain a slot, and they do not participate in an immediate snapshot algorithm together. The case of triangle z is interesting. It represents a run where, the views in the first immediate snapshot are: h1, {1, 2, 3}i, that is, p1 sees all 3 processes in its first immediate snapshot, while for the other two, the views are: h2, {2, 3}i, and h3, {3}i. Then, processes will invoke recursively isRenaming, but each one by itself; p1 will invoke isRenaming3 , p2 will invoke isRenaming2 , and p3 will not invoke it, as it directly obtains a slot. It computes 3 = max S3 = {3} and returns Last = F irst + Direction ∗ (2|S3 | − 2) = 1 (assuming F irst = 1), while p2 returns slot 2, and p1 returns slot 4. Theorem 3 Algorithm isRenaming solves the renaming task for n processes, in O(n2 ) steps. Proof: As explained above, the algorithm is equivalent to Renaming, and hence correctness follows from Theorem 2. Now, to show that the complexity is O(n2 ) steps we do an amortized analysis, based on Theorem 1: a process obtains a snapshot of size s from the Immediate Snapshotn algorithm in Θ(n(n − s + 1)) steps. Assume a process runs isRenaming until it obtains a slot, invoking the algorithm recursively k times. In the i-th call, assume it gets a snapshot of size si (in line 1). For example, if s1 = n, then k = 1, and using the result of Theorem 1, the number of steps executed is n. In general, the number of steps executed by a process is n times [n − s1 ] + [(n − s1 ) − (n − s2 )] + [(n − s2 ) − (n − s3 )] + · · · + [(n − sk−1 ) − (n − sk )] 2
which gives a total of O(n(n − sk )).
6
SWAP
Here we consider two tasks, Tournamentπ and Swapπ . A process can invoke these tasks with its input id, where π is an id of a processes that does not invoke the task, or 0, a special virtual 11
1
2
3
2
3
1
1
1
2 1
z 2
3 3
2
Figure 10: IS renaming subdivision, for 3 processes id. Each process gets back another process id, or π. A process never gets back its own id. Exactly one process gets back π. We think of this process as the “winner” of the invocation. The induced digraph consists of all arcs i → j, such that process i received process j as output from the task; it is guaranteed that the digraph is acyclic. We say that j is the parent of i. As every vertex has exactly one outgoing arc, except for the root, π, which has none, there is exactly one directed path from each vertex to the root. The Swap task always returns a directed path, while the Tournament can return any acyclic digraph. Afek et al [2, 31] noticed that these two tasks cannot be solved using read/write registers only, but can be solved if 2-consensus tasks are also available, namely, tasks that can be used to solve consensus for 2, but not for 3 processes.1 They presented a wait-free algorithm that solves Swap using read/write registers and Tournament tasks. The following is a recursive version of this algorithm, of the same complexity. The Swap algorithm is in Figure 11. Process pi invokes the algorithm with Swaptag (i), where tag = 0. In line 1 process i invokes Tournamenttag (i) and in case it wins, i.e., gets back tag, it returns tag. By the specification of the tournament task, one, and only one process wins. All others invoke recursively a Swap task: all processes with the same parent π invoke the same Swapπ (i) task. Algorithm Swaptag (i); (1) π ← Tournamenttag (i); (2) if tag = π then return π; (3) else Swapπ (i)
Figure 11: The Swap algorithm (code for pi ) Assuming initially for each process pi , πi = 0, we get an invariant: the induced digraph (arcs from pi to πi ) is a directed tree rooted in 0. Initially, all processes point to 0 directly. Each time 1
The task of Afek et al is not exactly the same as ours; for instance, they require a linearizability property, stating that the winner is first.
12
45 12 3
SwAp 0 SwAp 1
1
outputs 0
2
outputs 1
2 4 3
4 3
5
SwAp 2
3
outputs 2
4
SwAp 3
4
outputs 3
5
outputs 4
5
SwAp 4
Figure 12: Branching deterministic recursion, for 5 processes Tournamenttag is executed, it is executed by a set of processes pointing to tag. The result of the execution is that exactly one process p keeps on pointing to tag, while the others induce a directed graph rooted at p. An execution for 5 processes appears in Figure 12. Notice that although it is a tree, it has a unique topological order, as opposed to the tree of Figure 5. The result of executing Tournament0 is that 1 wins, 2, 3, 4 point to 1, while 5 point to 3. The result of executing Tournament1 is that 2 wins, while 3, 4 point to 2. The result of executing Tournament2 is that 3 wins, while 4 point to 3. The result of executing Tournament3 is that 4 wins, while 5 point to 4. Finally, 5 executes Tournament4 by itself and wins. For the following theorem, we count as a step either a read or write operation, or a 2-consensus operation, and assume a Tournament task can be implemented using O(n) steps [2, 31]. Theorem 4 Algorithm Swap solves the swap task for n processes, in O(n2 ) steps. Proof: The algorithm terminates, as in each invocation one process returns, the winner of the Tournament. Also, the basis is easy, as when only one process invokes the Tournament, it is necessarily the winner. Assume inductively that the algorithm solves swap for less than n processes. Consider an invocation of Swap, where the winner in line 1 is some process p. Consider the processes W that get back p in this line. Every other process will get back a descendant of these processes. Thus, the only processes that invoke Swaptag with tag = p are the processes in W . Moreover, it is impossible that two processes return the same tag, say tag = x, because a process that returns x does so during the execution of swapx , after winning the tournament invocation, and only one process can win this invocation. 2
13
References [1] Afek Y., H. Attiya, Dolev D., Gafni E., Merrit M. and Shavit N., Atomic Snapshots of Shared Memory. Proc. 9th ACM Symposium on Principles of Distributed Computing (PODC’90), ACM Press, pp. 1–13, 1990. [2] Yehuda Afek, Eytan Weisberger, Hanan Weisman,. A Completeness Theorem for a Class of Synchronization Objects (Extended Abstract). ACM PODC 1993: 159–170. [3] Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Sharing Memory Robustly in Message-Passing Systems. J. ACM 42(1): 124–142 (1995). [4] Hagit Attiya, Amotz Bar-Noy, Danny Dolev, David Peleg, Rdiger Reischuk: Renaming in an Asynchronous Environment, J. ACM 37(3): 524–548 (1990). [5] Amotz Bar-Noy, Danny Dolev, Cynthia Dwork, H. Raymond Strong: Shifting Gears: Changing Algorithms on the Fly to Expedite Byzantine Agreement, Inf. Comput. 97(2): 205–233 (1992). [6] Borowsky E. and Gafni E., Generalized FLP Impossibility Results for t-Resilient Asynchronous Computations Proc. 25th ACM Symposium on the Theory of Computing (STOC’93), ACM Press, pp. 91-100, 1993. [7] Borowsky E. and Gafni E., Immediate Atomic Snapshots and Fast Renaming (Extended Abstract). PODC 1993: 41–51. [8] Borowsky E. and Gafni E., A Simple Algorithmically Reasoned Characterization of Wait-Free Computations (Extended Abstract). Proc. 16th ACM Symposium on Principles of Distributed Computing (PODC’97), ACM Press, pp. 189–198, August 1997. [9] Borowsky E., Gafni E., Lynch N. and Rajsbaum S., The BG Distributed Simulation Algorithm. Distributed Computing, 14(3):127–146, 2001. [10] Armando Casta˜ neda, Sergio Rajsbaum., New combinatorial topology upper and lower bounds for renaming. PODC 2008: 295–304. [11] Ching-Tsun Chou, Eli Gafni, Understanding and Verifying Distributed Algorithms Using Stratified Decomposition. PODC 1988: 44–65. [12] John Dobson and Brian Randell, Introduction to “Building Reliable Secure Computing Systems out of Unreliable Insecure Components,” Department of Computing Science, University of Newcastle upon Tyne, Newcastle NE1 7RU, U.K. [13] Tzilla Elrad, Nissim Francez, Decomposition of Distributed Programs into CommunicationClosed Layers. Sci. Comput. Program. 2(3): 155–173 (1982). [14] Fischer M.J., Lynch N.A. and Paterson M.S., Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, 32(2):374–382, 1985. [15] Arie Fouren. Exponential examples for two renaming algorithms. Available at http://www. cs.technion.ac.il/~hagit/publications/expo.ps.gz, Aug. 1999.
14
[16] Eli Gafni: Round-by-Round Fault Detectors, Unifying Synchrony and Asynchrony (Extended Abstract). PODC 1998: 143–152. [17] Eli Gafni, The 0-1-Exclusion Families of Tasks. OPODIS 2008: 246–258. [18] Gafni E., Merritt M. and Taubenfeld G., The Concurrency Hierarchy, and Algorithms for Unbounded Concurrency. Proc. 21st ACM Symposium on Principles of Distributed Computing (PODC’01), ACM Press, pp. 161–169, 2001. [19] Eli Gafni, Sergio Rajsbaum, Maurice Herlihy, Subconsensus Tasks: Renaming Is Weaker Than Set Agreement. DISC 2006: 329–338. [20] Robert G. Gallager, Pierre A. Humblet, Philip M. Spira: A Distributed Algorithm for Minimum-Weight Spanning Trees. ACM Trans. Program. Lang. Syst. 5(1): 66–77 (1983). [21] Herlihy M.P., Wait-Free Synchronization. ACM Transactions on programming Languages and Systems, 11(1):124–149, 1991. [22] Maurice Herlihy and Sergio Rajsbaum, The Topology of Shared-Memory Adversaries. 29th ACM Symposium on Principles of Distributed Computing (PODC), Zurich, Switzerland, July 25–28, 2010. To appear. [23] Herlihy M.P. and Shavit N., The Topological Structure of Asynchronous Computability. Journal of the ACM, 46(6):858–923, 1999. [24] L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4 (3): 382–401, 1982. [25] Udi Manber, Introduction to Algorithms: A Creative Approach, Addison Wesley, 1989. [26] Randell, Brian Recursively structured distributed computing systems, IEEE Symposium on Reliability in Distributed Software and Database Systems, 3–11, 1983. [27] Sergio Rajsbaum, Michel Raynal, Corentin Travers, An impossibility about failure detectors in the iterated immediate snapshot model. Inf. Process. Lett. 108(3): 160–164 (2008). [28] Sergio Rajsbaum, Michel Raynal, Corentin Travers, The Iterated Restricted Immediate Snapshot Model. COCOON 2008: 487–497. [29] Saks, M. and Zaharoglou, F., Wait-Free k-Set Agreement is Impossible: The Topology of Public Knowledge. SIAM Journal on Computing, 29(5): 1449–1483, 2000. [30] Stojmenovic, I., Recursive algorithms in computer science courses: Fibonacci numbers and binomial coefficients. IEEE Trans. on Education, 43(3): 273–276, Aug. 2000. [31] Hanan Weisman, Implementing shared memory overwriting objects. Master’s thesis, Tel Aviv University, May 1994.
15
A
Non-recursive immediate snapshots algorithm
A wait-free algorithm solving the immediate snapshot task was described in [6]. We describe it here for comparison with the recursive version presented in Section 4. The algorithm is in Figure 13. It uses two arrays of registers denoted REG[1..n] and LEVEL[1..n] (only pi can write REG[i] and LEVEL[i]). A process pi first writes its value in REG[i]. The array LEVEL[1..n], initialized to [n + 1, . . . , n + 1], can be thought of as a ladder, where initially a process is at the top of the ladder, namely, at level n + 1. Then it descends the ladder, one step after the other, according to predefined rules until it stops at some level (or crashes). While descending the ladder, a process pi registers its current position in the ladder in the register LEVEL[i]. After it has stepped down from one ladder level to the next one, a process pi computes a local view, denoted viewi , of the progress of the other processes in their descent of the ladder. That view contains the processes pj seen by pi at the same or a lower ladder level (i.e., such that leveli [j] ≤ LEVEL[i]). Then, if the current level ` of pi is such that pi sees at least ` processes in its view (i.e., processes that are at its level or a lower level) it stops at the level ` of the ladder. Finally, pi returns the set of indexes determined from the values of viewi . Algorithm Immediate Snapshot(i); REG[i] ← i; repeat LEVEL[i] ← LEVEL[i] − 1; for j ∈ {1, . . . , n} do leveli [j] ← LEVEL[j] end for; viewi ← j : leveli [j] ≤ LEVEL[i]}; until (|viewi | ≥ LEVEL[i]) end repeat; return({j such that j ∈ viewi })
Figure 13: Non-recursive one-shot immediate snapshot algorithm (code for pi ) It is argued in [6] that this algorithm solves the immediate snapshot task, with O(n3 ) complexity.
16
Publicaciones Preliminares del Instituto de Matem´ aticas, UNAM
(2007): [835] [836]
Araujo-Pardo, G., Balbuena. C. Finding small regular graphs of girth 6, 8 and 12 as subgraphs of cages, 12 p. P´erez Garmendia, J. L. On Tempered α-Stable Stochastic Integrals, 15 p.
(2008): [837] [838] [839] [840] [841] [842] [843]
[844] [845] [846] [847] [848]
Ortiz-Bobadilla, L., Rosales-Gonz´alez, E., Voronin, S.M. Extended holonomy and topological invariance of the vanishing holonomy group, 55 p. Ortiz-Bobadilla, L., Rosales-Gonz´alez, E., Voronin, S.M. On Camacho-Sad’s Theorem about the existence of a separatrix, 9 p. de Teresa, L. and Zuazua, E. Identification of the class of initial data for the insensitizing control of the heat equation, 15 p. Gabriela Araujo, M., Montellano, J.J. and Strausz, R. On the pseudoachromatic index of the complete graph, 10 p. Casta˜ neda, A. and Rajsbaum, S. New combinatorial topology upper and lower bounds for renaming, 10 p. Achcar, J. A., Rodrigues, E. R., Paulino, C. D. and Soares, P. Some non-homogeneous Poisson models with a change-point: An application to ozone data in Mexico City, 11 p. Achcar, J. A., Rodrigues, E. R. and Tzintzun, G. Some non-homogeneous Poisson models with multiple change-points to study the behaviour of the number of ozone exceedances in Mexico City, 11 p. Achcar, J.A., Rodrigues, E.R. and Tzintzun, G. Analysing weekly ozone averages in Mexico City using stochastic volatility models, 11 p. Cannarsa, P. and de Teresa, L. Controllability results for 1 − d coupled degenerate parabolic equations, 22 p. Galeana-S´ anchez, H., Llano, B. and Montellano-Ballesteros, J.J. Kernels by monochromatic directed paths in m-colored digraphs with quasi-transitive chromatic classes, 11 p. Galeana-S´ anchez, H., Llano, B. and Montellano-Ballesteros, J.J. Absorbent sets and kernels by monochromatic directed paths in m-colored tournaments, 13 p. Micu, S. and de Teresa, L. A spectral study of the boundary controllability of the linear 2-D wave equation in a rectangle, 25 p.
(2009): [849] [850] [851] [852] [853]
Galeana-S´ anchez, H. and Rojas-Monroy, R. Independent domination by monochromatic paths in arc coloured bipartite tournaments, 17 p. Galeana-S´ anchez, H., Rojas-Monroy, R. and Zavala, B. Monochromatic paths and monochromatic sets of arcs in 3-quasitransitive digraphs, 12 p. Galeana-S´ anchez, H., Rojas-Monroy, R. and Zavala, B. Monochromatic paths and monochromatic sets of arcs in bipartite tournaments, 13 p. Galeana-S´ anchez, H., Rojas-Monroy, R. and Zavala, B. Monochromatic paths and monochromatic sets of arcs in quasi-transitive digraphs, 12 p. Caballero, M.E., Pardo, J.C. and P´erez, J.L. On Lamperti stable process, 28 p.
[854] [855] [856] [857] [858] [859] [860] [861] [862] [863] [864] [865] [866]
Caballero, M.E., Lambert, A. and Uribe-Bravo, G. Proof(s) of the Lamperti representation of continuous state branching processes, 28 p. Caballero, M.E., Rivero, V. On the asymptotic behaviour of increasing self-similar Markov processes, 28 p. Caballero, M.E., Carrillo, A., G´omez, R. and P´erez, J.L. Some results on the moment problem, 21 p. P´erez, J.L. On weighted tempered moving averages processes, 17 p. Rojas-Monroy, R. and Villarreal-Vald´es, J.I. Kernels in infinite digraphs, 12 p. Galeana-S´ anchez, H. and Goldfeder, I. A. A classification of arc-locally semicomplete digraphs, 22 p. Farf´ an, B., Mercado, G., Kershenobich, D. and Strausz, R. Applying DNA computing to diagnose-and-interfere hepatic fibrosis, 8 p. Galeana-S´ anchez, H. and Olsen, M. Kernels by monochromatic paths in digraphs with covering number 2, 13 p. Galeana-S´ anchez, H. and Manrique, M. Level Hypergraphs, 18 p. Galeana-S´ anchez, H. and Manrique, M. Level Hypergraphs (II), 24 p. Galeana-S´ anchez, H., Manrique, M. and Stehl´ık, M. A corrected version of Meyniel’s conjecture, 11 p. Galeana-S´ anchez, H., Manrique, M. Directed hypergraphs: A tool for researching digraphs and hypergraphs, 17 p. Goldfeder, I. A. (k, l)-kernels in quasi-transitive digraphs, 5 p.
(2010): [867] [868] [869] [870] [871] [872] [873] [874]
Ortiz-Bobadilla, L., Rosales-Gonz´alez, E. and Voronin, S.M. Thom’s problem for generic degenerated singular points of holomorphic foliations in the plane, 40 p. Bokowski, J., Bracho J. and Strausz, R., Carath´eodory-type theorems `a la B´ ar´ any, 10 p. Strausz, R., Note: Hyperseparoids: a representation theorem, 5 p. Galeana-S´ anchez, H., Goldfeder, I. A. and Urrutia, I. On the structure of strong 3-quasitransitive digraphs, 13 p. Galeana-S´ anchez, H. and Goldfeder, I. A. Counterexamples to a conjecture by Bang-Jensen, 4 p. Casta˜ neda, A. and Rajsbaum, S. New Combinatorial Topology Bounds for Renaming: The Upper Bound, 48 p. Casta˜ neda, A. and Rajsbaum, S. New Combinatorial Topology Bounds for Renaming: The Lower Bound, 37 p. Benabdallah, A., Cristofol, M., Gaitan, P. and de Teresa, L. Controllability to trajectories for some parabolic systems of three and two equations by one control force, 25 p.
Informaci´ on o pedidos: Leonardo Espinosa Depto. de Publicaciones Instituto de Matem´ aticas, UNAM Circuito Exterior Ciudad Universitaria ´ 04510 M´exico, D.F. MEXICO
Tel´efono: 56-22-44-96 FAX: 55-50-13-42 y 56-16-03-48 e-mail:
[email protected] web: http://www.matem.unam.mx http://www.smm.org.mx