Iterated Shared Memory Models - Semantic Scholar

13 downloads 106 Views 311KB Size Report
of a task, we should first study it in the wait-free model, and then generalize the results to stronger models .... But p2 can't tell the or- der in which p1 and p3 ...
Iterated Shared Memory Models! (Invited Talk) Sergio Rajsbaum Instituto de Matem´ aticas Universidad Nacional Aut´ onoma de M´exico D.F. 04510, Mexico [email protected]

Abstract. In centralized computing we can compute a function composing a sequence of elementary functions, where the output of the i-th function in the sequence is the input to the i + 1-st function in the sequence. This computation is done without persistent registers that could store information of the outcomes of these function invocations. In distributed computing, a task is the analogue of a function. An iterated model is defined by some base set of tasks. Processes invoke a sequence of tasks from this set. Each process invokes the i + 1-st task with its output from the i-th task. Processes access the sequence of tasks, one-by-one, in the same order, and asynchronously. Any number of processes can crash. In the most basic iterated model the base tasks are read/write registers. Previous papers have studied this and other iterated models with more powerful base tasks or enriched with failure detectors, which have been useful to prove impossibility results and to design algorithms, due to the elegant recursive structure of the runs. This talk surveys results in this area, contributed mainly by Borowsky, Gafni, Herlihy, Raynal, Travers and the author.

1

Introduction

A distributed model of computation consists of a set of n processes communicating through some medium, satisfying specific timing and failure assumptions. The communication medium can be message passing or some form of shared memory. The processes can run synchronously or run at arbitrarily varying speeds. The failure assumptions describe how many processes may fail, and it what way. Along each one of these three dimensions, there are many variants, which when combined, give rise to a wide variety of distributed computing models, e.g. [6,29,37]. And for each model, we would like to know which distributed tasks can be solved, and at what cost, in terms of time and communication. Thus, work for developing a theory of distributed computing has been concerned with finding ways of unifying results, impossibility techniques, and algorithm design paradigms of different models. !

Partially supported by PAPIIT and PAPIME UNAM projects.

A. L´ opez-Ortiz (Ed.): LATIN 2010, LNCS 6034, pp. 407–416, 2010. Springer-Verlag Berlin Heidelberg 2010

408

1.1

S. Rajsbaum

In Search of a Fundamental Model

In early stages of distributed computing theory similar result needed different proofs for different models. Consider for example the consensus task, where processes need to agree on one of their input values. In an asynchronous system, the problem is impossible to solve even if only one process may crash. This was proved in [18] for the case where the processes communicate by sending messages to one another. Read/write shared memory is in principle a more powerful communication media than message passing, so proving that consensus is also impossible here, is a stronger result. Indeed, the same impossibility result holds, if processes communicate through read/write memory, as proved in [28]. A first approach towards the goal of unifying distributed computing theory, was to derive direct simulations from one model to another, e.g., [2,5,8]. In particular, [2] shows how to transform a protocol running in an asynchronous message passing model to one for a shared memory model. This implies that it is sufficient to prove the consensus impossibility result in the shared memory model, to get the impossibility in the message passing model. Later on, the approach of devising models of a higher level of abstraction, where results about various more specific models can be derived, e.g., [19,24,30] was explored. For instance, [30] described a generic layered model of computation where a consensus impossibility result is proved, and as specific cases, consensus impossibilities can be derived for both a message passing and a shared memory model, and even for several synchronous models. Recently, the approach proved useful also for randomized algorithms, e.g. [3]. 1.2

From Graph Connectivity to Topology

The 1 failure asynchronous consensus impossibility results [18,28] mentioned above, lead to a characterization of the tasks that can be solved in an asynchronous system where at most 1 process can crash [11]. Thus, the case of 1 failure seemed to be the base case from which one should generalize to any number of failures. After all, dealing with many failures is more complicated than dealing with few failures, isn’t it? A major step in the development of the theory was taken in 1993, by three works presented in the ACM STOC conference [8,25,36], that uncovered a deep relationship between distributed computing and topology. This lead to the realization that, instead of the case of 1 failure, the read/write wait-free case is fundamental. In a system where any number of processes can crash, each process must complete the protocol in a finite number of its own steps, and “wait statements” to hear from another process are not useful. Roughly speaking, when we want to study the complexity and solvability of a task, we should first study it in the wait-free model, and then generalize the results to stronger models (either with more powerful communication primitives or stronger synchrony assumptions). For example, reductions from the case where at most t processes can crash, to the wait-free model have been presented in [8,10,20]. In the topology approach one considers the simplicial complex of global states of the system after a finite number of steps, and then proves topological invariants

Iterated Shared Memory Models

409

Fig. 1. Three simplexes

about the structure of such a complex, to derive impossibility results. The notion of indistinguishability, which has played a fundamental role in nearly every lower bound in distributed computing, is hence generalized from graph connectivity, to higher dimensions. Two global states are indistinguishable to a set of processes if they have the same local states in both. In the next figure there is a complex with three triangles, each one is a simplex representing a global state; the corners of a simplex represent local states of processes in the global state. The center simplex and the rightmost simplex represent global states that are indistinguishable to processes p1 and p2 , which is why the two triangles share an edge. Only process p3 can distinguish between the two global states. 1.3

The Importance of the Wait-Free Snapshot Model

We have seen that the read/write wait-free model is fundamental. However, there are several ways of defining read/write registers, like single-writer/multi-reader, multi-writer/multi-reader, and others. A research branch in distributed computing theory has been concerned with finding the simplest read/write communication abstraction. Several variants of read/write registers were studied early on [6,27] and proved to be equivalent. We now use snapshots [1], or even immediate snapshots [7], as such abstractions are wait-free equivalent to read/write registers (although at a complexity cost), but give rise to cleaner and more structured models. In a snapshot object each process has a component where it can write, and the process can read all components with a single atomic read operation that returns an instantaneous snapshot of its contents. An immediate snapshot object provides a single write-snapshot operation, guaranteeing that the snapshot is executed immediately after the write. The complex corresponding to a snapshot object for three processes is in the first part of Figure 2 and will be explained below.

2

The Basic Iterated Model

We have explained above that the snapshot wait-free model is fundamental. However, in this paper we argue that a specific variant of this model is especially suitable for a central role in distributed computing theory. Most attempts at unifying models of various degrees of asynchrony restrict attention to a subset of well-behaved, round-based executions. The approach in [9] goes beyond that and defines an iterated model, where each communication object can be accessed only once by each process. In the paper only the basic case of snapshot objects is

410

S. Rajsbaum

considered. The sequence of snapshot objects are accessed asynchronously, and one after the other by each process. It is shown in [9] that this iterated model is equivalent (for bounded wait-free task solvability) to the usual read/write shared memory model. 2.1

Recursive Structure

The iterated model has an elegant recursive structure. In each iteration, the only information transmitted can be the local state of processes after each snapshot– the snapshot objects are not persistent, we may think they exist only during an iteration. The result in [9] can be thought of as a variant of the result in [2] that shows that shared-memory can be emulated over message passing. In message passing too, there are no persistent objects. The recursive structure is clearly expressed in the complex of global states of a protocol. The complex of global states after i + 1 rounds is obtained by replacing each simplex by a one round complex, see Figure 2. Indeed, this iterated model was the basis for the proof in [9] of the main characterization theorem of [25]. In more detail, the properties of an immediate snapshot are represented in the first image of Figure 2, for the case of three processes. The image represents a simplicial complex, i.e. a family of sets closed under containment; each set is called a simplex, and it represents the views of the processes after accessing the IS object. The vertices are the 0-simplexes, of size one; edges are 1-simplexes, of size two; triangles are of size three (and so on). Each vertex is associated with a process pi , and is labeled with smi (the view pi obtains from the object). In the first complex of Figure 2, the highlighted 2-dimensional simplex, represents a run where p1 and p3 access the object concurrently, both get the same views seeing each other, but not seeing p2 , which accesses the object later, and gets back a view with the 3 values written to the object. But p2 can’t tell the order in which p1 and p3 access the object; the other two runs are indistinguishable to p2 , where p1 accesses the object before p3 and hence gets back only its own value or the opposite. These two runs are represented by the corner 2-simplexes. Recall that in the iterated immediate snapshot model the objects are accessed sequentially and asynchronously by each process. In Figure 2 one can see that the complex after one round is constructed recursively by replacing each simplex by the one round complex. Thus, the highlighted 2-simplex in the second complex of the figure, represents global states after two snapshots, given that in the first snapshot, p1 and p3 saw each other, but they did not see p2 . 2.2

On the Meaning of Failures

Notice that the runs of the iterated model are not a subset of the runs of a standard (non-iterated) model. Consider a run where processes, p1 , p2 , p3 , execute an infinite number of rounds, but p1 is scheduled before p2 , p3 in every round. The triangles at the left-bottom corners of the complexes in Figure 2 represent such a situation; p1 , at the corner, never hears from the two other processes. Of course, in the usual (non-iterated read/write shared memory) asynchronous model, two

Iterated Shared Memory Models

411

p2 p1 p2

p3

p3

p1

p2

p2 p1

p3

p1

p3

Fig. 2. One, two and three rounds in the IIS model

correct processes can always eventually communicate with each other. Thus, in an iterated model, the set of correct processes of a run, may be defined as the set of processes that observe each other directly or indirectly infinitely often (a formal definition is given in [33]). 2.3

Equivalence with the Standard Model

Recall that in the k-set agreement task each of the n processes in the system starts with an input value of some domain of at least n values, and must decide on at most k of their input values. It was proved in [9] that if a task (with a finite number of inputs) is solvable wait-free in the read/write memory model then it is solvable in the snapshot iterated model (the other direction is trivial), using an algorithm that simulates the read/write model in the iterated model. Recently another simulation was described in [21], somewhat simpler. As can be seen in Figure 2, the complex of global states at any round of this model is a

412

S. Rajsbaum

subdivided simplex, and hence Sperner’s Lemma implies that k-set agreement is not solvable in the model if k < n. Thus, it is also unsolvable in the wait-free read/write memory model.

3

General Iterated Models

We can consider iterated models where instead of snapshots, in each round processes communicate through some other task. The advantages of programming in an iterated model are two fold: First, as there are no “side effects” during a run, we can logically imagine all processes going in lockstep from task to task just varying the order in which they invoke each task; they all do return from task S1 , before any of them invokes task S2 . This structured set of executions facilitates an inductive reasoning, and greatly simplifies the understanding of distributed algorithms. Second, if we have a description of S1 as a topological complex X1 , and S2 as X2 , the iterated executions accesing S1 and then S2 have a simple topological description: replacing each simplex of X1 by X2 . This iterated reasoning style have been studied and proved useful in works such as [15,16,19,24,30,35,34]. 3.1

The Moebius Task

In [22], the situation where processes communicate using a Moebius task is considered. It is possible to construct a complex that is a subdivision of a Moebius band out of 2-dimensional simplexes, as in Figure 3. But to construct a task specification, the complex should satisfy two properties. Consider the case of 3 processes. First, it must be chromatic. That is, each of the three vertices of its 2-simplexes has to be labeled with a different process id. The label of a vertex is also be labeled with an output value of the task. Second, the complex should have a span structure identified. That is, a specification stating what is the output of the task when each of the processes runs solo, what is the output when pairs of processes run solo, and what are the possible outputs when all three run concurrently. Figure 4 from [22], is the task specification for three processes, that corresponds to a Moebius band. Notice that its boundary is identical to the boundary of a chromatic subdivided simplex. The Moebius task was introduced in [22] because it is a manifold: if we consider any of the edges of its complex, either it belongs to one or to two triangles. The one-round Moebius task is a manifold task, so composing the Moebius task with itself in an iterated model, with read/write rounds, or with any other manifold task yields a manifold task. Thus, any protocol in such an iterated model yields a protocol complex that is also a manifold. And as mentioned above, we can apply Sperner’s lemma to a manifold to prove that k-set agreement is not solvable in the model if k < n. Furthermore, [22] shows that the Moebius task can be used to prove that set agreement is strictly more difficult than renaming [4], in the iterated model.

Iterated Shared Memory Models

413

Fig. 3. Construction of a Mobius band

Fig. 4. One-round Moebius task protocol complex for 3 processes

3.2

Equivalence of More General Models with the Standard Model

Recall that the standard read/write memory model is equivalent to the iterated snapshot model (for bounded task solvability) [9]. Recently another simulation was described in [21], that shows that both models are also equivalent, when enriched with tasks T more powerful than read/write registers– for any task T solvable by set agreement, the power of the standard and the iterated model coincide. This implies that set agreement is strictly more difficult than renaming also in the standard non-iterated model.

4

Iterated Models and Failure Detectors

In the construction of a distributed computing theory, a central question has been understanding how the degree of synchrony of a system affects its power to solve distributed tasks. The degree of synchrony has been expressed in various ways, typically either by specifying a bound t on the number of processes that

414

S. Rajsbaum

can crash, as bounds on delays and process steps [17], or by a failure detector [12]. It has been shown multiple times that systems with more synchrony can solve more tasks. Previous works in this direction have mainly considered an asynchronous system enriched with a failure detector that can solve consensus. Some works have identified this type of synchrony in terms of fairness properties [38]. Other works have considered round-based models with no failure detectors [19]. Some other works [26] focused on performance issues mainly about consensus. Also, in some cases, the least amount of synchrony required to solve some task has been identified, within some paradigm. A notable example is the weakest failure detector to solve consensus [13] or k-set agreement [40]. Set agreement [14] represents a desired coordination degree to be achieved in the system, and hence is natural to use it as a measure for the synchrony degree in the system. A clear view of what exactly “degree of synchrony” means is still lacking. For example, the same power as far as solving k-set agreement can be achieved in various ways, such as via different failure detectors [31] or t-resilience assumptions. 4.1

A Restriction of the Snapshot Iterated Model

The paper [35] introduces the IRIS model, which consists of a subset of runs of the immediate snapshots iterated model of [9], to obtain the benefits of the round by round and wait-freedom approaches in one model, where processes run wait-free but the executions represent those of a partially synchronous model. As an application, new, simple impossibility results for set agreement in several partially synchronous systems are derived. The IRIS model provides a mean of precisely representing the degree of synchrony of a system, and this by considering particular subsets of runs of the snapshots iterated model. A failure detector [12] is a distributed oracle that provides each process with hints on process failures. According to the type and the quality of the hints, several classes of failure detectors have been defined (e.g., [31,40]). Introducing a failure detector directly into an iterated model is not useful [34]. Instead, the the IRIS model of [35] represents a failure detector as a restriction on the set of possible runs of the iterated system. As an example, the paper considers the family of limited scope accuracy failure detectors, denoted !Sx [23,39]. They are a generalization of the class denoted !S that has been introduced in [12]. Consider the read/write computation model enriched with a failure detector C of the class !Sx . An IRIS model that precisely captures the synchrony provided by the asynchronous system equipped with C is described in [35]. To show that the synchrony is indeed captured, the paper presents two simulations. The first is a simulation from the shared memory model with C to the IRIS model. The second shows how to extract C from the IRIS model, and then simulate the read/write model with C. For this, a generalization of the wait-free simulation of [9] is described, that preserves consistency with the simulated failure detector.

Iterated Shared Memory Models

4.2

415

Equivalence of Failure Detector Enriched Models with Iterated Models

As a consequence of these simulations, we get: an agreement task is wait-free solvable in the read/write model enriched with C if and only if it is wait-free solvable in the corresponding IRIS model. Then, using a simple topological observation, it is easy to derive the lower bound of [23] for solving k-set agreement in a system enriched with C. In the approach presented in this paper, the technically difficult proofs are encapsulated in algorithmic reductions between the shared memory model and the IRIS model, while in the proof of [23] combinatorial topology techniques introduced in [24] are used to derive the topological properties of the runs of the system enriched with C directly. A companion technical report [32] extends the equivalence presented in [35] to other failure detector classes.

References 1. Afek, Y., Attiya, H., Dolev, D., Gafni, E., Merritt, M., Shavit, N.: Atomic Snapshots of Shared Memory. J. ACM 40(4), 873–890 (1993) 2. Attiya, H., Bar-Noy, A., Dolev, D.: Sharing Memory Robustly in Message Passing Systems. J. ACM 42(1), 124–142 (1995) 3. Attiya, H., Censor, K.: Tight bounds for asynchronous randomized consensus. J. ACM 55(5) (2008) 4. Attiya, H., Bar-Noy, A., Dolev, D., Peleg, D., Reischuk, R.: Renaming in an Asynchronous Environment. Journal of the ACM 37(3), 524–548 (1990) 5. Awerbuch, B.: Complexity of network synchronization. J. ACM 32, 804–823 (1985) 6. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations, and Advanced Topics. Wiley, Chichester (2004) 7. Borowsky, E., Gafni, E.: Immediate Atomic Snapshots and Fast Renaming. In: Proc. 12th ACM Symp. on Principles of Distributed Computing (PODC 1993), pp. 41–51 (1993) 8. Borowsky, E., Gafni, E.: Generalized FLP Impossibility Results for t-Resilient Asynchronous Computations. In: Proc. 25th ACM STOC, pp. 91–100 (1993) 9. Borowsky, E., Gafni, E.: A Simple Algorithmically Reasoned Characterization of Wait-free Computations. In: Proc. 16th ACM PODC, pp. 189–198 (1997) 10. Borowsky, E., Gafni, E., Lynch, N., Rajsbaum, S.: The BG distributed simulation algorithm. Distributed Computing 14(3), 127–146 (2001) 11. Biran, O., Moran, S., Zaks, S.: A Combinatorial Characterization of the Distributed 1-solvable Tasks. J. Algorithms 11, 420–440 (1990) 12. Chandra, T., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43(2), 225–267 (1996) 13. Chandra, T., Hadzilacos, V., Toueg, S.: The Weakest Failure Detector for Solving Consensus. J. ACM 43(4), 685–722 (1996) 14. Chaudhuri, S.: More Choices Allow More Faults: Set Consensus Problems in Totally Asynchronous Systems. Information and Computation 105, 132–158 (1993) 15. Elrad, T., Francez, N.: Decomposition of Distributed Programs into Communication-Closed Layers. Sci. Comput. Program. 2(3), 155–173 (1982) 16. Chou, C.-T., Gafni, E.: Understanding and Verifying Distributed Algorithms Using Stratified Decomposition. In: PODC 1988, pp. 44–65 (1988) 17. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the Presence of Partial Synchrony. J. ACM 35(2), 288–323 (1988)

416

S. Rajsbaum

18. Fischer, M., Lynch, N., Paterson, M.: Impossibility of Distributed Consensus with One Faulty Process. J. ACM 32(2), 374–382 (1985) 19. Gafni, E.: Round-by-round Fault Detectors: Unifying Synchrony and Asynchrony (Extended Abstract). In: Proc. 17th ACM Symp. on Principles of Distributed Computing (PODC), pp. 143–152 (1998) 20. Gafni, E.: The extended BG-simulation and the characterization of t-resiliency. In: Proc. of the 41st ACM Symp. on Theory of Computing (STOC), pp. 85–92 (2009) 21. Gafni E., Rajsbaum S., Loopless Programming with Tasks (Manuscript November 5, 2009) (Submitted for publication) 22. Gafni, E., Rajsbaum, S., Herlihy, M.: Subconsensus Tasks: Renaming is Weaker than Set Agreement. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 329–338. Springer, Heidelberg (2006) 23. Herlihy, M., Penso, L.D.: Tight Bounds for k-Set Agreement with Limited Scope Accuracy Failure Detectors. Distributed Computing 18(2), 157–166 (2005) 24. Herlihy, M.P., Rajsbaum, S., Tuttle, M.: Unifying Synchronous and Asynchronous Message-Passing Models. In: Proc. 17th ACM PODC, pp. 133–142 (1998) 25. Herlihy, M., Shavit, N.: The Topological Structure of Asynchronous Computability. J. ACM 46(6), 858–923 (1999) 26. Keidar, I., Shraer, A.: Timeliness, Failure-detectors, and Consensus Performance. In: Proc. 25th ACM PODC, pp. 169–178 (2006) 27. Lamport, L.: On Interprocess Communication. Distributed Computing 1(2), 77– 101 (1986) 28. Loui, M.C., Abu-Amara, H.H.: Memory Requirements for Agreement among Unreliable Asynchronous Processes. Advances in Computing Research 4, 163–183 (1987) 29. Lynch, N.A.: Distributed Algorithms, 872 pages. Morgan Kaufmann, San Francisco (1997) 30. Moses, Y., Rajsbaum, S.: A Layered Analysis of Consensus. SICOMP 31(4), 989– 1021 (2002) 31. Mostefaoui, A., Rajsbaum, S., Raynal, M., Travers, C.: Irreducibility and Additivity of Set Agreement-oriented Failure Detector Classes. In: Proc. PODC 2006, pp. 153–162. ACM Press, New York (2006) 32. Rajsbaum, S., Raynal, M., Travers, C.: Failure Detectors as Schedulers. Tech. Report # 1838, IRISA, Universit´ e de Rennes, France (2007) 33. Rajsbaum, S., Raynal, M., Travers, C.: The Iterated Restricted Immediate Snapshot Model. Tech. Report # 1874, IRISA, Universit´ e de Rennes, France (2007) 34. Rajsbaum, S., Raynal, M., Travers, C.: An impossibility about failure detectors in the iterated immediate snapshot model. Inf. Process. Lett. 108(3), 160–164 (2008) 35. Rajsbaum, S., Raynal, M., Travers, C.: The Iterated Restricted Immediate Snapshot Model. In: Hu, X., Wang, J. (eds.) COCOON 2008. LNCS, vol. 5092, pp. 487–497. Springer, Heidelberg (2008) 36. Saks, M., Zaharoglou, F.: Wait-Free k-Set Agreement is Impossible: The Topology of Public Knowledge. SIAM Journal on Computing 29(5), 1449–1483 (2000) 37. Santoro, N.: Design and Analysis of Distributed Algorithms. Wiley Interscience, Hoboken (2006) 38. V¨ olzer, H.: On Conspiracies and Hyperfairness in Distributed Computing. In: Fraigniaud, P. (ed.) DISC 2005. LNCS, vol. 3724, pp. 33–47. Springer, Heidelberg (2005) 39. Yang, J., Neiger, G., Gafni, E.: Structured Derivations of Consensus Algorithms for Failure Detectors. In: Proc. 17th ACM PODC, pp. 297–308 (1998) 40. Zieli´ nski, P.: Anti-Omega: the Weakest Failure Detector for Set Agreement. Tech. Rep. # 694, University of Cambridge