Fault-Tolerant Shared Memory Simulations

Fault-Tolerant Shared Memory Simulations1 (Extended Abstract) Petra Berenbrink Friedhelm Meyer auf der Heide Volker Stemann Heinz Nixdorf Institute and Dept. of Computer Science University of Paderborn D-33102 Paderborn Germany Email: fpebe,fmadh,[email protected] Abstract. We consider the problem of simulating a PRAM on a faulty distributed memory machine (DMM). We focus on dynamic faults, i.e. each processor or memory module independently fails during the simulation of a PRAM step with fixed probability and remains faulty for the rest of the simulation. We build upon randomized hashing-based simulations on non-faulty DMMs from [14], which achieve delay O(log log n), with high probability. We design and analyze routines for handling faults occurring during the simulation. Based on these routines we present simulations on faulty DMMs with the same delay O(log log n) as in the non-faulty case, provided that the failure probability of processors and modules is small enough to guarantee an expected linear number of processors and modules to survive the simulation. Thus the facility of being resilient to memory or processor faults increases the delay of the simulation at most by a constant factor.

1 Introduction Parallel machines that communicate via shared memory, so-called Parallel Random Access Machines (PRAMs), represent the most powerful model considered in the theory of parallel computation. This model is relatively comfortable to program, because the user does not have to specify interprocessor communication or allocate storage in a distributed memory. He does not have to worry about synchronization, locality of data, communication capacity or delay effects of memory contention. On the other hand, PRAMs are unrealistic from the technological point of view, as large machines with shared memory can only be built at the cost of very slow shared memory access. A more realistic model is the Distributed Memory Machine (DMM), where the memory is partitioned in modules, one per processor. In this case, a parallel memory access is restricted to allow only one access to each module per parallel step. Thus, memory contention occurs if a PRAM algorithm is run on a DMM; parallel accesses to cells stored in one module are sequentialized. Many authors have already investigated methods for simulating PRAMs on DMMs. 1 Supported in part by DFG-Graduiertenkolleg “Parallele Rechnernetzwerke in der Produktionstechnik”, ME 872/4-1, by DFG-SFB 376 “Massive Parallelit¨at, by the Esprit Basic Research Action Nr. 7141 (ALCOM II), and by the SICMA Project founded by the European Community within the program on “Advanced Communication Technologies and Services”. The third author’s current address is: International Computer Science Institute, Berkeley, CA 94704-1105

In this paper we consider the execution of PRAM simulations on a DMM with faulty memory modules and processors. Normally, if a hardware component (memory module or processor) becomes faulty during a computation, one has to restart the whole computation after repairing the faulty components or reconfiguring the system. If we use faulttolerant algorithms for simulating PRAMs on DMMs, however, there is a good chance that the computation can be continued without great loss of time despite of faults occurring during the simulation. We focus on dynamic faults, i.e. each memory module respectively processor independently fails during the simulation of a PRAM with a fixed probability and remains faulty for the rest of the simulation. Nevertheless, the simulation is guaranteed to work correctly, with high probability (w.h.p.)2 , even if a large fraction of all memory modules or processors is expected to become faulty during the simulation. 1.1

Computation Models

A Parallel Random Access Machine (PRAM) consists of n processors Q1; : : :; Qn and a shared memory with cells U = f1; : : :; mg, each capable of storing one integer. The processors work synchronously and have random access to the shared memory cells. In this paper we mostly consider the exclusive-read exclusive-write PRAM (EREWPRAM) model, that is, no two processors are allowed to access one shared memory cell at the same time. The CRCW-PRAM allows several processors to read and write the same cell concurrently. In case of concurrent write we apply the arbitrary write conflict resolution rule: an arbitrary one of the write requests to a cell is chosen to be successful. Without loss of generality we assume that the processors of the PRAM do not have private memory (up to internal registers like the program counter), all information is stored in the shared memory. A Distributed Memory Machine (DMM) consists of n processors P1; : : :; Pn and n memory modules M1 ; : : :; Mn. Each module has a communication window, i.e. a register, into which all other processors can write and from which all processors can read. In a basic communication step of a DMM the processors send read or write requests for certain memory cells to the memory modules, at most one request per processor. Each module processes some of the requests directed to it, i.e. it executes the corresponding update in case of a write request or sends back the contents of a memory cell in case of a read request. Additionally it sends an acknowledgment to each processor whose request was chosen to be processed. If more than one processor wants to access a memory module, a collision occurs. There exist several rules for handling these collisions, for a discussion see [7] or [13]. We are going to focus on the c-collision rule: If at most c requests arrive at a module, all of them where processed, otherwise none is processed. An answer is only accessible by the issuing processor. For c = 1 this model corresponds to a communication mechanism based on optical crossbars. It is also called the Optical Communication Parallel Computer (OCPC), see [1] and [9]. In the more general Arbitrary-DMM, concurrent access to the same module yields an answer to an arbitrarily chosen access; in case of a read access, the result can be read concurrently by all processors reading at the module. Thus, an Arbitrary-DMM can be seen as a CRCW-PRAM with linear size shared memory. 2 A property (n) is said to hold with high probability if, for arbitrary l > 0, (n) holds with the probability of at least 1 n1l for a sufficiently large n.

?

1.2 Known Results Shared memory simulations using only one hash function to distribute the shared memory cells over the modules of the DMM have an inherent delay (log n= loglog n), even if the hash functions behave like random function (see [8]). Karp et al [11] were the first to consider shared memory simulations using two hash functions. They also present a fast implementation of write steps. The simulation runs on an Arbitrary-DMM with delay O(loglog n) and can be made time-processor optimal. Dietzfelbinger and Meyer auf der Heide [7] achieve the same delay with a very simple scheme using the majority trick introduced in [17] with three hash functions. It can be executed on the weaker c-collision-DMM with c 3. For a survey of shared memory simulations see [13]. MacKenzie et al. [12] showed that an EREW-PRAM can even be simulated on a 1-collision-DMM with 5 hash functions. This result was finally extended to a timeprocessor optimal simulation of an n-processor DMM by Goldberg et al. [10]. Their simulation uses only three hash functions. In [14] Meyer auf der Heide et al. presented a simulation of an n-processor PRAM on an n-processor 1-collision DMM with delay O(loglog n) and a simulation of an n-processor p PRAM on an n-processor ArbitraryDMM with delay O( logloglogloglogn n ), which uses loglog n hash functions. More involved techniques from [6] even yield delay O(loglog logn log n).

In [4] Chlebus et al. present a simulation of one step of an n-processor PRAM with shared memory size O(n) on a faulty n-processor PRAM with memory size O(n). They achieve a delay of O(log n) for simulating one step, tolerating a constant fraction of the memory to become faulty. In that paper they also give an extension of the simulation using PRAMs with super-linear memory size. 1.3 Fault Model

In this paper we focus on dynamic faults, i.e. each memory module and processor fails during the simulation of a PRAM with a fixed probability and remains faulty for the rest of the simulation. In the following we only handle faulty modules. The last section contains the extension to faulty processors. We assume that, for each step t of the computation of the DMM, a module becomes faulty with failure probability p, for some p 2 (0; 1), and it remains faulty up to the end of the computation. The failures of different modules are independent of each other, hence, in a T -step computation of a DMM with failure probability p, a module becomes faulty with probability 1 ? (1 ? p)T . Thus the expected number of faulty modules is n (1 ? (1 ? p)T ). Our simulation will tolerate failure probabilities which are small enough relative to the running time T such that a sufficiently large fraction of the DMMs modules survives the computation of length T , w.h.p.. Definition 1.1 ((s; T)-tolerable) Let s n, T 1. p 2 (0; 1) is (s; T)-tolerable if the expected number of faulty modules at the end of a T -step DMM-computation with failure probability p is at most s. ?

1

Remark 1.2 p is (s; T)-tolerable, if and only if p 1 ? 1 ? ns T . It is easy to check (using e.g. Chernoff bounds) that, if p is (s; T)-tolerable, at most (1 + ) s modules become faulty, > 0 arbitrary, with exponentially high probability3. 3 A property (n) is said to hold with exponentially high probability if, for arbitrary > 0, (n) holds with the probability at least 1 2?n for a sufficiently large n.

?

A faulty component stops working at once. We are not going to consider models in which they keep working and eventually perform incorrect operations. These models are discussed in [5] and [2], often using coding theory. We are able to extend our model using the same techniques in order to handle these faults, too. We assume (for reasons of simplification) in our fault tolerant simulation working on faults that our DMMs have a mechanism to broadcast the failure of a module to all intact modules in constant time. The algorithms of this paper can easily be extended in a way that at first only one processor recognizes the faulty module. This processor informs the other processors via broadcast. In this case we just have to add the time needed by this broadcast to the time needed to handle a fault, so that similar results can be achieved. 1.4 New Results

We design simulations of t-step computations of n-processor EREW-PRAM algorithms using m n memory cells for an arbitrary constant on a faulty n-processor DMM. We further may assume w.l.o.g. that m t n. Using the O(loglog n) delay PRAMsimulation on a non-faulty DMM designed by Meyer auf der Heide et al. [14] we get two PRAM-simulations for different running times of the PRAM algorithms to be simulated. They also have delay bound O(loglog n), w.h.p., i.e. the facility of being resilient to memory faults does not increase the delay of the simulation by more than a constant factor. Theorem 1.3 (Main Theorem) A faulty 1-collision-DMM with n processors and failure probability p can simulate (T= loglog n) steps of an EREW-PRAM within T steps, w.h.p. (i.e. the simulation yields delay O(loglog n)), if the failure probability p and the runtime T fulfill one of the following conditions:

T loglog n, and p is (n ; T)-tolerable for arbitrary > 0. n T n (loglog(n))2 for arbitrary > 0, and p is logT n ; T -tolerable. T 3. n (loglog(n))2 < T < n logn loglogn, and p is log nlog ; T -tolerable. log n ?n 4. n log n loglog n T , and p is d ; T -tolerable for constant d > 2. ? For part 2, 3, and 4 each module of the DMM needs size?O m + n if the PRAM has n shared memory size m. The simulation in part 1 needs O m n + log n size per module. 1. 2.

The Main Theorem can be generalized to simulate a CRCW-PRAM with the same performance bounds, if an Arbitrary-DMM is used instead of a 1-collision-DMM, using techniques from [6]. Alternatively, using delay O(log logn), the restriction m (n2) can be improved. The Main Theorem then implies a simulation of a CRCW-PRAM on a ? faulty Arbitrary-DMM with delay O(loglog n) using O m + logn memory cells per n module and O(m + n) memory cells altogether. 1.5 Organization of the Paper We only discuss the case of faulty modules, the extension to faulty processors is sketched at the end of this paper. Section 2 presents the simulation on a non-faulty DMM from [14]. Section 3 shows that the fault-tolerance described in part 1 of the Main Theorem is already implicitly given by the simulation presented in [14]. For the other results we introduce simulations using fault handling routines in Section 4. There we prove part 4 of the Main Theorem

and sketch the modifications necessary to prove parts 2 and 3, too. Finally, Section 5 sketches the extension to faulty processors. Due to space limitations, most algorithms and proofs are only sketched. A full version will be available, see [3].

2 Simulation of a non-Faulty DMM We are going to discuss fault-tolerant simulations based on the (n; ; a; b)-process from [14]. We review it in this section. The process uses a hash functions h1 ; : : :; ha to distribute a copies of each shared memory cells, or keys for short of the PRAM among the memory modules of the 1-collision DMM. For technical reasons the (n; ; a; b)-process starts accessing only n of the keys. It performs the task of accessing b of the a copies of each key. The hash functions are randomly and independently chosen from a log3 (n=a)universal class of hash functions which is described in [16]. A majority technique due to Upfal and Wigderson [17] is used to simulate a PRAM step. It ensures that it is sufficient for a DMM processor to access arbitrary b > a2 of the a possible copies to guarantee a correct simulation. To write to a memory cell a processor of the DMM adds a time stamp indicating the PRAM time. To read a memory cell a processor chooses from the b copies one with the latest time stamp. We assume that a divides n and denote the processors and modules by Pi;k and Mi;k respectively, i 2 f1; : : :; na g and k 2 f1; : : :ag. The processors P1;k; : : :P na ;k form the (processor) cluster PCk and M1;k ; : : :M na ;k the (module) cluster MCk. Each hash functions hk has range f1; : : :; na g and maps U into MCk . Mhk (x);k is said to contain the k’ th copy of x. The following process accesses b > a2 of the a copies for n given keys for a constant > 0. It runs on a 1-collision-DMM. Note that in this paper the number of hash functions, a, is always a constant.

(n; ; a; b)-Process: /* Given: n active processors. Pi;k wants to access cell xi;k, Ij;k = ; for all i 2 f1; : : :; na g; k 2 f1; : : :; ag. */ Repeat For j = 1 to a do For all active Pi;k do in parallel If j 62 Ii;k then -Pi;k tries to access the ((k + j ? 1) mod a)-th copy of xi;k /* Each Mi;k accepts the access if at most one access is directed to Mi;k */ If Pi;k ’s access is accepted then Ii;k := Ii;k [ fj g If jIi;k j b then Pi;k becomes inactive End Until all keys are inactive In each round of this process each active processor tries to access all a copies of its key, the processor Pi;k starts in cluster MCk . Thus, a round takes time O(a). If a processor knows b of the a copies of its key, it becomes inactive. If we want to simulate one step of the PRAM, we have to repeat the (n; ; a; b)-process 1 times. We call the resulting algorithm the (n; ; a; b)-simulation.

The following theorem is proved in [14] (in a somewhat more general way). It shows that a 1-collision-DMM can simulate an EREW-PRAM with delay O(log log(n)), with high probability. The memory bound is implicitly given in the construction. Theorem 2.1 Let h1; : : :; ha : U ! f1; : : :; na g be randomly and independently chosen from an log3 (n)-universal class of hash functions. Let a 7, a ? 3 b > a=2 and = (ea)?a . Then the (n; ; a; b)-simulation has delay O(loglog n), w.h.p.. The simulation needs a ? m memory cells, O mn + log n per module, w.h.p..

3 A Simple Fault-Tolerant Simulation In this section we show that (n; ; a; b)- simulation is inherently fault-tolerant for PRAM algorithms with running time (T= loglog n), T log logn, b = a2 + 1, a sufficiently large and p (n ; T)-tolerable for > 0 arbitrary. This will prove part 1 of the Main Theorem. Theorem 2.1 shows that an access of b > a2 hash functions can be done on a 1-collision?a DMM with delay O(loglog a n), if a b + 3 and = (ea) ?. aWe use the (n; ; a; b)simulation with b = 2 + 1 introduced above and = (ea) with the modification that, in the case of failure of a module in step t, no request directed to that module will be answered from step t on. Thus it suffices to show that, after T steps, each x 2 U still has at least b + 3 copies in intact modules, w.h.p.. Because p is (n ; T)-tolerable, at most 2n modules fail during T steps w.h.p.. The worst case of a key becoming unreadable arises, if the faulty modules are distributed evenly over all a clusters. Hence we get:

P[There is x 2 U with less than b + 3 copies in intact modules] m P[A fixed PRAM cell has at least a ? b ? 2 copies in faulty modules] ? n=a a?b?2 n1l for a 2 (+1?l+1) m a?ab?2 2n=a + 3 and m n Hence, for constant a to be chosen large enough, the (n; ; a; ba=2c +1)-simulation simulates (T= log logn) PRAM-steps within T steps, w.h.p.. The memory bounds follow from Theorem 2.1. This proves part 1 of the Main Theorem.

4 Simulation Using Fault Handling This simulation will use a variation of the (n; ; a; b)-simulation mentioned above. In case of a module becoming faulty, this simulation is interrupted by a fault-handling routine that reconstructs the contents of the faulty module in some intact module. In the next subsection we describe a static fault tolerant simulation, i.e. we assume that initially some modules of the DMM are faulty but no further modules become faulty during the simulation. This will be the basis for further simulations that dynamically react on faults.

4.1 Static Fault Tolerant Simulation

In this part, i; i0 ; i00 2 f1; ; na g, k; k0; k00 2 f1; : : :; ag, and x 2 U holds. 1 Let Vi;k := h? k (i; k) be the set of keys x 2 U mapped to Mi;k by hk (note: only hk maps keys to MCk ). We call Vi;k the virtual contents of Mi;k . Assume that a set of modules fMi;k ; (i; k) 2 I g for an I f1; : : :; na g f1; : : :ag is faulty after a certain number of steps. Our goal is to maintain the virtual contents of each faulty module Mi;k , i.e. Vi;k , in an intact module. We will guarantee that the virtual contents of the faulty modules are distributed evenly among the intact modules of the same cluster. Let Mi;k be the set of virtual contents maintained by Mi;k (we say Mi;k simulates the corresponding modules) and Pi;k such that Pi0 ;k 2 Pi;k if Vi0 ;k 2 Mi;k . We further assume that Pi;k executes the memory accesses for the processors from the set Pi;k . In order to make the keys of Vi;k accessible we maintain an address array A[i0 ; k0] contained in each module. A[i0 ; k0] = (i00; k0 ) if Vi0 ;k0 2 Mi00 ;k0 . First we note that no intact module is assigned the contents of too many faulty modules. Remark 4.1 If p is (s; T)-tolerable and s logn then at most 32 as modules per cluster become faulty during a T -step computation of the DMM with failure probability p, w.h.p.. Proof: Easy application of the Chernoff bound.

Our Main Theorem uses only (s; T)-tolerable values of p with s n2 , thus each intact module has to maintain the virtual contents of at most four modules and each processor has to simulate at most four processors, w.h.p..

Our static simulation works as follows: We modify the (n; ; a; b)-simulation such that each processor Pi;k simulates the memory accesses of all processors (at most four) from Pi;k. Then the (n; ; a; b)-simulation is run four times, the j ’th time in order to fulfill the memory access request from the j ’th simulated processor from the corresponding Pi;k of each intact processor Pi;k . In order to do the access to the k’th copy of x the processors first look into the address array A stored in Mi;k in order to find out, where Mhk (x);k is maintained. It is easy to organize the memory accesses in a way that the virtual contents of different faulty modules, that are maintained in the same intact module, are never accessed concurrently (see full version). This makes sure that we get an answer at least if we get it in the corresponding fault free simulation. Theorem 2.1 now implies the following result: Lemma 4.2 (Static fault tolerant simulation) Assume that at most 34 na modules per cluster are faulty, and no further faults occur during the simulation. If ; a; b are chosen as in Theorem 2.1, then the modifications of the (n; ; a; b)-simulation described above has delay O(loglog n), w.h.p.. 4.2 Dynamic Fault Tolerant Simulation

During this section we assume that ; a; b are chosen as in the Main Theorem, and refer to the modification of the static (n; ; a; b)-simulation from Lemma 4.2 as the static simulation. We now extend the static simulation such that it can dynamically react on faults occurring during the simulation. To do so, we try to run the static simulation. But if a fault occurs during the simulation, the assumption that the contents of each faulty module, Vi;k , is stored in an intact module, is no longer true. Therefore we define a fault-handling

routine. This routine is invoked as soon as a module has become faulty. Its task is to reorganize the memory of the DMM in a way that the virtual contents of the faulty modules are distributed evenly among the intact modules in the same cluster. We assume that a pending list F is stored in each intact module. It contains all (i; k) such that Vi;k is maintained in a faulty module and Vi;k is not yet handled by the fault-handling routine described below. For (i; k) 2 F , Vi;k is called critical. Note that the DMMs capability of broadcasting module failures in constant time (see Section 1.3) ensures that the pending lists are always up to date in each intact module. Our fault-tolerant simulation will work similar to the static simulation, as long as the pending list is empty. As soon as it is non-empty, the simulation is interrupted by a faulthandling routine, until the pending list is empty again. There are two problems we have to solve: How to realize the fault-handling efficiently (Subsection 4.2). How to ensure that the simulation will be finished correctly (Subsection 4.2). The Fault Handling Routine

An easy approach to the first problem would be to reconstruct all keys of a critical Vi;k in a module Mi0 ;k . As jVi;k j m=n, this would need time (m=n). Now, consider the simulation of a t-step PRAM computation in which (n) faults occur, as in part 4 of the Main Theorem. As m nt, the fault-handlings alone would need time (nt), in contrast to the O(t log logn) bound necessary for part 4 of the Main Theorem.

0 Therefore, we need a way to reconstruct all keys of the Vi;k in a way that Mi ;k gets much less than m=n of Vi;k ’s keys. The approach is to update a2 + 1 arbitrary copies of each key from Vi;k . Note that the copy of x in Mi0 ;k is not necessarily updated, thus we might have less than m=n accesses to Mi0 ;k . It will turn out that each module only gets O(m=n2 + log n) updates, w.h.p., instead of to (m=n) in the simple approach mentioned above.

In order to allow the above fault-handling to be efficient, we attach an array Di;k of disjunction lists to each Vi;k . The list Di;k [i0; k0] contains all those keys from Vi;k \ Vi0 ;k0 that have been accessed in Vi;k until now. It is easy to incorporate the maintainance of the disjunction lists into the simulation from Lemma 4.2. Simply, whenever a key x is accessed in Vi;k for the first time all a hash functions are evaluated on x. Then x is added to Di;k [hk00(x);k00 ] for all k00 6= k. This only takes O(a) (i.e. constant) time. So we conclude: Remark 4.3 Lemma 4.2 also holds if, in addition, the disjunction lists are maintained. Now we are ready to describe the fault-handling routine. Let (i; k) be the head of the pending list F , i.e., the module having maintained Vi;k is faulty. We describe how to reconstruct Vi;k in an intact module.

Phase 1: Each processor computes the current values of all x 2 Vi;k that have a copy in its module. For this, it runs the static simulation for reading the keys contained in the disjunction lists D; (i; k) stored in its module (“” here stands for some i, j respectively). Phase 2: Each processor computes the intact module Mi0 ;k that will be used to maintain Vi;k by choosing i0 minimal such that Mi0 ;k is intact and jMi0 ;k j < 4. Then A(i; k) :=

(i0 ; k). (Note that Phase 2 can be done in constant time by each processor, if a suitable data structure is used).

Phase 3: Each processor writes the values of the variables read in Phase 1, again using the static simulation. (Note that between Phase 1 and Phase 3 the module containing the k-th copies changed, it is now Mi0 ;k !!). Note that there are several implementation details to be taken care of (e.g. avoiding concurrent access to the same key). They will be explained in the full paper [3]. Lemma 4.4 If at least b+3 copies of each of the keys accessed in Phases 1 and 3 are in intact modules during the fault-handling routine, then it needs time O((m=n2 +logn) loglog(n)). Proof: (Sketch) If b + 3 copies of each of the keys accessed are in intact modules, Lemma 4.2 yields that the time is bounded by O(maximum length of the disjunction lists log logn). The following proposition implies the lemma. Proposition 4.5 The maximum length of the disjunction lists is O( nm2

+ log n), w.h.p..

Proof: (Sketch) We only sketch a proof under the assumptions that the hash functions are truly random (not only drawn from a log3 n-universal class of hash functions). Fix (i; k), (i0 ; k0) , k 6= k0 . For a key x 2 U , P[hk(x) = i and hk0 (x) = i0 ] = a2=n2. Therefore E[jDi;k(i0 ; k0)j] = an2 m2 . Using Chernoff Bounds [16] we get that jDi;k(i0 ; k0)j = O(m=n2 + log n), w.h.p.. Thus max(i;k);(i0;k0) fDi;k (i0 ; k0)g = O(m=n2 +log n), w.h.p.. The extension to log3 nuniversal classes of hash functions is done by replacing Chernoff Bounds by a more involved tail estimate from [15]. The Simulation We run the static simulation which additionally maintains the disjunction lists (Remark 4.3). Whenever a module becomes faulty, for each virtual contents Vi;k maintained by it, (i; k) is appended to the pending list F , stored in each intact module. As soon as the pending list is non-empty, the simulation is interrupted and the fault-handling routine is invoked. The simulation is continued only if the fault-handling routine has finished and the pending list is empty. It then restarts the full simulation of the PRAM step previously interrupted by the fault-handling routine. In order to analyze this algorithm, we first describe a property of the copies of the keys that ensures that the simulation does not get stuck because too many destroyed copies of a key. Definition 4.6 A key is readable if at least b + 3 of its copies are maintained in intact modules during the whole simulation. Lemma 4.7 If all keys remain readable during T n log logn steps of the faulty 1collision DMM and no intact module ever has to maintain more than four modules, then the DMM simulates (T= log logn) PRAM steps, w.h.p.. Proof: (Sketch) As b + 3 copies of each key are in intact modules, the time for simulating one PRAM step is still bounded by O(log logn), w.h.p., by Lemma 4.2, and each round of the fault-handling routine needs time O(m=n2 log logn), w.h.p., by Lemma

4.4. If t PRAM steps are simulated, m t n holds. For sufficiently small > 0 assume that t T= loglogn (otherwise the lemma is proven). Then, as T n loglog n, the total time for fault-handling is bounded by 12 T , if is chosen suitably. Thus 12 T steps are left for simulation, at most O(n log logn) = o(T) of them are spent for completing the simulation of steps that are interrupted by fault-handlings. Thus (T= loglogn) PRAM steps can be simulated, w.h.p.. Now, part 4 of the Main Theorem follows from Remark 4.1 and the two Lemmata below. Lemma 4.8 (Readability of the keys) All keys remain readable during the T steps of the simulation of the DMM, w.h.p.. Proof: Let be chosen such that the fault handling routine needs time at most nm2 log logn. We consider the simulation of at most t = 12 logT log n PRAM steps. nT Thus m 12 log . Let I1 ; : : :I12n be a partition of the time interval [1; T] in interlog n vals of length 12Tn , each. For r 0 let Yr;j be the event that at least r modules become faulty during the r intervals Ij +1 Ij +r . We want to bound P[Yr;j ]. The expected r . Thus Chernoff Bounds number of faults occurring during these intervals is at most 12 ? 1 r ([16]) yield that P[Yr;j ] 3 . Now we are going to determine the probability that at one point of time at least r modules are faulty and not handled yet. Then at least 4r virtual modules are in the pending list. Let Xr be the event that we have at least r faulty modules at the same time during the simulation. P[Xr ]

12n X

l=1 12 n X

P[At least r modules are faulty at the same time in Il ] P[91 j

l=1 r+1 12n l? X X l=1 j =1

l ? r + 1: At least l ? j + 1 modules failed during Ii : : :Il ]

P[Yj ?1;l?j +1]

r+1 ? 12n l? X X 1 l?j +1 l=1 j =1

3

12n2

? 1 r 3

Now we may conclude P[There is an unreadable key] m P[Key x is unreadable]

m

n=2 X

P[At some point of time F contains the virtual contents of at least

r

0 s with (i; k) 2 F ] r=1 modules such that a ? b ? 2 copies of x are in Vi;k n=2 a?b?2 X a m P[Xr ] a ? b ? 2 n=r 2 r=1 n=2 ? 1 a?b?2 X ? 1 r a?b?2 a 2 a ? b ? 2 r m 12n a ? b ? 2 2 n 3 r=1

1 ? X a?b?2 = O 1 , for m n , a sufficiently large 1 r n1l r 3 nl r=1 ? Lemma 4.9 (Memory size) The simulation needs memory of size O m n ule, w.h.p..

+n

per mod-

Proof: The address array and the data structure necessary to update it (Phase 2 of the fault handling routine) need O(n) memory per module. Each module has O( m n +log(n)) copies of keys, w.h.p. (proof will be in the full version), ? m and each key is stored in a disjunction lists. So the number of memory cells is O n + n per module, w.h.p..

Parts 2 and 3 of the Main Theorem can be shown using the same simulation but a stronger bound for the length of the disjunction lists (will be shown in the full version). If n (log log(n))2 < T < n log(n) loglog(n), the memory size m of the PRAM is O(n2 log(n)). Each disjunction list can now be shown to have length O(log(n)). This yields part 2 of the Main Theorem. If n < T n (loglog(n))2 , the memory size m of the PRAM is O(n2 loglog(n)). The disjunction list can now be shown to have length O(log(n)= loglog(n)). This yields part 3 of the Main Theorem.

5 Extensions The algorithms in this paper can be changed in a way that faulty processors can be tolerated. Definition 5.1 ((s; r; T)-tolerable) Let p 2 (0; 1), s; r n and T 1. p 2 (0; 1) is (s; r; T)-tolerable if the expected number of faulty modules at the end of a T -step DMM computation with failure probability p is at most s, and the expected number of faulty processors at the end of a T -step DMM computation with failure probability p is at most r. Then a complete processor-module pair will be taken out of the simulation if the processor or the module fails. In the full paper we describe techniques containing redundant distribution of PRAM-processors to DMM-processors which yield fault tolerant simulations for the case of faulty processor-module pairs. We achieve the same bounds as in the Main Theorem.

References [1] J.R. Anderson and G.L. Miller: Optical communication for pointer based algorithms. Technical Report CRI 88-14, Computer Science Department, University of Southern Carolina, Los Angeles, CA 90089-0782 USA, 1988. ¨ Babaoglu, R. Drummond and P. Stephenson: The impact of communica[2] O. tion network properties on reliable broadcast protocols. Technical Report, Department of Computer Science, Cornell University, Ithaca, New York 1988. [3] P. Berenbrink, F. Meyer auf der Heide and V. Stemann: Fault-tolerant shared memory simulations. Technical Report, to appear.

[4] B.S. Chlebus, A. Gambin and P. Indyk: PRAM computations resilient to memory faults. In Proc. of the 2nd Annual European Symposium on Algorithms, pp 401-412, 1994. [5] F. Christian, H. Aghili, D. Dolev and Ray Strong: Atomic broadcast: from simple message diffusion to byzantine agreement. Computer Science, 1984. [6] A. Czumaj, F. Meyer auf der Heide and V. Stemann: Shared memory simulations with triple logarithmic delay. In Proc. of the 3rd Annual European Symposium on Algorithms, pp 46-59, 1995. [7] M. Dietzfelbinger and F. Meyer auf der Heide: Simple, efficient shared memory simulations. In Proc. of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures, pp 110-119, 1993. [8] M. Dietzfelbinger and F. Meyer auf der Heide: How to distribute a hash table in a complete network. In Proc. of the 22nd ACM Symposium on Theory of Computing, pp 117-127, 1990. [9] L.A. Goldberg, M. Jerrum and T. Leighton: A doubly logarithmic communication algorithm for the completely connected optical communication parallel computer. In Proc. of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures, pp 300-309, 1993. [10] L.A. Goldberg, Y. Matias and S. Rao: An optical simulation of shared memory. In Proc. of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures, pp 257-267, 1994. [11] R. Karp, M. Luby, and F. Meyer auf der Heide: Efficient PRAM simulations on distributed memory machine. In Proc. of the 24th Annual ACM Symposium on Theory of Computing, pp 318-326, 1992. [12] P.D. MacKenzie, C.G. Plaxton, R. Rajamaran: On contention resolution protocols and associated phenomena. University of Texas at Austin, Technical Report 94-06, 1994. [13] F. Meyer auf der Heide: Hashing strategies for simulating shared memory on distributed memory machines. In Proc. of the 1st Heinz Nixdorf Symposium “Parallel Architectures and their Efficient Use”, F. Meyer auf der Heide, B. Monien, A.L. Rosenberg, eds., pp 20-29, 1992. [14] F. Meyer auf der Heide, C. Scheideler and V. Stemann: Exploiting storage redundancy to speed up randomized shared memory simulations. In Proc. of the 12th Annual Symposium on Theoretical Aspects of Computer Science, pp 267-278, 1995. [15] J.P. Schmitt, A. Siegel and A. Srinivasan: Chernoff–Hoeffding bounds for applications with limited independence. In the Proc. of the 4th ACM-Siam Symposium on Discrete Algorithms, pp 331-340, 1993. [16] A. Siegel: On universal classes of fast high performance hash functions, their time-space tradeoff and their applications. In Proc. of the 30th IEEE Annual Symposium on Foundations of Computer Science, pp 20-25, 1989. [17] E. Upfal and A. Wigderson: How to share memory in a distributed system. J. Assoc. Comput. Mach. 34, pp 116-127, 1987.

Fault-Tolerant Shared Memory Simulations

Fault-Tolerant Shared Memory Simulations

Suggest Documents

Improved Optimal Shared Memory Simulations, and the ... - CiteSeerX

Shared Memory - WordPress.com

Shared memory multiprocessors

SHARED MEMORY MULTIMICROPROCESSOR OPERATING

Shared Memory - Google Sites

Shared Memory Example

Memory Consistency Models for Shared-Memory - CiteSeerX

Memory Coherence in Shared Virtual Memory Systems

Checkpointing Speculative Distributed Shared Memory

Implementing Shared Memory on Multi

Shared-Memory Performance Profiling - Description

Checkpointing Speculative Distributed Shared Memory

Efficient Shared Memory Programming on

MONITORING, FAULT DIAGNOSIS, FAULTTOLERANT CONTROL ...

Shared-Memory and Shared-Nothing Stochastic Gradient ... - People

Using Simulations to Model Shared Mental Models

VM-Based Shared Memory on Low-Latency, Remote-Memory-Access ...

Memory Mapped Files And Shared Memory For C++ - Google Groups

Memory Consistency Models for Shared-Memory - CiteSeerX [PDF]

Page 1 Comparison of Shared Memory and Distributed Memory ...

VM-Based Shared Memory on Low-Latency, Remote-Memory-Access ...

Implementing Transparent Shared Memory on Clusters ... - Data61

High-Performance Distributed Shared Memory ... - Semantic Scholar

Programming Shared Memory Systems with OpenMP - LRZ