Improved Optimal Shared Memory Simulations, and the Power of Recon guration A. Czumaj
F. Meyer auf der Heide
V. Stemann
Heinz Nixdorf Institute Heinz Nixdorf Institute and Heinz Nixdorf Institute University of Paderborn Dept. of Computer Science University of Paderborn D-33095 Paderborn University of Paderborn D-33095 Paderborn Germany D-33095 Paderborn Germany
[email protected] Germany
[email protected] [email protected]
Abstract We present time-processor optimal randomized algorithms for simulating a shared memory machine (EREW PRAM) on a distributed memory machine (DMM). The rst algorithm simulates each step of an n-processor EREW PRAM on an n-processor DMM with O( logloglogloglogn n ) delay with high probability. This simulation is work optimal and can be made timeprocessor optimal. The best previous optimal simulations require O(log log n) delay. We also study recon gurable DMMs which are a \complete network version" of the well studied recon gurable meshes. We show an algorithm that simulates each step of an nprocessor EREW PRAM on an n-processor recon gurable DMM with only O(log n) delay with high probability. We further show how to make this simulation time-processor optimal.
1 Introduction Parallel machines that communicate via a shared memory (Parallel Random Access Machines, PRAMs) are the most commonly used machine model for describing parallel algorithms [J92]. The PRAM is relatively comfortable to program, because the programmer does not have to allocate storage within a distributed memory or specify interprocessor communication. On the other hand shared memory machines Supported in part by DFG-Graduiertenkolleg \Parallele Rechnernetzwerke in der Produktionstechnik", ME 872/4-1, by DFG-Forschergruppe \Eziente Nutzung massiv paralleler Systeme, Teilprojekt 4", by the Esprit Basic Research Action Nr 7141 (ALCOM II), and by the Volkswagen Foundation.
are very unrealistic from the technological point of view, because, on large machines a parallel shared memory access can only be realized at the cost of a signi cant time delay. A more realistic model is the Distributed Memory Machine (DMM), in which the memory is divided into a limited number of memory modules, one module per processor. The processors and modules of the DMM are connected by a complete network. Each module can respond to only one access at a time. Thus DMMs exhibit the phenomenon of memory contention, in which an access request is delayed because of a concurrent request to the same module. In an eort to understand the eects of memory contention on the performance of parallel computers, several authors have investigated the simulation of shared memory machines on DMMs. Often the authors assumed that processors and modules are connected by a bounded degree network, and packet routing is used to access the modules [R91, L92a, L92b, U84, KU86]. In this paper we focus on DMMs with a complete interconnection between processors and modules. Additionally we introduce a new model of the DMM, called the recon gurable DMM, or abbreviated as RDMM. This model can be viewed as an ordinary DMM with the additional facility of combining links to buses. In each step of the RDMM each processor can combine two adjacent links into one and then read from or write into this new link. This de nes us paths and cycles that form buses which can be used for broadcasting. The capability of recon guration is used in our simulation to nd leaders of cycles of a distributively given permutation. In the light of current eorts to realize optical crossbars, and the (feasible)
technology of recon guration, our model seems to be not too unrealistic. Various models based on recon gurable architectures have recently been intensively studied e.g. in [WC90, BS91, OSZ91, BLPS94].
1.1 Previous Work We focus here on simulations based on universal hashing. In such simulations we consider n memory modules and the shared memory cells U are distributed using one or more hash functions hi : U ! [n], 1 i 2 [a]; cell u 2 U is stored in the modules Mh1 (u); : : :; Mha (u). All such simulations assume that h1 ; : : :; ha are randomly chosen from a high performance universal class of hash functions as e.g. presented by Dietzfelbinger and Meyer auf der Heide [DM90] or Siegel [S89]. It is easily seen that a simulation of an n-processor EREW PRAM on an n-processor DMM using one hash function has contention (log n=log log n), even if the hash function behaves like a random function. Mehlhorn and Vishkin [MV84] use a log n=log log n-universal class of hash functions to achieve a simple simulation of a CRCW PRAM on a DMM with expected delay (log n=log log n). Karp et al. [KLM93] break this log n=log log n bound by using two or three hash functions. With this approach an n-processor PRAM can be simulated on an n-processor DMM with expected delay O(log log n). Thus, at the expense of increasing the total storage requirement by a constant factor, the running time of the simulation is exponentially decreased. Additionally, they obtain a time-processor optimal simulation of an EREW PRAM with delay O(log log n log n). Dietzfelbinger and Meyer auf der Heide [DM93] extend this result to a much simpler schedule for an O(log log n) simulation on the weaker c-collision DMM, where a module can only answer, if it gets less than c requests; otherwise it sends a collision symbol. They use the majority technique due to Upfal and Wigderson [UW87]. Meyer auf der Heide et al. [MSS94] extend this result to more hash functions. They show that the use of (log log n)" hash functions, for a constant " > 0, yields delay O(log log n=log log log n). As it uses a non-constant number of hash functions it cannot be turned into a time-processor optimal simulation. Very recently Goldberg et al. [GMR94] show that one can perform also a time-processor optimal simulation with expected delay O(log log n) even on a 1-collision model (called also OCPC). These results look very close to time-optimality. 1
In this paper [n] will always denote the set f1; 2; : : : ; ng
Particulary, MacKenzie et al. [MPR94] prove that the direct strategy cannot give a faster simulation than (log log n), and independently Meyer auf der Heide et al. [MSS94] show that the running time of O(log log n) is a lower bound even for a much wider class of simulations with constant memory redundancy, as long as the communication is oblivious, except for the accesses to copies of requested keys. The recon gurable network is a widely considered model of parallel computation [BS91, OSZ91, WC90, BLPS94]. Many fundamental operations and problems have been considered on this model, especially on the recon gurable mesh, namely, data reduction, ranking, sorting, parity. Wang et al. [WC90] simulate a PRAM with n processors and m shared memory cells on an m n processor array with a recon gurable bus system with constant delay. Ben-Asher et al. [BLPS94] consider the complexity of recon gurable network models. They evaluate the computational power by focusing on the set of problems computable in constant time on some variants of the model.
1.2 New Results In the present paper we design shared memory simulations using three hash functions. Thus we obtain constant memory redundancy. Hence, to break the
(log log n) lower bound of Meyer auf der Heide et al. [MSS94] we have to use non oblivious techniques. The algorithms we present are more involved than the direct algorithms or \balls into bins" games like in [DM93, MPR94, MSS94]. Our rst result is a randomized simulation of an EREW PRAM with delay O(log log n=log log log n), with high probability (w.h.p.)2. It improves all previously known simulations of an n-processor EREW PRAM on an n-processor DMM3 , either in the running time [KLM93, DM93, GMR94] or signi cally in the number of hash functions used [MSS94]. Its basic routine works with two hash functions. The high level description of the protocol is as follows. In each step, each module chooses among its incoming requests the one with highest contention, i.e., for which the memory module storing the other copy gets most requests. For implementing the access protocol we use sophisticated log-star techniques to get an O(log n)-time preprocessing for computing the number of requests directed to each module. We also show how one can extend this simulation to a time-processor optimal simulation of an
2 W.h.p. means \with probability at least 1 ? n for any constant ." 3 Also denoted brie y by n-EREW PRAM or n-DMM.
(n log log n log n=log log log n)-EREW PRAM on an n-DMM with delay O(log log n log n=log log log n), w.h.p., using techniques due to Karp et al. [KLM93]. There exists a constant time o-line schedule for shared memory simulations, as observed e.g. by Meyer auf der Heide [M92]. Our second simulation is executed on a recon gurable DMM, RDMM. In addition to the capabilities of a DMM, a processor can connect two of its links. Globally this operation forms disjoint cycles and paths, connecting groups of processors. These cycles can be used as a broadcast medium. The capability of executing such broadcasts enables us to design an algorithm that computes an o-line schedule as mentioned above in time O(log n), w.h.p.. This yields a shared memory simulation on an RDMM with delay O(log n), w.h.p.. It can be made time-processor optimal. The paper is organized as follows. In Section 2 we proceed with the de nition of the computation models used in this paper and state some lemmas that we need for our analysis and our algorithms. In Section 3 we elaborate some properties about the distribution of sizes of connected components in a random graph, which is essential for our proofs of the running time of the simulations. In Section 4 we present the O(log log n= log log log n) simulation and show how it can be extended to an optimal simulation. Finally Section 5 presents the log n simulation on the recon gurable DMM.
2 Preliminaries A parallel random access machine (PRAM) consists of n processors P1 ; : : :; Pn and a shared memory with cells U = [m]. The processors work synchronously and have random access to the shared memory cells, each of which can store an integer. In this paper we only distinguish between exclusive read exclusive write (EREW) PRAMs and concurrent read concurrent write (CRCW) PRAMs. We deal with two write con ict resolution rules. In the Arbitrary CRCW PRAM we require that if several processors want to write to the same memory cell simultaneously, an arbitrary one of them succeeds. In the Tolerant CRCW PRAM we require that if several processors want to write to the same memory cell simultaneously then its content remains unchanged. A distributed memory machine (DMM) has n processors Q1 ; : : :; Qn which communicate via a distributed memory consisting of n memory modules M1 ; : : :; Mn. Each module has a communication window. A module can read from or write into its win-
dow. From the point of view of the processors, a window acts like a shared memory cell, where concurrent access is allowed. Using the same write con ict resolution rules as in the PRAM we can de ne the Arbitrary DMM and the Tolerant DMM. The following fact can easily be obtained. Fact 2.1 An n-processor Arbitrary or Tolerant DMM can be simulated with constant delay on an n-processor Arbitrary or respectively Tolerant CRCW PRAM with O(n) shared memory cells and vice versa. We also consider a new model of the DMM, called the recon gurable DMM, abbreviated as RDMM. This model can be viewed as an ordinary DMM with the additional facility of recon gurating the complete network between the processors, such that each processor can combine two links to other processors into a bus. Hence, the links can be viewed as building blocks for larger bus components. The RDMM dynamically recon gurates itself at each time step. Each processor of the RDMM acts locally in each step combining two adjacent links into one. Then it can read from and write into this new link. This de nes us edge disjoint paths and cycles that form buses. In each step of the RDMM each processor connected to a given bus can try to send a message. If more than one do, an arbitrary one succeeds. The message sent goes through the bus such that all processors connected to the given path or cycle can read this message. This clearly means that it is possible to broadcast information in one step to more than one processor. In the paper we will use the following conclusion of Azumas martingale inequality given by McDiarmid [M89].
Theorem 2.2 Let x1 ; ; xn be independent random
variables, where xi takes values from a nite set Ai , for i = 1; : : :; n. Suppose that the (measurable) function f : ni=1Ai ! R satis es jf (x) ? f (x0 )j ci whenever the vectors x and x0 dier only in the i-th coordinate. Let Y be the random variable f (x1 ; ; xn). Then, for any t > 0, 2 P (jY ? E [Y ]j t) 2 exp P?n2t c2 i=1 i Our simulations will also make use of the following results for the strong semisorting and the chaining problem.
De nition 2.1 Given n integers x1 ; x2; : : :; xn; xi 2 [n], the semisorting problem is to store them in an
array of size dn, for some constant d 0, such that all variables with the same value occur together, separated only by empty cells. The strong semisorting problem is de ned as above but additionally all occurrences of an integer of multiplicity k appear in a subarray of the output array of size at most dk. The following lemma was proved by Bast and Hagerup [BH93].
Lemma 2.3 The strong semisorting problem can be solved for d = 2 on a Tolerant CRCW PRAM in O(log n) time with linear total work and linear space with probability at least 1 ? 2?n" for some constant " > 0. De nition 2.2 Given n bits x1; x2; : : :; xn, the
chaining problem is to nd, for each bit xi , the nearest 1's both to its left and to its right.
The following lemma was shown by Ragde [R93] and Berkman and Vishkin [BV93].4
Lemma 2.4 The chaining problem can be solved deterministically in O((n)) time with linear total work and linear space on a Tolerant CRCW PRAM.
3 Distribution of Connected Components Our PRAM simulations follow the ideas of Upfal and Wigderson [UW87], and Dietzfelbinger and Meyer auf der Heide [DM93] (see also [M92]). The memory of the PRAM is hashed using three hash functions h1 ; h2 , and h3 . That means, each memory cell u 2 U of the PRAM will be stored in the modules Mh1 (u); Mh2 (u), and Mh3 (u) of the DMM. We will call the representations of u in the Mhi (u)'s the copies of u. In this extended abstract we shall assume that all the hash functions used are random functions, which simpli es the presentation and the analysis of the algorithm. Universal classes of functions developed by Siegel [S89] are sucient for our purposes, however 5 . For the simulation of a PRAM step we use the following technique of Upfal and Wigderson [UW87] which ensures that it suces to access arbitrary two out of the three copies of a shared memory to guarantee a correct simulation. 4
These papers contain only the algorithms that run on a
Common CRCW PRAM, but as observed by Bast and Hagerup [BH93], one can extend these results to Tolerant CRCW
PRAMs. 5 See also [KLM93, MSS94, GMR94].
To write to a memory cell a processor of the DMM accesses at least two of the copies and adds a time stamp to them indicating the (PRAM-) time of the update. To read a memory cell a processor has to access two of the copies and takes the one with the latest time stamp. We modify this two out of three idea and split this schedule into three steps of trying to access one out of two copies with a dierent pair of hash functions in each step. (This approach was also observed by MacKenzie et al. [MPR94].) Clearly, in this way we always access at least two copies. Therefore in the following we will only analyse how to simulate the access of the shared memory that uses two hash functions h1 and h2 . For technical reasons, we do not perform all n accesses to the shared memory simultaneously but split the requests into batches of size n=24+c, for some constant c 1 to be speci ed later. Since we only have a constant number of batches, this will slow down our algorithm only by a constant factor. Let S denote such a batch. Let us call a schedule where for each u 2 S one has to access at least one of the two possible copies a one out of two schedule. Let G = ([n]; E ) be the labeled directed graph de ned by h1 ; h2 and the set of requests S , that has an edge (h1 (u); h2 (u)) labeled u for each u 2 S . Note that parallel edges and self-loops are allowed in G. Let H be the graph obtained from G by removing all directions from the edges. The algorithms we present rely on the properties of this random graph. The following lemma is an extension of a result of Karp et al. [KLM93]. De ne the size of a connected component C , denoted by jC j, to be the number of nodes it contains.
Lemma 3.1 For each positive constant l and c there is s 1 such that (a) Prob (H has a connected component of size at least cl log n) n?(l?1) (b) Prob (H has a connected component C of size at least jC j + s ? 1 edges) n?(l?1) (c) Prob (H has at least 4c+1 k2nck connected components of size at least k) 2?cn1=3 for k 41c log n
Proof: The proof of the lemma relies on the following claim:
Claim 3.2 Let k 2; s 0. The probability that there is a subgraph G0
G such that G0 contains k
vertices and at least k + s ? 1 edges is at most n?s+1(k + 2)s?12?c(k+s?1) Proof: Let Gk;s be the set of all directed labelled graphs on node set [n] with k vertices and k + s ? 1 edges. Then 2 n k+s?1 jGk;sj nk k k++k s+?s 1? 2 24+ c 2 k+s?1 k + k + s ? 2 2 k+s?1 ?k (ne) k 2(4+c)(k + s ? 1) k+s?1 (ne)2k+s?1 (k + 2)s?1 2?(4+c) n2k+s?1(k + 2)s?1(2?c )k+s?1 For a xed G0 2 Gk;s and randomly chosen h1 and h2 the probability that the directions and labels with respect to h1 and h2 coincide with G0 , i.e., the probability that G0 is a subgraph of G, is at most n?2(k+s?1). Therefore, the probability that there is some G0 2 Gk;s, such that G0 is a subgraph of G is at most n?s+1(k + 2)s?1(2?c)k+s?1
2
To prove part (a) of the lemma we use Claim 3.2 with k = cl log n and s = 0. This yields an upper bound on the probability of the existence of a connected component of size at least k = cl log n. Part (b) follows easily from the claim using part (a) and summing over all k, for a suciently large constant s. In order to prove part (c) we assign binary random variables to G0 2 Gk;s G0 G XG0 = 10 ifotherwise P P For a xed k de ne k = s G0 2Gk;s XG0 . From Claim 3.2 we get:
E [k ]
n X n?s+1(k + 2)s?1(2?c )k+s?1 s=0
n X k+2 s n = (k + 2)2c(k?1) n2c s=0 n 2 (k + 2)2 4c k2nck = c(k?1) Clearly k is an upper bound for the random variable ~ k denoting the number of connected components in
H of size at least k. We can view ~ k as a function f in the random choices of the hash functions h1 and h2 . h n i ~ k = f (hi (xj ) : i 2 1; 2; j 2 24+ ): c If the value of hi (xj ) changes for any (i; j ) then ~ k changes by at most one. Therefore, using Theorem 2.2 we get: ! 2 24+c 2 n Pr (~ k 2 ) 2 exp ? (k 2ck )2 n For k 41c log n this probability is smaller than !
c+5 1 2 exp ? 1 n2 2 p 2?cn 3 4c log n n for large enough n. 2 This lemma is the basis of the proof of the following lemma which is essential for the analysis of our simulations. Lemma 3.3n Let H be the random graph with n nodes and 24+c edges for some constant c 1 de ned as in Lemma 3.1. For all constants b and l such that c > 4bl, with probability at least 1 ? 2( n1 )l?1, X jC j 2jC j b = O(n)
connected components C
Proof: We split the sum into three parts: X
jC j 41c log n
X
jC j 2jC j b +
41c log n