Implementing Shared Memory on Multi

Implementing Shared Memory on Multi-Dimensional Meshes and on the Fat-Tree (Extended Abstract) Kieran T. Herleyy

Andrea Pietracaprinaz

Geppino Pucciz

Abstract We present deterministic upper and lower bounds on the slowdown required to simulate an (n; m)-PRAM on a variety of networks. The upper bounds are based on a novel scheme that exploits the splitting and combining of messages. Such a scheme can be implemented on an n1?1=d node d-dimensional mesh (d = O(1)) with O(n1=d (log(m=n)) ) slowdown, and on an n-leaf p pruned butter y, a variant of Leiserson's fat-tree, with O( n log(m=n)) slowdown. These simulations attain the best worst-case slowdowns to date for such interconnections. Moreover, the one for the pruned butter y is the rst PRAM simulation scheme on an area-universal network, and employs novel sorting and routing primitives. Furthermore, the simulations are space-ecient and require a total amount of storage that is within a polylogarithmic factor of the size of the PRAM memory. Finally, under the standard point-to-point assumption, we prove a bandwidthbased lower bound on the slowdown of any deterministic PRAM simulation on an arbitrary network, formulated in terms of its decomposition tree. p The general result yields lower bounds of

(n1=d (log(m=2n2)= log log(m=2n2 ))1?1=d ) and ( n log(m=2n2 )= loglog(m=2n2)) when specialized to the d-dimensional mesh and the pruned butter y, respectively.

Contact Author: Geppino Pucci

Dipartimento di Elettronica e Informatica, Universita di Padova, Via Gradenigo 6/A I35131 Padova, Italy Tel: +39 49 8287640 Fax: +39 49 8287699 e-mail: [email protected]

This research was supported, in part, by the ESPRIT III Basic Research Programme of the EC under contract No. 9072 (project GEPPCOM). y Department of Computer Science, University College Cork, Cork, Ireland. email: [email protected] z Dipartimento di Elettronica e Informatica, Universit a di Padova, Padova, Italy. email: fandrea,[email protected]

1 Introduction The problem of implementing shared memory on distributed-memory architectures has been intensively studied over the last decade. Generally, this problem has been referred to as the PRAM simulation problem and involves representing the m cells of the PRAM shared memory (called variables) among the n processor-memory nodes of the simulating machine, in such a way that a PRAM step, in which any n-tuple of variables is read or written, can be eciently executed. The time required to simulate a PRAM step is known as the slowdown of the simulation. A number of approaches to this problem, both probabilistic and deterministic, have been investigated for a variety of well-known architectures such as the complete interconnection, the mesh of trees, the butter y, as well as a variety of expander-based architectures, among others. We will not attempt to summarize the entire literature on this problem here but only quote those results that directly relate to our work, and refer the interested reader to [PPS94] for a recent and comprehensive summary of previous work on this topic. In [AHMP87], Alt et al., building on the work of [UW87], presented a deterministic scheme to simulate a PRAM with n processors and m variables (called an (n; m)-PRAM) on an n-node Module Parallel Computer (MPC), an architecture in which each node includes both a processor and a private memory module accessible only to that processor, and in which the nodes are connected by a crossbar that allows each node to transmit one message to, and receive one message from any other node per step. Their scheme employs the following copy-based method for the representation of the PRAM variables, which most of the deterministic simulation algorithms, including this present work, adopt. Speci cally, each variable is represented by 2c ? 1 = O(logm) copies, each consisting of a value and a timestamp, and distributed carefully among the memory modules of the simulating machine. To write a variable, at least c of its copies are overwritten to re ect the intended value and the time of writing. To read a value, at least c copies are inspected. This set of c copies read must contain at least one of the copies most recently written, which is readily identi able by virtue of its timestamp. They show that for a suitable distribution of the copies among the nodes of the machine, any n-tuple of variables may be accessed (read or written) in O(log m) time. The above scheme can be ported to an arbitrary network by simulating each MPC step through sorting. In particular, this approach yields a simulation with slowdown O(n1=d log m) on an n-node d-dimensional mesh, d = O(1), and a simulation with slowdown O(pnlog m) on an n-leaf pruned butter y, which are the interconnections that we consider in this paper. Alt et al. have observed that a simple PRAM simulation for the 2-dimensional mesh with an optimal slowdown of O(pn), is indeed possible [AHMP87]. Unfortunately, this simulation requires between pn and n copies per variable, resulting in an unacceptable memory blow-up. Moreover, the method does not appear to extend easily to higher dimensional meshes or to the pruned butter y. More recently, Pietracaprina et al. [PPS94, PP95] have studied PRAM simulations on the mesh from a dierent perspective. Their work focuses on devising explicitly constructible memory organizations that permit ecient access to the variables. Their scheme achieves O(pn logn) slowdown for memories of at most O(n1:5) variables using O(log1:59 n) copies. The result also 1

extends to any memory size polynomial in the number of processors, but requires the use of nonconstructive expanders of constant-degree. In contrast, the simulations in this paper (as well as the one in [AHMP87]) rely on nonconstructive expanders of logarithmic degree. Such powerful graphs allow us to achieve a better slowdown using lower redundancy. It might appear, at least at rst glance, that updating c copies apiece for n variables would require the physical movement of cn distinct packets across the entire network, resulting in an expensive access protocol that requires (cn1=d ) time on a d-dimensional mesh, for example. In order to circumvent this diculty, we devise a novel splitting and combining technique based on the following idea. If a processor p wishes to send the same packet to two nodes x and y that are distant from p but close to one another, then rather than dispatch a separate packet for each, it may be more ecient to dispatch a single message to some node close to both x and y that then splits into two packets which are forwarded to x and y. A careful implementation of this idea yields the result stated in the following theorem, which constitutes the fastest deterministic PRAM simulation on multi-dimensional meshes to date.

Theorem 1 There exists a scheme to simulate an (n; m)-PRAM on a n-node d-dimensional mesh, ?

d = O (1), with worst-case slowdown O n1=d (log(m=n))1?1=d , using O (log(m=n)) copies per variable and O((m=n) log3 (m=n)) storage per node.

In order to implement the strategy outlined above, the scheme relies on a recursive decomposition of the mesh and on ecient algorithms for sorting kn packets and for (k; k)-routing, where each processor sends and receives at most k packets. The n-leaf pruned butter y [BB90] is an area-universal network that is a variant of Leiserson's fat-tree [Lei85]. Although quite dierent from the 2-dimensional mesh in terms of the details of its structure, it is suciently similar in its bandwidth characteristics to allow it to support the key operations upon which our simulations rely, with comparable eciency. By providing novel sorting and routing primitives for this network, and by using its natural decomposition into subtrees, we are able to implement the above simulation scheme with the same slowdown achieved for the 2-dimensional mesh, thereby obtaining the rst PRAM simulation on an area-universal network. The result is stated in the following theorem.

Theorem 2 There exists a schemepto simulate an (n; m)-PRAM on an n-leaf pruned but-

ter y with worst-case slowdown O( n log(m=n)), using O(log(m=n)) copies per variable and O((m=n) log3 (m=n)) storage per node.

Lower bounds on the slowdown of PRAM simulations on bounded degree networks have been presented in a number of studies [AHMP87, KU88, HB94]. All such bounds, however, apply to the entire class of bounded degree networks, among which there are networks with low diameter and high bandwidth. For example, in [HB94] the authors show an (log2 (m=n)= loglog(m=n)) lower bound, which is too weak for our purposes, since a trivial (n1=d ) (resp., (pn)) lower bound holds for d-dimensional meshes (resp., pruned butter y), due to diameter and bandwidth limitations. In this paper, we present the rst lower bound argument that takes into account the individual characteristics of the network. To capture the properties of the network topology, the bound exploits the notion of decomposition tree [BL84, Lei85], which provides a partition of 2

the network into disjoint regions of limited bandwidth. We consider only on-line simulation in which the simulation of each step does not commence until the simulation of all preceding read steps has concluded. Furthermore, as in all previous works, our lower bound is proved under the point-to-point assumption, which requires that a processor seeking to update a number of copies of a variable must dispatch a separate message for each copy, and that these copies must be transported separately by the network. This discipline precludes the combining of messages that share a portion of their paths, which is a crucial feature of the simulations presented in this paper. In fact, the lower bound indirectly vindicates the usefulness of our splitting and combining techniques, by suggesting that any simulation that achieves a slowdown signi cantly faster than those presented here must exploit similar techniques. Assume that the simulating network has a [w0; w1; : : :; wlog n] decomposition tree [Lei85], that is, for any i, 0 i logn, the network can be partitioned into 2i disjoint regions, each connected to the rest of the network by at most wi edges. The following theorem states our general lower bound.

Theorem 3 Let m 2n and let T = (m=n). The worst-case running time of any on-line, 2

point-to-point simulation of a T -step (n; m)-PRAM computation on an n-processor network with a [w0; w1; : : :; wlog n ] decomposition tree, is

T min max r0 1h;klog n n

?

where gh;k (r) = min n=wh ; n=(rwk); n m=2n2

gh;k (r) + r 2hnw h

1=2r

;

(1)

o

=(2k wk ) .

By specializing the lower bound to d-dimensional meshes, with d constant, and to the pruned butter y, we obtain the following result.

Corollary 1 Under the same hypotheses as Theorem 3, any simulation of a T -step (n; m)-PRAM

computation requires time

Tn d 1

1!

"

log(m=2n2 ) 1? d log log(m=2n2 )

s

2 resp., T n log(m=2n 2) loglog(m=2n )

!#

on an n-node d-dimensional mesh [resp., an n-leaf pruned butter y ].

The rest of the paper is organized as follows. Section 2 discusses the memory map that describes the distribution of the copies among the memory modules. In Section 3, the mesh simulation algorithm is presented. The algorithm consists of two phases, copy-selection and routing, which are described in Subsections 3.2 and 3.1, respectively. In Section 4, the scheme is ported to the pruned butter y. The lower bound is presented in Section 5.

2 Memory Organization Consider the simulation of an (n; m)-PRAM on an n-node machine and suppose that each variable be replicated into 2c ? 1 copies, for a suitable integer c. It is convenient to model the distribution of the copies of the variables among the nodes of the machine by means of a bipartite graph 3

G = (U; V ; E), where U represents the set of variables, V the set of nodes, and 2c ? 1 edges connect each variable to the distinct nodes storing its copies. Note that there is a one-to-one correspondence between E and the set of all copies. We say that such a graph is smooth if the degree of every vertex in V is O((jU j=jV j)(2c ? 1)). Let S U and let F E contain exactly k edges incident on every s 2 S. We call F a k-bundle for S. We say that F has degree q if at most q edges of F are incident on any node of ?F (S), where ?F (S) denotes the subset of V touched by the edges in F. Our simulations adopt the standard majority protocol requiring that the access to a set of variables S be performed by accessing the copies corresponding to a c-bundle for S. In order to achieve fast simulation times, it is desirable to access a c-bundle of low degree, since the degree of a c-bundle is a measure of the congestion at the nodes of the underlying machine, that is, the number of physical copies that ultimately have to be accessed from some individual node. The existence of c-bundles of low degree is intimately related to the expansion properties of the graph G. This motivates the following de nition [HB94] that characterizes a class of graphs that make good memory organizations. De nition 1 A bipartite graph G = (U; V ; E) with jU j = m, jV j = n, and input degree d is a (; d; c; )-generalized expander if, for every S U such that jS j n and for every c-bundle F of S , j?F (S)j cjS j. The memory organization of our simulations is governed by a smooth (; 2c ? 1; c; 1=gc)-generalized expander G = (U; V ; E), where < 1 is some arbitrary constant, g = (4e1+ )2=(1?) and c = (2=(1 ? )) log(3m=n) = (log(m=n)). The existence of such a graph can be established with a minor modi cation to a result presented in [HB94], but no ecient construction for graphs of this type is yet known. It is assumed for the moment that each node in the network holds a copy of a read-only table that encodes the structure of memory organization. In the full paper, we will show how this table ? may be represented in a distributed fashion, requiring O (m=n) log3(m=n) storage per node. Note that the total storage is only a polylogarithmic factor away from the size of the PRAM memory. Let S U be a set of at most n variables. Conceptually, the construction of a c-bundle F for S is accomplished by means of a sequence of whittling steps. Each whittling selects c edges apiece for some of the variables in S to include in F. The choice is made at each step to ensure that the degree of F never exceeds a xed threshold q, whose value will be speci ed later. During the execution of the whittling steps, a variable is said to be alive if c edges for the variable have not been selected yet, and dead otherwise. Let Si S denote the set of live variables at the beginning of the i-th whittling, where S1 = S. Intuitively, the i-th step will identify a set Wi of \dicult-to-access" nodes and attempt to select c edges apiece for as many live variables as possible without involving Wi . In fact Wi (i 1) must include the set of all nodes in V with more than q incident edges from Si . Each Wi with i 2 will contain only these \congested" nodes, while W1 may include an additional small number of nodes but contains less than n=g nodes in total. We say that x 2 Si is con ned to Wi if x has c or more copies stored in nodes of Wi . In the i-th whittling step, for each x 2 Si which is not 4

con ned to Wi we select c edges not incident on Wi and add them to F. Thus x becomes dead. The variables that remain alive, that is, those which are con ned to Wi , will form the set Si+1 which is guaranteed, by our choice of memory organization, to be signi cantly smaller than Si . A sequence of whittling steps that kills all the variables in S is called a whittling sequence for S. Since the number of live variables diminishes rapidly with each passing whittling step, only a number of steps that is a logarithmic function of n is required to kill all the variables. At the conclusion of this whittling sequence, the set F obtained constitutes a c-bundle for S of degree at most q.

Lemma 1 If S is a set of n variables, then a whittling sequence of length k = dlog(n=gc)e + 2 suces to construct a c-bundle for S of degree at most q = (2c ? 1)gn=(n ? g) = (log(m=n)). Moreover, the number of variables left alive at the start of the i-th whittling step satis es jSi j < n=(gc2i? ), for 2 i k. 2

3 Simulation on Meshes

p We restrict ourselves to describing the simulation of an (n; m)-PRAM step on a pn n 2dimensional mesh. The extension to higher dimensions requires minor technical modi cations which will be provided in the full version of this abstract. Moreover, we only consider the simulation of EREW PRAMs, since the techniques needed to extend the result to more general variants of the model are standard. Suppose that each processor-memory node is connected to its immediate neighbours by means of bidirectional wires capable of transmitting a single O(log m)-bit packet per step, and that the PRAM variables are distributed among the mesh nodes according to the memory organization described in the previous section. Consider the simulation of a write step, and let S be the set of n variables to be written. The set S will be represented by means of a set of variable packets, each bearing the name of a referenced variable, the value to be written to that variable, and space for a bitvector of length 2c ? 1 to be explained later. We will use the symbol S to refer both to the set of variables and the set of packets that represent it. The simulation consists of two phases: Copy Selection, during which a c-bundle of degree q for S is determined and encoded in the bitvectors of the variable packets; and Access, where the actual updates on the selected copies take place. (The simulation of a read step is virtually identical, except that the Access phase will also involve routing a value back to the reading processors.) Our algorithms will also employ a set of copy packets representing the set of copies of S, with one packet per copy. Note that there are up to (2c ? 1)jS j distinct copy packets, and any direct manipulation of such a large set is too expensive for our purposes. A crucial feature of our simulation is to use this explicit representation sparingly. The simulation is based on two algorithmic primitives: k-sorting and (k; k)-routing. Given a set of packets distributed among the n nodes so that each node holds at most k packets, k-sorting is used to rearrange the packets so that node 1 holds the packets with the k smallest keys, node 2 holds the next k smallest keys, and so on. For any value of k, this can be accomplished in O (k(pn + log k)) time on the mesh. (k; k)-routing involves routing a set of packets subject to 5

the constraint that no node is the source (resp., destination) of more than k packets. This can be p done on the mesh in O(k n) time. Both algorithms can be found in [KSS94]. For ease of presentation, we will discuss the implementation of Access rst and then discuss that of Copy Selection which embodies some of the same ideas.

3.1 Access Phase We assume that following the completion of Copy Selection, the bitvectors of S encode a c-bundle for S. (The i-th bit of the bitvector for u 2 S indicates whether or not the i-th copy of u belongs p p to the c-bundle.) The mesh is conceptually partitioned into submeshes of size n=s n=s called cells. Updating the selected copies of u will involve routing a copy of u's variable packet to each node holding a selected copy of u. This is accomplished in three steps. Firstly, the set S is replicated in each cell of the mesh. Secondly, within each cell a copy packet is generated for each copy of a referenced variable that resides in that cell. (A copy packet is eectively a copy of the appropriate variable packet labelled with the location of the copy in question (destination).) Thirdly, each copy packet is routed locally within its cell to its destination. Note that delaying the generation of the copy packets to the second step is the crucial feature of the algorithm, since in this way all the copy packets within a cell will \share" the cost of the routing from the requesting processor to the cell. For brevity, we omit the discussion of the rst step and only mention that it can be accomplished in O(pns) time using an appropriate combination of routing and splitting. The second step involves the generation of the appropriate copy packets from the copy of S held locally within each individual cell. Although the copy selection guarantees that there will be at most q copy packets to be sent to each destination (see Lemma 1), a node of the cell may initially hold upto s variable packets, each of which may generate from 0 to c copy packets. In order to avoid congestion in routing the copy packets, great care has to be exercised to ensure that each node of the cell generate approximately the same number of such packets. Within each cell C, we partition S into degree classes SC(i) , 0 i dlog(2c ? 1)e, where SC(i) contains those variables with between 2i and 2i?1 copies in C. The variables of each class are distributed (using dlog(2c ? 1)e pre x computations and one execution of (s; s)-routing), so that each node receives at most djSC(i) j=(n=s)e of them,pwhich ensures that each node generates O(q) copy packets. The complexity of this step is p O(logc n=s+ ns). Thepthird step is eectively an instance of (q; q)-routing within each cell and can be completed in O(q n=s) time. Summing the contributionofpthe three steps and selecting s = q = (log(m=n)), we see that Access can be completed in O n log(m=n) time. We have:

Lemma 2 The copies to a c-bundle for S of degree at most q = O (log(m=n)) can p corresponding be accessed in O

n log(m=n) time.

3.2 Copy Selection Phase In this subsection we describe how we select a c-bundle of low degree for S by performing a sequence of whittling steps as described in Section 2. We break the algorithm into two stages for eciency reasons. The rst stage performs the rst whittling step and applies to n variables, 6

while the second stage performs the rest of the whittling sequence and applies to the O(n=(2c ? 1)) variables that are left alive by the rst stage.

Stage 1 Conceptually, the rst stage whittles S with respect to a set W of nodes that contains 1

(i) all nodes with degree greater than q, and (ii) all nodes belonging to cells whose total degree is at least q(n=s) + 1. (We measure degree with respect to the edges incident on S.) Our choice of memory organization and the parameter q ensures that the size of W1 is less than n=g and that this stage selects c edges apiece that do not touch W1 for all but O(n=(2c ? 1)) variables in S. The implementation of this idea, included in the full version of this paper but omitted here, involves identifying the nodes in W1 and determining for each variable in S which of its copies are incident on this set. In essence, the algorithm is similar to that for the Access phase, though more involved, and has the same running time.

Stage 2 Lemma 1 implies that the end of Stage 1 we have determined a c-bundle for all but O (n=(2c ? 1)) variables. The i-th whittling step (i 2) involves whittling Si with respect to the

set Wi of nodes that have degree greater than q in terms of the edges of Si . By representing the edges of Si by a set of ri = (2c ? 1)jSi j copy packets, this whittling may be completed eciently within a submesh of size pri pri in time O(pri ) by means of sorting and pre x operations. (The outcome of the whittling, namely the identity of the selected copies for the newly dead variables, is recorded in the bitvectors of packets in Si .) By Lemma 1, ri+1 ri=2, for i 2, therefore ?P pr = the cost ofpthe sequence of whittling steps performed during the second stage is O 1 i i =1 P i = O (pn) ; which is always dominated by the cost of Stage 1. Combining the n=2 O 1 i=1 costs of the two stages, we obtain the following.

Lemma 3 Given a set S of n distinct variables, a c-bundle for S of congestion at most q = p (log(m=n)) can be constructed in O

n log(m=n) time.

The conbination of Lemma 3 with Lemma 2 shows that our simulation achieves the running time stipulated in Theorem 1.

4 Simulation on the Pruned Butter y A closer look at the PRAM simulation devised for the mesh reveals that the copy selection and access phases rely upon the recursive decomposition of the network into subnetworks, and upon the primitives of sorting and routing. In fact, the scheme can be ported to any network topology with an appropriate decomposition into subnetworks (cells) and for which an implementation of the above primitives is available. In this section, we provide ecient algorithms for k-sorting and (k; k)-routing on the pruned butter y, a universal network belonging to the class of fat-trees. As a consequence, we obtain the rst ecient PRAM simulation on a universal network. The n-leaf pruned butter y, introduced in [BB90], is a variant of Leiserson's fat-tree [Lei85]. Its coarse structure may be interpreted as a n-leaf complete binary tree where the leaves represent the processor-memory nodes of the machine, the internal nodes represent clusters of routing switches, and where the edges represent channels of bandwidth that doubles every other level from the 7

leaves to the root. More precisely, each subtree of n0 leaves is connected to its parent through p 0 a channel of capacity n . A 16-leaf pruned butter y is illustrated in Figure 1, where each ellipse indicates a cluster of routing switches that collectively constitute a single internal node of the underlying binary tree. A formal de nition of the network can be found in [BB90]. The pruned butter y is area-universal in the sense that it can route any set of messages almost as eciently as any circuit of similar area. The following lemma summarizes the results on sorting and routing for the pruned butter y. More details will be provided in a full version of this abstract.

Figure 1: A 16-leaf pruned butter y

Lemma 4 Any instance of k-sorting and (k; k)-routing can be executed on the n-leaf pruned butp ter y in time O (k( n + log k)). Proof (sketch): For the sorting, we will employ an adaptation of Batcher's bitonic sorting to

handle the case where k = 1 and invoke the techniques of Baudet and Stevenson [BS78] to provide the generalization to larger values of k. As for the routing, we adapt the structure of the routing algorithm of Peleg and Upfal [PU89], in which the (k; k)-routing is accomplished by using balancing, sorting, and parallel pre x techniques. We develop and algorithm for balancing that evenly redistributes any set of packets among the leaves of the pruned butter y in O(kpn) time, provided that no leaf holds more than k packets initially. 2 By using the above primitives we can port the PRAM simulation scheme devised for the mesh to the pruned butter y, thereby obtaining the result stated in Theorem 2.

5 Lower Bound In this section, we develop a novel lower bound for deterministic PRAM simulation on processor networks. The lower bound argument is similar in spirit to the one used in [AHMP87, KU88, HB94], but embodies a number of critical enhancements that make it sensitive to the bandwidth characteristics of the interconnection, whereas previous bounds were signi cant only for highly expanding networks. The bound relies on the notion of decomposition tree [BL84, Lei85], which 8

provides a partition of the network into disjoint regions of limited bandwidth. We rst formulate the general lower bound in terms of such a decomposition, and then specialize it to the case of multi-dimensional meshes and the pruned butter y. Consider the simulation of an (n; m)-PRAM on an n-processor network N . Each PRAM step involves the reading (read step) or writing (write step) of some n-tuple of variables. It is assumed that the simulation is on-line, in the sense that the simulation of a PRAM step begins only after the simulation of previous read steps has been completed. Furthermore, we only consider simulations that are point-to-point, by which we mean that a processor that wants to write a variable must dispatch a distinct message for each copy of the variable it wants to update. This is not the case in the simulations presented in this paper. However, at the end of the section we observe that part of the argument does not rely on the point-to-point assumption. This enables us to derive a non point-to-point lower bound formulated in terms of the global space used to represent the PRAM variables, which applies to our upper bounds. We assume that the simulating network N has a [w0; w1; : : :; wlog n ] decomposition tree, as de ned in [Lei85], that is, for any i, 0 i logn, N can be partitioned into 2i disjoint i-regions, R1(i) ; : : :; R2(ii), where each i-region is connected to the rest of the network by at most wi edges. For t (u) be the minimum,taken any two indices h and k, with 1 h; k log n and any variable u, let rh;k over all h-regions Rj(h) , of the number of k-regions containing valid (i.e., most recently updated) t t =P copies of u that lay outside R(jh) . The average redundancy at time t is rh;k u2U rh;k (u)=m. Intuitively, the lower bound hinges on a tradeo between the cost of the write and the cost of the read steps. Unless the average redundancy is high, and therefore writing steps are expensive, an adversary is always able to generate read steps that will be relatively expensive to simulate. t is the average redundancy at time t, then there exists an Lemma 5 Let m 2n . If r = rh;k 2

n-tuple of variables which requires time at least (

?

n m=2n2 1=2r n n gh;k (r) = min w ; rw ; 2k wk h k

)

to be read. Proof (sketch): We identify a set of n variables all of whose valid copies are con ned within a

t (u) 2rg. Clearly, jU 0j m=2 low-bandwidth portion of the network. Let U 0 = fu 2 U : rh;k and there exists an h-region R(jh0 ) for which there are at least m=2h+1 m=2n n variables in U 0 achieving their minimum redundancy with respect to R(jh0 ) . We distinguish between two cases. If r < 1=2, then there exists an n-tuple of variables whose valid copies are all within R(jh0 ) and therefore gh;k (r) n=wh . If instead r 1=2, a combinatorial argument [PP94] shows that there exist n variables whose updated copies are contained in the union of R(jh0 ) and at most (

max

4r; 2k

9

2n2 m

21r )

k-regions. Reading these variables would take time gh;k (r)

n : wh + wk 4r + 2k (2n2 =m)1=2r

2

The lemma is proved by taking the minimum between the two cases.

In order to prove Theorem 3, we construct a PRAM program made of batches of m=n write steps that update all the variables, followed by m=n read steps suitably chosen by the adversary. Consider the simulation of one such batch. If at time t the average redundancy is r, then mr packets have crossed boundaries of h-regions. In particular, there must be an h-region R(jh) such that at least mr=2h packets have crossed its boundary, requiring time at least mr=(2h wh ) = (m=n)rn=(2h wh ). The theorem follows by combining the cost of the writes with the cost of the reads established in Lemma 5, by maximizing over all choices of h and k, and by minimizing over all possible values of r. The lower bound specializes to the d-dimensional meshes and the pruned butter y as follows. For d-dimensional meshes, we choose the natural decomposition into submeshes, which yields a ? [w0; w1; : : :; wlog n ] balanced decomposition tree with wi = (n=2i)1?1=d . For the pruned butter y, the natural decomposition into subtrees yields a [w0; w1; : : :; wlog n ] balanced decomposition ? tree with wi = (n=2i)1=2 . We choose h and k such that 2h = log(m=2n2 )= loglog(m=2n2 ) and 2k = r(m=2n2 )1=2r . It can be shown that for both networks, the minimum in Equation (1) is ? attained for r = log(m=2n2 )= loglog(m=2n2 ) , yielding the result of Corollary 1. Finally, we observe that Lemma 5 is proved without resorting to the point-to-point assumption, therefore it can be used to obtain an unrestricted lower bound (based only on the diculty of the read operations), parametric in the total space used by the simulation. More precisely, suppose we want to simulate an (n; m)-PRAM using at total of at most mr copies of the variables, on the simulating machine. Then, r provides a bound on the average redundancy at any time during the simulation, and therefore gh;k (r) is a lower bound on the slowdown of any such simulation. By specializing this argument for the case of multi-dimensional meshes and the case of the pruned butter y we obtain the following result.

Theorem 4 For any constant 1 there exists a constant > 0 such that any simulation of an

(n; m)-PRAM that uses a total of at most m log(m=2n2 )= loglog(m=2n2 ) copies for the variables

requires slowdown

nd 1

log(m=2n2) loglog(m=2n2 )

(1?1=d)!

"

pn log(m=2n2 ) =2 resp.,

log log(m=2n2 )

!#

on an n-node d-dimensional mesh [resp., an n-leaf pruned butter y ].

The above result implies that the simulations presented in this paper, use an amount of redundancy which is only a factor log log(m=2n2 ) away from the best redundancy required to attain the same running times. 10

References [AHMP87] H. Alt, T. Hagerup, K. Mehlhorn, and F. P. Preparata. Deterministic simulation of idealized parallel computers on more realistic ones. SIAM Journal on Computing, 16(5):808{835, Oct 1987. [BB90] P. Bay and G. Bilardi. Deterministic on-line routing on area-universal networks. In Proceedings of the 31st Annual Symposium on Foundation of Computer Science, pages 297{306, 1990. [BL84] S.N. Bhatt and F.T. Leighton. A framework for solving VLSI graph layout problems. Journal of Computers and System Sci., 28(2):300{343, Apr 1984. [BS78] G. Baudet and D. Stevenson. Optimal sorting algorithms for parallel computers. IEEE Transactions on Computers, c-27(1):84{87, Jan 1978. [HB94] K. T. Herley and G. Bilardi. Deterministic simulations of PRAMs on bounded-degree networks. SIAM Journal on Computing, 23(2):276{292, Apr 1994. [Her94] K.T. Herley. Representing shared data on distributed-memory parallel computers. Mathematical Systems Theory, 1994. To appear. [KSS94] M. Kaufmann, J. F. Sibeyn and T. Suel. Derandomizing Algorithms for Routing and Sorting on Meshes. In Proceedings of 5th ACM-SIAM Symp. on Discrete Algorithms, pages 669{679, 1994. [KU88] A.R. Karlin and E. Upfal. Parallel hashing: An ecient implementation of shared memory. Journal of the ACM, 35(4):876{892, Oct 1988. [Lei85] C. E. Leiserson. Fat-trees: Universal networks for hardware-ecient supercomputing. IEEE Transactions on Computers, c-34(10):892{900, Oct 1985. [PP94] A. Pietracaprina, G. Pucci. Work-Ecient Deterministic PRAM Simulations are Impossible. GEPPCOM Report 17-94. Dipartimento di Elettronica e Informatica, Universita di Padova, Italy, Oct 1994. [PPS94] A. Pietracaprina, G. Pucci, and J. F. Sibeyn. Constructive deterministic PRAM simulations on a mesh-connected computer. In Proceedings of 6th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 248{256, Jun 1994. [PP95] A. Pietracaprina, G. Pucci. Improved deterministic PRAM simulation on the mesh. In Proceedings of 22nd International Colloquium on Automata, Languages and Programming, Jul 1995. To appear. [PU89] D. Peleg and E. Upfal. The token distribution problem. SIAM Journal on Computing, 18(2):229{243, Apr 1989. [UW87] E. Upfal and A. Wigderson. How to share memory in a distributed system. Journal of the ACM, 34(1):116{127, Jan 1987.

11

Implementing Shared Memory on Multi

Implementing Shared Memory on Multi

Suggest Documents

Implementing Transparent Shared Memory on Clusters ... - Data61

Implementing Object-Based Distributed Shared Memory ... - CiteSeerX

Efficient Shared Memory Programming on

Shared Memory - WordPress.com

Shared memory multiprocessors

SHARED MEMORY MULTIMICROPROCESSOR OPERATING

Shared Memory - Google Sites

Fast Synchronization on Shared-Memory Multiprocessors: An ...

Parallel Graph Mining on Shared Memory Architectures

Heterogeneous Distributed Shared Memory on ... - Semantic Scholar

TreadMarks: Distributed Shared Memory on Standard ... - Infoscience

Practical Performance Estimation On Shared-Memory Multiprocessors

Shared Memory Example

VM-Based Shared Memory on Low-Latency, Remote-Memory-Access ...

Verifying Sequential Consistency on Shared-Memory Multiprocessors

Data Locality On Shared Memory Computers Under

VM-Based Shared Memory on Low-Latency, Remote-Memory-Access ...

MULTI-THREADING AND SHARED-MEMORY POOL TECHNIQUES FOR ...

Multi-Writer Consistency Conditions for Shared Memory Registers

Comparison of Parallelization Frameworks for Shared Memory Multi

Multi-Task Generative Adversarial Nets with Shared Memory for ... - arXiv

Memory Consistency Models for Shared-Memory - CiteSeerX

Implementing Memory Protection Primitives on ... - Google Sites

Memory Coherence in Shared Virtual Memory Systems