Exploiting Locality for Data Management in Systems of ... - CiteSeerX

10 downloads 0 Views 461KB Size Report
connected by a relatively sparse network of limited bandwidth. In this paper, ... near future, a program is more likely to reference those data objects that have.
Exploiting Locality for Data Management in Systems of Limited Bandwidth Bruce M. Maggs1 Friedhelm Meyer auf der Heide2 Berthold Vocking2 Matthias Westermann2

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 (email: [email protected]). Supported in part by the Air Force Materiel Command (AFMC) and ARPA under Contract F196828-93-C-0193, by ARPA Contracts F33615-93-1-1330 and N00014-95-1-1246, and by an NSF National Young Investigator Award, No. CCR-94-57766, with matching funds provided by NEC Research Institute and Sun Microsystems. This research was conducted in part while he was visiting the Heinz Nixdorf Institute, with support provided by DFG-Sonderforschungsbereich 376 \Massive Parallelitat: Algorithmen, Entwurfsmethoden, Anwendungen". 2 Department of Mathematics and Computer Science, and Heinz Nixdorf Institute, University of Paderborn, 33095 Paderborn, Germany (email: ffmadh, voecking, [email protected]). Supported in part by DFG-Sonderforschungsbereich 376, by EU ESPRIT Long Term Research Project 20244 (ALCOM-IT), and by DFG Leibniz Grant Me872/6-1. 1

Abstract This paper deals with data management in computer systems in which the computing nodes are connected by a relatively sparse network. We consider the problem of placing and accessing a set of shared objects that are read and written from the nodes in the network. These objects are, e.g., global variables in a parallel program, pages or cache lines in a virtual shared memory system, shared les in a distributed le system, or pages in the World Wide Web. A data management strategy consists of a placement strategy that maps the objects (possibly dynamically and with redundancy) to the nodes, and an access strategy that describes how reads and writes are handled by the system (including the routing). We investigate static and dynamic data management strategies. In the static model, we assume that we are given an application for which the rates of read and write accesses for all node{object pairs are known. The goal is to calculate a static placement of the objects to the nodes in the network and to specify the routing such that the network congestion is minimized. We introduce ecient algorithms that calculate optimal or close{to{optimal solutions for tree{connected networks, meshes of arbitrary dimension, and internet{like clustered networks. These algorithms take time only linear in the input size. In the dynamic model, we assume no knowledge about the access pattern. An adversary speci es accesses at runtime. Here we develop dynamic caching strategies that also aim to minimize the congestion on trees, meshes, and clustered networks. These strategies are investigated in a competitive model. For example, we achieve competitive ratio 3 for tree{connected networks and competitive ratio O(d  log n) for d{dimensional meshes of size n. Further, we present an (log n=d) lower bound for the competitive ratio for on{line routing in meshes, which implies that the achieved upper bound on the competitive ratio for meshes of constant dimension is optimal.

1 INTRODUCTION

1

1 Introduction Large parallel and distributed systems such as massively parallel processor systems (MPPs) and networks of workstations (NOWs), consist of a set of nodes each having its own local memory module. These nodes are usually connected by a relatively sparse network of limited bandwidth. In this paper, we consider the problem of placing and accessing shared objects that are read and written from the nodes in the network. The objects are, e.g., global variables in a parallel program, pages or cache lines in a virtual shared memory system, shared les in a distributed le system, or pages in the World Wide Web (WWW). The performance of MPPs and NOWs depends on a number of parameters, including processor speed, memory capacity, network topology, bandwidths, and latencies. Usually, the buses and links are the bottleneck in these systems, since improving communication bandwidth and latency is often more expensive or more dicult than increasing processor speed and memory capacity. But whereas several standard methods are known for hiding latency, e.g., pipelined routing (see, e.g., [7, 8]), redundant computation (see, e.g., [1, 2, 15, 20, 21, 23]) or slackness (see, e.g., [29]) the only way to bypass the bandwidth bottleneck is to reduce the communication load by exploiting locality. The principle of locality is already known from sequential computation. Two kinds of locality are usually distinguished: temporal and spatial locality (see, e.g., [10]). Temporal locality means that in the near future, a program is more likely to reference those data objects that have been referenced in the recent past. This locality can be due to instruction references in program loops, or data references in working stacks. Spatial locality means that in the near future, a program is more likely to reference those data objects that have addresses close to the last reference. This is caused, e.g., by the traversal of data structures such as arrays. A further kind of locality is speci c to parallel and distributed systems, topological locality. Topological locality means that processors that are close together according to the topology of the interconnection network are likely to be interested in the same data objects. This locality can be due to a communication sensitive mapping of processes to processors. In this paper, we investigate data management strategies exploiting temporal and topological locality in order to increase the network utilization and therefore the performance of the system. A data management strategy consists of a placement and an access strat-

1 INTRODUCTION

2

egy. The placement strategy speci es the distribution of the objects among the memory modules. In particular, it has to answer the following questions.  How many copies of an object should be made?  On which nodes should these copies be placed? The access strategy speci es how read and write accesses are handled by the system. This includes answers to the following questions.  How should consistency among the copies be maintained?  How should access messages be routed through the network? In this paper, we neglect the scheduling aspect of routing, because we want to concentrate on reducing the communication load which depends only on the routing paths. The following parameters are important measures for this communication load. For a given application, we de ne the (absolute) load of a bus or link to be the amount of data that must be transferred by this connection during the execution of the application. We de ne the relative load of a connection to be its load divided by its bandwidth. Finally, we de ne the congestion to be the maximum over the relative loads of all buses and links in the system. Of course, it is useful to economize the total communication load, i.e., the sum over all bus and link loads. However, developing data management strategies that just aim to minimize the total load can produce communication bottlenecks in the network. Minimizing the congestion overcomes this drawback, because the congestion measure captures bottlenecks in the system. Therefore, it is a lower bound on the execution time of a given application. Moreover, several results on store{and{forward{ and wormhole{routing (see, e.g., [8, 17, 22, 24, 27]) indicate that the congestion yields a good estimate for the execution time taken by coarse grained applications with high communication load. Hence, minimizing the congestion seems to be a good heuristic for ecient data management. Our aim is to develop strategies for placing and accessing shared objects such that the locality inherent in an application is exploited in order to minimize the congestion. We model the network by a (hyper{)graph G = (V; E ) with node set V and edge set E such that the (hyper{)edges represent the links (or buses). The bandwidths of these connections are described by a function b : E ?! IN . The set of shared objects is denoted by X . We

1 INTRODUCTION

3

use two models to formalize information about the pattern of accesses to the shared objects of an application.

1.1 Description of the static model In this model, we assume that we are given an application A from which the access rates from the nodes to the objects are known, i.e., the number of read or write accesses in A are described by the functions hr : V  X ?! IN and hw : V  X ?! IN , respectively. We are interested in ecient algorithms calculating a static placement of the objects and specifying the routing paths such that the congestion is small. The placement of the objects is allowed to be redundant, i.e., each object can be placed on several nodes. In order to maintain consistency among the di erent copies of an object, we assume that a write access to an object has to update all of its copies, whereas a read access to an object can be satis ed by one of its copies. Updates of copies are allowed to be done by multicasts, i.e., the node that wants to update a set of copies sends out one message that is transmitted along the branches of a tree to all the copies. The goal is to place the copies and to determine the routing paths (or trees in case of multicasts) in such a way that the congestion is minimized. For simplicity, we assume that the additional load due to a write or read access on each edge belonging to the respective routing path (or tree) is one. Alternatively, it is possible to weight the access rate functions hr and hw according to di erent cost measures, e.g., the hr {values could be multiplied by some factor larger than one, because reads require to send additional request messages. Note that the model described above is slightly restrictive in the sense that it xes the update policy to a certain range. In particular, it does not include strategies that allow only a fraction of the copies to be updated in case of a write, which, e.g., is implemented in strategies using the majority trick introduced in [28]. However, all strategies using such techniques add time stamps to the copies. This requires that there is some de nition of uniform time among di erent nodes. Since it is not clear how to realize this in an asynchronous setting, we initially restrict ourselves to strategies that update all copies in case of a write. At the end of this paper, in Section 5, we consider a more general model, including, e.g., strategies using the majority trick.

1 INTRODUCTION

4

1.2 Description of the dynamic model In the dynamic model, there is no knowledge about the access patterns of an application in advance. We assume that an adversary speci es a parallel application running on the nodes of the network, i.e., the adversary initiates read and write requests on the nodes of the network. These accesses should be served by a dynamic data management strategy. If this strategy is randomized, then the adversary is assumed to be oblivious, i.e., the accesses to shared objects are not allowed to depend on the random decisions of the data management strategy. We restrict the class of allowed applications speci ed by the adversary to data{race free programs, i.e., a write access to an object is not allowed to overlap with other accesses to the same object, and there is some order among the accesses to the same object such that, for each read and write access, there is a unique last recent write. Note that this still allows arbitrary concurrent accesses to di erent objects, and concurrent read accesses to the same object. A consistent data management strategy ensures that a read always returns the value of the most recent write. Write accesses are assumed to be object alterations rather than overwrites. Thus, we demand that none of them is ignored, even not in the case of immediately consecutive write accesses. We are interested in developing on{line distributed data management strategies that minimize the congestion. These strategies are allowed to migrate, create, and invalidate copies of an object during execution time. Migration means that a copy is moved along a path through the network; creation means that a copy is duplicated, and the new copy is migrated; and invalidation means that a copy is deleted. Initially, one copy of each object is placed somewhere in the network. Arbitrary communication between neighboring nodes in the network is allowed. However, each communication increases the load on the involved edge. For simplicity, we assume that each object ts into one routing packet such that each migration of a copy along an edge increases the load of this edge by one. Also request, update, and invalidation messages are assumed to have size one. More complicated models including non{uniform message sizes and slice{wise accesses to larger objects are discussed at the end of this paper in Section 5. We use the competitive ratio as a measure for the eciency of a dynamic data management strategy. For a given application A, let C (A) denote the minimum congestion expended by an optimal dynamic strategy having full knowledge of the parallel program speci ed by the adversary (indyn opt

1 INTRODUCTION

5

cluding knowledge of all future accesses). A dynamic strategy is said to be k{competitive if it achieves congestion at most k  C (A), for any application A. A randomized strategy has to satisfy this bound with high probability . Of course, the optimal strategy has an advantage over an on{line strategy, because it has full knowledge of the dynamic access pattern. Without the restriction to data{race free programs, the optimal strategy could, e.g., defer write accesses to later time steps in order to save repeated read accesses from the same node to the same object. This illustrates that the restriction to this class of programs is necessary in order to allow a more fair comparison of on{line against o {line strategies. dyn opt

1

1.3 Our Results

We introduce new static and dynamic data management strategies for tree{ connected networks, for meshes of arbitrary dimension and side length, and for internet{like clustered networks. The strategies for trees are deterministic, and the others are randomized. All strategies aim to minimize the congestion by exploiting locality. To our knowledge this is the rst analytic treatment of this problem. We start with developing strategies for tree{connected networks. Here we introduce an ecient algorithm that calculates an optimal placement of the objects. We show that the placement minimizes the load on any edge of the tree and, therefore, also the total load and the congestion, regardless of the bandwidths. The sequential running time of the algorithm is linear in the input size, i.e., in time O(jX j  jV j). Moreover, the algorithm can be eciently calculated in a distributed fashion by the processors of the underlying tree{network. We also develop a 3{competitive dynamic caching strategy. Note that both results are applicable to trees with arbitrary bus connections having non{uniform bandwidths. Hence, they are well suited, e.g., for Ethernet connected NOWs. The situation on mesh{connected networks is much more complicated than the one on trees, because here there are several possible routing paths between every pair of nodes. In fact, we show by a straightforward reduction from PARTITION that the static problem is NP{hard already on a 3  3 mesh. The dynamic problem is even more complicated, since we have to 1 Throughout the paper, the terms \congestion C , with high probability" or \congestion C , w.h.p." mean \congestion C + x, with probability at least 1 ? 2? (x)". Note that this

induces that the expected congestion is C + O(1).

1 INTRODUCTION

6

solve an on{line distributed routing problem, including multicasting and data tracking. We describe a simulation strategy that solves the static and the dynamic problem simultaneously. The strategy is based on a randomized but locality preserving embedding of access trees into the mesh. Hence, it is called the access tree strategy. On the access trees, the static or dynamic strategy for trees is simulated. Consider an arbitrary mesh of size n and dimension d. Here the access tree strategy yields an ecient algorithm for static data placement on arbitrary meshes achieving optimal congestion up to an O(d  log n) factor, w.h.p., that can be calculated sequentially in time linear in the input size. In the dynamic model, the access tree strategy achieves competitive ratio O(d  log n). We give a corresponding (log n=d) lower bound for the competitive ratio for on{line routing in these networks, which implies that the upper bound is optimal for meshes of constant dimension. Finally, we investigate internet{like clustered NOWs. We show that the access tree strategy is also suitable for this kind of topology. In particular, we show that a static placement with close{to{optimal congestion can be calculated in linear time. Further, the access tree can be used on these networks for ecient dynamic data management, e.g., for WWW pages. The characteristics of this topology and the results for it are described in more detail in Section 4. In Section 5, we show that most of the static results hold even in more general models capturing, e.g., the majority trick. Further, we show that the dynamic access tree strategy can be extended to handle non{uniform object sizes and slice{wise access to large objects.

1.4 Related Work

The problem of distributing and accessing shared objects in networks is investigated in several papers. We give a brief overview among the most important or related work in this area. The rst approaches to solve the problem concentrated on modeling it by mixed integer programs and solving these programs eciently by using heuristics. Here several models with di erent cost functions and constraints have been developed. The 1981 survey paper by Dowdy and Foster [9] gives an overview of this work. Most theoretical work in the area of distributed data management concerns PRAM simulations. In [13], Karlin and Upfal present a probabilistic emulation on an N {node butter y. Their algorithm emulates one step of an

1 INTRODUCTION

7

N {processor EREW PRAM in time (log N ), w.h.p.. Ranade [26] showed how combining could be used to improve this result, i.e., he showed that a CRCW PRAM step can be emulated in the same time. Both strategies use random hash functions to distribute the memory cells uniformly among the processors. This scheme can also be adapted to other networks. p p For instance, this yieldspN {processor PRAM simulations for the N  N mesh with slowdown ( N ). Although this bound cannot be improved for general PRAM simulations, it is not satisfactory for applications including locality. Awerbuch et al. investigate dynamic data management in arbitrary networks. In [4] they present a dynamic caching strategy that minimizes the total communication load up to a polylogarithmic factor. In [5] they adapt their scheme to systems with limited memory capacities. However, their strategies uses the concept of a global leader that eventually is involved in nearly any access issued by one of the processors. This shows the importance of considering the congestion rather than the total communication load. Better results are known for dynamic data management on tree{connected networks. Here Bartal et al. [6] describe a randomized 3{competitive dynamic strategy for trees, and Lund et al. [19] describe a deterministic strategy with same competitive ratio. Both strategies allow to access slices of objects. However, the underlying competitive model does not account for request and invalidation messages required for the distributed execution of the algorithms. In particular, the competitive ratio for the deterministic algorithm in [19] cannot be translated into our model since the used approach is centralized. The ratio for the randomized algorithm in [6] increases by a constant factor if it is \translated" into our model. Several recent papers deal with the distribution of pages in the WWW. Plaxton and Rajaraman [25] show how to balance the pages among several caches by embedding a random cache tree for each page into the network. This balances the load well and ensures fast responses even for popular pages. Karger et al. [12] use a similar technique to relieve hot spots in the WWW. Note that the technique of embedding a random tree for each object is similar to our access tree strategy. The main di erences to our approach are the following. The strategy in [25] uses a uniform embedding of the tree nodes onto the nodes in the Internet, which completely dissolves topological locality. The strategy in [12] pays attention to topological locality. In fact, they use a model similar to our Internet model. However, they consider the latencies instead of the bandwidths to be the main problem for data transmission in the Internet. Further, they consider only read accesses to the pages.

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS

8

2 Data management on tree{connected networks In this section, we describe static and dynamic data management strategies for tree{connected networks. The advantage of these networks is that there is only one simple path between any pair of nodes. Thus, the placement strategy automatically de nes the routing paths from the accessing processor to the respective copies. In particular, this means that the congestion is xed as soon as the placement is speci ed. This make the analysis for trees much easier than the one for networks including cycles. In the following, the network is modeled by a (hyper{)graph T = (V; E ). The edges are allowed to have arbitrary bandwidth. Let diam(T ) denote the diameter of this graph and degree(T ) its maximum node degree.

2.1 Static placement on trees

2.1.1 Trees with point{to{point connections For simplicity, we initially assume that all connections in the network are point{to{point connections. This means the the tree is a \normal" graph, i.e., each edge in T is incident to only 2 nodes. We have some knowledge about the access frequencies from the nodes to the objects, i.e., we are given two functions hr : V  X ?! IN and hw : V  X ?! IN that describe the rate of read or write accesses, respectively, from the nodes in V to the objects in X . Each object can be placed statically in one or more nodes. A read access to an object can be satis ed by one of its copies, and a write access has to update all copies. The following strategy, which we call the nibble strategy, places each object x 2 X independently from the other objects. Fix an object x. We use the following notations and de nitions concerning the access rates to x. For a node v 2 V , de ne r(v) = hr (v; x), w(v) = hwP (v; x), and h(v) = r(Pv)+w(v), and for a subtree T 0 = (V 0; E 0), de ne r(T 0) = v2V r(v), w(T 0) = v2V w(v), and h(T 0) = r(T 0) + w(T 0). The nibble strategy places copies of x on some nodes that form a connected component including the gravity center of T . This gravity center depends on the access rates for x. The gravity center of our xed object x is denoted by g(T ) and it is de ned as follows. For v 2 V , let T (v) denote the set of subtrees in which T is partitioned if v is removed from it. Consider 0

0

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS

9

the set of nodes v that satisfy the following condition:

8T 0 = (V 0 ; E 0) 2 T (v) : h(T 0)  h(T )=2 : It is easy to check that this set is not empty. Suppose each node knows the h(T 0) values. Then it can locally decide whether or not it belongs to the above set. We choose one arbitrary node from this set to be the gravity center g(T ), e.g., the one with the smallest index. In the following, g(T ) is assumed to be the root of the tree T , which de nes the parent and the children of each node. The subtree T (v) rooted with v 2 V is de ned to be the maximal subtree including v but not the parent of v. After we have xed all these notations, the rest of the description of the nibble strategy is very simple: A node v gets a copy of x if and only if h(T (v)) > w(T ) or v is the gravity center g(T ). The nibble strategy can be calculated in time linear to the input size, which is O(jV j) for each object. The most dicult part is to compute the h(T 0) values for each node (with T' denoting the subtrees in which the tree is partitioned when the node is removed). However, this can be done by a depth rst search algorithm taking time O(jE j) = O(jV j), for each object. Moreover, the placement can be calculated easily by the processors of the tree network. Here the h(T 0) values can be computed in 2  diam(T ) rounds each of which takes time O(degree(T )). Further, the computation for several objects can be pipelined, which gives time O((jX j + diam(T ))  degree(T )) for the placement of all objects in X . The following theorem shows that the nibble strategy calculates a mapping which yields minimum load on each edge. Of course, this also proofs that it minimizes the total load and the congestion regardless of the bandwidths of the edges.

Theorem 2.1 The nibble placement strategy achieves minimum load for each edge of the tree.

Proof. The nibble strategy places a copy on the gravity center for sure. The following lemma shows that this does not increase the load.

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS 10

Lemma 2.2 Any mapping S for x can be transformed into a mapping S 

without increasing the load on any edge such that the nodes that hold a copy build a connected component including g(T ).

Proof. Consider the minimum Steiner tree connecting all copies of S . Let

S 0 be the mapping in which every node of this Steiner tree holds a copy. Then transforming S to S 0 does not increase the load on any edge because each write access has to to update all copies of S , which means that each write access crosses each edge in the Steiner tree. If S 0 includes g(T ) then we are nished. Otherwise, we have to transform S 0 into S  as follows. The nodes holding the copies of S 0 build a connected component not including g(T ). Thus, one of the subtrees T (v ); T (v ); : : : includes all of the copies, where v ; v ; : : : denote the nodes incident to g(T ). Let TA = (VA; EA) denote this subtree, and let TB = (VB ; EB ) denote the remaining subtree, i.e., the subtree of T induced by the set of nodes in T nTA. Further, let TC = (VC ; EC ) denote the subtree of TB induced by the set of the nodes holding a copy of S 0, let u denote the node in VC which is closest to g(T ), and let P denote the simple path connecting u with g(T ). Figure 1 illustrates the situation. 1

1

2

2

g(T)

P u TB

TC TA

Figure 1: The subtrees TA , TB , and TC , the nodes g(T ) and u, and the path P. Suppose r(TB ) > w(TA). Then placing additional copies onto all the nodes of the path P in uences only the load on the edges of this path. On the one hand, each write access issued in TA has to traverse this path now. On the other hand, the read accesses issued in TB have not to traverse this path anymore. Thus, the load on every edge on the path is increased by

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS 11 w(TA) ? r(TB ) < 0, which means that adding copies to all nodes on P yields the strategy we are looking for, S . Now assume r(TB )  w(TA). Because g(T ) is the gravity center, h(TA )  h(T )=2  h(TB ). Combining both equations yields w(TB ) = h(TB ) ? r(TB )  h(TA) ? w(TA) = r(TA) : Let S 00 denote the mapping having only one copy which is placed on u. Then transforming S 0 to S 00 only in uences the load on edges in EC . The load on these edges is increased by at most r(TA) and decreased by at least w(TB ). This gives an increase of r(TA) ? w(TB )  0 on these edges. Therefore, changing S 0 into S 00 does not increase the load on any edge. Finally, we change S 00 into S  by moving the copy from u to g(T ). This e ects only the load on edges of P , which is increased by at most h(TA ) due to accesses issued in TA and decreased by h(TB )  h(TA) due to accesses issued in TB . As a consequence, S can be transformed via S 0 and S 00 into S  without increasing the load on any edge. We show that an arbitrary mapping S of the copies can be transformed into the mapping computed by the nibble strategy without increasing the load on any edge. This yields the theorem. First, transform S into a mapping S  such that the set of nodes U holding a copy build a connected component including the gravity center g(T ). According to Lemma 2.2 this transformation can be done without increasing the load. The nibble strategy places a copy on a node v 2 V if and only if h(T (v)) > w(T ) or v = g(T ). We have to show that S  can be transformed into the mapping computed by the nibble strategy without increasing the load on any edge. In particular, we have to show that  one can add a copy to all nodes v 2 V n U with h(T (v)) > w(T ), and  one can remove the copies from all nodes v 2 U nfg(T )g with h(T (v))  w(T ). without increasing the load on any edge. Consider v 2 V n U with h(T (v)) > w(T ). All nodes v0 on the simple path from v to U (including v) satisfy h(T (v0))  h(T (v)) > w(T ) because U includes the gravity center. Adding a copy to all these nodes in uences only the load on the edges on this path. In particular, the load on these edges is increased by at most w(T ) ? w(T (v)) because of write accesses

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS 12 issued by nodes not in T (v) that have to traverse these edges after adding the copies. However, the load on these edges is decreased by at least r(T (v)) because of read accesses issued by nodes in T (v) which can be satis ed now by the copy on v. Combining both e ects, the load is increased by at most w(T ) ? w(T (v)) ? r(T (v)) = w(T ) ? h(T (v)) < 0. This means adding a copy to all nodes v 2 V n U with h(T (v)) > w(T ) does not increase the load on any edge. Now consider v 2 U n fg(T )g with h(T (v))  w(T ). Let T 0 denote the subtree of T (v) induced by the nodes holding a copy. Then each node v0 in the subtree T 0 satis es h(T (v0))  h(T (v))  w(T ). Removing all copies from all nodes in T 0 in uences only the load on edges in T 0. On the one hand, the load on these edges is increased at most by r(T (v)) because of reads issued by nodes in T (v). On the other hand, the load is decreased by at least w(T ) ? w(T (v)) because of writes issued by nodes not in T (v). Thus, the load on these edges is increased by at most r(T (v)) ? (w(T ) ? w(T (v))) = h(T (v))?w(T )  0. Hence, removing the copies from all nodes v 2 U nfg(T )g with h(T (v))  w(T ) does not increase the load on any edge, which completes the proof of Theorem 2.1.

2.1.2 Trees with arbitrary bus{connections The nibble strategy can be adapted to tree{connected networks with arbitrary bus connections. In this case T = (V; E ) is a hypergraph rather then a \normal" graph. Each hyperedge represents a bus that connects all nodes incident to the hyperedge. The busses are allowed to have arbitrary bandwidths. The strategy is adapted in the following way. First, the hypergraph T is transformed into a \normal" tree T 0 = (V 0; E 0), i.e., each edge in T 0 is incident to at most 2 edges. Then the nibble strategy is applied to T 0. Finally, the calculated optimal placement on T is transformed into an optimal placement on the hypergraph T . The transformation from T to T 0 is done by local substitutions. Each hyperedge is simulated by a star, i.e., hyperedge eH = (v ; : : : ; vk ) incident to the nodes v ; : : : ; vk is replaced by a node v(eH ) that is connected by k \normal" edges to the nodes v ; : : : ; vk . Further, r(v(eH )) and w(v(eH )) are set to 0. After the nibble strategy is applied to T 0 the calculated placement must be transformed into a placement for the hypergraph T . This means that we 1

1

1

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS 13 have to remove the copies from the nodes in T 0 representing the hyperedges in T . These copies are replaced according to the following rule: Consider a hyperedge eH = (v ; : : : ; vk ) from T represented by the node v(eH ) in T 0. Let T ; : : : ; Tk denote the subtrees resulting from the deletion of v(eH ), and let m be the index satisfying h(Tm ) = maxfh(Ti) j 1  i  kg. If h(Tm) > r(T ) then the copy on v(eH ) is replaced by a copy on vm . Otherwise, the copy is replaced by k copies on v ; : : : ; vk . The above strategy can be calculated sequentially in time O(jX j  jV j) for each object. Further, the placement can be calculated in a distributed fashion by the nodes of T in time O((jX j +diam(T ))  (degree(T )+rank(T ))) with rank(T ) denoting the maximum number of nodes incident to an edge. The following theorem shows that the strategy yields minimal congestion regardless of the bandwidths of the busses. 1

1

1

Theorem 2.3 The nibble placement strategy achieves minimum load for each hyperedge of the tree.

Proof. The transformation from T to T 0 consists of several local transforma-

tion, one for each hyperedge. Each of these transformations does not e ect the minimum load on any edge not involved in the transformation. Also the transformation from the optimal placement for T 0 to a valid placement for T consists only of local transformations on the hyperedges that does not change the load on any edge not involved in the transformation. As a consequence, we can restrict ourselves to investigate the e ects of each of these local transformations independently. In particular, we have to show for each hyperedge eH that the load on eH is minimal after the local transformation on eH under the assumption that the congestion in the respective star in T 0 is minimal. Suppose that v(eH ) gets no copy by applying the nibble strategy in T 0. Then the nal transformation do not add a copy to any of the nodes v ; : : : ; vk . In this case the load on the hyperedge is equivalent to the load of the edge leading to the connected components of the nodes holding the copies. Hence, the load of the hyperedge is not larger than the congestion in the star. Because a star can simulate a hyperedge such that the congestion in the star is not larger than the load of the hyperedge, it follows that the load of eH is optimal. Now, suppose that v(eH ) has got a copy and h(Tm ) > r(T ). Then the nibble strategy on T 0 places no copy in subtree Tj with j 6= m because this 1

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS 14 gives higher load on edge (v(eH ); vj ) than placing no copy in this subtree. In particular, in the rst case this edge has load at least

w(Tj ) + w(Tm) = w(Tj ) + h(Tm ) ? r(Tm ) > w(Tj ) + r(T ) ? r(Tm )  h(Tj ) because all writes issued in Tm and in Tj have to traverse this edge, whereas in the second case the has load is at most h(Tj ), because only the accesses from Tj have to pass this edge. In the transformation to a valid placement for the hypergraph T we place a copy on vm . Since there is no copy outside of Tm after the transformation, none of the accesses issued in Tm have to cross eH . This yields load h(T ) ? h(Tm ) = r(T ) + w(T ) ? h(Tm ) < w(T ) on the hyperedge. This is the optimal solution for the hyperedge, because placing a copy on more than one of the subtrees yields load w(T ), and placing only copies in one of the subtrees Tj with j 6= m yields load at least h(T ) ? h(Tj )  h(T ) ? h(Tm ) on the hyperedge. Finally, suppose that v(eH ) has got a copy in T 0 and that h(Tm )  r(T ). Then placing a copy onto all nodes as done by the above strategy yields load w(T ) on the hyperedge. Placing a copy onto at least two of the nodes yields also at least load w(T ). Further, placing a copy onto one of the nodes v ; : : : ; vk or only copies in one of the subtrees yields load at least 1

h(T ) ? h(Tj )  h(T ) ? h(Tm ) = r(T ) + w(T ) ? h(Tm ) > w(T ) : Hence, the nibble strategy yields minimum load on eH .

2.2 Dynamic data management on trees

Here all placement decisions have to be made on{line, i.e., we have no knowledge about the access patterns beforehand. It is assumed that an adversary initiates the read and write accesses arbitrarily at execution time. A dynamic strategy is allowed to migrate, to create, and to invalidate, copies during the execution of the application. Initially, one copy of each object is placed somewhere in the network. Arbitrary communication between neighboring nodes in the network is allowed. However, each communication increases the load on the involved edge by one. We describe a very simple caching protocol that invalidates all but one of the old copies in case of a write. Initially we assume that all connections in the tree are point{to{point connections. Then the accesses from node v to object x are handled in the following way:

2 DATA MANAGEMENT ON TREE{CONNECTED NETWORKS 15

 v wants to read x: v sends a request message to the nearest node u holding a copy of x. u sends the value of x to v, and a copy of x is created on each node on the path from u to v.

 v wants to write x: v sends a message including the new value to the

nearest node u holding a copy of x. u starts an invalidation multicast to all other nodes holding a copy of x, modi es its own copy, and sends this copy to v. The modi ed copy of x is stored on each node on the path from u to v.

The adaption of this strategy for networks with arbitrary bus connections is very simple. When the value of x is send from u to v in case of a read or write, a copy of x is created on each node incident to an edge on the path from u to v (rather than placing a copy only on each node on the path). Note that the data tracking for the above algorithm is very simple because the nodes holding the copies of an object always build a connected component in the tree. For each object x, each node is attached a signpost pointing to the last node that has updated x. (Initially, this signpost points to the only copy of x.) Whenever x is updated the signposts are redirected to the node that has issued the write. Note that this mechanism does not require extra communication, because only the signposts on nodes involved in the invalidation multicast has to be redirected. The number of signposts can be reduced by de ning a root of the tree and canceling all signposts directed to the root. Then diam(T ) signposts for each object are sucient.

Theorem 2.4 The dynamic caching strategy is 3{competitive for trees with arbitrary bus connections and non{uniform bandwidths.

Remark: In order to achieve a sequential consistent strategy according to the de nition of Lamport [16], it is necessary to acknowledge the invalidations before updating a copy. Then the strategy becomes 4{competitive. Proof. The accesses from nodes to each object x can be ordered by their occurrence. (Concurrent reads are ordered arbitrarily.) If the sequence of accesses to x starts with a read access, then we add a write access issued by the processor holding the initial copy of x at the beginning of the sequence. Obviously, this causes no extra communication for an optimal strategy. We divide the sequence of accesses into phases such that the rst access in each phase is a write and each phase includes only one write.

3 DATA MANAGEMENT ON MESHES

16

For a phase t, let vt denote the node issuing the write access at the beginning of t, and let Ut denote the connected component induced by vt and the nodes issuing read accesses in this phase. Any consistent strategy, has to send at least one message along each edge in Ut , for each phase t. Further, for any two consecutive phases t ? 1 and t with node disjoint components Ut? and Ut , an additional message has to be send along the unique path pt leading from Ut? to Ut . Now we consider the load induced by our strategy. All messages induced by read and write access of phase t except for the invalidation messages are routed along edges in Ut or pt . In particular, each of these edges is traversed by exactly two of these messages: either by the two messages for the write issued by vt or by a read request and the corresponding answer message. Further, the invalidation multicast for the write in the next phase sends exactly one message along the edges in Ut and pt . Hence, our strategy sends at most three times the number of packets along any edge than any other consistent strategy does. 1

1

3 Data management on meshes In this section, we consider data management strategies for the mesh M = M (m ; : : : ; md ), i.e., the d{dimensional mesh{connected network with side length mi in dimension i. The number of processors is denoted by n, i.e., n = m  : : :  md . Each edge is assumed to have bandwidth 1. Thus, the relative and absolute load of an edge are identical. We investigate static and dynamic strategies that aim to minimize the congestion. 1

1

3.1 Static placement on meshes

First, we prove that calculating an optimal placement with respect to minimizing the congestion is NP{hard on meshes. Then we describe a static placement strategy that can be calculated in linear time and achieves minimum congestion up to a small factor.

3.1.1 NP{hardness of static placement The static placement problem is de ned as follows. We are given a graph, a set of shared objects, access rates from the nodes of the graph to the objects,

3 DATA MANAGEMENT ON MESHES

17

and an integer k. Each object has to placed on one node, and for each access a routing path from the accessing node to the respective object has to be speci ed. The problem is to decide whether there is a mapping such that the congestion of the routing paths is not larger than k. The above problem is restricted to non-redundant placement. Note, however, that non-redundant placement is a subproblem of a redundant placement problem in which all accesses are write accesses. Hence, NP-hardness of non-redundant placement induces NP-hardness also of redundant placement.

Theorem 3.1 The static placement problem on 3  3 meshes is NP-hard. Proof. We describe a reduction from PARTITION. The input for this Pn problem are integers k ; : : : ; kn and k with i ki = 2k and ki  k, for each i 2 f1; : : : ; ng. The problem is to decide whether there exists a subset P S  f1; : : : ; ng such that i2S ki = k. 1

=1

v1 s

s v2

h

h’ v3

Figure 2: The labeling of the 3  3 mesh. We code an instance of PARTITION into a static placement problem on the 3  3 mesh. Figure 2 describes the labeling of the mesh used in the following. The shared objects in the placement problem are x ; : : : ; xn and y. The access rates are de ned as follows: 1

8i 2 f1; 2; 3g 8j 2 f1; : : : ; ng : h(vi ; xj ) := ki h(v ; y) := 4k + 1; h(h; y) := k; h(h0; y) := k: 2

3 DATA MANAGEMENT ON MESHES

18

with h(v; x) denoting the number of accesses from node v to object x. All other rates are 0. We have to prove that the placement can be realized with congestion at most k if and only if there exists a subset S  f1; : : : ; ng with P k = k . i2S i Suppose it exists a subset S ful lling the PARTITION constraint. Then the following placement achieves congestion k. Each object xi with i 2 S is placed onto node s. The routing paths from v , v , and v to s are de ned to be the unique shortest paths using only solid edges. Each object xi with i 62 S is placed onto node s. The routing path from v , v , and v to s are also de ned to be the unique shortest paths using only solid edges. Object y is placed onto node v . For the routing path from h and h0 to v we choose the dashed edges. Counting the load for each edge yields that the congestion of this placement is k. Now suppose there is no subset S ful lling the PARTITION constraint. Assume for contradiction that there is a placement with congestion k. Then object y is placed on node v according to this placement since otherwise one of the edges leaving v has congestion larger than k. According to the selection of the routing paths from h and h0 to object y on node v , we distinguish the following two cases: 1

2

3

1

2

2

3

2

2

2

2

 Case 1: Suppose each of the 2k accesses to y traverses one of the

dashed edges. Then these edges are blocked for the other accesses. As a consequence, all routing paths have to be shortest paths and all xi 's have to be placed either on s or s since otherwise the total load due to accesses to the xi's are larger than 10k, which means that the average load per solid edge is larger than k. Now, in order to achieve congestion k on the edges (s; v ) and (s; v ) the objects must be distributed among s and s in such a way that each of the two edges gets the same load. However, this is not possible because according to our assumption their is no appropriate partition. 1

1

 Case 2: Now suppose that `  1 of the routing paths for the accesses

to y does not traverse the dashed edges. Then the total load due to accesses to y sums up to 2k + 2`. Further, only ` accesses to the xi 's can cross the dashed edges. The total load of accesses to the xi 's is minimized if the objects accessed via a dashed edge are placed on v . This is because an object xi causes load 4ki if it is placed on v and the accesses from v to the object use a dashed edge. In all other cases the 2

2

3

3 DATA MANAGEMENT ON MESHES

19

accesses to xi cause load at least 5ki. This means that the total load over all accesses to the xi 's is 4` + 5(2k ? `) = 10k ? `. Then the total load over all accesses to all objects is 10k ? ` + 2k + 2` = 12k + `, and, hence, the average load per edge is larger than k, which contradicts the assumption that the considered placement has congestion k.

3.1.2 The static access tree strategy Now we describe a static placement strategy achieving minimal congestion up to a small factor that can be calculated eciently. It is based on a hierarchical decomposition of M , which we describe recursively. Let i be the smallest index such that mi = maxfm ; : : : ; mdg. If mi = 1 then we have reached the end of the recursion. Otherwise, we partition M into two non overlapping submeshes M = M (m ; : : : ; dmi=2e; : : : ; md ) and M = M (m ; : : : ; bmi =2c; : : : ; md ). M and M are then decomposed recursively according to the same rules. Figure 3 gives an example for this decomposition. 1

1

2

1

level 0

1

level 1

level 3

1

2

level 4

level 5 a

b

c

d

e

f

g

h

j

k

l

m

Figure 3: The partitions of M (4; 3). The hierarchical decomposition has associated with it a decomposition tree T (M ), in which each node corresponds to one of the submeshes, i.e., the root of T (M ) corresponds to M itself, and the children of a node v in the tree correspond to the two submeshes into which the submesh corresponding to v is divided. Thus, T (M ) is a binary tree of height O(log n) into which the leaves correspond to submeshes of size one, i.e., to the processors of M . We de ne the root to be on level 0 of this tree, and all nodes whose parents are on level i are de ned to be on level i + 1. For each node v in T (M ),

3 DATA MANAGEMENT ON MESHES

20

let M (v) denote the corresponding submesh. Further, each edge e of T (M ) connecting a level i node u with a level i + 1 node v is de ned to be on level i + 1 and M (e) = M (v). We interpret T (M ) as virtual network that we want to simulate on M . In order to compare the congestion in both networks we de ne bandwidths for the edges in T (M ), i.e., for an edge e of T (M ), de ne the bandwidth of e to be the number of edges leaving submesh M (e). Figure 4 gives an example of a decomposition tree with bandwidths. level 0 3

3

level 1 4

3

4

3

level 2 3

5

2

3

a

d 3

3

5

2

3

m

j

level 3 3

4

2

3

b

e

c

f

4

2

3

h

k

g

level 4 l

Figure 4: The decomposition tree T (M (4; 3)). The node labels are equivalent to the ones in Figure 3. The edge labels indicate the bandwidth of the respective edge. For each object x 2 X , de ne an access tree Tx(M ) to be a copy of the decomposition tree T (M ). We embed the access trees randomly into M , i.e., for each x 2 X , each interior node v of Tx(M ) is mapped by a random hash function h(x; v) to one of the processors in M (v), and each leaf v of Tx(M ) is mapped onto the processor in M (v). (For simplicity we assume that the hash functions map in a truly random fashion, i.e., uniformly and independently.) The remaining description of our data management strategy is very simple: For object x 2 X , we simulate the optimal static strategy for trees described in section 2.1 on the access tree Tx(M ). All messages that should be sent between neighboring nodes in the access trees are sent along the dimension{by{dimension order paths between the associated nodes in the mesh.

3 DATA MANAGEMENT ON MESHES

21

The access static tree strategy yields a static placement of the objects onto the nodes and speci es the routing paths. This placement can be calculated in time O(jX j  jV j), because the optimal static placement of an object on its access tree can be calculated in time linear in the number of nodes of the tree, and this number is smaller than 2 times the number of nodes in the mesh. The following theorem shows that the access tree strategy achieves small congestion. Theorem 3.2 For any application on the mesh M of size n and dimension d, the static access tree strategy achieves congestion O(C (M )  d  log n), w.h.p., where C (M ) denotes the optimal congestion for the application in the static model. Proof. In order to prove the above result we require a lower bound and an upper bound on the congestion of the optimal strategy and the access tree strategy, respectively. De ne C (T (M )) to be the best achievable congestion for the application when it is executed on the binary tree T (M ), under the assumption that each processor of M is simulated by its counterpart in T (M ). We give a lower bound that relates C (T (M )) to the optimal congestion on the mesh, and an upper bound that relates this value to the congestion achieved by our strategy. We start with the lower bound. Lemma 3.3 C (T (M ))  C (M ). Proof. For a given strategy on M with congestion C we have to describe a strategy on T (M ) with congestion at most C . We simulate the strategy for M on T (M ), except for the routing. Instead, for the routing paths in T (m; d) we use the unique shortest paths between the respective nodes. Let C 0 denote the congestion for a given application with the above strategy on T (M ). Let e denote an edge of T (M ) with relative load C 0. Then the absolute load of e is C 0  b(e) (remind that b(e) is the bandwidth of e). Now consider the same application on M . Any message that crosses e in T (M ) has either to leave or to enter the submesh M (e) in M . The number of edges leaving M (e) is b(e). Thus, the load on one of these edges is at least C 0  b(e)=b(e) = C 0, and hence, C  C 0 . The following lemma gives an upper bound on the expected load of the access tree strategy. For an edge e of M , let L(e) denote the load of e and E [L(e)] the expectation of this value. stat opt

stat opt

stat opt

stat opt

stat opt

stat opt

3 DATA MANAGEMENT ON MESHES

22

Lemma 3.4 For any edge e of M , E [L(e)] = O(log n  d  C (T (M ))). Proof. Let h denote the height of T (M ), and let Li (e) denote the load of e due to the simulation of edges on level i of T (M ), for 1  i  h. We show that E [Li(e)] = O(d  C (T (M ))), which yields the lemma since h = O(log n). Fix i with 1  i  h. Let v be a node of T (M ) on level i ? 1, and stat opt

stat opt

let v and v be the children of v. Then the mesh M (v) is partitioned by the hierarchical decomposition into the submeshes M (v ) and M (v ). The nodes v, v , and v are each mapped randomly to a node in M (v), M (v ), and M (v ), respectively. M (v ) and M (v ) are submeshes of M (v) that have at least 1=3 the size of M (v). As a consequence, the expected load on an arbitrary edge e of M (v) for sending a message between two nodes chosen randomly from M (v) is at most 3 times larger than sending a message from a node chosen randomly either from M (v ) or M (v ) to a node chosen randomly from M (v). Therefore, in the following we assume, for simplicity, that we are choosing v and v randomly from M (v), rather than choosing v from M (v ) and v from M (v ). Let kj denote the side length of M (v) in dimension j . W.l.o.g. we assume k  k  : : :  kd . Let size(M (v)) denote the number of nodes in M (v). Further, let out(M (v)) denote the number of edges leaving M (v), and let outj (M (v)) denote the number of edges of dimension j leaving M (v). Then the order of the partitions in the hierarchical decomposition ensures that we have not divided the mesh according to dimension j if kj < kd =2. Thus, outj (M (v)) = 0 if kj < kd=2. Otherwise, 1

2

1

1

1

2

1

2

1

1

1

1

2

2

2

2

2

1

2

2

outj (M (v))  2  size(k M (v))  4  size(k M (v)) : j

d

Thus, size(M (v))  kd  out(M (v))=4d. Consider an edge e from M (v). Let j denote the dimension of e. The probability that a dimension{by{dimension order path between two nodes chosen randomly from M (v) crosses e is at most kj kd 2d   2  size(M (v)) 2  size(M (v)) out(M (v)) : (This can be seen easily as follows. Add wrap around edges to M (v) in each dimension. Then M (v) is a torus, and all edges in dimension j look the same. The number of these edges is size(M (v)), and the expected number

3 DATA MANAGEMENT ON MESHES

23

of these edges that are traversed by the random path is bkj =4c. Therefore, the expected load on one of these torus edges is at most kj =(4  size(M (v))). The expected load for the mesh edges is at most twice this value.) Now consider the tree edge fv ; vg. The bandwidth of this edge is at most out(M (v)). Thus, the maximum number of messages that are transmitted along this edge is at most C (T (M ))  out(M (v)). The edge fv ; vg is simulated by a dimension{by{dimension order path in M (v). As a consequence, the expected load on edge e for simulating edge fv ; vg is at most C (T (M ))  out(M (v))  out(2Md (v)) = 2d  C (T (M ))) under the assumption that v and v are chosen randomly from M (v). Of course, the same holds for the simulation of edge fv ; vg. Thus, E (Li (e)) = O(C (T (M ))  d), and hence, E [L(e)] = O(log n  d  C (T (M ))). In order to complete the proof of Theorem 3.2, we have to show that the maximum load over all edges does not deviate too much from the expected load of a xed edge. Consider edge e of the mesh M . Let L denote the load of e. We color the edges of the access trees with three colors f1; 2; 3g such that all incident edges have di erent colors. For 1  j  3, let Ex;j be the set of tree edges of Tx(M ) with color j . For ` 2 Ex;j , let Aj (`) be a random variable that is 1 if the dimension{by{dimension order path that simulates ` crosses e, and that is 0, otherwise. Further, for ` 2 Ex;j , let K (`) denote the load of ` in the access tree Tx(M ). Then de ne X X Lj := Aj (`)  K (`) ; (1) 1

stat opt

1

1

stat opt

stat opt

1

2

stat opt

x2X `2E

stat opt

x;j

for 1  j  3, i.e., L = L + L + L . The coloring yields that the Aj (`)'s are independent, for xed j . Therefore, Lj is a sum of weighted independent random variables. Let  denote the write contention, i.e., the maximum number of write accesses to the same object. Then the static strategy for trees guarantees that for every ` 2 Ex;j , K (`) = O(). As a consequence, the maximum weight in the sum of random variables in equation 1 is at most O(). Applying a Cherno {Hoe ding bound [11] to this sum yields that Lj deviates by at most O(  log n) from E [Lj ], w.h.p., for 1  j  3. Since L = L + L + L , it follows L = E [L] + O(  log n), w.h.p.. Now, applying Lemma 3:3 and Lemma 3:4 yields L = O(log n  (d  C (M ) + )), 1

1

2

3

2

3

stat opt

3 DATA MANAGEMENT ON MESHES

24

w.h.p.. Further, we have C (M )  =d because each write access has to update all copies and each copy is placed statically at a node with degree d. Thus, L = O(log n  d  C (M )), w.h.p.. Summing over all edges yields that the same bound holds for the congestion. This completes the proof of Theorem 3.2. stat opt

stat opt

3.2 Dynamic data management on meshes

In this section, we show how the access tree strategy described above has to be modi ed such that it achieves minimum congestion up to a factor of O(d  log n) also in the dynamic model. Further, we give a lower bound for online routing on meshes showing that the above competitive ratio is optimal up to a factor (d ). 2

3.2.1 The dynamic access tree strategy The access tree strategy described above can be easily adapted to the dynamic model by simulating the dynamic instead of the static tree strategy on the access trees. This gives congestion O(C (M )  d  log n +   log n), w.h.p., where C (M ) denotes the optimal congestion for a given application in the dynamic model, and  denotes the write contention. In the static model, the optimal congestion was at least =d. From this we could deduce O(d  log n){competitiveness. Unfortunately, in the dynamic model we cannot apply this bound. In fact, it is easy to construct counterexamples in which the optimal congestion is much smaller than =d. In order to achieve O(d  log n){competitiveness in the dynamic model the access trees have to be remapped dynamically when too many write accesses are directed to the same variable. The remapping is done as follows. For every object x, and every node v of the access tree Tx(M ) we add a counter  (x; v). Initially, this counter is set to 0. Every time a message for object x traverses node v the counter is increased by 1. Let K be a constant integer of suitable size. When the counter  (x; v) reaches K the node v is remapped randomly to another node in M (v), and  (x; v) is set to 0. Remapping the home of v means that we eventually have to send a transport message that moves the copy of x from the new to the old home. Further, we have to send noti cation messages including information about the new home to the three mesh nodes that hold the access tree neighbors of v. This noti cation messages also increase the counters at the traversed tree nodes, dyn opt

dyn opt

3 DATA MANAGEMENT ON MESHES

25

i.e., the counters for x at the three neighbors of v. (Alternatively, to providing counters for the tree nodes a node can be remapped with probability 1=K whenever a message for the respective object traverses the node.) The following theorem describes the e ect of the remapping. Theorem 3.5 The dynamic access tree strategy is O(d  log n){competitive, for meshes of size n and dimension d. Proof. The load of the mesh edges is increased by the transport and the noti cation messages. The impact of the transport messages on the expected load of the edges is relatively small. This can be shown as follows. Consider an object x 2 X and a node v of Tx(M ). Of course, if v is a leave then it is not remapped. Thus, we assume v is not a leave. Let v and v denote the two children of v in Tx(M ). Suppose v is not the root, and let u denote be the parent of v. The transport messages for (x; v) are send from the old home of v to the new home of v. For every transport message, at least K other messages for read or write accesses or noti cations are send from (or to) the old home of v to (or from) one of the three homes of u, v , and v . These homes are chosen randomly from M (v), M (u), M (v ), and M (v ), respectively. M (v) is a submesh of M (u) that has at least 1=3 the size of M (u). M (v ) and M (v ) are submeshes of M (v) that have at least 1=3 the size of M (v). As a consequence, the expected load on an arbitrary edge e from M (v) for sending a message between two nodes chosen randomly from M (v) is at most 3 times larger than sending a message from a random node in M (v) to a node chosen randomly either from M (u), M (v ), or M (v ). Thus, if K is chosen suciently large, then the expected load for an edge e in M (v) due to the transport messages is not larger than the expected load for the K other messages. It is easy to check that the same holds if v is the root. Now we consider the impact of the noti cation messages. We show that, if K is chosen suciently large, the congestion of these messages in the decomposition tree T (M ) is not larger than the congestion of the access messages. For simplicity, we argue with fractional message sizes, i.e., we assume every time a node receives an access message then it sends out a noti cation message of size 1=K to all of its neighbors in the tree, and if it receives a noti cation of size r from one of its neighbors then it sends out a noti cation message of size r=K to the other neighbors. The load for each edge in this fractional model is not smaller than the original load. 1

2

1

1

2

2

1

1

2

2

3 DATA MANAGEMENT ON MESHES

26

Let C denote the congestion in T (M ) due to access messages. Consider a node v of T (M ) and an edge e of T (M ). Let d(v) denote the distance of node v to edge e, i.e., the number of edges on the shortest paths from v to one of the two nodes incident to e. Further, let b(v) denote the maximum bandwidth of one of the three edges incident to v. Then the bandwidth of edge e is at most b(v)=2d v because the decomposition ensures that the bandwidth of two incident edges di er at most by a factor of 2. Further, the absolute load on edge e for a noti cation message initiated by an access message traversing a node v is at most (1=K )d v . Thus, the relative load on edge e for a noti cation message initiated by an access message traversing node v is at most (1=K )d v = 2d v b(v)=2d v K d v  b(v) : ( )

( )+1

( )+1

( )

( )

( )+1

The maximum number of access messages traversing v is at most 3C  b(v) because each of these messages has to cross one of the three edges incident to v. As a consequence, the relative load on e for noti cation messages due to accesses to v is at most 2d v

3C  b(v)  K d v

 2 d v 3 C  b(v) = 2  K

( )+1

( )

( )+1

Since the number of nodes at distance i is at most 2i , the relative load on e is at most 1 3C  4 i X  ; 2 K i which is at most C , for K  7. Now we can bound the congestion for the mesh. Let e be an edge of M and let L denote the load of e. It is easy to check that the results of Lemma 3:3 and Lemma 3:4 hold analogously also for the dynamic model. Thus, we have E [L] = O(log n  d  C (M )). Further, similar to the proof for the static model, we can decompose L into three parts such that L = L + L + L , and Lj is the sum of independent weighted random variables, for 1  j  3. Due to the remapping, the maximum weight in each of these sums is K . Hence, applying a Cherno Hoe ding bound gives that the deviation of L from E [L] is at most O(K  log n) = O(log n), w.h.p., and therefore, the congestion is at most O(log n  d  C (M )), w.h.p.. +1

+1

=0

dyn opt

1

dyn opt

2

3

3 DATA MANAGEMENT ON MESHES

27

3.2.2 A lower bound for on-line routing and dynamic data management The results in Theorem 3.5 compare the congestion of the dynamic access tree strategy with the congestion of an optimal o -line strategy. The theorem shows that the access tree strategy achieves minimum congestion up to a factor O(d  log n). In the following, we show that this ratio is nearly optimal. In particular, we prove that the best possible competitive ratio for dynamic data management is (log n=d). Dynamic data management includes the problem of on-line routing. Here an adversary speci es a sequence of T routing requests, i.e., pairs rt = (st ; dt) of source and destination nodes, for 1  t  T . An on-line routing algorithm must assign a routing path connecting st and dt, for 1  t  T , without knowing future requests, i.e., requests (rt0 ) with t0 > t. The goal is to minimize the congestion. The competitive ratio of the on-line algorithm is de ned as the worst case ratio over all request sequences between the congestion achieved by the on-line algorithm and the minimum congestion achievable for the sequence. The following theorem gives a lower bound on the competitive ratio for on-line routing on meshes.

Theorem 3.6 Any on-line routing algorithm for the mesh of dimension d  2 and side length m has competitive ratio (log m).

Proof. We show that for any C , d  2, and m  4 being a power of 2, there is a random on-line routing problem Rd (m; C ) for which the minimum congestion is C whereas the expected congestion achieved by any on-line routing algorithm is (C  log m). We start with proving a lower bound for the 2-dimensional case. First, we describe a random on-line routing problem R (m; C ) on the m  m mesh M (m), for which the minimum o -line congestion is C . Then we show that the expected congestion of any on-line strategy is (C  log m). R (m; C ) is de ned as follows. Let (k; `) denote the k-th node in the `-th row of M (m), for 1  k; `  m. The adversary starts by specifying m=2 pairs of source and destination nodes each of which should be connected by C routing paths. The pairs are ((m=2; `); (m=2; m=2 + `)), for 1  `  m=2. Further requests are described recursively: The mesh M (m) can be partitioned into for m=2  m=2 submeshes. If m=2  4, then the adversary selects randomly one from these submeshes and speci es routing requests in this submesh according to R (m=2; C ). 2

2

2

2

2

2

3 DATA MANAGEMENT ON MESHES

28

Altogether, this gives log m ? 1 batches of routing requests of size C  m=2; C  m=4; : : : ; C  2. For 1  i  log m ? 1, the routing batch speci ed in stage i is denoted by Ri and the submesh considered in this stage is denoted by Si such that S = M (m). It is easy to check that, for 1  i  log m ? 2, there exists an o -line schedule that routes the requests of batch Ri with congestion C through mesh Si without using any edge edge of mesh Si . Thus, all requests can be routed o -line with congestion C . It remains to show that the expected congestion of any on-line strategy is (C  log m). The mesh M (m) consists of m ? 1 rows of edges and m ? 1 columns of edges. We number the rows and columns from 1 to m ? 1, respectively. In the following, we only consider the edges in odd rows and odd columns. These edges are called odd edges. Each routing path connecting a source and a destination node of a request in batch Ri has to traverse at least m=2i odd edges of mesh Si. Note that this bound holds even if a path leaves the submesh Si. Hence, if one chooses randomly and uniformly an edge from the odd edges in Si , then the expected number of paths connecting two nodes of batch Ri using this edge is at least jRij  m=2i = (C  m=2i)  m=2i = C jE (Si)j m =2 i? 8 with E (Si) denoting the set of odd edges in Si, for 1  i  log m ? 1. Choosing a random odd edge from S m? rather than from Si yields the same bound, because the selection of the submeshes by the adversary corresponds to random selections of subsets of odd edges. (Note that this holds only for the odd edges.) Hence, the expected congestion in S m? due to requests from batch Ri is C=8, for 1  i  log m ? 1. Summing over all batches, yields that the expected congestion of R (m; C ) is C  (log m ? 1)=8, which completes the proof for the 2-dimensional case. Now we give describe the on-line routing problem Rd (m; C ) for the mesh Md (m) with dimension d  3 and side length m. Md (m) can be partitioned into J = md? two dimensional m  m submeshes M ; : : : ; MJ , each of which consisting only of edges of dimension 1 and 2. Let (k; `; j ) denote the node in the k-th row and the `-th column of submesh Mj , for 1  k; `  m and 1  j  J. Rd (m; C ) is de ned as follows. The adversary speci es the routing requests in each submesh Mj according to R (m; C ), for 1  j  J . In each submesh Mj it uses the same random bits, which means that it speci es exactly the same routing problem in each Mj , for 1  j  J . We have 1

2

+1

2

+1

+1

odd

+1

2

2

2

odd

log

1

log

2

2

1

2

1

4 DATA MANAGEMENT ON CLUSTERED NETWORKS

29

already shown that all requests can be routed o -line with congestion C inside the respective 2-dimensional mesh Mj . Hence, it remains to show that the expected congestion of any on-line strategy on this routing problem is

(C  log m). This we do by contradiction. Suppose an on-line routing strategy exists for which the expected congestion on Rd(m; C ) is smaller than C  (log m ? 1)=8. Then this routing strategy can be simulated on the m  m mesh M (m) for R (m; md?  C ). In this simulation, a node in row k and column j of M (m) simulates all nodes (k; `; j ) with 1  j  J of Md (m). Further, each edge ((k ; ` ); (k ; ` )) of M (m) simulates all edges ((k ; ` ; j ); (k ; ` ; j )) with 1  j  J . This yields congestion smaller than md?  C  (log m ? 1)=8, since each edge of M (m) has to simulate md? edges of Md (m). This contradicts the above result for R (m; md?  C ), and hence the expected congestion of any on-line strategy on Rd (m; C ) is at least C  (log m ? 1)=8. The lower bound on the competitive ratio for on-line routing can be adapted to dynamic data management. Consider a parallel application that includes a sequence of 2T consecutive accesses a ; : : : ; a T to the shared objects x ; : : : ; xT such that for 1  t  T , a t? is a write access from node v t? to object xt and a t is a read access from processor v t to the same object. A dynamic data management strategy for this application has to specify a routing path from a t? to a t , for 1  t  T . This shows that data management includes on-line routing, and hence, we can deduce the following corollary. 2

2

2

2

1

2

2

1

1

2

1

2

2

2

2

2

1

1

2

2

2

1

2

2

2

1

2

2

1

2

Corollary 3.7 Any dynamic data management strategy for the mesh of dimension d  2 and side length m has competitive ratio (log m).

4 Data management on clustered networks A clustered network G = (V; E ) is a network that consists of several small subnetworks, i.e., clusters, that are organized hierarchically. The cluster tree T (G) describes this hierarchical structure. The internal nodes of T (G) correspond to the clusters of G, and the leaves correspond to the user processors. (In the following, we interpret these processors as clusters of size 1.) Each pair of clusters that is connected by one or more edges in G is also connected by an edge in T (G). The bandwidth of this edge in T (G) is the sum of the bandwidths of the corresponding edges in G. For instance, NOWs are usually

4 DATA MANAGEMENT ON CLUSTERED NETWORKS

30

organized as clustered networks. Figure 5 depicts a possible topology for a wide-area NOW.

0 1 00 11 10 0 1111 0000 11 00 00 11 0 1 00 11 00 1 1 0 1 0 1 00 11 0000 1111 0 1 00 11 0 1 00 11 1 0 1 00 11 1 0 1 0 00 11 000 111 000 1 00 11 0000 111 11 0 1 00 11 00 11 1 0 00 11 000 111 00 11 1 0 0 1 1 0 1 0 00 11 000 111 1 1 0 1 0 1 0 00 11 000 111 00 11 00 11 000 111 00 11 1 0 0 1 0 1 0 1 00 11 000 111 1111 0 00 11 0001 0 1 00 11 00 1 0 00 11 111 0 00 11 0 1 00 1 1 0 1 000 111 000 111 0 1 1111 0000 0 00 11 1 0 0 1 00 11 0 1 1 0 10 0 1 0 1 0 000 111 000 111 1 1 10 0 1 00 11 00 11 0 1 00 11 000 111 000 111 100 0 10 1 11 00 11 111 0 00 11 00 011 1 0 1 00 00 11 1 0 000 111 000 111 000 111 00 00 11 00 11 0 1 0 1 0 1 00 11 1 0 000 111 000 111 000 111 00 11 1 0 1 0 1 0 1 0 1 0 11 00 1 0 000 111 0 1 1 0 0 1 000 111 1111 0 00 11 000 1 0 00 11 0 1 1 0 0 1 00 11 00 11 1 0 00 11 000 111 00 11 0 1 0 1 1 0 0 1 00 11 00 111 0 1 0 0 1 0 11 0 Regional Network

Backbone Network

Regional Network

Regional Network

Figure 5: The topology of a wide area network of workstations.

The nodes inside the clusters are connected in an arbitrary fashion. However, we assume that communication between nodes of the same cluster is less expensive than communication between nodes of di erent clusters, which is just the basic idea behind any kind of clustering. This means we assume that the links connecting nodes in distinct clusters are the bandwidth bottlenecks of the system. This property can be formalized as follows. Consider a cluster K = (VK ; EK ). De ne the weight w(v) of a node v 2 VK to be the sum of the bandwidths of P its incident edges leaving K , and the weight w(U ) of U  VK by w(U ) = v2U w(v). Further, for U  VK , let cut(U ) denote the sum of the bandwidths of the edges connecting the processors in U with the processors in VK nU . Then the x ux (cross ux), de ned by cut(U ) (K ) = Umin ; V minfw(U ); w(VK nU )g is a good measure for the bandwidth bottleneck of K . If (K )  1 then there is a bottleneck inside cluster K . We assume that (K ) = (1), for any cluster K . For each object x 2 X , de ne the access tree Tx(G) to be a copy of the cluster tree T (G). Each interior node a of Tx(G) is mapped randomly to one K

4 DATA MANAGEMENT ON CLUSTERED NETWORKS

31

of the processors in the associated cluster K = (VK ; EK ), i.e., a is mapped with probability w(v)=w(VK ) onto node v 2 VK . Analogously to the mesh strategy, we simulate the optimal static or dynamic strategy for trees on these access trees. The static placement for each object can be calculated in time O(jV j). However, the selection of the routing paths is more complicated than the one for meshes. It requires an initialization of each cluster, which takes time polynomial in the size of the cluster. The routing will be described in the proof for the following theorem.

Theorem 4.1 Consider an application running on the user processors in a

clustered network G of size n. Let  denote the maximum number of edges that leave the same cluster, let denote the maximum degree in the cluster tree T (G), and let  denote the write contention, i.e., the maximum number of write accesses to the same object. Further, let Copt (G) denote the optimal congestion for the application when it is executed on G. Then the access tree strategy achieves congestion

O(log   C (G) +    log n) ; opt

w.h.p.. If the clusters in G can be represented by planar graphs or by constant genus graphs, then this result improves to

O(C (G) +    log n) ; opt

w.h.p..

Proof. Consider a cluster K = (VK ; EK ). Let = (K ) denote the x ux of

K and  = (K ) denote the number of edges that leave cluster K . We de ne the congestion of K to be the maximum relative load of all edges adjacent to nodes in K . Let C = C (K ) denote the optimal congestion of K . We show, for cluster K , that the strategy achieves congestion O(log   C = +     log n), w.h.p.. If K is planar or of constant genus, then this result improves to O(C = +     log n), w.h.p.. This yields the above theorem. (Furthermore, it shows that the in uence of a \bad" x ux is only small and local to the respective cluster.) First, we describe the path selection strategy for the routing along the edges that leave cluster K . Suppose K 0 is a cluster neighboring to K . Let u and u0 be the cluster tree nodes that represent K and K 0, respectively, and let eT be the edge that connects u and u0. Furthermore, let e ; : : : ; ek denote opt

opt

opt

opt

1

4 DATA MANAGEMENT ON CLUSTERED NETWORKS

32

the edges in G represented by eT . We send a message that traverses eT in T (G) along edge ei with probability p(ei) = b(ei )=b(eT ). The following lemma, gives an upper bound on the expected relative load of edges that leave cluster K . Lemma 4.2 Let e be an edge of G that connects a node in cluster K with a node in one of the neighboring clusters. Then the expected load on e is at most C . Proof. Let eT denote the edge in T (G) that corresponds to e. Our strategy simulates the optimal strategy on the access trees. This strategy is equivalent to the optimal strategy on the cluster tree T (G). Thus, the load on eT is the optimal load achievable in T (G). This load is at most C  b(eT ). Therefore, the expected absolute load for edge e is at most C  b(eT )  p(e) = C  b(e). As a consequence, the expected relative load on e is at most C . Now we have to describe the path selection strategy for the routing inside cluster K . Consider the following multicommodity ow problem. Let v ; : : : ; v denote the nodes incident to the edges that leave cluster K = (VK ; EK ). We have  commodities `i;j with 1  i; j  . The source of commodity `i;j is vi , its sink is vj , and its demand is w(vi)  w(vj )=W with W := Pk w(vk ). We solve this commodity ow problem on K with respect to the capacities, i.e., the bandwidths, of the edges in K . The result is a multicommodity ow in which the demand of each commodity is satis ed up to a factor q. That is, for each commodity `i;j , there is a ow of size q  w(vi)  w(vj )=W from vi to vj . Lemma 4.3 In the general case, q = ( = log ). If K is a planar graph or a constant genus graph then q = ( ). Proof. First we consider the general case. According to [3] any multicommodity ow problem can be satis ed up to a factor O(S= log k) with S denoting the minimum cut ratio and k denoting the number of commodities. It is easy to check that the minimum cut ratio of our multicommodity ow problem is . Further, the number of commodities is  . Therefore, the maximum ow can be satis ed up to a factor q = ( = log ). Now suppose K is a planar graph or of constant genus. Then we can translate the above multicommodity ow problem into a uniform multicommodity problem in which each node sends the same amount of data to each opt

opt

opt

opt

opt

1

2

=1

2

4 DATA MANAGEMENT ON CLUSTERED NETWORKS

33

other node such that the ow and the cut ratio in both problems is nearly equivalent. According to [14], the ow in uniform multicommodity factor can be satis ed up to a factor q = O(S ) with S denoting the minimum cut ratio. Applying, S = O( ) yields the lemma. The maximum ow for cluster K can be calculated eciently by a randomized approximation scheme in time O(  jVK j  jEK j  log jVK j), see [18]. Alternatively, it can be calculated deterministically based on linear programming. Let fi;j : EK ! IR, represent the ow of commodity `i;j on the respective edges. If the ow on edge e = (x; y) is oriented from x to y, then fi;j (x; y) is positive; otherwise it is negative. We use these ow values to determine the routing paths. Let v be a node of K . Let Fi;j (v) be the sum of ows that leave v. Let (v; y) 2 EK be an edge with fi;j (v; y) > 0. Then each message with source vi and destination vj that arrives at node v is sent with probability fi;j (v; y)=Fi;j (v) along edge (v; y). 4

2

Lemma 4.4 Let e be an edge of G that connects two nodes of cluster K . Then the expected relative load on e is at most 2  C =q. Proof. Suppose each node vi wants to send q  w(vi)  w(vj )=W data units to node vj , for 1  i; j  . Then the expected load on each edge e is equal opt

to its bandwidth. This means, the expected congestion in K is at most 1. Now consider the messages that arrive in the cluster. According to Lemma 4.2 the number of messages that arrives from outside at node vi , with 1  i  , is at most C  w(vi) since w(vi) is equivalent to the sum of the bandwidths of the incident edges leaving K . Each message is sent to a random destination chosen from v ; : : : ; v such that the probability for vj to be the destination is w(vj )=W . As a consequence, for 1  i; j  , the expected amount of data for incoming messages sent from node vi to node vj is at most C  w(vi)  w(vj )=W . Analogously, we can prove that the same bound holds for the messages that leave the cluster. As a consequence, the expected relative load on each edge is at most 2  C =q. Combining the results of the Lemmas 4.2, 4.3, and 4.4 yields that the expected load for each edge in K is O(log   C = ) in the general case, and it is O(C = ) if K is planar or of constant genus. In order to complete the proof of Theorem 4.1, we have to show that the maximum relative load over all edges in K does not deviate too much from the expected load of a xed edge. opt

1

opt

opt

opt

opt

5 MORE GENERAL MODELS

34

Consider edge e of the cluster K . Let L denote the load of e. Then we have to show that L = E [L] + O(    log n), w.h.p.. Let Lx denote the load of e due to the accesses to objects in x 2 X . The data management strategy for trees ensures that the load on each edge of the cluster tree due to accesses to x is O(). Further, each edge e is involved in the simulation of at most di erent cluster tree edges. Therefore, Lx = O(  ). This means that Lx is the sum of O(  ) not necessarily independent 0{1{random variables. We add some dummy variables so that Lx is the sum of exactly  = O(  ) random variables A (x); : : : ; A (x), for every x 2 X . Then 1

L=

 XX x2X i=1

Ai (x) =

 X X i=1 x2X

Ai (x) :

The variables Ai (x) in the sum Si = Px2X Ai(x) are independent, for 1  i   . Applying a Cherno {Hoe ding bound [11] to this sum yields that it deviates P by at most O(log n) from E [Si ], w.h.p., for 1  i   . Since L = i Si, it follows L = E [L] + O(    log n), w.h.p., which completes the proof of Theorem 4.1. =1

5 More general models

5.1 Non{uniform object sizes and slice{wise accesses

For simplicity, we assumed until now that all objects and also all messages for requesting, invalidating and updating copies have unit size. The static strategies, in particular the nibble strategy, can easily be adapted to arbitrary object and message sizes. This can be done by weighting the access rates hr and hw according to the costs of reads and writes, respectively. For instance, if a read access to an object requires to send a request message of size r to the nearest node holding a copy, and this node sends back a reply message of size r , then the access is weighted with r + r . Analogously, if a write requires to send an update message of size w along a multicast tree, and this update is acknowledged by a message of size w , then this access is weighted with w + w . Applying the nibble strategy with respect to the weighted rates yields an optimal placement. The dynamic strategies pro t from large objects. Let jxj denote the size of an object and let us assume that the size of request and invalidation messages is 1. Then the dynamic strategy for trees achieves competitive ratio 1

2

1

2

1

2

1

2

5 MORE GENERAL MODELS

35

(2 + 1=jxj) rather than ratio 3. It is more complicate to adapt the dynamic strategies to slice{wise accesses, i.e., a node wants to read or write only a relatively small part of an object. These slice{wise accesses are typical for distributed le systems. For this scenario, Bartal et al. [6] describe a randomized algorithm for trees that achieve competitive ratio (1). If this algorithm is used in the access tree strategies for meshes and clustered networks instead of our deterministic tree algorithm, then the results for dynamic data management on meshes and clustered networks in Theorem 3.5 and Theorem 4.1, respectively, hold also for applications using slice{wise accesses.

5.2 Other update policies

The static model restricts the class of allowed update policies since it is assumed that a write has to update all copies. For instance, the majority trick, introduced in [28] for PRAM simulations, does not t into this model. Here only more than half of the copies of an object are updated in case of a write and more than half of the copies are accessed in case of a read. This ensures that every read access gets at least one copy updated by the last write. However, this technique requires to add time stamps to each copy in order to gure out which of the copies accessed by a read is the actual one. Since it is not clear how to realize this in an asynchronous setting we restricted ourselves until now to strategies that update all copies in case of a write. In the following, we consider a more general model, maybe the most general model one can think of. We just assume that every read access has to see at least one of the copies updated by the last write. In this model, a static placement strategy has to specify for every read and write access in advance which copies to read and which to update, respectively. For instance, the inverse nibble strategy is an allowed strategy in this model: A read always accesses all copies whereas a write updates only the copy closest to the writing node. The copies are placed according to the nibble strategy, however, the rates for reads and writes are exchanged, i.e., w(v) is set to hr (v; x) and r(v) is set to hw (v; x), for every node v 2 V and every object x 2 X . The following theorem shows, that a combination of the nibble and the inverse nibble strategy is optimal in the general model.

Theorem 5.1 Consider a static placement problem on a tree{connected net-

5 MORE GENERAL MODELS

36

work T = (V; E ) in the general model. If the total number of reads R to an object x is not smaller than the total number of writes W then the placing x according to the nibble strategy yields minimum load on any edge. Otherwise, the placing x according to the inverse nibble strategy yields minimum load on any edge.

Remark 5.2 The above theorem has consequences for the static result on

meshes. It shows that simulating the combination of nibble and inverse nibble strategy on meshes of dimension d and size n rather than the nibble strategy stat yields congestion O(Copt  d  log n +   log n) also in the more general model. Here  denotes the write contention, i.e., the maximum number of write accesses to the same object.) Further, it shows that also the result for static placement on clustered networks in Theorem 4.1 holds in the general model if the above strategy is simulated on the access trees.

Proof. Suppose R  W . We show that any strategy S that does not

update all copies of x in case of a write can be transformed into a strategy S 0 that updates all copies without increasing the load on any edge. Then, according to Theorem 2.1, S 0 can be transformed into the nibble strategy without increasing the load on any edge. Hence, the nibble strategy yields minimum load on any edge. Consider an arbitrary strategy S that does not update all copies in case of a write. Then there exist an (hyper{)edge e and a node v incident to e such that any write issued in T (v) touches v but not e and there is at least one copy in T nT (v), where T (v) denotes the maximal subtree including v but not e. We distinguish two cases:  Case 1: Any write access issued in T nT (v) touches v. (W.l.o.g. we assume that whenever a write access touches a node holding a copy then it updates the copy.) In this case we remove all copies in T nT (v). This does not increase the load due to read access issued in T nT (v), because all of them have to visit node v since some of the write accesses issued in T (v) do not pass e.  Case 2: Some of the write accesses issued in T nT (v) does not touch v. Then one can nd a subtree T  = (V ; E  ) of T such that each node u 2 V  is touched by any write access issued in T 0(u) and at least one node issued in T 0(u) does not update a copy outside of T 0(u)

5 MORE GENERAL MODELS

000u 1100 111 1100 0011 T * 1100 000 1100 111 000 111 111 000 e

37

T’(u)

Figure 6: The subtrees T  and T 0(u). Any of the write accesses issued in T 0(u) updates u, but a least one of them does not update a copy outside of T 0(u). with T 0(u) denoting the maximal subtree including u but no edge from E .. T  can be easily constructed by initially setting E  = feg and then extending E  until each node u in V  is updated by all writes in subtree T 0(u). Figure 6 gives an example. In this case we add copies to all nodes in V  which are updated by every write access. This increases only the load in T , because each write already visits one node in T . However, after the modi cation none of the read accesses has to cross an edge of T  anymore whereas before the modi cation any read access had to visit each node in T  in order to ensure that it gets at least one copy updated by the last write access. Hence, this modi cation saves load R ? W  0 on every edge in T  . In both cases the number of copies not updated by every write access is decreased at least by one. Thus, applying the above transformation repeatedly yields a strategy in which every write access updates all copies. Now suppose R < W . We have to prove that the inverse nibble strategy minimizes the load on every edge in this case. This can be shown analogously to the previous case just by treating writes as reads and reads as writes.

REFERENCES

38

References [1] M. Andrews, T. Leighton, P. T. Metaxas, and L. Zhang. Automatic methods for hiding latency in high bandwidth networks. In Proc. of the 28th ACM Symp. on Theory of Computing (STOC), pages 257{265, 1996. [2] M. Andrews, T. Leighton, P. T. Metaxas, and L. Zhang. Improved methods for hiding latency in high bandwidth networks. In Proc. of the 8th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 52{61, 1996. [3] Y. Aumann and R. Rabani. An O(log k) approximate min{cut max{ ow theorem and approximation algorithm. SIAM Journal on Computing, to appear, 1997. [4] B. Awerbuch, Y. Bartal, and A. Fiat. Competitive distributed le allocation. In Proc. of the 25th ACM Symp. on Theory of Computing (STOC), pages 164{173, 1993. [5] B. Awerbuch, Y. Bartal, and A. Fiat. Distributed paging for general networks. In Proc. of the 7th ACM Symp. on Discrete Algorithms (SODA), pages 574{583, 1996. [6] Y. Bartal, A. Fiat, and Y. Rabani. Competitive algorithms for distributed data management. In Proc. of the 24th ACM Symp. on Theory of Computing (STOC), pages 39{50, 1992. [7] R. J. Cole, B. M. Maggs, and R. K. Sitaraman. On the bene t of supporting virtual channels in wormhole routers. In Proc. of the 8th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 131{141, 1996. [8] R. Cypher, F. Meyer auf der Heide, C. Scheideler, and B. Vocking. Universal algorithms for store-and-forward and wormhole routing. In Proc. of the 28th ACM Symp. on Theory of Computing (STOC), pages 356{365, 1996. [9] D. Dowdy and D. Foster. Comparative models of the le assignment problem. Computing Surveys, 14(2):287{313, 1982.

REFERENCES

39

[10] A. J. v. d. Goor. Computer Architecture and Design. Addison Wesley, 1994. [11] T. Hagerup and C. Rb. A guided tour of Cherno bounds. Information Processing Letters, 33:305{308, 1989/90. [12] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proc. of the 29th ACM Symp. on Theory of Computing (STOC), pages 654{655, 1997. [13] A. Karlin and E. Upfal. Parallel hashing{an ecient implementation of shared memory. In Proc. of the 18th ACM Symp. on Theory of Computing (STOC), pages 160{168, 1986. [14] P. Klein, S. A. Plotkin, and S. Rao. Excluded minors, network decomposition, and multicommodity ow. In Proc. of the 25th ACM Symp. on Theory of Computing (STOC), pages 682{690, 1993. [15] R. R. Koch, F. T. Leighton, B. M. Maggs, S. B. Rao, A. L. Rosenberg, and E. J. Schwabe. Work-preserving emulations of xed-connection networks. Journal of the ACM, 44(1):104{147, Jan. 1997. [16] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, c28(9):690{691, 1979. [17] F. T. Leighton, B. M. Maggs, A. G. Ranade, and S. B. Rao. Randomized routing and sorting on xed-connection networks. Journal of Algorithms, 17:157{205, 1994. [18] T. Leighton, F. Makedon, S. Plotkin, C. Stein, E. Tardos, and S. Tragoudas. Fast approximation algorithms for multicommodity ow problems. Journal of Computer and System Science, 50:228{243, 1995. [19] C. Lund, N. Reingold, J. Westbrook, and D. Yan. On-line distributed data management. In Proc. of the 2nd European Symposium on Algorithms (ESA), 1996. [20] F. Meyer auf der Heide. Eciency of universal parallel computers. Acta Informatica, 19:269{296, 1983.

REFERENCES

40

[21] F. Meyer auf der Heide. Ecient simulations among several models of parallel computers. SIAM Journal on Computing, 15(1):106{119, Feb. 1986. [22] F. Meyer auf der Heide and B. Vocking. A packet routing protocol for arbitrary networks. In Proc. of the 12th Symp. on Theoretical Aspects of Computer Science (STACS), pages 291{302, 1995. [23] F. Meyer auf der Heide and R. Wanka. Time-optimal simulations of networks by universal parallel computers. In Proc. of the 6th Symp. on Theoretical Aspects of Computer Science (STACS), pages 120{131, 1989. [24] R. Ostrovsky and Y. Rabani. Universal O(congestion + dilation + log  n) local control packet switching algorithms. In Proc. of the 29th ACM Symp. on Theory of Computing (STOC), pages 644{653, 1997. [25] C. G. Plaxton and R. Rajaraman. Fast fault-tolerant concurrent access to shared objects. In Proc. of the 37th IEEE Symp. on Foundations of Computer Science (FOCS), pages 570{579, 1996. [26] A. G. Ranade. How to emulate shared memory. Journal of Computer and System Science, 42:307{326, 1991. [27] C. Scheideler and B. Vocking. Universal continuous routing strategies. In Proc. of the 8th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 142{151, 1996. [28] E. Upfal and A. Wigderson. How to share memory in a distributed system. Journal of the ACM, 34:116{127, 1987. [29] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33, 1990. 1+

Suggest Documents