Packet Routing and PRAM Emulation on Star ... - Semantic Scholar

Packet Routing and PRAM Emulation on Star Graphs and Leveled Networks1 2 Michael A. Palis; Sanguthevar Rajasekaran; and David S. L. Wei Department of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104.

1

3

Preliminary versions of portions of this paper were presented in [12] and [13]. Acknowledgements: This research was supported in part by DARPA grant NOOO14-85-K0018, NSF grants MCS-82-07294, DCR-84-10413, MCS-83-05221, MCS-8219196-CER, IRI84-10413-AO2 and U.S. Army grants DAA29-84-K-0061, DAA29-84-9-0027 3 Author’s current address: Dept. of CS, Radford University, Radford, Virginia 24142. 2

1

Star Graphs and Leveled Networks Sanguthevar Rajasekaran 334 C, 3401 Walnut Street Department of CIS, Univ. of Pennsylvania Philadelphia, PA 19104 (215) 898 0375

Abstract: We consider the problem of permutation routing on a star graph, an interconnection network which has better properties than the hypercube. In particular, its degree and diameter are sublogarithmic in the network size. We present optimal randomized routing algorithms that run in O(D) steps (where D is the network diameter) for the worst-case input with high probability. We also show that for the n-way shuffle network with N = nn nodes, there exits a randomized routing algorithm which runs in O(n) time with high probability. Another contribution of this paper is a universal randomized routing algorithm that could do optimal routing for a large class of networks (called leveled networks) which includes the star graph. The associative analysis is also network-independent. In addition, we present a deterministic routing algorithm, for the star graph, which is near optimal. All the algorithms we give are oblivious. As an application of our routing algorithms, we also show how to emulate a PRAM optimally on this class of networks.

2

1

Introduction

In parallel computations, it is usually the case that the communication cost dominates the time complexity rather than the computation cost. A parallel algorithm designer is thus normally forced to focus on the task of minimizing the communication cost. However, an ideal shared memory abstract parallel model called the parallel random access machine (PRAM) that avoids the communication problem and is also simple-to-program has been proposed. Unfortunately, the PRAM does not seem to be realizable in the present or even foreseeable technologies. On the other hand, a packet routing technique can be employed to simulate the PRAM on a feasible parallel architecture without significant loss of efficiency. The problem of routing is also important due to its intrinsic significance in distributed processing and its important role in the simulations among parallel models. The routing problem is defined as follows: Given a specific network and a set of packets of information (a packet being an origin, destination pair), these packets must be routed in parallel to their own destinations such that at most one packet passes through any link of the network at any time and all the packets arrive at their destinations as quickly as possible. To start with, the packets are placed in their origins, one per node. We are interested in a special case of the general routing problem called permutation routing in which the destinations form some permutation of the origins. A routing algorithm is said to be oblivious if the path taken by each packet is only dependent on its source and destination. An oblivious routing strategy is preferable since it will lead to a simple control structure for the individual processing elements. Also oblivious routing algorithms can be used in a distributed environment. In this paper we are concerned with only oblivious routing strategies. Both deterministic and randomized schemes have been studied in solving routing problems ([27] [25] [23] [14] [20] [10] [4] [5] [19] [8] [9] [11] [16] [17],[15].) However, most of the past work has focused on bounded degree networks, such as cube-connected cycles (CCC), butterfly, shuffle-exchange, the mesh, etc. Some research work has also been done on a binary n-cube (hypercube) which is not a bounded degree network. All of these networks (except the mesh) have logarithmic diameter and have randomized routing algorithms that run in logarithmic time. Clearly, these algorithms are optimal. An interesting open question is: ‘Can we do optimal routing on a network with sublogarithmic diameter?’ In this paper we settle this question in the affirmative. In particular, we present optimal randomized oblivious routing algorithms for the star graph ([1, 2]) which has sublogarithmic diameter. The picture is quite different for the case of oblivious deterministic routing strategies. Borodin and Hopcroft [5] have shown that for any graph of N nodes with degree d, the maximum delay, in the worst case, of any oblivious deterministic routing scheme is Ω( dN3 ). We present an oblivious deterministic routing algorithm for the star graph. This algorithm 3

α

4231 2134

2314

3124 δ

213

231

312

β

3241

2431

2341

3421

χ

1324

123 321

132

1234

3214

4321

3412

2413 β

α

4312

1432

4213

1423

1342

4132

1243

4123

χ

3142

2143

δ

Figure 1: 3-star graph and 4-star graph. √ √ runs in O( N) time with O( N) queue size, where N is number of nodes in the graph. We also give a universal randomized routing algorithm that could do optimal routing for a class of constant and non-constant degree leveled networks. The analysis for this algorithm is also network-independent. Leighton, Maggs and Rao have already given an O(1) queue universal routing algorithm for constant degree leveled networks [10]. Finally, we show that a CRCW PRAM can be optimally emulated on leveled networks with non-constant degree (this class of networks includes the star graph as well as the n-way shuffle), thus extending the work of [10].

2 2.1

An oblivious deterministic routing algorithm for the n-star graph The star graph

Definition 1 Let d1 d2 . . . dn be a permutation of n symbols, e.g., 1 . . . n. For 1 < j ≤ n, we define SW APj (d1 d2 . . . dn ) = dj d2 . . . dj−1d1 dj+1 . . . dn . Definition 2 An n-star graph is a graph G=(V,E) with | V |= n! nodes, where V = {d1 d2 . . . dn | d1 d2 . . . dn is a permutation of 1...n}, and E = {(u, v) | u, v ∈ V and v = SW APj (u) for some j, 1 < j ≤ n}. The 3-star and 4-star graphs are depicted in Figure 1. In [1], Akers, Harel, and Krishnamurthy have shown that the star graph is superior to the n-cube with respect to the degree and diameter. An n-star graph has n! nodes, degree n − 1, and diameter 32 (n − 1). On the other hand, an n-cube has 2n nodes, degree n, and diameter n. Thus, the degree and diameter of the star graph grows more slowly as a function of the network size than does the n-cube. Moreover, the star graph is both vertex symmetric and edge symmetric (just like 4

the n-cube.) Oftentimes, these properties lead to a simpler analysis of the routing algorithm. In [2, 1], an algorithm was presented for routing a single packet from a source to an arbitrary destination. The more general problem of permutation routing was not considered. In the next two sections, we present efficient deterministic and randomized algorithms for permutation routing on the star graph. Both these algorithms are oblivious. Definition 3 A subgraph of an n-star graph G is said to be an i-th stage subgraph, denoted Gi , iff Gi is itself an (n − i)-star graph, 0 ≤ i < n, and the last i symbols of the labels of all nodes in it are identical. The Gi ’s of any Gi−1 partition it into n − i + 1 identical subgraphs. Let’s define the stage of the network during a run of the routing algorithm to be simply the collection of the nodes together with the packets each node holds in its queue. Hence the routing algorithm can be thought of as a sequence of stage transitions S1 , ..., Sf , where in S1 each node has a single packet that originated in that node, and in Sf each node has a single packet that is destined for it. Look at all the Gi ’s that constitute any Gi−1 . It is easy to see that for any node u in any one of these Gi ’s, there is exactly one other node v adjacent to u such that v is contained in some other Gi . We call v the critical point to u and vice-versa, at stage i. For example, in the 4-star graph of Figure 1, node BACD is a critical point to node DACB at stage 1. (Throughout this paper we use the terms ‘point’ and ‘node’ interchangeably). i Definition 4 A stage Si is said to be i-th stage stable, denoted Sstable , iff for every i-th stage i subgraph G , the destination of each packet in the subgraph is in the subgraph itself.

2.2

An oblivious deterministic routing algorithm

The routing scheme is based on divide-and-conquer. The algorithm runs in stages. In the first stage each packet is sent to the G1 (refer to Definition 3) it belongs to. In the second stage each packet is sent to the G2 it belongs to, and so on. Finally, in stage n − 1, each packet is sent to the Gn−1 it belongs to (which is the single node destination of the packet.) Thus our routing scheme can be viewed as a sequence of stage transitions i+1 n−1 0 1 i Sstable . . . Sstable Sstable . . . Sstable . Also our algorithm is such that once the algorithm Sstable j n−1 i enters Sstable (for any i), it will also be in Sstable for j ≤ i. Once the routing reaches Sstable , the task will be complete. The formal description of the routing scheme is shown in Algorithm A. We assume that all the links are bidirectional and in one step each node can send a packet along each of its outgoing edges and receive a packet along each of its incoming edges. Each node has two queues Q1 and Q2. At any given time each node looks at the packet at the 5

head of Q1 and sends it along the shortest path to the packet’s appropriate Gi . This path could be of length 0, 1, or 2. If the path is of length 0 (i.e., the packet is already in its Gi ), the packet will be placed in queue Q2 so that it could be processed again in the next stage. In the same time unit, each node receives packets along its incoming edges and stores them in queue Q1. More details of the algorithm follow. Algorithm A {Each node has two queues Q1 and Q2 . Initially, Q1 has a single packet that originates from the node, and Q2 is empty. i−1 i to Sstable (for i = 1, ..., n − 1).} The second for loop stands for transition from Sstable for every node π = d1 d2 . . . dn in parallel do for each 1 ≤ i < n do Append Q2 to thetail of Q1 ; n−i n s, s do for j := 1 to min s=n−i+1 s=1 Let x be the packet at the head of Q1 and let d1 d2 ...dn be the address of this packet’s destination. i−1 . {The algorithm is now in Sstable From Definition 3, we know that dn−i+2 dn−i+3 ...dn is identical to dn−i+2 dn−i+3 ...dn }. if dn−i+1 = dn−i+1 then Put x at the tail of Q2 ; {So it could be processed in the next stage. Notice that x is already in the correct Gi } else if d1 = dn−i+1 then Send x to node SW APn−i+1 (π) to be appended to queue Q1 ; {x will be in its correct Gi when it goes there } else Choose the unique j such that dj = dn−i+1 ; Send x to node SW APj (π) to be appended to Q1 ; { When x reaches this node, it has to traverse one more link before it is in its correct Gi .} end Algorithm A.

2.3

Performance analysis of Algorithm A

We will show that (1) min

n−i

n

s=n−i+1 s

time is sufficient to make the transition from √ to (2) the√queue size for the algorithm is O( n!), and (3) the run time of the whole algorithm is O( n!). Let Mqi be the maximum number of packets queued in any i−1 Sstable

s=1 s,

i Sstable ,

6

α

.

. . . . . .

n-2

...

2

a1

1

critical point to β

critical point to α

b1

1

n-2

...

β Figure 2: A pair of critical points. i−1 i to Sstable . Clearly, the time needed for the transition node during the transition from Sstable i−1 i from Sstable to Sstable is O(Mqi ) since each packet needs only a constant amount of time to process.

Lemma 2.1 Mqi ≤ min 2 n−i s=1 s, 2 more than half the above minimum.

n

i s=n−i+1 s . Also, in Sstable , the value of Mqi is no

Proof: (By induction) Base Case: When i = 1, we have Mqi ≤ 2n. This follows from the following fact. Suppose that we have two G1 ’s, say α and β, such that α and β are connected through the pair of 0 1 to Sstable , the worst case critical nodes4 (a1 , b1 ) (See Figure 2). In the transition from Sstable of queuing for b1 occurs when each node adjacent to a1 wants to send its packet through a1 to β and also a1 wants to send its own packet through b1 to β. Hence, including the packet that originally resided in b1 , we have a total of (n − 2) + 1 + 1 = n packets that will pass through b1 . This also means that (n − 1) packets may have to be queued in a1 . Also the packet from b1 and packets from nodes that are one distance apart from b1 can reach a1 and therefore in the worst case a1 may have to queue 2(n − 1) packets. But notice that at the 4

(a, b) is a pair of critical nodes if a is critical point to b and vice-versa.

7

end of this stage, the queue size of a1 is at the most n. The same holds for the critical points of the other G1 ’s. But they are independent events, i.e. they will never affect each other. Induction step: Suppose that Lemma 2.1 is true for i = k. We will prove it for i = k + 1, i.e. we’ll prove that n−k−1 Mqk+1 ≤ min 2 s=1 s, 2 ns=n−k s .

n n Case A: Mqk ≤ min 2 n−k s=1 s, 2 s=n−k+1 s = 2 s=n−k+1 s. Fix any node b, and let a be the critical point to b at stage k + 1. The only packets that will k+1 k to Sstable are those that ever contribute to the queue size of b during the transition from Sstable k k ever reached node a or nodes adjacent to a which are in G . Since G is an (n−k)-star graph, a has n − k − 1 other nodes adjacent to it (including b) in Gk . It follows, using the induction hypothesis, that the total number of packets reach b during the transition from that will k+1 n k Sstable to Sstable is at most ((n − k − 1) + 1) × s=n−k+1 s which is equal to ns=n−k s. Notice n−k−1 s because, only these that b is in a Gk+1 . The queue size of b can not be greater than s=1 k+1 that b is in. (Figure 3 might helpthe reader better many packets are destined for the G n−k−1 understand the proof.) Thus, we have Mqk+1 ≤ min s=1 s, ns=n−k s . Realize that at the most twice this number of packets will have to be queued in node a, but,at the end of n−k−1 n this stage the number of packets queued at a is no more than min s=1 s, s=n−k s . n n−k Case B: Mqk ≤ min 2 n−k s=1 s, 2 s=n−k+1 s = 2 s=1 s. n−k−1 Clearly Mqk+1 is ≤ s=1 s, since there are only these many nodes in any Gk+1 and hence only these many packets are destined for any Gk+1 . n−k−1 n−k n n s ≤ 2 s ≤ 2 s ≤ 2 Also (similar to caseA) Mqk+1 ≤ 2 s=1 s=1 s=n−k+1 s=n−k s. n−k−1 n Thus Mqk+1 ≤ min 2 s=1 s, 2 s=n−k s . Again, at the end of this stage the queue size will only be at the most half this value. ✷

Theorem The maximum queue 2.1 needed √ in Algorithm A is n−i n maxi min 2 s=1 s, 2 s=n−i+1 s = O( n!). Proof: Follows from Lemma 2.1 and the following fact. Given any integer N. √Let Z = {(X, Y ) : X and Y are integers and X ∗ Y = N}. Then max {min(X, Y )} ≤ O( N).✷ (X,Y )∈Z

Theorem 2.2 A permutation routing √ in an n-star graph can be performed by an oblivious deterministic routing scheme in O( n!) time steps. Proof: Let T(n) be time steps needed the for Algorithm √ √A. From Lemma 2.1, we have n−1 n−i n T (n) = i=1 min 2 s=1 s, 2 s=n−i+1 s < 8 n! = O( n!).✷ In the randomized algorithms to be given in the rest of the paper, we will make use of a slightly different version of Algorithm A which we call Algorithm A . In this new version 8

.

..

. . .

. ..

xxxxxdn-k+1 . . . d n

[(n-1)-k-1]

. .

. . . .. .

G

k

...

k+1

G

a

k+1

G

b

. .

xxxxdn-k d n-k-1

n-k-1 xxxxdn-k d

.

..

. . .

Figure 3:

9

.

n-k+1

n-k+1

. . .d

. . .d

n

n

.. .

. . . .

specified time is a parameter known in advance and will be specified as and when Algorithm A is invoked. For many invocations, specified time will just be cn for some constant c.

Algorithm A {To begin with each node has a single packet that originates in the node, and there is only one queue. Now each packet is in its G0 .} for every node π = d1 d2 ...dn do the following in parallel for a specified time Let x be the packet at the head of π’s queue and let d1 d2 ...dn be the address of this packet’s destination. Also let the packet x be in its Gi (realize that x carries i along with it). {From the Definition 3, we know that dn−i+1 dn−i+2 ...dn is identical to dn−i+1 dn−i+2 ...dn }. if dn−i = dn−i then Set i = i + 1 and put x at the tail of π’s queue; {So it could be processed in the next stage. Notice that x is already in the correct Gi+1 } else if d1 = dn−i then Set i = i + 1 and send x to node SW APn−i(π); {x will be in its correct Gi+1 when it goes there } else Choose the unique j such that dj = dn−i ; Send x to node SW APj (π); { When x reaches this node, it has to traverse one more link before it is in its correct Gi+1 .} end Algorithm A .

3

Optimal randomized routing algorithms for the nstar graph

The large worst case delay of oblivious deterministic routing makes such schemes uninteresting from a practical point of view. But efficient routing algorithms that employ randomization have been discovered. In their pioneering paper, Valiant and Brebner [27, 25] have given an O(log N) time oblivious randomized routing scheme for the n-cube network, with N = 2n nodes. They use a two phase strategy in which packets are sent obliviously, first to random intermediate nodes and then to their correct destinations. They showed that there is a constant c such that every packet will reach its own destination in ≤ cc log N 10

to represent the steps with high probability (i.e. with probability ≥ 1 − N −c ). We use O complexity bounds of randomized algorithms (see e.g., [18]). We say a randomized algorithm

if there exists a constant c such that the has a resource (time, space etc.) bound of O(g(n)) amount of resource used by the algorithm (on any input of size n) is no more than cαg(n)

N). with probability ≥ 1 − n1α . Under this notation Valiant’s algorithm runs in time O(log After Valiant’s work, a lot of research on randomized routing ([3] [23] [20] [10] [19] [8] [17] [15]) has been done. But all these employ bounded degree networks such as butterfly, shuffle-exchange, d-way shuffle, the mesh, etc. The randomized routing lower bound for a bounded degree network is obviously Ω(log N) because the diameter of a constant degree network is at least log N. Thus, we won’t be able to perform permutation routing on these networks in sublogarithmic time steps. An interesting question is: For unbounded degree networks with sublogarithmic diameter, can we route (using randomization) a permutation request in sublogarithmic steps with high probability? Valiant [27] has shown that permutation routing can be done on the d-way shuffle graph

log d/ log log d) steps. For the n-way shuffle (which has N = dn nodes and diameter n) in O(n

log n/ log log n) and hence is not optimal. In this graph, Valiant’s algorithm runs in time O(n section, we present randomized routing algorithms for the n-star graph that runs in time of

time the order of the diameter with high probability. The same algorithm also runs in O(n) on the n-way shuffle graph. The algorithms presented in the next subsection assume that all the links are bidirectional and also for each node there is a queue corresponding to each incoming and outgoing link. Furthermore, a node can receive a packet from each incoming link and send a packet along each outgoing link in one unit of time (this assumption has been made in [27] also).

Algorithm B Phase 1 Step 1: for each packet x do in parallel select a random intermediate node. Step 2: Use Algorithm A to send the packets to their intermediate random destinations. {The queuing discipline is first-in first-out (FIFO). Specified time in Algorithm A , applied here, means c n (for some constant c that depends on the failure probability)}. Phase 2 Use Algorithm A to send each packet x from its intermediate node to its correct destination. Analysis Fact 3.1 The number of steps a packet x is delayed is less than or equal to the number of packets that overlap5 with x. 5

Two packets are said to overlap if there are ≥ 1 common links in their paths.

11

stage 1

123

stage 2

123

213

213

132

132

312

312

231

231

321

321 Figure 4: A logical network for the 3-star graph.

Proof: Refer to [27].✷ Fact 3.2 For any n > 0, there exists an i such that min n − i > n2 .

n−i s=1

s,

n

s=n−i+1 s

=

n−i

s=1 s

and

We can represent the stage transitions in our algorithm in the form of a logical network. A logical network is the following. Each column is simply the nodes in the network. The links from column i − 1 to column i are the links (in the network) that can be used during i−1 i to Sstable (in our algorithm). So a logical network represents the the transition from Sstable stage of the network at each time unit (where ‘stage’ means the same as in the context of Algorithm A). Our proof will be simplified if it is given using the logical network. A logical network for the 3-star graph is shown in Figure 4. Since n = 3, we have only two stages (levels). Each node in column i has n − i + 1 incoming and n − i outgoing links. Packets are delayed only in the case that more than one incoming links contain a packet and more than one of them must be forwarded to the same outgoing link. Note that, as an example, if a packet x moving from node 123 to node 312 has to pass through node 213, it will never cause a delay to the packets in node 213 if the destination of those packets are not node 312. Also note that each link corresponds to at most 2 steps. Theorem 3.1 For the n-star graph with N = n! nodes, any permutation routing can be

steps6 . completed by a randomized routing algorithm (using Algorithm B) in O(n) 6

We will prove Theorem 3.1 only for Phase 1 and it will be clear how the proof can be modified to apply to the second phase as a mirror image of the first phase.

12

Proof: (A similar proof technique has been used by Rivest [21].) Based on Fact 3.1, to determine the expected delay of a packet x, we only need to determine how many packets x are expected to overlap with x. To simplify the discussion, let us first determine the probability that d packets overlap x’s path for the first time in stage i. Consider a link, say L, in stage on Lemma 2.1, we know that these d i. Based n n−i packets can possibly originate from min s=1 s, s=n−i+1 s number of nodes. Thus, there 

min are 

n−i

s=1 s,

n

s=n−i+1 s



 number of ways to choose the origins of these d packets. d For each packet, there are ns=n−(i+1) s possible paths for the packet to take before it reaches

stage i+1. Thus, the probability that all these d packets pass through link L is n

d

1

s=n−i−1

s

.

n Besides, the likelihood for the remaining min n−i s=1 s, s=n−i+1 s − d packets not to pass min( n−i s, n s −d s=1 s=n−i+1 ) 1 through link L is 1 − n . Hence, we have an upper bound for s s=n−i−1

the probability that the number of packets, whose paths overlap a given path through link L packet for for the first time at stage i, equals d. Let di be number of packets that delay a given n n−i the first time in stage i. In the following derivation, R stands for min s=1 s, s=n−i+1 s .

P rob(di = d) ≤ 

R d

n

min ≤ 

≤

min

d

1

s=n−i−1 s

s=n−i−1 s

 n n−i s=1 s, s=n−i+1 s 

n−i s=1

s,

n

s=n−i+1 s

1 ≤ d 2

d

d

1

n s=n−i−1

d

d!

n 2

1 − n

R−d

1

s

1

d n s=n−i−1 s

, by Fact 3.2. d!

But we are interested in the probability of a total delay d rather than the delay due to packets that meet the given packet for the first time in stage i. The total delay for the given di . This can be computed using generating functions. packet is i

The generating function for P rob(di = d) is Gi (x) =

∞ d=0

d −2 n 2

d!

xd = e

x n2 4

13

Therefore the generating function for P rob( G(x) =

k i=1

Gi (x) = e

4k x n2



∞

4k  = n2 d=0



d

di = d) is given by

i

1 d x , where k is the number of stages in the algod!

rithm. Then the probability that the total delay is greater than a given amount, say δ, is: P rob(

∞

i

d=δ

di ≥ δ) ≤

4k n2

d

δ

1 d!

4k 1 ≤ 2 2 n δ! δ 4 1 , since k = n − 1. ≤ 2 n δ! If we let δ equal cn for some constant c > 1, P rob(

i

4cn 1 di ≥ cn) ≤ 2 cn n cn! 1 ≤ .✷ (n!)c

A similar proof technique can also be used to analyze the behavior of a simple but efficient randomized routing algorithm for the d-way shuffle. Our routing algorithm for the n-way shuffle achieves a better (in fact, optimal) time bound than that of [27]. A d-way shuffle network has N = dn nodes. Each node can be labelled as dn dn−1 ...d1 where each di is a d-ary digit. A node labelled dn dn−1 ...d1 is connected to the nodes labelled ldn dn−1 ...d2 where l is an arbitrary d-ary digit. Therefore, the network has diameter n and a unique path of exactly n links between any pair of nodes. If we choose d = n, then the network is an n-way shuffle. The following algorithm can be used to perform permutation routing on the n-way shuffle.

14

Algorithm C Phase 1 Step 1: for each packet x do in parallel select a random intermediate node. Step 2: Send the packets along the unique path to their intermediate random destinations. {The queuing discipline is FIFO} Phase 2 Send each packet x from its intermediate node to its correct destination along the unique path.

Theorem 3.2 For the n-way shuffle network of N = nn nodes, any permutation routing

steps. can be performed by a randomized routing algorithm (using Algorithm C) in O(n) Proof: Similar to Theorem 3.1.✷

4

A universal optimal randomized routing algorithm

A deficiency with the state-of-the-art in packet routing is that the algorithms presented and their analysis are network-specific. An important open question is: Is there a networkindependent routing algorithm that works for a large class of networks, rather than a specific network? A significant contribution in this direction has been reported by Leighton, Maggs and Rao [10]. They give a proof that any set of paths with distance d and congestion c can be off-line routed in O(c + d) steps using constant-size queues. They also show that for a leveled network with N leveled paths spanning , levels with congestion c7 , their algorithm could complete any permutation routing on it in O(c + , + log N) steps. However, their analysis only works for constant degree leveled networks. We provide a universal routing algorithm and network-independent analysis (a modified version of the proof given in section 3) which works for both constant degree and non-constant degree leveled networks (although the algorithm doesn’t guarantee a constant queue size). An (N, ,) leveled network consists of , + 1 groups of nodes such that each group has N nodes and these groups form a sequence of , + 1 columns, say c1 , c2 , . . . , c+1 . Column c1 and column c+1 are identified; thus, although there are , + 1 columns of N nodes each, the total number of nodes is ,N. The only links in the network are between nodes in ci and nodes in either ci−1 or ci+1 (provided these columns exist). Every node in each column has at most d incoming and outgoing links where d is the degree of the network. For each node 7

The congestion is the largest number of packets that must traverse a single edge during the entire course of the routing.

15

level 1 node 1

level 2 .. .

node 2

.. ..

node 3

. . .

.. .

level l

. . .

.. .

.. . .. .

.. .

.. .

.. . node N-1

d

level 3

.. .

. . .

. . .

.. .

. . . .. .

. . .

.. .

. . .

node N

Figure 5: A leveled network of , levels and degree d. in the first column, there exists a unique path of length , connecting it to any node in the last column. Clearly, the diameter of the network is ,. See Figure 5. A leveled network is called nonrepeating if it satisfies the following property: if any two distinct paths from the first column to the last column share some links and then diverge, these two paths will never share a link again. Leveled networks are interesting because the problem of packet routing in various singlestage interconnection networks (such as the n-cube) can be reduced to an equivalent packet routing problem on a leveled network. Given an N-node single-stage network N with diameter D, the first step is to select a path of length at most D for every source, destination pair. The collection C of all such paths is then represented as an (N, D) leveled network whose links are defined as follows: (1) there is a link from node u of column ci to node v of column ci+1 if and only if there is some path p ∈ C whose i-th edge connects nodes u and v in N ; (2) for every node u, there is a link from node u in column ci to node u in column ci+1 , 1 ≤ i ≤ D. The set of links defined by (2) take care of paths in C which are less than D in length. For such a path p, the corresponding path in the leveled network will follow the same sequence of nodes (in increasing columns), and then extended to the last column by following the links in (2). The n-cube, d-way shuffle [27], star graph [1], mesh, and a host of other single-stage networks can all be represented as nonrepeating leveled networks. For instance, the leveled network representation of the n-cube is the butterfly, which is easily seen to be nonrepeating. 16

We refer the reader to [28] for the leveled network representations of the d-way shuffle, star graph, and other single-stage networks. The next theorem pertains to routing on a leveled network. Theorem 4.1 For a leveled network of ,N nodes with , levels, any permutation routing of

steps provided that N packets8 (from first column to last column) can be completed in O(,) d ≥ 2, where d is the degree of the network. To prove this theorem, we first present the routing algorithm. Algorithm D {A universal Routing Algorithm} Phase 1 for each packet x do in parallel select a random link as a bridge to go to the next level by flipping a d sided coin, where d is the number of outgoing links of the node at which the packet is residing. Do this until the packet reaches the last column. {Each packet will reach a node in last column which is a random intermediate node.} {The queuing discipline is FIFO} Phase 2 Send each packet x from its intermediate node to its correct destination along the unique path. Proof of Theorem 4.1: Without loss of generality, suppose that the degree of the leveled network is d. Let δi be number of packets that delay a given packet for the first time in level i. Then with similar argument that used in the proof of Theorem 3.1, we have δ ( d12 ) P rob(δi = δ) ≤ δ! Then, the generating function for P rob(δi = δ) is Gi (x) =

∞

1 d2

δ

x

xδ = e d2

δ! Then the generating function for P rob( δi = δ) is thus d=0

∞



δ

i

, 1 δ x 2 d δ! i=1 δ=0 Hence, the probability that the total delay is greater than a given amount, say q, is:

G(x) =

8

Gi (x) = e d2 x =



The result can easily be extended for permutation of N packets

17

P rob(

∞

i

δ=q

δi ≥ q) ≤

, d2

δ

1 . δ!

Let q equal ,. If , = O(d), or if , = Ω(log N), then the above probability is ≤ c

5

1 N 1+c

, where c , c are constants.✷

Emulation of a PRAM on leveled networks

The parallel random-access machine (PRAM) has become a popular vehicle for investigating parallel algorithms for a wide variety of problems such as sorting, graph and matrix problems, computational geometry, etc (see e.g., [7] [22] [24] [26]). It is an abstract parallel computer model consisting of an arbitrary number of processors that communicate via a shared global memory. Each memory access to the shared memory is assumed to take unit time. This unittime memory access property simplifies programming because it permits parallel algorithms to be designed and analyzed solely on the basis of their computational requirements, divorced from issues of interprocessor communication. In this section, we consider the problem of emulating a PRAM on leveled networks. Ranade [20] has earlier shown that one step of a concurrent-read concurrent-write (CRCW) N-processor PRAM can be emulated in O(log N) time on an N-node butterfly (whose degree is constant). This paper presents, for the first time, optimal emulations of the CRCW PRAM on the star graph and the n-way shuffle which have sub-logarithmic diameter. These results are special cases of a more general result that gives an optimal emulation of the CRCW PRAM on a large class of non-constant degree leveled networks.

5.1

PRAM emulation on any Inter Connection Network (ICN)

We consider the problem of emulating a PRAM with N processors and shared memory of size M on an N-node ICN. For simplicity, we assume that the PRAM is exclusive read, exclusive write (EREW); the emulation result can be extended to the more general concurrent read, concurrent write (CRCW) PRAM using ‘message combining’ (see [20] [28]). Our emulation algorithm is based on Karlin and Upfal’s technique called parallel hashing [6]. The idea is to map the M shared memory cells of the PRAM onto the local memory modules of the N processors of the ICN. The mapping is obtained by randomly choosing a hash function h from the following class of hash functions:

18

H = {h|h(x) = ((

0≤i 0), N

19

no more than O(1) items from S will be mapped onto the same memory module, then the routing algorithm in section 4 together with its analysis can be directly used to prove the desired performance of the emulation. Unfortunately, with N −β (for some β > 0) probability, at least one node will get c, (for some constant c) items. However, even if we allow c, items to be mapped into each memory module, the desired performance can be obtained. In order to obtain the desired performance, same routing algorithm will be used but the analysis is different. We will first prove that the algorithm in section 4 can perform a partial ,-relation

routing in O(,) time, and then, in the next subsection, we will prove that O(,) items from S will be mapped into the same memory module. Theorem 5.1 For the leveled network of , levels with degree d, , = O(d), any partial , − relation routing can be completed by a randomized routing algorithm (using the algorithm in

steps. section 4) in O(,) (By partial ,-relation we mean the problem of routing where at the most , packets originate from any node and at the most , packets are destined for any node.) We need the following lemma in the proof of Theorem 5.1. Lemma 5.1 If a routing algorithm X can realize any permutation in c1 f (N) steps with probability ≥ (1 − N1 ),6 > 0, then we can make use of this algorithm to perform any permutation routing in c1 c2 f (N) steps with probability ≥ (1 − N 1c2 ). Proof: To prove this lemma, we simply repeat algorithm X for a constant number of times, say c2 . In each run of algorithm X, those packets that have not reached their destinations in c1 f (N) steps will trace back their paths and reach their sources in c1 f (N) steps or less and these packets will repeat algorithm X. Clearly, the probability of ≥ 1 unsuccessful packets in one trial is ≤ N1 , and the probability of failure in all the c2 trials is thus ≤ N 1c2 . Then the probability of success in c2 trials is ≥ (1 − N 1c2 ). Therefore, the total run time of the algorithm is c1 c2 f (N) with probability ≥ (1 − N 1c2 ).✷

Proof of Theorem 5.1: (The proof is similar to that in Theorem 4.1, but has different parameters.) Based on Fact 3.1, to determine the expected delay of a packet x, we only need to deter mine how many packets x are expected to overlap with x. We first determine the probability that ρ packets overlap x’s path for the first time in level i. Consider a link, say L, in level i. We know that these ρ packets can possibly originate from di−1 number of nodes having ,di−1 packets. (Because, initially we have at most , packets in each processor.) Thus, there are 20

,di−1 number of ways to choose the origins of these ρ packets. For each packet, there are ρ di+1 possible paths for the packet to take before it reaches level i + 1. Thus, the probability 1 that each of these ρ packets pass through link L is di+1 . Besides, the likelihood for the

(di−1 )−ρ

1 . Hence, we remaining (,di−1 ) − ρ packets not to pass through link L is 1 − di+1 have an upper bound for the probability that the number of packets, whose paths overlap a given path through link L for the first time at level i, equals ρ. Let di be number of packets that delay a given packet for the first time in level i. Then,

P rob(di = ρ) ≤

≤

,di−1 ρ ,di−1 ρ

1

ρ

1−

di+1 1

1

(di−1 )−ρ

di+1

ρ

di+1

ρ

1 (,di−1 ) ≤ i+1 ρ! (d )ρ ρ 1 , ≤ . ρ! d2 But we are interested in the probability of a total delay d rather than the delay due to packets that meet the given packet for the first time in level i. The total delay for the given di . This can be computed using generating functions. packet is i

The generating function for P rob(di = ρ) is Gi (x) =

∞

d2

ρ=0

ρ!

ρ

x

xρ = e d2

Thus the generating function for P rob( ∞

ρ

di = ρ) is given by

i

,2 1 ρ Gi (x) = e = x , where , is the number of levels of the network. G(x) = d2 ρ! ρ=0 i=1 Then the probability that the total delay is greater than a given amount, say ζ, is:

P rob(

2 x d2

∞

i

ρ=ζ

di ≥ ζ) ≤

,2 d2

ρ

1 ρ!

ζ

,2 1 ≤ 2 2 d ζ! 1 , if , = c1 d ≤ 2c2ζ 1 ζ! 21

1 , letting ζ = c2 , c2 !,! 1 2 2 ≤ c3 c4 , where c3 = and c4 = c2c 1 ,! c2 ! 1 c3 ≤ c6 cd7 , since ,! = (c1 d)! ≥ c1 !c5 d , and letting c6 = and c7 = cc41 d c1 !c5 1 ≤ c6 c , where 0 < c < 1. (d ) 2l ≤ 2c2c 1

Then it follows from Lemma 5.1 that any ,-relation routing can be finished on the leveled networks of , levels in c c6 , steps with probability at least 1 − 1c c , c c > 1.✷ (d )

Corollary 5.1 For the n-star graph with N = n! nodes, any partial n-relation routing can

be performed by a randomized routing algorithm (using the algorithm of section 4) in O(n) steps. Corollary 5.2 For the n-way shuffle with N = nn nodes, any partial n-relation routing can

be performed by a randomized routing algorithm (using the algorithm of section 4) in O(n) steps.

5.3

Performance analysis of emulation

We know that each item of the PRAM has been mapped to a location in distributed memory modules of the emulating network according to a hash function h randomly chosen from H.

we need To prove that each step of the PRAM can be emulated in desired time, say O(,), to prove that each read/write instruction of the PRAM can be performed by the emulating

network in O(,) time. First, on the way to access the items, read/write request packets are sent from processors to destinations defined by h. Then, on the way back (in case of a read instruction), each item (a return packet) is sent from its location (destination of the request packet and source of the return packet) to the processor that sent the request packet. The communication algorithm has been analyzed in the previous section. We have proven that if initially there is at most cl packets at any node and no more c, packets have the same

Hence, if we could prove destination, the communication could be completed in time O(,). that with high probability no more than c, items in S will be mapped into any memory module, then together with the result of Theorem 5.1 the desired emulation performance will immediately follow. Let X S be the number of items in S assigned by the hash function h to a memory module, then we have:

22

Lemma 5.2 For γ > δ

P rob(max X ≥ γ) ≤ N S

S

1 γ−δ

δ

.

Proof: See Karlin and Upfal [6]. ✷ Theorem 5.2 Each step of the EREW PRAM can be emulated by a leveled network of ,

steps. levels with degree d, , = O(d), in O(,) Proof: Using Lemma 5.2, and fixing δ to be c,, the probability that more than c, elements from any S are assigned to a memory module is bounded by N1c . Together with Theorem 5.1, the theorem is proven.✷

steps 1) by the Corollary 5.3 Each step of the EREW PRAM can be emulated in O(n) n-star graph with N = n! nodes, and 2) by an n-way shuffle graph with N = nn nodes.

Theorem 5.3 Each step of the CRCW PRAM can be emulated by a leveled network of ,

steps. levels with degree d, , = O(d), in O(,) Proof: Each processor combines all incoming packets having the same destination into one packet9 . In [20], packets are routed in sorted order, and thus a FIFO queue of the destination bits for each link could guarantee that each requesting processor receives a reply. However, in our algorithm sorting is not possible because of the non-constant degree of the network and hence the technique used in [20] can not be directly applied in this proof. We use a clever but simpler method. To make sure that each requesting processor receives a reply in case of a read instruction, each link is associated with a FILO queue (stack) of the direction bits. For each processor, before a combined packet is sent out, the processor stores d direction bits in the stack of the link where the packet will be sent out. These bits are used to store information about the edges along which these packets (to be combined) arrived. Using techniques similar to the one given in section 4, we can show all the (combined) packets will arrive at their destinations within c, steps (for some constant c) with high probability. After a packet reaches its destination, it will trace back the original path to its source in case of a read request. In order to make sure that the packets get their correct direction bits for replication (from any node where a combine operation took place), a packet starts its backward journey at time step c, + td , if the packet arrived at its destination at time c, − td . The extra hardware 9

It is assumed that any number of incoming packets, which have the same destination, from different links can be combined into one packet in one unit of time.

23

needed is O(d,) bits of storage per link which is smaller than the queue size at any node which can grow as large as Ω(,2 log d) bits (Notice that the queue size at any node can be as large as , packets, and the address of each packet is , log d bits). One can imagine that the snapshot of the routing of the packets at time tf (on the way to the destinations) would be the same as the snapshot of the routing of the packets at time 2c, − tf (on the way back to the sources). Together with the proof of Theorem 5.2, the theorem is proven. An alternative way is for each packet to carry with it O(d,) direction bits. ✷

6

An optimal routing algorithm for the star graph under the sequential model

The assumption made in section 3 that each node can receive and send a packet along each incident edge may not be realistic because in practice a node can process only one packet at a time. The former model is referred to as the parallel model and the later one as the sequential model from here on. Clearly, the routing time under the sequential model is upper bounded by the degree times the routing time on the parallel model. For example, the routing time

2 ). This fact has been indicated by Upfal in [23]. Fortunately, in Theorem 3.1 is indeed O(n by slightly modifying our randomized algorithm and using a different analysis, we still are

time able to make use of Algorithm B to realize any permutation on star graphs in O(n) even under the sequential model. This is shown in Theorem 6.1. Theorem 6.1 For the n-star graph (sequential model) of N = n! nodes, any permutation

routing can be completed by a randomized routing algorithm (using Algorithm B) in O(n) steps. Proof: Using a proof similar to Theorem 5.1, it can be shown that any n-relation routing on star graphs can be realized in c n steps with probability at least 1 − N1c , for some constant 0 < c < 1, and a constant c that depends on c. Then it follows from Lemma 5.1 that any permutation routing can be finished on the n-star graph in c c n steps with probability at least 1 − N1c c , c c > 1.✷ Using the same idea, the following can also be proved. Theorem 6.2 For the n-way shuffle network (sequential model) of N = nn nodes, there

exists a randomized routing algorithm with routing time O(n).✷

24

7

Conclusions

Valiant’s two phase scheme has been proved to be a powerful technique for packet routing. Sections 3, 4, 5, and 6 demonstrate that making use of generating functions to handle random variables can simplify the analysis of the behavior of the routing algorithm and can also lead to a tighter upper bound. In particular, optimal randomized algorithms have been derived in these sections for packet routing on networks with sub-logarithmic diameter. We have also presented optimal algorithms for emulating a PRAM on leveled networks with non-constant degree.

References [1] Akers, S., Harel, D. and Krishnamurthy, B., ‘The Star Graph: An Attractive Alternative to the n-Cube,’ Proc. International Conference of Parallel Processing, 1987, pp. 393-400. [2] Akers, S. B. and B. Krishnamurthy, ‘A group theoretic model for symmetric interconnection networks,’ Proc. International Conference on Parallel Processing, 1986, pp.216-223. [3] Aleliunas, R., ‘Randomized parallel communication,’ Proc. Symposium on Principles of Distributed Computing, 1982, pp. 60-72. [4] Alt, H., Hagerup, T., Mehlhorn, K., and Preparata, F., ‘Deterministic Simulation of Idealized Parallel Computers on More Realistic Ones,’ SIAM Journal on Computing, vol. 16(5), 1987, pp. 808-835. [5] Borodin, A. and J. E. Hopcroft, ‘Routing, merging and sorting on parallel models of computation,’ Proc. Symposium on Theory of Computing, 1982, pp. 338-344. [6] Karlin, A., and Upfal, E., ‘Parallel Hashing–An Efficient Implementation of Shared Memory,’ Proc. Symposium on Theory of Computing, 1986, pp. 160-168. [7] Karp, R., and Ramachandran, V., ‘Parallel Algorithms for Shared-Memory Machines,’ in Handbook of Theoretical Computer Science, North-Holland, 1990. [8] Krizanc, D., Rajasekaran, S., and Tsantilas, T., ‘Optimal Routing Algorithms for MeshConnected Processor Arrays,’ Proc. Aegean Workshop on Computing, 1988. SpringerVerlag Lecture Notes in Computer Science # 319, pp. 411-422. To appear in Algorithmica.

25

[9] Kunde, M., ‘Routing and Sorting on Mesh-Connected Arrays,’ Proc. Aegean Workshop on Computing, 1988. Springer-Verlag Lecture Notes in Computer Science # 319, pp. 423-433. [10] Leighton, T., Maggs, B., and Rao, S., ‘Universal packet routing algorithms,’ Proc. Symposium on Foundations of Computer Science, 1988, pp. 256-269. [11] Leighton, T., Makedon, F., and Tollis, I.G., ‘A 2n − 2 Step Algorithm for Routing in an n × n Array With Constant Size Queues,’ Proc. Symposium on Parallel Algorithms and Architectures, 1989, pp. 328-335. [12] Palis, M., Rajasekaran, S., and Wei, D., ‘General Routing Algorithms for Star graphs,’ Proc. International Parallel Processing Symposium, 1990, pp. 597-611. [13] Palis, M., Rajasekaran, S., and Wei, D., ‘Emulation of a PRAM on Leveled Networks,’ Proc. International Conference on Parallel Processing, 1991. [14] Pippenger, N., ‘Parallel communication with limited buffers,’ Proc. Symposium on Foundations of Computer Science, 1984, pp.127-136. [15] Rajasekaran, S., ‘Randomized Algorithms for Packet Routing on the Mesh,’ to appear in Advances in Parallel Algorithms, Blackwell Scientific Publications, 1991. [16] Rajasekaran, S., and Overholt, R., ‘Constant Queue Routing on a Mesh,’ Proc. Symposium on Theoretical Aspects of Computer Science, Hamburg, Germany, Feb. 1990. Springer-Verlag Lecture Notes in Computer Science # 480, pp. 444-455. To appear in Journal of Parallel and Distributed Computing. [17] Rajasekaran, S., and Raghavachari, M., ‘Optimal Randomized Algorithms for Multipacket and Cut Through Routing on the Mesh,’ Proc. Symposium on Parallel and Distributed Processing, Dec. 1991. [18] Rajasekaran, S., and Reif, J.H., ‘Optimal and Sub-Logarithmic Time Randomized Parallel Sorting Algorithms,’ SIAM Journal on Computing, 18(3), 1989, pp. 594-607. [19] Rajasekaran, S., and Tsantilas, Th., ‘An Optimal Randomized Routing Algorithm for the Mesh and a Class of Efficient Mesh-Like Routing Networks,’ Proc. Conference on Foundations of Software Technology and Theoretical Computer Science, 1987. SpringerVerlag Lecture Notes in Computer Science # 287, pp. 226-241. [20] Ranade, A.G., ‘How to Emulate Shared Memory,’ Proc. Symposium on Foundations of Computer Science, 1987, pp. 185-194. 26

[21] Rivest, R., Lecture Notes on Parallel Algorithms, MIT, 1987. [22] Snyder, L., ‘Type architectures, shared memory, and the corollary of modest potential,’ Annu. Rev. Comput. Sci. 1, 1986, pp. 289-317. [23] Upfal, E., ‘Efficient schemes for parallel ommunication,’ Journal of the ACM, vol.31, no.3, 1984, pp. 507-517. [24] Upfal, E., and Wigderson, A., ‘How to Share Memory in a Distributed System,’ Proc. Symposium on Foundations of Computer Science, 1984, pp. 171-180. [25] Valiant, L.G., ‘A Scheme for Fast Parallel Communication,’ SIAM Journal on Computing, 11(2), 1982, pp. 350-361. [26] Valiant, L.G., ‘A bridging model for Parallel Computation,’ Communications of the ACM, Vol. 33, No. 8, 1990, pp. 103-111. [27] Valiant, L.G., and Brebner, G.J., ‘Universal Schemes for Parallel Communication,’ Proc. Symposium on Theory of Computing, 1981, pp. 263-277. [28] Wei, D.S.L., ‘Fast Parallel Routing and Computation on Interconnection Networks,’ Ph.D. Thesis, Univ. of Pennsylvania, Jan. 1991.

27

BIOGRAPHIES Michael A. Palis is Assistant Professor of Computer and Information Science at the University of Pennsylvania. He received the B.S. degree in Electrical Engineering (cum laude) in 1979 and the B.S. degree in Physics (magna cum laude) in 1980, both from the University of the Philippines. He received the Ph.D. degree in Computer Science from the University of Minnesota in 1985. Dr. Palis’ main research interests are design and analysis of algorithms, computational complexity, and parallel processing. He is a member of the IEEE Computer Society, Association for Computing Machinery, European Association for Theoretical Computer Science, Mathematical Association of America, and Phi Kappa Phi Honor Society. Sanguthevar Rajasekaran received his M.E. degree in Automation from the Indian Institute of Science (Bangalore) in 1983, and his Ph.D. degree in Computer Science from Harvard University in 1988. Since August 1988 he is employed in the Computer and Information Science Department of the University of Pennsylvania as an Assistant Professor. His research interests include Parallel Algorithms, Randomized Computing, Combinatorial Optimization, Learning Theory, Animation, Simulation, etc. David S. L. Wei received his BE in computer science from Feng Chia University, Taiwan, R.O.C., in 1978, his MS in computer science from National Tsing Hua University, Taiwan, R.O.C., in 1980, and his Ph.D. in computer science from the University of Pennsylvania, in 1991. His research interests include parallel processing, parallel programming, VLSI architectures, and their applications in artificial intelligence. During the period 1980 - 1985, he was a computer science instructor at Feng Chia University, and Soochow University. He is currently an assistant professor of computer science at Radford University. Dr. Wei is a member of the IEEE, the IEEE Computer Society, the ACM, and the SIAM.

28