Adaptive Multicast Wormhole Routing in 2D Mesh ... - Semantic Scholar

1 downloads 0 Views 313KB Size Report
May 14, 1993 - flinx, mckinley, esfahani [email protected]. May 1993. Abstract. The issues of adaptive multicast wormhole routing in 2D mesh multicomputers.
Adaptive Multicast Wormhole Routing in 2D Mesh Multicomputers Xiaola Lin, Philip K. McKinley, and Abdol-Hossein Esfahanian Technical Report MSU-CPS-93-14 May 1993

A short version of this report appeared in Proc. PARLE'93, Munich, Germany, June 1993.

Adaptive Multicast Wormhole Routing in 2D Mesh Multicomputers  Xiaola Lin, Philip K. McKinley, and Abdol-Hossein Esfahanian Department of Computer Science Michigan State University East Lansing, Michigan 48824 flinx, mckinley, esfahani [email protected] May 1993

Abstract

The issues of adaptive multicast wormhole routing in 2D mesh multicomputers are studied. Three adaptive multicast wormhole routing strategies are proposed and evaluated. The methods include minimal partially-adaptive, minimal fully-adaptive, and nonminimal adaptive routing. All the algorithms, which are the rst deadlockfree adaptive multicast wormhole routing algorithms ever proposed, are shown to be deadlock-free. A study has been conducted that compares the performance of these multicast algorithms. The results show that the minimal fully-adaptive routing method creates the least trac, however, double vertical channels are required in order to avoid deadlock. The nonminimal routing algorithm exhibits the best adaptivity, although it creates more network trac than the other methods.

 This work was supported in part by the NSF grants MIP-9204066, CDA-9121641, and CDA-9222901, by DOE grant DE-FG02-93ER25167, and by an Ameritech Faculty Fellowship.

ii

1 Introduction Massively parallel computers (MPCs) are seen as a viable platform on which to solve the so-called grand-challenge problems. Most such systems are characterized by the distribution of memory among an ensemble of processor nodes, which communicate by sending messages through a network. These systems are often said to be scalable because, as the number of nodes in the system increases, the total communication bandwidth, memory bandwidth, and processing capability of the system also increase. Ecient communication among nodes is critical to the performance of MPCs. Communication operations include not only point-to-point operations, in which two processes communicate with one another, but also collective operations, which involve more than two processes. Multicast is a collective communication service in which the same message is delivered from a source node to an arbitrary number of destination nodes. Both unicast, which involves a single destination, and broadcast, which involves all nodes in the network, are special cases of multicast. Multicast communication has several uses in large-scale multiprocessors [21], including direct use in various parallel algorithms [6, 14], implementation of data parallel programming operations, such as replication and barrier synchronization [25], and support of shared-data invalidation and updating in systems using a distributed shared-memory paradigm [17]. Ecient implementation of multicast communication services depends on the particular system architecture, which includes the network topology and the underlying switching technique used to transfer messages across the network. The two-dimensional (2D) mesh topology has become popular in the construction of large-scale distributed-memory multiprocessors. Networks with mesh topologies o er massive parallelism and are more scalable than many other approaches to multiprocessor interconnection [23]. Formally, an m  n 2D mesh consists of N = m  n nodes; each node has an associated integer coordinate pair (x; y ), 0  x < n and 0  y < m. Two nodes with coordinates (xi ; yi) and (xj ; yj ) are connected by a communication channel if and only if jxi ? xj j + jyi ? yj j = 1. Figure 1 illustrates a 4  5 2D mesh; notice that adjacent nodes are connected by two unidirectional channels in opposite directions. As shown in Figure 1, directions are used to describe the relative positions of nodes. For example, node (1; 2) is said to be northwest of

1

node 3; 0. The 2D mesh topology is used in the Symult 2010 [24], the Intel Touchstone DELTA [10], and the commercial successor to the latter, the Intel Paragon [9]. 0,3

1,3

2,3

3,3

4,3

0,2

1,2

2,2

3,2

4,2

0,1

1,1

2,1

3,1

4,1

N

W

E S

unidirectional communication channel

node (including router) 0,0

1,0

2,0

3,0

4,0

Figure 1. Example 4  5 2D mesh The predominant switching technique used in new generation parallel machines is wormhole routing [4]. In this approach, a message is divided into a number of its for transmission. The header it(s) of a message governs the route, and the remaining its follow in a pipeline fashion. The two salient features of wormhole routing are that (1) only minimal bu ers are required and (2) the network latency is distance-insensitive when there is no channel contention [23]. In wormholerouted systems, each node contains a separate router to handle such communication-related tasks. As shown in Figure 2, several pairs of external channels connect the router to neighboring routers; the pattern in which the external channels are connected de nes the network topology. Usually, the router can relay multiple messages simultaneously, provided that each incoming message requires a unique outgoing channel. A router is connected to the local processor/memory by one or more pairs of internal channels. One channel of each pair is for input, the other for output. If each node possesses exactly one pair of internal channels, a so-called one-port communication architecture [12] results, and the local processor must transmit (receive) messages sequentially. A multi-port architecture reduces this bottleneck. In the case of an all-port system, every external channel has a corresponding internal channel, allowing the node to send to and receive from all its ports simultaneously. For a survey of wormhole routing in direct networks, please refer to [23]. Because messages may hold some channels while waiting for others, wormhole routing is particularly susceptible to deadlock. Typically, deadlock is avoided in the routing algorithm, which determines the path followed by a message in order to reach its destination(s). Routing can be 2

Local Processor/Memory

internal input channels

external input channels

internal output channels

Router

external output channels

Figure 2. Generic MPC node architecture classi ed as deterministic or adaptive. In deterministic routing, the path followed by a message is completely determined by the source and destination addresses. A routing technique is adaptive if, for a given source and destination, the path taken by a particular message depends on dynamic network conditions, such as the presence of faulty or congested channels. By accounting for current conditions, adaptive routing can be used to improve system performance [3, 20]. An adaptive routing algorithm is said to be minimal if the path selected is one of the shortest paths between the source and destination pair. Using a minimal routing algorithm, every channel visited will bring the message closer to the destination. A nonminimal routing algorithm allows messages to follow a longer path, usually in response to current network conditions. Several related routing problems have been studied previously. For example, adaptive routing algorithms for unicast communication [20, 11, 8] and deterministic routing algorithms for multicast communication [18] have been proposed for wormhole-routed networks. In addition, adaptive multicast routing algorithms have been proposed for networks using store-and-forward switching [15, 16]. As will be explained in detail in next section, none of these methods alone can be extended to provide deadlock-free adaptive multicast wormhole routing. In this paper, three deadlock-free adaptive multicast routing algorithms for wormhole-routed 2D mesh networks are presented. Section 2 discusses the issues involved in designing such algorithms so that they are both adaptive and deadlock-free. A partially-adaptive minimal multicast 3

routing algorithm is presented in Section 3; each message is routed deterministically to at most one destination. In Section 4, a fully-adaptive minimal multicast routing method is given that routes the message adaptively to all destination at the expense of additional channels in the network. In Section 5, a nonminimal adaptive multicast algorithm is proposed that is based on a node-labeling assignment used in earlier deterministic multicast algorithms [18]. Section 6 describes variations of the algorithms presented in Sections 4 and 5 in which up to four independent worms may be used to instantiate a multicast operation. Comparisons of all the proposed adaptive routing algorithms in terms of several metrics are presented in Section 7. Section 8 contains concluding remarks.

2 Deadlock Problems in Adaptive Multicast Wormhole Routing One of the most important issues in designing an adaptive routing algorithm is how to guarantee freedom from deadlock. A deadlock occurs when two or more messages are delayed forever due to a cyclic dependency among their requested resources. In wormhole-routed networks, the critical resources are channels. Since blocked messages are not bu ered at intermediate nodes and, therefore, not removed from the network, one way to avoid deadlock is to guarantee that cyclic dependencies in channel usage cannot arise. This strategy has been used in the design of numerous deadlock-free routing algorithms for wormhole-routed networks. For example, deadlock-free deterministic unicast communication can be implemented by simply assigning to each channel a unique number and allocating channels to messages in strictly ascending (alternatively, descending) order [5]. A channel numbering scheme often used in n-dimensional meshes is based on the dimension of channels. In such dimension-ordered routing, each message is routed in one dimension at a time, arriving at the proper coordinate in each dimension before proceeding to the next dimension. By enforcing a strictly monotonic order on the dimensions traversed, deadlock-free routing is guaranteed. Examples of dimension-ordered routing include XY routing for the 2D mesh and E-cube routing for the hypercube [23]. Avoiding cyclic dependencies has also been used to develop deadlock-free adaptive unicast routing algorithms. One such approach uses virtual channels multiplexed on each physical channel. Each virtual channel has its own it bu er, control, and data path [2]. In the virtual network model [11], there are two virtual channels for each physical channel in a 2D mesh. The network is 4

divided into four acyclic subnetworks used to reach nodes to the northeast, southeast, southwest, and northwest, respectively, of the source node. This method produces a fully-adaptive minimal deadlock-free routing algorithm. Actually, adding double channels in only one dimension of a 2D mesh is sucient to produce such an algorithm [20]. Providing fully-adaptive minimal deadlockfree routing algorithms for the hypercube, 2D-torus, and more general k-ary n-cube topologies requires more additional channels [20]. Nonminimal adaptive routing algorithms based on the use of additional channels have also been proposed [3]. Recently, another approach to adaptive unicast wormhole routing has been proposed which does not require additional channels. The turn model [8] provides a systematic approach to the development of both minimal and nonminimal adaptive routing algorithms for a given network. The fundamental concept behind the turn model is to prohibit the smallest number of turns such that cyclic dependencies among channels are prevented. In fact, for a 2D mesh, only two turns need to be prohibited. Figure 3(c) shows six turns allowed, suggesting the corresponding west- rst routing algorithm: route a message rst west, if necessary, and then adaptively south, east, and north [8]. Because cycles are avoided, west- rst routing is deadlock-free.

(c) six turns (solid arrows) (b) four turns (solid arrows) allowed in west- rst routing allowed in XY routing Figure 3. An illustration of the turn model in 2D mesh

(a) abstract cycles in 2D mesh

Before discussing adaptive multicast routing, a brief review of deterministic multicast is in order. Currently, most multicomputers support only unicast communication in hardware. In these environments, multicast must be implemented in software by sending multiple unicast messages. One method is to send a separate copy of the message from the source to every destination. Depending on the number of destinations, this separate addressing strategy may require excessive time because many systems allow a local processor to send only one message at a time. Although ecient algorithms to support multicast in software have been developed previously [21], performance can be further improved by implementing multicast communication in hardware. This paper concerns only multicast communication that is supported in hardware. 5

0,5

1,5

2,5

3,5

4,5

5,5

0,4

1,4

2,4

3,4

4,4

5,4 source node

0,3

1,3

2,3

3,3

4,3

5,3

0,2

1,2

2,2

3,2

4,2

5,2

0,1

1,1

2,1

3,1

4,1

5,1

0,0

1,0

2,0

3,0

4,0

5,0

destination node

channel selected by routing algorithm

Figure 4. Example of tree-based deterministic multicast routing in a 6  6 mesh Hardware-supported wormhole multicast can be either tree-based or path-based. In tree-based routing, the destination set is partitioned at the source, and separate copies are sent on one or more outgoing channels. A message may be replicated at intermediate nodes and forwarded along multiple outgoing channels toward disjoint subsets of destinations. Figure 4 shows an example of tree-based deterministic multicast routing in a 6  6 mesh. As its enter routers at branch points (the source (2; 1) and nodes (1; 1) and (3; 1)), they are duplicated and forwarded on multiple outgoing links. Unfortunately, tree-based routing, which is actually used to support a restricted form of multicast in the nCUBE-2 [22], su ers from several drawbacks in multicomputers that use wormhole routing. Since there is no message bu ering at routers, if one branch of the tree is blocked, all are blocked. Branches must proceed forward in lock step, which may cause a message to hold many channels for extended periods, thereby increasing network contention. More importantly, it has been shown that tree-based routing is not deadlock-free in hypercubes or 2D-meshes without using multiple channels per unidirectional channel [18]. Lin et al [18] have developed a new approach to hardware-supported multicast, called pathbased routing. A multicast path for a set of destinations consists of a set of consecutive channels, starting from the source node and traversing each destination in the set. Path-based multicasting may be implemented by sorting the destination addresses according to the order in which they are 6

to be visited and placing the resulting list in the header of the message; each destination address occupies one or more its of the message header. When the it(s) containing the rst destination address, d1 , arrives at that node's router, the address d1 is removed from the message header and the subsequent its are forwarded both to the local host and to destination d2 . Eventually, the data component of the message will arrive at all the destinations. Path-based routing is applicable to many topologies, including hypercubes and meshes. Most importantly, path-based routing is deadlock-free. Because multiple worms may proceed independently, path-based routing also avoids the branch-dependency problem of multicast trees. Due to their advantages over tree-based routing, only path-based approaches to adaptive multicast communication are considered in this paper. Two important issues must be accounted for in developing a path-based adaptive multicast routing algorithm. First, as with deterministic multicast communication [18], the degenerate cases of unicast and broadcast must use the same algorithm in order to guarantee freedom from deadlock. That is, the multicast algorithm is the only routing algorithm used in the network, thereby o ering a comprehensive routing solution. An important property of a multicast path algorithm is that a unicast message routed according to the algorithm should always follow a shortest path; this property holds for the deterministic multicast routing algorithms proposed in [18]. The second issue involves the ordering of destinations in the path. Because of the pipelining characteristic of wormhole routing, it is not sucient to simply order the destinations randomly and perform deadlock-free adaptive unicast routing between each pair. Speci cally, the destinations along the path will not bu er the entire message before it is forwarded to the next node. Hence, the entire routing path, which is usually not a shortest path from a source to each of the destinations, must be considered. As an example, consider Figure 5, where the source node is (0,2) and the destinations (1,4), (3,2), and (4,1) occupy positions along the same routing path. Both nodes (1,4) and (3,2) do not bu er the entire message before sending out the its of the message. The routing path is ((0,2), (1,2), (1,3),(1,4), (2,4), (2,3), (2,2), (3,2), (3,1), (4,1)), which is clearly not a shortest path from the source node (0,2) to node (4,1). The problem addressed in this paper is how to order destinations in such a way as to allow adaptive routing between the source and the rst destination and between successive pairs of destinations while avoiding deadlock. In order to describe and compare the three algorithms presented in this paper, terms describing the adaptivity of multicast routing must be de ned. A path-based multicast routing algorithm 7

0,5

1,5

2,5

3,5

4,5

5,5

0,4

1,4

2,4

3,4

4,4

5,4

0,3

1,3

2,3

3,3

4,3

5,3

0,2

1,2

2,2

3,2

4,2

5,2

source node

destination node

channel selected by routing algorithm 0,1

1,1

2,1

3,1

4,1

5,1 busy channel

0,0

1,0

2,0

3,0

4,0

5,0

Figure 5. An example of adaptive routing in a 6  6 mesh is minimal if it always follows a shortest path between each pair of nodes in the multicast path; otherwise, it is nonminimal. A multicast routing algorithm is de ned to be fully-adaptive if the message can take any path between each pair of nodes in the multicast path, that is, from s to d1 and from dj to dj +1 , for 1  j  k ? 1. A multicast algorithm is partially adaptive if it can route messages adaptively between only some pairs of nodes in the multicast path. Finally, a distinction should be drawn between the adaptive multicast routing problem in wormhole-routed networks and the same problem in store-and-forward networks. In the latter, deadlock is usually avoided by providing adequate bu er space at routers or, in their absence, at local processors, which are necessarily required to forward messages. With the deadlock problem solved outside of the routing algorithm, adaptive or fault-tolerant multicast routing algorithms for store-and-forward networks [15, 16] are invariably based on multicast trees, and would risk deadlock in wormhole-routed networks. Further, in evaluating the performance of such algorithms, latency is assumed to be linear in the path length; therefore, such algorithms do not exploit the distance-insensitivity of wormhole routing.

8

3 Partially-Adaptive Minimal Multicast Routing The rst algorithm studied is a single-path partially-adaptive minimal routing algorithm, denoted 1-PM, that does not require the use of virtual channels. The algorithm rst selects the westmost (leftmost) destination d whose y coordinate is greater than any other destination. If d has a smaller x-coordinate than that of the source node s, then the message is routed from s to d deterministically using XY routing. Any destination nodes that lie along the path from s to d will be placed before d in the message header according to their positions in the path, so that these destinations can receive the message as it is forwarded from s to d. The message is then adaptively routed east (right) towards the remaining destination nodes, that is, the destination nodes are visited in ascending order according to their x-coordinates. At each step, if more than one node has the same lowest x-coordinate, the message will be sent rst north (in increasing y -coordinate order) and then south (in decreasing y -coordinate order). It will be shown later that this routing strategy is deadlock-free. Figure 6 gives the algorithm for constructing the message header at the source node. The distributed routing algorithm that is executed at each node, including the source node, is given in Figure 7; the algorithm is a general path-based multicast routing algorithm. Figure 8 shows an example of a multicast path that may be created by the 1-PM routing algorithm. Consider a multicast with source (2,1) and destinations (0,2), (1,4), (3,2), (4,1), and (5,4). At the source node (2,1), the algorithm in Figure 6 is executed to order the destinations and construct the message header MH . In Step 3, (xp ; yp) = (0; 2), and Step 4 places this address into MH . In the rst iteration of Step 5, (xf ; yf ) = (1; 4), yc = 2, and the sublist H is equal to the single node (1,4), which is placed into MH next. The procedure continues for each of the remaining three destinations. After the execution of the algorithm, the message header MH is complete with MH =((0,2), (1,4), (3,2), (4,1), (5,4)). Using the routing algorithm in Figure 7, the message will be sent to (0,2) deterministically; it can then be routed adaptively between each of (0,2) and (1,4), (1,4) and (3,2), (3,2) and (4,1), and (4,1) and (5,4). Note that if more than one destination has the same x-coordinate, the message will always be routed rst north and then south. For example, if node (4,3) were also a destination in this example, it would be placed between (3,2) and (4,1) in the message header.

9

Algorithm: 1-PM Message Header Algorithm: 1-PMH Input: Destination set D, D = f(x ; y ); (x ; y ); : : :; (xk; yk )g, 1

1

2

2

and source address u0 =(x0 ; y0). Output: Ordered list of destinations, MH , placed in the message header.

Procedure:

1. Assign a label ((x + 1)  n) ? y ? 1 to each address (x; y ) in D. 2. Sort the destinations in increasing order using their labels as keys; call the sorted list S . 3. Let dp = (xp; yp) be the rst destination in the list. 4. Find those addresses of destinations that lie on the X- rst, Y-next path from the source u0 to destination dp. Remove those addresses from S and place them in MH in the order visited on that path, that is, with dp last. 5. While S is not empty do the following: (a) If MH is empty, then set yc = y0 ; otherwise, set yc = yr , where (xr ; yr ) is the destination address most recently placed in MH . (b) Let df = (xf ; yf ) be the rst address in S . (c) Let H be the sublist of addresses dh = (xh ; yh ) in S , beginning with df , such that xh = xf and yh  yc . (H is a possibly empty sublist at the front of S .) (d) If H is not empty, reverse the order of the addresses in H , place them in MH , and remove them from S . (e) Let L be the list of addresses dl = (xl ; yl) in S , beginning with df , such that xl = xf . Necessarily, yl  yc . (L is a possibly empty sublist at the front of S .) (f) If L is not empty, place the addresses of L, in order, into MH and remove them from S . 6. Place MH in the message header. Figure 6. Message header construction for 1-PM routing.

Next, the deadlock-free property of the 1-PM algorithm is discussed. As mentioned earlier, avoiding deadlock in routing algorithms can be accomplished by ordering network resources and requiring that messages request and use these resources in strictly monotonic order. In this manner, circular wait, a necessary condition for deadlock, cannot occur. In wormhole-routed networks, a channel dependence graph (CDG) has been used to develop deadlock-free routing algorithms [5]. The CDG for a directed network and a routing algorithm is a directed graph G(V; E ), where the vertex set V (G) corresponds to all the unidirectional channels in the network, and the edge set 10

Algorithm: PM Algorithm for Message Routing Input: A message with ordered destination list MH = (d ; : : :; dk), 1

a local address u, u = (x; y ).

Procedure:

1. If u = d1, then MH 0 = MH ? fd1g and the message is sent to the local host; otherwise, MH 0 = MH . 2. If MH 0 = ;, then terminate the message forwarding, but continue to deliver the remaining its of the message to the local host (the last destination). 3. Let d be the rst node in MH 0 , d = (xd ; yd ). (a) If xd  x, select any channel (u; u0) (u0 is a neighboring node of u), that is along any one of the shortest paths from u to d1; (b) If xd < x, select the neighboring node u0, u0 = (x ? 1; y ). 4. The message is forwarded to node d0 with address destination list MH 0 in its header. Figure 7. PM algorithm for message routing. 0,5

1,5

2,5

3,5

4,5

5,5 source node

0,4

1,4

2,4

3,4

4,4

5,4

0,3

1,3

2,3

3,3

4,3

5,3

0,2

1,2

2,2

3,2

4,2

5,2

0,1

1,1

2,1

3,1

4,1

5,1

destination node

channel selected by adaptive routing

channel selected by deterministic routing 0,0

1,0

2,0

3,0

4,0

5,0

Figure 8. An example of 1-PM routing in a 6  6 mesh.

E (G) to all the pairs of connected channels, as de ned by the routing algorithm. A path from one vertex to another in the CDG indicates a channel dependency from the rst channel to the second, that is, a message may hold the rst channel while waiting for the second. A deterministic routing algorithm is deadlock-free if and only if its CDG is acyclic. However, the CDG for an 11

adaptive routing algorithm may contain cycles, even though the routing algorithm is deadlock-free. Therefore, the CDG alone is of limited use in the development of deadlock-free adaptive routing algorithms. Alternative methods have been developed to prove that certain adaptive unicast routing algorithms are deadlock-free even though their CDGs contain cycles [7, 1]. A new method, the message ow model [19], is used to prove that the deadlock-free property holds for 1-PM routing and the algorithms described in later sections. It is assumed that every destination node can consume any incoming message. For a given routing algorithm, a channel (u; v ) is deadlock-immune if and only if, for any message arriving at v from channel (u; v ), the last it of the message can eventually be sent out towards its destination(s), thus releasing the channel (u; v ). A routing algorithm is deadlock-free if and only if every channel in the network is deadlock-immune when using the algorithm. In order to show that all channels are deadlock-immune when using the 1-PM algorithm, the channels in an m  n mesh are rst partitioned into 4m ? 2 disjoint sets according to their directions as follows: Ni = f((i; j ); (i; j + 1))j0  j < n ? 1g, for 0  i < m, Si = f((i; j ); (i; j ? 1))j0 < j < ng, for 0  i < m, Wi = f((i; j ); (i ? 1; j ))j0  j < ng for 0 < i < m, Ei = f((i; j ); (i + 1; j ))j0  j < ng for 0  i < m ? 1. Figure 9 shows these subsets for a 3  3 mesh. The following lemmas are required to show the deadlock-free property of the 1-PM routing algorithm.

Lemma 1 Let Ni [ Sj denote the union of the two channel sets Ni and Sj . Under 1-PM routing, there does not exist a channel dependency from any channel in Ni [ Si to any channel in Wi , for 0 < i < m.

Proof: The 1-PM algorithm routes the message rst to a west-most destination node (a node with smallest x-coordinate), and then sends the message from west to east, that is, in increasing order of the x-coordinates. According to the 1-PM algorithm, the channels must be used in the following order: Wm?1 ; Wm?2; : : :; W1, N0 [ S0; E0; N1 [ S1; E1; : : :; Em?2; Nm?1 [ Sm?1 , that is, zero or one channel from set Wm?1 , followed by zero or one channel from set Wm?2 ; : : :, followed by zero or one channel from set W1 , followed by zero or more channels from set N0 [ S0 ; : : :, and so on. 12

0,2

1,2

2,2

0,2

1,2

2,2

0,1

1,1

2,1

0,1

1,1

2,1

0,0

1,0

2,0

0,0

1,0

2,0

N0

N1

N

2

S0

S1

S2

0,2

1,2

2,2

0,2

1,2

2,2

0,1

1,1

2,1

0,1

1,1

2,1

0,0

1,0

2,0

0,0

1,0

2,0

W1

E0

W2

E1

Figure 9. The channel set partitioning in a 3  3 mesh. Since a message cannot hold a channel in Ni [ Si while waiting for a channel in Wi , it follows that no channel dependency can exist from any channel in Ni [ Si to any channel in Wi , for 0 < i < m.

2

Lemma 2 If a message requires the use of one or more channels in Ni and one or more channels in Si in order to reach destinations in column i, it must rst use channels in Ni and then those in Si.

Proof:

By the de nition of the 1-PM algorithm, the destinations in the message header are arranged in such a way that, if there are two or more destinations with the same x-coordinate, the message will be sent rst north and then south to the destinations. 2

Theorem 1 The PM routing algorithm is deadlock-free. 13

Proof: It is rst proved by induction that every channel in Ni [ Si, m > i  0, as well as every channel in Ei , m ? 1 > i  0, is deadlock-immune. It is then shown that every channel in Wi , m > i > 0, is also deadlock-immune. The induction variable is i. When i = 1, Nm?i [Sm?i contains the channels connecting the nodes in column m ? 1. By Lemma 1, there is no channel dependency from any channel in Nm?1 [ Sm?1 to any channel in Wm?1 , so after a message arrives at a node in column m ? 1 from a channel in Em?1 , all of the remaining destinations in the message header for a multicast must also be located in column m ? 1. By Lemma 2, there is no channel dependency from any channel in Sm?1 to any channel in Nm?1 , thus the message can eventually be sent to the remaining destinations after it has arrived at any node in column m ? 1. Hence, all the channels in Nm?1 [ Sm?1 [ Em?2 are deadlock-immune. Suppose that the assumption is true for i = p; p  1, and consider i = p + 1. By Lemma 1, there is no channel dependency from any channel in Nm?(p+1) [ Sm?(p+1) to any channel in Wm?(p+1) , and by induction, the channels in Em?(p+1) are deadlock-immune. If there were a channel in Nm?(p+1) [ Sm?(p+1) involving a deadlock, then only among those channels in Nm?(p+1) [ Sm?(p+1) could a cyclic channel dependency be formed that could not be broken. By Lemma 2, if a message rst uses one or more channels in Sm?(p+1) , then requests a channel in Nm?(p+1), the rst destination in the message header must be a node in some column `, `  m ? p. Because adaptive routing is used, such a dependency can always be removed by selecting an alternative channel in Em?(p+1) , which is deadlock-immune by the induction assumption. Therefore, no unbreakable cyclic dependency can be formed among the channels in Nm?(p+1) [ Sm?(p+1) , and all the channels in Nm?(p+1) [ Sm?(p+1) [ Em?(p+2) are deadlock-immune. Finally, by Lemma 1, if a message requires channels in Wi , for 0 < i < m, such channels can be used only before using any channels in Np [ Sp, for 0  p < m, and before any channels in Ep, for 0 < p < m. It is easy to see that the channels in W1 are deadlock-immune, as is also true for the channels in W2, W3 , and so on. Because every channel in the mesh network is deadlock-immune under the 1-PM routing, the routing algorithm is deadlock-free. 2

14

Given a multicast with k destination nodes, Step 1 of the 1-PMH algorithm in Figure 6 takes at most O(k) time. Step 2 can be executed in O(k log k) time, since it involves simply sorting the destinations based on their labels. Step 3 takes at most O(k) time, as does the while loop in Step 4, since it processes each destination a constant number of times. Thus, the time complexity of the 1-PMH algorithm is O(k log k). Such an algorithm may need to be executed only once for a given set of destinations; in fact, it may be possible to order the destinations at compile time. The routing algorithm in Figure 7 requires O(1) time, since each node in a 2D mesh has at most four outgoing channels to choose from. The 1-PM algorithm sends a message using a single multicast path. Although the route from the source to the last destination node may be relatively long, this strategy is particularly wellsuited for one-port architectures, such as the Intel Paragon [9]. If more than one worm were used to implement a multicast operation in a one-port system, then the worms would have to be transmitted sequentially from the source node, reducing the bene ts of pipelining provided by wormhole routing. However, if the architecture o ers multiple internal channels, then the use of more than one worm is e ective in reducing the lengths of the constituent multicast paths and the total number of channels required to implement a particular multicast operation. In this approach, the destination set is partitioned into subsets. A separate worm is transmitted for each subset, following a multicast path that visits each destination node in the subset. Figure 10 gives the message header construction algorithm for the 2-PM algorithm, a partiallyadaptive multicast routing algorithm that uses up to two worms; the worms can proceed in parallel as long as the source node has at least two internal channels. In this algorithm, the set of destination nodes is divided into two subsets, and copies of the message are routed independently along two multicast paths. The 2-PM message routing algorithm itself, which is executed at each intermediate node, is identical to that of the PM algorithm given in Figure 7. Figure 11 shows an example of two multicast paths that may be created by the above algorithm. At the source node (2,1), the set of destinations f(0; 2); (1; 4); (3; 2); (4; 1); (5; 4)g is divided into two subsets: f(0; 2); (1; 4)g; and f(3; 2); (4; 1); (5; 4)g. In the rst path, the message is routed deterministically from the source node (2,1) to node (0,2) and then adaptively from (0,2) to (1,4).

15

Algorithm: 2-PM Message Header Algorithm: 2-PMH Input: Destination set D, D = f(x ; y ); (x ; y ); : : :; (xk; yk )g, 1

1

2

2

and local address u0 =(x0 ; y0). Output: Two ordered lists of destination nodes: MHW and MHE placed in the message header.

Procedure:

1. Divide D into DW and DE , DW = f(x; y )jx < x0 or x = x0 and y < y0 g and DE = D ? D1 = f(x; y)jx > x0 or x = x0 and y > y0 g; 2. Call 1-PMH(DW ; u0), and 1-PMH(DE ; u0) (see Figure 6) to produce MHW and MHE , respectively. 3. Place MHW in one message header and MHE in the other message header. Figure 10. The 2-PM algorithm for constructing the message header.

0,5

1,5

2,5

3,5

4,5

5,5 source node

0,4

1,4

2,4

3,4

4,4

5,4

0,3

1,3

2,3

3,3

4,3

5,3

0,2

1,2

2,2

3,2

4,2

5,2

0,1

1,1

2,1

3,1

4,1

5,1

destination node

channel selected by adaptive routing

channel selected by deterministic routing 0,0

1,0

2,0

3,0

4,0

5,0

Figure 11. An example of the 2-PM routing algorithm in a 6  6 mesh.

16

In the second path, the message is forwarded from (2,1) to (3,2), from (3,2) to (4,1), and from (4,1) to (5,4), all adaptively. The following corollary follows directly from Theorem 1.

Corollary 1 The 2-PM algorithm is deadlock-free.

4 Fully-Adaptive Minimal Multicast Routing In the 1-PM algorithm, a message is routed deterministically to one west-most destination node; destinations along that path are also reached deterministically. For unicast communication, in which the number of the destinations is one, this property implies that the message has to be delivered deterministically to any destination node that lies west of the source node. Hence, only about half of the unicast messages can be routed adaptively. It can be easily shown that, without introducing virtual channels, it is impossible to implement fully-adaptive deadlock-free routing even for unicast communication [23]. In order to support fully-adaptive multicast routing, the technique used here is to double the vertical channels. Suppose that one set of the vertical channels is identi ed as CV 1 and the other as CV 2 . The channels of the network are further partitioned into two disjoint sets CW and CE as follows, CW = fcjc is horizontal channel from east to west or c 2 CV 1 g CE = fcjc is horizontal channel from west to east or c 2 CV 2 g. Like the 2-PM algorithm, the proposed fully-adaptive minimal routing algorithm, 2-FM, is a dual-path multicast algorithm. The destination set is divided into two subsets, MHW and MHE . The set MHW contains the destination nodes that are to the west of the source, and MHE the nodes to the east of the source. The message will be delivered to MHW using the channels in CW and to MHE using the channels in CE . The multicast routing for MHE is the same as that of the PM routing algorithm described in the previous section; since there is no destination node to the west of the source node in MHE , the message can be forwarded fully-adaptively. The routing in CW is similar to the routing in CE but in the opposite direction. Figure 12 gives the 2-FM algorithm for constructing the message header. The algorithm rst divides the set of destination addresses into two subsets, DW and DE . The addresses in DE are ordered in exactly the same way as the 1-PMH algorithm in Figure 6. The nodes in DW are ordered from east to west, that is, in descending order of their x-coordinates. Two independent multicast paths result, each of which uses a di erent set 17

of channels to deliver the message to one of the subsets of the destinations. Figure 13 describes the FM routing algorithm for a 2D mesh with double vertical channels. The FM routing algorithm is similar to the one in Figure 7 except that deterministic routing is not required.

Algorithm: 2-FM Message Header Algorithm: 2-FMH Input: Destination set D, D = f(x ; y ); (x ; y ); : : :; (xk; yk )g, 1

1

2

2

and source address u0 =(x0 ; y0). Output: Two ordered lists of destination nodes: MHE and MHW placed in the message header.

Procedure:

1. Divide D into DW and DE such that DW = f(x; y)jx > x0 or x = x0 and y > y0 g, and DE = D ? DW . 2. Call 1-PMH(DE ; u0) (see Figure 6), which returns MHE . 3. Assign a label (x  n) + y to each address (x; y ) in DW . 4. Sort the destinations in DW in decreasing order using their labels as keys; call the sorted list SW . 5. While SW is not empty do the following: (a) If MHW is empty, then set yc = y0 ; Otherwise, set yc = yr , where (xr ; yr ) is the destination address most recently placed in MHW . (b) Let df = (xf ; yf ) be the rst address in SW . (c) Let H be the sublist of addresses dh = (xh ; yh ) in SW , beginning with df , such that xh = xf and yh  yc . (H is a possibly empty sublist at the front of SW .) (d) If H is not empty, reverse the order of the addresses in H , place them in MHW , and remove them from SW . (e) Let L be the list of addresses dl = (xl ; yl) in SW , beginning with df , such that xl = xf ; necessarily, yl  yc . (L is a possibly empty sublist at the front of SW .) (f) If L is not empty, place the addresses of L into MHW and remove them from SW . 6. Construct two messages, one with message header MHE and the other with message header MHW . Figure 12. 2-FM algorithm for constructing the message header

Figure 14 shows an example of the multicast paths that may be created by the 2-FM routing algorithm. At the source node (2,1), the 2-FMH algorithm rst divides the set of the destinations into two subsets: (0,2), (1,4); and (3,2), (4,1), (5,4). One copy of the message is forwarded to (0,2) 18

Algorithm: FM Algorithm for Message Routing Input: A message with ordered destination list MH = (d ; : : :; dk), MH is MHW or MHE , a local address u, u = (x; y).

1

Procedure:

1. If u = d1 , then set MH 0 = MH ? fd1g and the message is delivered to the local node; otherwise, set MH 0 = MH . 2. If MH 0 = ;, then terminate the message forwarding, but continue to deliver the remaining its of the message to the local host. 3. Let d be the rst node in MH 0 , d = (xh ; yh ). Select any channel (u; u0) in CL if the header is MHW (or CR if the header is MHE ) that is in any one of the shortest paths from u to d1. 4. The message is sent to node u0 with destination address list MH 0 in its header. Figure 13. FM algorithm for message routing.

0,5

1,5

2,5

3,5

4,5

5,5

0,4

1,4

2,4

3,4

4,4

5,4

0,3

1,3

2,3

3,3

4,3

5,3

source node

destination node 0,2

1,2

2,2

3,2

4,2

5,2

0,1

1,1

2,1

3,1

4,1

5,1

0,0

1,0

2,0

3,0

4,0

5,0

channel selected adaptively by routing algorithm

Figure 14. An example of 2-FM routing algorithm in a 6  6 mesh.

19

and (1,4) adaptively using channels in CW ; another copy of the same message is sent adaptively to (3,2), (4,1) and (5,4), using channels in CE .

Theorem 2 The FM routing algorithm is deadlock-free. Proof: The message routing in the subnetwork with channels in CE is the same as the message routing in the PM algorithm. Hence, the message routing is from west to east, is fully-adaptive, and is deadlock-free according to Theorem 1. For the message routed in other subnetwork with channels in CW , the proof is similar to the proof of Theorem 1. however, the induction would start from the west-most column. 2 As with the 1-PMH algorithm, the 2-FMH algorithm in Figure 12 requires O(k log k) time for a multicast with k destination nodes, but again may be executed once at compile time. Step 1 requires O(k) time, and Step 2 calls the 1-PMH algorithm with O(jDE j log jDE j) time, jDE j  k. Step 3 requires O(k) time and Step 4, a sort, requires O(jDW j log jDW j) time, jDW j  k. The while loop in Step 5 requires O(DW ) time, since each destination is processed a constant number of times. Step 6 requires O(k) time. Thus, the time complexity of the 2-FMH algorithm is O(k log k). The routing algorithm in Figure 13 obviously requires O(1) time.

5 Nonminimal Adaptive Multicast Routing In nonminimal adaptive routing, a message can be derouted, that is, routed along a non-shortest path, in the presence of a blocked channel. As shown in [13], nonminimal adaptive routing has the potential to provide lower latency and higher throughput than minimal adaptive routing, especially under non-uniform trac. In addition to handling the deadlock problem, a nonminimal multicast algorithm must also avoid livelock, in which a message fails to be delivered to its destinations due to its repeated derouting. The solution presented here is to modify a deterministic dual-path method [18] so as to support nonminimal adaptive routing. The method is based on a node-labeling scheme. First, a Hamiltonian path in the 2D mesh is selected and integer numbers are assigned to the nodes according to their positions in the path. The 2D mesh is next divided into two subnetworks, one called the highchannel network, containing all of the channels from lower labeled nodes to higher labeled nodes; 20

the other called the low-channel network containing all of the channels from higher labeled nodes to lower labeled nodes. Figure 17 shows a label assignment for the nodes in a 6  6 mesh. The label assignment function ` for an m  n mesh can be expressed as

8 >< x  n + y + 1 if x is even `(x; y) = > : x  n + n ? y if x is odd

The destination node set is divided into two subsets, one containing destination nodes with labels higher than that of the source node, the other containing destinations with labels lower than that of the source node. The message is delivered to these two sets of the destinations using the highchannel and low-channel subnetworks, respectively. Figure 15 and Figure 16, respectively, give the message header construction algorithm and the routing algorithm for the label-based, dual-path approach, collectively called the 2-LD routing algorithm. The algorithm in Figure 15 divides the destinations into two subsets of the destinations as described above, and sorts the two subsets in ascending and descending order, respectively, using their labels as keys. The routing algorithm in Figure 16 is a general, path-like routing algorithm [18], which uses di erent sets of the channels for di erent subsets of destinations.

Algorithm: LD Message Header Algorithm Input: Destination set D, local address u and node label assignment function `. Output: Two sorted lists of addresses, DH and DL, placed in the message header. Procedure: 0

1. Divide D into two sets DH and DL such that DH contains all the destination nodes with higher ` value than `(u0 ), and DL contains the nodes with lower ` value than `(u0). 2. Sort the destination nodes in each of DH and DL , using the ` value as the key. 3. Construct two messages, one containing DH as part of the header and the other containing DL as part of the header. Figure 15. 2-LD algorithm for constructing the message header.

Figure 17 shows an example of the two multicast paths that may be created by the 2-LD algorithm. The source node (2,1) has label `(2; 1)=14; the destination nodes (0,2), (1,4), (3,2), (4,1), (5,4) have the labels 3, 8, 22, 26, and 32, respectively. At the source node, the destination 21

Algorithm: LD Algorithm for Message Routing Input: A message with sorted destination list MH = (d ; : : :; dk), 1

a local address u and node label assignment `;

Procedure:

1. If u = d1 , then MH 0 = MH ? fd1g and the message is sent to the local node; otherwise, MH 0 = MH . 2. If MH 0 = ;, then terminate the message forwarding, but continue to deliver the remainder of the message to the local host. 3. Let d be the rst node in MH 0 . Select any channel (u; u0), such that `(u) < `(u0)  `(d1). 4. The message is sent to node u0 with destination address list MH 0 in its header. Figure 16. LD algorithm for message routing.

6

7 1,5

0,5

5 0,4

1,4

3 0,2

2 0,1

1

9 10 1,2

11 1,1

12 1,0

17 2,4

1,3

0,3

0,0

2,5

8

4

18

16 2,3

15 2,2

14 2,1

13 2,0

30

19 3,5

4,5

20 3,4

29 4,4

21 3,3

28 4,3

27

22 3,2

4,2

26

23 3,1

4,1

24 3,0

25 4,0

31 5,5

32 5,4

source node

33 5,3

34

destination node

5,2

35 5,1

channel selected by routing algorithm

36 5,0

Figure 17. An example of 2-LD routing algorithm in a 6  6 mesh.

22

node set is divided into two subsets: f(0; 2); (1; 4)g, whose labels are lower than that of the source node, and f(3; 2); (4; 1); (5; 4)g, whose labels are higher than that of the source node. The message will be delivered to the two sets of destinations using the low-channel network and high-channel network, respectively. For example, in the high-channel network, the message can be forwarded from the source node (2,1) with label 14 to destination (3,2) with label 22 by di erent paths, such as (14, 15, 22) or (14, 15, 16, 21, 22), or (14, 15, 16, 17, 20, 21, 22), etc. The message can be sent from node (3,2) with label 22 to destination (4,1) with label 26 by di erent paths, such as (22, 23, 26) or (22, 23, 24, 25, 26).

Theorem 3 The LD routing algorithm is deadlock-free. Proof: Since the two multicast paths are independent, and the message is always forwarded by a multicast path in either of the two disjoint acyclic subnetworks, it is straightforward to see that no cyclic dependency can arise in the routing. Hence, the algorithm is deadlock free. 2 Because the message is routed in an increasing (or decreasing) order, no destination will be visited more than once in a multicast.

Theorem 4 The LD routing algorithm is livelock-free. Proof: Suppose that a message has arrived at node u and that d is the next destination in the message header. Without loss of generality, assume that l(u) < l(d). The message can only be routed in high-channel network, that is, in the increasing order of the labels. After traveling at most l(d) ? l(u) channels, the message will arrive at node d. Therefore, the routing algorithm is livelock-free. 2 The time complexity of the algorithm in Figure 15 is O(k log k) for a multicast with k destination nodes since it is, in fact, a sorting algorithm. Also, the routing algorithm in Figure 16 only requires O(1) time.

23

6 Four-path Adaptive Multicast Routing Depending on the port model of the system, it may be possible to use more than two paths to implement a multicast operation. In a 2D mesh, most nodes have outgoing degree four, so up to four paths can be used to deliver a message. However, the worms can be transmitted in parallel only if each node is equipped with as many internal channels as external channels. In this section, we describe deadlock-free extensions of the FM and LD algorithms in which up to four paths are used to implement each multicast. The PM algorithm cannot be extended to four paths without risking deadlock. Extending the FM and LD algorithms can be accomplished by simply modifying their respective message header construction components; the actual routing algorithms (Figures 13 and 16) remain unchanged. Suppose that the source node of a multicast message is d0 = (x0; y0). In the case of the four-path FM algorithm, denoted 4-FM, the destination set D is partitioned into at most four subsets DNE ; DNW ; DSW , and DSE as follows: DNE = f(xi ; yi) j xi > x0 ; yi  y0 ; and (xi; yi ) 2 Dg DNW = f(xi; yi ) j xi  x0 ; yi > y0 ; and (xi ; yi) 2 Dg DSW = f(xi ; yi) j xi < x0; yi  y0 ; and (xi; yi) 2 Dg DSE = f(xi; yi ) j xi  x0 ; yi < y0 ; and (xi ; yi ) 2 Dg . Figure 18 shows an example of the four-path FM algorithm, 4-FM. In this example, D = f(0, 0), (0, 5), (1, 1), (1,4), (3, 1), (4, 0), (4, 4), (5, 5)g. The set D is partitioned into four subsets as follows: DNE = f(4, 4), (5, 5) g; DNW = f(0, 5), (1, 4)g; DSW = f(0, 0), (1, 1)g; and DSE = f(3, 1), (4, 0)g. The source message will be sent to the destinations by four multicast paths, each with the subset as the destination set, respectively. Similarly, the LD algorithm can also be modi ed to use up to four paths for a given multicast. However, in this case, the manner in which the destination set is partitioned depends on the location of the source node in the network and the particular labeling method used. For example, given the labeling shown in Figure 19 and a source node d0 = (x0; y0), if x0 is even then the destination set is partitioned as follows: DNE = f(xi ; yi) j xi  x0 ; yi > y0 ; and (xi; yi ) 2 Dg DNW = f(xi; yi ) j xi < x0 ; yi  y0 ; and (xi ; yi) 2 Dg 24

0,5

1,5

2,5

3,5

4,5

5,5

0,4

1,4

2,4

3,4

4,4

5,4

0,3

1,3

2,3

3,3

4,3

5,3

source node

destination node 0,2

1,2

2,2

3,2

0,1

1,1

2,1

3,1

0,0

1,0

2,0

3,0

4,2

5,2

4,1

5,1

4,0

5,0

channel selected adaptively by routing algorithm

Figure 18. An example of the 4-FM routing algorithm.

DSW = f(xi ; yi) j xi  x0; yi < y0 ; and (xi; yi) 2 Dg DSE = f(xi; yi ) j xi > x0 ; yi  y0 ; and (xi ; yi ) 2 Dg Figure 19 shows an example of such a partitioning and the four resultant paths of the 4-LD algorithm for a given multicast in a 32  32 mesh. If x0 were odd, the destination set would be partitioned in the same manner as for the 4-FM algorithm given earlier. Because the four-path versions of the FM and LD algorithms use the same routing algorithms and, in the case of LD, the same labeling method as their dual-path counterparts described in Sections 4 and 5, they are also deadlock-free by Theorems 2 and 3. As will be described in the next section, using more than two paths can further reduce the trac generated by and the lengths of the constituent paths associated with multicast operations.

7 Performance Evaluation A study of the relative performance of the proposed adaptive routing algorithms was conducted. The study evaluates multicasts to di erent numbers of destination nodes in a 32  32 mesh. The performance metrics include the average trac created by each routing algorithm, the average and 25

6

7 1,5

0,5

5 0,4

1,4

3 0,2

2 0,1

1

9 10 1,2

11 1,1

12 1,0

17 2,4

1,3

0,3

0,0

2,5

8

4

18

16 2,3

15 2,2

14 2,1

13 2,0

30

19 3,5

4,5

20

29

3,4

4,4

21

28

3,3

4,3

27

22 3,2

4,2

26

23 3,1

4,1

24

25

3,0

4,0

31 5,5

32 5,4

source node

33 5,3

34

destination node

5,2

35 5,1

channel selected by routing algorithm

36 5,0

Figure 19. An example of the multi-path 4-LD routing algorithm. maximum distances from the source to the destinations along the path(s), and the number of the alternative paths available to each routing algorithm. Randomly generated multicast sets (at least 1000 sets per plotted point) with di erent numbers of destinations were generated and tested. The number of destination nodes, k, was selected from 1 to 100. In comparing the trac created by the algorithms, each unit of trac represents the transmission of a message over a channel. The trac is de ned as the total number of channels used for a given multicast communication. A multicast with k destinations requires at least k units of the trac. The additional trac is de ned as the total amount of trac minus k. Figure 20 plots the amount of additional trac generated by the six routing algorithms in a 32  32 mesh. Among the six routing algorithms, the 4-FM algorithm creates the least trac. Interestingly, the 2-PM and 2-FM algorithms produce almost the same amount of trac as 1-PM, even though the latter uses only one path. Although 2-FM does not show a very signi cant improvement over the 1-PM algorithm, since it uses a double-vertical channel network, it would be expected to achieve a lower network latency than that of 1-PM algorithm.

26

700

...... . . .. . . . .. ......... . . . 600 1-PM .......... . . ... 2-FM . ...... .... . . . . . . . . . . 2-PM .. Average ... 500 ...... . . . . . . . . . . . . . . 2-LD . . . 4-LD ..... ..... . . . . . . . . . . . . . . . . . . . . 4-FM ...... . . . . . . . . . . . Additional 400 . . . . . ....... . . . . . . . . . . . ....... . . . . . . . . 300 . . . Trac . ... . ..... . . . . . .. . . . . 200 . . . ... . . . . . . . .... . 100 ............. . .... .. .. 0

0

20

40

60

Number of Destination Nodes

80

100

Figure 20. Generated multicast trac in 32  32 mesh. Figure 21 compares the algorithms in terms of the average length of the path between the source and a destination. As expected, the average path length depends heavily on the number of paths used. The average path path lengths of the four-path algorithms (4-FM and 4-LD) are approximately half those of the three dual-path algorithms (2-LD, 2-PM, and 2-FM), which are in turn approximately half that of 1-PM. Among algorithms using the same number of paths, a slight advantage goes to the minimal algorithms. Figure 22 compares the algorithms in terms of the maximum length of the path between the source and a destination. The maximum path length is important for cases in which the performance of the algorithm depends on when the last destination receives the message; an example is the use of multicast in distributed barrier synchronization [25]. Again, the 2-LD, 2-PM and 2-FM methods have similar maximum path length, although the 2-LD algorithm produces slightly longer maximum path lengths because it is nonminimal. The three methods also produce paths whose maximum lengths are approximately 3/4 that of paths produced by the 1-PM algorithm. The maximum 27

400

Average Path Length

........ . . . . . . . . ........ . . . . . 350 1-PM ........ . . . ........ 2-FM . . . . . . 2-PM ....... . 300 . . . 2-LD . . ...... 4-LD ..... . . . . . . 4-FM . . . 250 ..... . . . . . ...... . . . . . . 200 ...... . . . . .. ..... . . 150 . . ..... . . ... . 100 ..... . .. ... .. ... .. ... .. ... .. ... .. ... . . . . . . . . . . . . . . . . . .. ........ . . . . . . 50 .... . . ... .. ... .. ... .. ... .. ... .. . . . . . . . . .. . ..... ..

0

0

20

40

60

Number of Destination Nodes

80

100

Figure 21. Average multicast path length in a 32  32 mesh. lengths paths produced by the four-path algorithms is approximately 1/2 that of paths produced by the 1-PM algorithm. The adaptivity of the proposed algorithms was measured by computing the average number of available paths from one node (initially from the source node) to the next destination in the multicast path. For a multicast with message header (s; d1; d2; : : :; dk ), suppose that the number of available paths from s to d1 is n1 , and from di to di+1 is ni+1 , 1  i < k. The average number P of available paths is de ned as ( ki=1 ni )=k. Clearly, the total number of available paths for the Q multicast is ki=1 ni . As shown in Figure 23, the 1-PM, 2-PM, 2-FM, and 4-FM have similar average numbers of available paths in terms of di erent numbers of destinations. This result occurs because all these algorithms are minimal; therefore, the average number of paths between consecutive nodes along a path is approximately the same across the algorithms. Although 2-FM may not signi cantly outperform 1-PM in terms of adaptivity, as was explained before, only about half of the uni28

800

. ..... . . .. . . . . ........ . 700 . . . . . 1-PM .......... ......... . . . 2-FM ..... . . Maximum 600 . 2-PM . . .... 2-LD . . . . . 500 4-LD ..... .... . Path . 4-FM . . . . . ... . . 400 . . ... .. ... .. . . . . . . . . . . . . .. . Length . . ... . . . . . . .... . . . . . . . . . 300 .. . . ... .. .. . . . . . . .... . . . . . . . ... .. . . . . . ... . . 200 . . . .. .. . . . . . . .. .. . . . . .. . .. . . . . 100 ..... . . .. . . . .. . .. ... . . . 0

0

20

40

60

Number of Destination Nodes

80

100

Figure 22. Maximum multicast path length in a 32  32 mesh. cast communications can be performed adaptively in 1-PM, while all of the unicast messages are adaptively routed in 2-FM. Hence, 2-FM would be expected to exhibit better performance if the percentage of the unicast communication were relatively high. The 2-LD algorithm exhibits better adaptivity than any of the minimal algorithms. Since 2-LD can select nonminimal routes between consecutive nodes along the path, the average number of alternative paths is higher. However, when the number of destinations becomes large, the two algorithms have very close performance in terms of adaptivity because 2-LD is decreasingly likely to choose routes that are nonminimal. Finally, the 4-LD algorithm exhibits the best adaptivity because the nonminimal routes for destinations in one quadrant may range far into other quadrants. In fact, the number of paths between certain pairs of destinations can be extremely large. However, this result is somewhat misleading, since the worms for destinations in di erent quadrants may contend for channels. In Figure 19, for example, if node (5; 1) were also a destination, then the worm for the southeast 29

Average Path Number

1e + 07 ..... ..... ..... 1e + 06 . .. 100000 10000 1000 100 10 1

.. .. .. . ........ 1-PM .... .. 2-FM .. 2-PM .. 2-LD . 4-LD ..... .... 4-FM . . . ... . . .. . . .

. ... . .. .. .. .. ... . .... ... ... .... .... ..... .... . ...... . .... . . ..... . . ...... . . . . ....... . . . . ........ ................ . . . . . . ...................... . . . . . . ...................

10

20

30

40

50

60

70

Number of Destination Nodes

80

90 100

Figure 23. Available number of paths in a 32  32 mesh quadrant could potentially deroute from node (4; 0) all the way to the channel from (4; 5) to (5; 5), which is used by the worm in the northeast quadrant. If the message were long, one of the worms would be delayed.

8 Concluding Remarks Three adaptive multicast wormhole routing algorithms for 2D mesh multicomputers have been proposed. The algorithms include partially-adaptive minimal routing, fully-adaptive minimal routing, and nonminimal adaptive routing methods. The proposed adaptive routing algorithms are simple and deadlock-free. The minimal routing methods are livelock-free in nature, and the nonminimal routing method has been shown to be livelock-free. Two versions of each, using di erent numbers of paths, were studied; the number of paths should match the port model of the architecture. These routing strategies are the rst adaptive multicast wormhole routing algorithms ever proposed.

30

A study has been conducted to compare the performance of the proposed adaptive routing algorithms. The results indicate that the four-path approaches create the least trac and the shortest paths. The nonminimal routing algorithms o er the best adaptivity, but require more channels for message transmission. The adaptivity of the minimal algorithms are close in value, however, the PM algorithms have a simpler control structure and do not require virtual channels, as do the FM algorithms. When the number of the destinations is relatively large, the 2-LD routing algorithm does not o er more adaptivity than the minimal algorithms. Since the PM algorithms use deterministic routing for about half the cases of unicast communication, they may be preferred only if the percentage of unicast communication is relatively low; for unicast-intensive trac, the FM routing algorithms are likely to be better choices because they o er full adaptivity for all unicast messages. Finally, it should be noted that all these routing methods for 2D mesh can be extended to mesh topologies with any dimension.

Acknowledgements The authors would like to express their sincere appreciation to Professor Lionel M. Ni for his contributions to this work. This work was supported in part by the NSF grants CDA-9121641, CDA9222901, and MIP-9204066, by DOE grant DE-FG02-93ER25167, and by an Ameritech Faculty Fellowship.

References [1] Berman, P., Gravano, L., Sanz, J., and Pifarre, G. Adaptive deadlock- and livelock-free routing with all minimal paths in torus networks. In Proc. 4th ACM Symposium on Parallel Algorithms and Architectures (June 1992), pp. 3{12. [2] Dally, W. J. Virtual channel ow control. IEEE Transactions on Computers 3, 2 (Mar. 1992), 194{205. [3] Dally, W. J., and Aoki, H. Adaptive routing using virtual channels. Tech. rep., Massachusetts Institute of Technology, Laboratory for Computer Science, Sept. 1990. [4] Dally, W. J., and Seitz, C. L. The torus routing chip. Journal of Distributed Computing 1, 3 (1986), 187{196. [5] Dally, W. J., and Seitz, C. L. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Transactions on Computers C-36, 5 (May 1987), 547{553. 31

[6] DeMara, R. F., and Moldovan, D. I. Performance indices for parallel marker-propagation. In Proceedings of the 1991 International Conference on Parallel Processing (1991), pp. 658{659. St. Charles, Illinois, Aug. 12-17. [7] Duato, J. On the design of deadlock-free adaptive routing algorithms for multicomputers: design methodologies. In Proceedings of 1991 Parallel Architectures and Languages Europe Conference (PARLE'91) (1991). [8] Glass, C. J., and Ni, L. M. The turn model for adaptive routing. In Proc. of the 19th Annual International Symposium on Computer Architecture (May 1992), pp. 278{287. [9] Intel Corporation. Paragon XP/S Product Overview, 1991. [10] Intel Corporation. A Touchstone DELTA System Description, 1991. [11] Jesshope, C. R., Miller, P. R., and Yantchev, J. T. High Performance Communications in Processor Networks. In Proceedings of IEEE 16th Annual International Symposium on Computer Architecture (1989), pp. 150{157. [12] Johnsson, S. L., and Ho, C.-T. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers C-38, 9 (Sept. 1989), 1249{1268. [13] Konstantinidou, S., and Snyder, L. Chaos Router: Architecture and Performance. In Proceedings of the 18th Annual Symposium on Computer Architecture (1991), pp. 222{231. [14] Kumar, V., and Singh, V. Scalability of parallel algorithms for the all-pairs shortest path problem. Tech. Rep. ACT-OODS-058-90, Rev. 1, MCC, Jan. 1991. [15] Lan, Y. Fault-tolerant multi-destination routing in hypercube multicomputers. In Proceedings of the 12th International Conference on Distributed Computing Systems (June 1992), pp. 632{ 639. [16] Lan, Y. Multicast in faulty hypercubes. In Proc. of the 1992 International Conference on Parallel Processing (Aug. 1992), vol. I, pp. 58{61. [17] Li, K., and Schaefer, R. A hypercube shared virtual memory. In Proc. of the 1989 International Conference on Parallel Processing (Aug. 1989), vol. I, pp. 125 { 132. [18] Lin, X., McKinley, P. K., and Ni, L. M. Deadlock-free multicast wormhole routing in 2D mesh multicomputers. accepted to appear in IEEE Transactions on Parallel and Distributed Systems. [19] Lin, X., McKinley, P. K., and Ni, L. M. The message ow model for routing in wormholerouted networks. In Proc. of the 1993 International Conference on Parallel Processing (1993), vol. I, pp. 294{297. [20] Linder, D. H., and Harden, J. C. An adaptive and fault tolerant wormhole routing strategy for kary n-cubes. IEEE Transactions on Computers 40, 1 (Jan. 1991), 2{12. [21] McKinley, P. K., Xu, H., Esfahanian, A.-H., and Ni, L. M. Unicast-based multicast communication in wormhole-routed networks. In Proc. of the 1992 International Conference on Parallel Processing (Aug. 1992), vol. II, pp. 10{19. [22] NCUBE Company. NCUBE 6400 Processor Manual, 1990. 32

[23] Ni, L. M., and McKinley, P. K. A survey of wormhole routing techniques in direct networks. IEEE Computer 26, 2 (Feb. 1993), 62{76. [24] Seitz, C. L., Athas, W. C., Flaig, C. M., Martin, A. J., Seizovic, J., Steele, C. S., and Su, W.-K. The architecture and programming of the Ametek Series 2010 multicomputer. In Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, Volume I (Pasadena, CA, Jan. 1988), ACM, pp. 33{36. [25] Xu, H., McKinley, P. K., and Ni, L. M. Ecient implementation of barrier synchronization in wormhole-routed hypercube multicomputers. Journal of Parallel and Distributed Computing 16 (1992), 172{184.

33