An Euler-Path-Based Multicasting Model for Wormhole-Routed Networks with Multi-Destination Capability Yu-Chee Tsengy , Ming-Hour Yangy, and Tong-Ying Juangz yDepartment of Computer Science and Information Engineering National Central University Chung-Li, 32054, Taiwan Tel: 886-3-4227151 ext. 4512, Fax: 886-3-4222681 Email: fyctseng,
[email protected] z Department of Computer Science
Chung-Hua University Hsin-Chu, 30067, Taiwan Email:
[email protected]
Abstract
This research is supported by the National Science Council of the Republic of China under Grant # NSC86-2213-E-008-029 and Grant # NSC86-2213-E-216-021.
1
Recently, wormhole routers with multi-destination capability have been proposed to support fast multicast in a multi-computer network. In this paper, we develop a new multicasting model for such networks based on the concept of Euler path/circuit in graph theory. The model can support multiple concurrent multicasts freely from deadlock and can be applied to any network which is Eulerian or is Eulerian after some links being removed. No virtual channels are needed. In particular, we demonstrate the potential of this model by showing its fault-tolerant capability in supporting multicasting in the currently popular torus/mesh topology of any dimension with regular fault patterns (such as single node, block, L-shape, +-shape, U-shape, and H-shape) and even irregular fault patterns. It is the rst multicasting model known to us in the literature with such a strong fault-tolerant capability. The result has improved over existing fault-tolerant routing algorithms for meshes/tori in at least one of the following aspects: the number of faults tolerable, the shape of fault patterns, the number of deactivated healthy nodes, the requirement of support of virtual channels, and the range of network topology acceptable. Simulation results based on the torus networks are also presented.
Keywords: Collective communication, Euler path, fault tolerance, k-
ary n-cube, multicast, multi-computer network, torus, virtual channel, wormhole routing.
2
Contents 1 Introduction
4
2 The Euler-Path-Based Multicasting Model
8
2.1 2.2 2.3 2.4 2.5
Basic Idea . . . . . . . . . . . . . . . Extension to Non-Eulerian Networks Making Use of Short-cuts . . . . . . An Adaptive Multicasting Algorithm Implementation Considerations . . .
3 Applying to 2-D Tori with Faults
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
3.1 Single Faulty Cluster . . . . . . . . . . . . . . . . 3.1.1 Stage 1: Making All Nodes' Degrees Even 3.1.2 Stage 2: Making the Network Connected 3.2 Extension to Multiple Faulty Clusters . . . . . . 3.3 What Our Model Can and Can't Do? . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
8 10 10 10 15
16 17 18 20 21 23
4 Extensions to Tori of Higher Dimensions with Faults
24
5 Applying to Meshes with Faults
29
6 Simulation Results
31
7 Conclusions
37
4.1 3-D Torus with Faulty Blocks . . . . . . . . . . . . . . . . . . 26 4.2 3-D Torus with Faulty Clusters . . . . . . . . . . . . . . . . . 27 4.3 n-D Torus with Faulty Clusters . . . . . . . . . . . . . . . . . 29
3
1 Introduction In a multicomputer network, processors often need to communicate with each other for various reasons, such as data exchange and event synchronization.Ecient communication has been recognized to be critical for high performance computing. The communication problem can be characterized according to network topologies (e.g., mesh, torus, hypercube, or star graph), communication patterns (e.g., one-to-one, permutation, broadcast, or multicast), and switching technologies (e.g., package-switching, circuitswitching, or wormhole-routing). Packet- and circuit-switching have been used by earlier parallel computers and much routing work can be found in [2, 6, 15, 27, 28]. The more recent parallel computers (such as Intel Touchstone DELTA, Intel Paragon, MIT J-machine, IBM SP2, and Cray T3D) have adopted wormhole routing [1, 9, 22], which is known to be quite insensitive to routing distance and can oer fast inter-processor communication. This paper studies the multiple multicast problem in a wormhole-routed network, where any node in the network may intend to perform multicast to any set of destination nodes at any time. Multicast has many applications, such as parallel graph algorithms, barrier synchronization, and memory update/invalidation for cache coherence in distributed-shared memory systems. Both one-to-one communication and broadcast are special cases of multicast. Solutions to the multicast problem can be categorized as unicast-based, tree-based, and path-based. The unicast-based solutions (e.g., [8, 21] for meshes and [31] for tori) make use of 1-to-1 communication to achieve multicast. Disadvantages of this approach include necessary involvement in message propagation at intermediate nodes and required start-up latency in each intermediate node. The tree-based solutions (e.g., [18, 27, 29] for hypercubes) rely on nding a spanning tree rooted at the source node. A multicast message is then propagated along the tree. Such solutions are adequate for packet-switched networks, but incur long latency in wormholerouted networks [20]. More seriously, as shown in [20], such approach tends to have communication deadlock if multiple multicasts are performed simul4
taneously. One major advance in solving this problem is the recently proposed pathbased solutions, by enhancing routers with multi-destination capability [11, 20]. Examples include [11, 20] for meshes, [19, 12, 17, 23] for k-ary n-cubes, and [19, 24, 30] for arbitrary networks. Simple hardware is added to the router to enable it to copy the content of a worm while forwarding the worm to the next router. The header of such a worm can carry a number of destination addresses for the worm to visit. So such worms are also termed as multidestination worms [12, 23]. When seeing its address in the header, a router retrieves its address and forwards the worm (according to the routing function) to the next router. As the worm passes by, the router also makes a copy of the worm body for itself. As a worm can deliver a message to multiple destinations with only one startup cost, the deciency of high startup cost associated with the unicast-based approach is elliminated. In [20], to avoid the deadlock problem, a Hamiltonian path is constructed rst from the network to restrict the order of destinations to be visited by the multidestination worms. A similar approach is proposed in [24], but instead an Euler path/cycle needs to be constructed to restrict the routing order. Observing that both [20, 24] are too restricted in requiring the network being Hamiltonian or Eulerian, a more relaxed solution is proposed by [19] which only requires there existing a pseudo-Hamiltonian in the network (a pseudo-Hamiltonian path in an undirected graph is a path which visits each vertex at least once and each edge at most once). In [30], it is suggested to use a trip instead of a Hamiltonian path to enforce such ordering. As a trip always exists in any netwok, this extends the applicability of the pathbased model to arbitrary network topologies. To save hardware cost, [17, 23] propose not to modify the routing function in the routers, but only route a worm conforming to the routing function originally provided for pointto-point communication. Several levels of worms may be used to complete a multicast. This scheme is tested on the k-ary n-cube architecture to demonstrate its eciency. Following the similar line of thinking, [12] derives a multi-destination multicasting on meshes based on the turn model [13]. Each of the above results has its strength and weakness. While most 5
commonly used networks are Hamiltonian/Eulerian, the approaches in [20, 24] have little fault-tolerant capability as any faulty node in the network may destroy its Hamiltonian/Eulerian property . The same problem also exists in [12, 17, 23] as the given base routing function may not be fault-tolerant. The pseudo-Hamiltonian model by [19] is somewhat more fault-tolerant. Through somewhat complicated methods, [19] shows how to constructed a pseudo-Hamiltonian path in a mesh network with some faulty blocks such that the no two faulty blocks are of a distance of 3. But we conjecture that nding a pseudo-Hamiltonian path in an arbitrary graph is just as dicult as determining whether the graph is Hamiltonian, which is known to be NPcomplete. On the contrary, in graph theory, it is known that determining whether a graph is Eulerian can be easily done in linear time. The solution by [30] can be used in any network topology, and thus most fault-tolerant. However, this requires two virtual channels per physical channel, as opposed to the earlier results, which require no virtual channel. To relieve these problems, in this paper we propose a new multicast model called Euler-path-based model which can be applied to any network that (i) is Eulerian, or (ii) is Eulerian after some links being removed.This is more exible than requiring the network being Eulerian (as in [24]), but easier to be satis ed than requiring the network being Hamiltonian or pseudoHamiltonian (as in [20, 19]). The model does not rely on the existence of virtual channels and thus is more hardware-ecient than [30]. Its applicability includes both regular and irregular networks. Irregular networks are receiving increasing attention with the appearence of workstation clusters which also use wormhole routing. Even with regular networks, fault-tolerant routing is important especially in large-scale networks, which are more vulnerable to faults (refer to the footnote). Also, given any network, the proWe exemplify such a situation by a 5 5 mesh. The mesh is a bipartite graph and we can partition its nodes into two groups, one with 12 and the other with 13 nodes. If any node in the former group is damaged, then the mesh becomes non-Hamiltonian. The reason is that a Hamiltonian path must visit nodes of these two groups alternatively. Thus, a bipartite graph with sizes of node groups diering by 2 or more must be nonHamiltonian. It is easy to extend such arguments to meshes of other sizes and even tori. So the Hamiltonian property of a mesh/torus can be easily destroyed by one or two faulty nodes.
6
posed scheme can be used as a framework to de ne a deadlock-free base routing function, which can be extended to support the behavior similar to [12, 17, 23]. Another major emphasis of this paper is to demonstrate the potential and practical values of our model when applied to the currently popular torus/mesh topology of any dimension with faults. Many schemes have been proposed for fault-tolerant routing on mehs/torus-like networks [3, 4, 7, 10, 14]. However, most results can only tolerate a limited number of faults, usually proportional to the dimension of the mesh/torus [10, 14]. To tolerate more faults, a typical approach is to \deactivate" some healthy nodes (i.e., regard them as faulty) so that faults are in rectangular shapes [3, 4, 7, 19]. Deactivating healthy nodes is certainly undesirable. The only scheme known to us that can handle non-rectangular fault patterns (such as T and L fault patterns) is [5]. However, this is achieved using four virtual channels per physical channel. Furthermore, it should be noted that all these results are only for point-to-point communication and thus can not enjoy the bene ts of multi-destination routing capability. A survey can be found in [1]. In this paper, we show that many regular fault patterns (such as single node, block, L-shape, +-shape, U-shape, and H-shape) and even irregular fault patterns are tolerable by our model. In most cases there is no need to deactivate healthy nodes as opposed to the de ciency in [3, 4, 7]. The result is for multi-destination type of networks and does not require the support of virtual channels. It is the rst multicasting model known to us in the literature with such strong fault-tolerant capability. The result has signi cantly improved over the result in the earlier version of this paper by the same authors in [16]. Simulations have also been conducted to study the performance behavior of our model when applied to 2D tori from several aspects, such as communication parameters of the networks, nature of trac, distribution of source and destination nodes, numbers of faults, and resilience of our model. The rest of this paper is organized as follows. Section 2 develops the Euler-path-based model. In Section 3 we show how to apply our model to a 2-D torus with faults. In Section 4 we further extend the application 7
to faulty tori of higher dimensions. Applying to meshes is quite straightforward based on the above results and is discussed in Section 5. Section 6 presents our simulation results. Conclusions are drawn in Section 7.
2 The Euler-Path-Based Multicasting Model Now we present our model. Section 2.1 to Section 2.3 give brief idea and motivation in designing our model. Formal de nitions are in Section 2.4.
2.1 Basic Idea A multicomputer network is represented by an undirected system graph G = (V; E ) with vertex set V corresponding to processors (nodes) and edge set E to communication links. Each undirected edge (u; v) consists of two directed links hu; vi and hv; ui, which can transmit data independently. Throughout, we assume G to be connected. An Euler path in G (if any) is an undirected path that traverses each edge of G exactly once (and thus each node once or more). A graph is said to be Eulerian if it contains an Euler path. The following lemma is well-known in graph theory.
Lemma 1 [25] A graph is Eulerian i one of the following conditions holds true: (a) all nodes have even degrees, or (b) all nodes, except exactly two nodes, have even degrees.
In the above lemma, when condition (a) holds, the path in fact forms a circuit. Otherwise, the Euler path must start from and end with the two odd-degree nodes. For instance, Fig. 1(a) shows a system graph with an Euler path from f to d. The graph in Fig. 1(b) is not Eulerian as there are more than two odd-degree nodes. As a convention, we denote an Euler path by a sequence of nodes [1 ; 2 ; : : : ; n ], where each i ; i = 1::n, is a node. Finding an Euler path is a simple job in graph theory, while nding a Hamiltonian path is NP-complete [25]. Given any system graph G containing an Euler path = [1 ; 2 ; : : : ; n ], suppose a node s wants to multicast a message M to a set of destination 8
Figure 1: (a) A system graph containing an Euler path = [f; a; b; f; g; b; c; g; h; c; d; h; i; e; d], and (b) a system graph which is not Eulerian. Paths are shown in grey lines. nodes D V . We can develop, based on the Euler path, a simple multicast algorithm as follows.
Procedure Basic();
1) Find any index i; 1 i n, such that i = s. 2) Send a worm which carries M , starts from i , and sequentially traverses nodes i+1 ; i+2 ; : : : ; n . 3) Send a worm which carries M , starts from i , and sequentially traverses nodes i?1 ; i?2 ; : : : ; 1 . 4) Each node in D makes a copy of the message M when the worms pass by.
Let us call the worm in step 2 a f-worm (read as forward-worm), and the one in step 3 a b-worm (read as backward-worm). The basic idea is to send the f-worm and b-worm in the forward and backward directions of the Euler path, respectively. A f-worm (b-worm) never traverses in the backward (forward) direction of the Euler path. Such a routing algorithm is apparently deadlock-free, even if multiple nodes perform multicasting concurrently (a simple proof following the channel dependency analysis [9] will do the job). However, the algorithm may not be ecient enough as the routing has no adaptivity at all. Later we will introduce the idea of \short-cut" to increase its adaptivity. 9
2.2 Extension to Non-Eulerian Networks If a network G is not Eulerian, the above model will not be applicable. To solve this problem, one possible solution is to remove some links from G to obtain a new graph which is Eulerian. If this can be achieved, we can directly apply procedure Basic() on the new graph to perform multicasting. For instance, given the graph in Fig. 1(b), we can remove links (a; c), (g; i), and (d; e) to make it Eulerian (the Euler path is shown in grey line). It is to be noted that this by no means implies that the removed links can not be used for delivering worms. In fact, as to be shown later, these links can even increase the routing adaptivity. Thus, one general problem is: given a network G, how to remove a set of links from G such that the new graph is Eulerian, so as to apply our model. One central issue in this paper is to deal with this problem when G is a torus/mesh with some faulty components (see Sections 3, 4, and 5).
2.3 Making Use of Short-cuts We make two observations to increase the adaptivity of Basic(). First, suppose we have an Euler path = [: : : ; i ; : : : ; j ; : : :] such that i = j . When a f-worm arrives at i , if there is no destination between i and j for the worm to visit, we can assume that the f-worm has arrived at j and directly forward the worm to node j +1 . This is one possible \short-cut". A similar short-cut also exists for b-worms. Second, in = [: : : ; i ; : : : ; j ; : : :], if there is a \removed" link between i and j (refer to Section 2.2), the link can be used to forward a f-worm from i to j . A similar shortcut from j to i also exists for b-worms.
2.4 An Adaptive Multicasting Algorithm Now we formally develop our multicast model. The model can be applied to any system graph G = (V; E ) which contains an Eulerian subgraph G0 = (V; E1 ). We will use any Euler path in G0 as a basis to avoid communication deadlock, and use links in E2 = E ? E1 (i.e., the set of links being \removed") as shortcuts to accelerate the multicasting. 10
Figure 2: The f-graph and b-graph corresponding to the example in Fig. 1(b).
De nition 1 Let = [1 ; 2 ; : : : ; n] be an Euler path in G0. Associated with each i , 1 i n, de ne ^ i to be the 2-tuple (i ; i). Also, let ^ be the sequence [^1 ; ^2 ; : : : ; ^n ].
Note that each ^ i is associated with i 's location, i, in the Euler path. So ^ i 6= ^ j i i 6= j (but not necessarily i 6= j ).
De nition 2 The f-graph (read as forward graph) with respect to , denoted as Gf () = (Vf ; Ef ), is a directed graph such that Vf = f^ 1 ; ^ 2 ; : : : ; ^n g Ef = fh^ i ; ^i+1 iji = 1::n ? 1g [ fh^ i ; ^ j ij(i < j ) ^ (i ; j ) 2 E2 g:
Using the system graph in Fig. 1(b) as an example, we can let E2 = f(a; c); (g; i); (d; e)g and draw the corresponding f-graph Gf () as shown in Fig. 2. Intuitively, we regard (i.e., the rst set in the equation of Ef ) as the \backbone" of Gf (). All removed links (i ; j ) 2 E2 are added between all occurrences of ^ i and ^ j from left to right (i.e., the second set in the equation of Ef ) as shortcuts. In a similar way, we de ne the backward-graph as follows.
De nition 3 The b-graph (read as backward graph) with respect to , denoted as Gb () = (Vb ; Eb ), is a directed graph such that Vb = f^ 1 ; ^2 ; : : : ; ^ n g
11
Eb = fh^ i ; ^ i?1 iji = 2::ng [ fh^ i ; ^j ij(i > j ) ^ (i ; j ) 2 E2 g:
Suppose a source node s wants to multicast a message M to a set D of destinations. We will develop our multicast algorithm based on the two tool graphs Gf () and Gb(). A multicast request will generate one worm (called f-worm) on Gf () and another (called b-worm) on Gb (). A worm, once initiated, will remain in the same graph until it is retrieved from the network. The algorithm is presented below. Step 1: From D, we construct two sequences Df and Db , which are the nodes to be visited by the f-worm and b-worm injected into Gf () and Gb (), respectively. The f-worm will be injected from node ^ f = (f ; f ) such that f = s and f is minimal (i.e., the rst occurrence of s in ). However, the b-worm will be injected from node ^ b = (b ; b) such that b = s and b is maximal (i.e., the last occurrence of s in ). To calculate Df and Db , for each x 2 D, we randomly select an ^ i such that i = x. The following rules are used to add ^ i to one of Df and Db . (a) If i < f , add ^ i to Db (since ^ i is not reachable from ^ f in Gf ()). (b) If b < i, add ^ i to Df (since ^ i is not reachable from ^b in Gb ()). (c) Otherwise, ^ i is reachable from both ^ f in Gf () and ^ b in Gb(); we randomly add ^ i to either of Db and Df . The above is repeated for all x in D. Note that one of the rules must be applicable due to the way ^ f and ^ b being de ned. Step 2: Sort Df in the ascending order (based on the node indices) and then inject a f-worm carrying Df and M into Gf () starting from ^ f . Also, sort Db in the descending order and inject a b-worm carrying Db and M into Gb() starting from ^ b . Step 3: On a node ^ i receiving a f-worm or b-worm carrying a sequence, say D0 , and a message M , it examines head(D0 ), where head() is a function which will return the rst element in D0 . If head(D0 ) = ^ l such that i = l , then ^i makes a copy of M and removes ^ l from D0 . If the new D0 becomes 12
an empty sequence, retrieve the worm from the network and stop; otherwise, go to step 4. Step 4: For a node ^ i owning a f-worm (resp., b-worm) carrying a sequence D0 and a message M , it needs to select a channel h^ j ; ^k i 2 Ef (resp., 2 Eb ) to forward the worm. If this is a f-worm, then channel h^ j ; ^ k i must satisfy the following conditions: (i) i = j , and (ii) i j < k l, where l is the index of head(D) (i.e., ^ l = head(D)). The rst condition states that the channel is an outgoing channel from i . The second condition guarantees that the f-worm always moves in the forward direction of the Euler path. The second condition also ensures that the f-worm never arrives at a node ^ k that can not reach the next destination ^ l . These conditions can always be satis ed since the channel h^ i ; ^ i+1 i always exists as a candidate (which is the one used by the earlier algorithm Basic()). Therefore, the f-worm will sequentially visit nodes in Df and the correctness of the multicast is ensured. If this is a b-worm, the same condition (i) should also hold, but condition (ii) should be changed to: i j > k l. Similarly, it is guaranteed that the b-worm will progress in the backward direction of the Euler path and sequentially visit nodes in Db . It is to be noted that although the algorithm is presented based on the tool graphs Gf () and Gb (), a message routed on h^ j ; ^ k i is actually delivered on the link hj ; k i. The router needs to do such a mapping. The mapping is 1-to-1 if hj ; k i 2 E1 , but could be many-to-1 if hj ; k i 2 E2 since nodes j and k may appear more than once in . The implementation details will be discussed later.
Example 1 Using Fig. 2 as an example, suppose node b wants to multicast
a message to nodes f; g and e. We may initiate a f-worm with Df = [^g8 ; e^14 ] into Gf () from ^b3 . From ^b3 , the f-worm may follow the route ^b3 ! f^4 ! g^5 or ^b3 ! c^7 ! g^8 . After g receives the f-worm, it will retrieve itself from Df and makes a copy of M . Several routes are possible for the f-worm to proceed: g^5 (or g^8 ) ! ^i13 ! e^14 , g^5 (or g^8 ) ! h^ 12 ! ^i13 ! e^14 , and g^5 ! ^h9 ! c^10 ! d^11 ! e^14 . 13
Similarly, a b-worm with Db = [f^1 ] will be injected into Gb () from ^b6 . There are two possible routes for the b-worm: ^b6 ! g^5 ! f^1 and ^b6 ! a^2 ! f^4 . These possibilities are shown in Fig. 3. 2
Figure 3: The possible routing paths in Example 1 for (a) f-worm and (b) b-worm. This example shows that our multicasting scheme has the potential to provide a lot of adaptivity during routing. One question raised is how channels should be selected if there are more than one candidate channel. We will use the following channel selection policy: (a) The selection prefers a channel that is free. (b) The selection prefers a channel that leads to a shortest path to head(D0 ). (c) If all candidate channels are busy, we never wait on a channel in the link set E2 .
Theorem 1 The multicasting algorithm accompanied with the channel selection policy can correctly perform multiple multicasts and is free from communication deadlock.
Proof. The undirected link set
E1 can be partitioned into two directed link sets, one in the forward direction of and the other in the backward
direction. The former set is used only by f-warms, while the latter only by b-warms. Clearly, there is no cyclic dependency in each of these link sets. However, joining the link set E2 may create new dependency, and perhaps 14
cyclic dependency. Fortunately, the channel selection policy states that a worm never waits for a link in E2 which is busy. This guarantees that a worm either proceeds along a channel in E2 , or proceeds or waits on a channel in E1 . If a f-worm is waiting on a channel h^ i?1 ; ^ i i in E1 , the channel will eventually be unblocked, by an induction proof as follows. First, in Gf (), channel h^ n?1 ; ^ n i can not be blocked forever. The releasing of this channel will unblock h^ n?2 ; ^ n?1 i. This will eventually unblock channel h^ i?1 ; ^ i i. A similar argument can be done for a b-worm on Gb (). So the liveliness of the algorithm is proved. There is no livelock either as both Gf () and Gb () are acyclic. 2
2.5 Implementation Considerations The implementation of multidestination worms has been discussed in several works [17, 20, 23, 30].Below we discuss some implementation considerations in particular for our model. We also address the complexity to implement the model. A) Worm Format: A worm should carry the following elds: (i) previous position ^ i , (ii) the sequence D0 (= Df or Db ), and (iii) the message M . It is preferable to arrange these elds in that order, due to the following reasons. A wormhole router should make fast routing decisions only after reading the rst few its of a worm. When a node receives a worm, it requires to know the previous position ^ i and the rst one or two addresses of D0 (depending on whether it is a destination) to make routing decisions. B) Routing Tables: At each router x, it needs to keep three tables: (i) one containing all ^ i such that i = x, (ii) one containing all entries h^ i ; ^ i+1 i and h^ i ; ^ i?1 i in link set E1 such that i = x, and (iii) one containing all entries h^ i ; ^ j i in link set E2 such that i = x. The sizes of the tables depend on the degree of the node. For networks of small constant degrees (such as 2D/3D tori/meshes), the cost would be quite low. C) Router Design and Worm Propagation: On a node ^ i receiving a worm, a table-lookup is required to check if head(D0 ) is in its rst table. If so, it is a destination and head(D0 ) is retrieved from D0 . Then another 15
lookup on the second and third tables is needed to select a channel to forward the worm. Also, it should update the rst eld (previous position) to itself (^i ). D) Startup Overheads: Given a multicast request, a cost to prepare Df and Db will be incurred. The processor of the node should do this job and inject f-worms/b-worms to the router. So the processor should keep the tool graphs Gf () and Gb (). Note that this should not be confused with the function of routers, which are only responsible for propagating worms. E) Recon guration of Lookup Tables: The proposed model can be used based on any Euler path in G0 . There may exist many Euler paths in G0. Any change on G0 and the Euler path will need to recon gure the lookup tables. Dynamically recon guring the system is desirable for several reasons, such as fault tolerance and performance optimization for particular applications.
3 Applying to 2-D Tori with Faults A fault-free 2D torus is already Eulerian as each node has a degree of 4. However, when there are some faulty nodes in the network, the Eulerian property may be destroyed. One possible solution, as suggested by our model, is to nd a link set E2 whose removal from the network will make the network Eulerian again. In the following, we formulate this problem by viewing the torus as containing multiple faulty clusters which are de ned as follows. Two nodes are regarded as neighbors if their x-indices dier by at most 1 and their y-indices dier by at most 1 (thus a node has 8 neighbors). A faulty cluster is a maximum set of faulty nodes that forms a connected component, in the sense of neighborhood relationship de ned above, in the torus.In graphtheoretical terms, a simple cycle is one that does not have any sub-cycles. A faulty cluster is simple if there is a simple cycle that is fault-free and directly wrapping around the cluster and only the cluster. We call the simple cycle the perimeter of the faulty cluster. For instances, the faulty clusters in Fig. 4(a) and (b) are simple, while that in Fig. 4(c) is non-simple. In the following discussion, we only consider faulty clusters that are 16
Figure 4: Simple and non-simple faulty clusters. Perimeters are shown in gray lines. simple (the reason will be discussed later). We rst present our solution to deal with a single faulty cluster. The result will then be extended to handle multiple faulty clusters.
3.1 Single Faulty Cluster We rst point out several important properties of perimeters.
Lemma 2 The length of the perimeter of any simple faulty cluster is even. Proof. As de ned above, the perimeter is a simple cycle. We traverse
the cycle starting from any node. When returning back to the starting node, we must have traversed the same numbers of positive-x and negativex directions, and similarly the positive-y and negative-y directions. So the lemma follows. 2 A node on a perimeter is called a corner node if it falls on a position of the perimeter where a 90-degree turn is made; otherwise it is a non-corner 17
node. For instances, nodes a and b in Fig. 4(a) are corner nodes. The following lemma can be easily observed.
Lemma 3 On the perimeter of a simple faulty cluster, every corner node
has an even degree, while every non-corner node has an odd degree.
Lemma 4 On the perimeter of a simple faulty cluster, there must be an even number of corner nodes and an even number of non-corner nodes.
Proof. We can traverse the perimeter starting from any node. The directions of the traversal can be feast, west, north, southg. For any two
continuous links traversed, the direction is either unchanged, or is switched from feast, westg to fnorth, southg or vice versa. When returning back to the starting node, we must switch back to the initial direction, so the number of corner nodes is even. As there are an even number of perimeter nodes (Lemma 2), the number of non-corner nodes is even too. 2 In the following, suppose there is only one simple faulty cluster C in the network. We discuss how to construct the link set E2 in two stages.
3.1.1 Stage 1: Making All Nodes' Degrees Even We run the procedure CF() in Fig. 5 using C as the input. Mainly, CF() traverses the perimeter of C and moves some links to E2 to keep all perimeter nodes' degrees even. Lemma 3 suggests two guidelines to do so: (i) when a non-corner node x is traversed, move one of the perimeter links incident to x to E2 , and (ii) when a corner node x is traversed, either move both perimeter links incident to x to E2 , or move none of them. In the procedure, this is re ected by the binary ag f ; whenever f = 0, the corresponding link is moved to E2 . To prove that procedure CF() does follow rules (i) and (ii), observe that f toggles between 0 and 1 when a non-corner node is traversed; otherwise, it remains unchanged. This has implied that each node incident by ei and 18
Procedure CF(C ); = input C = a simple faulty cluster = begin f := 1; E2 = ;;
end.
Let the sequence of links [e1 ; e2 ; : : : ; ep ] be C 's perimeter; for i := 2 to p do if (the node incident by ei?1 and ei is a non-corner node) then f := f ; if (f = 0) then E2 := E2 [ fei g; end for; Figure 5: Procedure CF() to deal with a single faulty cluster.
ei+1 will have an even degree, i = 1::p ? 1. It remains to prove that the node (called x below) incident by e1 and ep also has an even degree. If x is a non-
corner node, then by Lemma 4 there must be an even number of corners and an odd number of non-corners that have been traversed excluding x. This implies ag f (for ep ) must be 0 at the end, so x observes rule (i). Otherwise, if x is a corner node, then there must be an odd number of corners and an even number of non-corners that have been traversed excluding x. So ag f must be 1 at the end, implying that x observes rule (ii).
Lemma 5 Given a simple faulty cluster, procedure CF() can construct a link set E2 whose removal from the torus induces a network in which all nodes have even degrees.
For instances, if we execute procedure CF() on the faulty clusters in Fig. 4(a) and (b), the possible results are shown in Fig. 6(a) and (b), respectively. The removed links are shown in dotted lines. In these examples, the perimeters are traversed counter-clockwise and the rst links traversed (i.e. e1 ) are marked by (however, any perimeter link can serve as e1 ).
19
Figure 6: Parts (a) and (b) show the results after executing CF() on the examples in Fig. 4(a) and (b), respectively; part (c) shows the result after executing the Connection-Scheme on (b).
3.1.2 Stage 2: Making the Network Connected Although procedure CF() generates a network with only even-degree nodes, the removal of E2 may disconnect the network. This is due to at least two reasons: (i) a corner node originally having a degree of 2 is isolated because both links incident to it are removed (e.g., node b in Fig. 6(a) and nodes s; t and u in Fig. 6(b)), or (ii) two segments of the perimeter are adjacent and parallel to each other (e.g., the block containing nodes n1 : : : n4 in Fig. 6(b)). In the following, we propose a general solution to the isolation problem. Suppose G0 (obtained from G by removing E2 ) is disconnected and H1 and H2 are two connected components in G0 . The scheme works in two steps:
Algorithm: Connection-Scheme 1. Find a simple cycle in G which contains at least one node in H1 and at least one node in H2 . 20
2. For each edge e in , if e 2 E2 , then delete e from E2 ; otherwise, add e into E2 . Intuitively, we try to use the cycle to join the two components H1 and H2 together by reversing its links in E2 . After the adjustment, every perimeter node still has an even degree. To prove this, for any node v in , consider the two edges incident to v in . There are three cases. First, if both edges are in E2 , then the degree of v will be increased by two after the adjustment. Second, if only one of the edges is in E2 , then the degree of v will be unchanged. The last case is that both edges are not in E2 and the degree of v will be decreased by 2. For instances, in Fig. 6(b), to connect the isolated node s, we can let be the cycle consisting of edges fe1 ; e2 ; e3 ; e4 g. This will result in e1 and e4 being deleted from E2 , and e2 and e3 added into E2 . The isolated node t can be treated similarly. The scheme can even be used to connect multiple components together. For instance, in Fig. 6(b), the isolated node u and the isolated block formed by n1 : : : n4 are joined with the rest of the network with only one cycle. The cycles used and the nal adjustment result is shown in Fig. 6(c). One can easily observe that the application of these rules may introduce new isolated components due to the edges newly added into E2 . While this is true, our experiments and experiences have revealed that this approach is general enough to solve most of the isolation problems.
3.2 Extension to Multiple Faulty Clusters In procedure CF(), to deal with one faulty cluster, we only move some links on the perimeter to E2 . All other parts of the torus are unaected. Thus, if no two perimeters are overlapping, CF() can directly be used to handle multiple faulty clusters to make all nodes' degrees even. Below we discuss some problems that may need to be taken care of after executing CF(). First, the isolation problem similar to what discussed earlier may occur if two faulty cluster are too close. For instance, Fig. 7(a) shows two faulty clusters in a network. After executing CF() on each of them, a 2 2 block 21
Figure 7: (a) execution of CF() on two faulty clusters, and (b) adjustment after using the Connection-Scheme. between them is disconnected from the rest of the network. One remedy is to use the Connection-Scheme in Section 3.1.2 to modify link set E2 ; such possibility is shown in Fig. 7(b) using the cycle drawn in gray. When the perimeters of two faulty clusters overlap with each other, directly applying CF() will not work as inconsistency may take place during making decisions of moving which links to set E2 . Fig. 8(a) demonstrates such a dilemma. Below we propose a general approach to solve this problem, given two simple faulty clusters C1 and C2 whose perimeters are overlapping. 1. Consider the path, say P , that is in common to the perimeters of both C1 and C2 . Let's call the healthy nodes in P , excluding the two endpoints, as transiently faulty nodes. 2. Join the transiently faulty nodes and the faulty clusters C1 and C2 together into a larger faulty cluster (which must be simple) and run procedure CF () on it. 3. Construct a simple cycle which contains P and run the ConnectionScheme in Section 3.1.2 to make the network connected. For instance, Fig. 8(a) shows the result after executing CF() by combining all transiently and permanently faults into one large faulty cluster. Fig. 8(b) shows how to join the transiently faulty nodes with the rest of the network using a cycle . 22
}
P
*
(b)
(a) transiently faulty node
Figure 8: (a) execution of CF() by combining transiently and permanently faulty nodes into one large cluster, and (b) execution of Connection-Scheme to join the transiently faulty nodes with the rest of the network. Finally, we comment that although the above scheme is presented for two faulty clusters, it is not hard to extend it to deal with multiple faulty clusters.
3.3 What Our Model Can and Can't Do? The above discussion has shown that our model can be applied to a torus with one, or even more, simple faulty clusters. Our formulation has required that the perimeters of faulty clusters be simple cycles. To see why our approach can not handle perimeters which are non-simple cycles, observe the example in Fig. 4(c); it is impossible to construct an Eulerian subnetwork because there are at least three nodes, n; p, and q, having degrees of 1, thus violating Lemma 1. Even with such a limitation, simple faulty clusters are still powerful enough to represent a large group of common fault patterns. For instance, the most frequently seen failure is probably the single-node fault (e.g., the one on the top of Fig. 4(a)), which is obviously simple under our de nition. Another commonly seen faulty pattern which is also simple is the block fault, where the faulty nodes form a rectangle. This is because the typical layout of a torus is to partition the torus into sub-meshes, each being implemented 23
on a printed-circuit board. As a board or consecutive boards tend to fail at the same time, block faults are very likely to happen. Many faulty clusters with regular shapes are also simple. Examples include the L-shape, T-shape, and +-shape clusters (this can be easily proved by observing their perimeters; the one on the bottom of Fig. 4(a) is an L-shape faulty cluster). The other regular patterns, such as U-shape and H-shape faulty clusters, are not guaranteed to be simple if they have one or more \dead-ends" (i.e., a path of healthy nodes surrounded by faults, such as nodes n; p, and q in Fig. 4(c)). Excluding these cases, a U- or H-shape faulty cluster is highly possible to be simple. In addition to faulty clusters with regular shapes, many irregular faulty patterns are also solvable under our approach (e.g., the cluster in Fig. 4(b)). So our model can deal with very broad coverage of fault patterns, thus signi cantly improving over the approaches in [3, 4, 7] by restricting fault patterns to be rectangular. If unfortunately a faulty cluster is non-simple, one remedy to this problem is to sacri ce (deactivate) some healthy nodes neighboring to the cluster (by regarding them as faulty) to make the perimeter a simple cycle. For example, by regarding nodes n; p, and q in Fig. 4(c) as faulty, the faulty cluster will become simple. Although this is somewhat undesirable, the result is still better than restricting faulty cluster to be rectangular.
4 Extensions to Tori of Higher Dimensions with Faults An n-D torus is Eulerian as each node has an even degree, 2n. So our model can be directly applied to it. In case that some nodes are down, it is highly possible to apply our model to keep the network alive. To demonstrate such possibilities, we present a link removal strategy for a torus of any dimension with faults. The following lemma is a generalization of what the earlier procedure CF() has done. Lemma 6 Given a sequence of nodes S = (x0 ; x1; : : : ; xp?1) in a network such that 24
(a) p is even, and (b) there are an even number of even-degree nodes (and thus an even number of odd-degree nodes) in S , it is possible to add and/or delete some links (xi ; xi+1 mod p), 0 i p ? 1, to and from the network to make all nodes in S of even degrees.
Proof. As in CF(), we still traverse S sequentially. However, there are some
modi cations. First, each pair (xi ; xi+1 ) is considered being associated with a binary ag f (we have omitted saying \mod p"). The meanings of f is as follows:
(
1 leave the link (xi ; xi+1 ), if any, unchanged 0 reverse the link (xi ; xi+1 ) (see the note below) : (1) By \reverse", we mean that if (xi ; xi+1 ) is a link in the network, then delete it; otherwise, add such a link to the network. Second, while traversing S , ag f is toggled according to the following rules: f=
(i) If xi is an even-degree node, then the f 's associated with (xi?1 ; xi ) and (xi ; xi+1 ) should be the same. (ii) If xi is an odd-degree node, then the f 's associated with (xi?1 ; xi ) and (xi ; xi+1 ) should be distinct. Rule (i) guarantees that the degree of an even-degree node will be increased or decreased by 2; while rule (ii) does that the degree of an odd-degree node will be increased or decreased by 1. The proof then follows similar to that of Lemma 5. We leave the details to the reader. An example is shown in Fig. 9 with a sequence S of 10 nodes. The initial ag f for (x0 ; x1 ) is arbitrarily set to 1. 2 We comment that the condition (a) in Lemma 6 is always true in an n-D torus. Below, for readability reasons, we rst present our solution for a 3-D torus with only faulty blocks. Then we will consider faulty clusters (of irregular shapes) and further discuss in general the solution for any n-D torus. 25
Figure 9: An illustrative example for the proof of Lemma 6.
4.1 3-D Torus with Faulty Blocks Without loss of generality, we denote a fault block B in a 3-D torus by identifying its two anti-podal nodes, (x; y; z ) and (x0 ; y0 ; z 0 ). That is, all nodes (i; j; k) such that x i x0 , y j y0 , and z k z 0 are inside the block (for notational simplicity, we omit saying \mod" that is necessary whenever wrapping-around occurs). We will remove some links from the surface of B , which is de ned to be the block B 0 excluding the block B , where block B 0 is identi ed by the anti-podal nodes (x ? 1; y ? 1; z ? 1) and (x0 + 1; y0 + 1; z 0 + 1). Intuitively, the surface contains the healthy nodes and links directly wrapping around B. Observing that only nodes on B 's surface may have odd degrees, we propose the following link removal procedure: Step 1: For each i = z::z0 , consider the rectangle formed by the four corners (x ? 1; y ? 1; i), (x0 +1; y ? 1; i), (x0 +1; y0 +1; i), and (x ? 1; y0 +1; i). The rectangle forms a simple cycle on the surface of B . The cycle must satisfy the pre-conditions in Lemma 6 (the proof is trivial). So it's possible to use Lemma 6 to remove some links from the cycle to make all nodes' 26
z
y x
(a)
(b)
(c)
Figure 10: (a) A 3-D torus containing a 2 1 3 faulty block, (b) link removal after step 1, and (c) link removal after step 2. degrees even. See the example in Fig. 10. After this step, all surface nodes, except those on the top and bottom, have even degrees. Step 2: For each i = x::x0, consider the rectangle formed by the four nodes (i; y ?1; z ?1), (i; y0 +1; z ?1),(i; y0 +1; z 0 +1), and (i; y ?1; z 0 +1).Again, the rectangle forms a simple cycle, which satis es the pre-conditions in Lemma 6 (we leave the proof to the reader). Apply Lemma 6 on each of these cycles. See the example in Fig. 10.
4.2 3-D Torus with Faulty Clusters It will be helpful to summarize what has been done above: we remove some links from the surface of a faulty block rst along xy-planes, and then along yz -planes. Similarly, for a faulty cluster of any shape, we de ne its surface to be the healthy nodes and links direct wrapping around the cluster.
De nition 4 In a 3-D torus, a faulty cluster is said to be simple if its surface satis es: for each xy-plane and yz -plane, the intersection (if any) between the plane and the surface consists of only simple cycle(s).
For instance, the faulty cluster in Fig. 11(a) is simple. Now suppose there is a simple faulty cluster C in a 3-D torus. We remove links from C 's surface as follows: 27
z y x
(a)
(b)
(c)
Figure 11: (a) A 3-D torus containing an irregular faulty cluster, (b) link removal after step 1, and (c) link removal after step 2.
Step 1: For each xy-plane, consider the simple cycles (if any) obtained
from the intersection between the plane and the surface of C . For each node in each simple cycle, we consider only its degree summing over the x and y axes. Apply Lemma 6 on the cycle to make all its nodes' degrees, summing over only x and y axes, even. An example is shown in Fig. 11(b). After Step 1, the surface has the following property.
Lemma 7 After step 1, on each simple cycle (if any) obtained from the
intersection between a yz -plane and the surface of C , there are an even number of odd-degree nodes and thus an even number of even-degree nodes.
Proof. (Sketched) This can be proved by considering the simple cycles on
each yz -plane and observing how each z -axis intersects with these cycles. Each z -axis must intersect with a cycle by 2, or a multiple of 2, nodes. 2 Step 2: For each yz-plane, consider the simple cycles (if any) obtained from the intersection between the plane and the surface of C . For each node in a simple cycle, consider its total degree summing over all axes. Guaranteed by Lemma 7, we can apply Lemma 6 on each cycle to x nodes' degrees. (See the example in Fig. 11(c).) 28
4.3
n-D
Torus with Faulty Clusters
First, we need the concept of planes. In an n-D torus, an (i; j )-plane, 1 i; j n, is a hyperplane consisting of nodes which have common indices along each dimension k such that k 6= i and k 6= j . Still, we de ne the surface of a faulty cluster in an n-D torus to be the nodes and links directly wrapping around the cluster. We only consider faulty clusters that are simple in the sense that for each (i; i +1)-plane, i = 1::n ? 1, the intersection (if any) between the plane and the surface of the faulty cluster consists of only simple cycles. Suppose there is a simple faulty cluster C in an n-D torus. The link removal procedure consists of n ? 1 steps as follows (i = 1::n ? 1): Step i: For each (i; i + 1)-plane, consider the simple cycles (if any) obtained from the intersection between the plane and C 's surface. For each node on each simple cycle, we consider its degree summing over only dimensions 1; 2; : : : ; i + 1. Apply Lemma 6 on the cycle to x nodes' degrees. Two important properties hold true after step i. First, all surface nodes will have even degrees summing over dimensions 1; 2; : : : ; i + 1. Second, on each (i + 1; i + 2)-plane, for each simple cycle (if any) obtained from the intersection between the plane and C 's surface, there are an even number of odd-degree nodes, and the same for even-degree nodes, where degrees are summed over dimensions 1; 2; : : : ; i + 2. The proof is similar to that of Lemma 7. This inductively guarantees the applicability of Lemma 6 in step i + 1. At the end, all surface nodes will have even degrees. Fig. 12 shows an example in a 4-D torus containing a 1 1 2 1 faulty block; after three steps, each removing links on (1,2)-, (2,3)-, and (3,4)-planes, all surface nodes will have even degrees.
5 Applying to Meshes with Faults Although meshes are in general considered close families to tori, they are not node-symmetric. To us, the main problem limiting the applicability of the Euler-based model is that some boundary nodes may have odd degrees. Below, we show how to remove some links on, or close to, the boundary of a 29
dim 3 dim 4
dim 2 dim 1
(b)
(a)
(d)
(c)
Figure 12: (a) a 4-D Torus containing a 1 1 2 1 faulty block, and (b){(d) the link removal after steps 1 to 3, respectively. mesh. The new network will become Eulerian. One nice, direct implication of doing so is that all techniques presented earlier for tori can be easily applied to meshes. Given a 2-D mesh, we let be the cycle on its boundary. One easily observes that only the four corner nodes will have even degrees. Thus, we can run procedure CF() on to remove some links (see the example in Fig. 13(a)). Note that the isolation problem may still occur (e.g., the four corner nodes). We can apply the Connection-Scheme to solve this problem (the result is in Fig. 13(b)). For an n-D mesh, consider the surface of the mesh. Clearly, the surface is simple (recall the de nition in Section 4). It is easy to extend the link removal scheme for n-D tori to this case: (1) apply Lemma 6 on the cycle obtained from the intersection of each (1,2)-plane and the surface, by considering nodes' degrees summing over only dimensions 1 and 2, (2) apply Lemma 6 on the cycle obtained from the intersection of each (2,3)-plane and the surface, by considering nodes' degrees summing over only dimensions from 1 to 3, . . . , and so on until (n ? 1; n)-plane.
Lemma 8 Given any n-D mesh in which each side has at least four nodes, 30
Figure 13: Applying to an 8 8 mesh: (a) after executing CF() on the boundary, and (b) after applying the Connection-Scheme. it is possible to construct a set of links whose removal from the mesh will make the network Eulerian.
6 Simulation Results We have developed a simulator based on the process-based CSIM library [26] to nd out the eectiveness of our Euler-path-based model. The simulated platform is a 16 16 torus. No virtual channels are used. Each router has only one it buer per outgoing link. The latency to transmit a message in a wormhole-routed network typically consists of two costs [22]: startup time and transmission time. We assume the startup time to initiate a multicast to be ts = 5sec, and the transmission time to deliver a it on a link tc = 0:02sec. Source nodes, destination nodes, and faulty nodes (if any) are all generated randomly. A node issues multicast request in a rate of per unit time (here we use tc = 0:02sec as a unit). The latency of a multicast is from its issued time until all destination nodes receiving the multicast message. We use the algorithm in [25] to construct an Euler path () in the network. Below we present our observation on the performance behavior from several aspects. All results presented are from the average of 100 simulations. 31
A) Eects of Numbers of Sources and Destinations: Fig. 14 shows the average multicast latency at various combinations of sources and destinations. The network is assumed to be fault-free. Each source node uses an arrival rate of = 0:01. We observe the latency by varying the number of destinations and the number of sources in Fig. 14(a) and (b), respectively. As can be observed from the curves, with a xed number of sources, the latency only increases sub-linearly with respect to the number of destinations, while with a xed number of destinations, the latency increases closer to linearly with respect to the number of sources. As a result, our result is more sensitive to the number of sources than that of destinations. B) Eects of Worm Length: We also adjust the message length to observe how it aects the communication latency. The result is shown in Fig. 15 with the number of sources = 1 and 16, and the worm length = 4 to 64
its. In Fig. 15(a), with a single source, congestion never occurs, so the pipeline eect of wormhole routing is properly re ected by the linear increase of latency with respect to worm length. In Fig. 15(b), with 16 sources, congestion could occur, especially when the multicast messages are long. Longer worms will worsen the situation. This is re ected by the points where the latency increases sharply. So the multi-destination type of networks are more appropriate for multicasting shorter messages than longer messages. C) Throughput: In a 16 16 torus, there are 256 nodes, so the maximum trac that might be injected into the network is 256 its per unit time. In Fig. 16, by adjusting the arrival rate , we show the latency at dierent trac loads (in terms of the total number of its injected into the network per unit time). In the simulation, all nodes are source nodes and the number of destinations is xed at 8 for each multicast. At about 130 its injected per unit time (which is about half of the maximum trac), the network will start to become saturated. D) Eects of Numbers of Faults: One important application of our model is to deal with faults in a torus network. We randomly generate faults in the torus and observe the eect of faults on communication latency. The results are shown in Fig. 17, with the number of souces=1 and 64. Generally speaking, the number of faults does not have much eect on the communication
32
´¶±
64 sources 32 sources 16 sources 8 sources 4 sources
´±±
Latency(us)
³¶±
arrival rate=0.01 worm length=6 flits t_s=5 us t_c=0.02 us
³±± ²¶± ²±± ¶± ± ±
²±
³±
´±
µ±
¶±
·±
¸±
¶±
·±
¸±
Number of Destinations
(a) ´¶±
64 destinations 32 destinations 16 destinations 8 destinations 4 destinations 2 destinations
´±±
Latency(us)
³¶±
arrival rate=0.01 worm length=6 flits t_s=5 us t_c=0.02 us
³±± ²¶± ²±± ¶± ± ±
²±
³±
´±
µ±
Number of Sources
(b) Figure 14: The communication latency at various numbers of sources and destinations. 33
²µ
arrival rate=0.005 no. of source=1 t_s=5 us t_c=0.02 us
²³
Latency(us)
²± ¹ ·
32 destinations 16 destinations 8 destinations 4 destinations 2 destinations
µ ³ ± ±
²±
³±
´±
µ±
¶±
·±
¸±
worm length(flits)
(a) ²µ±
arrival rate=0.005 no. of source=16 t_s=5 us
²³±
t_c=0.02 us
Latency(us)
²±± ¹± ·±
32 destinations 16 destinations 8 destinations 4 destinations 2 destinations
µ± ³± ± ±
¶
²±
²¶
³±
worm length(flits)
(b) Figure 15: Communication latency vs. worm length: (a) single source and (b) 16 sources. 34
º±
no. of destinations=8 no. of source=256 t_s=5 us t_c=0.02 us
¹± ¸±
Latency(us)
·± ¶± µ± ´± ³± ²± ± ²·
´·
¶·
¸·
º·
²²·
²´·
²¶·
Total number of flits injected per unit time
Figure 16: Latency vs. injected trac per unit time. latency. This nice property has indicated the appropriateness of our model for fault-tolerant multicasting. E) Fault Tolerant Capability: Recall our algorithm for constructing link set E2 . Our model always works as long as the removal of E2 does not partition the network. It is interesting to see how resilient our model is without the necessity of deactivating healthy nodes. In this simulation, given an integer f , we randomly generate f faulty nodes in the network and observe the probability, p(f ), that an Eulerian subnetwork can not be found using the proposed technique. Fig. 18 shows the probability at dierent values of f . Each experiment is from 30,000 tests. Surprisingly, when f 4, the failure rate is 0%. When f 14, the rate is still below 1%. Suppose the probability that the system has f faulty nodes is q(f ). Then the probability that our model is unavailable (unable to nd a good link set E2 ) is p(f ) q(f ), which should be pretty small. As in reality the value of f tends to be small, our model should be resilient enough for use in most practical situations.
35
arrival rate=0.01 no. of source=1 worm length=6 flits t_s=5 us t_c=0.02 us
Latency(us)
64 destinations 32 destinations 16 destinations 8 destinations 4 destinations 2 destinations
Number of Faults
(a)
arrival rate=0.01 no. of source=64 worm length=6 flits t_s=5 us t_c=0.02 us
Latency(us)
2 destinations 4 destinations 8 destinations 16 destinations 32 destinations 64 destinations
Number of Faults
(b) Figure 17: Communication latency vs. the number of faults in the torus: (a) 1 source and (b)64 sources. 36
0.014
failure rate p(f) (in 30,000 tests)
0.012
0.01
0.008
0.006
0.004
0.002
0 0
2
4
6
8
10
12
14
16
number of faults(f)
Figure 18: The probability that our scheme fails to work at various number of randomly generated faults in a 16 16 torus.
7 Conclusions We have presented a new multicasting model that can be applied to any network that is Eulerian or is Eulerian after some links being removed. The model does not rely on the existence of virtual channels. We have shown the strength of this model by applying it to damaged tori/meshes of any dimension with faults. A lot of regular and irregular faulty patterns are shown to be tolerable by our model. The result has improved over existing fault-tolerant routing algorithms for meshes/tori in at least one of the following aspects: the number of faults tolerable, the shape of fault patterns, the number of deactivated healthy nodes, the requirement of support of virtual channels, and the range of network topology acceptable. This paper has placed more emphasis on the development of the Eulerpath-based model and thus a lot of issues could be directed toward future research. First, we have not discussed how to select an Euler path, which 37
could signi cantly aect the performance behavior, such as congestion and hot-spot factors. Second, the length of the Euler path will aect the distance a worm traveling. Intuitively, shorter Euler paths would be better, as more links will be left for use as shortcuts. However, minimizing the path length is equivalent to nding a Hamiltonian path in any graph, which is known to be NP-complete. Last, given a multicast request, we have only limited our attention to injecting two worms (f-worm and b-worm) to the network. It should be interesting to study the possibility of injecting more worms as in [20] or even multiple levels of worms as in [12, 17, 23].
References [1] K. M. Al-Tawil, M. Abd-El-Barr, and F. Ashraf. A survey and comparison of wormhole routing techniques in mesh networks. IEEE Network, 11(2):38{45, 1997. [2] D. P. Bertsekas, C. Ozveren, G. D. Stamoulis, P. Tseng, and J. N. Tsitsiklis. Optimal communication algorithms for hypercubes. J. of Parallel and Distrib. Comput., 11:263{275, 1991. [3] R. V. Boppana and S. Chalsani. Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Trans. on Comput., 44(7):848{864, July 1995. [4] Y. M. Boura and C. R. Das. Fault-tolerant routing in mesh networks. In Int'l Conf. on Parallel Processing, pages I{106{109, 1995. [5] S. Chalasani and R. V. Boppana. Communication in multicomputers with nonconvex faults. In EURO-PAR Conf., pages 673{684, 1995. (also in IEEE Trans. on Comput., 46(5), May 1997, pp. 616-622). [6] G. I. Chen and T. H. Lai. Constructing parallel paths between two subcubes. IEEE Trans. on Comput., 41(1):118{123, Jan. 1992.
38
[7] K.-H. Chen and G.-M. Chiu. Fault-tolerant routing algorithm for meshes without using virtual channels. Technical report, Nat'l Taiwan Inst. of Tech., 1997. Tech. Rpt., Dept. Elec. Engr. and Tech. [8] L. D. Coster, N. Dewulf, and C.-T. Ho. Ecient multi-packet multicast algorithms on meshes with wormhole and dimension-ordered routing. In Int'l Conf. on Parallel Processing, pages 137{141, 1995. [9] W. J. Dally and C. L. Seitz. The torus routing chip. J. of Parallel and Distrib. Comput., 1(3):187{196, 1986. [10] J. Duato. A theory of fault-tolerant routing in wormhole networks. In Int'l Conf. on Paral. and Distrib. Sys., pages 600{607, 1994. [11] J. Duato. A theory of deadlock-free adaptive multicast routing in wormhole networks. IEEE Trans. on Paral. and Distrib. Sys., 6(9):976{987, Sep. 1995. [12] K.-P. Fan and C.-T. King. Turn grouping for ecient multicast in wormhole mesh networks. In Symp. of Frontiers of Massively Parallel Computation, pages 50{57, 1996. [13] C. J. Glass and L. M. Ni. Maximally fully adaptive routing in 2D meshes. In Int'l Conf. on Parallel Processing, pages 101{104, 1992. [14] C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. on Paral. and Distrib. Sys., 7(6):620{636, June 1996. [15] S. L. Johnsson and C. T. Ho. Optimal broadcasting and personalized communication in hypercubes. IEEE Trans. on Comput., 38(9):1249{ 68, Sep. 1989. [16] T.-Y. Juang, Y.-C. Tseng, and M.-H. Yang. An euler-path-based multicasting model for wormhole-routed networks: Its applications to damaged 2d tori and meshes. In Int'l Performance, Computing, and Communications Conf., pages 444{450, 1997. 39
[17] R. Kesavan and D. K. Panda. Minimizing node contention in multiple multicast on wormhole k-ary n-cube networks. In Int'l Conf. on Parallel Processing, 1996. [18] Y. Lan, A.-H. Esfahanian, and L. M. Ni. Multicast in hypercube multiprocessors. J. of Parallel and Distrib. Comput., 8:30{41, 1990. [19] R. Libeskind-Hadas, K. Watkins, and T. Hehre. Fault-tolerant multicast routing in the mesh with no virtual channels. In High-Performance Computer Arch. Conf., pages 180{190, 1996. [20] X. Lin, P. K. McKinley, and L. M. Ni. Deadlock-free multicast wormhole routing in 2D mesh multicomputers. IEEE Trans. on Paral. and Distrib. Sys., 5(8):793{804, Aug. 1994. [21] P. K. McKinley, H. Xu, A.-H. Esfahanian, and L. M. Ni. Unicast-based multicast communication in wormhole-routed networks. IEEE Trans. on Paral. and Distrib. Sys., 5(12):1252{65, Dec. 1994. [22] L. M. Ni and P. K. McKinley. A survey of wormhole routing techniques in directed networks. IEEE Computer, 26:62{76, Feb. 1993. [23] D. K. Panda, S. Singal, and P. Prabhakaran. Multidestination message passing mechanism conforming to base wormhole routing scheme. In Parallel Computer Routing and Communication Workshop, pages 131{ 145, 1994. LNCS, No. 853. [24] W. Qiao and L. M. Ni. Adaptive routing in irregular networks using cut-through switches. In Int'l Conf. on Parallel Processing, pages I{ 52{60, 1996. [25] K. H. Rosen. Discrete Mathematics and its Applications. McGraw-Hill, New York, 1995. [26] H. Schwetman. Csim users' guide. Technical Report Tech. Rpt.: ACT126-90, MCC. 40
[27] J.-P. Sheu and M.-Y. Su. A multicast algorithm for hypercube multiprocessors. In Int'l Conf. on Parallel Processing, pages III{18{22, 1992. [28] G. D. Stamoulis and J. N. Tsitsiklis. Ecient routing schemes for multiple broadcasts in hypercubes. IEEE Trans. on Paral. and Distrib. Sys., 4(7):725{739, July 1993. [29] Y.-C. Tseng, T.-Y. Juang, and M.-C. Du. Some heuristics and experiments for building a multicasting tree in a high-speed network. In High Performance Computing { Asia, pages 248{253, 1997. [30] Y.-C. Tseng, D. K. Panda, and T.-H. Lai. A trip-based multicasting model in wormhole-routed networks with virtual channels. IEEE Trans. on Paral. and Distrib. Sys., 7(2):138{150, Feb. 1996. [31] S.-Y. Wang, Y.-C. Tseng, and C.-W. Ho. Ecient multicast in wormhole-routed 2d mesh/torus multicomputers: A networkpartitioning approach. In Symp. of Frontiers of Massively Parallel Computation, pages 42{49, 1996.
41