Network Performance under Physical Constraints - Semantic Scholar

1 downloads 0 Views 280KB Size Report
Abstract. The performance of an interconnection network in a massively parallel architecture is subject to physical constraints whose impact needs to be ...
Network Performance under Physical Constraints Fabrizio Petrini and Marco Vanneschi Dipartimento di Informatica, Universita di Pisa Corso Italia 40, 56125 Pisa, Italy tel +39 50 887228, fax +39 50 887226 e-mail: fpetrini,[email protected]

Abstract

The performance of an interconnection network in a massively parallel architecture is subject to physical constraints whose impact needs to be re-evaluated from time to time. Fat-trees and low dimensional cubes have raised a great interest in the scienti c community in the last few years and are emerging standards in the design of interconnection networks for massively parallel computers. In this paper we compare the communication performance of these two classes of interconnection networks using a detailed simulation model. The comparison is made using a set of synthetic benchmarks, taking into account physical constraints, as pin and bandwidth limitations, and the router complexity. In our experiments we consider two networks with 256 nodes, a 16ary 2-cube and 4-ary 4-tree.

1 Introduction

Fat-trees and low-dimensional cubes are emerging standards in the design of interconnection networks for parallel machines. Fat-trees have been adopted by many research prototypes and commercial machines [1]. The data network of the Connection Machine CM-5 uses two distinct fattrees [2] and is composed of routing chips that have either two or four parent connections. The Data Diffusion Machine (DDM) is a virtual shared memory architecture that implements a hierarchical COMA cache coherence protocol in the internal switches of a fat-tree [3]. The communication chip Elite is the basic building block of the Meiko CS-2 network [4]. This network takes the form of a quaternary fat-tree. Its design is based on a multistage network and has the property that the overall communication bandwidth remains constant at each level. Other references to fat-trees include [5] [6]. Unfortunately, not much is known on the communication performance of the fat-trees. Most of the literature deals with the CM-5 and focuses on raw network performance [7] [8] [9].

Thanks to their simplicity and expandability, lowdimensional cubes have been adopted as interconnection networks by many massively parallel machines. In the Stanford Dash there are two distinct cubes that support the cache coherence mechanisms [10]: one is dedicated to the requests and the other to the replies, in order to avoid deadlocks caused by the coherency protocols. Other important academic prototypes that use low-dimensional cubes are Alewife [11] and the J-machine [12]. This list also includes many of the most popular commercial machines. The Cray T3D [13] and T3E [14] use a threedimensional cube [13] and the topology of both the Intel Delta and Paragon is a bi-dimensional cube [15]. A fair comparison of the communication performance of these machines is not an easy task because they all have di erent technological characteristics. On the other hand, theoretical models of the interconnection network often prove overly simplistic and are not able to capture important performance aspects [16] [17]. In this paper we try to face this problem with a detailed simulation model using a set of synthetic benchmarks representative of shared memory computation and common parallel algorithms. Our experiments are conducted on a quaternary fat-tree and a bi-dimensional cube, whose communication performance is properly equalized taking into account physical limitations as the router complexity, wire delay and density [18]. This paper is an attempt to compare apples with apples: with our simulation model we try to eliminate all implementation dependent details and to compare the essential features of the two interconnection networks. The remainder of this paper is organized as follows. Sections 2 and 3 overview the two families of interconnection networks, the k-ary n-trees and the k-ary ncubes. Section 4 describes the relevant details of the simulation model and Section 5 presents the methodology that we use to normalize the physical characteristics of the two interconnection networks in order to make a fair comparison. Sections 6 and 7 introduce the main characteristics of the experimental results and the

trac patterns used as benchmarks. The performance of a quaternary fat-tree and a bi-dimensional cube, both with 256 nodes, are displayed separately in Sections 8 and 9 and compared in Section 10. An overview of the experimental results and some concluding remarks are given in Section 11.

2

k

-ary n-trees

Figure 2: A 4-ary 2-tree. A packet can follow any minimal path passing through a nearest common ancestor of source and destination.

Switch

Processor

ancestors (NCA) of source and destination and from

External Connections

Figure 1: The structure of a fat-tree. Processors are located at the leaves, while internal nodes contain switches. At the root there are some external connections available to recursively build a bigger network or to interface it to the external world. The fat-tree is an indirect interconnection network based on a complete binary tree. Unlike traditional trees in computer science, fat-trees resemble real trees, because they get thicker near the root. A set of processors is located at the leaves of the fat-tree and each edge of the underlying tree corresponds to a bi-directional channel between parent and child. The arity of the internal switches of the fat-tree increases as we go closer to the root: this makes the physical implementation of these switches unfeasible. For this reason some alternative constructions have been proposed that use building blocks with xed-arity [19]. These solutions trade connectivity with simplicity: incoming messages at a given switch in a \full" fat-tree may have more choices in the routing decision than in a corresponding network with xed-arity switches. k-ary n-trees [20] are a particular subclass of the fattrees and borrow from the k-ary n-butter ies [21] the topology of the internal switches. A k-ary n-tree has kn leaf nodes and n levels of kn?1 switches. Each switch has 2k links. A 4-ary 2-tree is shown in Figure 2 Minimal adaptive routing between a pair nodes on a k-ary n-tree can be easily accomplished sending the packet to one of the common roots or nearest common

there to the destination. That is, each packet experiences two phases, an ascending adaptive phase to get to one of the NCA, followed by a descending deterministic phase. The performance of a wormhole network can be enhanced mapping two or more virtual channels on each physical channel [22]. In our experiments we will examine three variants of the adaptive algorithm, with one, two and four virtual channels. They simply pick the less loaded link in the ascending and descending phases, that is the link that has the maximum number of free virtual channels (a fair choice is made when more links are in a similar state).

3

k

-ary n-cubes

Figure 3: A 5-ary 2-cube. A k-ary n-cube is characterized by its dimension n and radix k, and has a total of kn nodes. The kn nodes are organized in an n-dimensional grid, with k nodes in each dimension and wrap-around connections. The binary hypercube is a special case of k-ary n-cube with k = 2. Also, the two-dimensionaltorus is another special case with n = 2. Figure 3 shows an example of k-ary n-cube.

Routing algorithms on the k-ary n-cubes are deadlock-prone and require sophisticated strategies for deadlock-avoidance. In this paper we compare two algorithms, each o ering a di erent degree of adaptivity: deterministic [23] and minimal adaptive based on Duato 's methodology [24] [25]. The deterministic algorithm is a dimension order routing based on a static channel dependency graph. Packets are sent to their destination along a unique minimal path. The potential deadlocks caused by the wrap-around connections are avoided doubling the number of virtual channels and creating two distinct virtual networks. Packets enter the rst virtual network and switch to the second virtual network upon crossing a wrap-around connection. Our version of the deterministic algorithm uses four virtual channels for each physical link (two channels for each virtual network). Rather than using a static channel dependency graph, Duato's methodology only requires the absence of cyclic dependencies on a connected channel subset. In our adaptive algorithm, based on this methodology, we associate four virtual channels to each link: on two of these channels, called adaptive channels, packets can be routed along any minimal path between source and destination. In the remaining two channels, called deterministic or escape channels, packets are routed deterministically when the adaptive choice is limited by network contention [26]. An interesting characteristic of this algorithm is that, once in the escape channels, packets can re-enter the adaptive channels, that is the channel allocation policy is non monotonic. A central point of this algorithm is the interface between the processor and the router. We assume that packets can enter the network passing through a single injection or memory channel placed between the processor and the router. This limitation, known as source throttling, makes the network throughput stable when the network operates above saturation [27] [28].

4 Relevant details of the network model

This section presents a router model and a simulation environment, that are used in the following sections to analyze the performance of the k-ary n-trees and the kary n-cubes under various trac loads and ow control strategies. This model is evaluated in the SMART (Simulator of Massive ARchitectures and Topologies) environment [29]. Figure 4 outlines the internal structure of a routing switch. We can distinguish the external channels or links, the input and the output bu ers or lanes that implement the bu er space of the virtual channels and an internal crossbar. The switch has bidirectional channels and each channel on the single direction is logically

ack

ack

ready

ready

ack

ack

CROSSBAR ready

ready

ack

ack

ready

ready

Figure 4: The internal structure of a routing switch. composed of three interfaces: a data path that transmits messages on a it level, the ready lines that ag the presence of a it on the data path and specify the virtual channel where the it is to be stored and the ack lines in the reverse direction that send an acknowledgment every time bu er space is released in the input lanes. The processing nodes have a compatible interface with the same number of virtual channels. A it is moved from an output lane to the corresponding input lane in a neighboring node in Tlink cycles, when there is at least a free bu er position. Each output lane has associated a counter that is initialized with the total number of bu ers in the input lane, it is decremented after sending a it and it is incremented upon receiving and acknowledgment. When multiple lanes are enabled, an arbiter picks one of them according to a fair policy. When a header it reaches the top of an input lane, the routing algorithm tries to establish a path in the crossbar with a suitable output lane, that is neither full nor bound to another input lane. This path will remain in action till the transmission of the tail it of the packet. Our model allows the routing of a single header every Trouting cycles. Although a physical link services in each direction at most one virtual channel every Tlink cycles, multiple virtual channels can be active at the input and output ports of the crossbar. The internal it propagation takes Tcrossbar cycles. Every time a it is moved from an input lane to the corresponding output lane, a feedback is sent back to the neighboring switch or node to update the counter of free positions. Each node generates packets of 64 bytes and the destinations are distributed uniformly or according a static communication pattern, as explained in more detail in the following sections. The simulator collects performance data only after 2000 cycles, to allow the network

to reach steady state and each simulation is halted after 20000 cycles.

5 Performance Normalization

In this study we would like to compare networks with the same number of processing nodes and routing chips. If we consider a k-ary n-tree with parameters (k1; n1) and a k-ary n-cube with parameters (k2 ; n2), we require that k1n1 = k2n2 (same number of processors) and n1 k1n1?1 = k2n2 (same number of routing chips). These equations imply k1 = n1, and the total number of processing nodes N = k1k1 . A 4-ary 4-tree and a 16-ary 2-cube satisfy these conditions, so we will consider these two networks in the experimental evaluation. A fair comparison of interconnection networks should also take into account physical constraints as the pin count, wire delay, bisection width [18] and the router complexity [30]. In our experiments we normalize the communication performance by setting the it and the data path size on the fat-tree at two bytes and at four bytes on the cube. If we consider a 4-ary 4-tree and a 16-ary 2-cube, this normalization can be interpreted in the following ways. Technological constraints limit the number of pins on a given chip. In a quaternary fat-tree, the arity of the routing switches is eight, while the arity of the routing chip in a bi-dimensional cube is four, if we do not consider the connection with the local processing node. By doubling the data paths on the cube we have the same pin count on both routing chips. Both k-ary n-trees and k-ary n-cubes have nkn links. The quaternary fat-tree has got twice as many links as a bi-dimensional cube and our normalization equalizes the overall (peak) communication bandwidth. There is another important consideration: with this normalization the two networks have the same theoretical upper bound under uniform trac. For the k-ary n-cubes this upper bound corresponds to twice the bisection bandwidth1 . k-ary n-trees are not bisectionbandwidth limited and the upper bound is simply the unidirectional bandwidth of the links connecting the processing nodes to the network switches. Other important parameters are the router complexity and the wire delays. Adaptive algorithms have more degrees of freedom but require larger crossbars and more complex arbitration. So these advantages are often o set by increased clock cycles. Chien in [30] has proposed a cost model to make fair comparisons between routing algorithms. It can be applied to evaluate the router

1 The network capacity can be determined by considering that 50% of the uniform random trac crosses the bisection of the network. Thus if a cube has bisection bandwidth B , each of the N nodes can inject 2B=N trac at the maximum load.

delays of the deterministic and the minimal adaptive algorithms for the cubes and the adaptive algorithms for the fat-trees outlined in Section 2 This model has gained consideration in several performance studies [31] [32]. It assumes a 0.8 micron CMOS gate array technology for the implementation of the routing chip. The three delays Trouting , Tcrossbar and Tlink are computed as follows. Routing a message involves address decoding, routing decision and header selection. According to [30] the routing decision has a delay that grows logarithmically with the number of alternatives, or degree of freedom, o ered by the routing algorithm. Denoting by F the degree of freedom, the model estimates the routing delay in Trouting = 4:7 + 1:2 log F ns: (1) The time required to transfer a it from an input channel to the corresponding output channel is the sum of the delay involved in the internal ow control unit, the delay of the crossbar and the set-up time of the output channel latch. The crossbar delay grows logarithmically with the number of ports P. Therefore the crossbar time is Tcrossbar = 3:4 + 0:6 log P ns: (2) The time required to transmit a it across a physical link includes the wire delay and the time required to latch it at destination. Low-dimensional cubes as the two-dimensional ones can be easily embedded in the three-dimensional space with constant length wires. If virtual channels are used, the virtual channel controller has a delay logarithmic in the number of virtual channels V . The delay of links with short wires is estimated by the model in s = 5:14 + 0:6 log V ns: Tlink

(3)

In our experiments the deterministic algorithm uses four virtual channels as the minimal adaptive algorithm based on Duato's methodology, in which there are two adaptive and two escape channels on each link. Both input and output bu ers can contain four its. Thus the algorithms map four virtual channels on each physical channel (V = 4) and the internal crossbar has four inputs from each link plus an injection channel from the local node (P = 17). The only di erence is the routing delay which is in uenced by the degree of adaptivity. In the deterministic routing we have only two virtual channels available in a single direction (F = 2). With the adaptive algorithm the number increases to six (F = 6), four adaptive channels in two directions plus two deterministic channels.

Det. Duato

Trouting

5.9 7.8

Tcrossbar

5.85 5.85

s Tlink

6.34 6.34

6 Experimental Results

Tclock

6.34 7.8

Table 1: Delays of the two routing algorithms for the cube, expressed in nanoseconds

1 vc 2 vc 4 vc

Trouting

8.06 9.26 10.46

Tcrossbar

5.2 5.8 6.4

m Tlink

9.64 10.24 10.84

Tclock

9.64 10.24 10.84

Table 2: Delays of the three variants of the adaptive algorithm for the fat-tree, expressed in nanoseconds We can now apply this model to nd the clock cycles induced by the three routing algorithms. These results are summarized in Table 1. In the deterministic algorithm the limiting factor is the link delay, while the routing delay is the bottleneck of the minimal adaptive algorithm. In both cases the clock cycle is set to the maximum of the three delays. When we embed a quaternary fat-tree with 256 nodes in the three-dimensional space some wires are inevitably longer than others. The delay of links with medium length wires [32] can be estimated by the model in m = 9:64 + 0:6 log V ns: Tlink

(4)

In our experiments we will consider three variants of the adaptive algorithm, with one, two and four virtual channels. As in the cubes, the input and output bu ers can contain up to four its. The values of P and F can be directly computed from the number of virtual channels. The degree of freedom F of a packet in the ascending phase is (2k ? 1)  V , because it can take any of the ascending or descending links and the crossbar size P is 2k  V . Table 2 reports the delays of the three ow control strategies. From these results we can see that 4-ary 4-trees with one and two virtual channels are wire limited and there is no impact of the virtual channels on the clock cycle. With four virtual channels the gap between routing and the link delays is narrow. With more virtual channels the routing complexity becomes the limiting factor. In our experiments the delays Trouting Tcrossbar and Tlink are equalized to a single clock cycle, which is set to maximum of the three delays, the link delay.

The performance of an interconnection network under dynamic load is usually assessed by two quantitative parameters, the accepted bandwidth or throughput and the latency. Accepted bandwidth is de ned as the sustained data delivery rate given some o ered bandwidth at the network input. Two important characteristics are the saturation point and the sustained rate after saturation. Saturation is de ned as the minimum o ered bandwidth where the accepted bandwidth is lower than the global packet creation rate at the source nodes. It is worth noting that, before saturation, o ered and accepted bandwidth are the same. The behavior above saturation is important because the network and/or the routing algorithm can become unstable, with a consequent performance degradation. We usually expect the accepted bandwidth to remain stable after saturation, both in the presence of bursty applications that require peak performance for a short period of time and applications that operate after saturation in normal conditions, e.g. when executing a global permutation pattern. The network latency is the average delay spent by a packet in the network, from the insertion of the header

it in the injection lane till the reception of the tail it at the destination. It does not include the source queuing delay. The end-to-end latency rises to in nity above saturation and is impossible to gain any information in this case. For this reason, the network latency is often preferred to analyze the network performance. The rst sets of experimental results of each trac pattern are presented according to the Chaos Normal Form2 (CNF). The CNF uses two graphs, one to display the accepted bandwidth and the other to display the network latency. In both graphs the x-axis corresponds to the o ered bandwidth normalized with the maximum bandwidth that can be accepted under uniform trac. In the nal comparison we use absolute measurements units because we compare routing algorithms that involve di erent implementations and, consequently, different clock cycles.

7 Message Generation

In our model each node generates packets according to the following trac patterns. To describe these patterns, let each node of the k-ary n-cube or k-ary n-tree p0; p1; : : :; pn?1 also be labelled with a number in base k resulting from the concatenation of the pi. The binary representation of p0 p1 : : :pn?1 is a0a1 : : :a(n log2 k)?13. 2 See http://www.cs.washington.edu/research/projects/lis/ chaos/www/presentation.htmlfor more details on the presentation of simulation results of network routing studies. 3 We will assume that k is a power of two and n is even.

The adaptive routing algorithm used in the experiments can have con icts on the descending phase only. 4

300

Latency (cycles)

Fraction of capacity

0.6 0.5 0.4 0.3

250 200 150

0.2

100

0.1

50 0.2

a)

0.4 0.6 Fraction of capacity

0.8

1

0.2

b)

Accepted vs. offered bandwidth (Complement traffic) 1

0.7

110

Latency (cycles)

Fraction of capacity

120

0.6 0.5 0.4 0.3

1

100 90 80 70

0.2

60

0.1

50 0.2

0.4 0.6 Fraction of capacity

0.8

1

0.2

d)

Accepted vs. offered bandwidth (Transpose traffic) 0.8

0.4 0.6 Fraction of capacity

0.8

1

Network latency vs. offered bandwidth (Transpose traffic) 400

1 vc 2 vc 4 vc

0.7

1 vc 2 vc 4 vc

350

0.6

Latency (cycles)

300

0.5 0.4 0.3

250 200 150

0.2

100

0.1

50 0.2

0.4 0.6 Fraction of capacity

0.8

1

0.2

f)

Accepted vs. offered bandwidth (Bit reversal traffic) 0.9

0.4 0.6 Fraction of capacity

0.8

1

Network latency vs. offered bandwidth (Bit reversal traffic) 400

1 vc 2 vc 4 vc

0.8

1 vc 2 vc 4 vc

350

0.7

300

0.6 0.5 0.4

250 200 150

0.3

100

0.2 0.1

g)

0.8

1 vc 2 vc 4 vc

130

0.8

e)

0.4 0.6 Fraction of capacity

Network latency vs. offered bandwidth (Complement traffic) 140

1 vc 2 vc 4 vc

0.9

c)

1 vc 2 vc 4 vc

350

Latency (cycles)

Under uniform trac the adaptive routing algorithm saturates at 36% of the capacity with 1 virtual channel, 55% with 2 virtual channels and 72% with 4 virtual channels, as shown in Figure 5 a). In all cases the post saturation behavior is stable, with a constant throughput for any o ered bandwidth. These results con rm the importance of the ow control strategy. Wormhole routed fat-trees with a single virtual channel do not achieve good throughput, due to blocking problems. When a packet is stopped at an intermediate switch on the descending phase 4 , all the links on the path from the node/switch where the tail it is stored to the current switch are blocked. Other packets could pro tably use these links. In fact, with 4 virtual channels doubles the accepted bandwidth, reaching a considerable 72%. This comes at a price. The condivision of the links between two or more packets slightly increases the network latency for moderate loads. The CNF of the complement trac shows an interesting behavior. As can be seen in Figure 5 c), the saturation point is around 95% of the capacity for all ow control strategies. This permutation pattern doesn't create any congestion in the descending phase. The use of more than a virtual channel is counterproductive in terms of network latency (Figure 5 d): this is mainly due to the link multiplexing, that increases the tail latency. At steady state, there are as many packets in progress as the number of virtual channels in each link. The network latency with 1 virtual channel remains stable until the o ered load is 70% of the capacity and experiences a minor increase of the head latency after this point.

Network latency vs. offered bandwidth (Uniform traffic) 400

1 vc 2 vc 4 vc

0.7

Fraction of capacity

8 Fat-Tree

Accepted vs. offered bandwidth (Uniform traffic) 0.8

Fraction of capacity

Also, let 0 = 1 and 1 = 0. Uniform trac. Destinations are chosen at random with equal probability between the processing nodes. Complement trac. Each node sends only to the destination given by a0a1 : : :a(n log2 k)?1. Bit reversal. Each node sends only to the destination given by a(n log2 k)?1a(n log2 k)?2 : : :a0. Transpose. Each node sends only to the destination given by a( n2 log2 k)a( n2 log2 k)+1 : : :a(n log2 k)?1a0 a1 : : : a( n2 log2 k)?1. These trac patterns illustrate di erent features. The uniform one is a standard benchmark used in network routing studies. This generation pattern can be considered representative of well-balanced shared memory computations. In the complement trac all the packets cross the bisection of the network. Bit reversal and transpose, are important because they occur in practical computations [21].

50 0.2

0.4 0.6 Fraction of capacity

0.8

1

h)

0.2

0.4 0.6 Fraction of capacity

0.8

Figure 5: Communication performance of a 4-ary 4tree with adaptive routing and one, two and four virtual channels. With 2 virtual channels the head latency has a similar behavior, while the tail latency converges to the upper bound after 70% of the capacity. The complement trac belongs to a wide class of permutations that map a k-ary n-tree into itself. These permutations do not generate any congestion on the descending phase and are called congestion-free [33]. Bit reversal and transpose permutations have a similar distribution of the destinations in terms of distance. It can be easily noted, looking at the numerical representation shown in section 7, that in both cases there are kn=2 nodes at distance 0 (that is source and destination are on the same node) and (k ? 1)kn=2+i?1 nodes at distance n + 2i, i 2 f1; : : :; n=2g. The average distance dm

1

The congestion-free communication patterns are an important characteristic of the k-ary n-trees. They can be routed reaching optimal performance with a simple routing algorithm and ow control strategy. They are analogous to local communication in direct topologies, as the k-ary n-cubes. The results obtained on the complement trac generalize to the whole class of congestion-free patterns and are expected to scale with the number of nodes with an accepted bandwidth that approximates the network capacity. Message latency is only in uenced by the ow control overhead and can be deterministically estimated with tight upper bounds. The remaining communication patterns, uniform, bit reversal and transpose, generate congestion in the descending phase and are very sensitive to the ow control strategy. They all saturates at about 35 ? 40% of the capacity with 1 virtual channel, 55 ? 60% with 2 virtual channels and around 75% with 4 virtual channels. From these results we can argue that the expected performance of di erent permutation patterns is mainly in uenced by the ow control strategy. Also, in all these cases, switching from 1 to 4 virtual channels doubles the accepted bandwidth.

9 The Cube

We can now compare the performance of the deterministic and the minimal adaptive algorithm. In Figure 6 a) we can see that the adaptive algorithm saturates at 80% of the capacity while the deterministic stops at 60%. The network latency is low for both algorithms: before saturation is stable at about 70 cycles and packets spend on the average 150 and 130 cycles in the network at saturation, respectively the deterministic and the adaptive algorithm. In the complement trac all packets are re ected across the logical center of the network. Each packet traverses the network bisection, halving the theoretical upper bound. Looking at Figure 6 c), we can see that the throughput of the deterministic algorithm is very close to optimality at 47% of the capacity, while the minimal

Latency (cycles)

0.5 0.4

130 120 110 100 90

0.3 80 0.2

70

0.1

60 0.2

a)

0.4 0.6 Fraction of capacity

0.8

1

0.2

b)

Accepted vs. offered bandwidth (Complement traffic) 0.5

0.4 0.6 Fraction of capacity

0.8

1

Network latency vs. offered bandwidth (Complement traffic) 280

deterministic Duato

0.45

deterministic Duato

260 240

0.4

Latency (cycles)

8.1 Discussion

0.6

0.35 0.3 0.25

220 200 180 160 140

0.2 120 0.15

100

0.1

80 0.2

c)

0.4 0.6 Fraction of capacity

0.8

1

0.2

d)

Accepted vs. offered bandwidth (Transpose traffic) 0.6

220

0.45

200

0.4 0.35

1

0.3

180 160 140

0.25

120

0.2

100

0.15

80

0.1

60 0.2

0.4 0.6 Fraction of capacity

0.8

1

0.2

f)

Accepted vs. offered bandwidth (Bit reversal traffic) 0.7

0.4 0.6 Fraction of capacity

0.8

1

Network latency vs. offered bandwidth (Bit reversal traffic) 350

deterministic Duato

0.6

300

0.5

250

0.4 0.3 0.2

deterministic Duato

200 150 100

0.1

g)

0.8

deterministic Duato

240

0.5

e)

0.4 0.6 Fraction of capacity

Network latency vs. offered bandwidth (Transpose traffic) 260

deterministic Duato

0.55

Latency (cycles)

For a 4-ary 4-tree dm = 7:125, which is very close to the network diameter. The performance results of these communication patterns are very similar. From the CNF of the transpose shown in Figure 5 e) we can see that the saturation points are at 33%, 60% and 78% of the capacity with 1, 2 and 4 virtual channels. An analogous behavior for the bit reversal can be seen in Figure 5 g).

0.7

Latency (cycles)

i=1

(5)

deterministic Duato

150 140

Fraction of capacity

2

Network latency vs. offered bandwidth (Uniform traffic) 160

deterministic Duato

0.8

Fraction of capacity

k

Accepted vs. offered bandwidth (Uniform traffic) 0.9

Fraction of capacity

k?1X (n + 2i)ki : dm = n= n=2

Fraction of capacity

is given by

50 0.2

0.4 0.6 Fraction of capacity

0.8

1

h)

0.2

0.4 0.6 Fraction of capacity

0.8

Figure 6: Communication performance of a 16-ary 2cube with deterministic and minimal adaptive routing. adaptive algorithm experiences an early saturation at 35% of the capacity. This phenomenon has been observed in [34] too. The complement is unusual since dimension order routing helps prevent con icts. This behavior is con rmed in Figure 6 d), where we can see a wide gap between the network latencies at medium loads. In the transpose trac the destination of each packet is a re ection of the source along the diagonal. This causes a continuous area of congestion along this diagonal and on the opposite corners of the logically attened torus. The adaptive algorithm provides better performance in this case with 50% of the capacity, more than twice than the deterministic one. Unlike the previous trac patterns, bit reversal has

1

Accepted vs. offered bandwidth (Uniform traffic) 450 400 350

3500

250 200 150 100

3000 2500 2000 1500 1000

50

500

0

0 0

a)

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

4000

300

Latency (nsec)

Traffic (bits/nsec)

Network latency vs. offered bandwidth (Uniform traffic) 4500

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

100

200

300 400 Traffic (bits/nsec)

500

600

700

0

b)

Accepted vs. offered bandwidth (Complement traffic) 450

1800

300 250 200 150 100

600

700

1600 1400 1200 1000

600

0

400 0

100

200

300 400 Traffic (bits/nsec)

500

600

700

0

d)

Accepted vs. offered bandwidth (Transpose traffic)

100

200

300 400 Traffic (bits/nsec)

500

600

700

Network latency vs. offered bandwidth (Transpose traffic)

300

4000

250

3500

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

3000 200

Latency (nsec)

Traffic (bits/nsec)

500

800

50

c)

300 400 Traffic (bits/nsec)

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

2000

Latency (nsec)

Traffic (bits/nsec)

350

200

Network latency vs. offered bandwidth (Complement traffic) 2200

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

400

100

150 100

2500 2000 1500

cube, deterministic cube, Duato fat tree, 1 vc fat tree, 2 vc fat tree, 4 vc

50 0 0

e)

100

200

300 400 Traffic (bits/nsec)

500

600

1000 500 700

0

f)

Accepted vs. offered bandwidth (Bit reversal traffic) 350

250

500

600

700

3000

Latency (nsec)

Traffic (bits/nsec)

300 400 Traffic (bits/nsec)

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

3500

200 150 100

2500 2000 1500

50

1000

0

g)

200

Network latency vs. offered bandwidth (Bit reversal traffic) 4000

cube, deterministic cube, Duato fat tree, 1vc fat tree, 2 vc fat tree, 4vc

300

100

500 0

100

200

300 400 Traffic (bits/nsec)

500

600

700

h)

0

100

200

300 400 Traffic (bits/nsec)

500

600

Figure 7: Normalized communication performance of a 16-ary 2-cube and a 4-ary 4-tree. no easy geometric interpretation. There are 16 nodes that have a palindrome bit string and do not inject any packet into the network. They generate some underloaded areas that are located along or near the two main diagonals according to a symmetric layout. As in the previous case, the adaptive algorithm provides better performance both in terms of throughput and network latency. The saturation points are 60% and 20% of the capacity.

10 The Two Networks Compared

We are now ready compare the two interconnection networks using the cost model introduced in Section 5. The raw data already shown in Sections 8 and 9 are ltered to take into account the router complexity and the wire delay.

700

In Figure 7 a) we can see that the bi-dimensional cube outperforms the quaternary fat-tree under uniform trac. The Duato's algorithm has the highest saturation throughput, about 440 bits/nsec and is followed by the deterministic algorithm with 350 bits/nsec. The minimal adaptive algorithm has a performance advantage over the deterministic one even when the router complexity is taken into account. It is surprising, at least at rst glance, to see that the best throughput of the fat-tree is only 280 bits/nsec and is achieved with four virtual channels. The version with one virtual channel stops at 150 bits/nsec, at about one third of the best throughput on the cube. In the cube the latency of both algorithms before saturation is stable at about half sec and the network latency in a saturated network is only one sec. In the fat-tree we pay the penalties of the narrow data paths, the longer wires and the router complexity. The rst factor increases the tail latency, because worms of the same size require more its. The other two factors increase the multiplicative factor represented by the clock cycle. As a result, the network latency on the fat-tree is much higher and reaches 4 sec at saturation when we use four virtual channels. The network latency under normal trac conditions is about one sec, twice the latency on the cube. The complement trac is a very particular permutation pattern for both topologies. On the one hand it is a dicult pattern for the cube, because it stresses the topological limitation of the bisection bandwidth. On the other hand, it generates no form of contention on the fat-tree. The saturation points on the fat-trees are all around 400 bits/nsec while the best result on the cube is provided by the deterministic algorithm with 280 bits/nsec. The network latencies on the cube are around 0:5 and 0:7 sec before saturation, respectively with the deterministic and the adaptive algorithm. When we use more than a virtual channel on the fat-tree we must pay the routing overhead and the network latency with four virtual channels reaches 1:5 sec when the network is close to saturation. For the transpose and bit reversal tracs we can distinguish two classes of algorithms. The rst class includes the adaptive algorithm on the cube and the versions with two and four virtual channels on the fat-tree. The saturation points of these algorithms are grouped in a short interval between 250 and 300 bits/nsec. In the second class there are the deterministic algorithm of the cube and the adaptive algorithm of the fat-tree with a single virtual channel, whose saturation points are between 100 and 150 bits/nsec. We also must note the low average latency of the adaptive algorithm based on Duato's methodology, only half sec. In the presence

of non uniform trac it pays to have adaptive routing on the cube and more virtual channels on the fat-trees.

11 Conclusion

In this paper we have analyzed two popular interconnection networks, a bi-dimensional cube and a quaternary fat-tree. We have compared the communication performance of these interconnection networks using a detailed simulation model, taking into account important parameters as the bisection width, the router complexity and the wire delay. In the experimental evaluation we have considered a 4-ary 4-tree and a 16-ary 2-cube, two networks with same number of processing nodes and routers, whose communication performance has been properly normalized. From the body of experimental results we have gathered so far we can draw some considerations. The rst important result is that the bi-dimensional cube outperforms the quaternary fat-tree under uniform trac, both in terms of network throughput and latency. The highest saturation throughput is reached by the adaptive algorithm based on Duato's methodology with 440 bits/nsec and the deterministic algorithm with 350 bits/nsec, while the best throughput on the fat-tree is 280 bits/nsec with 4 virtual channels. The network latency on the cube is only 0:5 sec below saturation, about half the latency on the fat-tree. This is mainly due to the pin count limitation, that allows larger data paths on the cube and the wire delay, that increases signi cantly the clock cycle. The fat-tree provides a slightly better throughput in the presence of non-uniform trac patterns. The complement is a dicult pattern for the cube because it stresses the topological limitation of the bisection bandwidth: in this case the best saturation points are 400 bits/nsec for the fat-tree and around 250 bits/nsec for the cube. With the transpose and bit reversal tracs the throughput with two and four virtual channels on the fat-tree is tantamount to the adaptive algorithm on the cube. On the other hand the cube provides lower latency with all the communication patterns. An important characteristic of the fat-tree is that its communication performance is not sensitive to the permutation pattern, because the bisection bandwidth it is not a topological limitation in this network. So we have a predictable performance. In the cube the performance depends on how the communication patterns relate to the bisection bandwidth. This problem is alleviated by the adaptive algorithm. It is also worth noting that the adaptive algorithm maintains a performance advantage over the deterministic one even when we consider the router complexity. The performance of the fat-tree is mainly sensitive to

the ow control strategy in use: for all trac patterns but the complement one, the saturation point with one virtual channel is 150 bits/nsec, between 200 and 250 bits/nsec with two and around 300 bits/nsec with four virtual channels. The network latency in a non saturated network is nearly the same for all the three variants, about a sec. Virtual channels have a modest impact on the router complexity of the fat-tree, whose limiting factor is the wire delay. When we use four virtual channels the routing delay is equalized with the wire delay, so we expect a diminishing return with more virtual channels. As the performance of interconnection networks becomes increasingly limited by physical constraints as the wire delay, we expect that low-dimensional cubes will increase the gap with the fat-trees, because they can be easily mapped on the three-dimensional space.

Acknowledgments

We thank all the reviewers for their insightful comments.

References

[1] C. E. Leiserson, \Fat-Trees: Universal Networks for Hardware Ecient Supercomputing," IEEE Transactions on Computers, vol. C-34, pp. 892{901, October 1985. [2] C. E. Leiserson et al., \The Network Architecture of the Connection Machine CM-5," in Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 272{285, June 1992. [3] H. L. Muller, P. W. A. Stallard, and D. H. D. Warren, \An Evaluation Study of a Link-Based Data Di usion Machine," in Proceedings of the 8th International Parallel Processing Symposium, IPPS'94, (Cancun, Mexico), pp. 115{128, April 1994. [4] Meiko World Incorporated, Computing Surface 2 reference manuals, preliminary ed., 1993. [5] S. Haridi and E. Hagersten, \The Cache Coherence Protocol of the Data Di usion Machine," in PARLE'89, Parallel Architectures and Languages Europe, vol. I, pp. 1{18, June 1989. [6] Kendall Square Research, Technical Summary, 1th ed., 1991. [7] T. T. Kwan, B. K. Tatty, and D. A. Reed, \Communication and Computation performance of the CM-5," in Supercomputing'93, pp. 192{201, November 1993. [8] M. Lin, R. Tsang, D. H. C. Du, A. E. Klietz, and S. Saro , \Performance Evaluation of the CM-5 Interconnection Network," Tech. Rep. AHPCRC Preprint 92-111, University of Minnesota AHPCRC, October 1992. [9] A. Martin and D. Bader, \Performance of the CM-5 ENEE 646." unpublished, January 1994.

[10] D. Lenoski, J. Laudon, et al., \The Stanford DASH Multiprocessor," IEEE Computer, pp. 63{79, March 1992. [11] A. Agarwal, B. H. Lim, D. Kranz, and J. Kubiatowicz, \APRIL: A Processor Architecture for Multiprocessing," in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 104{ 114, May 1990. [12] R. Suaya and G. Birtwistle, eds., VLSI and Parallel Computation, ch. Network and Processor Architecture for Message-Driven Computers. Morgan Kaufmann Publishers, 1990. [13] Cray Research Inc., Cray T3D System Architecture Overview, 1th ed., September 1993. [14] S. L. Scott and G. M. Thorson, \The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus," in HOT Interconnects IV, (Stanford University), August 1996. [15] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, Inc., 1993. [16] Z. G. Mou, \Comparison of Multiprocessor Networks with the Same Cost," in International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'96), vol. I, (Sunnyvale, CA), pp. 539{548, August 1996. [17] D. C. Burger and D. A. Wood, \Accuracy vs. Performance in Parallel Simulation of Interconnection Networks," in International Symposium on Parallel Processing, April 1995. [18] A. Agarwal, \Limits on Interconnection Network Performance," IEEE Transactions on Parallel and Distributed Systems, vol. 2, pp. 398{412, October 1991. [19] C. E. Leiserson and B. M. Maggs, \CommunicationEcient Parallel Algorithms for Distributed Random Access Machines," Algorithmica, vol. 3, pp. 53{77, 1988. [20] F. Petrini and M. Vanneschi, \k-ary n-trees: High Performance Networks for Massively Parallel Architectures," in Proceedings of the 11th International Parallel Processing Symposium, IPPS'97, (Geneva, Switzerland), pp. 87{93, April 1997. [21] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. San Mateo, CA, USA: Morgan Kaufmann Publishers, 1992. [22] W. J. Dally, \Virtual Channel Flow Control," IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 194{205, March 1992. [23] W. J. Dally and C. L. Seitz, \Deadlock-Free Message Routing in Multiprocessor Interconnection Networks," IEEE Transactions on Computers, vol. C-36, pp. 547{ 553, May 1987.

[24] J. Duato, \A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 4, pp. 1320{ 1331, December 1993. [25] J. Duato, \A Necessary and Sucient Condition for Deadlock-Free Adaptive Routing in Wormhole Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 6, pp. 1055{1067, October 1995. [26] J. Duato, \A Necessary and Sucient Condition for Deadlock-Free Adaptive Routing in Wormhole Networks," in International Conference on Parallel Processing, vol. I - Architecture, pp. I{142{I{149, 1994. [27] W. J. Dally and H. Aoki, \Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels," IEEE Transactions on Parallel and Distributed Systems, vol. 4, pp. 466{475, April 1993. [28] F. Petrini and M. Vanneschi, \Minimal Adaptive Routing with Limited Injection on Toroidal k-ary n-cubes," in Supercomputing 96, (Pittsburgh, PA), November 1996. [29] F. Petrini and M. Vanneschi, \SMART: a Simulator of Massive ARchitectures and Topologies," in International Conference on Parallel and Distributed Systems Euro-PDS'97, (Barcelona, Spain), June 1997. [30] A. A. Chien, \A Cost and Speed Model for k-ary ncube Wormhole Routers," in Hot Inteconnects '93, (Palo Alto, California), August 1993. [31] J. Duato and P. Lopez, \Performance Evaluation of Adaptive Routing Algorithms for k-ary n-cubes," in First International Workshop, PCRCW'94 (K. Bolding and L. Snyder, eds.), vol. 853 of LNCS, (Seattle, Washington, USA), pp. 45{59, May 1994. [32] J. Duato and M. P. Malumbres, \Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again?," in Second International Euro-Par Conference, Volume I, no. 1123 in LNCS, (Lyon, France), pp. 205{212, August 1996. [33] S. Heller, \Congestion-Free Routing on the CM-5 Data Router," in First International Workshop, PCRCW'94 (K. Bolding and L. Snyder, eds.), vol. 853 of LNCS, (Seattle, Washington, USA), pp. 176{184, May 1994. [34] M. L. Fulgham and L. Snyder, \A Comparison of Input and Output Driven Routers," in Second International Euro-Par Conference, Volume I, no. 1123 in LNCS, (Lyon, France), pp. 195{204, August 1996. [35] D. R. Helman, D. A. Bader, and J. Jaja, \Parallel Algorithms for Personalized Communication and Sorting with an Experimental Study," in Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, (Padova, Italy), June 1996.

Suggest Documents