Proc. of MFCS '94, Springer LNCS
Communication Throughput? of Interconnection Networks Burkhard Monien, Ralf Diekmann, and Reinhard Luling Department of Mathematics and Computer Science, University of Paderborn, Germany e-mail : fbm, diek,
[email protected]
Abstract. Modern ow control techniques used for massively parallel com-
puters have made network capacity a more important parameter for the application performance than network latency. Network latency is usually rather low as long as the injection rate is below a speci c value. Nowadays the maximal injection rate is usually approximated by the bisection bandwith of the network. We will describe the state of the art in determining the bisection bandwith of interconnection systems. Unfortunately the bisection bandwith leads only to very vague approximations of the communication capacity of a network. We will describe some methods aiming at modeling the maximal network capacity by using probabilistic models. Especially we will present results for the multistage interconnection network which is often used in parallel computing and more general communication applications. The presented results show a rather close relation to results gained by simulations and therefore have the potential to replace them. We argue that theoretical investigations, leading to close term expressions or polynomial algorithms computing the exact network throughput can be of great help for engineers usually determing the throughput of a given network by time consuming simulations.
1 Introduction \The problem of building computer systems with increasing useful computational power has been the diamond in the crown of computer science similar to the way that fundamental science problems such as sequencing the human genome are at the heart of life sciences" [18]. This citation, made in a call for proposal for the US High Performance Computing project (HPC), re ects the demand for high performance computers designed to solve highly relevant problems in science and engineering. A large number of projects have been and will be started to develop high performance computers, mainly based on parallel processing, and to make ecient use of them. Popular parallel systems are the T3D (Cray), CM-5 (Thinking Machines), CS-2 (Meiko), GC (Parsytec), Paragon (Intel), SP2 (IBM) and KSR2 (Kendall Square ? This work was partly supported by the German Federal Department of Science and
Technology (BMFT), PARAWAN project 413-5839-ITR 9007 BO, the EC Esprit Basic Research Action Nr. 7141 (ALCOM II) and the EC Human Capital and Mobility Project: \Ecient Use of Parallel Computers: Architecture, Mapping and Communication"
Research). In addition to these commercial systems, there exist a large number of experimental systems, realized by universities and research institutes to validate new concepts in parallel computing. These systems often dier in their programming models and hardware realizations (full custom design vs. standardized building blocks) but most of them, as well as the commercial systems, use a point-to-point communication network of asynchronously working processors communicating by message passing.
Interconnection Networks: Today's parallel computing systems use a large variety of interconnection networks. Popular examples are two- or three-dimensional grids (Parsytec GC, Paragon) and tori networks (Cray T3D) as well as multistage interconnection networks (CM-5, CS-2). In some way meshes and multistage networks present two extremes in the spectrum of all possible interconnection networks. Multistage networks like the Clos network have a much better communication performance than meshes, even if their large number of switches is taken into account by measuring capacity per switch. However, due to their long wires and their large number of crossings these networks are hard to realize for larger systems. Usually the VLSI area complexity [40] is used to estimate the overall costs of an architecture in terms of physical space, wiring amount and related topics. Comparing communication capacity and VLSI area the mesh makes much better use of its area than the Clos network. The so-called Fat Mesh of Clos network (FMOC(n; r), [32]) combines advantages of the mesh and the Clos by substituting the r top stages of a complete Clos network of height n by a mesh structure. This partitions the entire Clos in a number of smaller networks, connected by a large number of independent meshes. Simulations show that the FMOC(n; r) with some r between 0 and n ? 1 are attractive network structures if capacity and costs both have to be considered [32]. Eciency of Interconnection Networks: Parallel computer architectures are
usually evaluated according to network throughput and latency times. The network throughput is de ned as the average number of packets reaching their destination per time step, if each processor sends in each step a packet to a random destination with probability . For a given routing function R we de ne max (R) as the maximum injection rate such that the system remains stable using routing R, i.e. no backlogs occur at processor when inserting messages into the network. The network capacity max is de ned to be the maximum injection rate over all routings such that the system remains stable. In the stable case the average number of packets inserted into the network is the same as the average number of packets reaching their destination at each time step. max is usually hard to determine by simulations, since backlogs occur at processors temporarily even for small injection rates. The message latency as a function of the injection rate is de ned to be the average number of time steps it takes until a message reaches its destination. The latency time of a network is de ned to be the message latency for injection rate max . Another factor which determines the practical usability of a parallel computing system is the message startup time. This is the time it takes for a processor to insert a message into the communication network.
System nCUBE/2 CM-5 Paragon Dash J-Machine Monsoon
Network Cycle [ns] Startup Time [Cycles] Latency [Cycles] Hypercube 25 6400 360 Fat Tree (Multistage) 25 3600 114 2-d Mesh 12 2500 83 Torus 30 30 13 3-d Mesh 31 16 24 Butter y 20 10 10
Table 1. Startup time and communication latency. Table 1 presents network latency times for lightly loaded networks (message injection rate < max ) as well as startup times for three commercial and three experimental parallel computer systems [7]. As can be seen the startup time dominates the overall communication time.2 This is mainly due to the use of sophisticated routing strategies like wormhole routing or cut through routing which parallelize the message transition and the routing decision by pipelining the communication through the network [8, 13]. The numbers in Table 1 are only valid if the message injection rate is below the network capacity. If the message trac caused by an application increases, the latency time becomes unpredictable large. This is presented in Figure 1 for the Fat Mesh of Clos network (FMOC(h; r)).
140 fmoc(5,0) fmoc(5,1) fmoc(5,2) fmoc(5,3) fmoc(5,4)
120
time steps
100 80 60 40 20 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 injection rate
Fig.1. Latency for the fat mesh of clos network, FMOC(h; r). Therefore, we argue that the exact determination of the network capacity is of largest interest for the investigation of a network's eciency. Today this is usually done by time consuming simulations [9, 17]. 2
Latency for 160 Bit communication to random processor on an lightly loaded network.
Theoretical Models: Theoretical work, focused on the investigation of the eciency of interconnection networks, is based on a number of dierent models: A large number of results have been achieved for the simpli ed case that every processor holds exactly one message and sends it to a destination chosen randomly. Using this model, denoted as the static model, results mainly focusing on the message latency have been achieved for multistage interconnection networks (like the butter y), hypercubes and grids. A model which is more closely related to practical situations arising in communication networks is the so called dynamic model. In this model messages are generated dynamically during run time and routed to uniformly random chosen destinations. For the static model, rst results are due to Batcher [3] who has shown that a ninput butter y network is able to sort, and hence route, n packets in O(log2 n) time. Randomization techniques, introduced by Valiant [42, 43] split the routing phase into two random problems by routing the packets rst to a random intermediate destination and afterwards to its nal destination. It was shown that for the butter y network, n packets can be routed in O(log n) time using a queue size of O(log n) with high probability. This result was improved in terms of the number of packets which can be routed and the queue size which arises during runtime [2, 31, 36, 37, 41]. Recently similar results were also shown for the popular wormhole routing scheme which is used in most of today's parallel computing systems [13]. Most of these results are based on routing algorithms that are more sophisticated than the simple algorithms used for practical applications. In general it seems to be much harder to analyze these simple routing schemes [31]. Results for the mesh using the static model were presented in [22, 23, 25, 28]. Leighton's average case analysis [27] of this model showed that it is possible to route every packet with O(log N ) additional delay and a maximum queue size of 4 with high probability. Considering the dynamic model, it was shown that for the two dimensional grid every packet encounters an additional delay of O(log N ) with high probability if the injection rate is less than the network capacity [27]. The queue size for this routing log t ) with high probability. algorithm over a time span of t steps was shown to be O( log N Results for the butter y and hypercube networks having unbounded queue size and using greedy routing algorithms were presented in [38]. It was shown, that if the probability of generating a message per time step is less than 1=2, the network behavior converges to a steady state in which it is stable in the already described sense. The average delay per message tends to O(log N ) and the average queue size is O(1). Contents of the Paper: The theoretical results presented above, show that the
communication capacity of a network using unbounded queues is only limited by the bottlenecks induced by the network structure and the routing scheme. In section 2 we will focus on the determination of these communication bottlenecks. Usually the bisection bandwidth is a rst candidate responsible for limiting throughput. Therefore we will present some results on bounds on the bisection bandwidth for a number of important interconnection networks. The bisection width does not consider char-
acteristics of routing functions and therefore often gives only weak upper bounds on the communication capacity. The edge-forwarding index accounts for edge congestions caused by routing algorithms and can therefore often give better bounds. We will describe the relation between bisection bound and forwarding index and give throughput bounds derived from both measures. Practical systems usually have only a limited amount of space per buer. In section 3 we will present dierent models leading to a more accurate approximation of the network throughput for the case of a single buer per communication link.
2 Approximating Network Throughput using Graph Theoretic Measures In this chapter, we describe upper bounds for the network capacity max of uniformly randomized communication for various networks. These bounds are based on two measures, the bisection width of the networks and the edges forwarding index of arbitrary routing schemes. We present an overview of results for the bisection bandwidth of a variety of networks including some lower and upper bounds. For regular networks and especially for networks of degree 4 improved bisection results are shown that were motivated by our work on Transputer architectures [19]. Finally we present experimental results on upper and lower bounds and comparisons of max depending on the bisection width and on the edge forwarding index for the De Bruijn network. Let bw (G) denote the bisection bandwidth of a given graph G = (V; E ), jV j = N , which is de ned to be the minimal number of edges that have to be removed in order to partition the graph into two equal sized sets of nodes. The idea to compute the upper bound of the capacity using the bisection bandwidth is based on the following observation: If N=2 processors in a partition generate a message with probability at each time step, then the expected trac between the two parts is N2 21 . Therefore: G) 4 g : max minf1; bw (N This bound can be applied to arbitrary graphs, but gives good results only if three conditions are ful lled. The routing should use shortest paths to minimize to overall contention and it must provide locality in that sense that the shortest path between two nodes lying on the same side of the cut does not cross the cut. Additionally the bisection bound is only sharp if the highest congested edges lie on the cut. Thus edge congestion is a very important value for the evaluation of max . The three conditions are ful lled for most popular networks and routing algorithms like the dimension ordering routing schemes used for grids and hypercubes [9]. If the conditions do not hold, i.e. the bisection does not give a good bound, the so called edge-forwarding index serves as a better upper bound for the communication capacity. The edgeforwarding index (G; R) for a graph G and an associated routing function R is de ned as the maximum number of paths from R passing through any edge of G [16]. The edge-forwarding index of G is de ned to be the minimum edge-forwarding
index of G over all possible routing functions R. As each edge can route only one message per time step, the communication capacity is bounded by
max (NG) : Notice that neither the bisection bound nor the edge-forwarding bound make any assumptions on buer sizes of the network. Thus they are both upper bounds independent on the details of the network and the routing algorithms.
2.1 General Bounds on the Bisection Width In the past, much work has been done on nding bounds on the size of edge separators in certain kinds of graphs, i.e. bounding the number of edges that have to be removed to split the graph into two sets. Tarjan's and Lipton's separator theorem [30] has motivated a lot of work on node and edge separators for dierent kinds of graphs. Most of the results split a graph into roughly equal sized parts V1 ; V2 with jVij 23 jV j and are therefore not directly applicable to bound thep bisection width. p p In [12] Djidjev nds a vertex separator of size ( 3+2 21 + p33?21 ) N for arbitrary planar graphs such that jVij 21 jV j. If d is the maximalp degreepof a p planar graph 21 2 3 3+ than this leads to an edge separator of size bw (G) = ( 2 + p3?1 )d N . Network #nodes (N ) #edges bw (G) (G) source n k-ary n-cube (wrapped) k nkn 2kn?1 kn+1 =4 [9] !Hypercube, Q(n) 2n n2n?1 2n?1 2n 2 2 3 !2D Mesh, M [n; n] n 2n ? 2n n n =2 !3D Mesh, M [n; n; n] n3 3n3 ? 3n2 n2 n4 =2 X-tree, XT(h) 2h+1 ? 1 2h+2 ? h ? 4 (h) [26] Butter y, BF(k) k 2k k2k+1 2k k2 2k?1 [26] Cube-Con. Cycl., CCC(k) k 2k 3k2k?1 2k?1 [26] k Shue-Exchange, SE(k) 2 3 2k?1 (2k =k) [26] De Bruijn Graph, DB(k) 2k 2k+1 (2k =k) [26]
Table 2. Upper bounds for communication capacity of dierent networks. Table 2 shows the bisection width and upper bounds for the communication capacity of some common interconnection networks. The proofs of the bisection width use a technique that is related to the forwarding index [26]. The complete directed graph KN of N nodes obviously has a bisection width bw (KN ) = N 2 =2. To show a lower bound on the bisection width of any graph G, it is sucient to de ne an embedding of KN into G and to determine the maximal congestion c caused by the embedding on any edge in G (the forwarding index). A bisection of G has to cut at least bw (KN )=c edges in G. If it would cut less edges, it would de ne a cut in KN of size smaller than N 2 =2.
2.2 Partitioning Regular Graphs In [6] upper bounds on the bisection width of 3-regular graphs are shown. In [19] these bounds are improved and extended to arbitrary pr-regular graphs. Using a new technique bisections of size bw (G) (r ? 2)N=4 + O( N ) are constructed. For graphs of degree 4 the bounds are even improved. It is shown that bw (G) n=2 + 1 for N 350 and bw (G) n=2 + 4 for N 60. Both results lead to ecient bisection algorithms. Small graphs can be partitioned using a simple greedy strategy whereas larger graphs require an optimization with the use of helpful sets. The latter was the basis for the development of a new bisection heuristic [11].
2.3 Bisecting De Bruijn Graphs The De Bruijn graph is a very interesting network for building processor interconnections. Because of its low constant degree (every node has degree at most 4) it can be easily build from standard components. Nevertheless, it provides a logarithmic diameter and a large bisection width of (N= log N ). k N N= log N LoB LoB(SA) UpB
3 8 2.66 4 16 4.00 5 32 6.40 6 64 10.66 7 128 18.29 8 265 32.00 9 512 56.88 10 1024 102.40
3 5 8 13 22 38 66 118
3 3 6 6 9 10 16 18 26 30 44 54 78 93 133 162
Table 3. Lower and upper bounds on the bisection width of De Bruijn graphs. Exact bounds on the bisection width of De Bruijn graphs are not known. As can be seen from Table 3, the gap between upper and lower bounds is still very large. The lower bound uses the technique of embedding a complete graph that was already described. Because the edge congestion is not uniform, this bound is weak. Some upper bound can be found using the complex plane embedding of the De Bruijn graph that de nes an O(N= log N ) bisection [26]. Optimizing the embedding in terms of edge congestion can improve the lower bound. The column LoB in Table 3 gives the lower bound obtained with standard shortest path routing (note that an embedding of a complete graph de nes a routing scheme). LoB(SA) is an improved bound derived from optimized shortest path routing using simulated annealing [10] and UpB is an upper bound found with the bisection heuristic from [11]. As can be seen, there is still a large gap between the lower and upper bounds. This results mainly from the weakness of the lower bound. The upper bounds are not very
likely to be improved because they are obtained from numerous runs of an ecient bisection heuristic [11].
2.4 Quality of Throughput Approximation If the bisection width is used to determine upper bounds on the communication capacity the question to answer is, whether the approximation is close to the real capacity. Unfortunately, there is no general answer to this question. The quality of an approximation usually depends on the structure of the network. k N (G; R) = x max (R) Nx
4 16 5 32 6 64 7 128 8 256 9 512 10 1024
27 75 176 427 978 2226 5023
x0 max xN x00 max 4Nx
0.5926 17.72 0.4267 44.79 0.3636 111.39 0.2998 270.81 0.2617 644.85 0.2300 1507.89 0.2039 3469.81
00
0
0.9029 6 0.7144 10 0.5746 18 0.4727 30 0.3970 54 0.3359 93 0.2951 162
1.5000 1.2500 1.1250 0.9375 0.8438 0.7266 0.6328
Table 4. Comparison of max depending on or bw for De Bruijn graphs DB(k). It is (G) x0 and thus max xN , and bw (G) x00 and thus max 4Nx . 00
0
Table 4 shows bounds on the forwarding index and the bisection width bw of De Bruijn networks for dierent dimensions k. The values of x0 re ect the average edge congestion resulting from any routing via shortest paths in the De Bruijn network (i.e. x0 is the sum of all shortest paths in DB(k) divided by the number of edges). Clearly, x0 is a lower bound for the edge forwarding index (G) as it is independent of a speci c routing function. Thus it leads an upper bound on the network capacity max . The upper bounds x00 on the bisection width bw were obtained using the bisection heuristic described in [11]. x00 also leads to an upper bound on the network capacity max . Additionally, the column for x shows the values of the edge forwarding index if local optimal shortest path routing is used, i.e. the routing function tries to distribute the paths equally over all edges incident to a particular node. max (R) is the corresponding capacity bound. The three dierent values for max show, that the bisection width is only a weak bound on the capacity, at least for practical sized De Bruijn networks. There is also a large gap between the forwarding index (G; R) obtained by a speci c routing and the bound x0 on (G). This shows on the one hand that the bound on (G) is weak but also that the local optimal routing R is by far not global optimal. Table 5 presents the bisection upper bound on the communication capacity and real values gained by simulation for 1024-nodes Fat Mesh of Clos Networks ranging from the pure 2-d mesh to the pure Clos. Simulations of such large networks are very time consuming. The large gap between the real values and the bisection bound show however, that upper bounds are often not sucient to model the behavior of
an interconnection network. Thus there is a strong need for better theoretical models that express the network capacity without using time consuming simulations. The next section presents some of such models. h r 5 5 5 5 5
0 1 2 3 4
(bisec) proc. mesh max (sim) max max (sim) max (bisec) 1024 1x1 0.38770 0.38770 1 1024 2x2 0.31665 0.63330 0.50 1024 4x4 0.17264 0.69056 0.25 1024 8x8 0.07889 0.63112 0.125 1024 16x16 0.03285 0.52560 0.0625
Table 5. Comparison of upper bound and simulation, Fat Mesh of Clos
3 Throughput of Multistage Interconnection Networks Multistage interconnection networks (MINs) in their very general sense have been widely used for the realization of interconnection networks of parallel computer systems and as switches for high speed communication networks. Examples for the realization in parallel computers are found in [4, 15, 35]. Also the Thinking Machine CM-5, as well as the Meiko CS-2, use a multistage interconnection network as their main communication system. An architecture based on Transputers using a multistage network is described in [14]. Multistage networks used for the realization of high speed communication system for telecommunications use are presented in [5]. source S λ
S
S
S
S
S
S
S
S
S S
S
S
S
S
S
S
S
S
S
S
stage:
sink
S
S
S
S
S
0
1
2
3
single message buffer empty buffer blocking message
Fig.2. MIN of height 4. Messages are buered at links. In the following we describe some models which were developed to analyze the network throughput of MINs using the dynamic generation of messages presented in
Section 1. All models are based on uniformly randomized communication and can be distinguished according to their complexity and accuracy in describing the network throughput. The networks we consider have a xed buer space per link. Most of the presented models can be generalized to any number of buers per link, but we focus on the socalled single buer model which means, that each link has an associated buer which can store exactly one message (cf. Fig. 2). The methods described in [27, 31] can not be applied here. This is mainly due to the fact, that in the case of limited buer space the random variables describing the ow in the network are no longer independent. Therefore one has to develop models, describing most of the dependencies arising during runtime as close as possible to get a good approximation of the network throughput. Results for the case of unbuered edges are much easier to achieve. See [24, 44] for an overview of this work. The used routing strategy is straightforward for all models. Every message is routed according to its unique path from the message origin to its destination.
3.1 The Basic Model A basic model determining the throughput of MIN's was developed by Jenq in 1983 [21]. This model uses mainly two relaxations to reduce its complexity. The status of a complete stage of the network is described by the state of exactly one buer. This is only valid if the states of buers within a stage (and switch) are independent of each other. Since messages within one switch can compete for a single outgoing link it is clear that buer states are not independent. Furthermore blocked messages are not precisely considered, which is, as we will see later, the main reason why the model gives only very vague approximation results. Only the probability that a buer contains a message is used to describe the state of a stage. For switches of size 2 the model for a MIN containing n stages can be described as follows: Let p0(k; t) denote the probability of a buer at stage k being empty at time step t and p1 (k; t) = 1 ? p0(k; t). To describe the ow of messages let in(k; t) denote the probability that some message is ready to enter a buer of a switch at stage k during time step t and out(k; t) the probability that a message which is supposed to reside in a buer at level k leaves this buer at time step t. The probability that a message is able to enter a buer at stage k depends only on the state of the buers at stage k ? 1. If only one of the buers at stage k ? 1 holds a message (this case arises with probability p0 (k ? 1; t) p1 (k ? 1; t)) this message is put into a speci c buer at stage k with probability 12 . If both buers contain one message (probability for this is p1(k ? 1; t) p1 (k ? 1; t)), the targeted buer at stage k receives a message with probability 43 . in(k; t) = p1(k ? 1; t) p0(k ? 1; t) + 43 p1(k ? 1; t) p1 (k ? 1; t) To describe the probability that a message is able to leave a buer to the next stage, the two possibilities of blocking a message have to be considered. First a message can
be blocked because its target buer in the next stage is not empty. The second source of blocking is due to the competition between messages residing in the two buers of one switch, having the same destination buer. In this case only one message is able to move, the other one is blocked. Therefore a message is able to leave its buer if it wins the competition at the switch (with probability p0 (k; t) + 34 p1 (k; t)) and if the target buer at the next stage is already empty (probability p0(k + 1; t)) or the message at this buer will leave the buer (with probability p1(k +1; t) out(k +1; t)). out(k; t) = p0 (k; t) + 43 p1 (k; t) (p0(k + 1; t) + p1 (k + 1; t) out(k + 1; t)) The probability that a buer is empty at time step t + 1 depends on the probability that no new message arrives at the buer and that the buer is already empty or becomes empty during time step t. p0(k; t + 1) = (1 ? in(k; t)) (p0(k; t) + p1 (k; t) out(k; t)) The throughput of the network is now computed by iteratively evaluating the above equations for k = 1; : : :; n and t ! 1 until a steady state is reached. Actually, there is no prove that such a steady state exists. The throughput is described by p1(n; t) out(n; t). A model that generalizes the here presented approach to larger switching elements was presented by Yoon in [45].
3.2 Considering Dependencies Between Buers at one Switch A model also described by Jenq in [21] is more complex in the sense, that relations between buers belonging to one switch are precisely considered. The model was described for switching elements of size 2, but can be generalized to other sizes, leading to an exponential number of states per switch.
3.3 Considering Blocked Messages The main reason why the previous models do not accurately describe the throughput of multistage networks is that they do not consider blocked messages explicitly. To our knowledge there are three papers, describing models that take blocking of messages into account. The models presented in [20] and [34] make some assumptions which were found to be unrealistic or do assume (as the rst model of Jenq) that buers belonging to the same switch are independent of each other. In the following we describe a model, introduced by [39]. A buer can be in one of the states \o" (empty), \n" (normal) or \b" (blocked). It is de ned to be in the normal state, if a message entered the buer in the last time step. It is blocked if it contains a message which has not been delivered to the next stage in the last time step. A clock cycle is logically divided into two phases. In the rst phase, a buer is able to deliver its message to the next stage. In the second phase, a buer probably gets a message from the previous stage of the network.
Let px;y (k; t) de ne the probability that the left buer of a switch at stage k is in state x and the right buer in state y (x; y 2 fo; n; bg) at the beginning of phase one of clock cycle t and p^x;y (k; t) the probability of being in state (x; y) after performing the rst phase. The rst phase of a clock cycle is explicitly describedi;jby introducing the probability of a message being send to the next stage. Let outa;b (k; t) be the probability that the buers are in state a and b and i messages of the left buer and j messages of the right buer are able to move to stage k + 1 in phase 1 of clock cycle t. Using these variables the probabilities p^x;y (k; t + 1) of reaching a state (x; y) after phase two can be computed depending on pa;b(k; t):3 #a?#x;#b?#y if #a #x and #b #y a;b p^x;y (k; t + 1) (pa;b(k; t)) = out 0 else To describe phase two of a clock cycle t, in(k; t) is de ned to be the probability that a buer at stage k is oered a new packet during clock cycle t. This probability is assumed to be independent of the state of the switch. This is a simplifying assumption and is mainly responsible for the unexact description of the network throughput. Using this, one can easily compute the probabilities px;y (k; t) in dependence of p^x;y (k; t) and in(k; t). The associated equations are straightforward and omitted here. Solving the equations iteratively, the probabilities pi;j (k; t) for k = 1; : : :; n and t ! 1 can be computed. The question which remains to be answered is how to compute the transitions probi;j ( k; abilities in(k; t) and outi;j a;b t). The variables outx;y (k; t) are computed using the values of outi;j x;y (k +1; t) and the message ow at time step t? 1. The probability that a message is able to move to the next stage depends only on the probability that the target buer is free and on the probability that messages in both buers do not con ict in their destinations for the next stage. Regarding the case that messages of both buers are not con icting, one can distinguish two cases: 1. If the buer at stage k is in state n the message can move to the next stage, if the buer in stage k + 1 is already empty or changes from fn; bg to the empty state. The probability of this is already known. 2. If the buer at stage k is blocked, the target buer in the next stage must be in one of the states n or b. The probability of being in state n can be computed using the insertion probability in(k +1; t? 1). Otherwise the buer is in state b. In both cases one can use the speci c values of outi;j x;y (k + 1; t) to compute the probability that the buer changes to state o. The main idea of this model is therefore that using the blocked state, one knows that the target buer in the next stage is also occupied. Using the insertion probability of the previous time step one can also dierentiate between blocked and normal states in the next buer. Therefore one can argue that this model gives much better approximations of the network throughput. 3
#x := 0 if x = o, #x := 1 if x 2 fn; bg
3.4 Comparison of Models To validate the three models presented in the previous sections we compare them to the exact throughput of multistage networks which has been determined by simulations. Figures 3 presents the network throughput as a function of the injection rate for MINs with 4 and 12 stages. It is interesting to notice that all three models overestimate the throughput. As already stated in the previous sections the two models introduced by Jenq lead to a very large overestimation whereas the model introduced by Theimer gets much better results also for large networks. Throughput of MIN(4)
0.6
Throughput of MIN(12)
0.5 "Jenq 1" 0.45 "Jenq 2" "Theimer" 0.4 "Sim" 0.35
"Jenq 1" "Jenq 2" 0.5 "Theimer" "Sim" 0.4
0.3 0.3
0.25 0.2
0.2
0.15 0.1
0.1
0.05 0
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Fig.3. Analysis of Multistage networks of height 4 and 12
4 Conclusions The determination of the communication throughput of interconnection networks is of large importance for the design of parallel computing systems and other more general communication system (e.g. telecommunications). This paper studied two methods for the determination of the communication capacity. One is based on results gained for the bipartitioning of graphs and for the estimation of the edge forwarding index. The other, more accurate method, is based on explicitly modeling the routing in the network. We argue that approximative models of communication networks can be of large bene t when designing communication systems, since they can replace time consuming simulations and provide results also for very large systems which cannot be simulated in reasonable time. Future work should therefore focus on developing models which take more state dependencies into account. It would also be very helpful to get closed form expressions for the network throughput, describing the used model exactly. For practical reasons it would also be interesting to generalize the models to larger switches and multiple buers.
References 1. A. Agarwal, Limits on Interconnection Network Performance, IEEE Transactions on Parallel and Distributed Systems, vol. 2, 1991, pp. 398-412 2. R. Aleliunas, Randomized parallel communication, Proc. of ACM Symp. on Principles of Disributed Computing (PODC),1982, pp. 60{72 3. K. Batcher, Sorting networks and their applications, Proc. of the AFIPS Spring Joint Computing Conference, vol. 32, 1968, pp. 307{314 4. BBN, Butter yTM Parallel Processor Overview, BBN Report No. 6148, Version 1, Cambridge Mass. 1986 5. CCITT Recommendations I.121, Broadband aspects of ISDN, Blue Book, vol III.7, Geneva, Switzerland 1989 6. L.H. Clark, R.C. Entringer, The Bisection Width of Cubic Graphs, Bull. Austral. Math. Soc., vol. 39, 1988, pp. 389{396 7. D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subronomian, T. van Eicken, LogP: Towards a Realistic Model of Parallel Computation, Proc. of 4th ACM Symp. on Pinciples and Practice of Parallel Programming, 1993 8. W. J. Dally, C. L. Seitz, Deadlock-Free Message Routing in Multiprocessor Interconnection Networks, IEEE Transactions on Computers, vol. C-36, no. 5, 1987, pp. 547{553 9. W. Dally, Performance Analysis of k ary n cube interconnection networks, IEEE Transactions on Computers, vol. 39, 1990, pp. 775-785 10. R. Diekmann, R. Luling, J. Simon, Problem Independent Distributed Simulated Annealing and its Applications, in: R.V.V. Vidal (ed.), Applied Simulated Annealing, Lecture Notes in Econ. and Math. Systems, Springer LNEMS 396 (1993), pp. 17{44 11. R. Diekmann, B. Monien, R. Preis, Using Helpful Sets to Improve Graph Bisections, Technical Report TR-RF-8-94, University of Paderborn, 1994 12. H.N. Djidjev, On the Problem of Partitioning Planar Graphs, SIAM J. Alg. Disc. Meth., vol. 3(2), 1982, pp. 229{240 13. S. Felperin, P. Raghavan, E. Upfal, A Theory of Wormhole Routing in Parallel Computers, ACM Symposium on Foundations of Computer Science, 1992, pp. 563-572 14. R. Funke, R. Luling, B. Monien, F. Lucking, H. Blanke-Bohne, An optimized recon gurable Architecture for Transputer Networks, Proc. of the 25th Hawaii Int. Conf. on System Sciences (HICSS) 1992, vol. 1, pp. 237-245 15. A. Gottlieb, An overview of the NYU Ultracomputer Project, in J.J. Dongarra, Experimental Parallel Computing Architectures, Elsevier, Amsterdam, 1987, pp. 25{95 16. M.C. Heydemann, J.C. Meyer, D. Sotteau, On Forwarding Indices of Networks, Discrete Applied Mathematics, vol. 23, 1989, pp. 103{123 17. H. Hofestadt, A. Klein, E. Reyzl, Performance Bene ts from Locally Adaptive Interval Routing in Dynamically Switched Interconnection Networks, Proc. of 2nd European Distributed Memory Computing Conference, Springer LNCS 487, pp. 193-202 18. HPC Project, Project for Suggesting Computer Science Agenda(s) for HighPerformance Computing, April 1994 19. J. Hromkovic, B. Monien, The Bisection Problem for Graphs of Degree 4 (Con guring Transputer Systems), Proc. of 16th Math. Foundations of Computer Science (MFCS), 1991, Springer Lecture Notes in Computer Science, vol. 520, pp. 211{220 20. S.H. Hsiao, C.Y.R. Chen, Performance analysis of single-buered multistage interconnection networks, 3rd IEEE Symp. Par. and Distr. Processing, 1991, pp. 864{867 21. Y.C. Jenq, Performance analysis of a packet switch based on a single-buered banyan network, IEEE J. on Selected Areas of Comm., vol. SAC-3, 1983, pp. 1014{1021 22. M. Kaufmann, J. Sibeyn, T. Suel, Derandomizing Algorithms for Routing and Sorting on Meshes, Proc. 5th SODA, 1994, pp. 669{679
23. D. Krizanc, S. Rajasekaran, T. Tsantilis, Optimal routing algorithms for meshconnected processor arrays, Proc. of Agean Workshop on Computing: VLSI Algorithms and Architectures, Lecture Notes in Computer Science, vol. 319, 1988, pp. 411{422 24. C.P. Kruskal, M. Snir, The performance of multistage interconnection networks for multiprocessors, IEEE Trans. on Computers, vol. C-32, 1983, pp. 1091{1098 25. M. Kunde, Routing and Sorting on Grids, Habilitationsschrift, Technical University of Munich, June 1991 26. F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann Publishers, 1992 27. F.T. Leighton, Average Case Analysis of Greedy Routing Algorithms on Arrays, ACM Symposium on Parallel Algorithms and Architectures (SPAA), 1990, pp. 2-10 28. F.T. Leighton, F. Makedon, I. Tollis, A 2N-2 step algorithms for routing in an N x N mesh, Proc. 1st ACM Symp. on Par. Alg. and Architectures (SPAA), 1989, pp. 328{335 29. C.E. Leiserson et.al., The Network Architecture of the Connection Machine CM-5, Proc. 4th ACM Symp. on Par. Alg. and Architectures (SPAA), 1992, pp. 272{285 30. R.J. Lipton, R.E. Tarjan, A Separator Theorem for Planar Graphs, Siam J. Appl. Math., vol. 36(2), 1979, pp. 177{189 31. B.M. Maggs, R.K. Sitaraman, Simple Algorithms for Routing on Butter y Networks with Bounded Queues, Proc. 24th ACM Symp. on Theory of Comp., 1992, pp. 150{161 32. B. Monien, R. Luling, F. Langhammer, A Realizable Ecient Parallel Architecture, 1st Int. Heinz Nixdorf Symposium: Parallel Architectures and their Ecient Use, Paderborn, 1992, Springer LNCS 678, pp. 93{109 33. B. Monien, R. Feldmann, R. Klasing, R. Luling, Parallel Architectures: Design and Ef cient Use, Proc. STACS '93, Springer LNCS 665, pp. 247{269 34. Y. Mun, H. Yong, Performance Analysis of Finite Buered Multistage Interconnection Networks, IEEE Trans. on Computers, vol. C-43, 1994. pp. 153{162 35. G.F. P ster, W.C. Brantley, D.A. George, S.L. Harvey, W.J. Kleinfelder, K.P. McAulie, E.A. Melton, V.A. Norton, J. Weiss, An introduction to the IBM Research Parallel Processor Prototype (RP3), In J. J. Dongarra, Experimental Parallel Computing Architectures, Elsevier Science Publishers, Amsterdam, 1987, pp. 123{140 36. N. Pippenger, Parallel communication with limited buers, Proc. of 25th Symposium on Foundations of Computer Science (FOCS), 1984, pp. 127{136 37. A.G. Ranade, How to emulate shared memory, Proc. of 28th Symposium on Foundations of Computer Science (FOCS), 1987, pp. 185{194 38. G.D. Stamoulis, J.N. Tsitsiklis, The Eciency of Greedy Routing in Hypercubes and Butter ies, Proc. ACM Symp. on Par. Alg. and Arch. (SPAA), 1991, pp. 248-259 39. T.H. Theimer, E.P. Rathgeb, M.N. Huber, Performance analysis of buered banyan networks, IEEE Trans. on Communication, vol. C-39, 1991, pp. 269{277 40. J.D. Ullman, Computational Aspects of VLSI, Computer Science Press, Inc. 1984 41. E. Upfal, Ecient schemes for parallel communication, Journal of the ACM, vol. 31, no. 3, 1984, pp. 507{517 42. L.G. Valiant, A scheme for fast parallel communication, SIAM Journal on Computing, vol. 11, no. 2, 1982, pp. 350{361 43. L.G. Valiant, G.J. Brebner, Universal schemes for parallel communications, Proc. of 13th ACM Symposium on Theory of Computing (STOC), 1981, pp. 263{277 44. A. Varma, C.S. Raghavendra, Performance analysis of a redundant-path interconnection network, Proc. of Int. Conf. on Parallel Processing, 1985, pp. 474{479 45. H.Y. Yoon, K.Y. Lee, M.T. Liu, Performance analysis of multibuered packet-switching networks in multiprocessor systems, IEEE Trans. on Comp., C-39, 1990, pp. 319{327 This article was processed using the LaTEX macro package with LLNCS style