Multilayer Multistage Interconnection Networks - CiteSeerX

6 downloads 64 Views 93KB Size Report
the last stage of their layer. As a result, single acceptance is a special case of multiple acceptance with R = 1. In contrast to regular multistage interconnection net ...
Multilayer Multistage Interconnection Networks ¨ Dietmar Tutsch and Gunter Hommel Technische Universit¨at Berlin Real–Time Systems and Robotics D–10587 Berlin, Germany {dietmart,hommel}@cs.tu-berlin.de

Keywords: network architecture, costs, multicasting, multistage interconnection networks, performance

Abstract Multistage interconnection networks are frequently proposed as connections in multiprocessor systems or network switches. In this paper, a new network architecture called multilayer multistage interconnection network is introduced. Multilayer multistage interconnection networks are established by defining the main parameters of such an architecture. Performance and cost of the new structure are determined and compared to regular and to replicated multistage interconnection networks. It is shown that multicast traffic especially benefits from the new network type.

1

INTRODUCTION

Multistage interconnection networks (MINs) with the banyan property are proposed to connect a large number of processors to establish a multiprocessor system [1]. They are also used as interconnection networks in Gigabit Ethernet [2] and ATM switches [3]. Such systems require high performance of the network. MINs were first introduced for circuit switching networks [4]. To increase the performance of a MIN, buffered MINs were established as packet switching networks. For instance, Dias and Jump [5] inserted a buffer at each input of the switching elements (SE). Buffers at each SE allow to store the packets of a message until they can be forwarded to the next stage in the network. Patel [6] defined delta networks. Delta networks are a subset of banyan networks (MINs with just one path between a given input and output). It is additionally required that packets can use the same routing tag to reach a certain network output independently of the input at which they enter the network. Many variations of delta networks were introduced. Most of them result in MINs that lose the unique path property (and therefore the delta property) in order to reduce blocking. Clos [7] presented a MIN consisting of three stages and non-quadratic SEs. Beneˇs [8] suggested MINs consisting of a banyan network followed by its inverse. Tandem-banyan networks [9, 10] also base on a series of banyan networks. Packets that arrive at a banyan network output leave the tandem-banyan network if they already managed to reach the right output. Otherwise,

they enter the following banyan network and try again. Turnaround MINs [11] were established by bidirectional links between the SEs. Network inputs and SE inputs operate as well as outputs. Dilated banyan networks [12] arise by multiplying the links between the SEs: the link bandwidth is enhanced. Replicated banyan networks [12] originate from multiplying the whole banyan network. Arriving packets are distributed to the banyan networks by a demultiplexer of the corresponding network input. Respectively, corresponding outputs are connected via a multiplexer. In this paper, a new architecture is introduced that applies especially to multicasting in multistage interconnection networks. Those MINs are called multilayer multistage interconnection networks (MLMINs). The paper is organized as follows. The architecture of MLMINs is described in Section 2. Section 3 investigates the costs of such a structure. Its performance is determined in Section 4. Section 5 summarizes and gives conclusions.

2 ARCHITECTURE OF MLMIN The architecture of multilayer multistage interconnection networks presented in this paper is based on MINs with the banyan property and on replicated MINs.

2.1

MIN with Banyan Property

Multistage interconnection networks with the banyan property are networks where a unique path from an input to an output exists. Such MINs of size N×N consist of c×c switching elements with n = logc N stages (Figure 1). The shown network also belongs to the class of delta networks. That means it is a banyan network where all packets can use the same routing tag to reach a certain network output independently of the input at which they enter the network. To achieve synchronously operating switches, the network is internally clocked. In each stage k (0 ≤ k ≤ n −1), there is a FIFO buffer of size m max (k) in front of each switch input. The packets are routed by store and forward routing or cut-through switching from a stage to its succeeding one by backpressure mechanism. Multicasting is performed by copying the packets within the c×c switches. In ATM context, this scheme is called cell replication while routing (CRWR). Figure 2 shows such a scenario for an 8×8 MIN consisting of 2×2 SEs. A packet is received by Input 3 and destined to Out-

stage

0

1

2

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

Figure 2: Multicast while routing Figure 3 shows the architecture of an 8 × 8 replicated MIN consisting of two layers in a three-dimensional view. A lateral view of the same network is given by Figure 4. Such a concept was introduced by Kruskal and Snir [12]. Packets are received by the inputs of the network and distributed to the layers. Layers may be chosen by random, by round

Input 0 Input 1

Figure 1: 3-stage delta network consisting of c×c SEs put 5 and Output 7. The packet enters the network and is not copied until it reaches the middle stage. Then, two copies of the packet proceed their way through the remaining stages. Packet replication before routing in the above example would copy the packet and send it twice into the network. Therefore, packet replication while routing reduces the amount of packets in the first stages. Comparing the packet density in the stages in case of replication while routing shows that the greater the stage number, the higher is the amount of packets. In other words: there are much more packets in the last stages due to replication than in the first stages. The only exception is if the traffic pattern results in such a destination distribution that packet replication has to take place at the first stage. Then, the amount of packets is equal in all stages. But such a distribution is very unlikely, in general. To set up multistage interconnection networks that are appropriate for multicasting, the previously mentioned different traffic densities of the stages must be considered. MLMINs, which are presented in this paper, belong to this kind of networks. Their roots are in replicated MINs.

2.2

Replicated MIN

Replicated MINs enlarge regular multistage interconnection networks by replicating them L times. The resulting MINs are arranged in L layers. Corresponding input ports are connected as well as corresponding output ports.

Output 0 Output 1

Input 2 Input 3

Output 2 Output 3 Output 4 Output 5

Input 4 Input 5 Input 6 Input 7

Output 6 Output 7

Figure 3: Replicated multistage interconnection network (L = 2, 3D view) Stage 0

Stage 1

Stage 2

Layer 2

Layer 1 Inputs

Outputs

Figure 4: Replicated multistage interconnection network (L = 2, lateral view) robin, dependent on layer loads, or any other scheduling algorithm. The distribution is performed by a 1 : L demultiplexer. At each network output, an L : 1 multiplexer collects the packets from the corresponding layer outputs and forwards them to the network output. Two different output schemes are distinguished: single acceptance (SA) and multiple acceptance (MA). Single acceptance means that just one packet is accepted by the network output per clock cycle. If there are packets in more than one corresponding

layer output, one of them is chosen. All others are blocked at the last stage of their layer. The multiplexer decides according to its scheduling algorithm which packet to choose. Multiple acceptance means that more than one packet may be accepted by the network output per clock cycle. Either all packets are accepted or just an upper limit R. If an upper limit is given, R packets are chosen to be forwarded to the network output and all others are blocked at the last stage of their layer. As a result, single acceptance is a special case of multiple acceptance with R = 1. In contrast to regular multistage interconnection networks, replicated MINs may cause out of order packet sequences. Sending packets belonging to the same connection to the same layer avoids destruction of packet order.

2.3

layers and therefore the amount of hardware, two options are considered: starting the replication in a more rear stage and/or stopping further layer replication if a given number of layers is reached. The first option is demonstrated in Figure 6 in lateral view. The example presents an 8 × 8 MLMIN in which replication starts not before Stage 2 (last stage) with G F = 2. A 3D view is given in Figure 7. The stage numStage 0

Stage 1

Stage 2

Inputs

Outputs

Multilayer MIN

Multilayer multistage interconnection networks (MLMINs) consider the multicast traffic characteristics. As mentioned above, the amount of packets increases from stage to stage due to packet replication. Thus, more switching power is needed in the last stages compared to the first stages of a network. To supply the network with the required switching power, the new network structure presented in this paper replicates the number of layers in each stage. The factor with which the number of layers is increased is called growth factor G F (G F ∈ N\{0}). Figure 5 shows an 8 × 8 MLMIN (3 stages) with growth factor G F = 2 in lateral view. That means the number of layers is doubled each Stage 0

Inputs

Stage 1

Figure 6: MLMIN in which replication starts at Stage 2 (lateral view)

Input 0 Input 1

Output 0 Output 1

Input 2 Input 3

Output 2 Output 3

Input 4 Input 5

Output 4 Output 5 Output 6 Output 7

Input 6 Input 7

Figure 7: MLMIN in which replication starts at Stage 2 (3D view)

Stage 2

Outputs

Figure 5: Multilayer multistage interconnection network (G F = 2) stage and each switching element has twice as much outputs as inputs. Consider for instance that 2 × 2 SEs are used. Such an architecture ensures that even in case of two broadcast packets at the inputs all packets can be sent to the outputs (if there is buffer space available at the succeeding stage). On the other hand, unnecessary layer replications in the first stages are avoided. Choosing G F = c ensures that no internal blocking occurs in an SE, even if all SE inputs broadcast their packets to all SE outputs. Nevertheless, blocking may still occur at the network output depending on R. A drawback of the new architecture arises from the exponentially growing number of layers for each further stage. The more network inputs are established, the more stages and the more layers result. To limit the number of

ber in which replication starts is defined by G S (G S ∈ N). Figures 6 and 7 introduce a MLMIN with G S = 2. Of course, moving the start of layer replications some stages to the rear not just reduces the number of layers. It also reduces the network performance due to less SEs and therefore less paths through the network. Section 4 is presenting some performance results. Stopping further layer replication if a given number G L of layers is reached also reduces the network complexity (G L ∈ N\{0}). It prevents exponential growth in case of large networks. Figure 8 shows such an MLMIN with limited number of layers in lateral view. 3D view is presented in Figure 9. The number of layers of this 8 × 8 MLMIN Stage 0

Inputs

Stage 1

Stage 2

Outputs

Figure 8: MLMIN with limited number of layers (lateral view) is limited to an upper number of G L = 2. Layers are replicated with a growth factor of G F = 2. As in the previous

Input 0 Input 1

Output 0 Output 1

Input 2 Input 3

Output 2 Output 3

Input 4 Input 5

succeeding stage is slightly changed to regular SEs: The output is replicated by the growth factor G F . For instance, a 2×2 SE with G F = 2 is shown in Figure 10. That means

Output 4 Output 5 Output 6 Output 7

Input 6 Input 7

Figure 9: MLMIN with limited number of layers (3D view) option, the reduced amount of SEs decreases network performance as well (see Section 4). Both presented options can be combined to reduce network complexity further. Such a network is determined by parameters G S (start of replication), G F (growth factor), and G L (layer limit). For instance, Figure 9 shows an MLMIN with G S = 1, G F = 2, and G L = 2. Regular MINs and replicated MINs can be considered as special cases of MLMINs. Regular MINs are equivalent to MLMINs with G F = 1. In this case, G S and G L do not affect. Replicated MINs are equivalent to MLMINs with G S = 0, G F = L, and G L = L.

Figure 10: 2×2 SE with G F = 2 each output is doubled. c×c SEs of such a structure consist of c inputs and G F · c outputs resulting in PM L S E = c2 · G F

(3)

The cost of multistage interconnection networks are assumed to be determined by the number of crosspoints within the network and the number of buffers.

crosspoints. An MLMIN in which layer growing starts at Stage G S (G S > 0) and the number of layers is not limited consists of G S − 1 stages of regular SEs, n − G S stages in which SEs according to Figure 10 are used (the number of layers is replicated in each stage by G F ), and a last stage of S G n−G regular SEs (no output replication is needed in last F stage; see Figure 5). Additionally, a multiplexer connects n−G the G F S layers at each network output. The MLMIN crosspoint cost add to

3.1

PM L M I Nnolimit

3

COST

Crosspoint Cost

The number of crosspoints within a network is given by the number of crosspoints within a switching element and by the number of switching elements within the network. SEs of regular MINs consist of c 2 crosspoints if c×c SEs are used. That means the crosspoint cost PM I N of an N×N MIN, which consists of n = logc N stages with N/c SEs in each stage, results in PM I N

N 2 ·c c = n · cn+1

= n·

(1)

Replicated MINs are established by L layers of MINs, a demultiplexer at each input, and a multiplexer at each output. Each demultiplexer is composed of L crosspoints to distribute the incoming packets among the L layers. A multiplexer is also composed of L crosspoints to collect the packets from the L layers. Thus, the crosspoint cost PRepM I N of a replicated MIN are given by PRepM I N

= =

L · n · cn+1 + L · cn + L · cn L · cn · (n · c + 2)

(2)

The crosspoint cost of multilayer multistage interconnection networks are also mainly determined by the number and crosspoint cost of the switching elements. The structure of SEs that allow increasing number of stages in the

=

N · (G S − 1) · c2 + c2 · G F · c ! n−G S

+G F

n−1−G X S

G iF

i=0

n−G S

· c2 + N · G F

G n−G S − 1 S = cn · c · G S − 1 + G F F + G n−G F GF − 1 ! S +G n−G F

! (4)

where the sum represents a geometric series and is replaced by its solution. The crosspoint cost of an MLMIN with a limited number G L of layers can be obtained by adapting Equation 4. First, the multiplexer cost are removed (they are added again later on). Then, the equation gives the crosspoint cost of the first network part till the growing of the layers stops at stage x (if n within the brackets is replaced by x). x represents the number of the first stage that has equal number of layers than the previous one. The remaining n −x network stages consist of G L layers of regular SEs, followed by multiplexers connecting the layers.

stage consist of N buffers (due to N SE inputs at each stage and layer) leading to buffer cost of a MIN with no layer limit of

PM L M I Nlimit x−G

=

S G −1 x−G N · c · GS − 1 + GF F + GF S GF − 1

+

!!

N · (n − x) · G L · c2 + N · G L c

(5)

Stage x where growing stops is determined by (6)

o with G L ∈ G iF i ∈ N\{0} ∧ i ≤ n − G S . Equation 5 results in n

PM L M I Nlimit

 GL − 1 = cn · c · G S − 1 + G F GF − 1

+G L

H · m max · N · G S − 1 + H · m max

n−G XS i=0



G iF 

n+1−G

S G −1 · N · GS − 1 + F GF − 1

! (9)

MLMINs with a layer limit of G L consist similar to unlimited MLMINs of G S − 1 stages where no growing occurs. x + 1 − G S stages deal with layer growing with x defined as in Equation 6. n − x stages remain with maximum number of layers G L . Thus, buffer cost of MLMIN with a layer limit result in

· GS − 1 +

(7)

Buffer Cost

The buffer cost of a network are determined by the number of buffers and the size m max of each buffer (assuming that all buffers are of equal size). Furthermore, all following equations include a constant H which represents the hardware cost relation between a crosspoint and buffer for one packet. For instance, an assumption could be that a crosspoint is realized by a transistor. To store one bit, usually two transistors are necessary (dynamic RAM). For instance, if packet size is 53 bytes (e.g. ATM) and 8 bit are transfered in parallel, 8 crosspoint transistors relate to 8 · 53 · 2 memory transistors. The hardware factor of this example results in H = 106. That means the hardware cost of a buffer of size m max = 1 is 106 times higher than those of a crosspoint. Replicated MINs consist of L layers of MINs. Each N×N MIN is established by n stages of SEs. All N SE inputs of a stage are connected to a buffer. Therefore, the buffer cost of a replicated MIN is given by B RepM I N = H · m max · L · N · n



B M L M I Nlimit = H · m max · N



Equation 7 is also valid for MLMINs with no limited numn−G ber of layers if G L is set to G L = G F S .

3.2

=

=

x = G S + logG F G L

+(n + 1 − G S − logG F G L ) · G L !

B M L M I Nnolimit

(8)

This equation is also valid for regular MINs if the number of layers is set to L = 1. The number of buffers of a MLMIN with no layer limit is determined by G S − 1 stages where no growing occurs (resulting in just one layer) and by the remaining n+1−G S stages where layers grow with factor G F . Each layer of a

x−G XS

G iF + (n − x) · G L

i=0

=

!



GL · GF − 1 GF − 1  +(n − G S − logG F G L ) · G L (10)

H · m max · N · G S − 1 +

Equation 7 is also valid for MLMINs with no limited numS ber of layers if G L is set to G L = G n−G . F

4 PERFORMANCE This section compares the performance of various MIN architectures. As performance measures the normalized throughput Si at the inputs of the network and the normalized throughput So at the outputs of the network, the mean delay time d(k) of the packets in each stage, the mean delay time dtot of the packets in the whole network, and the mean queue length m(k) ¯ of the buffers in each stage can be considered. This paper concentrates on the normalized throughput So at the outputs and the mean delay time dtot of the packets describing the most important performance measures. As multicasting is considered, many different assumptions about the shape of the network traffic are possible [13]. The most simple case is to assume that all possible combinations of destination addresses for each packet entering the network are equally distributed. This traffic pattern is used in this paper. Performance results are obtained by simulation. The simulator is written in C++ program language [14] based on Akaroa [15]. As simulator termination criteria, the maximum relative error was set to 2% and the confidence level to 95%.

20 1248 1888 1188 8888

18 16 14 delay

Figures 11 to 14 show the performance of 16×16 networks consisting of 2×2 SEs and a buffer size of m max = 1 at each stage and SE input. Multiple acceptance of R = 4 packets per network output and per clock cycle is assumed. The layer a packet is send to is chosen randomly. Also, conflicts among packets for resources (SE outputs, buffers, layers) are solved by random. The networks operate in store and forward routing. Because 2×2 SEs are used, packets can be doubled at most at each stage even in case of broadcasting the packets to all network outputs. Therefore, doubling the number of layers in each stage seems to be a sufficient growing. That leads to a growth factor of G F = 2 with growing to be started at stage G S = 1 and no layer limit. The resulting 16×16 MLMIN consists of one layer at Stage 0, two layers at Stage 1, four layers at Stage 2, and eight layers at Stage 3. Such a network is named 1248 in the following figures (The digits of the legend refer to the number of layers at Stage 0, Stage 1, etc). First, network architectures with 8 layers in the last stage are compared (Figures 11 and 12). The through-

12 10 8 6 4 0.01

0.1

Figure 12: 8 layers in last stage (delay) Table 1: 8 layers in last stage (costs)

normalized throughput at the output

Network 4 3.5 3

1248 1888 1188 8888

1248 1888 1188 8888

2.5

1

offered load

Parameter GS GF GL 1 2 – 1 8 8 2 8 8 rep. MIN: L = 8

Costs Crossp. Buffer 832 240 · H 1152 400 · H 928 288 · H 1280 512 · H

2 1.5 1 0.5 0 0.01

0.1 offered load

1

Figure 11: 8 layers in last stage (throughput) put of Network 1248 is close to Network 1888 that is established with a higher number of layers starting in Stage 1. It is also close to Network 8888 representing a replicated MIN with eight layers. If layer growing starts later than at Stage 1, throughput breaks down (Network 1188). The delay of Network 1248 turns out to be the lowest of all compared networks (Figure 12). Table 1 compares network costs. The lowest costs are raised by Network 1248, especially if a hardware factor of H = 106 (as mentioned in Section 3) is assumed. Network 1248 impresses with lowest costs, lowest delay, and high throughput. Figures 13 and 14 compare Network 1248 with less expensive ones that consist just of four layers in the last stage. Additionally, it is compared with replicated MINs. Network 1248 performs in terms of throughput best in relation to the networks with four layers in last stage. Networks (replicated MINs with five and six layers) that catch up with

its throughput suffer from much higher delays and from higher costs if H > 1 is assumed (e.g. H = 106). Costs of the networks are given in Table 2. Again, Network 1248 is the best solution. Figures 15 and 16 investigate SEs larger than 2×2. The performance of 64×64 networks consisting of 4×4 SEs and a buffer size of m max = 1 at each stage and SE input is examined. Multiple acceptance of R = 8 packets per network output and per clock cycle is assumed. This time, 4×4 SEs are used what means packets can be destined to at most four SE outputs at each stage. As a result, multiplying the number of layers by four in each stage seems to be an optimal growing. Consequently, a Table 2: Costs of various architectures Network 1248 1244 1124 4444 5555 6666

Parameter GS GF GL 1 2 – 1 2 4 2 2 – rep. MIN: L = 4 rep. MIN: L = 5 rep. MIN: L = 6

Costs Crossp. Buffer 832 240 · H 512 176 · H 416 128 · H 640 256 · H 800 320 · H 960 384 · H

1248 1244 1124 4444 5555 6666

3.5 3 2.5

normalized throughput at the output

normalized throughput at the output

4

2 1.5 1 0.5 0 0.01

0.1

8 1_4_16 1_1_16 1_16_16 1_2_4 7_7_7 8_8_8

7 6 5 4 3 2 1 0 0.01

1

0.1

offered load

Figure 13: Throughput of various architectures

Figure 15: 64×64 networks with 4×4 SEs (throughput) 24

16

delay

12

22

1248 1244 1124 4444 5555 6666

20 18 16 delay

14

1

offered load

10

1_4_16 1_1_16 1_16_16 1_2_4 7_7_7 8_8_8

14 12 10

8

8 6

6

4 4 0.01

0.1 offered load

1

Figure 14: Delay of various architectures growth factor of G F = 4 with growing to be started at stage G S = 1 and no layer limit is chosen. A 64×64 MLMIN emerges with one layer at Stage 0, four layers at Stage 1, and 16 layers at Stage 2. Such a network is named 1 4 16 in Figures 15 and 16 (Again, the digits of the legend refer to the number of layers at Stage 0, Stage 1, etc). Networks 7 7 7, 8 8 8, 1 2 4, and 1 1 16 cannot compete with the investigated Network 1 4 16 concerning throughput. The throughput of Network 1 16 16 is comparable to the investigated one. But it suffers from its high costs (Table 3) and higher delay. Due to the costs and the performance, Network 1 4 16 is the best solution as predicted before. Previous investigations indicate that multilayer multistage interconnection networks show the best performance to delay relation. It is especially true if a growth factor of G F = c is chosen. That means the growth factor equals the switching element size. Furthermore, layer growing should start at Stage 1 (G S = 1) and should have no limit. Of course, no layer limit is only reasonable for networks with a small number of stages.

2 0.01

0.1 offered load

1

Figure 16: 64×64 networks with 4×4 SEs (delay)

5 CONCLUSIONS With multilayer multistage interconnection networks, a very efficient network type in case of multicasting is introduced. This paper describes how MLMINs are established and what parameters determine such an architecture. Performance and cost of MLMINs are calculated and compared to regular MINs and to replicated MINs. It turned out that MLMIN architecture shows a better performance or at least similar performance in terms of throughput if network costs are equal. The delay times of MLMINs heavily undercut those of replicated MINs. Comparing various MLMIN architectures, it seems that the most powerful architectures consist of a layer growth factor equal to the switching element size. A structure in which growing starts at Stage 1 should be prefered. If exponential layer explosion occurs due to many network stages, an upper layer limit helps to reduce the hardware cost. In future work, an investigation of the influence of buffer size on the optimal number of network layers must be performed. Dealing with various kinds of network traf-

Table 3: 64×64 networks with 4×4 SEs (costs) Network 1 4 16 1 1 16 1 16 16 124 777 888

Parameter GS GF GL 1 4 – 2 16 – 1 16 16 1 2 – rep. MIN: L = 7 rep. MIN: L = 8

Costs Crossp. Buffer 10240 1344 · H 9472 1152 · H 13312 2112 · H 2816 448 · H 6272 1344 · H 7168 1536 · H

fic will also help to characterize MLMINs in more detail.

References [1] Gheith A. Abandah and Edward S. Davidson. Modeling the communication performance of the IBM SP2. In Proceedings of the 10th International Parallel Processing Symposium (IPPS’96); Hawaii. IEEE Computer Society Press, 1996. [2] Toshio Soumiya, Koji Nakamichi, Satoshi Kakuma, Takashi Hatano, and Akira Hakata. The large capacity ATM backbone switch ”FETEX-150 ESP”. Computer Networks, 31(6):603–615, 1999. [3] Ra’ed Y. Awdeh and H. T. Mouftah. Survey of ATM switch architectures. Computer Networks and ISDN Systems, 27:1567–1613, 1995. [4] V. E. Beneˇs. Mathematical Theory of Connecting Networks and Telephone Traffic, volume 17 of Mathematics in Science and Engineering. Academic Press, New York, 1965. [5] Daniel M. Dias and J. Robert Jump. Analysis and simulation of buffered delta networks. IEEE Transactions on Computers, C–30(4):273–282, April 1981. [6] Janak H. Patel. Performance of processor–memory interconnections for multiprocessors. IEEE Transactions on Computers, C–30(10):771–780, October 1981. [7] C. Clos. A study of nonblocking switching network. Bell System Technology Journal, 32:406–424, March 1953. [8] V. E. Beneˇs. Optimal rearrangeable multistage connecting networks. Bell System Technology Journal, 43:1641–1656, March 1964. [9] F.A. Tobagi, T. Kwok, and F.M. Chiussi. Architecture, performance, and implementation of the tandem-banyan fast packet switch. IEEE Journal on Selected Areas of Communication, 9(8):1173–1193, October 1991.

[10] Sandeep Sibal and Ji Zhang. On a class of banyan networks and tandem banyan switching fabrics. IEEE Transactions on Communications, 43(7):2231–2240, July 1995. [11] Hong Xu, Yadong Gui, and Lionel M. Ni. Optimal software multicast in wormhole-routed multistage networks. IEEE Transactions on Parallel and Distributed Systems, 8(6):597–606, June 1997. [12] Clyde P. Kruskal and Marc Snir. The performance of multistage interconnection networks for multiprocessors. IEEE Transactions on Computers, C– 32(12):1091–1098, 1983. [13] Dietmar Tutsch and Marcus Brenner. Multicast probabilities of multistage interconnection networks. In Proceedings of the 12th European Simulation Symposium 2000 (ESS’00); Hamburg, pages 554–558. SCS, September 2000. [14] Dietmar Tutsch, Matthias Hendler, and G¨unter Hommel. Multicast performance of multistage interconnection networks with shared buffering. In Proceedings of the IEEE International Conference on Networking (ICN 2001); Colmar, pages 478–487. IEEE, July 2001. [15] Krzysztof Pawlikowski, Victor W. C. Yau, and Don McNickle. Distributed stochastic discrete-event simulation in parallel time streams. In Proceedings of the 1994 Winter Simulation Conference; Lake Buena Vista, pages 723–730, December 1994.

Biographies Dietmar Tutsch received the degree Dipl.-Ing. in Electrical Engineering from the University of Saarbr¨ucken (Germany) in 1993. In 1998 he received his Ph.D. in computer science from TU Berlin. There, he is now a senior research assistant. From fall 2000 to spring 2001 he joined the ICSI at Berkeley. His research interests include Petri nets, performance evaluation, communication networks, and high performance computing. Gunter ¨ Hommel has been a professor in the computer science department at TU Berlin since 1984. He studied Electrical Engineering and received a Ph.D. in computer science from TU Berlin. After several years of practical experience in the Nuclear Research Center in Karlsruhe and the GMD National Research Center in Bonn he became professor at the computer science department of TU M¨unchen in 1982 where he worked in the fields of real-time programming and robotics. His current research interests include Petri nets, performance evaluation, communication networks, real-time systems, and robotics.