Multicast Performance of Multistage Interconnection ... - Springer Link

3 downloads 76 Views 142KB Size Report
Dias and Jump reduced each stage in the network to one SE of this stage so .... last stage, LastContainerOutputs (LCO) additionally replace the containers by.
Multicast Performance of Multistage Interconnection Networks with Shared Buffering Dietmar Tutsch1 , Matthias Hendler2 , and G¨ unter Hommel2 1

International Computer Science Institute 1947 Center St., Suite 600 Berkeley, CA 94704-1198, USA [email protected] 2 Technische Universit¨ at Berlin Sekr. FR 2-2 Franklinstr. 28/29 D-10587 Berlin, Germany {guppy,hommel}@cs.tu-berlin.de

Abstract. Multistage interconnection networks are often proposed to establish a multiprocessor system, ATM switches, or Ethernet switches. Various MIN structures exist to improve the performance. This paper investigates the buffer structures of shared and non-shared buffers in case of packet multicast. Shared buffers perform a dynamic buffer allocation but require a more complex switch management. The different behavior concerning uniform and non-uniform network traffic is examined. The simulation model copes with networks of arbitrary size, arbitrary switching element sizes, arbitrary buffer lengths in each network stage, and an arbitrarily chosen network load. Additionally, arbitrary multicast traffic patterns can be handled.

1

Introduction

Multistage interconnection networks (MIN) with the Banyan property are proposed to connect a large number of processors to establish a multiprocessor system [1]. They are also used as interconnection networks in Ethernet [18] and ATM switches [3]. Such systems require high performance of the network. To increase the performance of a MIN, Dias and Jump [4] inserted a buffer at each input of the switching elements (SE) and developed an analytical model to predict its performance. Buffers at each SE allow to store the packets of a message until they can be forwarded to the next stage in the network. In their model, Dias and Jump reduced each stage in the network to one SE of this stage so that it could be mapped to a Markov chain. 1

Dietmar Tutsch is supported by the German Academic Exchange Service (DAAD) within the ICSI Postdoc Program

P. Lorenz (Ed.): ICN 2001, LNCS 2093, pp. 478–487, 2001. c Springer-Verlag Berlin Heidelberg 2001 

Multicast Performance of Multistage Interconnection Networks

479

Jenq [8] introduced a model with lower complexity than that of Dias and Jump by considering only one input port of a SE per stage to model the complete stage. Yoon, Lee and Liu [17] extended Jenq’s model by using arbitrary buffer lengths in the network and arbitrary SE sizes. Atiquzzaman and Akhtar [2] and Zhou and Atiquzzaman [19] examined nonuniform traffic like hot spot traffic. There are few investigations on multicast routing in MINs [9,10,12] and on the structure of multicast ATM switches [6,11]. An analysis of multicasting in MINs is presented by Yang [16]. But in contrast to the other models, this model is not able to deal with the backpressure mechanism to handle full buffers. Tutsch and Hommel [14,13] extended Jenq’s model such that the analytical model additionally copes with performance analysis of a network with multicasting. Multicasting includes the two special cases of unicasting and broadcasting of messages. Furthermore, the performance of MINs consisting of switching elements larger than 2×2 can be evaluated. In case of store and forward routing, a transient performance evaluation is available [15]. In this paper, a simulation model is used to investigate a shared memory approach in contrast to the previously mentioned network architectures. Packet multicast is taken into account. The paper is organized as follows. In Section 2, the architecture of a multistage interconnection network with shared buffers is introduced. The simulation model of such a network is developed in Section 3. This model is used to investigate the performance of shared and non-shared network buffers. Section 4 summarizes the research.

2

MIN Architecture

Simulation models of MINs allow a QoS (quality of service) comparison of various network architectures. MINs of special interest are such ones that connect multiprocessor systems or establish ATM and Ethernet switches. These internally clocked N ×N MINs consist of c×c switches with n = logc N stages (Figure 1). Internal clocking results in synchronously operating switches. In each stage k (0 ≤ k ≤ n − 1) of non-shared buffer networks, there is a FIFO buffer of size mmax (k) in front of each switch input. The packets are routed by store and forward routing or cut-through switching from one stage to the succeeding by backpressure mechanism. Multicasting is performed by copying the packets within the c×c switches while routing (cell replication while routing, CRWR). Each packet copy is sent to the desired switch output independently of the other copies, even if another copy is blocked. These blocked copies are sent in the following clock cycles. Networks consisting of shared buffers are established by replacing the c FIFO input buffers of size mmax (k) of a c×c switch by one common buffer of size c · mmax (k) (Fig. 2). This shared buffer is organized as follows: Each switch input owns at least buffer space to store one packet avoiding the isolation of inputs (see below). The remaining buffer space of c · mmax (k) − c packets is available to all inputs. Each input forms a FIFO input queue of packets. If an input receives a new packet from the previous stage that has to be stored, the

480

D. Tutsch, M. Hendler, and G. Hommel

Fig. 1. 3-stage non-shared buffer MIN consisting of c×c SEs

Fig. 2. 4×4 switch consisting of a shared buffer (left) and non-shared buffer (right)

input allocates buffer space of the commonly used buffer part if available. If there is no further buffer available the packet is blocked at the previous stage. An input with a queue of more than one packet deallocates buffer space if it sends a packet to the next network stage. This space is returned to the pool of the commonly available buffer. Guaranteeing at least one buffer space to each input avoids that an input without any buffer cannot participate in the switch routing process because it is not able to receive a packet that has to be forwarded. E.g. let us assume that one of the inputs (hot spot input) receives much more packets than the other ones. This input would allocate up to all of the buffers. Packets of the previous stage that are directed to the other inputs would be blocked at the previous stage even if their final destination is different from the first packet queued at the hot spot input. Only the hot spot input would contribute to the switch traffic and all other inputs would remain idle. Additionally, the following assumptions hold for the presented simulation model. However, most of these assumptions can be changed to further interesting

Multicast Performance of Multistage Interconnection Networks

481

network realizations with little effort due to object oriented modeling. This can be performed by replacing or subclassing the desired components of the object oriented simulation model. The network structure is described by one class while the routing, queuing and various other behaviors of the model are encapsulated in different further classes. So the effort to change the network is minimized. The assumptions of the presented model are: – All packets have the same size (like in ATM). – Their destination outputs are distributed uniformly. That means every output of the network is with equal probability one of the destinations of a packet. – Conflicts between packets are solved randomly with equal probabilities. – Packets are removed from their destinations immediately upon arrival. – Routing is performed in pipeline manner. That means the routing process occurs in every stage in parallel.

3

Buffer Structure Comparison

Previously mentioned MINs are simulated for performance evaluation. Networks consisting of switches with shared and non-shared buffers are compared. The simulation model is implemented in C++ [7]. It handles most kinds of network structures that are based on c×c switches but is optimized to model MINs. The network is represented as a directed graph starting at the sources (network inputs) and ending at the destinations (network outputs). Packets are generated at the sources. Each packet is provided with a tag determining the destination. Due to multicasting this tag is modeled by a vector of N binary elements, each representing a network output. The elements of the desired outputs are set to “true”. If the packet arrives at a c×c switch, the tag is divided into c subtags of equal size. Each subtag belongs to one switch output, the first (lower indices) subtag to the first output, etc. If a subtag contains at least one “true” value a copy of the packet is send to the corresponding output containing the subtag as the new tag. Keeping the amount of allocated memory as small as possible, just a representation of the packets, referred to as containers, are routed along the network paths. The containers are replaced by the packets at the network outputs allowing evaluations. Figure 3 gives a short sketch of the simulation model. So called ContainerMultiputs (CM) receive the containers and store them in the queues. At the first network stage, FirstContainerMultiputs (FCM) additionally perform the replacement of the packets by containers. So called ContainerOutputs (CO) send the containers to the next network stage. At the last stage, LastContainerOutputs (LCO) additionally replace the containers by the corresponding packets. Each operation of a switch is aligned by its Crossbar Manager. The clocks perform the sequencing of the parallel actions due to computer simulation. The Deadlock Manager is just needed in case of multicast and wormhole routing. Such a scenario, which is not subject to this paper, may cause deadlocks.

482

D. Tutsch, M. Hendler, and G. Hommel MainClock

Clock 3

Clock 2

Network clock, stage 0

Source

Clock 1

Network clock, stage 1

Network clock, stage 2

Stage 1

Stage 2

Stage 0 Crossbar Manager

Crossbar Manager

Destination

Crossbar Manager

0

FCM

CO

CM

CO

CM

LCO

0

1

FCM

CO

CM

CO

CM

LCO

1

Crossbar Manager

Crossbar Manager

Crossbar Manager

2

FCM

CO

CM

CO

CM

LCO

2

3

FCM

CO

CM

CO

CM

LCO

3

Crossbar Manager

Crossbar Manager

Crossbar Manager

4

FCM

CO

CM

CO

CM

LCO

4

5

FCM

CO

CM

CO

CM

LCO

5

Crossbar Manager

Crossbar Manager

Crossbar Manager

6

FCM

CO

CM

CO

CM

LCO

6

7

FCM

CO

CM

CO

CM

LCO

7

container area

Deadlock Manager

Fig. 3. Sketch of the simulation model

Simulations are performed starting multiple simulation runs in parallel and using a confidence level of 0.95 and a relative error of 0.02 as termination criteria. The simulation is observed and managed by the tool Akaroa [5]. All following figures identify the results of non-shared buffer networks by a legend “non-shared buffer: x, total z” where x represents the buffer size of each FIFO buffer and z = c · x gives the overall buffer size of the switch. Shared buffer networks are identified by a legend “shared buffer: min v, max w, total z” where v represents the minimal buffer size of each input, w the maximal buffer size of each input, and z gives the overall buffer size of the switch. The figures show the average throughput and delay times of 16×16 MINs consisting of four stages of 2×2 switches. The packets are routed by cut-through switching.

Multicast Performance of Multistage Interconnection Networks

483

First, a completely uniform network traffic is investigated: the offered load to all inputs is equal. Concerning multicasting, all output combinations occur with equal probability as the destination of a packet entering the network. Figure 4 shows the dependence between offered load and average throughput at the outputs in case of uniform network traffic. Increasing the offered load form 0.01 to

non−shared buffer: 1, total 2 non−shared buffer: 4, total 8 shared buffer: min 1, max 7, total 8

normalized throughput at the output

1

0.8

0.6

0.4

0.2

0 0.1

0.2

0.3

0.4

0.5 0.6 offered load

0.7

0.8

0.9

1

Fig. 4. Uniform network traffic

1.0, the network reaches congestion for an offered load greater than approx. 0.14 due to the large number of packets caused by multicasting. Comparing the networks with shared and non-shared buffers of an overall buffer size of 8, no observably difference in throughput occurs. In case of uniform network traffic the buffer structure does not affect the throughput. However, larger buffers result in higher throughput. In the following, uniform traffic is replaced by merging sources sending unicast traffic and sources sending broadcast traffic. First, traffic established by one broadcast source is investigated. The offered load of this source is varied from 0.01 to 1.0. All other sources send unicast traffic to the network with an fixed offered load of 0.2. Figure 5 shows the throughput of the merged traffic at the outputs for various buffer sizes and structures. Shared buffers perform a higher throughput than non-shared buffers of the same size: buffer space is more efficiently used. On the other hand, a more efficiently used buffer results in larger packet queues at the switch inputs: higher delay times occur (Figure 6). A further investigated traffic pattern is similar to the previously mentioned one except the fact that two broadcast sources are used. These sources may be located at various network inputs. Depending on their locations the first conflicts between their packets will occur in different network stages. E.g., if they are located at the first two inputs, they are already in conflict at the first network stage because they are located at the same switch. If they are locate at

484

D. Tutsch, M. Hendler, and G. Hommel

normalized throughput at the output

1

0.8

0.6

0.4

non−shared buffer: 1, total 2 non−shared buffer: 2, total 4 shared buffer: min 1, max 3, total 4 non−shared buffer: 4, total 8 shared buffer: min 1, max 7, total 8

0.2

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

offered load

Fig. 5. One broadcast source (throughput) 25

non−shared buffer: 1, total 2 non−shared buffer: 2, total 4 shared buffer: min 1, max 3, total 4 non−shared buffer: 4, total 8 shared buffer: min 1, max 7, total 8

average delay

20

15

10

5

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

offered load

Fig. 6. One broadcast source (delay)

the first and last input, the first crossing of their packets’ network paths occurs soonest at the last stage. Figure 7 shows the throughput taking the stage of the first conflict into account. A non-shared switch input buffer of 1 is chosen. The sooner the network paths of both broadcast sources cross, the lower is the network throughput. E.g., if the crossing and therefore the first conflict occurs at the first stage (stage 0), the output of this switch equals the output in case of just one broadcast source at the inputs: both sources send a packet to both outputs but no more than one packet can pass an output. Just one broadcast source would also result in one packet passing each of both outputs. If MINs are fed by more than one high-load source the sources should be placed in such a way that their paths are crossing as late as possible.

Multicast Performance of Multistage Interconnection Networks

485

normalized throughput at the output

1

0.8

0.6

0.4

0.2

first conflict at stage 3 first conflict at stage 2 first conflict at stage 1 first conflict at stage 0

0 0.1

0.2

0.3

0.4

0.5 0.6 offered load

0.7

0.8

0.9

1

0.9

1

Fig. 7. Two broadcast sources 1

normalized throughput at the output

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

non−shared buffer: 1, total 2 non−shared buffer: 4, total 8 shared buffer: min 1, max 7, total 8 first conflict: stage 0 − non−shared buffer: 1, total 2

0.55 0.5 0.1

0.2

0.3

0.4

0.5 0.6 offered load

0.7

0.8

Fig. 8. First conflict at last stage

A comparison of shared and non-shared buffer structures depending on the stage of the first conflict is presented in Figures 8 and 9. Figure 8 demonstrates the throughput behavior for two broadcast sources that cause their first conflict at the last stage. The throughput behavior is given for various buffer structures. Additionally the dotted line allows a comparison to a source distribution resulting in a first conflict at the first stage. Broadcast sources that cause their first conflict at the first stage are evaluated in Figure 9. This figure also shows the throughput of various buffer structures. The dotted line determines the throughput of a source distribution resulting in a first conflict at the last stage. All figures show a higher throughput for network switches with a shared buffer compared to switches with non-shared buffers. However, the throughput

486

D. Tutsch, M. Hendler, and G. Hommel 1

normalized throughput at the output

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

non−shared buffer: 1, total 2 non−shared buffer: 4, total 8 shared buffer: min 1, max 7, total 8 first conflict: stage 3 − non−shared buffer: 1, total 2

0.55 0.5 0.1

0.2

0.3

0.4

0.5 0.6 offered load

0.7

0.8

0.9

1

Fig. 9. First conflict at first stage

increases from non-shared to shared buffers by only a small amount and only in case of network congestion. This increase is paid with a more complex switch hardware for managing the shared buffer.

4

Conclusion

Multistage interconnection networks are often proposed to establish a multiprocessor system, ATM switches, or Ethernet switches. Various MIN structures exist to improve the performance. This paper compares the two buffer structures of shared and non-shared buffers in case of packet multicast. Shared buffers perform a dynamic buffer allocation. At least, space for one packet is reserved for each input avoiding the isolation of inputs. In case of uniform network traffic, both buffer structures show identical behavior. Non-uniform traffic causes a slightly higher network throughput if shared buffers are used. On the other hand, shared buffers require a more complex switch management. If the network operates at high traffic load, e.g. caused by multicasting, the network stage of the first crossing of high load paths influences heavily the network behavior.

References 1. Gheith A. Abandah and Edward S. Davidson. Modeling the communication performance of the IBM SP2. In Proceedings of the 10th International Parallel Processing Symposium (IPPS’96); Hawaii. IEEE Computer Society Press, 1996. 2. M. Atiquzzaman and M. S. Akhtar. Performance of buffered multistage interconnection networks in a non uniform traffic environment. Journal of Parallel and Distributed Computing, 30(1):52–63, October 1995.

Multicast Performance of Multistage Interconnection Networks

487

3. Ra’ed Y. Awdeh and H. T. Mouftah. Survey of ATM switch architectures. Computer Networks and ISDN Systems, 27:1567–1613, 1995. 4. Daniel M. Dias and J. Robert Jump. Analysis and simulation of buffered delta networks. IEEE Transactions on Computers, C–30(4):273–282, April 1981. 5. Greg Ewing. Akaroa II. Version 1.2. User’s Manual. New Zealand, July 1996. 6. Ming-Huang Guo and Ruay-Shiung Chang. Multicast ATM switches: Survey and performance evaluation. ACM Sigcomm: Computer Communication Review, 28(2):98–131, April 1998. 7. Matthias Hendler. Simulative Leistungsbewertung von gepufferten mehrstufigen Kommunikationsnetzen bei Cut-Through-Switching (in German). Master’s thesis, Technische Universit¨ at Berlin, Germany, 2000. 8. Yih-Chyun Jenq. Performance analysis of a packet switch based on single–buffered banyan network. IEEE Journal on Selected Areas in Communications, SAC– 1(6):1014–1021, December 1983. 9. Jaehyung Park and Hyunsoo Yoon. Cost-effective algorithms for multicast connection in ATM switches based on self-routing multistage networks. Computer Communications, 21:54–64, 1998. 10. Wenge Ren, Kai-Yeung Siu, Hiroshi Suzuki, and Masayuki Shinohara. Multipointto-multipoint ABR service in ATM. Computer Networks and ISDN Systems, 30:1793–1810, 1998. 11. Neeraj K. Sharma. Review of recent shared memory based ATM switches. Computer Communications, 22:297–316, 1999. 12. Rajeev Sivaram, Dhabaleswar K. Panda, and Craig B. Stunkel. Efficient broadcast and multicast on multistage interconnection networks using multiport encoding. IEEE Transaction on Parallel and Distributed Systems, 9(10):1004–1028, October 1998. 13. Dietmar Tutsch. Generating systems of equations for performance evaluation of buffered multistage interconnection networks. Technical Report FB Informatik 2000–07, Technische Univ ersit¨ at Berlin, 2000. 14. Dietmar Tutsch and G u ¨nter Hommel. Multicasting in buffered multistage interconnection networks: an analytical algorithm. In 12th European Simulation Multiconference: Simulation – Past, Present and Future (ESM’98); Manchester, pages 736–740. SCS, June 1998. 15. Dietmar Tutsch and G u ¨nter Hommel. Multifractal multicast traffic in multistage interconnection networks. In Proceedings of the High Performance Computing Symposium 2001 (HPC 2001); Seattle, accepted for publication, April 2001. 16. Yuanyuan Yang. An analytical model for the performance of buffered multicast banyan networks. Computer Communications, 22:598–607, 1999. 17. Hyunsoo Yoon, Kyungsook Y. Lee, and Ming T. Liu. Performance analysis of multibuffered packet–switching networks in multiprocessor systems. IEEE Transactions on Computers, 39(3):319–327, March 1990. 18. B.Y. Yu. Analysis of a dual-receiver node with high fault tolerance for ultrafast OTDM packet-switched shuffle networks. Technical paper, 3COM, 1998. 19. Bin Zhou and M. Atiquzzaman. Efficient analysis of multistage interconnection networks using finite output-buffered switching elements. Computer Networks and ISDN Systems, 28:1809–1829, 1996.

Suggest Documents