Improved Performance of Bidirectional Multistage Interconnection ...

1 downloads 0 Views 167KB Size Report
simulator is written in C++ program language [15]. Sim- ulation is observed by the tool Akaroa [16]. As simulator termination criteria, the estimated precision was ...
Improved Performance of Bidirectional Multistage Interconnection Networks by Reconfiguration∗ Daniel Lüdtke, Dietmar Tutsch, Arvid Walter, Günter Hommel Technische Universität Berlin Real–Time Systems and Robotics D–10587 Berlin, Germany {dluedtke, dietmart, asas, hommel}@cs.tu-berlin.de

Keywords: Reconfiguration, multistage interconnection networks, performance, simulation

Abstract

Multistage interconnection networks (MIN) are widely used to interconnect nodes in multiprocessor systems and to build switching fabrics for broadband communication networks. A common subclass of MINs are bidirectional MINs (BMINs) which support turnaround routing. The special characteristic of BMINs is that messages of neighboring inputs/outputs profit from short paths through the network. A high ratio of such local communications increases the whole network performance compared to, for example, uniformly distributed traffic. This paper investigates the influence of local traffic on the performance of BMINs and introduces the concept of reconfiguration to support local traffic.

1

INTRODUCTION

Multistage interconnection networks (MINs) are proposed to connect a large number of processors to establish a multiprocessor system [1]. They are also used as interconnection networks in Gigabit Ethernet and ATM Switches [2, 3]. Recently, the same topologies were also involved when networks on chips (NoC) were developed. Such systems require high performance of the network. MINs were first introduced by Beneš [4] for circuit switching networks. To increase the performance of a MIN, buffered MINs were established as packet switching networks. For instance Dias and Jump [5] inserted a buffer at each switching element (SE). Buffers at each SE allow storing the packets of a message until they can be forwarded to the next stage in the network. Patel [6] defined delta networks. Delta networks are a subset of banyan-like networks (MINs with just one path between a given input and output). It is additionally required that packets can use the same routing tag to reach a certain network output independently of the input at which ∗ This research is supported by Deutsche Forschungsgemeinschaft (DFG) under Grant Ho 1257/22-1 within the Priority Programme 1148 “Rekonfigurierbare Rechensysteme”.

they enter the network. Many variations of MINs were introduced. Most of them result in networks that lose the unique path property (and therefore the delta property) in order to reduce blocking. For instance, Beneš [7] suggested MINs consisting of a banyan-like network followed by its inverse. A common subclass of MINs forms the bidirectional multistage interconnection networks (BMINs) which support turnaround routing. Many commercial scalable parallel computers implement BMINs, for example the IBM SP2 interconnection network [8]. Due to their topology, BMINs support communication locality very well [9]. Local communication means, that source and destination of a message are placed topologically close. Congestion in an interconnection network influences the performance heavily. With more packets in the network, the probability of congestion rises. Congestion increases the latency and decreases the throughput of the network. Several techniques have been proposed to reduce or avoid congestion. One technique for example is to prevent packets from entering the network when congestion is detected [10]. Another technique to improve the network performance is to replicate parts of the network to avoid conflicts in certain SEs [11]. Other approaches are based on the dynamic reconfiguration of the network. The goal is to increase the performance by minimizing the length of the path of packets in the network. These reconfigurable networks consist of topologies that can be dynamically reconfigured to meet the actual communication requirements. Sánchez et al. [12] described a reconfigurable direct network. In a twodimensional torus topology, a node is able to exchange its position with a neighboring node. This paper introduces the concept of reconfiguration of the BMIN topology to improve the network performance. Inputs/outputs that communicate a lot were brought in a topologically closer position. The interconnection at the ingress of the network is changed to group communication partners as much as possible. The paper is organized as follows. The architecture of BMINs is briefly described in Section 2. Section 3 introduces the reconfiguration of the network topology. The

performance of the reconfigurable network is determined in Section 4. Section 5 summarizes and gives conclusions and Section 6 discusses future work.

2

ARCHITECTURE OF BMIN

Stage

2

2.1 Topology and Routing

4

2.2 Local Traffic

Local traffic describes the fact that topologically neighboring inputs/outputs communicate. Local traffic can be characterized by the turnaround stage k turn. Extreme local traf-

1

2

1

The architecture of bidirectional multistage interconnection networks used in this paper is based on the BMIN model of Ni et. al. [9]. Like all MINs, bidirectional multistage interconnection networks of size N × N (N inputs/outputs) consists of c ×c switching elements (SEs) with n = logc N stages. SEs are usually realized as crossbars with buffers. To achieve synchronously operating switches, the network is internally clocked. There are several possibilities of the buffer configuration, like input or output buffering as well as separated or shared buffers. The further discussion refers to input buffering and to one buffer with space for m max packets for each input at an SE. This model uses the global backpressure scheme [13]. That means, that a packet is eventually transmitted to its destination output. No packet loss occurs inside the BMIN. Each port of an SE consists of two channels to allow bidirectional communication. Figure 1 illustrates an 8 × 8 butterfly BMIN with 2 × 2 bidirectional switching elements and three stages. In order to simplify the explanation, it is assumed that the input/output ports of the BMIN are located at the left-hand side. Available ports on the right-hand side (not shown) can be used to build larger networks [8]. The routing scheme for BMINs is known as turnaround routing [9]. The routing takes place in two phases. First, the packets are moved from the inputs to the right (forward) direction up to their turnaround stage k turn. This denotes the stage, that must be reached at least, in order to be able to reach the destination output. The path to k turn and the particular switch at k turn does not matter. The decision can be randomly solved. At k turn the packet is turned around and sent in the opposite (backward) direction to its target output. In the backward direction only one path exists from the turnaround SE at k turn to the target, like in unidirectional delta networks. To ensure shortest-path routing, kturn is the leftmost of all possible turnaround stages as mentioned above. Turnaround routing is a distributed routing scheme in which each SE determines locally the output channel based on the address header of each packet. The advantage is that with increasing network size, the number of redundant paths is also increased. That means, that the blocking probability for larger networks with the same traffic volume is lower.

0

0

3

5

6 7

Figure 1. Three-stage bidirectional butterfly MIN with eight inputs/outputs and 4 × 4 SEs fic arises, when all packets turn in the first stage (k = 0). In other words, the probability that a packet, with extreme local traffic, turns in stage k = 0 is Pmaxloc (kturn = 0) = 1 and respectively Pmaxloc (kturn = 1) = 0, Pmaxloc(kturn = 2) = 0, . . . In the case of a network with 2 × 2 SEs and unicast traffic, there will be no conflicts in the network and therefore the throughput is equal to the offered load and the delay corresponds to one network cycle, when store and forward switching is used. In contrast, for uniformly distributed traffic the turnaround probability increases in the rear stages. In the example of Figure 1: Punif (kturn = 0) =

1 2 , Punif (kturn = 1) = , 7 7

4 7 Generally, for networks with c × c (bidirectional) SEs and n stages the turnaround probability at stage k turn in case of uniform traffic is: Punif (kturn = 2) =

Punif (kturn = k) =

(c − 1) ck (cn − 1)

3 RECONFIGURATION

The goal of reconfiguration is to increase the ratio of local traffic in the network by changing the topology. In detail, the connections from the inputs/outputs to the first stage are rewired to bring communication partners together. With uniformly distributed traffic no network configuration exists, resulting in better performance. Admittedly, uniformly distributed traffic is very rare in practice. Examining a typical multiprocessor system with a bidirectional multistage interconnection network and several applications running: Each application allocates an exclusive subset of processors, called a processor cluster [9].

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

(a) Before reconfiguration

(b) After reconfiguration

Figure 2. Example of a reconfigurable BMIN: (a) shows the origin network and (b) the reconfigured network Communication in this system takes place predominantly inside these clusters. So, the processors within a cluster should become neighbors, if possible. Often, it is not possible to allocate applications to adjacent processors at runtime. Particularly if the applications have different starting times, a favorable allocation of the clusters is mostly not possible. Migration of an application context from one processor to another to localize the clusters is very costly [14]. The approach in this paper, on the other hand, reconfigures the network topology to localize application clusters and meet the communication requirements. A simple example should clarify the concept: Figure 2(a) depicts a small multiprocessor system with eight processors running three jobs or applications. Processors 1, 2, 7, 8 are performing the first job, 3, 4 the second, and 5, 6 the third. For the second and third job there are no optimization possibilities, they already communicate via the shortest path (the possible routing paths are shown in Figure 2). The allocation of the cluster for Job 1 is inappropriate. Some of the messages of Job 1 have to be sent to the last stage (k turn = 2) which is not optimal. At a specific point in time (trec ) the reconfiguration starts: the network inputs are closed during the entire reconfiguration process (Trec ). No further packet is allowed to enter the network. Current packets have to leave the network (Tpdp, packet drain phase), otherwise, the destination tags of the packets would be invalid after the reconfiguration. Afterwards, rewiring takes places. The inputs/outputs of the network are reconnected to the first stages (Figure 2(b)). This reconfiguration of the hardware needs the time interval Tcon . Simultaneously, the new target addresses are distributed among the communication adapters (interfaces between processors and network). The total reconfiguration time contains the packet drain

Stage

0

1

2

0 1

2 3

4 5

6 7

Figure 3. Determine the packet drain phase: only packets in the black buffers can be destined for Output 2 phase and the reconnection time: Trec = Tpdp + Tcon Tcon is determined by the underlying hardware. That could be a (partial reconfigurable) FPGA. Tpdp depends on the number of buffer places in the BMIN. To be sure that all packets have left the network at tcon (the starting point of the rewiring process) the worst case must be considered. This case occurs, if all buffers of the network are filled and all packets are routed to one single output. It is assumed that packets are immediately consumed at the outputs of the network, thus no blocking occurs at the outputs. With this assumption it is guaranteed that at least one packet per clock cycle moves within the network, as long as a packet is present. Thus, the packet drain time is equal to the number of buffer places in the network. But there are buffers, which cannot route their packets to this specific output because of the turnaround routing (see Figure 3 for an example).

k=0

n

= m max · n · c +

n−1  X k=0

c·k −c

k



!

,

 Pn−1 n where k=0 c − ck are the buffers in the forward diPn−1 rection, k=0 (k + 1) · c are the buffers in the backward direction and −n · c represents the missing buffers in the last stage (k = n − 1). Of course, a reconfiguration of the network only makes sense, if the traffic pattern remains much longer than Trec .

4

PERFORMANCE

This section shows the performance of a BMIN with different degrees of local traffic and compares the results with an unidirectional delta MIN. The time-dependent performance of the network during a reconfiguration is also presented. As performance measures the normalized throughput Si at the inputs of the network and the normalized throughput So at the outputs of the network, the mean delay time d(k) of the packets in each stage, the mean delay time dtot of the packets in the whole network, and the mean queue length m(k) ¯ of the buffers in each stage can be considered. This paper focuses on the normalized throughput So at the outputs and the mean delay time dtot of the packets describing the most important performance measures. With only unicast traffic present, the normalized throughput at the outputs is equal to the normalized throughput at the inputs. Performance results are obtained by simulation. The simulator is written in C++ program language [15]. Simulation is observed by the tool Akaroa [16]. As simulator termination criteria, the estimated precision was set to 2% and the confidence level to 98%. Cheemalavague and Malek [17] investigated different SE sizes and concluded that 4 × 4 switching elements perform favorably in general and proved to be cost-effective. Also, in existing interconnection networks 4 × 4 switches are frequently used (e.g. in the Vulcan switch chip in the IBM SP2 interconnection network [8, 1]). For this reasons the investigated networks consist of 4 × 4 SEs. The following results emerge from a 256 × 256 bidirectional MIN and an unidirectional MIN with four stages each. Every buffer can store three packets (m max = 3). The networks operate with store and forward switching and use the global backpressure mechanism. Conflicts among packets for resources (SE outputs, buffers) are solved randomly.

probability of turn in stage k (P(kturn=k))

According to this, not all buffers in the network have to be considered for the packet drain phase. For the example above Tpdp = 23 · m max , where m max is the storage volume in every single buffer. Generalized, the maximum packet drain phase is determined by ! n−1   X n k Tpdp = m max · c − c + (k + 1) · c − n · c

1

0.8 0.6 0.4 0.2 0

0

1

2

3

stage k

Figure 4. Different traffic patterns defined by the turnaround probability distributions of the stages

4.1 Steady-state Comparisons

First, the effect of three different traffic distributions in space to the normalized throughput So and mean delay dtot is investigated. The traffic patterns are defined with the probabilities that a packet turns in a specific stage k. The predefinition of a turnaround stage for a packet constrains the possible targets. In the 8×8 BMIN in Figure 1: a packet generated by source 5 with k turn can reach the destinations 6 and 7 only. The turnaround stage defines a target cluster. Within these clusters the destination outputs are uniformly distributed. Figure 4 depicts the turnaround probabilities in the four stages of the 256 × 256 networks for the following traffic patterns: non-local traffic: 95% of the packets turn in the last stage; this is a mirrored distribution of pattern uniformly distributed packet destinations local traffic: 95% of the packets turn in the first stage, the remaining 5% are uniformly distributed over the remaining destination outputs The generation of succeeding packets is assumed to be mutually independent. No correlation between packets exists in the simulation. Figure 5 depicts the simulations results for the BMIN and the unidirectional MIN at the three traffic shapes. The offered load to network was varied from 5% to 100%. The results show that the BMIN outperforms the unidirectional MIN at every traffic shape, if the throughput is considered (Figure 5(a)). The BMIN reaches saturation at a higher offered load than the unidirectional MIN. Because the BMIN has several redundant paths, fewer conflicts occur in the network. There is also the double amount of buffers in the network available. Looking at the delay (Figure 5(b)), a inverse picture shows up: the delay is higher in a BMIN compared to the

0.8 0.6

30

BMIN BMIN BMIN MIN MIN MIN

25 mean delay time (dtot)

normalized throughput at the output (So)

1

0.4 0.2 0

20

BMIN BMIN BMIN MIN MIN MIN

15 10 5

0.1

0.2

0.3

0.4 0.5 0.6 offered load

0.7

0.8

0.9

0

1

0.1

0.2

(a) Throughput

0.3

0.4 0.5 0.6 offered load

0.7

0.8

0.9

1

(b) Delay

80

1

70 0.8 Tpdp

mean delay time (dtot)

normalized throughput at the output (So)

Figure 5. Comparision of BMIN and MIN with three different target distributions

Tcon

0.6 0.4 0.2

60

Tpdp

Tcon

50 40 30 20 10 0

0 -20

trec

+20 +40 +2889 tnew network cycles

+20 +40 +60

-20

trec

+20 +40 +2889

tnew +20 +40 +60

network cycles

(a) Throughput

(b) Delay

Figure 6. Performance measures of a reconfiguration procedure unidirectional MIN, except at traffic shape , which will be discussed later. The higher delay of the BMIN is explained on one hand by the longer paths a packet has to go through the network under certain conditions and on the other hand by the higher number of buffers. Investigating the different traffic shapes it is worth mentioning that there are only minor differences between traffic shape and . Even though the turnaround probabilities for the last stage differs (P (kturn = 3) = 0.95 and P (kturn = 3) ≈ 0.75), it has only little influence on the throughput performance. The delay is more sensitive to changes in the traffic distributions. The sensitivity of the mean delay is often observed in investigations of MINs.

Traffic shape leads to the opposite behavior of both architectures. BMINs give the best performance under such local traffic. Most packets can turn in the first stage, thus the total number of packets will be reduced and fewer conflicts occurs. In contrast, the unidirectional MIN gives a poor performance under those conditions. That is because the packets compete at the first stage for one output in the SEs at extreme local traffic: there are no redundant paths. In the stages behind conflicts occur rarely and the buffer occupation is low. The network reaches saturation as soon as the buffers in the first stage are filled. This happens for 4×4 SEs starting from an offered load of approx. 25%, because at this load more than one packet on average requests

the only SE output. Unidirectional MINs show the best performance with uniformly distributed traffic, all paths and buffers are then equally utilized.

4.2 Performance Measurements during Reconfiguration

Figure 6 shows the behavior of the 256 × 256 BMIN before, during, and after a reconfiguration procedure. The initial network configuration reflects the traffic shape . Through the reconfiguration, the turnaround probabilities are changed to traffic shape . The input load is set to 70%. At time trec all inputs are closed though no further packet can enter the network. The theoretical drain out phase is Tpdp = 2889 for this network. After that period the reconnection process takes place (Tcon). At time tnew the reconfiguration is completed and the input ports allow new packets. In order to be able to terminate the simulation during the drain out phase the confidence level was decreased after trec + 20 cycles. Figure 6(a) shows the normalized throughput. It is noticeable that the network empties much faster than assumed. This can be explained by the fact, that the disadvantage cases (e.g. all packets in the network are destined to one output) are too unlikely to have an effect on the determination of the mean throughput. To investigate these rare events other simulator approaches have to be considered [18]. The throughput rises that fast after tnew because of the short path for most of the packets with the target distribution . The transient phase is very short, the network reaches steady-state after 25 cycles. The intense rising of the mean delay during the packet drain phase (Figure 6(b)) results from the fact that the majority of packets leave the network relatively fast. Then at the outputs only multiple blocked packets arrive. This investigated reconfiguration only improves the performance if the traffic pattern lasts long enough (Tnew denotes the time interval where the traffic assumptions are still valid after the reconfiguration): So (trec − 1)


trec +TX rec +Tnew x=trec

5

CONCLUSION

So (x) Trec + Tnew dtot (x) Trec + Tnew

This paper investigated the concept of improving the performance of bidirectional multistage interconnection networks by reconfiguration. These networks are used to establish broadband communication switches, to building multiprocessor systems and are investigated to develop network on chip applications. The influence of different traffic distributions in space were examined. It was shown that the BMIN provides good performance if the ratio of local traf-

fic is high. Goal of the reconfiguration by topology changes is to increase the ratio of local traffic. Therefore the interconnections from the network inputs/outputs to the first stage are changed. Before the rewiring process can start all packets have to leave the network to avoid packet loss. This packet drain time depends on the number of buffers in the network. It turned out that for an average case the network is much earlier empty than the worst case assumption predicts. This idle time of the network reduces the benefit of this approach. On the other side, the advantage of this concept is that no changes have to be made to the routing scheme and no additional hardware is needed within the switching elements. Only an underlying switching layer for the rewiring process is needed. This additional layer is inherently present if the network is implemented in a (partial reconfigurable) FPGA for example.

6 FUTURE WORK

Multistage interconnection networks are suitable to support hardware multicast (cell replication while routing – CRWR) [19, 20]. Future investigations will focus on the influence of multicast traffic to the performance of reconfigurable BMINs. In particular, the effect of local traffic in combination with different multicast traffic shapes will be considered. The time-dependent performance measures above identified the task of emptying the network, prior to the rewiring process, as not very efficient. Two approaches are possible: First, the packet drain phase will be shortened with the risk of packet loss. Second, a better approach seems to let the packets stay in their buffers during the rewiring process. Therefore the turnaround routing must be modified. A simple modification is to send all packets to the last stage of the network after the rewiring process. From the last stage all targets are reachable. A further improvement is to develop an adaptive version of the turnaround routing to ensure that all packets reach their targets on the shortest way. Two other aspects are not investigated yet: determining the reconfiguration point in time and choosing the best topology for a given traffic shape. To solve this question, research has to be performed in the area of traffic analysis and traffic prediction. A reconfiguration only makes sense, if the actual traffic pattern lasts for a while. The choice of a suitable topology for the network is a typical optimization problem. Computation at runtime does not seem to be feasible but the preselection of several wiring patterns and a choice from this patterns at the reconfiguration time point seems to be possible.

References

[1] Abandah, A. G.; Davidson, E. S. 1996, “Modeling the Communication Performance of the IBM SP2.” In Proceedings of the 10th International Parallel Processing Symposium (IPPS’96), Honolulu, Hawaii, USA, 249–257.

[2] Soumiya, T.; Nakamichi, K.; Kakuma, S.; Hatano, T.; Hakata, A. 1999, “The large capacity ATM backbone switch FETEX-150 ESP.” Computer Networks, 31, no. 6: 603–615. [3] Awdeh, R. Y.; Mouftah, H. T. 1995, “Survey of ATM switch architectures.” Computer Networks and ISDN Systems, 27, no. 12: 1567–1613. [4] Beneš, V. E. 1964, “Optimal Rearrangeable Multistage Connecting Networks.” Bell System Technology Journal, 43, no. 4: 1641–1656. [5] Dias, D. M.; Jump, R. J. 1981, “Analysis and Simulation of Buffered Delta Networks.” IEEE Transactions on Computers, C–30, no. 4: 273–282. [6] Patel, J. H. 1981, “Performance of Processor– Memory Interconnections for Multiprocessors.” IEEE Transactions on Computers, C–30, no. 10: 771–780. [7] Beneš, V. E. 1965, Mathematical Theory of Connecting Networks and Telephone Traffic, vol. 17 of Mathematics in Science and Engineering. Academic Press, Orlando, Florida, USA. [8] Stunkel, C. B.; Shea, D. G.; Abali, B.; Atkins, M. G.; Bender, C. A.; Grice, D. G.; Hochschild, P.; Joseph, D. J.; Nathanson, B. J.; Swetz, R. A.; Stucke, R. F.; Tsao, M.; Varker, P. R. 1995, “The SP2 High-Performance Switch.” IBM Systems Journal, 34, no. 2: 185–204. [9] Ni, L. M.; Gui, Y.; Moore, S. 1997, “Performance Evaluation of Switch-Based Wormhole Networks.” IEEE Transactions on Parallel and Distributed Systems, 8, no. 5: 462–474. [10] Liu, J.-C.; Shin, K. G.; Chang, C. C. 1995, “Prevention of Congestion in Packet-Switched Multistage Interconnection Networks.” IEEE Transactions on Parallel and Distributed Systems, 6, no. 5: 535–541. [11] Tutsch, D.; Hommel, G. 2003, “Multilayer Multistage Interconnection Networks.” In Proceedings of 2003 Design, Analysis, and Simulation of Distributed Systems (DASD’03), Orlando, USA, 155–162. [12] Sánchez, J. L.; García, J. M. 2000, “Dynamic reconfiguration of node location in wormhole networks.” Journal of Systems Architecture, 46, no. 10: 873– 888. [13] Gianatti, S.; Pattavina, A. 1994, “Performance Analysis of ATM Banyan Networks with Shared Queueing– Part I: Random Offered Traffic.” IEEE/ACM Transactions on Networking, 2, no. 4: 398–410.

[14] Lu, C.; Lau, S.-M. 1994, “A Performance Study on Load Balancing Algorithms with Task Migration.” In Proceedings of 1994 IEEE Region 10’s Ninth Annual International Conference (TENCON ’94), Theme: ’Frontiers of Computer Technology’, IEEE, Singapore, vol. 1, 357–364. [15] Tutsch, D.; Brenner, M. 2003, “MINSimulate – A Multistage Interconnection Network Simulator.” In Proceedings of the 17th European Simulation Multiconference: Foundations for Successful Modelling & Simulation (ESM’03), Nottingham, UK, 211–216. [16] Ewing, G.; Pawlikowski, K.; McNickle, D. 1999, “Akaroa2: Exploiting Network Computing by Distributing Stochastic Simulation.” In Proceedings of the European Simulation Multiconference (ESM’99), International Society for Computer Simulation, Warsaw, 175–181. [17] Cheemalavagu, S.; Malek, M. 1982, “Analysis and Simulation of Banyan Interconnection Networks with 2 × 2, 4 × 4 and 8 × 8 switching Elements.” In Proceedings of the Real-Time Systems Symposium (RTSS 1982), IEEE Computer Society, Los Angeles, California, USA, 83–89. [18] Kelling, C. 1996, “A Framework for Rare Event Simulation of Stochastic Petri Nets using RESTART.” In Proceedings of the 1996 Winter Simulation Conference, Coronado, CA, USA, 317–324. [19] Xiong, Y.; Mason, L. 1998, “Analysis of Multicast ATM switching networks using CRWR scheme.” Computer Networks and ISDN Systems, 30, no. 9: 835–854. [20] Tutsch, D.; Hommel, G. 1997, “Performance of Buffered Multistage Interconnection Networks in Case of Packet Multicasting.” In Proceedings of the 1997 Conference on Advances in Parallel and Distributed Computing (APDC 97), IEEE Computer Society Press, Shanghai, 50–57.

Suggest Documents