Exploring partitioning methods for 3D Networks-on-Chip utilizing

0 downloads 0 Views 535KB Size Report
Department of Information Technology, University of Turku. Joukahaisenkatu 3-5 B .... Multicast latency consists of two parts: startup latency and network latency.
Exploring Partitioning Methods for 3D Networks-on-Chip Utilizing Adaptive Routing Model Masoumeh Ebrahimi, Masoud Daneshtalab, Pasi Liljeberg, Juha Plosila, Hannu Tenhunen Department of Information Technology, University of Turku Joukahaisenkatu 3-5 B, 20520 Turku, Finland

{masebr,masdan,pakrli,juplos,hanten}@utu.fi ABSTRACT Three-Dimensional (3D) integration is a solution to the interconnect bottleneck in Two-Dimensional (2D) MultiProcessor System on Chip (MPSoC). 3D IC design improves performance and decreases power consumption by replacing long horizontal interconnects with shorter vertical ones. As the multicast communication is utilized commonly in various parallel applications, the performance can be significantly improved by supporting of multicast operations at the hardware level. In this paper, we propose a set of partitioning approaches each with a different level of efficiency. In addition, we present an advantageous method named Recursive Partitioning (RP) in which the network is recursively partitioned until all partitions contain comparable number of nodes. By this approach, the multicast traffic is distributed among several subsets and the network latency is considerably decreased. We also present Minimal Adaptive Routing (MAR) algorithm for the unicast and multicast traffic in 3D-mesh Networks-on-Chip (NoCs). The idea behind the MAR algorithm is utilizing the Hamiltonian path to provide a set of alternative paths.

Categories and Subject Descriptors C.2.1 [COMPUTER-COMMUNICATION NETWORKS]: Network Architecture and Design - Packet-switching networks.

General Terms Algorithms, Performance, Design.

1. INTRODUCTION The technology trends toward the increased number of processing elements with higher levels of integration and higher performance will require scalable and efficient communication infrastructure. The Network-on-Chip (NoC) architecture paradigm, based on a modular packet-switched mechanism, can address many of the onchip communication design issues such as wiring complexity and integration of a large number of Intellectual Property (IP) cores into a 2D chip [1][2][3]. The 3D technology can overcome the limited floor-planning choices of 2D designs and allows each layer to have a specific technology [6]. The major advantages of 3D NoCs are the considerable reduction in the average wire length and wire delay, resulting in lower power consumption and higher performance [5][7][8][9]. The routing protocols in NoCs and MPSoCs can be unicast or multicast [10]. In the unicast Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NOCS'11, May 1-4, 2011 Pittsburgh, PA, USA Copyright 2011 ACM 978-1-4503-0720-8… $10.00

communication a message is sent from a source node to a single destination node, while in the multicast communication a message is delivered from one source node to an arbitrary number of destination nodes. Multicast is a special case of unicast while the unicast routing cannot support multicast messages efficiently [11]. This inefficiency arises for several reasons such as 1-sending multiple copies of the same message into the network not only imposes a significant amount of traffic in the network but also increases the overall power consumption. 2-multiple unicast messages required to access the local link connected to the router sequentially, thus introducing additional latency. As the vast majority of traffic in MPSoCs consists of unicast traffic and most of studies have assumed only unicast traffic, the concept of unicast communication has been studied extensively in the literature. The proposed unicast protocols are efficient when all injected messages are unicast. However, if only a small percentage of the total traffic is multicast, the efficiency of the overall system is considerably reduced. The multicast communication has a large impact on CMP systems performance and is frequently employed in many coherence protocols such as directory-based protocols, token coherence protocols, Intel QPI protocol [12][13]. For instance, in a SGI-Origin protocol, i.e. directory based protocol, around 5% of the total traffic is generated by multicast messages. In this protocol, the network latency can be reduced by 50%, if multicast is supported in hardware, thus highlighting the importance of hardware-level multicast support. In order to determine the percentage of multicast messages in cache coherence protocols, we have used synthetic benchmark (TPC-H [14]) and analyzed application traces (i.e. SPLASH-2 [15], PARSEC [16][17]) in two popular cache coherence protocols, MESI [18] and token-based MOESI [19][20]. On account of our analysis, on average, 6% of MESI traffic and 99% of token-based MOESI traffic are multicast. A 3D NoC can have different topologies for each layer of it, such as mesh [7][21], torus [7][21], and ring [7]. In this work we limit our considerations to 3D-mesh NoCs, in which each layer consists of a 2D mesh. Routing algorithms can be classified into deterministic and adaptive. A deterministic routing algorithm uses a fixed path for each pair of nodes resulting in increased network latency especially in congested networks [22][23]. In contrast, in adaptive routing algorithms, a packet is not restricted to a single path when traveling from a source node to its destination(s). Therefore, adaptive routing algorithms could obtain better performance at the congested network utilizing alternative routing paths [22][23]. The rest of this paper is organized as follows: Section 2 reviews related work. The proposed partitioning methods are discussed in Section 3. The minimal adaptive routing is presented in Section 4. The results are given in Section 5 while we summarize and conclude in the last section.

2. RELATED WORK Some research has been conducted to evaluate the performance metrics of 3D NoCs. The authors in [5][21] demonstrate that besides reducing the footprint in a fabricated design, 3D designs provide a better performance compared to traditional 2D designs. They have also demonstrated that both mesh and tree topologies for 3D systems achieve better performance compared to traditional 2D systems. However, the mesh topology shows significant performance gains in terms of throughput, average latency, and energy dissipation with a small area overhead [5]. In [9] different 3D-mesh based architectures have been compared in the zero-load latency to compare the speed and power consumption of 3D NoC with 2D NoC. Due to the fact that the multicast communication is used commonly in various parallel applications, there have been several attempts to improve the performance of multicast communication in 2D NoCs. “Virtual Circuit Tree Multicasting” (VCTM) [12], “Recursive Partitioning Multicast” (RPM) [24] and “Hamiltonian path multicast algorithm for NoCs” [25] are three recent works in the realm of 2D NoC in which RPM and VCTM are based on the tree-based method and the proposed algorithm in [25] is based on the path-based method. In VCTM, a set up message is sent from a source node to all destinations in order to build a virtual circuit tree, then the multicast message is send down the tree. RPM supposes the network is divided into several partitions. This method minimizes the message replication time by defining priority rules to reach each partition. The authors in [25] presented a deadlock free adaptation of the dual-path multicast algorithm for 2D mesh NoCs and then evaluated the impact of the proposed method on the performance of the network, demonstrating the efficiency of the proposed multicast algorithm. An adaptive multicast communication in 3D-mesh networks is discussed in [26]. The algorithm is based on an extension of a theory defined in [27] from 2D to 3D-mesh network. The algorithm utilizes the Hamiltonian path and prevents deadlocks by using virtual channels. However, adding virtual channels is costly in NoCs due to increased arbitration complexity and buffering requirements [28]. Two another methods of unicast/multicast communication in 3D-mesh NoCs are presented in [29]. The proposed methods are guaranteed to be deadlock free because of using the Hamiltonian path. However, the presented algorithms are suffering from the low performance and inability to partition the network efficiently. In this paper, we present several partitioning methods in 3D-mesh NoCs in order to improve the performance of unicast/multicast communication. In addition, we propose an advantageous partitioning method named recursive partitioning method which outperforms the other presented methods, and finally we propose an adaptive routing algorithm for all proposed partitioning methods.

3. PARTITIONING METHODS The performance of multicast communication is measured in terms of its latency in delivering a message to all destinations. Multicast latency consists of two parts: startup latency and network latency. The startup latency is the time required to break a message into several packets (each with different destinations), prepare packets, and deliver them completely to the network. The network latency is defined as the time between the first flit is injected to the network until the tail flits of all packets has reached corresponding destinations. Partitioning methods reduce network latency by dividing the network into several partitions and reducing the overall path length. Nevertheless, breaking the

network into partitions has differing constraints as follows: 1Increasing the number of network partitions leads to additional startup latency due to the preparation time of more packets at the source node. 2-Breaking the network into unbalanced partitions create long paths in the network. Therefore, they increase the latency to reach the last destination which increases network latency for multicast messages. We call this factor “fairness”. The Hamiltonian path strategy [11] guaranties that the network will be free of deadlocks for the unicast and multicast traffic. The Hamiltonian path visits each node exactly once along the path. As shown in Fig. 1(a), for each node a label is assigned from 0 to N-1 in which N is the number of nodes in the network. Several Hamiltonian paths can be considered in the mesh topology. In 3D a×b×c mesh, each node is presented by the ordered triple (x,y,z) where x is the X-coordinate, y is the Y-coordinate and z is the Zcoordinate. The following equations show one possibility of assigning the labels which we utilize in this paper: L ( x , y , z ) = {( a × b × z ) + ( a × y ) + ( x )}

where z : even , y : even

L ( x , y , z ) = {( a × b × z ) + ( a × y ) + ( a − x − 1)}

where z : even , y : odd

L ( x , y , z ) = {( a × b × z ) + ( a × (b − y − 1)) + ( a − x − 1)} where z : odd , y : even L ( x , y , z ) = {( a × b × z ) + ( a × (b − y − 1)) + ( x )}

where z : odd , y : odd

As exhibited in Fig. 1, two directed Hamiltonian paths (or two subnetworks) are constructed by the labeling. The high channel subnetwork (Fig. 1(b)) starts at node 0, and the low channel subnetwork (Fig. 1(c)) ends at node 0.

3.1 Two-Block Partitioning (TBP) The TBP is a base scheme in which the network is partitioned into high and low channel subnetworks. The high channel subnetwork contains all directional channels with nodes labeled in ascending order, and the low channel subnetwork contains all directional channels with nodes labeled in descending order. In this method, all destination nodes are split at most into two disjoint groups: a high group and a low group. The high group consists of all destination nodes with the higher labels than the source node and the low group contains all destination nodes with the lower labels. When considering label assignment described in the Hamiltonian path strategy, all destination nodes located in the same layer as the source node are divided at most into high and low groups while all destinations in higher (lower) layers are put in the high (low) group (see Fig. 3). In addition, one packet is created for each group and the destinations within each packet should be sorted in the correct order in which they are visited in the path. Therefore, destinations in the high group should be sorted in ascending order and other destinations in descending order. The created packets are routed via high and low channel subnetworks. The pseudo code for the TBP method is illustrated in Fig. 2. Fig. 3(a) shows an example of the partitioning policy and the portions of each partition that depends on the source node position. As illustrated in Fig. 3(a), if the source node is located at the middle layer, two partitions cover comparable number of nodes but with a large number of nodes in both partitions. However in Fig. 3(b), one partition contains considerably more nodes than the other. Now, suppose that the multicast message m=(6,{1,2,19,25,44}) is generated by the core where the source node is 6. The destinations are split into two groups according to their labels: GH={19,25,44} and GL={1,2}. The packet created for uses the Hamiltonian path as follows: GH {6,9,10,11,12,19,20,21,22,25,38,41,42,43,44} where 14 hops are needed to reach the last destination. The packet path for the GL is:{6,5,2,1} where 3 hops are required for delivering the packet to all destinations.

Fig. 1. (a) A 3×3×3 mesh physical network with the label assignment (b) high channel and (c) low channel subnetworks. The solid lines indicate the Hamiltonian path and dashed lines indicate the links that could be used to reduce path length in routing. Algorithm: Two-Block Partitioning (TBP) Inputs: a×b×c network; destinations labels; source label; Begin For “i: 0 to number of destinations” loop If (destinationLabel(i) > sourceLabel) then -- sort GH in ascending order GH n) then --Divide the given P into two new partitions(Gi,Gi+1) G=>GiÆ((0:[(x_p)/2]),y_p,z_p), Gi+1Æ(([(x_p)/2]:x_p-1),y_p,z_p); Partitioning(Gi,Num_Pi); Partitioning(Gi+1,Num_Pi+1); Else Return (G,Num_P); End if; End Partitioning; --Construct a msg. for each group End RP;

Fig. 4. The pseudo code of the VBP method.

Fig. 6. The pseudo code of the RP method.

Fig. 5. The VBP method (a) balanced partitions (b)unbalanced partitions

Fig. 7. The RP method when the source node is at (a) node 25 (b) node 6

P(u) = {{v} | v Є V and (u, v) Є E and v ≠ u }. The multicast message can be represented by m=(u,D), where u Є V is the source node, D = {d1,d2, . . . ,dx} is the set of ordered destination nodes, and x is the number of destination nodes. Each node in the graph has a label (L) determined by the Hamiltonian path labeling mechanism. For a given node u and a destination d, the MAR algorithm finds possible neighbors of the current node that can be selected to deliver a packet, so:

Step3: Since the MAR algorithm provides several choices at each node, the goal of Step3 is to route a packet through the less congested neighboring nodes. So, in the case where a packet can be forwarded through multiple neighboring nodes, the stress values of the selected neighbors are checked and then the packet is sent to the neighbor with the smallest stress value.

If L(u)

Suggest Documents