Mesh-of-Tree Based Scalable Network-on-Chip ...

2 downloads 0 Views 402KB Size Report
Kanchan Manna. School of Information Technology ..... [6] F. Karim, A. Nguyen, and S. Dey, “An Interconnect Architecture for. Networking Systems on Chips”, ...
2008 IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, INDIA December 8-10. PAPER IDENTIFICATION NUMBER: 247

Mesh-of-Tree Based Scalable Network-on-Chip Architecture Santanu Kundu, Radha Purnima Dasari, Santanu Chattopadhyay Dept. of Electronics and Electrical Communication Engg. Indian Institute of Technology, Kharagpur Kharagpur, India Email: [email protected], [email protected], [email protected] Abstract— Scalability has become an important consideration in Network-on-Chip (NoC) designs. The word scalability has been widely used in the parallel processing community. For massively parallel computing, a scalable system has the property that performance will increase linearly with the system size. The scalability analysis may be used to select the best architecture for a problem under different constraints on the growth of the problem size and the number of processors. In this paper, we have analyzed the scalability issue of Mesh-of-Tree topology based network. Keywords- Mesh-of-Tree (MoT), Network-on-Chip (NoC), Scalability, System-on-Chip (SoC), Cycle Accurate Simulator.

I.

INTRODUCTION

From the hardware point of view, the main trend in the processors industry is to achieve better performance by exploiting parallelism, in the form of multiple cores in a single silicon die. Network-on-Chip (NoC) is a new paradigm for designing future System-on-Chips (SoC) [1] where multiple cores are connected to the communication fabric (router based network) using network interfaces. Scalability is a common objective in the design of NoC. Scalability has no commonly accepted, precise definition. There is, however, some consensus that as the size of a scalable system is increased, a corresponding increase in performance is obtained. For designers of new architectures, small experimental prototypes can demonstrate the viability of ideas in much larger systems. It may be used to predict the performance of a system for a large number of processors from the known performance on fewer processors. Primarily when a specific architecture is shown to be scalable, the system whose size varies over a wide range can use that same architecture. We can classify the different approaches employed in multiprocessor systems for computing into two large groups [2]: Scale-Up: Scale-Up means adding resources to a single node in a system, such as addition of CPUs or memories to a single server in a network. Scale-Out: Scale-Out means adding more nodes in a system, such as adding new servers in a network.

Kanchan Manna School of Information Technology Indian Institute of Technology, Kharagpur Kharagpur, India Email: [email protected]

Scale-up solutions have represented the mainstream of commercial computing for the past several years. Recently, the scale-out solutions, in the form of clusters of smaller systems, have gained increased acceptance for commercial computing as it offers better performance. In [2], it has been concluded that pure scale-up approach is not very effective in using all the processors in a large node. On the other hand, in scale-out approach, larger numbers of nodes will increase management complexity. So there is a need to go for a trade-off between these two models. As NoC is used to design large systems with many cores, it is very much essential to address the scalability issues of such a network. II.

RELATED WORKS

For NoC, topologies like mesh [3], torus [4], folded torus [5], octagon [6], fully binary tree [7], fat-tree [8], and butterfly fat tree [9] have already been proposed. Mesh-of-Tree (MoT) is another potential candidate for NoC architecture. Balkan et al [10] proposed MoT interconnection structure for single chip parallel processing. In their model, the node degree of each root level router is 18; which is not feasible for NoC due to large area overhead. We have already proposed a MoT based deterministic routing which ensured that the packet will always reach the destination through the shortest path and it is deadlock free [12]. Because of limited space, readers are referred to [13] for detailed comparative study of a set of recently proposed NoC topologies and their performance evaluations with realistic traffic models. An M × N MoT (both M and N are powers of 2) has M row trees and N column trees. Fig. 1 shows a 4 × 4 MoT structure. In [12], we have connected two cores only at leaf level (L) but none at the stem (S) and root level (R) to avoid deadlock. But we have not discussed about the scalability of our proposed network topology in [12]. In this paper, our contribution is to analyze the scalability issue of MoT topology based network proposed in [12] and find out the optimum number of cores that can be connected at the leaf level. We have also discussed the implementation detail of MoT based wormhole router architecture and evaluate the performance of different sized MoT based network under

IEEE Kharagpur Section & IEEE Sri Lanka Section

978-1-4244-2806-9/08/$25.00© 2008 IEEE

1

2008 IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, INDIA December 8-10. PAPER IDENTIFICATION NUMBER: 247 different traffic conditions. The rest of the paper is organized as follows. In section 3, we have discussed about the basic terminologies of the interconnection network. In section 4, the implementation detail of the single buffer based wormhole router has been discussed. In section 5, we briefly discuss about the simulation environment for evaluating the performance of MoT topology based network. In section 6, we have optimized the number of cores to be attached in each leaf level router. In section 7, the performance evaluation has been done by varying network size. The scalability issues have also been discussed. Finally in section 8, we have concluded our present work and focus on the future works.

control unit) is a smaller unit over which flow control is performed. Latency: Transport latency [13] is defined as the time (in clock cycles) that elapses from between the occurrence of a message header injection into the network at the source node and the occurrence of a tail flit reception at the destination node. In order to reach the destination node from some starting source node, flits must travel through a path consisting of a set of switches and interconnect, called stages. Depending on the source/destination pair and the routing algorithm, each message may have a different latency. There is also some overhead in the source and destination that contributes to the

(a)

(b)

Figure 1. 4 × 4 Mesh-of-Tree Topology (a), and its simplified graph (b).

III.

BASIC DEFINITION

Diameter: Diameter [11] signifies the maximum distance between two nodes within the network. Smaller diameter is preferable for building a network. Bisection Width: Bisection Width [11] is defined as the minimum number of wires removed in order to bisect the network. A larger bisection width enables faster information exchange and therefore preferable. Node Degree: Node degree [11] is defined as the number of channels connecting the node to its neighbors. Lesser node degree is easier to build the network whereas for higher node degree, larger silicon area is required. Bandwidth: Bandwidth refers to the number of bits can send successfully to the destination through the network per second. It is represented as bps (bits/sec). Throughput: Throughput (TP) is defined as [13],

Here we assume message length and packet length to be equal. Total messages completed refers to the number of whole messages that successfully arrive at their destination IPs, Message length is measured in terms of flits, Number of IP blocks is the number of functional IP blocks involved in the communication, and Total time is the time (in clock cycles) that elapses between the occurrence of first message generation and the last message reception. A flit (flow

overall latency. Therefore, for a given message i, the overall latency (Li) is, Li = sender overhead + transport latency + receiver overhead We use the average overall latency as a performance metric in our evaluation methodology. Let P be the total number of messages reaching their destination IPs and let Li be overall latency of each message i, where i ranges from 1 to P. The average latency, Lavg, is then calculated as follows:

L a vg = IV.



P i

Li

P

WORMHOLE ROUTER ARCHITECTURE DESIGN

The router is composed of instances of 2 kinds of modules: (A) input channel, (B) output channel. A. Input Channel Module The input channel module (Fig. 2a) is composed of four architectural blocks named (i) Input Flow Controller (IFC), (ii) Dual Clock FIFO buffer, (iii) Decision Maker, and (iv) Input Read Switch (IRS). Input Flow Controller (IFC) block implements the logic that performs the translation between the handshake and the FIFO flow control protocols. IFC block sends 'in-req' signal to the previous router when its FIFO is not FULL. If the incoming flit is valid (in-val = 1) and FIFO is not FULL, flit will write into the FIFO buffer synchronously with the 'wr-clk'. This incoming flit is acknowledged by the IFC block (in-ack)

978-1-4244-2806-9/08/$25.00© 2008 IEEE

2

2008 IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, INDIA December 8-10. PAPER IDENTIFICATION NUMBER: 247 after successful writing. This 'in-ack' and ‘in-req’ ports are connected with the 'out-ack' and ‘out-req’ ports of the output channel of the previous router respectively. The 'in-val' signal is coming from the 'out-val' port of the output channel of the previous router. In the design of the NoC, each router is locally synchronous with its own clock. The core may operate at its own clock. There should be no such global restriction over the clock domain. Thus we have to follow the GALS style of communication. We have designed a binary counter based dual clock FIFO Buffer to support GALS style of communication [15]. Decision-Maker module performs the deadlock free deterministic routing function [12]. It detects the header from the FIFO outgoing flits and runs the routing algorithm to select an output channel. After detecting the header, it will send a request to any of the output channels. The selected output channel will send a grant signal to that request when it is free. This grant signal is coming to the IRS block of the Input Channel module. The Input Channel module also sends a 'read-ok' signal to the selected output channel. The ‘read-ok’ signal is high if the outgoing flit is a valid one. The request and ‘read-ok’ signals remain high until tailer is coming out from FIFO. Flits are written into the buffer with the write clock supplied by the previous router and are coming out from the buffer synchronously with the router's own clock (local clock). Input-Read-Switch (IRS) block receives three 'x-RD' signals and three 'x-gnt' signals from other output channel modules (Fig. 2a) and generate a granted read enable signal ('rd-en') for reading from FIFO if it is not EMPTY. A clock signal ('tr-clk') is also coming out from the input channel module. This clock identifies the frequency at which data is coming out from the Input Channel module. This clock signal is the write clock of the next router. B. Output Channel Module Each output channel consists of 3 architectural blocks, (i) Arbiter, (ii) Output Flow Control (OFC), and (iii) Output Read Switch (ORS).

When more than one Input Channel modules send their request signals to a particular output channel, the arbiter block selects one request signal by applying round-robin scheduling approach [14] for avoiding starvation problem and sends a grant signal ('x-gnt') to that request (Fig. 2b). Thus the input channel accepts the grant ('x-gnt') signal and starts passing flits in pipeline fashion till this ‘x-gnt’ signal is high. This grant signal is also an input to the ORS block. The 'read-ok' signals and the 'data-out' signals are also input to the ORS block as shown in Fig. 2b. The ORS block just passes the selected data and the 'read-ok' signal depending on the ‘x-gnt’ signal. The OFC (Output Flow Controller) block generates 'outval' (valid output) signal from the 'read-ok' signals. The 'x-RD' signal is wire connected with 'out-req’ port. C. Synthesis Result of the Router We have designed all 3 types of routers (L, S, and R) separately. For FIFO depth = 6 and width = 32, synthesis result of, • Leaf (L) level router: (i) Number of slices = 935, (ii) Number of equiv. 2 input NAND gates = 3285, (iii) Number of slice Flip Flops =1296, (iv) Maximum router frequency = 115.8 MHz, (v) Number of 4 input LUTs =1504, (vi) Maximum core frequency = 123.8 MHz. • Stem (S) level router: (i) Number of slices = 838, (ii) Number of equiv. 2 input NAND gates = 2447, (iii) Number of slice Flip Flops = 966, (iv) Maximum router frequency = 176.5 MHz, (v) Number of 4 input LUTs =1066. • Root (R) level router: The root level router does not need any routing decision and arbitration logic, thus it consists of only Binary counter based FIFO in each input channel.

(a) (b) Figure 2. Router Architecture (a) Input Channel Module, (b) Output Channel Module

978-1-4244-2806-9/08/$25.00© 2008 IEEE

3

2008 IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, INDIA December 8-10. PAPER IDENTIFICATION NUMBER: 247 V.

Here we replace each core by using Traffic Generator (TG) and Traffic Receptor (TR). The traffic patterns are following homogeneous Poisson distribution [16] where all the TGs have same rate parameter (λ = number of packets per 100 micro seconds). In this simulator, the user may choose between uniform and localized traffic patterns for the packets. Locality factor is denoted by α and is defined as the ratio of local traffic to total traffic. For example in 4 × 4 MoT, distances (d) of the destinations from any source are at d = 0, 2, 4, 6, and 8. If α = 0.5, then 50 percent of the traffic will go to the cluster having d = 0. The rest of the traffics will be distributed according to their distances such that the destination at near cluster will get more traffic compare to the core at far cluster. Thus, 40, 30, 20, and 10 percent of rest traffics will go to cluster at d = 2, 4, 6, and 8 respectively.

SIMULATION ENVIRONMENT

For the area limitation of FPGA, we have designed a cycle-accurate on-chip network simulator (written in C++) to conduct detailed performance evaluation of large network based on single channel router. We have modeled 2 × 2, 2 × 4, and 4 × 4 MoT architecture in the simulator. The simulator operates at the granularity of individual architectural components. The simulation test bench models the pipelined routers and the interconnection links. Single link traversal was assumed to complete within one clock cycle. Since many factors (e.g. routing path selection delay, FIFO delay, etc.) have a significant impact on the NoC performance, the simulator models them accurately with their actual values taken from the prototype router designs.

(a)

(b)

(c)

(d) Figure 3. Performance variation with number of cores under different traffic (a) α = 0, (b) α = 0.3, (c) α = 0.5, and (d) α = 0.8 in 4 × 4 MoT based network

978-1-4244-2806-9/08/$25.00© 2008 IEEE

4

2008 IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, INDIA December 8-10. PAPER IDENTIFICATION NUMBER: 247 Now if there is more than one core in a cluster, the traffic will be randomly distributed among them. In our experiment we have fixed the packet length to 64 flits. For wormhole switching we have fixed the FIFO depth = 6. Here we also supply non-coherent clocks to all the routers having same phase and frequency (f = 100 MHz). The cores are operating at different clocks. The simulator keeps injecting packets into the network for 200,000 cycles of the routers’ clock including 10,000 warm-up cycles of the routers’ clock. VI.

OPTIMUM NUMBER OF CORES IN LEAF ROUTER

In this section we will discuss about the optimum number of cores that can be connected at each leaf level router. One can increase the number of cores attached, but the router size increases due to increase in node degree. As all the cores are transmitting packets by the same rate parameter (λ), the average overall latency of the network will also increase due to more traffic congestion (Fig. 3). Here we will focus on the trade-off between the maximum number of cores attached in each leaf level router and the performance of the network. In Fig. 3, we have plotted the throughput and average overall latency of 4 × 4 MoT based network by varying the number of cores attached in each leaf level router. In our case, each flit is of 32 bits and the router clock cycle is 10 ns. For ‘n’ number of cores attached in a network, the bandwidth is calculated as, Bandwidth = [(Max Throughput * n * 32) / 10] … (1) In Fig. 4, we have plotted the bandwidth by varying the number of cores attached in each leaf level router of 4 × 4 MoT based network. The value of maximum throughput has been taken from Fig. 3. From Fig. 4, we can conclude that as the number of cores attached in each leaf level router increases from 2 to 3, there is not a significant increase in bandwidth but from Fig. 3, the average overall latency is drastically increasing. Thus optimum number of cores attached in each leaf level router is 2. VII. PERFORMANCE EVALUATION AND SCALABILITY ANALYSIS OF THE NETWORK We have fixed the number of cores attached in each level router is 2. They are connected via the local channels. No core is attached with the stem and root level routers. Therefore the node degree of leaf, stem, and root level routers are 4, 3, and 2 respectively. Thus in M × N MoT, the total number of cores connected is 2*(M*N). We have plotted the network performance by varying the offered loads of Poisson traffic in Fig. 5. The load of homogeneous Poisson traffic in the network is varied according to the Poisson rate parameter (λ = number of packets per 100 micro seconds). The λ value is varied as 10, 20, 30, 40, and so on, to increase the load that is injected into the network. From Fig. 5 it is found that, the throughput is initially increasing with the offered load. But if we keep on increasing the offered load, it will saturate after a certain value. The latency is also increasing with the offered load. But the rate is increasing exponentially from that saturation point.

Scalability is a property which exhibits performance proportional to the number of processors employed. Performance metric is the Bandwidth (in Gbps) of the network where all the source cores are transmitting packets at the same rate (λ). From Fig. 5, the maximum throughput of 2 × 2, 2 × 4, and 4 × 4 networks are 0.48, 0.32, and 0.28 flits /cycle /IP respectively under uniform traffic. Therefore according to Equation (1) by applying uniform traffic, the obtained bandwidths of 2 × 2, 2 × 4, and 4 × 4 networks are 12.5, 16.38, and 28.67 Gbps respectively. Fig. 6 shows that bandwidth of the network is increasing as network size increases under uniform and locality traffic. Thus the MoT network is scalable.

Figure 4. Bandwidth with number of cores attached with each leaf level router

VIII. CONCLUSION This paper shows a generic MoT architecture with good scalability. We have modeled a cycle accurate simulator for evaluating the performance of large network based on single channel router. Scalability has no commonly accepted, precise definition. We consider the Bandwidth of the network as the performance metric. Here all the cores are transmitting packets at same rate (λ). In our simulation we have taken the reading of throughput and average overall latency for different values of rate parameter (λ). Thus, we have fixed the simulation time and emphasized how much more packets can reach their destination with parallel processing within the same time. REFERENCES [1] [2]

[3]

[4]

[5] [6]

[7]

L. Benini, and G. D. Micheli, “Network on Chips: A new SOC paradigm”, IEEE Computer, Vol. 35, No. 1, pp. 70-78, 2002. M. Michael, J. E. Moreira, D. Shiloach, and R. W. Wisniewski, “Scaleup x Scale-out: A Case Study using Nutch / Lucene”, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007. S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, and A .Hemani, “A Network on Chip Architecture and Design Methodology”, Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 117-124, 2002. W. J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks”, Proceedings of the 38th Design Automation Conference, pp. 684-689, 2001. W. J. Dally and C. L. Seitz, “The Torus Routing Chip”, Journal of Distributed Computing, Vol. 1, No. 4, pp. 187-196, 1986. F. Karim, A. Nguyen, and S. Dey, “An Interconnect Architecture for Networking Systems on Chips”, IEEE Micro, Vol. 22, No. 5, pp. 36-45, 2002. Y. L. Jeang, W. H. Huang, and W. F. Fang, “A Binary Tree Architecture for Application Specific Network On Chip (ASNOC) Design”, IEEE Asia-Pacific Conference on Circuits and Systems, pp. 877-880, 2004.

978-1-4244-2806-9/08/$25.00© 2008 IEEE

5

2008 IEEE Region 10 Colloquium and the Third ICIIS, Kharagpur, INDIA December 8-10. PAPER IDENTIFICATION NUMBER: 247

(a)

(b)

(c) Figure 5. Performance variation with network size (a) 2 × 2, (b) 2 × 4, and (c) 4 × 4 MoT for uniform and locality traffic using homogeneous Poisson distribution

Figure 6. Bandwidth with network size variation for different α values [8]

[9]

P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip Packet-Switched Interconnections”, Proceedings of Design, Automation and Test in Europe (DATE), pp. 250-256, 2000. P. P. Pande, , C. Grecu, A. Ivanov, and R. Saleh, “High-Throughput Switch-based Interconnect for future SoCs”, Proceedings of 3rd IEEE International Workshop on System-on-Chip for Real Time Applications, pp. 304-310, 2003.

[10] A. O. Balkan, Q. Gang, and U. Vishkin, “A Mesh-of-Trees Interconnection Network for Single-Chip Parallel Processing”, IEEE 17th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 73-80, 2006. [11] Inter connection Network architectures, pp 26-49, 2001. www.wellesley.edu/cs/courses/cs331/notes/notesnetworks.pdf. [12] S. Kundu and S. Chattopadhyay, “Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture”, ACM Great Lake Symposium on VLSI (GLSVLSI), pp. 343-346, 2008. [13] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance Evaluation and Design Trade-offs for MP-SOC Interconnect Architectures”, IEEE Transactions on Computers, Vol. 54, No. 8, pp. 1025-1040, 2005. [14] E. S. Shin, V. J. Mooney III, and G. F. Riley, “Round Robin Arbiter Design and Generation”, International Symposium on System Synthesis (ISSS), pp. 243 – 248, 2002. [15] S. Kundu and S. Chattopadhyay, “Interfacing Cores and Routers in Network-on-Chip Using GALS”, IEEE International Symposium on Integrated Circuits (ISIC), Singapore, 2007. [16] S. M. Ross, “Simulation (3rd edition)”, Academic Press, San Diego, CA, USA, 2002.

978-1-4244-2806-9/08/$25.00© 2008 IEEE

6

Suggest Documents