A Fast Custom Network Topology Generation with Floorplanning for NoC-based Systems Katherine Shu-Min Li, Member, IEEE, Shu-Yu Chen, Student Member, IEEE, Liang-Bi Chen, Member, IEEE, and Ruei-Ting Gu Abstract—This paper proposes a fast full-chip synthesis methodology which can be built a custom Network-on-Chip (NoC) topology for NoC-based systems. The processors and their communications are synthesized simultaneously in the system-level floorplanning process. The proposed method leads to accurate area estimation, which makes an algorithm much more efficient than previous approaches. Moreover, the wirelength-aware floorplanning is carried out to optimize circuit size as well as wire length. As a result, experimental results show that the proposed approach produces custom NoCs with better performance than previous methods while the computation time is significantly shorter. This method is also more scalable, which makes it ideal for complicated NoC-based systems. Index Terms—Custom NoC, Floorplanning, Network-on-Chip (NoC), Topology Generation.
T
I. INTRODUCTION
HE advance of process technology keeps on reducing device size. As a result, a system consists of multiple processing elements (PEs), which can be integrated into a single chip, called System-on-a-Chip (SoC). Compared with the traditional approach, the SoC technology offers many advantages such as higher performance and lower power consumption. The requirement for communication would be very complicated in an SoC with a large number of PEs. The multiple-to-multiple connection may also be required in complicated SoC designs. The conventional on-chip communication architecture, which consists of point-to-point connection and bus infrastructure, may not be able to provide sufficient communication support for an SoC. In the point-to-point connection structure, a direct link is assigned to any two modules that need communication. As a result, a large number of links are required, which dramatically inflates chip size. Besides, long communication wires may be required under this approach, which leads to excessive signal delay. The bus-based communication infrastructure is both Manuscript received February 13, 2011; revised April 7, 2011. This work was supported in part by the National Science Council (NSC), Taiwan, R.O.C. under Grants NSC-97-2220-E-110-009 and NSC-98-2221-E-110-090-MY2, and by the Ministry of Economic Affairs, Taiwan, R.O.C. under Grant 99-EC-17-A-01-S1-104. The authors are with Department of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan, R.O.C. (e-mail:
[email protected];
[email protected]).
978-1-4244-9021-9/11/$26.00 ©2011 IEEE
efficient and flexible. On the flip side, the communication channel can only be utilized by a pair of modules at a time; thus, the bus may become a bottleneck of overall system performance in applications requiring heavy communication. A promising solution to these problems is to deploy the packet-switched networking infrastructure for on-chip communication. This approach is referred to as the Network-on-Chip (NoC) [1]-[4]. The NoC greatly improves the scalability of SoCs and achieves higher power efficiency compared to other types of communication structures. In an NoC-based system, which consists of network components including data links, network interfaces (NI), and routers. Links provide communication media on which data are transmitted. An NI is the interface between a PE and a router, and it is responsible for transforming data into packets or vice versa. The physical transmission path for a packet is set up by routers. The topology of the NoC is determined by the physical connections of links and routers, while connection paths for packets are selected by the routing algorithms. NoC topologies can be classified into two categories: regular, and irregular. If each PE is mapped a router, the PE and its assigned router are combined as a tile. A router is usually connected to other routers in neighboring tiles in addition to the PE in the same tile. There are two types of irregular topologies. The first type is constructed mainly to deal with the placement of PEs with diversified sizes and aspect ratios. In the second type, a custom topology is synthesized specially for given design goals. A custom NoC no longer consists of tiles, as routers are allocated according to the communication demand. In this paper, we will focus on the problem of irregular topology generation for custom NoC-based systems. II. RELATED WORK Regular topologies are better choices for general-purpose systems where traffic conditions are not predictable, and they are also easy to design. On the flip side, they involve more network components than necessary for ASICs with special communication demands. As a result, the chip becomes larger with higher power consumption and longer latency. These problems become more prominent as chips become larger. The custom NoC provides a better solution to the above problems in ASICs, as this approach produces smaller circuits. Specially, it was reported that power consumption in an NoC is highly
correlated to the link wirelength. Many custom NoC design methods have been discussed and developed in [5]-[10]. Previous studies on custom NoC designs treat topology generation and message routing separately, and route allocation has to be carried out for every demand on the given topology. Although this approach can generate good network topology, the computation time for route allocation can be a problem, especially when the number of PEs and communication data volume increase. Besides, network components are usually added after floorplan for PEs being decided, which may render the estimation of overall chip size inaccurate. Therefore, this paper proposes a custom NoC synthesis methodology for ASICs with special communication demands. The goal is to synthesize an irregular custom NoC floorplan that requires fewer network components and consumes less power. The proposed algorithm is also very fast, and can achieve accurate area estimation for the synthesized circuits. The proposed method consists of two steps: a topology generation based on communication analysis and co-floorplanning for PEs and routers. This scheme provides some advantages over previous methods. Since the proposed methodology can take the communication demands into account in the topology synthesis procedure so it is not necessary to allocate a route for each message. As a result, routing in NoCs synthesized by the proposed methodology is more efficient. The area estimation is more accurate with the proposed method, as PEs and network components are floorplanned at the same time. Besides, it provides a complete system-level floorplan where circuit size and overall network link length are optimized. III. THE PROPOSED METHODOLOGY The proposed custom NoC design flow is shown in Fig. 1. The goal is to develop a fast synthesis scheme that minimizes power consumption and overall wire length in a custom NoC-based system. A Communication Trace Graph (CTG) is a graph CTG = (VG, EG), where each v VG represents a block and each e EG is an edge in the corresponding block diagram. In a CTG, w(e) stands for the communication demand of edge e, and ηG is an edgemap showing the source node u and target node v where ηG(e) = (u, v), u, v VG , e EG. Given a CTG and the input/output port limits of routers, our synthesis method generates a custom NoC topology with the corresponding system-level floorplan. The custom NoC topology is generated in the first two phases, where the communication requirement is analyzed (Phase I) and then Router Sharing Groups (RSGs) are formed (Phase II) accordingly. In Phase III, a wire-length driven floorplanning that takes into account PEs and shapes of the routers is carried out, while routing paths are allocated as well. A. Phase I: Communication Analysis The optimal NoC structure is mainly determined by the communication requirement. In order to facilitate the computation, first a Communication Trace Bipartite Graphics
978-1-4244-9021-9/11/$26.00 ©2011 IEEE
Fig. 1. Flow chart of the proposed methodology.
(CTB) is pruned by removing trivial arcs and vertices. The simplified bipartite graph is then analyzed so that the RSGs can be constructed in the next phase. 1) Isolated Arcs Removal (IAR) Let din(v) and dout(v) be the in-degree and out-degree of a vertex v, respectively. In a bipartite graph, din(v)=0 if v is a source vertex while dout(v)=0 if v is a target vertex. An arc is said to be isolated if it is the only communication link connected to the source and target vertices. 2) Isolated Vertices Removal (IVR) A source vertex s is isolated if dout(s)=0; similarly, a target vertex t is isolated if din(t)=0. In a CTB, such vertices are not connected to any arc. 3) Successors Subset Tree Construction (STC) In order to improve the router utility, it is imperative to put a set of vertices with heavy communication traffic in the same RSG. In other words, it is desirable to put target vertices sharing the same source vertex in the same group. Similarly, source vertices sharing the same target vertices can be put in the same group. Definition 1: Successors of a source vertex v are those target vertices that are connected to v through arcs in the CTB. The set of vertices are denoted as Suc(v). Definition 2: Predecessors of a target vertex v are those source vertices that connect to v through arcs in the CTB. The set of vertices are denoted as Pre(v). Source vertices sharing the same successors are partitioned into the same group, and this partition is achieved through a subset tree construction. The root of the tree contains all source vertices (SB) in the reduced CTB (RCTB), while each of the remaining tree nodes contains a subset of SB. Each source vertex si is contained in exactly one non-root tree node. This tree is constructed according to the following rules. (1) If Suc(si) = Suc(sj), source vertices si and si belong to the same tree node. (2) Suc(si) Suc(sj), then the tree node containing si is a descendant of the tree node containing sj. (3) If Suc(si) Suc(sj) and Suc(sj) Suc(si), then the only ancestor of tree nodes containing si and sj is the root. The tree construction algorithm is given in Fig. 2.
Algorithm: STC (Subset Tree Construction) Input: Reduced CTB’’(S’’, T’’, A’’) Output: Subset Tree 1. L ← sorted S’’ by |Suc(s)| in descending order , s 2. set root node of ST as S’’ 3. for (all v L) do 4. insert_subset_tree (v, root node of ST) 5. end for
Cost(fp) = 0.5Area(fp) + 0.5WL(fp), (1) where fp is a floorplan, while Area(fp) and WL(fp) are the estimated area and wirelength of floorplan fp, respectively.
S’’
function insert_subset_tree (v, root node of ST) 1. q.enqueue(root node of ST) 2. while (q is not empty) do 3. node current ← q.front 4. if (Suc(v) = Suc(current)) then 5. insert v to current 6. else if (Suc(v) Suc(current)) then 7. if (current is leaf node) then 8. insert v as child of current 9. else if 10. counter ← 0 11. while (current is not null) then 12. if (Suc(v) = Suc(current) or Suc(v) Suc(current)) then 13. q.enqueue(current) 14. counter ← counter + 1 15. end if 16. node prev ← current 17. current ← current.sibling 18. end while 19. if (counter = 0) 20. insert v as sibling of prev 21. end if 22. else 23. insert v as sibling of current 24. end if 25. end while end function Fig. 2. Pseudo code of STC.
4) Target Vertices Partitioning (TVP) When Step 3 (STC) in Phase I is finished, the set of all source vertices (SB) has been partitioned into several disjoint subsets, with vertices in the same subset sharing exactly the same subset of target vertices (TB). Each of these disjoint subsets of source vertices is called a Source Component (SC), and it is a minimal set of source vertices that can be put into an RSG. B. Phase II: RSG Clustering Once we have all the source and target components, the next step is to search for components that can use the same router. A set of such components become an RSG. The goal is to construct larger RSGs that maximize router utility under the given I/O port constraints. The rationale for larger routers is that fewer routers are created, so that the number of hops for each communication is reduced. As a result, communication delay can be improved, so is the power consumption. C. Wirelength-Aware Floorplanning A fast simulated-annealing based algorithm [11] is used to generate a floorplan in the proposed scheme. In order to optimize circuit size as well as overall wire length, the cost function is defined as follows:
978-1-4244-9021-9/11/$26.00 ©2011 IEEE
IV. EXPERIMENTAL RESULTS We conduct the experiments on Ubuntu 8.0.4, with Xeon 2.00GHz CPU, 3GB memory. The benchmark circuits are application-specific SoC designs, including video applications: MPEG-4 decoder, Multi-Window Display (MWD), Picture-in-Picture (PIP), and Video Object Plane Decoder (VOPD); and multimedia applications: H.263 video encoder, H.263 video decoder, MP3 audio encoder, and MP3 audio decoder. In order to make a fair comparison to results obtained from the mesh topology optimized for performance, the upper bounds on router inputs (ri) and outputs (ro) are set to 5, and the bus width is limited to 128. The custom NoC designs are synthesized with the 70 nm technology file provided by CosiNoC, while the area and power consumption of routers are estimated by Orion [12]. The statistics of benchmark SoCs are shown in Table I. Also shown in Table I are the NoCs synthesized by the proposed method as well as results from mesh-based implementation. The indices and circuit descriptions are given in Columns One and Two, while the numbers of cores (#Core) and communications (#Comm.) of each circuit are listed in Columns Three and Four. The next column gives the average amount of communication in each benchmark (#Comm./#Core), which gives an indication of the communication traffic in each example. Columns Six and Seven give the number of routers required in mesh and the proposed method, while the last two columns are the number of links for mesh and the proposed method. Synthesis results of the proposed methodology will be compared to a custom NoC synthesis scheme: CosiNoC [10]. Since the source code of CosiNoC is available, a complete comparison between CosiNoC and this work will be provided in Table II. V. CONCLUSIONS A custom NoC topology generation method with floorplanning is proposed in this paper. The topology generation is based on communication analysis, which leads to efficient designs in shorter computation time. Experimental results show that the proposed methodology can achieve lower wirelength and power consumption than previous methods while the runtime is significantly shorter. Since both network components and PEs are planned simultaneously, an accurate estimation of the synthesized circuit can be made. Besides, this approach is also helpful to restrict runtime, as the floorplanning process can be done in one pass. In contrast, previous methods try to add the on-chip network in a floorplan where PEs have been processed, which may cause design violation and require more iterative loops of the process. The proposed method is also scalable, as it is capable of providing good results in reasonable time for larger circuits. Hence, the proposed method is also particularly useful for large
SoC G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11
TABLE I BENCHMARKS STATISTICS WITH SYNTHESIS RESULTS OF THIS WORK AND MESH # Router # Link Avg. Comm. Description # Core # Comm. (# Comm. / # Core) n × n MESH This Work n × n MESH This Work PIP 8 8 1.00 9 6 30 14 H.263 enc MP3 dec 12 12 1.00 16 7 56 21 MWD 12 12 1.00 16 9 56 23 MP3 enc MP3 dec 13 13 1.00 16 8 56 21 H.263 dec MP3 dec 14 15 1.07 16 8 56 24 VOPD 12 14 1.17 16 9 56 24 VOPD + MPEG-4 + MWD 36 52 1.44 36 24 240 81 VOPD + MPEG-4 24 40 1.67 25 15 182 58 MPEG-4 + PIP 20 34 1.70 25 12 132 46 MPEG-4 12 26 2.17 16 6 132 32 IMP 27 96 3.56 36 12 380 74 Data Normalized
Average
SoC
Method
CosiNoC This Work CosiNoC G2 The Work CosiNoC G3 This Work CosiNoC G4 This Work CosiNoC G5 This Work CosiNoC G6 This Work CosiNoC G7 This Work CosiNoC G8 This Work CosiNoC G9 This Work CosiNoC G10 This Work CosiNoC G11 This Work CosiNoC Avg. This Work Improvement G1
2.473 2.7 2.912 2.934 2.706 2.816 2.32 2.436 4.666 4.54 8.025 8.26 12.906 13.737 10.252 10.834 5.17 4.98 2.511 2.499 6.298 6.29 5.476 5.639 -2.98%
2 2 2.17 2.17 2.17 2.17 2 2 2.13 2.07 2.43 2.07 3.69 2.25 2.98 2.28 2.5 2.24 2.81 2.31 4.74 2.68 2.69 2.20 18.22%
6 6 7 7 7 9 8 8 8 8 9 9 26 24 14 15 10 12 6 6 22 12 11.18 10.55 5.69%
0.007 0.007 0.014 0.014 0.013 0.013 0.012 0.012 0.017 0.015 0.019 0.015 0.074 0.054 0.058 0.041 0.037 0.032 0.028 0.024 0.079 0.073 0.0325 0.0273 16.20%
REFERENCES [2] [3] [4]
[5] [6]
10.55 1.00
125.09 3.29
38.00 1.00
TABLE II. THE PROPOSED METHODOLOGY SYNTHESIS RESULTS AND COMPARISON TO COSINOC [10] Network Size Power Dynamic #Hop (mm2) (mW) Energy (μJ) #Router AR (mm2) #Link WL (mm)
systems where a complicated NoC-based system is required.
[1]
20.64 1.96
W. J. Dally and B. P. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann, Jan. 2004. J. Nurmi, “Network-on-chip:a new paradigm for system-on-chip design,” in Proc. IEEE System-on-Chip Conf., pp. 2–6, 2005. B. S. Feero and P. P. Pande, “Networks-on-chip in a three-dimensional environment: A performance evaluation,” IEEE Trans. Computers, vol. 58, pp. 32–45, Jan. 2009. R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, and Y. Hoskote, “Outstanding research problems in noc design: System, microarchitecture, and circuit perspectives,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 28, pp. 3–21, Jan. 2009. K. Srinivasan, K. S. Chatha, and G. Konjevod, “Linear-programming based techniques for synthesis of network-on-chip architectures,” IEEE Trans. VLSI Systems, vol. 14, pp. 407–420, Apr. 2006. K. Srinivasan, K. S. Chatha, and G. Konjevod, “Linear-programmingbased techniques for synthesis of network-on-chip architectures,” IEEE Trans. VLSI Systems, vol.14, pp. 407-420, 2006.
978-1-4244-9021-9/11/$26.00 ©2011 IEEE
14 14 19 21 18 23 21 21 24 24 25 24 81 81 56 58 42 46 29 32 78 74 37.0 38.0 -2.70% [7]
7.786 6.167 15.686 6.381 16.46 8.293 16.358 7.24 26.029 11.264 34.006 15.773 97.519 49.708 52.959 33.276 56.596 20.278 19.517 12.48 66.277 50.43 37.199 20.117 45.92%
5.06 4.79 8.92 7.35 8.9 7.67 8.41 6.89 12.59 8.95 15.8 10.23 58.5 35.89 39.78 26.31 30.12 19.58 18.51 14.45 50.84 46.8 23.40 17.17 26.62%
4.18 3.3 10.42 0.12 0.2 39.81 253.18 242.77 199.37 195.18 370.2
CPU Time (sec) 0.04 0.001 0.08 0.001 0.06 0.001 0.09 0 0.12 0.001 0.1 0.001 5.37 0.01 0.59 0.001 1.25 0.001 0.37 0.001 16.74 0.01 2.2555 0.0025 886.1X
K. Srinivasan and K. S. Chatha, “A methodology for layout aware design and optimization of custom network-on-chip architectures,” in Proc. International Symp. Quality Electronic Design, pp. 352–357, 2006. [8] K. Srinivasan, K. S. Chatha, and G. Konjevod, “An automated technique for topology and route generation of application specific on-chip interconnection networks,” in Proc. IEEE/ACM International Conf. Computer-Aided Design, pp. 231–237, 2005 [9] L. Benini, “Application specific NoC design,” in Proc. IEEE/ACM Design, Automation, and Test in Europe, pp. 491–495, 2006 [10] J. A. Roy, S. N. Adya, D. A. Papa, and I. L. Markov, “Min-cut floorplacement,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 25, pp. 1313–1326, Jul. 2006. [11] T.-C. Chen, Y.-W. Chang, and S.-C. Lin, “A new multilevel framework for large-scale interconnect-driven floorplanning,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 27, pp. 286–294, Feb. 2008. [12] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration,” in Proc. IEEE/ACM Design, Automation and Test in Europe, pp. 423–428, 2009.