24 Integrated Through-Silicon Via Placement ... - ACM Digital Library

52 downloads 5538 Views 578KB Size Report
To copy otherwise, to republish, to post on servers, to redistribute to lists, .... four cores are taken to be in a single virtual group and one TSV is dedicated ...... T410 system with eight cores (Intel Xeon processor, [email protected] GHz), 64GB RAM.
Integrated Through-Silicon Via Placement and Application Mapping for 3D Mesh-Based NoC Design KANCHAN MANNA, SHIVAM SWAMI, SANTANU CHATTOPADHYAY, and INDRANIL SENGUPTA, Indian Institute of Technology (IIT) Kharagpur

This article proposes a solution to the integrated problem of Through-Silicon Via (TSV) placement and mapping of cores to the routers in a three-dimensional mesh-based Network-on-Chip (NoC) system. TSV geometry restricts their number in three-dimensional (3D) ICs. As a result, only about 25% of routers in a 3D NoC can possess vertical connections. Mapping plays an important role in evolving good system solutions in such a situation. TSVs have been placed with detailed consultation with the application mapping process. The integrated problem was first solved using the exact method of Integer Liner Programming (ILP). Next, a solution was obtained via a Particle Swarm Optimization (PSO) formulation. Several augmentations to the basic PSO strategy have been proposed to generate good-quality solutions. The results obtained are better than many of the contemporary approaches and close to the theoretical situation in which all routers are 3D in nature.

r

CCS Concepts: Computer systems organization → Embedded systems; Network-on-Chip; 3D IC; Application Mapping; Additional Key Words and Phrases: Network-on-chip (NoC), TSV placement, application mapping, 3D NoC ACM Reference Format: Kanchan Manna, Shivam Swami, Santanu Chattopadhyay, and Indranil Sengupta. 2016. Integrated through-silicon via placement and application mapping for 3D mesh-based NoC design. ACM Trans. Embed. Comput. Syst. 16, 1, Article 24 (November 2016), 25 pages. DOI: http://dx.doi.org/10.1145/2968446

1. INTRODUCTION

A major concern of the emerging multicore System-on-Chip (SoC) design is ensuring fast communication between Intellectual Property (IP) cores. It imposes restrictions on the performance, power consumption, and reliability of the system [Ho et al. 2001; Magarshack and Paulin 2003; Dally and Towles 2001; Benini and Micheli 2002]. In a traditional SoC environment, IP cores communicate with each other using either a bus-based or a point-to-point network. While a point-to-point network has large wiring overhead, bus-based communication suffers from bandwidth limitations. Compared to these architectures, Network-on-Chip (NoC) can provide better scalability, lesser power consumption, and higher predictability [Dally and Towles 2001; Benini and Micheli 2002]. With an increasing number of IP cores integrated into an NoC, the two-dimensional IC-based implementation of the system suffers from performance This work is partially supported by the Department of Science and Technology, Govt. of India (SB/S3/ EECE/058/2013; dt. 26-8-13). Authors’ addresses: K. Manna and I. Sengupta, Department of Computer Science & Engineering; emails: [email protected], [email protected]; S. Swami and S. Chattopadhyay, Department of Electronics & Electrical Communication Engineering, Indian Institute of Technology (IIT) Kharagpur, West Bengal 721 302, India; emails: [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 1539-9087/2016/11-ART24 $15.00  DOI: http://dx.doi.org/10.1145/2968446

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24

24:2

K. Manna et al.

degradation. Such degradation happens due to the increase in network diameter that plays an important role in deciding system performance. Limitation in performance degradation of 2D-NoC has led to the emergence of 3D-NoC. Three-dimensional NoC has a large potential to build complex, highly interconnected systems. In this methodology, different tiers or dies are connected by vertical interconnects. Hence, it reduces the latency (due to reduction in wire length, proportional to as much as the square root of stacked dies [Xu et al. 2011] and incorporation of pipelined stages) of critical interconnect structures, providing high performance, high throughput, and less power consumption [Topol et al. 2006]. Moreover, it provides the additional benefit of heterogeneous system integration. In this direction, a 3D-NoC with twelve x86 processors and two dies has been introduced by AMD [2010]. Several approaches have been reported in the literature for vertical interconnects, such as wire bonding, micro-bump, contactless interconnection, and Through-Silicon Via (TSV) [Davis et al. 2005]. Among them, TSV is the most viable solution due to its low-latency and low-power characteristics [Hwang et al. 2011]. Unfortunately, the TSV process does not scale well with the CMOS technology. TSV diameter (4μm) and pitches (8μm) are two to three orders of magnitude larger than the transistor gate lengths [Pasricha 2012; Kim et al. 2009]. The manufacturing cost of TSV is also high. Associated defect probability introduces yield loss and thus extra cost. Yield loss increases exponentially with defect frequency and number of TSVs. However, system performance also depends on the number of available TSVs, since routing congestion and blockage in a 3D network are proportional to the number of TSVs [Hwang et al. 2011]. The expected number of TSVs in high-performance 3D chips in 2012 was 1,000 (as per the ITRS [Semiconductor Industry Association et al. 2007]). This number is predicted to increase at a rate of another 1,000 per year. Considering the negative impact of TSV limitation on the performance of the system, a balance between the performance of the system and its cost is very crucial [Xu et al. 2011]. Three-dimensional topologies have been suggested for the NoC domain as well [Pavlidis and Friedman 2007]. Three-dimensional network topologies [Feero and Pande 2009] have shown improvement in performance compared to the topologies in the 2D environment. Among different topologies proposed for such interconnection networks, mesh is the most widely used. Mesh provides a regular structure with short interconnects and a high bisection width. It also provides a modular architecture for the NoC with equal-sized links. As a result, many industrial NoCs [SCC 2010; Vangal et al. 2008] have been designed around the mesh topology. The three-dimensional NoC structure can be categorized into three different types: symmetric, hybrid, and true 3D fabric [Liu et al. 2011]. In a symmetric 3D-NoC, each router is a simple extension of a 2D router having √ two extra ports: upward and downward. The number of vertical 3 channels can be 2(N − N 2 ) [Dubois et al. 2013], N being the number of nodes (routers) in the network. Each channel in the NoC consists of tens or even hundreds of physical wire links. Such a network with a large number of nodes/routers needs a huge number of physical channels. As a result, it demands more silicon floor in the die. Therefore, the 3D-NoC design introduces new issues, such as the technology constraint on the number of TSVs that can be supported and placement of routers in the 3D-NoC. Reduction in TSVs in the 3D-NoC reduces the bisection width of the topology, resulting in reduction of possible concurrent communication in the vertical direction. Thus, the performance of a 3D-NoC-based system depends on a proper placement of these limited number of TSVs. Other router design strategies for 3D-NoCs, such as hybrid and complete 3D fabric may not be suitable because of communication contention, scalability, and complex design issues[Xie et al. 2009]. In contrast, the present work is based on a partially connected symmetric 3D-mesh-based NoC topology. In this topology, fully connected 2D planer meshes are partially connected together via TSVs only. The structure uses ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:3

two types of routers: 2D routers with five ports and 3D routers with seven ports. The 3D routers utilize the TSVs for vertical connections [Ahmed and Abdallah 2012]. Thus, the 3D-NoC architecture consists of 2D as well as 3D routers [Tatas et al. 2014]. Therefore, finding out the proper places for such routers in the 3D NoC plays a crucial role in determining the system performance while meeting the technology constraints. Another major challenge in NoC-based system design is to determine the association of routers with the IP cores. This is commonly known as the process of application mapping. In this work, it has been assumed that the core responsible for carrying out a particular task has already been decided. Thus, the application can be represented in the form of a core graph [Murali and Micheli 2004]. The core graph of an application is a directed graph, G(C, E), comprising a set C of vertices (or cores) together with a set E of directed edges. An edge ei, j  E represents the communication between cores ci and c j . The weight of the edge ei, j represents the bandwidth requirement between the cores, Commi, j . The NoC topology can also be represented in the form of a topology graph [Murali and Micheli 2004]. It is a directed graph T (R, F) comprising a set R of vertices (or routers in the topology) together with a set F of directed edges. An edge fi, j  F represents the actual link between the routers ri and r j . The weight of the edge fi, j , represented as bwi, j , denotes the bandwidth available across that edge. A mapping of the core graph G(C, E) onto the topology graph T (R, F) is a function M : C → R, such that ∀ci ∈ C, ∃rk ∈ R and M(ci ) = rk. The total communication cost (defined in the following) of an application under this mapping is a measure of the quality of mapping. Let a single commodity, dk, k = 1, 2, . . . , |E|, represent the communication between the corresponding pair of cores, ci and c j (i.e., commi, j , the bandwidth requirement). Let the quantity xi,k j indicate the commodity dk following the link (ri , r j ). It is given by  value(dk), if link(ui , u j )  P(source(dk), sink(dk)) (1) xi,k j = 0, otherwise, where P(m, n) indicates the routing path between the nodes m and n in the topology. To ensure the bandwidth limitation of individual links, the following constraint must be satisfied: |E| 

xi,k j ≤ bwi, j , ∀i, j  {1, 2, . . . , |R|}.

(2)

k=1

The communication cost T of a mapping solution is given by Murali and Micheli [2004]:

T =

|E| 

value(dk) × hopcount(source(dk), sink(dk)),

(3)

k=1

where hopcount(a, b) is the total number of hops between nodes (in the topology) a and b. The overall mapping problem is to optimize the communication cost, satisfying the bandwidth constraints of individual links. The mapping of cores to routers plays a crucial role in determining the overall system performance. The approach corresponds to a design-time decision of attachment of cores to routers. From this discussion, it can be noted that the performance of a 3D- NoC-based system will depend significantly on the locations of 3D routers (TSVs) and the core-to-router association (mapping). This work proposes a TSV placement strategy for a symmetric 3D-mesh-NoC design by detailed consultation with the mapping of an application, ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:4

K. Manna et al.

taking care of TSV geometry constraints. The salient features of the approach are as follows: —An exact solution to the integrated core mapping and TSV placement problem based on Integer Linear Programming (ILP) has been proposed for NoCs with two vertical layers. —A multistage Particle Swarm Optimization (PSO)-based formulation for the combined problem of TSV placement and mapping has been reported that can work for any number of layers. —Several innovative augmentations have been proposed to improve the solution quality of PSO. The rest of the article is organized as follows. Section 2 surveys the related works. The architecture and its corresponding routing algorithm are described in Section 3. Section 4 presents a formal description of the problem. An ILP-based solution is presented in Section 5. Section 6 describes the PSO-based solution for the problem. Complexity analysis of the proposed method is discussed in Section 7. Experimental results are presented in Section 8. Section 9 concludes the work. 2. RELATED WORKS

The impact on performance of 3D-NoC due to a different number of TSVs and their placement has been analyzed in Xu et al. [2010, 2011]. Here, the authors have considered three types of TSV configurations—full, quarter, and one-eighth connection—and compared the performance among them. They have shown a tradeoff between performance and manufacturing cost. The TSV squeezing scheme to share TSVs among neighboring routers in 3D-NoC has been studied in Liu et al. [2011]. In this approach, four neighboring routers share a TSV in a time division multiplexed fashion. In Hwang et al. [2011], a new communication technique has been proposed between TSVs. Here, four cores are taken to be in a single virtual group and one TSV is dedicated for the group. A core transfers traffic in the vertical direction by using the TSV either in its group or its neighboring group, based on the current load of the TSV in the group. Serialization of TSVs can contribute in reducing their number in 3D-IC [Pasricha 2009]. TSV virtualization also reduces the number of TSVs used in 3D-ICs [Miller et al. 2012]. Here, the authors have proposed a TSV virtualization scheme for multiprotocol interconnect in 3D-ICs. In this approach, TSVs are clocked at a much higher rate than conventional intralayer links. To utilize the full bandwidth of a TSV-based vertical interconnect, it uses multiple TSVs in a multiplexed and shared manner. Moreover, each layer can contain different types of interconnection architectures, such as buses, crossbars, or NoCs. An application-specific 3D-NoC synthesis procedure has been presented in Yan and Lin [2008], based on a greedy rip-up-and-reroute technique. In this rerouting technique, smaller flows are routed first, followed by higher flows. To reduce power consumption, a router merging scheme has been used to further optimize the topology. Seiculescu et al. [2009], Murali et al. [2009], and Seiculescu et al. [2010] have proposed a tool for NoC topology synthesis in a 3D environment. It determines the custom topology for an application and paths for communication flows, assigns network components onto the 3D layer, and decides the placement of them in an individual layer. Here, TSVs are iteratively added during the synthesis process. To design a custom 3D-NoC for an application, a Genetic Algorithm (GA)-based synthesis procedure has been presented in Jiang and Watanabe [2010]. The proposed procedure reduces the topology cost and optimizes the floorplan to reduce power consumption under software and hardware constraints. A floorplan-aware 3D-NoC synthesis technique has been described in Zhou et al. [2012]. Here, the authors have used Simulated Allocation (SAL), a stochastic method (multicommodity flow technique) for traffic flow routing, and also ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:5

proposed a power and delay (queuing) model for network components. Also, they have presented a study of various impact factors on the network performance in 3D-NoC, such as number of TSVs and 3D tiers. An application-specific 3D-NoC design using Integer Liner Programming (ILP) has been proposed in Xu et al. [2009]. Here, the authors have considered a low-radix router and many long links. A framework called MORPHEUS for TSV serialization-aware synthesis of an application-specific 3D-NoC has been proposed in Pasricha [2012]. It incorporates 3D topology, route generation, and thermal-aware core layout. It also places the network interfaces (NIs), routers, and serialized TSVs in the die. A high-level 3D-NoC synthesis mechanism has been reported in Ying et al. [2012b] to improve system performance and to reduce the link heat by distributing the communication evenly in the system. Here, a Simulated Annealing (SA)-based algorithm has been used to optimize the overall system. Zhong et al. [2011] have presented a four-stage application-specific synthesis approach for 3D-NoC. This approach attempts to generate power-performance-efficient topology for an application. Core-to-core communication is analyzed from the task graph and it tries to place the most-communicating pair in the same cluster. A TSV assignment procedure has also been incorporated to reduce the link power consumption. A GA-based optimization technique has been proposed in Ying et al. [2012a] to design a 3D-NoC with low vertical link density. It optimizes topology, routing algorithm, task mapping, and tile placement. Rahmani et al. [2012] have proposed a bus-hybrid-symmetric-meshbased architecture for a 3D-NoC design. It is a combined version of packet-switched network and bus-based communication. It enhances the system performance, thermal safety, fault tolerance, and power efficiency. Kapadia and Pasricha [2012] have proposed a framework for power delivery network (PDN)-aware 3D-mesh-NoC synthesis with multiple voltage islands. They have proposed an ILP as well as a heuristic to synthesize the PDN. A branch-and-bound method has been used to generate multiple mappings to optimize NoC power. A cosynthesis methodology for a PDN-3D-mesh-NoC has been reported in Kapadia and Pasricha [2013]. To optimize the cost of the PDN network and NoC design, they have used a biobjective SA approach. An application-specific 3D-mesh-NoC has been designed with a traffic-aware selection strategy in Azampanah et al. [2013]. The selection strategy can significantly balance the traffic load to reach a better performance. As far as application mapping is concerned, a detailed recent review of 2D-NoC application mapping techniques can be found in Sahu and Chattopadyay [2013]. The possible solution for this problem ranges over exact mapping by using ILP, heuristic, and metasearch techniques. In Sahu et al. [2014], a PSO-based approach has been used to solve the mapping problem in 2D-NoC. A latency-aware mapping scheme has been presented for regular 3D-mesh-NoC in Wang et al. [2011]. It uses a rank-based multiobjective GA to solve the mapping problem. Here, packet latency has been calculated in both congested and congestion-free environments. A thermal-aware mapping of 3D-mesh-NoC has been presented in Hamedani et al. [2012]. Here, one ILP-based approach and two heuristic-based static thermal-aware mapping algorithms have been proposed to study the thermal constraints and their effects on temperature and performance. A mapping scheme has been proposed in Ding et al. [2013] to optimize the number of TSVs and peak temperature of 3D-symmetric-mesh-NoC. Here, the unused TSVs have been kept as thermal TSV for heat dissipation. The iterative techniques have been proposed in Manna et al. [2014, 2015] for application mapping together with intelligent placement of 25% TSVs in 3D-NoC. Since high performance can be achieved in symmetric-3D-mesh-NoC topology, the present work has chosen symmetric partially connected 3D mesh as a target architecture for mapping an application together with TSV placement. It uses two deadlock-free Elevator First [Dubois et al. 2013; Lee and Choi 2013] routing algorithms (with and without virtual channel) for communication between cores. ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:6

K. Manna et al.

From a theoretical viewpoint, the application mapping problem attempts to freeze one graph (the core graph) into another (the topology graph). This is intractable [Garcy and Johson 1979] and is a specific case of the Quadratic Assignment Problem (QAP) [Congying et al. 2011]. Furthermore, proper placement of TSVs into die is also intractable, being an instance of the Uncapacitated Facility Location Problem [Guner and Sevkli 2008]. Meta-heuristic-based search algorithms, such as GA, SA, and PSO, often produce good results for such an optimization problem. Further, a PSO can converge faster than similar techniques, such as GA, and can work with a relatively small population size. This has motivated us to make an enriched PSO formulation for the application mapping problem that also intelligently places the TSVs in a 3DNoC environment. Many augmentations over the basic PSO have been incorporated. The solutions produced are encouraging in terms of reduction in communication cost, latency, and energy consumption. 3. ARCHITECTURE AND ROUTING ALGORITHMS

A simple way to extend 2D-mesh-based NoC to 3D is to incorporate the vertical connections in each and every router. The resulting structure resembles a fully connected 3D-mesh architecture. The router port toward the vertical direction can be implemented using TSVs. So, in a fully connected mesh architecture, where all the routers are vertically connected by using TSVs to the routers above and below them, the hop count and consequently the communication cost reduce significantly, compared to a 2D-NoC with an equal number of routers. However, it is not feasible to provide vertical connection to each and every router because of the manufacturing cost and chip area consumed by the TSVs. It is important to note that the TSV process does not scale with the CMOS technology. TSV diameters (4um) and pitches (8um) are two to three orders of magnitude higher than transistor gate lengths [Kim et al. 2009]. Therefore, fully connected 3D mesh is more of a theoretical topology. To circumvent this problem, different types of partially connected 3D-mesh topologies have been proposed in the literature. In Liu et al. [2011], four adjacent routers share one TSV. Eight adjacent routers can also share one TSV, as reported in Xu et al. [2010]. Another vertically partially connected 3D-mesh-NoC architecture has been presented in Dubois et al. [2013]. In this architecture, the number, position, and data flow direction of TSVs can be varied from die to die. The current work uses the vertically partially connected 3D-mesh-NoC architecture and, furthermore, the tile-based NoC design methodology. It is assumed that the data flow in each TSV can be in both directions. A typical such structure is presented in Figure 1. Here, routers and cores are represented as r and c, respectively. The number, data flow direction, and location of TSVs are design-time constraints. These parameters are chosen by the designer by taking care of technological constraints, required system performance, and overall system budget. The number of TSVs that can be afforded in the system depends on the area constraint. Moreover, Liu et al. [2011] have suggested that it rarely happens that the adjacent routers use their vertical links at the same time. Based on this observation, the work [Liu et al. 2011] has suggested that a 25% vertical connection between two layers can be a good choice to design 3DNoC. Hence, this work restricted the vertical links to only 25% of the routers per layer. This number can vary based on the system requirements, such as performance and cost of the system. Furthermore, a spread-out arrangement of TSVs that produces the lowest communication cost compared to any other architecture with the same number of TSVs has been reported in Xu et al. [2011]. This has prompted us to place the TSVs in a spread-out fashion also. Hence, the present work includes a constraint on the placement of the TSVs, that no two TSVs can be within one-hop distance of each other. This implies that a minimum of a two-hop distance should be maintained ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:7

Fig. 1. A two-layer partially connected 3D-NoC.

between the routers having TSVs while placing them. Such distance can vary based on the requirement. In such a scenario, the traditional “dimension order routing” may not be applicable. For vertically partially connected 3D-mesh-based-NoC architecture, two deadlock-free “elevator first” routing algorithms have been proposed in Bahmani et al. [2012] and Lee and Choi [2013]. The work of Bahmani et al. [2012] assumes the presence of virtual channels to resolve the deadlock issues, while Lee and Choi [2013] do not use any virtual channels. The present work has used both routing algorithms while measuring the performance of the system. 4. PROBLEM STATEMENT

The overall problem addressed in this work can be stated as follows: To determine (1) the suitable positions (router) of TSVs and (2) the association between routers and cores for a particular application in a vertically partially connected 3D-mesh-based-NoC such that total communication cost (bandwidth × hop-count) is minimized while satisfying the following constraints. —One router can have at most one core attached to it. —Routing should follow a given routing algorithm. —The number of layers in the 3D-NoC is restricted to a given value l. —At most k percentage of routers per layer can have vertical connections. —The number of hops between two adjacent routers having TSVs must be at least d, a given value. —The link bandwidth must be honored. This work considers elevator-first routing algorithms [Lee and Choi 2013; Bahmani et al. 2012], k = 25 and d = 2. We have experimented for two values of l, l = 2 and l = 4. The previous problem has been solved using the following two strategies: —ILP —PSO and its variants The ILP formulation to solve the TSV placement and mapping problem is presented in Section 5. The PSO-based approach is described in Section 6. ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:8

K. Manna et al. Table I. Variables for ILP Formulation

Variables μ mcis μs μt

Pl

Tk T

Dμskμt

Definitions = 1, if i th core ci is mapped to sth router μs = 0, otherwise = 1, if communication path exists between routers μs and μt for a mapped edge l of the core graph = 0, otherwise = 1, if kth router has a vertical link = 0, otherwise Precomputed distance between routers μs and μt , given kth router has TSV, that is, Tk = 1 = 1, if Tk = 1 and kth router comes in the path for a mapped edge l. This depends on the routing algorithm followed. = 0, otherwise = Bandwidth requirement of the edge l of the core graph

Ylk Bwl

5. ILP FORMULATION FOR TSV PLACEMENT AND APPLICATION MAPPING

An ILP formulation for the TSV placement and mapping problem is described here. 5.1. Parameters and Variables

The parameters and variables used in the ILP formulation are noted in Table I. 5.2. Objective Function

The objective of the work is to minimize the communication cost by placing the TSVs at suitable positions of the 3D-NoC and performing a mapping of cores to routers. The objective function can be written as ⎛ ⎞   μμ Minimize : Bwl × ⎝ DμTskμt × Ylk⎠ × Pl s t , ∀l E

∀(μs ,μt )U, k{valid TSV}

th

where k TSV refers to the vertical link on the kth router. If the kth TSV is 1, the kth router has a vertical connection. This work considers two layers (i.e., l = 2) and all vertical connections to be bidirectional in nature. The value of k can range from 1 to the number of routers present per layer. For example, if we have 12 cores and the number of routers per layer is six (the cores being equally distributed in different layers), k can take any value between 1 and 6. The distance between routers, present in different layers, can change depending on the choice of TSV used. Hence, for every pair of routers, all possible distances have been precomputed in DμTskμt and provided to the ILP formulation. The appropriate distance will get selected based on the value of the variable Tk. Suppose, for an edge l = (ci , c j ) in the core graph, core ci is mapped to router us and c j to ut , located in different layers. Now, let the kth router have a vertical connection (i.e., Tk = 1) and it comes in the path between the routers us and ut following the routing algorithm. This notion is captured by a binary variable Ylk. The value Ylk = 1 signifies that the kth router has a vertical connection and it comes in the routing path for the mapped edge l, and 0 otherwise. 5.3. Constraints

The following set of constraints have been framed to solve the TSV placement and mapping problem: i. TSV usage constraints (1) Ylk ≤ Tk, for all edge l of core graph and all k of TSV positions  No. o f routers per layer Ylk = 1 (2) k=1

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:9

The first constraint ensures that the path distance with a chosen TSV k is considered for a core graph edge l only when that TSV is present (Tk = 1). The second constraint ensures that an edge l is mapped considering only one TSV at most. For example, suppose we have TSVs at both the first and third 1 locations (i.e., T1 = T3 =1). While mapping an edge l, its distance will include either T1 or T3 . However, if an edge l in the core graph is mapped to the routers in the same layer, neither T1 nor T3 will come in the path between the routers. In this case, any of Yl1 or Yl3 can become 1. ii. Mapping constraints:  μ (1) ∀μs U, ci C mcis ≤ 1  μs (2) ∀ci C, μs U mci = 1 The first inequality implies that any router has at most one core mapped onto it. Constraint (2) ensures that each core has to be mapped onto only one router. These constraints guarantee that no two cores are mapped to one router and each core gets mapped onto a single router. iii. Constraints for core graph edges: μμ

μ

μ

(1) ∀l E, ∀(μs , μt ) U [Pl s t ≥ mcis + mc jt − 1], where ci is the source of the lth edge and it is mapped to router μs , and c j is the destination of the lth edge and is mapped to router μt . μμ μ μ (2) ∀l E, ∀(μs , μt ) U [Pl s t ≤ (mcis + mc jt ) ÷ 2], where ci is the source of the lth edge and it is mapped to router μs , and c j is the destination of the lth edge and it is mapped to router μt . Each edge present in the core graph would be mapped onto a path in the topology μμ graph. It can be ensured by inequalities (1) and (2). The variable Pl s t will be 1, when edge l of the core graph is mapped to a path between routers μs and μt . iv. Technological constraints:  (1) Tk = n/4 , where n is the number of routers per layer (2) Tk + Tk+1 = 1, for k not being a router in the last column (3) Tk + Tk+n = 1, for k not being a router in the last row, n being the number of routers per layer Due to the technological constraints, we assume that the number of vertical links (TSVs) in each layer can be less than or equal to 25% of the total routers present per layer. This criterion can be ensured by constraint (1). Constraints (2) and (3) ensure that TSVs are not placed within one-hop distance of each other. The tool Cplex [2013] has been used to solve the formulated ILP and get the optimum solution. However, except for very small NoCs, it takes a huge amount of CPU time to arrive at the solution. Hence, in the following, a PSO-based technique and its variants have been proposed to find the solution for bigger NoCs, producing results within a reasonable CPU time. 6. PSO FORMULATION FOR TSV PLACEMENT AND MAPPING

PSO [Kennedy and Eberhart 1995] is a population-based stochastic optimization technique developed by Eberhart and Kennedy, encouraged by the social behavior of bird flocking and fish schooling. In this technique, multiple solutions are present at any instant of the optimization phase. These solutions help each other to evolve themselves by sharing their experiences to achieve the close-to-optimum solution. Each of these solutions is called a particle. A particle moves through the problem space according to its own experience as well as the experience of the fellow particles. The proficiency of ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:10

K. Manna et al.

a particle is measured by its fitness. PSO has been applied successfully to solve many optimization problems in both continuous and discrete domains [Wang et al. 2003]. This has motivated us to look for a discrete PSO (DPSO) formulation of the integrated TSV placement and core mapping problem for the 3D-mesh-based NoC with limited vertical connections. Apart from developing a PSO, we have augmented it in several ways as discussed in Section 6.2. The position of a particle in an n-dimensional search space at the kth iteration can be represented as pk =< pk,1 , pk,2 , . . . , pk,n >. Let, pki denote the position for the i th particle. For the i th particle, let its local best position be represented by pbesti , corresponding to the best fit position that the particle has seen so far over the generations. Similarly, the global best particle of the kth generation may be represented by gbestk. The particles can “fly” over the solution space through the generations. The new position can be calculated as i pk+1 = (k1 ∗ I ⊕ k2 ∗ ( pk → pbesti ) ⊕ k3 ∗ ( pk → gbestk)) pki .

(4)

In this equation, m → n denotes the minimum-length sequence of swapping to be applied on m to transform it to n. For example, let, m =< 1, 3, 4, 2 > and n =< 2, 1, 3, 4 >, m → n =< swap(1, 4), swap(2, 4), swap(3, 4) >. The operator ⊕ denotes the fusion operator. Two swap sequences can be applied one after another, that is, m followed by n, for the operation m ⊕ n. Each particle evolves over generations based on inertia, selfconfidence, and swarm confidence. In Equation (4), the corresponding confidence factors are denoted by the constants k1 , k2 , and k3 , respectively. The quantity ki ∗ (m → n) means that the swaps in the sequence (m → n) are to be applied with probability ki . The identity swap sequence is represented by I =< swap(1, 1), swap(2, 2), . . . , swap(n, n) >. It corresponds to the inertia of the particle to maintain its initial configuration. The final sequence of swaps equivalent to (k1 ∗ I ⊕ k2 ∗ ( pk → pbesti ) ⊕ k3 ∗ ( pk → gbestk)) is i to be applied on particle pki to create pk+1 . The convergence criterion for the DPSO is given by Guilan et al. [2008]: (1 − k1 )2 ≤ k2 + k3 ≤ (1 + k1 )2 . (5) In PSO, the momentum factor of the particle is decided by the parameter k1 . The particle is also attracted by the previously found local and global best particles. The corresponding strengths of the attraction are determined by the factors k2 and k3 , respectively. Accordingly, we have tried it with various values of k1 , k2 , and k3 . The results reported in this work consider the values k1 = 1, k2 = 0.04, and k3 = 0.02. The convergence rate, in terms of CPU time, of the PSO changes if we vary the values of these factors. To enhance the performance of the PSO engine, it is required to set proper values for these factors. These may be determined through some trial runs. Next, we present the particle structure used by us in the integrated TSV position selection and core mapping problem. 6.1. Particle Formulation and Fitness Function

6.1.1. Particle Structure. To solve the integrated problem of core mapping and TSV position selection using PSO, the particle structure has been made to have two parts in it. The first part corresponds to the mapping, while the second part indicates the routers having a 3D connection via TSV. For this, the routers have been numbered in an increasing order from the lowest to the highest layer. Within a layer, the router numbers are assigned in a row-wise manner, starting from the top-left down to the bottom-right corner. Figure 2(a) shows the numbering scheme followed for a 3 × 3 × 2, two-tier NoC having a total of 18 routers. The core mapping part of the particle is a permutation of core numbers, identifying the core mapped to a router. For example, ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:11

Fig. 2. Particle structure for mapping and TSV placement problem.

Figure 2(b) shows the particle structure corresponding to the NoC mapping in Figure 2(a). Core 16 gets mapped to router 1, core 13 to router 2, and so on. For the TSV part, it has been assumed that the routers at similar positions are of the same type in each layer. That is, if router r in layer 1 is a 3D router with a TSV connection to corresponding router r˙ in layer 2, r˙ is also having a TSV connection to router r¨ in layer 3, and so on. Such an assumption is justified as TSV geometry will not allow two neighboring routers in any layer to have TSV connections. Thus, it is better to put 3D routers at same positions in each layer. However, after mapping the core and placement of TSV onto the router, we have inspected the traffic flow through each TSV. TSV with zero traffic flow has been removed from the topology. The TSV part of the particle has been formulated as a bit array of size equal to the number of routers at a layer l. A bit being “1” indicates that the corresponding routers in each layer are 3D in nature with TSV connections. A “0” indicates the router to be 2D without TSV connection. The TSV placement constraints, such as x% routers in each layer can have TSV and that the minimum d-hop distance must be maintained between neighboring routers having TSVs, are not incorporated into the particle. However, the constraints have been considered while determining the 3D-NoC structure corresponding to the particle. For example, assume that the first bit of the TSV part of a particle is “1.” This router will have TSV. During placement of TSVs for other routers (i.e., routers having a “1” bit in the TSV part of the particle), it will consider the constraints. Such a router can have TSV only if it satisfies both constraints. Figure 2(b) shows a full particle structure. For the sake of implementation, a particle has been considered to be a single array with appropriate care taken regarding the values contained in its cells. The fitness of each particle can be evaluated using Equation (6): CommCost =

 (BW(ci , c j ) × Dist(ri , r j )), i

j

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

(6)

24:12

K. Manna et al.

Fig. 3. An example of a swap operation.

where BW(ci , c j ) denotes the bandwidth requirement between cores ci and c j . Dist(ri , r j ) is the hop count in a shortest path between routers ri and r j , to which the cores ci and c j have been mapped. 6.1.2. Local and Global Best. Each particle has an associated local best ( pbest) that is a configuration with the minimum communication cost (defined by Equation (6)), among all configurations that the particle has seen so far, in the evolution process. On the other hand, global best (gbest) is a particle having the best (minimum) communication cost for a generation calculated from the set of local bests. The local as well as the global best particles control the evolution of each particle. The local and global best particles are updated if the corresponding fitness values in the current generation are less than the values till the previous generation. 6.1.3. Evolution of Generation. Particles evolve over generations to create new particles that are expected to produce better solutions. The initial population is created randomly and the fitness values of individual particles is determined. The local best ( pbest) of each particle is initialized to itself. For a new generation, particles are created through a series of swap operations, explained next. 6.1.4. Swap Operator. Swap operation takes two indices, say i and j, of the particle p as input and creates a new particle p1 . The particles p and p1 are the same except that the position i and j of p are exchanged. Care has been taken to disallow swapping between the core part and TSV part of a particle. Let p be a particle as shown in Figure 3(a). The swap operator SO(3,5) exchanges the values at positions 3 and 5 in p to generate a new particle, as shown in Figure 3(b). 6.1.5. Swap Sequence. This is a sequence of swap operators. For example, a swap sequence SS =< SO(7, 1), SO(4, 3) > creates particle Pnew by applying the operations on particle P in two steps. Figure 4(a) represents the particle P. Applying SO(7,1) on P creates an intermediate particle, Pmod, shown in Figure 4(b). The swap SO(3,4) on particle Pmod results in the new particle noted in Figure 4(c). For the evolution of a particle, first the swap sequences are identified to align it to its local best and the global best. The sequences are applied with some probabilities corresponding to the confidence factors. For our formulation, we have used the confidence factors of 0.04 and 0.02, respectively, for local and global best alignment. 6.2. Augmentations to the Basic PSO

To achieve a better solution from the PSO technique presented so far, the following augmentations have been incorporated. 6.2.1. Inversion Mutation (IM). PSO generally performs better than GA [Guilan et al. 2008]. However, one important drawback of PSO, as compared to GA, is the absence ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:13

Fig. 4. An example of a swap sequence operation.

Fig. 5. An example of the inversion mutation operation.

of a mutation operator that can bring sudden changes into a solution, thus possibly exploring a promising unexplored part of the search space. When PSO is found to be not improving over a fixed predefined number of generations, a mutation operation may take it out of a probable local optima. In this light, we have introduced an inversion mutation operator. To apply this on a particle, we follow the procedure noted next. First, a breakpoint is randomly generated for the core part of the particle. The portion from this breakpoint to the end is inverted and joined at the end of the part before the breakpoint. Next, the same is performed for the TSV part. Figure 5 shows the operation of such an inversion mutation. Its impact on solution quality is shown in Section 8.2. 6.2.2. Usage of Better Random Number Generator. Like any other stochastic method, PSO also depends to a great extent on the quality of the random number generator. In this work, we have used the thread-safe, single-instruction, multiple data-oriented, fast Mersenne twister (SIMT) pseudorandom number generation technique [Saito and Matsumoto 2008]. It has a period up to 2216091 − 1, better equidistribution, and quick recovery from a 0-excess initial state. The effect of using SIMT is demonstrated in Section 8.2. 6.2.3. Initial Population Generation. For a core graph with n cores mapped onto a 3D mesh having n routers distributed over m layers, the total number of possible mappings ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:14

K. Manna et al.

and TSV positions can be n! and (2n/m), respectively. Exploration of the potential region of this enormous search space depends to a great extent on the initial set of particles. We have used the deterministic initial population generation technique of Manna et al. [2014] to help in the process. The strategy proposed in Manna et al. [2014] can generate a number of solutions equal to the number of routers in the NoC quite fast. However, the algorithm works only for fully connected 3D mesh. To restrict the number of TSVs, we have generated a number of 3D-mesh architectures with randomly placed TSVs. The total number of TSVs has been restricted to 25% of the total available positions. Also, TSVs are placed at least two hops away from each other. Next, the algorithm suggested in Manna et al. [2014] has been utilized to get a number of mapping solutions. For each router in the NoC, a solution is generated by starting the mapping process at this router node. The best one among them, along with the associated TSV positions, contributes to the creation of one intelligent particle to be included in the initial population. The process has been repeated to create a certain number of intelligent particles. The rest of the particles are generated randomly. 6.2.4. Multiple PSO. In any population-based search technique, exploration and exploitation are the two properties that can be used to control the quality of the solution. The techniques can also be used in PSO. In the exploration phase, different regions of the search space are explored, whereas the exploitation process checks for local optima around the globally explored points. In the initial portion of a PSO run, it performs more of exploration. However, through the evolution process, the particles start converging, making more of exploitation. To balance the exploration and exploitation process in a multiple swarm-based optimization technique, several strategies can be found in the literature [Rhler and Chen 2011; Chen and Montgomery 2011]. Among these strategies, locust swarm [Rhler and Chen 2011] is based on the “devor and move on” strategy. In this strategy, if a subswarm has found a local optima, a set of scouts are deployed to explore the new potential regions. Furthermore, the scouts are guided by intelligence accumulated by the earlier subswarm. In our work, we have utilized a similar such technique for better exploration of the search space, as detailed next. In this proposed augmentation, PSO has been run several times to improve upon the global best solution. Let, at the end of the nth run of PSO, the local best for the kth particle be pbestnk, and the global best be gbestn. In the (n + 1)th pass of PSO, it starts with a new set of particles. However, the local and global best information are transferred from the nth to (n + 1)th PSO. We have worked with various values of k1 , k2 , and k3 satisfying Equation (5). The values k1 = 1.0, k2 = 0.04, and k3 = 0.02 have been observed to produce good results for most of the applications we have experimented with. A typical trace for application G18 of evolution of the best particle with these parameter settings is shown in Figure 6. As it can be observed, in the process of convergence, it is reasonable to assume that the particle has converged to its final value if there is no significant improvement in the solution quality over the generations. From Figure 6, it may be noted that there is only a small variation in the fitness values between 100 and 200 generations. In this example, it has decreased by 50 only. After 200 generations, the communication cost does not change. Hence, the maximum number of PSO runs is set to 200 generations. Figure 7 shows a typical improvement in communication cost achieved through the multiple PSO structure. Typically, communication cost does not improve after 20 such PSO runs. Hence, the maximum number of multiple PSO runs is set to 20. Thus, the maximum number of the PSO runs to be executed has been set as follows:

—A user-defined value for the maximum number of PSO runs. In our experimentation, it has been taken as 200 PSO runs. —The global best fitness does not change in last 20 PSO runs. ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

Fig. 6. Trace of the evaluation of PSO over generations.

24:15

Fig. 7. Effect of multiple PSOs.

7. COMPLEXITY ANALYSIS OF PSO- AND ILP-BASED APPROACHES

In order to determine the complexity of PSO and ILP, let us consider the following: —The application is specified as a core graph G(C, E), with |C| and |E| number of cores, and edges. —Topology graph has |R| routers and L layers. —PSO contains K number of particles. —Specified maximum number of generation is g. In PSO, the fitness evaluation of a particle is being preceded by the determination of the source and destination routers for each edge of G and which nearest router to be used to send the traffic vertically. Assuming that the mapping part of the particle is presorted, this determination takes O(log|C|) time. Thus, overall fitness evaluation takes O(|E|(log|C| + |R|/L)) time. Hence, the initialization phase has a complexity of O(K|E|(log|C| + |R|/L)). For the evolution part, for each particle, identification of each swap sequence for mapping part of the particle takes O(|C|log|C|) time, whereas the TSV part takes O(1) time. The modification on the mapping part takes O(|C|) time, whereas the TSV part takes O(|R|/L) time. The fitness evaluation takes O(|E|(log|C| + |R|/L)) time. Thus, the overall complexity of the algorithm becomes O(gK|E|(log|C| + |R|/L)). If |E| = O(|C|2 ), that is, cores are highly communicating in nature, then the time complexity becomes O(gK|C|2 (log|C| + |R|/L)). The variable complexity in the proposed ILP formulation in Section 5 is proportional to the product of the square of the number of cores and cube of routers in the core graph and topology graph, respectively. That is, O(|C|2 |R|3 ). To assign the value in these variables and reach the solution takes a long CPU time compared to the PSO method. 8. EXPERIMENTAL RESULTS

The results of experimentation on NoC benchmarks are presented in this section. The benchmarks with the number of IP-cores/tasks, the dimension of 3D-mesh-based NoC, and number of edges/links are noted in Table II. The well-known NoC benchmarks, such asPIP, 263ENC-MP3DEC, MWD, MPEG-4, VOPD, and DVOPD can be found in the literature [Murali and Micheli 2004; Sahu and Chattopadyay 2013]. Other than these, some synthetic benchmarks G17-G22 and G25-G29 have been generated with a greater number ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:16

K. Manna et al. Table II. Benchmark Applications and Their Mesh Sizes Benchmarks

No. of Cores/Tasks

3D Mesh Dimensions

No. of Edges/Links

PIP 263ENC-MP3DEC MWD MPEG4 VOPD DVOPD G17 G18 G19 G20 G21 G22 G25 G26 G27 G28 G29

8 12 12 12 16 32 64 64 64 64 64 64 128 128 128 128 128

2 × 2 × 2, 1 × 2 × 4 2 × 3 × 2, 1 × 3 × 4 2 × 3 × 2, 1 × 3 × 4 2 × 3 × 2, 1 × 3 × 4 2 × 4 × 2, 2 × 2 × 4 4 × 4 × 2, 2 × 4 × 4 4 × 8 × 2, 4 × 4 × 4 4 × 8 × 2, 4 × 4 × 4 4 × 8 × 2, 4 × 4 × 4 4 × 8 × 2, 4 × 4 × 4 4 × 8 × 2, 4 × 4 × 4 4 × 8 × 2, 4 × 4 × 4 8 × 8 × 2, 4 × 8 × 4 8 × 8 × 2, 4 × 8 × 4 8 × 8 × 2, 4 × 8 × 4 8 × 8 × 2, 4 × 8 × 4 8 × 8 × 2, 4 × 8 × 4

8 12 12 13 20 42 95 59 62 91 98 89 207 127 180 236 192

of IP cores, using the task graph generator tool, TGFF [Dick et al. 1998]. The TGFF parameters used are as follows. The bandwidths have been varied from 10MB/s to 1,500MB/s for some graphs and 50MB/s to 150MB/s for others. The in-out degrees of nodes are varied from 1 to 8 to generate both low and high communication graphs. The number of start nodes is also varied to generate different graphs and to see the effect of the mapping and TSV placement strategy upon them. The bandwidth values for the edges are generated randomly to get the heterogeneous communication behavior of cores. In this work, communication cost has been minimized by placing the highly communicated cores nearby, which in turn reduces the hop count/communication distance for the packets between source and destination pairs. As a result, packets traverse less distance and hence reduce the communication energy. No other energy model has been used for the analytical evaluation (Sections 8.1–8.3) of the proposed method. However, in Section 8.5, for dynamic evaluation of a mapping solution, we have used the tool Noxim for the 3D environment. In particular, we have used Access Noxim, which is available by request to the developer (http://access.ee.ntu.edu.tw/noxim/index.html). Moreover, it includes energy values from the Intel’80 core processor model [Hoskote et al. 2007]. Thus, in Section 8.5, the Intel’80 core processor energy model is used to calculate the communication energy. 8.1. Comparison Between ILP- and PSO-Based Approaches

To check the optimality of the proposed approach for the TSV placement and mapping problem, we first compare the ILP with PSO results. Table III shows the TSV placement and mapping results using ILP and PSO. For applications PIP, S1, and S2, PSO could obtain the same solution, reported by ILP. For other applications, ILP could not start or complete due to the creation of a large number of constraints. However, PSO could produce satisfactory results within a reasonable time. Table III notes the number of variables and constraints required for ILP formulation and number of cores/tasks and edges/links for each application. The CPLEX [Cplex 2013] tool has been used to solve the formulated ILP. Both ILP and PSO have been implemented on a Dell PowerEdge T410 system with eight cores (Intel Xeon processor, [email protected] GHz), 64GB RAM. The ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:17

Table III. Comparison of Communication Cost Between ILP and Proposed PSO PIP (8 cores/tasks, 8 edges/links) Comm. Cost Mapping Algorithms (Hops× BW)

CPU Time

Proposed PSO 768 ILP (548 variables and 485 constraints) 768 S1 (6 cores/tasks, 7 edges/links)

0.59 sec almost 10 days

Proposed PSO 384 ILP (412 variables and 367 constraints) 384 S2 (6 cores/tasks, 6 edges/links)

0.32 sec almost 7 days

Proposed PSO ILP (472 variables and 425 constraints)

0.47 sec almost 7 days

1,224 1,224

Table IV. Communication Cost Comparison of Different Augmentation Techniques Benchmarks

Basic PSO

Basic PSO w-IM Only

Basic PSO w-SIMT Only

G28 G29

499,896.19 398,451

493,796.14 397,123

486,559.15 386,714

capacity of PSO to reach the optimal results found from ILP gives us the confidence about the quality of the PSO. 8.2. Effect of Inversion Mutation and Randomness Into the Basic PSO

The effect of augmentations into the basic PSO, such as IM and randomness of random number generator, are presented in this section. The corresponding results are noted in Table IV for the applications G28 and G29. The second column notes the results of basic PSO without any augmentation. The third and fourth columns show the results of incorporation of inverse mutation and SIMT into this basic PSO. As can be observed from the table, better results have been achieved using the IM and the SIMT random generator. Similarly, we have experimented with other applications, and it turns out that in applications like PIP, 263ENC-MP3DEC, MWD, and MPEG4, no improvement in communication cost can be achieved. However, for other applications, it shows improvement. This improvement in communication cost is due to the possibility of PSO to overcome the local optima. It has been observed that the maximum, average, and minimum reduction in communication cost are 36,022.18, 6,082.769, and 0, respectively, when only the IM is used as augmentation. The corresponding improvements are 51,161.58, 9,816.84, and 0 when only SIMT is used. Thus, we can conclude that such improvements have been obtained due to the usage of a mutation operator and SIMT with the basic PSO technique. This establishes the suitability of our proposed augmentation strategies for improving the solution quality. 8.3. Impact of Initial Population Generation

To improve the solution quality, the proposed method augments the initial population generation process of the basic PSO. That is, some percentage of the total number of particles has been taken from a fast deterministic heuristic, as explained earlier (Section 6.2.3). The impact of such kind of augmentation on the solution quality with partially connected 3D-mesh-NoC with two layers is shown in Table V. The last row of the table notes the average percentage improvements achieved via incorporation of different percentages of intelligent particles in the initial population, over a complete random initial population. It shows that 5% intelligent particles in the initial population can be a good augmentation. The solution quality does not improve significantly as the percentage of intelligent particles is increased from 5% to 20%. To get the ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:18

K. Manna et al. Table V. Communication Cost for Intelligent Initial Population Intelligent Particles 5%

Benchmarks

0%

1%

G17 G18 G20 G25 G26 G27 Avg. Imp.

46,412.4 9,620.64 117,978.8 160,396.69 28,517.03 74,391.23 –

37,730 6,679 105,584 104,719.25 15,217.35 48,749.25 29%

37,565 6,193 105,468 103,819.67 13,279.89 48,464.99 31.47%

10%

20%

37,565 6,153 105,468 103,819.67 13,269.79 48,460.58 31.55%

37,565 6,153 105,468 103,819.67 13,269.79 48,460.58 31.55%

Table VI. Dynamic Performance Evaluation for Intelligent Initial Population Benchmarks G17 G20 G26

Intelligent Particles 0% 5% 0% 5% 0% 5%

Th. 0.62 0.64 0.69 0.71 0.71 0.74

Lat. 99,789 98,321 99,785 98,874 84,129 82,673

Energy 15.67 14.07 13.12 12.88 12.67 11.80

Benchmarks G18 G25 G27

Intelligent Particles 0% 5% 0% 5% 0% 5%

Th. 0.65 0.69 0.75 0.77 0.61 0.64

Lat. 99,567 98,612 85,234 82,789 85,895 84,385

Energy 12.03 11.23 12.23 11.67 15.23 14.31

overall enhancement of such an augmentation procedure, we have performed the dynamic evaluation (experimental setup is explained in Section 8.5) of the mapping solution generated by our proposed PSO. Table VI compares the throughput (Th.), latency (Lat.), and communication energy (Energy) considering 0% and 5% intelligent particles in the initial stage of PSO. The experimental results highlight the improvement in terms of throughput, latency, and energy. 8.4. Comparison with Existing Works

This section compares the experimental results of the current approach with some of the recent approaches reported in the literature. The corresponding results are presented in Table VII. From a performance viewpoint, in the best possible 3D mesh architecture, each router in the mesh can have a vertical connection (i.e., 100% TSVs) [Feero and Pande 2009]. We have considered such a fully connected 3D-mesh-NoC as an instance and mapped each benchmark onto that architecture using our proposed augmented multi-PSO (AMPSO)-based technique to evaluate the performance. The corresponding results are noted in the third column of Table VII, labeled as “AMPSO.” The TSV footprint in such case is very high. “Rand.” in column 4 corresponds to random distribution of 25% TSVs with the restriction that no two adjacent routers are having TSVs. The scheme suggested in Liu et al. [2011] makes four adjacent routers share one TSV located at the center of the cluster, which has been labeled as “Squeezing” in Table VII. Thus, it needs 25% of routers to have a 3D connection. We have extended the NMAP [Murali and Micheli 2004] and extended-KL [Manna et al. 2015] algorithms to work with 25% intelligently placed TSVs, which are marked as “Extnd.-NMAP” and “Extnd.-KL,” respectively, in Table VII. The approach reported in Manna et al. [2014] and Manna et al. [2015] are iterative techniques. In these approaches, first a mapping is carried out assuming the existence of 100% TSVs. Then, the 25% most-used TSVs are retained, while the other routers are converted to 2D, and the mapping process is repeated. The corresponding communication cost is noted in the column marked “Constructive.” The columns marked as “Single PSO” and “Multiple PSO” correspond to the case in which PSO has been run only once or several times, respectively, as ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Layers

Two

Four

PIP 640 263ENC-MP3DEC 230.43 MWD 1216 MPEG4 3633 VOPD 4119 DVOPD 9560 G17 32922.35 G18 6153.1 G19 6247.98 G20 96679.95 G21 93863.07 G22 37903.21 G25 89079.84 G26 12953.47 G27 42210.77 G28 32909.72 G29 210267.34 Rank 1

896 230.51 1699 3912 4310 9912 42189.9 6422.15 6912.1 153118.15 131213.15 49172.5 142140.12 15311.38 49231.12 381215.01 272510.91 1.852

Benchmarks AMPSO Rand. PIP 640 896 263ENC-MP3DEC 230.41 268 MWD 1216 1260 MPEG4 3567 3768 VOPD 4119 4157 DVOPD 9528 10384 G17 36295.6 46620.8 G18 6094.11 10159.9 G19 6102.65 9625 G20 96187.35 118456 G21 94681.89 134817 G22 39600 54366.2 G25 95315.77 171234.15 G26 12988.05 29987.09 G27 46028.05 75848.18 G28 340893.63 556321.12 G29 218147.19 397620.18 Rank 1 1.442

100%

25%

896 230.47 2015 4025 4415 11257 54251.15 9912.07 9985.12 123450.19 131721.01 55129.03 173242.61 30124.17 77451.6 56715.31 423506.18 1.544

896 230.51 1965 3940.21 4900 11668 52507.96 8459.07 8984.38 188913.95 159312.89 58671.28 152142.61 22972.12 57361.02 47174.12 35821.72 1.386

896 230.51 1845 3960.12 4773 10004 45622.87 6847.07 7266.12 177260.21 157012.13 51989.62 149812.12 18321.26 53241.18 42891.12 293158.12 1.342

896 230.45 1664 3713 4237 9800 40565 6522.23 6745.89 125737.3 121292.33 49280.46 121129.99 15468.77 49380.91 41014.01 255441.25 1.194

896 230.45 1728 3752 4237 11085 44305.92 9804.02 9950.26 19531.92 128547.79 55307.56 148357.69 27823.64 79732.18 44306.13 336384.47 1.368

896 230.45 1664 3752 4237 9832 36566.99 6222.23 6575.89 116290.52 111292.33 45259.43 114208.68 14520.25 54027.32 39324.78 260493.25 1.161

896 230.45 1664 3752 4237 10496 42224.91 8385.52 6690.79 111666.7 121444.17 51738.26 122315.89 20684.07 57585.57 40250.19 275643.72 1.222

896 230.45 1664 3713 4237 9768 36565 6222.23 6545.89 105737.3 101035.16 42280.46 99126.77 13402.08 46380.91 35014.01 225441.25 1.099

Proposed PSO and its variants Extnd.-NMAP Squeezing [Murali and Extnd.-KL Constructive Single PSO Multi PSO [Liu et al. 2011] [Micheli 2004] [Manna et al. 2015] [Manna et al. 2014] w/o augmntn w augmntn w/o augmntn w augmntn 768 896 768 768 768 768 768 768 230.21 230.21 230.99 230.42 230.43 230.43 230.41 230.43 1248 1255 1252 1248 1248 1248 1248 1216 3714 3672 3706 3632 3632 3632 3632 3632 4157 4279 4189 4119 4157 4141 4151 4119 10307 10404 9726 9592 10032 9592 10016 9554 50626 39387.91 38576.63 40198.9 46412.4 37565 39974 35375.93 9814.02 7579.85 6847.48 6280.91 9620.64 6279 8184 6094.11 9930.1 8886.84 7231.05 6717.12 9215 6689 8837.5 6430.65 117380.8 117768.57 114297.94 110831 117978.8 105584 111986 103727.15 129985 111384.42 116985.25 107280 123884.3 113489 109516 99511.17 54354 51509.32 45223.15 48271.4 52905.8 50323 45943 42167.82 166271.15 145123.25 134161.25 112315.11 160396.69 103819.67 120021.79 99815.93 29231.51 19201.61 20613.78 14725.12 28517.03 13279.89 20097.92 13118.97 75128.27 53489.71 52374.61 48238.17 74391.23 48464.99 55859.67 47121.62 551751.31 442131.51 432371.81 381112.11 492880.19 391165.13 406676.23 376313 396121.45 312261.48 305166.55 242312.91 363630 229716.31 258872.25 222481 1.413 1.234 1.179 1.093 1.358 1.079 1.185 1.038

TSV Used

Table VII. Comparison of Communication Cost with Existing Works

Integrated Through-Silicon Via Placement and Application Mapping 24:19

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:20

K. Manna et al. Table VIII. CPU Time Comparison with Existing Works CPU Time in Sec.

Benchmarks PIP 263ENCMP3DEC MWD MPEG4 VOPD DVOPD G17 G18 G19 G20 G21 G22 G25 G26 G27 G28 G29 Rank

A 0.030 0.041

B 0.030 0.042

0.042 0.043 0.047 0.45 8.512 10.231 12.391 13.126 11.215 11.312 379.931 472.387 405.672 318.371 421.382 1.67

0.041 0.041 0.045 0.313 4.312 6.936 5.361 7.256 6.673 5.562 208.321 210.721 219.965 197.318 212.378 1.04

Two Layers C D 0.030 0.030 0.043 0.045 0.044 0.044 0.045 0.34 6.521 6.82 6.629 5.321 4.346 6.462 210.356 198.505 180.321 170.702 182.153 1.00

0.047 0.049 0.052 0.423 7.396 9.658 10.876 11.238 10.027 9.937 280.316 358.562 342.126 220.130 310.158 1.46

E 0.030 0.041

F 0.030 0.041

A 0.030 0.041

B 0.030 0.043

0.041 0.043 0.044 0.413 6.396 8.206 9.021 8.237 8.321 8.621 198.752 187.351 201.376 188.312 256.351 1.17

0.043 0.044 0.048 0.462 8.621 10.620 12.062 13.026 12.062 11.329 382.125 407.375 402.167 325.123 410.375 1.66

0.043 0.045 0.048 0.462 8.437 10.378 12.486 13.421 11.316 12.061 382.756 432.295 410.279 308.725 430.218 1.67

0.042 0.043 0.045 0.332 4.136 5.931 6.021 6.832 5.916 6.201 210.316 220.189 210.531 182.315 230.519 1.03

Four Layers C D 0.030 0.030 0.042 0.044 0.043 0.045 0.045 0.361 5.739 6.209 6.389 5.325 4.781 5.926 215.321 201.156 187.312 172.325 201.251 1.00

0.048 0.048 0.050 0.442 7.651 9.678 11.329 12.385 11.376 10.071 352.71 376.425 320.256 281.756 378.621 1.55

E 0.030 0.042

F 0.030 0.041

0.042 0.044 0.045 0.421 6.235 8.375 9.156 8.317 8.107 8.523 220.162 235.965 289.302 176.123 208.672 1.20

0.044 0.045 0.049 0.474 8.792 11.023 12.261 13.267 12.151 11.296 386.729 420.731 402.329 317.392 410.628 1.68

A=AMPSO; B=Squeezing [Liu et al. 2011]; C=Extnd.-NMAP [Murali and Micheli 2004]; D=Extnd.-KL [Manna et al. 2015]; E=Constructive [Manna et al. 2014]; F=Multi PSO (Proposed).

noted in Section 6.2. Furthermore, under those columns “w/o augmntn” represents the situation in which the augmentations suggested in Section 6.2 have not been incorporated. On the other hand, the columns marked “w augmntn” are the results in which all the augmentations have been used. Taking the “AMPSO” results as unity, the proposed approach in the last column requires only 3.8% more communication cost, on average, compared to 44.2%, 41.3%, 23.4%, 17.9%, and 9.3% more for randomly placed, Liu et al. [2011], Murali and Micheli [2004], Manna et al. [2015], and Manna et al. [2014], respectively, for partially connected 3D-mesh-based NoC having two layers. For four layers, it takes 9.9% more communication cost, on an average, compared to 85.2%, 54.4%, 38.6%, 34.2% and 19.4% more for randomly placed, Liu et al. [2011], Murali and Micheli [2004], Manna et al. [2015], and Manna et al. [2014]. In Table VII, unity in the rank field indicates the best communication cost achievable by our proposed technique considering every router to have vertical connection. As noted earlier, the area and routing congestion increases significantly due to the large TSV footprint in this case. Any method employing only 25% TSVs and producing results close to the AMPSO is a good one. Thus, a rank value close to 1 is desirable. Therefore, the proposed strategy with augmentations can produce a better solution than other contemporary approaches available in the literature. The CPU times for the state-of-the-art methods and the proposed strategy are noted in Table VIII. All the strategies have been implemented on a Dell PowerEdge T410 system with eight cores (Intel Xeon processor, [email protected]), 64GB RAM. It can be seen that the proposed method takes a bit more CPU time compared to non-PSO-based methods. More precisely, taking the “Extnd.-NMAP” results as unity, the proposed approach in the seventh column requires about 66% more CPU time, on average, compared to 67%, 4%, 46% and 17% more for ‘AMPSO’, Liu et al. [2011], Manna et al. [2015], and Manna et al. [2014], respectively for partially connected 3D-mesh based NoC having two layers. For four layers, it takes 68% more CPU time, on an average, compared to 67%, 3%, 55% and 20% more for ‘AMPSO’, Liu et al. [2011], Manna ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:21

Table IX. Access Noxim Settings Parameters

Values

Buffer depth Minimum and maximum packet size Routing Selection logic Warmup time Simulation time Traffic

6 64 flits (32 bits per flit) XYZ, Elevator-first, and modified Elevator-first Random 10,000 clk cycles 200,000 clk cycles Table based

et al. [2015], and Manna et al. [2014]. However, it could produce good-quality solutions compared to all other state-of-the-art techniques. 8.5. Dynamic Evaluation of Proposed Solutions

Earlier, we minimized the communication cost for 3D-NoC by placing the TSVs at proper positions. The communication cost is the product of hop count and bandwidth requirement between the source and destination cores. Theoretically, the total communication cost, among all the source and destination core pairs, is a good indicator of the throughput and latency of the network. However, here, latency is in a contention-free or zero-load environment. The proposed PSO technique considers a similar communication cost to improve the system performance. Furthermore, for clear understanding of the impact of TSV placement and mapping, we next simulate each of the NoC systems. Synthetic self-similar traffic was generated, guided by the communication requirement of the cores in an application. Self-similar traffic was observed in the burst traffic between on-chip modules in typical video and networking applications [Varatkar and Marculescu 2004]. Our traffic generator uses a self-similar nature for generating the traffic by aggregating a large number of ON-OFF message sources following Pareto distribution with a Hurst parameter of H = 0.75 and shape parameter of αON = 1.5 and αOFF = 1.17 [Kundu et al. 2012]. The Noxim (more precisely Access Noxim) [Jheng et al. 2013] simulator was used to simulate the NoCs. The configuration of the Access Noxim simulator is presented in Table IX. We incorporated both the routing algorithms—elevator-first [Bahmani et al. 2012; Dubois et al. 2013] and modified-elevator-first [Lee and Choi 2013]—in Access Noxim. The TSV positions, generated from the proposed methodology, were provided to Noxim. To make such kind of facility into the Access Noxim, it was extended accordingly. The Access Noxim environment was modeled by using the technique followed in the Intel’80 core architecture [Hoskote et al. 2007]. In such architecture, the maximum link capacity is 20GB/s and the maximum router frequency is 5.1GHz. However, the bandwidth requirements in the benchmarks are in MB/s. An NoC architecture considers the data rate provided in the benchmarks (https://drive.google.com/folderview?id=0Byxe1YCU_ wU9U19nVmpxZmhqTDg&usp=sharing). In the dynamic simulation environment, congestion can occur. The network throughput (Th.), average latency (Lat.(cycle)), and average packet energy for the benchmarks, by running the simulation for 200,000 clock cycles, are noted in Table X. It may be observed that the proposed approach of intelligent TSV placement and mapping produces results close to that for the situation with 100% TSVs. 9. CONCLUSION

Since TSVs occupy significant area and are also not expected to shrink at the same rate as the gates do, reducing their (TSV) number in 3D-IC is very important. This work attempts to map a given application on 3D-NoC, keeping the TSV count low. First, ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:22

K. Manna et al.

Table X. Comparison of Throughput, Latency, and Energy Consumption of Different Strategies for Benchmarks TSV used 25% 100% 25% 100% 25% Routing algorithm used Parameters XYZ XYZ XYZ XYZ Benchmarks 263ENC-MP3DEC MWD MPEG4 VOPD Th. 0.80 0.80 0.71 0.78 0.78 0.75 0.65 0.64 0.60 0.75 0.67 0.65 Lat.(cycle) 77,461 76,480 77,843 98,259 98,260 98,304 96,342 96,260 97,014 93,456 92,340 93,550 Avg. pkt. 9.73 9.83 11.22 10.84 10.84 11.06 8.54 11.67 12.21 12.24 12.29 12.98 enrgy. (in uJ) Benchmarks DVOPD G17 G18 G19 Th. 0.70 0.69 0.65 0.66 0.65 0.62 0.72 0.71 0.69 0.70 0.69 0.68 Lat.(cycle) 94,375 93,378 94,520 97,942 97,679 98,768 98,850 97,860 98,860 98,974 98,175 98,974 Avg. pkt. 11.03 11.26 11.35 13.10 13.49 13.55 10.79 10.83 10.83 11.08 11.17 11.18 enrgy. (in uJ) Benchmarks G20 G21 G22 G25 Th. 0.74 0.73 0.71 0.86 0.85 0.82 0.68 0.67 0.66 0.81 0.79 0.76 Lat.(cycle) 98,970 98,424 99,433 99,802 98,253 99,300 98,954 98,558 99,566 81,923 80,512 82,261 Avg. pkt. 12.07 12.48 13.65 12.24 12.29 12.09 12.08 12.81 13.38 10.25 10.92 11.25 enrgy. (in uJ) Benchmarks G26 G27 G28 G29 Th. 0.78 0.76 0.74 0.68 0.65 0.62 0.70 0.68 0.65 0.75 0.73 0.70 Lat.(cycle) 81,235 80,111 80,971 84,231 83,192 83,912 85,427 83,129 83,997 83,125 80,123 81,239 Avg. pkt. 10.75 10.99 11.10 13.25 13.65 13.68 11.59 12.01 12.26 12.35 11.87 12.01 enrgy. (in uJ) Benchmarks 263ENC-MP3DEC MWD MPEG4 VOPD Th. 0.78 0.78 0.68 0.75 0.75 0.72 0.62 0.61 0.58 0.73 0.67 0.65 Lat.(cycle) 78,461 77,450 77,643 99,259 98,262 98,504 96,942 96,261 97,214 92,856 92,342 93,555 Avg. pkt. 10.73 10.93 13.21 9.84 9.84 10.68 9.54 10.76 12.41 11.24 11.38 12.95 enrgy. (in uJ) Benchmarks DVOPD G17 G18 G19 Th. 0.68 0.67 0.62 0.65 0.63 0.60 0.64 0.63 0.59 0.70 0.67 0.65 Lat.(cycle) 77,961 77,485 77,846 98,959 98,561 98,624 97,242 96,253 97,114 94,356 92,356 93,558 Avg. pkt. 9.73 9.83 11.22 9.84 9.84 10.66 7.84 10.97 11.25 11.42 11.10 11.98 enrgy. (in uJ) Benchmarks G20 G21 G22 G25 Th. 0.72 0.70 0.68 0.82 0.80 0.75 0.65 0.64 0.60 0.79 0.76 0.71 Lat.(cycle) 78,466 77,410 77,856 96,252 95,265 95,305 95,245 94,261 95,014 72,456 71,340 72,550 Avg. pkt. 8.75 8.89 10.11 11.04 11.85 11.96 10.54 10.67 10.21 9.24 9.29 9.98 enrgy. (in uJ) Benchmarks G26 G27 G28 G29 Th. 0.75 0.73 0.71 0.65 0.62 0.59 0.69 0.67 0.59 0.72 0.71 0.69 Lat.(cycle) 87,461 86,480 87,843 83,259 82,260 83,304 87,242 86,260 87,014 84,356 83,340 83,550 Avg. pkt. 9.73 9.83 11.22 11.04 11.54 12.06 10.54 10.97 11.21 11.24 11.29 11.98 enrgy. (in uJ) =Elevator-first [Bahmani et al. 2012; Dubois et al. 2013] and =Modified elevator-first [Lee and Choi 2013]. 25%

100%

Four number of layers

Two number of layers

100%

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:23

it proposes solving this using ILP, which is an exact mapping method, and produces optimal results. However, ILP turned out to be highly CPU intensive and could not be used for large NoCs (more than 16 nodes). This prompted us to look for another solution strategy. Accordingly, a PSO was formulated to perform the TSV placement and core mapping together. The basic PSO was augmented in several ways to improve the solution quality. Results in Table VII indicate that reducing the TSV count leads to an increase in the communication cost. Using the proposed PSO, it is possible to bring down the communication cost compared to other proposed techniques for 3DNoCs with 25% vertical connectivity. The communication cost got sharply reduced after incorporating the augmentations. Future work involves exploring other exact solution approaches, such as constraint programming. Also, the congestion/contention-aware mapping may be considered to improve the latency and power consumption of the resulting 3D-NoC. The proposed work can be extended to contention-aware mapping similarly to Chou and Marculescu [2008]. REFERENCES 2010. Single-Chip Cloud Computer. http://techresearch.intel.com/UserFiles/en-us/File/SCC_Symposium_ Mar162010_GML_final.pdf. 2010. The AMD opteron 6000 series platform. http://www.amd.com. A. B. Ahmed and A. B. Abdallah. 2012. Low-overhead routing algorithm for 3d network-on-chip. In 3rd Int. Conf. on Networking and Computing (ICNC’12). 5–7. S. Azampanah, A. Eskandari, A. Khademzadeh, and F. Karimi. 2013. Traffic-aware selection strategy for application-specific 3d NoC. Advances in Computer Science: An Intl. Journal 2, 5 (2013), 107–114. M. Bahmani, A. Sheibanyrad, F. Petrot, F. Dubois, and P. Durante. 2012. A 3d-NoC router implementation exploiting vertically-partially-connected topologies. In Proc. of Computer Society Annual Symposium on VLSI (ISVLSI’12). 9–14. L. Benini and G. De Micheli. 2002. Networks-on-chips: A new SoC paradigm. Computer 35, (2002), 70–78. S. Chen and J. Montgomery. 2011. Selection strategies for initial positions and initial velocities in multioptima particle swarms. In Proc. of the 13th Annual Conference on Genetic and Evolutionary Computation. 53–60. C. L. Chou and R. Marculescu. 2008. Contention-aware application mapping for network-on-chip communication architectures. In Proc. of Int. Conference on Computer Design (ICCD’08). 164–169. L. Congying, Z. Huanping, and Y. Xinfeng. 2011. Particle swarm optimization algorithm for quadratic assignment problem. In Proc. of IEEE Int. Conference Computer Science Networking Technology. 1728–1731. Cplex. 2013. http://www.ibm.com/software/in/integration/optimization/cplex. W. J. Dally and B. Towles. 2001. Route packets, not wires: On-chip interconnection networks. In Proc. of Design Automation Conf. (DAC’01). 683–689. W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. M. Sule, M. Steer, and P. D. Franzon. 2005. Demystifying 3d ICs: The pros and cons of going vertical. IEEE Design Test 22 (2005), 498–510. R. P. Dick, D. L. Rhodes, and W. Wolf. 1998. TGFF: Task graphs for free. In Proc. of Int. Workshop on Hardware/Software Codesign. H. Ding, H. Gu, Y. Yang, and D. Fan. 2013. Traffic-aware selection strategy for application-specific 3d NoC. IEICE Electronics Express (2013), 1–6. F. Dubois, A. Sheibanyrad, F. Petrot, and M. Bahmani. 2013. Elevator-first: A deadlock-free distributed routing algorithm for vertically partially connected 3d-NoCs. IEEE Transactions on Computers 62, 3 (2013), 609–615. B. S. Feero and P. P. Pande. 2009. Networks-on-chip in a three dimensional environment: A performance evaluation. IEEE Transactions on Computers 58, 1 (2009), 32–45. M. R. Garcy and D. S. Johson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman Publisher. L. Guilan, Z. Hai, and S. Chunhe. 2008. Convergence analysis of a dynamic discrete PSO algorithm. In International Conference on Intelligent Networks and Intelligent Systems. 89–92. A. R. Guner and M. Sevkli. 2008. A discrete particle swarm optimization algorithm for uncapacitated facility location problem. Journal of Artificial Evolution and Application 2008 (2008), 1–9.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

24:24

K. Manna et al.

P. K. Hamedani, S. Hessabi, H. Sarbazi-Azad, and N. E. Jerger. 2012. Exploration of temperature constraints for thermal aware mapping of 3d networks-on-chip. In Proc. of Euromicro Int. Conference on Parallel, Distributed and Network-Based Processing (PDP’12). 499–506. R. Ho, K. W. Mai, and M. A. Horowitz. 2001. The future of wires. In Proc. of IEEE. 490–504. Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. 2007. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro 27, 5 (2007), 51–61. Y. J. Hwang, J. H. Lee, and T. H. Han. 2011. 3D network-on-chip system communication using minimum number of TSVs. In Proc. of ICT Convergence (ICTC’11). 517–522. K. Jheng, C. Chao, H. Wang, and A. Wu. 2013. http://access.ee.ntu.edu.tw/noxim/index.html. X. Jiang and T. Watanabe. 2010. An efficient 3d NoC synthesis by using genetic algorithms. In Proc. of IEEE Region 10 Conference (TENCON’10). 1207–1212. N. Kapadia and S. Pasricha. 2012. A power delivery network aware framework for synthesis of 3d networkson-chip with multiple voltage islands. In Proc. of Int. Conference on VLSI Design (VLSID’12). 262–267. N. Kapadia and S. Pasricha. 2013. A co-synthesis methodology for power delivery and data interconnection networks in 3d ICs. In Proc. of Int. Symposium on Quality Electronic Design (ISQED’13). 73–79. J. Kennedy and R. Eberhart. 1995. Particle swarm optimization. In Proc. of Int. Conference on Neural Networks. 1942–1948. D. H. Kim, K. Athikulwongse, and S. K. Lim. 2009. A study of through-silicon-via impact on the 3D stacked IC layout. In Proc. of Int. Conference on Computer-Aided Design (ICCAD’09). 674–680. S. Kundu, J. Soumya, and S. Chattopadhyay. 2012. Design and evaluation of mesh-of-tree based networkon-chip using virtual channel router. Microprocessors and Microsystems 36, 6 (2012), 471–488. J. Lee and K. Choi. 2013. A deadlock-free routing algorithm requiring no virtual channel on 3d-NoCs with partial vertical connections. In Proc. of Int. Symposium on Networks on Chip (NoCS’13). 1–2. C. Liu, L. Zhang, Y. Han, and X. Li. 2011. Vertical interconnects squeezing in symmetric 3d mesh networkon-chip. In Proc. of Asia and South Pacific Design Automation Conference (ASP-DAC’11). 357–362. P. Magarshack and P. G. Paulin. 2003. System-on-chip beyond the nanometer wall. In Proc. of 40th Design Automation Conf. (DAC’03). 419–424. K. Manna, S. Chattopadhyay, and I. Sengupta. 2014. Through silicon via placement and mapping strategy for 3d mesh based network-on-chip. In Proc. of the IFIP/IEEE Intelligent Conference on Very Large Scale Integration (VLSI-SoC’14). 1–6. K. Manna, V. S. Teja, S. Chattopadhyay, and I. Sengupta. 2015. TSV placement and mapping strategy for 3d mesh based network-on-chip using extended Kernighan-Lin partitioning technique. In Proc. of the IEEE Intl. Symposium on VLSI (ISVLSI’15). 392–397. F. Miller, T. Wild, and A. Herkersdorf. 2012. TSV-virtualization for multi-protocol-interconnect in 3d-ICs. In Proc. of Euromicro Conference on Digital System Design (DSD’12). 374–381. S. Murali and G. De Micheli. 2004. Bandwidth constrained mapping of cores onto NoC architectures. In Proc. of Design, Automation and Test in Europe (DATE’04). 896–901. S. Murali, C. Seiculescu, L. Benini, and G. De Micheli. 2009. Synthesis of networks on chips for 3d systems on chips. In Proc. of Asia and South Pacific Design Automation Conference (ASP-DAC’09). 242–247. DOI:http://dx.doi.org/10.1109/ASPDAC.2009.4796487 S. Pasricha. 2009. Exploring serial vertical interconnects for 3d ICs. In Proc. of Design Automation Conference (DAC’09). 581–586. S. Pasricha. 2012. A framework for TSV serialization-aware synthesis of application specific 3d networkson-chip. In Proc. of Int. Conference on VLSI Design (VLSID’12). 268–273. V. F. Pavlidis and E. G. Friedman. 2007. 3-d topologies for networks-on-chip. IEEE Transactions on Very Large Scale Integration Systems 15 (2007), 1081–1090. A.-M. Rahmani, K. R. Vaddina, K. Latif, P. Liljeberg, J. Plosila, and H. Tenhunen. 2012. Design and management of high-performance, reliable and thermal-aware 3d networks-on-chip. IET Circuits, Devices Systems 6, 5 (2012), 308–321. A. B. Rhler and S. Chen. 2011. An analysis of sub-swarms in multi-swarm systems. In Proc. of Joint Austral. Conf. on Artificial Intelligence. 271–280. P. K. Sahu and S. Chattopadyay. 2013. A survey on application mapping strategies for network-on-chip design. Journal of System Architecture 59, 1 (2013), 60–76. P. K. Sahu, T. Shah, K. Manna, and S. Chattopadyay. 2014. Application mapping onto mesh based networkon-chip using discrete particle swarm optimization. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 2 (2014), 300–312.

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.

Integrated Through-Silicon Via Placement and Application Mapping

24:25

M. Saito and M. Matsumoto. 2008. SIMD-Oriented Fast Mersenne Twister: A 128-Bit Pseudorandom Number Generator. Monte Carlo and Quasi-Monte Carlo Methods, A. Keller, S. Heinrich, and H. Niederreiter, Eds. Berlin, Germany: Springer-Verlag. pp. 607–622. C. Seiculescu, S. Murali, L. Benini, and G. D. Micheli. 2009. SunFloor 3d: A tool for networks on chip topology synthesis for 3d systems on chips. In Proc. of Design Automation Test in Europe (DATE’09). 9–14. C. Seiculescu, S. Murali, L. Benini, and G. D. Micheli. 2010. SunFloor 3d: A tool for networks on chip topology synthesis for 3-d systems on chips. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29, 12 (2010), 1987–2000. Semiconductor Industry Association et al. 2007. The International Technology Roadmap for Semiconductors (ITRS). K. Tatas, K. Siozios, D. Soudris, and A. Jantsch. 2014. Designing 2D and 3D Network-on-Chip Architectures. Springer, New York. A. W. Topol, D. C. La Tulipe, Jr., L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong. 2006. Three-dimensional integrated circuits. IBM Journal on Research and Development 50, 4/5 (2006), 491–506. S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. 2008. An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE Journal on Solid-State Circuits 43 (2008), 29–41. G. V. Varatkar and R. Marculescu. 2004. On-chip traffic modelling and synthesis for MPEG-2 video applications. IEEE Transactions on Very Large Scale Integration (VLSI) System 12, 1 (2004), 108–119. J. Wang, L. Li, H. Pan, S. He, and R. Zhang. 2011. Latency-aware mapping for 3d NoC using rank-based multi-objective genetic algorithm. In Proc. of Int. Conference on ASIC (ASICON’11). 413–416. P. K. Wang, L. Huang, C. G. Zhou, and W. Pang. 2003. Particle swarm optimization for traveling salesman problem. In Proc. of Int. Conf. on Machine Learning and Cybernetics. 1583–1585. T. C. Xu, P. Liljeberg, and H. Tenhunen. 2010. A study of through silicon via impact to 3d network-on-chip design. In Proc. of Int. Conf. on Electronics and Information Engineering (ICEIE’10). 333–337. T. C. Xu, P. Liljeberg, and H. Tenhunen. 2011. Optimal number and placement of through silicon vias in 3D network-on-chip. In Proc. of Int. Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS’11). 105 –110. Y. Xu, Y. Du, B. Zhao, X. Zhou, Y. Zhang, and J. Yang. 2009. A low-radix and low-diameter 3d interconnection network design. In Proc. of Symposium on High Performance Computer Architecture (HPCA’09). 30–42. Y. Xie, J. Cong, and S. Sapatnekar. 2009. Three-Dimensional Integrated Circuit Design: EDA, Design and Microarchitectures. Springer Publishing Company. S. Yan and B. Lin. 2008. Design of application-specific 3d networks-on-chip architectures. In Proc. of Int. Conf. on Computer Design (ICCD’08). 142–149. H. Ying, K. Heid, T. Hollstein, and K. Hofmann. 2012a. A genetic algorithm based optimization method for low vertical link density 3-dimensional networks-on-chip many core systems. In Proc. of NORCHIP. 1–4. H. Ying, T. Hollstein, and K. Hofmann. 2012b. Communication-centric high level synthesis metrics for low vertical channel density 3-dimensional networks-on-chip. In Proc. of Int. Workshop on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC’12). 1–6. W. Zhong, S. Chen, F. Ma, T. Yoshimura, and S. Goto. 2011. Floorplanning driven network-on-chip synthesis for 3-d socs. In Proc. of Int. Symposium on Circuits and Systems (ISCAS’11). 1203–1206. P. Zhou, P.-H. Yuh, and S. Sapatnekar. 2012. Optimized 3d network-on-chip design using simulated allocation. ACM Transactions on Design Automation of Electronic Systems 17, 2, Article 12 (April 2012), 19 pages. Received August 2015; revised May 2016; accepted July 2016

ACM Transactions on Embedded Computing Systems, Vol. 16, No. 1, Article 24, Publication date: November 2016.