Physical Mapping and Performance Study of a Multi-Clock ... - CiteSeerX

2 downloads 0 Views 238KB Size Report
Lancaster University, Lancaster LA1 4YW, UK. {m.grange, d.pamunuwa, .... A pitch on the order of 0.14μm is currently state-of-the-art in TSV technology [10].
Physical mapping and performance study of a multi-clock 3-Dimensional Network-on-Chip mesh Matt Grange+ , Awet Yemane Weldezion∗ , Dinesh Pamunuwa+ , Roshan Weerasekera+ , Zhonghai Lu∗ , Axel Jantsch∗ , and Dave Shippen+ + Centre

for Microsystems Engineering, Department of Engineering Lancaster University, Lancaster LA1 4YW, UK {m.grange, d.pamunuwa, r.weerasekera}@lancaster.ac.uk

∗ School

of Information and Communication Technologies, Department of Electronics, Computer, and Software Systems, KTH Royal Institute of Technology, Electrum 229, Kista SE 16440, Sweden {aywe,zhonghai, axel, hannu}@kth.se Abstract—The physical performance of a 3-Dimensional Network-on-Chip (NoC) mesh architecture employing Through Silicon Vias (TSV) for vertical connectivity is investigated with a cycle-accurate RTL simulator. The physical latency and area impact of TSVs, switches, and the on-chip interconnect is evaluated to extract the maximum signaling speeds through the switches, horizontal and vertical network links. The relatively low parasitics of TSVs compared to the on-chip 2-D interconnect allow for higher signaling speeds between chip layers. The systemlevel impact on overall network performance as a result of clocking vertical packets at a higher rate through the TSV interconnect is simulated and reported.

I. I NTRODUCTION 3-D integration technology in the next generation of chip manufacturing processes is a promising solution to rapidly approaching interconnect bottlenecks [21] in the deep submicron regime of process technology and may provide a route towards the realization of massively integrated, high performance chips. 3-D chip stacks are based on a variety of vertical interconnection techniques [2], [22]. 3-D integration provides opportunities for cost reduction and yield improvement in integration of different technologies such as CMOS, DRAM and MEMS circuits through the ability to implement them over multiple die layers on the same chip. It can also reduce form factor in applications where size is critical, while effective heat dissipation and temperature control can be a challenge. The use of the Through Silicon Vias (TSV) allows for dense vertical interconnects to be realized between chip layers. The relatively low parasitic properties of the TSV [24] can result in fast vertical inter-die communication when compared to the highly parasitic on-chip global wires. It is envisioned that a massively integrated 3-D IC may employ the standardized Network-on-Chip (NoC) architecture to implement communication between the functioning blocks to avoid bus-based interconnect bottlenecks and wiring complexity as the number of resources on a chip increases. For 2-D multiprocessor systems the NoC is a convincing choice, and most recently

978-1-4244-4512-7/09/$25.00 ©2009 IEEE 978-1-4244-4512-7/09/$25.00 ©2009 IEEE

an 80 tile, 100M transistor system with 1.28 TFLOP peak performance has been demonstrated in a 65nm, 1V technology [23]. The suitability of a 3-D NoC has been widely documented in literature for 3-D integrated systems and the 3-D NoC mesh has been shown to outperform its 2-D counterpart with a globally synchronous clock for horizontal and vertical communication [27],[19]. This article examines the performance of a 7-port 3-D NoC mesh employing higher clock speeds for vertical (layerlayer) communication links for switch to switch connections. Physical properties and models of the TSVs, on-chip wires, and synthesized switches are studied to understand the physical limitations in clocking and provide appropriate boundary conditions for the simulations. Extensive simulations are then carried out to explore the performance benefits of clocking the vertical link communication at a higher rate than the horizontal links. Cycle accurate RTL-level simulations are conducted with Uniform Random Traffic (URT) on a 7-port switch using the Nostrum network communication protocol based on a bufferless hot-potato routing algorithm [12] for symmetrical n×n×n configurations. The performance of the NoC in terms of throughput, latency and area is extracted from the simulations in order to quantify the variation of network performance with different vertical and horizontal clock speeds for a varying number of network nodes to derive key design guidelines. The main contribution of this paper is in providing a novel and extensive analysis into the performance of 3-D NoC architectures with a realistic application of the physical limitations of 3-D integrated circuits employing TSVs. This study quantifies performance to aid the design of the communication architecture of future massively integrated 3D systems. The rest of the paper is organized in the following manner. Section 2 discusses related work while the following section discusses the physical properties and modeling of communication hardware of a 3-D integrated circuit and NoC.

Section 4 defines network implementation issues including the protocols, network topology and architecture of the switches. Sections 5 and 6 describe the exploration space, the simulation methodology and the results. We end with a discussion of the results and our conclusions. II. R ELATED W ORK A comprehensive overview of processing and technological issues in 3-D integration can be found in [22] while a discussion of the physical level cost and performance tradeoffs associated with the different techniques is provided in [26]. The different NoC-based system architectures for 3-D systems have been exhaustively enumerated in [19], depending on whether a die housed in a tile has one layer or several layers, and whether the NoC itself is 2-D or 3-D. The average unloaded latency per bit has also been derived, showing the potential improvements in latency and power consumption obtainable through a 3-D architecture, but the performance of the network for different traffic patterns under a packetswitched protocol is necessary to evaluate its scalability. An investigation into different router structures was conducted in [9] which concludes with a novel design of a 3D crossbarstyle NoC. Cycle accurate simulations are performed and energy and latency metrics are extracted. In [18] a multi-layer NoC router architecture is explored for a number of traffic patterns. A well known paper that describes the performance of communication networks of varying dimension for wormhole routing is [3], which generalizes the interconnection network as being a k-ary n-cube torus, with n being the dimension of the cube, and k being the radix, or number of switches in a given dimension. In it Dally points out that VLSI circuits are wire-limited, and any growth in dimension has to be accompanied by a related reduction in the parallelism of each link and therefore the appropriate analysis of performance of interconnection networks for VLSI circuits has to be under the constraint of constant bisection bandwidth. It is shown that under various wire delay models, including transmissionline-like and RC-like behavior, for relatively large networks with 1M nodes, the best performance is delivered by switches of relatively low dimension, around 5 or 6. Due to layout restrictions, the most common implementation of the NoC for both 2-D and 3-D is a mesh, which is not directly comparable to the tori considered in [3]. In a 2-D mesh a switch has links to four other switches and its resource, while the straightforward extension for 3-D is to consider a switch with two additional inter-layer links (with the seventh port being connected to the resource). Therefore it is of interest to investigate the scalability (i.e. variation of key metrics related to latency, energy and area with size of network) of the NoC for this topology. Further, two key assumptions in [3] are likely to have an effect; firstly the study is carried out under the assumption of uniform links, but as explained below the vertical links have a significantly different delay model than the horizontal links. Secondly, the assumption of constant bisection bandwidth needs to be revisited as the footprint of the vertical interconnects is different in general to the horizontal

links. The electrical behavior of the relatively short and wide TSV is much better than that of long on-chip interconnects, primarily due to the low resistance, and can support much higher signaling speeds, but a relatively high area penalty due to via blockage may impose constraints on the number of TSVs that can be used for communication. This has led the authors of [11] to propose a dynamic time-division-multipleaccess (dTDMA) bus with a centralized arbiter for the vertical communication link, which allows single hop latency for packets between any number of vertical layers. Most recently [6] describes a detailed study using different traffic patterns under a realistic protocol for both architectures to derive figures of merit related to throughput, latency, energy and area overhead to characterize various topologies. The work in [14] describes a working 3-layer 27 node prototype that provides proof of concept of the 3-D NoC, but identify the need for a 3D network simulator and system-level explorations of the kind discussed here. Finally the authors in [27] perform systemlevel simulations to compare a TDMA bus and a 7-port switch under various traffic patterns for a range of network sizes to examine the scalability of a 3-D NoC mesh. The authors highlight the drawbacks of a bus-based vertical interconnect and favor the 7-port switch design. To the authors’ best knowledge, no simulations or performance evaluations have been conducted to examine the possibility of clocking the vertical router links at a higher rate than the horizontal 2-D network. III. P HYSICAL L AYER AND D ESIGN T RADEOFFS The technology that allows the highest density of vertical interconnects and is the most promising for tackling interconnect related communication woes in very large designs is based on through-silicon or through-hole vias that are fabricated either as part of the front-end wafer processing (front-end-of-line: FEOL) [20] or a back-end process (back-end-of-line: BEOL) at the die or wafer level [8]. A pitch on the order of 0.14μm is currently state-of-the-art in TSV technology [10]. Physical models for the equivalent circuit representation of TSVs are reported in [24], which recommends a lumped capacitive model with inter-via coupling for representative TSV lengths. The resistance of the TSV is of the order of mΩ, which results in a very low signal transmission latency compared to on-chip wires which run for a considerably longer length. The authors of [12] come to a similar conclusion. Because of this relatively low resistance, the effect of capacitive crosstalk within a TSV bundle can be overcome by a sufficiently high driver strength, which leads to the result that the highest bandwidth can actually be obtained by having the highest possible TSV density [28], unlike for the on-chip case, where the highly resistive wires dictate an optimal point that does not correspond to the highest density [17]. Even though it is desirable from a performance point of view to have the highest density, there are associated cost penalties. TSVs with different material composites and widely varying densities have been reported in the literature, from structures with pitches over 100μm with a ’U’ filled cross-section to 0.14μm pitch

100 90

Percentage Area Utilisation

80 70 60 2−D case 1.5 um 7.5 um 13.5 um 19.5 um 25.5 um 31.5 um 37.5 um 43.5 um 49.5 um 55.5 um

50 40 30 20 10 0

4 16 32

64

128 Bit width of Vertical Bus

256

Fig. 1. Area utilization efficiency on a single layer with TSV bundle size for different pitches

copper filled structures [10]. The higher densities are usually only enabled by costly BEOL processing. It is important to recognize that the via density is governed by cost constraints (dictated by poor yield for the more advanced processes) as well as performance requirements. Another important factor is that TSVs will take away logic area, and reduce the area utilization efficiency. As an example, the die, resource and switch size for an aggressive design in a nanometer scale CMOS technology is approximately 20mm, 2mm and 1mm on a side respectively [23]. Based on some reference designs we have synthesized, we estimate the sizes of a simpler 5port 2-D and 7-port 3-D switch to be 0.6mm and 0.7mm on a side respectively in 180nm UMC. The area utilization is quantified in Figure 1, with the exclusively 2-D NoC serving as a baseline. Under the assumption that every switch has vertical connections, the extra area consumed by the TSV bundle may present an unacceptably high overhead for certain bitwidth and pitch combinations. Hence the main design choices related to implementation in 3-D integration revolve around trade-offs associated with TSV density (i.e. parallelism), area overhead and cost. Potential solutions to minimize cost and area overhead without imposing bandwidth penalties are to reduce the number of switches with vertical connectivity, and also to reduce the bundle size, with parallel-to-serial and serialto-parallel conversion when a packet crosses a layer. This is made feasible by the fact that the TSVs have considerably better electrical performance than the horizontal inter-switch link, and support a higher signaling speed. A. Switch properties The aim of this section is to obtain representative performance limitations of the physical interconnect and hardware of a 3-D NoC which will provide a performance indicator and reality check for the system-level simulations conducted in this study. A maximum die size of 2cm×2cm using a 32nm CMOS process technology is considered for all physical

aspects of this investigation. To determine the switch, interconnect and resource areas, and subsequent switch-switch wire lengths, a maximum network size of 32×32 was selected. This represents the maximum number of nodes simulated for 2-D and 3-D networks within this exploration. This upper limit on the number of nodes allows for a resource, switch, and interconnect to occupy equally divided tiles of 624μm×624μm on a single die. This area allocated for each tile in the network is kept constant, but the number of nodes reduces along with the footprint of the die. The area allocated for the 2-D and 3-D nodes is identical to provide a fair comparison. As this study intends on mapping the physical capabilities of the communication hardware of the network, the resources are not considered in depth. The switch analysis establishes the physical length of the 2-D network links and the capability of a bufferless switch to be pipelined sufficiently enough to route packets through the vertical TSVs at higher frequencies. The switches used in this study were designed for functionality thus were not optimized for speed or layout. The area and speed of both the 2-D and 3-D switches can be considerably improved with customized layout tools and pipelining stages, which has been widely documented in academia and industry, but falls outside the scope of this investigation. Using the methodology applied in [16] we have estimated the area and interconnect density for a 2-D 5-port and 3-D 7-port switch in 32nm technology. Analyzing the switch speeds and sizes establishes where the bottleneck in the network performance lies. The single-stage 5-port 2-D switch is reported to occupy an area of 600μm×600μm and consist of 40,029 gates in 0.18μm technology from a synthesized VHDL model. By applying equation (1), given by the authors in [16] the size of the switch in 32nm technology was estimated to occupy an area of 106μm×106μm, where A denotes the area, γ the feature size and αs a constant to account for integration losses or gains in scaling. The layout of a tile is show in Figure 2. The 2-D switch occupies roughly 3% of the area allocated for each tile. Anew =

Aold old 2 ( γγnew ) × αs

(1)

The gate depth from the synthesis report was used to find the critical path delay of the switch. Using the fan out of 4 (FO4) delay allows the maximum operating frequency of the design to be roughly estimated. In this case, the FO4 delay for 32nm technology is approximately 16ps, the critical path for our synthesized 2-D switch is approximately 3.52ns, which results in a maximum operating frequency of 284MHz at 32nm. As the implementation is currently a single stage design, it is assumed that the switch can be pipelined to improve upon the critical path delay and maximize the clock frequency to bring the switch designs into the GHz realm. The single-stage 7-port 3-D switch design was also synthesized for 0.18μm technology and reported to occupy an area of 700μm×700μm, consisting of 50K gates. Applying the same methodology as in the 2-D case we have estimated the switch

area to be 124μm×124μm and the operating frequency as 250MHz at 32nm. B. Network links The resistance and inductance of a wide range of TSV dimensions is negligible [7] compared to the parasitics of an appropriately sized driver and on-chip global wires, the delay through the TSV is much smaller than signals on each 2-D network layer. It is therefore feasible to attain faster signaling speeds through the vertical interconnect than the 2-D horizontal communication grid due to these physical properties. The physical parameters of future 32nm technology are obtained from the International Technology Roadmap for Semiconductors (ITRS) 2007 and rail-to-rail CMOS signaling is considered. The insertion of repeaters is beneficial when the line delay is at least 7 times greater than the buffer delay [24]. Given the allocated size of each functioning block in the network and the results from the estimated switch areas, a maximum switch to switch wire length of 518μm is obtained. In order to determine appropriate driver sizes and the number of repeaters required, Bakoglu’s equation (2) derived in [1] was used.     Rw Cs Rd Cs Cc + hCd + 4.4 + 0.4 t2D = k 0.7 h k k k k  Cc + 0.7hCd (2) +1.51 k Using projected parasitic values for 32nm technology on a 518μm length of semi-global wire, as detailed in Table I, the ideal driver size and repeater count was calculated as H=115 and k=3 respectively for the worst case coupled switching scenario. This results in an analytically determined 50% rise time of 58.6ps. This calculated figure was confirmed to be accurate with Spectre simulations for a coupled wire with the same parasitic properties. t3D = 0.69 × Rd (Cs + K1 Clat + K2 Cdiag )

(3)

A 16x16 bundle of TSVs serves as the vertical connectivity for each switch between chip layers. The parasitic elements of a TSV, shown in Table I, were calculated from the compact models developed in [25] to determine R, L, C values for a 10μm diameter TSV with a pitch of 30μm between adjacent TSVs. Circuit simulations in Spectre of a 3×3 bundle under On-chip parameters Width = 70nm Lengthw = 0.518mm Cs = 20f F/mm Cd = 0.10f F Cc = 61f F/mm Rw = 1870Ω/mm Rd = 30.48kΩ

TSV parameters 10μm diameter LengthT SV = 20μm Cs = 7.37f F Clat = 1.31f F Cdiag = 0.45f F RT SV = 4.5Ω LT SV = 6.18pH

TABLE I PARASITIC VALUES FOR 2-D ON - CHIP WIRES , DRIVERS AND T HROUGH S ILICON V IAS

Fig. 2. A 2-D tile with area allocations for switch, interconnect and resource

the worst case switching scenario were compared with the delay model presented in [7] to evaluate the rise time information of TSVs in a bundle. The delay in an nxn bundle of TSVs can be quickly determined with reasonable accuracy in this fashion. A 50% rise time of 4.08ps was predicted by equation (3) with K1 and K2 as 9.0 and 10.6 respectively. Running a spice simulation with the same values produced a 5.77ps rise time, which confirms the accuracy of the model. When compared to the horizontal link delay, the vertical packets arrive some 50ps prior to the horizontal packets. This allows enough room for a vertical clock of up to 8 times faster than the horizontal links. IV. L INK AND N ETWORK LAYER P ROTOCOLS The RTL cycle-accurate simulator has been designed to enable vertical packets entering or exiting the vertical ports (Up/Down) of the 7-port switch to be clocked integer multiples faster or slower than the horizontal ports (N,E,S,W,R). For this study, the vertical packets are clocked up to four times for every one cycle of the horizontal clock to observe the effect of faster vertical packet routing. The switch runs on the faster vertical clock and the routing algorithm is not modified in any way to take advantage of the speed of the vertical links. Packets may only be injected into the network from the resources on the horizontal clock edges. The NoC protocol employed in this study is a hot-potato implementation with switch architecture as described in [15]. The routing strategy is based on non-minimal and load dependent deflection type packet switching, with adaptive per hop routing. A relative addressing scheme is implemented which simplifies the duplication of identical switches when network structures of varying sizes are designed. The switches instantiated in all of the network configurations are bufferless, and directly connected with their associated resource with a 1:1 resource to switch ratio. A packet cannot be stored in a switch, and thus in each cycle the packets

V. T RAFFIC PATTERNS AND S IMULATION S ETUP The data samples are extracted from the output files following the warm-up phase of the network and preceding the cooldown phase to ensure reliable results. For example, Figure 3 shows the average latency growth for a 5×5×5 3-D mesh network with an injection rate of 0.5. The warm-up phase occupies the first 200 clock cycles. After 500 clock cycles the network is fully stable until the simulation ends. In these experiments samples were obtained after the network reached stability but for certain combinations of network size and injection rate, the network never becomes stable. Simulations were performed for an n×n×n 3-D network for n = 2, 4, 6, 8 and 10 for a symmetrical vertical and horizontal clock, and a vertical clock with twice the frequency. Injection rates for these cases were simulated at 0.1 and 0.2 packets per node per horizontal cycle. Additionally, simulations were performed for a 10×10×10 network with a vertical clock of 2, 3 and 4 times the speed of the horizontal network links. The injection rate for these cases was fixed at 0.3 packets per node per horizontal cycle. The metrics used to quantify the performance of the network in each case are raw latency, normalized latency, and

Average normalized latency growth 2.8 2.6

Average normalized latency

must be moved from switch to switch or deflected back to a resource. To reduce the complexity of the switches, deterministic routing is favored over adaptive routing in buffered networks [13],[5],[2]-[4]. In bufferless networks, deflection routing is advocated [22] because it is possible to design fast and small switches with simplified control circuitry. The implemented switch uses 128-bit input and output channels for each switch-to-switch Physical Channel (PC) and resource to switch connection. connectivity is implemented with two simplex channels having the same switch-to-switch bit-width as the horizontal channels. The packets are generated by each resource and injected into the network with an injection rate of up to 0.9 packets per node per cycle. For this study, a packet is considered to be one flit long. Each resource has a FIFO buffer to temporarily store packets if they cannot enter the network due to congestion. These packets are queued and re-injected into the network with a higher priority than newly generated packets. The implemented network does not drop packets. The packet headers generated by the resource contain final destination addresses and the switches make routing decisions on the fly based on this information. Any NoC structure is comprised of switches located in general at the corners, edges, surface and center of the physical structure. For example a 2D switch has 4 links to other switches in 4 directions, namely, North, South, East and West. A 3-D switch has two additional, Up and Down links giving a total of 6 bi-directional links. There is also an additional bi-directional port from switch to resource in each case. For a given NoC structure, not all of the links can be used for routing packets. For example, only half of the switch-to-switch links in the corner of the network are connected, halving the available switch bandwidth due to the unconnected edge ports.

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0

500

1000

1500

2000

2500

3000

3500

Time in clock cycles

Fig. 3. An average normalized latency growth for a 5×5×5 3-D Mesh with a 0.5 injection rate

throughput, defined in (4), (5) and (7) respectively. TRaw =

(CF inal − CInit ) HC

(4)

The raw latency, TRaw , is the distance traveled by a packet from the source to the destination address in terms of hop counts, denoted by HC . CF inal and CInit represent the final and initial clock cycles respectively. When the network is at zero-load, the raw latency is equivalent to the minimum distance. The normalized latency, TN orm is the ratio of the raw latency (i.e. the actual distance traveled by a packet) to the minimum distance with zero-load, TZero load , defined as: TN orm =

TRaw TZero load

(5)

The average normalized latency is the mean of the normalized latencies for the collected samples defined as: TN orm

Avg

= mean(TN orm )

(6)

In each cycle, packets arrive at all nodes at a rate based on the injection rate and the congestion level of the network. The throughput per node per cycle, λ, is defined in (7), where PT otal is the total number of packets received over the simulated range, N is the number of nodes in the network and C is the number of cycles in the sampling region. PT otal (7) N ×C The generated traffic pattern attempts to load the network to provide a clear indicator of the impact of altering the clock frequencies of the vertical network links. A Uniform Random Traffic pattern is employed in all of the simulations conducted in this paper. The URT model stipulates that each node in the network is equally likely to become the destination address of any packet emitted from a resource. Random addresses are generated by each resource according to the current network λ=

VI. R ESULTS

18 9.51% Deviation

Raw Latency (Cycles)

14

12

10

3−D Uniform Clock 3−D Vertical Clock = 2 * Horizontal Clock 2−D Uniform Clock

8

6

4 2x2x2

5.3% Deviation 4x4x4

6x6x6

8x8x8

10x10x10

Network Configuration

(a) 24 22

Uniform Clock Vertical Clock = 2 * Horizontal Clock

10.2% Deviation

Raw Latency (Cycles)

20 18 16 14 12 10 8 6 4 2x2x2

6.1% Deviation 4x4x4

6x6x6

450 400 350 300 250 200 150 100 1

The results detail the performance contrast in clocking the vertical TSV links at a higher rate than the on-chip horizontal links. Figures 4(a) and (b) plot the average raw latency as the number of nodes and layers is increased for injection rates of 0.1 and 0.2 packets per node per horizontal cycle respectively. In these cases the vertical links are fixed at twice the speed of the horizontal links and compared against a uniformly clocked 3-D and 2-D network of comparable sizes. The 2-D results were omitted from Figure (b) for clarity, as

16

500

Average Raw Latency (Cycles)

size and configuration and then assigned to outgoing packets given a definable injection rate. From a design perspective, it is customary to place frequently communicating resources close to each other to maximize efficiency. This increases the performance of the communication in terms of power, timing and resource management. Although capable of simulating localized traffic patterns, the simulations in this paper were designed to characterize the performance variation of the vertical clocking schemes without complicating the design space, thus the localization protocol was disabled.

8x8x8

10x10x10

Network Configuration

(b) Fig. 4. Raw latency growth as the number of network nodes increases for injection rates of 0.1(a) and 0.2(b) for vertical links twice as fast as horizontal links

2

3

Vertical clock speed / Horizontal clock speed

4

Fig. 5. Raw Latency against increasing vertical clock speeds for a 10×10×10 network with an injection rate of 0.3

its performance was degraded. For smaller networks it is clear that the increased vertical links do not have a large effect on the delivery time of the packets in the 3-D case. As the data packets are equally likely to head in any given direction and the vertical links only represent two out of the possible seven output ports of the switch, the likelihood of a packet going vertically rather than horizontally is less in a small network due to the physical structure of the switch and limited network layers. In these cases, the effect of clocking twice as fast vertically is less prevalent but still advantageous in terms of latency. As the network becomes larger, consisting of more chip layers, the raw latency curves begin to separate, with the faster clocked network proving less latent than the uniformly clocked network by 9.5 and 10 percent for injection rates of 0.1 and 0.2 respectively. It is clear that the 2-D network performs comparably worse to the 3-D variants and the latency grows exponentially as the number of nodes in the network grows. The throughput per node per cycle for both 3-D cases remains constant, matching the injection rate. However, in the 2-D case, above 225 nodes, the throughput declines. It is clear that the faster clocked network performs best in all cases in these simulation results. In the next simulation set, the network size, horizontal clock speed and injection rate is fixed, whilst the speed of the vertical links is increased by integer multiples. This simulation shows the effect of higher clock rates on a large network. Figure 5 is a plot of the average raw latency against increasing clock speeds for the vertical communication links. It can be seen from this figure that there is a distinct improvement in latency when the clock rates are increased for a large network under URT. We see an improvement of roughly 70% in latency by increasing the clock rate four times. This demonstrates a significant savings in packet delivery time for a large chipstack. Eventually there will be a limit to the amount of savings provided by clocking the vertical links faster as the speed of the horizontal communication grid will eventually prove a bottleneck to the network. This can be seen by the levelling

off of latency with higher clock speeds in the curve. VII. D ISCUSSION AND C ONCLUSIONS This investigation provides a first-order analysis of the effect of utilizing faster vertical links on the overall performance of a bufferless 3-D NoC mesh. The physical analysis of the switch, interconnect and TSVs provides realistic boundary conditions for the system-level simulations, demonstrating that signaling speeds through the vertical network links can be achieved at up to eight times the speed of the horizontal communication wires. The ability to send packets vertically faster was investigated through the modification of a cycleaccurate RTL-level simulator to allow clocking of vertical packets at integer multiples of the horizontal clock. A URT traffic pattern was employed in a large scale 3-D NoC and the results show that packets arrive significantly faster than a NoC with a uniform clock. The savings in average raw latency were quantified in the results. Further performance improvements, reserved for future work, are expected when the switch architecture and routing algorithm is modified to favor vertical routing in times of deflection or heavy traffic at the desired destination port. The ability to route packets vertically with less latency penalty can significantly improve the communication performance for multi-layer chip stacks either in a bus configuration or a NoC. Additionally, the faster clocking of TSVs can be used to reduce the area footprint of a TSV bundle by providing fast serialization and de-serialization hardware to lessen the impact of the relatively large TSV structures. The investigations shown in this study reveal the performance benefits of taking advantage of the low parasitic electrical properties of through silicon vias in a 3-D Network-on-Chip and open up further venues of research into high-speed massively integrated 3-D ICs. VIII. ACKNOWLEDGMENTS We would like to acknowledge the contribution of Dave Shippen to the work carried out in the Department, whose untimely passing is a tragic loss to everybody who knew him. Funding support from the European Commission under grant F P 7 − ICT − 215030 (ELITE) of the 7th framework program is gratefully acknowledged. R EFERENCES [1] HB Bakoglu and JD Meindl. Optimal interconnection circuits for VLSI. IEEE Transactions on Electron Devices, 32(5):903–909, 1985. [2] K. Banerjee, S. Souri, P. Kapur, and K. Saraswat. 3-d ics: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration. Proc. IEEE, 89(5):602–633, May 2001. [3] W. J. Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 39(6):775–785, 1990. [4] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proc. DAC, pages 684–689, 2001. [5] W.J. Dally. Principles and practices of interconnection networks. Morgan Kaufmann, 2004. [6] B. Feero and P.P. Pande. Networks-on-Chip in a Three-Dimensional Environment: A Performance Evaluation. IEEE Transactions on Computers, 6, 2008. [7] M. Grange, R. Weerasekera, and D. Pamunuwa. Examination of Delay and Signal Integrity Metrics in Through Silicon Vias . In Workshop Notes, Design Automation and Test in Europe Conference (DATE), 2009.

[8] D. M. Jang, C. Ryu, K. Y. Lee, B. H. Cho, J. Kim, T. S. Oh, W. J. Lee, and J. Yu. Development and evaluation of 3-d sip with vertically interconnected through silicon vias (tsv). In Proceedings of the 57th Electronic Components and Technology Conference, pages 847–852, 2007. [9] J. Kim et al. A novel dimensionally-decomposed router for on-chip communication in 3D architectures. 2007. [10] JU Knickerbocker, PS Andry, B. Dang, RR Horton, MJ Interrante, CS Patel, RJ Polastre, K. Sakuma, R. Sirdeshmukh, EJ Sprogis, et al. Three-dimensional silicon integration. IBM Journal of Research and Development, 52(6):553–567, 2008. [11] F. Li et al. Design and Management of 3 D Chip Multiprocessors Using Network-in-Memory. ACM SIGARCH Computer Architecture News, 34(2):130–141, 2006. [12] I. Loi, F. Angiolini, and L. Benini. Supporting vertical links for 3d networks-on-chip: Toward an automated design and analysis flow. In Proceedings of the Nano-Net Conference 2007, 2007. [13] Z. Lu and A. Jantsch. Admitting and ejecting flits in wormhole-switched networks on chip. Computers and Digital Techniques, IET, 1(5):546– 556, Sept. 2007. [14] C. Mineo, R. Jenkal, S. Melamed, and W.R. Davis. Inter-Die Signaling in Three Dimensional Integrated Circuits. [15] E. Nilsson. Design and Implementation of a hot-potato Switch in a Network on Chip. M´emoire, Departement of Microelectronics and Information Technology, Royal Institute of Technology, 2002. [16] D. Pamunuwa, J. Oberg, L.R. Zheng, M. Millberg, A. Jantsch, and H. Tenhunen. A study on the implementation of 2-D mesh-based networks-on-chip in the nanometre regime. Integration, the VLSI Journal, 38(1):3–17, 2004. [17] D. Pamunuwa, L.R. Zheng, and H. Tenhunen. Maximizing throughput over parallel wire structures in the deep submicrometer regime. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 11(2):224–243, 2003. [18] D. Park, S. Eachempati, R. Das, A.K. Mishra, Y. Xie, N. Vijaykrishnan, and C.R. Das. MIRA: A Multi-layered On-Chip Interconnect Router Architecture. 2008. [19] V.F. Pavlidis and E.G. Friedman. 3-D Topologies for Networks-onChip. IEEE Transactions on Very Large Scale Integration Systems, 15(10):1081, 2007. [20] K. Snoeckx, E. Beyne, and B. Swinnen. Copper-nail TSV technology for 3D-stacked IC integration. Solid State Technology, 50(5):53, 2007. [21] D. Sylvester and K. Keutzer. Getting to the bottom of deep submicron. In Proc. ICCAD, pages 203–211, 1998. [22] A.W. Topol et al. Three-dimensional integrated circuits. IBM Journal of Research and Development, 50(4):491–506, 2006. [23] S. Vangal et al. An 80-Tile 1.28 TFLOPS Network-on-Chip in 65nm CMOS. In Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, pages 98–589, 2007. [24] R. Weerasekera. System Interconnection Design Trade-offs in ThreeDimensional Integrated Circuits. PhD thesis, The Royal Institute of Technology (KTH), Stockholm, Sweden, 2008. [25] R. Weerasekera, D. Pamunuwa, M. Grange, H. Tenhunen, and LR. Zheng. Closed-Form Equations for Through-Silicon Via (TSV) Parasitics in 3-D Integrated Circuits (ICs). In Workshop Notes, Design Automation and Test in Europe Conference (DATE), 2009. [26] R. Weerasekera, L. Zheng, D. Pamunuwa, and H. Tenhunen. Extending systems-on-chip to the third dimension: performance, cost and technological tradeoffs. In Proc. IEEE/ACM Int. Conf. on Comp.-Aided Design (ICCAD), pages 212–219, 2007. [27] A. Y. Weldezion, M. Grange, D. Pamuwa, Z. Lu, A. Jantsch, R. Weerasekera, and H. Tenhunen. Scalability of the Network-on-Chip communication architecture for 3-D meshes. In International Symposium on Networks-on-Chip (NoCS), 2009. [28] A. Y. Weldezion, R. Weerasekera, D. Pamuwa, L-R. Zheng, and H. Tenhunen. Bandwidth optimization for through silicon via (TSV) bundles in 3D integrated circuits. In Workshop Notes, Design Automation and Test in Europe Conference (DATE), 2009.

Suggest Documents