Exploring DRAM Last Level Cache for 3D Network-on ...

2 downloads 0 Views 574KB Size Report
[15] H. Global, “Ddr 2 memory controller ip core for fpga and asic,” June. 2010 ... [22] J. Janzen, “Calculating memory system power for ddr sdram,” Micron.
2010 International Conference on Embedded System and Microprocessors (ICESM 2010)

Exploring DRAM Last Level Cache for 3D Network-on-Chip Architecture Thomas Canhao Xu, Pasi Liljeberg, Hannu Tenhunen Department of Information Technology, University of Turku, 20014, Turku, Finland Turku Center for Computer Science (TUCS), Joukahaisenkatu 3-5 B, 20520, Turku, Finland canxu, pasi.liljeberg, [email protected]

Abstract—In this paper, we implement and analyze different Network-on-Chip (NoC) designs with Static Random Access Memory (SRAM) Last Level Cache (LLC) and Dynamic Random Access Memory (DRAM) LLC. Different 2D/3D NoCs with SRAM/DRAM are modeled based on state-of-the-art chips. The impact of integrating DRAM cache into a NoC platform is discussed. We explore the advantages and disadvantages of DRAM cache for NoC in terms of access latency, cache size, area and power consumption. We present benchmark results using a cycle accurate full system simulator based on realistic workloads. Experiments show that under different workloads, the average cache hit latencies in two DRAM based designs are increased by 12.53% (2D) and reduced by 27.97% (3D) respectively compared with the SRAM. It is also shown that the power consumption is a tradeoff consideration in improving the cache hit latency of DRAM LLC. Overall, the power consumption of 3D NoC design with DRAM LLC has reduced 25.78% compared with the SRAM design. Our analysis and experimental results provide a guideline to design efficient 3D NoCs with DRAM LLC.

WL VDD M2

M6

M5

WL M1

Data

M1

M3

(a) SRAM

Data

C1

Data

(b) DRAM

Fig. 1: Comparison of a cell in SRAM and DRAM. SRAM contains typically six MOSFETs (Figure 1a). The size of a SRAM LLC is around 12MB in modern chips (i.e. AMD Opteron 6176 SE and Intel Core i7-980X). Unlike SRAM, DRAM stores each bit in a circuit of one MOSFET and one capacitor (Figure 1b). Hence the storage density is much higher than SRAM. It is shown that under the same processing technology, the density of DRAM is about 8 times higher than SRAM [5]. The main disadvantage of DRAM is that, it requires periodical refresh to keep the leak charge of capacitors. Furthermore, the operation principle is different in DRAM. Due to these factors, the performance of DRAM is lower than SRAM. More details are given in Section III. Overall system performance is influenced by cache capacity, since larger caches buffer more data. Cache miss rate is lower with a larger cache. However, larger caches usually brings higher access latency. It is very important to find a balance between cache size, die size, power consumption and performance, since latency is a trade-off for a larger cache size. There have been researches on replacing on-chip SRAM caches with DRAM caches. Intel has demonstrated a dualcore CMP with stacked DRAM cache [5]. Li Zhao et al. explored DRAM caches for an eight-core CMP server platform [6]. Gabriel H. Loh proposed several 3D stacked memory architectures for multicore processors [7]. However, these studies focused on traditional on-chip interconnections, instead of on-chip network. Implementing cache in a 3D way to reduce total wire length is shown in [8], with 21.5% reduced latency. Taeho Kgil et al. have shown a CMP with multiple DRAM

Index Terms—Network-on-Chip; Chip Multiprocessor; 3D IC; DRAM; SRAM; NUCA;

I. I NTRODUCTION The limitation of clock frequencies has led to the concept of Chip Multiprocessor (CMP), which enables to integrate more than one core on a single physical chip. AMD has announced its Opteron 6100-series1 , with twelve x86 cores on a chip in a two-die configuration, each of which has six cores with an area of 346mm2 [1]. It is predictable that in the near future, more and more cores will be integrated on a chip. However, the current communication schemes in CMPs are mainly based on the shared bus architecture which suffers from high delay and low scalability. Therefore, NoC has been proposed as a promising approach to integrate a large number of components on a single chip by leveraging the well developed computer network concepts [2]. Intel has demonstrated an 80 tile, 100M transistor, 275mm2 2D NoC prototype under 65nm processing technology in 2007 [3]. An experimental CMP containing 48 x86 cores on a chip has been manufactured for research using 4×6 network-based 2D mesh topology with 2 cores per tile [4]. SRAM is widely used as LLC for CMPs. SRAM uses bistable latching circuitry to store data bits, avoiding to be refreshed periodically. The circuits used to store one bit in a 1 AMD and Opteron are trademarks or registered trademarks of AMD or its subsidiaries. Other names and brands may be claimed as the property of others.

978-1-4244-8962--6/10/$26.00©2010 IEEE

M4

39

2010 International Conference on Embedded System and Microprocessors (ICESM 2010)

dies specially designed for Tier 1 server applications [9]. It is found that, for a similar logic die area, a 12 CPU system with 3D stacking and no L2 cache outperforms an 8 CPU system with a large on-chip L2 cache by about 14% while consuming 55% less power [9]. These researches have focused on DRAM cache for conventional CMPs as well. Magnetic RAM is proposed and compared with SRAM in [10], their evaluations are based on 2D NoC architectures, however the comparison for DRAM is missing. Niti Madan et al. proposed a 3D stacked reconfigurable heterogeneous cache hierarchy [11] for NoC architectures, they have not evaluated DRAM and SRAM as alone. In our paper, however, we investigate the architecture of 3D NoC with DRAM LLC. By replacing the 2D SRAM LLC using 3D IC technology, overall system performance is expected to improve due to reduced cache miss rate and wire length. We model a 16-core NoC with both SRAM LLC and DRAM LLC, analysis the advantages and disadvantages with different implementations, present the performance with these systems using a full system simulator. To the best of our knowledge, this is the first paper that analyzes DRAM LLC architecture for 2D/3D NoC.

of transistors required for a memory controller is quite small compared with billions of total transistors for a chip. It is presented that a DDR2 memory controller is about 13,700 gates with application-specific integrated circuit (ASIC) and 920 slices with Xilinx Virtex-5 field-programmable gate array (FPGA) [15]. We estimate that the total area of the baseline 2D NoC with SRAM LLC is around (3.4 + 2.84 + 0.054)×16 = 100.704mm2 (as shown in Figure 2, each PE has a core with private L1 cache and shared L2 cache). By replacing the SRAM with DRAM directly, we explore a 2D NoC with DRAM LLC. We simulate the characteristics of a 128MB, 16 banks, 64-bit line size, 16-way associative, 32nm DRAM cache by CACTI [14]. Results show that the total area of cache banks is 52.8mm2 . Therefore the total area of the 2D NoC with DRAM LLC is around (3.4 + 3.3 + 0.054)×16 = 108.064mm2 , slightly larger than the SRAM implementation. The interconnection of traditional 2D NoC will result a long global wire lengths, causing a high delay, high power consumption and low performance [16]. Besides 2D NoCs have larger die size in multiprocessor implementations. To solve this problem, 3D integration, the technique of stacking multiple dies vertically, is introduced. Layers with different functions, e.g. processor layer, cache layer, controller layer and memory layer can be implemented in a 3D NoC. Since the processors consume overwhelming majority of power in a chip, it is expected that stacking multiple processor layers could be unwise for heat dissipation. According to a research [17], heat dissipation is a major problem by stacking multiple processor layers even if processors are interlaced vertically. Without direct contact with heatsinks, the peak chip temperature of 3D design raises by 29°C comparing with the 2D design, which is unfeasible for some applications [17]. However, by stacking more cache layers instead of processor layers, the thermal constraint is supposed to be alleviated. Gian Luca Loi et al. shows that, even for 18 stacked layers (1 of processors, 1 of cache and 16 of memory), the maximum temperature for a 3D chip increases only 10°C comparing with 2D chip [18]. It is estimated that 15% lower core frequency of a 3D chip could compensate the thermal drawback [18]. On account of the aforementioned analysis, we use a 3D NoC model with one layer of processor and one layer of DRAM LLC as shown in Figure 3. In consideration of heat dissipation, the processor layer should be on top of the chip (near heatsink). The processor layer is a 4×4 mesh of Sun SPARC cores with private L1 cache. We adopt a 7-port router for our 3D NoC model. It is noteworthy that routers are quite small compared with processors and cache banks, e.g. scaled to 32nm, as we calculated, a 7-port 3D router is estimated to be only 0.096mm2 . Furthermore, not all routers in a 3D NoC require seven ports, i.e. router of P4 in Figure 3 has only East, North, Local PE and Down ports. It is estimated that the total area of the processor layer is around (3.4 + 0.096)×16 = 55.936mm2 . The DRAM LLC layer has a 4×4 mesh of cache banks. The total area is around (3.3 + 0.096)×16 = 54.336mm2 . As we expected, the die size of the 3D NoC is much smaller, just 55.54% of the 2D NoC with SRAM LLC

II. M ODELING OF THE N O C S WITH SRAM/DRAM NoC brings network communication methodologies into onchip communication. Figure 2 shows a CMP with 4×4 mesh topology. The underlying network is comprised of network links and routers (R), each of which is connected to a processing element (PE) via a network interface (NI). Each PE is a core in the CMP. The basic architectural unit of a NoC is the tile/node (N) which is consisted of a router, its attached NI and PE, and the corresponding links. Communication among PEs is achieved via the transmission of packets through network. N

R

W

E

NI

PE

S

Fig. 2: An example of 4×4 NoC using mesh topology. In this paper, a baseline 2D NoC with SRAM LLC is modeled for comparison with the DRAM LLC based NoCs. Both NoCs have 16 PEs. The floorplan of modern multi-core chips such as third-generation Sun SPARC [12], IBM Power 7 [13], AMD Istanbul [1] shows the design choices of the PE. The total area of Sun SPARC chip is 396mm2 with 65nm fabrication technology. Scaled to 32nm technology, each core has an area of 3.4mm2 . We simulate the characteristics of a 16MB, 16 banks, 64-bit line size, 16-way associative, 32nm SRAM cache by CACTI [14]. Results show that the total area of cache banks is 45.5mm2 . Each cache bank, including data and tag, occupies 2.84mm2 . A 5-port router is estimated to be 0.054mm2 scaled to 32nm, as we calculated. The number

40

2010 International Conference on Embedded System and Microprocessors (ICESM 2010)

and 51.76% of the 2D NoC with DRAM LLC, respectively.

P1 P2

P5 P6

P9 P10

delays, assuming a 2GHz processor, we measure the access latency of the DRAM LLC to be 60 cycles. It is noteworthy that DRAM with higher frequencies are available nowadays, notwithstanding the value of tRCD , tCAS and tRP are usually higher. We also note that, there are other latencies for a NoC, e.g. router latency, link latency, cache response latency and tag latency. The overall system latency is not determined only by the LLC access latency.

P13 P14 N

Heatsink P7 P11

P3

P15 U

P4

P8

P12

Tile C1

C5 C6

C9 C10

C4

C7 C8

C11 C12

E

When a core in the NoC needs to read or write data in the memory, it firstly checks whether the data is in the cache (L1 and L2 etc.). If the data is found in the cache, a cache hit has occurred. Otherwise, it is called a cache miss and usually data have to be transferred from the memory via a memory controller, which is much slower due to the latency of the memory sub system. Cache hit rate is the proportion of accesses that result in a cache hit. Obviously, higher cache hit rate means higher system performance, it is easy to deduce that the cache hit rate will be improved by increasing the size of the cache. As we analyzed in Section II, based on our simulation, the density of DRAM is about 8 times higher than SRAM (52.8mm2 for 128MB DRAM, compared with 45.5mm2 for 16MB SRAM). Another important matter is that, despite the fact that the access latency for a single operation in DRAM is higher than SRAM, overall cache-memory latency is anticipated to be lower. This is due to the fact that smaller SRAM LLCs tend to have a higher miss rate, which induces higher main memory access frequency, compared with DRAM LLCs. Accessing data from off-chip DRAM memory is far less efficient in terms of performance and power consumption.

C13

NI

C14

P/C C3

R

W

C15

D

C2

B. Cache size

P16

S

C16

Fig. 3: Schematic diagram of a 3D NoC with one processor layer (Px) and one cache layer (Cx), layers are fully connected by Through Silicon Vias (TSVs). The heatsink is attached with processor layer III. A NALYSIS OF THE IMPACT OF ON - CHIP DRAM LLC TO N O C DESIGN As aforementioned, there are advantages and disadvantages for integrating DRAM as LLC. We analysis these in terms of access latency, cache size, waste of area and power consumption. A. Access latency The access latency of LLC plays a critical role in determining the overall system performance. The access latency for an SRAM LLC is composed of the following factors: • H-tree input • Decoder and word-line • Bit-line • Sense amplifier • Comparator • H-tree output For the aforementioned SRAM, we calculate the access latency is 5.403ns, which is around 10 cycles for a 2GHz frequency. Compared with SRAM LLC, DRAM LLC suffers from longer access latency due to refreshing and precharging, and therefore might significantly degrade system performance. DRAM is organized into a grid of single-transistor bit-cells, and the grid is divided into rows and columns. On the higher level, a DRAM bank consists of the grid and accompanying logic. When accessing data in a DRAM, tRCD (the number of clock cycles needed between a row address strobe and a column address strobe), tCAS (the number of clock cycles needed to access a column) and tRP (the number of clock cycles needed to precharge a row) are major factors. It is expected that the access latency of DRAM LLC will be much higher than SRAM LLC. We calculated the latency of a fast 200MHz synchronous DRAM, with tRCD -tCAS -tRP of 22-2. The time required for tRCD -tCAS -tRP are 10ns-10ns10ns respectively. In consideration of the H-tree and decoder

C. Area Li Zhao et al. have shown their concern in terms of wasted area of DRAM cache, due to the increased tag size [6]. It is claimed that the tag space overhead is a key consideration in implementing the DRAM cache. We argue that the tag space required for a large DRAM cache is not necessarily an important issue, in respect that the area required for tag array is much smaller compared with data array. To testify our standpoint, we simulated a 16MB, 16 banks, 16-way associative, 32nm cache with different cache line size using CACTI [14]. TABLE I: Comparison of cache area with different line size Cache line size 8B 16B 32B 64B 128B 256B

Data array 33.63mm2 37.45mm2 53.56mm2 115.99mm2 335.40mm2 1224.42mm2

Tag array 11.88mm2 5.90mm2 3.13mm2 1.74mm2 0.96mm2 0.61mm2

Total size 45.51mm2 43.35mm2 56.69mm2 117.73mm2 336.36mm2 1225.03mm2

Results in Table I show that, the area of data array increases with the increase of cache line size, while the area of tag array decreases with the increase of cache line size. However,

41

2010 International Conference on Embedded System and Microprocessors (ICESM 2010)

although the area of tag array becomes smaller with larger cache line size, the area of data array increases more rapidly. Therefore, the trade-off for pursuing a minimal area of tag array is not worthwhile. Most modern systems use a cache line size of 8 to 64 bytes [19], [20].

router/link activities. The execution time of an application will be reduced as well. Hence the total power consumption of the whole chip with DRAM LLC running the same application should be lower, comparing with SRAM LLC. TABLE III: Comparison of L2 transaction distributions.

D. Power consumption Read Exclusive Read Clean Write Dirty Write

The power consumption of DRAM consists of several parts: standby power, activate power and read/write power. Since the operation principle of DRAM is different from SRAM, e.g. precharge and refresh are not required in SRAM, DRAM might be more power hungry than SRAM. However, storing one bit in SRAM requires six MOSFETs, while in DRAM only one MOSFET plus one capacitor are needed, hence the per-bit power consumption for DRAM is much lower. Furthermore, the standby power for an SRAM is mainly leakage, which means the power is consumed directly. The leakage power is expected to rise to 50% of total chip power for nanoscale chips [21]. We also note that leakage power grows significantly with temperature, while the temperature for a 3D NoC can be quite high. In order to compare the power consumption of SRAM and DRAM, we simulate a 16MB SRAM by CACTI [14], in comparison with a 128MB DRAM by Micron System-Power Calculator [22]. We assume 2GHz operating frequency with Vdd = 0.9V.

16MB/32nm SRAM 4.3W 4.1W 4.1W

x264 19.04% 47.21% 22.01% 11.74%

LU 22.09% 44.50% 10.12% 23.29%

swaptions 29.27% 46.50% 11.27% 12.96%

IV. E XPERIMENTAL E VALUATION In this section, we present the experimental evaluation under different LLC configurations. Applications are selected from SPLASH-2 and PARSEC. TPC-H [25] is used as synthetic benchmark. A. 3D NoC Router and Routing Algorithm As shown in Figure 2, routers in 2D NoCs have five ports to connect to five directions, namely, North, East, West, South and Local PE. For the vertical communication between different layers, routers in our 3D NoC model have two more ports and the corresponding virtual channels, buffers and crossbars to connect to the Up and Down pillars (Figure 3). Adaptive routing is used widely in off-chip networks, however deterministic routing is favorable for on-chip networks because the implementation is easier. In this paper, a dimensional ordered routing (DOR) [26] based deterministic routing algorithm is selected and modified to fit the 3D topologies. When a node Nsource sends a flit to a node Ndestination , the flit will first travel along the X direction in Nsource dimension until Flitx =Pillarx , then it will be routed in the Y direction. As long as the flit reaches the pillar, it will be vertically routed to the layer of the destination node. X-Y deterministic routing is used when the flit reaches the destination layer, in which a flit is first routed to the X direction and last to the Y direction.

TABLE II: Comparison of SRAM and DRAM power consumption for different operations. Standby Read Write

FFT 26.12% 43.80% 18.83% 11.25%

128MB/32nm DRAM 1.9W (1.5W+0.4W) 4.7W 4.2W

Table II shows that, the standby power of DRAM is 55.8% lower than SRAM. This is due to the typical leakage of the SRAM can be too high under 32nm technology. It is noteworthy that the standby power of DRAM consists of two parts, the leakage power (1.5W) and the power required for refresh operation (0.4W). The write power of DRAM is 2.4% higher than SRAM, the read power is 14.6% higher. We emphasize that the overall power consumption of the onchip DRAM LLC is not necessarily higher than SRAM, since the resistance-capacitance for on-chip TSV connections is much lower than off-chip interconnects, and the overall power consumption depends on the access pattern of the applications. Table III shows L2 transaction distributions for selected workloads from SPLASH-2 [23] and PARSEC [24]. Results have revealed that swaptions has the most read transactions (75.77%) in four workloads, while x264 shows the most write transactions (33.75%). On average, the four workloads show a read transaction of 69.63%, and write transaction of 30.37%, respectively. On the basis of read/write transactions, we deduce that the DRAM itself should be more power hungry than SRAM. However, it is expected that the power consumption of routers and links will be reduced with DRAM LLC, due to the increased cache hit rate and the decreased total amount of

B. Experiment Setup The simulation platform is based on a cycle-accurate 3D NoC simulator which can produce detailed evaluation results. The platform models the routers, horizontal links and vertical pillars accurately. The state-of-the-art router in our platform includes a routing computation unit, a virtual channel allocator, a switch allocator, a crossbar switch and four input buffers. Deterministic routing algorithm has been selected to avoid deadlocks. We use a 16-node network which models a single-chip CMP for our 2D NoCs (SRAM LLC and DRAM LLC). A full system simulation environment with 16 nodes, each with a core and related cache, has been implemented. The 3D architecture in this paper has one layer for processors and one layer for DRAM LLC (Figure 3). We implement a 32node network, 16 for each layer. The simulations are run on the Solaris 9 operating system based on SPARC instruction set in-order issue structure. Each processor core is attached 42

2010 International Conference on Embedded System and Microprocessors (ICESM 2010)

TABLE IV: System configuration parameters

latency has increased 12.53% by replacing the SRAM LLC with DRAM LLC in a NoC platform. Application with higher cache accesses, e.g. FMM, has much higher cache hit latency (44.93% higher), compared with the original SRAM design. Although the access latency of the DRAM is much higher than SRAM, since the NoC has other latencies, overall cache hit latency has not degraded that much. By implementing the DRAM cache in a 3D way, the average cache hit latency has reduced by 27.97% comparing with the SRAM LLC. This is primarily due to the shorter wire length in 3D design than it is in the 2D counterpart [32].

Processor configuration Instruction set architecture SPARC Number of processors 16 Cache configuration L1 cache Private, split instruction and data cache, each cache is 16KB. 4-way associative, 64-Byte line, 3-cycle access time SRAM LLC Shared, unified 16MB (16 banks, each 1MB). 64-Byte line, 10-cycle access time DRAM LLC Shared, unified 128MB (16 banks, each 8MB). 64-Byte line, 60-cycle access time Cache coherence protocol MOESI Cache hierarchy SNUCA Memory configuration Size 4GB DRAM Access latency 260 cycles Requests per processor 16 outstanding Network configuration Router scheme Wormhole Flit size 128 bits

Normalized Avg. Cache Hit Latency

1.5

TABLE V: Benchmark descriptions SPLASH-2 PARSEC

TPC-H

Standford’s benchmark suite of parallel programs. FFT, FMM, LU and Water-Nsq are used. Princeton’s benchmark suite for shared-memory computers based on CMP. Swaptions, a workload that employs Monte Carlo simulation to compute prices; and x264, an application that encodes H.264 videos are used. TPC’s ad-hoc, decision support benchmark that examines large amount of data, executes queries and gives answers to critical business questions. MySQL v5.0.67 with 1GB of reference database, query 1 is used.

SRAM DRAM−2D DRAM−3D

1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 FFT

FMM

LU

Water

Swap

x264

TPCH

Application

Fig. 4: Normalized average LLC hit latency. In comparison with the 2D NoC with SRAM LLC (SRAM), as shown in Figure 5, power savings for both NoCs with DRAM LLC are better. We calculate the overall power consumption by cache read/write transactions, main memory access transactions and router/link activities. These data are gathered from the trace file of the application from our simulator. The average power consumption of 2D NoC with DRAM LLC (DRAM-2D) is reduced by 40.14% compared with SRAM. The power savings are due to the significant reduction of memory accesses (reduced by 90.45%) and router/link activities (reduced by 7.60% and 8.91% respectively) of the DRAM-2D. However, in 3D NoC with DRAM LLC (DRAM3D), the power consumption is increased by 23.99% compared with DRAM-2D. This is due to the fact that, although the reduction of memory accesses is the same with DRAM-2D, because of the increased number of routers and links and the increased complexity of routers, the power consumed by routers and links are much higher than 2D NoCs. The power consumption of DRAM-3D is higher than DRAM-2D, albeit still 25.78% lower than SRAM, on average. We note that, based on the aforementioned results, power consumption is a tradeoff of cache hit latency, for designing a NoC with DRAM LLC.

to a wormhole router and has a private write-back L1 cache. The L2 cache shared by all processors is split into banks. The size of each SRAM LLC bank is 1MB; hence the total size of shared SRAM LLC is 16MB. The size of each DRAM LLC bank is 8MB; hence the total size of shared DRAM LLC is 128MB. The simulated memory/cache architecture mimics SNUCA [27]. A two-level distributed directory cache coherence protocol called MOESI based on MESI [28] has been implemented in our memory hierarchy in which each L2 bank has its own directory. The protocol has five types of cache line status: Modified (M), Owned (O), Exclusive (E), Shared (S) and Invalid (I). Orion [29], a power simulator for interconnection networks, is used to evaluate detailed power characteristics. A wormhole router is modeled in Orion, with corresponding input/output ports, buffers and the crossbar. Power consumption of routers is analyzed. We use Simics [30] full system simulator as our simulation platform. The detailed configurations of processor, cache and memory configurations can be found in Table IV. Workloads used in this paper are shown in Table V. We do not specially select workloads with high cache miss rate, instead these workloads are from common scientific kernels, a video processing program and a decision support application. The information of cache miss rate can be found in [23], [31] and [25].

V. C ONCLUSION The impact of DRAM Last Level Cache to NoC design was studied in this paper. We studied 2D/3D NoC models based on realistic CMPs. Our analysis showed that in a NoC, DRAM is a feasible replacement for SRAM cache. The advantages and disadvantages of implementing DRAM LLC in a NoC were investigated. We explored these in terms of access

C. Result Analysis The normalized full system simulation results are shown in Figure 4 and 5. Figure 4 shows that, on average, the cache hit 43

2010 International Conference on Embedded System and Microprocessors (ICESM 2010)

Normalized Total Power Consumption

1.1

SRAM DRAM−2D DRAM−3D

1

[10] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novel architecture of the 3d stacked mram l2 cache for cmps,” in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, feb. 2009, pp. 239 –249. [11] N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni, and D. Newell, “Optimizing communication and capacity in a 3d stacked reconfigurable cache hierarchy,” in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, feb. 2009, pp. 262 –274. [12] M. Tremblay and S. Chaudhry, “A third-generation 65nm 16-core 32-thread plus 32-scout-thread cmt sparc processor,” in ISSCC 2008, February 2008, pp. 82–83. [13] IBM, “Ibm power 7 processor,” in Hot chips 2009, August 2009. [14] T. Shyamkumar, M. Naveen, A. J. Ho, and J. N. P., “Cacti 5.1,” HP Labs, Tech. Rep. HPL-2008-20. [15] H. Global, “Ddr 2 memory controller ip core for fpga and asic,” June 2010, http://www.hitechglobal.com/ipcores/ddr2controller.htm. [16] D. Sylvester and K. Keutzer, “Getting to the bottom of deep submicron,” in Computer-Aided Design, 1998. ICCAD 98. Digest of Technical Papers. 1998 IEEE/ACM International Conference on, Nov 1998, pp. 203–211. [17] “A study of 3d network-on-chip design for data parallel h.264 coding,” in Proceedings of the 27th Norchip Conference, November 2009. [18] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and K. Banerjee, “A thermally-aware performance analysis of vertically integrated (3-d) processor-memory hierarchy,” in DAC ’06: Proceedings of the 43rd annual Design Automation Conference. New York, NY, USA: ACM, 2006, pp. 991–996. [19] Intel, “Intel core i7-980x processor extreme edition,” May 2010, http://ark.intel.com/Product.aspx?id=47932. [20] AMD, “Family 10h amd phenom processor product data sheet,” November 2008, http://www.amd.com/usen/assets/content type/white papers and tech docs/44109.pdf. [21] S. I. Association, “The international technology roadmap for semiconductors (itrs),” 2007, http://www.itrs.net/Links/2007ITRS/Home2007.htm. [22] J. Janzen, “Calculating memory system power for ddr sdram,” Micron Designline, vol. 10, no. 2, pp. 1–12, 2Q 2001. [23] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash2 programs: Characterization and methodological considerations,” in Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 24–36. [24] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: characterization and architectural implications,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 2008, pp. 72–81. [25] TPC, “Tpc-h decision support benchmark,” http://www.tpc.org/tpch/. [26] H. Sullivan and T. R. Bashkow, “A large scale, homogeneous, fully distributed parallel machine,” in Proceedings of the 4th annual symposium on Computer architecture, March 1977, pp. 105–117. [27] C. Kim, D. Burger, and S. W. Keckler, “An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches,” in ACM SIGPLAN, October 2002, pp. 211–222. [28] A. Patel and K. Ghose, “Energy-efficient mesi cache coherence with pro-active snoop filtering for multicore microprocessors,” in Proceeding of the thirteenth international symposium on Low power electronics and design, August 2008, pp. 247–252. [29] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: a powerperformance simulator for interconnection networks,” in Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture, November 2002, pp. 294–305. [30] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” Computer, vol. 35, no. 2, pp. 50–58, February 2002. [31] C. Bienia, S. Kumar, and K. Li, “Parsec vs. splash-2: A quantitative comparison of two multithreaded benchmark suites on chipmultiprocessors,” in IEEE International Symposium on Workload Characterization, September 2008, pp. 47–56. [32] T. Xu, P. Liljeberg, and H. Tenhunen, “An analysis of designing 2d/3d chip multiprocessor with different cache architecture,” in NORCHIP, 2010, nov. 2010.

0.9 0.8 0.7 0.6 0.5

FFT

FMM

LU

Water

Swap

x264

TPCH

Application

Fig. 5: Normalized total power consumption. latency, cache size, area and power consumption. A full system simulator under different workloads was used for performance evaluation. Our experiments show that, comparing with the SRAM design, the average cache hit latencies in 2D DRAM LLC and 3D DRAM LLC designs were increased by 12.53% and reduced by 27.97% respectively. In addition, our results show that by using 3D stacked design, the improvement of cache hit latency is a tradeoff for higher power consumption. Overall, the results of this paper highlighted the importance of maintaining the equilibrium of performance and power consumption for designing DRAM LLC of a NoC. ACKNOWLEDGMENT This work is supported by Academy of Finland. The authors would like to thank the anonymous reviewers for their feedback and suggestions. R EFERENCES [1] AMD, “The amd opteron 6000 series platform,” May 2010, http://www.amd.com/us/products/server/processors/6000-seriesplatform/pages/6000-series-platform.aspx. [2] L. Benini and G. D. Micheli, “Networks on chips: A new soc paradigm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, January 2002. [3] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, “An 80-tile 1.28tflops network-on-chip in 65nm cmos,” in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, Feb. 2007, pp. 98–589. [4] Intel, “Single-chip cloud computer,” May 2010, http://techresearch.intel.com/articles/Tera-Scale/1826.htm. [5] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb, “Die stacking (3d) microarchitecture,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, December 2006, pp. 469–479. [6] L. Zhao, R. Iyer, R. Illikkal, and D. Newell, “Exploring dram cache architectures for cmp server platforms,” in Computer Design, 2007. ICCD 2007. 25th International Conference on, 7-10 2007, pp. 55 –62. [7] G. Loh, “3d-stacked memory architectures for multi-core processors,” in Computer Architecture, 2008. ISCA ’08. 35th International Symposium on, jun. 2008, pp. 453 –464. [8] K. Puttaswamy and G. H. Loh, “Implementing caches in a 3d technology for high performance processors,” in ICCD ’05: Proceedings of the 2005 International Conference on Computer Design. Washington, DC, USA: IEEE Computer Society, 2005, pp. 525–532. [9] T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, and K. Flautner, “Picoserver: using 3d stacking technology to enable a compact energy efficient chip multiprocessor,” in Proceedings of the 2006 ASPLOS Conference, November 2006, pp. 117–128.

44

Suggest Documents