DRAM Hybrid Cache Architecture ... - CiteSeerX

0 downloads 0 Views 325KB Size Report
Stacking a high-density DRAM die on processor cores makes it possible to implement a large last level cache with high on-chip memory bandwidth [1][2][3].
3D Implemented SRAM/DRAM Hybrid Cache Architecture for High-Performance and Low Power Consumption Koji Inoue1, Shinya Hashiguchi2, Shinya Ueno2, Naoto Fukumoto2, and Kazuaki Murakami1 1

2

Faculty of Information Science and Electrical Engineering, Kyushu University Graduate School of Information Science and Electrical Engineering, Kyushu University 744 Motooka, Nishi-ku, Fukuoka 819-0395, JAPAN {inoue, s-hashiguchi, ueno, fukumoto, murakamai} @ soc.ait.kyushu-u.ac.jp

Abstract— This paper introduces our research status focusing on 3D-implemented microprocessors. 3D-IC is one of the most interesting techniques to achieve high-performance, low-power VLSI systems. Stacking multiple dies makes it possible to implement microprocessor cores and large caches (or DRAM) into the same chip. Although this kind of integration has a great potential to bring a breakthrough in computer systems, its efficiency strongly depends on the characteristics of target application programs. Unfortunately, applying die stacking implementation causes performance degradation for some programs. To tackle this issue, we introduce a novel cache architecture consisting of a small but fast SRAM and a stacked large DRAM. The cache attempts to adapt to varying behavior of application programs in order to compensate for the negative impact of the die stacking approach.

I.

INTRODUCTION

Cache memories have been playing an important role in bridging the performance gap between high-speed processors and slow off-chip main memory. Confining memory accesses in on-chip also contributes to reducing energy consumption for memory accesses, because the frequencies of driving external I/O pins and activating off-chip DRAM devices are lowered. These positive impacts make on-chip caches be indispensable in state-of-the-art microprocessor designs. However, the memory performance is not enough in wide range of advanced computing systems even if we employ relatively large caches, e.g. a 2 MB last level cache. Recent emerging applications such as high-quality video processing and data mining require much larger working set and high memory bandwidth. Another tendency is that integrating multiple processor cores into a single chip, or multi-core, is a de-facto standard of recent processor designs, and increasing the number of cores makes the memory wall problem be more serious. One of the straightforward ways to solve the above mentioned memory issue is to aggressively invest the transistor budget in on-chip caches. Three dimensional integration, or 3DIC, is one of the most promising approaches to satisfy this requirement. Stacking a high-density DRAM die on processor cores makes it possible to implement a large last level cache with high on-chip memory bandwidth [1][2][3].

Unlike embedded DRAMs, 3D stacked DRAM can be fabricated in own process. The top and bottom dies, or layers, are connected by TSVs (Through Silicon Vias) instead of bonding wires. This die-to-die direct connection makes low latency, low energy communications be available. Although the 3DIC has a potential to achieve high-performance, lowenergy at the same time, it does not always work well. This comes from the fact that accesses to the DRAM die are still slow and energy consuming compared with those to a small, fast SRAM cache. Also, since TSVs have large load capacitance compared with 2-dimmenioinal short wires, activating a number of TSVs, e.g. 256 bytes, consumes large amount of energy. This paper introduces an architectural technique for 3D microprocessors called SRAM/DRAM hybrid cache architecture. The concept is to optimize the architectural parameters by considering the characteristics of target applications in order to effectively exploit the stacked hardware resources. The hybrid cache supports a small SRAM cache and a large DRAM cache. Based on the demand of cache capacity, the cache attempts to select an appropriate operation mode. This kind of adaptive optimization makes the 3D microprocessors in practical. The organization of this paper is as follows. In section 2, we analyze the conventional 3D stacking implementation and show the negative impacts of a conventional approach. Section 3 explains the detail of SRAM/DRAM hybrid cache. Section 4 reports the evaluation results, finally we conclude this paper in section 5. II.

MOTIVATION

Stacking high-density DRAM as the last level cache is one of promising approaches to alleviate the negative impact of memory wall problem. Here, we assume that the last level of on-chip memory hierarchy is two. Since DRAM has usually four or eight times higher density than SRAM, we can expect a large amount of capacity miss reduction. Black et al. [1] evaluated the performance impact of 3D DRAM caches. An L2 cache implemented by SRAM is placed on the same die of processor cores in conventional 2D designs as shown in

Slow but large DRAM  L2 Cache Mode

Small but first SRAM  L2 Cache Mode 32MB L2 Cache (DRAM) unused

32MB L2 Cache (DRAM) 2MB L2 Core(s) Cache + L1(s) (SRAM)

(A) 2D SRAM L2 Cache (Base)

L2 Core(s) Tag + L1(s) (SRAM)

32MB L2 Cache (DRAM) L2 Core(s) Tag + L1(s) (SRAM)

2MB L2 Core(s) Cache + L1(s) (SRAM)

(B) 3D L2 DRAM Cache

(C) 3D Proposed Hybrid Cache

Figure 1: 2D / 3D Cache Organizations

III.

SRAM/DRAM HYBRID CACHE

As we discussed in Section II, the conventional DRAM stacking does not always contribute to performance. To tackle this issue and expand the variability of 3D microprocessors, we propose an SRAM/DRAM hybrid cache architecture. As

32MB DRAM 

Performance

3.0 L2 Cache (3D) 2.5

Ocean

2.0

Better

1.5 1.0

Worse

0.5 0

2MB SRAM L2  Cache (2D)

100

80

60

40

Cholesky 150

20

L2 Miss Reduction  [%]

100

0 200

50

0

L2 Access Time Increase  [cc]

Figure 2: Impact of L2 Misses and Hit Time 60 LU 50

L2 Miss Rates [%]

Figure 1 (A), whereas a DRAM die is stacked on processor cores to store cache lines and the SRAM part is used for tags as depicted in figure 1 (B). Generally, the capacitor base DRAM tends to be slow compared with SRAM devices. Although the large DRAM L2 has great potential to reduce cache miss rates, it makes the L2 access time longer. This means that we need to pay attention to a tradeoff between the reduction of cache miss rates and the increase in access latency. There is a possibility to cause performance degradation if the negative impact of the latency issue is larger than the effects of cache miss reduction. Figure 2 shows analysis results of memory performance. The z-axis is the inverse of average memory access time. All results are normalized to the baseline organization as showed in Figure 1 (A). “L2 access time increase” is the access time overhead caused by replacing the last-level cache from 2MB SRAM to stacked 32MB DRAM in terms of clock cycles. “L2 Miss Reduction” represents how much we can reduce the L2 miss rates in terms of points. Ocean and Cholesky are the name of benchmark program from Splash-2. From this figure, we can see that exploiting 3D DRAM cache improves performance dramatically for Ocean, while it worsens the memory performance for Cholesky, if we assume that the increase in L2 access time is 50 clock cycles. Figure 3 shows L2 miss rates with varying the L2 cache size. It is clear that for Ocean we can reduce a number of L2 misses by increasing the cache size from 2 MB to 32 MB, about 30 points reduction. LU and FMM also have the similar tendency. On the other hand, for Cholesky, only a small amount of cache miss reduction can be achieved. If we can obtain the enough reduction of L2 misses such as Ocean, LU, and FMM in Figure 3, the DRAM stacking works very well. Otherwise, we may lose the performance such as Cholesky and FFT. This means that the efficiency of the DRAM stacking strongly depends on the characteristics of target application programs.

FFT 40 30 20

Ocean Cholesky FMM

10

Barnes Raytrace 0 2MB

WaterSpatial

4MB

8MB

16MB

32MB

64MB

128MB

L2 Size

Figure 3: L2 Cache Size and Miss Rates

shown in Figure 1 (C), the cache supports following two operation modes.  SRAM

cache mode: the SRAM implemented on the lower layer works as a conventional 2 MB fast SRAM L2 cache, and the stacked DRAM is not used (power supply is gated).

 DRAM

cache mode: the stacked 32 MB DRAM works as a data memory of the L2 cache, i.e. cache lines are stored in the DRAM and the SRAM is used as tag memory. Both of the SRAM and stacked DRAM are activated.

As showed in Figure 1 (C), the SRAM part can work as either a 2 MB SRAM L2 cache or a tag memory of 32 MB DRAM cache. Therefore, we can maintain the speed of fast L2 accesses on the SRAM cache mode. Figure 4 shows the tag

Cache lines on SRAM cache mode,  or Tags on DRAM cache mode 

SRAM  Tag Array

64‐bits Physical Address Index Field 64 - lg L S - IS IS lg LS

Tag Field

Tag

SARM (Capacity:        ) CS Line Size: L S Associativity:W S )

Tag SRAM  Data Array

Decoder

Off‐set Field

lg L D

ID

64 - lg L D - I D

CD DARM (Capacity:       ) Line Size:  L D Associativity:        ) WD

IS

Decoder

lg

LS

LS

64 - lg L S - I S

C D ・ L S・W S C S・ L D・W D

ID 64 - lg L S - I S

MUX1 Tag array for SRAM  cache mode

64 - lg L S - I S

=

=

1

1

64 - lg L D - I D

LS

Way

2‐way Set‐associative SRAM Cache, or  Tag Memory for DRAM Cache

Way L2 Cache

DRAM  Data Array

Data (SRAM)

CS IS  lg LS・ W S CD I D  lg L D・W D

2‐way Set‐Associative DRAM Cache

IV.

EVALUATION

In this paper we assume that the hybrid cache can know the appropriate operation mode in advance. Developing a mode selection algorithm is our ongoing work. We compare the performance and energy consumption of following cache models.

1

64 - lg L D - ID

LD

LD MUX3

1

LD

Data (DRAM)

Hit/Miss (DRAM)

1

Figure 5: Microarchitecture

Figure 4: Tag Mapping on DRAM Cache Mode

Another interesting design point in the proposed hybrid cache is how to decide the operation mode. If the target application program does not require the memory size of the SRAM, or 32 MB of L2 cache is completely not enough, we should choose the SRAM cache mode to provide fast accesses. On the other hand, the DRAM cache mode has to be selected when the 2 MB L2 capacity is too small, but the capacity of 32 MB is enough. In this paper, we assume that the appropriate operation mode can be decided before its execution. For instance, profiling is one of the well known approaches to statically optimize the application programs. We can know the appropriate mode by performing pre-executions. OS sets the cache mode when the program starts its execution. Another attractive strategy is a dynamic, or run-time, mode selection. By means of monitoring memory access behavior at run time, a special hardware decides the appropriate operation mode.

1

Hit/Miss (SRAM)

L2 Core(s) Tag

mapping on DRAM cache mode. The tags are stored in the SRAM data arrays and the cache lines are stored in the DRAM. Of course, the SRAM data arrays need to have enough capacity to store all of the tags for 32 MB DRAM cache. When an L2 access takes place, an associated tag in the SRAM is read out and compared with the physical address. On a hit, the DRAM arrays are accessed to read the referenced data. One of the straightforward implementations of this architecture is to implement a dedicated tag memory for DRAM cache operations. However, it causes a serious area overhead. For instance, if we assume 64 MB 16-way DRAM cache with 64 B cache line size, it requires about 5MB of tag memory. Our hybrid architecture avoids this problem by mans of storing tags into the SRAM data arrays. Figure 5 shows the detail of the proposed hybrid cache. Although we need to add several multiplexors to select the output from tag and data arrays, their area overhead is trivial.

MUX2

. = = . .

64 - lg L D - I D

2D (BASE)

Core@ 3GHz

3D

Core@ 3GHz

L1D, L1I Caches: 32KB Access Lat.:2clock cycles

L1 I/D 2D SRAM L2 Cache

L1 I/D

L2 SRAM Cache

3D  DRAM Cache

•2MB, 64B Block •8way •Lat. 6 clock cycles

•32MB, 64B Block •8way •Lat. 28 clock cycles

Main memory

3D DRAM L2 Cache

Main memory

Lat.:181 clock cycles

Figure 6: Parameters for Performance Evaluation 

2D-BASE: conventional 2-D implemented L2 SRAM cache as showed in Figure 1 (A).



3D-CONV: conventional 3-D L2 cache with stacked DRAM as depicted in Figure 2 (B). Tags are stored in the SRAM on the bottom layer.



3D-HYBRID: proposed SRAM/DRAM hybrid cache as presented in Figure 1 (C). The cache mode is set at the beginning of execution by OS, and is maintained until the end of the execution.

Figure 6 shows parameters used in our evaluation. For 3DHYBRID, the parameters for 2D and 3D in Figure 6 are used on SRAM and DRAM cache mode, respectively. We measured the average memory access time for each cache model by performing trace-base simulations. M5 processor simulator was used to capture memory access traces and hit/miss information [5]. We also evaluated the energy efficiency of the proposed hybrid approach. The energy consumed in a memory subsystem which includes L1 cache, L2 cache, and main memory can be expressed as follows. AMAE

EL

EL

AEL

EL

AEL AEL

EM

M

EL

EM

(1)

M

ACL SRAM DRAM

AEMM

(2) ACL SRAM AEL RCL DRAM ACMM

AEMM

DRAM

ACL

DRAM

(3) RCMMDRAM

(4)

Here, AE and AC represent average access energy and access count for each memory module, respectively. AEL2DRAMref and AEMM are the energy dissipated for DRAM refreshes on the stacked DRAM and main memory. RCL2DRAM and RCMMDRAM are the refresh counts performed during the program execution.

Memory Performance

3 2.5 2 1.5 1 0.5 0

2D‐BASE 3D‐CONV 3D‐HYBRID

Normalized Energy

2.5 2 1.5 1 0.5 0

This means that 3D-CONV wastes a lot of energy if the working set is small. For these programs, 3D-HYBRID selects the SRAM cache mode, so that L2 access energy can be reduced. Another effect is the energy reduction for DRAM refreshes. 3D-CONV has to issue refresh operations to the stacked DRAM at all time so as to maintain the stored data, resulting in large amount of energy dissipation. The DRAM refreshes are required on the hybrid cache when it works on the DRAM cache mode as same as 3D-CONV, but not on the SRAM cache mode. Compared with 2D-BASE, the hybrid cache has a small energy overhead for 164.gzip, 188.ammp, 300.twolf, and 301.apsi. This is because the total SRAM size of 3D-HYBRID is slightly larger than that of 2D-BASE as explained in Section III. However, this energy overhead is trivial. V.

Benchmark Programs

Figure 7: Memory Performance and Energy We used CACTI [4] to decide the value of each energy parameter, and also assumed that driving external IO pins for each access consumes the same amount of energy as a mainmemory DRAM array access. Also, we assume 3 GHz inorder high-end processor core. Figure 7 reports memory performance improvements and energy consumption for SPEC 2000 benchmark programs. First, we compare the performance. 3D-CONV which is a conventional 3D implementation can achieve outstanding improvements for three benchmarks, 171.swim, 172.mgrid, and 173.applu. However, for some programs such as 164.gzip, 179.art, 188.ammp, 300.twolf, and 301.apsi, the conventional stacking does not work well. This is because the L2 miss reduction is not enough to compensate for the negative impact of slow stacked DRAM accesses. Since 3D-HYBRID supports not only the stacked DRAM cache operations but also fast SRAM ones, it can maintain the performance of 2D-BASE for these programs. In average, the conventional 3D implementation does not contribute to the performance, whereas the hybrid approach achieves about 25% of improvement. Next, we discuss the energy consumption. As we see in Figure 7, 3D-HYBRID can maintain almost the same energy as the conventional 2D implementation (2D-BASE), or achieve outstanding energy reduction as well as 3D-CONV. The proposed hybrid cache has at least two positive effects from the energy point of view. Activating the stacked DRAM is still energy consuming compared with the small SRAM.

CONCLUSIONS

This paper proposed an SRAM/DRAM hybrid cache architecture for 3D implemented future microprocessors. By means of adapting to the characteristics of application programs, the proposed approach selects appropriate operation mode. If the program requires large cache capacity, the hybrid cache exploits the stacked DRAM as data memory of the last level cache. Otherwise, the small SRAM is used. One of the key challenges of the hybrid cache is how to decide the operation mode, and it is our ongoing work. ACKNOWLEDGMENT This research was supported in part by New Energy and Industrial Technology Development Organization and the Grant-in-Aid for Young Scientists (A), 21680005. The computation was mainly carried out using the computer facilities at Research Institute for Information Technology, Kyushu University. REFERENCES [1]

[2]

[3]

[4] [5]

Black, B., Annavaram, M., Brekelbau, N., DeVale, J., Jiang, L., Loh, G. H., McCauley, D., Morrow, P., Nelson, D. W., Pantuso, D., Reed, P., Rupley, J., Shankar, S., Shen, J. andWebb, C., “Die Stacking (3D) Microarchitecture,” Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 469–479, 2006. Woo, D. H., Seong, N. H., Lewis, D. L. and Lee, H.-H. S., “An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth,” Proceedings of the 16th International Symposium on High-Performance Computer Architecture, 2010. G. H. Loh, “3D-Stacked Memory Architectures for Multi-Core processors,” Proc. of the International Symposium on Computer Architecture, pp.453-464, June 2008. CACTI5.3, http://quid.hpl.hp.com:9081/cacti/. N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K. Reinhardt, “The M5 simulator: Modeling networked systems,". Micro, IEEE, Vol. 26, No. 4, pp.52{60, 2006.

Suggest Documents