Adaptive Cache-Line Size Management on 3D Integrated ... - CiteSeerX

7 downloads 14557 Views 285KB Size Report
cores and connecting them by wide on-chip buses composed of through silicon vias ... software-controllable variable line-size cache scheme. In this paper, we ...
Adaptive Cache-Line Size Management on 3D Integrated Microprocessors Takatsugu Ono,

Koji Inoue,

Kazuaki Murakami

Department of Advanced Information Technology Kyushu University Fukuoka, Japan {ono, inoue, murakami}@soc.ait.kyushu-u.ac.jp

Abstract—The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). The 3D stacking makes it possible to reduce the cache miss penalty because large amount of data can be transferred from the main memory to the cache at a time. If a large cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. To solve this problem, we introduce software-controllable variable line-size cache scheme. In this paper, we apply it to an L1 data cache with 3D stacked DRAM organization. In our evaluation, it is observed that our approach reduces the L1 data cache and stacked DRAM energy consumption up to 75%, compared to a conventional cache. Keywords-component; low power, variable line-size, 3D stacked DRAM

I.

INTRODUCTION

Three dimensional die stacking is one of the most promising approaches to achieving high-performance, low-power consumption for microprocessor systems [1][2]. The memory bandwidth can dramatically be improved by means of stacking the main memory (DRAM) on processor cores and connecting them by wide on-chip buses composed of through silicon vias (TSVs). Here, we assume that a memory hierarchy which consists of a level-1 data cache implemented on the same die as the processor core and a stacked main memory as shown in Fig.1. The high bandwidth achieved by the 3D stacking makes it possible to reduce the cache miss penalty because large amount of data can be transferred from the main memory to the cache at a time. If a large cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. This is because larger cache line sizes tend to cause frequent replacement due to conflict misses. This negative impact also consumes large amount of energy unnecessarily. To solve the above mentioned issues, we introduce software-controllable variable line-size cache scheme. In this approach, a dedicated compiler attempts to insert a special instruction in order to change the cache line size. The appropriate cache line sizes are decided at the compile time

Main Memory

wide on‐chip bus

Processor Core $DL1

$IL1

Figure 1. A Schematic of assumed memory hiearchy

based on the behavior of memory references. This technique has originally been proposed to improve a DRAM buffer performance [3]. In this paper, we apply it to an L1 data cache with 3D stacked DRAM organization, and analyze its performance-energy efficiency. This paper is organized as follows. Section II presents a software-controllable variable line-size cache. In Section III, we show evaluation results of our approach. Section IV describes related work and Section V summarizes our work. II.

SOFTWARE CONTRORALLBE VARIABLE LINE SISE CACHE

A. Architecture As explained in Section I, it is very important to optimize the L1 data cache block size in order to improve the execution efficiency of 3D integrated microprocessors. To tackle this requirement, we propose to apply a run-time line size optimization technique called Software-Controllable Variable Line-Size Cache (or SC-VLS cache for short). Fig.2 illustrates the block diagram of a direct-mapped SC-VLS cache. An SRAM cell array and a DRAM cell array are divided into several sub-arrays. In this paper, we assume that the width of each sub-array is 32 bytes and the total number of SRAM (and also DRAM) sub-arrays is eight. A tag is stored in the 32-byte line size granularity. The TSVs are exploited so as to directly connect each SRAM-DRAM sub-array pair. In advanced 3D stacking technologies, the TSV pitch can be less than 1.0 um [2]. Although our implementation requires 2,048 TSVs, therefore, its area overhead is negligible. Cache data replacements are performed on the corresponding SRAM-DRAM sub-array pairs and the number

line size = 64B

Processor

Data

Address Tag

Index

2|100 5|100 2|200

foo1() foo2()

foo3() foo1() foo2()

Status Reg.

Offset

Minimum Valid bit Tag line size

Set an adequate  line size MUX

2|200

foo1() foo2()

1|100 2|100

14|200

foo3() foo1() foo2()

SRAM cell array

ave. MR32Bfoo1()= 4/200   = 2.0% ave. MR32Bfoo2()= 16/400 = 4.0% ave. MR32Bfoo3()= 1/100   = 1.0% MR32B = 3.0%



line size = adequate

= = 32B



32B



32B

32B

32B

32B

= 32B

ave. MR64Bfoo2()= 8/400   = 2.0% ave. MR64Boo3()= 2/100    = 2.0%

line size = 32B



32B

ave. MR64Bfoo1()= 10/200 = 5.0%

MR64B ≒ 2.9%

2|100



Hit / Miss

# of misses # of accesses

6|200

5|100

64B 6|200

32B 32B 64B 1|100 2|100 2|200

adequate line sizefoo1() = 32B adequate line sizefoo2() = 64B adequate line sizefoo3() = 32B

foo1() foo2()

foo3() foo1() foo2()

MRadequate ≒ 1.9%

32B 2|100

Figure 3. An example, adequate line size analysis

MUX TSV

DRAM cell array

Figure 2. The SC-VLS cache architecture

of sub-array pairs to be activated on each cache miss is specified by the status register implemented in the SC-VLS cache. In the case of the design parameters depicted in Fig.2, the cache supports four line sizes, 32 bytes, 64 bytes, 128 byes, and 256 bytes. When the 32 bytes which is the minimum line size is selected, data replacement is performed only on an associated sub-array pair. On the other hand, in case that the line size indicated by the status register is 64 bytes (or 128 bytes), the contiguous two (or three) sub-array pairs are activated. Similarly, for the 256-byte maximum line size, all of the sub-array pairs are targeted for the data replacement. Regardless of the selected line size, we can complete the data replacement in a constant time by means of exploiting the high memory bandwidth implemented by TSVs. The cache works as the same manner as a conventional direct-mapped cache except for data replacements. When a cache access takes place, one of the SRAM sub-array is directly selected by using the memory reference address. Therefore, unlike set-associative organizations, there is no negative impact on the cache access time. The status register is I/O mapped, so that we can manage the line size by executing store instructions. Note that although executing the store instructions to update the SC-VLS’s status register requires extra execution clock cycles, its performance overhead is trivial (the detail is evaluated in Section III). B. Adequate Line Size Analysis and Code Generation For the SC-VLS cache, it is very important to determine appropriate cache line sizes at compile time. Our approach attempts to find the best line size at function level granularity, i.e. an appropriate line size is decided function by function in target application programs. First, we obtain profile information by pre-executing the target program by assuming fixed line sizes. This kind of profile-base compiler optimization is a very common way in high-performance and embedded systems fields. Usually, the best line size of a function depends on the memory-reference behavior. This means that the best line size

Identify applicable sponsor/s here. If no sponsors, delete this text box. (sponsors)

may vary between execution instances of that function. However, we need to decide one line size for each function. This is because we insert a store instruction to update the status register in the SC-VLS cache at the beginning of the function code. To solve this issue, we first calculate the average cache miss rate for each function by assuming a fixed 32 bytes, 64 bytes, 128 bytes, and 256 bytes line size. Then, a line size which provides the smallest cache miss rate is picked up. We explain the analysis algorithm by using Fig.3. Here, it is assumed that the SC-VLS cache supports only 32 bytes and 64 bytes line sizes. First, we measure cache miss rate in each function. The function foo1() causes 10 cache misses out of 200 accesses. The average cache miss rate of foo1() is 5.0% in case of 64-byte line size and 2.0% for 32-byte line size. In this scenario, we choose 32-byte line size for that function. We decide the adequate line sizes of other functions in the same way as the foo1(). After analyzing the appropriate line size for all functions, store instructions updating the status register of the SC-VLS cache are inserted into the assembly code. Before a function is executed, the instruction sets the status register to indicate an adequate line size. C. Performance/Energy Characteristics The average memory access time (AMAT) which is a well known memory performance metrics can be expressed by the following equation: , (1) where, TL1 and TDRAM are the access time of the L1 data cache and the DRAM main memory, respectively. MRL1 shows the cache miss rate. Even if we increase the cache line size (LineSize) within the range of the memory bandwidth (MemoryBW), assuming a constant DRAM access time, the miss penalty will not be increased. This is the fundamental reason why we can employ a large cache line size in the DRAM stacking microprocessors. There are two reasons why the proposed SC-VLS cache can improve the memory performance. When the target application program has enough special localities, the cache attempts to employ a larger cache line size. In this scenario, we can expect prefetching effects, resulting in a higher cache hit rate, i.e. small value of MRL1. Contrary, if the memory accesses of the target program has poor locality, the cache starts to decrease the cache line size in

order to avoid conflict misses. Thus, we can alleviate the negative impact of larger cache line sizes.

(2) (3) / ∑ (4) / where, EL1 and EMainMemory are the total energy consumed for the L1 and the stacked DRAM accesses, respectively. ACL1 and ACmm are the L1 and the main memory access counts. EL1/access is the average energy for a cache access, and EDRAM-SA/access represents the energy dissipated for accessing one DRAM sub-array. Ni denotes the number of DRAM sub-arrays to be activated on the i-th DRAM access. The value of Ni in Equation (4) is always eight in a conventional cache with fixed 512-byte line size, resulting in large energy consumption. On the other hand, the SC-VLS cache can employ small cache line sizes, so that the value of Ni becomes smaller when a smaller line size is selected. For instance, when the SC-VLS cache employs 32-byte line size, only one sub-array is activated on that access (Ni = 1). Another positive effect for optimizing the cache line size is that we can improve the cache hit rate as explained above. This means that the total number of DRAM accesses, ACmm, can be reduced, resulting in more energy reduction. III.

EVALUATION

A. Experimental Setup In this section, we quantitatively evaluate the efficiency of the SC-VLS cache. We assume a simple single issue in-order microprocessor with 16 KB direct-mapped L1 data cache. It is also assumed that the DRAM main memory is stacked. The cache models evaluated are as follows. As defined in Section II-A, the SC-VLS cache can choose one of the four line sizes, 32 bytes, 64 bytes, 128 bytes, and 256 bytes. We modified SimpleScalar simulation tool set [4] in order to measure the execution time. In addition, we calculated the total energy consumption based on the Equation (4). We ignore the energy consumed for updating the status register because it is much smaller than the energy dissipated for the SRAM and stacked DRAM accesses. In addition, since the cache attempts to optimize the cache line size function by function, the access frequency to the status register is also low. We measured the dynamic access energy consumed in the 16 KB data caches by using CACTI 5.3 [5]. We use 10 programs in MiBench [6]. Since the adequate line sizes depend on inputs, we used two types of input data sets in this evaluation. To analyze the adequate line size (analysis phase), we use small input sets. Large input sets are used for evaluating our approach (execution phase). B. Peformance Fig.4 Figure 4. shows the execution time of each cache mode. FIX32B, FIX64B, FIX128B, and FIX256B are conventional caches assuming 32-byte, 64-byte, 128-byte, and 256-byte fixed cache line size, respectively. The cache denoted as SC-VLS is the proposed approach. The x-axis shows

Normalized execution time

The SC-VLS cache has another advantage from the energy point of view. The total energy consumed in the memory hierarchy, Emem, can be presented by the following equations:

FIX32B

FIX64B

FIX128B

FIX256B

SC‐VLS

1.08 1.06 1.04 1.02 1 0.98 0.96 0.94

Benchmark programs

Figure 4. Execution time of conventional caches and the SC-VLS cache

benchmark programs and the y-axis describes the execution time. All results are normalized to the best execution time in conventional caches. It should be noted that the results of SC-VLS include the performance overhead to update the status register for changing the current cache line size. Comparing the conventional fixed line size caches, FIX32B, FIX64B, FIX128B, and FIX256B, it is obvious that the best line size is exactly application dependent. For instance, FIX32B produces the highest performance for rijndael_dec, but FIX256B is the best for tiff2bw. Similarly, 64-byte line size is appropriate for rijndael_enc, and 128-byte line size is preferred for mad, dijkstra, and lame. When we consider the performance of proposed approach, it is observed that for two benchmarks, mad and tiff2bw, the cache achieves almost the same results as the best conventional configurations. Furthermore, for dijkstra and lame, the performance of SC-VLS is higher than the conventional best ones. These results come from the fact that the SC-VLS cache attempts to optimize the cache line size inter- and intra-programs. These performance improvements cannot be achieved by the conventional fixed line sizes. Unfortunately, the SC-VLS cache introduces slight performance degradation for bitcount. Since our approach predicts the appropriate line size at compile time. There are two reasons why we see the execution time is increased. First, if the behavior of memory references observed by the profiling results do not much to that appearing on run time, the SC-VLS cache may work with inappropriate line sizes, resulting in higher cache miss rates. Second, the cache needs to execute extra instructions to set the status register in the SC-VLS cache, so that total number of instructions executed is increased. However, the performance overhead is only 1 % of the total execution time. Therefore, we believe that their negative impacts can be negligible. C. Energy In this section, we discuss the energy efficiency of the proposed SC-VLS cache. Fig.5 shows evaluation results. All of the results are normalized to the energy consumed in the best line size in the conventional strategy in terms of the performance showed in Fig.4. The SC-VLS cache reduces the energy consumption except mad, rijndael_dec, and lame. In particular, our approach reduces 75% of energy consumption for tiff2bw and sha. These results come from the fact that the SC-VLS cache can decrease the cache line size based on the

Normalized energy

SC-VLS

TABLE I.

7.1

11.3

11.4

FIX256B 4.5

3.7

9.0

FIX128B

5.2 19.3

FIX64B 3.7

3 2.5 2 1.5 1 0.5 0

11.4

FIX32B

Benchmarks bitcount mad tiff2bw dijkstra rijndael_enc rijndael_dec sha adpcm_enc adpcm_dec lame

Benchmark programs

AVERAGE SC-VLS CAHE LINE SIZE Average SC-VLS cache line size (B) 81.94 233.60 255.99 223.04 64.82 33.01 141.90 233.40 255.67 254.78

Figure 5. Normalized energy of the L1 data caches and the stacked DRAM

characteristics of memory references. Tab.1 reports the average cache line size for each benchmark program. We see that the applications do not always require the maximum cache line size. However, for mad, rijndael_dec, and lame, the SC-VLS cache increases the energy consumption. This is because the cache sometimes chooses inappropriate cache lines as explained in Section III-B. IV.

This research was supported in part by New Energy and Industrial Technology Development Organization and the Grant-in-Aid for Young Scientists (A), 21680005. The computation was mainly carried out using the computer facilities at Research Institute for Information Technology, Kyushu University. REFERENCES

RELATED WORK

Vleet et al. [7] proposed using off-line profiling to determine the line seize which are normal or large size. The line size is determined based on their cost model which describes trade-off between bytes improving cache hit ratio and minimizing additional bytes transferred into the cache. These studies have looked at improving cache performance. They do not describe any mechanism to reduce energy consumption. Grun et al. [8] customize the local memory architecture suitable for both the diverse access patterns and locality types presented in an application program. They achieve to decrease the main memory bandwidth thus generating power savings. We also exploit memory access behavior. However, our approach can dynamically change the cache line size. Witchel et al. proposes software-controlled cache line size [9]. The compiler specifies how much data to fetch on a miss, allowing greater cache utilization and reducing bandwidth requirement. Zhang et al. presents a configurable line size cache [10]. The cache has a counter in the cache controller, which specifies how many words to read from the off chip memory. They determine best line size for a program. Inoue et al. 0 propos an adequate line size based on cache simulation. They also determine best line size for a program. Our approach, on the other hand, is able to choose an adequate line size for each function of a program. V.

ACKNOWLEDGMENT

CONCLUSIONS

We have applied a Software-Controllable Variable Line-Size Cache to an L1 data cache with 3D stacked DRAM organization. The line sizes of the L1 data cache are dynamically changed during a program execution with few hardware and performance overheads. In our evaluation, it is observed that the SC-VLS cache reduce the energy consumption up to 75%, compared to a conventional cache without sacrificing the performance.

[1]

G. H. Loh, “3D-Stacked Memory Architectures for Multi-Core processors,” Proc. of the International Symposium on Computer Architecture, pp.453-464, June 2008. [2] T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, K. Flautner,”PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor ,” Proc. of the 12th international conference on Architectural Support for Programming Languages and Operating Systems, pp.117-128, Octorber 2006. [3] T. Ono, K. Inoue, K. Murakami and K. Yoshida, “Reducing On-Chip DRAM Energy via Data Transfer Size,” IEICE TRANSACTIONS on Electronics, vol.E92-C, no.4, pp.433-443, April 2009. [4] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure for computer system modeling,” IEEE Computer, vol.35, no.2, pp.59–67, 2002. [5] CACTI5.3, “http://quid.hpl.hp.com:9081/cacti/.” [6] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown, “Mibench: A free, commercially representative embedded benchmark suite, ” IEEE 4th Annual Workshop on Workload Characterization, December 2001. [7] P. van Vleet, E.J. Anderson, L. Brown, J.L. Baer, and A.R. Karlin, “Pursuing the performance potential of dynamic cache line sizes.,” Proc. of the IEEE International Conference On Computer Design, pp.528–537, October 1999. [8] P. Grun, N. Dutt, and A. Nicolau, “Access pattern based local memory customization for low power embedded systems, ” Proc. of the conference on Design, automation and test in Europe, pp.778–784, March 2001. [9] E. Witchel and K. Asanovic, “The span cache: Software controlled tag checks and cache line size,” Workshop on Complexity-Effective Design, 2001. [10] C. Zhang, F. Vahid, and W. Najjar, “Energy benefits of a configurable line size cache for embedded systems,” Proc. of the IEEE Computer Society Annual Symposium on VLSI, p.87-91, February 2003. K. Inoue, K. Kai, and K. Murakami, “ High bandwidth, variable line-size cache architecture for merged dram/logic lsis, ” IEICE TRANSACTIONS on Electronics, vol.E81-C, no.9, pp.1438 – 1447, 1998.

-page number-

2009 ISOCC

Suggest Documents