A Coldness Metric for Cache Optimization - Electrical and Computer ...

1 downloads 0 Views 226KB Size Report
Chen Ding†. Michael C. Huang ... {parihar@ece. cding@cs. michael.huang@}rochester.edu. Abstract .... [1] D. Callahan, J. Cocke, and K. Kennedy. Estimating ...
A Coldness Metric for Cache Optimization Raj Parihar

Chen Ding†

Michael C. Huang

Dept. of Electrical & Computer Engineering † Dept. of Computer Science University of Rochester, Rochester, NY 14627, USA {parihar@ece. cding@cs. michael.huang@}rochester.edu A “hot” concept in program optimization is hotness. For example, program optimization targets hot paths, and register allocation targets hot variables. Cache optimization, however, has to target cold data, which are less frequently used and tend to cause cache misses whenever they are accessed. Hot data, in contrast, as they are small and frequently used, tend to stay in cache. In this paper, we define a new metric called “coldness” and show how the coldness varies across programs and how much colder the data we have to optimize as the cache size on modern machines increases. Categories and Subject Descriptors mance measures

D.2.8 [Metrics]: Perfor-

General Terms measurement, performance Keywords reuse distance, miss rate, coldness

1.

Why Cache Optimization

On modern machines, there is not enough memory bandwidth to feed data fast enough to the processor. The limitation can be quantified by comparing program and machine balances [1]. For a set of scientific programs on an SGI Origin machine, a study in 2000 found that the program balances, ranging from 2.7 to 8.4 byte-perflop (except 0.04 for optimized matrix multiplication), were 3.4 to 10.5 times higher than the machine balance, 0.8 byte-per-flop. The maximal CPU utilization was as low as 9.5%, and a program spent over 90% of time waiting for memory [6]. The problem of insufficient bandwidth cannot be ameliorated by latency tolerance techniques such as data prefetching or multithreading. The solution has to come from better caching, which in hardware means larger caches and better caching algorithms. However, just increasing the cache size is not sufficient. A rule of thumb is that you halve the miss rate by quadrupling the cache size [2]. The estimate is optimistic considering the simulation data compiled by Cantin and Hill for SPEC 2000 programs, whose miss rate was reduced from 3.9% to 2.6% when quadrupling the cache size from 256KB to 1MB [3]. The imbalance is worse on multicore. On the one hand, the peak speed multiplies. A current high-end server (Intel E5-2690) has about 400GFlops peak performance from 16 cores. Despite its 80GB/s memory bandwidth, the machine balance is 0.2 byte-perflop, lower than 0.8 of the previous SGI machine. At the same time, the cache per core is smaller. On Intel E5-2590, the size of the lastlevel cache per core is 1.5MB, smaller than the 4MB cache on the SGI machine 13 years ago.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MSPC’13, June, 2013, Seattle, Washington. Copyright c 2013 ACM 978-1-4503-1219-6/12/06. . . $10.00

To reduce the miss ratio adequately to compensate for the worsening machine balance, we must improve the software solution of program optimization. Cache optimization is not a new problem, but the difficulty increases for larger cache. In the next section, we quantify this difficulty using a new metric.

2.

Coldness in Cache Optimization

As the cache size increases into the megabyte range, locality optimization becomes difficult because much of the missed data are those that are less frequently accessed, that is, the cold data. We describe a new metric to measure the coldness of the missed data. For a program p and cache size c, we count the minimal number of distinct data addresses for which complete caching can obtain a target relative reduction in miss ratio. It means that program optimization has to improve the locality for at least this many data blocks to reduce the miss ratio by r in size-c cache. We call the metric coldness which can be defined as follows: coldness(c, r) = ( 1) ⇤ (#uniq addr)

Figure 1 shows the quantitative measurement of the coldness as a function of the cache size and the level of optimization. −1

Top 10% misses (INT) Top 50% misses (INT) Top 90% misses (INT) Top 10% misses (FP) Top 50% misses (FP) Top 90% misses (FP)

−4

Coldness metric

Abstract

−15 −100 −540 −1029

−4630

−10,000

−28042 −50476

−11509

−63229 −399243

−562747 −1,000,000

−718639 1KB

4KB

16KB

64KB

256KB

1MB

2MB

4MB

Cache size/ Reuse distance

Figure 1: Coldness metric of SPEC 2006 applications for top 10%, 50%, and 90% misses. As the cache size increases, the coldness of the data required by optimization decreases. When the coldness is x, the optimization must optimize the access to at least x different data addresses. For 10% miss reduction, the coldness drops from -15 for 1KB cache to -4630 for 4MB cache in integer applications. In floating point applications, it drops from -4 for 1KB cache to -63,229 for 4MB cache. Similarly, for 90% miss reduction, the coldness drops from -11,509 to -50,476 in integer applications and from -562,747 to -718,639 in floating point applications. This shows that if the program optimization targets a small number of memory addresses, it may only be effective for small cache sizes and cannot be as effective for large cache sizes. Even for small cache sizes, such optimization cannot reduce the miss ratio by 90% or more.

As the optimization level increases, the coldness again decreases. For the 4MB cache, we must optimize the access to at least 344KB, 2.4MB, and 5.4MB data to reduce the miss ratio by 10%, 50%, and 90% respectively. In the last case, the coldness metric shows that it is necessary to optimize for a data size more than the cache size to obtain the needed reduction. Next we describe the experiments that produced these coldness data. Our simulation framework is based on SimpleScalar and we model a POWER7 like microarchitecture. On an average, we fast forward each applications about 5 billion instructions before we begin collecting the statistics for a 200 million instruction window. Data caches in this study are fully associative with LRU replacement policy. This ensures that all the misses are capacity and compulsory misses of the data whose reuse distance is larger than the cache size and there are no conflict misses. In Figure 2, we present the minimal number of distinct addresses that account for a given percentage of cache misses. The coldness metric is the negation of this number. The average number of most missed addresses increase by about 100x for top 10% and 50% misses as the cache size increases. 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer

# of distinct addresses

10,000

458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 410.bwaves 433.milc

100

434.zeusmp 435.gromacs 444.namd 447.dealII 450.soplex 470.lbm

1

482.sphinx3 median 1KB

4KB

16KB

64KB

256KB

1MB

2MB

4MB

Cache size/ Reuse distance

(a) Distinct addresses accounting for top 10% misses 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk

# of distinct addresses

100,000

456.hmmer

Temperature zones above median (less widespread misses) below median (more widespread misses)

3.

SPEC 2006 Applications h264ref, sphinx3, astar, xalancbmk, gobmk, hmmer, dealII, namd lbm, bwaves, libquantum, perlbench, zeusmp, gromacs, mcf, soplex, sjeng

Discussion and Future Work

The new metric augments the growing set of data-centric metrics. Reuse distance has been used to show the temporal and spatial locality for individual or collection of data. Hot data streams showed the regularity in consecutive data loads [5]. A recent tool, HPCToolkit, shows most missed data to help program tuning [8]. Unlike coldness, the other metrics do not quantify the minimal number of distinct data addresses that must be targeted by an optimization. As future work, we plan to calibrate the coldness metric more thoroughly using the full program trace and measuring the effect of program input and cache associativity. We will study hardware solutions. From Figure 2, it is evident that there are numerous distinct addresses which account for top misses. These number of addresses increase rapidly (shown in logscale) as the cache size increases. From a similar study, we also observed that these misses are incurred by a large number of distinct static instructions, not just a few delinquent instructions. A possible solution is more effective prefetching. A specific implementation of look-ahead, which we call decoupled look-ahead [4], is able to reduce the primary misses by 88x and secondary misses by 38x for a cache size of 4MB that incurs about 100,000 distinct miss addresses in top 90% misses. We will study the new metric as a guide to program optimization. The program mcf in Figure 2 shows two distinguishing characteristics. First, it is one of the programs that are below median cold, which means that its misses are more widespread. Second, for 50% miss-ratio reduction, its coldness is among the least varying across different cache sizes. An effective optimization has been found for mcf, up to 35% improvement by structure splitting [7]. The coldness characteristics suggest that structure splitting is effective in removing widespread misses but may be applicable only for certain type of programs. We will also use the coldness metric program tuning to estimate the difficulty and suggest data targets. In summary, we hope to identify effective techniques to deal with cold data, and generalize and improve these techniques. As we face increasingly severe memory problems, we must understand and expand the ways to optimize for extremely cold data.

458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 410.bwaves 433.milc 434.zeusmp

1,000

435.gromacs 444.namd 447.dealII 450.soplex 470.lbm

10

482.sphinx3 median 1KB

4KB

16KB

64KB

256KB

1MB

2MB

4MB

Cache size/ Reuse distance

(b) Distinct addresses accounting for top 50% misses

Figure 2: Distinct addresses accounting for top 10% and 50% misses corresponding to various reuse distance. Based on the individual results, we classify applications into two groups. Applications that are consistently colder than the median are known as below median cold and applications that are consistently not as cold as the median are known as above median cold applications. The next table shows the two coldness categories.

References [1] D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined architectures. JPDC, 5(4), 1988. [2] D. Callahan and J. Gray. Design considerations for parallel programming, 2008. http://msdn.microsoft.com/enus/magazine/cc872852.aspx. [3] J. F. Cantin and M. D. Hill. Cache performance for SPEC CPU2000 benchmarks. http://www.cs.wisc.edu/multifacet/misc/spec2000cachedata. [4] A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture In Proc. Int’l Symp. on Microarch., pages 306–317, November 2008. [5] T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In PLDI, pages 199–209, 2002. [6] C. Ding and K. Kennedy. The memory bandwidth bottleneck and its amelioration by a compiler. In IPDPS, pages 181–190, 2000. [7] G. Chakrabarti and F. Chow. Structure layout optimizations in the Open64 compiler. In Open64 Workshop, 2008. [8] X. Liu and J. M. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In CGO, pages 171–180, 2011.

Suggest Documents