Jul 6, 2011 - from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to re-evaluate the performance impact of ...
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order Processors
Abstract: Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce miss-induced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non-blocking loads showed they could achieve significant performance gains in comparison to blocking caches. However, those experiments were performed with benchmarks that are now over a decade old. Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating point operations, and write-through and write-no-allocate caches. These assumptions are very different from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using up-to-date benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors. Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in comparison to a similar machine with a blocking cache.
External Posting Date: July 06, 2011 [Fulltext] Internal Posting Date: July 06, 2011 [Fulltext]
Approved for External Publication
Copyright 2011 Hewlett-Packard Development Company, L.P.
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li†, Ke Chen‡, Jay B. Brockman‡, Norman P. Jouppi† † Hewlett-Packard Labs, ‡University of Notre Dame † {sheng.li4, norm.jouppi}@hp.com, ‡ {kchen2, jbb}@nd.edu Abstract Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce missinduced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non-blocking loads showed they could achieve significant performance gains in comparison to blocking caches. However, those experiments were performed with benchmarks that are now over a decade old. Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating point operations, and write-through and write-no-allocate caches. These assumptions are very different from today’s high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to reevaluate the performance impact of non-blocking caches on practical out-of-order processors using up-todate benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors. Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in comparison to a similar machine with a blocking cache.
1. Introduction and Motivations Non-blocking caches can eliminate the miss-induced processor stalls by buffering the misses and continuing to serve access requests. In order to leverage the benefits of non-blocking caches, there must be a pool of cache/memory operations that can be serviced out-of-order and effectively used by the processors. The processors with this ability include OOO processors such as Intel Nehalem [8], multithreaded processors such as Sun Niagara [9], and processors with run-ahead capability such as Sun Rock [10]. Previous research [1] demonstrated that data caches with non-blocking loads could achieve significant performance gain in comparison to blocking caches. The architecture assumed in previous work [1] was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, and single-cycle latency for floating point operations. Thus, the only stalls that could occur were those attributable to true data dependencies related to memory loads or cache lock up. This perfect architecture model can effectively isolate the impact of the data cache with non1
blocking loads from other parts of the processors. However, these assumptions are very different from today’s high performance out-of-order processors such as the Intel Nehalem. In addition, the previous work is based on write-through and write-no-allocate caches, while current caches are mostly write-back. Moreover, those experiments were done with SEPCCPU92 benchmarks that are now over a decade old. Thus, it is time to re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using up-to-date benchmarks. In this study, we evaluate the performance impacts of nonblocking data caches using the latest SPECCPU2006 benchmark suite on high performance out-of-order (OOO) Intel Nehalem-like processors.
2. Methodology and Experiment Setup The modeled Nehalem-like architecture has 4-issue OOO cores. Each core has a 32KB 4-way setassociative L1 instruction cache, a 32KB 8-way set-associative L1 data cache, a 256KB 8-way setassociative L2 cache, and a shared multi-banked 16-way set-associative 2MB per core L3. According to the Nehalem architecture, all caches are write-back and write-allocate. The load and store buffers have 48 and 32 entries, respectively, and support load forwarding within the same core. The re-order buffer contains 128 entries, enabling the OOO core to maintain 128 in-flight instructions. It also has a 36-entry instruction window, which enables the core to pick 4 instructions from the 36 candidates every cycle. We use M5 [3] to evaluate this architecture. We set the access time for each level of cache based on the timing specification in the Nehalem processor. The L1 Icache, L2, and L3 caches are assumed to be fully pipelined and non-blocking. We run CACTI [6] to estimate the memory latency to be around 90 cycles. The non-blocking Dcache is assumed to be implemented using inverted MSHRs [1] so that an unconstrained non-blocking cache can be achieved. Since L1 Dcache is write-back and write-allocate, we model the MSHR architecture as in [2] and [11] that extends the MSHR in [1] to support both read and write misses. This extension is necessary since the write-back cache must buffer the data that will be written to the cache line until the miss is serviced and space has been allocated for the received line. This advanced MSHR implementation can handle multiple request targets for each MHSR entry to eliminate the stalls due to secondary misses. The L1 Dcache is also modeled to have a write-back buffer that helps the MSHR to achieve non-blocking stores. Our experiments show that a 16-deep write-back buffer together with the enhanced MSHRs is sufficient to achieve non-blocking stores for the L1 Dcache. Thus, secondary misses will be handled smoothly by the same MSHR entry as for a given primary miss. Fully non-blocking cache with inverted MSHRs would require more than 128 entries so that each renamed register in the ROB can have an entry as the target for non-blocking loads. However, a 128 entry MSHR would be very expensive to implement. Our simulations show that a 64 entry MSHR is sufficient to eliminate all cache lockup cycles when running SPECCPU2006. Thus, in order to evaluate caches with different levels of non-blocking capabilities. We set the number of entries in MSHR to 0, 1, 2, and 64. The cache with no MSHR (0 entry) is a lockup cache. The cache with a 64 entry MSHR can support up to 64 in-flight misses while still servicing the requests (hit-under-64-misses) and is effectively an unconstrained non-blocking cache. The detailed architecture used in this study is shown in Table 1.
2
Table 1. Architecture configuration for the Nehalem-like architecture used in this study. Clock L1 I cache L1 Dcache
L2 cache L3 cache Memory Issue width Instruction window size ROB Size Load Buffer Size Store Buffer Size
2.5GHz 32KB, 8way, 64B line size, 4 cycle access latency write-back, write-allocate; MSHR with 0 (lockup cache), 1, 2, and 64 (unconstrained non-blocking cache) entries, write-back buffer with 16 entries 256KB, 8way, 64B line size, 10 cycle access latency 2MB per core, 64B line size, 36 cycle access latency DDR3-1600, 90 cycle access latency 4 36 128 48 32
The SPEC CPU2006 [4] benchmark suite was used for all our experiments. This benchmark suite consists of integer (CINT) and floating-point (CFP) benchmarks, all of which are single-threaded. We selected 9 out of the 12 integer benchmarks and 14 out of the 17 floating-point benchmarks, with the detailed benchmarks shown in Table 2. Because of the long simulation time (more than a month for benchmarks such as omnetpp and soplex) according to our own experiments and as in [7], we used Simpoint to reduce simulation time while maintaining accuracy. We found the representative simulation phases of each application and their weights using Simpoint 3.0 [5]. Next we simulated all simulation points and computed the final results using the simulation outputs and the weights of the simulation points. All SPEC CPU2006 benchmarks were compiled with O3 optimization using gcc 4.2 (no further optimizations are allowed according to the SPECCPU2006 specification). Table 2 SPECCPU2006 Benchmarks Used in the Experiments. SPECINT SPECCFP bzip2, gcc, mcf, hmmer, sjeng, games, zeusmp, milc, gromacs, libquantum, h264ref, omnetpp, astar cactusADM, namd, soplex, povray, calculix, GemsFDTD, tonto, lbm, wrf, sphinx3
3
3. Results Average Cache Block Cycle Ratio Hit-under-1-miss
Hit-under-2-misses
Hit-under-64-misses
SPECINT
sphinx3
wrf
lbm
tonto
GemsFDTD
calculix
povray
soplex
namd
cactusADM
gromacs
milc
zeusmp
gamess
astar
omnetpp
h264ref
libquantum
sjeng
hmmer
mcf
gcc
bzip2
60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
SPECFP
Figure 1. The ratio of the average Dcache/memory block cycles for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. The average cache block cycles are measured as cache block cycles per memory (Dcache) access. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss). There are 9 integer and 14 floating point benchmarks. For the integer programs, the average ratio of Dcache/memory block cycles for hit-under-1-miss is 15.72%, and for hit-under-2-misses is 5.11%. For the floating-point programs, the two averages are 23.89% and 7.00%, respectively. Hit-under-64-misses eliminates all block cycles for both CINT and CFP benchmarks. This means that all the other machine stall cycles are due to other causes for example a data dependency on an outstanding miss. Non-blocking cache can reduce the lockup time of the cache/memory subsystem, which in turn helps to reduce the processor stall cycles induced by cache/memory for not being able to service accesses after cache lockup. Figure 1 shows the ratio on average Dcache/memory block cycles for a cache from lockup to fully non-blocking. The average cache block cycles are measured as cache block cycles per memory (Dcache) access. The average cache block cycles are dictated by both the non-blocking level of the cache and the behavior of the benchmarks (i.e the cache miss ratio and clustering pattern of the memory instructions). All numbers in Figure 1 are normalized against that of the lockup cache setup (hit-under-0miss), so that the effectiveness of different levels of non-blocking is clearly demonstrated. On average, hit-under-2-misses reduces the memory block cycle by 94.89% and 93% for SPECINT and SPECFP, respectively. This demonstrates that a two-entry MSHR are enough for the SPECCPU2006 workloads to achieve non-blocking cache behavior. The average cache/memory block cycles affect both the Dcache access latency and the miss latency. Since once the cache is locked up, neither hits nor misses can be served (although it is the misses that cause the cache lockup). Figures 2 and 3 show the impact of non-blocking caches on the Dcache access latency and the miss latency, respectively. The impact of non-blocking caches is larger on the access latency than on the miss latency of the Dcache, since the former is smaller than the latter. Figure 4 shows the miss rates of all caches. Combining Figures 2, 3, and 4, it can be seen that non-blocking caches have more benefits for benchmarks with higher miss rates such as mcf and lbm. 4
Dcache Latency Ratio Hit-under-1-miss
Hit-under-2-misses
Hit-under-64-misses
100.00% 90.00%
80.00% 70.00% 60.00% 50.00%
SPECINT
sphinx3
wrf
lbm
tonto
GemsFDTD
calculix
povray
soplex
namd
cactusADM
gromacs
milc
zeusmp
gamess
astar
omnetpp
h264ref
libquantum
sjeng
hmmer
mcf
gcc
bzip2
40.00%
SPECFP
Figure 2. The ratio of the Dcache access latency for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss).
Dcache Miss Latency Ratio Hit-under-1-miss
Hit-under-2-misses
Hit-under-64-misses
SPECINT
sphinx3
wrf
lbm
tonto
GemsFDTD
calculix
povray
soplex
namd
cactusADM
gromacs
milc
zeusmp
gamess
astar
omnetpp
h264ref
libquantum
sjeng
hmmer
mcf
gcc
bzip2
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00%
SPECFP
Figure 3. The ratio of the Dcache miss latency for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss).
5
Miss Rate L1 Miss
L2 Miss
L3 Miss
SPECINT
sphinx3
wrf
lbm
tonto
GemsFDTD
calculix
povray
soplex
namd
cactusADM
gromacs
milc
zeusmp
gamess
astar
omnetpp
h264ref
libquantum
sjeng
hmmer
mcf
gcc
bzip2
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
SPECFP
Figure 4. Miss rates on all caches for the SPECCPU2006 benchmarks. Average memory stall cycles cannot tell us the whole story of the performance impact of non-blocking caches on the architecture. Because we are evaluating a practical Nehalem-like architecture that can still stall for various other reasons besides cache lockup, including ROB full, instruction window full, functional units busy, branches miss-prediction, etc. For example, if there are 10% memory lockup stall cycles in the blocking case and the non-blocking cache eliminates all of them, then the average memory stall cycles are reduced by 100%. However, this reduction on average memory lockup stall cycles will translate to a different reduction ratio on the cache/memory access latency and miss latency that eventually affect the overall processor performance. Thus, it is important to evaluate the performance impact of the non-blocking cache on the overall CPI of the processor. Figure 5 shows the impact of the non-blocking Dcache on the overall CPI. Since all other architecture parameters are kept unchanged, this CPI ratio is directly caused by the use of the non-blocking Dcache. All numbers in Figure 5 are normalized against that of the lockup cache setup (hit-under-0-miss), so that the effectiveness cache with different level of non-blocking can be clearly evaluated. On average, the cache with hit-under-2-misses reduces the CPI by 16.2% for SPECINT and 8.3% for SPECFP, while the fully non-blocking Dcache can reduce the CPI by 17.6% for SPECINT and 9.02% for SPECFP. This shows that using the non-blocking Dcache that supports two in-flight misses will achieve comparable performance benefits from a fully non-blocking cache, but with much lower implementation cost.
6
CPI Ratio Hit-under-1-miss
Hit-under-2-misses
Hit-under-64-misses
100.00% 90.00% 80.00% 70.00% 60.00%
50.00%
SPECINT
sphinx3
wrf
lbm
tonto
GemsFDTD
calculix
povray
Soplex
namd
cactusADM
gromacs
milc
zeusmp
gamess
astar
omnetpp
h264ref
libquantum
sjeng
hmmer
mcf
gcc
bzip2
40.00%
SPECFP
Figure 5. The ratio of the CPI for a cache from lockup to fully non-blocking as the number of outstanding misses varies for the SPECCPU2006 benchmarks. All numbers are normalized against that of the lockup cache setup (hit-under-0-miss). For the integer programs: the average performance (measured as CPI) improvement is 7.08% for hit-under-1-miss, 8.36% for hit-under-2-misses, and 9.02% for hit-under-64-misses (essentially the unconstraint non-blocking cache), compared to lockup cache. For the floating point programs, the three numbers are 12.69%, 16.22%, and 17.76%, respectively.
4. Conclusions In this report, we studied the performance impact of a non-blocking data cache on practical high performance OOO processors with the latest representative applications---the SPECCPU2006 benchmark suite. Overall the non-blocking cache can improve performance by 17.7% over a lockup cache. We found a cache that supports 2 in-flight misses is sufficient to eliminate the majority of memory stall cycles and the induced processor stall cycles for most of the SPECCPU2006 benchmarks. This is the sweet design spot to achieve balanced trade-offs between performance gain and the implementation complexity. Finally, our study shows similar trends but with less magnitude as compared to the earlier study [1] that was performed assuming a perfect single-issue processor. This is because stalls caused by other reasons such as hardware resource conflicts, branch miss-predictions, and the long floating point operation latencies attenuate the performance dependency on the non-blocking cache.
7
5. References [1] [2]
Keith I. Farkas and Norman P. Jouppi, “Complexity/Performance Tradeoffs with Non-Blocking Loads”, ISCA94’ J. Tuck et al., “Scalable Cache Miss Handling for High Memory-Level Parallelism”. MICRO 39, 2006.
[3] [4] [5] [6] [7]
[8] [9] [10] [11]
N. L. Binkert et al., “The M5 Simulator: Modeling Networked Systems,” IEEE Micro, vol. 26, no. 4, pp.52–60, 2006 J. L. Henning, “Performance Counters and Development of SPEC CPU2006,” Computer Architecture News, vol. 35, no. 1, 2007. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behavior,” ASPLOS, Oct 2002 CACTI 6.5, http://www.hpl.hp.com/research/cacti/ Karthik Ganesan, Deepak Panwar, and Lizy John “Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory, and TLB Characteristics “, In 2009 SPEC Benchmark Workshop, Austin. Kumar, R. and Hinton, G, “A family of 45nm IA processors” ISSCC 2009 P. Kongetira, K. Aingaran, and K. Olukotun,“Niagara: A 32-Way Multithreaded Sparc processor,” IEEE Micro, vol. 25, no. 2, 2005. Marc Tremblay and Shailender Chaudhry, “A Third-Generation 65nm 16-Core 32-Thread Plus32-Scout-Thread CMT SPARC Processor” ISSCC 2008 M. Jahre and L. Natvig, “Performance Effects of a Cache Miss Handling Architecture in a Multicore Processor” , NIK-2007 conference, 2007.
8