Sep 23, 2012 - tor (OBM), which makes bypass decisions by learning and predicting the behavior of the optimal bypass. OBM keeps a short global track of the ...
Optimal Bypass Monitor for High Performance Last-level Caches Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing, China
{lilingda, tongdong, xiezichao, lujunlin, chengxu}@mprc.pku.edu.cn
ABSTRACT
1.
In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks. Our analysis shows that bypass can contribute significant performance improvement, and the optimal bypass can achieve similar performance compared to OPT+B, which is the theoretical optimal replacement policy. Thus, we propose a bypass technique called Optimal Bypass Monitor (OBM), which makes bypass decisions by learning and predicting the behavior of the optimal bypass. OBM keeps a short global track of the incoming-victim block pairs. By detecting the first reuse block in each pair, the behavior of the optimal bypass on the track can be asserted to guide the bypass choice. Any existing replacement policy can be extended with OBM while requiring negligible design modification. Our experimental results show that using less than 1.5KB extra memory, OBM with the NRU replacement policy outperforms LRU by 9.7% and 8.9% for single-thread and multiprogrammed workloads respectively. Compared with other state-of-the-art proposals such as DRRIP and SDBP, it achieves superior performance with less storage overhead.
Energy efficiency is treated as a critical metric in modern processor design, and future processors tend to integrate larger last-level caches (LLCs) due to their high energy efficiency [3]. However, the cache management policy is critical to the processor performance and efficiency. The commonlyused Least Recently Used (LRU) policy and its approximations perform poorly for LLCs because most of the temporal locality is filtered by inner-level caches. This paper focuses on improving the LLC performance and efficiency with very low hardware cost1 . As shown in Figure 1, the reuse distances of numerous LLC blocks (cache lines) are greater than the cache size, which leads to the poor performance of LRU policy. Such kinds of cache blocks are called distant reuse blocks in this paper. A good LLC management policy should retain cache blocks with high temporal locality in the LLC first, and then avoid thrashing caused by distant reuse blocks. In order to achieve such goals, various replacement policies [12, 27, 38] attempt to insert distant reuse blocks into the LRU position. Dead block prediction techniques [10, 19, 20, 21, 22] try to identify distant reuse blocks to evict them earlier. Several adaptive methods [12, 27, 28, 32] dynamically change between different replacement policies to accommodate to the changing access patterns. However, such proposals either cannot achieve significant performance improvement, or require significant hardware overhead and large modification to the cache structure. There is still large potential space for the exploration of the LLC management policies. Bypass is an effective and attractive technique. Figure 2 shows that on average 81.2% of blocks are not reused before eviction in a 2MB LLC. Among them, 25.6% of blocks are never accessed again, and they should be bypassed rather than being inserted into the LLC. For the remaining 55.6% of blocks, Belady’s OPT [2] with bypass (OPT+B), which is the theoretical optimal replacement policy, inserts the blocks with minimal reuse distances to fill up the cache at first, and then bypasses the others to avoid thrashing. Besides, our experiments show that the optimal bypass, which bypasses the incoming block if its reuse distance is larger than or equal to that of the victim block selected by the baseline replacement policy, can achieve similar performance improvement compared to OPT+B. Thus, bypass can make significant contribution to the LLC performance. Moreover, bypass can save energy consumption due to the reduction of replacement and writebacks.
Categories and Subject Descriptors B.3.2 [Memory Structures]: Design Styles—cache memories
General Terms Design, Performance
Keywords Optimal Bypass, Replacement, Last-level Cache
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PACT’12, September 19–23, 2012, Minneapolis, Minnesota, USA. Copyright 2012 ACM 978-1-4503-1182-3/12/09 ...$15.00.
INTRODUCTION
1 Unless stated otherwise, cache refers to the last-level cache in this paper.
315
100%
7%
perlbench
6%
mcf
5%
sphinx3
4%
3% 2% 1%
No reused blocks
Normalized hits
8%
80% 60% 40%
20% 0%
0%
0MB 1MB 2MB 3MB 4MB 5MB 6MB 7MB 8MB
LRU stack distance
Figure 1: Normalized hit distribution in the stack of a 64-way 8MB LRU LLC. This figure shows lots of blocks have reuse distances larger than 2MB.
Figure 2: Percentage of distant reuse blocks in a 2MB LRU LLC. This figure shows that over a memory-intensive subset of SPEC CPU2006 benchmarks, 81.2% of blocks are not reused before eviction. Among them 25.6% of blocks are never accessed again, and they should be bypassed to improve cache performance and efficiency.
In this paper, we propose a bypass technique called Optimal Bypass Monitor (OBM), which learns and predicts the behavior of the optimal bypass to make bypass decisions. OBM uses a small Replacement History Table (RHT) to keep track of the recent incoming-victim block pairs. On each cache access, the current incoming block and victim candidate will compare with the RHT content to assert the behavior of the optimal bypass on a recorded pair. A PC indexed Bypass Decision Counter Table (BDCT) of saturating counters is used to learn which operation, bypass or replacement, is dominated under the optimal bypass recently and determine whether bypass should be used. OBM can be applied to the current LLC design with any replacement policy, and it requires negligible modification to the existing cache design. Furthermore, OBM is both thread-aware and prefetch-aware. We evaluate OBM with NRU, LRU, and SRRIP [12]. Our evaluation shows that they all improve cache performance significantly while requiring only less than 1.5KB extra storage. Among them, OBM with NRU performs well enough and requires minimal hardware cost. On average it outperforms LRU by 9.7% and 8.9% for single-thread and multi-programmed workloads respectively in the absence of prefetching, and it can also improve performance significantly in the presence of prefetching. Compared to other stateof-the-art proposals including DRRIP [12], SDBP [19], and DSB [6], OBM with NRU has superior performance while requiring less storage overhead. The rest of this paper is organized as follows. Section 2 demonstrates the motivation of OBM. Section 3 describes the design and implementation of OBM. We show the experimental methodology and analyze the results in Section 4 and 5. Section 6 discusses some related work. Finally, the paper is concluded in Section 7.
for replacement. After enhanced with bypass, OPT+B does not replace any block if the reuse distance of the incoming block is larger than or equal to that of any victim candidate. Figure 3 illustrates the behavior of OPT+B for four representative cache access patterns [12]. Let ai donate a cache block. (a1 , a2 , a3 , . . . , an ) denotes an access sequence from block a1 to block an , and (a1 , a2 , a3 , . . . , an )I denotes that an access sequence repeats I times. For cache friendly access patterns, OPT+B behaves as normal replacement policies to insert all incoming blocks because of their good locality. For streaming access patterns, since there is no temporal locality for any block, OPT+B bypasses all incoming blocks. Although these bypasses cannot improve performance, the power used for replacement can be saved to improve efficiency. While the performance of LRU and its approximations is close to that of OPT+B for cache friendly and streaming patterns, they perform poorly for the remaining two access patterns. For thrashing access patterns, OPT+B first places a subset of the working set into the cache. Then, the rest of blocks are bypassed to avoid thrashing. In mixed access patterns, blocks with different locality are mixed together. The accesses to blocks with poor locality are called scans because their insertion evicts good blocks in the cache. For the example in Figure 3, OPT+B retains d1 and d2 which have the best locality and a subset of ei , while bypasses the rest. Our analysis of OPT+B shows that bypass plays an important role on improving the LLC performance and efficiency. OPT+B uses both the optimal replacement and optimal bypass to improve cache performance. However, the optimal bypass itself can have similar performance compared to OPT+B. The optimal bypass compares the reuse distances of the incoming block and the victim block, which is selected by the baseline replacement policy. If the incoming block has a larger or equal reuse distance, it will be bypassed by the optimal bypass. Otherwise it replaces the victim. Figure 4 shows that for a 16-way 2MB LLC, the optimal bypass with LRU (LRU+OB) reduces average misses by 23.5% compared to LRU, and it bridges roughly four-fifths of the gap between LRU and OPT+B. Compared to other recent proposals like DRRIP [12] and SDBP [19], it outperforms them dramatically.
2. MOTIVATION Bypass is a promising technique for LLCs due to its high performance and efficiency, and thus we aim to design an LLC bypass technique in this paper. In order to understand how bypass should work, we first study OPT [2] with bypass (OPT+B), which is the theoretical optimal replacement policy for minimal misses. On a miss, among all victim candidates, OPT selects the block with the largest reuse distance
316
Cache friendly access pattern Streaming access pattern (a1, a2, a3, a4)I (b1, b2, b3, b4, ...) Access
a1 a2 a3 a4 a1 a1 a1 a2 a2 a3
Cache Behavior
R
Burst type Note:
R
a1 a1 a2 a3 a4 R H
R
... a1 a2 a3 a4 H
b1 a1 a2 a3 a4 B
Replace H: hit;
R: replacement;
b2 a1 a2 a3 a4 B
b3 a1 a2 a3 a4 B
b4 a1 a2 a3 a4 B
Bypass
... a1 a2 a3 a4 B
Thrashing access pattern (c1, c2, c3, c4, c5, c6)J c1 a1 a2 a3 a4 R
c2 c1 a2 a3 a4 R
c3 c1 c2 a3 a4 R
c4 c1 c2 c3 a4 R
c5 c1 c2 c3 c4 B
c6 c1 c2 c3 c4 B
... c1 c2 c3 c4 H
c5 c1 c2 c3 c4 B
Mixed access pattern [(d1, d2)2, e1, e2, e3, e4]K ... c1 c2 c3 c4 H or B
Bypass
Replace
d1 c1 c2 c3 c4 R
d2 d1 c2 c3 c4 R
d1 d1 d2 c3 c4 H
d2 d1 d2 c3 c4 H
e1 d1 d2 c3 c4 R
e2 d1 d2 e1 c4 R
Replace
e3 d1 d2 e1 e2 B
e4 d1 d2 e1 e2 B
... d1 d2 e1 e2 H
e3 d1 d2 e1 e2 B
e4 d1 d2 e1 e2 B
... d1 d2 e1 e2 H or B
Bypass
B: bypass.
Figure 3: The behavior of OPT+B for representative cache access patterns using a 4-entry cache. 1.21
Normalized MPKI
120% 100% 80% DRRIP
60%
SDBP LRU+OB
40%
OPT+B
20%
Figure 4: Normalized LLC MPKI for LRU+OB and OPT+B. These three conditions are called Optimal Bypass Assertions, because we can assert the behavior of the optimal bypass by detecting the occurrence of one certain condition. Table 1 shows three Optimal Bypass Assertions. Since Assertion 3 is impossible to detect, we consider that Assertion 3 is satisfied when IB and VB both have reuse distances larger than the cache size. This conversion is reasonable because if both blocks are not reused before eviction, IB should be bypassed to improve efficiency. Our experiments show that the probability of each assertion occurrence is 12.9%, 78.7%, and 8.4% respectively under LRU. Therefore, all assertions are essential to learn the behavior of the optimal bypass accurately. Using Optimal Bypass Assertions, we can learn the behavior of the optimal bypass on the RHT content, and then saturating counters called Bypass Decision Counters (BDCs) are employed to record the learning results. All BDCs are kept in the Bypass Decision Counter Table (BDCT), and they are initialized to -1. When the behavior of the optimal bypass on a recorded pair is detected, the signature of IB in that pair is used to index the BDCT to update its corresponding BDC. If Assertion 1 is satisfied, the BDC is decreased by 1; if Assertion 2 or 3 is satisfied, it is increased by 1. On a miss, the incoming block consults the BDCT to find out whether it should be bypassed. If the BDC indexed by its signature is greater than or equal to 0, it indicates that the optimal bypass uses more bypasses for blocks with that signature recently, and thus OBM predicts that the incoming block should be bypassed. Otherwise, it indicates that replacement is the dominant behavior of the optimal bypass for blocks with that signature recently, and OBM predicts that the incoming block should be placed in the cache.
Although it is still impractical to implement the optimal bypass due to the need of future information, we can learn the past behavior of the optimal bypass to predict its future behavior. Therefore, to achieve high performance, the proposed bypass technique should behave similarly to the optimal bypass.
3. OPTIMAL BYPASS MONITOR Our goal is to design a new bypass technique, which can make proper bypass decisions by dynamically learning and predicting the behavior of the optimal bypass. Thus, we propose Optimal Bypass Monitor (OBM).
3.1 Overview On a cache miss, the optimal bypass determines whether or not to use bypass based on the reuse distances of the incoming block (IB) and its corresponding victim block (VB), which is selected by the baseline replacement policy. Therefore, to learn the behavior of the optimal bypass, we propose Replacement History Table (RHT) to keep track of IB-VB pairs on cache misses. Then according to the relative reuse order on a recorded pair, the behavior of the optimal bypass on that pair can be asserted: • If IB is accessed earlier, the reuse distance of IB is smaller than that of VB. For that IB-VB pair, the optimal bypass should replace VB with IB; • If VB is accessed earlier, the reuse distance of IB is larger than that of VB. For that IB-VB pair, the optimal bypass should bypass IB; • If neither IB nor VB is accessed in the future, the reuse distances of IB and VB are both infinite, and the optimal bypass should also bypass IB.
317
Table 1: Optimal Bypass Assertions to detect the behavior of the optimal bypass. Action on IB-VB pair Reuse distance relationship optimal bypass behavior Assertion 1 Assertion 2
Current incoming block hits IB. Current incoming block hits VB.
Assertion 3
Current victim block hits IB.
IB’s reuse distance < VB’s reuse distance IB’s reuse distance > VB’s reuse distance IB’s reuse distance > cache size and VB’s reuse distance > cache size
V RP P
SI
IT
L2 Cache
Prefetcher
VT
Optimal Bypass Monitor
Demand/prefetch access
RHT
Miss? Victim candidate address
Existing LLC design Update
BDCT
BDC
Bypass
On an access to block x: if x is a demand access for each valid entry A in the corresponding set of RHT if x.tag == A.IT // Assertion 1 BDCT[A.SI]--; Invalidate A; else if x.tag == A.VT // Assertion 2 BDCT[A.SI]++; Invalidate A; if x misses in the cache y = Select_Victim_Candidate(); for each valid entry A in the corresponding set of RHT if y.tag == A.IT // Assertion 3 BDCT[A.SI]++; Invalidate A; if RHT.Record(x) == true B = RHT.Select_Victim(x); // Select an entry to record x B.SI = x.signature; B.IT = x.tag; B.VT = y.tag; if BDCT[x.signature] >= 0 Bypass x; else Replace y with x;
Entry of RHT: SI: IB signature. V: valid bit. RP: replacement policy bits. IT: IB tag. VT: VB tag. P: prefetch bit.
Replacement Bypass
Bypass?
RHT: Replacement History Table. BDC: Bypass Decision Counter BDCT: Bypass Decision Counter Table.
Figure 5: The structure of OBM.
Figure 6: The algorithm of OBM.
Figure 5 and 6 illustrate the structure and algorithm of OBM respectively. OBM requires only a little necessary information from the baseline LLC, and it uses a signal to inform the LLC whether to bypass the current miss or not. Therefore, the existing LLC design does not need to be changed. OBM can be used with any deterministic replacement policy. It can also potentially cooperate with non-deterministic replacement policies like the random policy. However, because the random policy chooses victim blocks randomly, which makes the behavior of the optimal bypass less predictable, OBM with the random policy does not work as well as OBM with other policies, although it still outperforms LRU. Thus, we only evaluate the performance of OBM with some deterministic replacement policies, including the Not Recently Used policy (NRU), LRU, and SRRIP [12].
record all misses in the RHT. A miss should be recorded only if there are invalid entries in the corresponding set of RHT or a low probability is satisfied. Our experiments show that, for a 16-way 128-entry RHT and a 16-way 2MB LLC, OBM can perform well when 1/512 of misses are recorded. Various kinds of block signatures can be used to index the BDCT, such as memory address, instruction program counter (PC), or many others [35]. Previous studies have shown that methods using PC can be more effective than other methods. Therefore, we use the instruction PC which causes the miss as the signature to update and consult the BDCT. For a 1024-entry BDCT, the lower 10 bits of PC are used to index the BDCT. Like all PC based methods [17, 19, 20, 21, 23, 35, 37], the shortened PC is delivered along with the request through the cache hierarchies.
3.2 Implementation Details
OBM is naturally thread-aware because of its PC based design. Since OBM makes bypass decisions based on the behavior of the optimal bypass, it can implicitly partition the shared LLC to minimize the total misses. The program with poor locality will have more blocks bypassed, and thus its cache space is released to improve the performance of programs with good locality.
3.3
The RHT can be organized as fully-associate, set-associate, or direct-associate. In practice we use a 16-way set-associate RHT. Each RHT entry contains 6 fields: a valid bit indicates whether the entry is valid, and an RHT entry is invalidated when the behavior of the optimal bypass on it is detected; the RP bits are used to implement the replacement policy of RHT, which is similar to LRU; the P bit introduces prefetch awareness; the SI bits keep the signature of IB; IT and VT store the tag of IB and VB respectively. We also use partial tags to reduce the RHT storage overhead, and IT and VT store the lower 21 bits of tags instead of the whole tags. To reduce the storage overhead, it is not necessary to
3.4
Thread-awareness
Prefetch-awareness
Prefetch is an important feature in modern high performance processors, and a simple extension is proposed to make OBM prefetch-aware. Demand and prefetch accesses usually have different behavior. For instance, in a thrash-
318
1.21
Normalized MPKI
1.2 1
NRU
0.8
DIP
DRRIP
0.6
SDBP
0.4
DSB NRU+OBM
0.2
Figure 7: Reduction in MPKI normalized to LRU. ing access pattern, demand accesses have large reuse distances and should be bypassed, while prefetch accesses are expected to be used soon and should be inserted. Therefore, it is reasonable to make bypass decisions for demand and prefetch accesses separately. In the prefetch-aware OBM, we assign an additional BDC for each core, which is dedicated for prefetch accesses. The prefetch bit (P in Figure 5) in the RHT entry indicates whether the entry records a prefetch access, and the signature field is used to store the core number of the incoming block instead of the shortened PC when recording a prefetch access.
evaluated because their working sets are so small that their misses are mostly compulsory misses, and the performance improvement is less than 1% when the cache size increases from 2MB to 32MB under LRU. The rest of 23 benchmarks are used in our experiments2 . For multi-programmed workloads, we choose four benchmarks out of our memory-intensive set of SPEC CPU2006 at random to combine into a mix workload. Totally, we create 15 mix workloads. Simulations run until all benchmarks have executed one billion instructions. If one benchmark finishes its one billion instructions early, it restarts from the beginning to continue modeling the contention of four cores. Our experimental methodology is similar to other recent work.
4. EXPERIMENTAL METHODOLOGY 4.1 Simulator
5.
The simulator we used is CMP$im, a Pin based tracedriven x86 simulator. We use a modified version which is provided for the 1st JWAC Cache Replacement Championship. It models an out-of-order, 4-wide, and 8-stage pipeline with a 128-entry instruction window. The microarchitecture parameters of memory hierarchies are shown in Table 2, which are similar to Intel Core i7 [11]. The L1 and L2 caches are private to each core. In single-core configuration, the LLC (L3 cache) is 16-way 2MB for single-thread workloads. In 4-core configuration, the LLC is 16-way 8MB for multi-programmed workloads. It also models a stream hardware prefetcher in each L2 cache, and the prefetched blocks are inserted into both the L2 and L3 caches.
5.1
RESULTS AND ANALYSIS OBM with NRU
4.2 Benchmarks
At first we evaluate the performance of Optimal Bypass Monitor in the absence of hardware prefetching. Figure 7 shows misses per thousand instructions (MPKI) normalized to LRU for OBM with NRU (NRU+OBM) and other techniques. Besides the baseline NRU and NRU+OBM, we also investigate the recently proposed techniques including DIP [27], DRRIP3 [12], sampling dead block prediction (SDBP) [19], and dueling segmented LRU with adaptive bypassing (DSB)4 [6]. While the MPKI of NRU is similar to that of LRU, NRU+OBM reduces MPKI by 15.5% compared to LRU on average, and the best reduction is 65.1% for sphinx3. NRU+OBM also outperforms other proposals: DIP reduces 8.2% of MPKI on average, while the reduction is 9.7% for DRRIP, 10.4% for SDBP, and 13.0% for DSB. Figure 8 shows the speedup over LRU. The speedup is computed by dividing the IPC of various proposals by the IPC of LRU. The geometric mean speedup of NRU+OBM is 9.7%, while it is 4.9% for DIP, 6.0% for DRRIP, 6.3% for SDBP, and 7.5% for DSB. Moreover, the performance degradation is less than 0.5% compared to the baseline NRU for all benchmarks. Our results show that NRU+OBM can deliver significantly superior performance compared to other recent work.
We use SPEC CPU2006 benchmarks [9] with the first reference inputs to do evaluation. These benchmarks are compiled using GCC 4.5.2 with -O2 optimizations. We use PinPoints [26] to obtain a single representative one billion instructions for each benchmark. Among these benchmarks, gamess, namd, povray, sjeng, tonto, and specrand are not
2 Our infrastructure cannot address gobmk, and previous work has reported that gobmk is not memory-intensive. 3 We use 2-bit SRRIP and DRRIP in this paper because it performs slightly better in our experiments. 4 We use its second configuration which is reported to perform best in their paper.
Table 2: Parameters of memory hierarchies Parameter
Configuration
L1 ICache L1 DCache L2 Cache Last-level Cache Memory Latency RHT BDCT
64B blocks, 32KB, 4-way, 1 cycle, LRU 64B blocks, 32KB, 8-way, 1 cycle, LRU 64B blocks, 256KB, 8-way, 10 cycles, LRU 64B blocks, 2MB per core, 16-way, 30 cycles 200 cycles 128-entry, 16-way 4-bit BDC, 1-entry per core+1024-entry
319
1.4
1.42, 1.53
1.84, 1.90, 1.72, 2.02, 2.10
Normalized IPC
NRU
1.3
DIP DRRIP
1.2
SDBP DSB
1.1
NRU+OBM
1 0.9
Figure 8: Speedup for various policies.
Bypassed blocks
100% 80% 60% 40% 20% 0%
Figure 9: Percentage of bypassed blocks. Figure 9 shows the fraction of bypassed cache blocks for NRU+OBM. It shows that on average 75.2% of incoming blocks are bypassed. For benchmarks which have a large fraction of blocks not reused in Figure 2 such as cactusADM, libquantum, and sphinx3, most incoming blocks are bypassed and OBM performs well for them. Even for benchmarks which LRU performs well enough like zeusmp, dealII, and astar, roughly half of the misses are bypassed. Although these bypasses do not help to improve the performance, they reduce the writeback number of dirty victim blocks. As a result, the contention on shared system buses is reduced, and the power used to write back dirty victim blocks is saved, too.
1024 while the associativity is fixed at 16, and the probability of recording misses in the RHT is adjusted accordingly. The number of BDC bits is changed from 2 to 5. Figure 10 shows that a 128-entry RHT with 4-bit BDCs can perform sufficiently well. A small RHT with large BDCs performs poorly because the learning process is slower. On the other hand, a large RHT with small BDCs also degrades performance because BDCs are easily affected by accidental events and become more sensitive. Figure 11 studies the performance sensitivity to the BDCT size when using a 128-entry RHT and 4-bit BDCs. The results show that a 1024-entry BDCT is enough and more entries are not necessary. We also notice that even a 16-entry RHT with a 128-entry BDCT and 3-bit BDCs can achieve a significant speedup of 8.2% compared to LRU, which requires only 0.17KB extra storage.
5.2 OBM with Other Replacement Policies Besides NRU, we also investigate the performance of OBM with LRU and SRRIP. The results are similar to those of NRU+OBM. LRU+OBM outperforms LRU by 9.5%, and the reduction in MPKI is 14.6% on average. SRRIP+OBM outperforms LRU by 9.4% and reduces MPKI by 15.5%. Compared to SRRIP, SRRIP+OBM outperforms by 7.7%. These results show that OBM can cooperate with various replacement policies well. Among these replacement policies, NRU needs the least storage and is the simplest, while the performance of NRU+OBM is comparable to that of LRU+OBM and SRRIP+OBM. Therefore, we mainly focus on the study of NRU+OBM in the following experiments.
5.4
Sensitivity to the Cache Size
Figure 12 shows the speedup of NRU+OBM for different cache sizes. We vary the size of LLC from 512KB to 8MB and the associativity is fixed at 16. The speedup is normalized to the speedup of LRU for the specified cache size. We show the speedup for five representative benchmarks and the geometric mean speedup for all 23 benchmarks. The figure shows that the performance gain of OBM for small caches is limited because there is less wasted space to place distant reuse blocks such as soplex and libquantum. While for large caches, the performance gain decreases because there are fewer distant reuse blocks and the working set is more likely to fit into the cache such as perlbench and hmmer. However, NRU+OBM can still achieve a geometric mean speedup of
5.3 Sensitivity to the Size of RHT and BDCT Figure 10 studies the sensitivity of NRU+OBM performance to different size of RHT and BDC using a 1024-entry BDCT. The number of RHT entries is changed from 16 to
320
1.1
1.3
1.1
1.31
1.31
512KB
Normalized IPC
1MB
1.2
2MB 4MB
1.09
1.09
8MB
1.1
2-bit BDC 3-bit BDC
1
4-bit BDC
5-bit BDC
1.08
1.08 16
32
64
128 256 512 1024
Figure 10: Sensitivity to the size of RHT and BDC.
64
128
256
512 1024 2048 4096
Figure 11: Sensitivity to the size of BDCT.
Figure 12: NRU+OBM for different cache sizes. 1.51
1.3 Normalized IPC
NRU
DIP
1.2
DRRIP SDBP
1.1
DSB NRU+OBM
1 0.9
Figure 13: Speedup in the presence of prefetching.
5.7
5.1% for an 8MB LLC. Therefore, we conclude that OBM is scalable to different cache sizes.
Results for Multi-threaded Workloads
Figure 14(a) shows the weighted speedup normalized to LRU for various techniques on 15 multi-programmed workloads in the absence of prefetching. The weighted speedup is ∑ IPC i computed as the formula 4i=1 SingleIPC , where SingleIPC i i is got when program i runs alone on an 8MB LRU-managed LLC. NRU+OBM can achieve a normalized weighted speedup of 8.9% on geometric mean, while it is 6.8% for SDBP which performs best among various state-of-the-art techniques. NRU+OBM improves 14 out of 15 mix workloads more than 1%, and does not degrade performance for any workload. In the presence of prefetching, the performance gain of NRU+OBM is 6.2% compared to that of LRU with prefetching, which is twice the performance gain of other proposals in which SDBP performs best and outperforms LRU by 3.1%. Figure 14(b) summarizes the results for OBM with other replacement policies. For LRU+OBM, the weighted speedup is 9.3% and 6.1% without and with prefetching respectively. For SRRIP+OBM it is 9.6% and 6.3%. SRRIP+OBM performs best for multi-programmed workloads because SRRIP can reduce the conflicts between the accesses of different cores by evicting no reuse blocks earlier. These results show that OBM is also an effective technique in multi-core environment.
5.5 LRU Insertion Instead of Bypass It is straightforward to apply OBM in non-inclusive and exclusive LLCs, but to satisfy inclusion, bypass cannot be used. In order to apply OBM in inclusive LLCs, a reasonable choice is to insert the incoming block into the LRU position instead of bypassing it [27]. The reason is that when bypass is not allowed, if the reuse distance of the incoming block is larger than or equal to that of any victim candidate, OPT will insert it into the LRU position so that it can be evicted on the next miss. Therefore, we modify OBM to Optimal LRU-insertion Monitor (OLM). OLM uses the penultimate victim block instead of the real victim block to train the RHT. Our experiments show that with the same configuration, the speedup of NRU+OLM is 8.9%. Although it underperforms NRU+OBM slightly, it still outperforms other recent work.
5.6 Results in the presence of prefetching Next, we evaluate the performance of OBM when the hardware prefetcher is enabled. Figure 13 shows that NRU+OBM reduces average MPKI by 16.2% and achieves a performance gain of 5.3% compared to LRU in the presence of prefetching. It doubles the performance gain of other recent proposals due to its ability to make bypass decisions for demand and prefetch accesses separately. Compared to LRU without prefetching, it can outperform by 66.1%. The experiments of SRRIP+OBM and LRU+OBM show similar results in the presence of prefetching.
5.8
Storage, Latency, and Power
Table 3 compares the storage overhead of various techniques for the 2MB LLC used in single-thread experiments. Each RHT entry consists of 1 valid bit, 4 bits for implementing the replacement of RHT, 1 prefetch bit, 10-bit signature, 21-bit IT and 21-bit VT, and each BDC needs 4 bits. It to-
321
1.4
1.1
Weighted speedup
TADIP TADRRIP
1.3
SDBP
1.2
1.05
DSB
NRU+OBM
NRU+OBM
LRU+OBM
1.1
SRRIP+OBM
1
1
mix1
mix2
mix3
mix4
mix5
mix6
mix7
mix8 mix9 mix10 mix11 mix12 mix13 mix14 mix15 gmean
without with prefetching prefetching
(a) NRU+OBM and other proposals without prefetching.
(b) Others.
Figure 14: Normalized weighted speedup for multi-programmed workloads. Table 3: Storage overhead of various techniques for a 16-way 2MB LLC. Speedup1
Speedup with prefetching1
Storage per block
Extra storage
Total
1 1.001 1.049 1.060 1.063 1.075 1.097
1.577 1.575 1.593 1.601 1.582 1.618 1.661
4 bits 1 bit 4 bits 2 bits 5 bits 5 bits 1 bit
0 0 10 bits 10 bits 9.75KB 11.35KB 1.41KB
16KB 4KB 16KB 8KB 29.75KB 31.35KB 5.41KB
LRU NRU DIP DRRIP SDBP DSB NRU+OBM 1
The speedup is normalized to LRU without prefetching.
tally consumes ((1+4+1+10+21+21)×128+4×(1024+1)) bits = 1.41KB of extra storage to implement OBM for singlecore configuration, which is less than 0.1% of the total storage of a 2MB LLC. For 4-core configuration, only 3 extra BDCs are needed. Compared to other recent proposals, NRU+OBM requires the lowest storage overhead. We use CACTI 6.5 [25] to simulate the latency and power under 32nm process technology. We model the RHT as the tag array of a 32-way associative cache with 256 blocks. The access time of the 2MB LLC is 1.30ns, while the access time of RHT and BDCT is 0.14ns and 0.19ns respectively. It is fast enough to fit well in the access time of LLC. Consequently, OBM does not affect the access latency of LLC. The dynamic access energy of LLC is 0.72nJ, and the leakage power is 678.4mW. While OBM has a dynamic access energy of 0.003nJ, and a leakage power of 3.5mW. Assuming the CPU frequency is 3GHz, according to our experiment, the LLC is accessed every 60.5ns on average. Consequently, OBM only consumes 0.51% of the power budget of LLC. What’s more, since OBM can reduce LLC misses and thus the execution time of programs, actually it reduces the energy consumption of the memory system and processors.
RRIP [12] can further identify frequently accessed blocks and retain them longer in the cache. Using signatures based on PC or instruction sequences, SHiP [35] proposes a re-reference interval predictor to improve performance. PACMan [36] extends RRIP to be prefetch-aware. Shepherd cache [29] uses two separate caches to record the relative access order of blocks, and then emulates the replacement and bypass decisions of OPT+B. Dead Block Prediction: Dead block prediction tries to identify distant reuse blocks, also known as dead blocks in their papers. By evicting or bypassing dead blocks preferentially, the remaining blocks could reside longer in caches. Dead block prediction can be classified into three categories based on how to identify dead blocks: trace based [18, 19, 21], time based [10], and counter based [20]. Cache burst predictor [22] makes prediction for continuous access sequences rather than individual accesses to improve prediction accuracy. SDBP [19] samples a part of sets to reduce conflicts in the predictor for high accuracy. Bypass: A few researchers have proposed bypass techniques for cache management. Based on how to predict distant reuse blocks, these studies can be classified into PC based [5, 8, 33] and address based [13, 15, 30, 31]. LRF [37] combines PC based and address based methods to improve performance. Annex caches [14] and PCC [34] filter no reuse blocks before they are brought into the main cache. Dead block prediction and bypass techniques mentioned above are all based on the assumption that all distant reuse blocks are useless. However, as we have shown, many blocks among them are actually useful and should be retained in caches. DSB [6] records the incoming block and the victim block for each set on a miss. Then it adjusts the bypass probability based on which one is accessed first. As stated above, OBM does not use bypass based on a probability, and three Optimal Bypass Assertions are used to learn the behavior of
6. RELATED WORK Extensive research has been done to improve the LLC performance, and we will discuss some of the primary studies, which are most relevant to our work. Replacement: Lots of LLC replacement policies try to retain some fraction of the working set in caches to avoid thrashing. DIP [27] attempts to insert most distant reuse blocks into the LRU position when the working set is larger than the cache size. Pseudo-LIFO [4] prioritizes to evict blocks on the top of fill stack to keep blocks in the bottom longer. A recent proposal guides replacement by explicitly predicting the reuse distance with a PC based predictor [17].
322
the optimal bypass. Since we collect replacement pairs globally, not for each set, our storage overhead is much lower. Moreover, OBM makes prediction based on PC to improve performance. A bypass and insertion algorithm for exclusive LLCs was presented recently [7]. It classifies blocks based on their access number in the L2 cache and the hit number in the LLC. NUcache [23] dedicates a part of LLC to keep distant reuse blocks. Only blocks accessed by selected PCs are inserted into the dedicated part, and the others are bypassed. Some researchers have proposed similar policies using some kinds of incoming-victim block relationship. They study the characteristic of accesses using metrics which are either similar to Assertion 1 and 2 [6], or similar to Assertion 1 and 3 [19, 35, 37]. However, since such proposals do not assert the behavior of the optimal bypass, and they only use a fraction of the three Optimal Bypass Assertions, they are not as accurate as OBM. Others: Victim caches [16] use a small fully-associative buffer to improve direct-associative cache performance. However, as shown in Figure 1, the LLC blocks usually have rather large reuse distances, which makes victim caches ineffective for LLCs [18].
[2] L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78 –101, 1966. [3] S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54:67–77, 2011. [4] M. Chaudhuri. Pseudo-lifo: the foundation of a new family of replacement policies for last-level caches. In MICRO-42, 2009. [5] C.-H. Chi and H. Dietz. Improving cache performance by selective cache bypass. In HICSS-22, 1989. [6] H. Gao and C. Wilkerson. A dueling segmented lru replacement algorithm with adaptive bypassing. In JWAC-1, 2010. [7] J. Gaur, M. Chaudhuri, and S. Subramoney. Bypass and insertion algorithms for exclusive last-level caches. In ISCA-38, 2011. [8] A. Gonz´ alez, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS-9, 1995. [9] J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34:1–17, 2006. [10] Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA-29, 2002. [11] Intel. Intel core i7 processor. http://www.intel.com/products/processor/corei7/. [12] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA-37, 2010. [13] J. Jalminger and P. Stenstrom. A novel approach to cache block reuse predictions. In ICPP ’03, 2003. [14] L. John and A. Subramanian. Design and performance evaluation of a cache assist to implement selective caching. In ICCD ’97, 1997. [15] T. Johnson, D. Connors, M. Merten, and W.-M. Hwu. Run-time cache bypassing. Computers, IEEE Transactions on, 48(12):1338 –1354, 1999. [16] N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA-17, 1990. [17] G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement based on reuse-distance prediction. In ICCD-25, 2007. [18] S. M. Khan, D. A. Jim´enez, D. Burger, and B. Falsafi. Using dead blocks as a virtual victim cache. In PACT-19, 2010. [19] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO-43, 2010. [20] M. Kharbutli and Y. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, 57(4):433 –447, 2008. [21] A.-C. Lai, C. Fide, and B. Falsafi. Dead-block prediction & dead-block correlating prefetchers. In ISCA-28, 2001. [22] H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In MICRO-41, 2008. [23] R. Manikantan, K. Rajan, and R. Govindarajan.
7. CONCLUSION Since the reuse distances of numerous blocks are larger than the cache capacity, the commonly-used LRU and its approximations perform poorly for LLCs. Many previous proposals attempt to address this problem, which either have limited performance or require significant hardware overhead and large modification to the existing cache design. Based on the analysis of the optimal bypass, this paper proposes Optimal Bypass Monitor (OBM). By keeping track of a short replacement history, OBM accurately learns the recent behavior of the optimal bypass to make bypass decisions. Our experiments show that OBM can cooperate well with various replacement policies including NRU, LRU, and SRRIP. Especially, OBM with NRU achieves a significant speedup for both single-thread and multi-programmed workloads, in both cases whether prefetching is enable or not. It also outperforms other state-of-the-art proposals including DIP, DRRIP, SDBP, and DSB. OBM only consumes less than 1.5KB extra storage and does not need to change the original design of LLC. To the best of our knowledge, the idea that learns the behavior of the optimal bypass to guide the LLC management is firstly proposed. We believe that OBM can be widely used in other relative research areas, such as memory and storage management [1, 24].
8. ACKNOWLEDGMENTS We would like to thank the anonymous reviewers for their helpful comments. This work is supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China under grant 2009ZX01029001-002-2.
9. REFERENCES [1] S. Bansal and D. S. Modha. Car: Clock with adaptive replacement. In FAST-3, 2004.
323
[24] [25]
[26]
[27]
[28]
[29]
[30]
[31]
Nucache: An efficient multicore cache organization based on next-use distance. In HPCA-17, 2011. N. Megiddo and D. S. Modha. Arc: A self-tuning, low overhead replacement cache. In FAST-2, 2003. N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Cacti 6.0: A tool to understand large caches. HP Research Report, 2007. H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing representative portions of large intel itanium programs with dynamic instrumentation. In MICRO-37, 2004. M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA-34, 2007. M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for mlp-aware cache replacement. In ISCA-33, 2006. K. Rajan and G. Ramaswamy. Emulating optimal replacement with a shepherd cache. In MICRO-40, 2007. J. Rivers and E. Davidson. Reducing conflicts in direct-mapped caches with a temporality-based design. In ICPP ’96, 1996. J. A. Rivers, E. S. Tam, G. S. Tyson, E. S. Davidson,
[32]
[33]
[34] [35]
[36]
[37]
[38]
324
and M. Farrens. Utilizing reuse information in data cache management. In ICS-12, 1998. R. Subramanian, Y. Smaragdakis, and G. H. Loh. Adaptive caches: Effective shaping of cache behavior to workloads. In MICRO-39, 2006. G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A modified approach to data cache management. In MICRO-28, 1995. S. Walsh and J. Board. Pollution control caching. In ICCD ’95, 1995. C. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr, and J. Emer. Ship: Signature-based hit predictor for high performance caching. In MICRO-44, 2011. C. Wu, A. Jaleel, M. Martonosi, S. Steely Jr, and J. Emer. Pacman: Prefetch-aware cache management for high performance caching. In MICRO-44, 2011. L. Xiang, T. Chen, Q. Shi, and W. Hu. Less reused filter: improving l2 cache performance via filtering less reused lines. In ICS-23, 2009. Y. Xie and G. H. Loh. Pipp: Promotion/insertion pseudo-partitioning of multi-core shared caches. In ISCA-36, 2009.