LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

4 downloads 2802 Views 235KB Size Report
Sep 1, 2012 - Digital Object Indentifier 10.1109/TC.2013.61 ... Cache memories, cache replacement algorithms, cost-sensitive cache replacement ..... In Section VI, we compare LACS against SHiP (Signature-based Hit Predictor) [9]: a state-.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

1

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers)

Mazen Kharbutli is with Jordan University of Science and Technology. Rami Sheikh is with North Carolina State University. September 1, 2012

Digital Object Indentifier 10.1109/TC.2013.61

DRAFT

0018-9340/13/$31.00 © 2013 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

2

Abstract The design of an effective last-level cache (LLC) in general - and an effective cache replacement/partitioning algorithm in particular - is critical to the overall system performance. The processor’s ability to hide the LLC miss penalty differs widely from one miss to another. The more instructions the processor manages to issue during the miss, the better it is capable of hiding the miss penalty and the lower the cost of that miss. This non-uniformity in the processor’s ability to hide LLC miss latencies, and the resultant non-uniformity in the performance impact of LLC misses, opens up an opportunity for a new cost-sensitive cache replacement algorithm. This paper makes two key contributions. First, it proposes a framework for estimating the costs of cache blocks at run-time based on the processor’s ability to (partially) hide their miss latencies. Second, it proposes a simple, low-hardware overhead, yet effective, cache replacement algorithm that is Locality-Aware and Cost-Sensitive (LACS). LACS is thoroughly evaluated using a detailed simulation environment. LACS speeds up 12 LLC-performanceconstrained SPEC CPU2006 benchmarks by up to 51% and 11% on average. When evaluated using a dual/quad-core CMP with a shared LLC, LACS significantly outperforms LRU in terms of performance and fairness, achieving improvements up to 54%. Index Terms Cache memories, cache replacement algorithms, cost-sensitive cache replacement, shared caches

I. I NTRODUCTION As the performance gap between the processor and main memory continues to widen, the design of an effective cache hierarchy becomes more critical in order to reduce the average memory access times perceived by the processor. The design of an effective last-level cache (LLC) continues to be the center of substantial research for several reasons: First, while a processor may be able to hide a miss in the higher-level (L1 and L2) caches followed by an LLC (L3 cache)1 hit through exploiting ILP, out-of-order execution, and non-blocking caches, it is almost impossible to fully hide the long LLC miss penalty. Second, as multi-core processors sharing the LLC becomes the dominant computing platform, new cache design constraints arise with the goal of maximizing performance and throughput while ensuring thread fairness [1], [2]. 1

Without loss of generality, we assume throughout this paper a 3-level cache hierarchy where the LLC is the L3 cache. The concepts and algorithms developed in this paper are also applicable to a 2-level cache hierarchy.

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

3

A crucial design aspect of LLCs continues to be the cache replacement and partitioning algorithms. This is evident in the many papers proposing intelligent LLC replacement and partitioning algorithms found in the recent literature. Examples include: dead block predictors [3]– [8], re-reference interval predictors and adaptive insertion algorithms [9]–[14], and CMP cache partitioning algorithms [15], [16], among others. Unfortunately, most of these algorithms only target the cache’s miss rate while ignoring the aggregate misses cost. Only a few proposed replacement algorithms attempt to reduce the aggregate misses cost or penalty [17]–[21]. In modern superscalar processors, the processor attempts to hide cache misses by exploiting ILP through issuing and executing independent instructions in parallel and out-of-order. Unfortunately, even with the most aggressive superscalar processors, it is quite impossible to hide the large LLC miss penalty. During this long miss penalty, the reorder buffer (ROB) and the other processor queues fill up. This eventually stalls the whole processor waiting on the LLC miss. Yet, depending on the dependency chain, miss bursts and other factors, the processor’s ability to partially hide the LLC miss penalty differs widely from one miss to another [22]. Figure 1 illustrates this point by showing the histogram of the number of issued instructions during the service of an LLC miss for several SPEC CPU2006 benchmarks [23]. The vertical axis represents the number of misses while the horizontal axis shows the number of issued instructions during the service of the miss (plotted using intervals of 20 instructions)2 . For example, looking at the sub-figure for the mcf benchmark, the leftmost bar indicates that for about 64 million of its LLC misses, the processor managed to issue only 0-19 instructions per miss. The number of issued instructions is counted from the time the instruction suffering the LLC miss is placed in the LLC MSHR until the requested data is received. The figure clearly shows that for most benchmarks, the number of issued instructions during an LLC miss is not uniform and varies widely asserting the statement: ”Not All Misses are Created Equal” [21]. The more instructions the processor manages to issue during the miss, the better it is capable of hiding the miss penalty and the lower the cost of that miss. This non-uniformity in the processor’s ability to hide the latencies of LLC misses, and the resultant non-uniformity in the performance 2 Although the ROB used in our evaluation has 128 entries, the number of instructions issued during an LLC miss may be larger than 128. There are 255 different instructions that may be in the ROB from the time the instruction suffering the miss is added to the tail of the ROB until it retires at the ROB’s head. Some or all of these instructions may issue during the miss.

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

4 bzip2

x106 0.5 0.4

0.6

0.3

0.4

0.2

2

0.2

0.1

0

0.0

0.0

dealII

omnetpp

40

sjeng

1 2 3 4 5 6 7 8 9

x106 6

0

soplex

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

x106 16 14 12 10 8 6 4 2 0

sphinx3

1 2 3 4 5 6 7 8 9

x106 70 60 50 40 30 20 10 0

mcf

0 20 40 60 80 100 120 140 160

0.0

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

x106 20

milc

15 10 5

1 2 3 4 5 6 7 8 9

x106 35 30 25 20 15 10 5 0

xalancbmk

1 2 3 4 5 6 7 8 9

0

0 20 40 60 80 100 120 140 160

0

0.1 0 20 40 60 80 100 120 140 160

0 20 40 60 80 100 120 140 160

10

1 0 20 40 60 80 100 120 140 160

0 20 40 60 80 100 120 140 160

Fig. 1.

20

2

0.1

libquantum

30

3

20

x106 40

1 2 3 4 5 6 7 8 9

0.0

1 2 3 4 5 6 7 8 9

50

4

0.2

1 2 3 4 5 6 7 8 9

lbm

5

0.3

0.0

0 20 40 60 80 100 120 140 160

0 20 40 60 80 100 120 140 160

x106 0.4

30

10

18 16 14 12 10 8 6 4 2 0

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

x106 50

hmmer

0.2

1 2 3 4 5 6 7 8 9

x106 8 7 6 5 4 3 2 1 0

zeusmp

0 20 40 60 80 100 120 140 160

0 20 40 60 80 100 120 140 160

0.2

gobmk

0.4

0.5

0 20 40 60 80 100 120 140 160

0.4

x106 0.5

0.3

0 20 40 60 80 100 120 140 160

0.6

1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

x106

0 20 40 60 80 100 120 140 160

0.8

x106

1 2 3 4 5 6 7 8 9

0 20 40 60 80 100 120 140 160

gromacs

1.0

1 2 3 4 5 6 7 8 9

0 20 40 60 80 100 120 140 160

0 20 40 60 80 100 120 140 160

x106

gcc

1.0

4

1 2 3 4 5 6 7 8 9

x106 2.0 1.5

6

1

0

x106 1.0 0.8

8

2

0.0

bwaves

10

3

0

x106 12

0 20 40 60 80 100 120 140 160

astar

0 20 40 60 80 100 120 140 160

x106 4

1 2 3 4 5 6 7 8 9

Issued Instructions Per LLC Miss Histogram. Simulation environment details are in Section V.

impact of LLC misses, opens up an opportunity to develop new cost-sensitive cache replacement algorithms. We define cache blocks where the processor manages to issue a small/large number of instructions during a miss on that block as high/low-cost blocks, respectively. Substituting high-cost misses with low-cost misses reduces the aggregate miss penalty and thus enhances the overall cache performance. This paper proposes a novel, simple, yet effective, cache replacement algorithm called LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm. LACS estimates the cost of a cache block by counting the number of instructions issued during the block’s LLC miss, which reflects the processor’s ability to (partially) hide the miss penalty. Cache blocks are classified as low-cost or high-cost blocks based on whether the number of issued instructions is larger or smaller than a threshold. On a cache miss, when a victim block needs to be found, LACS chooses a low-cost block keeping high-cost blocks in the cache. This is referred to as high-cost block reservation [17]–[20]. However, since a block with a high cost cannot be reserved forever, a mechanism must exist to relinquish the reservation once the block is dead (no longer needed). To achieve this, LACS implements a simple locality-based algorithm that ages a block while it is not being accessed inverting its cost from high to low. As a result, LACS attempts to reserve high-cost blocks in the cache, but only while their locality is still high (i.e. they have been September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

5

accessed recently). The underlying locality-based algorithm, employed by LACS, can also be a dead block predictor. Although not shown in this paper, we integrated LACS with two dead-block predictors [3], [6] and found that, even though the integrated dead block predictors outperform our simple locality-based algorithm as standalone cache replacement algorithms, both approaches almost perform equally when their goal is limited to providing locality hints to LACS. Moreover, our locality-based algorithm has a much smaller storage overhead compared to other dead-block predictors. The fact that LACS reserves a small subset of high-cost blocks in the cache makes it thrashresistant. In addition, LACS is scan-resistant since its locality-aware component increases the costs of frequently accessed blocks while they are in the cache, and decreases the costs of blocks that do not get re-accessed leading to their early eviction. Both thrash-resistance and scan-resistance are key traits of an efficient cache replacement algorithm [9]–[12]. Consequently, while LACS reduces the miss penalty by substituting high-cost misses with low-cost misses, it also reduces the miss count by being both thrash- and scan-resistant. This paper has four main contributions: •

Miss Costs: The non-uniformity in the performance impact (cost) of LLC misses due to the non-uniformity in the processor’s ability to hide LLC miss latencies is asserted.



Cost Estimation: A novel, simple, yet effective run-time cost estimation method for inflight misses is presented. The cost is estimated based on the number of instructions the processor manages to issue during the miss, which reflects the miss’s performance impact and how well the processor is capable of hiding the miss penalty.



LACS: A cost-sensitive and locality-aware cache replacement algorithm that utilizes the devised cost estimation method is proposed. LACS is simple, has low-hardware overhead, and is effective for private and shared LLCs.



LACS Optimizations: The performance of LACS is further improved by introducing novel and effective run-time optimizations. These optimizations include: 1) A mechanism to dynamically and periodically update the threshold value, which allows LACS to better adapt to different applications and execution phases. 2) A mechanism to turn the cost-sensitive component of LACS on and off based on the predictability of block costs.

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

6

LACS is thoroughly evaluated using a detailed simulation environment. When evaluated using a uniprocessor architecture model, LACS speeds up 12 LLC-performance-constrained SPEC CPU2006 benchmarks by up to 51% and 11% on average (relative to the base LRU) without slowing down any of the 23 SPEC CPU2006 benchmarks used in the study. This performance improvement is comparable to that achieved with a 50%-100% larger cache using LRU, and yet it is achieved using a simple implementation with low-hardware overhead. In addition, LACS’s effectiveness is demonstrated over a wide range of LLC sizes. Moreover, LACS is compared to and shown to outperform both: a state-of-the-art cost-based replacement algorithm (MLPSBAR) [21] and another state-of-the-art locality-based algorithm (SHiP) [9]. When evaluated using a dual-core CMP architecture model with a shared LLC, LACS improves 36 SPEC CPU2006 benchmark pairs by up to 54% and 10% on average3 . When evaluated using a quad-core CMP architecture model with a shared LLC, LACS improves 100 SPEC CPU2006 benchmark quadruples by up to 38% and 10% on average3 . The rest of the paper is organized as follows. Section II presents the related work and compares LACS to other replacement algorithms. Section III develops the foundations for LACS. Section IV discusses LACS and its optimizations in detail for both private and shared caches. Section V describes the evaluation environment while Section VI discusses the experimental evaluation in detail. Finally, Section VII concludes the paper. II. R ELATED W ORK Traditionally, cache replacement algorithms were developed with the goal of reducing the aggregate miss count and thus assumed that misses were uniform in cost. Belady’s optimal (OPT) replacement algorithm [24] victimizes the block in the set with the largest future usage distance. It guarantees a minimal miss count but requires future knowledge and thus remains theoretical and can only be approximated. The LRU replacement algorithm, and its approximations, rely on the principle of temporal locality by victimizing the least recently used block in the set. However, studies have shown that the performance gap between LRU and OPT was wide for high-associativity L2 caches [7]. One factor that works against LRU is that locality is usually filtered by the L1 cache and thus is inverted in the lower cache levels [3]. To bridge the gap 3

The metric reported here is the harmonic mean of weighted IPCs normalized to the base LRU. It is a measure of both performance and fairness improvement. September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

7

between OPT and LRU replacement algorithms, many intelligent replacement algorithms have been proposed for LLCs including but not limited to: Dead block predictors [3]–[8], re-reference interval predictors and adaptive insertion algorithms [9]–[14]. Dead block predictors aim to predict dead blocks (blocks that will no longer be used during their current generation times4 ) in the cache and evict them early while preserving live blocks. Re-reference interval predictors and adaptive insertion algorithms aim to predict the time interval between consecutive accesses to a cache block which determines the insertion position of the block in the LRU or re-use stack. Moreover, cache replacement algorithms in shared CMP caches have been studied in the context of shared cache partitioning among the concurrently-running threads [15], [16]. Replacement algorithms such as OPT, LRU, dead block predictors, and others, only distinguish between cache blocks in terms of liveness and do not distinguish between blocks in terms of miss costs. However, in modern systems, cache misses are not uniform and have different costs [18], [21], [22], [25]. Thus, it is wiser to take into consideration the miss costs in addition to the access locality in the replacement algorithm in order to improve the cache’s overall performance. This is exactly what LACS is designed to achieve. In Section VI, we compare LACS against SHiP (Signature-based Hit Predictor) [9]: a stateof-the-art locality-based cache replacement algorithm. SHiP associates a cache reference with a unique signature and attempts to predict the re-reference interval for that signature. A Signature History Counter Table (SHCT) of saturating counters is used to learn and predict the re-reference behavior of the signatures. The table is updated on cache hits and evictions. On a cache fill, SHiP indexes into the SHCT with the new block’s signature to obtain a prediction of its re-reference interval. SHiP only tracks whether a signature is re-referenced or not, but not the actual rereference timing. For block promotion and eviction decisions, SHiP utilizes SRRIP [10]. In our evaluation, LACS is found to outperform SHiP in terms of performance improvement in both a private and a shared LLC while requiring about 20% less storage overhead. Srinivasan and Lebeck [22] explore load latency tolerance in dynamically scheduled processors and show that load instructions are not equal in terms of processor tolerance for load latencies. They also show that load latency tolerance is a function of the number and types of dependent instructions especially mispredicted branches. Moreover, Puzak et al. [25] also assert that misses 4

A block’s generation time starts from when it is placed in the cache after a miss until it is evicted

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

8

have variable costs and present a new simulation-based technique for calculating the cost of a miss for different cache levels. The observation of the non-uniform impact of cache misses led to a new class of replacement algorithms called Cost-Sensitive Cache Replacement Algorithms. These algorithms assign different costs to cache blocks according to well-defined criteria, and rely on these costs to select which block to evict on a cache miss (least cost block gets evicted first). The miss cost may be latency, penalty, power consumption, bandwidth consumption, or any other property attached to a miss [17]–[21], [26]. LACS assigns costs based on the processor’s ability to (partially) hide the miss latency by counting the number of instructions issued during the miss. One of the earliest implementations of cost-sensitive cache replacement algorithms were proposed by Jeong and Dubois [17]–[19] in the context of CC-NUMA multiprocessors in which the cost of a miss mapping to a remote memory, as opposed to local memory, is higher in terms of latency, bandwidth, and power consumption. A cost-sensitive optimal replacement algorithm (CSOPT) for CC-NUMA multiprocessors with static miss costs is evaluated and found to outperform a traditional OPT algorithm in terms of overall miss costs savings although the miss count increases. In addition, several realizable algorithms are evaluated with a cost based on the miss latency. In comparison, LACS estimates a block’s cost based on the processor’s ability to tolerate and hide the miss and not on the miss latency itself. Moreover, LACS is applicable to both uniprocessors and multiprocessors. Jeong et al. [20] also proposed a cost-sensitive cache replacement algorithm for uniprocessors. The algorithm assigns cost based on whether a block’s next access is predicted to be a load (highcost) or store (low-cost) since processors can better tolerate store misses over load misses. In their implementation, all loads are equal and are considered high-cost. In comparison, LACS does not treat load misses equally but distinguishes between load misses in terms of cost based on the processor’s ability to tolerate and hide the load miss. Our study and the studies of others [22], [25] show that load miss costs are not uniform and thus should not be treated equally. Moreover, some store misses may be critical and can stall the processor. This, for example, can happen if the LLC MSHR or write buffers get full after a long sequence of consecutive store misses such as when initializing or copying an array. Moreover, an increase in the number of store misses can put pressure on the memory bandwidth. Srinivasan et al. [26] proposed a hardware scheme in which critical blocks are either preserved September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

9

in a special critical cache or used to initiate prefetching. Criticality is estimated by keeping track of a load’s dependence chain and the processor’s ability to execute independent instructions following the load. Although they demonstrate the effectiveness of their approach when all critical loads are guaranteed to hit in the cache, no significant improvement is achieved under a realistic configuration due to the large working set of critical loads and the inefficient way of identifying critical loads. In comparison, LACS does not need to keep track of a load’s dependence chain, instead it uses a simpler, more effective approach for cost estimation. Moreover, LACS achieves considerable performance improvement under a realistic configuration because: (a) high-cost blocks are preserved in the LLC itself instead of a smaller critical cache, and (b) LACS includes a mechanism to relinquish high-cost blocks that may no longer be needed by the processor making room for other useful blocks. Qureshi et al. [21] proposed a cost-sensitive cache replacement algorithm based on Memory Level Parallelism (MLP). The MLP-aware cache replacement algorithm relies on the parallelism of miss occurrences since some cache misses occur in isolation (classified as high-cost and are thus preserved in the cache) while others occur and get served concurrently (classified as low-cost). Because of its significant performance degradation in pathological cases, it is used in conjunction with a tournament-predictor-like Sampling Based Adaptive Replacement (SBAR) algorithm to choose between the MLP-aware algorithm and the traditional LRU depending on which provides better performance. In comparison, LACS estimates a block’s cost based on the processor’s ability to tolerate the miss. Moreover, LACS’s performance is demonstrated to be more robust with negligible pathological behavior. In Section VI, LACS is compared against and is shown to outperform the MLP-aware with SBAR algorithm. Finally, other replacement algorithms that utilized cost were proposed outside the context of processor caches, such as in disk paging [27] and Web proxy caching [28]. III. LACS’ S F OUNDATIONS AND U NDERLYING P RINCIPLES This section lays the foundations and underlying principles for LACS. First, we discuss the impact of LLC misses in modern processors, thus, establishing our cost heuristic. Second, we examine the predictability/consistency of block costs, a vital property for LACS’s performance.

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

10

A. Anatomy of the Impact of LLC Misses in Modern Processors Modern dynamically-scheduled superscalar processors improve performance by issuing and executing independent instructions in parallel and out-of-order. Multiple instructions are fetched, issued, and executed every clock cycle. After an instruction completes execution, it writes back its result and waits to be retired (by updating the architectural state) in program order. Although instructions get issued and executed out-of-order, program order is preserved at retirement using the ROB [1]. Dynamically-scheduled superscalar processors can tolerate the high latency of some instructions (e.g. loads that suffer L1 cache misses) by issuing and executing independent instructions. However, there is a limit to the delay a processor can tolerate and may eventually stall. This happens in particular when a load instruction suffers an LLC miss and has to be serviced by the long-latency main memory. The main reason why a processor may stall after an LLC load miss is that the ROB would fill up and dependencies would clog it [29]. Even if there are not so many dependent instructions, the load instruction would reach the head of the ROB and prevent the retirement of completed instructions following it, again filling up the ROB and preventing the dispatch of new instructions as no free ROB entries are available. The clogging of the ROB has a domino effect on other pipeline queues such as the instruction queue and the issue queues preventing the fetch and dispatch of new instructions. Moreover, even store instructions can stall the processor despite the fact that they can be retired under an LLC miss. This can happen after a long sequence of LLC store misses that fill up the cache’s MSHR or write buffer preventing new instructions from being added and thus stalling the processor. Such a scenario could happen when a large array is being initialized or copied. This is why LACS treats store misses similar to load misses and assigns costs at block granularity. Yet, the processor’s ability to tolerate the miss latency differs from one miss to another [22], [25]. Consider the following two extreme scenarios: In the first, a load instruction is followed by a long chain of dependent instructions that directly or indirectly depend on it. If the load instruction suffers an LLC miss, the processor may stall immediately since none of the instructions following it can issue. After the load instruction completes, the dependent instructions need to be issued and executed. Their execution times will be added to the miss latency. In the second scenario, a load instruction is followed by independent instructions. If the load instruction suffers an LLC miss, the processor can still remain busy by issuing and executing the independent instructions. It will

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

11

only stall once the ROB is clogged by the load instruction and the completed (but not retired) independent instructions following it. However, once the load instruction retires, all the following completed instructions can retire relatively quickly. The execution times of these instructions will be overlapped with the miss latency and thus saved. Most load instructions exhibit scenarios between these two extremes. Figure 1 earlier asserts this observation and demonstrates that the number of instructions that a processor manages to issue during an LLC miss differs widely from one miss to another. On the one hand, the processor only manages to issue 0-19 instructions while servicing some misses (leftmost bar), but on the other hand, it manages to issue more than 160 instructions while servicing other misses (rightmost bar). Therefore, the impact and cost of the miss can be effectively estimated from the number of instructions issued during the miss. Cache misses in which the processor fails to issue many instructions are considered high-cost, while cache misses in which the processor manages to remain busy and issue many instructions are considered low-cost since they can be highly tolerable by the processor. LACS uses the above heuristic in estimating the cost of a cache block based on whether the miss on the block is a low- or a high-cost miss. If the number of issued instructions during the miss is larger than a certain threshold value, the block is considered a low-cost block. Otherwise, it is considered a high-cost block. LACS attempts to reserve high-cost blocks in the cache at the expense of low-cost blocks. When a victim block needs to be found, a low-cost block is chosen. At the same time, LACS is aware of the fact that high-cost blocks should not be reserved forever and must be evicted after they are no longer needed. Therefore, LACS relinquishes high-cost blocks if they have not been accessed for some time. B. Cost Consistency and Predictability LACS attempts to reserve blocks that suffered high-cost misses in the past with the assumption that future misses on the same blocks will also be high-cost misses. These high-cost blocks are reserved at the expense of blocks that suffered low-cost misses in the past with the assumption that future misses on those same blocks will also be low-cost misses. LACS thus substitutes highcost misses with low-cost misses. Two factors determine a block’s cost: the number of issued instructions during an LLC miss on the block (numIssued) and the threshold value (thresh). In order for LACS to perform effectively, the cost of a block must be consistent and repetitive across consecutive generations. In other words, the numIssued values for an individual block September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

12

must be repetitive and consistent. Fortunately, our studies and profiling assert that this is true. Profile of numIssued Values

Fig. 2.

Average

zeusmp

xalancbmk

sphinx3

soplex

sjeng

povray

perlbench

omnetpp

namd

milc

mcf

libquantum

numIssued Abs. Dif f . Avg.

lbm

hmmer

h264ref

gromacs

gobmk

gcc

dealII

calculix

bzip2

bwaves

numIssued Avg.

astar

160 140 120 100 80 60 40 20 0

Profile of numIssued Values. Simulation environment details are in Section V.

Figure 2 shows some profiling statistics for the values of numIssued. During profiling, numIssued values are simply recorded but are not used for cache replacement decisions, instead the base LRU replacement is used. While profiling, the absolute difference between the numIssued values for the same cache block over two consecutive misses are recorded. The right bar for each benchmark shows the average of these absolute differences. LACS does not use the exact numIssued value to estimate cost, instead, cost is estimated based on whether this numIssued value is larger or smaller than a threshold value. The left bar for each benchmark shows the average of all numIssued values for all misses (numIssuedavg ). Comparing the two bars for each benchmark shows that the absolute differences between the values of numIssued for the same cache block over consecutive misses (right bar) are much smaller than the overall numIssued average (left bar). In other words, because the numIssued values for the same block over consecutive misses are close to each other relative to the overall numIssued average, the block will most likely have the same cost across its generations. The povray benchmark only suffers from cold misses and thus its absolute differences average bar is not shown. Moreover, numIssuedavg values must not change dramatically across periods. Otherwise, a block’s cost relative to other blocks would not be consistent across periods. A block that was marked as high-cost in one period could be a low-cost block had its cost been evaluated in a subsequent period, and vice versa. For example, assume that a block has a numIssued value of 30. On one hand, if the value of numIssuedavg in a period was 100, then the block will most September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

13

likely be considered as a high-cost block. On the other hand, if the value of numIssuedavg drops to 20 in a subsequent period, then the same high-cost block should become a low-cost block. Fortunately, our studies and profiling of numIssuedavg values across periods indicate that these averages are consistent and repetitive most of the time for most benchmarks.

160 140 120 100 80 60 40 20 0

Fig. 3.

bwaves

0

25

50

75

100

160 140 120 100 80 60 40 20 0

milc

0

25

50

75

100

160 140 120 100 80 60 40 20 0

gcc

0

25

50

75

100

Plot of numIssuedavg Values Over 100 (16K Miss) Periods. Simulation environment details are in Section V.

Figure 3 shows the plots of numIssuedavg values over 100 consecutive periods (of 16K misses each) for three representative benchmarks. The horizontal axis covers 100 periods, while the vertical axis shows the numIssuedavg values in each period. The figure shows three different patterns. In the first plot (bwaves), the average values are equal over all the periods. In the second plot (milc), the average values are equal over long period stretches (10-30 periods) with different stretches having different averages. Finally, in the third plot (gcc), the average values experience large fluctuations. For the first two patterns, it is expected that block costs would have high consistency and predictability. However, in the third pattern, it is expected that block costs would suffer low consistency. As we will describe in the next section, LACS employs a simple mechanism to detect large fluctuations in period averages and turn its cost-sensitive (CS) component on and off accordingly. Overall, the above discussion asserts that block costs are mostly repetitive and predictable, a necessary characteristic for LACS to perform effectively. IV. LACS’ S I MPLEMENTATION This section explains the implementation of LACS and its optimizations, in addition to its hardware and storage organization, for both private and shared LLCs.

September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

14

A. LACS’s Implementation Details Table I summarizes the different LACS operations explained in this section. Throughout the discussion, the corresponding steps in the table will be pointed out. TABLE I S UMMARY OF LACS’ S O PERATIONS On an LLC miss on block B:

Step A: Step B: Step C:

When the miss on block B returns:

Step Step Step Step Step Step

On an LLC hit on block B: Threshold Calculation:

D: E: E1: E2: F: G:

Step Op1:

missCount++; MSHR[B].IIR = IIC; Find a victim block with costcurrent = 0; If none exists, costcurrent -- for all set blocks and repeat search; numIssued = IIC - MSHR[B].IIR; Update the history table: if(numIssued < thresh) { Table[B].coststored ++; totalHigh++; } else Table[B].coststored --; Initialize block B: B.costcurrent = Table[B].coststored ; Update block B: B.costcurrent ++; if(totalHigh < minHigh) else if(totalHigh > maxHigh) else

threshnew = threshold + 8; threshnew = threshold - 8; threshnew = threshold ;

In the previous section, we established that block costs were highly repetitive and predictable. However, there remains some blocks whose costs swing across consecutive misses. This is usually the case when the program is transitioning from one execution phase to another or when the block’s numIssued values are around the threshold value. Therefore, a confidence mechanism is needed to distinguish between high- and low-confidence cost values. LACS uses a small history table to store the costs of individual blocks (whether they are in the cache or not) in their previous misses. A 2-bit (coststored ) saturating counter is stored per cache block representing the cost of the block and the confidence in its cost. A coststored value of 3 or 2 designates a high-cost block while a coststored value of 0 or 1 designates a low-cost block. Moreover, A coststored value of 3 or 0 reflects high confidence in the cost, whereas, a coststored value of 2 or 1 reflects low confidence in the cost. In order for LACS to calculate the number of issued instructions during a miss, a performance counter that tracks the number of issued instructions in the pipeline is needed. This performance counter is incremented for every issued instruction. Such a performance counter (or similar ones) is readily available in most modern processors. We will call this counter the Issued Instructions Counter (IIC). In addition, every LLC MSHR entry is augmented with a field: Issued Instructions Register (IIR) that stores the value of the IIC when the load/store instruction is added to the September 1, 2012

DRAFT

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON COMPUTERS

15

MSHR (Step B). Once the miss is serviced and the data is returned from main memory, the number of issued instructions during the miss (numIssued) can be calculated as the difference between the current IIC value and the stored IIR value corresponding to the missed request (Step D). Once the value of numIssued is calculated, it is compared to the threshold value (thresh). If numIssued is smaller than thresh, the current miss cost is considered to be high, and the corresponding coststored counter in the history table is incremented (Step E1). Otherwise, the current miss cost is considered to be low, and the corresponding coststored counter in the history table is decremented (Step E2). In addition, a totalHigh counter is incremented for every highcost block found. Figure 4 illustrates LACS’s cost estimation and assignment logic.

IIC

IIR IIR

Suggest Documents