The Non-Critical Buffer: Using Load Latency ... - Semantic Scholar

5 downloads 0 Views 81KB Size Report
Data cache performance is critical to overall proces- sor performance as the latency gap between CPU core and main memory increases. Studies have shown ...
The Non-Critical Buffer: Using Load Latency Tolerance to Improve Data Cache Efficiency Brian R. Fisk and R. Iris Bahar Division of Engineering, Brown University, Providence, RI 02912 Email: [email protected], [email protected]

Abstract Data cache performance is critical to overall processor performance as the latency gap between CPU core and main memory increases. Studies have shown that some loads have latency demands that allow them to be serviced from slower portions of memory, thus allowing more critical data to be kept in higher levels of the cache. We provide a strategy for identifying this latency-tolerant data at runtime and, using simple heuristics, keep it out of the main cache and place it instead in a small, parallel, associative buffer. Using such a “Non-Critical Buffer” dramatically improves the hit rate for more critical data, and leads to a performance improvement comparable to or better than other traditional cache improvement schemes. IPC improvements of over 4% are seen for some benchmarks.

1. Introduction The performance increase of today' s high-end microprocessors is due to many factors, among them is the use of speculative, out-of-order execution with highly accurate branch prediction. Branch prediction has increased instruction-level parallelism (ILP) by allowing programs to speculatively execute beyond control boundaries, while outof-order execution has increased ILP by allowing more flexibility in instruction issue and execution. The combination of these techniques has increased processor performance in part by hiding the memory latency penalty in the case of a cache miss; instructions without data dependencies on the cache miss instruction may execute while the miss is being serviced, thus sustaining higher processor throughput. Hiding memory latency has become particularly critical in the past few years as CPU performance has been increasing faster than memory access technologies have been improving. This large and increasing gap between the CPU and memory means that a larger number of in This work is supported in part by NSF CAREER grant MIP-9734247.

structions will have to stall while waiting for data values coming from accesses to memory. Local caches help to improve performance but aliasing and capacity problems can lead to inefficient use of the cache. Techniques to reduce conflicts include increasing cache associativity, alternative cache indexing schemes [1, 2, 12], use of victim caches [10], or cache bypassing with and without the aid of a buffer [9, 11, 14, 8, 3]. However, the prevailing technique used in existing processors is to improve local cache hit rates by increasing the size and/or associativity of caches. Although including a larger or higher associative data cache on chip may improve hit rates (and thereby avoid accesses to slower, lower-level memory), increasing the size or associativity of the cache is often not possible without also increasing the cycle latency. Unfortunately, this can cause an overall decrease in processor performance even if the cache miss rate goes down. This was shown in the Alpha 21264 microprocessor where processor performance decreased by about 4% when going to a 2-cycle pipelined cache configuration [7]. One potential problem with the methods proposed above is that they are all trying to improve cache miss rate universally without giving notice to that fact that some data cache misses may be more critical than others. That is, if we reduce the data cache miss rate by allowing requested data to remain in first level cache longer, this miss rate improvement will only lead to an improvement in performance if the load requests that now hit in the cache were formerly causing many dependent instructions to stall. Alternatively, if few instructions depend on the load accesses that now hit in the first-level cache, performance will improve little, since these independent instructions would have been able to execute out of order anyway. In fact, it may be possible to improve cache miss rate, but hurt performance, if critical data ends up being replaced by more non-critical data. This theory is supported by the idea that some data loads exhibit load latency tolerance. It has been quantitatively shown [13] that not all data accesses need to occur immediately if there are enough ready instructions for the processor to execute. This data is presumably not on any critical path.

Similarly, if a data access is initiated far enough in advance, it may be serviced by lower levels of the memory hierarchy without affecting performance. Relegating this “latencytolerant” data to slower portions of memory mitigates the problem of having a limited sized first-level cache by reserving precious cache entries for the less tolerant and/or more frequently accessed data. High-speed caches are often direct-mapped, despite the fact that direct-mapped caches often suffer from conflict misses. To alleviate this problem, a small associative buffer may be used in parallel with the first-level data cache (such as the victim cache [10] or non-temporal buffer [11] cited earlier). These strategies often use this cache as a “trash buffer” for data that is deemed less useful than some other competing data, but may still be required by the processor at some point. In this work, we propose using this buffer to hold data for non-critical, latency-tolerant loads while leaving the more critical data in the high-speed main cache. The non-critical data is identified during execution, when the data access misses in the first-level cache, so that the fill data may be prevented from being written into the main cache. In this study we use a cycle-level simulator of an 8-issue, speculative, out-of-order processor to evaluate the effectiveness of our Non-Critical Buffer (NCB) scheme compared to other more traditional caching strategies. We make the following contributions:

displaced by less useful data, thereby reducing “cache pollution” effects. Figure 1 shows the design of the memory hierarchy when using a buffer alongside the first level data cache. Also included in the figure is a representation of the five main stages found in a speculative out-of-order execution processor. Note that the instruction cache access occurs in the fetch stage while the data cache access occurs in the issue stage for a load operation and the commit stage for a store operation. An instruction may spend more than one cycle in any of these five stages depending on the type of instruction, data dependencies, and cache hit outcome. L1 Data Cache

data access

(dL1) from Processor

dL1 buffer

Unified L2 cache

L1 Inst. Cache

instruction access

(iL1)

Fetch iL1 access

1

Decode

2

Issue

3

dL1 access (load)

Writeback

4

Commit

5 dL1 access (store)

 We propose various strategies for detecting noncritical data in real time and develop policies to keep this data out of the main first level data cache.

Figure 1. Memory hierarchy design using a buffer alongside the first-level data cache.

 We show that using the Non-Critical Buffer results in a performance improvement that is usually better than using traditional caching schemes.

In the case of the victim cache [10], on a main cache miss, the victim cache is accessed. If the address hits in the victim cache, then the data is returned to the CPU, and the block is simultaneously swapped with an appropriate block in the main cache. If the address also misses in the victim cache, the data is retrieved from second-level cache (L2) and placed in the main first-level data cache. The displaced block in the main cache is then placed in the victim cache; this victim cache block is written back to L2 if dirty and then discarded. Jouppi showed that a small victim buffer provided performance comparable to a 2-way setassociative cache. However, his analysis did not use a full simulation model to measure impact on processor performance, nor did it account for the impact of the swap tying up the caches. Other work explores cache bypassing techniques to reduce pollution, where the L1 cache is bypassed on some load misses. In [14], Tyson proposed a method for selectively allocating space in the first level data cache (DL1) based on the characteristics of individual load instructions. They showed that there was a marked reduction in memory bandwidth requirements due to a reduction in DL1 pol-

 We show that when the Non-Critical Buffer is used, overall first-level cache miss rates may actually increase while overall performance remains the same or improves. This lends support to the idea that the cache is being used more efficiently.

2. Background and prior work 2.1. Associative caches As mentioned in Section 1, cache performance may be improved by the use of a buffer alongside the first-level caches [10, 9, 11, 14, 8, 3]. The buffer is a small cache (e.g., between 8–16 entries) located between the first level and second level caches. The buffer may be used to hold specific data (e.g. non-temporal or speculative data), or may be used for general data (e.g. “victim” data). One side-effect of the buffer is that it may prevent useful data from being

lution. A more rigorous experimental method was used in subsequent papers on bypassing for cache pollution to show overall performance improvement. Johnson [9] used a full system simulator/emulator to measure the system-level effects of bypassing the DL1 with the aid of a buffer. The buffer stored data which was deemed “short term” in terms of its temporal characteristics. Johnson showed up to a 10% improvement in overall system performance. Similarly, Rivers and Davidson implemented a small, fullyassociative “non-temporal streaming (NTS) cache” in parallel with the DL1 [11]. This cache was used for blocks with a history of non-temporal behavior, keeping often-reused data in the regular DL1 cache. The NTS cache usually provided a 2-3% performance improvement. John and Subramanian [8] used a different strategy to determine locality in their annex cache, putting all new fills (due to load misses) into the annex cache and promoting them to the main cache upon reads. Finally, the work of Bahar et al. proposed using a buffer to hold “speculative data” that was determined to have a high probability of being from a mis-speculated path [3]. The main cache was then targeted to hold only those blocks of data determined to be non-speculative, thus reducing pollution in the main cache. An alternative to using an associative cache is to use hashing functions on the index bits. This technique is used in the hash-rehash cache [1], the column-associative cache [2], and the skewed-associative cache [12].

2.2. Measuring latency tolerance The main inspiration for this project comes from the research of Srinivasan and Lebeck [13], who showed that a large portion of loads do not need to be serviced immediately, and some may be delayed up to 32 cycles or more before they are needed by other instructions. They also showed that up to 36% of loads miss in the L1 cache, even though they have shorter latency requirements than L2 access times. Furthermore, up to 37% of loads are serviced in the first-level cache, although they have enough latency tolerance to be satisfied by lower levels of the memory hierarchy. There is an implication here that, if these latencytolerant instructions were instead stored in a separate “noncritical” buffer, then more critical instructions will remain in the cache, perhaps providing a solution to each of these problems. To quantify load latency tolerance, Srinivasan and Lebeck equipped their simulated processor with “rollback” capabilities in order to arbitrarily complete loads when they were needed. Loads were allowed to remain outstanding until their simulator determined that a load result was needed by another instruction. At this point, the state of the processor was rolled back, the load was allowed to complete at the required time, and execution resumed. This allowed

the authors to determine just how long any particular load could be allowed to remain outstanding. A large portion of their research was devoted to determining when a load should be completed. For instance, they discovered that if a branch is (directly or indirectly) dependent on the load then this load should be completed as soon as possible. Alternatively, if overall processor performance is degrading (such as the number of functional units used, or the number of new instructions ready for issue per cycle), then there are probably several instructions dependent on an outstanding load, which needs to be completed. Obviously these rollback facilities would be impossible to implement in a real processor. Instead, our experiments attempt to use their observations in implementing a new cache configuration scheme that adapts to load latency tolerance (or alternatively, to data “criticality”). We do not attempt to determine a load' s latency tolerance ahead of time. Instead we try to determine its criticality over the course of any cache miss that occurs. We have two methods of measuring and adapting to load data criticality: 1. Keeping track of the overall performance of the processor, and 2. Counting the number of dependencies added to the load' s dependency chain over the course of the miss. In the first method, we issue loads as usual, and if they miss in the cache, while the miss is being serviced we measure performance degradation by monitoring issue rate or functional unit usage. If processor performance degradation is determined, the load is marked critical and placed in the main cache when the fill is received from lower level memory (allowing for fast access of this data the next time it' s requested). If the load is determined not to be critical, it is placed in the Non-Critical Buffer (NCB). We make the assumption that the next access to the data will be made with a similar processor state. If the data is still in the NCB at the next request, then the fast access may be taken advantage of. However, since the data is (theoretically) latency-tolerant, if it has been replaced in the NCB, little harm should be done. Our second strategy for measuring criticality involves following the dependency graph for each load while it is outstanding. We track the number of dependencies on the load' s dependency graph over the course of its time in the Load/Store Queue (LSQ) as well as over the course of the miss. A load is considered critical if more than a given number instructions are attached to the load during the time the miss is being serviced. We consider changes in the dependency chain only during the time of the miss since we only have control of the cache fill strategy and not over the LSQ. This strategy tends to perform better than the first strategy outlined above, but would require more hardware to implement. More details on the use and implementation of the Non-Critical Buffer are given Section 3.

3. Experimental methodology

 Fetch: Fetch new instructions from the the instruction cache and prepare them for decoding.

use for our experiments tend to use a small data set. Therefore, we use a smaller cache to obtain reasonable hit/miss rates. Our baseline processor featured mostly unlimited resources, aside from the memory subsystem, in order to try and isolate the effects of the cache on overall performance. Our simulations are executed on SPECint95 and SPECfp95 benchmarks [6]; they were compiled using a retargeted version of the GNU gcc compiler, with full optimization. This compiler generates S IMPLES CALAR machine instructions. Since we are executing a full model on a a very detailed simulator, the benchmarks take several hours to complete; due to time constraints we feed the simulator with a small set of inputs. Integer benchmarks (except for go) are executed to completion. Benchmark go and the floating point benchmarks are simulated only for the first 600 million instructions decoded. Since we want to delay the decision to write fill data to the main data cache or the NCB until after the effects of the miss are known, a method of keeping track of the outstanding misses is needed. To provide this service each of the caches may be equipped with a set of “Miss Status Holding Registers” (MSHRs). Upon a miss, the MSHRs are updated with the address that missed. On each cycle, the MSHRs are checked to see if a fill has be received from lower-level memory. If so, a cache is chosen (NCB or first-level data cache) using the desired fill strategy and the fill is placed in the appropriate block of that cache. Note that it is now possible for an access to hit a block that has a miss outstanding. While these accesses have a latency greater than the cache access time, in our experiments this is counted as a hit for compatibility with the original S IMPLES CALAR code.

 Dispatch: Decode instructions and allocate RUU and LSQ entries.

3.2. Non-swapping victim cache

This section presents our experimental environment. First, the CPU simulator will be briefly introduced and then we will describe the Non-Critical Buffer (NCB) in more detail, how it is accessed as part of the memory hierarchy and the various schemes we used to determine what is latency tolerant data (and therefore earmarked for the NCB).

3.1. Full simulator model We use an extension of the S IMPLES CALAR [5] tool suite. S IMPLES CALAR is an execution-driven simulator that uses binaries compiled to a MIPS-like target. S IM PLE S CALAR can accurately model a high-performance, dynamically-scheduled, multi-issue processor. We use an extended version of the simulator that more accurately models all the memory hierarchy, implementing non-blocking caches and more precise modeling of cache fills. The model implements out-of-order execution using a Register Update Unit (RUU) and a Load/Store Queue (LSQ). The RUU acts as a combined instruction window, array of reservation stations, and reorder buffer. The LSQ holds all pending load and store instructions until they are ready to be sent to the data cache. Among other things, the LSQ must check for load/store dependencies and prevent loads from executing before stores in certain cases. The simulated processor features five pipeline stages:

 Issue/Execute: Insert ready instructions into a ready queue, and issue instructions from the queue to available functional units.  Writeback: Forward results back to dependent instructions in the RUU.  Commit: Commit results to the register file in program order, and free RUU and LSQ entries. Table 1 shows the configuration of the processor modeled. Note that first level caches are on-chip, while the unified second level cache is off-chip (thus having a much higher latency). In addition we have an 16-entry fullyassociative buffer associated with each first level data cache. Note that the ALU resources listed in Table 1 may incur different latency and occupancy values depending on the type of operation being performed by the unit. Although an 8KB L1 cache may seem small by today' s standards for a high-performance processor, the SPEC95 benchmarks we

In order to compare the NCB performance to established caching strategies, we also implemented a variant of Jouppi' s victim cache with some slight changes. First, we access both the main and victim cache in parallel rather than sequentially. Second, our victim buffer is non-swapping, meaning that hits in the victim buffer are not promoted back to the main cache. Prior work [4] has shown that a nonswapping victim buffer performs as well as or better than a swapping victim cache, since the caches are never tied up for extra cycles in order to swap the data. In [10], the data cache of a single issue processor was considered, where a memory access occurs approximately one out of every four cycles; thus the victim cache had ample time to perform swapping. A modern 4-issue processor has an average of one data memory access per cycle, and tying up the caches to swap is detrimental to performance. In addition, this scheme should be simpler to implement in hardware than one that implements swapping.

Table 1. Machine configuration and processor resources. Parameter L1 Icache L1 Dcache L2 Cache Memory Branch Pred. BTB RAS ITLB DTLB

Configuration 8KB direct; 32B line; 1 cycle 8KB direct; 32B line; 1 cycle 256KB 4-way; 64B line; 12 cycle 128 bit-wide; 20 cycles on hit, 50 cycles on page miss 2k gshare + 2k bimodal + 2k meta 1024 entry 4-way set assoc. 32 entry queue 32 entry fully assoc. 64 entry fully assoc.

3.3. Implementation of the non-critical buffer Similar to a victim cache, our Non-Critical Buffer is a small (16-entry), fully-associative buffer that is accessed in parallel with the main first-level data cache. Access time of the NCB is the same as the main cache (i.e. 1 cycle). Unlike the victim cache, the NCB uses an active fill mechanism to dynamically determine whether to place new fills into the main cache (“critical data”) or the NCB (“non-critical data”) during program execution. Our first scheme for using the NCB tracks processor performance over the past few cycles and uses this information for determining criticality. To record the processor performance, a simple shift register is used. A particular performance metric is chosen, such as functional unit utilization, or instruction issue rate. In this context “issue rate” represents the number of new instructions ready to be issued to functional units. On each cycle, if the processor is “busy” enough (for example, more than some user-defined threshold of functional units are being used), then a value of 1 is shifted into the register. Upon a cache miss, when a decision needs to be made about where to place the fill, the number of 1' s in the register is counted and, if it is greater than another user-defined threshold, the data is placed in the NCB. In our experiments we varied the threshold for shifting a 1 into the register, as well as the history length and the number of 1' s required in the shift register for a load to be considered critical. Our second scheme for using the NCB tracks the number of instructions added to the instruction window that are dependent on a given memory operation. The number of data dependencies are referenced at two different points over the lifetime of a memory operation: immediately before a cache access and during a cache miss. These statistics can be tabulated separately for hits and misses. In a hardware implementation, this scheme might involve a counter for each LSQ entry that is incremented with each new dependency, as well as an identical counter as part of each MSHR (for counting during misses). The dependency-based NCB scheme uses the cache

Parameter Fetch/Issue/Commit Width Integer ALU Integer Mult/Div FP ALU FP Mult/Div/Sqrt Memory Ports Instr. Window (RUU) Entries LSQ Entries Fetch Queue Min. Mispred. Penalty

Units 8 8 4 8 4 2 64 128 128 6

miss dependency information to determine criticality: If the number of dependencies added is greater than a userdefined threshold, the data is deemed critical and placed in the main cache. Otherwise, the data is placed in the NCB. In addition, Srinivasan and Lebeck [13] showed that having a branch dependent on a load should mark the data as highly latency intolerant, or very critical. If this occurs, the data is marked critical regardless of the number of dependencies added during a cache miss (since it is possible that the branch was added before the actual cache access).

4. Experimental results 4.1. Base case In this section we will describe the base case we used for our initial experiments and the baseline results. All other experiments are compared to this one.

Table 2. Base case results: Results using an 8KB direct-mapped first-level data cache. Test applu compress go ijpeg li mgrid

IPC

LSQocc

MR

Dmiss

Dhit

1.9244 1.9532 0.5565 2.5738 1.3508 4.6990

18.5181 24.8755 3.5942 31.2164 11.7190 91.4849

5.97 7.44 9.95 3.36 3.83 4.93

0.8614 6.3769 3.3705 3.2156 0.7225 2.2478

3.6596 4.0806 3.2394 5.6111 2.5561 6.2080

As stated before, our base cases uses 8K direct-mapped on-chip first level caches (referred to as DL1 for data and IL1 for instruction), and a unified 256K 4-way off-chip second level cache (UL2). Table 2 shows the instructions per cycle (IPC), the average LSQ occupancy (the average number of load or store instructions in the LSQ per cycle), the DL1 miss rate (MR), the average number of dependencies added to a load' s dependency graph during a cache miss

%MR

Table 3. Traditional Techniques: Columns labeled %IPC and refer to percent change from base case. Negative percentages indicate a decrease in IPC or an increase in cache miss rate. Columns labeled , miss and hit refer to raw changes in numbers.

LSQ D

D

8K-2way

Test applu compress go ijpeg li mgrid

%IPC 0.468 2.473 2.084 1.628 1.658 1.698

D

LSQ %MR Dmiss Dhit -0.105 -0.666 -0.751 -0.256 -0.476 -0.099

20.436 12.769 46.030 28.571 26.893 18.864

-0.263 -0.537 -0.229 -0.703 -0.196 -0.451

( miss ), and the average number of added during a load' s time in the LSQ on a cache hit ( hit ). We see from this table that the benchmarks vary widely in cache miss rate, LSQ occupancy and the average number of dependencies. Of particular note is the relatively high number of dependencies added during the time the memory access is in the LSQ (prior to actual cache access), hit , compared to the average number added during a cache miss, miss . Note that hit only represents dependencies added during the time in the LSQ, prior to a cache hit. These values give an indication of the overhead of making a memory access, and indicate that, in general, many dependencies are added to a load' s dependency chain before the actual cache access is even attempted. Our caching technique has no control over this LSQ behavior, but it still shows an improvement in performance. Further enhancement may better address this LSQ problem for an even greater performance improvement.

D

D

D

D

4.2. Traditional techniques Traditional approaches for improving processor performance (particularly IPC) aim to reduce the number of cache misses. Aside from increasing the size of the cache (which is not examined here), associativity may be added to the cache or a traditional victim buffer may be used. These techniques often offer good performance improvements by reducing first-level cache miss rates. As these methods have been actually implemented in industry designs for some time now, any new caching strategy should at least be comparable with these tried-and-true methods. Table 3 presents results obtained by either changing the cache associativity to a 2-way associative cache, or by adding a 16-entry, fully-associative, non-swapping victim buffer to the base case configuration. Each cache retains its one cycle access latency. As expected, IPC improves across all benchmarks, varying from a 0.5% improvement for applu to 2.5% for compress.

0.013 -0.074 -0.017 -0.019 -0.041 -0.133

16-entry Victim Cache

%IPC 0.473 2.196 1.671 1.496 1.518 1.671

LSQ %MR Dmiss Dhit -0.095 -0.623 -0.597 -0.305 -0.482 -0.089

18.090 11.559 36.784 33.036 24.804 20.487

-0.186 -0.434 -0.174 -0.650 -0.220 -0.446

0.085 1.154 -0.090 -0.079 0.024 -0.370

Overall DL1 miss rates also improve, by up to 46% for go using a 2-way associative cache and up to 37% using a victim cache (note that, for the purpose of computing DL1 miss rate with an extra associative buffer, a “miss” occurs when data misses in both the main cache as well as the associative buffer). Since a load or store remains in the LSQ until the memory access is completely serviced, it is not surprising that the average LSQ occupancy decreases with improved cache hit rate. Changes in dependencies are relatively small because these strategies do not attempt to directly address any dependency issues.

4.3. Non-critical buffer The Non-Critical Buffer is used as part of an active cache management strategy that does not necessarily attempt to reduce cache miss rate, but instead makes sure that the most critical data is retained in the cache. Less critical data may miss more often, resulting in higher overall miss rates, but the performance gains from keeping the critical data in the cache outweigh these penalties. Table 4 shows results when using an NCB with processor performance history as the criticality metric. We used several different configurations in our experiments; the best-performing configuration is shown in the table. Table 5 presents results obtained when using the dependencycounting heuristic for measuring criticality. Several conclusions can be immediately drawn from these results. First, both NCB schemes are improving IPC values over the base case. IPC improvements are better than or comparable to improvements seen when using a 2-way associative or victim cache, and substantially better in the case of compress. In general, using the dependency counting scheme provides better performance improvements than the history scheme. The best results (shown in Table 5) were obtained for a criticality threshold of 1 or more dependencies added during a miss. That is, if any instructions are dependent on the load, it should be considered critical. We

Table 4. NCB with History Measurement: Percent improvement compared to the base case when using a Non-Critical Buffer and using recent performance history as a criticality metric. History length = 6 cycles, Busy Threshold = 6, Critical threshold = 3 or fewer. Functional Unit History

Test applu compress go ijpeg li mgrid

%IPC 0.535 2.038 1.420 1.403 1.007 1.211

LSQ

0.303 -0.605 -0.465 -0.141 -0.083 -0.079

%MR Dmiss Dhit

-11.558 10.215 28.744 23.512 18.277 -32.049

0.036 -0.617 0.063 -0.493 -0.173 0.001

Table 5. NCB with Dependency Measurement: Percent improvement compared to the base case when using a Non-Critical Buffer and using number of dependencies added during a cache miss as a criticality metric. Threshold for criticality is one or more dependencies added during a miss. Test applu compress go ijpeg li mgrid

%IPC 0.899 4.413 1.294 1.671 0.851 1.594

LSQ 0.347 -0.341 -0.440 -0.034 -0.341 -0.041

%MR -37.688 2.554 25.427 -20.238 10.183 -16.633

Dmiss Dhit -0.385 -1.254 -0.177 -1.387 -0.186 -0.440

0.718 2.186 0.175 -0.164 0.267 -0.065

will discuss the dependency-based strategy first. As mentioned above, compress sees a particularly impressive performance improvement when using the NCB (4.4%) compared to the base case. This is because the compress benchmark has a significant portion of loads with a high number of dependencies. Referring back to the base case (Table 2), we see that compress features a large average number of dependencies added during a miss (6.37). This implies that a large number of misses in the base case are very critical. Our fill strategy tries to keep data with dependencies in the DL1 cache, and succeeds, as the average number of dependencies added during a miss drops by 1.3. This is also reflected in the increase in dependencies during a hit by 2.2, implying that critical data is hitting more often in the cache, and that non-critical data has been replaced with this critical data. This phenomenon is also exhibited by the other benchmarks, especially those with large IPC improvements. For example, ijpeg also has miss dependencies decreased by 1.4. Other benchmarks decrease by somewhat smaller amounts.

-0.002 1.996 -0.009 -0.561 0.068 0.201

Instruction Issue Rate History

%IPC 0.556 2.053 1.420 1.352 1.066 1.292

LSQ

0.266 -0.604 -0.466 -0.132 -0.094 -0.089

%MR Dmiss Dhit

-5.863 10.484 28.945 24.405 18.538 -18.458

0.147 -0.599 0.057 -0.462 -0.170 -0.021

0.116 1.218 -0.016 -0.354 0.073 -0.067

Overall IPC improvement may be related more to absolute reduction in miss dependencies than relative (percentage) improvement. For example, compress and ijpeg exhibit large reductions in dependencies (1.3 and 1.4, respectively) for the largest improvements in IPC for the group. However, the reduction in dependencies for the compress benchmark is 20% of its base case average. Compare this to li, where the miss dependencies are reduced by a comparable percentage (26%, or 0.19 dependencies), but for a performance improvement of only 0.85% A directly related observation is that benchmarks with longer average dependency chains in the base case have the most to gain from even a small improvement. For example, compress, go and ijpeg each have a large average number of miss dependencies, and all respond well to our NCB strategy. On the other hand, li and mgrid are not well suited to this fill strategy, since they have a relatively small number of dependencies in the base case. As such, they see very small improvements, since most data is placed in the NCB and the main cache is underutilized. A final observation is that in several benchmarks, the overall DL1 cache miss rate increases, sometimes very substantially, while performance is also improving. This seems counterintuitive at first, but strongly supports the notion that the data that is still hitting in the cache is much more critical and of greater use in keeping the processor busy. For instance, ijpeg sees a large increase in miss rate, but has the second largest gain in IPC. The performance history strategy did not generally perform as well as the dependency length strategy. Though it did outperform on some benchmarks (li, go), it was not as impressive as the 2-way associative or victim caches, nor was it as impressive as the dependency length strategy on the other benchmarks. With the exception of ijpeg, measuring instruction issue rate history met with better success than functional unit usage. The performance history strategy also does not appear to be performing its prescribed function nearly as clearly. Miss dependencies are not sub-

stantially reduced in most cases. Coupled with significant cache miss rate improvements, this may suggest that the history algorithm is too random, and the NCB is effectively only adding associativity to the cache. The algorithm can only count 1' s in the shift register, and cannot distinguish between different patterns of 1' s and 0' s. This may be an issue because “111000” has different implications for performance trends than “101010”. However, the performance history strategy is almost certainly easier to implement in hardware than tracking dependencies, as it simply consists of a shift register and logic to count the number of high values in the register. The dependency method would require more complex hardware to count new dependencies added during a cache miss. However, the clearer performance gains using this strategy may make it worthwhile.

5. Conclusion This paper explores several ways to exploit load latency tolerance in a dynamically scheduled processor. We show that dependency information during a cache miss can be used to determine a load' s criticality. A small, associative, Non-Critical Buffer in parallel with the main data cache may be used as an insurance policy against a future cache miss if the data is deemed non-critical, rather than bypassing the data cache entirely. Using the NCB results in a performance improvement comparable to or better than traditional caching techniques. A simple shift register containing a local history of processor performance does not appear to be the best way to determine criticality. Counting the number of instructions dependent on a load is a better metric, and using this information as a criticality measurement with the NCB scheme may lead to more (but less critical) cache misses. The shift register history method may not perform as well, but is attractive due to its comparatively low hardware implementation cost. A more intelligent method for analyzing data in the history register may lead to improved results. There are several areas open to future work. First, more research needs to be done to investigate the growth of the dependency chain during an instruction' s lifetime in the LSQ, prior to cache access, and how it may be affecting this NCB technique. Second, there may be further improvements to the dependency-based NCB scheme to increase performance. Finally, while the S IMPLES CALAR simulator features a very reasonable approximation of a memory subsystem, implementing the NCB scheme in a simulator with a more robust memory model could yield substantially different results.

Acknowledgements: The authors would like to thank Dirk Grunwald for his generosity in donating spare CPU cycles for many of our simulations. We would also like

to thank the anonymous reviewers for their valuable comments.

References [1] A. Agarwal, J. Hennessy, and M. Horowitz. Cache performance of operating systems and multiprogramming workloads. ACM Transactions on Computer Systems, 6:393–431, November 1988. [2] A. Agarwal and S. D. Pudar. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. In 26th Annual International Symposium on Microarchitecture, pages 179–190. IEEE/ACM, December 1993. [3] R. I. Bahar, G. Albera, and S. Manne. Using confidence to reduce energy consumption in high-performance microprocessors. In Int. Symposium on Low Power Electronics and Design, pages 64–69. IEEE/ACM, August 1998. [4] R. I. Bahar, D. Grunwald, and B. Calder. A comparison of software code reordering and victim buffers. In Computer Architecture News. ACM SIGARCH, March 1999. [5] D. C. Burger, T. M. Austin, and S. Bennett. Evaluating future microprocessors – the SIMPLESCALAR toolset. Technical Report 1308, University of Wisconsin–Madison, Computer Sciences Department, July 1996. [6] J. Gee, M. Hill, D. Pnevmatikatos, and A. J. Smith. Cache performance of the SPEC benchmark suite. IEEE Micro, 13(4):17–27, August 1993. [7] L. Gwennap. Digital 21264 sets new standard. In Microprocessor Report. MicroDesign Resources, October 1996. http://www.digital.com/semiconductor/microrep/digital2.htm. [8] L. K. John and S. A. Design and performance evaluation of a cache assist to implement selective caching. In International Conference on Computer Design, pages 510–518. IEEE, October 1997. [9] T. L. Johnson and W.-m. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In 24th Annual International Symposium on Computer Architecture, pages 315–326. IEEE/ACM, June 1997. [10] N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In 17th Annual International Symposium on Computer Architecture. IEEE/ACM, June 1990. [11] J. A. Rivers and E. S. Davidson. Reducing conflicts in directmapped caches with a temporality-based design. In International Conference on Parallel Processing, pages 154–163, August 1996. [12] A. Seznec. A case for two-way skewed-associative caches. In 20th Annual International Symposium on Computer Architecture, pages 169–178. IEEE/ACM, May 1993. [13] S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In 31st Annual International Symposium on Microarchitecture. IEEE/ACM, December 1998. [14] G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. Managing data caches using selective cache line replacement. Journal of Parallel Programming, 25(3):213–242, June 1997.

Suggest Documents