Predictability of Load/Store Instruction Latencies - CiteSeerX

14 downloads 0 Views 279KB Size Report
Santosh G. Abraham, Rabin A. Sugumar,y. B. R. Rau, Rajiv Gupta .... ned to support research in instruction-level parallel computing. the purposes of this paper, ...
Predictability of Load/Store Instruction Latencies

Santosh G. Abraham, Rabin A. Sugumar,y Daniel Windheiser EECS Department The University of Michigan Ann Arbor, MI 48109-2122

Abstract Due to increasing cache-miss latencies, cache control instructions are being implemented for future systems. We study the memory referencing behavior of individual machine-level instructions using simulations of fully-associative caches under MIN replacement. Our objective is to obtain a deeper understanding of useful program behavior that can be eventually employed at optimizing programs and to motivate architectural features aimed at improving the ecacy of memory hierarchies. Our simulation results show that a very small number of load/store instructions account for a majority of data cache misses. Speci cally, fewer than 10 instructions account for half the misses for six out of nine SPEC89 benchmarks. Selectively prefetching data referenced by a small number of instructions identi ed through pro ling can reduce overall miss ratio signi cantly while only incurring a small number of unnecessary prefetches.

1 Introduction Processor performance has been increasing at 50% per year but memory access times have been improving at 5-10% per year only. As a result, the latency of cache misses in processor cycles is increasing rapidly. Besides the increasing cache-miss latencies, another trend is towards processors that issue more than one instruction/operation per cycle, such as VLIW or superscalar processors. In these systems, the e ective cache-miss penalty is proportional to issue width. Therefore, it is important to develop techniques to reduce the cache-miss ratio and tolerate the latency of cache misses.  Research supported in part by ONR contract N00014-93-10163. y Currently at Cray Research, Eau Claire, WI 54703

B. R. Rau, Rajiv Gupta Hewlett Packard Laboratories 1501 Page Mill Road, Bldg. 3U-7 Palo Alto, CA 94304 Current compilers schedule all loads using the cache-hit latency. Even though a lockup-free cache prevents a stall on a cache miss, the processor often stalls soon afterwards on an instruction that requires the data that missed the cache. The dynamic scheduling capabilities needed to cope with the highly variable latencies associated with cache misses are expensive and could degrade the processor cycle time. The alternative of scheduling all loads with the cache-miss latency is not feasible for most programs because it requires considerable instruction level parallelism (which may not be present) if it is to be successful and increases register pressure. Studies investigating load delay slots in RISC processors support this observation. Scheduling with the miss latency eliminates a large portion of the value of having a cache, viz., the short average latency, and retains only the reduction in main memory trac. In this paper, we investigate the potential improvements of a third alternative, selective scheduling where some loads are selected for scheduling with the cache-miss latency while other loads are scheduled with the cache-hit latency. In order to e ectively support selective scheduling, systems should have instructions that permit the software to manage the cache, e.g., DEC Alpha [1]. We describe a set of source-cache or latency speci ers for loads that speci es the highest level in the memory hierarchy where the data is likely to be found [13]. After an earlier phase of compilation has inserted the latency speci ers, the instruction scheduler preferentially schedules long-latency loads which tend to miss the cache and all other short-latency loads with the cache-hit latency. In a VLIW processor, the architectural latency of a load is determined using its latency speci er. If the behavior of loads is consistent with their latency speci cation, a major source of nondeterministic high-variance latencies which greatly impacts the performance of VLIW processors is eliminated. Superscalar processors can also use the latency speci ers for dynamic scheduling. We describe a set of

target cache or level-direction speci ers for load/store

instructions that install data at speci c levels in the memory hierarchy [13]. Through these instructions, the software can manage the contents of caches at various levels in the memory hierarchy and potentially improve cache-hit ratios and/or improve the accuracy of latency speci cation. In selective scheduling, the compiler must tag each load with a latency speci er. For regular scienti c programs, compile-time analysis can identify the loads to schedule with the cache-miss latency. But, for irregular programs such as gcc or spice, compile-time techniques have not been developed. A key issue that will determine the e ectiveness of latency speci cation is the predictability of load latencies for such irregular programs. In this paper, we study the behavior of individual load/store instructions with respect to their data cache accessing behavior using cache simulations. Unlike earlier research that reported overall miss ratios, we determine the miss ratios of individual load/store instructions. Though a relatively small number of loads/stores account for a large majority of memory references, a much smaller number of loads/stores account for most cache misses. We show that a large fraction of memory references are made by instructions that miss extremely infrequently. We use the pro ling information obtained by these cache simulations to label loads as short or long latency instructions. Using performance metrics, such as cache misses and bus trac, we compare various schemes. Our results suggest that pro ling loads/stores can potentially be used pro tably by the compiler to specify the latency of loads/stores. There are some important factors that make the usage of load pro ling information more dicult than the more established branch pro ling. Firstly, load pro ling is inherently more expensive than branch pro ling by one or two orders of magnitude. However, there are several methods to reduce this cost. Secondly, unlike branches whose targets are usually known, the address of a load is often not known suf ciently in advance. If the address is not known, the long latency load cannot be moved ahead in the schedule suciently. Even if load pro ling is not feasible for production compiler use, the load pro ling results may lead to compiler techniques that generate similar information.

Related Work Several researchers have investigated prefetching to reduce cache miss stalls. Callahan and Porter eld [4] investigate the data cache behavior of scienti c code

using source-level instrumentation. For a suite of regular scienti c applications, they show that a few sourcelevel references contribute to most of the misses. In a later paper, Callahan and Porter eld [3] present techniques for prefetching to tolerate the latency of instructions that miss. They investigate both complete prefetching where all array accesses are prefetched, and selective prefetching where the compiler decides, based on the size of the cache, what data to prefetch. McNiven and Davidson [19] evaluate the reductions in memory trac obtainable through compiler-directed cache management. Klaiber and Levy [15] and Mowry et al [20] also investigate selective prefetching using compiler techniques for scienti c code. Mowry et al [20] combine prefetching with software pipelining and further investigate the interaction between blocking and prefetching. Chen and Baer [6] investigate hardware-based prefetching and moving loads within basic blocks statically to tolerate cache-miss latencies for some SPEC benchmarks. Chen et al [7] reduce the e ect of primary cache latency by preloading all memory accesses for the SPEC benchmarks. In this paper, we describe a framework for software cache control that subsumes prefetching. We illustrate the performance improvements obtainable through software cache control using the matrix multiply program. Unlike Callahan and Porter eld [4], we investigate the caching behavior of machine-level instructions for irregular oating-point programs such as spice and integer programs such as gcc. We also evaluate the costs and bene ts of prefetching using load/store pro ling. Pro le guided optimizations play an important role in code optimization. Branch pro ling has been used in scheduling multiple basic blocks together to increase instruction level parallelism in VLIW and superscalar machines by Fisher [10] and Hwu et al [12]. Pro ling has also been used in instruction cache management by McFarling [18] and by Hwu and Chang [11], register allocation by Chow and Hennessy [8], and inlining by Chang et al [5]. This work proposes using load/store instruction pro ling for instruction scheduling and cache management. MemSpy [16] is a pro ling tool that can be used to identify source-level code segments and data structures with poor memory system performance. Our work focuses on opportunities for optimization of cache performance through pro ling of machine-level load/store instructions. The rest of the paper is structured as follows. Section 3 develops the machine model. Section 4 illustrates several concepts underlying our approach using a matrix multiply program. Section 5 describes the

pro ling experiments and the results. Section 6 is the conclusion.

2 Instruction Extensions for Software Cache control In this section, we review a comprehensive, orthogonal set of modi ers to loads/stores that can be used to specify the latency of loads as well as to control the placement of data in the memory hierarchy. In most systems, caches are entirely managed by the hardware. But in order to control the placement of data in hardware managed caches, the compiler1 must second-guess the hardware. Furthermore, since the cache management decisions are not directly evident to the software, the load/store latencies are nondeterministic. At the other extreme, local or secondary memory is entirely managed by the program. The main disadvantage of an explicitly managed local memory is that the software is forced to make all management decisions. We describe a mixed implicit-explicit managed cache that captures most of the advantages of the two extremes [13]. A mixed cache provides software mechanisms to explicitly control the cache as well as simple default hardware policies. The software mechanisms are utilized when the compiler has sucient knowledge of the program's memory accessing behavior and when signi cant performance improvements are obtainable through software control. Otherwise, the cache is managed using the default hardware policies. We now describe the load/store instructions in the HPL PlayDoh architecture2 in the remainder of this section [13]. In addition to the standard load/store operations, the PlayDoh architecture provides explicit control over the memory hierarchy and supports prefetching of data to any level in the hierarchy. The PlayDoh memory system consists of the following: main memory, second-level cache, rst-level cache and a rst-level data prefetch cache. The data prefetch cache is used to prefetch large amounts of data having little or no temporal locality without disturbing the conventional rst-level cache. The rst- and secondlevel caches may either be a combined cache or be split into separate instruction and data caches. This memory hierarchy can be extended to several levels. For 1 In this section, when we say compiler we mean programmer/compiler. 2 HPL PlayDoh is a parametricarchitecturethat has been de ned to support research in instruction-levelparallel computing.

the purposes of this paper, we are primarily interested in managing the rst-level cache and data prefetch cache. There are two modi ers associated with each load operation which are used to specify the load latency and to control the movement of data up and down the cache hierarchy. The rst modi er, the latency and source cache speci er, serves two purposes. First, it is used to specify the load latency assumed by the compiler. Second, it is used by the compiler to indicate its view of where the data is likely to be found. The second modi er, the target cache speci er, is used by the compiler to indicate its view of the highest level in the cache hierarchy to which the loaded data should be promoted for use by subsequent memory operations. (Note that the rst-level cache is the highest level in the hierarchy.) The second speci er is used by the compiler to control the cache contents and manage cache replacement. Each of these speci ers can be one of the following: V for data prefetch cache, C1 for rst-level data cache, C2 for second-level data cache, C3 for main memory. Non-binding loads cause the speci ed promotion of data in the memory hierarchy without altering the state of the register le. In a non-binding load, the destination of the load is speci ed as register 0. For convenience, we will use the terms pretouch and prefetch to refer to non-binding loads that specify the target cache as C1 and V respectively. In contrast to regular loads, prefetches and pretouches bring the data closer to the processor without tying up registers and thereby increasing register pressure. Also, long latency loads are binding loads that specify C2 as the source cache. Other binding loads are short latency loads. A store operation has only a target cache speci er. As with the load operations, it is used by the compiler to specify the highest level in the cache hierarchy at which the stored data should be installed to be available for use by subsequent memory operations.

3 Programmatic control of memory hierarchy: An example Although our interest extends beyond regular programs, we illustrate the concepts underlying this paper using the familiar blocked matrix multiply example. The usefulness of the cache control instructions can be demonstrated concisely and e ectively on the matrix multiply example. Furthermore, the optimized blocked matrix multiply with the cache control in-

structions exhibit ideal behavior in the sense that a few instructions consistently miss the cache and other instructions always hit the cache. This information can be exploited to improve program performance by choosing the right load instruction from the repertoire described in the previous section. In the experimental results section, the other benchmarks can be better understood by contrasting their behavior with that of blocked matrix multiply.

C C C

blocked matrix multiply constraint: n has to be an integer multiple of the block size m parameter (n = 128, m = 32) real A(n, n), B(n, n), C(n, n)

3.1 LRU Statistics for Blocked Matrix Multiply For reasonable cache sizes, the regular unblocked matrix multiply incurs approximately 2n3 cache misses out of a total of 3n3 data references. Fig. 1 shows a blocked version of the matrix multiply program with a block size of m = 32. In general, the block size m in the program is chosen so that m2 + 3m  cache size. Satisfying this inequality ensures that a square block of size m  m of B can be kept resident in cache throughout its use. Compared to the regular matrix multiply, the misses due to rereferences to elements in the B array are eliminated and the overall number of misses is reduced to approximately 2n3=m. Blocking has been widely used by programmers for regular scienti c applications and has recently been incorporated as an automatic transformation in some research and commercial compilers. Our objective is to illustrate the regularity of memory references of load/store instructions and this objective can be realized using either the regular or blocked matrix multiply. Since the validity of new optimization approaches is best illustrated on programs that have already been optimized using existing techniques, we choose the blocked version of matrix multiply for further consideration. Fig. 2 shows the LRU stack distributions for the blocked matrix multiply. Each entry in the LRU (Least Recently Used) stack is a data address. Each entry, ej at a depth j, occupies a position in the LRU stack below all other entries that have been referenced after ej . In a fully-associative LRU data cache of size C lines, references that access entries below C in the LRU stack are misses and the rest are hits. The LRU stack distribution shows the number of times a static program access referenced a data item within a range of depths in the LRU stack. Besides showing the actual LRU stack distribution when n = 128; m = 32, Fig. 2 also shows a general expression for the stack depth distribution and the values of the loop indices in the program for which these stack depths are en-

10

do 10 jj = 1, n, m do 10 kk = 1, n, m do 10 i = 1, n do 10 j = jj, jj+m-1 do 10 k = kk, kk+m-1 C(i, j) = C(i, j) + A(i, k)*B(k, j) continue

Figure 1: Blocked matrix multiply countered. Of the n3 references made by each static source-level reference, n2 are the initial references to the array of size n2. These initial references are characterized as compulsory misses. The remaining n3 ?n2 references are made at di erent depths for di erent references.

3.2 Multi-issue machine model In the remaining subsections of this section, we illustrate the performance improvements obtainable through the use of level-directed, latency-specifying loads/stores on a blocked matrix multiply. In order to quantify the performance improvements, we use the following multi-issue machine model. Except for this section, the rest of the paper is independent of the following machine model. Our multi-issue model issues a total of seven operations per cycle: two memory loads/stores, two address computations, one add, one multiply, and one branch. Table. 1 lists the types of operations and their latencies. The latency of loads is 1 if it is in the rst-level cache or bu er and 20 otherwise.

3.3 No Data Cache Now, we examine the performance of the blocked program in Fig. 1 on a range of architectural alternatives. Table 2 shows four metrics of interest for ve di erent architectural alternatives. The ideal execution time is the execution time on our sample machine ignoring loop and memory overheads of a matrix multiply code scheduled either with a load latency of 1

Depth 1-16

A(i,k) Load 0

B(k,j) Load 0

17-64 65-1K

C(i,j) Load 2.03M (n3  (m ? 1)=m at depth 3) [k 6= kk] 0 0

0 0 2.03M 0 (n3  (m ? 1)=m at depth 2m + 2) [j 6= jj] 1K-4K 0 2.1M 0 (n3 ? n2 at depth m2 + 3m) [i 6= 1] 4K-16K 0 0 49,152 (n2  (n=m ? 1) at depth 2  m2 + 2nm) [k 6= 1; k = kk] 16K-64K 49,152 0 0 (n2  (n=m ? 1) at depth n2 + 3nm + m2 ) [j 6= 1; j = jj] Compulsory 16,384 16,384 16,384 misses n2 misses n2 misses n2 misses [j = 1] [i = 1] [k = 1]

C(i,j) Store 2.1M (n3 at depth 1) 0 0 0 0

0

0

Figure 2: LRU stack depth distribution for blocked matrix multiply Pipeline Memory port Address ALU Adder

Number Operations Latency 2 Load 1/20 Store 1 2 Address add/ 1 subtract 1 Floating-point 2 add/subtract Integer add/ 1 subtract Multiplier 1 Floating-point 4 multiply Integer 2 multiply Instruction 1 Branch 2 Table 1: Instruction latencies

or 20 cycles. The cache miss penalty is the number of cycles spent stalling for cache misses. The actual execution time is the sum of the rst two times. The memory trac is the total number of non-cached accesses. In each case, we present a general expression and a number for the particular choice of n = 128 and m = 32. First, consider a machine without a data cache representative of VLIW machines such as the Cydrome Cydra 5 [2, 21] or Multi ow TRACE family [9]. The ideal execution time assumes a schedule with a memory latency of 20 cycles. Since there is no cache, the cache miss penalty is zero and every reference is satis ed by main memory. This example illustrates that in regular scienti c applications there is usually sucient parallelism to schedule loads to hide the memory latency. However, one memory access is required per oating-point operation and this level of memory trac can be sustained only by expensive memory sys-

No data cache Data cache Lockup-free cache + dyn. sched. Labeled loads Ideal data cache

Ideal Cache Actual Memory exec. miss exec. trac time penalty time n3 m+26 - n3 m+26 2n3 + nm3 m m 3.80M - 3.80M 4.26M n3 mm+7 2.56M n3 mm+7 2.56M



n3 mm+8 2.60M n3 mm+7 2.56M





2n3 + n2 40 nm3 n3 m+47 m m 2.95M 5.51M 0.147M3 { n3 mm+7 n2 + 2mn 2.56M 0.147M

{ {



n3 mm+8 2mn3 + n2 2.60M 0.147M n3 mm+7 3n2 2.56M 0.049M

Table 2: Performance metrics on architectural alternatives tems.

3.4 Conventional data cache Secondly, consider a machine with a rst-level data cache which is larger than 2n + 1 and m2 + 3m words and is smaller than n2 + 2n. The cache stalls the processor on a cache miss till the miss is serviced and the cache-miss latency is 20 cycles. Loads are scheduled by the compiler with the cache-hit latency. Most current RISC systems have a cache organization and scheduling approach similar to this machine. Compared to the earlier alternative, the combination of caching and blocking transformations reduces memory trac by a factor of m. This machine has an ideal execution time that is lower and has a signi cantly less expensive memory system. However, the overall execution time is signi cantly degraded by the cache miss penalty.

3.5 Lockup-free cache with out-of-order execution Consider a lockup-free cache in conjunction with a dynamically-scheduled processor that supports outof-order execution and is similar to our model processor in issuing capabilities. Assume that the blocked matrix-multiply is scheduled for such a machine with a nominal load latency of one. Assuming the system

is e ective at hiding the cache miss latency, the cache miss penalty is zero. Compared to the \no data cache" system, the lockup-free cache system has greatly reduced memory trac requirements and compared to the previous cache alternative, the current system appears to have better overall execution times. But, in order to hide the cache miss latency, the hardware must be capable of dynamically scheduling from a window of 140 instructions (7 instrs./cycle  20 cycles) which may impact the processor cycle time. Though dynamically-scheduled systems with lockupfree caches have the potential for eliminating the cache miss penalty, the large scale schedule reorganization required for overlapping the relatively high cache miss latency makes their practicality somewhat suspect.

3.6 Conventional system scheduled using miss latencies A solution to the above problem is to staticallyschedule the code with a nominal load latency equal to the cache-miss latency instead of the cache-hit latency. This solution is e ective for the matrix multiply example but is not likely to be as e ective in general. For programs with a modest amount of instruction level parallelism, scheduling loads with the cache-miss latency increases the critical path, creates empty slots in the schedule and reduces overall performance.

3.7 Labeled loads/stores We consider the use of level-directed, latencyspecifying loads/stores that were described in Section 2. For each load instruction, the latency speci er is determined through compile-time analysis (possible for dense matrix computations) or through cachesimulation based pro ling. Assuming that each source program access is translated into one machine-level load instruction, there are three loads, one each for the arrays A, B, C. Since the miss ratio for all these loads is fairly low: 0.031, 0.008, and 0.031 respectively for A, B, C, any general scheme would label all these loads as short latency loads and schedule the entire program using cache-hit latencies. Thus, there is no di erence between this scheme and the \Conventional data cache" scheme. Though it is clear from Fig. 2 that each load instruction has two or three distinct cache behavior, the simple labeling scheme above does not have sucient resolution to discriminate between these behavioral patterns.

i 1

j jj

Description Overall Miss ratio Short/long latency Cache install 1 6=jj Overall Miss ratio Short/long latency Cache install 6= 1 jj Overall Miss ratio Short/long latency Cache install 6 1 6=jj Overall Miss ratio = Short/long latency Cache install

A 1.0 Long Yes 0.0 Short Yes 1.0 Long Yes 0.0 Short Yes

B 1.0 Long Yes 1.0 Long Yes 0.0 Short Yes 0.0 Short Yes

C 1.0 Long No 1.0 Long No 1.0 Long No 1.0 Long No

ing behavior of load instructions in such programs. We study the e ectiveness of using this pro ling information to specify the latency of load instructions in the next section.

4 Simulation Experiments

3.8 Labeled loads/stores after peeling

In this section, we characterize the memory referencing behavior of loads/stores by performing cache simulations. We rst describe the simulation environment and the benchmarks. We then present dynamic annotation statistics on the complete reference trace, showing the number of prefetches, pretouches and replaces. We investigate the regularity of load/store caching behavior. We statically annotate the load/store instructions with prefetches using the pro ling information and present two performance metrics related to the cost and bene t of pro le based annotation.

Loop peeling is a simple compiler transformation where a speci c iteration of the loop is \peeled o " from the rest of the loop and replicated separately. In many situations, the accesses in the rst iteration bring data into the cache, tending to miss, and the last iteration references data for the last time without subsequent useful reuse. Therefore, peeling the rst and last iterations of the loop, enables the identi cation of loads/stores with a high miss rate or without reuse. Consider peeling the rst iterations of the i and j loop from the program in Fig. 1, so that the loop body is replicated a total of four times. Table 3 gives the miss ratio (when n = 128; m = 32) for the four distinct accesses to each of the arrays in the peeled code. From Table 3, each of the references in the peeled program either always hit or always miss in an LRU fully-associative cache and can be appropriately labeled as a short or long latency load. Peeling also enables us to identify that there is no reuse of data accessed in the last iterations of the i and j loops by the arrays B and A respectively, and we can bypass the cache on those references. In the case of regular scienti c programs such as matrix multiply, extensions to current compiler technology are sucient to characterize loads as short or long latency and to label them as cache install or bypass. Current compiler techniques are unlikely to be successful at characterizing load instructions in a more general class of programs. We propose the use of pro ling to study the regularity of the memory referenc-

We use four-word line size fully-associative caches in our program characterization, because the results are obtained at a higher level of abstraction without the artifacts introduced by limited associativity. We have con rmed that the general nature of the results apply for two-way set-associative caches [22]. We simulated the caches either for 100 million addresses or till completion of the program. We use the stack processing technique of Mattson et al [17] to simulate fullyassociative caches under LRU and MIN replacements and we use tree implementations of the stack for ef ciency. The stack represents the state of a range of cache sizes, with the top C lines being contained in a cache of size C lines. The initial program characterization in the next two subsections is based on normal LRU replacement because it is commonly used in practical caches. In the remainder of the section, the objective is to do better than hardware schemes by utilizing future program accessing information. Therefore, we use MIN in the pro ling step to capture future program behavior and we then use software control and default LRU replacement to measure performance improvements. A cache of size C using the MIN replacement scheme prefetches data into the cache to always maintain the next C distinct lines that will be referenced. Since the depth at which a reference is inserted in the MIN stack is the depth at which it is rereferenced in the LRU stack, the distribution of insertion depths in the MIN stack is actually obtained using a modi ed LRU stack simulator.

Table 3: Misses of individual loads/stores in peeled program

4.1 Environment

4.2 Benchmarks Since our objective was to do a preliminary study with detailed exploratory manual analysis, we used a limited number of programs consisting of nine SPEC89 benchmarks and matrix, which is the peeled blocked matrix multiply example that was described in detail in the previous section. We present detailed results on four benchmarks for our experiments: matrix, spice, gcc and xlisp. spice is an irregular oating point application from the SPEC suite. gcc is the Gnu C compiler also from the SPEC suite and represents an irregular integer program. xlisp is an integer-oriented lisp program. We present tabular results for all the benchmark programs.

4.3 Miss characterization For the 4KB and 16KB cache sizes typical of rstlevel caches, the miss ratios for some of these benchmarks are 0.089 and 0.005 for matrix; 0.170 and 0.112 for spice; and 0.061 and 0.036 for gcc. The average number of instructions between cache misses for a 16KB cache are 454, 23.2 and 89.4 instructions respectively for these benchmarks. Thus, a multi-issue processor issuing a modest two instructions and having a cache-miss latency of 10 cycles may spend half its time stalling on rst-level cache misses for the spice benchmark. The graphs in Fig. 3 plot the fraction of dynamic load/store references made by a certain number of load/store instructions, where the load/store instructions are sorted by their referencing frequency. The x-axis is on a log scale in order to magnify the left hand region showing the behavior of instructions with high reference coverage. For gcc, approximately 3000 out of 25395 instructions account for 90% of the dynamic references. The graphs also plot the fraction of misses in a 16KB data cache accounted by a certain number of load/store instructions. In this case, the load/store instructions are sorted by the number of misses. In spice, just four instructions account for half the misses and nine instructions account for 90% of the misses. Table 4 shows similar data for the nine SPEC89 benchmarks investigated. The pair of numbers in each entry in the table are the number of sorted instructions accounting for a certain fraction of the dynamic references and misses respectively. The rst row of the table shows that fewer than ten instructions account for a majority of data cache misses for six of nine SPEC89 benchmarks. The fact that a certain number of instructions account for a much larger fraction of cache misses than overall references demon-

strates that the concentration of cache misses among a few instructions is not merely due to a concentration of references among a few instructions. Because a few instructions account for most cache misses, it is more likely that manual or compiler intervention techniques can be developed to handle cache misses.

4.4 Dynamic annotation of references In this section, we annotate each reference using fully-associative cache simulation under MIN. For a cache with C lines, a reference is annotated with `was replaced' if it is the rst reference to access a cache line after it has crossed a depth of C in the MIN stack, i.e. after it has been replaced; otherwise, it is annotated with `was retained'. A reference is annotated with `replace' if it is the last reference to access a cache line before it drops below a depth of C; otherwise, it is annotated with `retain'. The primitive annotations are used to obtain the `derived' annotations using the table below. Primitive annotation Derived annotation was retained & retain do nothing was replaced & retain pretouch was replaced & replace prefetch was retained & replace replace Table. 5 shows the derived annotations on the reference traces of the four benchmarks for a 4KB cache and a 16KB cache. Note that the pretouch ratio is equal to the replace ratio because pretouch brings data into the cache while replace removes data from the cache. Once the cache is full, every pretouch must be preceded by a matching replace. The table also shows that approximately 3-6% of references are misses that have no subsequent reuse and can be handled by a small prefetch bu er.

4.5 Distribution of dynamic trace annotations In the previous subsection, we examined the global number of prefetches, pretouches and replaces. In this subsection, we examine the distribution of annotations on references made by individual instructions. For each static instruction, we maintain a count of the number of times it accesses references annotated with `was replaced', i.e. count of misses per instruction. Branch pro ling is useful because certain branches tend to be taken and others tend to be not taken and pro ling classi es branches into these two types. Each branch type can be optimized for its predicted behavior. In this section, we examine a similar issue with

FracSPEC89 int SPEC89 oat matrix tion eqn esp gcc xlisp doduc dnasa mat300 spice tomcatv blocked 0.5 2 54 405 38 864 21 9 6 46 5 1 5 61 2 63 2 2 4 26 2 0.6 3 78 653 51 1333 25 10 9 56 6 2 6 90 3 81 3 3 5 38 3 0.7 4 108 1067 65 1843 29 12 13 94 7 3 8 133 4 126 4 4 6 51 4 0.8 5 157 1698 84 2419 33 13 18 143 8 4 10 206 5 225 5 5 7 63 5 0.9 11 273 2982 117 3032 37 15 27 193 9 8 18 365 8 339 6 6 9 75 8 0.95 18 500 4469 169 3345 39 16 65 218 11 13 41 587 21 396 7 7 11 82 10 1.0 517 3682 17874 1045 6309 499 474 2096 937 66 176 742 2623 258 579 62 53 361 188 29 Table 4: Instruction counts for range of reference and miss coverages

Benchmark Annotation Fraction of references 4KB 16KB matrix Nothing 0.978341 0.978518 blocked Pretouch 0.010829 0.010740 Replace 0.010829 0.010740 Prefetch 0.0 0.0 spice Nothing 0.686814 0.798497 Pretouch 0.143264 0.088724 Replace 0.143264 0.088724 Prefetch 0.026658 0.024054 gcc Nothing 0.888081 0.931977 Pretouch 0.050780 0.032277 Replace 0.050780 0.032277 Prefetch 0.010360 0.003470 xlisp Nothing 0.930977 0.967902 Pretouch 0.033379 0.015957 Replace 0.033379 0.015957 Prefetch 0.002265 0.000183 Table 5: Annotation ratios

respect to the cache behavior of loads/stores using the annotation counts per instruction. We can classify loads/stores into two categories through pro ling based on cache simulation: hit-instructions that tend to hit and miss-instructions that tend to miss while accessing the data cache. As in branch pro ling, the value of such a classi cation depends on how closely the observed behavior matches the predicted behavior. To determine how often the observed behavior is di erent from the predicted behavior, we sort the individual loads/stores by their miss ratios. In Fig. 4, the reference coverage on the x-axis is the fraction of the total references made by some number of instructions starting from the head of the sorted list. Again, the x-axis is on a log scale to magnify the interesting regions of the four graphs. The y-axis shows the miss ratio of the last instruction removed from the the sorted list to reach a certain reference coverage. The shape of the curves, for instance spice, suggests that the miss ratio of instructions is either large or close to zero. A large fraction of references are accounted for by instructions with a miss ratio of zero and relatively few references are accounted for by instructions in the transitional region. Table 6 shows similar data for the entire set of benchmarks we investigated. 3 The table presents instruction counts and miss ratios for a range of reference coverages and benchmarks. The data for mat300 shows that four instructions accounting for 5% of the references have a miss ratio of 1.0 and the rest have a miss ratio of zero. For seven of the nine SPEC89 3 There are some minor discrepancies between the table and the gure due to interpolation between data points in the gure.

spice

matrix 1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4 0.3

0.3 References

0.2 0.1

References

0.2

Misses

Misses

0.1 0

0 1

10 Static Instructions

1

100

10

gcc

100 1000 Static Instructions

10000

xlisp

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4 0.3

0.3 References

0.2

Misses

0.1

References

0.2

Misses

0.1 0

0 1

10

100 1000 10000 100000 Static Instructions

1

10

100 1000 Static Instructions

10000

Figure 3: Reference and miss coverage of instructions benchmarks, the miss ratio of instructions accounting for 50% or more of the references is 0.0. Fig. 4 and Table 6 show that even if instructions that account for 50% of the dynamic references are labeled as hitinstructions, the miss ratio of the hit-instructions is zero for these seven SPEC89 benchmarks. Even if the percentage reference coverage is increased to 80% all the instructions scheduled with the hit latency have a miss ratio less than 1% for four of the SPEC89 programs (see row starting with 0.2 in table). Though the instructions that hit tend to have a hit ratio close to one, the instructions that miss do not tend to have a miss ratio of one. Thus, the predictability of miss-instructions is not as good as hitinstructions. Firstly, there are many more hits than misses globally and secondly, our cache model assumes a line size of four words. Unit-stride reference streams that always miss in a cache with a line size of one word will tend to hit with 0.75 probability in our cache. For

instance, the twenty-six instructions that miss in the (blocked) program have a miss ratio of 0.25. The predictably of these types of instructions can be improved by loop unrolling by the cache line size parameter. Regardless, instructions that account for a small fraction of dynamic references are responsible for a large fraction of the overall misses and have a hit ratio signi cantly less than the global hit ratio. matrix

4.6 Performance comparisons after static annotation In the previous subsection, we examined the distribution of dynamic annotations associated with particular instructions. We found that loads/stores generally exhibit dichotomous behavior with some instructions largely making references with hit or `was retained' annotations and other instructions accounting for a large fraction of miss or `was replaced' annota-

Fraction 0.001 0.01 0.02 0.05 0.1 0.2 0.5 1.0

eqn 80 0.7277 81 0.6929 128 0.2383 144 0.1729 166 0.1082 167 0.1069 211 0.0296 878 0.0000

SPEC89 int esp gcc 204 305 0.9499 0.8566 259 457 0.6940 0.6103 295 744 0.5073 0.3760 552 1076 0.0996 0.2499 770 1613 0.0148 0.1319 1028 3020 0.0019 0.0118 1189 4088 0.0000 0.0000 5559 25450 0.0000 0.0000

xlisp 95 0.4917 96 0.4764 97 0.4751 202 0.0221 250 0.0086 329 0.0020 445 0.0000 2026 0.0000

doduc 104 1.0000 387 0.5391 456 0.5385 888 0.0763 1155 0.0245 1258 0.0022 1922 0.0000 9598 0.0000

SPEC89 oat matrix dnasa mat300 spice tomcatv blocked 1 1 191 1 24 1.0000 1.0000 0.9335 1.0000 0.2500 2 2 192 130 25 1.0000 1.0000 0.9324 0.5098 0.2500 3 3 193 144 26 1.0000 1.0000 0.8571 0.5088 0.2500 133 4 221 152 38 0.0156 1.0000 0.6678 0.5020 0.0000 135 142 338 161 39 0.0156 0.0000 0.2677 0.5000 0.0000 140 144 384 224 40 0.0156 0.0000 0.1896 0.4990 0.0000 158 152 612 364 42 0.0000 0.0000 0.0019 0.0000 0.0000 883 837 3422 1615 97 0.0000 0.0000 0.0000 0.0000 0.0000

Table 6: Instruction count and miss ratio for range of reference coverages tions. Since distinct annotations for each reference in the trace is not feasible, we investigate the performance improvements possible through a static annotation of instructions. Load/store pro ling information is obtained through cache simulations and is used to annotate the static instructions in the program based on a threshold. All instructions that have a miss ratio greater than the threshold are labeled as long latency or prefetch/pretouch instructions and the rest are labeled short latency. Actually, the prefetch/pretouch associated with a reference annotated with `was replaced' should be issued (1) after the address moves above depth C in the MIN stack (otherwise, it will displace some other line that is rereferenced sooner) and (2) far enough in advance to overlap the memory access latency. The cache size is usually suciently large so that the second factor determines how early a line is prefetched and we associate the prefetch/pretouch with the dynamic reference to the data. This scheme is successful if it reduces the miss ratio suciently without greatly increasing the number of prefetches. We characterize these two factors by two metrics representing the bene t and cost of prefetching: residual miss ratio which is the ratio of unprefetched cache misses over total references and prefetch ratio which is the ratio of prefetches to the total references in the original unannotated program. In Fig. 5, we plot these two metrics versus threshold for the four benchmark programs. In Table 7, we show the residual miss ratio followed by the prefetch ratio for

the entire set of benchmarks investigated. At a threshold of 0.2, the residual miss ratio is 0.0, 0.03, 0.01, 0.002 for the matrix, spice, gcc and xlisp programs. Compared to the unannotated program, there are no unprefetched misses in matrix and the residual miss ratios for spice, gcc and xlisp are approximately 12-20% of the original miss ratios. In the dynamically annotated trace, there are no unnecessary prefetches and the reduction in miss ratio due to prefetching is exactly equal to the prefetch ratio. In contrast, the statically annotated program issues some unnecessary prefetches and consequently, an increase in prefetch ratio may not lead to a corresponding reduction in miss ratio. However, Table 7 shows that, even for a very low threshold of 0.05, the prefetch ratio does not exceed 0.1 for ve out of nine SPEC89 benchmarks. Thus, the static annotation sheme has the potential for reducing miss ratios appreciably at the expense of issuing only a small number of prefetches relative to total references. In order to prefetch successfully, we also need to know the addresses generated by these instructions suciently in advance. We have not addressed this issue in this paper. The choice of a threshold should be governed by a tradeo between the cost of prefetches and the bene t arising from a lower miss ratio [14]. Lower residual miss ratios reduce cache miss penalties. The cost of prefetches is a function of both the number of prefetches and the amount of instruction level parallelism in the program. In a program with low instruction level parallelism, unnecessary prefetches constrain

1

spice

matrix

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0.001

0.01 0.1 Reference Coverage

1

0 0.001

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 0.01 0.1 Reference Coverage

1

xlisp

gcc 1

0 0.001

0.01 0.1 Reference Coverage

1

0 0.001

0.01 0.1 Reference Coverage

1

Figure 4: Miss-ratio vs. reference coverage scheduling decisions and increase critical path lengths. On the other hand, in a program such as matrix, the instruction level parallelism is very large and a large prefetch ratio does not a ect the schedule length signi cantly. Therefore, a low threshold is desirable because it can potentially reduce the residual miss ratio and the costs due to prefetches are not signi cant.

5 Conclusion Cache control instruction are being implemented in most future systems. However, there has been little attempt at quantifying the gains that these instructions can provide for irregular applications. In this paper, we motivate cache control instructions using a blocked matrix multiply example. A system with cache-control instructions is superior to other common architectural

alternatives for this example program. We characterize the misses for four benchmark programs. For typical rst-level cache sizes, the miss ratio is signi cant, around 0.08, and typically these programs miss every 16-80 instructions, motivating the need for better schemes to reduce cache miss penalties. We present annotation statistics for the dynamic trace. We demonstrate that hit-instructions accounting for a large fraction of total references almost never miss and that the miss-instructions have a hit ratio less than half the global hit ratio. We employ a thresholding scheme to partition loads into short and long latency. The residual miss ratio can be kept low while not greatly increasing the number of unnecessary prefetches. Further experimentation is necessary to determine the sensitivity of the partitioning of loads into short and long latency to rstly, changes in input data sets

spice

matrix 0.012

0.045

0.08

0.45

0.04

0.07

0.4

0.035

0.06

0.35

0.01 0.03

0.008

0.3

0.05

0.025 0.006

0.25 0.04

0.02 Residual

0.015

Prefetch Ratio

0.01

0.004 0.002

0.005 0

0 0

0.1

0.2

0.3 0.4 0.5 Threshold

0.6

0.7

0.2

Residual 0.03

0.15

Prefetch Ratio 0.02

0.1

0.01

0.05

0

0.8

0 0

0.1

0.2

0.3 0.4 0.5 Threshold

gcc

0.6

0.7

0.8

xlisp

0.035

0.16

0.03

0.14

0.012

0.04 0.035

0.01

0.03

0.12

0.025

0.1

0.008

0.025

0.02 0.08 0.015

Residual

0.01

Prefetch Ratio

0.06

0.02

0

0 0.1

0.2

0.3 0.4 0.5 Threshold

0.6

0.7

0.02

Residual

0.015

Prefetch Ratio

0.004

0.01

0.04

0.005

0

0.006

0.8

0.002

0.005

0

0 0

0.1

0.2

0.3 0.4 0.5 Threshold

0.6

0.7

0.8

Figure 5: E ectiveness of prefetch annotations and secondly, changes in cache con guration. The labeling scheme needs to be integrated with a compiler which can use the latency-speci cation to perform better instruction scheduling. Cache replacement can be managed using the target cache speci ers described in Section 2 to reduce trac between levels of the memory hierarchy. Our initial results on using pro ling to manage cache contents show a reduction in trac of less than 10% [22]. Compiler transformations such as peeling and loop unrolling can greatly improve the resolution of the pro ling data. Intuition gathered from pro ling may also be useful in developing compiler heuristics so that, thereafter, pro ling is not needed while compiling.

Acknowledgements We thank Lapyip Tsang for conducting the nal set of experiments and plotting the gures in this paper.

References [1] Alpha Architecture Handbook { Preliminary Edition. Digital Equipment Corporation, Maynard, MA, 1992. [2] G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: architecture and implementation. The Journal of Supercomputing, 7(1/2):143{180, 1992. [3] D. Callahan, K. Kennedy, and A. Porter eld. Software prefetching. In Proc. of ASPLOS IV, pages 40{ 52, 1991. [4] D. Callahan and A. Porter eld. Data cache performance of supercomputer applications. In Supercomputing '90, pages 564{572, 1990. [5] P. Chang, S. Mahlke, W. Chen, and W.W. Hwu. Pro le-guided automatic inline expansion for C programs. Software-Practice and Experience, 22(5):349{ 369, 1992. [6] T-F. Chen and J-L. Baer. Reducing memory latency via non-blocking and prefetching caches. In Proc. of ASPLOS V, pages 51{61, 1992. [7] W. Y. Chen, S. A. Mahlke, and W.W. Hwu. Tolerating rst level memory access latency in high-

Threshold eqn 0.05 0.01318 0.43154 0.10 0.01466 0.40585 0.20 0.05602 0.03606 0.30 0.06043 0.01749 0.40 0.06062 0.01696 0.60 0.06227 0.01331 0.80 0.07117 0.00051

SPEC89 int esp gcc 0.00128 0.00205 0.07492 0.14092 0.00359 0.00406 0.04860 0.11167 0.00403 0.00987 0.04559 0.07475 0.00526 0.02164 0.04095 0.02713 0.00805 0.02500 0.03283 0.01766 0.01693 0.02857 0.01494 0.01012 0.02296 0.03265 0.00592 0.00420

xlisp 0.00148 0.03734 0.00203 0.02856 0.00208 0.02828 0.00630 0.01139 0.00630 0.01139 0.01171 0.00024 0.01185 0.00005

doduc 0.00186 0.08066 0.00500 0.03511 0.00520 0.03351 0.00578 0.03127 0.00586 0.03104 0.02143 0.00189 0.02150 0.00179

SPEC89 oat dnasa mat300 spice 0.00258 0.00001 0.00073 0.04366 0.08892 0.43181 0.00258 0.00001 0.00798 0.04366 0.08892 0.32513 0.00258 0.00001 0.03029 0.04366 0.08892 0.18565 0.00258 0.00001 0.05876 0.04366 0.08892 0.07461 0.00258 0.00001 0.05913 0.04366 0.08892 0.07350 0.00258 0.00077 0.06715 0.04366 0.08740 0.05507 0.00258 0.00077 0.07999 0.04365 0.08740 0.03585

tomcatv 0.00017 0.22526 0.00018 0.22514 0.00019 0.22512 0.00019 0.22511 0.00019 0.22511 0.11216 0.00163 0.11216 0.00163

matrix blocked 0.00000 0.04231 0.00000 0.04231 0.00000 0.04231 0.01046 0.00046 0.01058 0.00016 0.01058 0.00016 0.01058 0.00016

Table 7: Residual miss ratio and prefetch ratio

[8] [9] [10] [11] [12]

[13] [14]

[15]

performance systems. In Intl. Conf. on Parallel Processing, pages I{36 { I{43, 1992. F. Chow and J. Hennessy. Register allocation by priority-based coloring. In Proc. of the 1984 Symp. on Compiler Construction, pages 222{232, 1984. R. P. Colwell, R. P. Nix, J. J. O'Donnell, D. B. Papworth, and P. K. Rodman. A VLIW architecture for a trace scheduling compiler. IEEE Transactions on Computers, C-37(8):967{979, 1988. J. A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, C-30(7):478{490, July 1981. W. W. Hwu and P. P. Chang. Achieving high instruction cache performance with an optimizing compiler. In Proc. of 16th Intl. Symp. on Computer Architecture, pages 242{251, 1989. W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The superblock: an e ective technique for vliw and superscalar compilation. J. of Supercomputing, 7(1/2):229{248, 1993. V. Kathail, M. S. Schlansker, and B. R. Rau. HPL PlayDoh architecture speci cation: Version 1.0. Technical Report HPL-93-80, Hewlett-Packard Laboratories, 1993. in preparation. D. R. Kerns and S. J. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In Proc. of the SIGPLAN '93 Conf. on Prog. Lang. Design and Implementation, pages 278{ 289, 1993. A. C. Klaiber and H. M. Levy. Architecture for software controlled data prefetching. In Proc. of 18th Intl. Symp. on Computer Architecture, pages 43{63, 1991.

[16] M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In Proc. ACM SIGMETRICS Conf., pages 1{12, 1992. [17] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78{117, 1970. [18] S. McFarling. Program optimization for instruction caches. In Proc. of ASPLOS III, 1989. [19] D. M. McNiven. Reduction in Main Memory Traf c through the Ecient use of Local Memory. PhD thesis, University of Illinois, 1988. [20] T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proc. of ASPLOS V, pages 62{73, 1992. [21] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The Cydra 5 departmental supercomputer: Design philosophies, decisions and trade-o s. IEEE Computer, 22(1):12{35, 1989. [22] R. A. Sugumar and S. G. Abraham. Multicon guration simulation algorithms for the evaluation of computer architecture designs. Technical Report CSE-TR-173-93, CSE Division, University of Michigan, 1992.