E cient Simulation of Caches under Optimal Replacement ... - CiteSeerX

4 downloads 280 Views 285KB Size Report
email: [email protected], [email protected]. Abstract. Cache miss characterization ... ulation under optimal (OPT) replacement to obtain a ner and more accurate .... The sequential list-based search of the stack is a bottleneck in stack ...
Ecient Simulation of Caches under Optimal Replacement with Applications to Miss Characterization

Abstract

Rabin A. Sugumar and Santosh G. Abraham Advanced Computer Architecture Laboratory Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, MI 48109-2122 email: [email protected], [email protected]

Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a ner and more accurate characterization of misses than the three Cs model. However, current methods for optimal cache simulation are slow and dicult to use. We present three new techniques for optimal cache simulation. First, we propose a limited lookahead strategy with error xing, which allows one pass simulation of multiple optimal caches. Second, we propose a scheme to group entries in the OPT stack, which allows ecient tree-based fully-associative cache simulation under OPT. Third, we propose a scheme for exploiting partial inclusion in set-associative cache simulation under OPT. Simulators based on these algorithms were used to obtain cache miss characterizations using the OPT model for nine SPEC benchmarks. The results indicate that miss ratios under OPT are substantially lower than those under LRU replacement, by up to 70% in fully-associative caches, and up to 32% in two-way set-associative caches.

1 Introduction Caches have an increasing impact on overall performance because of the growing gap between CPU cycle times and memory access times. Cache misses contribute signi cantly to the cycles per instruction (CPI) performance metric [15]. The e ect that cache misses have on performance has led to research on methods to reduce cache misses or their penalty. Several strategies This research is supported in part by ONR contract N00014-931-0163. 0 This paper appeared in the Proceedings of the 1993 ACM SIGMETRICS Conference.

have been proposed to improve cache performance, such as remapping basic blocks in memory [8, 14], blocking algorithms [4], using two or more levels of caching, miss caches [10] and shadow directories [18]. Thiebaut and Stone [21], Agarwal [1], Hill [6, 5] and others have proposed models which can explain or classify misses. These models are useful both for gaining insight in developing caching strategies and also for evaluating these strategies. In Thiebaut and Stone's model and in Agarwal's model, expressions for miss rates are derived in terms of a few trace dependent parameters. In Hill's three C's model, misses are classi ed into three components: compulsory misses | misses that occur on rst time references to lines; capacity misses | additional misses resulting from the limited capacity of a cache; and con ict misses | additional misses due to the mapping constraints in set-associative caches In this paper we propose the OPT model, which provides a ner characterization of misses than the three Cs model. The OPT model di ers from the three Cs model in two respects. First, it de nes capacity misses as the non-compulsory misses from a fullyassociative cache with OPT replacement rather than LRU replacement.1 Since the performance of LRU is non-optimal, the misses in a fully-associative LRU cache are not a true measure of the misses caused by limited capacity. For instance, our experiments indicate that for some cache con gurations, the miss ratio of a fully-associative LRU cache on the SPEC benchmarks is greater than that of the corresponding direct-mapped or set-associative cache. In these cases, the con ict miss component of the three Cs model is negative which is contradictory to the intuitive concept that con icts increase the total number of misses. Second, the OPT model separately identi es mapping misses that occur as a result of the set mapping strategy and replacement 1 More precisely, Hill [6] de nes capacity misses as the noncompulsory misses from a fully-associative cache using the same replacement strategy as that in the cache being characterized. Empirical characterizations have used LRU [6], and practical caches usually use LRU.

misses that occur as a result of sub-optimal replacement inside a set. In the three Cs model this distinction is not made. The replacement strategy is thus included in the OPT model. The miss characterizations are for a xed line size in the OPT model as in the three Cs model. The OPT model is practical only if caches can be simulated eciently under optimal replacement. There are two constraints in simulating optimal caches. First, the current optimal cache simulation algorithm that can simulate a range of caches requires two passes over the address trace, where the rst pass is a reverse pass that gathers `future information'. The reverse pass requires that the trace be stored, and is not compatible with on-the- y simulation that is commonly used to avoid storing long traces. Second, the current algorithms for simulating optimal caches are much slower than those for simulating LRU caches. In this paper we address both issues. We describe an algorithm which simulates optimal caches without a reverse pass but by using a limited lookahead, and then we present schemes to do optimal cache simulation more eciently. We have implemented the algorithms and we present miss characterizations for the SPEC benchmarks using our simulators. The following section contains a brief overview of earlier relevant work on cache simulation. The limited lookahead algorithm is described in Section 3. In Section 4, a scheme for a tree-based simulation is described and an extension of the limited lookahead algorithm for set-associative caches is presented. Results of empirical performance evaluations of the various algorithms are presented in Section 5. The OPT model for characterizing misses is described, and a miss characterization for nine of the SPEC benchmarks is presented in Section 6. Section 7 concludes the paper.

2 Related Work On a miss, optimal replacement [12, 2] replaces the block2 that is re-referenced furthest in the future. Optimal replacement minimizes bus trac over the class of all replacement algorithms when write-backs are ignored, and minimizes the cache miss ratio as well over the class of demand fetching algorithms. Since it requires future information, it cannot be used in a real cache, and is mainly used in performance analysis. Belady, in his paper introducing optimal replacement [2], describes an algorithm for simulating a single 2

We will use block and cache line interchangeablyin this paper.

cache under optimal replacement, where the decision on which address is replaced is delayed till there is no uncertainty. In 1970, Mattson et al [12] developed the notion of stack simulation which allows single-pass trace-driven simulation of many alternative caches, subject to several restrictions. The major restriction is that all caches use the same stack replacement algorithm. Under any stack replacement algorithm, the following inclusion property holds: at any instant, a cache contains all the blocks in all smaller caches. After each reference, a stack replacement logically constructs a total order of all previously referenced blocks and always replaces lower priority blocks. Mattson et al proved that OPT is a stack algorithm and showed how to compute all priorities using a reverse pass over a trace. Single-pass simulation as described by Mattson et al is superior to single cache simulation algorithms when a range of cache sizes need to be evaluated. However, the need for a reverse pass makes the single-pass simulation dicult. In this paper we propose a limited lookahead scheme to do single-pass optimal cache simulation without the reverse pass. The sequential list-based search of the stack is a bottleneck in stack simulation. This problem has been addressed for LRU replacement [3, 16, 22, 11]. These algorithms make use of the fact that the stack update is simple in LRU; the referenced entry is just moved from its current position to the top of the stack. Under OPT replacement, however, the relative positions of many entries other than the referenced entry can potentially change in the stack. In this paper we introduce a scheme for grouping stack entries, so that the relative position of at most one of the entries in a group changes. This grouping enables ecient tree or multilevel list implementations for optimal replacement too. Single-pass set-associative cache simulation where the number of sets and the associativity are varied has been considered before for LRU replacement. Three algorithms have been proposed: all-associativity simulation [12, 7], generalized binomial forest simulation [19] and generalized forest simulation [7]. In this paper we describe an adaptation to the generalized forest simulation algorithm, which greatly increases the eciency of set-associative cache simulation under OPT. When cache bypass is permitted the referenced line need not be put in the cache, and is itself one of the candidates for replacement. Mcfarling [13] discusses OPT replacement with cache bypass. Cache bypass has a signi cant e ect in set-associative caches with low associativities. After some minor modi cations, the algorithms described in the paper can work with bypass. The miss characterizations in Section 6 are with bypass permitted.

3 Finite Lookahead Simulation In this section we describe the simulation of optimal caches using nite lookahead. The simulation consists of two co-routines that are executed alternately: a lookahead routine and a stack processing routine. These two routines communicate primarily through a bu er consisting of a list of (Address, Time of next reference3 ) tuples. The lookahead routine runs ahead of the stack processing routine and determines the true priorities of most entries in the bu er. The stack processing routine uses a speculative priority for the remaining entries and marks these as unknowns in the stack. When the lookahead routine determines the true priorities of the unknown references eventually, it calls a stack repair routine which repairs the stack using the correct priority. Algorithm 1 is the overall simulation algorithm. Here a lookahead of at least LA DIST is always maintained. Main () Lookahead (LA DIST) while (! END OF TRACE) Lookahead (X) Stack process (X) end while Stack process (LA DIST) Lookahead (N) for N trace addresses append (address,TNR)-tuple to bu er with TNR set to unknown if (address has been referenced previously) if (previous reference tuple not processed by Stack Process) Set TNR of tuple to current time else Stack Repair (address, current time) end if end if end for

Stack Process (N) for rst N tuples in bu er assign priority to address based on TNR lookup and update stack end for

Algorithm 1: Overall Simulation

In Section 3.1 we describe the lookahead and stack processing routines, and justify the overall algorithm assuming that the stack repair routine works. In Section 3.2 and Section 3.3, we describe stack repair routines.

3.1 Lookahead and Stack Processing The lookahead routine reads in address references from the trace. For each reference it appends a tuple The priority of an address under OPT is a simple function of its time of next reference, (e.g.) MAXINT - TNR. 3

to the communication bu er with the \Time of next reference" (TNR) attribute set to \unknown". Then a hash lookup is done to check the status of the previous reference to the address. If the previous reference has not been processed yet by the stack processing routine, its TNR is changed from \unknown" to the current time. If the previous reference has been processed, it is an unknown in the stack, i.e., a stack entry whose true priority is not known. The true priority is now available for this reference, and the lookahead routine calls the stack repair routine to convert the unknown to a known (a stack entry whose priority is known). If there is no previous reference nothing is done. The stack processing routine removes tuples from the head of the list in the communication bu er, and performs a stack lookup and update. The TNR attribute is used to assign priorities to each address reference; addresses whose TNR is unknown are treated as having lower priority than all knowns. The stack processing routine also maintains information required for stack repair. Below we brie y review the stack lookup and update procedure described in [12]. The objective of stack processing is to obtain the hit ratios of a range of fully-associative caches eciently. A single stack maintains the contents of fully-associative caches of sizes ranging from one line to the maximum number of lines in the stack. The contents of a fully associative cache with j lines are the top j lines in the stack. Therefore, the line at depth j in the stack is not in the cache with j ? 1 lines, but is in the cache with j lines. During lookup, the stack is searched for the referenced line from the top, and the stack hit depth, say d at which the line is found (or in nity if it is not found) is recorded. The reference is a miss in caches with less than d lines and a hit in all other caches. During update (usually done along with the lookup), the line at the top of the stack is displaced by the referenced line. For each depth from two up to the hit depth, the priority of the deleted line, y, from the previous depth is compared against the priority of the line, z , at the current depth, and the line of lower priority becomes the deleted line from the current depth. We say an interaction occurs between y and z in such a case, and that the deleted entry has been displaced by the other. Assuming that the stack repair routine always places unknowns that become knowns at the correct stack position, the hit depths determined are accurate. Consider unknown-known interactions; the unknown is always displaced because it is treated as having a lower priority than knowns. But unknowns are of lower priority than knowns in reality, i.e., even when in nite future information is available; there is hence no error in

unknown-known interactions. Since there is no error in known-known interactions either, addresses that enter the stack as knowns are in their true position, i.e., their position had in nite future information been available. However, potential errors are made when unknowns interact, since no information is available on their relative priorities, and unknown addresses may not necessarily be at their true stack position. Once an unknown becomes known it is moved to its true position by the stack repair routine. Since by the time an address reference is processed by the stack processing routine it is a known in the stack (unless it is a rst time reference in which case it is not in the stack at all), the hit depths determined are accurate.

3.2 Stack Repair | Interaction Graph As noted in the last section, errors potentially occur on unknown-unknown interactions. In this section we describe a stack repair algorithm where unknownunknown interactions are arbitrarily decided, and information on all unknown-unknown interactions are maintained in a directed graph called the interaction graph. When an unknown becomes known, a set of interactions now known to be wrongly done are identi ed using the interaction graph, and are redone in the correct order. When an interaction is redone a corresponding change is made to the stack also, so that at the end of the procedure the unknown that became known is in its true position. In Section 3.3 we describe a more ecient, but less intuitive, stack repair scheme where dummy priorities are assigned to unknowns. Let S (t;  ) denote the set of stacks for which lookahead has been done till time  and stack processing has been done till time t (<  ). In the stacks in S (t;  ), the unknowns are those addresses that have been referenced before t, but not in the interval (t,  ). The positions of knowns are identical in the stacks in S (t;  ), but that of one or more unknowns is di erent. With any stack, s(t;  ), an interaction graph is associated. Each directed edge represents an interaction where the unknown at the head displaces the unknown at the tail. The edges are numbered in their order of creation. The large table on the left in Fig. 1 shows an example where the stack processing routine processes nine references. The rst nine tuples in the bu er after the lookahead routine has processed 14 references ( = 14) are shown in the second and third rows of the table. A u in the third row indicates an unknown. The stack and the interaction graph at each step are shown in the rest of the table. The unknowns in the stack are underlined. Consider Reference 9 to Address k. It is not in

the stack and the search goes to the end of the stack. During the update, k displaces g. Then, a displaces g because g is an unknown. The f-g interaction is an unknown-unknown interaction, and it is arbitrarily decided that g displaces f. An edge, fg, is added to the interaction graph for this interaction. Subsequently, d displaces f, f displaces e and c displaces e. The last two are also unknown-unknown interactions, and are arbitrarily decided. Edges ef and ec are added to the interaction graph to represent these interactions. The nal interaction graph is shown. When the stack repair routine is called on a reference at time  +  , it takes the stack s(t;  ) (2 S (t;  )) formed in the course of the simulation, and forms a stack s(t;  +  ) (2 S (t;  +  )), with the unknown that became known in its true position. The stack repair routine accomplishes this task using the interaction graph as follows. When an unknown (U say) becomes known, we know that its true priority is greater than the true priority all other unknowns. Therefore, any interaction on which it has been displaced by an unknown in the interaction graph is an incorrectly decided interaction (i.d.i). The rst such i.d.i (if any) is identi ed in the interaction graph by examining edge numbers, and the decision is reversed. As a result of the reversal, whatever was done with U after the wrong interaction, is now associated with the other node (X say), and vice-versa. In the interaction graph after the reversal, the rst i.d.i (if any) is identi ed similarly (this interaction was originally associated with X ), and is reversed as before. This procedure is repeated until a state is reached where U has not been displaced by any other unknown. Now U is in its true position in the stack, since it has not been involved in any i.d.i. Algorithm 2 is a high-level statement of the interaction graph based stack repair algorithm. Fig. 1 also shows an example illustrating the stack repair procedure. The stack processing routine stops after processing 9 references, and the lookahead routine takes over. Reference 15 is to c, which now goes from unknown to known. The stack repair routine is called. The rst displacement of c occurred on edge 1 to f, and this interaction is reversed. Since all the other edges have higher numbers, c gets all the edges of f and vice versa. c and f are swapped in the stack. The stack and the interaction graph after this reversal are shown in the rst column of the table on the right. Note that the e ect of this transformation is to reverse the action of the rst interaction between c and f; the subsequent movements of f are associated with c and vice versa. In the modi ed graph, the rst displacement of c occurred on edge 3 by g; this interaction is similarly reversed. In the resulting graph c has not been displaced, and the

TAU = 15 TAU = 14

C refnced

Time (t)

1

2 3

4

5

6

7

8

Trace

A

C

E A

D

F

C

G

K

4

7

u 13 14

u

u

u

12

A

C E

Nxt Refn. Stack

A A C

A E C

E

E

t=9

t=9

D

F

C

G

K

A

A

A

K A

K

A

A

A

F

F

G

G

C

D

D

D

D

D

D

E

E

E

F

C

G

C

C

F

F

E

E

E

C E

Int. Graph

9

E

C

E

E

E

E

E 2

F

C

G

G

F

1 F

E 2

2 4

G

5

C

1

3

4

5

C

4

5

1

3

F

2 F

C

F

G

Figure 1: Finite lookahead simulation | Example (n(vXY ) denotes the number of edge vXY ) U is the unknown that has become known while U

has been displaced by some unknown let V be the rst unknown to have displaced U , and let vUV be the edge created for that interaction for each edge vUX (X 6= V ) s.t. n(vUX ) > n(vUV ) delete vUX add edge vV X with n(vV X ) = n(vUX ) end for for each edge vXU (X = V ) s.t. n(vXU ) > n(vUV ) Delete vXU Add edge vXV with n(vXV ) = n(vUX ) end for (Same for edges vV X and vXV )

6

ip all edges, between U and V with number greater than or equal to n(vUV ) swap the positions of U and V in the stack

end while

delete U and all edges adjacent to U Algorithm 2: Stack repair using interaction graph

current position of c in the stack is its true position. c and all the edges adjacent to it are deleted. The nal stack and the interaction graph are shown in the second column.

3.3 Stack Repair | Dummy Priorities The stack repair scheme using the interaction graph, described in the previous section, works for arbitrary unknown-unknown interaction policies. However, maintaining and traversing the interaction graph is time and

space intensive. In this section we present a simpler scheme, where dummy priorities are assigned to unknowns, and the stack processing routine takes replacement decisions based on their dummy or actual priorities. The dummy priority assignment forces an automatic encoding of information which may be used to do the stack repair. We just describe the scheme in this section; for a proof see [20]. The dummy priorities are assigned following the two rules below: 1. Each unknown is given a dummy priority lower than the priority of all knowns. 2. The dummy priorities are given in decreasing order, i.e., the dummy priority of an unknown is lower than the dummy priority of all earlier unknowns. The rst rule ensures that on a known-unknown interaction the unknown is always displaced, and the second rule ensures that on an unknown-unknown interaction, the unknown that entered the stack later is always displaced. When an unknown, U1 is below another unknown, U2 , with a higher dummy priority number, we know that the two unknowns have interacted, since U1 entered the stack after U2 and was therefore above U2 initially but is now below U2 . In contrast, when an unknown is above another unknown of higher dummy priority, we know that the two have not interacted. As a result when an unknown U becomes known its true position is that of the highest unknown in the stack with a dummy priority greater than or equal to that of

U . Algorithm 3 uses these properties to move U to its true position, and to rearrange the other unknowns so that the next unknown to known conversion may be performed similarly. In contrast to the previous approach using the interaction graph, this scheme maintains information on which unknowns have interacted, but does not have precise information on the order in which the interactions occurred. (The stack is maintained as a linked list with each entry having an address eld and a priority eld) Input: uaddr, /* Address of U */ tprty /* True priority of U */ Procedure: dpu dummy priority of U /* from hash lookup */ de!addr = uaddr de!prty = tprty /* entry ptr: pointer to current stack entry */ entry ptr = head of stack while (entry ptr!addr 6= uaddr) /* Knowns and later unknowns are skipped */ if ((current entry is unknown) && (entry ptr!prty > dpu)) /* Other unknowns are rearranged */ if (entry ptr!prty < de!prty) exchange current entry with de end if end if

entry ptr = Pointer to next stack entry

end while

replace current entry with de Algorithm 3: Stack repair by dummy priority assignment

Fig. 2 shows an example illustrating the actions of the unknown-handler. Unknowns are indicated by underlining as before and their dummypriorities are shown in parenthesis. The table on the top shows stack processing from Reference 8, with  = 20. g is Reference 21, and the results of the stack repair operation are shown. The table below shows stack processing with  = 21. Note that the nal stack in the table below is the same as the stack after stack repair above.

4 Ecient OPT Simulation In this section we describe two techniques for making OPT simulation faster, in the context of the limited lookahead scheme described above. First, we describe a technique for breaking the OPT stack into groups, such that during the update process at most one entry may enter or leave a group. This grouping limits the number of stack entries that need to be examined during the update process. It speeds up list implementations, and also makes possible tree implementations of OPT stack

TAU = 21 G refnced

TAU = 20 Time (t)

8

9

10

11

Trace

E

G

D

A

u

u

u

u

Nxt Refn. Stack

E (-3) G (-4)

t=11

D (-5) A (-6)

A (-6) G

C (-2)

C (-2)

C(-2)

C(-2)

A

A

A

D(-5)

D(-5)

G

E (-3)

E(-3)

E(-3)

C(-2)

F (-1)

F (-1)

F (-1)

F (-1)

F (-1)

D

D

G(-4)

G(-4)

E(-3)

TAU = 21 Time (t)

8

9

10

11

Trace

E

G

D

A

u

21

u

Nxt Refn. Stack

u

E (-3) G

D (-5)

A (-6)

C (-2)

C (-2)

G

G

A

A

A

D(-5)

G

E (-3)

C(-2)

C(-2)

F (-1)

F (-1)

F (-1)

F (-1)

D

D

E(-3)

E(-3)

Figure 2: Stack repair by dummy priority assignment simulation. Second, we describe a technique for doing single-pass set-associative cache simulation under OPT which exploits partial inclusion.

4.1 Grouping Algorithm In Mattson et al's algorithm the stack is implemented as a list and the execution time is dominated by the list lookup. The limited lookahead scheme described in the previous section enables a single-pass simulation, but the complexity of stack processing remains unchanged. An approach to speeding up the computation is to convert the list lookup into a tree lookup. The basic requirement for a tree-based simulation algorithm is a technique for updating the stack without having to examine all stack entries up to the hit depth. In this section we describe one approach to doing this for OPT.

De nition 1: A group is de ned to be a contiguous region of the stack in which entries are in decreasing priority order. The OPT stack is partitioned into groups based on this de nition. Note that there is exibility in partitioning the stack, in that a section of the stack which forms a group may be split into two or more smaller groups

without violating the de nition. It is convenient to regard the top entry in the stack (the line just referenced) as a special case, not belonging to any group.

Lemma 1: When optimal replacement is followed,

hits always occur at the top entry in the stack or at the rst entry of a group. Proof :

In OPT the highest priority line is always the one that is referenced next. Since, by de nition, the lines inside the group are of lower priority than the line at the top of a group, these lines cannot be referenced. The lemma follows. 2 for each (address, TNR)-tuple

assign priority to address based on its TNR if (address == address at top of stack) update priority of top of stack and record hit

else

delete rst stack entry and make it de insert address at the top of the stack for (each group in the group list starting at the top) if (address == address at top of group) record depth of hit delete top of group if (current group is the rst group) create a new group before current group put de in the new group else

add de to end of previous group

end if break else if (priority of de >

priority of last entry) insert de in group delete last entry and make it de

end if end if end for if (address not found)

add de to end of last group

end if end if end for

Algorithm 4: Stack processing with grouping

Algorithm 4 shows stack processing with grouping. Here lookup is done by examining the top of the stack, and then the head of each group successively. Update is done by displacing a line (denoted as de in the algorithm) from each group where the referenced address is not found. The de from a group is either the de of the previous group or the last entry of the current group, whichever is of lower priority. When the last entry is of lower priority, the current de is inserted in the current group at a position determined by its priority, and

D

34

L

42

C N

45 40

C

45

N D

40 34

F

-19

A H

59 21

B E

-10 -24 17 14

F E

-19 -24 (L

tL)

A

59

H

21

B

-10

L P K

70 17 14

P K

G M

12 -4

G

12

M

-4

Figure 3: Grouping algorithm { Example the last entry is made the de; otherwise the de is left unchanged. When the referenced address is found, it is deleted, and the current de is inserted at the end of the previous group. As addresses are found and deleted, groups become empty and are removed. Also as entries are deleted from groups, adjacent groups may be coalesced. Group deletion and coalescence are not shown in the algorithm. An example is shown in Fig. 3. The state of the stack just before the arrival of address l is shown on the left. Each stack entry consists of an address and its priority number (greater the number, greater the priority). On the arrival of l, d at the top of the stack is rst examined. The addresses do not match. d is removed and made the de, and l is put in at the top. The rst entry in the rst group is then examined (c), and again there is no match. Since the priority of d, the current de, is greater than the priority of the last entry of the group (e), e is made the de and d is inserted into the rst group. There is no match at the second group either. In this case, e continues to be the de, since it is of lower priority than the last entry b. l is the rst entry of the next group, and a hit occurs at depth 9 in the stack. l is deleted, and e is inserted at the end of the previous group. The resulting stack is shown on the right in Fig. 3. The complexity of the algorithm is O(Mean no. of groups passed  Cost of an insert-delete)

We show empirically in the next section that the mean number of groups passed is around one to four. The cost of inserts and deletes depends on how the groups are implemented. Both may be done in O(log(number of entries in group)) time when the groups are implemented as trees. Since the number of groups passed is small, and the group operations are done in log time, stack processing with groups is

much faster than regular stack processing. A stack repair scheme similar to the one described in the previous section may be used with grouping. The tree-based group implementation helps speed up the stack repair procedure too. We omit the details because of space constraints.

4.2 Set-Associative Cache Simulation In this section we brie y describe the simulation of a range of set-associative caches of varying associativity and degree of mapping (number of sets) under OPT replacement. The same simulation mechanism is employed with a lookahead routine, a stack processing routine, and a stack repair routine. The lookahead routine works exactly as described earlier. For stack processing an adaptation of generalized forest simulation is used. Generalized forest simulation (FS+) is a single-pass cache simulation algorithm for set associative caches under LRU replacement [7]. FS+ exploits the property that with LRU replacement, when a line is found at the stack-top for one degree of mapping (number of sets) it is found at the stack-top for other degrees as well. Although this property is not guaranteed to hold with OPT replacement, in practice it holds most of the time, and may be exploited when it holds as follows. With each stack a ag is maintained, which when set indicates that the entry at the stack-top is at the stack-top for subsequent mapping degrees as well. Therefore, when a hit occurs at the stack-top for a stack whose ag is set, stacks of subsequent mapping degrees need not be examined. Flags may be set and reset along with the lookup. Stack repair is done as described earlier, treating each stack as an individual fully associative stack.

5 Performance Evaluations In this section we present empirical performance evaluations of the tree- and list-based fully-associative cache simulation algorithms. The traces are from the SPEC89 benchmarks, and were generated using the pixie utility. The pixi ed executable generated the trace, a pre-processor converted the trace into another format, and the cache simulator read this converted trace. All three programs were piped in sequence and executed on a DEC 5000. The trace generation and preprocessing times are not included in the times reported below, but these times are less than 10% of the simulation time typically. For each benchmark, the initial

uni ed trace segment of length 100 million was used in the simulation. All simulations were performed with a cache line size of 16 bytes. The results are shown in Table 1. In the table, Dist. addr denotes the number of distinct addresses in the trace, Entries denotes the mean number of stack entries passed in the list algorithms (mean stack depth), and Groups denotes the mean number of groups that are inspected per address reference. All execution times are in seconds and all other metrics are unitless. The tree-based algorithms use splay trees [17] which are self-balancing and well-suited for cache simulation. In the LRU tree algorithm, the time of previous reference of an address is obtained by a hash lookup. The hit depth is then determined by a tree lookup that uses this time as the key[16, 22]. In the OPT tree implementation, a list of groups is maintained numbered in order. A descriptor is associated with each group that contains the rst and last addresses of the group, and the number of entries in the group. All the addresses in the stack are maintained in in-order form in one splay tree, with the current group number of the address as the primary key, and the priority of the address as the secondary key. Trace addresses are located by successively examining the rst addresses of groups. The tree is used for insertions, deletions and other stack operations. Stack repair is done using the dummy priority scheme described in Section 3.3. 1. LRU list vs OPT list: The OPT list algorithm is about a factor of 1.5 faster than LRU list for the spice and doduc benchmarks because the mean stack depth in OPT is signi cantly lower than in LRU. LRU list is faster than OPT list for the other two benchmarks. The stack can be grouped in the OPT list implementation and this optimization is estimated to improve the OPT list execution time by a factor of 1.5 to 2. 2. OPT tree vs. OPT list: The OPT tree algorithm runs about 1.5 to 10 times faster than the OPT list algorithm. The speedup obtained using the OPT tree algorithm has a strong correlation with the mean stack depth for a trace. 3. OPT tree vs LRU tree: The OPT tree algorithm runs about 2.5 to 3.0 times slower than the LRU tree algorithm. The OPT tree algorithm is slower primarily because more than one tree operation might be required per address. In addition pre-processing and stack repair steps are required in OPT. 4. E ect of grouping: The mean number of groups examined is quite low and ranges from 1.56 to 3.44. This range of values appears to be typical even in longer simulations. More experimentation and analytical modeling are necessary to con rm and explain this behavior.

Dist. Benchmark addr

spice i&d gcc i&d espresso i&d doduc i&d

31362 >65536 13185 11995

OPT (list)

Time

20625.5 62764 5102.1 16240.9

LRU (list)

Entries Time

113.6 222.0 21.87 86.62

29272.9 39737.2 3830.2 20056.0

OPT (tree)

Entries Time

201.0 263.5 30.26 159.5

3737.7 5599.1 3342.3 5322.7

LRU (tree)

Groups Time

2.52 3.31 2.16 3.52

1481.6 2015.1 1396.7 1647.6

Table 1: Execution times and performance metrics of simulation algorithms 8000 J 7000 6000

18000 B

OPT Model

UH calls

3 Cs Model

16000 B

J

14000 B

12000

5000

J

Run time

Miss Type

Number of Misses

Number of Misses

Compulsory

M (L, ∞, ∞, any)

M (L, ∞, ∞, any)

Miss Type

Compulsory

10000 4000 3000

JB

8000 J

6000 J

2000

Capacity

M (L, C, C/L, OPT) M (L, ∞, ∞, any)

Mapping

M (L, C, k, OPT) M (L, C, C/L, OPT)

4000 B

1000

2000 B

Replacement

M (L, C, C/L, LRU) M (L, ∞, ∞, any)

M (L, C, k, LRU) M (L, C, C/L, LRU)

Capacity

Conflict

M (L, C, k, LRU) M (L, C, k, OPT)

0

0

Lookahead Buffer size (in

Figure 4: Variations in calls to UH and run time with lookahead bu er space

5. Stack repair: The variation in run time and the number of calls to the stack repair routine as the lookahead bu er size (and hence lookahead distance) is varied is shown in Fig. 4 for spice. When the lookahead routine is at least 100,000 addresses ahead of the stack processing routine, the stack repair routine is called about 1.5 to 13 times for every 100,000 addresses, and the e ect of stack repair on overall run-time becomes negligible. In the current implementation, this extent of lookahead requires about 1.6 MB of memory space for the bu er. 6. Earlier Algorithms: We did not do comparisons with the reverse pass algorithm owing to the diculty of doing a reverse pass on a 100 million long trace. However, we believe it will run slower than our limited lookahead algorithm, because the overhead of the extra reverse pass will be greater than that of lookahead and stack repair. We did not do comparisons with Belady's algorithm either, since clearly doing a separate pass for each cache is very inecient. However, when miss ratios of a few caches are required, Belady's algorithm might be competitive.

Table 2: OPT model and the three Cs model

6 Characterization of Misses In this section we describe and contrast our miss characterization scheme, OPT model, with the three Cs model, a previously proposed scheme that is similar in many respects. We describe qualitatively why our miss characterization provides better insight into possible approaches that can be used to reduce cache misses and present a miss characterization averaged over nine SPEC89 benchmarks. Let M (L; C; k; repl) represent the number of cache misses in a cache where L, C , k and repl are the line size, cache size, associativity, and replacement strategy. Table 2 shows the characterization of cache misses, M (L; C; k; LRU ) using the OPT and three Cs model. In the table, a fully-associative cache is represented using an associativity of C=L. OPT replacement indicates OPT replacement with bypass. In order to illustrate the characterization, consider the mapping misses in the third row. The expression (M (L; C; k; OPT ) ? M (L; C; C=L; OPT )) represents the number of extra misses in a cache with a small associativity k as compared to a fully-associative cache of the same cache size and line size, both using the OPT replacement strategy. Since these misses arise from the mapping strategy and are a ected by the choice of the mapping strategy (e.g. hash-based mapping instead of bit-selection mapping) they are characterized as mapping misses. The OPT model is an extension of the three Cs model and gives a ner and more accurate character-

ization of cache misses. Misses are characterized into four categories instead of three. The compulsory misses are the same in both categories. In the OPT model, the capacity misses are the extra misses occurring in a fully-associative cache of the same size simulated using the OPT replacement strategy, and therefore accurately model the misses due to the nite capacity of the cache. Since the three Cs model uses a fully-associative LRU cache to obtain capacity misses, these misses obtained under the three Cs model exceed the actual capacity misses by M (L; C; C=L; LRU ) ? M (L; C; C=L; OPT ). This overcount in the three Cs model can be quite signi cant, as indicated by the results at the end of this section. In the OPT model, con ict misses are more nely characterized into mapping misses and replacement misses. Mapping misses arise because of the small degree of associativity of CPU caches, whereas the replacement misses are due to the sub-optimal replacement strategy. A ner characterization enables a designer to narrow down the set of approaches to reduce cache misses. For instance, if the mapping misses are signi cant, increasing the associativity can improve the miss ratio while modifying the replacement strategy will have little e ect on the overall miss ratio. The OPT cache simulators were used to obtain the miss ratio characterization using the OPT model for direct mapped and two-way set-associative caches of varying sizes for instruction, data and instruction & data (uni ed) traces for nine of the SPEC89 benchmarks (all except fpppp). All simulations were run to completion or for one billion addresses. Read and write accesses were handled identically. The line size was xed at 16 bytes. Fig. 5 shows the mean cache miss components averaged over the nine benchmarks for a range of twoway set-associative caches using the OPT model for the three types of traces. Superimposed on this gure is the miss ratio of a fully-associative cache with LRU replacement. This miss ratio represents the sum of the compulsory and capacity misses in the three Cs model. The main observation is that the capacity miss ratio under the three Cs model is close to 200% higher than the true capacity miss ratio obtained by the OPT model in some cases and is on the average about 92%, 36% and 48% higher for instruction, data and uni ed traces respectively. In other words, our results indicate that LRU replacement in fully-associative caches is considerably worse than optimal replacement. The anomalous and contradictory characterization occasionally obtained using the three Cs model (under LRU simulation) is illustrated by the results obtained with a cache size of 128 KB on data and uni ed traces traces, and 1K and 8K on instruction traces. In these cases, the compulsory and capacity miss components under the three Cs

model is greater than the actual miss ratio in the twoway set-associative cache. Therefore, under the three Cs model, the con ict miss ratio is negative. The OPT model does not lead to such contradictory characterization. The replacement miss component is about 20% of the miss-ratio on the average for the two-way setassociative caches, and is about the same magnitude as the mapping component, indicating that the missratio may be reduced by up to 32% using replacement schemes more sophisticated than LRU for two-way setassociative caches. Some other comments and caveats about the miss components follow. Firstly, the compulsory miss component is negligible in most cases and much smaller than that reported for instance in [5]. Since we simulate much longer (or complete) traces, the cold start e ects that contribute toward the compulsory miss component are amortized and are negligible. Furthermore, we simulate a single program at a time and do not account for multiprogramming e ects [9]. Multiprogramming tends to increase the number of compulsory misses. Secondly, mapping and replacement misses are quite large for the instruction traces. Software remapping of basic blocks of the programs was not performed, and is likely decrease the magnitude of these components. For the instruction caches the LRU fully-associative miss ratio is quite close to the LRU two-way associative miss ratio. This is probably because of the sequential access patterns in instruction traces. For the uni ed and data traces, the LRU fully-associative miss ratio, is close to the OPT two-way associative miss ratio. The above results indicate that improving the replacement strategy could lead to greater reductions in miss ratio than increasing the associativity beyond two.

7 Conclusion In this paper we propose a method for simulating multiple cache con gurations in a single pass under OPT replacement. The method uses limited lookahead, and xes errors as more information becomes available. We also propose schemes for making fully-associative and set-associative cache simulation faster under OPT replacement. With these algorithms fully-associative cache simulation under OPT takes about 1.25 hours on the average for an address trace of 100 million references on a DEC 5000. We propose a new OPT model for characterizing cache misses that extends the three Cs model. The OPT model is more accurate because the capacity misses represent a true lower bound on misses in any cache of a given size unlike the three Cs model. The

INSTRUCTION 0.06 B

0.05 0.04 B

0.03 B

0.02 B

0.01

B

0 1K

2K

4K

8K

B B

B B

B

16K 32K 64K 128K 256K 512K

Cache Size (in bytes)

Cache Size (in bytes)

Cache Size (in bytes)

1.2 B

1 0.8

B B

B

B

B

B

B B

B

0.6 0.4 0.2 0 1K Cache Size (in bytes)

Cache Size (in bytes)

2K

4K

8K

16K 32K 64K 128K 256K 512K

Cache Size (in bytes)

Figure 5: Two-way set-associative cache miss ratio components for the SPEC benchmarks OPT model further characterizes con ict misses under the three Cs model into mapping or replacement misses. Miss characterizations for instruction, data, and uni ed traces from the SPEC89 benchmarks using the OPT and three Cs model are presented. The capacity miss component as estimated by the OPT model is smaller by more than 200% in some cases. The replacement miss component is about 20% on the average for two-way set-associative caches.

Acknowledgements We thank Eric Hao, Daniel Windheiser and Alex Eichenberger for reading and critiquing drafts of the paper, and the reviewers for several helpful suggestions.

References [1] A. Agarwal. Analysis of Cache Performance for Operating Systems and Multiprogramming. PhD thesis, Stanford University, 1988. Available as Technical Report CSL-TR-87-332. [2] L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78{101, 1966. [3] B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM J. of Research and Development, pages 353{357, July 1975.

[4] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. J. of Par. and Dist. Comp., 5:587{616, 1988. [5] J. L. Hennessy and D. A. Patterson. Computer Architecture | A Quantitive Approach. Morgan Kaufmann Publishers Inc., 1990. [6] M. D. Hill. Aspects of Cache Memory and Instruction Bu er Performance. PhD thesis, University of California, Berkeley, 1987. Available as Technical Report UCB/CSD 87/381. [7] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. on Computers, 38(12):1612{ 1630, December 1989. [8] W. W. Hwu and P. P. Chang. Achieving high instruction cache performance with an optimizing compiler. In Proc. of 16th Intl. Symp. on Computer Architecture, pages 242{251, 1989. [9] W. W. Hwu and T. M. Conte. The susceptibility of programs to context switching. Technical Report CRHC91-14, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, April 1991. [10] N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch bu ers. In Proc. of 17th Intl. Symp. on Computer Architecture, pages 364{373, 1990. [11] Y. H. Kim, M. D. Hill, and D. A. Wood. Implementing stack simulation for highly-associative memories. In Proc. ACM SIGMETRICS Conf., pages 212{213, 1991. [12] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78{117, 1970.

[13] S. McFarling. Program Analysis and Optimization for Machines with Instruction Cache. PhD thesis, Stanford University, 1988. Technical Report No. CSL-TR-91493. [14] S. McFarling. Program optimization for instruction caches. In Proceedings of ASPLOS III, 1989. [15] T. N. Mudge et al. The design of a microsupercomputer. IEEE Computer, pages 57{64, Jan 1991. [16] F. Olken. Ecient methods for calculating the success function of xed space replacement policies. Technical Report LBL-12370, Lawrence Berkeley Laboratory, 1981. [17] D. D. Sleator and R. E. Tarjan. Self adjusting binary search trees. J. of the ACM, 32(3):652{686, 1985. [18] H. S. Stone. High-Performance Computer Architecture. Addison-Wesley, 2nd edition, 1987. [19] R. A. Sugumar and S. G. Abraham. Ecient simulation of multiple cache con gurations using binomial trees. Technical Report CSE-TR-111-91, CSE Division, University of Michigan, 1991. [20] R. A. Sugumar and S. G. Abraham. Ecient simulation of caches under optimal replacement with applications to miss characterization. Technical Report CSE-TR143-92, CSE Division, University of Michigan, 1992. [21] D. Thiebaut. On the fractal dimension of computer programs and its application to the prediction of the cache miss ratio. IEEE Trans. on Computers, 38(7):1012{ 1026, July 1989. [22] J. G. Thompson. Ecient analysis of Caching Systems. PhD thesis, University of California, Berkeley, 1987.