McGill University School of Computer Science ... - Semantic Scholar

4 downloads 10723 Views 489KB Size Report
Dec 1, 1993 - [email protected].ca, [email protected].ca, [email protected].ca ... ACAPS School of Computer Science 3480 University St. Montr eal ...
  ?

McGill University School of Computer Science ACAPS Laboratory

Advanced Compilers, Architectures and Parallel Systems

A Uni ed Framework for Hybrid Access Cache Design and Its Applications Kevin B. Theobald Herbert H. J. Humy Guang R. Gao

ACAPS Technical Memo 65 December 1, 1993

[email protected], [email protected], [email protected]

yConcordia University Dept. of Electrical and Computer Engineering, Montreal, Quebec ACAPS  School of Computer Science  3480 University St.  Montreal  Canada  H3A 2A7

Abstract High-speed microprocessors for high-performance computing demand fast on-chip caches to keep the processor usefully busy. Direct-mapped caches have been utilized for their superior access times, but su er from higher miss rates due to address con icts. Recently, this has led to several proposals for hybrid access caches combining the superior speed of direct-mapped caches with the higher hit rates of associative caches. These include the MRU cache [3], the victim cache [6], and the column-associative cache [2]. Contributions of this paper include:  A model which uni es these emerging hybrid designs for on-chip caches within a single framework called the Hybrid Access Cache model (HAC model). In this model, a hybrid access cache is conceptually viewed as two caches: a fast direct-mapped primary section, and a slower but associative secondary section, with an ecient mechanism for exchanging data between the two. Under this model, a rich design space is structured in a way which helps cache designers systematically to explore, experiment and compare various combinations of hybrid cache designs in order to meet area and performance goals.  Existing hybrid cache designs are found to be near the extreme points in the design space as revealed by the framework, which suggests the possibility of other designs. In this paper, we develop a novel hybrid access cache scheme, the half-and-half cache, which has half of the cache lines reserved for direct-mapped accesses and the other half for associative accesses. Lying in the center of the design space, it tries to exploit equally the advantages of both direct-mapping and associativity.  We compare a total of twelve cache designs using trace-driven simulation, and illustrate their performance di erences under our model. Simulation results are presented for single-process SPEC traces, multiple-process SPEC traces, and ATUM traces. Results demonstrate that the performance of the half-and-half cache fares better than other hybrid cache schemes for certain types of benchmarks. Keywords: cache, half-and-half cache, hybrid access cache framework, microprocessors, trace simulation.

i

Contents 1 Introduction

1.1 Development of Hybrid Caches 1.2 A Uni ed Framework : : : : : : 1.3 Synopsis : : : : : : : : : : : : :

1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 The Hybrid Access Cache Model

2.1 Basic Cache Design : : : : : : : : : : : : : : : : : 2.2 Adding Associativity to Direct-Mapped Caches : 2.3 Viewing HAC's as Dual Caches : : : : : : : : : : 2.3.1 HAC's and Two-Level Caches : : : : : : : 2.3.2 Implementation Details : : : : : : : : : : 2.4 Performance and Cost Measurements for HAC's :

3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3 Using the Hybrid Access Cache Model 3.1 3.2 3.3 3.4

The HAC Design Space : : : : : : Restricting the HAC Design Space Exploring the Design Space : : : : The Half-and-Half Cache : : : : : :

4 5 5 8 9 9

10

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 Simulation Experiments

4.1 Caches Designs Simulated 4.2 Trace Sources : : : : : : : 4.3 Results : : : : : : : : : : :

2 2 3

10 12 14 15

16 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5 Conclusions A Experimental Data

16 18 21

28 28

List of Figures 1 2 3 4 5 6 7 8 9 10 11 12

Typical Cache Structure : : : : : : : : : Structure of a HAC : : : : : : : : : : : : Problems with Bad Hashing Functions : Design space of the HAC model : : : : : Design space of more ecient HAC's : : Combined Results for 8K Cache : : : : : Combined Results for 32K Cache : : : : Combined Results for 128K Cache : : : Baseline Results for Multitasking Traces Multitasking Results for 8K Cache : : : Multitasking Results for 32K Cache : : Multitasking Results for 128K Cache : : ii

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 6 8 11 13 22 23 24 25 26 27 27

List of Tables 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Size Parameters for Simulated Caches : : : : : : : : : : : : : : : : : : : : : : : Symbolic SPEC89 Benchmarks : : : : : : : : : : : : : : : : : : : : : : : : : : : Multitasking SPEC89 Benchmarks : : : : : : : : : : : : : : : : : : : : : : : : : ATUM Traces : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data for Symbolic Benchmarks on 8K Caches : : : : : : : : : : : : : : : : : : : Data for Symbolic Benchmarks on 32K Caches : : : : : : : : : : : : : : : : : : Data for Symbolic Benchmarks on 128K Caches : : : : : : : : : : : : : : : : : : Data for Floating-Point Benchmarks and SPEC89 Aggregates on 8K Caches : : Data for Floating-Point Benchmarks and SPEC89 Aggregates on 32K Caches : Data for Floating-Point Benchmarks and SPEC89 Aggregates on 128K Caches Data for ATUM Traces on 8K Caches : : : : : : : : : : : : : : : : : : : : : : : Data for ATUM Traces on 32K Caches : : : : : : : : : : : : : : : : : : : : : : : Data for ATUM Traces on 128K Caches : : : : : : : : : : : : : : : : : : : : : : Data for Multitasking Traces on 8K Caches : : : : : : : : : : : : : : : : : : : : Data for Multitasking Traces on 32K Caches : : : : : : : : : : : : : : : : : : : Data for Multitasking Traces on 128K Caches : : : : : : : : : : : : : : : : : : :

iii

: : : : : : : : : : : : : : : :

17 19 20 21 29 29 30 31 32 33 34 35 36 37 38 39

1 Introduction In the past decade, the operating frequencies of high-performance microprocessors have increased dramatically, but memory speeds have not kept pace. Processor designers have had to turn to on-chip caches to overcome this gap. Because of the great di erential in access time between an on-chip cache and an o -chip cache or main memory, an e ective on-chip cache design is becoming more and more crucial to maximizing processor performance. To get the most throughput from an on-chip cache, it is best to have a cache which has a hit time of one clock cycle, i.e., the cache can retrieve the data in one cycle provided it is stored in the cache. In summary, on-chip cache design is now becoming a central issue in high-performance microprocessor architectures. Selecting an on-chip cache design is non-trivial because there is such a variety of cache con gurations from which to choose. Conventional cache designs have at least four di erent design parameters: the line size (L); the number of sets (N ); the number of lines per set, or associativity (K ); and the replacement policy for choosing which line to remove from a set when a new line must be inserted. This parameter space can be divided into three general categories. A fully-associative cache has only one set, i.e., N = 1. Thus, there must be a tag comparator for every tag stored in memory, which generally makes this design impractical for large caches. A direct-mapped cache has only one line per set, i.e., K = 1. Thus, each memory address can only be stored in one unique place in the cache. A set-associative cache is any cache in which both N and K are greater than 1. Tempering the varied choices of cache designs, studies have shown tradeo s between the associativity of a cache, its access time, and its miss rate (percentage of accesses which are misses). Because direct-mapped caches can only store a memory block in one speci c line, they are more prone to con icts between addresses (memory blocks) that map to the same set, and thus tend to have more misses than set-associative caches. However, due to a shorter critical circuit path, the access time of a direct-mapped cache is lower than the access time for a setassociative cache storing the same amount of data [5, 7]. In a set-associative cache, the data line from the cache can not be written into the memory output bu er until the tag comparisons are complete, since the cache has to know which of the data lines to select. On the other hand, since an address can only be stored in one line in a direct-mapped cache, the direct-mapped cache can write the selected line from the cache to the output bu er at the same time as it is comparing the tags. The data can then be invalidated if the tags don't match. For moderate-speed microprocessors, it is usually possible to build a cache with a small amount of associativity (such as K  4) and still attain a one cycle hit time. However, for high-speed processors with shorter clock cycles, even a cache with 2-way associativity may be too slow to return its data in one cycle; the designer has to use a direct-mapped cache to meet the clock speed requirement. 1

1.1 Development of Hybrid Caches Because direct-mapped caches tend to have higher miss rates, researchers have developed various hybrid schemes in an attempt to combine the fast access of a direct-mapped cache with the better hit rate of a set-associative cache. The simulation results of So and Rechtscha en [10] showed that in a set-associative cache, the vast majority of hits within a given set involve the MRU (most recently used) member of that set. This observation led to the MRU cache [3] | a set-associative cache which is biased toward the MRU member of each set. When the cache is accessed, all tags in the selected set are compared simultaneously. However, even before the outcomes of the comparisons are known, the line containing the MRU of the set is immediately latched into the output bu er. If the tag check should subsequently show that the MRU tag does not match the address tag, then the data is invalidated. If one of the other tags matches, then that line is latched into the output bu er in the next cycle, and that line becomes the new MRU. Thus, an MRU hit has an access time of one cycle, and a non-MRU hit takes two cycles. Another way to boost the hit rate of a direct-mapped cache is to add a victim cache [6]. A victim cache is a small fully-associative cache and is typically no more than sixteen lines. Whenever a line is replaced in the direct-mapped cache, the victim (the line that is replaced) is transferred to the victim cache. The direct-mapped cache is accessed in a single cycle, as before, and the output bu er invalidated if the tag mismatches. If the desired address is in the victim cache, then the line is fetched from the victim cache (in the second cycle), and that line is swapped with the appropriate line in the direct-mapped cache. Since some accesses require two cycles to perform, a simple way to combine fast access with better associativity is to read a direct-mapped cache twice. In the hash-rehash cache [1], if a read misses on the rst cycle, a second read is done using a di erent hashing function to select an alternate line. If the second try is a hit, the rst and second lines are swapped. An improved version called the column-associative cache [2] keeps with each tag a rehash bit which indicates whether the corresponding line represents a rst-cycle hit or a rehash (second-cycle) hit. This bit is used to limit the swapping in order to reduce thrashing.

1.2 A Uni ed Framework It can be seen that these hybrid cache designs represent very di erent solutions to the same problem. It would be quite useful if the alternatives in designing a hybrid cache could be characterized in a way such that useful insights into their relations could be readily obtained, and new promising design points could be quickly identi ed and explored. In this paper, we show that most of these existing hybrid caches can be viewed as variations of a general cache design framework, which we call the Hybrid Access Cache (HAC) model. Under the HAC model, a hybrid cache can be viewed as consisting of a direct-mapped primary section, a slower secondary section (usually associative), and an ecient mechanism for exchanging data between the two. The direct-mapped section provides fast, single-cycle access to frequently-accessed data, while 2

the slower section provides extra associativity in order to eliminate some of the con ict misses that would occur in a purely direct-mapped cache. Unlike conventional two level caches (a rst-level and a second-level cache) the two sections in a hybrid access cache are organized as one level in the memory hierarchy. (A more detailed comparison of HAC's vs. two-level caches can be found in Section 2.3.1.) The purpose of the HAC model is to allow the user to think about a whole spectrum of values for the cache design parameters, in order to balance the bene ts of fast access and greater associativity. To illustrate, we will show that the existing hybrid cache designs, such as the MRU, victim and column-associative caches, are all near extreme points in the design space. We also demonstrate how the use of this framework allowed us to conceive of a novel cache organization, which we call the \half-and-half" cache. In this cache, half of the memory storage is allocated to a direct-mapped section, and the other half is allocated to a K -way set-associative section. In this manner, the associative section can take advantage of higher associativity without negatively impacting the performance of the direct-mapped portion.

1.3 Synopsis In the next section, we present the HAC model in greater detail, discussing its design framework, typical operations of a HAC, and how to measure its costs and performance. That section also discusses implementation details for HAC's, and contrasts them with two-level caches. In Section 3, we show how the HAC model can be helpful in exploring the di erent schemes. We then use the model to design the half-and-half cache. We have performed trace-driven cache simulation to compare various hybrid access cache designs and illustrate their performance di erences under our model. Simulations have been performed on both single-process and multiple-process traces derived from SPEC benchmarks, as well as on ATUM traces [1]. In Section 4, we present the results of simulations and compare the performance of the half-and-half to those of the MRU cache, the victim cache, and the column-associative cache. The results demonstrate that the performance of the half-and-half cache is better than other hybrid cache schemes for many benchmark programs, particularly at intermediate cache sizes. Finally, we present our conclusions in Section 5. Experimental results are listed in greater detail in an appendix.

2 The Hybrid Access Cache Model This section reviews basic cache design and terminology, and then develops the Hybrid Access Cache as a model for caches which combine features of direct-mapped and associative caches. We then discuss the operation, implementation, and performance and cost measurements for HAC's. 3

K lines tags

tags

tags 1 set

. .

N sets

. .

..

. .

1 line = L bytes

Figure 1: Typical Cache Structure

2.1 Basic Cache Design A typical cache (on-chip or o -chip) is shown in Figure 1. A cache is divided into an N  K array of lines. Each line contains L bytes of data (L a power of 2), corresponding to L contiguous bytes in main memory, and a tag which indicates which block of main memory is being stored in that line. The lowest log2 L bits of a memory address are used as an o set into a particular line. A set-selection hashing function s is applied to the remaining bits of the address (call these bits a) in order to select one of the N sets. Some or all of these address bits form the tag. Upon receiving a request, the tags of the K member(s) of the selected set s(a) are (simultaneously) compared with the tag of the requested address. If a match (termed a hit) occurs, the line corresponding to the tag that matched is read from the cache, and the portion of the line speci ed by the line o set is either sent to the processor (if read), or updated (if written). If none of the K tag(s) match (termed a miss), the cache must choose one line of the set to be replaced by the new line. This general description covers many di erent conventional cache designs. A typical cache design requires ve di erent design parameters and choices: the line size (L); the number of sets (N ); the number of lines per set, or associativity (K ); the replacement policy for choosing which line to remove from a set when a new line must be inserted; and the hashing function. The simplest and most common implementation is to make N a power of 2, and make s(a) be the identity of the lowest log2 N bits of a. Selecting a line in this manner is called bit selection. With this scheme, these bits do not need to be stored in the tag elds. In this paper, we generally assume bit selection is used, but our model is applicable to other implementations as well.1 1

[8].

For instance, Seznec proposes di erent hashing functions for the two banks of a 2-way set-associative cache

4

If only size is being considered, then for a given replacement policy, the size of the cache may be represented by the triple (N; K; L). The three main types of cache organizations are: 1. Fully-associative:

N

2. Direct-mapped:

K

3. Set-associative:

N >

= 1;

= 1; 1; K > 1;

For an associative cache (full or set), studies [9] have shown that a Least Recently Used (LRU) replacement policy exhibits generally good performance. Thus, many associative caches employ some form of this policy.

2.2 Adding Associativity to Direct-Mapped Caches Inherently, direct-mapped caches have a speed advantage over associative caches in that the access time of an associative cache must include time for the associative tag matches|an overhead not incurred in a direct-mapped cache [5, 7]. However, a direct-mapped cache generally su ers a higher miss rate as compared to an associative cache of the same size due to the restriction that a memory block can only map to a unique cache line. Interferences (con icts) due to multiple blocks mapping to the same line results in increased miss rates, whereas an associative cache would o er multiple cache lines to accommodate those blocks. To get the speed advantage of direct-mapped caches, as well as the lower miss rates o ered by associative caches, researchers have proposed caches incorporating associativity in direct-mapped designs. We term these new cache designs Hybrid Access Caches (HAC's) due to their hybrid nature: they are part direct-mapped and part associative, and they have dual accessing modes. In a HAC, rst the direct-mapped part of the cache is probed. If that is not successful, a further probe is performed, this time in the associative section, thus the term `hybrid access.' Cache designs within this category include the MRU cache [3], the victim cache [6], the hash-rehash cache [1], and the column-associative cache [2].

2.3 Viewing HAC's as Dual Caches A HAC can be conceptually viewed as consisting of two (dual) caches: a fast primary cache which is direct-mapped, and a secondary cache, which may be direct-mapped, set-associative, or fully-associative. An example con guration is shown in Figure 2 where the secondary cache is a two-way set-associative cache. Any memory block may be cached in either the primary or secondary section, but not at the same time. Data in the primary cache may be accessed in a single cycle, while data in the secondary cache requires two or more cycles to access. The goal of the hybrid access cache organization is to try to keep most of the frequently-used data in the primary cache, while using the secondary cache to store data which would otherwise be 5

logN 1 tag and set fields

tags

word log N2

primary

secondary tags

tags

1 1

1

1

1 select 1

select

select 1

1

MRU 1 hit

2 inhibit

1

1

2 select 2

output

valid

data

Figure 2: Structure of a HAC swapped out of the cache due to address con icts in the direct-mapped primary cache. Even though accessing the secondary cache may require two or more cycles, this is far better than the standard miss penalty.2 The primary and secondary caches may be di erent sizes, and may have di erent hashing functions s1 and s2 to access the sets in the two caches (s1 is used to access the primary while s2 is for the secondary). However, since data is to be exchanged between the primary and secondary caches, the line sizes must be the same. Therefore, the sizes of the two caches can be represented by the triples (N1; 1; L) and (N2; K; L). Conceptually, a HAC performs a read in the following manner (refer to Figure 2). Assume a is the address with the line-o set bits removed: 1. In the rst cycle, the cache reads the tag and data in line s1 (a) of the primary cache. (In the gure, a number beside an arc indicates which cycle the output will appear. Moreover, the gure assumes s1 and s2 uses bit selection.) The designated data is latched into the Considering that high-speed microprocessors can operate at more than ten times the access time of o -chip memory, the need to go o -chip can be very costly. 2

6

output bu er of the cache, becoming available at the end of the rst cycle. In the same cycle, the HAC reads the tags and data of all members of set s2 (a) of the secondary cache. 2. The single tag from the primary cache and the K tags from the secondary cache are compared against the appropriate bits of the address. 3. If the tag from the primary cache matches the address, then a primary hit occurs. The cache immediately cancels the probe of the secondary cache, so that it may be used immediately in the next cycle. 4. If the tag from the primary cache does not match the address, then the output bu er is invalidated, and the cache must operate for at least one more cycle. 5. In the second cycle (if there was no primary hit), the tag comparators from the secondary cache are examined. If one of them matches (a secondary hit), then the appropriate data line is latched into the output bu er. The cache must now move the selected data (and tag) from the secondary cache into the primary cache, while simultaneously moving the displaced line from the primary cache into the secondary cache.3 (In Jouppi's terminology, this displaced primary-cache line is called the victim [6].) If K > 1 and the secondary cache uses an LRU replacement policy, then it would make sense for the LRU status bits in the set to be changed so that the victim becomes the new MRU element, since it just came from the primary cache. 6. If all tag comparators mismatch, then the cache must fetch the line from main memory. This data will go into the primary cache in the line that was just checked. The resulting primary-cache victim moves over to the secondary cache, which will cause the LRU element of the secondary-cache set to be ushed. When the data returns from main memory, the cache copies it into the primary cache. The swapping done in steps 5 and 6 impose a restriction on the hashing functions: 1 (a1 ) = s1 (a2) ) s2 (a1) = s2 (a2)

s

(1)

In other words, if two addresses map to the same line in the primary cache, then they should map to the same set in the secondary cache. This need can be seen in the following example, in which the primary cache has size (4; 1; 1), the secondary cache has size (8; 1; 1), and bit selection is used (see Figure 3). Assume that the memory blocks at addresses 00001 and 00101 are in the secondary cache (in sets 001 and 101, respectively), and memory block at address 01001 is in the primary cache (in set 01). What happens if the processor reads address 00101? The data corresponding to address 00101 should be moved from set 101 of the secondary cache to set 01 of the primary cache, displacing address 01001. Yet this victim must go into set 001 of the secondary cache, not set 101. The former set 3

This swapping is not shown in Figure 2 for clarity.

7

00001 01001

address

00101 00101

Figure 3: Problems with Bad Hashing Functions must ush its other address, while set 101 is left without a victim, and must be marked invalid. This results in a complex accessing scheme and wastes precious cache space. Assuming that the co-domain of s2 is the entire set space in the secondary cache, this restriction means that N2  N1. If bit selection is used, then N1 and N2 must be powers of 2; typically, the rightmost log2 N1 bits of a are used to select the primary cache line, and the rightmost log2 N2 bits of that eld are used to select the secondary cache set.

2.3.1 HAC's and Two-Level Caches At rst glance, our conceptual view of a HAC super cially resembles a two-level cache, because both systems have a slower secondary cache which attempts to cache things which are missed by the primary cache. However, two-level cache systems usually have substantially di erent hardware implementations for the two caches (e.g., o -chip SRAM chips for the second level cache of a two-level cache system versus on-chip memory for the HAC's). Since HAC's would be implemented on the same chip with the CPU and the chip area is limited, there is a direct tradeo between the sizes of the two caches. This distinction leads to the following important di erences between HAC's and two-level caches:

 In a two-level cache system, the second-level cache is usually much larger than the rst-

level cache. In a HAC, on the other hand, the size of the secondary cache is limited by the requirement that it have no more sets than the primary cache. Moreover, in many cases, the secondary cache is smaller than the primary cache.

 As a result of this di erence, the main purpose of a second-level cache is to alleviate

capacity misses [4], those due to the small size of the rst-level cache. In a HAC, the main purpose of the secondary cache is to alleviate con ict misses, those due to address interferences caused by direct mapping in the primary cache.

 Typically, two-level caches have the multilevel inclusion property [4]|meaning that a copy of everything in the rst-level cache is also in the second-level cache|for increased 8

eciency in consistency maintenance between I/O and the processor. In a HAC, no line can be in both caches at the same time.

 Lastly, there is no restriction as to where a line is mapped to in the second-level cache in

relation to the rst-level cache. In a HAC, two addresses mapping to the same line in the primary cache must map to the same set in the secondary cache (eq. 1).

2.3.2 Implementation Details A few words should be said about implementation details. We assume that the swapping can occur within the second clock cycle. To do this, both subcaches need to write back during the second cycle. At the end of the rst cycle, all data from both the primary line and the secondary set are latched in a bu er big enough to hold all K + 1 lines, plus tags and LRU status bits. (These must be read in one cycle, else the cache cannot perform one-cycle hits.) Since the results of tag matches are known at the end of one cycle (again, because it does one-cycle hits), these results can be used to control write-backs. The same mux that selects one of the secondary lines for the output can be used to route that output to the primary cache. In addition, each column of the secondary cache needs a multiplexor so that its input can come either from its own line in the bu er or from the line in the bu er linked to the primary cache. LRU status bits are updated in the usual way. To minimize storage costs, tags can be limited to only the bits not used in set selection (assuming bit selection is used). This means that the tags in the primary cache will be shorter than the tags in the secondary if N1 > N2. When a primary-cache victim is moved to the secondary, the extra bits can simply be read from the upper bits of the primary line number. For cache writes, various HAC's can implement di erent strategies. One such strategy can be as follows: a hit in the primary cache will simply alter the contents in the primary cache and a hit in the secondary cache will also result in the modi cation of the corresponding line in the secondary cache. If a write request misses in both, the write can be directed to the next level in the memory hierarchy. There are many other strategies a HAC can adopt such as swapping lines with the primary cache if there was a hit in the secondary, fetching the line from memory if there was a miss in both caches and storing it in either cache, etc.

2.4 Performance and Cost Measurements for HAC's When comparing simple caches, the miss rate (ratio of misses to total accesses) is usually an adequate measure of performance. However, it is not sucient to describe the performance of HAC's, because there are two caches, and hence two separate miss rates. The relation between the miss penalties must also be considered, because a HAC could have few mainmemory accesses, yet still have relatively poor performance because many of the hits occur in the secondary cache and take longer to access. 9

If a single parameter is needed to compare di erent cache alternatives, then the average e ective memory-cycle time (teff ) can be used. This parameter averages the costs of all types of misses. If h1 and h2 are the fractions of accesses which hit in the primary cache and secondary cache, respectively, and if t1 and t2 are the access times of the two caches, then t

eff

= t1 h1 + t2 h2 + (1 ? h1 ? h2 )(t1 + tmain )

(2)

This assumes that the main memory access begins at the end of the rst access, once all the comparator results are compared. This may be simpli ed to t

eff

= t1 + (t2 ? t1 )h2 + (1 ? h1 ? h2 )tmain

(3)

If we assume t1 is 1 cycle and t2 is 2 cycles, then this becomes eff

t

= 1 + h2 + (1 ? h1 ? h2 )tmain

(4)

To compute the hardware overhead costs of a HAC, one should consider the number of tag comparisons and the extra hardware needed to perform the swapping. The number of comparators needed is K + 1, which is about what is needed if a conventional K -way setassociative cache is used.4 All other hardware costs are roughly equivalent, except for the extra data paths needed to move a primary cache victim over to the secondary cache. These costs will probably be small since that operation is the inverse of the column-selection used to choose one secondary line (e ectively a demultiplexer). The two can be laid out and routed in parallel, so the overall area will not increase by much.

3 Using the Hybrid Access Cache Model The Hybrid Access Cache framework can be used to represent the design space of possible cache con gurations and to gain insights into how well they might be expected to perform. In this section, we show how to do this in general, and show that speci c hybrid schemes are equivalent to speci c cases in the HAC model. Using the parameter space created by the HAC model, we develop a new cache con guration called the \half-and-half" cache. The half-and-half cache can often give better access times than the other cache con gurations, especially for a middle range of cache sizes, as shown by our experiments in the next section.

3.1 The HAC Design Space Section 2.3 presented a few of the restrictions of the HAC model: In the column-associative cache, only one tag comparator is required. However, it can be reorganized to take advantage of an extra comparator and data paths for faster access times. 4

10

Associativity of associative portion (K)

Max. K

16

Half-and-half Victim caches

8

MRU 4

2 Column-assoc. 1 0%

50%

75% 87.5%

100%

Lines used for associative access

Figure 4: Design space of the HAC model 1. The line sizes are the same; 2.

N

2  N1 .

This still leaves us with a large design space to explore. Figure 4 shows the HAC design space. On the horizontal axis, we have the percentage of cache lines allocated to associative accesses, i.e., number of lines in the secondary cache. Obviously, as we increase the number of lines to one cache, the number of lines allocated to the other is decreased accordingly. On the vertical axis is the associativity of the secondary cache, K , where an associativity of one indicates that the secondary cache has a direct-mapped organization. Note that we do not have any mention of line size nor of the total cache size. They are omitted for clarity. In the HAC design space, we assume that the total area of a HAC can be allocated to the primary cache which has a one CPU-cycle access time; thus allocating lines to the associative secondary cache only decreases the amount of HAC memory which can respond in one cycle, not adding to the size of the HAC. Moving to the right in the design space indicates that the HAC has more and more lines dedicated to the associative accesses so that con icts from the direct-mapped accesses can be better resolved. Whereas moving to the left in the design space indicates that more and more lines are used in a direct-mapped manner for the increased speed. 11

Thus there is a trade-o when a cache design is moved left or right in the space. Moving towards the top of the design space indicates that the cache has more associativity in the associative portion to resolve more con icts, but at the cost of more tag matching hardware. Again, there is another trade-o when moving a cache design towards the top or bottom of the design space. In the above gure, design points for previously proposed HAC's are shown (the description of the \half-and-half" will be explained in a later section). The victim cache (i.e., a directmapped cache augmented by a victim cache) occupies design points on the far left. The reason is that almost all the cache lines are devoted to a direct-mapped access and only a very small portion is devoted to the victim cache. Jouppi suggests a victim cache which has 4 to 16 lines and is fully associative. Thus the design space for the DM & victim is a steeply-sloped straight line starting from a K value of 1 and percentage of DM lines at 100% and rising to the right. The MRU cache occupies the design points which start at 50% associative lines and a K value of one and forming a hyperbolic curve towards the upper right-hand corner of the HAC design space. In the MRU cache, the structure is an overall set-associative cache with one line in each set dedicated to the most recently used. Therefore, a four-way set-associative cache converted to an MRU cache would have 25% of the total cache lines dedicated to direct-mapped access. As for the cache lines in the associative portion (75% of the total), the associativity of that secondary cache would be three. The MRU caches mark the other extreme of the design space; the region to the right of the MRU curve cannot correspond to any legal designs given that N2  N1 (see Section 2.3). The column-associative (CA) cache (and its hash-rehash predecessor) is a HAC which does not have a single point in the design space, but a horizontal line. In terms of the mapping of addresses to cache lines, the CA cache is equivalent to a two-way set-associative cache, because the bit- ipping rehash function of the CA cache e ectively partitions the cache line space into 2-way sets. In terms of access, however, the CA sometimes acts like a direct-mapped cache, because the distribution of cache lines to direct-mapped accesses is dependent on the access patterns of the application. If accesses do not cause con icts in the CA cache, then the CA cache functions as if it devoted 100% of its lines for direct-mapped access. However, as con icts arise, lines begin to be used for associative (rehashed) accesses. Eventually, if all accesses cause con icts, then up to half of the lines can be used for associative accesses. Thus the horizontal line extends from 100% DM access to 50%. The line representing the CA cache is placed on the design space of K = 1 because a line can only be rehashed to one other location.

3.2 Restricting the HAC Design Space In the HAC design space as mentioned above, all design points within the space is possible since we can have a secondary cache which has x% of the total lines and use a multiple rehashing scheme to access it; thus increasing the associativity of the secondary cache.5 However, such 5

This can be seen by examining what the hash-rehash scheme does to a direct-mapped cache.

12

associativity of assoc. portion (K)

16

MRU caches Victim caches

8

Half-and-half caches

4

2 col.-assoc.

DM cache

1 1

2

4

8 N1/ N2

N1

oo

Figure 5: Design space of more ecient HAC's odd line distributions along with cumbersome access mechanisms leads to inecient designs. So for now, we only consider what happens when the hashing functions use bit-selection. Since this means that N1 and N2 are powers of 2, it also means that N1 =N2 is also a power of 2. Even with this restriction, we are still presented with a large range of sizes for the secondary cache. Given a primary cache of size N1 , we can set N2 to any power of 2 up to N1, and set K to any positive integer (although in practice, K would be limited by the high costs of high associativity). We may view this as another two-dimensional design space, but one for more ecient HAC designs (Figure 5). In this design space, the vertical axis remains the same, the only di erence is that the horizontal axis is now the ratio of the number of sets for directmapped access to the number of sets for associative accesses. A drawback of this design space is the lack of an indication of how storage space is devoted to the primary and secondary caches. To interpret the design points in this space, we can adopt these few observations. When moving towards the right, the number of sets for direct-mapped access is increased or the number of sets for associative accesses decreases. In a HAC with constant size, for every increase in the number of sets allocated to associative accesses (moving to the left), there is a corresponding drop to the number of lines for DM access. Increasing the number of lines for associative access alleviates the con ict misses arising from direct-mapped accesses. Moving up and down the design space has the same meaning as in the previous HAC design space. The various HAC designs are shown in the space. The design region of the victim cache is a vertical line where N2 is one (a victim cache is fully associative). The design region of the 13

MRU cache is a vertical line where N1 = N2. If the total cache size is xed, then as the design point for a MRU cache moves upwards, there is a corresponding decrease in the number of sets in both the direct-mapped and associative-access portions. All values of K are possible for an MRU cache; however, if the total number of lines in the cache is to be a power of 2, then K is restricted to the set f1; 3; 7; 15; : : :g.6 The design point of a column-associative cache is again a horizontal line at K = 1 and running from N1 = N2 to N2 = 0.

3.3 Exploring the Design Space The preceding sections gave the characteristics of the HAC design space, and illustrated where existing designs t into the space. For a primary cache of a given size (N1; 1; L), there are many choices for the size of the secondary cache (N2; K; L). A designer may explore the design space of Figure 5 to look for a good hybrid design which combines the bene ts of fast access and associativity, assuming that at least part of the cache must be direct-mapped to get the bene ts of single-cycle access. One can choose any powers of 2 for N1 and N2 , provided N2  N1, and any number for K . If we write r N1 = 2 (N2 ) (5) where r is the log of the ratio between the number of sets in the primary and secondary caches, then the total number of lines N in the combined cache is given by N

= (2r + K )N2

(6)

Even with a reasonable upper bound for K (e.g., 16), this gives the chip architect a lot of

exibility in choosing an appropriate cache size. For instance, suppose that a chip oorplan has enough room for roughly 1300{1400 cache lines. If simple direct-mapped caches were used, a 2K-line cache would not t, while the next smaller size, a 1K-line cache, would not use all available space. One could build a 1K-line primary cache and a 256-line secondary cache, with enough space left over to hold the extra circuitry necessitated by the dual con guration. Of course, other con gurations are possible. This design space is still large. Does the HAC framework provide any insights into which regions of the space are most likely to yield the best caches? We may answer this question by considering some basic properties of caches and the way they are used in real applications. One observation about real workloads is that as the cache size increases, the number of con ict misses goes down, so that associativity becomes less of a win. Larger caches can get by with lower rates of associativity. Therefore, as one increases the This is not an absolute requirement. One can have a 5-way MRU cache, just as one can have a 5-way set-associative cache. 6

14

cache size, the region of greatest interest would tend to move toward the bottom of the design space in Figure 4. Also, if the cache size is small, then capacity misses are a problem, and space is precious. Therefore, one would want to take most of what little data can be cached and keep it in the fast cache, so that at least the hits that do occur can occur quickly. In other words, a designer would want to focus attention on the left side of Figure 4. As cache size increases relative to the program working set, capacity misses are less problematic, so one can a ord to devote more memory to the secondary cache to alleviate con ict misses, and one's attention would move to the right. Combining these two trends, one would expect that for a small on-chip cache size, the best candidates for cache con gurations would be in the upper left corner of Figure 4. In this region, we see the victim caches. The victim cache is a small, fully-associative cache containing perhaps 16 lines. Slight variations are possible, such as replacing the fully-associative cache with a 4-way set-associative cache in order to reduce costs. We call these \modi ed" victim caches. As cache size increases, the region of interest moves downward and to the right, so that one would expect something like a low-K MRU cache or a CA cache to best t the bill. For the intermediate range of cache sizes, the middle region of Figure 4 looks like the best region to explore. No previous cache designs have used this middle region of space. We can try keeping the overall number of lines constant, dividing the cache into primary (fast-access) and secondary (slower-access) sections, then checking if any design points fall within the region of interest. If the cache size remains constant, then N in equation 6 must be a power of two. There are only three sets of valid solutions for K and r: 1. If K = 0, this is the degenerate case of no secondary cache; 2. If K is non-zero and odd, then r = 0 and K = 2i ? 1, i  1; 3. If K is non-zero and even, then r  1 and K = 2r . The solutions where K is odd correspond to a hybrid cache consisting of a primary cache, with N=2i lines, and a secondary cache, with N=2i lines and associativity K = 2i ? 1. Since the primary cache is accessed in one cycle and the secondary cache is accessed in two, this hybrid is functionally identical to the MRU cache of Chang, Chao, and So [3].

3.4 The Half-and-Half Cache The solutions where K is odd have been implemented in real cache designs, but what about the other solutions where K = 2r ? This seems to represent a new design option. If the total 15

number of lines in the cache is N , then a cache corresponding to the solution K = 2r has the following structure: 1. The primary cache contains

1N 2

lines (half of the total storage);

2. The secondary cache is 2r -way set-associative with 12 (N=2r ) sets. The total number of lines is 21 N lines. Because both primary and secondary caches contain half of the lines, we call this the \half-andhalf" cache. The half-and-half cache super cially resembles the MRU cache (indeed, a half-and-half cache with r = 0 is identical to a 2-way MRU cache). However, there are several di erences (when r > 0). A half-and-half cache may have more address con icts than an MRU cache with the same associativity, because only half of the storage of the half-and-half is associative. However, the bene ts of associativity in the MRU cache come at a price: as associativity increases, the size of the primary cache decreases, meaning that fewer and fewer lines can be accessed in one cycle. One would expect that increasing the associativity of the secondary cache portion in the MRU does not necessarily decrease the e ective access time. In the half-and-half, on the other hand, half of the cache is always fast regardless of the degree of associativity. Thus, while the two may have similar average access times for small values of K , larger K values should, in most cases, improve the access time of the half-and-half relative to the MRU cache. Given that the half-and-half cache occupies such points in the HAC design space, one would expect that increasing the associativity of the secondary cache would not penalize the performance. In fact, one would expect the half-and-half with a given associativity would perform better than one with a lower associativity. In the next section, these performance expectations are con rmed by experimental data.

4 Simulation Experiments In order to validate our hypothesis, and compare the half-and-half cache to the other cache designs, we ran a series of trace-driven simulation experiments. We programmed a cache simulator to emulate twelve di erent cache designs with three di erent sizes, feeding it sixty-two traces. This section describes the cache comparisons, the simulations, the performance metrics, and the results.

4.1 Caches Designs Simulated We simulated eleven hybrid cache designs in addition to a baseline direct-mapped cache. For each HAC, we assume that the direct-mapped cache is to be improved through the addition of 16

Param. DM VCF VC4 RVF RV4 MC2 MC4 MC8 CAy HH2 HH4 HH8 N1 N N N N N N=2 N=4 N=8 N to N=2 N=2 N=2 N=2 N2 | 1 4 1 4 N=2 N=4 N=8 N ? N1 N=4 N=8 N=16 K | 16 4 16 4 1 3 7 1 2 4 8 yThe CA cache is somewhat di erent from the others, as explained in the list. Table 1: Size Parameters for Simulated Caches associativity. The baseline cache has one of three sizes: 8K, 32K or 128K. These represent a range from an average value for present-day on-chip caches to what is likely to be possible in the future. All line sizes are 32 bytes. Thus, if we let N be the number of lines in the baseline cache, then N 2 f256; 1024; 4096g. The caches are compared in Table 1. The table gives the parameters N1 and N2 in terms of . The total number of lines in the cache is N1 + N2 K . All caches have a total of N , except for VCF and VC4 (acronyms explained below), which have a full-sized direct-mapped cache plus a 16-line victim cache. The following describes the cache con gurations:

N

DM: This is the baseline direct-mapped cache. VCF: This represents a baseline direct-mapped cache with an additional victim cache. The victim cache is fully-associative and has 16 lines, regardless of the size of the primary cache. VC4: This is a variation of VCF in which the secondary cache is structured as a 4-way set-associative cache with four sets. The total number of lines (16) is the same as VCF, but the hardware costs will be lower. RVF: This is a DM/victim cache combination in which the size of the direct-mapped section is reduced by the same number of lines as in the victim cache, so that the total number of lines is N . The set-selection function s is implemented using modulo arithmetic. This is not suggested as a practical cache design, but is included because the VCF and VC4 caches have more lines than the others. If these caches perform well, the improvement could be due merely to the increase in the storage capacity of the cache, rather than its organization, so the RVF makes the comparison more \fair" by modeling a cache with the same number of lines. RV4: This is a 4-way reduced victim cache, the corresponding variant of VC4. MC2,4,8: These are MRU caches with three di erent associativities. For example, an associativity of four implies that one line in each set is devoted to direct-mapped access (primary cache) and the other three lines are used for associative accesses in the secondary cache (thus K = 3). 17

CA: This is a column-associative cache. As explained at the end of Section 3.1, the CA cache does not t into the HAC design space in an exact point. We simulated this cache using the state diagrams from the original paper [2], except that we assume that a secondary hit (e.g., a hit after a rehash) requires only two cycles.7 HH2,4,8: These are half-and-half caches. The primary cache takes half of the lines, and the secondary cache is a set-associative cache with 2, 4, or 8 sets.8 It is assumed that all hits in the primary cache take one cycle and all hits in the secondary cache take two cycles. For this study, we assumed a miss penalty of 20 in eq. 4.9 We have chosen to limit this study to data caches only. We have chosen not to simulate instruction caches at this time since we believe that associativity is less of an issue for instruction caches. Whether the half-and-half cache is bene cial for instruction or uni ed caches is a topic for further research. Moreover, we have limited ourselves to examining only the miss rates due to read accesses. The reason is that there are many write policies applicable in a HAC, and we decided not to examine these issues by not including misses on writes. Examining write policies in a HAC is another topic for research.

4.2 Trace Sources In our simulations we used three groups of traces: single-process traces from SPEC89 benchmarks, synthetic multi-process traces from interleaved SPEC89 runs, and ATUM traces representing complete traces captured from a processor by special hardware [1]. The rst set of traces (Table 2) were single-process traces generated from the SPEC89 benchmarks. These were run on standard input data as indicated in the table. Because gcc and eqntott create subprocesses, we only simulated the 19 calls to the C compiler in gcc, and the main process (after pre-processing) in eqntott. Two versions of eqntott were compiled: the standard version and a \fast" version in which the qsort routine supplied with the eqntott code was replaced with the qsort routine from the espresso benchmark. All programs were run to completion, in order to account for the variations in memory use patterns throughout a The published design for the column-associative cache has a secondary hit time of four cycles, but this is because it uses a low-cost implementation which assumes only one access path into the cache. Thus, extra cycles are needed whenever it needs to swap data. However, the column-associative cache can be viewed as a specialized 2-way set-associative cache; thus operations can complete in two cycles instead of four if the cache is augmented with hardware similar to that used by the half-and-half. 8 There is no HH1 because that would be equivalent to MC2. 9 Of course, this value is arbitrary and varying it can a ect the di erences between HAC's. For instance, the relative performance of a direct-mapped cache (with a higher miss rate but no secondary hits) would increase relative to associative caches (with more secondary hits but fewer misses) if the miss penalty were to drop. Nevertheless, this value is similar to that used in other cache comparisons, and represents a middle ground between high-end machines with fast secondary caches and low-end machines with on-chip caches only. 7

18

Program Input gcc

espresso xlisp eqntott eqn-fast SYMB doduc fpppp

Ops

%

cexp dbxout emit-rtl explow gcc genoutput genrecog insn-emit insn-recog integrate bca cps

18.2 43.1 31.2 54.8 42.7 22.1 30.7 48.3 148.6 60.2 435.9 570.5

14.9 14.7 15.1 14.2 15.2 15.2 15.1 15.2 14.7 15.1 23.2 21.3

8.4 8.8 7.7 8.8 8.0 7.8 8.0 6.9 9.3 8.0 2.2 4.6

6,051 15,961 6,511 22,224 9,019 8,434 9,863 8,415 30,781 17,217 6,829 6,579

tiny small 4 atoms

46.2 22.4 6.5 234.9 22.5 6.4 147.1 37.0 7.8

1,824 1,847 2,536

Good FP spice2g6 nasa7 mxm t cho btr matrix tomcatv N=33 N=65

% Comp. Input

106 Ld. St. misses

859.4 1,113.2 979.5 994.7

30.9 20.6 32.3 36.3

6.3 20.5 13.6 6.7

14,590 16,891 46,012 85,731

20.4 30.2 7.7 80.5 30.8 7.7

2,178 7,590

Bad FP SPEC89

jump optabs print-tree recog regclass stmt toplev tree varasm TOTAL ti tial TOTAL (queens 9) int3 int3 TOTAL ref TOTAL 8 atoms TOTAL TOTAL short gmt emi vpe TOTAL 300 N=129 N=257 TOTAL TOTAL TOTAL

Ops

%

%

40.2 50.2 41.3 59.2 27.9 113.7 49.8 49.3 57.0 988.5 696.1 1,124.0 2,826.5 5,155.5 1,196.7 161.5 10,328.6 1,348.6 1,629.7 1,689.4 1,836.5 3,466.2 2,368.4 901.6 569.1 825.4 6,243.0 1,704.3 322.4 1,300.3 1,723.7 12,039.4 25,834.2

15.2 15.3 14.7 14.9 15.2 15.1 15.1 15.1 15.3 15.0 21.1 20.7 21.3 20.9 16.9 18.8 19.9 22.5 22.5 37.0 37.0 30.2 29.1 32.8 21.9 30.0 29.5 25.8 31.1 31.5 31.4 29.1 25.6

7.9 7.5 8.9 8.0 7.8 7.9 8.0 7.5 7.6 8.2 5.7 4.6 4.5 9.0 1.3 9.1 6.8 6.4 6.4 7.3 7.4 6.9 4.2 11.4 6.8 8.8 11.2 12.7 7.6 7.5 7.6 9.5 8.1

106 Ld. St.

Comp. misses 9,714 17,803 14,068 10,863 8,165 18,670 16,204 9,246 14,235 253,444 30,750 2,374 46,532 1,710 52,100 52,127 405,913 1,886 5,557 2,845 5,381 10,938 16,429 63,537 65,946 53,506 346,213 67,964 29,151 115,285 154,204 584,810 1,001,661

Table 2: Symbolic SPEC89 Benchmarks program's execution. All simulations were performed on Sparc computers, and the benchmarks were compiled with the standard Sparc Fortran compiler version 2.0.1 or with the GNU C Compiler version 2.4.5. The benchmarks have been divided into three groups: the four symbolic benchmarks (SYMB), two numerical benchmarks (doduc and fpppp) which showed relatively good cache performance (GOOD), and the remaining numerical benchmarks (spice2g6, nasa7, matrix, and tomcatv) 19

Trace SYM4 FP4 MIX6

Input Ops 106 % Loads % Stores Comp. miss. gcc,espresso,li,eqntott 400.0 20.3 6.1 87,451 spice2g6,doduc,gmt,fpppp 400.0 24.7 7.6 15,287 gcc,emi,eqntott,fpppp,espresso,tomcatv 600.0 24.1 5.9 169,495

Table 3: Multitasking SPEC89 Benchmarks which had relatively poor cache performance (BAD). The total for each group includes every input run on that benchmark, and is shown in the table. There were two major weaknesses with these traces. First, they only involved single processes, and did not measure the e ects of multitasking. Second, our trace simulator was not able to trace within system calls, removing the e ects of system calls from consideration. To measure the e ects of multitasking, we decided to generate a few extra traces by merging together several SPEC89 traces. Three multiprocess streams were generated (see Table 3). The rst (SYM4) combined the four symbolic SPEC89 benchmarks: gcc (input insn-recog.i), espresso (input cps.in), li, and eqntott (original version). The second (FP4) interleaved four numerically-intensive SPEC89 benchmark programs: spice2g6, doduc (input ref), the GMT subsection of the nasa7 benchmark, and fpppp (NATOMS=4). The last (MIX6) is a mixture of numerical and symbolic code: gcc (input stmt.i), the EMI subsection of the nasa7 benchmark, eqntott (fast version), fpppp (NATOMS=8), espresso (input tial.in), and tomcatv (N=129). When forming these mixtures, we looked for benchmarks which did not substantially favor any particular cache organization in the single-process traces. We halted each trace after 100 million operations. Each stream was generated by assuming round-robin scheduling. For each of the three streams, at each of the three cache sizes, one multitrace was run in which each trace was run to completion before switching to another, i.e., there was no interleaving. Then ve additional traces were run, in which the number of operations executed before switching to another trace was one of f1,000,000, 100,000, 10,000, 1,000, 100g. It is important to consider di erent levels of granularity, since as processor speeds continue to increase, there can be a concomitant increase in the number of instructions executed before doing a process switch. Moreover, simulating di erent levels of granularity can provide insights to the cache performance for systems which are frequently or less frequently interrupted. The multitasking traces still have the problem of lacking system calls, so we used ATUM traces [1] for our third series (see Table 4). These traces are based on actual VAX execution traces, and include system calls. Some of them are also multitasked. However, not as much weight should be placed on these traces, since they were generated from a CISC processor (VAX), and may not accurately mimic access patterns found on today's prevalent RISC processors. (Of course, a CISC designer may also wish to use the HAC framework to design an 20

Input

Total % % Comp. Input Total % % Comp. ops Loads Stores misses ops Loads Stores misses dec0 178,959 58.2 40.5 2,356 memxx 225,799 55.5 43.9 1,430 fora 188,135 56.5 42.1 2,647 mul2 163,670 55.8 43.3 1,455 forf 177,297 58.9 39.1 3,561 mul8 229,977 60.1 38.6 3,009 fsxzz 116,105 65.9 32.6 1,726 pasc 229,065 53.5 46.0 1,222 ivex 138,458 67.6 29.7 3,806 savec 88,877 81.5 17.6 772 linp 202,426 89.8 9.5 1,528 spic 222,995 60.7 38.9 991 lisp 121,604 80.8 18.5 841 ue02 157,837 60.2 37.7 3,298 macr 154,126 60.8 37.1 3,121 TOTAL 2,595,330 63.1 35.6 31,763

Table 4: ATUM Traces on-chip cache.) These traces were included primarily to show the possible e ects of system calls.

4.3 Results Results are shown in the graphs in Figures 6{12. We use the average cache access time as de ned in equation 4 in Section 2.4 as our primary metric in comparing caches. The rst three gures show the results for the SPEC89 benchmarks. Due to space limitations, we combined the results for all inputs on a particular benchmark into one set of cost values. The results were combined by adding the counts for each type of hit or miss (as if the di erent cases were run in sequence and the cache were ushed between jobs). The two di erent versions of eqntott were combined as well. Figures 6{8 also show the aggregate results for the three categories of SPEC89 benchmarks, the aggregate results for the entire SPEC89 suite, and the combined results for the ATUM traces. A few observations can be made by comparing the caches for di erent sizes. Figure 6 shows the greatest performance variations between hybrid caches. These results demonstrate the bene ts of keeping most of the cache lines in the fast-access section. For the SPEC89 traces, the MRU caches have the worst performance, especially those with high associativities (at the right side of the design space). Victim caches and half-and-half caches perform much better. For the ATUM traces, however, high associativity clearly improves cache performance; even the MC8 performs reasonably well. Thus, it would seem that adding associativity can help to improve cache performance, even when it comes at a higher hit cost. It is apparent from the graphs that associativity is less of a bene t for large caches. This is a well-known observation about caches. For large caches, column-associativity often produces equal or better results, which shows the advantages of allowing the cache to become e ectively direct-mapped when the cache is large enough to 21

13.0 12.0 11.0 10.0 9.5 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0

F 4

F 4

2 4 8

VC

RV

MC

2 4 8

4.5 4.0

DM

CA

HH

3.5 3.0

2.5

2.0

1.5

1.0

gcc espresso xlisp eqntottSYMB doduc fpppp GOOD spice nasa7 matrix tomcatvBADSPEC89ATUM

Figure 6: Combined Results for 8K Cache 22

6.5 6.0 5.5 5.0 4.5

4.0

3.5 F 4

F 4

2 4 8

VC

RV

MC

2 4 8

3.0 DM

CA

HH

2.5

2.0

1.5

1.0

gcc espresso xlisp eqntottSYMB doduc fpppp GOOD spice nasa7 matrix tomcatvBADSPEC89ATUM

Figure 7: Combined Results for 32K Cache 23

5.0

4.5

4.0

3.5

3.0

F 4

F 4

2 4 8

VC

RV

MC

2 4 8

2.5 DM

CA

HH

2.0

1.5

1.0

gcc espresso xlisp eqntottSYMB doduc fpppp GOOD spice nasa7 matrix tomcatvBADSPEC89ATUM

Figure 8: Combined Results for 128K Cache 24

2.5

DM

F 4

F 4

2 4 8

VC

RV

MC

2 4 8

CA

HH

2.0

1.5

1.0

SYM4

FP4 8K

MIX6

SYM4

FP4 32K

MIX6

SYM4

FP4 128K

MIX6

Figure 9: Baseline Results for Multitasking Traces eliminate most con icts. Other low-associativity caches such as the MC2 and the HH2 also perform well. For intermediate cache sizes, how much associativity is needed? In most cases, adding more associativity seems to improve the performance, because HH8 usually has the best performance. However, in most cases, the improvement from K = 2 to K = 8 is slight, and in the traces with large di erences, most of the improvement comes with increasing K from 2 to 4. Unlike the half-and-half cache, the MRU cache usually does not bene t from greater associativity, even though the tables in the appendix show miss rates declining with increasing K . This is because the bene ts of greater associativity are usually canceled by the extra costs of secondary hits, since as K increases, the amount of the cache dedicated to fast accesses decreases. Reducing the size of the direct-mapped cache, as in the reduced victim caches (RVF and RV4), generally increases the average access costs as expected, especially for smaller caches. There are some anomalies, particularly in the larger caches, in which the reduced caches have 25

+2.4 F 4

F 4

2 4 8

2 4 8

VC

RV

MC

CA

100

1M

+2.2 DM

HH

+2.0

+1.8

+1.6

+1.4

+1.2

+1.0

+0.8

+0.6

+0.4

+0.2

+0.0

1M

100K 10K 1K SYM4

100K 10K FP4

1K

100

1M

Figure 10: Multitasking Results for 8K Cache

26

100K 10K MIX6

1K

100

+1.4

+1.2

+1.0

DM

F 4

F 4

2 4 8

2 4 8

VC

RV

MC

CA

100

1M

HH

+0.8

+0.6

+0.4

+0.2

+0.0

1M

100K 10K 1K SYM4

100K 10K FP4

1K

100

1M

100K 10K MIX6

1K

100

100K 10K MIX6

1K

100

Figure 11: Multitasking Results for 32K Cache

+0.5 +0.4 +0.3 +0.2 +0.1 +0.0

1M

100K 10K 1K SYM4

100

1M

100K 10K FP4

1K

100

1M

Figure 12: Multitasking Results for 128K Cache 27

better performance. This shows that the data access patterns of a particular program have an important impact on cache performance. In each of the VC and RV cache sets, there is very little di erence in performance between the fully-associative victim cache and the set-associative victim cache, suggesting that a fully-associative victim cache is probably not worth the extra hardware cost. Figure 9 shows the results of the simulations of the non-interleaved versions of the multitasking traces. Figures 10{12 show what happens when context-switching is introduced. These gures show the relative performance of each cache under multitasking, i.e., each bar shows the di erence between the average read cost for the interleaved trace and the non-interleaved trace for the same cache.

5 Conclusions In this paper, we have proposed a uni ed framework for the design of Hybrid Access Caches (HACs) as on-chip caches. Under our model, a rich design space is structured in a way which helps the cache designers systematically to explore, experiment and compare various combinations of hybrid cache designs in order to meet performance and cost requirements. Using this framework, existing hybrid access caches can be viewed as points within this design space, and much insight can be obtained by comparing their tradeo s in the space. By exploring this framework, we conceived a novel HAC organization called the Half-andHalf cache. It achieves a good balance of the advantages from both the speed of direct-mapped caches and the toleration of access interference of set associative caches. We presented simulation results which have illustrated the use of this framework, and the results con rm that the half-and-half cache has better performance than other hybrid access cache designs for many applications.

A Experimental Data This appendix gives tabular results for the cache simulations. Tables are divided into groups of three, each group containing one table for each cache size (f8K, 32K, 128Kg). Each table lists, for each cache design, the percentages of reads which are primary hits and misses. (A dash means there were no misses; an entry of 0.000 means there were misses, but fewer than 0.0005%.) The percentage of secondary hits can be found by subtracting these two percentages from 100%. Each table also lists the actual cost values, and underlines the lowest value found. These tables show one additional cache design which was left out of Section 4 for space reasons. The simulations included a fully associative (FA) cache. To be consistent with our model, we represented the FA cache as a 1-line primary cache and an (N ? 1)-line secondary 28

Input

gcc: insn-emit gcc: insn-recog gcc: TOTAL espresso: bca espresso: cps espresso: ti espresso: tial xlisp: (queens 9) eqntott: int3 eqn-fast: int3

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost

FA 26.802 2.910 2.285 29.688 5.768 2.799 27.589 4.694 2.616 8.065 7.659 3.375 18.946 3.632 2.501 27.171 3.551 2.403 17.818 2.521 2.301 34.598 2.563 2.141 3.626 5.557 3.020 12.821 3.313 2.501

DM 93.014 6.986 2.397 90.570 9.430 2.886 90.991 9.009 2.802 91.710 8.290 2.658 94.841 5.159 2.032 95.199 4.801 1.960 96.105 3.895 1.779 94.014 5.986 2.197 93.674 6.326 2.265 95.664 4.336 1.867

VCF 93.014 4.453 1.916 90.570 7.070 2.438 90.991 6.310 2.289 91.710 7.376 2.484 94.841 3.715 1.757 95.199 3.495 1.712 96.105 2.239 1.464 94.014 3.152 1.659 93.674 5.567 2.121 95.664 3.417 1.693

VC4 93.014 4.547 1.934 90.570 7.130 2.449 90.991 6.382 2.303 91.710 7.379 2.485 94.841 3.720 1.758 95.199 3.512 1.715 96.105 2.254 1.467 94.014 3.176 1.663 93.674 5.567 2.121 95.664 3.421 1.693

RVF 92.626 4.582 1.944 90.686 6.996 2.422 90.937 6.383 2.303 91.591 7.499 2.509 94.898 3.903 1.793 94.297 3.584 1.738 96.526 2.241 1.461 95.324 3.366 1.686 93.500 5.596 2.128 95.397 3.550 1.721

RV4 92.626 4.632 1.954 90.686 7.050 2.433 90.937 6.472 2.320 91.591 7.500 2.509 94.898 3.908 1.793 94.297 3.596 1.740 96.526 2.248 1.462 95.324 3.379 1.689 93.500 5.597 2.128 95.397 3.555 1.722

MC2 88.784 4.521 1.971 86.695 6.949 2.453 86.499 6.288 2.330 89.597 7.550 2.538 93.099 3.795 1.790 93.365 3.700 1.769 93.734 2.536 1.545 91.320 3.221 1.699 92.787 5.598 2.136 94.381 3.470 1.715

MC4 82.922 3.442 1.825 81.622 6.120 2.347 81.068 5.257 2.188 86.429 7.447 2.551 90.242 3.647 1.790 90.418 3.504 1.762 90.937 2.532 1.572 85.779 2.859 1.685 91.211 5.564 2.145 90.373 3.343 1.731

MC8 75.028 3.154 1.849 75.435 5.891 2.365 73.960 4.981 2.207 81.836 7.493 2.605 86.309 3.592 1.819 84.340 3.490 1.820 86.002 2.528 1.620 82.084 2.707 1.694 87.727 5.562 2.180 85.039 3.334 1.783

CA 92.015 4.652 1.964 89.414 7.111 2.457 89.772 6.470 2.332 91.334 7.444 2.501 94.547 3.761 1.769 94.750 3.609 1.738 95.473 2.342 1.490 93.407 3.324 1.698 93.522 5.674 2.143 95.400 3.510 1.713

HH2 88.784 3.613 1.799 86.695 6.221 2.315 86.499 5.402 2.161 89.597 7.371 2.505 93.099 3.616 1.756 93.365 3.461 1.724 93.734 2.405 1.520 91.320 2.867 1.631 92.787 5.571 2.131 94.381 3.367 1.696

HH4 88.784 3.295 1.738 86.695 5.940 2.262 86.499 5.082 2.101 89.597 7.351 2.501 93.099 3.533 1.740 93.365 3.405 1.713 93.734 2.444 1.527 91.320 2.739 1.607 92.787 5.572 2.131 94.381 3.355 1.694

HH8 88.784 3.156 1.712 86.695 5.855 2.245 86.499 4.949 2.075 89.597 7.378 2.506 93.099 3.514 1.737 93.365 3.404 1.713 93.734 2.453 1.529 91.320 2.673 1.595 92.787 5.573 2.131 94.381 3.352 1.693

HH2 95.484 0.836 1.204 93.113 2.978 1.635 93.880 1.970 1.436 92.981 5.948 2.200 97.227 0.271 1.079 97.416 0.617 1.143 97.563 0.016 1.027 95.307 0.973 1.232 94.232 5.031 2.014 96.516 2.265 1.465

HH4 95.484 0.690 1.176 93.113 2.807 1.602 93.880 1.844 1.411 92.981 5.940 2.199 97.227 0.286 1.082 97.416 0.635 1.146 97.563 0.014 1.027 95.307 1.035 1.244 94.232 5.031 2.014 96.516 2.257 1.464

HH8 95.484 0.592 1.158 93.113 2.769 1.595 93.880 1.806 1.404 92.981 5.933 2.197 97.227 0.316 1.088 97.416 0.623 1.144 97.563 0.014 1.027 95.307 1.082 1.253 94.232 5.037 2.015 96.516 2.255 1.463

Table 5: Data for Symbolic Benchmarks on 8K Caches Input

gcc: insn-emit gcc: insn-recog gcc: TOTAL espresso: bca espresso: cps espresso: ti espresso: tial xlisp: (queens 9) eqntott: int3 eqn-fast: int3

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost

FA 27.242 0.426 1.808 29.605 2.838 2.243 27.427 1.837 2.075 8.065 5.929 3.046 18.946 0.674 1.939 27.171 0.686 1.859 17.818 0.013 1.824 34.598 1.071 1.858 3.626 5.027 2.919 12.821 2.231 2.296

DM 97.702 2.298 1.460 95.118 4.882 1.976 96.295 3.705 1.741 94.904 5.096 2.019 99.155 0.845 1.169 98.615 1.385 1.277 99.069 0.931 1.186 98.476 1.524 1.305 94.754 5.246 2.049 97.195 2.805 1.561

VCF 97.702 1.491 1.306 95.118 4.094 1.827 96.295 2.815 1.572 94.904 4.873 1.977 99.155 0.423 1.089 98.615 1.125 1.228 99.069 0.781 1.158 98.476 1.082 1.221 94.754 5.055 2.013 97.195 2.469 1.497

VC4 97.702 1.519 1.312 95.118 4.104 1.829 96.295 2.840 1.577 94.904 4.876 1.977 99.155 0.427 1.090 98.615 1.127 1.228 99.069 0.783 1.158 98.476 1.087 1.222 94.754 5.055 2.013 97.195 2.469 1.497

RVF 97.442 1.607 1.331 95.464 3.731 1.754 96.249 2.758 1.562 94.746 4.997 2.002 99.292 0.416 1.086 98.558 1.172 1.237 98.927 0.841 1.170 97.983 1.610 1.326 94.715 5.076 2.017 97.253 2.487 1.500

RV4 97.442 1.627 1.335 95.464 3.751 1.758 96.249 2.780 1.566 94.746 4.997 2.002 99.292 0.419 1.087 98.558 1.172 1.237 98.927 0.840 1.170 97.983 1.619 1.328 94.715 5.075 2.017 97.253 2.487 1.500

MC2 95.484 1.049 1.244 93.113 3.255 1.687 93.880 2.256 1.490 92.981 5.970 2.204 97.227 0.307 1.086 97.416 0.690 1.157 97.563 0.123 1.048 95.307 0.984 1.234 94.232 5.037 2.015 96.516 2.322 1.476

MC4 92.959 0.858 1.234 90.573 2.997 1.664 91.007 1.985 1.467 91.710 5.949 2.213 94.841 0.275 1.104 95.199 0.667 1.175 96.105 0.015 1.042 94.014 0.938 1.238 93.674 5.024 2.018 95.664 2.248 1.470

MC8 88.753 0.707 1.247 86.716 2.820 1.669 86.530 1.873 1.491 89.597 5.941 2.233 93.099 0.308 1.128 93.365 0.700 1.199 93.734 0.013 1.065 91.320 1.008 1.278 92.787 5.022 2.026 94.381 2.235 1.481

CA 97.353 1.080 1.232 94.714 3.274 1.675 95.832 2.278 1.475 93.656 4.875 1.990 99.064 0.314 1.069 98.391 0.659 1.141 99.023 0.106 1.030 97.998 1.145 1.238 94.548 5.065 2.017 96.977 2.341 1.475

Table 6: Data for Symbolic Benchmarks on 32K Caches 29

Input

gcc: insn-emit gcc: insn-recog gcc: TOTAL espresso: bca espresso: cps espresso: ti espresso: tial xlisp: (queens 9) eqntott: int3 eqn-fast: int3

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost

FA 27.242 0.018 1.731 29.605 1.348 1.960 27.427 0.347 1.792 8.065 0.004 1.920 18.946 0.009 1.812 27.172 0.078 1.743 17.818 | 1.822 34.598 | 1.654 3.626 3.493 2.627 12.821 1.100 2.081

DM 99.618 0.382 1.076 98.061 1.939 1.388 99.053 0.947 1.189 99.182 0.818 1.164 99.972 0.028 1.006 99.861 0.139 1.028 100.000 0.000 1.000 99.586 0.414 1.083 96.376 3.624 1.725 98.609 1.391 1.278

VCF 99.618 0.251 1.051 98.061 1.760 1.354 99.053 0.747 1.151 99.182 0.787 1.158 99.972 0.020 1.004 99.861 0.110 1.022 100.000 0.000 1.000 99.586 0.323 1.066 96.376 3.580 1.716 98.609 1.336 1.268

VC4 99.618 0.251 1.051 98.061 1.773 1.356 99.053 0.755 1.153 99.182 0.787 1.158 99.972 0.020 1.004 99.861 0.110 1.022 100.000 0.000 1.000 99.586 0.323 1.065 96.376 3.579 1.716 98.609 1.336 1.268

RVF 99.292 0.380 1.079 97.894 1.715 1.347 98.739 0.835 1.171 99.169 0.791 1.159 99.973 0.021 1.004 99.869 0.109 1.022 99.999 0.000 1.000 100.000 | 1.000 96.417 3.540 1.709 98.612 1.330 1.267

RV4 99.292 0.388 1.081 97.894 1.714 1.347 98.739 0.842 1.173 99.169 0.791 1.159 99.973 0.021 1.004 99.869 0.109 1.022 99.999 0.000 1.000 100.000 | 1.000 96.417 3.540 1.708 98.612 1.330 1.267

MC2 98.881 0.105 1.031 96.709 1.418 1.302 97.897 0.513 1.119 99.006 0.015 1.013 99.541 0.014 1.007 99.317 0.088 1.024 99.987 0.000 1.000 99.480 0.000 1.005 95.395 3.513 1.714 97.977 1.205 1.249

MC4 97.702 0.028 1.028 95.118 1.335 1.302 96.295 0.393 1.112 94.904 0.004 1.052 99.155 0.008 1.010 98.641 0.079 1.029 99.069 | 1.009 98.476 0.000 1.015 94.754 3.506 1.719 97.195 1.139 1.244

MC8 95.484 0.019 1.049 93.113 1.300 1.316 93.880 0.353 1.128 92.981 0.004 1.071 97.227 0.008 1.029 97.431 0.078 1.040 97.563 | 1.024 95.307 | 1.047 94.232 3.492 1.721 96.516 1.100 1.244

CA 99.571 0.112 1.025 97.774 1.383 1.285 98.895 0.523 1.110 99.174 0.015 1.011 99.962 0.019 1.004 99.842 0.093 1.019 100.000 0.000 1.000 99.586 0.000 1.004 95.906 3.451 1.697 98.422 1.225 1.249

HH2 98.881 0.031 1.017 96.709 1.276 1.275 97.897 0.391 1.095 99.006 0.004 1.011 99.541 0.007 1.006 99.317 0.080 1.022 99.987 | 1.000 99.480 0.000 1.005 95.395 3.464 1.704 97.977 1.147 1.238

HH4 98.881 0.021 1.015 96.709 1.230 1.267 97.897 0.348 1.087 99.006 0.003 1.011 99.541 0.008 1.006 99.317 0.078 1.022 99.987 | 1.000 99.480 | 1.005 95.395 3.419 1.696 97.977 1.110 1.231

HH8 98.881 0.019 1.015 96.709 1.221 1.265 97.897 0.335 1.085 99.006 0.003 1.011 99.541 0.008 1.006 99.317 0.077 1.022 99.987 | 1.000 99.480 | 1.005 95.395 3.402 1.692 97.977 1.103 1.230

Table 7: Data for Symbolic Benchmarks on 128K Caches cache. Because so little of the FA cache is primary, the primary hit rates tend to be low, and the access costs correspondingly high, since most successful reads cost two cycles. We included this simulation because when comparing caches with di erent levels of associativity for a given benchmark, it is useful to see how many hits occur when there are no con ict misses. In a few cases, the FA cache performs worse than the DM, generally in problems with large working sets that have poor cache performance due to capacity misses. Because there were so many runs of gcc, we decided to leave out most of the data for the individual input cases, and simply present the aggregate gure. However, we show the individual results for two of the cases, insn-emit and insn-recog, because these produced, for all cache sizes, the lowest and highest best-cache access cost, respectively. In the nal three tables, we list the results for the multitasking simulations for all six context-switching times. In each group, the rst row of data (100,000,000) is the baseline non-interleaved case (see Figure 9).

30

Input

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost GOOD: Prim. Misses Cost BAD: Prim. Misses Cost SPEC89: Prim. Misses Cost

doduc: tiny doduc: small doduc: ref fpppp: 4 atoms fpppp: 8 atoms spice2g6: short nasa7: mxm nasa7: t nasa7: cho nasa7: btr nasa7: gmt nasa7: emi nasa7: vpe matrix: 300 tomcatv: N=33 tomcatv: N=65 tomcatv: N=129 tomcatv: N=257 SYMB:

FA 27.784 3.263 2.342 27.978 3.335 2.354 28.020 3.358 2.358 7.283 0.912 2.100 6.769 0.714 2.068 10.311 24.086 6.473 10.497 25.240 6.691 23.831 45.556 10.417 12.658 56.382 12.586 7.659 31.984 8.000 6.641 55.423 12.464 42.361 5.633 2.647 27.361 50.608 11.342 3.239 59.329 13.240 11.224 6.777 3.175 10.777 6.927 3.208 10.584 8.119 3.437 11.933 10.376 3.852 26.082 3.400 2.385 14.238 1.649 2.171 12.498 35.356 8.593 17.004 20.075 5.644

DM 90.655 9.345 2.869 90.560 9.440 2.888 90.554 9.446 2.889 96.134 3.866 1.773 96.797 3.203 1.641 70.064 29.936 6.987 88.953 11.047 3.209 37.019 62.981 13.596 45.475 54.525 11.905 66.109 33.891 7.778 45.300 54.700 11.940 93.122 6.878 2.376 43.649 56.351 12.270 50.211 49.789 10.958 86.358 13.642 3.728 85.884 14.116 3.823 85.835 14.165 3.833 74.770 25.230 6.046 94.044 5.956 2.191 94.576 5.424 2.085 62.236 37.764 8.553 77.254 22.746 5.549

VCF 90.655 5.376 2.115 90.560 5.508 2.141 90.554 5.517 2.143 96.134 2.834 1.577 96.797 2.327 1.474 70.064 25.974 6.234 88.953 6.208 2.290 37.019 45.469 10.269 45.475 51.728 11.374 66.109 32.442 7.503 45.300 53.946 11.797 93.122 5.980 2.205 43.649 50.531 11.164 50.211 48.541 10.721 86.358 7.039 2.474 85.884 7.398 2.547 85.835 7.410 2.549 74.770 9.202 3.001 94.044 3.782 1.778 94.576 3.469 1.713 62.236 32.288 7.512 77.254 18.854 4.810

VC4 90.655 5.456 2.130 90.560 5.564 2.152 90.554 5.600 2.158 96.134 2.727 1.557 96.797 2.239 1.458 70.064 25.992 6.238 88.953 6.212 2.291 37.019 45.469 10.269 45.475 51.837 11.394 66.109 32.409 7.497 45.300 53.942 11.796 93.122 5.972 2.203 43.649 50.910 11.236 50.211 48.561 10.725 86.358 7.035 2.473 85.884 7.399 2.547 85.835 7.410 2.550 74.770 9.217 3.004 94.044 3.803 1.782 94.576 3.439 1.708 62.236 32.329 7.520 77.254 18.878 4.814

RVF 91.909 4.979 2.027 91.747 5.097 2.051 91.889 4.868 2.006 90.966 5.246 2.087 92.241 3.930 1.824 69.629 26.475 6.334 86.882 11.971 3.406 54.249 45.545 10.111 45.021 54.790 11.960 68.076 30.636 7.140 41.916 56.255 12.269 93.022 6.006 2.211 49.216 50.535 11.109 48.026 51.573 11.319 92.229 6.829 2.375 91.591 7.398 2.490 87.691 7.789 2.603 89.012 10.079 3.025 94.685 3.928 1.799 92.044 4.340 1.904 64.880 33.605 7.736 78.457 19.736 4.965

RV4 91.909 5.012 2.033 91.747 5.142 2.059 91.889 5.032 2.037 90.966 5.162 2.071 92.241 3.921 1.823 69.629 26.467 6.332 86.882 11.990 3.409 54.249 45.545 10.111 45.021 54.780 11.958 68.076 30.658 7.144 41.916 56.255 12.269 93.022 5.992 2.208 49.216 50.644 11.130 48.026 51.580 11.320 92.229 6.830 2.375 91.591 7.397 2.489 87.691 7.790 2.603 89.012 10.079 3.025 94.685 3.943 1.802 92.044 4.380 1.912 64.880 33.614 7.738 78.457 19.752 4.968

MC2 86.019 5.502 2.185 85.900 5.653 2.215 85.868 5.622 2.210 90.967 2.421 1.550 91.765 2.082 1.478 58.960 26.739 6.491 15.603 16.033 4.890 33.481 45.523 10.315 42.235 55.378 12.099 50.671 29.441 7.087 42.975 54.527 11.930 91.838 5.706 2.166 43.541 53.118 11.657 38.222 54.727 12.016 85.126 7.075 2.493 84.988 7.247 2.527 73.710 7.535 2.695 48.651 25.622 6.382 91.601 3.882 1.821 89.660 3.340 1.738 47.240 36.121 8.391 67.762 20.899 5.293

MC4 79.941 3.980 1.957 79.773 4.099 1.981 79.764 4.127 1.987 80.816 1.669 1.509 81.849 1.386 1.445 50.479 24.718 6.192 15.565 25.237 6.639 29.253 45.459 10.345 41.016 56.181 12.264 44.450 30.318 7.316 40.971 55.092 12.058 83.646 5.679 2.242 43.355 52.126 11.470 36.649 59.336 12.907 83.329 6.767 2.453 72.486 6.929 2.592 47.267 7.577 2.967 46.116 24.143 6.126 87.250 3.585 1.809 81.067 2.358 1.637 42.733 36.959 8.595 62.657 21.095 5.381

MC8 71.712 3.334 1.916 71.481 3.451 1.941 71.436 3.418 1.935 70.368 1.351 1.553 71.540 1.094 1.492 44.042 24.020 6.123 15.564 25.240 6.640 28.865 45.441 10.345 40.346 56.317 12.297 41.459 32.745 7.807 37.966 55.265 12.121 80.959 5.645 2.263 43.167 50.653 11.192 35.994 59.329 12.913 70.175 6.764 2.583 46.281 6.952 2.858 44.566 8.314 3.134 45.342 10.375 3.518 82.938 3.483 1.832 71.448 1.923 1.651 40.257 35.404 8.324 58.480 20.170 5.247

CA 89.457 5.780 2.204 89.357 5.921 2.231 89.347 5.900 2.228 95.276 2.530 1.528 96.035 2.197 1.457 65.833 26.367 6.351 81.477 16.061 4.237 36.883 45.597 10.295 42.973 55.152 12.049 59.849 29.150 6.940 44.158 54.155 11.848 92.776 6.013 2.215 43.556 53.151 11.663 43.140 52.075 11.463 85.906 7.099 2.490 85.747 7.270 2.524 85.156 7.973 2.663 71.343 24.486 5.939 93.481 3.921 1.810 93.653 3.512 1.731 58.539 35.535 8.166 74.972 20.627 5.169

HH2 86.019 4.186 1.935 85.900 4.308 1.960 85.868 4.328 1.964 90.967 1.655 1.405 91.765 1.354 1.340 58.960 24.650 6.094 15.603 20.596 5.757 33.481 45.425 10.296 42.235 55.518 12.126 50.671 28.970 6.998 42.975 54.720 11.967 91.838 5.738 2.172 43.541 52.937 11.623 38.222 55.948 12.248 85.126 6.831 2.447 84.988 6.924 2.466 73.710 7.600 2.707 48.651 23.962 6.066 91.601 3.577 1.764 89.660 2.409 1.561 47.240 35.977 8.363 67.762 20.580 5.233

HH4 86.019 3.485 1.802 85.900 3.591 1.823 85.868 3.604 1.826 90.967 1.297 1.337 91.765 1.042 1.280 58.960 23.582 5.891 15.603 15.609 4.810 33.481 45.419 10.295 42.235 55.773 12.174 50.671 29.325 7.065 42.975 54.807 11.984 91.838 5.716 2.168 43.541 50.969 11.249 38.222 56.161 12.288 85.126 6.831 2.447 84.988 6.936 2.468 73.710 7.911 2.766 48.651 11.952 3.784 91.601 3.482 1.746 89.660 1.951 1.474 47.240 33.949 7.978 67.762 19.402 5.009

HH8 86.019 3.295 1.766 85.900 3.375 1.782 85.868 3.408 1.789 90.967 1.199 1.318 91.765 0.956 1.264 58.960 23.404 5.857 15.603 15.590 4.806 33.481 45.419 10.295 42.235 55.845 12.188 50.671 26.878 6.600 42.975 54.836 11.989 91.838 5.720 2.168 43.541 50.562 11.171 38.222 56.350 12.324 85.126 6.828 2.446 84.988 6.928 2.466 73.710 8.037 2.790 48.651 9.604 3.338 91.601 3.439 1.737 89.660 1.825 1.450 47.240 33.394 7.872 67.762 19.075 4.947

Table 8: Data for Floating-Point Benchmarks and SPEC89 Aggregates on 8K Caches 31

Input

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost GOOD: Prim. Misses Cost BAD: Prim. Misses Cost SPEC89: Prim. Misses Cost

doduc: tiny doduc: small doduc: ref fpppp: 4 atoms fpppp: 8 atoms spice2g6: short nasa7: mxm nasa7: t nasa7: cho nasa7: btr nasa7: gmt nasa7: emi nasa7: vpe matrix: 300 tomcatv: N=33 tomcatv: N=65 tomcatv: N=129 tomcatv: N=257 SYMB:

FA 27.784 1.266 1.963 27.978 1.279 1.963 28.020 1.283 1.964 7.283 0.152 1.956 6.769 0.142 1.959 10.311 5.863 3.011 10.497 5.402 2.921 23.831 41.959 9.734 12.658 32.499 8.048 7.659 5.441 2.957 6.641 7.051 3.273 42.361 5.619 2.644 27.361 41.850 9.678 3.239 12.427 4.329 11.224 4.107 2.668 10.777 6.586 3.144 10.584 6.831 3.192 11.933 6.746 3.162 26.070 1.600 2.043 14.238 0.542 1.961 12.498 14.131 4.560 17.000 8.080 3.365

DM 96.688 3.312 1.662 96.647 3.353 1.671 96.634 3.366 1.673 98.755 1.245 1.249 99.101 0.899 1.180 81.694 18.306 4.661 93.178 6.822 2.364 43.229 56.771 12.354 76.223 23.777 5.755 90.944 9.056 2.811 75.691 24.309 5.862 95.616 4.384 1.877 43.731 56.269 12.254 72.767 27.233 6.447 96.607 3.393 1.679 87.092 12.908 3.582 86.587 13.413 3.683 86.714 13.286 3.657 97.876 2.124 1.425 98.220 1.780 1.356 77.489 22.511 5.502 87.116 12.884 3.577

VCF 96.688 1.888 1.392 96.647 1.937 1.402 96.634 1.950 1.404 98.755 0.442 1.096 99.101 0.383 1.082 81.694 16.105 4.243 93.178 5.600 2.132 43.229 45.402 10.194 76.223 23.571 5.716 90.944 8.050 2.620 75.691 24.163 5.834 95.616 4.317 1.864 43.731 50.531 11.164 72.767 26.152 6.241 96.607 3.104 1.624 87.092 6.735 2.409 86.587 7.203 2.503 86.714 7.162 2.494 97.876 1.734 1.351 98.220 0.934 1.195 77.489 19.629 4.955 87.116 11.100 3.238

VC4 96.688 2.031 1.419 96.647 2.077 1.428 96.634 2.090 1.431 98.755 0.510 1.109 99.101 0.398 1.085 81.694 16.061 4.235 93.178 5.601 2.132 43.229 45.401 10.194 76.223 23.548 5.712 90.944 8.075 2.625 75.691 24.156 5.833 95.616 4.316 1.864 43.731 50.866 11.227 72.767 26.006 6.213 96.607 3.104 1.624 87.092 6.735 2.409 86.587 7.203 2.503 86.714 7.162 2.494 97.876 1.738 1.352 98.220 0.995 1.207 77.489 19.625 4.954 87.116 11.110 3.240

RVF 96.136 1.966 1.412 95.997 2.050 1.429 95.987 2.069 1.433 97.154 1.169 1.251 96.752 1.274 1.275 83.146 14.418 3.908 93.155 6.505 2.304 54.494 45.420 10.085 75.887 21.739 5.372 85.832 13.230 3.655 74.809 25.034 6.008 95.369 4.519 1.905 49.390 50.535 11.108 77.721 20.896 5.193 96.645 3.278 1.656 93.356 6.373 2.277 92.939 6.836 2.370 92.799 6.981 2.398 97.593 2.024 1.409 96.507 1.545 1.329 79.826 19.125 4.835 87.996 11.020 3.214

RV4 96.136 2.009 1.420 95.997 2.095 1.438 95.987 2.113 1.442 97.154 1.194 1.255 96.752 1.291 1.278 83.146 14.491 3.922 93.155 6.510 2.305 54.494 45.420 10.085 75.887 21.721 5.368 85.832 12.960 3.604 74.809 25.028 6.007 95.369 4.515 1.904 49.390 50.560 11.113 77.721 20.834 5.181 96.645 3.280 1.657 93.356 6.372 2.277 92.939 6.836 2.370 92.799 6.981 2.398 97.593 2.030 1.410 96.507 1.572 1.334 79.826 19.104 4.831 87.996 11.015 3.213

MC2 95.225 1.545 1.341 95.185 1.615 1.355 95.153 1.639 1.360 97.722 0.311 1.082 98.294 0.251 1.065 77.008 10.678 3.259 91.783 5.450 2.118 40.215 45.412 10.226 56.114 26.831 6.537 82.078 5.982 2.316 58.045 8.216 2.981 93.886 3.850 1.793 43.704 53.029 11.638 64.913 16.074 4.405 94.686 3.115 1.645 86.488 6.477 2.366 86.286 6.903 2.449 86.494 6.826 2.432 95.521 1.580 1.345 97.166 0.738 1.169 70.969 16.140 4.357 82.757 9.171 2.915

MC4 90.655 1.035 1.290 90.560 1.082 1.300 90.554 1.094 1.302 96.134 0.181 1.073 96.797 0.160 1.062 70.064 6.834 2.598 88.953 5.402 2.137 37.019 45.379 10.252 45.475 31.336 7.499 66.109 5.382 2.361 45.300 7.129 2.902 93.122 5.165 2.050 43.649 52.013 11.446 50.211 12.468 3.867 86.358 3.283 1.760 85.884 6.588 2.393 85.835 6.825 2.438 74.770 6.747 2.534 94.045 1.517 1.348 94.576 0.487 1.147 62.236 15.146 4.255 77.255 8.584 2.858

MC8 86.019 1.044 1.338 85.900 1.097 1.349 85.868 1.109 1.352 90.967 0.161 1.121 91.765 0.150 1.111 58.960 6.226 2.593 15.603 5.402 2.870 33.481 45.355 10.283 42.235 32.484 7.750 50.671 5.225 2.486 42.975 7.166 2.932 91.838 5.619 2.149 43.541 50.530 11.165 38.222 12.427 3.979 85.126 3.718 1.855 84.988 6.580 2.400 73.710 6.825 2.560 48.651 6.753 2.796 91.603 1.549 1.378 89.660 0.486 1.196 47.240 15.023 4.382 67.762 8.529 2.943

CA 96.326 1.564 1.334 96.271 1.625 1.346 96.252 1.645 1.350 98.632 0.326 1.076 99.008 0.265 1.060 80.233 10.764 3.243 93.125 5.481 2.110 43.186 45.440 10.202 66.929 29.372 6.911 90.093 6.407 2.316 74.495 8.812 2.929 94.411 3.120 1.649 43.707 53.037 11.640 72.111 15.564 4.236 95.579 3.246 1.661 86.688 6.479 2.364 86.476 6.906 2.447 86.681 6.831 2.431 97.482 1.611 1.331 98.025 0.750 1.162 76.032 16.396 4.355 86.189 9.318 2.909

HH2 95.225 1.126 1.262 95.185 1.168 1.270 95.153 1.187 1.274 97.722 0.189 1.059 98.294 0.168 1.049 77.008 7.565 2.667 91.783 5.402 2.109 40.215 45.391 10.222 56.114 27.650 6.692 82.078 5.582 2.240 58.045 7.667 2.876 93.886 3.589 1.743 43.704 52.863 11.607 64.913 12.675 3.759 94.686 3.196 1.660 86.488 6.550 2.380 86.286 6.841 2.437 86.494 6.745 2.417 95.521 1.532 1.336 97.166 0.525 1.128 70.969 15.054 4.151 82.757 8.546 2.796

HH4 95.225 1.139 1.264 95.185 1.209 1.278 95.153 1.212 1.279 97.722 0.172 1.055 98.294 0.156 1.047 77.008 6.781 2.518 91.783 5.402 2.109 40.215 45.364 10.217 56.114 30.695 7.271 82.078 5.478 2.220 58.045 7.944 2.929 93.886 3.908 1.804 43.704 50.866 11.227 64.913 12.471 3.720 94.686 3.563 1.730 86.488 6.584 2.386 86.286 6.841 2.437 86.494 6.748 2.417 95.521 1.557 1.341 97.166 0.526 1.128 70.969 15.031 4.146 82.757 8.542 2.795

HH8 95.225 1.158 1.268 95.185 1.219 1.280 95.153 1.238 1.284 97.722 0.159 1.053 98.294 0.151 1.046 77.008 6.408 2.448 91.783 5.402 2.109 40.215 45.362 10.217 56.114 32.501 7.614 82.078 5.510 2.226 58.045 8.091 2.957 93.886 5.105 2.031 43.704 50.530 11.164 64.913 12.428 3.712 94.686 3.694 1.755 86.488 6.580 2.385 86.286 6.838 2.436 86.494 6.747 2.417 95.521 1.580 1.345 97.166 0.531 1.129 70.969 15.149 4.169 82.757 8.612 2.809

Table 9: Data for Floating-Point Benchmarks and SPEC89 Aggregates on 32K Caches 32

Input

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost GOOD: Prim. Misses Cost BAD: Prim. Misses Cost SPEC89: Prim. Misses Cost

doduc: tiny doduc: small doduc: ref fpppp: 4 atoms fpppp: 8 atoms spice2g6: short nasa7: mxm nasa7: t nasa7: cho nasa7: btr nasa7: gmt nasa7: emi nasa7: vpe matrix: 300 tomcatv: N=33 tomcatv: N=65 tomcatv: N=129 tomcatv: N=257 SYMB:

FA 27.784 | 1.722 27.978 | 1.720 28.020 | 1.720 7.283 | 1.927 6.769 | 1.932 10.311 2.153 2.306 10.497 5.402 2.921 23.831 35.864 8.576 12.658 12.001 4.154 7.659 4.422 2.764 6.641 6.952 3.254 42.361 0.481 1.668 27.361 11.837 3.975 3.239 12.427 4.329 14.338 | 1.857 10.777 3.861 2.626 10.584 6.483 3.126 11.933 6.660 3.146 26.070 0.391 1.814 14.238 | 1.858 12.504 8.696 3.527 17.003 4.735 2.730

DM 99.540 0.460 1.092 99.515 0.485 1.097 99.512 0.488 1.098 99.886 0.114 1.023 99.945 0.055 1.011 97.284 2.716 1.543 98.422 1.578 1.316 49.282 50.718 11.144 86.079 13.921 3.784 94.372 5.628 2.126 88.909 11.091 3.218 99.444 0.556 1.111 43.751 56.249 12.250 78.764 21.236 5.247 99.978 0.022 1.004 97.282 2.718 1.544 87.346 12.654 3.531 87.010 12.990 3.598 99.287 0.713 1.143 99.791 0.209 1.042 84.716 15.284 4.057 91.637 8.363 2.673

VCF 99.540 0.292 1.060 99.515 0.310 1.064 99.512 0.311 1.064 99.886 0.030 1.007 99.945 0.025 1.005 97.284 2.495 1.501 98.422 1.218 1.247 49.282 45.020 10.061 86.079 13.846 3.770 94.372 5.195 2.043 88.909 11.076 3.215 99.444 0.501 1.101 43.751 44.009 9.924 78.764 20.653 5.136 99.978 0.000 1.000 97.282 2.712 1.543 87.346 6.577 2.376 87.010 6.999 2.460 99.287 0.642 1.129 99.791 0.126 1.026 84.716 12.976 3.618 91.637 7.103 2.433

VC4 99.540 0.291 1.060 99.515 0.308 1.063 99.512 0.310 1.064 99.886 0.032 1.007 99.945 0.027 1.006 97.284 2.500 1.502 98.422 1.223 1.248 49.282 45.020 10.061 86.079 13.836 3.768 94.372 5.194 2.043 88.909 11.075 3.215 99.444 0.501 1.101 43.751 46.272 10.354 78.764 20.636 5.133 99.978 0.000 1.000 97.282 2.712 1.543 87.346 6.577 2.376 87.010 6.999 2.460 99.287 0.642 1.129 99.791 0.126 1.026 84.716 13.134 3.648 91.637 7.187 2.449

RVF 99.629 0.234 1.048 99.605 0.246 1.051 99.600 0.249 1.051 98.700 0.075 1.027 99.939 0.033 1.007 97.280 2.497 1.502 98.385 1.559 1.312 55.814 44.183 9.837 85.997 13.917 3.784 94.414 5.145 2.033 88.858 11.129 3.226 99.097 0.710 1.144 50.726 43.473 9.753 81.929 17.476 4.501 99.900 0.070 1.014 97.168 2.753 1.551 93.766 6.179 2.236 93.339 6.609 2.322 99.485 0.475 1.095 99.756 0.110 1.023 86.930 12.468 3.500 92.868 6.779 2.359

RV4 99.629 0.230 1.047 99.605 0.243 1.050 99.600 0.246 1.051 98.700 0.101 1.032 99.939 0.034 1.007 97.280 2.505 1.503 98.385 1.564 1.313 55.814 44.181 9.836 85.997 13.904 3.782 94.414 5.141 2.033 88.858 11.128 3.226 99.097 0.716 1.145 50.726 47.971 10.607 81.929 17.454 4.497 99.900 0.062 1.013 97.168 2.753 1.551 93.766 6.179 2.236 93.339 6.609 2.322 99.485 0.476 1.096 99.756 0.111 1.024 86.930 12.783 3.560 92.868 6.947 2.391

MC2 98.658 0.105 1.033 98.628 0.108 1.034 98.628 0.108 1.034 99.251 0.001 1.008 99.497 0.002 1.005 88.417 2.517 1.594 93.885 1.940 1.430 46.077 45.386 10.162 83.304 11.859 3.420 93.120 4.561 1.935 84.510 7.191 2.521 99.429 0.481 1.097 43.744 53.006 11.634 76.753 13.289 3.757 99.757 0.000 1.002 95.700 2.712 1.558 86.887 6.230 2.315 86.861 6.637 2.392 98.968 0.408 1.088 99.180 0.039 1.016 81.379 12.148 3.494 89.671 6.577 2.353

MC4 96.688 0.009 1.035 96.647 0.009 1.035 96.634 0.009 1.035 98.755 0.001 1.013 99.101 0.001 1.009 81.694 2.126 1.587 93.178 3.130 1.663 43.229 45.361 10.186 76.223 11.835 3.486 90.944 4.413 1.929 75.691 6.932 2.560 95.616 0.481 1.135 43.731 51.982 11.439 72.767 12.427 3.633 96.642 | 1.034 87.092 2.741 1.650 86.587 6.458 2.361 86.714 6.664 2.399 97.878 0.396 1.096 98.220 0.004 1.018 77.489 11.950 3.496 87.116 6.463 2.357

MC8 95.225 | 1.048 95.185 | 1.048 95.153 | 1.048 97.722 | 1.023 98.294 | 1.017 77.008 2.109 1.631 91.783 5.367 2.102 40.215 45.338 10.212 56.114 11.934 3.706 82.078 4.423 2.020 58.045 6.941 2.738 93.886 0.481 1.153 43.704 50.530 11.164 64.913 12.427 3.712 94.729 | 1.053 86.488 2.903 1.687 86.286 6.477 2.368 86.494 6.662 2.401 95.522 0.391 1.119 97.166 | 1.028 70.969 12.024 3.575 82.757 6.500 2.407

CA 99.493 0.094 1.023 99.466 0.097 1.024 99.463 0.097 1.024 99.885 0.002 1.001 99.943 0.003 1.001 96.671 2.580 1.523 97.550 1.832 1.373 48.914 45.388 10.135 85.899 11.888 3.400 94.271 4.583 1.928 88.486 7.388 2.519 99.441 0.484 1.098 43.745 53.009 11.634 78.616 13.173 3.717 99.978 0.000 1.000 96.371 2.771 1.563 86.947 6.231 2.314 86.912 6.638 2.392 99.224 0.403 1.084 99.773 0.036 1.009 84.395 12.160 3.466 91.444 6.582 2.336

HH2 98.658 0.007 1.015 98.628 0.007 1.015 98.628 0.007 1.015 99.251 0.001 1.008 99.497 0.001 1.005 88.417 2.131 1.521 93.885 2.253 1.489 46.077 45.209 10.129 83.304 12.160 3.477 93.120 4.453 1.915 84.510 6.920 2.470 99.429 0.481 1.097 43.744 52.845 11.603 76.753 12.432 3.595 99.757 | 1.002 95.700 2.715 1.559 86.887 6.388 2.345 86.861 6.655 2.396 98.968 0.392 1.085 99.180 0.003 1.009 81.379 11.966 3.460 89.671 6.470 2.333

HH4 98.658 | 1.013 98.628 | 1.014 98.628 | 1.014 99.251 | 1.007 99.497 | 1.005 88.417 2.111 1.517 93.885 2.832 1.599 46.077 45.184 10.124 83.304 12.047 3.456 93.120 4.452 1.915 84.510 6.915 2.469 99.429 0.481 1.097 43.744 50.854 11.225 76.753 12.427 3.594 99.757 | 1.002 95.700 2.883 1.591 86.887 6.417 2.350 86.861 6.662 2.397 98.968 0.384 1.083 99.180 | 1.008 81.379 11.855 3.439 89.671 6.408 2.321

HH8 98.658 | 1.013 98.628 | 1.014 98.628 | 1.014 99.251 | 1.007 99.497 | 1.005 88.417 2.092 1.513 93.885 3.670 1.758 46.077 44.621 10.017 83.304 12.015 3.450 93.120 4.458 1.916 84.510 6.919 2.470 99.429 0.481 1.097 43.744 50.530 11.163 76.753 12.427 3.594 99.757 | 1.002 95.700 3.050 1.623 86.887 6.433 2.353 86.861 6.662 2.397 98.968 0.381 1.083 99.180 | 1.008 81.379 11.854 3.438 89.671 6.407 2.321

Table 10: Data for Floating-Point Benchmarks and SPEC89 Aggregates on 128K Caches 33

Input

Rate Prim. Misses Cost fora: Prim. Misses Cost forf: Prim. Misses Cost fsxzz: Prim. Misses Cost ivex: Prim. Misses Cost linp: Prim. Misses Cost lisp: Prim. Misses Cost macr: Prim. Misses Cost memxx: Prim. Misses Cost mul2: Prim. Misses Cost mul8: Prim. Misses Cost pasc: Prim. Misses Cost savec: Prim. Misses Cost spic: Prim. Misses Cost ue02: Prim. Misses Cost total: Prim. Misses Cost

dec0:

FA 43.307 3.008 2.138 36.042 3.086 2.226 35.754 2.718 2.159 54.278 1.018 1.651 27.195 3.099 2.317 4.220 5.244 2.954 25.486 1.352 2.002 29.856 3.242 2.317 50.302 1.284 1.741 52.570 2.538 1.956 44.638 1.046 1.752 43.261 2.534 2.049 21.328 0.701 1.920 46.432 2.580 2.026 37.277 4.167 2.419 35.870 2.640 2.143

DM 92.903 7.097 2.419 93.906 6.094 2.219 94.930 5.070 2.014 96.894 3.106 1.621 93.168 6.832 2.366 93.387 6.613 2.323 94.813 5.187 2.037 93.541 6.459 2.292 96.142 3.858 1.772 95.240 4.760 1.952 95.547 4.453 1.891 88.690 11.310 3.262 98.135 1.865 1.373 96.368 3.632 1.726 92.063 7.937 2.587 94.258 5.742 2.148

VCF 92.903 5.159 2.051 93.906 4.468 1.910 94.930 3.579 1.731 96.894 1.986 1.408 93.168 4.303 1.886 93.387 5.511 2.113 94.813 2.002 1.432 93.541 3.753 1.778 96.142 1.956 1.410 95.240 3.249 1.665 95.547 2.166 1.456 88.690 7.718 2.579 98.135 1.071 1.222 96.368 2.674 1.544 92.063 5.660 2.155 94.258 3.817 1.783

VC4 92.903 5.160 2.051 93.906 4.499 1.916 94.930 3.595 1.734 96.894 1.986 1.408 93.168 4.343 1.893 93.387 5.547 2.120 94.813 2.107 1.452 93.541 3.786 1.784 96.142 1.964 1.412 95.240 3.213 1.658 95.547 2.260 1.474 88.690 7.799 2.595 98.135 1.086 1.225 96.368 2.682 1.546 92.063 5.747 2.171 94.258 3.853 1.790

RVF 92.147 5.415 2.107 93.592 4.560 1.931 94.980 3.523 1.720 96.939 1.925 1.396 93.770 4.162 1.853 92.903 5.786 2.170 94.675 1.900 1.414 94.030 3.773 1.776 96.978 1.750 1.363 95.010 3.530 1.721 95.114 2.186 1.464 91.055 6.919 2.404 98.345 1.030 1.212 93.984 3.450 1.716 92.351 5.477 2.117 94.216 3.843 1.788

RV4 92.147 5.474 2.119 93.592 4.548 1.928 94.980 3.557 1.726 96.939 1.925 1.396 93.770 4.195 1.859 92.903 5.841 2.181 94.675 1.945 1.423 94.030 3.809 1.783 96.978 1.755 1.364 95.010 3.560 1.726 95.114 2.322 1.490 91.055 6.909 2.402 98.345 1.017 1.210 93.984 3.521 1.729 92.351 5.586 2.138 94.216 3.886 1.796

MC2 89.321 4.894 2.037 90.968 4.161 1.881 92.596 3.403 1.721 95.281 1.951 1.418 89.494 4.300 1.922 92.270 5.426 2.108 91.581 1.845 1.435 91.496 3.618 1.772 94.885 1.995 1.430 93.120 3.118 1.661 93.246 2.089 1.464 83.601 6.135 2.330 97.554 1.155 1.244 94.023 2.776 1.587 89.350 5.271 2.108 91.806 3.601 1.766

MC4 83.967 4.066 1.933 85.922 3.468 1.800 88.085 2.949 1.679 93.716 1.278 1.306 86.010 3.546 1.814 89.108 5.260 2.108 86.543 1.431 1.406 85.474 3.278 1.768 92.291 1.546 1.371 88.595 2.827 1.651 89.450 1.362 1.364 77.916 4.354 2.048 96.897 0.892 1.200 91.185 2.514 1.566 83.993 4.662 2.046 87.812 3.024 1.696

MC8 77.475 3.741 1.936 79.322 3.277 1.829 80.932 2.815 1.726 91.382 1.097 1.295 80.626 3.267 1.815 85.142 5.254 2.147 75.601 1.369 1.504 78.703 3.233 1.827 89.405 1.345 1.361 86.021 2.654 1.644 85.349 1.188 1.372 70.507 3.352 1.932 96.105 0.698 1.172 86.263 2.498 1.612 77.005 4.355 2.057 82.506 2.808 1.709

CA 91.637 5.143 2.061 92.950 4.444 1.915 94.177 3.613 1.745 96.438 2.020 1.419 92.384 4.463 1.924 93.226 5.508 2.114 94.272 1.927 1.423 92.843 3.843 1.802 95.884 2.009 1.423 94.639 3.181 1.658 95.137 2.132 1.454 87.058 5.911 2.252 97.912 1.157 1.241 95.679 2.740 1.564 91.230 5.542 2.141 93.586 3.692 1.766

HH2 89.321 4.184 1.902 90.968 3.581 1.771 92.596 3.028 1.649 95.281 1.463 1.325 89.494 3.712 1.810 92.270 5.266 2.078 91.581 1.453 1.360 91.496 3.351 1.722 94.885 1.593 1.354 93.120 2.898 1.619 93.246 1.531 1.358 83.601 4.417 2.003 97.554 0.951 1.205 94.023 2.471 1.529 89.350 4.755 2.010 91.806 3.099 1.671

HH4 89.321 3.954 1.858 90.968 3.416 1.739 92.596 2.915 1.628 95.281 1.207 1.277 89.494 3.441 1.759 92.270 5.257 2.076 91.581 1.321 1.335 91.496 3.295 1.711 94.885 1.397 1.317 93.120 2.716 1.585 93.246 1.288 1.312 83.601 3.729 1.872 97.554 0.814 1.179 94.023 2.401 1.516 89.350 4.507 1.963 91.806 2.903 1.634

HH8 89.321 3.551 1.782 90.968 3.306 1.718 92.596 2.856 1.617 95.281 1.113 1.259 89.494 3.379 1.747 92.270 5.254 2.076 91.581 1.329 1.337 91.496 3.243 1.701 94.885 1.334 1.305 93.120 2.672 1.577 93.246 1.224 1.300 83.601 3.259 1.783 97.554 0.740 1.165 94.023 2.366 1.509 89.350 4.451 1.952 91.806 2.799 1.614

Table 11: Data for ATUM Traces on 8K Caches

References [1] Anant Agarwal, John Hennessy, and Mark Horowitz, \Cache Performance of Operating Systems and Multiprogramming," ACM Transactions on Computer Systems, 6(4):393{431, November 1988. [2] Anant Agarwal and Steven D. Pudar, \Column-Associative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches," in Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, California, pp. 179{190, ACM SIGARCH and IEEE Computer Society, May 17{19, 1993. Computer Architecture News, 21(2), May 1993. 34

Input

Rate Prim. Misses Cost fora: Prim. Misses Cost forf: Prim. Misses Cost fsxzz: Prim. Misses Cost ivex: Prim. Misses Cost linp: Prim. Misses Cost lisp: Prim. Misses Cost macr: Prim. Misses Cost memxx: Prim. Misses Cost mul2: Prim. Misses Cost mul8: Prim. Misses Cost pasc: Prim. Misses Cost savec: Prim. Misses Cost spic: Prim. Misses Cost ue02: Prim. Misses Cost total: Prim. Misses Cost

dec0:

FA 43.307 0.578 1.677 36.042 0.497 1.734 35.754 0.597 1.756 54.278 0.161 1.488 27.195 1.114 1.940 4.220 5.156 2.937 25.486 | 1.745 29.856 1.219 1.933 50.302 0.273 1.549 52.570 0.098 1.493 44.638 0.812 1.708 43.261 0.001 1.568 21.328 | 1.787 46.432 | 1.536 37.277 1.589 1.929 35.870 1.007 1.833

DM 97.686 2.314 1.463 97.849 2.151 1.430 97.954 2.046 1.409 99.063 0.937 1.187 97.541 2.459 1.492 96.523 3.477 1.695 97.764 2.236 1.447 97.944 2.056 1.411 98.753 1.247 1.249 98.324 1.676 1.335 98.526 1.474 1.295 94.590 5.410 2.082 99.550 0.450 1.090 98.890 1.110 1.222 96.817 3.183 1.637 97.748 2.252 1.450

VCF 97.686 1.477 1.304 97.849 1.414 1.290 97.954 1.570 1.319 99.063 0.687 1.140 97.541 1.857 1.377 96.523 3.227 1.648 97.764 0.423 1.103 97.944 1.827 1.368 98.753 0.846 1.173 98.324 1.388 1.281 98.526 1.013 1.207 94.590 2.267 1.485 99.550 0.264 1.055 98.890 0.694 1.143 96.817 2.637 1.533 97.748 1.530 1.313

VC4 97.686 1.489 1.306 97.849 1.451 1.297 97.954 1.571 1.319 99.063 0.696 1.142 97.541 1.876 1.381 96.523 3.225 1.647 97.764 0.460 1.110 97.944 1.829 1.368 98.753 0.850 1.174 98.324 1.383 1.279 98.526 1.015 1.208 94.590 2.351 1.501 99.550 0.262 1.054 98.890 0.704 1.145 96.817 2.631 1.532 97.748 1.544 1.316

RVF 97.488 1.723 1.353 97.446 1.721 1.353 95.767 1.632 1.352 99.183 0.530 1.109 97.455 1.942 1.394 95.304 4.237 1.852 99.197 0.356 1.076 97.528 1.896 1.385 99.039 0.650 1.133 98.158 1.233 1.253 98.519 0.977 1.201 97.281 1.545 1.321 99.687 0.221 1.045 98.744 0.905 1.185 96.851 2.545 1.515 97.707 1.609 1.329

RV4 97.488 1.739 1.355 97.446 1.722 1.353 95.767 1.655 1.357 99.183 0.549 1.112 97.455 1.962 1.398 95.304 4.250 1.855 99.197 0.350 1.075 97.528 1.888 1.384 99.039 0.652 1.134 98.158 1.230 1.252 98.519 0.979 1.201 97.281 1.722 1.354 99.687 0.226 1.046 98.744 0.903 1.184 96.851 2.559 1.518 97.707 1.628 1.332

MC2 95.833 1.050 1.241 96.254 0.927 1.214 96.890 1.073 1.235 98.381 0.474 1.106 96.185 1.504 1.324 94.461 3.138 1.652 96.837 0.188 1.067 96.916 1.471 1.310 97.561 0.707 1.159 97.257 0.799 1.179 97.785 0.867 1.187 92.792 1.221 1.304 99.194 0.166 1.040 98.035 0.344 1.085 94.966 2.123 1.454 96.466 1.169 1.257

MC4 92.903 0.773 1.218 93.906 0.782 1.210 94.930 0.859 1.214 96.894 0.384 1.104 93.168 1.258 1.307 93.387 3.854 1.798 94.813 0.040 1.059 93.541 1.296 1.311 96.142 0.657 1.163 95.240 0.420 1.127 95.547 0.801 1.197 88.690 0.183 1.148 98.135 0.104 1.038 96.368 0.140 1.063 92.063 1.808 1.423 94.258 1.024 1.252

MC8 89.321 0.637 1.228 90.968 0.674 1.218 92.596 0.728 1.212 95.281 0.280 1.100 89.494 1.149 1.323 92.270 4.425 1.918 91.581 | 1.084 91.496 1.258 1.324 94.885 0.531 1.152 93.120 0.287 1.123 93.246 0.792 1.218 83.601 0.042 1.172 97.554 0.057 1.035 94.023 0.011 1.062 89.350 1.699 1.429 91.806 1.001 1.272

CA 97.236 1.206 1.257 97.430 1.065 1.228 97.624 1.174 1.247 98.895 0.511 1.108 97.116 1.628 1.338 95.534 2.389 1.499 97.675 0.204 1.062 97.550 1.603 1.329 98.583 0.763 1.159 98.064 0.874 1.185 98.372 0.889 1.185 94.325 1.323 1.308 99.478 0.182 1.040 98.722 0.389 1.087 96.196 2.315 1.478 97.388 1.162 1.247

HH2 95.833 0.813 1.196 96.254 0.813 1.192 96.890 0.917 1.205 98.381 0.366 1.086 96.185 1.316 1.288 94.461 3.409 1.703 96.837 0.051 1.041 96.916 1.345 1.286 97.561 0.645 1.147 97.257 0.483 1.119 97.785 0.775 1.169 92.792 0.244 1.118 99.194 0.101 1.027 98.035 0.156 1.049 94.966 1.881 1.408 96.466 0.999 1.225

HH4 95.833 0.695 1.174 96.254 0.656 1.162 96.890 0.796 1.182 98.381 0.282 1.070 96.185 1.258 1.277 94.461 3.655 1.750 96.837 0.001 1.032 96.916 1.309 1.280 97.561 0.557 1.130 97.257 0.353 1.094 97.785 0.772 1.169 92.792 0.091 1.089 99.194 0.057 1.019 98.035 0.023 1.024 94.966 1.793 1.391 96.466 0.945 1.215

HH8 95.833 0.676 1.170 96.254 0.577 1.147 96.890 0.733 1.170 98.381 0.243 1.062 96.185 1.211 1.268 94.461 4.147 1.843 96.837 | 1.032 96.916 1.253 1.269 97.561 0.487 1.117 97.257 0.288 1.082 97.785 0.770 1.168 92.792 0.017 1.075 99.194 0.019 1.012 98.035 0.006 1.021 94.966 1.736 1.380 96.466 0.961 1.218

Table 12: Data for ATUM Traces on 32K Caches [3] J. H. Chang, H. Chao, and K. So, \Cache Design of A Sub-Micron CMOS System/370," in Proceedings of the 14th Annual International Symposium on Computer Architecture, Pittsburgh, Pennsylvania, pp. 208{213, IEEE Computer Society and ACM SIGARCH, June 2{5, 1987. Computer Architecture News, 15(2), June 1987. [4] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990. [5] Mark D. Hill, \A Case for Direct-Mapped Caches," Computer, 21(12):25{40, December 1988. 35

Input

Rate Prim. Misses Cost fora: Prim. Misses Cost forf: Prim. Misses Cost fsxzz: Prim. Misses Cost ivex: Prim. Misses Cost linp: Prim. Misses Cost lisp: Prim. Misses Cost macr: Prim. Misses Cost memxx: Prim. Misses Cost mul2: Prim. Misses Cost mul8: Prim. Misses Cost pasc: Prim. Misses Cost savec: Prim. Misses Cost spic: Prim. Misses Cost ue02: Prim. Misses Cost total: Prim. Misses Cost

dec0:

FA 43.307 | 1.567 36.042 | 1.640 35.754 | 1.642 54.278 | 1.457 27.195 | 1.728 4.220 | 1.958 25.486 | 1.745 29.856 | 1.701 50.302 | 1.497 52.570 | 1.474 44.638 | 1.554 43.261 | 1.567 21.328 | 1.787 46.432 | 1.536 37.277 | 1.627 35.870 | 1.641

DM 99.345 0.655 1.131 99.270 0.730 1.146 99.394 0.606 1.121 99.728 0.272 1.054 99.338 0.662 1.132 99.873 0.127 1.025 99.660 0.340 1.068 99.277 0.723 1.145 99.380 0.620 1.124 99.417 0.583 1.117 99.520 0.480 1.096 99.097 0.903 1.181 99.786 0.214 1.043 99.883 0.117 1.023 98.823 1.177 1.235 99.471 0.529 1.106

VCF 99.345 0.346 1.072 99.270 0.334 1.071 99.394 0.461 1.094 99.728 0.183 1.037 99.338 0.528 1.107 99.873 0.043 1.010 99.660 | 1.003 99.277 0.568 1.115 99.380 0.541 1.109 99.417 0.447 1.091 99.520 0.403 1.081 99.097 0.176 1.042 99.786 0.119 1.025 99.883 0.011 1.003 98.823 0.990 1.200 99.471 0.326 1.067

VC4 99.345 0.364 1.076 99.270 0.345 1.073 99.394 0.468 1.095 99.728 0.183 1.037 99.338 0.531 1.108 99.873 0.052 1.011 99.660 0.005 1.004 99.277 0.574 1.116 99.380 0.540 1.109 99.417 0.438 1.089 99.520 0.405 1.082 99.097 0.185 1.044 99.786 0.119 1.025 99.883 0.014 1.004 98.823 0.997 1.201 99.471 0.331 1.068

RVF 99.588 0.282 1.058 99.671 0.206 1.042 99.392 0.419 1.086 99.779 0.133 1.028 99.167 0.669 1.135 99.827 0.072 1.015 99.850 0.014 1.004 99.368 0.518 1.105 99.691 0.216 1.044 99.637 0.242 1.050 99.127 0.332 1.072 99.137 0.127 1.033 99.833 0.079 1.017 99.769 0.048 1.011 99.110 0.748 1.151 99.531 0.259 1.054

RV4 99.588 0.281 1.058 99.671 0.207 1.043 99.392 0.415 1.085 99.779 0.135 1.028 99.167 0.668 1.135 99.827 0.079 1.017 99.850 0.013 1.004 99.368 0.517 1.105 99.691 0.220 1.045 99.637 0.240 1.049 99.127 0.337 1.073 99.137 0.148 1.037 99.833 0.083 1.017 99.769 0.084 1.018 99.110 0.750 1.151 99.531 0.265 1.055

MC2 98.754 0.129 1.037 98.843 0.098 1.030 98.837 0.182 1.046 99.425 0.106 1.026 98.658 0.281 1.067 99.623 0.004 1.005 99.108 0.005 1.010 98.720 0.272 1.064 98.939 0.449 1.096 99.071 0.116 1.031 99.134 0.298 1.065 98.446 0.039 1.023 99.615 0.052 1.014 99.467 0.024 1.010 98.025 0.473 1.110 99.002 0.164 1.041

MC4 97.686 0.047 1.032 97.849 0.011 1.024 97.954 0.079 1.036 99.063 0.047 1.018 97.541 0.119 1.047 96.523 0.001 1.035 97.764 | 1.022 97.944 0.124 1.044 98.753 0.343 1.078 98.324 0.009 1.018 98.526 0.191 1.051 94.590 0.020 1.058 99.550 0.003 1.005 98.890 | 1.011 96.817 0.178 1.066 97.748 0.080 1.038

MC8 95.833 0.011 1.044 96.254 0.002 1.038 96.890 0.024 1.036 98.381 0.026 1.021 96.185 0.053 1.048 94.461 | 1.055 96.837 | 1.032 96.916 0.045 1.039 97.561 0.245 1.071 97.257 | 1.027 97.785 0.095 1.040 92.792 | 1.072 99.194 | 1.008 98.035 | 1.020 94.966 0.052 1.060 96.466 0.039 1.043

CA 99.285 0.151 1.036 99.216 0.123 1.031 99.305 0.205 1.046 99.684 0.124 1.027 99.186 0.329 1.071 99.870 0.006 1.002 99.654 0.009 1.005 99.164 0.299 1.065 99.347 0.453 1.093 99.374 0.138 1.032 99.457 0.300 1.062 99.075 0.044 1.018 99.769 0.051 1.012 99.869 0.025 1.006 98.654 0.509 1.110 99.416 0.178 1.040

HH2 98.754 0.045 1.021 98.843 0.020 1.015 98.837 0.085 1.028 99.425 0.051 1.015 98.658 0.123 1.037 99.623 | 1.004 99.108 | 1.009 98.720 0.134 1.038 98.939 0.392 1.085 99.071 0.016 1.012 99.134 0.192 1.045 98.446 0.017 1.019 99.615 0.010 1.006 99.467 0.001 1.005 98.025 0.220 1.062 99.002 0.088 1.027

HH4 98.754 0.016 1.016 98.843 0.006 1.013 98.837 0.026 1.017 99.425 0.035 1.012 98.658 0.069 1.027 99.623 | 1.004 99.108 | 1.009 98.720 0.064 1.025 98.939 0.303 1.068 99.071 0.001 1.009 99.134 0.119 1.031 98.446 | 1.016 99.615 | 1.004 99.467 | 1.005 98.025 0.080 1.035 99.002 0.050 1.020

HH8 98.754 | 1.012 98.843 0.002 1.012 98.837 0.012 1.014 99.425 0.018 1.009 98.658 0.018 1.017 99.623 | 1.004 99.108 | 1.009 98.720 0.025 1.017 98.939 0.208 1.050 99.071 | 1.009 99.134 0.065 1.021 98.446 | 1.016 99.615 | 1.004 99.467 | 1.005 98.025 0.047 1.029 99.002 0.028 1.015

Table 13: Data for ATUM Traces on 128K Caches [6] Norman P. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Bu ers," in Proceedings of the 17th Annual International Symposium on Computer Architecture, Seattle, Washington, pp. 364{373, IEEE Computer Society and ACM SIGARCH, May 28{31, 1990. Computer Architecture News, 18(2), June 1990. [7] S. Przybylski, M. Horowitz, and J. Hennessy, \Performance Tradeo s in Cache Design," in Proceedings of the 15th Annual International Symposium on Computer Architecture, Honolulu, Hawaii, pp. 290{298, IEEE Computer Society and ACM SIGARCH, May 30{ June 2, 1988. Computer Architecture News, 16(2), May 1988. 36

Input

SYM4: 100,000,000 SYM4: 1,000,000 SYM4: 100,000 SYM4: 10,000 SYM4: 1,000 SYM4: 100 FP4: 100,000,000 FP4: 1,000,000 FP4: 100,000 FP4: 10,000 FP4: 1,000 FP4: 100 MIX6: 100,000,000 MIX6: 1,000,000 MIX6: 100,000 MIX6: 10,000 MIX6: 1,000 MIX6: 100

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost

FA 19.066 4.474 2.659 19.066 4.514 2.667 19.065 4.869 2.734 19.055 7.368 3.209 18.959 7.295 3.196 18.024 7.278 3.203 25.871 2.726 2.259 25.871 2.784 2.270 25.870 3.308 2.370 25.860 7.081 3.087 25.760 10.378 3.714 24.783 10.325 3.714 14.739 4.225 2.655 14.739 4.265 2.663 14.738 4.615 2.729 14.733 7.310 3.242 14.677 10.420 3.833 14.137 10.052 3.769

DM 93.483 6.517 2.303 93.441 6.559 2.312 93.080 6.920 2.384 91.270 8.730 2.746 89.373 10.627 3.125 87.715 12.285 3.457 92.418 7.582 2.516 92.360 7.640 2.528 91.884 8.116 2.623 89.199 10.801 3.160 83.595 16.405 4.281 79.883 20.117 5.023 92.912 7.088 2.418 92.873 7.127 2.425 92.536 7.464 2.493 90.409 9.591 2.918 86.420 13.580 3.716 83.037 16.963 4.393

VCF 93.483 4.727 1.963 93.441 4.774 1.973 93.080 5.180 2.053 91.270 7.205 2.456 89.373 9.158 2.846 87.715 8.989 2.831 92.418 4.044 1.844 92.360 4.111 1.857 91.884 4.653 1.965 89.199 7.670 2.565 83.595 13.903 3.806 79.883 15.883 4.219 92.912 4.673 1.959 92.873 4.717 1.968 92.536 5.096 2.043 90.409 7.485 2.518 86.420 11.682 3.355 83.037 12.437 3.533

VC4 93.483 4.738 1.965 93.441 4.785 1.975 93.080 5.192 2.056 91.270 7.221 2.459 89.373 9.155 2.846 87.715 9.123 2.856 92.418 4.022 1.840 92.360 4.089 1.853 91.884 4.634 1.962 89.199 7.676 2.566 83.595 13.914 3.808 79.883 16.034 4.248 92.912 4.648 1.954 92.873 4.693 1.963 92.536 5.074 2.039 90.409 7.473 2.516 86.420 11.677 3.354 83.037 12.755 3.593

RVF 93.757 4.888 1.991 93.718 4.932 2.000 93.313 5.317 2.077 91.629 7.265 2.464 89.758 9.105 2.832 88.416 8.813 2.790 92.720 4.656 1.957 92.667 4.719 1.970 91.766 5.591 2.145 89.389 8.310 2.685 84.081 14.249 3.866 80.125 15.947 4.229 91.915 5.431 2.113 91.879 5.473 2.121 91.566 5.822 2.190 89.595 8.030 2.630 85.828 12.154 3.451 82.097 13.168 3.681

RV4 93.757 4.905 1.994 93.718 4.950 2.003 93.313 5.327 2.079 91.629 7.276 2.466 89.758 9.096 2.831 88.416 8.913 2.809 92.720 4.660 1.958 92.667 4.721 1.970 91.766 5.607 2.148 89.389 8.323 2.687 84.081 14.206 3.858 80.125 16.042 4.247 91.915 5.444 2.115 91.879 5.484 2.123 91.566 5.830 2.192 89.595 8.039 2.631 85.828 12.143 3.449 82.097 13.410 3.727

MC2 91.044 4.857 2.012 91.025 4.898 2.020 90.848 5.260 2.091 89.513 7.362 2.504 86.595 8.733 2.793 83.815 8.850 2.843 87.965 6.072 2.274 87.940 6.129 2.285 87.707 6.612 2.379 85.761 9.670 2.980 78.676 14.501 3.968 72.953 15.199 4.158 87.799 4.706 2.016 87.782 4.745 2.024 87.635 5.087 2.090 86.370 7.544 2.570 81.398 11.774 3.423 74.896 12.879 3.698

MC4 87.422 4.502 1.981 87.413 4.544 1.989 87.329 4.910 2.060 86.548 7.233 2.509 82.946 7.980 2.687 78.098 7.794 2.700 80.482 3.623 1.884 80.471 3.678 1.894 80.368 4.161 1.987 79.389 7.560 2.643 73.161 12.183 3.583 64.146 12.328 3.701 77.378 4.376 2.058 77.371 4.415 2.065 77.307 4.762 2.132 76.696 7.352 2.630 72.762 11.505 3.458 63.077 11.546 3.563

MC8 83.432 4.440 2.009 83.427 4.481 2.017 83.382 4.845 2.087 82.940 7.270 2.552 79.552 7.522 2.634 71.339 7.452 2.702 71.134 2.988 1.856 71.129 3.044 1.867 71.087 3.547 1.963 70.666 7.167 2.655 66.970 11.708 3.555 55.815 10.846 3.503 71.915 4.406 2.118 71.911 4.445 2.125 71.878 4.786 2.191 71.548 7.361 2.683 68.674 11.159 3.433 55.436 10.822 3.502

CA 92.919 4.901 2.002 92.888 4.944 2.011 92.609 5.321 2.085 90.799 7.449 2.507 88.294 8.895 2.807 86.474 9.146 2.873 91.295 5.925 2.213 91.247 5.984 2.225 90.843 6.490 2.325 88.152 9.559 2.935 81.383 14.358 3.914 77.235 15.382 4.150 92.223 4.845 1.998 92.192 4.889 2.007 91.915 5.276 2.083 89.902 7.791 2.581 84.818 11.963 3.425 80.608 13.389 3.738

HH2 91.044 4.511 1.947 91.025 4.555 1.955 90.848 4.936 2.029 89.513 7.222 2.477 86.595 8.177 2.688 83.815 7.991 2.680 87.965 4.580 1.990 87.940 4.638 2.002 87.707 5.140 2.100 85.761 8.500 2.757 78.676 13.091 3.701 72.953 13.197 3.778 87.799 4.395 1.957 87.782 4.435 1.965 87.635 4.790 2.034 86.370 7.389 2.540 81.398 11.562 3.383 74.896 11.804 3.494

HH4 91.044 4.417 1.929 91.025 4.460 1.937 90.848 4.848 2.013 89.513 7.221 2.477 86.595 7.807 2.617 83.815 7.629 2.611 87.965 3.023 1.695 87.940 3.084 1.707 87.707 3.615 1.810 85.761 7.192 2.509 78.676 11.814 3.458 72.953 11.391 3.435 87.799 4.327 1.944 87.782 4.367 1.952 87.635 4.722 2.021 86.370 7.338 2.531 81.398 11.322 3.337 74.896 10.976 3.336

HH8 91.044 4.391 1.924 91.025 4.435 1.932 90.848 4.822 2.008 89.513 7.230 2.479 86.595 7.594 2.577 83.815 7.495 2.586 87.965 2.892 1.670 87.940 2.953 1.682 87.707 3.496 1.787 85.761 7.123 2.496 78.676 11.595 3.416 72.953 10.728 3.309 87.799 4.313 1.942 87.782 4.355 1.950 87.635 4.719 2.020 86.370 7.334 2.530 81.398 11.262 3.326 74.896 10.298 3.208

Table 14: Data for Multitasking Traces on 8K Caches [8] Andre Seznec, \A case for two-way skewed-associative caches," in Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, California, pp. 169{178, ACM SIGARCH and IEEE Computer Society, May 17{19, 1993. Computer Architecture News, 21(2), May 1993. [9] Alan Jay Smith, \Cache Memories," ACM Computing Surveys, 14(3):473{530, September 37

Input

SYM4: 100,000,000 SYM4: 1,000,000 SYM4: 100,000 SYM4: 10,000 SYM4: 1,000 SYM4: 100 FP4: 100,000,000 FP4: 1,000,000 FP4: 100,000 FP4: 10,000 FP4: 1,000 FP4: 100 MIX6: 100,000,000 MIX6: 1,000,000 MIX6: 100,000 MIX6: 10,000 MIX6: 1,000 MIX6: 100

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost

FA 19.066 2.095 2.207 19.066 2.263 2.239 19.065 3.443 2.464 19.055 4.043 2.578 18.959 4.002 2.571 18.024 4.005 2.581 25.871 1.505 2.027 25.871 1.673 2.059 25.870 2.775 2.269 25.860 2.974 2.307 25.760 2.957 2.304 24.783 2.978 2.318 14.739 2.719 2.369 14.739 2.845 2.393 14.738 3.690 2.554 14.733 5.334 2.866 14.677 4.803 2.766 14.137 4.809 2.772

DM 97.609 2.391 1.478 97.375 2.625 1.525 96.203 3.797 1.759 94.872 5.128 2.026 94.337 5.663 2.133 93.915 6.085 2.217 97.490 2.510 1.502 97.322 2.678 1.536 96.449 3.551 1.710 94.603 5.397 2.079 92.266 7.734 2.547 90.702 9.298 2.860 95.460 4.540 1.908 95.308 4.692 1.938 94.452 5.548 2.110 92.788 7.212 2.442 91.673 8.327 2.665 90.733 9.267 2.853

VCF 97.609 1.949 1.394 97.375 2.191 1.443 96.203 3.409 1.686 94.872 4.807 1.965 94.337 5.159 2.037 93.915 5.031 2.017 97.490 1.755 1.359 97.322 1.954 1.398 96.449 2.930 1.592 94.603 4.923 1.989 92.266 7.015 2.410 90.702 7.074 2.437 95.460 2.862 1.589 95.308 3.033 1.623 94.452 3.949 1.806 92.788 5.746 2.164 91.673 6.811 2.377 90.733 6.681 2.362

VC4 97.609 1.951 1.395 97.375 2.193 1.443 96.203 3.412 1.686 94.872 4.810 1.965 94.337 5.164 2.038 93.915 5.061 2.022 97.490 1.816 1.370 97.322 1.997 1.406 96.449 2.934 1.593 94.603 4.917 1.988 92.266 7.000 2.407 90.702 7.116 2.445 95.460 2.873 1.591 95.308 3.035 1.623 94.452 3.941 1.804 92.788 5.739 2.163 91.673 6.808 2.377 90.733 6.814 2.387

RVF 97.627 1.999 1.404 97.423 2.226 1.449 96.287 3.405 1.684 94.940 4.815 1.965 94.464 5.093 2.023 94.154 4.980 2.005 96.407 2.110 1.437 96.725 2.215 1.454 95.896 3.111 1.632 94.428 4.698 1.948 92.612 6.339 2.278 91.463 6.300 2.282 96.274 3.024 1.612 96.129 3.179 1.643 95.299 4.041 1.815 93.443 5.975 2.201 92.233 7.142 2.435 90.925 6.922 2.406

RV4 97.627 2.003 1.404 97.423 2.230 1.450 96.287 3.410 1.685 94.940 4.818 1.966 94.464 5.096 2.024 94.154 5.000 2.008 96.407 2.150 1.444 96.725 2.241 1.458 95.896 3.127 1.635 94.428 4.715 1.952 92.612 6.326 2.276 91.463 6.357 2.293 96.274 3.031 1.613 96.129 3.183 1.643 95.299 4.046 1.816 93.443 5.979 2.202 92.233 7.133 2.433 90.925 7.021 2.425

MC2 95.243 1.805 1.391 95.146 2.037 1.436 94.448 3.362 1.694 92.665 4.592 1.946 91.605 4.644 1.966 90.572 4.671 1.982 96.285 1.661 1.353 96.190 1.834 1.387 95.527 2.792 1.575 92.958 4.466 1.919 88.822 5.251 2.109 86.461 5.287 2.140 94.178 2.628 1.558 94.110 2.783 1.588 93.615 3.769 1.780 91.469 5.714 2.171 89.114 5.922 2.234 87.288 5.985 2.264

MC4 93.483 1.685 1.385 93.441 1.908 1.428 93.080 3.297 1.696 91.270 4.312 1.907 89.373 4.198 1.904 87.717 4.198 1.920 92.418 1.473 1.356 92.360 1.662 1.392 91.884 2.752 1.604 89.199 3.705 1.812 83.595 3.547 1.838 79.883 3.559 1.877 92.912 2.690 1.582 92.873 2.828 1.609 92.536 3.704 1.778 90.409 5.509 2.143 86.419 5.059 2.097 83.038 5.052 2.130

MC8 91.044 1.719 1.416 91.025 1.928 1.456 90.847 3.321 1.722 89.513 4.159 1.895 86.596 4.087 1.911 83.817 4.090 1.939 87.965 1.462 1.398 87.940 1.648 1.434 87.707 2.764 1.648 85.761 3.315 1.772 78.676 3.138 1.809 72.953 3.162 1.871 87.799 2.727 1.640 87.782 2.856 1.665 87.636 3.694 1.825 86.370 5.481 2.178 81.397 4.898 2.117 74.897 4.904 2.183

CA 97.264 1.904 1.389 97.076 2.149 1.438 95.867 3.499 1.706 94.117 4.699 1.952 93.463 4.850 1.987 93.013 4.896 2.000 97.234 1.670 1.345 97.091 1.851 1.381 96.248 2.827 1.575 93.942 4.407 1.898 91.231 5.359 2.106 89.596 5.510 2.151 95.155 2.556 1.534 95.041 2.731 1.568 94.300 3.845 1.788 92.146 5.748 2.171 90.682 6.190 2.269 89.685 6.318 2.303

HH2 95.243 1.692 1.369 95.146 1.925 1.414 94.448 3.288 1.680 92.665 4.380 1.905 91.605 4.278 1.897 90.572 4.274 1.906 96.285 1.488 1.320 96.190 1.680 1.357 95.527 2.730 1.563 92.958 3.894 1.810 88.822 3.847 1.843 86.461 3.855 1.868 94.178 2.508 1.535 94.110 2.668 1.566 93.615 3.708 1.768 91.469 5.557 2.141 89.114 5.276 2.111 87.288 5.261 2.127

HH4 95.243 1.705 1.371 95.146 1.928 1.415 94.448 3.294 1.681 92.665 4.258 1.882 91.605 4.151 1.873 90.572 4.154 1.883 96.285 1.483 1.319 96.190 1.674 1.356 95.527 2.732 1.564 92.958 3.584 1.751 88.822 3.351 1.748 86.461 3.380 1.778 94.178 2.528 1.538 94.110 2.683 1.569 93.615 3.697 1.766 91.469 5.531 2.136 89.114 5.069 2.072 87.288 5.077 2.092

HH8 95.243 1.741 1.378 95.146 1.955 1.420 94.448 3.309 1.684 92.665 4.181 1.868 91.605 4.103 1.863 90.572 4.106 1.874 96.285 1.486 1.319 96.190 1.672 1.356 95.527 2.731 1.564 92.958 3.463 1.728 88.822 3.208 1.721 86.461 3.239 1.751 94.178 2.665 1.565 94.110 2.805 1.592 93.615 3.694 1.766 91.469 5.529 2.136 89.114 5.003 2.059 87.288 5.013 2.080

Table 15: Data for Multitasking Traces on 32K Caches 1982. [10] Kimming So and Rudolph N. Rechtscha en, \Cache Operations by MRU-Change," in Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers, Port Chester, New York, pp. 584{586, October 6{9, 1986. 38

Input

SYM4: 100,000,000 SYM4: 1,000,000 SYM4: 100,000 SYM4: 10,000 SYM4: 1,000 SYM4: 100 FP4: 100,000,000 FP4: 1,000,000 FP4: 100,000 FP4: 10,000 FP4: 1,000 FP4: 100 MIX6: 100,000,000 MIX6: 1,000,000 MIX6: 100,000 MIX6: 10,000 MIX6: 1,000 MIX6: 100

Rate Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost Prim. Misses Cost

FA 19.066 0.629 1.929 19.066 1.196 2.037 19.065 1.507 2.096 19.055 1.434 2.082 18.959 1.433 2.083 18.025 1.434 2.092 25.871 0.012 1.744 25.871 0.311 1.800 25.870 0.625 1.860 25.860 0.682 1.871 25.760 0.689 1.873 24.783 0.701 1.885 14.735 1.594 2.156 14.735 1.987 2.230 14.738 3.025 2.427 14.733 2.803 2.385 14.677 2.801 2.385 14.137 2.809 2.392

DM 99.099 0.901 1.180 98.661 1.339 1.268 97.863 2.137 1.427 97.426 2.574 1.515 97.280 2.720 1.544 97.165 2.835 1.567 99.645 0.355 1.071 99.424 0.576 1.115 98.621 1.379 1.276 98.115 1.885 1.377 97.433 2.567 1.513 97.000 3.000 1.600 96.930 3.070 1.614 96.550 3.450 1.690 95.684 4.316 1.863 95.229 4.771 1.954 94.936 5.064 2.013 94.651 5.349 2.070

VCF 99.099 0.834 1.167 98.661 1.280 1.257 97.863 2.089 1.418 97.426 2.520 1.505 97.280 2.522 1.506 97.165 2.507 1.505 99.645 0.273 1.055 99.424 0.501 1.101 98.621 1.318 1.264 98.115 1.800 1.361 97.433 2.021 1.410 97.000 1.978 1.406 96.930 1.713 1.356 96.550 2.103 1.434 95.684 2.986 1.611 95.229 3.458 1.705 94.936 3.686 1.751 94.651 3.602 1.738

VC4 99.099 0.835 1.168 98.661 1.282 1.257 97.863 2.091 1.419 97.426 2.519 1.504 97.280 2.526 1.507 97.165 2.510 1.505 99.645 0.273 1.055 99.424 0.501 1.101 98.621 1.319 1.264 98.115 1.797 1.360 97.433 2.031 1.412 97.000 2.006 1.411 96.930 1.715 1.356 96.550 2.105 1.434 95.684 2.988 1.611 95.229 3.459 1.705 94.936 3.688 1.751 94.651 3.630 1.743

RVF 99.126 0.777 1.156 98.659 1.251 1.251 97.897 2.026 1.406 97.555 2.362 1.473 97.460 2.359 1.474 97.394 2.346 1.472 99.643 0.270 1.055 99.440 0.482 1.097 98.821 1.103 1.221 98.456 1.449 1.291 98.116 1.492 1.302 97.975 1.447 1.295 98.217 1.685 1.338 97.845 2.067 1.414 96.926 2.997 1.600 96.215 3.714 1.743 95.826 4.009 1.803 95.629 3.940 1.792

RV4 99.126 0.777 1.156 98.659 1.251 1.251 97.897 2.026 1.406 97.555 2.362 1.473 97.460 2.363 1.474 97.394 2.350 1.472 99.643 0.271 1.055 99.440 0.484 1.098 98.821 1.107 1.222 98.456 1.448 1.290 98.116 1.501 1.304 97.975 1.452 1.296 98.217 1.685 1.338 97.845 2.068 1.415 96.926 3.000 1.601 96.215 3.715 1.744 95.826 4.006 1.803 95.629 3.951 1.794

MC2 98.531 0.678 1.143 98.158 1.177 1.242 97.097 1.795 1.370 96.344 1.828 1.384 96.037 1.820 1.386 95.781 1.821 1.388 98.621 0.114 1.036 98.379 0.357 1.084 97.436 1.121 1.239 96.540 1.262 1.274 95.144 1.603 1.353 93.895 1.640 1.373 96.436 1.581 1.336 96.149 1.973 1.413 94.954 2.972 1.615 94.014 3.185 1.665 93.447 3.177 1.669 92.926 3.205 1.680

MC4 97.609 0.643 1.146 97.375 1.173 1.249 96.203 1.626 1.347 94.872 1.545 1.345 94.337 1.544 1.350 93.916 1.545 1.354 97.490 0.053 1.035 97.322 0.304 1.084 96.449 0.892 1.205 94.603 0.811 1.208 92.266 0.807 1.231 90.702 0.816 1.248 95.461 1.601 1.350 95.309 1.991 1.425 94.452 3.089 1.642 92.788 2.957 1.634 91.672 2.939 1.642 90.733 2.944 1.652

MC8 95.243 0.624 1.166 95.146 1.178 1.272 94.448 1.545 1.349 92.665 1.462 1.351 91.605 1.461 1.362 90.572 1.463 1.372 96.285 0.015 1.040 96.190 0.281 1.091 95.528 0.869 1.210 92.958 0.777 1.218 88.822 0.781 1.260 86.461 0.789 1.285 94.184 1.595 1.361 94.117 1.987 1.436 93.615 3.071 1.647 91.470 2.848 1.626 89.113 2.846 1.650 87.288 2.852 1.669

CA 98.978 0.677 1.139 98.522 1.184 1.240 97.571 1.800 1.366 97.073 1.875 1.385 96.923 1.878 1.388 96.808 1.879 1.389 99.599 0.096 1.022 99.321 0.342 1.072 98.314 1.071 1.220 97.686 1.295 1.269 96.873 1.587 1.333 96.393 1.684 1.356 96.794 1.589 1.334 96.452 1.986 1.413 95.338 2.976 1.612 94.686 3.286 1.677 94.355 3.320 1.687 94.059 3.352 1.696

Table 16: Data for Multitasking Traces on 128K Caches

39

HH2 98.531 0.633 1.135 98.158 1.158 1.239 97.097 1.663 1.345 96.344 1.597 1.340 96.037 1.593 1.342 95.781 1.595 1.345 98.621 0.047 1.023 98.379 0.291 1.072 97.436 0.891 1.195 96.540 0.867 1.199 95.144 0.891 1.218 93.895 0.899 1.232 96.436 1.590 1.338 96.149 1.983 1.415 94.954 2.973 1.615 94.014 2.953 1.621 93.447 2.924 1.621 92.926 2.930 1.628

HH4 98.531 0.612 1.131 98.158 1.154 1.238 97.097 1.593 1.332 96.344 1.510 1.323 96.037 1.510 1.326 95.781 1.511 1.329 98.621 0.018 1.017 98.379 0.265 1.067 97.436 0.840 1.185 96.540 0.762 1.179 95.144 0.764 1.194 93.895 0.772 1.208 96.436 1.585 1.337 96.149 1.982 1.415 94.954 2.929 1.607 94.014 2.819 1.596 93.447 2.813 1.600 92.926 2.820 1.607

HH8 98.531 0.607 1.130 98.158 1.155 1.238 97.097 1.556 1.325 96.344 1.476 1.317 96.037 1.476 1.320 95.781 1.478 1.323 98.621 0.014 1.016 98.379 0.262 1.066 97.436 0.708 1.160 96.540 0.669 1.162 95.144 0.673 1.176 93.895 0.681 1.190 96.436 1.586 1.337 96.149 1.983 1.415 94.954 2.903 1.602 94.014 2.770 1.586 93.447 2.768 1.592 92.926 2.776 1.598

Suggest Documents