A Design Framework for Hybrid-Access Caches - Semantic Scholar

In the Proceedings of the First International Symposium on High-Performance Computer Architecture, Raleigh, North Carolina, January 22{25, 1995, pp. 144{153. c 1995 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

A Design Framework for Hybrid-Access Caches Kevin B. Theobald Herbert H.J. Hum Guang R. Gao School of Computer Science Dept. of Elec. and Comp. Eng. School of Computer Science McGill University Concordia University McGill University Montreal, Quebec H3A 2A7 Montreal, Quebec H3G 1M8 Montreal, Quebec H3A 2A7

Abstract

data line from the cache can't be sent to its destination until the tag comparisons are complete, since the cache has to know which data line to select. On the other hand, since an address can map to only one line in a direct-mapped cache, the direct-mapped cache can write the selected line from the cache to the output at the same time as it is comparing the tags. The data can then be invalidated if the tags don't match.

High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the Hybrid Access Cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call Half-and-Half Caches, which are half direct-mapped and half set-associative. Simulations con rm the predictive value of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more ecient cache designs. Keywords: Cache, Cache simulation, Half-and-half cache, Hybrid access cache, On-chip cache.

1.1 Previous hybrid caches Because direct-mapped caches tend to have higher miss rates, designers have developed various hybrid schemes that combine the fast access of a directmapped cache with the better hit rate of a setassociative cache. So and Rechtschaen [10] observed that in a set-associative cache, the vast majority of hits within a given set involve the MRU (most recently used) member of that set. This led to the MRU cache [3]. When an MRU cache is read, all tags in the selected set are compared simultaneously, but the MRU line in that set is immediately sent out. If the tag for that line doesn't match the address tag, then the data is invalidated. If one of the other tags matches, then that line is latched into the output buer in the next cycle, and that line becomes the new MRU. Thus, an MRU hit has an access time of one cycle, and a nonMRU hit takes two cycles. Another way to boost the hit rate of a directmapped cache is to add a victim cache [7]. A victim cache is a small fully-associative cache of typically no more than 16 lines. If a line in the direct-mapped cache is replaced, that line (called the \victim") is transferred to the victim cache. If the desired block is in the victim cache, then that block exchanges places with the victim during the second cycle. Since some accesses require two cycles to perform, a simple way to combine fast access with better associativity is to read a direct-mapped cache twice. In the hash-rehash cache [1], if a read misses on the rst cycle, a second read is done using a dierent hashing

1 Introduction In the past decade, the operating frequencies of high-performance microprocessors have far outpaced memory speeds. Processor designers have turned to on-chip caches to overcome this gap. To get the most bene t from an on-chip cache, it is best if the data can be read from the cache in only one cycle, and if the miss rate is low. Set-associative caches have traditionally been favored for their lower miss rates. Because direct-mapped caches can only store a memory block in one speci c line, they are more prone to con icts between blocks that map to the same set, and thus tend to have more misses than set-associative caches. However, for faster processors, the one-cycle-read requirement can only be met by a direct-mapped cache, because such a cache has a shorter critical circuit path than a set-associative cache storing the same amount of data [5, 9]. In a set-associative cache, the 144

function. If the second try is a hit, the rst and second lines are swapped. An improved version called the column-associative cache [2] keeps with each tag a rehash bit which indicates whether the corresponding line represents a rst-cycle hit or a rehash (secondcycle) hit. This bit is used to limit the swapping in order to reduce thrashing.

logN 1 tag and set fields

tags

word log N2

primary

secondary tags

tags

1.2 Unifying the hybrid caches These hybrid caches represent very dierent solutions to the same problem. In this paper, we show that most of these designs can be viewed within a general cache design taxonomy, which we call the Hybrid Access Cache (HAC) model. Under the HAC model, a hybrid cache consists of a direct-mapped primary section, a slower secondary section (usually associative), and an ecient mechanism for exchanging data between them. The purpose of the HAC model is to allow the user to think about a whole spectrum of values for the cache parameters, in order to balance the bene ts of fast access and greater associativity.

1

1

1

1 1 select 1

select

select 1

1

MRU 1 hit 1

2 inhibit

1

2 select 2

output hit

Figure 1: Structure of a HAC

2 The Hybrid Access Cache model

1.3 Synopsis and summary

This section develops the Hybrid Access Cache as a model for caches which combine features of directmapped and associative caches. We then discuss the operation of HACs and their performance metrics.

In the next section, we present the HAC model in greater detail. In Section 3, we describe the HAC design space and show how points in the design space are related to the cache behavior of programs. Based on this, we develop the half-and-half cache, whose design points ll the middle of the HAC space. In this cache, half of the cache lines are direct-mapped, and the other half are set-associative. In Section 4, we present the results of cache simulations of both single-process and multiple-process traces derived from SPEC benchmarks and ATUM traces. The results con rm our predictions from Section 3. They show that for small caches, fast access is most important, and the victim cache gives the best overall performance. However, the victim cache shows large variations in performance, and the halfand-half cache tends to be more consistent. For moderate caches, associativity becomes more important, and the middle of the HAC space yields better results. Large caches show few performance variations because the primary caches have become large enough to handle most accesses without interference. Finally, we present our conclusions in Section 5.

2.1 Viewing HACs as dual caches A HAC can be conceptually viewed as a combination of two caches: a fast primary cache which is direct-mapped, and a secondary cache, which may be direct-mapped, set-associative, or fully-associative. An example con guration is shown in Figure 1 where the secondary cache is a two-way set-associative cache. Any memory block may be cached in either the primary or secondary section, but not in both at the same time. Data in the primary cache may be read in a single cycle, while data in the secondary cache requires two or more cycles to read. The goal of the hybrid access cache organization is to keep most of the frequently-used data in the primary cache, while using the secondary cache to store data which would otherwise be swapped out of the cache due to address con icts in the direct-mapped section. Even though reading the secondary cache may require two or more cycles, this is far better than the standard o-chip miss penalty. 145

Although the two caches are organized in a hierarchy, a HAC is substantially dierent from a conventional two-level cache. Two-level cache systems typically place the second level o-chip, and make it much larger in order to reduce capacity misses [4]. In a HAC, the secondary cache is on-chip and integrated with the primary cache, and is primarily used to alleviate con ict misses. Also, most two-level caches have the multilevel inclusion property [4], meaning that a copy of everything in the rst-level cache is also in the secondlevel cache. In a HAC, no line can be in both caches at the same time.1 The primary and secondary caches may be dierent sizes, and their sets may be accessed using dierent hashing functions s1 and s2 . However, since data is exchanged between the two caches, their line sizes must be the same. Therefore, the two caches can be represented by the triples (N1 ; 1; L) and (N2 ; K; L), where N1 and N2 are the number of sets in the primary and secondary caches, respectively, K is the associativity, and L is the number of words in a line. Conceptually, a HAC performs a read in the following manner (refer to Figure 1). Assume a is the address with the line-oset bits removed: 1. In the rst cycle, the cache reads the tags and data in line s1 (a) of the primary cache and in set s2 (a) of the secondary cache. (In the gure, a number beside an arc indicates which cycle the output will appear. Moreover, the gure assumes s1 and s2 use bit selection.) The data from the primary cache is latched into the output buer, becoming available at the end of the rst cycle. 2. The single tag from the primary cache and the K tags from the secondary cache are compared against the appropriate bits of the address. 3. If the tag from the primary cache matches the address, then a primary hit occurs. The cache immediately cancels the probe of the secondary cache, so that it can be use in the next cycle. If the tag from the primary cache doesn't match, then the output buer is invalidated, and the cache must operate for at least one more cycle. 4. In the second cycle (if there was no primary hit), the tag comparators from the secondary cache are examined. If one of them matches (a secondary hit), then the appropriate data line is latched into the output buer. The cache must now move the selected data (and tag) from the secondary 1

00001 01001

address

00101 00101

Figure 2: Problems with bad hashing functions cache into the primary cache, while simultaneously moving the displaced line from the primary cache into the secondary cache.2 If K > 1 and the secondary cache uses an LRU replacement policy, then it would make sense for the LRU status bits in the set to be changed so that the \victim" becomes the new MRU element, since it just came from the primary cache. 5. If all tag comparators mismatch, then the cache must go to main memory. This data will go into the primary cache in line s1 (a). Thus, the block currently in line s1 (a) becomes a victim, and is moved over to the secondary cache, causing the LRU element of the secondary-cache set to be

ushed. When the data returns from main memory, the cache places it in the primary cache. The swapping done in steps 4 and 5 imposes a restriction on the hashing functions: s1 (a1 ) = s1 (a2) ) s2 (a1 ) = s2 (a2) (1) In other words, if two addresses map to the same line in the primary cache, then they should map to the same set in the secondary cache. The need for this restriction can be seen in the counterexample shown in Figure 2. Assume that memory blocks 00001 and 00101 are in the secondary cache (in sets 001 and 101, respectively), and memory block 01001 is in the primary cache (in set 01). What happens if the processor reads block 00101? That block should be moved from set 101 of the secondary cache to set 01 of the primary cache, displacing block 01001. But this victim must go into set 001 of the secondary cache, not set 101. The former set must ush its other address, while set 101 is left without a victim, and must be marked invalid. This results in a complex accessing scheme and wastes precious cache space. Assuming that the codomain of s2 is the entire set space in the secondary cache, this restriction means 2 This swapping is not shown in g. 1 for clarity. Implementation is discussed in [11].

This is called two-level exclusive caching [8].

146

that N2 N1 . If bit selection is used, then N1 and N2 must be powers of 2; typically, the rightmost log2 N1 bits of a are used to select the primary cache line, and the rightmost log2 N2 bits of that eld are used to select the secondary cache set.

Associativity of associative portion (K)

Max. K

2.2 Performance measurements When comparing simple caches, the miss rate (ratio of misses to total accesses) is usually an adequate measure of performance. However, it is not sucient to describe the performance of HACs, because the two caches in a HAC have dierent hit times. A HAC can have a low miss rate, yet still have relatively poor performance, if many of the hits occur in the secondary cache and take longer to access. If a single parameter is needed to compare HACs, then the average eective memory-cycle time (teff ) can be used. If h1 and h2 are the fractions of accesses which hit in the primary cache and secondary cache, respectively (note that h1 + h2 1), and if t1 and t2 are the access times of the two caches, then teff = t1 + h2 (t2 ? t1 ) + (1 ? h1 ? h2)tmain (2) This assumes that the main memory access begins after the rst cycle, once all the comparator results are compared. If t1 is 1 cycle and t2 is 2 cycles, then teff = 1 + h2 + (1 ? h1 ? h2)tmain (3)

16

Half-and-half Victim caches

8

MRU 4

2 Column-assoc. 1 0%

50%

75% 87.5%

100%

Lines used for associative access

Figure 3: Design space of the HAC model the associative accesses so that con icts in the directmapped section can be better resolved. Moving to the left indicates that more and more lines are used in a direct-mapped manner for increased speed. Thus, there is a trade-o when a cache design is moved left or right in the space. Moving towards the top of the design space indicates that the cache has more associativity in the associative portion to resolve more con icts, but at the cost of more tag matching hardware and possibly increasing the access time of the secondary cache.3 Again, there is another trade-o when moving a cache design towards the top or bottom of the design space. Existing HAC designs are plotted in gure 3. The left edge of the space (x = 0) corresponds to the degenerate case of a direct-mapped cache, because no lines are used for associative storage. Victim caches lie near this edge, because only a very small portion of the cache is allocated for associative storage, e.g., 16 lines. They lie on a line starting at (,1) (x is not quite 0 because there is one line in the secondary cache) and rising nearly vertically. The MRU caches occupy the design points which start at (50%,1) and form an exponential curve towards the upper righthand corner of the HAC design space. As K is increased, the secondary cache occupies a greater percentage of the total cache lines. For instance, if a 4-way set-associative cache is converted to an MRU

3 Using the HAC model The Hybrid Access Cache framework can be used to examine the design space of possible cache con gurations and to gain insights into how well they might be expected to perform. In this section, we show how to do this in general, and show that speci c hybrid schemes are equivalent to speci c cases in the HAC model. Using the parameter space created by the HAC model, we explore a new cache con guration called the \half-and-half" cache as another possible HAC design.

3.1 The HAC design space Figure 3 shows the HAC design space. The horizontal axis shows the percentage of cache lines allocated to associative accesses, i.e., the number of lines in the secondary cache, and the vertical axis shows the associativity of the secondary cache, K . (Line size and total cache size are omitted for clarity.) Moving to the right in the design space indicates that the HAC has more and more lines dedicated to

3 For simplicity, we assume that the associativity does not increase the access time of the secondary cache beyond two cycles. In the future, we intend to explore a more elaborate HAC model.

147

associativity of assoc. portion (K)

cache, then 3 lines out of every set of 4 will be part of the secondary cache (75%), and the associativity would be 3. The limit is reached when K = N , i.e., the cache is fully-associative. The MRU represents the other extreme of the design space; the region to the right of the MRU curve cannot correspond to any legal designs given that N2 N1 (see Section 2.1). The column-associative (CA) cache (as well as its hash-rehash predecessor) is a HAC which does not occupy a single point in the design space, but a horizontal line. In terms of the mapping of addresses to cache lines, the CA cache is equivalent to a 2-way setassociative cache, because the bit- ipping rehash function of the CA cache eectively partitions the cache line space into 2-way sets. In terms of access, however, the CA sometimes acts like a direct-mapped cache, because the distribution of cache lines to direct-mapped accesses is dependent on the access patterns of the application. If accesses do not cause con icts in the CA cache, then the CA cache functions as if it devoted 100% of its lines for direct-mapped access. However, as con icts arise, lines begin to be used for associative (rehashed) accesses. Eventually, if all accesses cause con icts, then up to half of the lines can be used for associative accesses. Thus the horizontal line extends from 0% associative access to 50%. This line is placed in the design space at K = 1 because a cache line can only be rehashed to one other location.

16

MRU caches Victim caches

8

Half-and-half caches

4

2 col.-assoc.

DM cache

1 1

2

4

8 N1/ N2

N1

oo

Figure 4: Design space of more ecient HACs associative). The design space of the MRU cache is a vertical line where N1 = N2 . The design point of a column-associative cache is again a horizontal line at K = 1 and running from N1 = N2 to N2 = 0.

3.3 Exploring the HAC design space The HAC design space is useful for illustrating the various design possibilities for a hybrid access cache, but it is also important to understand the relationship between this space and the behavior of real applications. Since the purpose of adding associativity to a cache is to reduce con ict misses, it is the interferences between addresses that should be considered. Sometimes, the interferences will be \broad," meaning that con icts occur in many places. For instance, if two large arrays are mapped to the same region of the cache, there will be con icts over many cache lines. Sometimes, interferences can be \deep," meaning that many addresses are competing for the same set. Such a situation may occur, for instance, with randomlyallocated pointer structures. The horizontal axis of Figure 3 corresponds to the \breadth" of interferences, while the vertical axis corresponds to the \depth." Thus, the types of interferences prevalent in a given program correspond to a region of the HAC space, and can be used to indicate which kinds of HAC designs would be most appropriate for that application. For instance, the example, given above, of two large arrays mapping to the same sets in the cache would correspond to the bottom of Figure 3, and would best be served by a column-associative or a 2-way MRU cache. A victim

3.2 Restricting the HAC design space Most points in Figure 3 would be dicult to implement because they would require set selection functions more dicult than simple bit selection. (Imagine, for instance, trying to build a HAC with a directmapped secondary cache containing 49% of the lines.) If we restrict the hashing functions to bit selection, then both N1 and N2 must be powers of 2. Even with this restriction, we are still presented with a large range of sizes for the secondary cache. Given a primary cache of size N1 , we can set N2 to any power of 2 up to N1 , and set K to any positive integer (although in practice, K would be limited by the costs of high associativity). We may view this as another two-dimensional design space, but one for more ecient HAC designs (see Figure 4). In this design space, the vertical axis remains the same, but the horizontal axis is now the ratio of the number of sets for the primary cache to the number of sets for the secondary cache. The various HAC designs have also been shown in the new space. The design space of the victim cache is a vertical line where N2 is 1 (the victim cache is fully 148

2. The other half of the lines form a 2r -way setassociative secondary cache with 21 (N=2r ) sets.4 The half-and-half cache super cially resembles the MRU cache (indeed, a half-and-half with r = 0 is identical to a 2-way MRU cache). However, there are several dierences (if r > 0). A half-and-half cache may have more address con icts than an MRU cache with the same associativity, because only half of the storage of the half-and-half is associative. However, the bene ts of associativity in the MRU cache come at a price: as associativity increases, the size of the primary cache decreases, meaning that fewer and fewer lines can be accessed in one cycle.

cache would not be a good choice for this application, because the interference of the two arrays would generate too many \victims"; these would be driven out of the small secondary cache before they could be reused. On the other hand, if there is not much \breadth" to the interferences in an application, then a victim cache would serve the application eectively. In some cases, an equally-sized HAC with a larger secondary cache, such as an MRU cache, could perform just as well. But this would depend on the size of the cache relative to the size of the program's working set. If the cache were relatively small, then most of the lines in the cache could be \active" at any given time, and they could be accessed more quickly if these lines were in the primary cache. Thus, while a high-K MRU cache could attain the same hit rate as a victim cache, it might cost more to use because more accesses come from the slower, secondary memory. Thus, it is important to think about the interference patterns that occur at a given cache size and the region to which they correspond in the HAC design space, and to choose a cache design which is close to that region. Caches at one edge may be unsuitable for applications lying near the other side of the space. For instance, a direct-mapped cache, lying at the extreme left side of the HAC space, would probably give poor performance for any application with either depth or breadth in its interference. Dierent applications can exhibit dierent types of interference patterns. Since most machines are expected to perform over a wide range of applications, it may be impossible to nd a HAC that would be optimal for all applications. But from Figures 3 and 4, it can be seen that the existing designs are all near the edges of the HAC space. A cache that is closer to the middle region of the space might be a better choice for a general-purpose cache than one at one edge of the space, because it would be likely to provide a more balanced cache design which may better accommodate both deep and broad interference patterns.

4 Simulation experiments In order to validate our hypotheses about the relationship between application programs and the HAC design space, and to compare the half-and-half cache to other caches, we ran a series of trace-driven simulations. We programmed a cache simulator to emulate various hybrid access caches with three dierent sizes, feeding it 49 traces. In brief, we found the following: All hybrid cache designs performed considerably better than one-level direct-mapped caches. For small (8K) caches, the victim caches performed best overall, especially on oating-point applications with large working sets, con rming that the left side of Figure 3 is most relevant when cache sizes are small. For medium (32K) caches, the half-and-half caches performed best, demonstrating that the middle of the HAC design space is a good area to explore for this size. For large (128K) caches, the half-and-half caches also were the best, but the dierences in performance among all the caches were small. These ndings con rm the predictions we made while analyzing the HAC design space (see Section 3.3). We also examined the eects of multitasking, and found that this did not signi cantly change the results.

3.4 The half-and-half cache To exploit the \middle region" of the HAC design space as suggested in the previous section, we have chosen to make a case study of the \half-and-half" cache [11]. Half-and-half caches lie on a straight line running through the middle of the HAC design space (in both Figures 3 and 4). An N -line half-and-half cache has the following properties:

4.1 Cache designs simulated We simulated a fully-associative cache, a directmapped cache, three MRU caches, a standard victim

1. Half of the lines ( 21 N ) are in the primary cache.

4

149

A detailed derivation can be found in [11].

Name FA DM MRU2 MRU4 MRU8 VC16 VC4 CA HH2 HH4 HH8

N

to

N1 N N N=2 N=4 N=8 N N N=2 N=2 N=2 N=2

grouped into three sets: symbolic programs, numerical programs with relatively good cache behavior, and numerical programs with bad cache behavior. Each benchmark was run on the reference input case, except that spice2g6 was run on the short case due to its size. All four of the regular espresso inputs and all 19 regular gcc inputs were run. All programs within a group were run independently, and their miss counts were combined before applying eq. 3. The benchmarks were compiled statically, using the sun4 make les, with Sparc Fortran SC2.0.1 (-cg89 switch) and GNU CC 2.4.5. All programs were run to completion on Sparc workstations, and the traces were generated using the Spy tool [6]. To measure the eects of multitasking, we generated three synthetic traces by merging together SPEC89 traces. As can be seen in Table 2, we sometimes used smaller input cases or subsections of programs. The eqntott-fast program used in Mix-6 was a faster version of eqntott formed by recompiling eqntott with espresso's qsort module. A multitrace was generated by running each program in the group for 100 million instructions and performing round-robin scheduling. Five traces were generated, in which the merger would switch to another trace after every 100, 1,000, 10,000, 100,000, or 1,000,000 instructions. A sixth trace was generated by running the constituent programs separately without context-switching. For our nal series, we used 15 ATUM traces [1] from the DLX software release. These traces are based on actual VAX execution traces, and include system calls. Some of them are also multitasked. (Of course, coming from a CISC processor, they may not accurately mimic access patterns found on today's prevalent RISC processors.)

Structure | | Fully assoc. | | Direct mapped N=2 1 MRU cache N=4 3 N=8 7 1 16 DM plus 4 4 victim cache N ? N1 1 Column assoc. N=4 2 Half-and-half N=8 4 cache N=16 8 N2

K

Table 1: Size parameters for simulated caches cache with a 16-line fully-associative secondary, a variation of the victim cache in which the secondary is 4-way set-associative, a column-associative cache, and three half-and-half caches. Each cache came in three sizes: 8K, 32K and 128K, representing a range from an average value for present-day on-chip caches to what will likely be possible in a few years. We used 32-byte lines, so if N is the number of lines in the cache, then N 2 f256; 1024; 4096g. The caches are compared in Table 1. The table gives the parameters N1 and N2 in terms of N , and the associativity K . Each cache has N lines total, except for VC16 and VC4, which have a full-sized directmapped cache plus a 16-line victim cache. In this study, we only measured the eects of noncompulsory reads, and we only considered data caches. It was assumed that all hits in the primary cache take one cycle and all secondary cache hits take two cycles. We also made this assumption for CA, even though the original publication rated it at 4 [2], because it can be equipped with the same switching hardware as the other HAC's. For this study, we assumed a miss penalty of 20 in eq. 3. This value is arbitrary and can aect the comparison of the HACs (for instance, increasing the miss penalty would tend to bene t MRU caches more than others). Nevertheless, this value is similar to that used in other cache comparisons, and represents a middle ground between highend machines with fast secondary caches and low-end machines with on-chip caches only.

4.3 Results Results for the single-process SPEC runs, the ATUM traces, and the non-interleaved versions of the multitasking traces are shown in Figures 5{7. For each cache size and trace group, we plot the performance of the HAC's and the DM cache. Each bar shows the ratio of a cache's average access time (as de ned in eq. 3 in Section 2.2) to the average access time of the FA (fully associative) cache for the same trace. (Generally, the FA represents the \ideal" limit, in which con ict misses are minimized, but in some experiments, the FA performs worse than some of the HAC's, due to anomalous access patterns.) A few observations can be made by comparing the caches for dierent sizes. Figure 5 shows the greatest performance variations between hybrid caches. This

4.2 Trace sources In our simulations we used three groups of traces, which are summarized in Table 2. The rst set of traces were single-process traces generated from the SPEC89 benchmarks. They are 150

Name

Benchmarks

Symbolic Good FP Bad FP ATUM Sym-4 FP-4 Mix-6

gcc(all),espresso(all),li,eqntott doduc(ref),fpppp(8) spice(short),nasa7,matrix,tomcatv DLX traces (15) gcc(insn-recog),espresso(cps),li,eqntott spice(short),doduc(ref),nasa7(GMT),fpppp(4) gcc(stmt),nasa7(EMI),eqntott-fast, fpppp(8),espresso(tial),tomcatv(129)

Total ops Reads Writes Compulsory (%) (%) misses 10,167,146,068 19.97 6.73 353,786 3,037,963,140 30.54 6.91 4,731 11,616,044,389 29.08 9.57 545,891 2,595,330 64.36 35.64 31,763 400,000,000 20.29 6.07 87,451 400,000,000 24.74 7.63 15,287 600,000,000 24.12 5.95 169,495

Table 2: Benchmarks used

DM

MRU2,4,8

VC16,4

CA

HH2,4,8

DM

MRU2,4,8

VC16,4

CA

HH2,4,8

1.5

1.6

1.4

1.5

1.3

1.4

1.2

1.3

1.1

1.2

1.0

1.1

0.0

1.0

Symbolic Good FP Bad FP ATUM Sym-4

0.0 Symbolic Good FP Bad FP ATUM Sym-4

FP-4

Mix-6

FP-4

Mix-6

Total

Figure 6: teff (relative to FA) for 32K cache especially true for the Bad FP traces (which make up over 40% of the total), due to their large working sets. In other cases, though, the victim caches did worse than the others, even the MRU caches with their small primary caches. The half-and-half caches (especially HH4 and 8) performed more consistently, showing that the middle of the HAC design space may be a better place for all-around good performance. For the 32K cache, the victim caches' relative performances have dropped. It would appear that once most of the working set is in the primary cache, it becomes more important to provide enough associativity to handle broad interferences such as overlapping arrays. The other HACs do this pretty well. CA can provide breadth, but not much depth, and the strong

Total

Figure 5: teff (relative to FA) for 8K cache shows that nding a HAC which is in a region suitable to the application is most important for smaller caches. In a small cache, it is desirable to keep the primary cache dominant so that most hits take only one cycle. If the cache is large enough, more associativity can be added, removing some con ict misses, while keeping the primary cache large enough to satisfy most of the read request. Figure 7 shows the least performance variations. For the 8K cache, the victim caches have the lowest average access time, probably because they have more one-cycle-access lines than the others.5 This is This is also partially because they actually have more total lines. Experiments [11] on victim caches whose primary sections were reduced by the same number of lines as contained in the 5

secondary showed worse results; the average access times for all traces were between the HH4 and HH8 values.

151

DM

MRU2,4,8

VC16,4

CA

DM

HH2,4,8

1.5

+0.6 +0.4 +0.2 +0.0

1.4

MRU2,4,8

VC16,4

CA

HH2,4,8

Sym-4

1,000,000

100,000

10,000

1,000

100

100,000

10,000

1,000

100

100,000

10,000

1,000

100

1.3

FP-4

+1.2 +1.0 +0.8 +0.6 +0.4 +0.2 +0.0

1.2 1.1 1.0

1,000,000 0.0 Symbolic Good FP Bad FP ATUM Sym-4

FP-4

Mix-6

Total

+0.8 +0.6 +0.4 +0.2 +0.0

Figure 7: teff (relative to FA) for 128K cache

Mix-6

1,000,000

DM

+1.0 +0.8 +0.6 +0.4 +0.2 +0.0

MRU2,4,8

+2.0

CA

Figure 9: Eects of multitasking on 32K cache

HH2,4,8

DM

1,000,000

+2.4

VC16,4

Sym-4

100,000

10,000

1,000

+0.4 +0.3 +0.2 +0.1 +0.0

100

FP-4

+1.2

+0.5 +0.4 +0.3 +0.2 +0.1 +0.0

+0.8 +0.4 +0.0 1,000,000

+1.6

100,000

10,000

1,000

100

VC16,4

CA

HH2,4,8

Sym-4

1,000,000

+1.6

+2.0

MRU2,4,8

100,000

10,000

1,000

100

100,000

10,000

1,000

100

100,000

10,000

1,000

100

FP-4

1,000,000

Mix-6

+0.5 +0.4 +0.3 +0.2 +0.1 +0.0

+1.2 +0.8 +0.4 +0.0 1,000,000

100,000

10,000

1,000

100

Mix-6

1,000,000

Figure 8: Eects of multitasking on 8K cache

Figure 10: Eects of multitasking on 128K cache 152

performance of MRU4 compared to MRU2 suggests that more depth is needed. Both MRU and HH can provide this depth, but HH has more fast cache at its disposal. HH2 wastes 4% fewer cycles, on average, than MRU4, 6% fewer than CA, and 20% fewer than VC16. Again, it appears that the middle of the HAC space gives the best overall results. For the largest cache, all the HAC's except the victim caches perform roughly equally. They all provide enough breadth, and the size of the cache makes depth less of an issue. The half-and-half caches are slightly better, but given the costs of tag matching circuits, it would probably be more cost-eective to use a columnassociative cache. Of course, a suciently large problem could completely change this assessment. There is very little dierence between VC16 and VC4, suggesting that a fully-associative victim cache is probably not worth the extra hardware cost, and a set-associative victim cache may perform just as well at a lower cost. The multitasking traces are shown in Figures 8{10. Each shows, for each of the multitasking traces and each of the 5 grain sizes, the absolute dierence between the access time for a cache and the access time for the same cache on the non-interleaving version of the same trace. From these graphs, it can be seen that multitasking has a big impact on cache usage. The effects increase as tasks are switched more frequently, as could be expected. At the rapid-switching end of the spectrum it can be seen that MRU8 and HH8 give good results, showing the importance of depth of associativity when there is frequent context-switching.

can help the designer to see all the possibilities as part of a continuum and to understand how these design decisions can aect performance.

Acknowledgements This research was supported by the Natural Sciences and Engineering Research Council (NSERC) and MICRONET, both funded by the Canadian government, and by a Concordia FRDP Grant.

References [1] Anant Agarwal, John Hennessy, and Mark Horowitz, \Cache Performance of Operating Systems and Multiprogramming," ACM Trans. on Computer Systems, 6(4):393{431, Nov. 1988. [2] Anant Agarwal and Steven D. Pudar, \ColumnAssociative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches," in Proc. of the 20th Ann. Intl. Symp. on Computer Architecture, San Diego, Calif., pp. 179{190, May 1993. [3] J. H. Chang, H. Chao, and K. So, \Cache Design of A Sub-Micron CMOS System/370," in Proc. of the 14th Ann. Intl. Symp. on Computer Architecture, Pittsburgh, Penn., pp. 208{213, Jun. 1987. [4] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., Inc., 1990. [5] Mark D. Hill, \A Case for Direct-Mapped Caches," Computer, 21(12):25{40, Dec. 1988. [6] Gordon Irlam, \Spa." World-Wide Web page URL: http://www.base.com/gordoni/spa.html. [7] Norman P. Jouppi, \Improving Direct-Mapped Cache Performance by the Addition of a Small FullyAssociative Cache and Prefetch Buers," in Proc. of the 17th Ann. Intl. Symp. on Computer Architecture, Seattle, Wash., pp. 364{373, May 1990. [8] Norman P. Jouppi and Steven J. E. Wilton, \Tradeos in Two-Level On-Chip Caching," in Proc. of the 21st Ann. Intl. Symp. on Computer Architecture, Chicago, Ill., pp. 34{45, Apr. 1994. [9] S. Przybylski, M. Horowitz, and J. Hennessy, \Performance Tradeos in Cache Design," in Proc. of the 15th Ann. Intl. Symp. on Computer Architecture, Honolulu, Hawaii, pp. 290{298, May{Jun. 1988. [10] Kimming So and Rudolph N. Rechtschaen, \Cache Operations by MRU-Change," in Proc. of the IEEE Intl. Conf. on Computer Design, Port Chester, N. Y., pp. 584{586, Oct. 1986. [11] Kevin B. Theobald, Herbert H. J. Hum, and Guang R. Gao, \A Uni ed Framework for Hybrid Access Cache Design and Its Applications," ACAPS Tech. Memo 65, Sch. of Comp. Sci., McGill U., Montreal, Que., Oct. 1993.

5 Conclusions In this paper, we have proposed a uni ed framework for the design of Hybrid Access Caches (HACs) as on-chip caches. Under our model, a rich design space is structured in a way which helps the cache designers systematically to explore, experiment and compare various combinations of hybrid cache designs in order to meet performance and cost requirements. With the help of this framework, we conceived a novel HAC organization called the half-and-half cache. Our simulation results demonstrated that for moderate-sized caches, the half-and-half cache has better performance than other hybrid caches for most applications, as predicted by the HAC design space. Other points in the HAC design space are possible, and may be even more desirable from a performance or implementation standpoint. The HAC design space 153