A Design Framework for Hybrid-Access Caches Kevin B. Theobald Herbert H.J. Hum Guang R. Gao School of Computer Science Dept. of Elec. and Comp. Eng. School of Computer Science McGill University Concordia University McGill University Montreal, Quebec H3A 2A7 Montreal, Quebec H3G 1M8 Montreal, Quebec H3A 2A7
data line from the cache can't be sent to its destination until the tag comparisons are complete, since the cache has to know which data line to select. On the other hand, since an address can map to only one line in a direct-mapped cache, the direct-mapped cache can write the selected line from the cache to the output at the same time as it is comparing the tags. The data can then be invalidated if the tags don't match.
High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the Hybrid Access Cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call Half-and-Half Caches, which are half direct-mapped and half set-associative. Simulations con rm the predictive value of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more ecient cache designs. Keywords: Cache, Cache simulation, Half-and-half cache, Hybrid access cache, On-chip cache.
1.1 Previous hybrid caches Because direct-mapped caches tend to have higher miss rates, designers have developed various hybrid schemes that combine the fast access of a directmapped cache with the better hit rate of a setassociative cache. So and Rechtschaen [10] observed that in a set-associative cache, the vast majority of hits within a given set involve the MRU (most recently used) member of that set. This led to the MRU cache [3]. When an MRU cache is read, all tags in the selected set are compared simultaneously, but the MRU line in that set is immediately sent out. If the tag for that line doesn't match the address tag, then the data is invalidated. If one of the other tags matches, then that line is latched into the output buer in the next cycle, and that line becomes the new MRU. Thus, an MRU hit has an access time of one cycle, and a nonMRU hit takes two cycles. Another way to boost the hit rate of a directmapped cache is to add a victim cache [7]. A victim cache is a small fully-associative cache of typically no more than 16 lines. If a line in the direct-mapped cache is replaced, that line (called the \victim") is transferred to the victim cache. If the desired block is in the victim cache, then that block exchanges places with the victim during the second cycle. Since some accesses require two cycles to perform, a simple way to combine fast access with better associativity is to read a direct-mapped cache twice. In the hash-rehash cache [1], if a read misses on the rst cycle, a second read is done using a dierent hashing
1 Introduction In the past decade, the operating frequencies of high-performance microprocessors have far outpaced memory speeds. Processor designers have turned to on-chip caches to overcome this gap. To get the most bene t from an on-chip cache, it is best if the data can be read from the cache in only one cycle, and if the miss rate is low. Set-associative caches have traditionally been favored for their lower miss rates. Because direct-mapped caches can only store a memory block in one speci c line, they are more prone to con icts between blocks that map to the same set, and thus tend to have more misses than set-associative caches. However, for faster processors, the one-cycle-read requirement can only be met by a direct-mapped cache, because such a cache has a shorter critical circuit path than a set-associative cache storing the same amount of data [5, 9]. In a set-associative cache, the 144
function. If the second try is a hit, the rst and second lines are swapped. An improved version called the column-associative cache [2] keeps with each tag a rehash bit which indicates whether the corresponding line represents a rst-cycle hit or a rehash (secondcycle) hit. This bit is used to limit the swapping in order to reduce thrashing.
1.2 Unifying the hybrid caches These hybrid caches represent very dierent solutions to the same problem. In this paper, we show that most of these designs can be viewed within a general cache design taxonomy, which we call the Hybrid Access Cache (HAC) model. Under the HAC model, a hybrid cache consists of a direct-mapped primary section, a slower secondary section (usually associative), and an ecient mechanism for exchanging data between them. The purpose of the HAC model is to allow the user to think about a whole spectrum of values for the cache parameters, in order to balance the bene ts of fast access and greater associativity.
2 The Hybrid Access Cache model
1.3 Synopsis and summary
This section develops the Hybrid Access Cache as a model for caches which combine features of directmapped and associative caches. We then discuss the operation of HACs and their performance metrics.
In the next section, we present the HAC model in greater detail. In Section 3, we describe the HAC design space and show how points in the design space are related to the cache behavior of programs. Based on this, we develop the half-and-half cache, whose design points ll the middle of the HAC space. In this cache, half of the cache lines are direct-mapped, and the other half are set-associative. In Section 4, we present the results of cache simulations of both single-process and multiple-process traces derived from SPEC benchmarks and ATUM traces. The results con rm our predictions from Section 3. They show that for small caches, fast access is most important, and the victim cache gives the best overall performance. However, the victim cache shows large variations in performance, and the halfand-half cache tends to be more consistent. For moderate caches, associativity becomes more important, and the middle of the HAC space yields better results. Large caches show few performance variations because the primary caches have become large enough to handle most accesses without interference. Finally, we present our conclusions in Section 5.
2.1 Viewing HACs as dual caches A HAC can be conceptually viewed as a combination of two caches: a fast primary cache which is direct-mapped, and a secondary cache, which may be direct-mapped, set-associative, or fully-associative. An example con guration is shown in Figure 1 where the secondary cache is a two-way set-associative cache. Any memory block may be cached in either the primary or secondary section, but not in both at the same time. Data in the primary cache may be read in a single cycle, while data in the secondary cache requires two or more cycles to read. The goal of the hybrid access cache organization is to keep most of the frequently-used data in the primary cache, while using the secondary cache to store data which would otherwise be swapped out of the cache due to address con icts in the direct-mapped section. Even though reading the secondary cache may require two or more cycles, this is far better than the standard o-chip miss penalty. 145
Although the two caches are organized in a hierarchy, a HAC is substantially dierent from a conventional two-level cache. Two-level cache systems typically place the second level o-chip, and make it much larger in order to reduce capacity misses [4]. In a HAC, the secondary cache is on-chip and integrated with the primary cache, and is primarily used to alleviate con ict misses. Also, most two-level caches have the multilevel inclusion property [4], meaning that a copy of everything in the rst-level cache is also in the secondlevel cache. In a HAC, no line can be in both caches at the same time.1 The primary and secondary caches may be dierent sizes, and their sets may be accessed using dierent hashing functions s1 and s2 . However, since data is exchanged between the two caches, their line sizes must be the same. Therefore, the two caches can be represented by the triples (N1 ; 1; L) and (N2 ; K; L), where N1 and N2 are the number of sets in the primary and secondary caches, respectively, K is the associativity, and L is the number of words in a line. Conceptually, a HAC performs a read in the following manner (refer to Figure 1). Assume a is the address with the line-oset bits removed: 1. In the rst cycle, the cache reads the tags and data in line s1 (a) of the primary cache and in set s2 (a) of the secondary cache. (In the gure, a number beside an arc indicates which cycle the output will appear. Moreover, the gure assumes s1 and s2 use bit selection.) The data from the primary cache is latched into the output buer, becoming available at the end of the rst cycle. 2. The single tag from the primary cache and the K tags from the secondary cache are compared against the appropriate bits of the address. 3. If the tag from the primary cache matches the address, then a primary hit occurs. The cache immediately cancels the probe of the secondary cache, so that it can be use in the next cycle. If the tag from the primary cache doesn't match, then the output buer is invalidated, and the cache must operate for at least one more cycle. 4. In the second cycle (if there was no primary hit), the tag comparators from the secondary cache are examined. If one of them matches (a secondary hit), then the appropriate data line is latched into the output buer. The cache must now move the selected data (and tag) from the secondary 1
ushed. When the data returns from main memory, the cache places it in the primary cache. The swapping done in steps 4 and 5 imposes a restriction on the hashing functions: s1 (a1 ) = s1 (a2) ) s2 (a1 ) = s2 (a2) (1) In other words, if two addresses map to the same line in the primary cache, then they should map to the same set in the secondary cache. The need for this restriction can be seen in the counterexample shown in Figure 2. Assume that memory blocks 00001 and 00101 are in the secondary cache (in sets 001 and 101, respectively), and memory block 01001 is in the primary cache (in set 01). What happens if the processor reads block 00101? That block should be moved from set 101 of the secondary cache to set 01 of the primary cache, displacing block 01001. But this victim must go into set 001 of the secondary cache, not set 101. The former set must ush its other address, while set 101 is left without a victim, and must be marked invalid. This results in a complex accessing scheme and wastes precious cache space. Assuming that the codomain of s2 is the entire set space in the secondary cache, this restriction means 2 This swapping is not shown in g. 1 for clarity. Implementation is discussed in [11].
This is called two-level exclusive caching [8].
that N2 N1 . If bit selection is used, then N1 and N2 must be powers of 2; typically, the rightmost log2 N1 bits of a are used to select the primary cache line, and the rightmost log2 N2 bits of that eld are used to select the secondary cache set.
3 Using the HAC model The Hybrid Access Cache framework can be used to examine the design space of possible cache con gurations and to gain insights into how well they might be expected to perform. In this section, we show how to do this in general, and show that speci c hybrid schemes are equivalent to speci c cases in the HAC model. Using the parameter space created by the HAC model, we explore a new cache con guration called the \half-and-half" cache as another possible HAC design.
3.1 The HAC design space Figure 3 shows the HAC design space. The horizontal axis shows the percentage of cache lines allocated to associative accesses, i.e., the number of lines in the secondary cache, and the vertical axis shows the associativity of the secondary cache, K . (Line size and total cache size are omitted for clarity.) Moving to the right in the design space indicates that the HAC has more and more lines dedicated to
3 For simplicity, we assume that the associativity does not increase the access time of the secondary cache beyond two cycles. In the future, we intend to explore a more elaborate HAC model.
3.3 Exploring the HAC design space The HAC design space is useful for illustrating the various design possibilities for a hybrid access cache, but it is also important to understand the relationship between this space and the behavior of real applications. Since the purpose of adding associativity to a cache is to reduce con ict misses, it is the interferences between addresses that should be considered. Sometimes, the interferences will be \broad," meaning that con icts occur in many places. For instance, if two large arrays are mapped to the same region of the cache, there will be con icts over many cache lines. Sometimes, interferences can be \deep," meaning that many addresses are competing for the same set. Such a situation may occur, for instance, with randomlyallocated pointer structures. The horizontal axis of Figure 3 corresponds to the \breadth" of interferences, while the vertical axis corresponds to the \depth." Thus, the types of interferences prevalent in a given program correspond to a region of the HAC space, and can be used to indicate which kinds of HAC designs would be most appropriate for that application. For instance, the example, given above, of two large arrays mapping to the same sets in the cache would correspond to the bottom of Figure 3, and would best be served by a column-associative or a 2-way MRU cache. A victim
3.2 Restricting the HAC design space Most points in Figure 3 would be dicult to implement because they would require set selection functions more dicult than simple bit selection. (Imagine, for instance, trying to build a HAC with a directmapped secondary cache containing 49% of the lines.) If we restrict the hashing functions to bit selection, then both N1 and N2 must be powers of 2. Even with this restriction, we are still presented with a large range of sizes for the secondary cache. Given a primary cache of size N1 , we can set N2 to any power of 2 up to N1 , and set K to any positive integer (although in practice, K would be limited by the costs of high associativity). We may view this as another two-dimensional design space, but one for more ecient HAC designs (see Figure 4). In this design space, the vertical axis remains the same, but the horizontal axis is now the ratio of the number of sets for the primary cache to the number of sets for the secondary cache. The various HAC designs have also been shown in the new space. The design space of the victim cache is a vertical line where N2 is 1 (the victim cache is fully 148
2. The other half of the lines form a 2r -way setassociative secondary cache with 21 (N=2r ) sets.4 The half-and-half cache super cially resembles the MRU cache (indeed, a half-and-half with r = 0 is identical to a 2-way MRU cache). However, there are several dierences (if r > 0). A half-and-half cache may have more address con icts than an MRU cache with the same associativity, because only half of the storage of the half-and-half is associative. However, the bene ts of associativity in the MRU cache come at a price: as associativity increases, the size of the primary cache decreases, meaning that fewer and fewer lines can be accessed in one cycle.
cache would not be a good choice for this application, because the interference of the two arrays would generate too many \victims"; these would be driven out of the small secondary cache before they could be reused. On the other hand, if there is not much \breadth" to the interferences in an application, then a victim cache would serve the application eectively. In some cases, an equally-sized HAC with a larger secondary cache, such as an MRU cache, could perform just as well. But this would depend on the size of the cache relative to the size of the program's working set. If the cache were relatively small, then most of the lines in the cache could be \active" at any given time, and they could be accessed more quickly if these lines were in the primary cache. Thus, while a high-K MRU cache could attain the same hit rate as a victim cache, it might cost more to use because more accesses come from the slower, secondary memory. Thus, it is important to think about the interference patterns that occur at a given cache size and the region to which they correspond in the HAC design space, and to choose a cache design which is close to that region. Caches at one edge may be unsuitable for applications lying near the other side of the space. For instance, a direct-mapped cache, lying at the extreme left side of the HAC space, would probably give poor performance for any application with either depth or breadth in its interference. Dierent applications can exhibit dierent types of interference patterns. Since most machines are expected to perform over a wide range of applications, it may be impossible to nd a HAC that would be optimal for all applications. But from Figures 3 and 4, it can be seen that the existing designs are all near the edges of the HAC space. A cache that is closer to the middle region of the space might be a better choice for a general-purpose cache than one at one edge of the space, because it would be likely to provide a more balanced cache design which may better accommodate both deep and broad interference patterns.
4 Simulation experiments In order to validate our hypotheses about the relationship between application programs and the HAC design space, and to compare the half-and-half cache to other caches, we ran a series of trace-driven simulations. We programmed a cache simulator to emulate various hybrid access caches with three dierent sizes, feeding it 49 traces. In brief, we found the following: All hybrid cache designs performed considerably better than one-level direct-mapped caches. For small (8K) caches, the victim caches performed best overall, especially on oating-point applications with large working sets, con rming that the left side of Figure 3 is most relevant when cache sizes are small. For medium (32K) caches, the half-and-half caches performed best, demonstrating that the middle of the HAC design space is a good area to explore for this size. For large (128K) caches, the half-and-half caches also were the best, but the dierences in performance among all the caches were small. These ndings con rm the predictions we made while analyzing the HAC design space (see Section 3.3). We also examined the eects of multitasking, and found that this did not signi cantly change the results.
3.4 The half-and-half cache To exploit the \middle region" of the HAC design space as suggested in the previous section, we have chosen to make a case study of the \half-and-half" cache [11]. Half-and-half caches lie on a straight line running through the middle of the HAC design space (in both Figures 3 and 4). An N -line half-and-half cache has the following properties:
4.1 Cache designs simulated We simulated a fully-associative cache, a directmapped cache, three MRU caches, a standard victim
1. Half of the lines ( 21 N ) are in the primary cache.
A detailed derivation can be found in [11].
Table 1: Size parameters for simulated caches cache with a 16-line fully-associative secondary, a variation of the victim cache in which the secondary is 4-way set-associative, a column-associative cache, and three half-and-half caches. Each cache came in three sizes: 8K, 32K and 128K, representing a range from an average value for present-day on-chip caches to what will likely be possible in a few years. We used 32-byte lines, so if N is the number of lines in the cache, then N 2 f256; 1024; 4096g. The caches are compared in Table 1. The table gives the parameters N1 and N2 in terms of N , and the associativity K . Each cache has N lines total, except for VC16 and VC4, which have a full-sized directmapped cache plus a 16-line victim cache. In this study, we only measured the eects of noncompulsory reads, and we only considered data caches. It was assumed that all hits in the primary cache take one cycle and all secondary cache hits take two cycles. We also made this assumption for CA, even though the original publication rated it at 4 [2], because it can be equipped with the same switching hardware as the other HAC's. For this study, we assumed a miss penalty of 20 in eq. 3. This value is arbitrary and can aect the comparison of the HACs (for instance, increasing the miss penalty would tend to bene t MRU caches more than others). Nevertheless, this value is similar to that used in other cache comparisons, and represents a middle ground between highend machines with fast secondary caches and low-end machines with on-chip caches only.
4.3 Results Results for the single-process SPEC runs, the ATUM traces, and the non-interleaved versions of the multitasking traces are shown in Figures 5{7. For each cache size and trace group, we plot the performance of the HAC's and the DM cache. Each bar shows the ratio of a cache's average access time (as de ned in eq. 3 in Section 2.2) to the average access time of the FA (fully associative) cache for the same trace. (Generally, the FA represents the \ideal" limit, in which con ict misses are minimized, but in some experiments, the FA performs worse than some of the HAC's, due to anomalous access patterns.) A few observations can be made by comparing the caches for dierent sizes. Figure 5 shows the greatest performance variations between hybrid caches. This
4.2 Trace sources In our simulations we used three groups of traces, which are summarized in Table 2. The rst set of traces were single-process traces generated from the SPEC89 benchmarks. They are 150
performance of MRU4 compared to MRU2 suggests that more depth is needed. Both MRU and HH can provide this depth, but HH has more fast cache at its disposal. HH2 wastes 4% fewer cycles, on average, than MRU4, 6% fewer than CA, and 20% fewer than VC16. Again, it appears that the middle of the HAC space gives the best overall results. For the largest cache, all the HAC's except the victim caches perform roughly equally. They all provide enough breadth, and the size of the cache makes depth less of an issue. The half-and-half caches are slightly better, but given the costs of tag matching circuits, it would probably be more cost-eective to use a columnassociative cache. Of course, a suciently large problem could completely change this assessment. There is very little dierence between VC16 and VC4, suggesting that a fully-associative victim cache is probably not worth the extra hardware cost, and a set-associative victim cache may perform just as well at a lower cost. The multitasking traces are shown in Figures 8{10. Each shows, for each of the multitasking traces and each of the 5 grain sizes, the absolute dierence between the access time for a cache and the access time for the same cache on the non-interleaving version of the same trace. From these graphs, it can be seen that multitasking has a big impact on cache usage. The effects increase as tasks are switched more frequently, as could be expected. At the rapid-switching end of the spectrum it can be seen that MRU8 and HH8 give good results, showing the importance of depth of associativity when there is frequent context-switching.
can help the designer to see all the possibilities as part of a continuum and to understand how these design decisions can aect performance.
Acknowledgements This research was supported by the Natural Sciences and Engineering Research Council (NSERC) and MICRONET, both funded by the Canadian government, and by a Concordia FRDP Grant.
5 Conclusions In this paper, we have proposed a uni ed framework for the design of Hybrid Access Caches (HACs) as on-chip caches. Under our model, a rich design space is structured in a way which helps the cache designers systematically to explore, experiment and compare various combinations of hybrid cache designs in order to meet performance and cost requirements. With the help of this framework, we conceived a novel HAC organization called the half-and-half cache. Our simulation results demonstrated that for moderate-sized caches, the half-and-half cache has better performance than other hybrid caches for most applications, as predicted by the HAC design space. Other points in the HAC design space are possible, and may be even more desirable from a performance or implementation standpoint. The HAC design space 153