A Comparison of Locality-Based and Recency-Based Replacement ...

0 downloads 0 Views 66KB Size Report
recency-based replacement policy (such as, e.g., LRU), we can also make use of ... An important part of the cache design is the replacement policy, which ...
A Comparison of Locality-Based and Recency-Based Replacement Policies Hans Vandierendonck and Koen De Bosschere Ghent University, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium, fhvdieren,[email protected], WWW home page: http://elis.rug.ac.be/˜fhvdieren,kdbg/

Abstract. Caches do not grow in size at the speed of main memory or raw processor performance. Therefore, optimal use of the limited cache resources is of paramount importance to obtain a good system performance. Instead of a recency-based replacement policy (such as, e.g., LRU), we can also make use of a locality-based policy, based on the temporal reuse of data. These replacement policies have usually been constructed to operate in a cache with multiple modules, some of them dedicated to data showing high temporal reuse, and some of them dedicated to data showing low temporal reuse. In this paper, we show how locality-based replacement policies can be adapted to operate in set-associative and skewed-associative [8] caches. In order to understand the benefits of locality-based replacement policies, they are compared to recency-based replacement policies, something that has not been done before.

1 Introduction Trends in microprocessor development indicate that microprocessors gain in speed much faster than main memory. This discrepancy is called the memory gap. The memory gap can be hidden using multiple levels of cache memories. But even then, the delays introduced by the caches and main memory are becoming so large, that the memory hierarchy remains a bottleneck in processor performance. An important part of the cache design is the replacement policy, which decides what data may be evicted from the cache. A recent approach to better replacement policies is using locality properties of the memory reference stream. In studies of such replacement policies, cache organisations consisting of multiple cache modules are used. Each module is a conventional cache and is dedicated to data with a specific locality type. A typical organisation is a direct mapped cache dedicated to data exhibiting temporal locality combined with a smaller and fully associative cache for data with non-temporal or highly spatial locality. Such a cache organisation poses serious design problems. Since data can be found in either module, a multiplexer is needed to select the data from one of the modules, increasing the cache lookup time. Furthermore, a direct mapped cache has an inherently shorter access time than a fully associative cache. It is not always possible to find two modules with the same access time. This results in an unbalanced design with one module in the critical path. Because of these difficulties, we propose to use locality-sensitive replacement policies in simple cache organisations, including the set-associative and

the skewed-associative cache. However, applying locality-sensitivity to set-associative caches is not straightforward, because the operation of these replacement policies is closely interwoven with the organisation of a multi-module cache. The locality type of a block is derived from the module it is stored in. Therefore, we propose to label the blocks with their locality type, so that the replacement policy can make use of this information. This paper is organised as follows. In section 2, we describe the various cache organisations. Section 3 describes replacement policies and their extension to set-associative and skewed-associative caches. In section 4 we present simulation results. Section 6 discusses related work and section 7 summarises the main conclusions.

2 Cache Organisations The most wide-spread cache organisation is that of a set-associative cache. In a setassociative cache, memory blocks are mapped to cache sets by extracting bits from the block number (Figure 1(a)). An n-way set-associative cache can contain n blocks from the same set. If n is 1, the cache is called direct mapped. When there is only one set in the cache, the cache is called fully associative.

effective address way bit 0 1 2 3 select

effective address

hash hash way0 way1

A

B combine results select

select

(a) Set-associative cache

set

set

set number

effective address

(b) Multi-module cache

(c) Skewed-associative cache

Fig. 1. Three cache organisations.

In a multi-module cache, multiple cache modules operate in parallel (Figure 1(b)). Each cache module can be thought of as a conventional cache, possibly having different associativity, block size, etc. A memory request is sent to all cache modules simultaneously, so extra combining logic is needed to obtain the result from the correct module. If data can only be cached in one module, then the combining logic consists of a multiplexer which selects the data from the module with a cache hit. We study multi-module caches with two modules having the same block size. In the remainder of this paper, we call the cache modules A and B. The sets in module A and module B are called A-sets and B-sets, respectively. A skewed-associative cache is a multi-bank cache. Each bank is indexed by a different set index function [8]. Furthermore, the index functions are designed in such a way that blocks are dispersed over the cache. When two blocks map to the same frame

in one bank, then they do not necessarily map to the same frames in the other banks. An n-way skewed-associative cache can be modelled as a multi-module cache, where each module is direct mapped and corresponds to a bank of the skewed-associative cache.

3 Replacement Policies In this section we describe the recency-based replacement policies and the localitybased replacement policies. 3.1 Recency-Based Replacement Policies The least recently used (LRU) policy is commonly used in set-associative caches. Since the complexity of the LRU algorithm scales as the square of the number of blocks involved [11], it would be impractical to implement it for multi-module caches. Other recency-based algorithms are thus needed. We discuss the not recently used and the enhanced not recently used policies, which have been constructed for skewed-associative caches [8–10]. In the not recently used policy (NRU) a one bit tag is associated with every block in the cache. The tag bit is asserted every time the block is accessed and it signals that the block is young. The NRU policy also requires a (global) counter, which keeps track of the number of young blocks in the cache. When the counter reaches a certain threshold,1 all blocks in the cache have their tag bit reset and the counter is reset as well. The NRU policy selects its victim randomly among all old blocks in the cache. If there are no old blocks, it selects a block randomly among the young blocks. The enhanced not recently used policy (ENRU) is an improvement of the NRU policy. It uses two tag bits for each cache block and divides the cache blocks into three categories: very young, young and old. The ENRU policy selects a victim block at random, first among the old blocks, then the young blocks and, if necessary, among the very young blocks. 3.2 Locality-Based Replacement Policies Several locality-sensitive replacement policies have been proposed in the literature. We focus on the non-temporal streaming cache and the allocation by conflict policy. In these cache organisations, each module is dedicated to data exposing a specific type of locality. Hence, the replacement policy can be decomposed into two consecutive steps: (1) determining the locality type of the data and thus the module and (2) selecting a victim from the module e.g., using LRU. The non-temporal streaming (NTS) cache [5], consists of three units: a temporal module A, a non-temporal module B and a locality detection unit. When a block is placed in the cache, it is placed in either module A or module B, depending on the locality properties reported by the locality detection unit. Blocks exposing mainly temporal locality are placed in module A, the other blocks in module B. 1

It is reported that a good threshold is half the size of the cache, expressed as a number of blocks [9].

A block exposes temporal locality when at least one word in the block is used at least twice between loading the block and evicting it from the cache. Each block in the cache has one locality bit associated with it and each word in the cache has a reference bit. When a block is loaded into the cache, the block’s locality bit is set to zero, indicating non-temporal use, and the reference bit of the requested word is set to one. If the reference bit of the requested word is one on a cache hit, then the locality bit is also set to one, indicating temporal use. The detection unit is a small fully-associative cache, where the locality bits of evicted blocks are saved. In our implementation, only non-temporal blocks are stored in the detection unit’s cache, since missing blocks have temporal locality. In the allocation by conflict (ABC) policy [13], blocks are locked in a direct mapped cache until the number of misses exceeds the number of references to this block. The ABC policy adds a conflict bit to each block in module A. The conflict bit is set to zero when a reference is made to the associated block. It is set to one on each cache miss which maps into the same A-set. In each module, there is a candidate block for replacement, selected by LRU policies. One of these blocks is then selected using the conflict bit of the block in module A. 3.3 Locality-Sensitive Replacement Policies for Skewed- and Set-Associative Caches A locality-sensitive replacement policy has to know the locality properties of the data in the cache, so we label each block with its type of locality. We use the not recently used (NRU) policy to account for some aging effect. The policies work similar to NRU. They define several categories of blocks and search them in their listed order. The temporality-based NRU policy (TNRU) is a combination of NRU and NTS. The locality properties of the data are defined and detected in exactly the same way as in NTS. The recency ordering between cache blocks is maintained using the NRU policy. The TNRU policy can distinguish between four categories of blocks: old and non-temporal, old and temporal, young and non-temporal and young and temporal. The temporal properties of the block are decided using the locality information at the time of loading the block in the cache. The second replacement policy we propose is the conflict based NRU policy (CNRU), based on the ABC policy. Each block in the cache has a conflict bit, which is managed in the same way as in ABC. The CNRU policy distinguishes between four categories of cache blocks, namely those that are old and not proven, young and not proven, old and proven and young and proven. A block is proven when its conflict bit is one.

4 Experimental Evaluation We evaluated the performance of the replacement policies in three cache organisations: a multi-module cache, a skewed-associative cache and a set-associative cache. All caches have 32 byte blocks and use demand fetching. The skewed-associative and setassociative cache are both 8kB large and have associativity two. The skewed-associative cache uses the index functions defined in [10].

The multi-module cache was chosen such that the modules have approximately the same cycle time. We used the cacti model [14] to obtain access times of several cache organisations in a 0:13m technology. A fully associative 1 kB module has a 1.3 ns cycle time. To match this cycle time, the other module should be either 8 (1.23 ns) or 16 kB (1.32 ns) large and 2-way set-associative. Another possibility is to use a very large direct mapped cache, e.g. a 64 kB cache (1.36 ns). We choose to combine an 8 kB 2-way set-associative cache with a 1 kB fully associative cache. SPECfp MM applu 0.071 apsi 0.048 fpppp 0.012 hydro2d 0.179 mgrid 0.054 su2cor 0.049 swim 0.080 tomcatv 0.153 turb3d 0.061 wave5 0.164

SA 0.074 0.085 0.024 0.178 0.054 0.051 0.609 0.453 0.083 0.316

SK SPECint MM SA SK 0.074 compress 0.045 0.051 0.049 0.040 gcc 0.044 0.063 0.048 0.023 go 0.010 0.032 0.020 0.180 ijpeg 0.021 0.043 0.024 0.055 li 0.038 0.045 0.041 0.051 m88ksim 0.011 0.021 0.013 0.091 perl 0.015 0.035 0.027 0.151 vortex 0.022 0.045 0.028 0.050 0.167

Table 1. Miss ratios of the LRU policy in the different cache organisations.

We collected traces of all SPEC95 benchmarks using ATOM [12]. The number of memory references in each trace was limited to 300 million, taken from the middle of the program. Miss ratios are used as the performance measure. However, since miss ratios vary greatly from program to program, the miss ratios are divided by the miss ratio of the LRU policy in the same cache organisation. Table 1 contains the miss ratios of the LRU policies in the different cache organisations, for reference. MM is the multi-module cache, SA is the set-associative cache and SK is the skewed-associative cache.

5 Discussion of the Results For the multi-module cache, the most eye-catching result is that the locality-based replacement policies have very bad performance for some benchmarks (Figure 2). In one case, the miss ratio is increased with 175% (ABC for the benchmark tomcatv). In other cases, the miss ratio of a benchmark can be increased by as much as 5 or 10%. The replacement policies usually have a miss ratio that is worse than that of the LRU policy. For the SPECfp benchmarks, the ENRU and CNRU policies perform the best. These policies also closely follow the miss ratio of the LRU policy. For the SPECint benchmarks, the CNRU and ABC benchmarks provide the best results. In contrast to what happens for the SPECfp benchmarks, all replacement policies sometimes perform 10 to 20% worse than the LRU policy. However, this unexpected

1.6

3 ENRU

2.5

1.4

CNRU

1.2

TNRU

2 1.5

ABC

1

NTS

0.8

ENRU

0.6

CNRU

0.4

TNRU

1 0.5

ABC

0.2

NTS

H-mean

vortex

perl

m88ksim

li

ijpeg

go

gcc

wave5

H-mean

turb3d

swim

tomcatv

mgrid

su2cor

hydro2d

apsi

fpppp

applu

0

compress

0

Fig. 2. Relative performance of replacement policies in the 8 kB two-way set-associative and 1 kB fully-associative multi-module cache for SPECfp (left) and SPECint (right).

behaviour is not as bad as it is for the SPECfp benchmarks. Furthermore, the CNRU policy generally works better than the ABC policy, on which it is based. The same relation holds between TNRU and NTS. Figure 3 shows the results for the set-associative cache. The ENRU policy has about the same performance as the LRU policy, while it has a larger cost. For some benchmarks, the TNRU policy provides a big improvement with respect to the LRU policy (e.g. 13% for fpppp and 8.6% for perl). On the average, TNRU performs about 1% better than LRU for SPECint and SPECfp. In the skewed-associative cache, the ENRU policy usually performs worse than the LRU policy, except for fpppp (Figure 4). This was also reported in [10]. The localitysensitive policies perform really well for the benchmark fpppp and they also work better than the ENRU policy for most benchmarks. Overall, the miss ratio of TNRU is about 2% lower than that of ENRU, both for SPECint and SPECfp. However, neither CNRU nor TNRU perform better on average than the LRU policy.

1.15

1.1

1.1

1.05

1.05

1

1

0.95

0.95 0.9

ENRU

0.9

CNRU

0.85

ENRU CNRU

0.85

TNRU

TNRU

H-mean

perl

vortex

m88ksim

li

ijpeg

go

gcc

wave5

H-mean

turb3d

swim

tomcatv

su2cor

mgrid

fpppp

hydro2d

apsi

applu

compress

0.8

0.8

Fig. 3. Relative performance of replacement policies in the 8 kB two-way set-associative cache, for SPECfp (left) and SPECint (right).

1.15

1.15 1.1

1.1

1.05

1.05

1

1

0.95

0.95 c

CNRU

CNRU

0.85

ENRU

0.9

ENRU

0.9

0.85

TNRU

TNRU

H-mean

vortex

perl

m88ksim

li

ijpeg

go

gcc

wave5

H-mean

turb3d

swim

tomcatv

su2cor

mgrid

fpppp

hydro2d

apsi

applu

compress

0.8

0.8

Fig. 4. Relative performance of replacement policies in the 8 kB skewed-associative cache, for SPECfp (left) and SPECint (right).

6 Related Work Many different locality-sensitive replacement policies and accompanying cache organisations have been proposed. The NTS cache was introduced in [4] and was slightly changed in [5]. The main difference between these two versions is the way the locality properties of evicted blocks are remembered. Our implementation is based on [5]. The dual data cache [2] dedicates one module to data with high temporal locality, while the other module caches data with high spatial locality. The locality properties are detected by treating all fetched data as vectors. The stride and vector length is measured and is used to define three types of locality: non-vector data, short vectors and selfinterfering vectors. The latter type is not cached at all. The speedup of the dual data cache is largely due to selective caching [2]. Alternatively, a compiler can detect the stride and vector length, as well as self- and group-reuse [6, 7]. Several processors have implemented multi-module caches. The data cache of the HP PA-RISC 7200 consists of a large direct mapped cache and a small fully associative cache [1]. The purpose of the fully associative cache is to decrease the number of conflicts in the direct mapped cache. Another approach is taken in the UltraSPARC III, which has a multi-module L1 and L2 data caches [3]. These caches are managed by splitting the reference stream, not on the basis of locality properties, but on the origin of the transfers between the caches.

7 Conclusions We discussed the problems associated with applying locality-sensitive replacement policies to set-associative and skewed-associative caches. We extended two replacement policies from literature by labelling each block with its locality type. We compared the locality-sensitive replacement policies to recency based replacement policies like LRU and ENRU. Overall, we find that the locality-sensitive replacement policies have approximately the same performance as recency based policies. Recency-based replacement policies can manage a multi-module cache as good as or better than locality-sensitive policies. Furthermore, the locality-based policies are

mostly suited for the SPECfp benchmarks although they show very poor behaviour for some benchmarks. In contrast, the recency-based policies are more well-behaved. For set-associative caches, the locality-based replacement policy TNRU decreases the miss ratio slightly over LRU (with about 1%). In a skewed-associative cache, the TNRU policy provides a 2% improvement over the ENRU policy.

Acknowledgements The authors thank Lieven Eeckhout for his helpful comments in proof-reading this manuscript. Hans Vandierendonck is supported by a grant from the Flemish Institute for the Promotion of the Scientific-Technological Research in the Industry (IWT). Koen De Bosschere is research associate with the Fund for Scientific Research-Flanders.

References 1. Kenneth K. Chan, Cyrus C. Hay, John R. Keller, Gordon P. Kurpanek, Francis X. Schumacher, and Jason Zheng. Design of the HP PA 7200 CPU. Hewlett-Packard Journal, 47(1), February 1996. 2. A. Gonzalez, C. Aliagas, and M. Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS’95. Proceedings of the 9th ACM International Conference on Supercomputing, pages 338–347, 1995. 3. T. Horel and G. Lauterbach. UltraSPARC-III: Designing third-generation 64-bit performance. IEEE Micro, 19(3):73–85, May 1999. 4. J. A. Rivers and E. S. Davidson. Reducing conflicts in direct-mapped caches with a temporality-based design. In Proceedings of the 1996 International Conference on Parallel Processing, volume 1, pages 154–163, August 1996. 5. J. A. Rivers, E. S. Tam, G. S. Tyson, E. S. Davidson, and M. Farrens. Uitilizing reuse information in data cache management. In ICS’98. Proceedings of the 1998 International Conference on Supercomputing, pages 449–456, 1998. 6. F. Jes´us S´anchez, Antonio Gonz´alez, and Mateo Valero. Software management of selective and dual data caches. IEEE Technical Committee on Computer Architecture Newsletter, pages 3–10, March 1997. 7. Jes´us S´anchez, , and Antonio Gonz´alez. A locality sensitive multi-module cache with explicit management. In ICS’99. Proceedings of the 1999 International Conference on Supercomputing, pages 51–59, Rhodes, Greece, June 1999. 8. A. Seznec. A case for two-way skewed associative caches. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 169–178, May 1993. 9. A. Seznec. A new case for skewed-associativity. Technical Report PI-1114, IRISA, July 1997. 10. A. Seznec and Francois Bodin. Skewed-associative caches. In PARLE’93: Parallel Architectures and Programming Languages Europe, pages 305–316, Munich, Germany, June 1993. 11. Alan Jay Smith. Cache memories. ACM Computing Surveys, 14(3):473–530, September 1982. 12. Amitabh Srivastava and Alan Eustace. ATOM: A system for building customized program analysis tools. Technical Report 94/2, Western Research Laboratory, March 1994. 13. Edward S. Tam. Improving Cache Performance Via Active Management. PhD thesis, University of Michigan, 1999. 14. Steven J.E. Wilton and Norman E. Jouppi. An enhanced access and cycle time model for on-chip caches. Technical Report 93/5, Western Research Laboratory, July 1994.

Suggest Documents