Cache With Adaptive Fetch Size Technical Report ICS-00-16 April 22, 2000 Weiyu Tang , Alexander V. Veidenbaum, Alexandru Nicolau, Rajesh Gupta Department of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8168 fwtang, alexv, nicolau,
[email protected]
Cache With Adaptive Fetch Size Technical Report ICS-00-16 April 22, 2000 Weiyu Tang, Alexander V. Veidenbaum, Alexandru Nicolau, Rajesh Gupta Department of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8168 Abstract
Current cache designs support only a xed line size. Fixed cache line size limits cache's ability to utilize spatial/temporal locality. Previously, we proposed an ALS cache, where each line's behavior is monitored and its future size is predicted. In this report, we present a cache design with adaptive fetch size. In this cache, the line size is stable for a period of time (an interval). Then at the end of an interval, the fetch size for the next interval is predicted based on the memory access behavior in the previous interval. Two strategies for changing the fetch size are proposed, sampling-based and localitybased. For sampling-based fetch size change, several possible fetch sizes are tested for a short period, and then a fetch size with a minimal miss rate is chosen for the rest of the interval. For locality-based fetch size change, the fetch size is based on the observed spatial utilization of the cache during the previous interval. Overall, a better performance is achieved by this novel cache design. Simulations of SPEC95 benchmarks show that this cache design can signi cantly reduce cache miss rates. It can achieve as much as 24 percent miss reduction for L1 cache and 57 percent miss reduction for L2 cache. At the same, the increase in trac is small, as little as 22 percent between L2 and L1 cache and 32 percent between L2 and the main memory.
Contents 1 2 3 4
Introduction Motivation Related Work Experimental Setup and Benchmarks
1 2 5 7
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Algorithms to adapt fetch size
5.1 Sampling-based fetch size prediction . . . . . . . . . . . 5.1.1 Intuition . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Sampling-based algorithm design . . . . . . . . 5.1.3 Hardware cost . . . . . . . . . . . . . . . . . . 5.2 Locality-based fetch size adaptation . . . . . . . . . . 5.2.1 Spatial Locality detection . . . . . . . . . . . 5.2.2 Next fetch size prediction with xed thresholds 5.3 Next fetch size prediction with adaptive thresholds . . 5.3.1 Hardware cost . . . . . . . . . . . . . . . . . . .
6 Performance
. . . . . . . . .
. . . . . . . . .
6.1 Parameters in locality-based algorithms . . . . . . . . . . 6.1.1 Choice of adaptation interval length . . . . . . . . 6.1.2 Limitation of xed threshold values . . . . . . . . 6.1.3 Comparison of xed and aging threshold algorithms 6.2 Performance with dierent adapting algorithms . . . . . . 6.2.1 Miss rate reduction . . . . . . . . . . . . . . . . . 6.2.2 Normalized trac . . . . . . . . . . . . . . . . . . 6.2.3 Comparison with adaptive line size cache . . . . .
7 Conclusion References
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
7
7 7 8 9 9 9 10 11 11
13 13 13 13 14 15 15 16 18
19 19
i
List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Optimal line size for SPEC95 benchmarks . . . . . . . . . . . . . . . . Optimal line size for IJPEG with 32KB xed line size cache . . . . . . Relationship between Physical Cache Line and Virtual Cache Lines . . Sampling overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Next fetch size prediction with xed thresholds . . . . . . . . . . . . . Next fetch size prediction with aging thresholds . . . . . . . . . . . . . Miss rates with adaptation interval of 1M and 100K memory accesses . Miss rates with dierent threshold values for xed threshold algorithm Miss rates for xed threshold and aging threshold algorithms . . . . . . Miss rate reduction for L1-32KB cache . . . . . . . . . . . . . . . . . . Miss rate reduction for L2-256KB cache . . . . . . . . . . . . . . . . . . Normalized trac for L1-32KB cache . . . . . . . . . . . . . . . . . . . Normalized trac for L2-256KB cache . . . . . . . . . . . . . . . . . . Miss rate reduction for L1-32KB cache . . . . . . . . . . . . . . . . . . Miss rate reduction for L2-256KB cache . . . . . . . . . . . . . . . . . .
ii
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
2 3 4 8 10 12 13 14 14 15 16 17 17 18 19
Cache With Adaptive Fetch Size Weiyu Tang, Alexander V. Veidenbaum, Alexandru Nicolau, Rajesh Gupta Department of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425, USA
Abstract
Current cache designs support only a xed line size. Fixed cache line size limits cache's ability to utilize spatial/temporal locality. Previously, we proposed an ALS cache, where each line's behavior is monitored and its future size is predicted. In this report, we present a cache design with adaptive fetch size. In this cache, the line size is stable for a period of time (an interval). Then at the end of an interval, the fetch size for the next interval is predicted based on the memory access behavior in the previous interval. Two strategies for changing the fetch size are proposed, sampling-based and locality-based. For sampling-based fetch size change, several possible fetch sizes are tested for a short period, and then a fetch size with a minimal miss rate is chosen for the rest of the interval. For locality-based fetch size change, the fetch size is based on the observed spatial utilization of the cache during the previous interval. Overall, a better performance is achieved by this novel cache design. Simulations of SPEC95 benchmarks show that this cache design can signi cantly reduce cache miss rates. It can achieve as much as 24 percent miss reduction for L1 cache and 57 percent miss reduction for L2 cache. At the same, the increase in trac is small, as little as 22 percent between L2 and L1 cache and 32 percent between L2 and the main memory.
1 Introduction In current cache design, a cache consists of multiple lines of equal size. This size is used in the cache mapping function to determine to which cache line an address will be mapped. For a cache with a xed line size, the determination of the line size is based on the spatial and temporal locality of average benchmarks. This xed line size limits cache's ability in locality utilization. A cache with multiple line sizes has the advantage of changing line size based on inherent locality in applications.
This work was supported in part by the DARPA ITO under Grant DABT63-98-C-0045.
1
In our previous research [4], we have proposed a cache with adaptive cache line size, where each individual cache line can have a dierent size. For most applications, an adaptive line size cache can achieve better miss rate than the optimal miss rate for a xed line size cache. In this research, we take a dierent approach to adapt to variations in locality in dierent applications. For a traditional cache, the fetch size is equal to the cache line size and one miss fetch lls one cache line. The fetch size can also be dierent from the cache line size and one miss fetch can ll multiple cache lines. We conjectured that adapting fetch size could achieve the same bene ts as adapting the cache line size. The rest of this report is organized as follows. Section 3 describes the related work. Section 4 describes the environment for cache simulations. Section 2 discusses the motivation for adapting fetch size dynamically. Section 5 presents the algorithms to adapt fetch size. Section 6 shows the performance of dierent fetch size adapting algorithms. Section 7 presents the conclusion and discussions for future research.
2 Motivation
Figure 1: Optimal line size for SPEC95 benchmarks Previous research [4] has shown that optimal cache line size changes for dierent applications and for dierent regions of the same application. The reason is that cache is used to utilize the spatial and temporal locality of applications and the locality inherent in an application may change frequently. Figure 1 shows how the optimal cache line size changes across applications for SPEC95 benchmarks. The cache size is 32KB. Of all 16 benchmarks, line size 256B is the optimal for 4 benchmarks; line size 128B is the optimal for 7 benchmarks; line size 32 is the optimal for 3 benchmarks; line size 16 is the optimal for 2 benchmarks. Figure 2 shows how the optimal line size changes in time in IJPEG. The cache size is 32KB and each point in the gure is the miss rate during an interval of one million memory accesses. We obtain the miss rate for the interval using dierent line sizes during the cache simulation. 2
Figure 2: Optimal line size for IJPEG with 32KB xed line size cache The optimal line size is the line size with the minimal miss rate . Line size of 32B, 64B and 128B is the optimal for 25, 47 and 28 percentage of the time respectively. For a traditional cache, the line size and the fetch size are same. We call it xed fetch line cache (FFL cache). The fetch size can also be dierent from the line size and multiple cache lines can be lled on one cache miss. We call this an adaptive fetch line cache (AFL cache). In this research, we are only interested in fetch size equal to or larger than the cache line size. The fetch size is a multiple of cache line size and a power of two. Suppose the cache line size is 2p bytes and the fetch size is 2p+v bytes. We need the following de nitions.
physical cache line (PCL)
A PCL is an actual cache line in an AFL cache. The line size of the PCL is used in the tag computation to determine whether an access hits or misses in a cache. A PCL can be represented as a pair: (line num; tag)
where line num is the address of one PCL in the cache and tag is the memory address for the start of a line of data in a PCL. virtual cache line (VCL) A VCL consists of multiple continuous PCLs. The line size of the VCL is equal to the fetch size and one miss-fetch will ll one VCL. A VCL can be represented as a triplet : (i 2v ; (i + 1) 2v ? 1; tag) where i 2v is the rst PCL of this VCL, (i + 1) 2v ? 1 is the last PCL of this VCL and tag is the starting memory address for the rst PCL. 3
neighboring virtual cache lines Two VCLs (i 2v ; (i + 1) 2v ? 1; tag1) and (j 2v ; (j + 1) 2v ? 1; tag2) are "neighboring"
if they are part of a large VCL with double line size. That is, they are "neighboring" if there exists an integer k such that one of the following conditions holds: 1. i == 2 k AND j == (2 k + 1) 2. i == (2 k + 1) AND j == 2 k neighboring tags Tags tag1 and tag2 for two neighboring VCLs (2 k 2v ; (2 k + 1) 2v ? 1; tag1) and ((2 k +1) 2v ; (2 k +2) 2v ? 1; tag2) are neighboring if they satisfy the following condition (tag2 ? tag1) == 2p+v
Figure 3: Relationship between Physical Cache Line and Virtual Cache Lines Figure 3 shows the relationship between PCLs and VCLs. In this example, the fetch size is four times the cache line size. VCL (0; 3; tag0) consists of PCLs 0,1,2 and 3. VCL (4; 7; tag1) consists of PCLs 4,5,6, and 7. (0; 3; tag0) and (4; 7; tag1) are neighboring VCLs because they are part of a VCL (0; 7; tag0) if the fetch size is eight times cache line size. Suppose there are two caches A and B. Cache A is a FFL cache and cache B is an AFL cache. The line size of cache A is equal to the fetch size of cache B. There is a one-to-one mapping between one PCL in cache A and one VCL in cache B. It can be proven by induction that: If both caches are initially empty and the sequences of memory accesses to both caches are same, then both caches will always have the same data. As both caches always have the same data, a memory access will always hit or miss in both caches. Thus the performance of an AFL cache with fetch size lB is the same as the performance of a FFL cache with line size lB. By changing the fetch size dynamically, we are eectively having a cache with multiple physical line sizes. 4
To be exible, the line size of an AFL cache should be as small as possible so that a larger number of fetch sizes can be used. Flexibility comes at the cost of a larger tag store. However, the cache hit time, which is determined by the access time of the data cache, is unchanged because the size of the data cache remains the same and the mapping function complexity is the same.
3 Related Work For a traditional cache, the fetch size is equal to the cache line size. The fetch size can also be dierent from the cache line size. In previous research [13], a fetch size smaller than the cache line size has been proposed. Only words of a cache line that are predicted to be accessed in the future will be fetched. This technique is useful in trac reduction between the cache and the lower memory hierarchy. The average miss latency can be reduced but the cache may be underutilized. Fetch line size larger than the cache line size can also be used. This often results in miss rate reduction, although sometimes at the cost of increased trac between the processor and the cache. In [3, 4], we have proposed to adapt cache line size based on the spatial locality where each individual cache line can have dierent size. Additional spatial locality exists if the addresses for two neighboring cache lines are also neighbors in the memory. A cache line size of double size will be used in the future to make use of the spatial locality. When a cache line is to be replaced, if half of the line is not used, then this indicates either there is not enough spatial locality or a con ict miss exists. A cache line size of half size will be used in the future to prevent cache pollution. [19] also investigates the use of dierent cache line sizes for dierent load instructions. Loads are classi ed into loads and superloads. A superload is used for data with additional spatial locality and line size equal to four times physical cache line size is used. The authors have used two approaches to classify loads, the oine approach based on pro ling and the online approach based on dynamic line size prediction. The online approach is similar to our approach used in [3, 4] because neighboring is used to nd additional spatial locality. However, our approach is simpler and can be easily extensible to multiple line sizes, ranging from 16B to 256B as used in our experiments. For the online approach in [19], only two line sizes are allowed. In a sector cache [20, 21], a cache sector consists of several contiguous cache lines, each cache line has its own coherency and valid tags, but all the cache lines in a cache sector share a single address tag. The size of the tag array is signi cantly lower than the size of the tag array in a traditional cache. And the transfer granularity from the memory to the processor is a cache block and this can limits bus trac. But the cache will be underutilized when the optimal line size is smaller than the sector size and only a fraction of the cache will hold valid data. Adaptivity has also been applied in other forms. Selected examples of its use are: Adaptive routing pioneered by ARPANET in computer networks and, more recently, applied to multiprocessor interconnection networks [5], [7] to avoid congestion and route messages faster to their destination. Adaptive throttling for interconnection networks [7]. [18] shows that "optimal" limit varies and suggests admitting messages into the network adaptively based on current network 5
Program Input Name APPLU applu.in APSI apsi HYDRO2D hydro2d.in MGRID mgrid.in SU2COR su2cor.in SWIM swim.in TOMCATV tomcatv.in WAVE wave5.in.ref COMPRESS bigtest.in FPPPP natoms.in GO null.in IJPEG specmun.ppm LI au.lsp, boyer.lsp,... M88KSIM ctl.in PERL scrabbl.pl TURB3D turb3d.in
Instr Memory (M) (M) 1609 500 1472 500 2109 500 1542 500 4233 500 1324 500 1384 500 1686 500 1255 500 1472 500 2109 500 1542 500 1094 500 1449 500 1384 500 1255 500
Table 1: SPEC95 Benchmarks
behavior. Adaptive cache control or coherence protocol choice were proposed and investigated in the FLASH and JUMP-1 projects [8], [14]. Adapting branch history length in branch predictors was proposed in [12] since optimal history length was shown to vary signi cantly among programs. Adaptive page size has been proposed in [16] to improve the page management overhead and it is used in to reduce the TLB and memory overhead in [15]. Adaptive adjustment of data prefetch length in hardware was shown to be advantageous [6], while in [9] the prefetch lookahead distance was adjusted dynamically either purely in hardware or with compiler assistance. A cache with a xed large cache line is used in [13] in association with a predictor to only fetch the parts of the cache line that are likely to be used.
6
4 Experimental Setup and Benchmarks
4.1 Setup
The system architecture used in this study consists of a processor, L1 cache, L2 cache and memory. Both caches block on a cache miss . Each instruction takes one cycle except memory instructions that cause a cache miss. The L1 miss penalty is 5 cycles and the L2 miss penalty is 100 cycles. We have studied both a direct-mapped cache and a 2-way set-associative cache. Based on current processor implementation, L1 cache sizes of 16KB, 32KB and 64KB are studied. The cache uses write-back policy. L1 line sizes range from 8 to 256 bytes. L2 cache size of 128KB, 256KB and 512KB are studied. L2 line sizes range from 64B to 512B. Primary performance metrics used in this report are: cache miss rate, data trac volume.
4.2 Benchmarks
SPEC95 benchmarks, except for GCC and VORTEX, are used in the performance evaluation. These benchmarks exhibit a suciently varied memory behavior to thoroughly check our architecture. Only the rst 500 million memory references per benchmark are simulated. The benchmark statistics are shown in Table 1. All these benchmarks are used in the study of L1 cache. Only the rst 8 benchmarks have signi cant L2 miss rates and they are used in the study of L2 cache and processor speedup with a 2-level cache architecture. The benchmarks are compiled on an SGI system for an R3000 processor using MIPS and MIPSPro compilers with the following ags: -n32 (MIPS-III instruction set, 32b executable) and -O2 . They are used for execution-driven cache and memory simulation of the architecture described in this report. The cache simulator is invoked and driven via MINT-3 [2] which models a single-issue, statically-scheduled processor.
5 Algorithms to adapt fetch size Determining the future fetch size is a prediction process. In this section, we present two approaches to adapting the fetch size. The sampling-based approach predicts the optimal fetch size for a long interval by nding the optimal fetch size over several small intervals. The locality-based approach predicts the optimal fetch size by observing the spatial locality during an interval and predicting for the next interval.
5.1 Sampling-based fetch size prediction
5.1.1 Intuition
Although the optimal cache line size changes over time, a line size may stay optimal for an extended period of time. As can be seen in Figure 2, 64B line size is optimal for the rst 60 million memory accesses. Thus we conjecture that one can use the optimal line size over a short interval to predict the optimal line size for a long interval. 7
The optimal fetch size is the fetch size that results in minimal miss rate. But it is impossible to measure multiple miss rates for multiple fetch sizes at the same time. This problem can be solved if the following assumption holds: The locality in a program will not change dramatically during a short period The optimal fetch size can then be obtained in the following way: 1. select a list of fetch sizes 2. use each fetch size for a short period and compute the miss rate 3. select the fetch size with the minimal miss rate as the optimal fetch size
5.1.2 Sampling-based algorithm design
Figure 4: Sampling overview Figure 4 shows how the fetch size adapts over time. Memory accesses in time are divided into adaptation intervals (AI). There are two phases in each AI, a sampling phase and a stable phase. The sampling phase consists of several sampling intervals (SI) and dierent fetch sizes are used for each SI. The optimal fetch size during the sampling phase is used as the fetch size for the stable phase. As there are two types of intervals used in the algorithm design, the lengths of the intervals and their relationship should be determined carefully. Sampling interval should not be very short so that the eect of the fetch size on the cache miss rate can be observed. On the other hand, the sampling interval should not be very long, otherwise the locality may change between dierent sampling intervals and the selected optimal fetch size may not be optimal. In our experiments, the sampling interval is an order of magnitude larger than the number of entries in the cache. There are performance penalties in the sampling phase. Several fetch sizes are used and only one of them is the optimal. For a sampling interval with a non-optimal fetch size, the miss rate may be higher than in the interval with an optimal fetch size. The adaptation interval should be much larger than the sampling interval to amortize the performance penalties during the sampling phase. In our experiment, the adaptation interval is two orders of magnitude larger than the sample interval. We propose two sampling algorithms: 8
all sampling ( samp-a ): all the possible fetch sizes are used in the sampling phase neighbor sampling ( samp-n ): current fetch size, the immediately larger fetch size and the
immediately smaller fetch size are used in the sampling phase There are tradeos in using the two algorithms in terms of miss rate and trac. For example, samp-a can adapt to optimal fetch size faster because all the possible fetch sizes are tested. But the penalty is using a larger number of non-optimal fetch sizes. Samp-n adapts to optmial fetch size slower because at most three fetch sizes are tested. But less number of non-optimal fetch sizes are used and the penalty may be smaller. Experiments in next section will study the tradeos in the two algorithms.
5.1.3 Hardware cost
The additional hardware needed for sampling algorithms is: a register to store the current fetch size two registers to store interval lengths: one for sampling interval, the other for adaptation interval two counters, one for sampling interval and the other for adaptation interval several registers to record the performance (miss rate) statistics for each fetch size
5.2 Locality-based fetch size adaptation
The spatial locality in a program has a direct impact on the cache performance. For a program with good spatial locality, large cache line size is preferred. Similarly, for a program with poor spatial locality, small fetch size is preferred. Thus in the case of an AFL cache, both small and large fetch sizes should be allowed. As in the sampling-based fetch size adaptation, memory accesses are divided into adaptation intervals. The fetch size is kept unchanged during an adaptation interval and the locality utilization during this interval is measured. Then at the end of an adaptation interval, next fetch size is predicted using the locality information. To simplify the mechanism, the candidates for next fetch size are: current fetch size, the immediately larger fetch size and the immediately smaller fetch size.
5.2.1 Spatial Locality detection
Spatial locality detection is proposed here. The following two rules are used: increased spatial locality For a VCL, if its tag and the tag of its neighboring VCL are neighboring tags, then this is an indication that there is more spatial locality. A larger fetch size can bring the data in both VCLs into the cache and can potentially eliminate one cache miss. 9
decreased spatial locality
For a VCL, if half of the PCLs in it are unused, then this VCL doesn't have much spatial locality and those unused PCLs are potentially polluting the cache. A smaller fetch size can reduce cache pollution and keep more useful data in the cache. Spatial locality detection can be done on a miss-fetch for each VCL. Thus it will not aect the cache hit time, which is one of the critical issues in processor design.
5.2.2 Next fetch size prediction with xed thresholds
At the end of each adaptation interval, the collective spatial locality for all the lines seen in the cache during an interval can be used to make the next fetch size prediction. The collective spatial locality can be characterized using the following two parameters: inc% the percentage of VCLs that needed more spatial locality. dec% the percentage of VCLs that needed less spatial locality. Given: incthresh , decthresh , maxfetch size , minfetch size
fixed fetch line predict(fetch size; inc%; dec% ) BEGIN 1. If inc% > incthresh Then 2. If fetch size < maxfetch size Then 3. fetch size = fetch size 2 4. Endif 5. Else if dec% > decthresh Then 6. If fetch size > minfetch size Then 7. fetch size = fetch size=2 8. Endif 9. Endif END Figure 5: Next fetch size prediction with xed thresholds An algorithm shown in Figure 5 is used to predict the next fetch size. When the percentage of VCLs that need more spatial locality is larger than a threshold value, the fetch size will double for the next adapt interval. When the percentage of VCLs that need less spatial locality is larger than a threshold value, the fetch size will decrease to half for the next adaptation interval. The parameters incthresh and decthresh are xed in this algorithm. Threshold values range from 0.5 to 0.7. The larger a threshold value, the more dicult it is to increase/decrease the fetch size. The smaller the threshold value, the easier it is to increase/decrease the fetch size. 10
5.3 Next fetch size prediction with adaptive thresholds
As will be shown later in Section 6, dierent applications prefer dierent threshold values. It would be best if the threshold values were adaptable. Thus an algorithm based on an aging mechanism is described in Figure 6 which adaptively changes the threshold values. In the algorithm aging fetch line predict, incthresh and decthresh are initialized to midthresh and can change from maxthresh to minthresh. Values for maxthresh, minthresh and midthresh used in our simulations are 0.7, 0.4, and 0.55, respectively. Value for agingrate used in our simulations is 0.01. There are three stages in this algorithm: validation stage from line 1 to line 10, fetch size prediction stage from line 11 to line 23, and threshold aging stage from line 24 to line 29. In the fetch size prediction stage, the next fetch size is predicted in the same way as in the algorithm fixed fetch line predict. In the threshold aging stage, if there is no fetch size change in the fetch size prediction stage, then incthresh and decthresh will decrease by agingrate. Because of aging, incthresh /decthresh can have values as small as minthresh. This has the advantage of making more fetch size changes to adapt to the underlying changes in locality. A disadvantage of threshold aging is that it is easier to make wrong fetch size prediction and this may degrade the cache performance. Thus in the validation stage, if there is a fetch size change at the end of the last adapt interval and the cache performance in the current interval is worse than in the previous interval, then the previous fetch size change was not bene cial and fetch size reverts to the size for the previous adapt interval. Then the corresponding incthresh /incthresh will be set to maxthresh so that it will be more dicult (take more time) to make the fetch size change in the same direction.
5.3.1 Hardware cost
The additional hardware needed for AFL cache using locality detection is as following: a register to store the fetch size a register to store the length of adaptation interval three registers to store threshold values one register to store the aging rate two counters to store the number of VCLs which need more and less spatial locality, respectively hardware to detect VCL usage hardware to determine whether two tags for two VCLs are neighboring
11
Given: maxthresh, minthresh , midthresh , agingrate, maxfetch size , minfetch size
aging fetch line predict(incthresh ; decthresh ; curmiss rate; prevmiss rate) BEGIN 1. If (fetch size increased last time) AND (curmiss rate > prevmiss rate) Then 2. fetch size = fetch size=2 3. incthresh = maxthresh 4. return 5. Endif 6. If (fetch size decreased last time) AND (curmissr ate > prevmissr ate )Then 7. fetch size = fetch size 2 8. decthresh = maxthresh 9. return 10. Endif 11. If inc% > incthresh Then 12. If fetch size < maxfetch size Then 13. fetch size = fetch size 2 14. incthresh = midthresh 15. return 16. Endif 17. Else if dec% > decthresh Then 18. If fetch size > minfetch size Then 19. fetch size = fetch size=2 20. decthresh = midthresh 21. return 22. Endif 23. Endif 24. If incthresh > minthresh Then 25. incthresh = incthresh ? agingrate 26. Endif 27. If decthresh > minthresh Then 28. decthresh = decthresh ? agingrate 29. Endif END Figure 6: Next fetch size prediction with aging thresholds 12
6 Performance In this section, we will discuss the parameters that aect the design of locality-based adaptive algorithms. Then the eect on miss rate and trac by both sampling-based and locality-based algorithms are shown.
6.1 Parameters in locality-based algorithms
In this subsection, we experiment to evaluate the eect of several parameters on cache performance: what adaptation interval to use, what thresholds to use for xed threshold algorithm, and the comparison of xed and aging threshold algorithms.
6.1.1 Choice of adaptation interval length
Figure 7: Miss rates with adaptation interval of 1M and 100K memory accesses Figure 7 compares miss rates of 2 adaptation intervals for the algorithm with xed thresholds. The cache size is 32KB. Both incthresh and decthresh are equal to 0.7. As can be seen in the gure, for eight benchmarks{APPLU, SWIM, WAVE, COMPRESS, FPPPP, GO, IJPEG, LI, the miss rates are almost the same for both intervals. SU2COR gets slightly better miss rate with 1M-interval. For the other six benchmarks, the miss rates obtained by 100K-interval are slightly better. We believe a small interval is better as it can adapt to changes in the underlying locality faster.
6.1.2 Limitation of xed threshold values
Figure 8 compares the miss rates for the xed threshold algorithm for dierent thresholds. The label on each bar is a pair: (incthresh ; decthresh ). As can be seen in the gure, for APSI, SU2COR, SWIM, COMPRESS, M88KSIM and TURB3D, the miss rates are almost the same with dierent sets of threshold values. For the other ten benchmarks, we can see noticeable dierences in the miss rates. For example, the miss rate for GO with threshold values (07, 05) is half the miss rates when other threshold 13
Figure 8: Miss rates with dierent threshold values for xed threshold algorithm values are used. The optimal line size for GO is 16B as shown in Figure 1. It is easier to adapt to a small fetch size when large incthresh and small decthresh are used. Things are dierent for WAVE. The optimal line size for WAVE is 128B. In the gure, the second and third bars are shorter. For these two bars, the decthresh is high. It is dicult to adapt to a small fetch size. So the fetch size will be large most of the time and the miss rates will be smaller. In general, high decthresh and low incthresh make it easy for the algorithm to adapt to a small fetch size. Low incthresh and high decthresh make it easy to adapt to a large fetch size.
6.1.3 Comparison of xed and aging threshold algorithms
Figure 9: Miss rates for xed threshold and aging threshold algorithms Figure 9 shows the miss rates for both xed threshold and aging threshold algorithms. The rst two bars are obtained using xed threshold algorithms. The rst bar has low incthresh and high decthresh , which make it easier to adapt to large fetch sizes. The second bar has high incthresh and low decthresh , which make it easier to adapt to small fetch sizes. The third bar is obtained using aging threshold algorithm. The maxthresh , midthresh and minthresh are 0.7, 0.55 and 0.4 respectively. The agingrate is 0.01. Except for APSI, WAVE and TURB3d, the miss rate obtained by aging threshold algorithm is close to or better than the optimal miss rates obtained 14
by xed threshold algorithms. This demonstrates that aging threshold algorithm is less biased toward either large line sizes or small line sizes and it can adapt the fetch size faster.
6.2 Performance with dierent adapting algorithms
In this section, we will compare the performance of dierent adapting algorithms in terms of miss rate reduction, normalized trac and speedup. In each gure in this section, there are four bars for each benchmark. The rst bar corresponds to all sampling algorithm. The second bar corresponds to neighbor sampling algorithm. For sampling-based algorithms, the sample interval is 10000 memory accesses and the adaptation interval is 1,000,000 memory accesses. The third bar corresponds to xed threshold algorithm. The incthresh and decthresh are both 0.7. The fourth bar corresponds to adaptive threshold algorithm. The maxthresh, midthresh and minthresh are 07, 0.55 and 0.4 respectively. The agingrate is 0.01. The L1 cache size is 32KB and is direct-mapped. Possible fetch sizes are: 16B, 32B, 64B, 128B and 256B. The L2 cache size is 256KB and is direct-mapped. Possible fetch sizes are: 64B, 128B, 256B and 512B.
6.2.1 Miss rate reduction
Figure 10: Miss rate reduction for L1-32KB cache The miss rate reduction shows the eect of adaptivity and is calculated using the following formula: ratefixed ?rateadapt reductionalg 100 adapt = ratebaseline fixed baseline
alg
Figure 10 shows the miss rate reduction for 32KB L1 cache. The baseline is a xed line size cache with 32B line size. For ve benchmarks, APPLU, HYDRO2D, MGRID, SWIM and TOMCATV, all algorithms can reduce the miss rates by at least 50 percent. For WAVE, COMPRESS, FPPPP, IJPEG, LI, and M88KSIM, the miss rates range from 1 to 30 percent. For APSI, PERL and TURB3D, the algorithms samp-a, samp-n and loc-a result in an increase in the miss rate. We can see from Figure 1 that the optimal line size for APSI and TURB3D is 256B and the 15
optimal for PERL is 16B. So all these benchmarks either have ample spatial locality or little spatial locality. Adapting fetch size is not necessary and adds overhead. loc-f still improves miss rates for APSI and TURB3D. In loc-f, since the threshold values are high, it is dicult to make wrong fetch size decisions. Thus the overhead due to adapting fetch size is smaller. The average miss rate reduction for samp-a, samp-n, loc-f and loc-a is 24, 25, 20 and 24 percent respectively. loc-f is the worst.
Figure 11: Miss rate reduction for L2-256KB cache Figure 11 shows the miss rate reduction for 256KB L2 cache. The baseline is a xed line size cache with 64B line size. All algorithms improve miss rates for every benchmark. For APPLU, SU2COR and SWIM, the miss rate reduction is as high as 80 percent. The miss reduction for APSI is small because there is not much variation in spatial locality in this benchmark and the largest fetch size is always better. For all other benchmarks, the miss rate reduction ranges from 10 to 70 percent. The average miss reduction for all benchmarks is 57 percent. So all of these algorithms are good for reducing the miss rate in a L2 cache.
6.2.2 Normalized trac
The normalized trac is calculated using the following formula: trafficfixed trafficalg adapt = trafficbaseline fixed
baseline
Figure 12 shows the normalized trac between L1 AFL cache and L2 cache. The baseline is the trac between L1 FLS cache with line size of 32B and L2 cache. For some benchmarks, there are big dierences in the normalized trac when dierent algorithms are used. For example in GO, the normalized trac is 0.76 for algorithm samp-n, but the normalized trac is 5.5 for algorithm loc-f. The average normalized trac for all benchmarks is 1.22, 2.3, 2.42 and 1.74 for algorithm samp-a, samp-n, loc-f and loc-a respectively. Algorithm samp-a is eective in trac control. On the contrary, algorithm loc-f is the worst. Figure 13 shows the normalized trac between the memory and a L2 AFL cache. The baseline is the trac between the memory and L2 FLS cache with line size of 64B. 16
Figure 12: Normalized trac for L1-32KB cache
Figure 13: Normalized trac for L2-256KB cache
17
Comparing the normalized trac between a L2 cache and a L1 AFL cache, the dierences in trac for a benchmark by dierent algorithms are much smaller. The average normalized trac for all benchmarks are 1.32, 2.2, 1.98 and 1.56 for algorithm samp-a, samp-n, loc-f and loc-a respectively. Algorithm samp-a is still the best in trac control and algorithm samp-n is the worst here.
6.2.3 Comparison with adaptive line size cache
In this section, we compare the miss rate reduction by using the AFL cache with the ALS cache. For the AFL cache, samp-a algorithm is used in fetch size adaptation. For the ALS cache, NEPL-IF-DF algorithm is used in line size adaptation. The optimal miss rate for the FLS cache is used in computing the miss rate reduction.
Figure 14: Miss rate reduction for L1-32KB cache Figure 14 shows the miss rate reduction for 32KB L1 cache. Both the AFL cache and the ALS cache have no improvement over the optimal FLS cache for four benchmarks. The ALS cache achieves better a better miss rate than the optimal FLS cache in eleven benchmarks. The AFL cache achieves a better miss rate than the optimal FLS cache in only three benchmarks. However, the FLS cache with 32B line size results in 73 percent increase in miss rate. Both the ALS cache and the ALS cache are much better than the FLS cache with 32B line size. Figure 15 shows the miss rate reduction for 256KB L2 cache. The ALS cache is better than The AFL cache in seven out of eight benchmarks. On average, the ALS cache and the AFL cache result in 15.6 and 3.18 percent miss reduction over the optimal FLS cache. Both perform much better than the FLS cache with line size 64B, which on average increases miss rate by 242 percent. Overall, the ALS cache performs better than the AFL cache. But the AFL cache can be implemented with much less complexity. Both caches perform much better than the FLS cache.
18
Figure 15: Miss rate reduction for L2-256KB cache
7 Conclusion In this study, we have proposed and investigate a cache design with adaptive fetch size. With few modi cations to a conventional cache, this cache can achieve the same bene ts as a cache with multiple cache line sizes to suit the changing spatial and temporal locality in applications. We have proposed four algorithms to adapt the fetch size dynamically. The sampling-based algorithm samp-all is the best among them. It achieves on average 24 percent miss rate reduction and only 22 percent trac increase over the FLS L1 cache with 32B line. The algorithms are more eective for the L2 cache than for the L1 cache. The samp-all algorithm in L2 on average results in 57 percent miss rate reduction and only 32 percent trac increase over the FLS L2 cache with 64B line. In this research, our focus was to optimize cache performance by fetch size adaptation. Fetch size adaptation can also be used to optimize trac and to reduce power consumption. Dierent algorithms for fetch size adaptation are needed in these cases and this is the focus of our future research. The simulations in this report are trace-driven and no speedups are obtained. We have integrated this cache simulator with Simplescalar processor simulator. Evaluation with detailed timing will be performed in the near future.
References [1] MIPS R3000 hardware manual, MIPS Corporation. [2] Jack E. Veenstra and Robert J. Fowler, MINT: A Front End for Ecient Simulation of Shared-Memory Multiprocessors, Intl. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems , pp. 201-207, Jan. 1994. [3] Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau and Xiaomei Ji Adapting cache line size to application behavior, Intl. Conference on Supercomputing, pp. 145-154, June. 1999. 19
[4] Weiyu Tang, Alexander V. Veidenbaum, Alexandru Nicolau and Rajesh Gupta Adapting line size cache, Technical report 99-56, Department of Information and Computer Science, University of California, Irvine , Nov. 1999. [5] Andrew A. Chien and Jae H. Kim. Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. In Proc. 19th Annual Symposium on Computer Architecture, pages 268{277, 1992. [6] Fredrik Dahlgren, Michel Dubois, and Per Stendstrom. Fixed and adaptive sequential prefething in shared memory multiprocessors. In Intl. Conference on Parallel Processing, 1993. [7] W.J. Dally and H. Aoki. Deadlock-free adaptive routing in multicomputer networks using virtual channels. In IEEE Transactions on Parallel and Distributed Systems, pages 466{475, 1993. [8] Jerey Kuskin et al. The Stanford FLASH multiprocessor. In Proc. 21st Annual Symposium on Computer Architecture, pages 302{313, 1994. [9] Edward H. Gornish and Alexander Veidenbaum. An integrated hardware/software data prefething scheme for shared-memory multiprocessors. In Intl. Conference on Parallel Processing, pages 247{254, 1994. [10] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-a ssociative cache and prefecth buer. [11] Norman P. Jouppi and Steven J. E. Wilton. Tradeos in two-level on-chip caching. In Proc. 21st Annual Symposium on Computer Architecture, 1994. [12] Toni Juan, Sanji Sanjeevan, and Juan J. Navarro. Dynamic history-length tting: A third level of adaptivity for branch prediction. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 155{166, 1998. [13] Sanjeev Kumar and Christopher Wilkerson. Exploiting spatial locality in data caches using spatial footprints. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 357{368, 1998. [14] T. Matsumoto, K. Nishimura, T. Kudoh, K. Hiraki, H. Amano, and H. Tanaka. Distributed shared memory architecure for JUMP-1. In Intl. Symposium on Parallel Architecures, Algorithms, and Networks, pages 131{137, 1996. [15] Ted Romer, Wayne Ohlich, Anna Karlin, and Brian Bershad. Reducing TLB and memory overhead using on-line superpage promotion. 1996. [16] Madhusudhan Talluri and Mark D. Hill. Surpassing the TLB performance of superpages with less operating system support. 1996. 20
[17] O. Temam and N. Drach. Software-assistance for data caches. In Proceedings IEEE High Performance Computer Architecture, 1995. [18] Steve Turner and Alexander Veidenbaum. Scalability of the Cedar system. In Supercomputing, pages 247{254, 1994. [19] Peter Van Vleet, Eric Anderson, Lindsay Brown, Jean-Loup Baer, and Anna Karlin. Pursuing the performance potential of dynamic cache line sizes. In Proceedings of 1999 International Conference on Computer Design, 1999. [20] PowerPC 601 RISC Microproessor User's Manual, Motorola, 1993 [21] "TMS390Z55 Cache Controller, Data Sheet", Texas Instrument, 1992
21