SF-LRU cache replacement algorithm - IEEE Computer Society

2 downloads 0 Views 265KB Size Report
algorithm in data cache and approximately 9.3% improvement ... LRU, LFU, Replacement, Low Power Cache. 1. .... algorithm is a hybrid replacement policy that.
SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 62901 [email protected], [email protected], [email protected] ABSTRACT In this paper we propose a novel replacement algorithm, SF-LRU (Second Chance-Frequency Least Recently Used) that combines the LRU (Least Recently Used) and the LFU (Least Frequently Used) using the second chance concept. A comprehensive comparison is made between our algorithm and both LRU and LFU algorithms. Experimental results show that the SF-LRU significantly reduces the number of cache misses compared the other two algorithms. Simulation results show that our algorithm can provide a maximum value of approximately 6.3% improvement in the miss ratio over the LRU algorithm in data cache and approximately 9.3% improvement in miss ratio in instruction cache. This performance improvement is attributed to the fact that our algorithm provides a second chance to the block that may be deleted according to LRU’s rules. This is done by comparing the frequency of the block with the block next to it in the set.

Keywords LRU, LFU, Replacement, Low Power Cache.

1. INTRODUCTION One of the most critical design issues in high performance processors is the power consumption. Microprocessor’s chip area is dominated mostly by on-chip cache memories and thus the need arises for power efficient cache memories. The main idea behind our research is to improve the efficiency of the first level memory hierarchy by utilizing an efficient novel replacement algorithm (SF-LRU). Therefore, accesses to the larger and more power consuming memory hierarchies are prevented, which reduce the execution time of application and the power consumption. Also a sophisticated highspeed memory hierarchy is essential for computers

to bridge the growing speed gap between the main memory and processing elements. In order to improve processor’s performance small and fast memories (Cache memories) are utilized to temporarily save the content of segments of main memory that are currently in use. Cache memory is the simplest cost-effective way to achieve highspeed memory hierarchy. It is for this reason that cache memory has been extensively studied since their first introduction. Three common cache organizations have been used since then: fully associative, set associative and direct mapped. Caches from direct mapped to fully associative are in fact increasing levels of set associativity: direct mapped is nothing but one-way set associative and a fully associative cache with n blocks might be called n-way set associativity; on the other hand, fully associative can be thought of as having one set and direct mapped as having n sets. Generally, set associative cache organization offers a good balance between hit/miss ratios and the implementation cost. The choice of a block replacement algorithm, in set associative caches, can have a great effect on the overall system performance. [1-3] The aptitude of caches to eliminate the performance gap is determined by two main factors: the access time, and the hit / miss ratio. The access time, the time needed to get data from the cache, is critical for the cache memory because a longer access time basically implies a slower processor clock rate. The hit ratio is the number of memory references that can be satisfied by the cache memory. The hit and miss ratio are also critical, both because misses impose delays and the fact that off-chip busses, especially those shared, are limited resource that impose a delay on the processor in waiting for the block to be moved to cache (miss penalty). The locality of reference property used by most programs allow the cache to provide instruction and data required by the CPU at a high rate that is more in alignment with the CPU’s demand rate. Conventional replacement algorithms used are LRU (least recently used), LFU (least Frequently

Records of the 2004 International Workshop on Memory Technology, Design and Testing (MTDT’04) 1087-4852/04 $20.00 © 2004 IEEE

Used), and FIFO (First in First out) Algorithms. Other improved algorithms include LRFU (Least Recently/Frequently Used) algorithm. [1-8] All these algorithms have one common function; which is to reduce the miss rate. The cost of misses includes miss penalty, power consumption, and bandwidth consumption. Relative performance of these replacement algorithms depends primarily on the span of the history reviewed. The LRU algorithm reviews a longer history of past address patterns than FIFO. LFU employs a different type of history than both the LRU and FIFO. The LRU reviews the history to determine which block hasn’t been used for the last memory operations to delete it from cache. While the LFU checks the history to determine which memory block hasn’t been frequently used to delete from the cache. The LRFU combines both the LRU and LFU algorithms; it provides a wide array of replacement algorithms. Researchers are continually proposing techniques to optimize the performance of cache replacement algorithms. [1-8] In order to reduce power consumption we are concentrating on reducing the miss rate. In every miss, the cache controller fetches the requested cache line from main memory to store it in cache. This is the most power consuming operation. In fact, in many systems, it has been shown that the majority of power cost is not due to data-path or controllers but due to the global communication and memory interactions. Embedded applications for signal processing consume 50-80% of the total power in memory traffic alone due to the communication between the processor and the off-chip memory. In fact, it is critical to focus on design strategies that reduce the power consumption due to off-chip memory traffic [10].

2. PREVIOUS WORK Effective cache replacement algorithms are essential in reducing the number of cache misses. The focus of this paper will be on reducing the power consumption by reducing cache misses through the utilization of novel cost effective replacement algorithm. We therefore start with related work for replacement algorithms then introduce related work for power consumption. The Most common replacement algorithms used in cache memory are first in first out (FIFO), most recently used (MRU), least recently used (LRU), and least frequently used (LFU). As the name implies, the FIFO algorithm replaces the block in the cache that was referenced first. The MRU algorithm replaces the block in the cache that was

referenced the most times. The LRU algorithm replaces the block in the cache that has been unused for the longest time. The LFU algorithm replaces the block in the cache that was least frequently referenced. The LRU and LFU are the two extreme points in which all other variations of replacement algorithms fall. [1-8] Many different methodologies have been proposed by researchers to improve the performance of replacement algorithms. Some of these methodologies are the frequency based replacement (FBR), the least recently frequently used (LRFU) , and the segmented least recently used (sLRU). Other proposed algorithms include LRU-K proposed. LRU-K replacement algorithm makes its replacement decision based on the time of the Kth-to-last reference to the block. The FBR algorithm is a hybrid replacement policy that combines both LRU and LFU, maintaining the LRU ordering of all blocks in the cache, but replaces the block in the cache that is least frequently used. The LFRU algorithm associates a value called Recency and Frequency (CRF) to each block in the caches and replaces the block in the cache with the minimum CRF value. [3-9] There are advantages and drawbacks for each one of the mentioned algorithms. One drawback for the LRU is that it uses only the time of the most recent reference to each block and cannot differentiate between frequently and infrequently referenced blocks. The LFU on the other hand cannot differentiate between references that occurred way in the past and the more recent ones. The LRU-K considers only the Kth reference while ignoring the K-1 references. The LRU-K can differentiate between frequently and infrequently referenced blocks but still does not combine recency and frequency in a unified manner. Other algorithms such as the LRFU, even though combining the recency and frequency in a unified manner has a lot of implementation overhead. These are just some drawbacks of some of the replacement algorithms to show that all replacement algorithms are not perfect and can have room for modification and performance improvement. [1-9] Since our main concern is power consumption, we show here some related work to power consumption because many research has been done on reducing the power consumption of cache memories. Researchers have proposed many techniques to reduce power consumption, such as the methods proposed by Memik et al. [11], and the technique proposed by Nicolaescu et al. [12]. The first proposed a victim cache structure to reduce

Records of the 2004 International Workshop on Memory Technology, Design and Testing (MTDT’04) 1087-4852/04 $20.00 © 2004 IEEE

the number of accesses to more power consuming structures, while the latter propose a technique utilizing cache line address locality to determine the cache way prior to the cache access.

Y0

y0

3. ENERGY MODEL Y1

Several energy models have been proposed for caches. We base our energy model basd on the model developed in [13]. Where energy is given by

y1

Energy = hit _ rate * Energy _ hit + miss _ rate * Energy _ miss

y2

Where Energy _ hit = Energy _ Decoder + Energy _ Cell _ Arrays and Energy _ miss = Energy _ hit + Energy _ access _ memory

The Energy_Cell_Array is the energy in the cell arrays, Energy_access_memory is the energy required to access data in main memory. And Energy_decoder is the energy in the decoder. As previously stated the Energy required to access data from main memory consumes the majority part of the overall power cost. Thus it is clear that Energy _ miss >> Energy _ Hit , And it is also apparent from the previous equations that if miss rate reduction is achieved, then energy consumption is reduced. More information on the complete model can be found in [13]. This proves that our approach to the energy consumption problem is correct and that when we achieve better-hit rate and reduce the number of misses we are actually reducing the power consumption of the system.

4. THE SF-LRU ALGORITHM In this paper, we introduce a cache replacement algorithm based on LRU and LFU. The algorithm combines both the frequency and recency of blocks in making the decision to replace blocks. The basic philosophy is to modify the LRU so that not only is there a counter to count the number of times a memory block has been referenced but also to give a second chance to those memory reference that have been referenced to the maximum of the counter before being thrown away. This combined number we will call RFCV (Recency-Frequency Control Value).

Operation: - 1 Read - 0 Write / Swap

Figure 1: Hardware representation of the RFCV (RFCV = Y = (y + 1) Operation Figure 1 shows hardware architecture of the calculation implementation for the RFCV value. Using this implementation of calculating the RFCV value, the algorithm determines either to delete the last block from the memory or a previous one. The algorithm actually compares the RFCV values for all the blocks in the set and deletes the one with the lowest value: Definition A: Assume that each block has a value that combines the recency and frequency of a block, where the value called (RFCV) is defined as: RFCV = F(x) + G(y,R) ……………….. (1) Where F (x) is a decreasing function that represents and is defined as: LRU F (x) = (1/ȡ) x ………………….. (2) With x= tactual time – tlast and ȡ > 1, and G (y, R) is an increasing function that represents the LFU with respect to R and is defined as: G (y, R) = (y +Į) R …………………… (3) With y is the frequency of the block, R is the operation (Read or Write), and (Į >1). Substituting (2) and (3) in (1) yields: RFCV = (1/ȡ) x + (y +Į) R …………. (4) On closer inspection we find F (x)