Application Adaptive Intelligent Cache Memory System - Amazon Web ...

2 downloads 885 Views 154KB Size Report
and a simple dynamic fetching mechanism with different fetch sizes. .... These cache structures and mechanisms have been applicable to specific application.
Application Adaptive Intelligent Cache Memory System JUNG-HOON LEE Yonsei University SHIN-DUG KIM† AND CHARLES WEEMS‡ Yonsei University†, University of Massachusetts‡ ________________________________________________________________________ This paper presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching mechanism with different fetch sizes. Support for dynamically varying the fetch size makes the cache equally effective for general-purpose as well as multimedia applications. Our cache organization and operational mechanism are especially designed to maximize temporal locality and spatial locality, selectively and adaptively. Simulation shows that the average memory access time of the proposed cache is equal to that of a conventional direct-mapped cache with eight times as much space. In addition, the simulations show that our cache achieves better performance than a 2-way or 4-way set associative cache with twice as much space. The average miss ratio, compared with the victim cache with 32-byte block size, is improved by about 41% or 60% for general applications and multimedia applications, respectively. It is also shown that power consumption of the proposed cache is around 10% to 60% lower than other cache systems that we examine. Our cache system thus offers high performance with low power consumption and low hardware cost. Categories and Subject Descriptors: B.3 [Hardware]: Memory Structures – Design Styles; Performance Analysis and Design Aids, C.3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems– Real-time and embedded systems, I.6 [Computing Methodologies]: Simulation and Modeling Simulation Output Analysis General Terms: Design, Experimentation, and Performance Additional Key Words and Phrases: Memory hierarchy, temporal locality, spatial locality, general application, media application, dynamic block fetching and cache memory

________________________________________________________________________ 1. INTRODUCTION As the demand for multimedia applications has increased, embedded processor designs have evolved to provide high performance on media processing algorithms, such as image processing, video compression and decompression, voice processing, wireless communication, and so on. Cache memory is a fundamental apparatus in modern computer systems. Major cache design parameters are cache size, block size, and associativity. This work was supported by Korea Research Foundation under contract number EA0081 and partially by NSF ITR grant CCR-0085792, DARPA grant5-21425, NSF grant ACI9982028. Authors' addresses: J.H. Lee, and S.D. Kim, Department of computer science, Yonsei University, 134, Shinchon-dong, Sudaemoon-Ku, Seoul, 120-749, Korea; C. Weems, Department of Computer Science, University of Massachusetts, Amherst, MA 010034610, USA. Permission to make digital/hard copy of part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2001 ACM 1073-0516/01/0300-0034 $5.00

For a given cache space, improved performance can be achieved by choosing an optimal set-associativity to reduce conflict misses. However, adding set-associativity to a cache requires insertion of a multiplexer in the data path, as well as increased complexity in timing and control logic [Przybylski et al. 1988]. Embedded processors typically do not employ this organization because of high power consumption. Caches exploit temporal locality by retaining recently accessed data and spatial locality by fetching multiple words as a cache block. However, temporal locality cannot be exploited as effectively when the cache block size becomes too large. A 32-byte block size is widely used as an effective compromise between improving spatial locality and retaining the majority of the benefit of temporal locality. However, as mentioned in [Slingerland et al. 2000], multimedia applications show better performance when larger cache block sizes, e.g., 64-bytes or 128-bytes, are chosen. Cache systems constructed with a fixed block size have shown a tendency to emphasize only one application group, i.e., either general-purpose applications or multimedia applications. This research proposes a new intelligent cache system that can be applied to general-purpose applications as well as multimedia applications by adjusting the fetch size dynamically. The proposed cache system can support three different fetch sizes, i.e., 32-byte, 64-byte, and 96-byte blocks. For general applications, the size of a fetch is typically 32-bytes, but 64-byte or 96-byte fetch sizes can be used when a data access pattern is encountered that has high spatial locality. In multimedia applications, however, a large percentage of block fetches can take advantage of 64-byte or 96-byte fetch sizes. Previous approaches to varying the fetch size have focused on prefetching. However, most prefetching mechanisms generate a prefetch signal when either a cache hit or a miss occurs. Therefore an excessive number of prefetch signals are generated, leading to high power consumption and cache pollution [Chen et al. 1992]. Additional drawbacks that result from adding dynamic fetch size to the prefetch mechanism are the extra hardware cost and the increased difficulty of predicting memory reference patterns, especially in the data cache. Instead, the proposed cache system initiates variation of the fetch size as part of the mechanism associated with a miss handling. The proposed cache system, called a dynamically aggressive spatial and adaptive temporal (DASAT) cache system, is an extended combination of a dual direct-mapped cache module with a small block size and a fully associative cache module with a large block size at the same cache level. Several architectural and operational features are integrated together to optimize the effectiveness of locality. The improvement in performance is primarily achieved by exploiting the inherent characteristics of spatial

locality through varying fetch size dynamically and also from reducing conflict misses by using a dual direct-mapped cache organization. Temporal locality is further enhanced by selectively retaining blocks with a high probability of repeated references in the time domain. To accomplish this, our system monitors the behavior of a block over some time interval before storing it into the direct-mapped cache. The rest of the paper is organized as follows. Related work is provided in Section 2. Section 3 describes the proposed cache organization and its operation. Section 4 presents our performance evaluation. Finally, conclusions are given in Section 5. 2. RELATED WORKS Dual cache structures are a popular organizational approach in high performance processors. The victim cache, the selective cache, STS cache, and assist cache are typical examples of dual cache structures. The victim cache [Jouppi 1990] is a small buffer that holds data recently evicted from the main cache. The main cache and the buffer are accessed at the same time. If a memory address generated by the CPU is a hit in the victim buffer, the data are returned to the CPU and simultaneously promoted to the main cache; then the replaced block from the main cache is moved to the victim cache, therefore performing a “content swap’’. The selective cache [Gonzalez et al. 1995] consists of a spatial cache with a large block size, a temporal cache with a small block size, and a locality prediction table. The data may be placed in just one of the two subcaches or may not be cached anywhere, depending on the predicted type of locality for a given memory access. The split temporal / spatial cache (STS) [Milutinovic et al. 1996] is organized as two parts, i.e., a spatial cache with a prefetch mechanism and a temporal cache. The temporal cache is organized as a two-level hierarchy, with a one-word block size at each level. The HP-7200 assist cache [Kurpanchek et al. 1994] design places the primary directmapped cache in parallel with a small fully associative buffer, guaranteeing single-cycle lookup at both units. Blocks requested from the cache controller, due to a cache miss or a prefetch, are first loaded into the assist buffer, and are only promoted into the directmapped cache if they exhibit temporal locality. Data with no temporal reuse bypass the direct-mapped cache and are moved directly back to memory with a FIFO replacement algorithm. There have been a limited number of multimedia caching studies. The data cache for MPEG-2 video decoding was studied in [Soderquist et al. 1997]. But the data cache failed to achieve high performance for video data. Cox et al. [1998], and Hakura et al. [1997] evaluate the effectiveness of caching for texture mappings used in 3D rendering. A

texture cache with a capacity as small as 16KB was found to reduce the required memory bandwidth by three to fifteen times the rate for a non-cached design, and exhibited miss ratios around 1%. Igehy et al. [1998] described and analyzed a prefetching architecture designed for texture caching that can tolerate an arbitrarily high and variable amount of latency in the memory system by separating the tag part from the data part of the cache. This method was able to reduce the average memory access time by 90%. The cache studies for texture mapping provided tremendous improvements for specific media applications. In addition to prefetching of texture mappings, various prefetching techniques have been introduced for MPEG-1, MPEG-2 [Zucker et al. 1996], and image processing [Cucchiara et al. 1999]. These cache structures and mechanisms have been applicable to specific application areas, in part because they are restricted to one block size. So, in the case of multimedia applications with high spatial locality, there have been several studies of prefetching to overcome these structural drawbacks. 3. INTELLIGENT CACHE SYSTEM WITH DYNAMIC BLOCK FETCH SIZE In this section, our design motivation is presented along with the architectural model for the proposed intelligent cache system. Also its operational model is explained in the context of the dynamic fetching mechanism. 3.1 Motivation and Overview One way to adaptively exploit two types of locality is to design a cache system that consists of two caches having different configurations that are tuned for each type of locality [Gonzalez et al. 1995] [Milutinovic et al. 1996] [Kurpanchek et al. 1994]. Our previous work has used this strategy, employing a direct-mapped cache and a fully associative spatial buffer at the same cache level [Lee at al. 2000]. The new organization proposed in this work is also based on this model. In this research, our main objective is to design a simple, hardware-controlled, low power but high performance cache system to support fast access time and improved utilization of temporal and spatial locality. Our hypothesis is that this can be achieved by making the cache adaptive to the different levels of spatial and temporal locality in applications, and through a fetching mechanism that dynamically varies the fetch size. The DASAT cache organization achieves this goal as follows. For data with temporal locality, we use a direct-mapped cache module with a small block size to guarantee low power consumption and fast access time. The block size is minimized so that a specific group of data with high temporal locality can be retained

longer and more selectively by maximizing the number of entries for a given cache space. To reduce conflict misses, which are a drawback in a conventional direct-mapped cache, we use two direct-mapped caches that are configured as a main and a shadow cache. A new control mechanism is employed to manage the roles of the main cache and shadow cache. The effect is to increase the overall time that a particular small block with high temporal locality can stay in the direct-mapped cache. In addition to the direct-mapped cache, a fully associative cache module with a large block size is used to exploit spatial locality. Each large block in this cache is configured as multiple small blocks. Thus, when a cache miss occurs, a large block is fetched from the next level of the memory hierarchy, enabling exploitation of potential spatial locality. To improve spatial locality in a manner that is adaptive to different application characteristics, we have designed fetching mechanism that supports different fetch sizes. 3.2 Cache Organization The DASAT cache system consists of two major components, as shown in Fig. 1: a dynamically aggressive spatial cache and an adaptive temporal cache. The adaptive temporal (AT)-cache contains the two direct-mapped caches, described above, which we refer to as T0 and T1. Because their blocks are minimal in size, we refer to the T0 and T1 blocks as small blocks. There are n small block entries for each of T0 and T1, resulting in a total of 2n small blocks in the cache space. As mentioned previously, T0 and T1 cooperate logically as a main and shadow cache to mitigate conflict misses. A block evicted from the main cache is stored into the shadow cache at the corresponding block index. If a hit occurs in the shadow cache, its corresponding main and shadow block entries are swapped between these two caches. In our approach, this operation is performed logically without incurring any physical block movement. The logical transfer is effected with a driving bit that is assigned to each block index for both T0 and T1, as shown in Fig. 1, and thus there are n driving bits. This bit is used to determine whether T0 or T1 is acting as the main cache or the shadow cache for a specific block index, where the value one means T0 is the main cache and T1 is the shadow cache, and the value zero means the reverse. For example, given a particular block index i where the i-th driving bit is set to one, then T0 is the main cache for this block and is enabled to check first. If a miss occurs at T0, then the corresponding block of T1 is enabled and checked for a match. If a hit occurs at T1 for block index i, its corresponding driving bit is reset to zero, meaning T0 is now the shadow and T1 is the main cache for this block. If a miss occurs at both caches when the driving bit is one, then a new block is loaded into T1 and the driving bit is toggled, so

that T1 is now the main cache for this block. When the CPU generates a particular address, only one of the two direct-mapped caches is enabled (the main cache for this entry) by the driving bit and can be accessed within a single cycle. This mechanism helps reduce conflict misses, which are a major drawback for the direct-mapped organization, especially when the cache space is limited. We describe an integrated operational algorithm for managing the AT-cache and the dynamically aggressive spatial cache together in more detail later. The dynamically aggressive spatial (DAS)-cache is designed to improve spatial locality. As mentioned earlier, it is a conventional fully-associative cache, where each block is organized as a collection of k small blocks, and called a large block. Each large block is constructed as k banks, where k consecutive small blocks can be stored as a single fetch unit. Given that there are m large blocks in this cache, then there are m*k small blocks. For each large block, there are a tag field, a valid bit field, a group of k dirty bits, and a group of k hit bits, where one dirty bit and one hit bit are assigned to each small block bank within a large block. The hit bit is used to distinguish between referenced and unreferenced small blocks. The dynamic fetch mechanism that manages the DAS-cache consists of an address generator (AG) and a dynamic fetch controller. Because the DAS-cache is fully-associative, its tag space is constructed as a content addressable memory (CAM). The dynamic fetch controller is organized as two fetch prediction buffers and one adder. If a hit occurs at the DAS-cache, the hit bits belonging to the accessed large block are coped into one of the fetch prediction buffers. Therefore, the fetch prediction buffers contain hit bit information for the two most recently accessed large blocks. The adder is used to derive a fetch signal by summing all of the hit bits stored in these two buffers. In the case of a cache miss, the result from the adder determines the size of the fetch. The AG then generates the necessary fetch addresses onto the address bus. For example, if the size of a small block is 8-bytes and that of a large block is 32bytes, there are four banks for each block entry in the DAS-cache. The minimum size of a fetch unit is 32-bytes and it can be extended to 64-bytes, or 96-bytes dynamically depending upon the hit bit information. If the summation value of the adder is 0, 1, or 2,

Data address for CPU

Driving bit

VD Tag

Data

VD Tag

Small block

T0 Direct-Mapped Cache

Data

Small block

Adaptive Temporal Cache

T1 Direct-Mapped Cache

Small block update

Fully Associative Cache Tag

V Dirty bit

Decoder

Data

Hit bit

Large block

(D3 - D0) (H3 - H0)

CAM

Bank Enable

Dynamically Aggressive Spatial Cache

Small block 3 Small block 2 Small block 1 Small block 0

Match

AG

dynamic fetch controller

Data from next level memory

Fig. 1. The DASAT cache organization.

the fetch unit size is 32-bytes. If the value from the adder is 3, or 4, the fetch unit size is extended to 64-bytes and two consecutive large blocks are fetched. And if the output of the adder is 5, 6, 7, or 8, the fetch unit size becomes 96-bytes and three consecutive large blocks are fetched. For general applications, such as the Spec95 benchmarks, the size of a fetch unit is typically 32-bytes, but 64-byte or 96-byte fetch sizes are selected for some working sets with high spatial locality. In the case of multimedia applications, such as the Media benchmarks, a very large fetch size is useful because of its streaming nature [Conte et al. 1997, and Lee et al.

1996]. Because the proposed mechanism responds to

application characteristics dynamically and adaptively, fetch unit sizes of 64-bytes and 96-bytes are used proportionately more often in most media applications. When the CPU performs a memory reference, both the DAS-cache and one of the two direct-mapped caches in the AT-cache are searched in parallel within one cycle. A hit in the direct-mapped cache is processed in the same way as a hit in a conventional directmapped cache. If a reference misses in the direct-mapped cache, but hits in the DAScache, the corresponding data are fetched from the DAS-cache, and the hit bit for that small block is simultaneously set in the DAS-cache. At the same time, the hit bits for this DAS-cache entry are copied into the least recently loaded fetch prediction buffer in the dynamic fetch controller. If a miss occurs both in the main cache bank of the AT-cache

and in the DAS-cache, then the shadow bank of the AT-cache is accessed during the next cycle. If a hit occurs in the shadow cache, the driving bit of the entry is reversed and the fetch proceeds in the same way as a normal hit. However, if the shadow cache also misses, it is called a global miss and the dynamic fetch controller initiates the fetch of a new large block into the DAS-cache from the next level of the memory hierarchy. On a global miss, the dynamic fetch controller generates one of three fetch signals depending on the hit bit information in the fetch prediction buffer. Thus, a global miss may result in generation of a fetch signal for a 32-, 64, or 96-byte block. The required word from memory is sent to the CPU concurrently with being stored in the DAS-cache. Details are covered in later subsection. 3.3 Operational Model of the DASAT Cache System Here, we describe the algorithm for managing the DASAT cache in detail. Its conceptual operational flow is explained as follows. When a global cache miss occurs, a large block is fetched into the DAS-cache. Small blocks belonging to this large block are marked as they are subsequently referenced. When this large block is replaced, its small blocks that are marked as having been accessed are moved into their corresponding block entries in the AT-cache. Unmarked small blocks are evicted from the entire cache. Different cases for the operational model are explained as follows. 3.3.1 Cache hits On every memory access, a hit may occur at either the AT-cache or the DAS-cache. First the DAS-cache and the main bank of the AT-cache are checked in parallel. If there is a hit in the DAS-cache, then the hit bit for the corresponding small block is set at the same time that it is accessed. If both miss, then the shadow bank of the AT-cache is checked on the next cycle. A hit in either bank of the AT-cache proceeds normally. First, consider the case of a hit in the AT-cache. If a read access to the AT-cache is a hit, the requested data item is transmitted to the CPU without any delay. If a write access to the AT-cache is a hit, the write operation is performed and the dirty bit for this block entry is set. Second, consider the case of a hit at the fully associative DAS-cache. The data belonging to its small block are sent to the CPU, and its hit bit is set to mark it as referenced. The hit bits for this DAS-cache entry are also copied into the fetch prediction buffers. Note that DAS-cache blocks are not promoted to the AT-cache on a hit. Promotion occurs only on replacement in the DAS-cache. Average memory access time tends to increase slightly due to the one cycle penalty for the second access to the AT-cache. We refer to this case as a two cycle hit. The

overhead is negligible, however, because the two cycle hit ratio is only 0.02%~0.03%, and the overall miss ratio is only about 1.2% ~ 1.4% of the total number of addresses generated by the CPU. Because we mark the shadow cache so that it becomes the main cache on a two-cycle hit, and because we move small blocks from the DAS-cache into the shadow bank of the AT-cache, and then reverse the roles of the banks, the most recently accessed data always ends up in the main bank. Therefore, we always check the bank with the most recently used data (the main bank for this block index) first, which results in a higher percentage of single cycle accesses. Thus the two-bank AT-cache can reduce conflict misses effectively. Power consumption can also be reduced in the DAS-cache by using the most significant two bits of a large block offset address to activate just one of k-banks in the fully associative cache. For example, if the size of a small block is 8-bytes and a large block is 32-bytes, there are four banks in the fully associative DAS-cache, and power can be reduced proportionally. 3.3.2 Cache misses If a global miss occurs, a large block including the missed small block is brought into the DAS-cache from the next level of memory. Depending on the hit data in the prediction buffers, the dynamic fetch controller may perform additional fetch operations for larger fetch sizes on subsequent cycles. However, it is possible that the two subsequent large blocks are already present in the DAS-cache, and we do not want to replace them. Not only is this a waste of memory bandwidth, but it also results in the loss of hit-bit values for those blocks and in premature promotion of their referenced small blocks to the ATcache. Four possible cases exist as follows: first, the (i+1)-th large block does exist and the (i+2)-th large block does not exist in the DAS-cache. Second, the (i+1)-th large block doesn’t exist and the (i+2)-th large block does exist in the DAS-cache. Third, both the (i+1)-th and the (i+2)-th large blocks exist. Finally, both the (i+1)-th and the (i+2)-th large blocks don’t exist. For 64 or 96-byte block fetches, first the cache controller always initiates a fetch signal for the missing 32-byte block, via the AG, onto the address bus. Then the AG creates the (i+1)-th block address to fetch and sends it to the CAM to search the tags of the DAS-cache to detect whether the (i+1)-th large block already exists. If the (i+1)-th large block does exist in the DAS-cache, the fetch signal is not generated. But if it doesn’t exist in the DAS-cache, then the fetch signal is generated on the next cycle.

If a 96-byte fetch has been indicated, then while the dynamic fetch controller is handling the second large block, it prepares the third fetch signal and sends it to the AG during the next cycle. The CAM then searches the tags to detect whether the (i+2)-th large block is already present. Depending on whether the (i+2)-th large block exists in the DAS-cache, the dynamic fetch controller performs its corresponding action as before. We assume that the time to fetch a 32-byte block is 19 clock cycles. These parameters are based on the values for common 32-bit embedded processors (e.g., Hitachi SH4 or ARM920T). However, the one cycle delay that is incurred to search the shadow bank in the AT-cache needs to be added. Thus, the time to fetch a 32-byte large block is assumed to be 20 clock cycles. An 8-byte block can be transferred during each cycle via a 64-bit data bus. Therefore the times to fetch 64-byte and 96-byte block sizes are assumed to be 24 and 28 clock cycles respectively. However, there are variations on these times, due to special cases that we describe next. When the (i+1)-th large block exists and the (i+2)-th large block doesn’t, the time to fetch the 64-bytes that make up the i-th and (i+2)-th blocks is 25 cycles. To see why, consider that if the first fetch signal for the missed i-th block occurs at time t, the fetch signal for the (i+2)-th large block is not generated until time t+2, i.e., after a one cycle delay, because of the time required to detect that the (i+1)-th large block is already in the DAS-cache. When the (i+1)-th large block doesn’t exist and the (i+2)-th large block does, the (i+1)-th large block is continuously fetched as a 64-byte block unit, i.e., by issuing its fetch signals consecutively during the next cycle at time t+1. Thus, it takes 24 cycles, as would a normal 64-byte fetch. When both the (i+1)-th and the (i+2)-th large blocks are already present, then only 20 clock cycles are needed, which is simply the time for the 32byte fetch. Finally, when both additional large blocks must be fetched, we need 28 clock cycles as explained previously. In what follows, we use an example configuration in which each 4KB bank in the ATcache has a small block size of 8-bytes, and the DAS-cache holds 1 KB, and has a large block size of 32-bytes. Thus, four sequential small blocks can be aggregated into a 32byte large block. This configuration has a total of 9 KB of space. The AT-cache then has 1K small blocks, and the DAS-cache has 32 large blocks. When a replacement needs to occur because of a global miss, two cases can be considered, depending on whether the DAS-cache is full or not. 1.

The DAS-cache is not full: If at least one entry in the DAS-cache is in the invalid state, a large block can be

fetched and stored into the DAS-cache without replacement. When the CPU accesses a particular small block within a large block, its corresponding hit bit is set to one to mark it

as a referenced block. In the case of a 32-byte block size, one large block is fetched and stored in the DAS-cache. But when fetching 64-byte or 96-byte block sizes, two or three large blocks are fetched and there must be this many invalid entries in the cache in order to store without replacement. If there are fewer open entries than blocks that need to be loaded, then the controller switches at the necessary point to the process for replacement that we describe next. 2.

The DAS-cache is full: If the DAS-cache is full, the oldest entry is replaced according to the FIFO policy. At

that time, the small blocks whose hit bits are set are moved into the AT-cache. Because these actions are accomplished while the cache controller is handling a miss, this operation does not introduce any additional delay. The move operations between the two caches are illustrated as follows. For our example configuration, when a 32-bit memory address is generated, the tag field is 19-bits, the index field is 10-bits, and the offset field is 3-bits in the AT-cache. And in the DAS-cache, the tag field is 27-bits and the offset field is 5-bits. Therefore the high order two bits of the large block offset are used to search one of the four banks selectively. The tag bits for the AT-cache can be created by combining the tag bits, from the large block being replaced in the DAS-cache, with the indices of its hit bits. That is, in the hardware, hit bits H0, H1, H2, and H3 indicate that the low order bits of the AT-cache tags are 00, 01, 10, and 11 respectively. For example, when the hit bit (H0) of the first small block is the only one set, the bits 00 corresponding to the first small block are appended to the tag value of the DAS-cache. Therefore a new memory address can be formed easily. The tag and data values are stored in one of the AT-cache banks depending on the current value of its driving bit, as described previously. When the block size is 64bytes or 96-bytes, the same mechanism is repeated a second or third time. Write-back cannot directly occur from the DAS-cache because any dirty or referenced small block is always moved to the AT-cache before its corresponding large block is replaced. Thus, write-back occurs via the AT-cache, but only the dirty small blocks are actually written. In contrast, a write back operation for a conventional cache with the same block size would need to write back the entire large block, unless sub-block writing is specifically supported. Therefore, our design has the side benefit of reducing write traffic into memory. Finally, when a global miss occurs, small blocks corresponding to the (i+1)-th large block or the (i+2)-th large block may already be present in the AT-cache. Therefore, the cache controller searches the tags of the AT-cache for the four small blocks belonging to the large block being prefetched. When a matching small block is found, it is invalidated.

If its dirty bit is set, the invalidated small block is updated when the corresponding large block is fetched into the DAS-cache. The search operation is accomplished while the cache controller is handling a miss, so it does not add any overhead. Power consumption overhead is also negligible because the prefetching ratio is only about 0.5% ~ 0.7% of the total number of addresses generated by the CPU. If an invalid small block in the ATcache is referenced at the DAS-cache, then when its corresponding large block is replaced, the small block is stored in the AT-cache once again. Therefore, there is almost no performance loss. Of course, if three or four invalid small blocks are all present in the AT-cache, the cache space effectively decreases. But according to our simulation results, this case accounts for less than 1% of the total number of prefetches generated. 4. PERFORMANCE EVALUATION The details of the simulation environment and performance measurements are presented in this section. Benchmarks used in the trace-driven simulation include six of the SPECint95 benchmarks and two from SPECfp95 (applu and tomcatv), representing general-purpose applications, and ten of the Media benchmarks [Lee, Potkonjak, and Mangione-Smith 1997], representing embedded multimedia and communications applications. The Media benchmarks are representative of image compression, voice, video transmission, 3D text mapping, cryptography, and so forth. The executables of these benchmarks are processed by QPT2 to generate traces [Ball et al. 1994]. Only data references are collected and used for the simulation. The DineroIV cache simulator [Edler et al. 1997] was modified to simulate the proposed cache system. We have chosen commonly used cache structures for our performance comparisons. 4.1 Distribution of Fetch Block Sizes We used simulation to determine the best combination of prediction buffer hit bits to use as triggers to initiate different fetch sizes. A simple and effective result is obtained from thresholding the sum of the hit bits. For our previous example, there are 8 hit bits in the fetch prediction buffer. Depending upon the number of hit bits set to one, one of three fetch unit sizes is determined as explained previously. This mechanism enables the cache to adapt depending on the types of locality that are present in an application. Fig. 2 and Fig. 3 show the distribution of the different fetch sizes that are actually initiated. Fig.2 shows the distribution on the Spec95 benchmarks. Fig. 3 shows the distribution on the Media benchmarks. The figures distinguish between the number of fetch signals generated for each size, and the number of actual fetches that occurred for a given size.

The discrepancy is due to the fact that fetches of large blocks are cancelled when they are already present in the DAS-cache. 32-byte block signal generation

64-byte block signal generation

32-byte block real fetch ratio

96-byte block signal generation 96-byte block real fetch ratio

64-byte block real fetch ratio

100 90 80

percentage (%)

70 60 50 40 30 20 10 0

compress

applu

go

ijpeg

m88ksim tomcatv

vortex

gcc

AVG

General applications: Spec95 benchmarks

Fig. 2 Spec95 benchmarks: Ratio of various block signal generations and actual fetching operations. 32-byte block signal generation

64-byte block signal generation

96-byte block signal generation

32-byte block real fetch ratio

64-byte block real fetch ratio

96-byte block real fetch ratio

100 90

percentage (%)

80 70 60 50 40 30 20 10 0

epic

unepic

cjpeg

djpeg

mpeg2 mpeg2 rasta decode encode

mipmap texgen osdemo AVG

Multimedia applications: Media benchmarks

Fig. 3 Media benchmarks: Ratio of various block signal generations and actual fetching operations.

Table I. The Proportion of Actual Block Fetches.

Benchmarks Spec95 Media

32-bytes 55.5% 48.8%

64-bytes 20.7% 18.2%

96-bytes 23.8% 33.0%

In the Spec95 benchmarks, with the exception of m88ksim and tomcatv, the 32-byte fetch size dominates. In the Media benchmarks, the 96-byte fetch size is more prominent, but the 32-byte size is also selected frequently. When only the 32-byte fetch size is used, there is no significant performance gain attributable to our features that support greater spatial locality. Table 1 shows the average percentages of actual fetch sizes that occurred for the Spec95 and Media benchmarks.

4.2 Comparison with Conventional Caches Two common performance metrics, the miss ratio and the average memory access time, are used to evaluate and compare the DASAT cache system with other approaches. The miss ratio represents the fraction of cache references that are not found in the cache. The average memory access time captures the latency from the beginning of an access until the time the requested data is retrieved. The amount of cache space has a significant impact on the cache miss ratio, and thus it is one of the most important cache design parameters. But the larger the cache space, the slower it is due to increased signaling time across critical address and data lines. Larger caches also have higher power consumption. For our simulations, we assume a DASAT cache system with a pair of two 4KB AT-caches and a 1KB DAS-cache, i.e., a total of 9KB of cache space. Conventional direct-mapped caches with 32KB and 64 KB cache sizes, and also 2-way and 4-way set associative cache with 16KB cache size are compared with the DASAT cache. 4.2.1 Miss ratio and average memory access time Several experiments were performed in advance to determine the optimum block sizes for the AT-cache and the DAS-cache in the DASAT cache system. The combination of an 8byte small block and a 32-byte large block showed the best performance for most cases. Cache miss ratios for the conventional caches and the DASAT cache are shown in Fig. 4 and Fig. 5. Fig. 4 shows the simulation results for the Spec95 benchmarks. Fig. 5 shows the simulation results for the Media benchmarks. For a direct-mapped cache, denoted as DM, the notation “32KB-32byte’’ and “64KB-32byte’’ denote 32KB and a 64KB directmapped caches with a 32-byte block size respectively. The 2-way or 4-way set associative caches with a 16KB cache size with 32-byte block size are indicated as “16KB-32byte (2way)” and “16KB-32byte (4-way)” respectively. Notice that the average miss ratio of the DASAT cache equals that of a conventional direct-mapped cache with roughly eight times as much space (e.g., 64KB) in general applications. In addition, the average miss ratio of the DASAT cache shows better performance than the larger direct-mapped cache for media applications. A 2-way or a 4-way set associative cache greatly reduces the miss ratio, but because of its slower access time and higher power consumption, embedded processors typically don’t employ this organization. The simulation results show that the DASAT cache can achieve better performance than a 2-way or 4-way set associative cache with twice as much space.

32KB-32byte (DM)

64KB-32byte(DM)

16KB-32byte(2-way)

16KB-32byte(4-way)

DASAT cache

9 8

Miss ratio (%)

7 6 5 4 3 2 1 0 compress

applu

go

ijpeg

m88ksim

tomcatv

vortex

AVG

gcc

General applications: Spec95 benchmarks

Fig. 4. Spec95 benchmarks: miss ratios of the conventional caches and the DASAT cache. 32KB-32byte (DM)

64KB-32byte(DM)

16KB-32byte(2-way)

16KB-32byte(4-way)

DASAT cache

9 8

Miss ratio (%)

7 6 5 4 3 2 1 0 epic

unepic

cjpeg

djpeg

mpeg2 decode

mpeg2 encode

rasta mipmap

texgen

osdemo

AVG

Multimedia applications: Media benchmarks

Fig. 5. Media benchmarks: miss ratios of the conventional caches and the DASAT cache.

In general, one more meaningful measure to evaluate the performance of any given memory-hierarchy is the average memory access time: Average memory access time = Hit time + Miss rate * Miss penalty.

(1)

Here hit time is the time to process a hit in the cache and the miss penalty is the additional time to service the miss. Basic parameters for the simulation are presented in Table II. These parameters are based on the values used for common 32-bit embedded processors (i.e., Hitachi SH4 or ARM920T). The hit times of a direct mapped cache and a fully associative buffer are both assumed to be one cycle. We assume 15 CPU-cycles are needed for a miss. Therefore each 8-byte block is transferred from the off-chip memory after a 15 CPU-cycle penalty. The conventional cache system with a 32-byte block size

and the DASAT cache system based on a 32-byte block fetch take 19 and 20 cycle penalties respectively. Table II. Simulation parameters. System parameters

Values

CPU clock memory clock memory latency memory bandwidth

200 MHz 133 MHz 70ns 1.6 Gbytes / sec

The average memory access times for the conventional cache and the DASAT cache are compared in Fig. 6 and Fig. 7. Our analysis shows that general applications with a high degree of spatial locality, such as tomcatv and m88ksim, show an especially strong performance improvement with the DASAT cache. We also find that media applications with large block fetch sizes (e.g., 96-bytes), such as unepic, show an especially high performance improvement with the DASAT cache. 32KB-32byte (DM)

64KB-32byte(DM)

16KB-32byte(2-way)

16KB-32byte(4-way)

DASAT cache

Average memory access times (cycle)

3

2.6

2.2

1.8

1.4

1

compress

applu

go

ijpeg

m88ksim

tomcatv

vortex

gcc

AVG

General applications: Spec95 benchmarks

Fig. 6. Spec95 benchmarks: average memory access time of the conventional caches and the DASAT cache.

Average memory access times (cycle)

32KB-32byte (DM)

64KB-32byte(DM)

16KB-32byte(2-way)

16KB-32byte(4-way)

DASAT cache

2.4 2.2 2 1.8 1.6 1.4 1.2 1 epic

unepic

cjpeg

djpeg

mpeg2 decode

mpeg2 encode

rasta mipmap

texgen

osdemo

AVG

Multimedia applications: Media benchmarks

Fig. 7. Media benchmarks: average memory access time of the conventional caches and the DASAT cache.

4.3 Comparison of the Victim Cache with Some Commercial Caches In this section we compare the DASAT cache with several previously proposed cache designs (e.g., STS cache, victim cache, selective cache, assist cache, and so forth). Our analysis of the performance improvement achieved by each of these cache designs shows that one of the most effective ones is the victim cache [Albera et al. 1999, and Srinivasan 1998]. The victim cache structure has been used in, for example, the DEC Alpha and Samsung Calm-RISC. Thus, the victim cache was chosen to compare with the DASAT cache. A configuration with 8 KB of space, a 1 KB victim buffer, and a 32-byte block size is used as a direct comparison with the DASAT cache. Also several cache systems in currently available embedded processors are compared with the DASAT cache. All of these cache systems are configured for general-purpose applications. The size of a block is set to 32bytes for all the caches, but different cache sizes and associativities are utilized. Table III shows the design parameters for the three different cache configurations of the chosen embedded processors [Burd 2001]. The victim cache can significantly reduce conflict misses and can provide a low overall miss ratio while using only a simple hardware mechanism. However, the victim cache does incur a large number of content swaps between the main cache and the victim buffer, which increases power consumption. Table III: Cache Configurations of Currently Available Embedded Processors. Processors PowerPC 603e AMD mobile K6-2-P SH7750 (SH4)

Data Cache size 16KB 32KB 16KB

Data Block size 32-byte 32-byte 32-byte

Data Associativity 4-way 2-way 1-way

Figures 8 and 9 show the resulting miss ratio and the average memory access time in general applications (Spec95) respectively. The simulations show that in some cases the DASAT cache achieves better performance than other cache systems with two or four times as much space. The victim cache with the 32-byte block size shows better performance than the one with an 8-byte or a 16-byte block size, but increasing the block size often increases write traffic into memory. As shown in Figures 8 and 9, the DASAT cache shows better performance than the victim cache with a 32-byte block size. Note that the victim cache employs the same 32-byte cache block size for both the main cache and the victim buffer, while 8-byte small blocks are used in the AT-cache. This also results in a significant reduction in power consumption because write traffic into memory can be reduced.

Victim cache

SH4

PowerPC 603e

AMD mobile K6-2-P

DASAT cache

9 8

Miss ratio (%)

7 6 5 4 3 2 1 0 compress

applu

go

ijpeg

m88ksim

tomcatv

vortex

gcc

AVG

General applications: Spec95 benchmarks

Fig. 8. Spec95 benchmarks: miss ratios of the various caches and the DASAT cache. Victim cache

SH4

PowerPC 603e

AMD mobile K6-2-P

DASAT cache

Average memory access times (cycle)

3

2.5

2

1.5

1

0.5

0 compress

applu

go

ijpeg

m88ksim

tomcatv

vortex

gcc

AVG

General applications: Spec95 benchmarks

Fig. 9. Spec95 benchmarks: average memory access time of the various caches and the DASAT cache.

Figures 10 and 11 show the miss ratio and the average memory access time, respectively, in multimedia applications. The performance gap tends to be more significant than with general applications. Especially, in the epic and unepic benchmarks, the proposed DASAT cache system can achieve an exceptional performance improvement. Victim cache

SH4

PowerPC 603e

AMD mobile K6-2-P

DASAT cache

7

Miss ratio (%)

6

5

4

3

2

1

0 epic

unepic

cjpeg

mpeg2 mpeg2 djpeg rasta mipmap decode encode Multimedia applications: Media benchmarks

texgen

osdemo

AVG

Fig. 10. Media benchmarks: miss ratios of the various caches and the DASAT cache. Victim cache

SH4

PowerPC 603e

AMD mobile K6-2-P

DASAT cache

Average memory access times (cycle)

3

2.6

2.2

1.8

1.4

1 epic

unepic

cjpeg

mpeg2 mpeg2 rasta mipmap decode encode Multimedia applications: Media benchmarks djpeg

texgen

osdemo

AVG

Fig. 11. Media benchmarks: average memory access time of the various caches and the DASAT cache.

4.4 Comparison of area and power consumption

For our performance/cost and power consumption analysis, we evaluated various cache sizes using the CACTI-3.0 simulator [Reinman et al. 2001], which can calculate access times, cycle times, area, and power consumption for many types of hardware caches. Our

results are based on 0.18 µm technology with a 1.7 V supply voltage. Table IV shows the performance/cost for various cache sizes. The DASAT cache (dual 4KB direct-mapped caches with 8-byte block size and 1KB fully associative cache with 32-byte block size) shows about a 75% area reduction compared to the conventional direct-mapped cache (64KB-32byte) in spite of providing similar or better performance. Table IV: Performance and Cost of the Various Caches. Area

Avg of miss ratios

Caches

By using CACTI-III (cm2)

Speg95 Bench (%)

Media Bench (%)

32KB-32B (DM) 64KB-32B (DM) 16KB-32B (2-way) 16KB-32B (4-way) DASAT (Dual 4KB8B-1KB32B)

0.035449 0.065937 0.019984 0.021075 0.01704

1.89 1.45 2.19 2.05 1.42

1.92 1.46 1.62 1.54 1.21

Avg of average memory access times Speg95 Media Bench Bench (%) (%) 1.36 1.28 1.52 1.59 1.32

1.37 1.28 1.39 1.47 1.30

Table V shows the power consumption for various cache configurations. Each entry shows the power dissipation for a cache access and a cache update on miss case. In the cases of the victim cache and the DASAT cache system, the direct-mapped cache and the fully associative cache are searched in parallel at the same level. According to the results of CACTI 3.0, access times for the dual direct-mapped cache (e.g., 8KB-8byte) and the fully associative cache (e.g., 1KB-32byte) of the DASAT cache are 0.953 ns and 1.934 ns respectively. But the access time for the tag part of the fully associative cache is 1.372 ns. If a hit occurs at the direct-mapped cache, the data part of the fully associative cache does not need to be driven. That is, the requested data item is transmitted to the CPU without checking for a hit/miss in the fully associative cache. This mechanism offers the fast access time of a direct-mapped cache and low power consumption by using a simple additional unit and an asynchronous SRAM. Paccess of the victim and the DASAT cache can be divided into two parts, i.e., a hit case and a miss case at the direct-mapped cache. If a hit occurs at the direct-mapped cache, accessing power is consumed to access the tag and the data part of the direct-mapped cache and to access the tag part of the fully associative cache (i.e., “DM hit” case in Table 5). And if a miss case occurs at the direct-mapped cache, power consumption to access the data part of the fully associative cache should be added to that of a hit case (i.e., “DM miss” case in Table 5). Finally “DM write” at the victim cache denotes two cases. Namely power consumption to update the direct-mapped cache when a global miss occurs or the

case for content swapping when a victim buffer hit occurs. “FA write” for the victim cache denotes power consumption for updating the associative cache when the replaced item from the direct-mapped cache is moved into the victim buffer. “FA write” for the DASAT cache denotes power consumption for updating the associative cache when a global miss occurs. “DM write” for the DASAT cache denotes power consumption in the direct-mapped cache module when a large block is replaced, that is, when its small blocks that are marked as having been accessed are moved into their corresponding block entries in the AT-cache. Table V. Power Consumption per Access for Various Cache Configurations. Cache configuration

Paccess (nJ)

Pcache_write (nJ)

16KB-32B (DM) 32KB-32B (DM) 64KB-32B (DM) 16KB-32B (2-way) 16KB-32B (4-way) 32KB-32B (2-way) Victim cache (8KB32B-1KB32B) DASAT cache (dual 4KB8B -1KB32B)

0.4734 0.6205 0.9358 0.6335 0.9349 0.7586 DM miss : 0.6499, DM hit : 0.4402 DM miss : 0.5772 DM hit : 0.4092

0.2220 0.3726 0.6909 0.2237 0.2260 0.3402 DM write: 0.1302 FA write: 0.0779 DM write: 0.1065 FA write: 0.0779

From these values, the average power consumption of the cache system is given by: Avg.power = Nhit * Paccess + Nmiss * Pmiss ,

(2)

where Nhit and Nmiss are the ratios of hits and misses in the cache respectively. Also Paccess is the power used to access a cache block and Pmiss is the power required to process a miss. Pmiss can be calculated as follows: Pmiss = Paccess + Pcache_write + Ppad ,

(3)

where Pcache_write is the power for a cache write operation in a cache miss, and Ppad is the power dissipated at the on-chip pad slot. Ppad can be calculated as follows [Wilton et al. 1994] [Kamble et al. 1997]: 2

Ppad = 0.5 * Vdd * (0.5 * (Wdata + Waddr)) * 20pF ,

(4)

where Wdata and Waddr are the number of bits for both the data sent/returned and the address sent to the lower level memory on a miss request. The capacitive load for off-chip

destinations is assumed to be 20pF [Wilton et al. 1994]. A data cache with a 32-byte block size is assumed, where the values of Wdata and Waddr are also 32 bits. Figures 12 and 13 present the average power consumption of different cache structures compared to the DASAT cache for the benchmarks used earlier. As shown in these figures, the DASAT cache shows the lowest power consumption of all the approaches. Overall, the DASAT shows the best result in terms of both performance and power among all of the approaches. 16KB-32byte (DM)

32KB-32byte (DM)

64KB-32byte(DM)

16KB-32byte(4-way)

32KB-32byte(2-way)

Victim cache

16KB-32byte(2-way) DASAT cache

1.1

1

Power consumption (nJ)

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

compress

applu

go

ijpeg

m88ksim

tomcatv

vortex

gcc

AVG

Fig. 12. Spec95 benchmarks: power consumptions of the conventional caches and the DASAT cache. 16KB-32byte (DM)

32KB-32byte (DM)

64KB-32byte(DM)

16KB-32byte(4-way)

32KB-32byte(2-way)

Victim cache

16KB-32byte(2-way) DASAT cache

1.1

Power consumption (nJ)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

epic

unepic

cjpeg

djpeg

mpeg2de mpeg2en code code

rasta

mipmap

texgen

osdemo

AVG

Fig. 13. Media benchmarks: power consumptions of the conventional caches and the DASAT cache.

5. CONCLUSION

We have presented the design of a simple, hardware-controlled, low power, but high performance cache system with low cost that is applicable to embedded processors. The DASAT cache is an extended combination of a direct-mapped AT-cache and a fully associative DAS-cache. A new caching mechanism for exploiting two types of locality effectively and adaptively was presented. The AT-cache includes two identical conventional direct-mapped caches, which we refer to as T0 and T1 caches. Its organization and operational mechanism are specially designed to improve temporal locality and reduce conflict misses. The DAS-cache is designed to improve spatial locality dynamically. It is constructed as a fully-associative structure with a large block size for exploiting spatial locality. The DASAT cache also improves performance by dynamically adjusting the fetch size in response to varying levels of spatial locality in different types of applications. When a miss occurs, the dynamic fetch controller generates fetch signals for one of three block sizes depending on information that is kept on recent block access patterns. Simulation results show that the average memory access time of the DASAT cache is equal to a conventional direct-mapped cache with eight times as much space in general applications or multimedia applications. In addition, the simulation shows that the DASAT cache can achieve better performance than a 2-way or 4-way set associative cache with twice as much space. Also, compared with a victim cache with 32-byte block size, the average miss ratio is improved by about 41% and 60% in general applications and multimedia applications respectively. The average memory access time is reduced by about 10% compared with the victim cache for both types of applications. Finally, the performance of the DASAT cache is comparable to a 2 way-set associative cache with LRU replacement, such as the AMD mobile K6-2-P cache system, with four times as much space, for both types of applications. We have also shown that power consumption in the DASAT cache is around 10% ~ 60% lower than these various cache systems. REFERENCES ALBERA, G., AND BAHAR, R.I. 1999. Power/Performance Advantages of Victim Buffer in HighPerformance Processors. In Proceedings of the IEEE Alessandro Volta Memorial Workshop, Mar. 1999, 43-51. BALL, T., AND LARUS, J.R. 1994. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems, Vol. 16, No. 4, July 1994, 1319-1360. BURD, T. 2001. CPU Info Center: General Processor Info. http://bwrc.eecs.berkeley.edu/CIC/ summary/local/ retrieved Dec. 2001. CHEN, W.Y., BRINGMANN, R.A., MAHLKE, S.A., HANK, R.E., AND SICOLO, J.E. 1992. An Efficient Architecture for Loop Based Data Preloading. In Proceedings of the 25th Int’l. Symposium on Microarchitecture, Dec. 1992, 92-101. CONTE, T.M. et al. 1997. Challenges to Combining General-Purpose and Multimedia Processors. IEEE Computer, Vol. 30 No. 12, Dec. 1997, 33-37. COX, M., BHANDARI, N., AND SHANTZ, M. 1998. Multi-Level Texture Caching for 3D Graphics Hardware. In Proceedings of 25th ISCA, June 1998, 86-97.

CUCCHIARA, R., PICCARDI, M., AND PRATI, A. 1999. Exploiting Cache in Multimedia. In Proceedings of IEEE Multimedia Systems’99, July 1999, 345-350. EDLER, J., AND HILL, M.D. 1997. Dinero IV Trace-Driven Uniprocessor Cache Simulator. available from Univ. Wisconsin; ftp://ftp.nj.nec.com/pub/edler/d4/, 1997. GONZALEZ, A., ALIAGAS, C., AND MATEO, M. 1995. Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality. In Proceedings of International Conference on Supercomputing '95, July 1995, 338-347. HAKURA, Z. S., AND GUPTA, A. 1997. The Design and Analysis of a Cache Architecture for texture Mapping. In Proceedings of 24th ISCA, June 1997, 108-120. IGEHY, H., ELDRIDGE, M., AND PROUDFOOT, K. 1998. Prefetching in a Texture Cache Architecture. In Proceedings of Eurographics/SIGGRAPH Workshop on Graphics Hardware ’98, Aug. 1998. JOUPPI, N.P. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers. In Proceedings of the 17th ISCA, May 1990, 364-373. KAMBLE, M.B., AND GHOST, K. 1997. Analytical Energy Dissipation Models for Low Power Caches. In Proceedings of the ACM/IEEE Int’l Symp. on Low-Power Electronics and Design, Aug. 1997. KURPANCHEK, G. et al. 1994. PA-7200: A PA-RISC Processor with Integrated High Performance MP Bus Interface. In Proceedings of IEEE compcon, Feb. 1994, 375-382. LEE, C., POTKONJAK, M., AND MANGIONE-SMITH, W.H. 1997. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems. In Proceedings of the 30th Int’l. Symposium on Microarchitecture, Dec. 1997, 330-335. LEE, J.H., LEE, J.S., AND KIM, S.D. 2000. A Selective Temporal and Aggressive Spatial Cache System Based on Time Interval. In Proceedings of Int’l Conference on Computer Design’00, Sep. 2000, 287-293. LEE, R.B., AND SMITH, M.D. 1996. Media Processing: A New Design Target. IEEE Micro, Vol. 16, No. 4, Aug. 1996, 6-9. MILUTINOVIC, V., TOMASEVIC, M., MARKOVIC, B., AND TREMBLAY, M. 1996. The Split Temporal/Spatial Cache: Initial Performance Analysis. In Proceedings of the SCIzzL-5, Mar. 1996, 63-69. PRZYBYLSKI, S., HOROWITZ, M., AND HENESSY, J. 1988. Performance Tradeoffs in Cache Design. In Proceedings of the 15th Annual Int’l Symp. on Computer Architecture, May 1998, 290-298. REINMAN, G., AND JOUPPI, N.P. 2001. CACTI 3.0: An Integrated Cache Timing and Power, and Area Model. Compaq WRL Report, Aug. 2001. SLINGERLAND, N.T., AND SMITH, A.J. 2000. Cache Performance for multimedia Applications. CSD-001123, University of California, Berkeley, Dec. 2000. SODERQUIST, P., AND LEESER, M. 1997. Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder. In Proceedings of ACM Multimedia’97, Nov. 1997, 291-301. SRINIVASAN, V. 1998. Improving Performance of an L1 Cache With an Associated Buffer. CSE-TR-361-98, University of Michigan, Feb. 1998. WILTON, S.J.E., AND JOUPPI, N.P. 1994. An Enhanced Access and Cycle Time Model for On-Chip Caches. Digital WRL Research Report 93/5, July 1994. ZUCKER, D.F., FLYNN, M.J., AND LEE, R.B. 1996. A Comparison of hardware Prefetching Techniques for Multimedia Benchmarks. In Proceedings of 3rd IEEE Int’l Conference on Multimedia Computing and Systems, June 1996, 236-244.

Received January 2002; revised April 2002; accepted July 2002.