Cache Write Policy for Streaming Output Data Jason Fritts Dept. of Computer Science, Washington University, St. Louis, MO.
[email protected] processing. As part of on-going research to evaluate the memory characteristics of multimedia applications, we have been experimenting with multi-level cache memory hierarchies to evaluate the memory bottlenecks in media processing. One aspect of that research has been to examine the streaming output data characteristics of multimedia applications. While evaluating various cache parameters, it was found that the cache write policy provides significantly influences higher performance for memoryintensive applications in cache memory hierarchies. In alternative memory hierarchies, streaming memory structures will need to support significant write storage, and provide the processor quick access to temporary output data streams. The design of memory hierarchies for media processing remains an ill-defined area. Early solutions to media processing ranged from application-specific hardware or digital signal processors (DSPs) in embedded applications to general-purpose processors (GPPs) for desktop applications. For DSPs, the memory hierarchy has usually been composed of software-controlled memory with DMA prefetching, while GPPs have typically relied upon multi-level cache memory hierarchies for automated (nonsoftware-controlled) memory support. In recent years, multimedia applications have begun to dominate a significant portion of the computing industry, so much effort has been devoted to optimizing processing for multimedia. This has resulted in the gradual merging of the realms of digital signal processing and general-purpose processing. With the emergence of media processors
Abstract This paper presents the results of a cache memory hierarchy evaluation that demonstrates multimedia applications are significantly influenced by the cache write policy of the L1 data cache. As part of an initial investigation of the memory characteristics of multimedia applications, we have been exploring multi-level cache memory hierarchies to evaluate the memory bottlenecks in media processing. The existence of streaming data in multimedia and the benefits of streaming memory prefetch structures such as stream buffers for supporting streaming input data has been well studied. However, support for streaming output data has been largely ignored. This study targets streaming output data by evaluating various cache configurations for write policies and write buffers at the L1 cache level. It was found that in cache memory hierarchies the writeallocate cache write policy provides much better performance than no-write-allocate for memory-intensive applications. In alternative memory hierarchies, memory structures for streaming data will need to support significant amounts of write storage, and enable temporary output data streams to be available near the processor when they are next needed by the program. 1. Introduction This paper presents the results of a cache memory hierarchy evaluation that evaluates the effectiveness of various cache parameters on streaming output data in media
1
provided by stream buffers, stride prediction tables, or stream caches. Outside of that single answer, however, many questions remain: Should the memory hierarchy be multi-level like GPPs? At what level(s) in the memory hierarchy should prefetching occur? Should the memory hierarchy use memory, cache, or both? And so on. This paper provides an initial investigation of streaming output data, and how it may best be supported in memory hierarchies. The remainder of the paper is organized as follows. Section 2 provides the motivation and goals in studying memory hierarchies for media processing, and describes some early results. Related studies in multimedia are discussed in Section 3. In Section 4, we describe our evaluation tools and the base architecture model used for these experiments. The various experiments and the implications of the results are evaluated in Section 5. Sections 6 and 7 summarize our findings and discuss our future plans for designing memory architectures for media processing.
(MMPs), DSPs are moving towards more general-purpose architectures, such as VLIW architectures, that are better targets for compilability, but still provide efficient media processing. Likewise, GPPs now include multimedia extensions that include subword parallel instructions for more parallelism in multimedia processing, as well as software-controlled prefetching instructions for prefetching multimedia data streams into the cache. The merging markets, however, have not been able to agree on an effective memory hierarchy design. GPPs are still primarily using cache memory hierarchies, but are now beginning to support software-controlled prefetching. Similarly, MMPs are extending their memory hierarchies to also support cache memory. Consequently, the question remains: What is the optimal memory hierarchy design for multimedia? There are a number of memory hierarchy options that are available. On-chip, there is the option of either a flat or multi-level memory hierarchy. While a multi-level memory hierarchy may initially sound farfetched for media processing, they already exist on GPPs and current media processors are now moving in that direction, as exhibited by the reconfigurable 128 KB L2 cache/memory available on the TI C64x [1]. With respect to the type of memory, there are the obvious options of multi-level cache hierarchies, as used in GPPs, or local memory with prefetching, as used in DSPs. Or there are hybrid alternatives combining cache and prefetching. One memory aspect we can certainly expect is the use of some form of memory prefetching support. Studies on multimedia applications have shown the existence of significant amounts of streaming memory [2][3], which is most effectively supported using some form of memory prefetching, whether it be softwarecontrolled prefetching, such as DMA or memory touches, or automatic prefetching, as
2. Memory Hierarchy Exploration On-going research at Washington University has begun to evaluate memory hierarchy options for multimedia. We have been approaching this problem from the perspective of ease of programmability. Development of embedded and multimedia systems must deal with increasingly shorter “time to market”. To accommodate the shorter development times, it is necessary to simplify the development process as much as possible. Application design for DSP processors typically requires higher software development costs because of the limited effectiveness of compilers for DSP processors. Our goal has been to find memory hierarchy designs that minimize the degree of software control necessary by the compiler or programmer. Consequently, automated memory hierarchies such as
2
organizations, but have primarily analyzed only the first level of the memory hierarchy. An initial investigation of cache memory performance for media processing was performed by Lee et al. with their introduction of MediaBench [5]. Zucker et al. [2] examined the impact of streaming memory structures such as stream buffers, stride prediction tables, and stream caches, and found considerable benefit with these structures for a MPEG video codec. Studies by Wu and Wolf on video application traces [3] also found that memory architectures combining cache with stream buffers provide better performance than either cache-only or streaming-memory-only memory hierarchies. Consequently, we expect some hybrid of cache and prefetching will likely provide the best performance at the L1 level. Other studies have examined the impact of the large data rates on external memory. The enormous data sets of multimedia can place significant burdens on external memory, so it often becomes a bottleneck. To support the high data rates, prior research indicates either memory hierarchies must be designed to reduce external memory traffic, or scheduling methods must be used to restructure the program for improved data locality [6][7].
caches and automatic streaming memory structures are preferable to softwarecontrolled memory and prefetching. We are not ruling those options out, but are simply giving precedence to automated (nonsoftware-controlled) memory structures. The research was begun assuming a multi-level cache-only memory hierarchy. This is a well-studied memory hierarchy, so a large body of existing research can be brought to bear in analyzing the findings. Results from the study are being used to define bottlenecks of cache-based memory hierarchies, and how those bottlenecks can be eliminated using alternative memory support. An initial investigation of the multi-level cache memory hierarchy [4] found that external memory latency is the foremost problem for multimedia. While prefetching can effectively reduce memory latency, it must be careful of the second major problem, external memory bandwidth. Prefetching must be careful not to saturate the limited memory bandwidth. These results are being used to explore the design of multi-level memory prefetching. As part of that research, we evaluated various cache parameters to determine their impact on the streaming data inherent to multimedia applications. While researchers have been well aware of the streaming nature of multimedia data for some time, they have primarily concerned themselves with streaming input data. However, multimedia applications exhibit streaming output data as well. Consequently, this evaluation focuses on the impact of various cache parameters on streaming output data. In particular, the cache write policy of the L1 data cache was found to significantly influence the memoryintensive applications in media processing.
4. Evaluation Environment Accurate architecture evaluation for media processing requires a benchmark suite representative of the multimedia industry, and an aggressive compilation and simulation environment. This architecture style evaluation uses an augmented version of the MediaBench benchmark suite defined by Lee, et al. [5] at UCLA. This benchmark provides applications covering six major areas of media processing: video, graphics, audio, images, speech, and security. We also augmented MediaBench with additional video applications, MPEG-4 and H.263. MediaBench already contains MPEG-2, but
3. Related Work Early investigations of memory hierarchies for media processing have analyzed both cache and memory
3
with separate L1 instruction and data caches, a 256 KB unified on-chip L2 cache, and an external memory bus that operates at 1/6 the processor frequency. The L1 instruction cache is a 16 KB direct-mapped cache with 256-byte lines and a 20 cycle miss latency (assuming a hit in L2). The L1 data cache is a 32 KB direct-mapped cache with 64-byte lines and a 15 cycle miss latency. It is nonblocking with an 8-entry miss buffer, and uses a no-write-allocate/write-through policy with an 8-entry write buffer. The L2 data cache is a 256 KB 4-way set associate cache with 64-byte lines and a 50 cycle miss latency. It is non-blocking with an 8-entry miss buffer, and uses a write-allocate policy with an 8-entry write buffer. Further details are available in [8].
H.263 and MPEG-4 are distinct in that H.263 targets very low bit-rate video, and MPEG-4 uses an object-based representation. These applications are believed to make MediaBench more representative of the current and future multimedia industry. The compilation and simulation tools for this evaluation were provided by the IMPACT compilation/simulation environment, produced by Wen-mei Hwu’s group at UIUC [9]. The IMPACT environment includes a trace-driven simulator and an ILP compiler. The simulator enables both statistical and cycle-accurate trace-driven simulation of a variety of parameterizable architecture models, including both VLIW and in-order superscalar datapaths. We recently expanded the simulator to also support an out-of-order superscalar datapath. The compiler supports many aggressive compiler optimizations including procedure inlining, loop unrolling, speculation, and predication. Our experiments found similar results across IMPACT’s many optimization methods, but this paper presents the results using the Superscalar optimization level, which includes procedure inlining, loop unrolling, and control and data speculation, but no predication. Systematic evaluation of different architectures requires a base processor model for performance comparisons. We defined an 8-issue base media processor targeting the frequency range from 500 MHz to 1 GHz, with instruction latencies modeled after the Alpha 21264, as shown in Table 1. The processor uses a cache memory hierarchy
5. Streaming Output Data In designing the cache memory hierarchy for media processing, we initially targeted a L1 data cache by making the cache no-writeallocate/write-through, hoping to enable a more efficient L1 cache. By avoiding allocation of stored data in the L1 data cache, the entire cache could be reserved for readonly data, which is typically much more critical in the processing dependence chain. Stores are less critical since they do not stall the pipeline on a cache miss. By providing a sufficient-sized write buffer (8-entry) in the L1 data cache, it was expected that stores could be more efficiently stored in the onchip L2 cache. However, it was found that this expectation was far from the truth.
Operation
Number of Units
Latency
ALU Branches Load/Store Floating-Point Multiply/Divide
8 1 4 2 2
1 1 loads - 3, stores - 2 4 multiply - 5, div - 20
Table 1 – Number of parallel functional units and their operation latencies.
4
To off-chip cache/memory 50 cycles L2 Cache
L2 Cache
L1 Data Cache
L1 Data Cache
Datapath
Datapath
L2 Write Buffers
15 cycles L1 Write Buffers
3 cycles
(a) L1 Write-Allocate Policy
(b) L1 No-Write-Allocate Policy
Figure 5.1 – Data output flow through on-chip cache hierarchy using (a) write-allocate policy in the L1 data cache, and (b) no-write-allocate policy in the L1 data cache. encoding benchmarks must first search for redundancy in the media stream and then encode the data to minimize that redundancy, the encoding benchmarks are much more compute-intensive than decoding and other non-encoding benchmarks. Additionally, the encoding benchmarks are receiving (loading) large amounts of data, but producing (storing) much smaller amounts of data. As a result, the encoding benchmarks demonstrate only a small degradation (about 8-10%) with the no-write-allocate cache policy. Conversely, the decoding and representation transformation (e.g. graphics and postscript processing) benchmarks are much more memory-intensive. Instead of spending significant amounts of computation searching for redundancy, they simply transform the compressed stream back into an uncompressed stream. Furthermore, they are receiving (loading) small amounts of data, and are producing (storing) much larger amounts of data. Consequently, the decoding and transformational benchmarks are producing extensive streaming output data.
Impact of Cache Write Policies The initial round of simulation experiments indicated that many of the applications were becoming stalled a substantial fraction of the time by memory stores. A subsequent experiment was performed to compare the performance of the L1 data cache, using both (a) L1 writeallocate policy, and (b) L1 no-write-allocate policy. Figure 5.1 illustrates the data flow of memory stores in both configurations. From the results of these simulations, it was readily apparent that using a no-writeallocate cache policy instead of a writeallocate policy causes severe degradations in many of the media applications. As shown in Figure 5.2, performance with a writeallocate L1 cache policy was 2x better than performance with the no-write-allocate policy on the impacted benchmarks. Further investigation of the affected benchmarks revealed two important points. First, all the heavily impacted benchmarks were non-encoding benchmarks. Because
5
2 no-write-allocate write-allocate
IPC
1.5
1
0.5
ENCODE
NONENCODE
unepic
texgen
rawdaudio
osdemo
mpeg4dec
mpeg2dec
mipmap
h263dec
djpeg
0
Application
Figure 5.2 – Comparison of write-allocate/write-back and no-write-allocate/write-through L1 data cache write policies. IPCs are given for the 9 most heavily impacted benchmarks, and average IPCs are given for the grouped applications of encoding and non-encoding benchmarks. The second point was that the write buffers were the cause of the memory stalls. The large amounts of streaming output data were filling all 8 entries in the L1 data cache’s write buffer, so that all subsequent stores were forced to delay the entire pipeline. This revelation was rather surprising considering that the dynamic frequency of stores in these benchmarks is approximately 10%, but they degrade performance by 50%. The reason such enormous degradation is possible is that stores are often performed in “burst” fashion in media processing. As an example, a large number of stores occur during run-length decoding, which receives an encoded input stream and decodes it into an uncompressed stream. During this stage the frequency of stores is quite high, so the 8-entry write buffer is quickly filled, and all further stores are stalled. Hence, the streaming output data is the primary culprit in degrading processing performance. A write-allocate/write-back policy is necessary in the L1 data cache for effectively supporting streaming output data.
Impact of Write Buffers While the experiment comparing performance under different cache write policies indicates the need for write-allocate when using a cache memory hierarchy, it provides little additional information about streaming output data in multimedia beyond its bursty nature. Further characterization of streaming output data is necessary in order to understand how it may be supported under alternative memory hierarchies. Ideally, in designing memory hierarchies for media processing, we would like to separate the memory structures needed for streaming and non-streaming data. Streaming data typically has high-spatial locality and low temporal locality, while non-streaming data, such the look-up tables used for run-length coding, quantization, and DCT coefficients, has low spatial locality and high temporal locality [10]. Separating the memory structures would avoid crosspollution between the two data types, so the separate memory structures would operate
6
performance with only 8 write buffer entries on the no-write-allocate policy. All other benchmarks showed only marginal improvement from increasing the number of write buffers. The results for some of the non-encoding benchmarks are indicated below in Figure 5.3. As expected, increasing the number of write buffers did serve to reduce the performance degradation from bursty stores, but it also indicated a higher degree of burstiness than expected. As illustrated, the benchmarks did not begin to display relief from the saturated write buffers until there were at least 1K write buffers (each capable of holding one outstanding store). At 1-2K write buffer entries, the performance of rawdaudio and djpeg began to improve significantly, but the other benchmarks showed much less improvement as the number of write buffers was further increased. Also note that the performance under the no-write-allocate policy with any number of write buffers is always less than the performance using the write-allocate policy.
more efficiently. Also, this would enable two smaller structures, which can operate at higher frequencies than one larger structure. To further evaluate the characteristics of streaming output data, an additional experiment was performed to measure the performance improvement from increasing numbers of write buffer entries when using the no-write-allocate/write-through policy. Since the prior experiment indicated stores typically occur in “burst” fashion, increasing the number of write buffers will help relieve the bottleneck caused by bursty stores. This will indicate both the relative degree of burstiness of stores in multimedia, as well as indicate whether additional problems exist from not allocating space in the L1 data cache for streaming output data. Again using the no-write-allocate, writethrough policy, we evaluated performance of the MediaBench applications, varying the number of L1 write buffers from 1 to 32,768. The results over all the benchmarks exhibited only substantial performance variations on the non-encoding benchmarks, i.e. those benchmarks that originally experience poor 1.5
djpeg h263dec rawdaudio texgen unepic
IPC
1
0.5
16K
4K
1K
256
64
16
4
1
0
Number of Write-Buffer Entries
Figure 5.3 – Comparison of performance versus number of write-buffer entries for non-writeallocate/write-through L1 data cache write policy.
7
that much of the streaming output data that is generated are streams of temporary data. These streams are the blocks of data that are generated between major functional blocks in multimedia applications. These temporary streams do not need to be written out to a file or to the display, they are simply a means for storing temporary data until the application begins processing that data in the next function in the program. Consequently, when a temporary stream is written into the L2 cache instead of the L1 cache, it takes much longer to access that stream when it is next needed by the program. An example of this phenomenon is illustrated in Figure 5.4. Figure 5.4(a) illustrates the major functional blocks of an MPEG-2 decoder. While performing each of these major functions, the decoder generates a stream of data. The four major streams generated by the decoder are labeled as A, B, C, and D in the figure. Of these four streams, only the last stream, stream D, will be saved to a file or sent to the display. The other three are all temporary streams.
From these results we are able to make two important conclusions. First, the nonencoding multimedia applications typically display a high-degree of store burstiness, so any memory structures for supporting streaming output data will require a significant amount of memory storage for memory writes. And secondly, it is important to keep streaming output data relatively close to the processor. This second conclusion may be derived from the fact that at the point in the simulation regression at which there are 8K+ write buffer entries, there is as much or more storage available in the write buffers as in the complete 32 KB L1 data cache. Consequently, the problem resides not in the lack of storage, but in the distance that the data is stored from the processor. Because the data is stored into the L2 cache and not the L1 cache, the additional latency in accessing data residing in the L2 cache generates the performance degradation. The reason that the distance the data is stored from the processor is so important is (a) MPEG-2 Decoder
Run-length Decoder
A
Inverse Quantization
B
IDCT
D
Motion Compensation
D
D
L2 Cache
A, B, C
C
A, B, C
L2 Cache
L1 Data Cache
L1 Data Cache
Datapath
Datapath
(b) L1 Write-Allocate Policy
(c) L1 No-Write-Allocate Policy
Figure 5.4 – Example indicating destination of streaming output data in MPEG-2 decoder.
8
The characteristics of streaming output data were further investigated by an experiment that simulated the no-writeallocate cache write policy with an increasing number of write buffer entries. The results indicate that not only does streaming output data generate a relatively high degree of bursty writes, but that the output streams must oftentimes be kept close to the processor. Many output data streams are temporary streams that move temporary data results between major functions in the application. These streams must reside near the processor when it is needed by the next major function or significant miss penalties will result. Ideally, the design of memory hierarchies for media processing will separate the memory into separate structures for streaming and non-streaming data. Because these two types of data demonstrate significantly different characteristics, separate memory structures would prevent cross-pollution of these data types in the memory. To enable separate memory structures for supporting streaming data, this study concludes that the streaming memory structure will need to (a) provide significant amounts of write storage for bursty writes, and (b) enable the temporary output data stream to be available near the processor when it is next needed by the program.
Also shown in Figure 5.4 are the data flow for the temporary streams, A, B, and C, as they are processed by the system. Item (b) in the figure displays the data flow when using the write-allocate policy, while item (c) displays the flow when using a no-writeallocate policy. As indicated, the temporary streams are stored in the L2 cache when using a no-write-allocate policy, so when needed by the next major functional block in the decoder, they generate a series of L1 cache misses. The implications of the temporary streams in designing alternative memory structures is that not only must substantial write storage be provided for streaming output data, it must also be available close to the processor by the time it is needed for the next major function in the application. Consequently, the output data must be either: (a) initially stored near the processor (i.e. at the L1 memory level), or (b) prefetched back to the processor before it is needed again. Our future research will explore the advantages and disadvantages of both alternatives. 6. Conclusions This paper presents the results of a cache memory hierarchy evaluation that compared cache write policies for the L1 data cache. Comparison of the no-write-allocate/writethrough policy and the write-allocate/writeback policies indicated significantly lower performance for the decoding and representation transformation benchmarks when using the no-write-allocate policy. Further investigation of the problem revealed that the problem was most severe on these benchmarks because they are memory intensive benchmarks that produce large amounts of streaming output data. Due to the bursty nature of this output data, the write buffers quickly become saturated and generate significant processor stall penalties.
7. Future Work We are now beginning to research multilevel memory hierarchies for media processing that combine memory prefetching with cache memory. This work is studying two aspects of the memory hierarchy. The first is examining the design of memory prefetch units for both streaming input and streaming output data. Previously, automatic memory prefetch structures, such as stream buffers and stride prediction tables, have nearly exclusively been designed for streaming input data. We are examining
9
International Symposium on Microarchitecture, December 1997.
modifications to the existing prefetch structures for supporting streaming output data. A second area of research is investigating a multi-level memory prefetching method for prefetching that will not overload the external memory bandwidth. This method will use a conservative on-chip memory prefetch unit to perform prefetching into the near future, while a more aggressive off-chip memory prefetch unit performs prefetching further into the future. The combination should provide aggressive prefetching without saturating the chip’s memory bandwidth.
[6]
Y. K. Chen and S. Y. Kung, "Multimedia Signal Processors: An Architectural Platform with Algorithmic Compilation," Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, vol. 20, 1998.
[7]
F. Catthoor, S. Wuytack, E. DeGreet, F. Balasa, L. Nachtergaele, and A. Vandecapelle, “Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design,” Kluwer Academic Publishers, 1998.
[8]
J. Fritts, “Architecture and Compiler Design Issues in Programmable Media Processors,” Ph.D. Thesis, Dept. of Electrical Engr., Princeton University, 2000.
[9]
IMPACT compiler group: http://www.crhc.uiuc.edu/Impact
Bibliography [1]
[2]
Texas Instruments C6x: http://www.ti.com/sc/docs/dsps/produc ts/c6x/index.htm Daniel F. Zucker, Michael J. Flynn, and Ruby B. Lee, “Improving Performance of Software MPEG Players,” Proceedings of COMPCON ’96, 1996.
[3]
Z. Wu and W. Wolf, "Study of Cache Systems in Video Signal Processors," IEEE Workshop on Signal Processing Systems, October 1998, pp. 23-32.
[4]
Jason Fritts and Wayne Wolf, “MultiLevel Cache Hierarchy Evaluation for Programmable Media Processors,” IEEE Workshop on Signal Processing Systems, October 2000.
[5]
Chunho Lee, Miodrag Potkonjak, and William H. Mangione-Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia Communications Systems,” Proceedings of the 30th Annual
[10] Paolo Faraboschi, Guiseppe Desoli, and Joseph A. Fisher, “The Latest Word in Digital and Media Processing,” IEEE Signal Processing Magazine, pp. 59-85, March 1998.
10