Improving Performance for Software MPEG Players Daniel F. Zucker, Michael J. Flynn, and Ruby B. Lee Computer Systems Laboratory Stanford University Stanford, CA 94305
[email protected] [email protected] [email protected]
2 Methodology 2.1
0.12
Applications
|
0.10
|
In this paper, we present a technique for improving the cache memory performance for software MPEG players. We motivate this technique by first presenting a characterization of cache behavior for mpeg play and mpeg2play MPEG applications. We then apply two hardware-based prefetching techniques to improve the cache memory performance. Previously published work has focused on applications of prefetching towards only scientific applications. The prefetching technique presented here eliminates approximately 80% of cache misses and potentially reduces execution time by a factor of two.
cessor performance. Improvements in memory performance can eventually result in performance increases of up to two times with relatively little additional hardware.
Miss Rate
Abstract
0.08
|
0.06
|
1. Introduction
mpeg_play-hula mpeg2play mpeg_play-easter
0.04
|
|
|
|
|
|
|
|
2048
4096
8192
16384
32768
65536
131072
262144
|
0.00
|
0.02
|
As multimedia datatypes and applications become ubiquitous, new processors and architectures will have to support video compression as a matter of course. Our objective is to find low-cost enhancements for standard processor architectures to support MPEG and MPEG2 applications. This paper focuses on the memory hierarchy as a means for providing this kind of performance enhancement. Our earlier work [9] has focused on arithmetic enhancements aimed at improving the computational aspect of compression. Our current work is targeted at the memory hierarchy, which is of fundamental importance to multimedia system performance. While there has been much work studying memory performance for scientific and general purpose applications, there has been little work on the needs of multimedia applications. Our results show that relatively simple prefetching techniques can significantly improve the memory hit rates for MPEG. As processors become faster and utilize increasing instruction level parallelism, memory performance will have a dominating effect on overall pro-
| 524288
| 1048576
Cache Size (Bytes) baseline miss rates -- associativity 1
Figure 1. Baseline miss rates for a direct mapped cache.
The applications we have chosen to measure performance are mpeg play and mpeg2play. The MPEG implementation used in this paper is mpeg play from the Berkeley MPEG Group [7]. The MPEG2 implementation is mpeg2play from
application
image
frame size
number of frames
MPEG MPEG MPEG2
hula 2.mpg easter.mpg tennis.m2v
352x240 240x176 576x704
40 49 7
frame pattern IPPIPPI IPBBIBBPBBI IBBPBBPP
data memory references 6e+07 6e+07 8e+07
Table 1. Characteristics of the movies used in the benchmark applications. the MPEG Software Simulation Group’s MPEG2 release [3]. Specific characteristics of the images used in our benchmarks are shown in table 1. Two images were used for MPEG to illustrate the effect of different frame types. Hula 2.mpg uses only I and P frames, while easter.mpg also includes B frames. Memory references refer to both data loads and data stores. Instruction accesses were not simulated. Although the number of frames for each application might seem small, since the miss rate rapidly converges to a stable average after only a few frames, the applications perform similarly to movies with many more frames.
2.2
Simulation Techniques
Trace driven simulations are used to model memory behavior in order to determine cache misses. Application code was compiled to an assembly language format with the commercially available HP C Compiler version A.09.75. Maximum optimization was set by using the +O3 option. This assembly code was then instrumented using the RYO instrumentation tool for PA-RISC architecture [10]. The instrumentation added additional assembly code so that external library functions are called for every memory access instruction. To save both execution time and disk space, distinct traces are not written to disk files. Instead, the cache simulator is executed concurrently with the instrumented executable so that address references are dynamically simulated. Because new traces are dynamically generated every execution, variables returned from system calls may cause slightly different traces at each run and may result in some run-to-run variation. The simulator provides data over a wide range of data cache sizes and associativities. The data presented here is for a direct mapped primary data cache with a line size of 16 bytes. This line size was chosen to better expose the potential benefits of prefetching. Hereafter, this first level data cache is referred to as the main cache. Because only a single process was simulated for each cache configuration, it is expected that the performance for the cache sizes reported corresponds to a larger cache size in a real system. Instruction memory accesses are not modeled. When possible, performance data is reported in terms of miss rate so that the different schemes can be compared independent of specific implementation parameters.
3 Cache Memory Behavior Baseline miss rates are shown in figure 1. For large cache sizes, the MPEG1 images settle to very low miss rates, while the MPEG2 image goes only slightly below 2%, even for the 1MB cache. This is due to the difference in frame size, since tennis.m2v has a much larger frame size than either easter.mpg of hula 2.mpg. For smaller caches, the miss rates get progressively worse and can result in a significant fraction of execution time. A graph showing the sequence of misses is shown in figure 2. This graph numbers each miss sequentially starting with 0 and plots this miss number vs. memory address for a 32KB cache. The range of addresses was chosen to illustrate the stream-like nature of the data. The graph, showing miss location vs. time, makes clear the stream-like nature of MPEG data. A large fraction of the misses seems to occur in linear streams of data. Because of this extremely predictable data access pattern, an intelligent prefetching scheme should be able to predict what data will be necessary well in advance, and thereby greatly improve miss rates. The large rectangular blocks of misses are an artifact of the specific dithering algorithm used in mpeg play. Because a very large table look-up is performed, the cache does not have the capacity to hold the entire table, so that a miss frequently results. Figure 3 shows the miss behavior when the -dither ordered option is used. This implementation of dithering substitutes 3 smaller table look-ups and two additions for the single large table look-up of the first implementation. Using this dithering option reduces the overall miss rate from 3.5% to 2.3%. While the execution time will depend on the trade-off of arithmetic speed for memory performance in a given system, the ordered dither option performs better with the machine used in this study. Looking at a vertical slice where only three data streams are present, the three streams represent the luminance and two chrominance components of the frame being displayed. The slope of the luminance component is twice that of the chrominance component due to the the 2:1:1 sampling scheme employed. Vertical slices with six streams present illustrate a P frame being constructed from references to the corresponding I frame. The IPP frame cycle is repeated for the length of the movie. Figure 4 shows the same kind of graph for MPEG2.
Address
|
| |
1074000000
|
Address
|
|
|
|
|
|
|
|
|
120000
180000
240000
300000
360000
420000
480000
540000
MPEG2
|
|
|
|
|
|
|
30000
60000
90000
120000
150000
180000
210000
|
1074600000 | 0
|
60000
Miss Number
F gure 4 Memory stream behav or mpeg2p ay
|
1074700000
1074300000
|
1074800000
1074600000
1073700000 | 0
|
1074900000
1074900000
|
1075000000
|
1075100000
1075200000
|
1075200000
|
|
1075300000
1075500000
|
|
1075400000
|
1075500000
1075800000
|
The streaming behavior looks almost identical to that for MPEG1. In this case, however, vertical slices with 9 streams present show a B frame being constructed from two reference frames.
for
Miss Number MPEG
1075500000
|
1075400000
|
Address
Figure 2. Memory stream behavior mpeg play.
for
|
1075300000
|
1075100000
|
1075000000
|
1074900000
|
1075200000
|
1074700000
41
Stride Prediction Table
|
1074800000
has s ud ed he effec of prefe ch ng on sc en fic app caons here has been e work on prefe ch ng for mu med a app ca ons The firs pub shed prefe ch ng scheme was he one-b ock- ookahead (OBL) scheme for prefe chng cache nes repor ed by Sm h [8] Joupp [5] expanded h s dea w h h s proposa for s ream buffers Pa achar a and Kess er [6] a er proposed severa enhancemen s o he s ream buffer Ano her approach o hardware prefe ch ng re es on an ex erna ab e o keep rack of pas memory opera ons and pred c fu ure requ remen s for prefe ch ng Th s ype of prefe ch ng has been s ud ed ex ens ve y by Chen and Baer [1] [2] The s r de pred c on used n h s paper s a s m ar echn que proposed by Fu and Pa e [4]
|
1074600000 | 0
|
|
|
|
|
|
|
|
20000
40000
60000
80000
100000
120000
140000
160000
Miss Number MPEG -dither ordered
Figure 3. Memory stream behavior mpeg play -dither ordered.
for
4 Prefetching Techniques Hardware data prefetching is a well-known technique for improving data cache behavior. While much published work
The s ruc ure of he s r de pred c on ab e (SPT) s mua ed s shown n figure 5 A ab e ndexed by ns ruc on address s ma n a ned for a memory opera ons and ho ds he address of he as access When a subsequen memory access s made by an ns ruc on a ready con a ned n he s r de pred c on ab e he curren memory access address s sub rac ed from he prev ous y s ored address o ca cu a e a da a s r de va ue If h s va ue s non-zero he nex pred c ed memory em ca cu a ed by add ng he s r de va ue and he curren memory address s prefe ched n o he ma n cache When he curren ns ruc on does no ma ch an ns ruc on s ored n he s r de pred c on ab e an SPT m ss occurs The new en ry s added o he SPT rep ac ng he eas recen y used en ry (LRU)
Data obtained from simulations using a 128 entry stride table is shown in figure 6. The metric used to report performance is fraction of misses eliminated. This metric judges the performance of a given prefetch scheme independent of the memory system implementation. A perfect prefetching scheme would eliminate all memory misses and be reported as having a fraction of misses eliminated value of 1.0. In the case of a second level cache, the fraction of misses eliminated metric is identical to the hit rate for the second level cache. Of all the misses that occur in the first level cache, the fraction of those that hit in the second level cache is, by definition, equal to the fraction of misses eliminated. The reason for using fraction of misses eliminated instead of second level hit rate is that there is no discrete second level cache since prefetches are fetched directly to the main cache. Furthermore, performance improvement can be compared independently of other cache design considerations such as main cache size and associativity. The size of the main cache will have a dominating effect on miss rate, so that if results were simply compared in terms of absolute miss rates, the variation due to cache size would tend to mask out the variation due to prefetching scheme. Finally, performance can also be judged independently of memory implementation parameters such as time to access main memory. If this were not the case, varying memory parameters such as cycles to fill a main cache line could have a significant impact on results. For large main cache sizes, between 70% and 90% of misses are eliminated relative to a cache with the same size and associativity, but no stride prediction mechanism. A knee in the curve appears, however, at a cache size of approximately 32KB below which the stride prediction table rapidly becomes less effective. For very small cache sizes, the performance can actually be degraded by the SPT. The large number of non-useful prefetches begins to remove useful data from the cache. This problem could potentially be solved by the use of filtering techniques. The stride prediction table works very well for middle and large cache sizes. Indeed, it would be difficult to do better than eliminating 90% of the misses. In this range, the SPT is an effective means of prefetching. However, the stride prediction table has two significant problems with the smaller cache sizes. A large stride prediction table and small main cache results in unpredictable performance and may be undesirable in a real system. Certain applications may result in improvement, while other applications would be unpredictably degraded. The second problem is that even when the performance of smaller cache sizes is not degraded, it is hardly improved. It is the smaller cache range, where the main cache is exhibiting a high miss rate, that has the most to benefit by successful prefetching. The larger caches, where the stride prediction table eliminates most of the misses, actually have fewer misses to eliminate,
Stride Prediction Table Figure 5. Stride prediction table architecture.
1.00
|
0.80
|
0.60
|
0.40
|
Fraction of Misses Eliminated
Instr Address
0.20
|
0.00
|
|
|
|
|
4096
8192
16384
32768
Instr Address
|
2048
mpeg_play-hula mpeg2play mpeg_play-easter
| | | | | Compare 65536
131072
262144
524288
1048576
Cache Size (Bytes) 128 entry stride table -- associativity 1
Figure 6. Hit rates for a 128 entry stride prediction table.
SPT hit
Memory Address
Last Mem Address Valid
Subtract
Add
Prefetch Address
so that total execution time would be less impacted. Thus, it is this range where there is the most performance benefit potential that the stride prediction table is least effective.
5 Stream Cache
Processor
Stream Cache
Cache
Memory
1.00
|
0.60
|
0.80
|
Fraction of Misses Eliminated
Figure 7. Stream cache architecture.
cache sizes because it prefetches a large amount of unnecessary data. The stream cache uses the stride prediction table to prefetch data not to the main cache, but to a small stream cache that is accessed in parallel with the main cache. Because the data is not prefetched directly to the main cache, polluting the main cache is not a problem. The stream cache is based on the hypothesis that MPEG applications tend to operate on a relatively small workspace of data that marches linearly through the image. The data in this workspace is operated on for a short time, but then is not frequently reused. The stream cache isolates this stream data from the main cache. Prefetched data is brought into the stream cache, but is not copied into the main cache. A cache access must search both the main cache and the stream cache in parallel. An LRU replacement scheme is employed in the stream cache. Miss rate data for a 128 entry stream cache with a 128 entry SPT is shown in figure 8. The smaller caches show a greater enhancement than mid sized caches since there is a greater benefit from keeping less frequently used data out of the main cache. For small cache sizes, performance is significantly better than the 128 entry SPT described previously. Miss rates for the middle and large cache sizes are approximately equal to SPT performance. Easter.mpg seems to perform much worse for a 1MB cache than the SPT table alone, but for this cache size, the main cache already captures so many of the misses that this loss of performance is not significant. The region on the left part of the graph is significant, since this is where the smaller main caches are not performing as efficiently and memory performance is a much higher percentage of execution time. This is the region where the main cache is suffering from high miss rates, so that this improvement is particularly beneficial.
5.1
|
0.40
0.20
|
0.00
|
mpeg_play-hula mpeg2play mpeg_play-easter
|
|
|
|
|
|
|
|
|
|
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
Cache Size (Bytes) 128 entry modified stream cache -- associativity 1
Figure 8. Hit rates for 128 entry stream caches using 128 entry stride prediction table.
The stream cache potentially overcomes the problems of the stride prediction table by improving performance for the small cache sizes. The stride prediction table does a good job of predicting which data to prefetch, but fails for smaller
Execution Time
Execution times for the MPEG1 image hula 2.mpg are shown in figure 9. To calculate these execution times, an aggressive superscalar processor model is assumed such that the memory system is the performance limiting component. Therefore, a load or store instruction is assumed to occur every cycle, and all other operations are assumed to occur in parallel. Execution time is calculated assuming a main memory latency for both the main and stream caches of 25 cycles. If data is needed while it is in the process of being fetched to the cache, then the balance of cycles remaining is counted in total execution time. Memory bandwidth is assumed to be large enough such that it is not a limiting factor in performance. For the enhanced case, an extra 2.5KB is added to the cache size to account for the extra area required for the SPT and stream cache. For very small cache sizes, the stream
cache can cut the execution time in half. For cache sizes of up to about 256 KB, less than 80% of the original time is required for execution.
[2]
6 Conclusion
[3]
240
180
|
150
|
|
90
60
|
30
|
0
mpeg_play-hula base mpeg_play-hula enhanced
|
120
[5]
[6]
[7]
[8] [9]
[10]
|
210
[4]
|
Execution Time (Millions of Cycles)
In this paper, we first presented the cache usage characteristics for MPEG1 and MPEG2 software player implementations. Motivated by the stream-like miss behavior, we proposed two hardware-based prefetching techniques to improve memory performance. The stride prediction table showed extremely good memory performance for middle and large cache sizes, but actually degraded performance for small cache sizes. The stream cache, in which the stream prediction table was used to prefetch data into a small parallel cache, maintained the performance improvement for large caches, but also greatly improved small cache performance. We showed that approximately 80% of the total misses can be eliminated using a relatively small amount of additional hardware. Execution times can potentially be cut by a factor of two. Hardware prefetching is one method for enhancing a traditional processor to efficiently support multimedia applications.
|
|
|
|
|
|
|
|
|
|
2048
4096
8192
16384
32768
65536
131072
262144
524288
|
|
1048576 2097152
Cache Size (Bytes) absolute execution time -- asssociativity 1
Figure 9. Absolute execution times for mpeg play-hula with 128 entry modified st ream cache and 128 entry stride table adjusted for extra area required.
References [1] J.-L. Baer and T.-F. Chen. An effective on-chip preloading
scheme to reduce data access penaly. In Proceedings of Supercomputing ’91, pages 176–186, November 1991. T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44:318–328, May 1995. S. Eckart, C. Fogg, et al. mpeg2 codec. ftp:ftp.netcom.com:/pub/cf/cfogg/mpeg2, MPEG Software Simulation Group, 1994. J. Fu, J. Patel, and B. Janssens. Stride directed prefetching in scalar processors. In Proc. of the 25th International Symposium on Microarchitecture, pages 102–110, December 1992. N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. of the 17th Annual International Symposium on Computer Architecture, pages 364–373, May 1990. S. Palacharla and R. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proc. of the 21st Annual International Symposium on Computer Architecture, pages 24–33, April 1994. K. Patel, B. Smith, and L. Rowe. Performance of a Software MPEG Video Decoder. In Proceedings ACM Multimedia 93, pages 75–82, August 1993. A. J. Smith. Cache memories. ACM Computing Surveys, 14:473–530, September 1982. D. Zucker and R. Lee. Reuse of High Precision Arithmetic Hardware to Perform Multiple Concurrent Low Precision Calculations. Technical Report No. CSL-TR-94-616, Computer Systems Laboratory, Stanford University, April 1994. D. F. Zucker and A. H. Karp. RYO: A versatile instruction instrumentation tool for PA-RISC. Technical Report No. CSL-TR-95-658, Computer Systems Laboratory, Stanford University, January 1995.