the Linux, Windows NT and Tru UNIX operating systems. We per- ... Multimedia applications generate similar or fewer number of average data memory ..... [21] studied the execution characteristics of desktop applications on Windows NT.
A Study of Memory System Performance of Multimedia Applications Sohum Sohoni, Zhiyong Xu, Rui Min, Yiming Hu Operating Systems and Computer Architecture Laboratory Department of Electrical & Computer Engineering and Computer Science P.O. Box 210030 University of Cincinnati Cincinnati, OH 45221-0030 e-mail: ssohoni,zxu,rmin,yhu @ececs.uc.edu ABSTRACT
be commonly used on general purpose processors in the near future. Multimedia applications will become one of the dominating workloads.
Multimedia applications are fast becoming one of the dominating workloads for modern computer systems. Since these applications normally have large data sets and little data-reuse, many researchers believe that they have poor memory behavior compared to traditional programs, and that current cache architectures cannot handle them well. It is therefore important to quantitatively characterize the memory behavior of these applications in order to provide insights for future design and research of memory systems. However, very few results on this topic have been published. This paper presents a comprehensive research on the memory requirements of a group of programs that are representative of multimedia applications. These programs include a sub-set of the popular MediaBench suite and several large multimedia programs running on the Linux, Windows NT and Tru UNIX operating systems. We performed extensive measurement and trace-driven simulation experiments.We then compared the memory utilization of these programs to that of SPECint95 applications. We found that multimedia applications actually have better memory behavior than SPECint95 programs. The high cache hit rates of multimedia applications can be contributed to the following three factors. Most multimedia applications apply block partitioning algorithms to the input data, and work on small blocks of data that easily fit into the cache. Secondly, within these blocks, there is significant data reuse as well as spatial locality. The third reason is that a large number of references generated by multimedia applications are to their internal data structures, which are relatively small and can also easily fit into reasonably-sized caches.
1.
Since typical media processing applications, especially 3D animation graphics, have very large data sets and little data-reuse, many researchers believe that existing cache schemes cannot handle these applications efficiently [2, 3, 6, 7] when compared to traditional programs. It is therefore important to quantitatively characterize the memory system behavior of these applications in order to provide insights for future design and research of memory systems. Recent papers studying architectural support for multimedia applications [1, 2, 3, 4, 8], focus on the aspect of the impact of Instruction Set Architecture (ISA) on the performance of multimedia applications. Very few results have been published that quantitatively study the memory behavior of multimedia applications. This paper makes the following two contributions. First, we present a clear picture of the memory reference characteristics of multimedia applications, as regards their cache and TLB performance. The impact of different cache and TLB configurations, including different cache sizes, block sizes and associativities, are studied. Secondly, we compare the memory system performance of these programs with that of conventional workloads represented by the SPECint95 benchmark suite to see if multimedia programs have higher memory demands than conventional workloads. We conducted extensive measurement and trace-driven simulation experiments for a group representative of multimedia applications, including a sub-set of the popular MediaBench suite and several large multimedia programs. Our key observations are:
INTRODUCTION
The last few years have seen a tremendous increase in the use of multimedia applications [1, 2, 3, 4, 5]. This however is just the tip of the iceberg. We believe that emerging applications, such as video conferencing, video-on-demand, games with high-quality and high resolution graphics and sounds, virtual reality applications etc., will
Multimedia applications generate similar or fewer number of average data memory references per instruction, compared to SPECint95 programs.
To be presented in SIGMETRICS 2001/ Performance 2001, Cambridge, Massachusetts, USA Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2001 ACM 0-89791-88-6/97/05 ...$5.00.
1
For L1 data caches, multimedia applications actually have lower cache miss rates than SPECint95 programs. This fact, together with the previous observation, implies that most multimedia applications do not necessarily place a more stringent requirement on memory systems than SPECint95 programs. For multimedia applications, a larger input data size does not necessarily result in a higher cache miss rate.
2.5 TLB simulator
Multimedia applications have similar TLB behavior compared to SPECint95 applications.
Our TLB simulator is built on the ACS (Acme Cache Simulator) [11], developed by Bryan Hunt of New Mexico State University. We enhanced the simulator to support Dinero III trace files as well as the PDT file format. We also changed the address splitting, cache configuration and replacement algorithm parts so that it can simulate both Cache and TLB performance.
The remainder of the paper is organized as follows. Section 2 describes our methodology. Section 3 describes the application programs. Section 4 presents the results for memory system behavior. Section 5 discusses related work and section 6 summarizes our findings.
2.
3. APPLICATION CHARACTERISTICS To represent a wide spectrum of multimedia workloads, we used a subset of the MediaBench [4] benchmark suite. Since most MediaBench programs are relatively small, and known to be computebound, we also chose to use a set of large multimedia applications, including 2 Mesa 3-D graphics library applications and 6 streaming media players with very large input data. To represent traditional workloads, we used some of the SPECint95 applications.
METHODOLOGY
We used both hardware monitoring and trace-driven simulation in this study. In order to study cache and TLB behavior under different configurations, we performed extensive trace-driven simulation experiments, using ATOM [9], Dinero [10] and the ACS (Acme Cache Simulator) [11] modified to work as a TLB simulator. To validate our simulation results, we ran the applications on an Intel Pentium III 600 MHz machine, and used the Rabbit PMC [12] tool, which monitors the Pentium hardware counters, to profile program memory behavior. In addition to this, some large Multimedia applications running on Windows NT were tested using the P6Perf [13] tool which also monitors the Pentium hardware counters.
3.1 Multimedia Applications 3.1.1
MediaBench Applications
MediaBench consists of complete applications coded in high level languages. All the applications are publicly available and widely used on general purpose processors. The applications we used are listed below. They were compiled statically with an optimization level O2.
2.1 Rabbit PMC CPU hardware counters are typically used to measure performance of an application program [14, 15, 16]. Rabbit PMC is a performance counters library for Intel Pentium processors running the Linux operating system. We used Rabbit PMC to sample the Pentium hardware event counters, which enable measurement of a broad range of processor events like instruction counts and on-chip cache miss counts in a non-intrusive way.
JPEG: JPEG is a standardized compression method for full-color and gray-scale images. We used two applications from the JPEG source: cjpeg performs image compression and, djpeg performs decompression. We used the file testimg.ppm, a 100 KB input file provided with the benchmark for the compression program, and the corresponding testimg.jpg for decompression. A compression ratio of 20:1 was achieved.
2.2 P6Perf P6Perf is a tool that augments the capabilities of the Windows NT utility, Perfmon [13]. It can measure the performance of the Pentium processor using the hardware counters of the CPU and is similar to Rabbit PMC [12] for Linux. We have used the P6Perf utility to measure the performance of multimedia applications on the Windows platform.
MPEG: MPEG is the standard for digital video transmission. The important computing kernel is a discrete cosine transform for coding and the inverse transform for decoding. The two applications used are mpeg2enc and mpeg2dec for encoding and decoding highquality video respectively. The input data file used for mpeg2enc was out.m2v, a 32K file provided with the benchmark. For decoding, a couple of files obtained from http://www.mpeg2.de/video/streams were used.
2.3 ATOM In order to generate memory traces for simulation, we used ATOM (Analysis Tools with OM) [9] to insert instrumentation code into the program binaries on Compaq Alpha workstations. ATOM is actually a system for building customized tools for program analysis and is available on the Alpha platform. It uses OM link-time technology to organize the final executable such that the application program and user’s analysis routines run in the same address space. The analysis routines do not interfere with the program’s execution, and precise information is presented to the analysis routines at all times. A trace file is generated when the instrumented application is executed. Information in the trace file consists of address traces with tags that identify whether a reference is a load, a store or an instruction fetch. All addresses are virtual.
EPIC: It is an experimental image compression utility. The compression algorithms are based on a bi-orthogonal critically sampled dyadic wavelet decomposition and a combined run-length/Huffman entropy coder. They have been designed to allow extremely fast decoding without floating-point hardware. A 65 KB pgm file was tested for compression using the epic program, and the resulting file was decompressed with the unepic program.
3.1.2
Large Multimedia Applications
2.4 Dinero
One of the well-known problems of the MediaBench suite is it’s small program and data sizes compared to real world multimedia programs. To avoid potentially biased results because of the small sizes, we also studied the following 6 large, complete multimedia applications with very large input data sizes.
Dinero III [10] is a trace driven cache simulator developed by Mark Hill of the University of Wisconsin. Dinero’s output is derived from the input trace file and the cache parameters specified. Various parameters such as cache size, block size, associativity etc. can be varied.
realplayer: It is a very popular stream media player, which plays audio as well as video files. In our experiments, we used realplayer version 7.0 to play a 1.9 MB, 9 minutes, audio/video clip in .rm format. 2
mpg123: This is another very popular program that plays audio MPEG 1.0/2.0 files. We used an MP3 (MPEG 1 Audio Level 3) song file (4.18 MB, 4 minutes, in high-quality 44.1 KHz format) as its input file.
0.8
0.7
aKtion!: It is a freely available video player for Linux. It plays many different video formats such as DL, AVI and QuickTime. We used an MPEG-1 video file as our input. The file size was 15.2 MB, in 1.5 Mbps compressed format. The video length was 81 seconds.
0.6
0.5
MpegTV: Also known as mtv, it is a real-time software MPEG Video Player (with audio/sync) for Linux and other Unix platforms. We used it to play a Video C.D. The file size was 407 MB. The display window size was full screen.
0.4
0.3
0.2
0.1
Power DVD: It is a media player that can play both audio and video data from an optical storage device. A single DVD can hold a large amount of data, 4.5 to 17 GB. We used the DVD player only on Windows NT, because a DVD player for Linux was not available.
es ast M ar es awa ve Re al pl ay M PG 12 3 aK tio n! M pe gT V
ic
ic ep
M
Ep
Un
12
Cj pe g Dj pe M g pe g2 en M c pe g2 de c
pe 4.
99 .g 9. o co m pr 12 es 4. s m 88 ks im 12 6. gc c
7. 14
13
vo
rte
rl
x
0
XingMp3: It is an mpeg player that can play MPEG-1 and MPEG2 audio files. It uses the XingMP3 Encoder which can encode sounds above 16 KHz without loss of speed or quality. We used an MP3 (MPEG 1 Audio Level 3) song file (4.18 MB, 4 minutes, in high-quality 44.1 KHz format) as its input file.
Figure 1: Numbers of Data Memory References Per Instruction. The average numbers of memory references per instruction is 0.433 for SPECint95 programs and 0.407 for multimedia applications.
Mesa applications: Mesa [17] is a clone of OpenGL, a commonly used 3-D graphics library. We used two Mesa applications, Mesastar and Mesa-wave, which were part of the samples provided with the library. The two program display large, animated, 3D images on a local X-Window display.
multimedia applications do not generate more memory references per instruction (0.41), than SPECint95 applications (0.43). In fact, two multimedia applications, cjpeg and unepic, have much lower references/instruction ratios compared to SPECint95 and other applications.
3.2 SPECint95 Applications Six of the applications of the SPECint95 integer benchmark suite were used: go, m88ksim, compress, gcc, perl, and vortex. The input files used were the ones provided with the benchmark suites as test inputs. The SPECint95 applications were also compiled statically with an optimization level O2.
4.
4.2 Cache Performance Our measurement results show that multimedia applications generate similar or less memory traffic per instruction than the SPECint95 programs. However, an important question is how well the cache will perform under this traffic. To answer this question, we used ATOM to generate memory traces for both SPECint95 and multimedia applications. We then used Dinero III to simulate the cache behavior of these traces.
EXPERIMENTAL RESULTS
The performance evaluation was carried out for caches as well as TLB’s. The measurement of the miss rate for various cache sizes and parameters constitutes the major part of the simulation process. The results were confirmed using extensive measurements on Linux and Windows platforms.
We were unable to collect memory traces for the 6 stream media applications, namely realplayer, mpg123, MpegTV, aKtion!, Power DVD, and XingMp3, since their binaries are not available on Compaq TruUnix systems on which ATOM runs. Also these applications interact with special hardware such as sound cards, making them difficult to trace. Instead we measured the cache behavior of these applications using Pentium counters. The results are explained in Section 4.2.3.
4.1 Memory References Per Instruction Because many multimedia applications have very large data sets, it is expected that they place a high demand on the memory system. In order to compare the memory requirement of multimedia applications with that of SPECint95 applications, our first experiment was to measure the number of data memory references per instruction for these programs. A higher value indicates that a program generates more memory traffic and hence places a higher demand on the memory system.
4.2.1
L1 Data Caches
Figures 2 through 5 show the simulation results for all the applications, with different cache configurations. The size of the cache is varied from 8 KB to 1 MB in each of the figures.
We ran these applications on an Intel Pentium III 600 MHz system. Using the rabbit PMC tool, we measured the total number of data memory references and the total number of instructions executed for each application. Figure 1 shows the references/instruction ratio (the number of data memory references per instruction). The five bars on the left are for SPECint95 applications, while the remaining are for multimedia programs. The figure clearly indicates that
It is very interesting to note that the average miss rate for the media processing applications is much lower than that of the SPECint95 applications for all the plots. Table 1 lists the average miss rates for some configurations (we could not show numbers for all configurations because of the space limitation). This observation is in direct 3
4
Figure 4: L1 D-Cache Performance (2-way Associative, Line Size 16 Bytes)
M
es
a-
av
e
r
c
ta
de
S
W
a-
g2
c
ic
en
ep
g2
es
pe
pe
M
m
m
un
g
ic
pe
c
g
gc
ep
dj
s
im
pe
6.
ks
cj
12
88
go
es
9.
rl
ex
pe
rt
pr
09
4.
vo
m
m
co
4.
9.
12
12
7.
13
14
Miss Rate %
pe g
un ep m ic pe g2 en c m pe g2 de c M es aS ta M r es aW av e
ep ic
dj
c
g
gc
pe
6.
cj
12
09 9. 12 go 9. co m p 12 res s 4. m 88 ks im
14 7. vo rt ex 13 4. pe rl
Miss Rate %
pe
g
un ep m ic pe g2 en c m pe g2 de c M es aS ta M r es aW av e
ep ic
dj
c
g
gc pe
6.
cj
12
09 9. go 9. co m p 12 res s 4. m 88 ks im 12
14 7. vo rt ex 13 4. pe rl
Miss Rate % 25 25.16
20
15
10
5
0
Figure 2: L1 D-Cache Performance (Direct Mapped, Line Size 16 Bytes)
25
20
15
10
5
0
Figure 3: L1 D-Cache Performance (Direct Mapped, Line Size 32 Bytes)
25
20
15
10
5
0
25
Miss Rate %
20
15
10
5
un ep m ic pe g2 en c m pe g2 de c M es aS ta M r es aW av e
ep ic
g
cj
dj
pe
pe
g
c gc 6. 12
09 9. go 9. co m p 12 res s 4. m 88 ks im 12
14 7. vo rt ex 13 4. pe rl
0
Figure 5: L1 D-Cache Performance (2-way Associative, Line Size 32 Bytes)
Cache Size 8 KB 32 KB 128 KB 512 KB
Line Size 16 B 16 B 32 B 32 B
Associativity 1-way 1-way 2-way 2-way
SPEC95 Average 12.24% 8.14% 0.89% 0.83%
Multimedia Average 4.89% 2.72% 0.78% 0.37%
The primary observation from the above discussion is that neither the SPECint95 applications nor the multimedia applications collectively show a bias towards a change in either associativity or cache line size. On the whole the 2 sets of applications have similar characteristics.
Table 1: Average Miss Rates of SPECint95 and Multimedia Applications for Several Cache Configurations
4.2.3
Data Cache Performance of Large Multimedia Applications 4.5
contradiction with the views found in recent literature, that multimedia applications have poor cache behavior. In fact, the highest miss rates are observed for a SPECint95 program, viz. compress, for cache sizes less than 32 KB.
4
3.5
Miss Rate %
3
One MediaBench program, unepic, shows higher miss rates than other multimedia applications, especially for larger cache sizes. It’s miss rates are still close to or lower than those of SPECint95 programs. This, however, can be compensated by the fact that unepic has an unusually low memory references/instructions ratio, as seen in Figure 1 and a low I-Cache miss rate as well, as seen in Figure 10.
2.5
2
1.5
1
0.5
Impact of Different Line Sizes and Associativities
V
n!
gT
tio
pe
M
er
23 g1
aK
ve
ay
pe
m
wa
pl
a-
al re
ic
ar
ep
lst a-
es
un
es
M
M
c
ic ep
c
de
g
en
pe
g2 pe
m
dj
g2
pe m
c
eg
gc
cjp
6.
s
im 12
es pr
88
m 4.
12
ks
o .g m
co
pe 4.
99 12
9.
vo 7.
Varying the cache line sizes and associativities has similar effects on SPECint95 and multimedia applications. For example, one of the SPECint95 applications, m88ksim is very sensitive to the change of cache line sizes: doubling the line size from 16 bytes to 32 bytes significantly reduces the miss rates, suggesting that it has very good spatial locality. Others, such as vortex and perl, are immune to changes of cache line sizes. Similarly, doubling the cache line size has a significant impact on unepic, but has little effect on the other multimedia application. Behavior for different cache associativities is quite similar. For example, perl, vortex and go from the SPECint95 suite, and cjpeg, djpeg and the Mesa programs from the multimedia applications show a linear reduction in the miss rate with increasing line size. Applications like gcc and epic do not show much variation with either associativity or cache line size.
13
rte
x
rl
0
14
4.2.2
Figure 6: Measured L1 Cache Performance of SPECint95, Small and Large Multimedia Programs. Results were measured using the Rabbit tool for on an Intel Pentium III with a 16 KB L1 cache with a 32 Byte line size. Our simulation results have shown that on an average, multimedia applications have lower cache miss rates compared to SPECint95 applications. However, many of the multimedia applications that we simulated, except for the two large Mesa programs, are from the MediaBench suite, and are relatively small. To verify whether the 5
ber of memory references with the L1 data-cache miss rate. This is the effective memory bandwidth of the application program as seen by the L2 cache. Here too, the multimedia applications show comparable or better performance than the SPECint95 programs.
Data-mem refs/ inst beyond L1
0.025
0.02
4.2.5 0.015
Data Cache performance for Multimedia Applications running on Windows NT
4
0.01
3.5
3 0.005
Miss Rate %
2.5
V
n!
gT
tio
pe
M
er
23 g1
aK
pe
ic
ve
ay m
wa
pl al
re
a-
c
ar
ep
lst a-
es
es
M
M
ep
ic un
c
de g2
g
en pe
pe dj
pe
g2 m
m
c
eg
gc
cjp
6.
s
im ks
88
12
o .g
es pr
m
99
m
co
4.
9. 12
12
pe 4.
13
14
7.
vo
rte
x
rl
0
2
1.5
1
Figure 7: Memory references per instruction beyond the L1 Cache for SPECint95, Small and Large Multimedia Programs running on an Intel Pentium III CPU.
0.5
0
DVD player
low miss rates of multimedia applications were caused by the small program sizes, we re-ran these applications, as well as 4 large multimedia applications, realplayer (an audio/video player), mpg123 (an MP3 audio player), aKtion! (an MPEG video player), and MpegTV (a VCD player), on an Intel Pentium III 600 MHz system which has a 16 KB, L1 data cache with a 32 Byte line size. Using the Rabbit tool, we measured the cache performance of these programs using the Pentium hardware counters (as stated before, due to hardware and software limitations, we could not simulate the 4 large streaming media applications). It is worth noting that a direct comparison between the simulation results and the measurement results cannot be made. This is because the simulations were done for the Compaq Alpha platform, while the measurements were taken on the Intel Pentium III architecture. Thus, the results obtained are for different instruction set architectures and different compiler optimization levels. However, the general trend of the 2 sets of results can be compared. Figure 6 shows the measurement results.
Realplayer Realplayer streaming video streaming audio
Realplayer audio (disk)
Windows media Xing Mp3 player player
Figure 8: L1 Data cache performance of Large Multimedia Programs on an Intel Pentium III CPU running Windows NT. Average miss rate is 2.02%
Multimedia applications are used extensively on PC’s running some variant of the Windows operating system. This section presents the data cache performance results for some popular media applications on an Intel Pentium III running Windows NT. As seen in Figure 8, the miss rates are all under 4% with the average miss rate of 2.02%. This is lower than the average miss rate of 2.48% for the SPECint95 programs measured on the same machine running Linux (Figure 6).
4.2.6
Impact of Different Input Data Sizes on Multimedia Applications
The results of the actual measurements clearly support our simulation results. The multimedia applications (including both small and large programs) have an average miss rate of only 1.2%, compared to 2.48% of SPECint95 programs. While the two audio stream players, mpeg123 and realplayer, and the VCD player MpegTV have higher miss rates than other multimedia applications, they nevertheless have similar or better cache performance compared to SPECint95 programs. It is also worth to note that aKtion!, an MPEG-1 video player with a very large (15.2 MB) input file, has a miss rate much lower than most other applications.
The input data sizes of many multimedia applications can vary significantly. For example, a jpeg viewer program can be used to display a small graphic “icon” or a large, full-screen picture. This may have a significant impact on their cache behavior. To study the effect of different input data sizes, we ran cjpeg and djpeg with three different input file sizes and performed trace-driven simulations. Figure 9 shows the L1 D-cache miss rates. Cjpeg-s, cjpeg-m and cjpeg-l stand for cjpeg with a small (100 KB), medium (920 KB) and large (3700 KB) input file, respectively. Similarly, djpeg-s, djpeg-m and djpeg-l stand for djpeg with a small (5.6 KB), medium (76 KB) and large (434 KB) input file, respectively.
4.2.4
Figure 9 shows that a larger input file does not necessarily result in a higher cache miss rate. For example, for a cache of 8 KB, while the miss rate of cjpeg-m is about twice as high as cjpeg-s, cjpeg-l has a miss rate of only one-third of that of cjpeg-m. The cache behavior of these two programs is affected not only by the
Memory Traffic Beyond the L1 Cache
The number of memory references per instruction that cannot be serviced by the L1 cache is also a good measure of memory system performance. Figure 7 shows the results for this measurement on the Pentium III. The data was obtained by multiplying the num6
10
12
8K 16K 32K 64K 128K
10
8K 16K 32K 64K
9
8
7
Miss Rate %
Miss Rate %
8
6
6
5
4
3
4 2
1
2
cjpeg_m
cjpeg_l
djpeg_s
djpeg_m
djpeg_l
Figure 9: Impact of Different Input Data Sizes on JPEG Compression and Decompression
Application Name real player mpg123 MpegTV aKtion! PowerDVD
1 minute
5 minutes
10 minutes
4.37% 2.97% 3.35% 1.03% 2.46%
4.02% 2.68% 2.3% 0.92% 2.57%
3.98% 2.67% 2.02% 0.87% 2.58%
eg
c
dj pe g
cjp
im
gc
m
88
ks
es
s
G o
pr m
ep ic un ep ic m pe g2 en c m pe g2 de c M es aSt M ar es aW av e
cjpeg_s
Co
0
Pe rl
Vo rte x
0
Figure 10: L1 I-Cache Performance (Direct Mapped, Line Size 32 Bytes)
for SPECint95 programs. Based on our simulation and measurement results, we can conclude that I-caches are not a performance bottleneck for most multimedia applications. However, this might change in the future. As more complex multimedia programs such as video conferencing and virtual reality applications become common, the instruction footprints of many multimedia applications might increase, which may result in higher I-Cache miss rates.
Table 2: Effect of Input File Size on the Miss Rate of some Multimedia Programs on Windows NT
4.3 Analysis The good cache performance of most multimedia applications can be explained by the algorithms used by these applications. First, most of these applications use block-partitioning algorithms, dividing large input files into small blocks. The programs work on one or several small blocks at a time, so the data can fit into a moderatelysized cache. Second, though multimedia applications, especially streaming media programs seem to have little data reuse, this is not true at the block level. In fact there is enough data reuse within each block to achieve high cache hit rates.
data size, but also by the contents of the input (e.g., the smoothness of a picture). Table 2 lists the miss rates for different input file sizes for several Multimedia Applications. Even as file size increases, the miss rate stays almost constant, indicating that the memory system performance is independent of the size of the input. In fact, the miss rate reduces fractionally with increase in the file size for four out of the five applications. We will discuss possible reasons in Section 4.3
Figure 10 shows the performance of L1 instruction caches. It is clear that MediaBench programs (cjpeg, djpeg, epic, unepic and mpeg2enc) are very small programs with excellent locality. One Mesa application, namely Mesa-Star, shows much higher I-Cache miss rates. This is because Mesa applications have to call many libraries and have much larger instruction footprints. However, as the I-Cache size increases to 32 KB or higher, its instruction miss rate quickly falls in line with other applications.
For example, the JPEG compression algorithm extracts an 8 8 pixel block from the input picture. The block then goes through a four-step process of changes: (1) Conversion from normal RGB color space to a luminance/chrominance color space. (2) Discrete Cosine Transform (DCT) to an 8 8 array of coefficients for the corresponding frequency representation. (3) Quantization (4) Encoding. During these steps the block is accessed repeatedly (for example, the complexity of DCT phase is ), yielding a high cache hit ratio. This also explains why the cache performance of JPEG compression/decompression is independent of the input file size, as large files will be divided into small blocks and the algorithm works sequentially on these blocks.
Our measured results of other large multimedia applications (realplayer, mpg123 and aKtion!) indicate that they all have very low instruction miss rates. The average I-Cache miss rate of the three programs on an Intel Pentium III is 0.15%, compared to 0.19%
In MPEG Video decoding, frames are spatially compressed with algorithms similar to JPEG. MPEG also uses block-based motion compensation to reduce the temporal redundancy between frames. Each frame is divided into macroblocks of 16 16 pixels. Each
4.2.7
Instruction Cache Performance
"!
7
macroblock can then be predicted from the previous or future frame, by estimating the amount of the motion in the macroblock during the frame time interval. Although there is little data-reuse between frames during compression/decompression, the current macroblock and the reference frame are accessed repeatedly, resulting in good cache performance.
Figure 11 shows the data TLB behavior of different applications with different TLB sizes. The solid lines are for multimedia applications, while the dotted lines are for SPECint95 applications. It is clear that multimedia applications have similar or better data TLB performance compared to SPECint95 programs. Most applications achieve very low TLB miss rates with 48 entries. Increasing the TLB size beyond 48 entries results in diminishing returns.
MPEG Audio Layer III encoding employs a similar scheme of dividing the input into sample frames, and working on 3 frames at a time. A number of operations such as masking, run-length encoding and huffman encoding are applied to each frame to reduce redundancy in the data.
5. RELATED WORK Papers studying memory system performance have focused on the SPEC benchmarks and more recently on commercial workloads [15, 16, 18, 19, 20] or desktop applications [21]. For example, Lee et al. [21] studied the execution characteristics of desktop applications on Windows NT. They found that while many desktop applications are graphics intensive, many of their characteristics are similar to those of the integer SPECint95 benchmarks. They also found that desktop applications have large working sets.
Finally, a large percentage of memory accesses generated by multimedia applications are for the internal data sturctures of programs. Since these data structures are normally small and have very good locality, they can be handled well by reasonably-sized caches.
4.4 TLB Performance
Many multimedia architecture papers [2, 3] pointed the potential importance of memory systems on multimedia applications. They also suggested that memory systems may be a performance bottleneck. However, most of the papers discussing architectural support for multimedia applications focused on the impact of Instruction Set Architecture (ISA) on the performance of multimedia applications [1, 4, 8, 22]. Very few of them studied the impact of multimedia workloads on memory system performance. Soderquist and Leeser [23] proposed several techniques to improve the data cache performance of an MPEG-2 Video decoder. Ranganathan et al. [1] very briefly discussed the performance of caches for the 6 image processing kernels and the cjpeg-np and djpeg-np applications. They found that cache sizes have no impact on most of these kernels and applications.
We considered both a unified TLB and a split TLB scheme. We varied the TLB size from 4 to 64 entries, and the page size from 4 KB to 8 KB. Fully associative TLBs were assumed.
4.4.1
Instruction TLB Plot of Data TLB 8K pages
1 Djpeg Cjpeg Epic Unepic Mesa-Star Mesa-wave Mpeg Player Mpeg Encoder vortex go compress perl m88ksim
0.1
0.01
6. CONCLUSIONS
0.001
Multimedia applications are becoming one of the dominant workloads for computer systems. Since these applications normally have large data sets with apparently little data-reuse, it is generally assumed that they have poor memory behavior compared to traditional programs such as the SPECint95 benchmark suite.
0.0001
1e-05
In this paper we performed a comprehensive study of the memory requirements of a group of programs representative of multimedia applications. These programs include a sub-set of the popular MediaBench suite and several large multimedia programs with very large input data. We performed extensive measurement and tracedriven simulation experiments on these applications. We then compared the memory utilization of the programs to that of SPECint95 applications.
1e-06 4
8
16
48
64
TLB entry
Figure 11: Data TLB Behavior (8 KB Page Size). Average miss rates: at 16 entries: 0.4% for multimedia and 0.7% for SPECint95; at 32 entries: 0.02% for multimedia and 0.08% for SPECint95; at 48 entries: 0.005% for multimedia and 0.04% for SPECint95; at 64 entries: 0.003% for multimedia and 0.03% for SPECint95;
The experiments were performed on two different architectures (Alpha vs x86) running three different operating systems (Tru64 Unix, Linux and Windows NT). The compiler used for the trace driven simulation was the Compaq C compiler, and for measurement on x86 gcc was used, while some applications were pre-compiled by the vendors. This leads to the conclusion that it is unlikely that the observed memory behavior was caused by some architecture or compiler artifacts.
Table 3 lists the maximum number of instruction TLB entries used by each program. In other words, if the instruction TLB size is larger than the numbers listed in the table, the programs will suffer only from compulsory TLB misses. Again, Table 3 shows that MediaBench applications have very small instruction footprints. The two large Mesa applications, Mesa-wave and Mesa-star, given their large program sizes, have a larger number of TLB entries used than their SPECint95 counterparts.
4.4.2
Our observations are summarized as follows:
Data and Unified TLB 8
We found that multimedia applications generate similar or fewer numbers of data memory references per instruction,
page size 4 KB 8 KB
SPECint95 Programs 127. 134. 099. 129. vortex perl go compr 64 64 64 16 42 45 39 11
#
#
#
124. m88k 40 24
cjpeg
djpeg
19 11
28 15
Multimedia Programs epic unmpeg mpeg epic 2enc 2dec 8 6 20 15 4 4 11 9
Mesa Star 64 53
#
Mesa Wave 64 64
# #
Table 3: Maximum Numbers of Instruction TLB Entries Used
[4] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “MediaBench: A tool for evaluating and synthesizing multimedia and communications systems,” in Proceedings of the 30th Annual International Symposium on Microarchitecture, (Research Triangle Park, North Carolina), pp. 330–335, Dec. 1–3, 1997.
compared to SPECint95 programs.
For L1 data caches, multimedia applications actually have cache miss rates lower than SPECint95 programs. This fact, together with the previous observation implies that most multimedia applications do not necessarily place a more stringent requirement on memory systems than traditional programs such as the SPECint95 benchmark suite.
[5] T. M. Conte, P. K. Dubey, M. D. Jennings, R. B. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song, and A. Wolfe, “Theme feature: Challenges to combining general-purpose and multimedia processors,” Computer, vol. 30, pp. 33–37, Dec. 1997.
For multimedia applications, a larger input data size does not necessarily result in a higher cache miss rate.
The TLB behavior of multimedia applications is similar to that of the SPECint95 applications.
[6] P. Ranganathan, S. Adve, and N. P. Jouppi, “Reconfigurable caches and their application to media processing,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, (Vancouver, British Columbia), pp. 214–224, IEEE Computer Society and ACM SIGARCH, June 12–14, 2000.
Our analyses show that the two most important features of multimedia applications, namely large input data sizes and little data-reuse, hold true only at the macro-level. Most applications use block partitioning algorithms and divide the large input files into small blocks, reducing the effect of large input sizes. Also, while it is true that there is little data reuse between frames in streaming media data, there exists enough locality within blocks and frames to achieve good cache performance. Moreover, a large percentage of memory accesses go to the program’s internal data structures, which are normally small and have very good locality.
[7] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, (Vancouver, British Columbia), pp. 128–138, IEEE Computer Society and ACM SIGARCH, June 12–14, 2000. [8] A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, pp. 42–50, Jul/Aug 1996.
While improving memory system performance is still very important for both SPECint95 and multimedia applications, our observations suggest that most current multimedia programs will continue to be CPU-bound rather than memory-bound. Hence we believe that future research on architectural support for multimedia applications should concentrate more on increasing processing power rather than improving memory performance.
[9] A. Srivastava and A. Eustace, “ATOM: A system for building customized program analysis tools,” in Proceedings of the SIGPLAN ’94 Conference on Programming Language Design and Implementation, pp. 196–205, June 1994. [10] M. Hill, “Dineroiii: A uniprocessor cache simulator.” http://www.cs.wisc.edu/ larus/warts.html.
Acknowledgments This work is supported in part by the National Science Foundation Career Award CCR-9984852, and an Ohio Board of Regents Computer Science Collaboration Grant. Venkatesh Raman provided a lot of help during the experiments and many suggestions during the writing of this paper.
[11] E. Johnson and J. Ha, “PDATS: Lossless address trace compression for reducing file size and access time,” in Proceedings of 1994 IEEE International Conference on Computers and Communications, April 1994.
7.
[12] D. Heller, “Rabbit: A performance counters library for intel processors and linux.” http://www.scl.ameslab.gov/Projects/ Rabbit/index.html.
REFERENCES
[1] P. Ranganathan, S. Adve, and N. Jouppi, “Performance of image and video processing with general-purpose processors and media ISA extensions,” in Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99), pp. 124–135, May 1–5 1999.
[13] Intel Corporation, http://developer.intel.com/software/idap/resources/ technical collateral/pentiumii/p6perfnt.htm, P6 Family Processor Performance Measurement Utility for Windows NT, 1997.
[2] K. Diefendorff and P. K. Dubey, “How multimedia workloads will change processor design,” Computer, vol. 30, pp. 43–45, Sept. 1997.
[14] J. B. Chen, Y. Endo, K. Chan, D. Mazi`eres, A. Dias, M. I. Seltzer, and M. D. Smith, “The measured performance of personal computer operating systems,” ACM Transactions on Computer Systems, vol. 14, pp. 3–40, Feb. 1996.
[3] R. B. Lee and M. D. Smith, “Media processing: A new design target,” IEEE Micro, vol. 16, pp. 6–9, Aug. 1996. 9
[15] L. Barroso, K. Gharachorloo, and E. Bugnion, “Memory system characterization of commercial workloads,” in Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA-98), pp. 3–14, June 27–July 1 1998. [16] K. Keeton, D. Patterson, Y. He, R. Raphael, and W. Baker, “Performance characterization of a Quad Pentium Pro SMP using OLTP Workloads,” in Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA-98), pp. 15–26, June 27–July 1 1998. [17] “The Mesa 3D Graphics Library.” http://www.mesa3d.org. [18] J. Lo, L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh, “An analysis of database workload performance on simultaneous multithreaded processors,” in Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA-98), pp. 39–51, June 27–July 1 1998. [19] P. Trancoso, J.-L. Larriba-Pey, Z. Zhang, and J. Torrellas, “The memory performance of DSS commercial workloads in shared-memory multiprocessors,” in Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture (HPCA-3), Feb. 1997. [20] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso, “Performance of database workloads on shared-memory systems with out-of-order processors,” in Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 307–318, Oct. 1998. [21] D. Lee, P. Crowley, J. Baer, T. Anderson, and B. Bershad, “Execution characteristics of desktop applications on Windows NT,” in Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA-98), pp. 27–38, June 27–July 1 1998. [22] R. Bhargava, L. K. John, B. L. Evans, and R. Radhakrishnan, “Evaluating MMX technology using DSP and multimedia applications,” in Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-98), (Los Alamitos), pp. 37–48, IEEE Computer Society, Nov. 30–Dec. 2 1998. [23] P. Soderquist and M. Leeser, “Optimizing the data cache performance of a software MPEG-2 video decoder,” in ACM Multimedia’97, pp. 291–301, Nov. 1997.
10