memory performance comparison of the SPEC CPU2000 and. SPEC CPU2006 workloads. ... writing program analysis tools called
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU2000 and SPEC CPU2006 Benchmark Suites Aamer Jaleel Intel Corporation, VSSAD {aamer.jaleel}@intel.com
Abstract
modeled within the functional/performance model. In general, functional models of modern ISAs are slow and can be complex to build. As a result, trace-driven simulation has become popular for conducting memory system studies [15]. The usefulness of trace-driven simulation, however, lies in the continued availability of address traces to study the memory system requirements of different workloads. With several emerging application domains, understanding the memory behavior and working set requirements of new applications requires the ability to generate address traces. Address trace generation for a target ISA can require sophisticated hardware tools (e.g. bus tracer) or a functional model that supports the target ISA and the requirements of the workload (e.g. the functional model must provide support for multiple contexts if executing a multi-threaded workload). However, there are drawbacks with these mechanisms. Hardware tools such as bus tracers only capture address traces that reach the system bus. Consequently, the address trace collected is incomplete since address requests that hit in the processor caches are filtered out. On the other hand, address trace generation using function models represent only a small region of the application. Additionally, a practical problem with collecting memory address traces with either mechanism is that they can become very large, potentially occupying several gigabytes of disk space even in their compressed formats. To address the drawbacks of current simulation techniques, this paper proposes the use of instrumentation driven simulation (IDS) to conduct memory system studies. Since instrumentation-driven memory system simulation is fast, robust, and simple, users can write simple tools to characterize the memory system requirements of almost any application at MIPS (as opposed to KIPS) speed. In doing so, IDS can support full run application studies. This paper presents the use of Pin [7] based IDS to conduct full-run memory system studies of workloads from the SPEC CPU2000 and SPEC CPU2006 [4] benchmark suites. Our studies reveal that on average, workloads in SPEC CPU2006 have an instruction count that is an order of magnitude larger than the SPEC CPU2000 workloads. Additionally, SPEC CPU2006 workloads have larger memory working-set sizes
There is a growing need for simulation methodologies to understand the memory system requirements of emerging workloads in a reasonable amount of time. This paper presents binary instrumentation-driven simulation as an alternative to conventional execution-driven and trace-driven simulation methodologies. We illustrate the use of instrumentation-driven simulation (IDS) using Pin to determine the memory system requirements of workloads from the SPEC CPU2000 and SPEC CPU2006 benchmark suites. In comparison to SPEC CPU2000, SPEC CPU2006 workloads are an order of magnitude longer (in terms of instruction count). Additionally, SPEC CPU2006 comprises of many more memory intensive workloads that require more than 4MB of cache size for better cache performance.
1. Introduction Processor architects continue to face key design decisions in designing the memory hierarchy. With several emerging application domains, understanding the memory system requirements of new applications is essential in designing an efficient memory hierarchy. Such characterization and exploratory studies require fast simulation techniques that can compare and contrast the performance of alternative design policies. For memory system studies, this paper proposes instrumentation driven simulation as an alternative to existing execution-driven and trace-driven methodologies. Simulation is a common methodology that is used to identify performance bottlenecks in existing systems as well as design space exploration. There exist many simulators and software tools to investigate the memory system performance of applications. In general, memory system simulators fall into two main categories: trace-driven or execution-driven. With trace-driven memory system simulation, address traces are fed off-line to a memory system simulator (e.g. Dinero IV [3]). Such simulators rely on existing tools to collect memory address traces and log them to file for later use. Executiondriven cache simulators on the other hand rely on functional/ performance models to execute application binaries. The memory addresses generated by the functional/performance model are fed in real time to a memory system simulator
1
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
execution. If speculation is desired, special purpose tools to emulate a branch predictor can be implemented [11]. • Repeatability: Since instrumentation occurs on real operating systems, multiple runs of the same workload on the same system are not identical. If repeatability is desired, special purpose tools can be implemented [10]. • User-level Traffic Only: Current binary instrumentation tools only support user-level instrumentation. Thus, applications that are kernel intensive are unsuitable for user-level IDS. For such applications system level simulators [6, 8] or instrumentation systems [2] are ideal. • Native System Limited: Since IDS is conducted on a native system, the fidelity of the simulation model can depend on the characteristics of the native system. For example, simulating a system that has four active hardware threads on a system that only supports two active hardware threads may incorrectly interleave the order in which instructions are executed between different hardware threads. This is because the native system time slices the simulated threads while a real simulated system executes the threads simultaneously. To ensure correct timing, this issue can be resolved by modeling a thread scheduler [11]. Despite the above mentioned caveats, IDS is an excellent methodology for initial exploratory studies as it is fast, robust, and simple.
with most memory-intensive workloads requiring more than 4MB of cache size. The rest of this paper is organized as follows. Section 2 provides the benefits and caveats of using IDS. Section 3 describes the simulation methodology and then presents a memory performance comparison of the SPEC CPU2000 and SPEC CPU2006 workloads. Finally Section 4 provides a summary of our work.
2. Instrumentation-Driven Simulation (IDS) Binary instrumentation is a technique for inserting extra code into an existing application to collect run time information from an application. The binary instrumentation tool determines the type of run time information to be collected. Typically, binary instrumentation tools have commonly been used for the performance analysis of applications. However, binary instrumentation systems can also serve as an attractive tool for conducting computer architecture research studies. For example, binary instrumentation tools can model simple simulators that are instrumentation-driven. In addition to being simple, binary instrumentation systems are fast and robust. Since binary instrumentation normally occurs at native execution speeds, IDS can also occur at MIPS speed. Unlike existing execution and tracedriven simulators, IDS supports quick exploratory studies by simulating applications run to completion. Additionally, since binary instrumentation systems are robust for all kinds of applications, users can conduct instrumentation-driven simulation with any kind of application, no matter its complexity. For example, users can study complex applications such as Oracle or Java.
3. Memory Behavior of SPEC CPU2000 and SPEC CPU2006 For purposes of this study, we use IDS for characterizing the memory system behavior of both the SPEC CPU2000 and SPEC CPU2006 benchmark suites. The workloads in the SPEC suite are attractive for IDS as they are unaffected by the caveats mentioned in the previous section. Existing work has studied the memory behavior of SPEC CPU2000[13]. For the purpose of uniformity in ISA, compilers and simulation methodology, we present a comprehensive cache performance study of all workloads from both SPEC CPU2000 and SPEC CPU2006 run to completion. We first describe the experimental methodology before presenting our results.
2.1. Caveats with Instrumentation-Driven Simulation As with any simulation methodology there are some tradeoffs with the instrumentation-driven approach. • Instrumentation Overhead: Instrumentation involves injecting extra code dynamically or statically into the target application. The additional code causes an application to spend extra time in executing the original application. If the application has the property that it changes its behavior based on the amount of time spent in execution, the native execution and instrumented execution of the application can be different. If the differences are significant, IDS is unsuitable for such class of applications. Additionally, for multi-threaded applications, instrumentation can modify the ordering of instructions executed between different threads of the application. As a result, IDS with multi-threaded applications comes at the lack of some fidelity. • Lack of Speculation: Instrumentation only observes instructions executed on the correct path of execution. As a result, IDS may not be able to support wrong-path
3.1. Experimental Methodology We use Pin, a dynamic binary instrumentation system for Linux, MacOS, and Windows binary executables for Intel® IA-32 (x86 32-bit), IA-32E (x86 64-bit), and Itanium® processors. Pin is similar to the ATOM[14] toolkit for Alpha processors. Like ATOM, Pin provides an infrastructure for writing program analysis tools called pin tools. For our workload characterization studies we use CMP$im [5], an instrumentation driven system simulator for CMPs. CMP$im can characterize application instruction profiles and conduct full run cache performance studies of applications at
2
0
AVG
zeusmp.zeusmp
wrf.rsl
tonto.tonto
sphinx3.an4
specrand.rand
soplex.ref
soplex.pds-50
povray.ref
namd.namd
milc.su3imp
100
75 75
50 50
25 25
0
100 100
75 75
50 50
25 25
0 bzip2.graphic bzip2.program bzip2.source crafty.ref eon.cook eon.kajiya eon.rushmeier gap.ref gcc.166 gcc.200 gcc.expr gcc.integrate gcc.scilab gzip.graphic gzip.log gzip.program gzip.random gzip.source mcf.ref parser.ref perlbmk.535 perlbmk.704 perlbmk.850 perlbmk.957 perlbmk.diffmail perlbmk.makerand perlbmk.perfect vortex.lendian1 vortex.lendian2 vortex.lendian3 vpr.place vpr.route AVG
100
astar.BigLakes2048 astar.rivers bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord h264ref.foreman_ref_baseline h264ref.foreman_ref_main h264ref.sss_main hmmer.nph3 hmmer.retro libquantum.ref mcf.ref omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail sjeng.ref xalancbmk.ref AVG
AVG
wupwise.ref
twolf.ref
swim.ref
sixtrack.ref
mgrid.ref
mesa.ref
lucas.ref
galgel.ref
fma3d.ref
GemsFDTD.ref astar.BigLakes2048 astar.rivers bwaves.bwaves bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text cactusADM.benchADM calculix.hyperviscoplastic dealII.ref gamess.cytosine gamess.h2ocu2+ gamess.triazolium gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord gromacs.gromacs h264ref.ref_baseline h264ref.ref_main h264ref.sss_main hmmer.nph3 hmmer.retro lbm.lbm leslie3d.leslie3d libquantum.ref mcf.ref milc.su3imp namd.namd omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail povray.ref sjeng.ref soplex.pds-50 soplex.ref specrand.rand sphinx3.an4 tonto.tonto wrf.rsl xalancbmk.ref zeusmp.zeusmp AVG
Instructions (billions)
443 905
2143
ammp.ref applu.ref apsi.ref art.ref1 180 art.ref2 332 bzip2.graphic 297 553 bzip2.program 414 bzip2.source 587 crafty.ref 3047 eon.cook 7321 eon.kajiya 2193 1140 eon.rushmeier 832 equake.ref 3664 facerec.ref 82 fma3d.ref 159 galgel.ref 138 gap.ref 101 gcc.166 114 gcc.200 158 gcc.expr 197 177 gcc.integrate 61 gcc.scilab 218 gzip.graphic 582 gzip.log 334 gzip.program 219 316 gzip.random 2565 gzip.source 604 lucas.ref 408 mcf.ref 3732 mesa.ref 997 mgrid.ref 2119 parser.ref 1476 1737 perlbmk.535 4013 perlbmk.704 322 perlbmk.850 1213 perlbmk.957 3255 perlbmk.diffmail 795 perlbmk.makerand 1208 perlbmk.perfect 395 768 sixtrack.ref 1228 swim.ref 2351 twolf.ref 424 vortex.lendian1 407 vortex.lendian2 1 vortex.lendian3 2991 2999 vpr.place 3050 vpr.route 1354 wupwise.ref 2234 AVG 1276
1913
5000 4000 3000 CPU2000 1000 0
leslie3d.leslie3d
facerec.ref
equake.ref
art.ref2
art.ref1
apsi.ref
applu.ref
ammp.ref
Instruction Distribution
Instructions (billions)
SPEC CPU2000 - FP
SPEC CPU2006 - FP
3
131
61 72.5 68.2 109 47.7
118
267
266
230 323 288
213
SPEC CPU2000 48.9 52.3 99.2 86.8 30.4 1.17 19.8
50.1
21.6 65.3 6.73 6.82 36.2 57.5 22.5 96.5 45.7 47.8
294 284 265 204
24.7 27 114 96.6 75.4 183 50.9 62.9 34.9 115
298 360 376
544
600 500 400 300 200 100 0
lbm.lbm
gromacs.gromacs
gamess.triazolium
gamess.h2ocu2+
gamess.cytosine
dealII.ref
calculix.hyperviscoplastic
cactusADM.benchADM
bwaves.bwaves
0
GemsFDTD.ref
Instruction Distribution
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
SPEC CPU2006
SPEC CPU2000 - INT
MEM read/write MEM write MEM read MEM none
SPEC CPU2006 - INT
Figure 1: Dynamic Instruction Count and Instruction Profile of SPEC CPU2000 and SPEC CPU2006 workloads. The figure shows the instruction count and instruction mix of the applications when run to completing using the reference input sets.
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
3.3. Processor Performance
the speeds of 4-10 MIPS. We use CMP$im in single-core mode to conduct detailed characterization studies of the SPEC workloads. All workloads from the SPEC CPU2000 and SPEC CPU2006 benchmark suites are run to completion using their reference input sets. We also collect IPC numbers for the two suites using performance counters on an Intel® Xeon® 2.8 GHz system with a two-level cache hierarchy: 32KB L1 cache and 512KB L2 cache. All workloads are compiled using optimization flags -O3 on a Red Hat 32-bit Linux system using Intel’s C/C++ and Fortran compilers. For each workload, we conduct a cache sensitivity analysis using the stack distance [9] approach to simulate multiple cache sizes in one run. We modeled a 256-way 8MB instruction cache and a 2048-way 128MB data cache. Both use 64B line size and model true LRU replacement policy. The simulated cache configurations support cache sizes that are direct mapped 32KB, 2-way 64KB, 4-way 128KB, and so on. After presenting the cache sensitivity study, we also present the cache performance of each workload for a three level cache hierarchy: 4-way 32KB separate L1 instruction and data caches, a unified 8-way 256KB L2 cache, and finally a unified 16-way 2MB L3 cache. All caches in the simulated hierarchy use 64B line size, are non-inclusive, allocate on writes, and use writeback and true LRU replacement policy.
To identify the memory bound workloads, Figure 4 presents the CPI values of the two suites collected using performance counters on the 2.8 GHz Xeon system. The average CPI of the SPEC CPU2000 workloads is 1.97 while that of SPEC CPU2006 is 2.16. The higher CPI of SPEC CPU2006 is a result of larger problem set sizes [1]. Table 1 categorizes workloads based on their CPI values: CPI values greater than 8, CPI values greater than 4 but less than 8, and CPI values greater than 2 but less than 4. The table shows that SPEC CPU2006 represents more applications that are memory bound (higher CPI values) than SPEC CPU2000. The primary memory bound SPEC CPU2000 workloads (with CPIs greater than 4) were art, equake, mcf, swim, and vpr (route input). Comparatively, the memory bound SPEC CPU2006 workloads are: GemsFDTD, astar (BigLakes2048 input), cactusADM, lbm, leslie3d, libquantum, mcf, milc, omnetpp, and soplex. To understand the memory system requirements of these workloads, the next section presents workload sensitivity to different cache sizes. 3.4. Cache Performance We study working set analysis of the workloads by studying their cache performance with different cache sizes. We then present the cache performance of the SPEC workloads for a three-level cache hierarchy that is representative of modern high performance processors.
3.2. Application Instruction Profile Figure 1 shows the total instruction count for each workload in the two suites. The average instruction count for the SPEC CPU2000 benchmark suite is 131 billion while that of SPEC CPU2006 is 1276 billion—an order of magnitude higher. The increase in instruction count by SPEC has been in response to significant differences observed in run time as a result of small fluctuations in the system state or measurement conditions [1]. However, the large run lengths of SPEC CPU2006 now present interesting challenges in choosing representative regions of the workloads for conducting experiments using detailed performance simulators. To understand the contribution of instructions that reference memory, Figure 1 presents the instruction profile for the floating-point and integer benchmarks distributed into four categories: instructions that do not reference memory (ALU operations only), instructions that have one or more source operands in memory (MEM read), instructions whose destination operand is in memory (MEM write), and instructions whose source and destination operands are in memory (MEM read and write). On average, roughly half of the instructions reference memory. Of the instructions that reference memory, very few (< 1%) both read from and write to memory while 20% of total memory instructions write to memory. This behavior is consistent for both SPEC CPU2000 and SPEC CPU2006. The large proportion of instructions that reference memory is indicative of register spills to the stack due to the limited registers available on 32-bit x86 systems.
3.4.1.Working Set Size Figures 2-3 and 5-6 present the instruction and data cache sensitivity studies for all workloads in the two suites. The figures present cache size in megabytes (MB) on a logarithmic x-axis and misses per one thousand (MPKI) instructions on the y-axis. Note that the y-axis varies significantly between different workloads. We use a stack distance approach to simulate the performance of multiple cache sizes in one simulation run by using a single large cache. In doing so, the stack distance approach not only provides cache performance for multiple cache sizes but also provides an indication of the total memory footprint of each workload. For example, if the workload has a total instruction and data footprint that is smaller than the simulated large cache, the amount of valid data in the cache (assuming the cache is invalid at simulation start) represents the total memory footprint of the workload. This information is represented on the cache sensitivity graphs for workloads where the last x-axis co-ordinate ends before the simulated large cache size (8MB for instruction and 128MB for data). For example, the last x-coordinate data cache point for art (Figure 2d) occurs at 4MB, implying that it has a total data memory footprint size of approximately 4MB. Similarly, art has an instruction memory footprint size of approximately
4
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/ 20
20
300
5
Data Stream
15
10
5
Instr Stream
15
Misses Per 1000 Instructions
10
art.ref1
250 Misses Per 1000 Instructions
15
300
apsi.ref
applu.ref Misses Per 1000 Instructions
Misses Per 1000 Instructions
ammp.ref
10
5
art.ref2
250 Misses Per 1000 Instructions
20
200
150
100
50
200
150
100
50
Cache Size (MB) 1/8
1/4
1/2
1
2
4
(a) ammp.ref
8
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(b) applu.ref
8
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(c) apsi.ref
8
16
32
64
8 12
0
6 25
20
4
2
6
4
2
1/8
1/4
1/2
1
2
4
8
(e1) bzip2.graphic
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
2
4
8
16
32
64
8 12
2
1
1/2
1
2
4
8
(g2) eon.kajiya
16
32
64
8 12
3
2
0
6 25
2
4
(d1) art.ref1
8
16
32
64
8 12
0
6 25
1/8
1/4
1/2
1
2
4
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
8
(e3) bzip2.source
16
32
64
8 12
4
2
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(f) crafty.ref
8
16
32
64
8 12
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
4
8
16
32
64
128
6 25
4
8
16
32
64
8 12
6 25
8
16
32
64
8 12
6 25
8
16
32
64
8 12
6 25
3
2
1
0
6 25
8
equake.ref
2
eon.cook
6
0
6 25
1
(d2) art.ref2
4
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(g1) eon.cook
20
facerec.ref
fma3d.ref
20
15
10
6
4
2
15
10
5
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(g3) eon.rushmeier
30
galgel.ref
1
25
1
1/4
2 4 6 1/6 1/3 1/1
30
Misses Per 1000 Instructions
Misses Per 1000 Instructions
3
1/8
5
0
6 25
4
2 4 6 1/6 1/3 1/1
10
eon.rushmeier
4 Misses Per 1000 Instructions
1
(e2) bzip2.program
5
eon.kajiya
0
1/2
1/2
crafty.ref
15
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
5
1/4
bzip2.source
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
6
0
bzip2.program
8
1/8
8
Misses Per 1000 Instructions
bzip2.graphic
8
2 4 6 1/6 1/3 1/1
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
Misses Per 1000 Instructions
0
32
64
8 12
0
6 25
gap.ref
8
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(h) equake.ref
8
16
32
64
8 12
0
6 25
40
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(i) facerec.ref
8
16
32
64
8 12
0
6 25
20
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(j) fma3d.ref
20
gcc.166
gcc.expr
gcc.200
10
4
2
20
10
15
Misses Per 1000 Instructions
15
6
30
Misses Per 1000 Instructions
20
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
25
10
5
15
10
5
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(k) galgel.ref
8
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(l) gaps.ref
8
16
32
64
8 12
0
256
20
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(m1) gcc.166
8
16
32
64
8 12
0
6 25
1/2
1
2
4
8
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(m3) gcc.expr
20
gzip.program
25
15
10
10
5
15
Misses Per 1000 Instructions
20
15
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
1/4
(m2) gcc.200 gzip.log
25
10
5
5
0
1/8
gzip.graphic
gcc.scilab
gcc.integrate
2 4 6 1/6 1/3 1/1
30
20
Misses Per 1000 Instructions
0
30
20
15
10
15
10
5
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(m4) gcc.integrate
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(m5) gcc.scilab
16
32
64
8 12
0
6 25
30
20
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(n1) gzip.graphic
16
32
64
8 12
0
6 25
20
gzip.random
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(n2) gzip.log
8
16
32
64
8 12
0
6 25
200
lucas.ref
gzip.source
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(n3) gzip.program
4
mesa.ref
mcf.ref
15
10
10
5
150
Misses Per 1000 Instructions
5
20
15
Misses Per 1000 Instructions
10
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
25 15
100
50
3
2
1
5
1/8
1/4
1/2
1
2
4
8
(n4) gzip.random
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(n5) gzip.source
16
32
64
8 12
0
6 25
1
2
4
8
16
32
64
8 12
0
256
20
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(r) mgrid.ref
8
16
32
64
8 12
10
5
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(s) parser.ref
8
16
32
64
8 12
1/4
1/2
1
2
4
(p) mcf.ref
8
16
32
64
8 12
0
6 25
1.5
1
1/8
1/4
1/2
1
2
4
8
(t1) perlbmk.535
16
32
64
8 12
1/4
1/2
1
2
(q) mesa.ref
2.5
4
3
2
0
6 25
2
2
1.5
1
0.5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(t2) perlbmk.704
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(t3) perlbmk.850
1
20
perlbmk.makerand
perlbmk.diffmail
1/8
perlbmk.850
1
2 4 6 1/6 1/3 1/1
2 4 6 1/6 1/3 1/1
3
5
2
0
6 25
7
perlbmk.957
1/8
perlbmk.704
0.5
3
2 4 6 1/6 1/3 1/1
6
Misses Per 1000 Instructions
30
Misses Per 1000 Instructions
Misses Per 1000 Instructions
sixtrack.ref
perlbmk.perfect
6
2.5
2
1.5
1
Misses Per 1000 Instructions
Misses Per 1000 Instructions
0.8 5 4 3 2
1.5
1
0.5
15
Misses Per 1000 Instructions
Misses Per 1000 Instructions
1/2
2.5 15
10
Misses Per 1000 Instructions
1/4
(o) lucas.ref perlbmk.535
40
10
5
0.6
0.4
0.2
0.5
0
1/8
parser.ref
mgrid.ref
0
2 4 6 1/6 1/3 1/1
3
20
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
Misses Per 1000 Instructions
0
50
1
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(t4) perlbmk.957
16
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(t5) perlbmk.diffmail
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(t6) perlbmk.makerand
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(t7) perlbmk.perfect
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(u) sixtrack.ref
Figure 2: SPEC CPU2000 Cache Sensitivity: The figure shows the cache performance of each workload as a function of cache size. The y-axis presents workload MPKI and the x-axis presents the cache size in megabytes (MB). All workloads were run to completion using the reference input sets.
5
10
5
0 4
8 16 32
15 64 8 12 6 25
15
10
20
0
0
2 4 6 1/6 1/3 1/1
2 4 6 1/6 1/3 1/1
1/8
1/8
(w) twolf.ref
1/4
(y1) vpr.place 1/4
1/2
1/2
1 2
1
2
10 4
4
8
8
16
16
32
32
64
64
8 12
8 12
6 25
15
10
5
6 25
0
(x1) vortex.lendian1
0
2 4 6 1/6 1/3 1/1
2 4 6 1/6 1/3 1/1
1/8
1/8
1/4
(y2) vpr.route
1/4
1/2
vpr.place
1/2
1
1
6
2
2
4
4
8
8
16
16
32
32
64
64
7.74
Instr Stream 10
5 5
40
8 12
8 12
6 25
30
20
10
6 25
6
4
0
0
Misses Per 1000 Instructions
vortex.lendian1
(x2) vortex.lendian2
2 4 6 1/6 1/3 1/1
15 1/8
2 4 6 1/6 1/3 1/1 1/8
1/4
(z) wupwise.ref
1/4
1/2
vpr.route
1/2
1
1
2
2
4
4
2.32 1.86 1.21 1.93 1.69 4.24 1.68 1.97
20 20
4.25 1.36 1.49 1.36 1.69 1.73 6.28 5.25 1.36 2.89 1.64 0.88 3.28 2.29 2.16
0 2
twolf.ref
0.97
5 1
11.8
1/2
25
4.55
30 20 15
Misses Per 1000 Instructions
30
Misses Per 1000 Instructions
(v) swim.ref
15.3
swim.ref
1.45 2.68 1.66 0.93 1.01 0.91 1.03 1.3 1.71 1.74 1.05
40
Misses Per 1000 Instructions
Data Stream
Misses Per 1000 Instructions
50
Misses Per 1000 Instructions
70
5.6 1.9 2.85 1.38 1.4 1.85 1.47 1.49 1.59 1.52 1.1 1.46 1.34 1.11 1.49 3.15
Misses Per 1000 Instructions
Misses Per 1000 Instructions
60
1.44 1.32 1.65 1.35 1.78 1.95 1.8
1/4
11.9 10.3
1/8
2.74 2.53 2.29
2 4 6 1/6 1/3 1/1
ammp.ref applu.ref apsi.ref art.ref1 art.ref2 bzip2.graphic bzip2.program bzip2.source crafty.ref eon.cook eon.kajiya eon.rushmeier equake.ref facerec.ref fma3d.ref galgel.ref gap.ref gcc.166 gcc.200 gcc.expr gcc.integrate gcc.scilab gzip.graphic gzip.log gzip.program gzip.random gzip.source lucas.ref mcf.ref mesa.ref mgrid.ref parser.ref perlbmk.535 perlbmk.704 perlbmk.850 perlbmk.957 perlbmk.diffmail perlbmk.makerand perlbmk.perfect sixtrack.ref swim.ref twolf.ref vortex.lendian1 vortex.lendian2 vortex.lendian3 vpr.place vpr.route wupwise.ref AVG
0
4.01 4.12 2.26 3.09 1.94 1.83 2.54 1.35 1.77 1.52 5.27 1.1 1.57 1.23 1.24 1.16 3.03 2.19 2.3 2.16 2.38 2.5 3.14 2.73 1.97 2.16 2.11 1.93 2.2 2.07 1.82 0.92 0.96 0.97 2.05 1.22 4.4 3.92 3.84
CPI
10
GemsFDTD.ref astar.BigLakes2048 astar.rivers bwaves.bwaves bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text cactusADM.benchADM calculix.hyperviscoplastic dealII.ref gamess.cytosine gamess.h2ocu2+ gamess.triazolium gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord gromacs.gromacs h264ref.ref_baseline h264ref.ref_main h264ref.sss_main hmmer.nph3 hmmer.retro lbm.lbm leslie3d.leslie3d libquantum.ref mcf.ref milc.su3imp namd.namd omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail povray.ref sjeng.ref soplex.pds-50 soplex.ref specrand.rand sphinx3.an4 tonto.tonto wrf.rsl xalancbmk.ref zeusmp.zeusmp AVG
CPI
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/ 10
8
vortex.lendian2 20
6
5
wupwise.ref
4
2
3
1
8 16 32 64 8 12 6 25
8 16 32 64 8 12 6 25
15
vortex.lendian3
10
Cache Size (MB) 2 5
0 2 4 6 1/6 1/3 1/1
(x3) vortex.lendian3
Table 1: Memory Bound Workloads
CPI
SPEC CPU2000
>8
art, mcf
mcf
4-8
equake, swim, vpr.route
GemsFDTD, astar.BigLakes2048, cactusADM, lbm, milc, omnetpp, soplex
2-4
ammp, applu, apsi, fma3d, lucas, mgrid, twolf
astar.rivers, bwaves, bzip2.liberty, gcc, gobmk, hmmer.nph3, leslie3d, libquantum, sphinx3, xalancbmk, zeusmp
SPEC CPU2006
1/8 1/4 1/2 1 2 4 8 16 32 64 8 12 6 25
Figure 3: SPEC CPU2000 Cache Sensitivity: The figure shows the cache performance of each workload as a function of cache size. The y-axis presents workload MPKI and the x-axis presents the cache size in megabytes (MB). All workloads were run to completion using the reference input sets.
Figure 4: Performance of SPEC CPU2000 and SPEC CPU2006 Workloads: The figure shows the CPI values for the two suites gathered using performance counters on a 2.8GHz Xeon system. The table below presents the memory bounds workloads separated into three categories.
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/ 40
20
20
GemsFDTD.ref
20
30
bwaves.bwaves
astar.rivers
astar.BigLakes2048
bzip2.chicken
5
10
5
Instr Stream
1/8
1/4
1/2
1
2
4
8
16
(a) GemsFDTD.ref
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1
2
4
8
16
32
64
8 12
0
6 25
20
5
1/4
1/2
1
2
4
8
16
(d2) bzip2.combined
32
64
8 12
10
5
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(d3) bzip2.liberty
16
32
64
8 12
1/4
1/2
1
2
4
8
(b2) astar.rivers
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(c) bwaves.bwaves
32
64
8 12
10
5
0
6 25
20
4
2
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(d4) bzip2.program
32
64
8 12
1/8
1/4
1/2
1
2
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
(d1) bzip2.chicken bzip2.text
15
10
5
0
6 25
20
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(d5) bzip2.source
32
64
8 12
15
10
5
0
6 25
8
calculix.hyperviscoplastic
2 4 6 1/6 1/3 1/1
20
bzip2.source
6
0
6 25
4
cactusADM.benchADM
8
10
bzip2.program
15
Misses Per 1000 Instructions
Misses Per 1000 Instructions
10
1/8
1/8
bzip2.liberty
15
2 4 6 1/6 1/3 1/1
2 4 6 1/6 1/3 1/1
8
bzip2.combined
Misses Per 1000 Instructions
1/2
(b1) astar.BigLakes2048
20
0
15
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
20
15
5
Cache Size (MB)
Misses Per 1000 Instructions
0
Misses Per 1000 Instructions
10
10
15
Misses Per 1000 Instructions
Data Stream
20
15
Misses Per 1000 Instructions
30
Misses Per 1000 Instructions
Misses Per 1000 Instructions
25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(d6) bzip2.text
10
gamess.cytosine
dealII.ref
gamess.h2ocu2+
2
1/2
1
2
4
8
16
32
64
8 12
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
2
4
8
16
32
64
8 12
6 25
2
1/8
1/4
1/2
1
2
4
8
16
(h3) gamess.triazolium
32
64
8 12
15
10
5
0
6 25
20
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(i1) gcc.166
8
16
32
64
8 12
1/2
1
2
4
(i5) gcc.expr
8
16
32
64
8 12
10
5
0
6 25
20
2
4
8
16
32
64
8 12
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(i6) gcc.expr2
8
16
32
64
8 12
6
4
1/8
1/4
1/2
1
2
4
8
16
(h1) gamess.cytosine
32
64
8 12
0
6 25
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
2 4 6 1/6 1/3 1/1
1/8
1/4
4
(i2) gcc.200
8
16
32
64
8 12
10
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(i3) gcc.c-typeck
32
64
8 12
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(i7) gcc.g23
8
16
32
64
8 12
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
10
5
2
gcc.scilab
6
4
2
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(i8) gcc.s04
8
16
32
64
8 12
0
6 25
20
gobmk.nngs
1
(i4) gcc.cp-decl
8
15
0
6 25
7
4
10
0
6 25
gcc.s04
10
2
15
20
15
1
gcc.cp-decl
15
0
6 25
1/2
(h2) gamess.h2ocu2+
20
gcc.c-typeck
10
0
6 25
20
gobmk.13x13
2 4 6 1/6 1/3 1/1
gcc.g23
15
Misses Per 1000 Instructions
Misses Per 1000 Instructions
5
1
20
gcc.expr2
10
1/4
1/2
20
gcc.expr 15
1/8
1/4
(g) dealII.ref
15
0
6 25
20
2 4 6 1/6 1/3 1/1
1/8
gcc.200
Misses Per 1000 Instructions
4
2 4 6 1/6 1/3 1/1
2 4 6 1/6 1/3 1/1
20
gcc.166
6
Misses Per 1000 Instructions
Misses Per 1000 Instructions
1
(f) calculix.hyperviscoplastic
20
0
Misses Per 1000 Instructions
1/4
gamess.triazolium
Misses Per 1000 Instructions
2
Misses Per 1000 Instructions
1/8
0
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
(e) cactusADM.benchADM
0
4
2
0
8
0
5
6
Misses Per 1000 Instructions
1
10
Misses Per 1000 Instructions
0
2
15
Misses Per 1000 Instructions
4
3
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
8
6
2 4 6 1/6 1/3 1/1
1/8
1/4
1
2
gobmk.trevord
gobmk.trevorc
gobmk.score2
1/2
(i9) gcc.scilab
20
5
4 3 2
15
Misses Per 1000 Instructions
5
10
5
Misses Per 1000 Instructions
10
15
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
6 15
10
5
15
10
5
1
1/8
1/4
1/2
1
2
4
8
(j1) gobmk.13x13
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(j2) gobmk.nngs
16
32
64
8 12
0
6 25
4
gromacs.gromacs
1
2
4
8
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(k) gromacs.gromacs
32
64
8 12
2
1
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
32
(l1) h264ref.ref_baseline
64
8 12
1/4
3
2
1
2
4
8
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(l2) h264ref.ref_main
4
3
2
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(l3) h264ref.sss_main
32
64
8 12
10
1
1/8
1/4
1/2
1
2
4
8
(m2) hmmer.retro
16
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(n) lbm.lbm
8
16
32
64
8 12
6 25
20
15
10
0
2
4
10
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(m1) hmmer.nph3 mcf.ref
80 15
10
5
60
40
20
5
2 4 6 1/6 1/3 1/1
1
20
libquantum.ref
Misses Per 1000 Instructions
2
20
1/2
100
leslie3d.leslie3d
Misses Per 1000 Instructions
Misses Per 1000 Instructions
3
1/4
30
0
6 25
25
4
1/8
(j5) gobmk.trevord hmmer.nph3
20
lbm.lbm 30
2 4 6 1/6 1/3 1/1
40
1
30
5
1/2
5
4
0
6 25
40
hmmer.retro
1/8
(j4) gobmk.trevorc h264ref.sss_main
1
6
2 4 6 1/6 1/3 1/1
6
Misses Per 1000 Instructions
3
2
3
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
1/2
h264ref.ref_main
1
Misses Per 1000 Instructions
1/4
5
4
0
1/8
(j3) gobmk.score2
h264ref.ref_baseline
5
0
2 4 6 1/6 1/3 1/1
6
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
6
Misses Per 1000 Instructions
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
(o) leslie3d.leslie3d
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(p) libquantum.ref
16
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(q) mcf.ref
Figure 5: SPEC CPU2006 Cache Sensitivity: The figure shows the cache performance of each workload as a function of cache size. The y-axis presents workload MPKI and the x-axis presents the cache size in megabytes (MB). All workloads were run to completion using the reference input sets.
7
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
Instr Stream
3
2
1
5
Misses Per 1000 Instructions
Data Stream
10
perlbench.diffmail
25 Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
20
20
perlbench.checkspam
omnetpp.omnetpp
namd.namd 4
15
4
30
5
milc.su3imp 25
20
15
10
3
Misses Per 1000 Instructions
30
2
1
15
10
5
5
Cache Size (MB) 1/4
1/2
1
2
4
8
(r) milc.su3imp
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(s) namd.namd
16
32
64
8 12
0
6 25
1/4
1
1/2
2
4
8
16
32
64
8 12
0
6 25
1
4
2
0.5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
32
(u3) perlbench.splitmail
64
8 12
6 25
2
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(v) povray.ref
8
16
32
64
8 12
6 25
2
1
specrand.rand
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(w) sjeng.ref
8
16
32
64
8 12
6 25
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(y) specrand.rand
16
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(z) sphinx3.an4
16
32
64
8 12
6
4
0
6 25
64
8 12
0
256
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
32
64
8 12
6 25
4
8
16
32
64
8 12
6 25
8
16
32
64
8 12
6 25
(u2) perlbench.diffmail soplex.ref
25
20
20
15
10
5
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(x1) soplex.pds-50
16
32
64
8 12
0
6 25
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
(x2) soplex.ref
20
xalancbmk.ref
4
2
0
32
wrf.rsl
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
Misses Per 1000 Instructions
0.5
10
16
30
0
8
1
8
30
tonto.tonto
15
4
5
sphinx3.an4
1.5
2
10
10
20
1
soplex.pds-50
1.5
0
1/2
40
0.5
0
1/4
sjeng.ref
6
Misses Per 1000 Instructions
Misses Per 1000 Instructions
2
1/8
(u1) perlbench.checkspam
2.5
1.5
2 4 6 1/6 1/3 1/1
50
povray.ref
perlbench.splitmail
Misses Per 1000 Instructions
1/8
(t) omnetpp.omnetpp
2.5
0
2 4 6 1/6 1/3 1/1
3
8
Misses Per 1000 Instructions
1/8
Misses Per 1000 Instructions
2 4 6 1/6 1/3 1/1
3
Misses Per 1000 Instructions
0
3
IN PROGRESS 2
15
10
5
1
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(α) tonto.tonto
16
32
64
8 12
6 25
16
32
64
128
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(β) wrf.rsl
8
16
32
64
8 12
6 25
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
(χ) xalancbmk.ref
20
Misses Per 1000 Instructions
zeusmp.zeusmp 15
10
5
0
2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
(δ) zeusmp.zeusmp
Figure 6: SPEC CPU2006 Cache Sensitivity: The figure shows the cache performance of each workload as a function of cache size. The y-axis presents workload MPKI and the x-axis presents the cache size in megabytes (MB). All workloads were run to completion using the reference input sets.
ISTREAM SPEC CPU2000 (all) DSTREAM SPEC CPU2000 (all) DSTREAM SPEC CPU2000 (no art, mcf) ISTREAM SPEC CPU2006 (all) DSTREAM SPEC CPU2006 (all) DSTREAM SPEC CPU2006 (no mcf)
Misses Per 1000 Instructions
20
15
10
5
0 2 4 6 1/6 1/3 1/1
1/8
1/4
1/2
1
2
4
8
16
32
64
128
6 25
Cache Size (MB)
Figure 7: Average SPEC CPU2000 and SPEC CPU2006 Cache Sensitivity: The figure presents the average cache behavior of the SPEC CPU2000 and SPEC CPU2006 suites. Since SPEC CPU2000 is heavily biased by art and mcf, we also present the average behavior of the workloads excluding art and mcf. Similarly, for SPEC CPU2006 we present the average behavior excluding mcf.
8
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
the other hand, the knee of the data stream cache miss rate function for SPEC CPU2000 occurs at a cache size of 1-2 MB, while that of SPEC CPU2006 occurs at a cache size of 16-32 MB—a factor of 16 higher. The larger cache requirements of these workloads continues to put pressure on improving the performance of the memory subsystem.
100KB. However, for workloads such as applu (Figure 2a), gap (Figure 2l), and others where misses occur even in a 128MB cache, one can conclude that the total data memory footprint size of these workloads is greater than or equal to 128MB. From the figures, 20% of the 48 SPEC CPU2000 workloads have total memory footprint greater than 128MB while more than 75% of the 56 SPEC CPU2000 workloads have total memory footprint greater than 128MB. The instruction cache sensitivity study reveals that an instruction cache that is 64KB suffices for most workloads in the two suites. However, there are some outlier workloads such as crafty, gap, gcc, eon, perl and vortex from SPEC CPU2000 and gcc, gobmk, omnetpp, perl, and xalancbmk from SPEC CPU2006 which require a 128KB instruction cache to perform as well as other workloads in the suite. These workloads have code footprints that are greater than 1MB. This is to be expected from workloads such as compilers (gcc), interpreters (perl), XML parsers (xalancbmk) and artificial intelligence programs (gobmk and sjeng) which spend their time over a wide range of functions. The data cache sensitivity study reveals that most of the workloads exhibit cache friendly behavior—incremental increases in cache size yields incremental improvements in cache performance. Such behavior is observed when the cache miss rate shows a smooth exponentially decreasing behavior with increasing cache sizes (example Figure 2e). However, both suites also comprise of workloads where incrementally increasing the cache size provides no improvements in cache performance until a particular cache size is reached. At this cache size suddenly significant improvement in cache performance is observed. Such behavior is evident through sudden drops (cliff shaped figures) in the cache miss rate as a function of cache size. Some examples are art (Figure 2d) and equake (Figure 2h) from SPEC CPU2000 and libquantum (Figure 5p) and sphinx (Figure 6z) from SPEC CPU2006. This behavior implies that the respective workloads have a working set that is of the same cache size as where the significant drops in cache miss rate occur. For example, libquantum from the SPEC CPU2006 suite has the best cache performance with a 32MB cache. Since the total number of misses drops down to zero with a 32MB cache, this implies that the workload re-references a 32MB working set in a circular fashion. Some workloads like gcc in both SPEC CPU2000 and SPEC CPU2006 illustrate multiple drops in the cache miss rate function implying that they have multiple working set sizes. Figure 7 presents the average cache behavior of both the instruction and data stream. Since the average behavior of SPEC CPU2000 is heavily biased by art and mcf, we also present the average of all workloads excluding art and mcf. Comparatively, for SPEC CPU2006 we also present the average of all workloads excluding mcf. The figure shows that both SPEC CPU2000 and SPEC CPU2006 have similar instruction cache requirements, with 64KB being optimal. On
3.4.2.Sensitivity to Different Input Sets For the SPEC CPU2006 workloads, the benchmarks with multiple input sets are: astar (2 input sets), bzip2 (6 input sets), gamess (3 input sets), gcc (9 input sets), gobmk (5 input sets), h264ref (3 input sets), hmmer (2 input sets), perlbench (3 input sets), and soplex (2 input sets). We compare the different input sets using MPKI as the metric of comparison. Input sets that have the highest MPKI are considered more memory intensive than others. For the astar benchmark, the BigLakes2048 input is more data memory intensive of the two. For the bzip2 benchmark, the different input sets have very similar instruction and data cache requirements with the liberty input set being more data memory intensive than the others. For the gamess benchmark, all input sets have very similar working sets and fit well into a 512KB cache (the triazolium input set is slightly more data memory intensive for a 256KB cache). For the gcc benchmark, the 166 and g23 input sets are more data memory intensive than the others. For the gobmk and h264ref benchmarks, all input sets have very similar working set sizes. For the hmmer benchmark, the retro input set is more data memory intensive of the two. For the perlbench benchmark, the splitmail input set is more data memory intensive. Finally, for the soplex benchmark, the pds-50 input set is more data memory intensive when the cache size is less than 4MB while the ref input set is more data memory intensive when the cache size is greater than 4MB. 3.4.3.Performance of a Three-Level Cache Hierarchy Figure 8 presents the cache performance of the workloads for a three-level cache hierarchy: 32KB separate instruction and data caches, 256KB unified L2 cache and 2MB unified L3 cache. For each cache we present four metrics: accesses per 1000 instructions, misses per 1000 instructions, miss rate, and evictions unused. The evictions unused metric is a measure of the total number of cache lines evicted that were never re-referenced after they were inserted into the cache. There are three main reasons for unused cache evictions. First, the referenced data has no temporal locality. This behavior is characteristic of references to streaming data. Second, the data has short temporal locality that is handled by smaller caches. In such scenarios, the filtering effect of the small caches causes the data to never be re-referenced in successive levels of the cache hierarchy. Finally, unused evictions also occur when the referenced data has a working set that is larger than the
9
100 75 50 25 0
0
25
21.2
71.4
100 75 50 25 0
10
96.5
95.9 80.7 86.5 73.8
0.45
88.9 57.2 81.8 46.4
27.8
87.8 63.7 95.4 100 89.1
99.8
19.7 0.12 13.2 0.81 0.26 0.95 0.01 0.48 34 22.1 0 14.1 5.8 6.93 6.24 5.2 8.76
63.1
75
48.3 64.9 57 42.8 65.9
100
41.2
100 75 50 25 0
100 87.4 100
25 0
90.9 91 82.2 76.2 65.6 76.2
90.4 69.6 67.2 56.8 33.7 86.7 86.2 98.1
37
89.5 88.6 97.9 92.5 98.4 98.2 98.7 98.8 81.5 63.7 62.3 81.8 67.7 63.5 57.8 64.5 61.5 62.1 61 97 58.1 90.2 100 73 98.8
33.5 59.1 43 71.6 64.4 77.4 74.9 77.6 75.9
64.1
37.1 24.5 18.8 30.7 20.6 14.4 18 50.8 34.1 32.2 39.2 74.3 59.5 73.2 100 64.7 98.1 2.24 80.8 21 3.13 40.6 0.09 16.6 81.5 94.1 99.8 81.3 67.3 51.4 27.3 28.8 52.1
1.81 1.42 13.2
19.7 0.12 13.2 0.81 0.26 0.95 0.01 0.48 34 22.1 0 14.1 5.8 6.93 6.24 5.2 8.76
63.1
75
59.8
50
51.2 41.4 19.3 22.7 22.9 28.5 56.2 30.7 28.8 22.7 28.4 69.7 35.4 33.6 27.5 31.5 3.7 4.53 36.7 0.12
25
93.6
50
16.1 37.4
100 75 50 25 0
0.14 0.29 2.57
100 75 50 25 0
33.3 22.3 32.7 26.6 20.8 22.5
25 19 11.4 5.67 17.8 9.15 8.03 14.9 4.03 7.67 6.82 4.67 1.76 6.79 0.06 0.11 0.35 11.7 5.15 14.7 10.5 13.6 13.4 13.8 15.1 3.85 1.96 1.32 2.07 1.59 0.96 1.19 1.38 1.02 1.03 2.76 2.54 25.2 21.3 13.4
50
72.2 65.6 45.9 95.5 78.3 60.3 82.7 52.3 57.3 53.5 69.3 60.6 74.3
124
82.3 27.2 12.7 6.78 14.1 18.1 31.5 4.91 24.9
1.64 16.3 12.6 2.29 2.63 1.96 2.45 5.38 1.37 12.9 0.42
20.1 5.21 16.3 3.87 8.25 2.34 8.36 2.92 41.7 23.5 0 17.3 8.62 13.5 22.8 18 14.1
26.3 17.4 12.4 18.6 11.7 13.3 18 7.71 13.4 12.8 6.75 2.91 9.14 3.3 7.53 2.67 18.2 12 20.6 16.2 17.5 18 17.8 19.9 10.4 8.02 7.04 6.72 7.71 6.68 6.64 2.71 2.98 3.19 7.04 3.41 42.4 29 13.4 97.4
125 100 75 50 25 0 100
89 75.4 77.1 97.5 87.6 77.3 90 71.8 75.1 71.9 76.2 89.6 93.9 66.3
104
7.05 2.63 0.16 22.3 16.5 20 16.9 21.8 22.3 0 21.7
86.4
22.6 28.1 20.7 0.09 54.2 52.2 47.6 38.1 52 49.5 2.61 1.17 2.86 12.4 8.76 5.37 30 19.4 25.9 37.9 29.1 29.7 14.7 24.2 14.8 20.9 15.1 12.9 16.3 11.9 11.2 4.63 5.14 5.98 2.1 0.11 53 1.79 0.01 52 4.56 2.53 27.4 11.8 7.88 50.2 8.57 18.7 26.2 4.73 0 13.6 7.06 18.7 20.8 34 19.5
100 75 50 25 0
19 11.4 5.67 17.8 9.15 8.03 14.9 4.03 7.67 6.82 4.67 1.76 6.79 0.06 0.11 0.35 11.7 5.15 14.7 10.5 13.6 13.4 13.8 15.1 3.85 1.96 1.32 2.07 1.59 0.96 1.19 1.38 1.02 1.03 2.76 2.54 25.2 21.3 13.4
90.2
97.6
0.75 15.7 4.68 0.78 0.87 0.76 0.89 0.17 0.41 0.19 0.41 44 16.2 5.11 2.71 5.8 8.8 17.4 4.43 17.6
9.43 7.4 1.52 2.29 0.21 16.3 24.1 2.24 4.5 6.41 11.8 9.47 4.73 8.03 40.8 39.6 40.4 44.3 39.4 4.57 31.9 2.33 0.02 15.1 13.9 15.7 13.4 15.7 7.03
74.4 64 63.5 56.7 49.5 59.7
25.7 0.05 36.6 36.5 2.1 2.05 2.6 0.7 0.15 0.04 0.08 3.72 1.47 2.35 4.93 1.03 8.59 4 4.71 6.4 3.77 3.87 8.97 10.4 4.99 10.5 3.65 31.5 0.34 2.49 2.92 0.43 0.5 0.39 0.48 0.75 0.29 1.07 0.14 13.7 5.96 2.5 1.13 2.83 3.55 5.49 1.44 5.25
3.9 2.86 3.01
4.81 2.98 2.16 2.94 2.77 2.66 4.44 1.61 2.61 2.58 1.25 0.57 1.76 0.64 1.41 0.49 3.94 2.39 4.74 3.82 4.27 4.36 4.42 4.47 1.88 0.94 0.77 1.05 0.83 0.74 1.75 0.46 0.54 0.6 1.15 0.56 11.7 5.69 2.72 20 3.82 1.48 2.92 0.46 0.97 0.4 1.35 0.61 10.5 4.87 0 4.07 1.58 2.7 4.73 5.52 2.95
100 75 50 25 0
17.1 7.33 1.29 17.7 3.04 1.79 4.86 1.07 1.6 1.53 4.37 0.28 2.54 0 0 0.01 5.98 2.13 2.84 2.37 3.1 3.83 7.75 4.63 1.11 0.45 0.38 1.44 0.56 0.32 0.33 0.43 0.04 0.05 1.01 0 25.2 18.6 13.4 37.7 19.7 0.06 8.54 0.46 0.11 0.63 0 0.42 21.7 21.1 0 12.5 1.61 6.16 3.57 4.25 5.34
50.6
53.5 59.5 40.3 40 41.1 48.7 55.2
96.3
100 84.2
204 203
124
82.3 27.2 10.4 5.26 11.7 18.1 31.5 4.91 24.4
1.64 16.3 12.5 1.71 1.98 1.52 1.9 3.22 1.37 5.39 0.42
8.49 8.49 12.4 3.29 0.96 0.25 0.5 26.2 7.01 11.7 21.4 4.07 36 15.4 18 24.7 14.8 14.5 30.1 32.9 18.9 34.3 13.1
18.4 12 14.5 204 203
20.1 5.21 15.9 2.11 5.08 1.89 6.82 2.64 41.7 23.5 0 17.3 8.51 13.5 21.5 18 13.5
26.3 17.4 12.4 18.6 11.7 13.3 18 7.71 13.4 12.8 6.75 2.91 9.07 3.15 7.51 2.65 16.7 9.9 19.3 15.2 16.7 17.1 17.1 19 7.57 4.37 3.63 4.77 3.95 3.49 6.64 2.54 2.79 3.02 7.04 3.41 42.4 29 13.4
97.4
125 100 75 50 25 0
99.7
1.45
3.22 29.8
37.3 34.3 33.3 38.7 36.5
45.6
8.49 8.49 12.4 4.28 0.96 0.26 0.5 26.2 7.01 11.9 21.4 4.1 36.6 16.5 19.3 25.5 16.3 14.5 30.1 32.9 18.9 34.3 13.1
18.4 12 14.5 472 419 482 557 556 404 415 475 473 636 655 654 704 475 498 435 397 419 385 383 385 392 375 335 317 378 326 360 393 478 656 430 394 394 394 393 432 480 503 297 601 456 416 464 414 509 573 340 454
547 583 571 634 422 500 405 479 514 494 542 507 514 496 534 536 425 414 407 397 392 392 387 425 402 464 469 456 473 473 379 552 514 507 613 606 362 510 494 488 526 353 544 460 523 468 506 434 398 482 541 425 538 498 453 327 478
1000 800 600 400 200 0
89.9 64.4
0 1.29 0.58 1.39 0 4.48 4.38 4.78
44
104
204 203
2.68 12.8 14 11.5 4.21 9.97 10 10.2 15.8 7.85 25.3 18.4 34.1 17 17.2 17.6 12.8 11.2 11.7 12 12.3 12.6 10.7 8.72 9.09 12 8.68 20.3 13.3 7.35 24.6 2.32 16.2 16.2 16.6 16 8.49 17.6 11.6 23.3 14.8 0.88 9.16 8.24 9.31 16.1 6.8 21.2 13.1
23.7 12.1 11.7 29.4 3.86 5.17 1.78 6.44 6.14 6.4 12.6 19.8 31.3 13.1 25.7 11.1 16.2 15.6 16.3 16 16 16 15.8 15.7 16.2 12.2 12.2 12 12.2 11.9 27.7 12.4 7.11 6.7 10.4 8.46 12.2 30.9 8.11 10.7 5.57 19.4 25.1 17 14.9 28.4 31.3 16.5 20.2 10.4 27.8 4.31 20.3 31.5 23.4 11 15.5
Miss Rate
0 0 0 0 0 0 0 0 0.72 0 0.01 0 0 0 0.18 0 0.02 0.29 0.49 0.59 0.38 0.69 0 0 0 0 0 0 0 0 0 0.01 0.27 0.32 0.21 0.27 1.04 0 3.57 0.01 0 0 0.89 0.63 0.92 0 0 0 0.24
0.01 0 0 0 0 0 0 0 0 0 0 0 0.05 0.15 0.02 0.03 0.65 0.92 0.58 0.45 0.34 0.37 0.29 0.43 1.22 1.71 1.61 0.95 1.76 1.51 0 0.19 0.21 0.21 0 0 0 0 0 0 0 0 0.22 0.75 1.43 0.2 1.2 0.12 0.01 0.01 0 0 0.18 0.01 0.57 0 0.33
Misses/1000 Instructions
Accesses/1000 Instructions
32KB L1 Instruction Cache % Evictions Unused
100 75 50 25 0
22.8
98.9
100
0.75 15.7 4.68 0.78 0.87 0.76 0.89 0.17 0.41 0.19 0.41 44 16.2 5.11 2.71 5.8 8.8 17.4 4.43 17.6
41.4 43.8 58 47
96.1 98.9 87.2 80.3 78.8 87.7
5.47 4.65 7.04 0.34 0 0 0 25.1 6.93 10.3 17.2 3.23 32.1 6.83 8.47 14.8 7.67 0.83 0.98 0.36 0.96 0.59 13.1
13.1 11.5 10.6
Accesses/1000 Instructions
100 75 50 25 0
0
43.7
25.8
25.2 21.5 24
93.7
77.9 99
63.2
5.74 3.25 1.1 5.09 1.73
7.88 0.04 0.11 0.1
96.1 73.2 100 100 64.4 54.8 57
71
Misses/1000 Instructions
2.33 1.52 2.42 0 0 0 0.51
0 0 0
0.99 0 0.01 0 0 0 0.12 0 0.03 0.63 1.1 1.33 0.81 1.54 0 0 0 0 0 0 0 0 0 0.02 0.57 0.65 0.44 0.55 2.15 0
0 0 0 0 0 0 0 0
7.5
89.2 60.3 62.6 89.1 90 135 147 151 138 98.6 102 101 40.1 74 65.9 94.9 170 216 225 226 214 225 170 179 235 182 211 20.7 239 98.7 24.7 199 209 207 209 207 207 218 210 37.6 36 139 260 242 262 137 174 93.3 150
0 0 0 0 0 0 0 0 0 0 0 0 0.07 0.15 0.02 0.03 1.45 2.07 1.26 1.06 0.78 0.84 0.67 0.91 2.81 3.65 3.41 1.95 3.76 3.19 0 0.16 0.19 0.17 0 0 0 0 0 0 0 0 0.41 1.76 3.18 0.45 1.53 0.28 0.01 0.02 0 0 0.11 0.01 1.38 0 0.67
28.9 158 154 44.4 172 172 190 168 171 200 16.3 43.7 145 94.9 86.4 78.2 225 225 216 236 230 227 233 211 231 214 212 205 214 212 37.3 86.2 88.2 82.5 165 165 20.4 70.3 140 244 39.4 43 189 234 222 231 127 232 161 170 198 118 58.8 77.8 241 60.4 152
1000 800 600 400 200 0 10 8 6 4 2 0
95 86.2 71.2 99.7 53.5 51 51.2 53.9 51.4 47.9 93.7 67.5 96.5 28.1 20.3 62.4 96.4 81.3 81.1 90.3 87.7 86.9 96.8 95.3 70.9 83.1 81.5 94.7 87.2 85.1 70.9 36.3 18.7 30.8 83.5 14.7 100 94.3 100 86.9 99.8 65.7 71.9 81 95.2 81.1 86.2 94.2 75.3 97.2
0
0
0.24 0.48
19.8 28.6 27.2 34.3 34.9
91.8
5.47 4.65 7.04 0.34 0 0 0 25.1 6.93 10.3 17.2 3.23 32.1 6.83 8.47 14.8 7.67 0.83 0.98 0.36 0.96 0.59 13.1
86.1 96 92.1 100 100 81.5 71 71.3 52.3 54.2 54.8 54.7 97 99.4 94.5 93 95.1 99 92.2 94.2 97.6 92 74.8 91 80.1 89 85.1 100 92.4 76.7 97.8 66.2 63.2 66.2 64.3 67.4 84.5 28.2 53.3 99.5 55.2 71.7 75.7 75.2 76.3 70.7 68.6 89.9 79.8
Miss Rate
32KB L1 Data Cache % Evictions Unused
SPEC CPU2000
GemsFDTD.ref astar.BigLakes2048 astar.rivers bwaves.bwaves bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text cactusADM.benchADM calculix.hyperviscoplastic dealII.ref gamess.cytosine gamess.h2ocu2+ gamess.triazolium gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord gromacs.gromacs h264ref.ref_baseline h264ref.ref_main h264ref.sss_main hmmer.nph3 hmmer.retro lbm.lbm leslie3d.leslie3d libquantum.ref mcf.ref milc.su3imp namd.namd omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail povray.ref sjeng.ref soplex.pds-50 soplex.ref specrand.rand sphinx3.an4 tonto.tonto wrf.rsl xalancbmk.ref zeusmp.zeusmp AVG
98.9
72.9 71.9 73
96 100
68.3 72 66.4 74.8 91.9 98.9
48.6
43
97
64.1 85.6 68.4 72.8 74.6 99.7
56.6
30.9
4.78 10.5 11 8.43 7.37
11.2
95.4
100 75 50 25 0
97.1 82.4 95.2 87.3 98.7 92.8 75.6 85 92.1 77.3 99.6 99.4 99.6 99.7 99.7 99.7 70.2 95.5 67.9
60 50 40 30 20 10 0
0 0 0
300 250 200 150 100 50 0 204 203
100 75 50 25 0
13.1 11.5 10.6
100 75 50 25 0
0.88 11.5 3.88 21.8 21.9 1.38 1.23 1.66 0 0 0 0 24.4 3.92 9.85 1.92 2.96 1.54 0.72 0.93 1.25 0.57 0.53 0.84 0.25 0.7 0.44 13.1 32.1 0.7 9.94 0.93 0.22 0.24 0.26 0.31 0.14 0.41 0 0
250 200 150 100 50 0 300 250 200 150 100 50 0
95.7
99.8
Accesses/1000 Instructions 100 75 50 25 0
36.5 10.7 10.8 25.2 26.3 23.6 0.05 62.5 64.6 64.7
Misses/1000 Instructions 100 75 50 25 0
6.71
Miss Rate
256KB Unified L2 Cache 1000 800 600 400 200 0 300 250 200 150 100 50 0
99.8 83.4
% Evictions Unused 100 75 50 25 0
66.2
Accesses/1000 Instructions 100 75 50 25 0
44.3 46.2 51.9 52.7 49.5
Misses/1000 Instructions 1000 800 600 400 200 0 10 8 6 4 2 0
ammp.ref applu.ref apsi.ref art.ref1 art.ref2 bzip2.graphic bzip2.program bzip2.source crafty.ref eon.cook eon.kajiya eon.rushmeier equake.ref facerec.ref fma3d.ref galgel.ref gap.ref gcc.166 gcc.200 gcc.expr gcc.integrate gcc.scilab gzip.graphic gzip.log gzip.program gzip.random gzip.source lucas.ref mcf.ref mesa.ref mgrid.ref parser.ref perlbmk.535 perlbmk.704 perlbmk.850 perlbmk.957 perlbmk.diffmail perlbmk.makerand perlbmk.perfect sixtrack.ref swim.ref twolf.ref vortex.lendian1 vortex.lendian2 vortex.lendian3 vpr.place vpr.route wupwise.ref AVG
Miss Rate
2MB Unified L3 Cache
% Evictions Unused
GemsFDTD.ref astar.BigLakes2048 astar.rivers bwaves.bwaves bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text cactusADM.benchADM calculix.hyperviscoplastic dealII.ref gamess.cytosine gamess.h2ocu2+ gamess.triazolium gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord gromacs.gromacs h264ref.ref_baseline h264ref.ref_main h264ref.sss_main hmmer.nph3 hmmer.retro lbm.lbm leslie3d.leslie3d libquantum.ref mcf.ref milc.su3imp namd.namd omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail povray.ref sjeng.ref soplex.pds-50 soplex.ref specrand.rand sphinx3.an4 tonto.tonto wrf.rsl xalancbmk.ref zeusmp.zeusmp AVG
ammp.ref applu.ref apsi.ref art.ref1 art.ref2 bzip2.graphic bzip2.program bzip2.source crafty.ref eon.cook eon.kajiya eon.rushmeier equake.ref facerec.ref fma3d.ref galgel.ref gap.ref gcc.166 gcc.200 gcc.expr gcc.integrate gcc.scilab gzip.graphic gzip.log gzip.program gzip.random gzip.source lucas.ref mcf.ref mesa.ref mgrid.ref parser.ref perlbmk.535 perlbmk.704 perlbmk.850 perlbmk.957 perlbmk.diffmail perlbmk.makerand perlbmk.perfect sixtrack.ref swim.ref twolf.ref vortex.lendian1 vortex.lendian2 vortex.lendian3 vpr.place vpr.route wupwise.ref AVG
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/ SPEC CPU2006
0
0 100
75
0
Figure 8: Three Level Cache Hierarchy Performance: The figure shows the cache performance of the different SPEC CPU2000 and SPEC CPU2006 workloads for a three level non-inclusive cache hierarchy
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
Miss Rate
Miss Rate
bzip2.liberty
mcf.ref
Miss Rate
Miss Rate
tonto.tonto gcc.c-typeck
Instruction Count (billions)
Instruction Count (billions)
Cumulative
10 million Instruction Phase
Figure 9: Full Run L3 Cache Behavior of Applications: The figure shows the full run L3 cache miss-rate behavior of workloads from SPEC CPU2006. The solid line presents cumulative behavior while the dots represent application behavior the on a 10 million instruction granularity.
misleading for performance studies since the L3 cache performance in this phase inaccurately reflects the overall L3 cache performance of the application. Choosing representative regions of long running programs for detailed execution becomes a challenging task as newer applications continue to emerge. Since IDS allows for quick full-run exploratory studies of applications, it serves as an excellent methodology to not only identify representative regions of a program but also to validate existing region selection algorithms [12].
available cache size. As a result, cache lines get evicted due to capacity misses before they ever get a chance to be reused. The cache performance numbers presented in Figure 8 draws similar conclusions on the instruction and data cache performance as discussed in Section 3.4.1. The key observations in these figures is the unused evictions metric. The figures show that on average, roughly 15%, 20%, 80%, and 75% of the lines evicted from the IL1, DL1, UL2, and UL3 caches are never re-referenced after insertion into the cache. This presents a tremendous wastage of cache resources, especially in the large caches. Every line that is not rereferenced before eviction is wasted cache space that could potentially have been used by a more useful cache line. For applications with frequent cache misses, this suggests an opportunity to improve cache performance by effectively utilizing the cache resources.
4. Summary There is a growing need for simulation methodologies to understand the memory system requirements of emerging workloads in a reasonable amount of time. This paper presents binary instrumentation-driven simulation (IDS) as an alternative to conventional execution-driven and trace-driven simulation methodologies. Since instrumentation driven simulation is fast, robust, and simple, users can characterize applications at MIPS speed. As a result, rather than studying only small regions of an application, users can now study applications run to completion. We illustrate the benefits of IDS using Pin to characterize the SPEC CPU2000 and SPEC CPU2006 benchmark suites. In comparison to SPEC CPU2000, workloads in SPEC CPU2006 have an instruction count that is an order of magnitude larger than the SPEC CPU2000 workloads. Specifically, the average instruction count of SPEC CPU2000 was 131 billion while that of SPEC CPU2006 is 1.3 trillion. The large instruction counts of SPEC CPU2006 now present interesting challenges in choosing representative regions for detailed simulation. A full memory characterization study of both SPEC CPU2000 and SPEC CPU2006 reveal that SPEC CPU2006 workloads have larger memory working-set sizes with most memory-intensive workloads requiring more than 4MB of cache size. The larger cache requirements of these workloads continues to put pressure on improving the performance of the memory subsystem.
3.5. Full Run Behavior of Applications Figure 9 presents the time varying L3 cache performance of selected benchmarks from the SPEC CPU2006 suite. The xaxis presents the total number of instructions executed (in billions) and the y-axis presents the L3 cache miss rate. For each benchmark we present the cumulative L3 cache miss rate of the application (as represented by the solid line) and the instantaneous behavior of the application on a 10 million instruction interval (as represented by the dots). The figure shows that applications have several different phases of execution. For example, bzip2 has a step function with the L3 cache miss rate increasing over time; mcf has several phases of execution where none, some, or all cache references result in misses; tonto presents loop behavior with the cache performance significantly varying within each loop; finally gcc illustrates the L3 cache performance significantly varying during its full execution run. The phase behavior of these applications show that it is extremely important that representative regions chosen for detailed simulation accurately reflect actual program behavior. For example, choosing a 100 million instruction interval starting at 525 billion instructions for tonto would be
11
Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/
5. References
[9] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. “Evaluation techniques in storage hierarchies.” IBM Journal of Research and Development, 9, 1970 [10] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. “Automatic logging of operating system effects to guide application-level architecture simulation.” In SIGMETRICS, Saint Malo, France, 2006. [11] H. Pan, K. Asanovic, R. Cohn, and C. K. Luk. “Controlling Program Execution through Binary Instrumentation.” In Workshop on Binary Instrumentation and Applications (WBIA), St. Louis, MO, September 2005. [12] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and B. Calder. “Using SimPoint for Accurate and Efficient Simulation”. In SIGMETRICS, San Diego, CA, 2003. [13] S. Sair and M. Charney. “Memory Behavior of the SPEC CPU2000 Benchmark Suite.” IBM Thomas J. Watson Research Center Technical Report RC-21852, October 2000. [14] Srivastava and A. Eustace. “ATOM: A System for Building Customized Program Analysis Tools”, Programming Language Design and Implementation (PLDI), 1994, pp. 196-205. [15] R. A. Uhlig. and T. N. Mudge. “Trace-driven Memory Simulation: A Survey”, In ACM Computing Surveys, Vol. 29, 1997.
[1] SPEC CPU2006: http://www.spec.org/cpu2006/Docs/ readme1st.html [2] P. Bungale and C. K. Luk “PinOS”. [3] J. Edler and M. D. Hill. “Dinero IV Trace-Driven Uniprocessor Cache Simulator”. [4] J. L. Henning. “SPEC CPU2006 Benchmark Descriptions.” In ACM SIGARCH newsletter, Computer Architecture News, Volume 34, No. 4, September 2006. [5] A. Jaleel, R. S. Cohn, C. K. Luk, B. L. Jacob. “CMP$im: Using PIN to Characterize Memory Behavior of Emerging Workloads on CMPs”, Technical Report - UMD-SCA-2006-01 [6] K. Lawton. Bochs. http://bochs.sourceforge.net. [7] C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.” In Proceedings of Programming Language Design and Implementation (PLDI), Chicago, Illinois, 2005. [8] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002.
12