Memory Characterization of Workloads Using ... - Aamer Jaleel

Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/

Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU2000 and SPEC CPU2006 Benchmark Suites Aamer Jaleel Intel Corporation, VSSAD {aamer.jaleel}@intel.com

Abstract

modeled within the functional/performance model. In general, functional models of modern ISAs are slow and can be complex to build. As a result, trace-driven simulation has become popular for conducting memory system studies [15]. The usefulness of trace-driven simulation, however, lies in the continued availability of address traces to study the memory system requirements of different workloads. With several emerging application domains, understanding the memory behavior and working set requirements of new applications requires the ability to generate address traces. Address trace generation for a target ISA can require sophisticated hardware tools (e.g. bus tracer) or a functional model that supports the target ISA and the requirements of the workload (e.g. the functional model must provide support for multiple contexts if executing a multi-threaded workload). However, there are drawbacks with these mechanisms. Hardware tools such as bus tracers only capture address traces that reach the system bus. Consequently, the address trace collected is incomplete since address requests that hit in the processor caches are filtered out. On the other hand, address trace generation using function models represent only a small region of the application. Additionally, a practical problem with collecting memory address traces with either mechanism is that they can become very large, potentially occupying several gigabytes of disk space even in their compressed formats. To address the drawbacks of current simulation techniques, this paper proposes the use of instrumentation driven simulation (IDS) to conduct memory system studies. Since instrumentation-driven memory system simulation is fast, robust, and simple, users can write simple tools to characterize the memory system requirements of almost any application at MIPS (as opposed to KIPS) speed. In doing so, IDS can support full run application studies. This paper presents the use of Pin [7] based IDS to conduct full-run memory system studies of workloads from the SPEC CPU2000 and SPEC CPU2006 [4] benchmark suites. Our studies reveal that on average, workloads in SPEC CPU2006 have an instruction count that is an order of magnitude larger than the SPEC CPU2000 workloads. Additionally, SPEC CPU2006 workloads have larger memory working-set sizes

There is a growing need for simulation methodologies to understand the memory system requirements of emerging workloads in a reasonable amount of time. This paper presents binary instrumentation-driven simulation as an alternative to conventional execution-driven and trace-driven simulation methodologies. We illustrate the use of instrumentation-driven simulation (IDS) using Pin to determine the memory system requirements of workloads from the SPEC CPU2000 and SPEC CPU2006 benchmark suites. In comparison to SPEC CPU2000, SPEC CPU2006 workloads are an order of magnitude longer (in terms of instruction count). Additionally, SPEC CPU2006 comprises of many more memory intensive workloads that require more than 4MB of cache size for better cache performance.

1. Introduction Processor architects continue to face key design decisions in designing the memory hierarchy. With several emerging application domains, understanding the memory system requirements of new applications is essential in designing an efficient memory hierarchy. Such characterization and exploratory studies require fast simulation techniques that can compare and contrast the performance of alternative design policies. For memory system studies, this paper proposes instrumentation driven simulation as an alternative to existing execution-driven and trace-driven methodologies. Simulation is a common methodology that is used to identify performance bottlenecks in existing systems as well as design space exploration. There exist many simulators and software tools to investigate the memory system performance of applications. In general, memory system simulators fall into two main categories: trace-driven or execution-driven. With trace-driven memory system simulation, address traces are fed off-line to a memory system simulator (e.g. Dinero IV [3]). Such simulators rely on existing tools to collect memory address traces and log them to file for later use. Executiondriven cache simulators on the other hand rely on functional/ performance models to execute application binaries. The memory addresses generated by the functional/performance model are fed in real time to a memory system simulator

1


execution. If speculation is desired, special purpose tools to emulate a branch predictor can be implemented [11]. • Repeatability: Since instrumentation occurs on real operating systems, multiple runs of the same workload on the same system are not identical. If repeatability is desired, special purpose tools can be implemented [10]. • User-level Traffic Only: Current binary instrumentation tools only support user-level instrumentation. Thus, applications that are kernel intensive are unsuitable for user-level IDS. For such applications system level simulators [6, 8] or instrumentation systems [2] are ideal. • Native System Limited: Since IDS is conducted on a native system, the fidelity of the simulation model can depend on the characteristics of the native system. For example, simulating a system that has four active hardware threads on a system that only supports two active hardware threads may incorrectly interleave the order in which instructions are executed between different hardware threads. This is because the native system time slices the simulated threads while a real simulated system executes the threads simultaneously. To ensure correct timing, this issue can be resolved by modeling a thread scheduler [11]. Despite the above mentioned caveats, IDS is an excellent methodology for initial exploratory studies as it is fast, robust, and simple.

with most memory-intensive workloads requiring more than 4MB of cache size. The rest of this paper is organized as follows. Section 2 provides the benefits and caveats of using IDS. Section 3 describes the simulation methodology and then presents a memory performance comparison of the SPEC CPU2000 and SPEC CPU2006 workloads. Finally Section 4 provides a summary of our work.

2. Instrumentation-Driven Simulation (IDS) Binary instrumentation is a technique for inserting extra code into an existing application to collect run time information from an application. The binary instrumentation tool determines the type of run time information to be collected. Typically, binary instrumentation tools have commonly been used for the performance analysis of applications. However, binary instrumentation systems can also serve as an attractive tool for conducting computer architecture research studies. For example, binary instrumentation tools can model simple simulators that are instrumentation-driven. In addition to being simple, binary instrumentation systems are fast and robust. Since binary instrumentation normally occurs at native execution speeds, IDS can also occur at MIPS speed. Unlike existing execution and tracedriven simulators, IDS supports quick exploratory studies by simulating applications run to completion. Additionally, since binary instrumentation systems are robust for all kinds of applications, users can conduct instrumentation-driven simulation with any kind of application, no matter its complexity. For example, users can study complex applications such as Oracle or Java.

3. Memory Behavior of SPEC CPU2000 and SPEC CPU2006 For purposes of this study, we use IDS for characterizing the memory system behavior of both the SPEC CPU2000 and SPEC CPU2006 benchmark suites. The workloads in the SPEC suite are attractive for IDS as they are unaffected by the caveats mentioned in the previous section. Existing work has studied the memory behavior of SPEC CPU2000[13]. For the purpose of uniformity in ISA, compilers and simulation methodology, we present a comprehensive cache performance study of all workloads from both SPEC CPU2000 and SPEC CPU2006 run to completion. We first describe the experimental methodology before presenting our results.

2.1. Caveats with Instrumentation-Driven Simulation As with any simulation methodology there are some tradeoffs with the instrumentation-driven approach. • Instrumentation Overhead: Instrumentation involves injecting extra code dynamically or statically into the target application. The additional code causes an application to spend extra time in executing the original application. If the application has the property that it changes its behavior based on the amount of time spent in execution, the native execution and instrumented execution of the application can be different. If the differences are significant, IDS is unsuitable for such class of applications. Additionally, for multi-threaded applications, instrumentation can modify the ordering of instructions executed between different threads of the application. As a result, IDS with multi-threaded applications comes at the lack of some fidelity. • Lack of Speculation: Instrumentation only observes instructions executed on the correct path of execution. As a result, IDS may not be able to support wrong-path

3.1. Experimental Methodology We use Pin, a dynamic binary instrumentation system for Linux, MacOS, and Windows binary executables for Intel® IA-32 (x86 32-bit), IA-32E (x86 64-bit), and Itanium® processors. Pin is similar to the ATOM[14] toolkit for Alpha processors. Like ATOM, Pin provides an infrastructure for writing program analysis tools called pin tools. For our workload characterization studies we use CMP$im [5], an instrumentation driven system simulator for CMPs. CMP$im can characterize application instruction profiles and conduct full run cache performance studies of applications at

2

0

AVG

zeusmp.zeusmp

wrf.rsl

tonto.tonto

sphinx3.an4

specrand.rand

soplex.ref

soplex.pds-50

povray.ref

namd.namd

milc.su3imp

100

75 75

50 50

25 25

0

100 100

75 75

50 50

25 25

0 bzip2.graphic bzip2.program bzip2.source crafty.ref eon.cook eon.kajiya eon.rushmeier gap.ref gcc.166 gcc.200 gcc.expr gcc.integrate gcc.scilab gzip.graphic gzip.log gzip.program gzip.random gzip.source mcf.ref parser.ref perlbmk.535 perlbmk.704 perlbmk.850 perlbmk.957 perlbmk.diffmail perlbmk.makerand perlbmk.perfect vortex.lendian1 vortex.lendian2 vortex.lendian3 vpr.place vpr.route AVG

100

astar.BigLakes2048 astar.rivers bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord h264ref.foreman_ref_baseline h264ref.foreman_ref_main h264ref.sss_main hmmer.nph3 hmmer.retro libquantum.ref mcf.ref omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail sjeng.ref xalancbmk.ref AVG

AVG

wupwise.ref

twolf.ref

swim.ref

sixtrack.ref

mgrid.ref

mesa.ref

lucas.ref

galgel.ref

fma3d.ref

GemsFDTD.ref astar.BigLakes2048 astar.rivers bwaves.bwaves bzip2.chicken bzip2.combined bzip2.liberty bzip2.program bzip2.source bzip2.text cactusADM.benchADM calculix.hyperviscoplastic dealII.ref gamess.cytosine gamess.h2ocu2+ gamess.triazolium gcc.166 gcc.200 gcc.c-typeck gcc.cp-decl gcc.expr gcc.expr2 gcc.g23 gcc.s04 gcc.scilab gobmk.13x13 gobmk.nngs gobmk.score2 gobmk.trevorc gobmk.trevord gromacs.gromacs h264ref.ref_baseline h264ref.ref_main h264ref.sss_main hmmer.nph3 hmmer.retro lbm.lbm leslie3d.leslie3d libquantum.ref mcf.ref milc.su3imp namd.namd omnetpp.omnetpp perlbench.checkspam perlbench.diffmail perlbench.splitmail povray.ref sjeng.ref soplex.pds-50 soplex.ref specrand.rand sphinx3.an4 tonto.tonto wrf.rsl xalancbmk.ref zeusmp.zeusmp AVG

Instructions (billions)

443 905

2143

ammp.ref applu.ref apsi.ref art.ref1 180 art.ref2 332 bzip2.graphic 297 553 bzip2.program 414 bzip2.source 587 crafty.ref 3047 eon.cook 7321 eon.kajiya 2193 1140 eon.rushmeier 832 equake.ref 3664 facerec.ref 82 fma3d.ref 159 galgel.ref 138 gap.ref 101 gcc.166 114 gcc.200 158 gcc.expr 197 177 gcc.integrate 61 gcc.scilab 218 gzip.graphic 582 gzip.log 334 gzip.program 219 316 gzip.random 2565 gzip.source 604 lucas.ref 408 mcf.ref 3732 mesa.ref 997 mgrid.ref 2119 parser.ref 1476 1737 perlbmk.535 4013 perlbmk.704 322 perlbmk.850 1213 perlbmk.957 3255 perlbmk.diffmail 795 perlbmk.makerand 1208 perlbmk.perfect 395 768 sixtrack.ref 1228 swim.ref 2351 twolf.ref 424 vortex.lendian1 407 vortex.lendian2 1 vortex.lendian3 2991 2999 vpr.place 3050 vpr.route 1354 wupwise.ref 2234 AVG 1276

1913

5000 4000 3000 CPU2000 1000 0

leslie3d.leslie3d

facerec.ref

equake.ref

art.ref2

art.ref1

apsi.ref

applu.ref

ammp.ref

Instruction Distribution

Instructions (billions)

SPEC CPU2000 - FP

SPEC CPU2006 - FP

3

131

61 72.5 68.2 109 47.7

118

267

266

230 323 288

213

SPEC CPU2000 48.9 52.3 99.2 86.8 30.4 1.17 19.8

50.1

21.6 65.3 6.73 6.82 36.2 57.5 22.5 96.5 45.7 47.8

294 284 265 204

24.7 27 114 96.6 75.4 183 50.9 62.9 34.9 115

298 360 376

544

600 500 400 300 200 100 0

lbm.lbm

gromacs.gromacs

gamess.triazolium

gamess.h2ocu2+

gamess.cytosine

dealII.ref

calculix.hyperviscoplastic

cactusADM.benchADM

bwaves.bwaves

0

GemsFDTD.ref

Instruction Distribution


SPEC CPU2006

SPEC CPU2000 - INT

MEM read/write MEM write MEM read MEM none

SPEC CPU2006 - INT

Figure 1: Dynamic Instruction Count and Instruction Profile of SPEC CPU2000 and SPEC CPU2006 workloads. The figure shows the instruction count and instruction mix of the applications when run to completing using the reference input sets.


3.3. Processor Performance

the speeds of 4-10 MIPS. We use CMP$im in single-core mode to conduct detailed characterization studies of the SPEC workloads. All workloads from the SPEC CPU2000 and SPEC CPU2006 benchmark suites are run to completion using their reference input sets. We also collect IPC numbers for the two suites using performance counters on an Intel® Xeon® 2.8 GHz system with a two-level cache hierarchy: 32KB L1 cache and 512KB L2 cache. All workloads are compiled using optimization flags -O3 on a Red Hat 32-bit Linux system using Intel’s C/C++ and Fortran compilers. For each workload, we conduct a cache sensitivity analysis using the stack distance [9] approach to simulate multiple cache sizes in one run. We modeled a 256-way 8MB instruction cache and a 2048-way 128MB data cache. Both use 64B line size and model true LRU replacement policy. The simulated cache configurations support cache sizes that are direct mapped 32KB, 2-way 64KB, 4-way 128KB, and so on. After presenting the cache sensitivity study, we also present the cache performance of each workload for a three level cache hierarchy: 4-way 32KB separate L1 instruction and data caches, a unified 8-way 256KB L2 cache, and finally a unified 16-way 2MB L3 cache. All caches in the simulated hierarchy use 64B line size, are non-inclusive, allocate on writes, and use writeback and true LRU replacement policy.

To identify the memory bound workloads, Figure 4 presents the CPI values of the two suites collected using performance counters on the 2.8 GHz Xeon system. The average CPI of the SPEC CPU2000 workloads is 1.97 while that of SPEC CPU2006 is 2.16. The higher CPI of SPEC CPU2006 is a result of larger problem set sizes [1]. Table 1 categorizes workloads based on their CPI values: CPI values greater than 8, CPI values greater than 4 but less than 8, and CPI values greater than 2 but less than 4. The table shows that SPEC CPU2006 represents more applications that are memory bound (higher CPI values) than SPEC CPU2000. The primary memory bound SPEC CPU2000 workloads (with CPIs greater than 4) were art, equake, mcf, swim, and vpr (route input). Comparatively, the memory bound SPEC CPU2006 workloads are: GemsFDTD, astar (BigLakes2048 input), cactusADM, lbm, leslie3d, libquantum, mcf, milc, omnetpp, and soplex. To understand the memory system requirements of these workloads, the next section presents workload sensitivity to different cache sizes. 3.4. Cache Performance We study working set analysis of the workloads by studying their cache performance with different cache sizes. We then present the cache performance of the SPEC workloads for a three-level cache hierarchy that is representative of modern high performance processors.

3.2. Application Instruction Profile Figure 1 shows the total instruction count for each workload in the two suites. The average instruction count for the SPEC CPU2000 benchmark suite is 131 billion while that of SPEC CPU2006 is 1276 billion—an order of magnitude higher. The increase in instruction count by SPEC has been in response to significant differences observed in run time as a result of small fluctuations in the system state or measurement conditions [1]. However, the large run lengths of SPEC CPU2006 now present interesting challenges in choosing representative regions of the workloads for conducting experiments using detailed performance simulators. To understand the contribution of instructions that reference memory, Figure 1 presents the instruction profile for the floating-point and integer benchmarks distributed into four categories: instructions that do not reference memory (ALU operations only), instructions that have one or more source operands in memory (MEM read), instructions whose destination operand is in memory (MEM write), and instructions whose source and destination operands are in memory (MEM read and write). On average, roughly half of the instructions reference memory. Of the instructions that reference memory, very few (< 1%) both read from and write to memory while 20% of total memory instructions write to memory. This behavior is consistent for both SPEC CPU2000 and SPEC CPU2006. The large proportion of instructions that reference memory is indicative of register spills to the stack due to the limited registers available on 32-bit x86 systems.

3.4.1.Working Set Size Figures 2-3 and 5-6 present the instruction and data cache sensitivity studies for all workloads in the two suites. The figures present cache size in megabytes (MB) on a logarithmic x-axis and misses per one thousand (MPKI) instructions on the y-axis. Note that the y-axis varies significantly between different workloads. We use a stack distance approach to simulate the performance of multiple cache sizes in one simulation run by using a single large cache. In doing so, the stack distance approach not only provides cache performance for multiple cache sizes but also provides an indication of the total memory footprint of each workload. For example, if the workload has a total instruction and data footprint that is smaller than the simulated large cache, the amount of valid data in the cache (assuming the cache is invalid at simulation start) represents the total memory footprint of the workload. This information is represented on the cache sensitivity graphs for workloads where the last x-axis co-ordinate ends before the simulated large cache size (8MB for instruction and 128MB for data). For example, the last x-coordinate data cache point for art (Figure 2d) occurs at 4MB, implying that it has a total data memory footprint size of approximately 4MB. Similarly, art has an instruction memory footprint size of approximately

4

Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/ 20

20

300

5

Data Stream

15

10

5

Instr Stream

15

Misses Per 1000 Instructions

10

art.ref1

250 Misses Per 1000 Instructions

15

300

apsi.ref

applu.ref Misses Per 1000 Instructions


ammp.ref

10

5

art.ref2


20

200

150

100

50

200

150

100

50

Cache Size (MB) 1/8

1/4

1/2

1

2

4

(a) ammp.ref

8

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(b) applu.ref

8

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(c) apsi.ref

8

16

32

64

8 12

0

6 25

20

4

2

6

4

2

1/8

1/4

1/2

1

2

4

8

(e1) bzip2.graphic

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

2

4

8

16

32

64

8 12

2

1

1/2

1

2

4

8

(g2) eon.kajiya

16

32

64

8 12

3

2

0

6 25

2

4

(d1) art.ref1

8

16

32

64

8 12

0

6 25

1/8

1/4

1/2

1

2

4

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

8

(e3) bzip2.source

16

32

64

8 12

4

2

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(f) crafty.ref

8

16

32

64

8 12

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

4

8

16

32

64

128

6 25

4

8

16

32

64

8 12

6 25

8

16

32

64

8 12

6 25

8

16

32

64

8 12

6 25

3

2

1

0

6 25

8

equake.ref

2

eon.cook

6

0

6 25

1

(d2) art.ref2

4

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(g1) eon.cook

20

facerec.ref

fma3d.ref

20

15

10

6

4

2

15

10

5

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(g3) eon.rushmeier

30

galgel.ref

1

25

1

1/4

2 4 6 1/6 1/3 1/1

30



3

1/8

5

0

6 25

4

2 4 6 1/6 1/3 1/1

10

eon.rushmeier


1

(e2) bzip2.program

5

eon.kajiya

0

1/2

1/2

crafty.ref

15


2 4 6 1/6 1/3 1/1

5

1/4

bzip2.source




6

0

bzip2.program

8

1/8

8


bzip2.graphic

8

2 4 6 1/6 1/3 1/1


2 4 6 1/6 1/3 1/1


0

32

64

8 12

0

6 25

gap.ref

8

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(h) equake.ref

8

16

32

64

8 12

0

6 25

40

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(i) facerec.ref

8

16

32

64

8 12

0

6 25

20

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(j) fma3d.ref

20

gcc.166

gcc.expr

gcc.200

10

4

2

20

10

15


15

6

30


20




25

10

5

15

10

5

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(k) galgel.ref

8

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(l) gaps.ref

8

16

32

64

8 12

0

256

20

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(m1) gcc.166

8

16

32

64

8 12

0

6 25

1/2

1

2

4

8

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(m3) gcc.expr

20

gzip.program

25

15

10

10

5

15


20

15




1/4

(m2) gcc.200 gzip.log

25

10

5

5

0

1/8

gzip.graphic

gcc.scilab

gcc.integrate

2 4 6 1/6 1/3 1/1

30

20


0

30

20

15

10

15

10

5

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(m4) gcc.integrate

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(m5) gcc.scilab

16

32

64

8 12

0

6 25

30

20

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(n1) gzip.graphic

16

32

64

8 12

0

6 25

20

gzip.random

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(n2) gzip.log

8

16

32

64

8 12

0

6 25

200

lucas.ref

gzip.source

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(n3) gzip.program

4

mesa.ref

mcf.ref

15

10

10

5

150


5

20

15


10




25 15

100

50

3

2

1

5

1/8

1/4

1/2

1

2

4

8

(n4) gzip.random

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(n5) gzip.source

16

32

64

8 12

0

6 25

1

2

4

8

16

32

64

8 12

0

256

20

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(r) mgrid.ref

8

16

32

64

8 12

10

5

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(s) parser.ref

8

16

32

64

8 12

1/4

1/2

1

2

4

(p) mcf.ref

8

16

32

64

8 12

0

6 25

1.5

1

1/8

1/4

1/2

1

2

4

8

(t1) perlbmk.535

16

32

64

8 12

1/4

1/2

1

2

(q) mesa.ref

2.5

4

3

2

0

6 25

2

2

1.5

1

0.5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(t2) perlbmk.704

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(t3) perlbmk.850

1

20

perlbmk.makerand

perlbmk.diffmail

1/8

perlbmk.850

1

2 4 6 1/6 1/3 1/1

2 4 6 1/6 1/3 1/1

3

5

2

0

6 25

7

perlbmk.957

1/8

perlbmk.704

0.5

3

2 4 6 1/6 1/3 1/1

6


30



sixtrack.ref

perlbmk.perfect

6

2.5

2

1.5

1



0.8 5 4 3 2

1.5

1

0.5

15



1/2

2.5 15

10


1/4

(o) lucas.ref perlbmk.535

40

10

5

0.6

0.4

0.2

0.5

0

1/8

parser.ref

mgrid.ref

0

2 4 6 1/6 1/3 1/1

3

20


2 4 6 1/6 1/3 1/1


0

50

1

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(t4) perlbmk.957

16

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(t5) perlbmk.diffmail

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(t6) perlbmk.makerand

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(t7) perlbmk.perfect

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(u) sixtrack.ref

Figure 2: SPEC CPU2000 Cache Sensitivity: The figure shows the cache performance of each workload as a function of cache size. The y-axis presents workload MPKI and the x-axis presents the cache size in megabytes (MB). All workloads were run to completion using the reference input sets.

5

10

5

0 4

8 16 32

15 64 8 12 6 25

15

10

20

0

0

2 4 6 1/6 1/3 1/1

2 4 6 1/6 1/3 1/1

1/8

1/8

(w) twolf.ref

1/4

(y1) vpr.place 1/4

1/2

1/2

1 2

1

2

10 4

4

8

8

16

16

32

32

64

64

8 12

8 12

6 25

15

10

5

6 25

0

(x1) vortex.lendian1

0

2 4 6 1/6 1/3 1/1

2 4 6 1/6 1/3 1/1

1/8

1/8

1/4

(y2) vpr.route

1/4

1/2

vpr.place

1/2

1

1

6

2

2

4

4

8

8

16

16

32

32

64

64

7.74

Instr Stream 10

5 5

40

8 12

8 12

6 25

30

20

10

6 25

6

4

0

0


vortex.lendian1


2 4 6 1/6 1/3 1/1

15 1/8

2 4 6 1/6 1/3 1/1 1/8

1/4

(z) wupwise.ref

1/4

1/2

vpr.route

1/2

1

1

2

2

4

4

2.32 1.86 1.21 1.93 1.69 4.24 1.68 1.97

20 20

4.25 1.36 1.49 1.36 1.69 1.73 6.28 5.25 1.36 2.89 1.64 0.88 3.28 2.29 2.16

0 2

twolf.ref

0.97

5 1

11.8

1/2

25

4.55

30 20 15


30


(v) swim.ref

15.3

swim.ref

1.45 2.68 1.66 0.93 1.01 0.91 1.03 1.3 1.71 1.74 1.05

40


Data Stream


50


70

5.6 1.9 2.85 1.38 1.4 1.85 1.47 1.49 1.59 1.52 1.1 1.46 1.34 1.11 1.49 3.15



60

1.44 1.32 1.65 1.35 1.78 1.95 1.8

1/4

11.9 10.3

1/8

2.74 2.53 2.29

2 4 6 1/6 1/3 1/1

ammp.ref applu.ref apsi.ref art.ref1 art.ref2 bzip2.graphic bzip2.program bzip2.source crafty.ref eon.cook eon.kajiya eon.rushmeier equake.ref facerec.ref fma3d.ref galgel.ref gap.ref gcc.166 gcc.200 gcc.expr gcc.integrate gcc.scilab gzip.graphic gzip.log gzip.program gzip.random gzip.source lucas.ref mcf.ref mesa.ref mgrid.ref parser.ref perlbmk.535 perlbmk.704 perlbmk.850 perlbmk.957 perlbmk.diffmail perlbmk.makerand perlbmk.perfect sixtrack.ref swim.ref twolf.ref vortex.lendian1 vortex.lendian2 vortex.lendian3 vpr.place vpr.route wupwise.ref AVG

0

4.01 4.12 2.26 3.09 1.94 1.83 2.54 1.35 1.77 1.52 5.27 1.1 1.57 1.23 1.24 1.16 3.03 2.19 2.3 2.16 2.38 2.5 3.14 2.73 1.97 2.16 2.11 1.93 2.2 2.07 1.82 0.92 0.96 0.97 2.05 1.22 4.4 3.92 3.84

CPI

10


CPI


8

vortex.lendian2 20

6

5

wupwise.ref

4

2

3

1

8 16 32 64 8 12 6 25

8 16 32 64 8 12 6 25

15

vortex.lendian3

10

Cache Size (MB) 2 5

0 2 4 6 1/6 1/3 1/1


Table 1: Memory Bound Workloads

CPI

SPEC CPU2000

>8

art, mcf

mcf

4-8

equake, swim, vpr.route

GemsFDTD, astar.BigLakes2048, cactusADM, lbm, milc, omnetpp, soplex

2-4

ammp, applu, apsi, fma3d, lucas, mgrid, twolf

astar.rivers, bwaves, bzip2.liberty, gcc, gobmk, hmmer.nph3, leslie3d, libquantum, sphinx3, xalancbmk, zeusmp

SPEC CPU2006

1/8 1/4 1/2 1 2 4 8 16 32 64 8 12 6 25


Figure 4: Performance of SPEC CPU2000 and SPEC CPU2006 Workloads: The figure shows the CPI values for the two suites gathered using performance counters on a 2.8GHz Xeon system. The table below presents the memory bounds workloads separated into three categories.


20

20

GemsFDTD.ref

20

30

bwaves.bwaves

astar.rivers

astar.BigLakes2048

bzip2.chicken

5

10

5

Instr Stream

1/8

1/4

1/2

1

2

4

8

16

(a) GemsFDTD.ref

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1

2

4

8

16

32

64

8 12

0

6 25

20

5

1/4

1/2

1

2

4

8

16

(d2) bzip2.combined

32

64

8 12

10

5

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(d3) bzip2.liberty

16

32

64

8 12

1/4

1/2

1

2

4

8

(b2) astar.rivers

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(c) bwaves.bwaves

32

64

8 12

10

5

0

6 25

20

4

2

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(d4) bzip2.program

32

64

8 12

1/8

1/4

1/2

1

2

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

(d1) bzip2.chicken bzip2.text

15

10

5

0

6 25

20

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(d5) bzip2.source

32

64

8 12

15

10

5

0

6 25

8

calculix.hyperviscoplastic

2 4 6 1/6 1/3 1/1

20

bzip2.source

6

0

6 25

4

cactusADM.benchADM

8

10

bzip2.program

15



10

1/8

1/8

bzip2.liberty

15

2 4 6 1/6 1/3 1/1

2 4 6 1/6 1/3 1/1

8

bzip2.combined


1/2

(b1) astar.BigLakes2048

20

0

15


2 4 6 1/6 1/3 1/1

20

15

5

Cache Size (MB)


0


10

10

15


Data Stream

20

15


30



25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(d6) bzip2.text

10

gamess.cytosine

dealII.ref

gamess.h2ocu2+

2

1/2

1

2

4

8

16

32

64

8 12

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

2

4

8

16

32

64

8 12

6 25

2

1/8

1/4

1/2

1

2

4

8

16

(h3) gamess.triazolium

32

64

8 12

15

10

5

0

6 25

20

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(i1) gcc.166

8

16

32

64

8 12

1/2

1

2

4

(i5) gcc.expr

8

16

32

64

8 12

10

5

0

6 25

20

2

4

8

16

32

64

8 12

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(i6) gcc.expr2

8

16

32

64

8 12

6

4

1/8

1/4

1/2

1

2

4

8

16

(h1) gamess.cytosine

32

64

8 12

0

6 25

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

2 4 6 1/6 1/3 1/1

1/8

1/4

4

(i2) gcc.200

8

16

32

64

8 12

10

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(i3) gcc.c-typeck

32

64

8 12

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(i7) gcc.g23

8

16

32

64

8 12

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

10

5

2

gcc.scilab

6

4

2

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(i8) gcc.s04

8

16

32

64

8 12

0

6 25

20

gobmk.nngs

1

(i4) gcc.cp-decl

8

15

0

6 25

7

4

10

0

6 25

gcc.s04

10

2

15

20

15

1

gcc.cp-decl

15

0

6 25

1/2

(h2) gamess.h2ocu2+

20

gcc.c-typeck

10

0

6 25

20

gobmk.13x13

2 4 6 1/6 1/3 1/1

gcc.g23

15



5

1

20

gcc.expr2

10

1/4

1/2

20

gcc.expr 15

1/8

1/4

(g) dealII.ref

15

0

6 25

20

2 4 6 1/6 1/3 1/1

1/8

gcc.200


4

2 4 6 1/6 1/3 1/1

2 4 6 1/6 1/3 1/1

20

gcc.166

6



1

(f) calculix.hyperviscoplastic

20

0


1/4

gamess.triazolium


2


1/8

0


2 4 6 1/6 1/3 1/1

(e) cactusADM.benchADM

0

4

2

0

8

0

5

6


1

10


0

2

15


4

3




8

6

2 4 6 1/6 1/3 1/1

1/8

1/4

1

2

gobmk.trevord

gobmk.trevorc

gobmk.score2

1/2

(i9) gcc.scilab

20

5

4 3 2

15


5

10

5


10

15




6 15

10

5

15

10

5

1

1/8

1/4

1/2

1

2

4

8

(j1) gobmk.13x13

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(j2) gobmk.nngs

16

32

64

8 12

0

6 25

4

gromacs.gromacs

1

2

4

8

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(k) gromacs.gromacs

32

64

8 12

2

1

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

32

(l1) h264ref.ref_baseline

64

8 12

1/4

3

2

1

2

4

8

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(l2) h264ref.ref_main

4

3

2

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(l3) h264ref.sss_main

32

64

8 12

10

1

1/8

1/4

1/2

1

2

4

8

(m2) hmmer.retro

16

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(n) lbm.lbm

8

16

32

64

8 12

6 25

20

15

10

0

2

4

10

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(m1) hmmer.nph3 mcf.ref

80 15

10

5

60

40

20

5

2 4 6 1/6 1/3 1/1

1

20

libquantum.ref


2

20

1/2

100

leslie3d.leslie3d



3

1/4

30

0

6 25

25

4

1/8

(j5) gobmk.trevord hmmer.nph3

20

lbm.lbm 30

2 4 6 1/6 1/3 1/1

40

1

30

5

1/2

5

4

0

6 25

40

hmmer.retro

1/8

(j4) gobmk.trevorc h264ref.sss_main

1

6

2 4 6 1/6 1/3 1/1

6


3

2

3




1/2

h264ref.ref_main

1


1/4

5

4

0

1/8

(j3) gobmk.score2

h264ref.ref_baseline

5

0

2 4 6 1/6 1/3 1/1

6


2 4 6 1/6 1/3 1/1

6


0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

(o) leslie3d.leslie3d

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(p) libquantum.ref

16

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(q) mcf.ref


7


Instr Stream

3

2

1

5


Data Stream

10

perlbench.diffmail




20

20

perlbench.checkspam

omnetpp.omnetpp

namd.namd 4

15

4

30

5

milc.su3imp 25

20

15

10

3


30

2

1

15

10

5

5

Cache Size (MB) 1/4

1/2

1

2

4

8

(r) milc.su3imp

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(s) namd.namd

16

32

64

8 12

0

6 25

1/4

1

1/2

2

4

8

16

32

64

8 12

0

6 25

1

4

2

0.5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

32

(u3) perlbench.splitmail

64

8 12

6 25

2

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(v) povray.ref

8

16

32

64

8 12

6 25

2

1

specrand.rand

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(w) sjeng.ref

8

16

32

64

8 12

6 25

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(y) specrand.rand

16

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(z) sphinx3.an4

16

32

64

8 12

6

4

0

6 25

64

8 12

0

256

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

32

64

8 12

6 25

4

8

16

32

64

8 12

6 25

8

16

32

64

8 12

6 25

(u2) perlbench.diffmail soplex.ref

25

20

20

15

10

5

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(x1) soplex.pds-50

16

32

64

8 12

0

6 25

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

(x2) soplex.ref

20

xalancbmk.ref

4

2

0

32

wrf.rsl





0.5

10

16

30

0

8

1

8

30

tonto.tonto

15

4

5

sphinx3.an4

1.5

2

10

10

20

1

soplex.pds-50

1.5

0

1/2

40

0.5

0

1/4

sjeng.ref

6



2

1/8

(u1) perlbench.checkspam

2.5

1.5

2 4 6 1/6 1/3 1/1

50

povray.ref

perlbench.splitmail


1/8

(t) omnetpp.omnetpp

2.5

0

2 4 6 1/6 1/3 1/1

3

8


1/8


2 4 6 1/6 1/3 1/1

3


0

3

IN PROGRESS 2

15

10

5

1

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(α) tonto.tonto

16

32

64

8 12

6 25

16

32

64

128

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(β) wrf.rsl

8

16

32

64

8 12

6 25

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

(χ) xalancbmk.ref

20


zeusmp.zeusmp 15

10

5

0

2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

(δ) zeusmp.zeusmp


ISTREAM SPEC CPU2000 (all) DSTREAM SPEC CPU2000 (all) DSTREAM SPEC CPU2000 (no art, mcf) ISTREAM SPEC CPU2006 (all) DSTREAM SPEC CPU2006 (all) DSTREAM SPEC CPU2006 (no mcf)


20

15

10

5

0 2 4 6 1/6 1/3 1/1

1/8

1/4

1/2

1

2

4

8

16

32

64

128

6 25

Cache Size (MB)

Figure 7: Average SPEC CPU2000 and SPEC CPU2006 Cache Sensitivity: The figure presents the average cache behavior of the SPEC CPU2000 and SPEC CPU2006 suites. Since SPEC CPU2000 is heavily biased by art and mcf, we also present the average behavior of the workloads excluding art and mcf. Similarly, for SPEC CPU2006 we present the average behavior excluding mcf.

8


the other hand, the knee of the data stream cache miss rate function for SPEC CPU2000 occurs at a cache size of 1-2 MB, while that of SPEC CPU2006 occurs at a cache size of 16-32 MB—a factor of 16 higher. The larger cache requirements of these workloads continues to put pressure on improving the performance of the memory subsystem.

100KB. However, for workloads such as applu (Figure 2a), gap (Figure 2l), and others where misses occur even in a 128MB cache, one can conclude that the total data memory footprint size of these workloads is greater than or equal to 128MB. From the figures, 20% of the 48 SPEC CPU2000 workloads have total memory footprint greater than 128MB while more than 75% of the 56 SPEC CPU2000 workloads have total memory footprint greater than 128MB. The instruction cache sensitivity study reveals that an instruction cache that is 64KB suffices for most workloads in the two suites. However, there are some outlier workloads such as crafty, gap, gcc, eon, perl and vortex from SPEC CPU2000 and gcc, gobmk, omnetpp, perl, and xalancbmk from SPEC CPU2006 which require a 128KB instruction cache to perform as well as other workloads in the suite. These workloads have code footprints that are greater than 1MB. This is to be expected from workloads such as compilers (gcc), interpreters (perl), XML parsers (xalancbmk) and artificial intelligence programs (gobmk and sjeng) which spend their time over a wide range of functions. The data cache sensitivity study reveals that most of the workloads exhibit cache friendly behavior—incremental increases in cache size yields incremental improvements in cache performance. Such behavior is observed when the cache miss rate shows a smooth exponentially decreasing behavior with increasing cache sizes (example Figure 2e). However, both suites also comprise of workloads where incrementally increasing the cache size provides no improvements in cache performance until a particular cache size is reached. At this cache size suddenly significant improvement in cache performance is observed. Such behavior is evident through sudden drops (cliff shaped figures) in the cache miss rate as a function of cache size. Some examples are art (Figure 2d) and equake (Figure 2h) from SPEC CPU2000 and libquantum (Figure 5p) and sphinx (Figure 6z) from SPEC CPU2006. This behavior implies that the respective workloads have a working set that is of the same cache size as where the significant drops in cache miss rate occur. For example, libquantum from the SPEC CPU2006 suite has the best cache performance with a 32MB cache. Since the total number of misses drops down to zero with a 32MB cache, this implies that the workload re-references a 32MB working set in a circular fashion. Some workloads like gcc in both SPEC CPU2000 and SPEC CPU2006 illustrate multiple drops in the cache miss rate function implying that they have multiple working set sizes. Figure 7 presents the average cache behavior of both the instruction and data stream. Since the average behavior of SPEC CPU2000 is heavily biased by art and mcf, we also present the average of all workloads excluding art and mcf. Comparatively, for SPEC CPU2006 we also present the average of all workloads excluding mcf. The figure shows that both SPEC CPU2000 and SPEC CPU2006 have similar instruction cache requirements, with 64KB being optimal. On

3.4.2.Sensitivity to Different Input Sets For the SPEC CPU2006 workloads, the benchmarks with multiple input sets are: astar (2 input sets), bzip2 (6 input sets), gamess (3 input sets), gcc (9 input sets), gobmk (5 input sets), h264ref (3 input sets), hmmer (2 input sets), perlbench (3 input sets), and soplex (2 input sets). We compare the different input sets using MPKI as the metric of comparison. Input sets that have the highest MPKI are considered more memory intensive than others. For the astar benchmark, the BigLakes2048 input is more data memory intensive of the two. For the bzip2 benchmark, the different input sets have very similar instruction and data cache requirements with the liberty input set being more data memory intensive than the others. For the gamess benchmark, all input sets have very similar working sets and fit well into a 512KB cache (the triazolium input set is slightly more data memory intensive for a 256KB cache). For the gcc benchmark, the 166 and g23 input sets are more data memory intensive than the others. For the gobmk and h264ref benchmarks, all input sets have very similar working set sizes. For the hmmer benchmark, the retro input set is more data memory intensive of the two. For the perlbench benchmark, the splitmail input set is more data memory intensive. Finally, for the soplex benchmark, the pds-50 input set is more data memory intensive when the cache size is less than 4MB while the ref input set is more data memory intensive when the cache size is greater than 4MB. 3.4.3.Performance of a Three-Level Cache Hierarchy Figure 8 presents the cache performance of the workloads for a three-level cache hierarchy: 32KB separate instruction and data caches, 256KB unified L2 cache and 2MB unified L3 cache. For each cache we present four metrics: accesses per 1000 instructions, misses per 1000 instructions, miss rate, and evictions unused. The evictions unused metric is a measure of the total number of cache lines evicted that were never re-referenced after they were inserted into the cache. There are three main reasons for unused cache evictions. First, the referenced data has no temporal locality. This behavior is characteristic of references to streaming data. Second, the data has short temporal locality that is handled by smaller caches. In such scenarios, the filtering effect of the small caches causes the data to never be re-referenced in successive levels of the cache hierarchy. Finally, unused evictions also occur when the referenced data has a working set that is larger than the

9

100 75 50 25 0

0

25

21.2

71.4

100 75 50 25 0

10

96.5

95.9 80.7 86.5 73.8

0.45

88.9 57.2 81.8 46.4

27.8

87.8 63.7 95.4 100 89.1

99.8

19.7 0.12 13.2 0.81 0.26 0.95 0.01 0.48 34 22.1 0 14.1 5.8 6.93 6.24 5.2 8.76

63.1

75

48.3 64.9 57 42.8 65.9

100

41.2

100 75 50 25 0

100 87.4 100

25 0

90.9 91 82.2 76.2 65.6 76.2

90.4 69.6 67.2 56.8 33.7 86.7 86.2 98.1

37

89.5 88.6 97.9 92.5 98.4 98.2 98.7 98.8 81.5 63.7 62.3 81.8 67.7 63.5 57.8 64.5 61.5 62.1 61 97 58.1 90.2 100 73 98.8

33.5 59.1 43 71.6 64.4 77.4 74.9 77.6 75.9

64.1

37.1 24.5 18.8 30.7 20.6 14.4 18 50.8 34.1 32.2 39.2 74.3 59.5 73.2 100 64.7 98.1 2.24 80.8 21 3.13 40.6 0.09 16.6 81.5 94.1 99.8 81.3 67.3 51.4 27.3 28.8 52.1

1.81 1.42 13.2

19.7 0.12 13.2 0.81 0.26 0.95 0.01 0.48 34 22.1 0 14.1 5.8 6.93 6.24 5.2 8.76

63.1

75

59.8

50

51.2 41.4 19.3 22.7 22.9 28.5 56.2 30.7 28.8 22.7 28.4 69.7 35.4 33.6 27.5 31.5 3.7 4.53 36.7 0.12

25

93.6

50

16.1 37.4

100 75 50 25 0

0.14 0.29 2.57

100 75 50 25 0

33.3 22.3 32.7 26.6 20.8 22.5

25 19 11.4 5.67 17.8 9.15 8.03 14.9 4.03 7.67 6.82 4.67 1.76 6.79 0.06 0.11 0.35 11.7 5.15 14.7 10.5 13.6 13.4 13.8 15.1 3.85 1.96 1.32 2.07 1.59 0.96 1.19 1.38 1.02 1.03 2.76 2.54 25.2 21.3 13.4

50

72.2 65.6 45.9 95.5 78.3 60.3 82.7 52.3 57.3 53.5 69.3 60.6 74.3

124

82.3 27.2 12.7 6.78 14.1 18.1 31.5 4.91 24.9

1.64 16.3 12.6 2.29 2.63 1.96 2.45 5.38 1.37 12.9 0.42

20.1 5.21 16.3 3.87 8.25 2.34 8.36 2.92 41.7 23.5 0 17.3 8.62 13.5 22.8 18 14.1

26.3 17.4 12.4 18.6 11.7 13.3 18 7.71 13.4 12.8 6.75 2.91 9.14 3.3 7.53 2.67 18.2 12 20.6 16.2 17.5 18 17.8 19.9 10.4 8.02 7.04 6.72 7.71 6.68 6.64 2.71 2.98 3.19 7.04 3.41 42.4 29 13.4 97.4

125 100 75 50 25 0 100

89 75.4 77.1 97.5 87.6 77.3 90 71.8 75.1 71.9 76.2 89.6 93.9 66.3

104

7.05 2.63 0.16 22.3 16.5 20 16.9 21.8 22.3 0 21.7

86.4

22.6 28.1 20.7 0.09 54.2 52.2 47.6 38.1 52 49.5 2.61 1.17 2.86 12.4 8.76 5.37 30 19.4 25.9 37.9 29.1 29.7 14.7 24.2 14.8 20.9 15.1 12.9 16.3 11.9 11.2 4.63 5.14 5.98 2.1 0.11 53 1.79 0.01 52 4.56 2.53 27.4 11.8 7.88 50.2 8.57 18.7 26.2 4.73 0 13.6 7.06 18.7 20.8 34 19.5

100 75 50 25 0

19 11.4 5.67 17.8 9.15 8.03 14.9 4.03 7.67 6.82 4.67 1.76 6.79 0.06 0.11 0.35 11.7 5.15 14.7 10.5 13.6 13.4 13.8 15.1 3.85 1.96 1.32 2.07 1.59 0.96 1.19 1.38 1.02 1.03 2.76 2.54 25.2 21.3 13.4

90.2

97.6

0.75 15.7 4.68 0.78 0.87 0.76 0.89 0.17 0.41 0.19 0.41 44 16.2 5.11 2.71 5.8 8.8 17.4 4.43 17.6

9.43 7.4 1.52 2.29 0.21 16.3 24.1 2.24 4.5 6.41 11.8 9.47 4.73 8.03 40.8 39.6 40.4 44.3 39.4 4.57 31.9 2.33 0.02 15.1 13.9 15.7 13.4 15.7 7.03

74.4 64 63.5 56.7 49.5 59.7

25.7 0.05 36.6 36.5 2.1 2.05 2.6 0.7 0.15 0.04 0.08 3.72 1.47 2.35 4.93 1.03 8.59 4 4.71 6.4 3.77 3.87 8.97 10.4 4.99 10.5 3.65 31.5 0.34 2.49 2.92 0.43 0.5 0.39 0.48 0.75 0.29 1.07 0.14 13.7 5.96 2.5 1.13 2.83 3.55 5.49 1.44 5.25

3.9 2.86 3.01

4.81 2.98 2.16 2.94 2.77 2.66 4.44 1.61 2.61 2.58 1.25 0.57 1.76 0.64 1.41 0.49 3.94 2.39 4.74 3.82 4.27 4.36 4.42 4.47 1.88 0.94 0.77 1.05 0.83 0.74 1.75 0.46 0.54 0.6 1.15 0.56 11.7 5.69 2.72 20 3.82 1.48 2.92 0.46 0.97 0.4 1.35 0.61 10.5 4.87 0 4.07 1.58 2.7 4.73 5.52 2.95

100 75 50 25 0

17.1 7.33 1.29 17.7 3.04 1.79 4.86 1.07 1.6 1.53 4.37 0.28 2.54 0 0 0.01 5.98 2.13 2.84 2.37 3.1 3.83 7.75 4.63 1.11 0.45 0.38 1.44 0.56 0.32 0.33 0.43 0.04 0.05 1.01 0 25.2 18.6 13.4 37.7 19.7 0.06 8.54 0.46 0.11 0.63 0 0.42 21.7 21.1 0 12.5 1.61 6.16 3.57 4.25 5.34

50.6

53.5 59.5 40.3 40 41.1 48.7 55.2

96.3

100 84.2

204 203

124

82.3 27.2 10.4 5.26 11.7 18.1 31.5 4.91 24.4

1.64 16.3 12.5 1.71 1.98 1.52 1.9 3.22 1.37 5.39 0.42

8.49 8.49 12.4 3.29 0.96 0.25 0.5 26.2 7.01 11.7 21.4 4.07 36 15.4 18 24.7 14.8 14.5 30.1 32.9 18.9 34.3 13.1

18.4 12 14.5 204 203

20.1 5.21 15.9 2.11 5.08 1.89 6.82 2.64 41.7 23.5 0 17.3 8.51 13.5 21.5 18 13.5

26.3 17.4 12.4 18.6 11.7 13.3 18 7.71 13.4 12.8 6.75 2.91 9.07 3.15 7.51 2.65 16.7 9.9 19.3 15.2 16.7 17.1 17.1 19 7.57 4.37 3.63 4.77 3.95 3.49 6.64 2.54 2.79 3.02 7.04 3.41 42.4 29 13.4

97.4

125 100 75 50 25 0

99.7

1.45

3.22 29.8

37.3 34.3 33.3 38.7 36.5

45.6

8.49 8.49 12.4 4.28 0.96 0.26 0.5 26.2 7.01 11.9 21.4 4.1 36.6 16.5 19.3 25.5 16.3 14.5 30.1 32.9 18.9 34.3 13.1

18.4 12 14.5 472 419 482 557 556 404 415 475 473 636 655 654 704 475 498 435 397 419 385 383 385 392 375 335 317 378 326 360 393 478 656 430 394 394 394 393 432 480 503 297 601 456 416 464 414 509 573 340 454

547 583 571 634 422 500 405 479 514 494 542 507 514 496 534 536 425 414 407 397 392 392 387 425 402 464 469 456 473 473 379 552 514 507 613 606 362 510 494 488 526 353 544 460 523 468 506 434 398 482 541 425 538 498 453 327 478

1000 800 600 400 200 0

89.9 64.4

0 1.29 0.58 1.39 0 4.48 4.38 4.78

44

104

204 203

2.68 12.8 14 11.5 4.21 9.97 10 10.2 15.8 7.85 25.3 18.4 34.1 17 17.2 17.6 12.8 11.2 11.7 12 12.3 12.6 10.7 8.72 9.09 12 8.68 20.3 13.3 7.35 24.6 2.32 16.2 16.2 16.6 16 8.49 17.6 11.6 23.3 14.8 0.88 9.16 8.24 9.31 16.1 6.8 21.2 13.1

23.7 12.1 11.7 29.4 3.86 5.17 1.78 6.44 6.14 6.4 12.6 19.8 31.3 13.1 25.7 11.1 16.2 15.6 16.3 16 16 16 15.8 15.7 16.2 12.2 12.2 12 12.2 11.9 27.7 12.4 7.11 6.7 10.4 8.46 12.2 30.9 8.11 10.7 5.57 19.4 25.1 17 14.9 28.4 31.3 16.5 20.2 10.4 27.8 4.31 20.3 31.5 23.4 11 15.5

Miss Rate

0 0 0 0 0 0 0 0 0.72 0 0.01 0 0 0 0.18 0 0.02 0.29 0.49 0.59 0.38 0.69 0 0 0 0 0 0 0 0 0 0.01 0.27 0.32 0.21 0.27 1.04 0 3.57 0.01 0 0 0.89 0.63 0.92 0 0 0 0.24

0.01 0 0 0 0 0 0 0 0 0 0 0 0.05 0.15 0.02 0.03 0.65 0.92 0.58 0.45 0.34 0.37 0.29 0.43 1.22 1.71 1.61 0.95 1.76 1.51 0 0.19 0.21 0.21 0 0 0 0 0 0 0 0 0.22 0.75 1.43 0.2 1.2 0.12 0.01 0.01 0 0 0.18 0.01 0.57 0 0.33

Misses/1000 Instructions

Accesses/1000 Instructions

32KB L1 Instruction Cache % Evictions Unused

100 75 50 25 0

22.8

98.9

100

0.75 15.7 4.68 0.78 0.87 0.76 0.89 0.17 0.41 0.19 0.41 44 16.2 5.11 2.71 5.8 8.8 17.4 4.43 17.6

41.4 43.8 58 47

96.1 98.9 87.2 80.3 78.8 87.7

5.47 4.65 7.04 0.34 0 0 0 25.1 6.93 10.3 17.2 3.23 32.1 6.83 8.47 14.8 7.67 0.83 0.98 0.36 0.96 0.59 13.1

13.1 11.5 10.6

Accesses/1000 Instructions

100 75 50 25 0

0

43.7

25.8

25.2 21.5 24

93.7

77.9 99

63.2

5.74 3.25 1.1 5.09 1.73

7.88 0.04 0.11 0.1

96.1 73.2 100 100 64.4 54.8 57

71

Misses/1000 Instructions

2.33 1.52 2.42 0 0 0 0.51

0 0 0

0.99 0 0.01 0 0 0 0.12 0 0.03 0.63 1.1 1.33 0.81 1.54 0 0 0 0 0 0 0 0 0 0.02 0.57 0.65 0.44 0.55 2.15 0

0 0 0 0 0 0 0 0

7.5

89.2 60.3 62.6 89.1 90 135 147 151 138 98.6 102 101 40.1 74 65.9 94.9 170 216 225 226 214 225 170 179 235 182 211 20.7 239 98.7 24.7 199 209 207 209 207 207 218 210 37.6 36 139 260 242 262 137 174 93.3 150

0 0 0 0 0 0 0 0 0 0 0 0 0.07 0.15 0.02 0.03 1.45 2.07 1.26 1.06 0.78 0.84 0.67 0.91 2.81 3.65 3.41 1.95 3.76 3.19 0 0.16 0.19 0.17 0 0 0 0 0 0 0 0 0.41 1.76 3.18 0.45 1.53 0.28 0.01 0.02 0 0 0.11 0.01 1.38 0 0.67

28.9 158 154 44.4 172 172 190 168 171 200 16.3 43.7 145 94.9 86.4 78.2 225 225 216 236 230 227 233 211 231 214 212 205 214 212 37.3 86.2 88.2 82.5 165 165 20.4 70.3 140 244 39.4 43 189 234 222 231 127 232 161 170 198 118 58.8 77.8 241 60.4 152

1000 800 600 400 200 0 10 8 6 4 2 0

95 86.2 71.2 99.7 53.5 51 51.2 53.9 51.4 47.9 93.7 67.5 96.5 28.1 20.3 62.4 96.4 81.3 81.1 90.3 87.7 86.9 96.8 95.3 70.9 83.1 81.5 94.7 87.2 85.1 70.9 36.3 18.7 30.8 83.5 14.7 100 94.3 100 86.9 99.8 65.7 71.9 81 95.2 81.1 86.2 94.2 75.3 97.2

0

0

0.24 0.48

19.8 28.6 27.2 34.3 34.9

91.8

5.47 4.65 7.04 0.34 0 0 0 25.1 6.93 10.3 17.2 3.23 32.1 6.83 8.47 14.8 7.67 0.83 0.98 0.36 0.96 0.59 13.1

86.1 96 92.1 100 100 81.5 71 71.3 52.3 54.2 54.8 54.7 97 99.4 94.5 93 95.1 99 92.2 94.2 97.6 92 74.8 91 80.1 89 85.1 100 92.4 76.7 97.8 66.2 63.2 66.2 64.3 67.4 84.5 28.2 53.3 99.5 55.2 71.7 75.7 75.2 76.3 70.7 68.6 89.9 79.8

Miss Rate

32KB L1 Data Cache % Evictions Unused

SPEC CPU2000


98.9

72.9 71.9 73

96 100

68.3 72 66.4 74.8 91.9 98.9

48.6

43

97

64.1 85.6 68.4 72.8 74.6 99.7

56.6

30.9

4.78 10.5 11 8.43 7.37

11.2

95.4

100 75 50 25 0

97.1 82.4 95.2 87.3 98.7 92.8 75.6 85 92.1 77.3 99.6 99.4 99.6 99.7 99.7 99.7 70.2 95.5 67.9

60 50 40 30 20 10 0

0 0 0

300 250 200 150 100 50 0 204 203

100 75 50 25 0

13.1 11.5 10.6

100 75 50 25 0

0.88 11.5 3.88 21.8 21.9 1.38 1.23 1.66 0 0 0 0 24.4 3.92 9.85 1.92 2.96 1.54 0.72 0.93 1.25 0.57 0.53 0.84 0.25 0.7 0.44 13.1 32.1 0.7 9.94 0.93 0.22 0.24 0.26 0.31 0.14 0.41 0 0

250 200 150 100 50 0 300 250 200 150 100 50 0

95.7

99.8

Accesses/1000 Instructions 100 75 50 25 0

36.5 10.7 10.8 25.2 26.3 23.6 0.05 62.5 64.6 64.7

Misses/1000 Instructions 100 75 50 25 0

6.71

Miss Rate

256KB Unified L2 Cache 1000 800 600 400 200 0 300 250 200 150 100 50 0

99.8 83.4

% Evictions Unused 100 75 50 25 0

66.2

Accesses/1000 Instructions 100 75 50 25 0

44.3 46.2 51.9 52.7 49.5

Misses/1000 Instructions 1000 800 600 400 200 0 10 8 6 4 2 0


Miss Rate

2MB Unified L3 Cache

% Evictions Unused



Please cite: Web Copy: http://http://www.glue.umd.edu/~ajaleel/workload/ SPEC CPU2006

0

0 100

75

0

Figure 8: Three Level Cache Hierarchy Performance: The figure shows the cache performance of the different SPEC CPU2000 and SPEC CPU2006 workloads for a three level non-inclusive cache hierarchy


Miss Rate

Miss Rate

bzip2.liberty

mcf.ref

Miss Rate

Miss Rate

tonto.tonto gcc.c-typeck

Instruction Count (billions)

Instruction Count (billions)

Cumulative

10 million Instruction Phase

Figure 9: Full Run L3 Cache Behavior of Applications: The figure shows the full run L3 cache miss-rate behavior of workloads from SPEC CPU2006. The solid line presents cumulative behavior while the dots represent application behavior the on a 10 million instruction granularity.

misleading for performance studies since the L3 cache performance in this phase inaccurately reflects the overall L3 cache performance of the application. Choosing representative regions of long running programs for detailed execution becomes a challenging task as newer applications continue to emerge. Since IDS allows for quick full-run exploratory studies of applications, it serves as an excellent methodology to not only identify representative regions of a program but also to validate existing region selection algorithms [12].

available cache size. As a result, cache lines get evicted due to capacity misses before they ever get a chance to be reused. The cache performance numbers presented in Figure 8 draws similar conclusions on the instruction and data cache performance as discussed in Section 3.4.1. The key observations in these figures is the unused evictions metric. The figures show that on average, roughly 15%, 20%, 80%, and 75% of the lines evicted from the IL1, DL1, UL2, and UL3 caches are never re-referenced after insertion into the cache. This presents a tremendous wastage of cache resources, especially in the large caches. Every line that is not rereferenced before eviction is wasted cache space that could potentially have been used by a more useful cache line. For applications with frequent cache misses, this suggests an opportunity to improve cache performance by effectively utilizing the cache resources.

4. Summary There is a growing need for simulation methodologies to understand the memory system requirements of emerging workloads in a reasonable amount of time. This paper presents binary instrumentation-driven simulation (IDS) as an alternative to conventional execution-driven and trace-driven simulation methodologies. Since instrumentation driven simulation is fast, robust, and simple, users can characterize applications at MIPS speed. As a result, rather than studying only small regions of an application, users can now study applications run to completion. We illustrate the benefits of IDS using Pin to characterize the SPEC CPU2000 and SPEC CPU2006 benchmark suites. In comparison to SPEC CPU2000, workloads in SPEC CPU2006 have an instruction count that is an order of magnitude larger than the SPEC CPU2000 workloads. Specifically, the average instruction count of SPEC CPU2000 was 131 billion while that of SPEC CPU2006 is 1.3 trillion. The large instruction counts of SPEC CPU2006 now present interesting challenges in choosing representative regions for detailed simulation. A full memory characterization study of both SPEC CPU2000 and SPEC CPU2006 reveal that SPEC CPU2006 workloads have larger memory working-set sizes with most memory-intensive workloads requiring more than 4MB of cache size. The larger cache requirements of these workloads continues to put pressure on improving the performance of the memory subsystem.

3.5. Full Run Behavior of Applications Figure 9 presents the time varying L3 cache performance of selected benchmarks from the SPEC CPU2006 suite. The xaxis presents the total number of instructions executed (in billions) and the y-axis presents the L3 cache miss rate. For each benchmark we present the cumulative L3 cache miss rate of the application (as represented by the solid line) and the instantaneous behavior of the application on a 10 million instruction interval (as represented by the dots). The figure shows that applications have several different phases of execution. For example, bzip2 has a step function with the L3 cache miss rate increasing over time; mcf has several phases of execution where none, some, or all cache references result in misses; tonto presents loop behavior with the cache performance significantly varying within each loop; finally gcc illustrates the L3 cache performance significantly varying during its full execution run. The phase behavior of these applications show that it is extremely important that representative regions chosen for detailed simulation accurately reflect actual program behavior. For example, choosing a 100 million instruction interval starting at 525 billion instructions for tonto would be

11


5. References

[9] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. “Evaluation techniques in storage hierarchies.” IBM Journal of Research and Development, 9, 1970 [10] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. “Automatic logging of operating system effects to guide application-level architecture simulation.” In SIGMETRICS, Saint Malo, France, 2006. [11] H. Pan, K. Asanovic, R. Cohn, and C. K. Luk. “Controlling Program Execution through Binary Instrumentation.” In Workshop on Binary Instrumentation and Applications (WBIA), St. Louis, MO, September 2005. [12] E. Perelman, G. Hamerly, M. V. Biesbrouck, T. Sherwood, and B. Calder. “Using SimPoint for Accurate and Efficient Simulation”. In SIGMETRICS, San Diego, CA, 2003. [13] S. Sair and M. Charney. “Memory Behavior of the SPEC CPU2000 Benchmark Suite.” IBM Thomas J. Watson Research Center Technical Report RC-21852, October 2000. [14] Srivastava and A. Eustace. “ATOM: A System for Building Customized Program Analysis Tools”, Programming Language Design and Implementation (PLDI), 1994, pp. 196-205. [15] R. A. Uhlig. and T. N. Mudge. “Trace-driven Memory Simulation: A Survey”, In ACM Computing Surveys, Vol. 29, 1997.

[1] SPEC CPU2006: http://www.spec.org/cpu2006/Docs/ readme1st.html [2] P. Bungale and C. K. Luk “PinOS”. [3] J. Edler and M. D. Hill. “Dinero IV Trace-Driven Uniprocessor Cache Simulator”. [4] J. L. Henning. “SPEC CPU2006 Benchmark Descriptions.” In ACM SIGARCH newsletter, Computer Architecture News, Volume 34, No. 4, September 2006. [5] A. Jaleel, R. S. Cohn, C. K. Luk, B. L. Jacob. “CMP$im: Using PIN to Characterize Memory Behavior of Emerging Workloads on CMPs”, Technical Report - UMD-SCA-2006-01 [6] K. Lawton. Bochs. http://bochs.sourceforge.net. [7] C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.” In Proceedings of Programming Language Design and Implementation (PLDI), Chicago, Illinois, 2005. [8] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002.

12

Memory Characterization of Workloads Using ... - Aamer Jaleel

Memory Characterization of Workloads Using ... - Aamer Jaleel

Suggest Documents

Memory Characterization of Workloads Using ... - Aamer Jaleel [PDF]

Explaining Cache SER Anomaly Using DUE AVF ... - Aamer Jaleel

Triggered Instructions: A Control Paradigm for Spatially ... - Aamer Jaleel

Memory System Characterization of Commercial Workloads - CiteSeerX

Memory System Characterization of Commercial Workloads - CiteSeerX

Achieving Non-Inclusive Cache Performance with ... - Aamer Jaleel

Storage and Memory Characterization of Data Intensive Workloads for ...

Energy Optimization of Memory Intensive Parallel workloads

Energy optimization of memory intensive parallel workloads

Performance of Database Workloads on Shared-Memory ... - CiteSeerX

Understanding the Memory Performance of Data-Mining Workloads on ...

24/7 Characterization of Petascale I/O Workloads - Mathematics and ...

Statistical Characterization of Business-Critical Workloads Hosted in ...

Characterization of OLTP I/O Workloads for ... - Semantic Scholar

Data Structures for Mixed Workloads in In-Memory Databases

MemX: Supporting Large Memory Workloads in Xen Virtual ... - OSNET

MemX: Supporting Large Memory Workloads in Xen Virtual ... - OSNET

MemX: Supporting Large Memory Workloads in Xen Virtual Machines

Data Structures for Mixed Workloads in In-Memory Databases

Optimizing Thread Throughput for Multithreaded Workloads on Memory ...

Processor workloads

Processor workloads

Managing SLAs of Heterogeneous Workloads using ... - Personal - UPC

Characteristics of Workloads Using the Pipeline Programming Model