Design and Implementation of a Workload Specific ... - CiteSeerX

1 downloads 0 Views 1MB Size Report
viewpoint of their execution mechanisms; instruction emu- lators and binary translators. An instruction emulator simply simulates the fetch- decode-execution ...
This paper is author’s private version of the paper published as follows. Proc. 39th Annual Simulation Symp., IEEE Computer Society, April 2006. Copyright (C) 2006 IEEE.

Design and Implementation of a Workload Specific Simulator Takashi Nakada Tomoaki Tsumura Hiroshi Nakashima Toyohashi University of Technology {nakada, tsumura, nakasima}@para.tutics.tut.ac.jp Abstract

decode-execution cycle by software. This simplicity gives emulators many good features such as portability, retargetability and developability. Only one but important drawback of the emulators is their inefficiency. Their slowdown (SD) is usually in the range of 50 to 100 or sometimes worse. The performance problem is cleared in binary translators because they execute native codes of the simulator hosts obtained from workload binaries by static or dynamic translation. However, although their slowdown is quite small, approximately 10 or less, implementing them is a quite hard work. For example, you must be acquainted with both of target and host ISA. Moreover, in order to link the native code with other modules of your simulator such as that for statistics, you have to have knowledge about compiler/ linker conventions for every host and software environment on which the simulator will be implemented. Adding a hardware component to be simulated makes the story more complicated. For example, cache simulation is inevitable to obtain a good performance approximation for modern processors, but its efficient implementation is not easy for instruction emulators and binary translators. Attaching a cache to an emulator type simulator is easy but its performance will be simply worsen resulting 300 to 1000 SD in typical cases such as sim-cache of SimpleScalar[2]. A binary-translated simulator may have an inline cache simulation code in its native code sequence but this technique brings not only further complication but also a performance problem due to a large size of code footprint causing cache misses on the simulator host. More detailed level simulators such as those with outof-order instruction scheduling capability have a different story. Since most of their work is spent for the complicated instruction scheduling, the performance issue of their modules for ISS and cache simulation was not serious. For example, the well-known out-of-order simulator sim-outorder of SimpleScalar has a large SD of 1000 to 3000. Therefore, its ISS module based on sim-fast with 100–300 SD and slow sim-cache, both built in sim-outorder, were not considered as important targets of performance improvement.

This paper proposes a simple but efficient technique for instruction set simulators. Our simulator is made workload specific by a simple process to generate a set of C functions from a workload binary. It is as portable and retargetable as ordinary instruction emulators because the translation targets C code and works well with well-abstracted instruction definitions. The translation is also easy-to-implement without requiring any complicated analysis nor profiling. We also propose a set of simple optimization techniques for cache simulation which cooperates with the workload specific technique. A SimpleScalar-based implementation of these techniques results a significantly large performance improvement. Our evaluations with SPEC CPU95 exhibit that the maximum speedups over sim-fast, sim-cache and sim-outorder are 38-fold, 14-fold and 9.7-fold respectively, while the average numbers are 19-fold, 8.3-fold and 3.8-fold.

1

Introduction

Steady and rapid progress of VLSI technology allows us to integrate a highly complicated system with a processor (or multiple processors sometimes) on a chip. To design and develop such a system-on-chip quickly, various types of processor simulators are indispensable for both hardware and software system architects to verify and/or to evaluate the functionality and performance of their system. The fundamental processor simulator is instruction set simulator (ISS) by which we may observe ISA-level behavior of the targeting processor and, optionally, other hardware components if we have a well-abstracted behavioral model of them. This type of simulator is also useful for software people especially when the targeting system has hardware components under development and/or lacks a comfortable programming/debugging environment. ISS implementations are roughly categorized into two types from the viewpoint of their execution mechanisms; instruction emulators and binary translators. An instruction emulator simply simulates the fetch1

However, as many researches to improve instruction scheduling conducted, the performance of these modules becomes a serious bottleneck. For example, even if we devise a technique to accelerate the instruction scheduling 10fold, the whole simulator of 1000 SD will not be improved so much, resulting 370 SD if its cache simulator still has 300 SD. Moreover, even if we could have a magic to speed the cache simulator up 10-fold again, the instruction emulator of 100 SD makes our effort insignificant resulting 200 SD total performance. These observations lead us to the necessity of techniques for ISS and cache simulator to achieve sufficiently high performance comparable with binary translators and to make them as portable, retargetable and developable as emulators. Our proposal given in this paper for this almost contradicting requirement is workload specific simulator (WSS). The key idea is to translate a workload binary into C source code rather than the native code of the simulator host. Since the translated C code is compiled and linked with other simulator modules in a usual way, the system is as portable as other C programs including instruction emulators. The translation mechanism is easily retargeted because it is a simple oneto-one conversion from an target machine instruction to a C macro to simulate it. Moreover the translator is easily implemented with existing emulator-type simulators because its workload binary analyzer is quite simple. As for cache simulation, we found optimization techniques to check cache hit quickly is effective when it is combined with WSS. Finally we confirmed the performance of our out-of-order simulator BurstScalar[10] was greatly improved by replacing its sim-fast and sim-cache based modules with their counterparts we propose in this paper. The rest of this paper is organized as follows. First we overview related research work in Section 2 to clarify the relationship and difference between previous researches and our own. Then in the following two sections we discuss our WSS and optimized cache simulation. After showing a few important details of our implementation with SimpleScalar and BurstScalar in Section 5, performance numbers are presented in Section 6 comparing them with those of original versions. Finally we conclude this paper in Section 7.

2

and other RISCs, and even for virtual machines such as DLX[6]. Emulators have another advantage that they are easily retargeted to various ISAs. For example, SimpleScalar provides a retargetability mechanism through its ISA definition in which the behavior of each instruction is specified in the form of C macro. Since it also defines a standard set of C macros for instruction field extraction, defining the behavior of an instruction is fairly simple such as; #define ADDI_IMPL {SET_GPR(RT,GPR(RS)+IMM);}

where RT, RS and IMM are macros to extract fields to specify the destination and source general purpose registers (GPR) and the immediate value of ADDI instruction. Binary translators, on the other hand, have a great advantage over emulators in their simulation speed. Well-tuned translators such as Shade[4] and Embra[18] exhibit small slowdown about 10 by their dynamic translation mechanisms. An obvious drawback of this type of simulators is their poor portability and retargetability. If you have a binary translation type simulator for MIPS architecture working on x86 PCs and need to develop another one for SPARC on IA64 servers, for example, it should be a nightmare for you even if you are well acquainted with these four machines. Moreover, the knowledge you need is not only their ISA but also software issues such as compiler conventions of each host machine, because you need to call functions written in C, C++ or whatever you chose to develop other simulator modules from native code you translate from a workload binary. Our approach to translate a workload binary to C source code may be considered as a variation of binary translation for portable and retargetable implementation. Since this approach is natural, it has been employed in many researches including SuperSim[19] and SyntSim[3]. The essential difference between these simulators and ours is that they are much more sophisticated than ours because they aim to squeeze the last drop of performance gain. For example, SuperSim generates a huge C function containing a huge switch block. Each case branch corresponds to a code block translated from each instruction in the workload binary. The code block for an instruction also has a goto label if it is the target of a direct branch. Thus the code for a branch or jump instruction directly jumps to its target goto label if its target address is known at the translation process. Otherwise, i.e. if the branch is an indirect one it breaks the switch block setting the variable for PC to its target address which is examined by the switch statement again to reach the code for the target instruction. This inline simulation code expansion minimizes the overhead incurred in the execution control of the translated code. However this inlining has significant defects.

Related Work

As briefly discussed in the introductory section, instruction set simulators (ISS) are fallen into two categories; instruction emulators and binary translators. Although the former type ones are significantly slower than the latter, they have much more users mainly because of their portability. For example, the most popular architecture simulator SimpleScalar[2] adopts the emulation method for its ISS engine sim-fast. Another proof of the popularity of emulators is found in the fact that there are a number of commercial and free simulators for x86[15], ARM[1], MIPS[11] 2

First, attaching case label to each code block of instruction spoils the chance of compiler optimization spanning the block boundary because the compiler must assume each code block has a potential entry of control flow1 . Second, the huge switch block and huge function makes compiler’s work hard resulting long compilation time. Finally, besides these compiler-unfriendly characters, this kind of optimization severely degrades the flexibility of the execution control mechanism. For example, it is quite hard to stop simulation at a specific instruction address (i.e. breakpoint) or after the execution of a specific number of instructions (i.e. fast-forwarding). SyntSim has a more sophisticated mechanism to achieve a certain level of compiler-friendliness. It carefully analyzes a workload binary exploiting hint fields of jump instructions of Alpha, which SyntSim targets in [3], in order to attach case labels only to code blocks for instructions which are assessed as indirect branch targets. If an indirect branch is executed targeting an instruction omitted from the target candidates, the switch statement transfers the control to default label where an emulator is called to handle the unexpected indirect branch target and following instructions. Moreover, SyntSim has a profile-based analysis for the case label attachment to the real targets occurred in the profile. The profile is also exploited to reduce translated code size and thus compilation time by inhibiting code translation of instruction sequences executed infrequently. Our claim is that, although we agree those sophisticated optimization should gain certain performance improvement, such sophistication is not necessary to achieve sufficiently high performance or small slowdown about 10 as exhibited in Section 6. As we will discuss in detail in the next section, our system is quite simple, compiler-friendly and flexible; our workload binary analyzer is quite simple because it simply finds apparent basic blocks; translated codes are also simple and compiler-friendly because each basic block is translated into a C function; and execution mechanism is flexible enough because we can freely switch from the fast execution of translated codes to the flexible execution by emulator and vice versa. A natural implementation of cache simulation is to combine a cache module with an ISS as we do. Many other simulators including sim-cache of SimpleScalar and SimICS[8] apply this execution-driven method resulting 300 or larger SD in the former while better performance of 200 SD or less is obtained in the latter by a carefully designed memory system simulation. Another implementation method is trace-driven[16] which first profiles memory access trace by executing an instrumented code, and then

simulates cache behavior using the profile[7]. A variation of trace-driven to avoid a huge size of trace files is on-thefly profiling in which its front-end executes workload binary to produce a subset of the trace to be transferred to its backend for cache simulation. The front-end executes instrumented binary[12] or emulates original binary[9] optionally combining dynamic binary translation mechanism[17]. The performances of trace-driven simulators are usually better than execution-driven especially when their front-end executes instrumented or translated binary. This advantage, however, is offset by their poor portability and retargetability. Therefore the best choice should be performance improvement of the execution-driven method as we propose in Section 5. Finally, the performance of ISS becomes a significant factor of out-of-order cycle accurate simulators (CAS) as the performance of their obvious bottleneck, instruction scheduling simulation, is improved by various techniques. As stated in the introductory section, SimpleScalar’s sim-outorder is so slow that the time consumed by its ISS engine based on sim-fast is negligible, 10 % or less. However, fast scheduling mechanisms, such as with computation reuse (or memoization) applied in FastSIM[13] and BurstScalar[10], changed the story. For example, BurstScalar’s performance for ‘swim’ in SPEC CPU95 is four times as high as sim-outorder but more than 40 % of its execution time is spent by its ISS sim-fast. FastSIM avoids this new bottleneck by employing binary translation which makes it less portable and retargetable. Another CAS issue requiring fast ISS is fast-forwarding mechanism. As many architectural papers state, architects evaluate their propositional systems with slow CAS “skipping first n × 106 instructions” by ISS. More analytical approach such as [14] also requires fast-forwarding to bridge CAS execution of representative parts of a workload. Finally, DiST[5] parallelizes CAS to obtain an almost accurate result by performing fast-forwarding on each parallel node before it performs CAS for its responsible time segment.

3

Workload Specific Simulator

Our workload specific simulator (WSS) is built from a workload binary by the following four steps. 1. The workload binary is analyzed to extract basic blocks in it. 2. For each basic block found in the Step 1, each instruction in it is translated to a C source code sequence to simulate it. Then a C function for the basic block is composed from the code sequences for instructions in the block.

1 SuperSim translator has an option switch flag to indicate there are no indirect branches to unknown targets in the workload binary and thus case labels are removable from code blocks for instructions unless they are the known targets (e.g. return point of a procedure call).

3

while(1){ PC=nPC; nPC=PC+4; inst=fetch(PC); switch(OP(inst)) { case ADD: regs[RD(inst)]= regs[RS(inst)]+regs[RT(inst)]; break; case ADDI: regs[RT(inst)]= regs[RS(inst)]+IMM(inst); break; case BNE: if (regs[RS(inst)]!=regs[RT(inst)]) nPC=PC+4+(OFS(inst)n)&(2m − 1)” if field’s LSB is bit-n and its width is m. The basic mechanism of the translation for an instruction is to put the code (4) partially evaluating its field extraction. For example, if we have a small basic block shown in Figure 2, its first add instruction is translated to;

Compilation The set of basic block functions are compiled by an ordinary compiler such as gcc or whatever suitable for the simulator host environment. Each function is usually short and has code blocks for N instructions at most. More importantly, it does not have any complicated control flow such as a huge switch block nor goto network inside the block. Thus functions are compiler-friendly enough so that the compiler quickly completes its job with a sufficient level of optimization eliminating redundant loads, stores and computations. For example, the basic block function shown in Figure 3 is optimized to a code equivalent to the function shown in

regs[5]=regs[3]+regs[2];

4

BB_0x0000100(){ caddr_t PC, nPC; PC=0x100; nPC=PC+4; regs[5]=regs[3]+regs[2]; PC=0x104; nPC=PC+4; regs[6]=regs[5]+regs[4]; PC=0x108; nPC=PC+4; regs[7]=regs[6]+10; PC=0x10C; nPC=PC+4; if (regs[1]!=regs[7]) nPC=PC+4+20; return(nPC); }

which precedes the instruction is temporarily nullified so that the basic block containing it is emulated with breakpoint check. Since other blocks are executed normally, the performance of the execution with breakpoints will have almost no overhead. Another example is an accurate and efficient fastforwarding. Since a basic block has at most N instructions, we may execute basic block functions until the number of executed instructions reaches to M − N or more where M is the instruction count to be fast-forwarded. Then, when the kernel loop detects the excess of the watermark, following instructions are unconditionally emulated with a check of the executed instruction counter against M . Note that the number of instructions executed in a basic block function is easily summed up because the workload analyzer can report the size of each block obviously. Alternatively, as we do in our implementation, each basic block function may sum up the instruction counter by incrementing it at every instruction in the block. This straightforward implementation is optimized by the compiler to result an addition of the block size to the counter.

Figure 3. Generated Basic Block Function BB_0x0000100(){ caddr_t nPC; reg_t r5,r6,r7; r5=regs[5]=regs[3]+regs[2]; r6=regs[6]=r5+regs[4]; r7=regs[7]=r6+10; nPC=0x110; if (regs[1]!=r7) nPC=0x124; return(nPC); }

Figure 4. Optimized Basic Block Function Figure 4 if your compiler is as wise enough as gcc. As shown in the figure, the values stored in regs[5..7] are not loaded again but their copies in local variables, which are simply host’s registers usually, are used. Unnecessary assignments to PC and nPC are successfully eliminated and the essential nPC calculations for bne are replaced with constant assignments.

4

Cache Simulation

A simple cache simulator for the cache of 2L × 2S × W bytes, where 2L is line size in byte, 2S is the number of sets and W is that of ways, may be designed as shown in Figure 5. In the function named cache_access to the address addr with width (A), the alignment of the access is checked as the first operation (B). Then the set number (C) and tag part (D) are extracted from the address to check, for each way out of W ways (E), if the tag matches to the wayth tag-array entry for the set (F). If matched, i.e. cache hit, a set of hit-case operations such as updating LRU status is performed by the function hit() (G). Otherwise, i.e. cache miss, another set of operations such as updating tag-array is performed by miss() (H). Starting from this basic design, our optimized cache simulator is designed conceptually by the following three improvement steps.

Execution The compiled object file is linked with simulator library modules. An important module is that for an ordinary emulator which is invoked when we meet an exceptional case such as an indirect jump/branch to an unexpected target. The mechanism for this execution control cooperated by the translated code and the emulator is fairly simple as follows: while(!terminate_condition()) { if (BB[PC]!=NULL) PC=BB[PC](); else{ nPC=PC+4; emulate(); } }

In the kernel loop of WSS shown above, a table named

1. The cache parameters L, S and W may be fixed, if the user is interested in the behavior with a cache of a specific configuration, to simplify the operations for field extraction and hit/miss check. This ready-made cache assumption may be partly relaxed without significant degradation of performance.

BB is examined to look up the function pointer for the basic block starting from the address contained in PC. If it

is found, i.e. the table entry has a non-null pointer, the basic block function is simply invoked for the execution of the block. Otherwise, the control is transferred to the function emulate() for the emulation of the instruction at PC. Thus when the simulator encounters an indirect jump/ branch to the target which our simple analyzer could not find, emulate() is repeatedly invoked until a known basic block entry is found. This simple mechanism also works well for flexible execution control of our simulator. For example, when a breakpoint is set to an instruction, the first non-null entry of BB

2. The hit/miss check is accelerated in typical cases in which the access is targeted to the line most recently used (MRU). 3. WSS is combined with the techniques above to eliminate redundant operations for an instruction cache access with a predeterminate address and size. 5

#define set_shift (L) #define set_mask ((1set_shift)&set_mask; tag=addr>>tag_shift; for (way=0;way>5)&127; tag=addr>>12; for(way=0;way>5)&127; tag=addr>>12; if (tag==cache[set].mru_tag) { /* hit() does nothing */ return; for (way=0;way5)&127 and 0x12 = 0x12340 >>12 respectively by the partial evaluation with addr equal to 0x12340 (F *, F*). Finally, the transformed cache_access() is inlined in the basic block function containing the instruction2 . These elimination and replacement drastically reduces the number of operations required for MRU hit cases. Since the address of cache[0x1A].mru_tag is known at compile time, a compiler as tuned as gcc-3.3.2 will generate only two instructions cmp and jne for x86 hosts.

5

swicth(op) { case ADD: ADD_IMPL; break; case ADDI: ADDI_IMPL; break; case BNE: BNE_IMPL; break; ... }

where the case labels are decoded results and XXX_IMPL is the macro defining the behavior of the instruction XXX. For example, as stated in Section 2, the behavior of ADDI is defined as; #define ADDI_IMPL {SET_GPR(RT,GPR(RS)+IMM);}

using SimpleScalar’s predefined macros3 . This scheme is greatly useful for our basic block extractor and source code translator. For example, the basic block extractor simply gets an instruction from a workload binary, decodes it by the SimpleScalar’s macro, and then checks if it is a direct/indirect jump/branch by another macro for classification. Moreover, the target address of a direct jump/ branch, say BNE, is calculated by the execution of the macro BNE_IMPL with setting PC to its address, because the macro

Implementation

We implemented a WSS version of ISS and an optimized cache simulator using SimpleScalar’s sim-fast and sim-cache as their bases. We also attached them to our 2 This

inline expansion may be suppressed for the statement and its successors by defining a function, say ordinary_ cache_access(addr), if the resulting code is too large. This suppression may cause small performance degradation because tag check of non-MRU hit case cannot be fully optimized. However, the performance may be better than the full expansion because of smaller code footprint.

(E )

3 The real definition has another macro for optional integer overflow check but it is nullified in usual usage. Thus we omit it in the explanation for the sake of simplicity.

7

contains a function to give us the target address regardless of the branch condition!4 . The source code translator also utilizes this scheme in a little bit complicated manner. First we generate a header file converted from that for the definitions of XXX_IMPL to define our own macros XXX_IMPL_TEXT like the following.

in Figure 7 is expanded for instruction cache access. It also has a few codes for data cache access for load/store instructions, and POST_INST() is defined to perform a postprocess for them. These codes for data cache, however, are eliminated by compiler unless the instruction is a load/store. BurstScalar It is a SimpleScalar-based high speed outof-order cycle accurate simulator. Its most important feature is computation reuse applied to the simulation of instruction scheduling in a loop iteration frequently executed. BurstScalar has the following four major components to which our optimization techniques are applied individually.

#define ADDI_IMPL_TEXT \ "{SET_GPR(RT,GPR(RS)+IMM);}"

The translator gets a instruction, decodes it, and then executes field extraction macros for it to do the following:

• Pre-Executor performs ISS to capture iterations executed frequently enough to apply the reuse technique.

printf("#define RS %d\n", RS); printf("#define RT %d\n", RT); ...

• Instruction Emulator performs ISS again to produce short instruction and address traces for out-of-order scheduling simulation.

It also prints “#define PC 0x108” for the instruction at 0x108 to give the value of PC in case of its reference in the instruction. Then according to the decoded result, it performs the following to print translation result.

• Instruction Scheduler simulates out-of-order instruction scheduling. When it fetches the first instruction of a reuse candidate iteration, it looks up its state in the state transition table. If the state is found, Reuse Engine takes its place. Otherwise, it registers the state in the table and attaches the records of the interactions with the modules for memory simulation and branch prediction.

swicth(op) { case ADD: printf("%s\n",ADD_IMPL_TEXT); break; case ADDI:printf("%s\n",ADDI_IMPL_TEXT);break; case BNE: printf("%s\n",BNE_IMPL_TEXT); break; ... }

Since it also prints two macros PRE_INST() and POST_INST() before/after printing XXX_IMPL_TEXT, generated source code for “addi r7,r6,$10” will be the following5 :

• Reuse Engine checks whether the interactions with the memory simulator and branch predictor are consistent with those recorded in the transition table. If consistent, it continues the work for the next iteration setting the scheduler state following the transition table entry. Otherwise, Instruction Scheduler takes its place.

#define RS 6 #define RT 7 #define IMM 10 #define PC 0x108 PRE_INST(); SET_GPR(7,GPR(6)+10); POST_INST();

The first two components were sim-fast based ISS and thus are replaced with WSS. The last two share the memory simulator module in which cache simulators are contained. Our optimization for cache simulation, however, is not fully applied to the cache simulators because they have interfaces and functions significantly different from those of simple sim-cache modules. In fact we just eliminate alignment checks in them which Pre-Executor performs in the first execution.

Then the generated source file is compiled with a header file prepared for each simulation mode. For sim-fast, POST_INST() is null and PRE_INST() is expanded to the assignment of nPC explained in Section 36 , so that C preprocessor will generate the following for the example above; nPC=0x10C; regs[7]=regs[6]+10;

which is equivalent to the fifth line of Figure 37 . On the other hand, the header file for sim-cache has its own definition of PRE_INST() from which the code shown

Peephole Optimization In addition to the optimization techniques discussed above, we applied two SimpleScalar specific peephole optimizations. One is to fix strict-aliasing violation in a module of SimpleScalar, in which a memory object is referenced by integer and floating point type pointers. Since this violation discourages modern C compilers such as gcc-3.x from full-power optimization, we fixed the problem by defining the object as a union structure. The other is for memory space management. SimpleScalar has a simple one-level page table to map the virtual

4 The real aim of this function is to update branch target buffer if the branch is predicted as taken. 5 To simplify our implementation, definitions for all possible field extraction macros are put but omitted in this explanation. 6 The assignment to PC is replaced with the #define. 7 In the real implementation, the resulting code is different from that shown here as follows; non essential codes which compiler finally eliminates are omitted; variable names are different; and the increment of nPC is 8 for PISA virtual machine which we used for experiment.

8

memory space of a workload onto its own memory space on the host. This mapping is inevitable if both spaces have the same size, 232 bytes for example, but causes serious performance bottleneck because every memory reference needs mapping and checking the existence of resulting address. A large portion of the references, however, could be free from the mapping because their targets are a priori known to exist. For example, if the workload code space is allocated contiguously, an instruction fetch address may be obtained by just one addition with the base of the space and its result is assured correctly mapped except for the case of indirect jump/branch. Alternatively, simpler and more efficient solution is found if the host has a larger memory space, 264 bytes for example, than the simulation target. In this case, whole address space of workload may be allocated contiguously leaving its management such as on-demand real page allocation to host’s operating system. We chose this solution using a 64-bit address machine and an operating system, Linux 2.4.21, which does what we need.

6

40

wss

speedup

35 30 25 20 15 10

0

101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex average

5

Figure 8. Speedup of ISS Since a large portion is occupied by ‘gcc’ benchmark which needs 18 minutes, other 17 benchmarks requires 85 seconds on average. The time should be acceptable even for a relatively small benchmark such as SPEC CPU95, and should be negligible for workloads in real world which run repeatedly with various datasets.

Performance Evaluation

Simulation Speed The most important performance issue is obviously how large gain is obtained by our optimization technique. We measured simulation speed of each benchmark running on original sim-fast, that with basic peephole optimizations (basic) discussed in Section 5, and that also with WSS (wss). Figure 8 shows the simulation speed of basic and wss relative to the baseline i.e. original sim-fast. The graph clearly shows WSS technique drastically improves the performance achieving up to 34-fold speedup (‘tomcatv’) and 19-fold on average. A closer look of the graph gives us interesting insights. First, the effect of basic optimization is not small, although less impressive than wss, and relatively steady for all benchmarks. In fact its speedup ratio is in the range of 3.3 to 4.2 and 3.7 on average. On the other hand, the speedup by wss varies in a wide range from 10- to 34-fold. This suggests that each workload has own characteristic that affects the effectiveness of WSS. An indirect evidence of this observation is found when we compare the simulation speed with the execution speed of the native code directly run on the host. Figure 9 shows MIPS values of workloads executed by sim-fast, WSS, and by the host machine directly (native). In the graph, MIPS values of sim-fast and wss are multiplied by 100 and 10 respectively to make them recognizable and comparable to large native values. As observed from the graph, MIPS of sim-fast is almost stable at about 15. On the other hand, the height of bars of wss and native varies depending on workload almost synchronously. This result strongly suggests that sim-fast spends most of computation time for fetch, de-

This section discusses the performance of the improved sim-fast and sim-cache, and their application to BurstScalar. The performance of the base and improved implementation is measured by a 2 GHz Opteron based PC with 8 GB memory and Linux 2.4.21 operating system. Both implementations are compiled by gcc-3.3.2 with -O2 option, while the base implementation requires an additional -fno-strict-aliasing option to cope with the strictaliasing violation. Workloads are 18 benchmarks in SPEC CPU95 suits and are executed with “train” dataset.

6.1

basic

Workload Specific ISS

Coverage of Translated Code One important issue of WSS is how large portion of executed instructions is covered by translated code. That is, if a significant number of instructions have to be executed by emulation, our simple translation mechanism should be redesigned more sophisticatedly. Our evaluation result, however, strongly supports the sufficiency of our simple implementation exhibiting only 0.05 % or fewer instructions are left for emulator due to unexpected jump/branch targets, except for two out of 18 benchmarks. One exception is ‘fpppp’ in which about 1 % instructions are emulated, while the other exception ‘li’ has 0.1 % of executed instructions left for emulator. Therefore we may conclude our translation mechanism passes Amdahl’s law test. Time for make Another important issue is how long time is required for the translation, compilation and linkage to build a WSS. We measured the time to make simulators specific to all 18 benchmarks to find it takes 42 minutes. 9

wss(x10)

native

SD

15

4500

15

12

3000

10

1500

5

0

0

101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex average

speedup

MIPS

slowdown

20

TLB

all

6

Figure 10. Speedup of Cache Simulation above. As the bars of all clearly indicate, our optimization techniques gain significantly large speedup, up to 14-fold (‘ijpeg’) and 8.3-fold on average. An important observation from the graph is that WSS and other cache optimizations are not very effective when they are applied solely. First, the speedup by WSS is only 1.6-fold on average. This is explained by the fact the portion of ISS in sim-cache is estimated as about 40 % because it is 2.5 times as slow as sim-fast, and 20-fold speedup of the portion merely results 1.6-fold following Amdahl’s law. Next, the speedup by other cache optimizations is 2.5-fold on average and larger than WSS’s contribution. However, taking account of the fact that both configurations include basic optimization which solely accelerates the simulation 1.4fold, the multiplication of speedups cannot explain 8.3-fold in total. The reason why one plus one is greater than two is a synergetic effect of WSS and cache optimization. For example, the calculation of the tag part of an address for instruction cache access has two operands, one is the instruction address itself and the other depends on cache configuration. Thus even when one of them is made constant by either WSS or cache optimization exclusively, it is still necessary to perform a certain number of operations to obtain the tag. However when both operands are fixed to a constant by applying both techniques, the tag turns into a constant which embedded in the operand of cmp instruction of the x86 host. Another remark should be made on ‘fpppp’, the performance of which is exceptionally low and is degraded by WSS also exceptionally. The reason of this exception is a large size of code working set due to its exceptionally large innermost loop. As stated in Section 6.1, the slowdown of ISS for ‘fpppp’ is significantly larger than other benchmarks. This is due to the fact that the innermost loop is translated to a huge code segment of about 210 KB that is larger than host’s L1 instruction cache. In cache simulation, the bad effect is twofold. First, since the innermost loop cannot reside in target machine’s L1 instruction cache, our

16KB, 32B/line, 1 way 16KB, 32B/line, 4 way 256KB, 64B/line, 4 way 64 entry, 4KB/page, 4 way 128 entry, 4KB/page, 4 way

code and other jobs rather than the essential part of instruction (e.g. addition for an add instruction) or the workload itself. Since these overheads are eliminated in WSS, its performance tends to follow that of the real execution reflecting ILP, branch predictability, memory access locality and so on. As a result, the slowdown of WSS is relatively stable around 10 as shown in the figure (line graph labeled SD). One remarkable exception is 17.7 SD of ‘fpppp’ which is discussed later. Besides the stability, the absolute slowdown 9.9 on average is sufficiently small and comparable to binary translation and sophisticated source code translation. For example, the average slowdown of SyntSim[3] is about 9 without profiling and varies from 7 to 11 depending on the ratio of emulated instructions.

6.2

basic+copt

9

0

Table 1. Cache and TLB Configuration L1 inst. L1 data L2 unified inst. data

basic+wss

3

Figure 9. MIPS of Simulation and Native Execution

cache

basic

101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex average

sim-fast(x100)

6000

Cache Simulation

To clarify the effect of each optimization technique, we measured simulation speed with the following five configuration using SimpleScalar’s default parameters of cache and TLB configurations shown in Table 1. • • • •

Original sim-fast as the baseline. Basic peephole optimization only (basic). basic and WSS (basic+wss). basic and our cache optimization without WSS (basic+copt). • All optmizations, i.e. basic, wss and copt (all). Figure 10 shows the result in the form of speedup over the baseline obtained by the last four configurations listed 10

(sec)

Table 2. Simulated Microarchitecture

10

Execution time

I-Fetch & ILP degree I-fetch width issue/commit width RUU depth load/store queue depth Function Units integer ALU integer mult/div floating point ALU floating point mult/div Branch Prediction cond. branch pred. branch target buffer return address stack

8 8 256 128

speedup

60 40

PE

IE

IS

RE

CS

AC

OT

Figure 12. Breakdown of BurstScalar’s Execution Time of ‘swim’

2 K entry, 2-bit counter 2 K entry, 4-way 8 entry

mance. The bars orig are for the original BurstScalar without optimizations discussed in this paper, while all of them applied to obtain the result of opt. The graph shows the optimization is effective especially for benchmarks whose performance has already improved by the original BurstScalar. For example, the speedup of ‘mpgrid’ is the highest in all aspects; 4.2-fold in the original version, 9.7-fold in the optimized version, and 2.3-fold from the original to the optimized. These values are significantly larger than the averages, which are 2.2, 3.8 and 1.7 respectively. This is explained as the more BurstScalar reduces instruction scheduling cost, the more relative cost of other part is. Similar results are obtained from ‘tomcatv’, ‘swim’ and ‘hydro2d’, and the significance of WSS and cache optimization is more clearly observed from the execution time breakdown of ‘swim’ shown in Figure 12. In the figure, the execution times of the following modules of orig and opt are shown; Pre-Executor (PE), Instruction Emulator (IE), Instruction Scheduler (IS) and Reuse Engine (RE), which are explained in Section 5; Cache Simulator (CS) for cycle accurate cache simulation; Address Conflict Detector (AC) for load/store-queue maintenance; and other components (OT) including branch predictor and interface between modules. The effect of WSS is clearly shown by the bars for PE and IE, which are accelerated 4.4-fold and 5.5-fold respectively. The cache simulation (CS) is also accelerated 4.2-fold by our optimization. In total, these three modules, which have occupied 55 % of the whole in orig, are now less significant than other modules essential for out-of-order simulation, by reducing their occupation ratio to 22 %.

opt

6 4

101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 134.perl 147.vortex average

2

Figure 11. Speedup of Out-of-Order Simulation MRU optimization has no effect. Second, since the size of host’s code for an instruction in cache simulation is larger than that in ISS, the code size slightly exceeds 1 MB and overflows from host’s L2 cache! This problem could be partially solved by inhibiting the inline cache code expansion of a large loop code segment, if we may loss a certain amount of simplicity from our translation mechanism.

6.3

80

0

8

0

opt

20

8 3 8 3

orig

orig

100

Out-of-Order Simulation

Finally, we evaluated the performance of our out-oforder simulator BurstScalar with and without WSS and cache optimization. The microarchitecture of the target machine is the default of SimpleScalar as shown in Table 2. Cache configuration is as same as that shown in Table 1 and access latencies in cycles are 1 for L1 caches, 6 for L2 cache, and 16 + 2 × B for memory when B-byte block is transferred. The result shown in Figure 11 has two types of bars representing speedup over the baseline sim-outorder perfor-

7

Conclusions

In this paper, we proposed a high speed ISS technique to make the simulator workload specific. A workload specific simulator (WSS) consists of a set of C functions, each of which corresponds to a basic block of the workload binary, and the other simulator functions including an ordinary emulator. The process of WSS generation is quite simple be11

cause it does not have any sophisticated analysis, and is easy to implement because it may exploit instruction definitions if they are available as in SimpleScalar. Generated C functions are compiler-friendly because they are short enough and does not have any complicated control structures. Although the WSS generation process is simple, the resulting simulator is quite efficient. Our SimpleScalar based implementation exhibited up to 34-fold speedup and 19-fold on average relative to sim-fast with SPEC CPU95 benchmarks. The average slowdown 9.9 is comparable with less portable/retargetable binary translators and with more complicated source code translators. We also proposed a set of optimization techniques for cache simulation, which exhibits significantly large performance improvement when it is combined with WSS. The evaluation with SPEC CPU95 proves its efficiency resulting up to 14-fold speedup and 8.3-fold on average relative to sim-cache. Finally, we applied WSS and cache optimizations to our fast out-of-order simulator BurstScalar to accelerate it 1.7-fold and to make it 9.7 times at maximum and 3.8 times on average as fast as sim-outorder. We plan to apply our WSS and cache optimizations to other types of simulators. For example, a distributed outof-order simulation will be a good target that needs really fast fast-forwarding. Another target will be a distributed multiprocessor simulator that also needs a fast distributed ISS for its frontend.

[8] P. Magnusson and B. Werner. Efficient memory simulation in SimICS. In Proc. 28th Annual Simulation Symp., 1995. [9] H. Matsuo, S. Imafuku, K. Ohno, and H. Nakashima. Shaman: A distributed simulator for shared memory multiprocessors. In Proc. 10th IEEE/ACM Intl. Symp. Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 347–355, Oct. 2002. [10] T. Nakada and H. Nakashima. Design and implementation of a high speed microprocessor simulator BurstScalar. In Proc. 12th IEEE/ACM Intl. Symp. Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 364–372, Oct. 2004. [11] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta. Complete computer system simulation: The SimOS approach. IEEE Parallel & Distributed Technology, 3(4):34– 43, 1995. [12] J. B. Rothman and A. J. Smith. Multiprocessor memory reference generator using Cerberus. In Proc. 7th IEEE/ACM Intl. Symp. Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 278–287, Oct. 1999. [13] E. Schnarr and J. Larus. Fast out-of-order processor simulation using memoization. In Proc. 8th Intl. Conf. Architectural Support for Programming Languages and Operating Systems, pages 283–294, 1998.

References

[14] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proc. 10th Intl. Conf. Architectural Support for Programming Languages and Operating Systems, pages 45–57, Oct. 2002.

[1] ARM Ltd. RealView developer suit. http://www.arm.com/ products/DevTools/RealViewDevSuite.html. [2] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for computer system modeling. Computer, 35(2):59–67, Feb. 2002.

[15] The Bochs Project. Bochs: The cross platform IA-32 emulator. http://bochs.sourceforge.net/, Aug. 2005.

[3] M. Burtscher and I. Ganusov. Automatic synthesis of highspeed processor simulators. In Proc. 37th Annual Symp. Microarchitecture, pages 55–66, Dec. 2004.

[16] R. A. Uhlig and T. N. Mudge. Trace-driven memory simulation: A survey. ACM Computing Surveys, 29(2):128–170, June 1997.

[4] B. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. ACM SIGMETRICS Performance Evaluation Review, 22(1):128–137, May 1994.

[17] J. E. Veenstra and R. J. Fowler. MINT: A front end for efficient simulation of shared-memory multiprocessors. In Proc. 2nd IEEE/ACM Intl. Symp. Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 201–207, Feb. 1994.

[5] S. Girbal, G. Mouchard, A. Cohen, and O. Temam. DiST: A simple, reliable and scalable method to significantly reduce processor architecture simulation time. In Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pages 1–12, June 2003. [6] L. B. Hostetler and B. Mirtich. DLXsim—a simulator for DLX. http://heather.cs.ucdavis.edu/˜matloff/DLX/Report. html.

[18] E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pages 68–79, May 1996.

[7] D. K. Keppel, E. J. Koldinger, S. J. Eggers, and H. M. Levy. Techniques for efficient inline tracing on a shared-memory multiprocessor. In Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, 1990.

˜ [19] V. Zivojnovi´ c and H. Meyr. Compiled HW/SW cosimulation. In Proc. 33rd ACM IEEE Design Automation Conf., pages 690–695, June 1996.

12