Dynamic Profile Driven Code Version Selection

7 downloads 0 Views 407KB Size Report
Peng-fei Chuang∗, Howard Chen†, Gerolf F. Hoflehner†, Daniel M. Lavery†, and ...... K. Muthukumar, and J. Pierce, “The Intel IA-64 compiler code gen- erator ...
Dynamic Profile Driven Code Version Selection Peng-fei Chuang∗ , Howard Chen† , Gerolf F. Hoflehner† , Daniel M. Lavery† , and Wei-Chung Hsu‡ ∗

Electrical and Computer Engineering Dept., University of Minnesota, Minneapolis, MN 55455 † Intel Corporation, Santa Clara, CA 95052 ‡ Computer Science Dept., University of Minnesota, Minneapolis, MN 55455 [email protected], {howard.h.chen, gerolf.f.hoflehner, daniel.m.lavery}@intel.com, [email protected] the comparison of them will be discussed in the related work section. In this paper, we propose to generate multi-versioned binaries at static compile time. During program execution, we measure the potential performance improvement from dynamically selecting the most efficient code version across multiple invocations, and use this information to select the best performing code version at run-time. We compare the results to the performance of original code. The goal is to apply optimizations only when they improve code efficiency. This approach can improve on static compiler performance in situations when the best optimization is input dependent, or when the optimizations selected using heuristic estimates that turn out to be very detrimental at runtime. We can also identify dynamic code version selection opportunities that are not exploited statically. In the next section, we compare the effectiveness of using dynamic code version selection to static selection of a single optimization, and present the potential performance improvement when dynamically applying three optimizations. Section III describes the implementation considerations of static code version generation and dynamic code version selection. Section IV provides discussion of related work and section V concludes.

Abstract— In this paper, we study the effectiveness of dynamic R code version selection on Itanium° 2 processors. Code version selection can improve the effectiveness of optimizations, adapting them to multiple input sets. In this performance potential study, R we conduct experiments on dual-core Itanium° 2 processors and examine the effectiveness of dynamic code version selection of loop scheduling, load prefetch and control speculation aggressiveness adjustment. We apply multiple versions of these optimizations to selected SPEC CPU2006 benchmarks at base optimization levels used for SPEC reporting. The experimental results demonstrate 2.0% to 15.7% improvement in performance from versioning optimizations on tested benchmarks. This paper also discusses the issues and solutions about the implementation of dynamic code version selection.

I. I NTRODUCTION Profile feedback improves program performance by exploiting knowledge of run-time program behavior to generate more efficient binaries. Dynamic profile-feedback optimization has been proposed as a technique to transparently apply profilefeedback optimizations, hiding the process from the user. This addresses a major barrier to profile-feedback adoption, the extra profile collection and recompilation steps required by the user. However, dynamic optimization can incur additional compilation and optimization overheads over static compilation. Since optimizations are detected and applied at run-time, the memory and computational costs can be considered part of the overall execution time. As a result, dynamic optimizers traditionally apply light-weight optimizations that can be quickly detected and applied like cache-miss prefetching or trace based optimizations. In addition, application binaries do not contain information that exists in the original source-code like type and datastructure information. This inhibits the ability of dynamic optimizers to re-schedule code or perform code motion due to the loss of memory-disambiguation information. As a result, many memory accesses take on spurious data-dependencies with other memory accesses. This limits the optimizer’s ability to apply optimizations involving code motion. In order to explore the knowledge from both worlds, compilation can be deferred until runtime (JIT compilation), or multiple versions of the optimization can be generated and selection can be performed at program runtime (code versioning). Both approaches are shown to be influential, and

II. DYNAMIC C ODE V ERSION S ELECTION The effectiveness of many compiler optimizations is influenced by input data, run-time program behavior, hardware configurations, micro-architecture behaviors or a combination of behaviors that are not known at static compile time. These optimizations can degrade performance, especially when applied aggressively, if run-time conditions differ from compile time assumptions. In general, compilers have been conservative in applying optimizations to ensure no significant performance degradation at runtime, in addition to more strict correctness requirements. However, this approach leaves a very large room for performance improvement. Compilers often exploit heuristics and run-time profile feedback to help decide where an optimization should be used, and the level of aggressiveness it should be applied. Even though these techniques capture average run-time behavior, there still exist hard-to-catch, but performance critical cases that result in a sub-optimal selection of optimizations. Furthermore, a program may behave differently in reacting to different input

74

data. In this case, neither compiler heuristics nor profile feedback can provide useful information. In order to generate code that is applicable on various environments and multiple data inputs, we propose dynamic code version selection using performance counter data. Dynamic code version selection creates multiple versions of code suited to different environments. With the run-time behavior measurement, optimizations can be selectively turned on when they are determined to provide performance benefits. This proactively enables optimizations to be applied more aggressively than with static heuristics. Furthermore, it can mitigate the performance degradation from an optimization misguided by static profiles. In this section, we present the evaluation of applying R dynamic code version selection on Intel° C++ Compiler R ° [1] [2] generated code for Itanium architecture. Three optimization techniques are selected for the versioning study: loop scheduling using software pipelining [3] [4] vs. global code scheduling [5], prefetching aggressiveness on loads, and control speculation aggressiveness on loads and uses. The experimental results demonstrate the potential performance improvement before version selection overhead.

build. However, dynamic optimization is allowed as long as no data is saved from previous runs. TABLE I E XPERIMENTAL SETUP Configuration Hardware configuration Operating system C/C++ compiler Baseline compilation flags Performance counter data collection Benchmarks

Description R Dual-Core Itanium° 2 processor, 12MB L3 cache per core, 4GB RAM Redhat Enterprise Linux 4 (EL4U2) R Intel° C++ Compiler ver. 9.1 -fast -ansi alias -IPF-fp-relaxed HP Caliper Selected SPEC CPU2006 benchmarks

B. Loop scheduling with software pipelining vs. global code scheduling The first experiment studies the performance improvement resulting from switching between different loop scheduling methods. In this experiment, the baseline version, in which either software pipelining or global code scheduling is selected for each loop based on compiler heuristics, is compared against a version with software pipelining applied to all loops and a version exclusively generated using global code scheduling. Software pipelining can achieve higher loop throughput, but may increase loop latency. It is, therefore, more suitable to be applied on frequently executed loops with longer execution R times. The Intel° C++ compiler uses heuristic profile information to estimate the total execution time of a software pipelined loop vs. global code scheduling, and then use this execution time estimation to select the best scheduling method. However, if the dynamic loop behavior differs from static estimates, the performance can be improved by selecting an alternately scheduled version of the loop. Figure 1 demonstrates the speedup of version selection among loop scheduling methods. Our evaluation showed that 401.bzip2, 416.gamess, 450.soplex, 458.sjeng, and 459.GemsFDTD, can benefit from dynamic loop scheduling selection. The maximum speedups for these benchmarks are 2.5% (input set 3), 2.7% (input set 2), 6.6% (input set 2), 3.5%, and 1.0%, respectively. While static compiler heuristics make good optimization choices in most cases, the overall performance can be further improved if the heuristic is able to catch the special cases residing in the five abovementioned benchmarks. However, the information that is required by the heuristic often is not available at static compile time. Benchmark 458.sjeng, for example, contains a frequently executed loop whose execution time is very difficult to estimate without the knowledge of actual input data (Figure 2). This loop has an input-sensitive continue statement which, if taken, will significantly reduce the execution time of each loop iteration. As a result, the compiler estimates that the average execution time of a non-pipelined loop is relatively lower, and it schedules the loop using global code scheduling. However, the continue statement is almost never taken in

A. Evaluation Methodology and Experimental Setup To evaluate the performance of dynamic code version selection, we use compiler to generate three versions of binaries for each application, with different optimizations including the baseline version. We then measure the execution time of each versioned binary at a function level. The execution times of the top ten hot functions, the functions that are most costly in terms of execution time, are compared against those of their versioned counterparts. By combining the fastest version of each function, we can estimate the potential performance improvement of dynamic code version selection, before the versioning overhead. In this study, we used HP Caliper [6] performance analyzer to monitor the execution of the benchmarks and collect Instruction Pointer (IP) samples. Caliper cross-references these IP samples and the function boundaries to generate a breakdown of IP samples at the function level. Since execution time of the hot functions significantly exceeds the sample period, the related change in the execution time has negligible effect on the IP sample counts. Therefore, IP sample information provides a good estimation of the execution time difference between various optimization techniques. We conducted our experiments on 4-socket, dual-core R Itanium° 2-based servers. A set of SPEC CPU2006 [7] benchmarks were chosen for this study, running the complete R reference input sets. Intel° C++ Compiler version 9.1 is used for compilation, and, in this study, the baseline compilation is done using the base optimization levels used for SPEC CPU2006 reporting. This is a high level of optimization, including -fast, which includes -O3 and interprocedural optimizations. Table I lists the details of the experimental setup. Note that for SPEC CPU2006, the new submission rules forbid the use of profile feedback directed optimization in base

75

6 5 4 3 2 1 473.astar(set2)

471.omnetpp

473.astar(set1)

462.libquantum

458.sjeng

459.GemsFDTD

450.soplex(set2)

444.namd

450.soplex(set1)

416.gamess(set3)

416.gamess(set2)

403.gcc(set9)

416.gamess(set1)

403.gcc(set8)

403.gcc(set7)

403.gcc(set6)

403.gcc(set5)

403.gcc(set4)

403.gcc(set3)

403.gcc(set2)

403.gcc(set1)

401.bzip2(set6)

401.bzip2(set5)

401.bzip2(set4)

401.bzip2(set3)

401.bzip2(set2)

0 401.bzip2(set1)

Performance Improvement (%)

7

SPEC CPU 2006 Benchmark

Fig. 1.

Potential performance improvement of dynamic selecting loop scheduling method at function level

C. Prefetching aggressiveness

for (j = 1, a = 1; (a ≤ piece count); j++) { i = pieces[j]; if (!i) continue; else a++; ...(many lines deleted) }

The second experiment evaluates the effectiveness of dynamically adjusting the aggressiveness of prefetch insertion. Compilers have commonly used software prefetching [8] [9] to minimize cache miss penalty for applications. We compare the default prefetching scheme used by the compiler against code versions with less aggressive prefetching and all prefetching turned off. Prefetching anticipates cache misses and includes instructions to fetch the data in advance of the actual memory references. If the fetch results are useful, they can effectively hide latency resulting from poor reference locality. Although software prefetching gives significant performance gains [10] for most of the SPEC CPU2006 benchmarks, it does have trade-offs and can incur losses in some cases. Including prefetch instructions in the code may negatively affect the performance in four ways: 1) increase in code size and dynamically executed instructions, 2) increase schedule length, 3) cache pollution, and 4) increased memory bandwidth demand which could be critical in a chip multiprocessor (CMP) environment where the pin-bandwidth is shared among multiple threads. Therefore, exploiting prefetching techniques extravagantly may harm the performance. In Figure 3, we present the potential performance speedup achieved by adjusting prefetching aggressiveness. The results demonstrate that dynamic code version selection is beneficial for a number of benchmarks with significant performance improvement. The measured improvement is 15.7% for 450.soplex (input set 1), 10.0% for 403.gcc (input set 5), 8.5% for 444.namd, 6.2% for 458.sjeng, 6.0% for 416.gamess (input set 1), and 5.6% for 401.bzip2 (input set 3). Figure 4 further compares the performance impacts of different levels of prefetching aggressiveness, focusing on the benchmarks that have higher performance gain using code version selection. For each benchmark, the three bars represent the speedup of turning off prefetching, the speedup of applying prefetches less aggressively than the compiler baseline, and the speedup achieved by versioning, with respect to the baseline. The observation is that for the majority of the benchmarks in Figure 4, higher performance can be achieved by disabling

Fig. 2. Code example for neval.c:432 in function “std eval()” of 458.sjeng benchmark

practice and software pipelining is in fact the more efficient loop scheduling method. An offline study shows that tuning the frequency assumption of taken continue statements may address the issue in this case, but brings negative performance impacts to other benchmarks. By dynamically switching to the software pipelined version, we can select the most efficient code implementation and reduce the execution time of the routine by 8.1%. Note that, in this study, the loop schedule versioning is applied at the function level. However, it is possible that different loops within one function could be scheduled with different methods in order to achieve optimum performance. Thus, it is more accurate to evaluate the effectiveness of versioning by loop scheduling at loop level. Future work may add compiler support to generate a table of loop boundaries and use this information to map the IP samples to loops and compare loop execution times. By using the performance counters, we can identify hot loops as well as measure the efficiency of different code implementations, which can help us select the faster loop scheduling method. In addition, performance counters provide insights into bottlenecks in the loop scheduling, which can lead to more effective version selection. One option is to analyze cache miss patterns. If we detect that frequent data cache misses inside a loop, we may want the software pipelined version (perhaps with cache prefetches) that is scheduled to tolerate longer latencies.

76

473.astar(set2)

471.omnetpp

473.astar(set1)

462.libquantum

458.sjeng

459.GemsFDTD

450.soplex(set2)

444.namd

450.soplex(set1)

416.gamess(set3)

416.gamess(set2)

403.gcc(set9)

416.gamess(set1)

403.gcc(set8)

403.gcc(set7)

403.gcc(set6)

403.gcc(set5)

403.gcc(set4)

403.gcc(set3)

403.gcc(set2)

403.gcc(set1)

401.bzip2(set6)

401.bzip2(set5)

401.bzip2(set4)

401.bzip2(set3)

401.bzip2(set2)

401.bzip2(set1)

Performance Improvement (%)

18 16 14 12 10 8 6 4 2 0

SPEC CPU 2006 Benchmark

Fig. 3.

Potential performance improvement of dynamic version selection on prefetch aggressiveness adjusting

15 10 5 0 458.sjeng

450.soplex(set2)

450.soplex(set1)

444.namd

416.gamess(set3)

416.gamess(set2)

416.gamess(set1)

403.gcc(set9)

403.gcc(set8)

403.gcc(set7)

403.gcc(set6)

403.gcc(set5)

403.gcc(set4)

403.gcc(set3)

403.gcc(set2)

403.gcc(set1)

401.bzip2(set6)

401.bzip2(set5)

-25

401.bzip2(set4)

-20

401.bzip2(set3)

-15

401.bzip2(set2)

-5 -10

401.bzip2(set1)

Performance Improvement (%)

20

-30 SPEC CPU 2006 Benchmarks no prefetch

versioned

Version speedup comparison

Relative performance (%) of no prefetching compared to aggressive prefetching

Fig. 4.

less prefetch

prefetching completely (when the first bar is close to the third bar). However, for 401.bzip2 (input set 2 and 3) and 450.soplex (input set 1), the ability to choose between different levels of prefetching aggressiveness can significantly improve performance. Further investigation on the performance improvement of 450.soplex reveals that some of its hot functions are datasensitive, and different input sets have different requirements for prefetching aggressiveness (Figure 5). With no prefetch, for example, the SSVector::assign2product4setup() function of 450.soplex is 20% slower than the version with prefetch for input set 1. However, it is 251% faster with no prefetch when applied on input set 2. These behaviors illustrate the ability of code version selection to improve the performance of applications with a wide range of program behavior, and data input sets.

300 250 200 150 input set 1 input set 2

100 50 0 -50 -100

function (1)

function (2)

Hot functions of 450.soplex: (1) SSVector::assign2product4setup() (2) SoPlex::setupPupdate()

Fig. 5. The behaviors of two of 450.soplex’s hot functions. They both demonstrate opposite behavior in reacting to different input data.

approach further improves the performance by completely rescheduling the code. This additional step is especially influR ential for in-order-execution processors such as the Itanium° processor.

In dynamic optimization frameworks, we can potentially improve memory efficiency by dynamically inserting prefetches and/or deleting useless prefetches by changing them into NOPs, but they often introduce idle processor cycles. Our

77

Performance Improvement

Back in Figure 4, we show that for benchmark 444.namd, more efficient code can be produced by generating code with no prefetches. In Figure 6, we further measure the role of scheduling in the performance gain by comparing the effectiveness of replacing the prefetches with NOPs against the effectiveness of both removing the prefetches and rescheduling the code. For 444.namd, which is not memory intensive, replacing useless prefetches with NOP instruction only results in 1.20% performance improvement. However, by removing prefetches that are not useful and rescheduling the code, we increase the performance by 7.82%. The inclusion of the prefetch instructions assumed that memory stalls will hide any extra schedule length caused by the increased number of instructions. We found that versioning can commonly gain performance by removing prefetches and improving the schedule when memory latency is not a limiting factor. Because of performance considerations, dynamic optimizations must be fast and lightweight. Code scheduling, however, is computationally expensive, making it a good candidate for dynamic code version selection.

3), 403.gcc (input set 1), 444.namd and 471.omnetpp can potentially improve for up to 2.2%, 1.4%, 2.6%, 1.2%, and 1.3%, respectively, resulting from tuning the aggressiveness of control speculation on loads. Although this technique resulted in lower performance impact compared to the other techniques we have explored, it may be applied on top of other mechanisms and yield even higher performance improvement. E. Summary Dynamic code version selection can extract more optimization opportunities than static heuristics. By monitoring the runtime behavior, optimization techniques can selectively be turned on when they are determined to be actually beneficial for the runtime environment and the input sets. This enables optimizations to be applied more aggressively than with static heuristic-driven optimization selection. Table II summarized the optimization techniques that we studied for dynamic code version selection, and the trade-offs associates with these optimization techniques. We found that dynamically adjusting data prefetching aggressiveness and selecting loop scheduling method has relatively more significant impact on performance. The performance improvement comes from the combination of better code scheduling and applying the better optimized code for the working data and the microarchitecture behaviors.

10% 7.82%

8% 6% 4% 2%

1.20%

TABLE II C OMPILATION OPTIMIZATION TECHNIQUES AND THEIR TRADE - OFFS

0% replace prefetches with NOPs (top 10 functions)

no prefetches, code rescheduled (top 10 functions)

EXAMINED IN THE STUDY OF POTENTIAL PERFORMANCE IMPROVEMENT

Optimization technique Loop scheduling using software pipelining in place of global code scheduling Aggressiveness of data prefetching

Fig. 6. Performance comparisons of prefetch removal methods applied on 444.namd benchmark. The baseline is compiled using the following flags -fast -IPF-fp-relaxed -ansi alias.

D. Control speculation aggressiveness on loads R On the Itanium° architecture, a load can be speculatively moved across branch boundaries. Doing this can improve the flexibility of code scheduling and hide long latency cache misses. However, if there is a fault or exception for the load, then control branches to recovery code when execution reaches a special check instruction. The compiler-generated recovery code repeats the load and the dependent instructions and then branches back to resume normal execution. In order to reduce R the overhead of recovery code, The Intel° C++ compiler’s heuristic evaluates the execution probabilities of the paths of a branch. Only the loads on the path whose path probability is higher than a pre-defined threshold will be considered for being scheduled across the branch boundary. The reason for the threshold is to limit any costs that may be incurred by speculation. If we speculate both the load and the use, we may suffer a stall on a cache miss for a load that is from a path that is rarely executed. Also hardware may not defer TLB misses right away so we may spend some time processing memory operations for loads don’t need to be executed. Our results in Figure 7 show that the performance of benchmarks 400.perlbench (input set 1), 401.bzip2 (input set

Control speculation aggressiveness on loads

Trade-offs Higher loop throughput longer loop latency

vs.

Cache efficiency vs. higher resource consumption (potential memory access contention), cache pollution, increase of dynamically executed instructions Load latency reduction vs. overhead of failed speculation recovery

Note that the experimental results do not reflect phase transitions in applications. There may exist situations where different program phases work better with different code versions. Dynamic code version selection can provide even better performance improvement if we can correlate program phases with code versions. Re-adaptation mechanisms are needed to explore program phases. This is an interesting topic for future investigation. III. I MPLEMENTATION C ONSIDERATIONS Dynamic code version selection defers the selection of optimized version to runtime. Its implementation consists of two primary components: 1) compile-time support in generating multiple code versions and storing the code versions for runtime analysis and selection, and 2) run-time support for code version switching, run-time performance analysis, and version selection.

78

2 1.5 1 0.5 473.astar(set2)

471.omnetpp

473.astar(set1)

462.libquantum

458.sjeng

459.GemsFDTD

450.soplex(set2)

444.namd

450.soplex(set1)

416.gamess(set3)

416.gamess(set2)

403.gcc(set9)

416.gamess(set1)

403.gcc(set8)

403.gcc(set7)

403.gcc(set6)

403.gcc(set5)

403.gcc(set4)

403.gcc(set3)

403.gcc(set2)

403.gcc(set1)

401.bzip2(set6)

401.bzip2(set5)

401.bzip2(set4)

401.bzip2(set3)

401.bzip2(set2)

401.bzip2(set1)

400.perlbench(set3)

400.perlbench(set2)

0 400.perlbench(set1)

Performance Improvement (%)

3 2.5

SPEC CPU 2006 Benchmark

Fig. 7.

Potential performance improvement of version selection on tuning the aggressiveness of control speculation on loads

A. Compiler support for code version generation

versions. This information can be stored in the program binary header. The run-time environments that support dynamic code version selection scan through the program headers and locate all alternative code implementations, and then prepare for monitored execution.

Effectively controlling code expansion is one of the major issues in generating multiple code versions. Potentially, many different permutations of optimizations may be applied to a single loop to create multiple versions, with each version well suited to a different situation. To prevent an explosion in code size, code version generation should be applied selectively based on the potential impact of an optimization. From the experimental results presented in Section II, we learn that there can be potential performance improvement by performing dynamic code version selection on a few key compiler optimization mechanisms. For example, tuning prefetching aggressiveness is more effective than adjusting control speculation aggressiveness. Thus, it could be given higher emphasis when choosing which compiler optimizations to be versioned given a target code size limit. The process of generating multiple versions for one code sequence and select the best performed one is in fact similar to iterative compilation [11] [12] [13] [14] [15]. By applying the optimization configuration space pruning methodologies in OptimizationSpace Exploration (OSE) [12], we can perform detail analysis to identify the candidate optimization techniques and configurations for dynamic code version selection. On the other hand, the complexity of configuration space pruning can be reduced by incorporating dynamic code version generation into OSE. A second approach is to have the compiler generate multiple versions of the code only when its heuristics do not have enough run-time information to select optimizations effectively, or when it is using high risk optimization techniques such as prefetch insertion. For example, in the case of compiling the hot function of 458.sjeng, static analysis can enumerate the possible paths through the loop and determine the best, worst, and expected performance of different loop versions. Using this information, the compiler can quantify the possible benefit from using different loop versions, and only choose versioning of loops that provide the greatest benefit. It can then generate multiple code versions based on the potential estimated gain. In addition to generating multiple code implementations, the compiler should specify the availability of different code

B. Runtime support for code version selection The basic tasks of run-time support of dynamic code selection include: • Detecting the existence of versioned code and switching between versions. • Collecting runtime performance information and storing the collected data. • Analyzing the collected performance data, selecting a code version and apply the selection. The dynamic code version selection can be implemented on current micro-architectures with minor hardware modifications. The performance counters available on modern microprocessors, such as the performance monitoring unit (PMU) on R Itanium° architecture, can provide accurate runtime behavior of an application with negligible overhead. In a management runtime environment (MRTE), the compiler can generate a wrapper function for each function that have multiple implementations. This wrapper function is in charge of turning on code version selection, invoking and measuring the execution time of each code version using PMU, analyzing the PMU readings and selecting the best code version. The wrapper function also ensures that there is only one performance monitor measuring at a time so nested functions and recursive function calls are handled properly. State-of-the-art multi-core processors can be exploited to build a less intrusive, lower overhead dynamic code selection framework. One possible implementation is to utilize the second core to conduct code version selection analysis. The overview of the design is given in Figure 8. In this design, we execute the versioned binary on core 1. The PMU monitors the execution and stores the performance data in the PMU history table. Instead of measuring the execution time, here we store the IP samples. Using IP samples gives us the flexibility of estimating the execution time at different granularity, which

79

can be as fine-grained as loop-level providing the loop boundary is available. Core 2 reads the PMU history periodically, or when core 1 notifies that new data is ready to be processed. It then compares the performance data and informs core 1 to invoke the better performing code version.

project [17] extends the Dynamo framework, ports it to the IA-32 platform and provides API to create customized modules for optimization, instrumentation, and profiling. In [18], Bruening et. al. implemented a dynamic optimization infrastructure using the DynamoRIO API, which applies compiler optimization techniques such as redundant load removal, architecture-specific optimization (instruction selection), etc. The ADORE framework [19] [20] monitors and optimizes R unmodified program binaries on Itanium° processors. It continuously samples the performance monitoring registers of the R Itanium° architecture throughout the program’s execution. In order to provide low-overhead monitoring and optimization, dynamic optimization frameworks are usually only applied on light-weight and coarse grain optimization techniques such as prefetches insertion or removal (but no reschedule of the code), and code trace layout improvement. Dynamic code version selection can apply optimizations on finer grain program areas, such as function or loop level. Therefore, it can explore more opportunities than the above methods.

PMU history

PMU -------------

IP samples IP samples

Monitors program execution

1. Read PMU history from core 1 2. Performance analysis 3. Inform core 1 about the version selection

Monitored execution of versioned binary

Core 1

Fig. 8.

Core 2

Implementation of dynamic code version selection

Another paradigm of dynamic optimization applies JIT compilation/optimization on the source code or the intermediate code. SELF-93 [21], HotSpot [22], Jikes RVM [23] and Jalapeno adaptive optimization system [24] are examples of dynamic optimization using JIT compiler. With the knowledge of both the source code and runtime application behavior, JIT compilers can perform most of the classical optimizations as in the static compilers. However, JIT compilers compete for computing resources with the applications, including processor time and memory bandwidth. Our profiles show that, for SPEC CPU 2006, one third of the benchmarks have compilation times that are as long as 30% of their execution times, when aggressive optimizations are in use. In contrast, our method generates code versions at the compilation time and limits optimization overhead to code monitoring and version selection.

Invoking code versions in turn, however, may have a potential drawback of the code versions not being applied on comparable working data. Alternatively, we may assign code versions to different cores, and have them race the execution. The execution in the second core is only used for verification purposes, it will not write back data in the way that changes program states. The collected performance data will again be processed by the second core, while it is not used for executing alternative versions. While this method has the merit of having code versions to work on the same inputs, its accuracy is affected by the locality characteristics of the local data cache. Contention for shared resources may distort the performance measurement. Therefore, if performance monitoring is done for a sufficient amount of time, and the performance reading does not have high variation, it is preferable to use the first multi-core design. Dynamic code version selection can be an on-demand service so that we can minimize the version selection overhead. The performance counters not only can help us measure code version efficiency, it can also serve as a mechanism to trigger code version selection as well as to identify the hot functions. In this case, the version selection runtime is initially started in observation mode. It continuously monitors the program execution until there is optimization opportunity, for example, when there are many long-latency load stalls occur in the hot function. It can then explore the use of other code versions such as aggressive prefetching or using software pipelining to hide load latency.

In [24], Arnold et. al. present an implementation of the framework that selects statically generated code versions using dynamic feedback. In this study, the authors use the compiler to generate parallel code versions, which use the various algorithms to reduce synchronization overhead. In [25], Voss and Eigenmann propose a generic compiler-support framework, called ADAPT, for adaptive program optimization. In addition to switching between statically generated code versions, the runtime system of ADAPT can dynamically generate optimized code with the help of a remote optimizer. The remote optimizer resides on a remote machine. It generates code based on the user specified heuristic using a domain specific language. The main focuses of ADAPT project are on loop transformation and parallelization. While the authors also study the effects of compiler flags selection at -O2 level, our approach is based on exploring scheduling optimizations in a production compiler, and attempts to improve the performance at the base optimization level that is used to report SPEC CPU2006 results.

IV. R ELATED W ORK There are a number of studies on dynamic optimization framework. HP’s Dynamo [16] framework interprets the native program binary of HP PA-8000’s instruction set. It processes the native program binary, locates the hot code region and performs runtime optimization on it. The DynamoRIO

80

V. C ONCLUSION

[5] J. Bharadwaj, K. Menezes, and C. McKinsey, “Wavefront scheduling: path based data representation and scheduling of subgraphs,” in Proc. of 32nd Annual International Symposium on Microarchitecture, Haifa, Israel, November 16-18, 1999, 1999, pp. 262–271. [6] HP caliper performance analyzer. [Online]. Available: http: //h21007.www2.hp.com/dspp/tech/tech TechSoftwareDetailPage IDX/ 1,1703,1174,00.html [7] SPEC CPU2006 benchmark suite. [Online]. Available: http://www.spec. org/cpu2006/ [8] R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, C.-C. Lim, J. Ng, and D. Sehr, “An advanced optimizer for the IA-64 architecture,” in IEEE Micro, 2000, pp. 60–68. [9] V. Santhanam, E. Gornish, and W. Hsu, “Data prefetching on the HP PA-8000,” in Proc. of the 24th International Symposium on Computer Architecture, Denver, Colorado, United States, June 1997, pp. 264–273. [10] S. Ghosh, A. Kanhere, R. Krishnaiyer, D. Kulkarni, W. Li, C.-C. Lim, and J. Ng, “Integrating high-level optimizations in a production compiler: Design and implementation experience,” in Proc. of the 12th International Conference on Compiler Construction, 2003, pp. 303–319. [11] P. Knijnenburg, T. Kisuki, and M. O’Boyle, “Iterative compilation,” Embedded processor design challenges: systems, architectures, modeling, and simulation-SAMOS, pp. 171–187, 2002. [12] S. Triantafyllis, M. Vachharajani, and D. August, “Compiler optimization-space exploration,” in Journal of Instruction-Level Parallelism, vol. 7, 2005. [13] K. Cooper, D.Subramanian, and L.Torczon, “Adaptive optimizing compilers for the 21st century,” Journal of Supercomputing, vol. 23, no. 1, pp. 7–22, 2002. [14] B. Aarts, M. Barreteau, F. Bodin, P. Brinkhaus, Z. Chamski, H.-P. Charles, C. Eisenbeis, J. Gurd, J. Hoggerbrugge, P. Hu, W. Jalby, P. Knijnenburg, M. O’Boyle, E. Rohou, R. Sakellariou, H. Schepers, A. Seznec, E. Stohr, M. Verhoeven, and H. Wijshoff, “OCEANS: Optimizing compilers for embedded applications,” in European Conference on Parallel Processing, 1997, pp. 1351–1356. [15] A. P. Nisbet, “GAPS: Iterative feedback directed parallelisation using genetic algorithms,” in Proc. of the Workshop on Profile and FeedbackDirected Compilation, 1998. [16] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A transparent runtime optimization system,” in Proc. of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’00), 2000. [17] D. Bruening, E. Duesterwald, and S. Amarasinghe, “Design and implementation of a dynamic optimization framework for windows,” in 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), 2001. [18] D. Bruening, T. Garnett, and S. Amarasinghe, “An infrastructure for adaptive dynamic optimization,” in Proc. of the International Symposium on Code Generation and Optimization (CGO ’03), 2003. [19] H. Chen, J. Lu, W. Hsu, and P. Yew, “Continuous adaptive object-code re-optimization framework,” in Ninth Asia-Pacific Computer Systems Architecture Conference (ACSAC 2004), 2004, pp. 241–255. [20] J. Lu, H. Chen, P. Yew, and W. Hsu, “Design and implementation of a lightweight dynamic optimization system,” in Journal of InstructionLevel Parallelism, vol. 6, 2004. [21] U. Holzle and D. Ungar, “Reconciling responsiveness with performance in pure object-oriented languages,” in ACM Trans. on Programming Language and Systems, vol. 18, no. 4, 1996, pp. 355–400. [22] M. Paleczny, C. Vick, and C. Click, “The Java HotSpot server compiler,” in Proc. Usenix Java Virtual Machine Research and Technology Symp. (JVM’01), 2001, pp. 1–12. [23] S. Fink and F. Qian., “Design, implementation and evaluation of adaptive recompilation with on-stack replacement,” in Proc. Int. Symp. Code Generation and Optimization, 2003, pp. 241–252. [24] M. Arnold, S. Fink, D. Grove, M. Hind, and P. Sweeney, “Adaptive optimization in the Jalapeno JVM.” in Proc. of the ACM SIGPLAN 2000 Conference on Object-Oriented Programming Systems, Languages and Applications, October 2000. [25] M. Voss and R. Eigenmann, “High-level adaptive program optimization with ADAPT,” in ACM SIGPLAN Notices, vol. 36, no. 7, July 2001, pp. 93–102.

Static code versioning improves program performance by statically compiling multiple versions of code specialized to different situations, and selecting a version of code based on runtime information. Traditionally, version selection is based on data that is available in advance of executing the code. However, many programs are dominated by code where code performance is mainly determined by dynamic behavior such as branch mispredictions and cache/TLB misses. In this work, we examine the effectiveness of selecting code versions based on previous code behavior measured with hardware performance counters. Three optimization techniques are selected in the versioning study, including loop scheduling using software pipelining vs. global code scheduling, prefetching aggressiveness on loads, and control speculation aggressiveness on loads and uses. The experimental data shows that the static profiles are insufficient for the compiler to make the best optimized decisions at compile time for a subset of SPEC CPU2006 programs. For loop scheduling, one third of the studied benchmarks show performance improvement ranging from 2.5% to 6.6%. For prefetching aggressiveness tuning, we see significant performance gain in more than half of the benchmarks, with speedups of 2.0% to 15.7%. While adjusting control speculation aggressiveness gives relatively moderate performance impacts, this optimization can be used in conjunction with other mechanisms and yield even higher performance improvement. Furthermore, versioning can react to different input data, changes in hardware and/or software configuration, and changes in the runtime environment. Case studies on benchmarks 458.sjeng, 450.soplex and 444.namd show that the best optimizations for the program change with different inputs. For 458.sjeng, the best loop scheduling method is dependent on how often a situation occurs in the input set. For 450.soplex, two of its hot functions are data-sensitive, and different input sets have different requirements for prefetching aggressiveness. Meanwhile, in the case of 444.namd, there are infrequent misses for the input data set, and the extra instructions significantly increased the schedule length of the performance critical loop. Our future plan includes 1) implementation of dynamic code version selection, 2) loop level versioning analysis, 3) program phase study, 4) versioned code emission with code explosion prevention mechanisms, and 5) reducing the overhead of code selection. R EFERENCES R Architecture Software Developer’s Itanium° [1] Intel Corporation, Manuals Volume 1-3, Revision 2.2, Jan 2006. [2] J. Bharadwaj, W. Chen, W. Chuang, G. Hoflehner, K. Menezes, K. Muthukumar, and J. Pierce, “The Intel IA-64 compiler code generator,” in IEEE Micro, 2000, pp. 44–53. [3] M. Lam, “Software pipelining: An effective scheduling technique for VLIW machines,” in Proc. of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, June 1988, pp. 318–328. [4] B. Rau, “Iterative modulo scheduling: An algorithm for software pipelining loops,” in Proc. of the 27th International Symposium on Microarchitecture, December 1994, pp. 63–74. R Intel°

81