programmer specifications supporting run-time adaptation. These techniques can ...... for Programming Languages and Operating Systems (ASP-. LOS'91), pp.
ECO: an Empirical-based Compilation and Optimization System Nastaran Baradaran, Jacqueline Chame, Chun Chen, Pedro Diniz, Mary Hall, Yoon-Ju Lee, Bing Liu and Robert Lucas University of Southern California / Information Sciences Institute Marina del Rey, California 90292-6695 nastaran,jchame,chunchen,pedro,mhall,yoonju,bliu,rflucas @isi.edu
Abstract In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Due to the growing complexity of accurate performance prediction, our system incorporates empirical techniques to execute variants of code segments with representative data on the target architecture. In this paper, we discuss how empirical techniques and performance modeling can be effectively combined. We also discuss the role of historical information from prior runs, and programmer specifications supporting run-time adaptation. These techniques can be employed to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs.
1 Introduction The last decade has witnessed dramatic improvements in the effectiveness of compiler technology for highperformance computing. In spite of these achievements, there is plenty of anecdotal evidence that important science and engineering applications are nevertheless executing at only a small fraction of peak performance, often less than , on systems ranging from PCs to ASCI supercomputers and computational grids. The sources of this inefficiency stem from trends in architecture, compilers and applications. On the architecture side, there is a growing gap between peak computation rate and memory and communication bandwidth across this whole range of systems. Thus, computational units are often stalled waiting on local memory accesses or communication with remote processors. Architectural and compiler solutions to combat
Work sponsored by the National Science Foundation (NSF) under award ACI-0204040.
this performance gap have resulted in systems of enormous complexity. As a result, statically predicting the impact of individual compiler optimizations and the aggregate impact of a collection of optimizations is becoming increasingly difficult for both compiler and application developers. Application complexity is also increasing dramatically. Scientific programs with several hundred thousand lines of code are now fairly common, problem sizes are scaling up dramatically, and applications are becoming increasingly irregular in terms of memory access behavior and parallelism. The net effect of this increased hardware and software complexity is that programmers interested in high performance are spending more and more of their time on performance tuning. The process of manual performance tuning involves focusing in on a few key computational components of an application. For each component, the programmer derives a sequence of variants with different implementations for performing the computation. Each variant is first debugged, and then its performance characteristics are evaluated. This lengthy process continues until the programmer either arrives at a suitable variant or, as is often the case, decides to give up on further performance tuning. New variants may be required for characteristically different input data sets, or as the application is ported to other platforms. In this paper, we describe an application-compilerruntime system we are developing that seeks to automate this performance tuning process as much as possible. Our approach is partially motivated by the successes of several recent self-tuning libraries and domain-specific systems that employ empirical techniques to systematically evaluate a collection of automatically-generated variants [48, 6, 20, 51]. Rather than trying to predict performance properties through analysis, variants are executed with representative input data sets so that performance can be measured and compared. In our system, we will generalize the notion of empirical optimization so that it can be applied to full scientific applications. The key considerations in our system are how to make empirical optimization of full applications both prac-
tical and effective. Consider the ATLAS system, which automatically derives highly tuned implementations of matrix multiply; ATLAS takes a few hours to self tune on a new platform. It is simply not feasible to spend several hours per loop nest tuning a several-hundred-thousand line program. Further, ATLAS uses a code generation strategy for deriving variants that is specifically tailored to matrix multiply, and is the result of decades of study of the performance properties of matrix multiply on modern architectures. As part of a strategy for empirical optimization of application programs, we must generalize the derivation and testing of code segment variants. The remainder of the paper is organized as follows. The next section presents an overview of the system we are developing, and the four research areas that will be the focus of our work. Each of these areas is described in more detail in Sections 3 through 6. Section 7 describes related work in this area, followed by the summary in Section 8.
2 Overview We now provide the rationale for our approach, and describe the system we are developing.
key computational kernels, if it yields just a performance improvement, is well worth the time spent. For an overnight job, even an hour spent optimizing the code may be reasonable. Thus, this is a very different problem from traditional compilation and modern dynamic optimization frameworks, where optimization speed is a high priority. Optimization speed is nevertheless a consideration, and we have developed strategies for trading off optimization quality and optimization speed. These strategies are particularly important in on-line optimization.
2.2 System Description Figure 1 depicts the overall flow of our system. The goal of the system is to automate and enhance the manual performance tuning process currently performed by programmers. The centerpiece of the system is a suite of sophisticated program analyses. The goal of these analyses is to derive performance models from application code, and for compilerdirected derivation of program variants, to guide application of program transformations. The goal of the system is to derive and manage program variants, and guide the selection of program variants that lead to high-performance application codes. To accomplish this feedback, the system includes two different feedback loops.
2.1 Motivation Application Source Code
The fundamental restructuring of the compilation process in our system is motivated by both requirements and opportunities in high-end computing. First, it probably goes without saying that big science and engineering problems require high performance. Their users tend to solve whatever problem size their computing platforms can support, and increased efficiency enables bigger problems to be solved, thus accelerating the pace of discovery in their field. Portability is also essential. A grid application code may be distributed across different platforms, even within a single execution. Hand optimizations are often not portable from one platform to another. The need to support so many different platforms makes machine-dependent hand tuning an intractable requirement. A compiler or run-time empiricalbased approach could facilitate porting, with high performance, from one platform to another. We consider strategies for both off-line (i.e., compile time) and on-line (i.e., run time) optimization. In big science and engineering applications, there is an opportunity for carefully studied optimization solutions. High-end applications often have execution times that range in days or even months (e.g., Grid application runs have taken over a month, and ASCI milepost runs have taken up to 45 days). For such long-running jobs, we can afford, even at run time, to perform experiments to improve application efficiency. If a job runs for a month, spending a few hours tuning its
Application Inputs
Target Architecture Description Compiler Analysis Elligible Transformations Performance Models
YES
Performance Models Database
Good Enough? NO Experiments Engine Performance Results Database
Instrumentation
Tuned Executable
Model Evaluation Code
Skeleton
Alternate Code Variants
Perfromance Models Evaluation Performance Predictions
Executable
Performance Metrics Run -time System
Execution Environment
Figure 1. Overall System Flow. The experiments loop uses either application specification or the results of program analysis to derive a series of experiments, as defined by an experiments engine. For each experiment, the system derives variants of individual code segments that are expected to address performance problems of the current implementation. The selection of the ap-
propriate variant can occur either on-line inside each run, by having the compiler generate code that dynamically jumps between a set of compiled versions (as developed in [17]) or off-line prior to execution. The modeling loop employs performance modeling techniques to assess the expected performance of the code variants without necessarily executing them completely. The main goal of this loop is to aid the experiments engine in pruning the potentially immense search space of available program transformations for each code section. This modeling loop uses a skeleton of the application code along with the application’s inputs to extract information about the values of particular variables that impact the performance of the code section. Overall the objective of the system is to combine a set of techniques including on-line experiments, off-line experiments and performance modeling to guide the optimization of application programs. The results of experiments and performance models are retained to be used for improving optimization across program runs. Given the large scale of the applications we are targeting, we believe a single general heuristic for the application of program transformations will not lead to higher overall performance. Also, given the vast search spaces for program transformations, on-line and off-line experiments combined with performance modeling and historical information to prune these search spaces will lead to an effective strategy to tackle these complex applications.
2.3 Research Themes Although other researchers have been developing compiler support for empirical optimization and related techniques, we believe four key research themes distinguish our
approach. Combining empirical techniques with performance
models. Utilizing search space properties for individual and combined optimizations.
Exploiting historical information to improve optimiza tion over time. Engaging the programmer to derive program variants. The next four sections describes the relationship between the system design and these four key research themes.
3 Combining Empirical Techniques with Performance Models Modeling the attainable performance of a code segment to which the compiler has applied a set of program transformations would allow the compiler to determine which
transformations are clearly unprofitable and thereby substantially prune the search space of program transformations. This modeling approach is also valuable in scenarios where the application can take an extremely long time to execute making on-line optimization impractical or impossible when attempting to predict the performance of the applications on future architectures. We now describe an approach we are exploring to combine empirical data to model the attainable performance of applications. Our modeling approach relies on the collaboration between static compiler analyses and run-time data to derive more accurate performance models and corresponding parameters. This approach consists of three phases. During the first phase the compiler identifies the application’s control-flow-graph (CFG) and isolates the program’s basic blocks of instructions by analysis of the binary code. The goal of this phase is to map instructions sequences (whenever possible) to high-level programming constructs using tools such HPCView [31]. In this phase the compiler also collects static information for each basic block such as the number of memory references and address calculations (whenever possible) along with floating-point and integer operations. Collecting such static metrics by looking at the source is bound to lead to terribly inflated metrics as compilers apply constant propagation and commonsubexpression-elimination (CSE), which are pervasive in many forms of array addressing. In this phase the compiler also defines, for each basic block, a set of possible execution scenarios to be used for the derivation of performance estimates and bounds. The scenarios are characterized by particular states of the execution of selected instructions, that are bound to affect the overall performance of the sequence of instructions of the basic block. For example, if a given load instruction leads to a miss in the cache, or worse in the TLB, it might lead to a bubble in the floating-point pipeline causing other instructions not to retire. On the other hand if the instructions are perfectly scheduled it could lead to maximal performance constrained only by the true data dependences of the program. This phase, therefore, precomputes a set of these scenarios so that it can use real data to determine which scenarios, and hence which performance is to be expected for each code section. The compiler must therefore engage in an on-line experiment to extract the values of the program variables that can impact the choice of the performance model for each code section. This selection of which input program variables and the corresponding input values is accomplish by a combination of static analysis and a skeleton code generation scheme. The static analysis determines for each code section which of the input program variables are of interest. With this knowledge the compiler then automatically generates a skeleton code that uses and uncovers for each section the values of these variables. Using the resulting output
values of interest, the compiler can then determine which is the more appropriate performance model for each section of the code and in some cases have much more detailed information about the number of times each code section should execute. Using the model the compiler can then plug-in the actual values for the model parameters and derive, based on internal analytical expressions for each model, more accurate performance predictions. For this modeling approach to be effective in delivering meaningful and accurate performance bounds to the compilation system, there are various challenges that need to be addressed. The first and most important challenge is the development of accurate performance models. As a starting point we will use simple memory hierarchy models and then overtime, or possible through the course of the off-line experiments understand when the models should be refined to include other architectural features. A simple approach would use a set of pre-simulated models (using a cycle-level accurate simulator) to extract performance parameters for observed instruction sequences and then feed back that information in future runs. The second challenge is the development of program analyses that extract relevant values to be observed by the skeleton code. In the general case all values are interdependent and force such an approach to execute the complete code. In reality there are various variables that are easily detected and extracted such as loop bounds that are only known at run time, which dictate the number of times loops execute and the size and shape of the iteration space. These are fairly simple to identify for the class of applications we are focusing on. Finally, from the practical viewpoint, a substantial challenge will be to develop a compilation system that scales and can effectively handle large scientific codes. One possible approach is to rely on memoization techniques. The compiler can build a library of known basic block instruction sequences with the corresponding performance model and reuse them in the same application or even across applications. For compactness, the compiler may even sacrifice some accuracy by deeming equivalent instruction sequences that are not exactly the same but lead to very similar performance metrics. While there are a number of challenges to deriving accurate and efficient performance models, we believe it is possible to build an effective compilation system that uses performance modeling estimates to effectively prune the overall design space exploration of program transformations. In our recent research in the context of automatic mapping of high-level programs to hardware, we have built a compilation system that successfully explores a very tiny fraction of the search space while reaching an optimal hardware design in a small fraction of the time a human programmer would
take. The key observation is that performance models need not be precise to be useful if they guide optimization in the right direction.
4 Navigating the Optimization Search Space The search space of possible optimizations for an application and target architecture is very large, as the application performance depends on the application characteristics, the architectural features of the target machine, the input data size, the set of optimizations and optimization parameters, and on how optimizations interact with each other. Several research projects have recently addressed the problem of navigating this huge optimization space using techniques such as simulated annealing and genetic algorithms [13, 14, 30]. Although these techniques have been shown to find good solutions, they do not exploit any of the knowledge that is available to the compiler about the properties of the optimizations, how optimization parameters affect performance, and how optimizations interact with each other. Our methodology to efficiently navigate the large search space of optimizations is based on guiding the search and pruning the search space by utilizing the knowledge available
to the compiler from several sources:
Compiler analysis and performance models
Optimization properties Interactions among optimizations
The remainder of this section discusses the items above in more detail.
4.1 Compiler analysis and performance models Whenever possible, the compiler should take advantage of available static (or run-time) information to constrain the search, or to direct the search towards optimizations that are likely to be successful. Static analysis and performance models can derive estimates of performance metrics which can be used to guide the search, prune the search space, and decide when to stop searching. As one example, estimates of cache misses based on static analysis and cache models (parameterized by architectural parameters) can be used to guide the search for tile sizes when loop tiling is one of the optimizations being considered by the compiler. Fairly accurate estimates of cache capacity misses can be derived using reuse analysis [49] (the total number of cache misses (including conflict misses) can also be predicted using more complex, and costly, cache models [21, 22, 45]). Such estimates of capacity misses can be used to guide the search for tile sizes, prune tiles from the search space and decide when to stop
the search. For example, if the search finds a tile size such that the difference between the number of cache misses observed in the experiments and the estimated capacity misses is smaller than a predetermined threshold, the number of cache conflict misses must be small. Moreover, all tiles such that the size of each dimension is smaller than the size of the corresponding dimension in must result in more capacity misses (but fewer conflict misses). If satisfies other search criteria (such as few TLB misses), such subspace of tiles with a higher number of capacity misses may not be searched at this point (but it may be searched later if the choice of tile size needs to be re-evaluated). Static analysis and performance models can also be used to derive stop criteria for the search. Timing execution indicates to the compiler that it is improving the code, but it is not a sufficient guide to tell the compiler it has met its goal. The compiler can derive predicted results from static models, and verify during the experiment phase, via performance counters, whether the actual results meet predictions. Once the result is within some threshold of expectations, the compiler can declare success on the code segment and move on to the next.
4.2 Optimization properties Some optimizations have known properties or behavior with respect to optimization parameters that can be used to guide and prune the search. For example, if the search objective function is monotonic with respect to an optimization parameter, the subspaces where only this particular parameter is variant can be pruned. One particular case of known behavior with respect to an optimization parameter is the relationship between unroll factors and amount of reuse exposed by unroll-and-jam. When searching for unroll factors for unroll-and-jam, the compiler can use the knowledge that reuse increases monotonically with the unroll amount of a loop (while the unroll factors of other loops and other parameters are fixed), and avoid evaluating all possible unroll factors for each loop [9].
4.3 Interactions among optimizations Optimizations targeting a specific goal or architecture component often affect the application performance with respect to other architecture components, sometimes with unexpected or undesirable results. Furthermore, two or more optimizations targeting distinct goals may interfere with each other, sometimes offsetting the benefits of one or multiple optimizations. One classic example is the interactions among optimizations targeting the memory hierarchy by exploiting data reuse at several levels (registers, caches and TLB). For example, unroll-and-jam is a code transformation often used
to expose data reuse at the register level. Applying unrolland-jam changes the iteration order of a loop nest, and consequently changes the data access pattern, which may affect the cache(s) and TLB behavior. The same is true when loop tiling is used to increase locality in cache, that is, the change in data access patterns affects other levels of the memory hierarchy. For example, tiling in multiple dimensions and accessing multiple dimensions of an array within a tile may cause TLB misses and offset the benefits of increased locality in cache. In this case the optimization process may choose not to tile in multiple dimensions, or to use small tile sizes in the higher dimensions. Therefore when optimizing for the memory hierarchy at several levels, the compiler must consider how the optimizations interact with each other, and select optimization parameters taking into account their effects on the overall memory system. When combining several optimizations, the search space for optimization parameters can be pruned when one optimization imposes constraints on another. Using the same example of unroll-and-jam for reuse and registers and loop tiling for cache locality, the compiler must decide which loops should be unrolled and which loops should be tiled, and find unroll factors and tile sizes for each transformed loop. When a loop is selected for both unroll-and-jam and tiling, the search for the unroll factor of the loop is constrained by the properties of unroll-and-jam in isolation (unroll factor limited by register pressure, and both reuse and register pressure increase monotonically with unroll factor), and it is also bounded by the tile size of the loop. Therefore, for each candidate unroll factor, tile sizes that are smaller than the unroll factor do not need to be searched.
5 Exploiting Historical Information An additional strategy for amortizing the overhead of empirical optimization is to use the results of previous executions to guide optimization. This approach is distinct from profile-directed optimization now in common use, whereby a single sample run collects a profile that is then used in optimization. Instead, we will develop a strategy that promises to improve a program’s performance over time, with each execution. It will be particularly effective in today’s internet computing environments, such as in a computational grid, because the results of many users’ runs can be exploited to increase the effectiveness of optimization. We maintain this historical information in an experience base. As a starting point, an experience base can be used to retrieve, at a level of detail useful to the compilation tools, the results of prior optimizations of the same code segment executed under very similar conditions. Then, the compiler can simply choose the best optimization strategy from the previous runs, or if more optimization is needed, can use prior
results as a starting point for continuing the optimization of the code. In either case, retrieving data from the experience base would involve certification that the conditions from the previous run are met, including that the same version of the code segment is being executed on the same architecture under the same system load, the data set size is similar, and the versions of the native compiler, operating system, communication libraries, and any other software that affects performance, are also the same. Beyond information used for certification, the experience base stores, at the very least, the source code for the optimized code segment, and a script for the compiler to repeat its previous optimizations. If there is no prior run that matches the current conditions, then there is the more challenging problem of generalizing the results of prior optimization under different conditions. We can assume that certain differences should not have much effect, such as with different versions of system software, and then verify this by running the code. Other differences, such as system load, cache parameters and data set size, will have significant effect, and a challenging research question is whether it is possible to generalize for such circumstances. In any event, for applications that execute frequently without change to the source code of the optimized code segments, over time the experience base will contain a significant set of prior runs, and thus a rich supply of information upon which to base optimization decisions. As described in Section 2, this experience base can also be used to store prior performance models. Thus, not only will the optimization results improve over time, but so will the performance models from Section 4. In addition to application-specific data, the experience base should include more detailed architectural descriptions. These, too, can be collected using empirical mechanisms such as microbenchmarking [40]. Previously, microbenchmarking has been shown to accurately determine cache size, associativity, line size and latencies, TLB size and memory latencies.
6 Engaging Application Programmers In some instances the programmer might wish to exploit knowledge about the application domain or the input data sets by selecting among various distinct, but ”functionally” equivalent, implementations of the same computation. For example, the programmer might want to apply a conjugategradient iterative method for solving sparse systems of linear equations when the input matrix exhibits a given property or default to a Gaussian elimination method in other situations. Unfortunately, the notion of algorithmic equivalence is beyond the reach of current compiler technology. To support selection among algorithm variants, we propose a userdirected mechanism, a programming construct called selector [?] to specify functionally equivalent alternate code
variants. Using the selector, the programmer specifies both the implementation variants and those performance metrics the compiler and the run-time-system should observe to select the most efficient variant. By inspection of the selector the compiler generates code with the various code variants instrumented with calls to performance monitoring library functions. The proposed approach is a valuable implementation technique that helps mitigate most of the complexity of tuning and optimizing high-performance applications thereby reducing the effort and programmer involvement in this process for both newly developed applications and applications using multiple libraries. Alternative
Code Selector
selector // The entry with with
Code Variants
Solver(Matrix A, Vector x, Vector b, int neq, int vol) is { list of alternative code variants. FactorMMD(A, x, b) probe orderingMMD(A) cos t costMetric1(env nonzeros, env clock);
entry FactorWND(A,x,b) with probe orderingWND(A) with cost costMetric1(env nonzeros, env clock); Policy Trigger Costs Cost Functions
entry FactorMS(A,x,b) with probe orderingMS(A,nonozeres) with cost costMetric2(neq, nonzeros, env clock); // One of many possibly policies - typically one although more // than one is possible for distinct call sites of the selector. // Default policy is to choose the variant with the lowest cost. int policy MinCost() { … }
Metrics Run -time System
// The default trigger function is { return TRUE; } boolean trigger Every(int block){ if((invocation_number % block) == 0) return TRUE; return FALSE; } } // end of the selector construct.
Figure 2. Selector Programming Concept. The selector concept, illustrated in figure 2, allows the programmer to define several concepts. We determine the dynamic behavior of a selector via a policy function, which defines the code variant to select according to certain metrics. This policy function is supported by two other userdefined functions, a trigger function for the selector and a cost function. A cost function allows the programmer to specify the particular cost metrics used when evaluating the benefits of each of the code variants. In the simplest case, the cost function can be execution time; whereas in a more sophisticated case, it can include synchronization overhead in a parallel implementation. The trigger functions allow the programmer to specify when a sampling phase should occur. In a typical scientific application, the programmer invokes the selector inside a loop construct and might want to evaluate each of the available implementations for a set of consecutive iterations and then use the best available implementation for a longer range of the loop iterations. Alternatively the programmer can exploit monotonic cost behavior and specify that if the last sampled code variant had a very low cost then there should be no need to produce additional code variants. This approach allows the programmer to control all aspects of the dynamic selection process. A wide variety of default actions will allow programmers to perform simple tuning. For example, an absent trigger function implies that the sampling phase is performed only once until a reset of the selector is explicitly executed. The absence
of a trigger function allows for situations in which the relative performance of the selector does not change over time but is unknown at compile time. All of these decisions and transformations can be easily done by a compiler and relieve the programmer of the tedious and error-prone task of managing the instrumentation and code generation correctness they imply.
7 Related Work We now briefly describe related work in the area of empirical optimization for scientific libraries, feedbackdirected optimization approaches and metacomputing environments. For brevity we omit related work regarding memory hierarchy optimization using loop transformations.
7.1 Empirical Optimization for Libraries There are several on-going research projects in empirical optimization of scientific libraries, ATLAS [48], PhiPAC[6] and domain-specific libraries UHFFT[32], FFTW[20], SPIRAL[51]. ATLAS generates high performance implementations of the Basic Linear Algebra Subroutines (BLAS) based on empirical performance data. PHiPAC generates highly tuned implementations of matrix multiply, with performance comparable to hand-tuned vendor libraries, by searching a large space of potential optimizations. FFTW uses a combination of static models and empirical techniques to optimize FFTs. SPIRAL generates optimized digital signal processing (DSP) libraries by searching a large space of implementation choices and evaluating their performance empirically.
7.2 Memory Hierarchy Optimizations There has been extensive research on improving the cache performance of scientific applications, most of it targeting loop nests with regular data access patterns [8, 49, 50]. Loop optimizations for improving data locality, such as tiling, interchanging and skewing, focus on reducing cache capacity misses. More recent research also addresses conflict misses, which play an important factor on the performance of tiled loop nests. Some of the research targeting conflict misses propose algorithms for tuning tile sizes to the problem and cache sizes [10, 12, 37, 27]. Others use data transformations [3] or combine data transformations with tiling [38] to prevent cache conflicts. Copying data tiles to contiguous memory locations at run time [46] is yet another technique used in combination with tiling to eliminate conflict misses. Most of the research on cache conflicts is based on models of cache misses. Some of the initial work in reducing cache conflicts propose approximated
models of cache misses that can be used for guiding optimizations [45]. More recent research proposes precise, although more complex and costly, models of cache misses [11, 21].
7.3 Feedback Directed Optimizations In may instances the lack of statically available information may prevent the compiler from applying program transformations more aggressively or simply from applying them all together. Several systems address this problem by a combination of static information and run-time testing. The inspector/executor approach dynamically analyzes the values in index arrays to automatically parallelize computations that access irregular meshes [29, 42], as well as compiler-based approaches that automatically derive run-time tests [35]. Speculative approaches optimistically execute loops in parallel, rolling back the computation if the parallel execution violates the data dependences [36]. Dynamic compilation systems enable code generation at run time [2, 18, 28]. Because delaying compilation until run time provides information about concrete values of input parameters, the compiler may be able to generate more efficient code. Other researchers have recognized the need to use dynamic performance data to optimize execution based on a set of control variables that parameterize an algorithm in the implementation [4, 15, 39, 41, 52]. There are also a few compilation systems that employ empirical optimization techniques in more restricted ways than the proposed research. In [34] and [7], a combination of static analysis and run-time experimentation are used to make run-time locality decisions. SPARSITY allows users to automatically build the most efficient sparse matrix kernels for the input data and target architecture [26].
7.4 Languages and Compiler Support for Adaptive Applications Other researchers have also recognized the lack of adequate programming language support for applications that must react to changing environments. Currently, programmers must intermix their application code with run-timesystem calls that implement the desired adaptation or optimization policies. The resulting code is complex and virtually impossible to port and maintain. In ADAPT [47], researchers have defined new languages used exclusively to specify adaptation policies, triggering events and performance metrics. Programmers develop their base application textually independently from their adaptation specification. While attractive from the viewpoint of separation of concerns, this approach suffers from the problems of naming and scoping as well as coherency. An alternative approach
extends existing languages to provide hints or directives to the compiler about the dynamic nature of the application. In [1] et. al. the authors describe an extension to the class hierarchy of an objected-oriented model of computation with three basic concepts - adaptors, metrics and events. Our approach defines explicitly via functions the concepts of adaptation, metrics and events in a language-independent fashion and not necessarily tied to an object-based model, and allows the definition of call-site specific policy selection.
8. Summary This research program aims to dramatically increase the efficiency of application performance through resourceaware optimization, a combined application-compilerruntime methodology that fundamentally restructures the compilation process to yield high performance across the many levels in an individual computer or even a computational grid. We have presented an overview of off-line and on-line empirical optimization techniques that can be used to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs.
References [1] V. Adve, R. Bagrodia, E. Deelman, T. Phan and R. Sakellariou. Compiler-Supported Simulation of Very Large Parallel Applications. In Proceedings of SC99, Nov. 1999. [2] J. Auslander, M. Philipose, C. Chambers, S Eggers, and B. Bershad. Fast, effective dynamic compilation In Proc. of the SIGPLAN ’96 Conference on Program Language Design and Implementation, ACM Press, May 1996. [3] D. Bacon et al. A compiler framework for restructuring data declaration to enhance cache and TLB effectiveness. In Proc. CASCON’94 conf., Nov. 1994. [4] V. Bala, E. Duestervald, S. Banerjia. Dynamo: A Transparent Run-Time Optimization System. In Proc. of the ACM Conference on Programming Languages Design and Implementation (PLDI’00), ACM Press, 2000, pp. 1-12. [5] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. MellorCrummey, D. Reed, L. Torczon and R. Wolski. ”The GrADS Project: Software Support for High-Level Grid Application Development”. To appear International Journal of High Performance Computing Applications, 15(4), 2001. [6] J. Bilmes, K. Asanovic, C.-W. Chen, J. Demmel. Optimizing Matrix Multiply using PHiPAC: a Portable HighPerformance ANSI-C Coding Methodology. In Proc. of the ACM International Conference on Supercomputing, 1997.
[7] F. Bodin, T. Kisuki, P.M.W. Knijnenburg, M.F.P. O’Boyle and E. Rohou. Iterative Compilation in a Non-Linear Optimization Space. In Proceedings of the Workshop on Profile and Feedback-Directed Optimization, Oct. 1998. [8] S. Carr, K. McKinley and K. Kennedy. Compiler Optimizations for Improving Cache Locality. In Proc. of Supercomputing ’92, November 1992. [9] S. Carr and K. Kennedy. Improving the Ratio of Memory Operations to Floating-Point Operations in Loops. In ACM Transactions on Programming Languages and Systems (TOPLAS), 15(3), pp 400-462, July 1994. [10] J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In Proc. of the 1999 ACM International Conference on Supercomputing, ACM Press, pp. 492-499, June 1999 [11] S. Chatterjee, E. Parker, P. J. Hanlon, A. R. Lebeck. Exact Analysis of the Cache Behavior of Nested Loops. In Proc. of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’01), ACM Press, pp. 286-297, June 2001. [12] S. Coleman and K. McKinley. Tile size selection using cache organization and data layout. In Proc. of the ACM SIGPLAN ’95 Conference on Programming Language Design and Implementation (PLDI’95), pp. 279–290, ACM Press, June 1995. [13] K. Cooper, P.J.Schielke and D. Subramanian. Optimizing for reduced code space using genetic algorithms. In Proceedings of the 1999 Worshop on Languages, Compilers and Tools for Embedded Systems (LCTES), May 1999. [14] K. Cooper, D. Subramanian and L. Torczon. Adaptive Optimizing Compilers for the 21st Century. Journal of Supercomputing, 2002. [15] A. Cox and R. Fowler. Adaptive cache coherency for detecting migratory shared data. In Proc. of the 20th International Symposium on Computer Architecture (ISCA’93), ACM Press, May 1993. [16] P. Diniz and B. Liu. Selector: An Effective Technique for Adaptive Computing. To appear in Proc. of the Forteenth Workshop on Languages and Compilers for Parallel Computing (LCPC’01). [17] P. Diniz and M. Rinard. Dynamic Feedback: An Effective Technique for Adaptive Computing. In Proc. of the SIGPLAN ’97 Conference on Program Language Design and Implementation (PLDI’97), ACM Press, June 1997. [18] D. Engler. VCODE: A retargetable, extensible, very fast dynamic code generation system. In Proc. of the ACM Conference on Program Language Design and Implementation (PLDI’96), ACM Press, June 1996. [19] The Grid: The Blueprint for a New Computing Infrastructure. I. Foster and C. Kesselman edts., Morgan Kaufmann, 1999. [20] M. Frigo. A Fast Fourier Transform Compiler. In the Proc. of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’99), ACM Press, June 1999.
[21] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proc. of the 1997 ACM International Conference on Supercomputing (ICS’97), Vienna, Austria, July 1997. [22] S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proc. of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98), pages 228–239, ACM Press, New York, Oct. 1998. [23] S. Graham, P. Kessler, and M. McKusick. gprof: a call graph execution profiler. In Proc. of the SIGPLAN ’82 Symposium on Compiler Construction, Boston, MA, June 1982. [24] A. Grimsaw and A. Wulf. Legion – A View From 50,000 Feet. In Proc. of the Fifth IEEE International Symposium on High Performance Distributed Computing, IEEE Computer Society Press, Aug. 1996. [25] A. Grimshaw, J. Weissman, E. West, and E. Loyot. Metasystems: An Approach Combining Parallel Processing And Heterogeneous Distributed Computing Systems. Journal of Parallel and Distributed Computing, pp. 257-270, 21(3), June, 1994. [26] E-J. Im and K. Yelick. Optimizing Sparse Matrix Computation for Register Reuse in SPARSITY. In International Conference on Computational Science, special session on Architecture-Specific Performance Tuning, May 2001. To be published in Springer LNCS series. [27] M. Lam, E. Rothberg, and M. Wolf. The cache performance and optimization of blocked algorithms. In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’91), pp. 63-74, ACM Press, Apr. 1991.
Tuning, May 2001. To be published in Springer LNCS series. [34] N. Mitchell, L. Carter, J. Ferrante, and K. Hgstedt. Quantifying the Multi-Level Nature of Tiling Interactions. In Proc. of the 10th International Workshop on Languages and Compilers for Parallel Computing (LCPC’91), Aug. 1991. [35] S. Moon and M. Hall. Evaluation of Predicated Array DataFlow Analysis for Automatic Parallelization. In Proc. of the ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’99), ACM Press, May, 1999 [36] L. Rauchwerger and D. Padua. The LRPD test: speculative run-time parallelization of loop with privatization and reduction parallelization. In SIGPLAN ’95 Conference on Programming Language Design and Implementation (PLDI’95), ACM Press, June 1995. [37] G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proc. of the ACM Conference on Programming Language Design and Implementation (PLDI’98), pp. 34–49, ACM Press, June 1998. [38] G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance architectures. In Proc. of the 1998 International Conference on Supercomputing, pp. 353–360, ACM Press, July 1998. [39] T. Romer, D. Lee, B. Bershad and J. Chen. ”Dynamic page mapping policies for cache conflict resolution on standard hardware” In Proc. of the First USENIX Symp. on Operating Systems Design and Implementation, ACM Press, Nov. 1994, pp. 255–266. [40] R. Saavedra, S. Gaines, and M. Carlton. ”Characterizing the Performance Space of Shared Memory Computers Using Micro-Benchmarks” In Proc. of Hot Interconnects 1993, August 1993.
[28] P. Lee and M. Leone. Optimizing ML with run-time code generation. In Proc. of the ACM Conference on Program Language Design and Implementation (PLDI’96), ACM Press, June 1996.
[41] R. Saavedra and D. Park. Improving the effectiveness of software prefetching with adaptive execution. In Proc. of the Parallel Architectures and Compilation Techniques (PACT’96), ACM Press, October 1996.
[29] S. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In Proc. of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pp. 83-91, ACM Press, May 1993.
[42] J. Saltz, H. Berryman and J. Wu. ”Multiprocessors and runtime compilation”. Concurrency: Practice & Experience, 3(6):573–592, Dec. 1991.
[30] H.Massalin. Superotimizer - A look at the smallest program. In Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’87), ACM Press, pp. 122-126, Apr. 1987. [31] J.Mellor-Crummey, R. Fowler, G. Marin and N. Tallent. NPCView: A Tool for Top-Down Analysis of Node Performance. In Journal of Supercomputing, 23, pp 81–104, 2002. [32] D. Mirkovic and S.L. Johnsson. Automatic Performance Tuning in the UHFFT Library. . Proceedings of the International Conference on Computational Science (ICCS’01), May 2001. [33] N. Mitchell, L. Carter, and J. Ferrante. A Modal Model of Memory. International Conference on Computational Science, special session on Architecture-Specific Performance
[43] J. Shalf. Lawrence Berkeley Natioal Lab., Private Communication, Oct. 2001. [44] The SUIF Compiler Infrastructure. public domain software (www.suif.stanford.edu) [45] O. Temam, C. Fricker, E. Granston, and W. Jalby. Cache performance analysis. APPARC deliverables PME2, Leiden University, July 1993. [46] O. Temam, E. Granston, and W. Jalby. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proc. of Supercomputing ’93, pp. 401-419, Portland, Oregon, Nov. 1993. [47] M. Voss and R. Eigenmann. High-Level Adaptive Program Optimization with ADAPT. In Proc. of the ACM SIGPLAN Conference on Principles and Practice of Parallel Processing (PPoPP’01), ACM Press, June, 2001.
[48] C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. of Supercomputing (SC’98), 1998. [49] M. Wolf and M. Lam. A Data Locality Optimization Algorithm. In Proc. of the 1991 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’91), ACM Press, June 1991. [50] M. Wolfe. More iteration space tiling. In Proc. of Supercomputing ’89, pp. 655-664 Reno, Nevada, November 1989. [51] J. Xiong, J. Johnson, R. Johnson and D. Padua. SPL: A Language and Compiler for DSP Algorithms. In Proc. of the ACM 2001 Conference on Programming Language Design and Implementation (PLDI’01), ACM Press, June 2001. [52] Hao Yu and L. Rauchwerger. Adaptive Reduction Parallelization. in Proc. of the ACM 14th International Conference on Supercomputing (ICS’00), ACM Press, May 2000.