A Survey of General and Architecture-Speci c Compiler Optimization Techniques Armando Fox
[email protected]
Michael Hsiao
[email protected]
James Reed
[email protected]
Brent Whitlock
[email protected]
Abstract
Experience with commercial and research high-performance architectures has indicated that the compiler plays an increasingly important role in real application performance. In particular, the diculty in programming some of the so-called \hardware rst" machines underscores the need for integrating architecture design and compilation strategy. In addition, architectures featuring novel hardware optimizations require compilers that can take advantage of them in order to be commercially viable. We survey a variety of compiler optimization techniques of current interest: general techniques, vectorizing compiler techniques, ne-grained parallelism techniques. For architecture-speci c techniques, we analyze what features of the architecture require special attention from the compiler in order to achieve best performance, and summarize implementation complexity and observed performance for a variety of past approaches. We pay particular attention to hardware/software trade-os and limits on the compiler's ability to enhance performance on a particular hardware architecture. We also discuss some hardware optimizations which currently do not rely on smart compilers, and compare their performance to compiler schemes which try to achieve the same eects.
1 Architecture-Independent Optimizations The techniques discussed in this section optimize for features common to all architectures: register allocation, combining adjacent simple instructions into more ecient single instructions (peephole optimization), and interprocedural analysis. Instruction cache optimization is presented as a novel and eective technique worthy of future attention.
1.1 Optimal Register Allocation
Since most transactions can take place much more quickly in or between registers than they can from main memory, the compiler should allocate program variables to registers in a way that minimizes the number of main memory accesses required. Furthermore, since most standard compiler optimizations increase register usage and spill code is typically expensive in terms of performance degradation, good register allocator performance is essential.
1.1.1 Register Allocation as WDAG Traversal Hsu, Fischer and Goodman [17] present one method for doing this. Horwitz's original work (1966) reduces the problem of register allocation to that of nding a shortest path through a weighted DAG representing the data dependencies among the variables. Each node in the WDAG corresponds to a con guration (mapping of a set of pseudo-registers to a set of physical registers at a given point in time); thus each instruction may generate several nodes, corresponding to the various possible con gurations at the time that instruction is reached. Arcs connect nodes associated with successive con gurations; the arc weight is the cost of getting from one con guration M
N < M
1
to the next, as measured by any necessary register loads or stores that must be done to move between them. These costs can be computed using live-range analysis. Since successive con gurations can dier only in the contents of one register, the cost of a given con guration is the sum of the costs of the sequence of con gurations preceding it, and therefore the shortest path through the WDAG represents the optimal sequence of con gurations (register allocations). Unfortunately, the size of the WDAG is bounded by the number of possible register con gurations, which grows exponentially with and . Hsu, Fischer and Goodman [17] present an algorithm which incorporates rules for pruning the WDAG and heuristics for handling cases not covered by the rules. M
N
1.1.2 Pruning the WDAG The rules for pruning the WDAG are based on some observations which can be drawn about the relative costs of dierent con gurations for a given instruction. For example, replacing a dead register has no cost; replacing a clean distant1 register is cheaper than replacing a near dirty register; and so on. A complete list of the rules and formal proofs are in [17]. These rules can be used to reduce the number of branches from each node in the con guration WDAG using a standard alpha-beta pruning strategy.
1.1.3 Heuristics The rules above don't cover all the possible cases; simple heuristics are introduced for the cases not covered (e.g. more than one candidate for replacement that is dirty and dead), and found to be within 10% eectiveness of the optimal algorithm when run on various blocks of code from Unix utilities like sort and grep as well as Livermore Loops. Trace scheduling was used to generate larger basic blocks than would be exposed by the assembly code; this is primarily because the techniques developed here apply more readily to large basic blocks, for which the large number of variables makes computing the optimal algorithm computationally infeasible. The authors also argue that common optimization techniques such as inline expansion and loop unrolling tend to enlarge basic blocks; in such cases, Chaitin's work on allocation using graph coloring [4] relies too heavily on spilling to reduce the interference graph to be colorable, and coloring heuristics become less reliable as the blocks become larger.
1.1.4 Evaluation The authors' heuristic algorithm is found to perform better overall than existing heuristic algorithms (usage count, LRU, Least Recently Loaded) with minimal added complexity, and may be adaptable for virtual memory allocation and cache replacement. Since good register allocation is a problem faced by every compiler targeted for a general register machine, and most other eective optimizations increase register usage, research into fast and eective allocation strategies will continue to be important.
1.2 Peephole Optimizations
Peephole optimizers typically examine intermediate-level code output by the compiler front end before it is translated to object code by the back end. Having a narrow common interface (intermediate language) makes it easier to rewrite 1
A register's distance is the number of instructions until it is next referenced.
2
the compiler for a new source language or retarget it for a new machine instruction set. Because many code generators employ some form of syntax-directed translation to build the intermediate language program, there is opportunity to optimize across the boundaries of intermediate-code blocks generated from adjacent expressions. Peephole optimizers examine a sequence of two or three successive intermediate-language instructions (or, less frequently, target code instructions) and attempting to replace them with a smaller number of more ecient instructions. For example, successive load and add instructions might be replaced with memory-to-memory add if supported by the intermediate language or target machine language. Such replacements result in both smaller code size and faster execution. It has been shown by Davidson and Whalley [10] and in other work that a good peephole optimizer operating on naive code yields results comparable to and in many cases better than those from a machine-speci c sophisticated code generator. Allowing the code-generator output to be naive makes new code generators much easier to write.
1.2.1 Rule-Based Peephole Optimizers Rule-based optimizers maintain a database of matching templates (rules) which are checked against each peephole of instructions. McKenzie [22] describes such a peephole optimizer written for the Amsterdam Compiler Kit, which uses hash tables to speed the rule matching process against a database of 580 hand-coded optimization patterns. Since multiple passes are required to get the greatest bene t from peephole optimizers, the optimization phase of the compiler is very slow, and McKenzie also describes an improved algorithm which keeps track between passes of which sequences are unlikely to yield further opportunity for optimization.
1.2.2 Faster Peephole Optimizers Davidson and Whalley [10] go further by identifying a small number (typically 40) of peephole rules responsible for most of the optimization; using the reduced database greatly speeds up the optimization phase while still producing code comparable to that produced by a dedicated code generator. The simple-minded intermediate language they describe is loosely based on a generic load/store RISC architecture, allowing exibility in retargeting the back end to both RISC and CISC machines. Davidson and Whalley's rules can also have semantic context-sensitive components, allowing a single general rule to handle many dierent instructions; for example, a single rule handles all types of conditional branches. The context-sensitive component can also be used to prevent the optimizer from combining instructions in such a way that unusual interactions might result in poor code. For example, the sequence mov #0,r3; mov r3,5(r2) might result in the optimizer replacing the move of 0 with a clear, and subsequently being unable to optimize the resulting two instructions into the single instruction mov #0,5(r2) which is obviously what is desired. By delaying the optimization until more context is available, such cases can be avoided.
1.2.3 Automatic Rule Inference Machine-directed peephole optimizers typically produce more ecient code than their rule-based counterparts, but embedding a machine description into the optimizer destroys the modular nature that makes this approach desirable in 3
the rst place. Davidson and Fraser [9] describe an integrated system that uses training sets to generate optimization rules based on the behavior of a machine-directed optimizer called PO. As PO makes replacements, textual traces of the replacement are formed and the assembly-time constants replaced with symbolic variables to generate rules. In addition, instruction sequences that fail to combine are also recorded, so that the new rule-based optimizer will not try to fall back on PO for instruction sequences it has never seen. The authors found that 10{20% of the training examples result in the construction of over 95% of the useful rules. Thus the rule-based optimizer can be quickly generated for any particular machine architecture, and in most cases it matches or outperforms large handwritten rule databases such as McKenzie's [22].
1.2.4 Evaluation Good peephole optimizers make it possible to write compilers in a modular and retargetable fashion. In particular, if a good rule-directed optimizer can be quickly produced using methods like those of Davidson and Fraser [9] that yields code comparable to that of a sophisticated code generator, the compiler development time is reduced while maintaining high quality object code. In an industry where both compilers and product timing are critical to the success of an architecture, such a situation is desirable if not necessary.
1.3 Interprocedural Analysis and Procedure Inlining The complex process of interprocedural data ow analysis (IPA) can be used to mitigate the cost of procedure call/return and register save/restore around calls. Such analysis allows register allocation to take place over procedure call boundaries and exposes other optimization opportunities: loop-invariant code may be moved outside a procedure call boundary (and therefore outside a loop), interprocedural constant folding can take place, and redundant assignments and dead variables may be eliminated. An alternative to IPA is to expand procedures in place (procedure inlining). In cases where a procedure is called with a parameter whose value is known at compile time, a separate procedure can even be created and optimized for the particular value of the argument (linkage tailoring). Inlining is constrained by the allowable amount of code expansion. Like interprocedural analysis, once a procedure is expanded it can potentially be optimized (register allocation, code motion, common subexpression elimination) in the context of its call. The main disadvantage of inlining is the increase in code size; one technique that has been suggested is to inline only leaf procedures, since past work has shown that programs spend much of their time at the leaf level. Richardson and Ganapathi [27] compare the relative bene ts and cost of inlining vs. IPA, by comparing the performance of IPA in practice with that of a \perfect" algorithm which inlines all procedure calls, thereby obviating the need for IPA. Their estimate is based on comparing the speedups gained by optimizing an unmerged program and a merged program. (A merged program is one in which procedures have been inlined.) In [26], the same authors introduce live range interference as a measure of the potential usefulness of interprocedural optimization. Let v be the number of references to a variable whose live range includes calls to procedures that can access , or zero if 's live range does not include any such calls; and let v be the total number of references P to variable . The ratio LiveInt is calculated as v ( v v ). The metric conservatively assumes that if a variable is i
v
v
v
v
n
i =n
4
available to the callee's scope, its value will be changed by the callee, which more often than not turns out to be the case. (In Pascal this is easy to detect, since only variables passed by reference can be changed by the callee; in other languages like C, the analyzer must conservatively assume that any variable passed by pointer may be modi ed.) For some representative Pascal benchmarks, the additional bene t from IPA was predicted to be an insigni cant 1.57%. Another problem with IPA is that dependency analysis is hampered by pointer aliasing. In C, the unrestricted use of pointers makes the problem severe; other languages such as Pascal and FORTRAN exhibit aliasing only under restricted conditions. Nevertheless, the authors present experimental evidence that even when the analyzer is supplied with alias information, the incremental bene t derived from the expensive analysis is less than 1%.
1.3.1 Evaluation The authors derive a number of interesting conclusions from their experiments: Some cases where inlining was extensively used exhibited \negative speedup" because the optimizer actually lost eciency due to having too much information available; register coloring in particular is NP-complete, and the heuristics used by coloring algorithms tend to become less reliable as the size of a procedure becomes large. In cases where this does not occur, the major bene t from inlining is due to the savings in call/return overhead and improved interprocedural register allocation. Very little improvement is attributable to increased opportunities from in-context optimizations. System issues such as potential increase in cache misses have yet to be explored. For example, a procedure that is heavily used by many others retains almost no spatial locality when it is replicated and inlined. Future architecture designs must be in uenced by methods of code optimization. For example, if optimizers become pro cient at eliminating large numbers of loads and stores, the demand for very fast memory is alleviated. Although IPA is generally less eective than selective inlining for increasing sequential performance, it may play a more important role in exposing more parallelism in sequential programs, as would be needed for vectorizing or parallelizing compilers.
1.4 Optimizing for Instruction Caches
Although direct-mapped caches tend to have higher miss rates than set-associative caches, their faster access time usually results in better overall performance and the reduced hardware complexity makes them a favorable choice for on-chip cache implementation. Direct-mapped caches are also controllable in that the compiler can arrange instructions in such a way as to minimize replacement of useful instructions in the cache; for example, two instructions inside a tight loop that may compete for the same cache line can be rearranged so that they will both be cached during the duration of the loop, avoiding excessive thrashing caused by replacement twice per iteration. McFarling [23] describes an algorithm for determining which blocks of instructions should be added to the cache.
1.4.1 Block Labeling McFarling uses the standard de nition of an optimal cache: the current instruction is added to the cache if and only if there is an instruction in the cache that will be referenced later, and the current instruction will not be replaced 5
before being referenced again. If the current instruction is added, it replaces the instruction in the cache referenced farthest in the future. This is similar to an optimal page replacement algorithm. Labeling leaf conditionals introduces problems. Suppose a branch occurs inside a loop of 100 iterations, with equal probability 5 of block A or B being the target of the branch. We cannot tell from branch pro ling information whether A is taken for the rst 50 iterations and B for the second 50, or whether A and B alternate every other iteration. The labeling algorithm must conservatively assume the latter, which would be the worst case. Labeling loops inside conditionals is similar. If a conditional contains a loop B that is only executed a fraction of the time, but competes for space with non-loop block A which is executed of the time, we need to know what percentage of the time A will be executed twice without any intervening B. But with a direct-mapped cache, where the caching pattern is known at compile time, we have a choice of caching A only, B only, or both A and B but the choice must be made statically. If we assume the worst case, A and B should both be cached if 2 and only the loop B should be cached if 2 . :
q
p
p >
p