methods for reducing the amount of spill code gen- ... hood that less spill code will be inserted. ..... For certain control structures, such as straight line code or.
Spill code minimization techniques for optimizing compilers David Bernstein 1 Dina Q. Goldin 2
Martin C. Golumbic 3 Hugo Krawczyk 4
Yishay Mansour 5 Itai Nahshon 6
Ron Y. Pinter 7
IBM Israel Science and Technology Technion City Haifa, Israel
Abstract
1. Introduction
Global register allocation and spilling is commonly performed by solving a graph coloring problem. In this paper we present a new coherent set of heuristic methods for reducing the amount of spill code generated. This results in more efficient (and shorter) compiled code. Our approach has been compared to both standard and priority-based coloring algorithms, universally outperforming them.
Global register allocation and spilling using graph coloring techniques has been a topic of practical interest to compiler designers for a number of years [C81, C82, CH84, LH86, W86]. In all compilers that use such a technique, some sort of conflict graph is built whose vertices correspond to the variables and whose edges represent the interference between the live areas of variables. The coloring of the vertices of this graph corresponds to an assignment of the variables to real machine registers. When the number of colors (i.e., real registers) is not sufficient, additional LOAD and STOKE statements must be inserted, and these extra statements are referred to as spill code.
In our approach, we extend the capability of the existing algorithms in several ways. First, we use multiple heuristic functions to increase the likelihood that less spill code will be inserted. We have found three complementary heuristic functions which together appear to span a large proportion of good spill decisions. Second, we use a specially tuned greedy heuristic for determining the order of deleting (and hence coloring) the unconstrained vertices. Third, we have developed a “cleaning” technique which avoids some of the insertion of spill code in non-busy regions.
Both combinatorial problems, that of deciding whether a graph can be colored with the given number of registers and that of minimizing the amount of spill code, are comThe major findings of our putationally intractable. research consist of a coherent set of heuristic methods for reducing the amount of spill code generated. This results in more efficient (and shorter) compiled code. Our approach is compared to both standard and priority-based coloring algorithms, universally outperforming them. These algorithmic improvements have been developed in the context of the IBM PL.8 compiler [AHSZ], a highly optimizing compiler with multiple front- and back-ends.
kurrently with the IBM T.J. Watson Research Center 2Currently at Parametric Technology, Waltham, Mass. 3Dual affiliation with the Department of Mathematics and Computer Science, Bar-Ilan University 4Dual affiliation with Department of Computer Science, Technion - Israel Institute of Technology 5Laboratory for Computer Science, Massachusetts Institute of Technology 6Dual afhliation with Department of Electrical Engineering, Technion - Israel Institute of Technology 7Currently on sabbatical leave at the Department of Computer Science, Yale University
2. Register allocation using best-of-three graph coloring The problem of global &ma-procedural) register allocation may be faithfully represented as the problem of coloring the vertices of a graph using a specified number of colors. For each procedure in a program, an interference graph G = ( V,E) is constructed as follows: 1.
Every vertex YE V corresponds to a distinct variable code representation of the program. There exists an edge (v,u) E E joining two vertices v and tl if there is a statement in the program where one of them is defined (assigned a value) and the other is alive name in the intermediate
Permission to copy without fee all or part of this material is granted prowded that the copies are not made or distributed for direct commercial advantage. the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise. or to republish, requires a fee and/or specific permission. 0 1989 ACM 0-89791-306-X/89/0006/0258 $1.50
2.
258
(holds a value needed subsequently); hence, they are not allowed to reside in the same machine register throughout the program.
Step 1.
Repeatedly, delete from G all unconstrained vertices v (and all their adjacent edges) until none remain.2
Although theoretically any arbitrary graph can arise in this manner, in practice, the interference graphs that are obtained from actual programs are quite sparse. A good (experimentally sound) estimate is that the number e of edges will be approximately 20 times the number n of vertices. A typical large graph that we may encounter will be of the order of magnitude of 1000 vertices and 20000 edges. They do not seem to be structurally similar to random sparse graphs.
Step 2.
Let G’ be the graph resulting from G by the repeated removal of all unconstrained vertices. If G’ is empty, then STOP; If G is not empty, try to color it with r colors.
Step 3.
If the procedure succeeds in coloring G’ in Step 2, then STOP; Otherwise, insert some spill code, modify G accordingly, and restart the entire procedure.
Given r machine registers, the problem is how to assign them to the variable names in a program in order to satisfy all the interferences. This problem is equivalent to coloring the interference graph using r different colors. The coloring problem is NP-complete for arbitrary graphs 1 and therefore different heuristic coloring algorithms are usually employed.
All algorithms known to us, including the new ones presented here and those presented in [C81, C82, CH84, LH86], are similar in their implementation of Step 1. however, Steps 2 and 3 are performed differently and give very different spill code. In general, whatever heuristic method is used in Step 2 to try to color G’, it should have complexity o(n) or o(n log n), where n is the number of vertices, and it should be similar to and coordinated with the spill decision method of Step 3.
When a coloring algorithm fails to color G, there is a need to spiZf some variable, meaning to assign this variable to a temporary memory location, instead of a register, for a certain duration of the program. This raises the obvious question of which variables to spill, and where to insert the spill code. The quality of the whole process is judged both by the amount of the spill code produced, as well as by an estimate of the time required to execute this code.
The process outlined in [C82] is the following: When the algorithm fails to color G in Step 2, it chooses, in Step 3, a vertex with the lowest value of cost(v) Jdeg(v) and makes a decision to spill it everywhere. This vertex is then deleted from the interference graph, and Step 1 is applied again reducing the graph further. Then, if necessary, a second vertex is chosen to be spilled and so on, until the interference graph is reduced to the empty graph. The formal algorithm is given in Figure 1.
This estimate is usually obtained by assigning a weight representing the expected relative frequency that each spill instruction is to be executed. Let cost(v) be the cost of keeping the variable which corresponds to v in the memory throughout the program as compared to that of keeping it in a register. In [CSZ], this is estimated by the formula
while G not empty do if there is an x with degree < r then
Y
delete x
I&&Lions v is defined or usedin I
else choose x with MIN costjdeg add x to spill-list delete x
where depth(l) is the nested level (within loops) of I. The weighted value of spiff code is the sum of the cost(v) of all variables v which are spilled. Hopefully, this weighted value roughly reflects how the execution time of the compiled program will be affected by the spill code.
end if no vertex has been spilled then
color the vertices in reverse order of deleting
Let deg(v) be the degree (i.e., the number of neighbors) of v in G. A vertex v is unconstrained if deg(v) < r, since such a vertex can always be colored after all other vertices have been successfully colored. The process of coloring the interference graph usually proceeds in the following steps:
else spill each x E spill_list everywhere rebuild the interference graph and repeat the procedure Figure 1. The original Chaitin color/spill
algorithm.
1
For certain families of graphs, like interval graphs [GSO, 6851 which arise from straight line code, the coloring problem can be solved efficiently.
2
Notice that deleting v lowers the degree of all its neighbors so a constrained neighbor may become unconstrained and will be removed from G subsequently.
259
There are many advantages to such a global approach as noted in [C81]; we focus here on the three main drawbacks of this approach and how they can be addressed. 1.
2.
3.
There is a chance that an efficient but better heuristic version of Step 2 might succeed in coloring the graph or part of it, and thus avoid inserting some unnecessary spill code. When a vertex is spilled, it is spilled everywhere, despite the fact that usually it should be kept in memory only for certain “busy” regions of the program. Spilling v everywhere does not correspond exactly to deleting v from the graph. Rather, it splits the live area of v into tiny segments. Therefore, once the spill code is actually inserted, the entire process of building and coloring the interference graph must be repeated.
In our approach, we extend the capability of Chaitin’s algorithm in several ways. First, we use multiple heuristic functions hk rather than just cost(v)/deg(v), to increase the likelihood that less spill code will be inserted. We call this the “best-of-three” algorithm since we have found three complementary heuristic functions which together appear to span a large proportion of good spill decisions. Second, we use a specially tuned greedy heuristic for determining the order of deleting (and hence coloring) the unconstrained vertices. Third, we have developed a “cleaning” technique which avoids some of the insertion of spill code in non-busy regions. We have retained the rebuilding and coloring of the interference graph after inserting spill code for reasons of efficiency which we will discuss later. Our algorithm is given in Figure 2. for each heuristic hi do while G not empty do if there is an x with degree < r then choose that x with largest degree (among those < r) delete x else choose x with MIN hi(x) add x to spill-listi delete x end
restore G
For comparison, we have also implemented a version of priority coloring without splitting for the PL.8 compiler. Our priority-based approach is similar to that just described, but we color the vertex v with the highest value of cost(v) Jdeg(v) first. This heuristic tends to reduce spill code more than the heuristic cost(v). When our algorithm fails to color a vertex U, we currently delete tl and backtrack, spilling all occurrences of tl without updating the graph but applying cleaning (see section 4).
3. Area-based heuristics For coloring and spilling algorithms, one cannot make an absolute statement about the performance of one simple heuristic function over another for all programs. Rather, it is necessary to compare the average performance of how such algorithms make their spill decisions on the actual kinds of programs expected by the compiler. We further advocate that it is desirable to select a small number of heuristic functions which tend to “span” the different behaviors causing register pressure, using the best of which to decide the actual spilling. There are several reasonable heuristics for choosing the next variable to spill. As we mentioned earlier, Cbaitin’s algorithm uses the heuristic: h,(v)
= cost(v) / deg(v).
The strategy behind his heuristic function is to spill something with tow cost and high degree. Spilling a variable with high degree reduces the degree of many other vertices in the interference graph making it more likely that other vertices will become unconstrained. This strategy leads us to our first of three alternate heuristics:
end for choose heuristic hi with smallest COS’T’(spilf~Iisti)
if no vertex has been spilled then color the vertices in reverse order of deleting else spill each x E spill-Iisti everywhere perform “CLEANING” in basic blocks rebuild the interference graph and repeat the procedure Figure 2. The Haifa color/spill
A somewhat different approach is advocated by [LH86]. Their algorithm uses priority coloring [CH84] in which Step 2 and Step 3 are performed together; the vertices are colored in decreasing order of cost(v) , with the vertex having highest value of cost(v) colored first. When the algorithm runs out of colors, the next vertex u is split into r.f and u”. Such splitting requires a certain amount of spill code to be inserted, and it requires that the graph be updated exactly to reflect the split, but it allows proceeding with coloring of the graph without several iterations of rebuilding the graph. In order to update the graph effrciently, however, they do not use the interference graph itself, but rather a supergraph of it, namely, the basic block intersection graph which has the same set of vertices but additional edges, in order to do a minimum amount of reanalyzing of live areas.
h,(v)
= cost(v) / deg(v)I
Next, we define the area of a variable v by the formula, area(v) =
algorithm.
c
5deP’h(‘)* width( I)
Ikm-uctions v is alive at Z
260
block (in which it has a use) and then it lives only inside the corresponding basic block.
where width(i) is the number of live variables at I and where depth(Z) is as above. Intuitively, area(v) represents the global contribution by v to register pressure. Spilling a variable with high area releases register pressure (1) along much of the program and (2) where it hurts the most, thus making it easier to color. This motivates our second and third alternate heuristics: h2(v)
=
cost(v)
/ ( urea(v)
deg(v)
h3(v)
=
cost(v)
/ ( urea(v)
deg(v)= )
The best results are achieved by activating cleaning for the first two iterations of color/spills, and when no restriction was put on either the depth of basic blocks in which the cleaning is performed or the registers for which the cleaning is applied. Gleaning tends to reduce the total spill cost (number of spill statements) more than weighted spill cost since is succeeds more often in basic blocks of low depth than in the deeper blocks. It is evident that limiting the action of the cleaning routine to the scope of single basic blocks is an artificial constraint, and that it could be extended to larger regions of the program, for example, the elimination of loads and stores from loops and/or moving them out of a loop.
)
An ideal strategy would be to spill variables having low cost, high area and high degree. We, therefore, have sought to find heuristic functions which incorporate these three values and yield good spill decisions. Our three functions hi, h,, h, have proven to be an effective combination; after extensive analysis and experimentation, it was found that none of them strictly dominates the other. two, but together (taking the best of the three) they outperformed all other techniques. Table 1 in section 6 demonstrates this non-dominance. We stress that in the best-of-three implementation the compile time is only slightly longer than if using the standard approach since most of the overhead including building G is “factored out” and only the computation using ho, which is a fast linear computation, is replaced by three such computations and the minimum is taken.
5. Local coloring considerations Graph coloring based register allocation techniques are global in nature, and hence - by and large - they ignore the control structure of the source program. We believe, however, that this can be remedied by trying to use the structure of the code to guide coloring decisions so as to In particular, different coloring yield better results. methods may be applied to different portions of a program and then the results can be combined. In some sense, this is already reflected in the fact that traditionally register allocation is performed separately for each procedure. When a subroutine which requires k registers is called, the contents of the first k registers are temporariIy stored in the calling sequence, while the remaining r - k registers remain untouched.
4. Cleaning The “spill everywhere” approach is extremely conservative in the sense that a “spilled” register is reloaded at every use and stored at every definition of it. This is somewhat wasteful because it is based on global information about the program and not on the local situation surrounding the place where the particular load or store is introduced.
For certain control structures, such as straight line code or nested loops, the structure of the conflict graph that arises is of a special type, such as an interval graph, a circular arc graph, or one of several types of path graphs. For some but not all of these classes of graphs a truly optimal coloring can be attained in polynomial time, whereas heuristics with guaranteed performance bounds have been developed for others in cases where the problem remains intractable [DGP88, G80, G85]. The results of coloring such fragments may be regarded as partial results that can then be extended.
Cleaning, as presented by us in this paper, adds only one such load or store per basic block. Thus, our idea retains most of the current methodology, and is best described as a “spill almost everywhere” approach.
In the original PL.8 compiler there already exist certain local optimizations of spill code. Namely, if between two consecutive uses of the same spilled register in a single basic block there is no register which goes dead, then only one reload is introduced, for the first use (none for the second). Our cleaning algorithm takes a “more aggressive” approach to saving spill code.
In addition, certain hardware configurations imply a preference order on register assignment, for example, due to varying performance characteristics of real registers or the use of special purpose registers which can double as general purpose when needed.
When scanning a basic block, looking for spilled computations, a load (resp., a store) is introduced only for the first use (resp., definition) of the register in the block. That is, only one load or store is introduced per “spilled” register in a single basic block. The register is renamed in every basic
3
The algorithms that are employed in order to handle specific control structures are greedy in nature, and so are the extensions that pertain to special architectural demands.
In practice, most programs do not require more than two iterations and in the majority of remaining cases, the number of color/spill iterations does not increase relative to the original compiler.
261
Also, comparisons between a variety of greedy sequential coloring algorithms applied to randomly generated graphs have been reported in the literature [B85, B79, M72]. But, as noted above, the interference graphs of real life programs are neither random nor structurally special, but rather comprise a loose combination of structured fragments. The computational complexity of the minimum coloring problem indicates that we look for greedy heuristic algorithms which will generally give a good solution and whose average performance is confirmed experimentally. This is indicated by our secondary criterion for choosing the order of removal of unconstrained vertices, and hence the order of the greedy sequential coloring.
different people. This was carried out for several target machines. The average savings in weighted spill code were 6% for the IBM System 370, loo/ for the IBM PC/RT, and 12% for the Motorola 68000. (The corresponding reductions in total spill statements, which affects the length of the code, are about double.)
Thus, after spilling has accomplished its mission of insuring r-colorability of the interference graph G, the secondary criterion is be used to color G with generally fewer colors than a random order. There are several situations when this can improve the resulting code.
A second approach for testing the heuristics is to compare the actual running time of the compiled programs. The means for measuring the running times, e.g., the PER option of CMS which counts instructions or statistical execution analyzers, are more computation intensive. Therefore, we tested dozens rather than hundreds of routines in this manner, and we concluded that the run-time improvement is, on average, about half the weighted spill code improvement. One of the programs tested using PER was PUZZLE, and its results are representative of the sample. Its path length was 3920000 when compiled with the original version, and the path length was reduced to 3794247 (improvement of about 3%) when the Haifa algorithm was employed. This may be compared to a 5% reduction in the weighted spill code.
l
.
l
As an illustration, Table 1 shows the weighted spill statistics for a sample of 15 procedures compiled for S370, including 9 of the most heavily spilling routines in the PL.8 compiler, 5 lightly spilling routines and one puzzle program. The stars (*) indicate the best-of-three results and demonstrate the non-dominance of the three heuristics.
Global register allocation has “over spilled”, and we can now color G with less than r colors. In this case we can undo the r - k most expensive spill decisions of the preceding iteration. Most small subroutines never spill, but if they are called very frequently, then being thrifty with the number of registers allocated will reduce the overhead of the subroutine call. When the register usage in a small procedure is extremely low. this may influence our decision on whether to in-line expand it or not.
6. Experimental
Our newly devised register allocation algorithms have been integrated into the current PL.8 code generator. Running best-of-three heuristics produces compiled code which is shorter and on average 2-3% faster. It adds 1% to the compile time over the single original heuristic. There is also an especially nice side benefit -- since the PL.8 compiler itself is written in PL.8, when the previous version is used to recompile the new version source code, the result is a new PL.8 compiler which is 10% shorter in length and 3% faster.
results
As evidence of the improvement that we obtain, the new Haifa register allocation algorithm was compared with that of Chaitin on the original PL.8 compiler [AH821 on a large test bucket of programs consisting of hundreds of routines, mostly from the compiler itself, and written by dozens of
Table 1. Program name COALESCE PRT-PROCS PHASE5 RSI MR LIVE 68 PATH PUZZLE UD
Original ho 143 346 1321 1487 9625 25872 33475 35573 41567 49396
Weighted
spill statistics
HEURISTIC ALGORITHM h, 138 346 839 * 1247 7328 25872 31967 35573 46347 48916
138
127 *
337 1345 1128 8326
336 * 2319 1089 * 7327 *
24964 31929 35167 39967 46335
Improvement in percent Best-of-three
h3
h*
* * * *
24964 * 33378 35167 * 39537 * 49766
RS PHASESA OPEN
88290 119468 191189
84666 * 118925 151389 *
85177 110468 * 159618
85877 110466 * 161534
ITF FLOW
256528 493868
198335 * 458762
275198 454432 *
271898 455432
11.2%*********** 2.9% *** 36.5% ***t*******************************f
22.6%x**x******************* 23 9% *************t********** 3.5% 4.6% 1.1% 4.9%
*** **** * *****
6.2%X****‘S 4.1% **** 7. c& *******
20 8%**********x********** 22.7%***X**R**************** 8.8% ***t****
262
[CH84]
ACKNOWLEDGEMENTS
The authors would like to express their thanks to Martin Hopkins, Peter Markstein and Robert Schloss of the IBM T.J. Watson Research Center for many fruitful discussions.
of the ACM Symposium on Compiler struction (June 1984), 222-232.
References [AH821
IB851
Bollobas, B., Random Press, London, 1985.
Compiler
Graphs,
Dagan, I., Golumbic, M.C., and Pinter, R.Y., “Trapezoid graphs and their coloring”, DisCrete Applied Math. 21 (1988). 35-46.
IGW
Golumbic, and
lC8ll
Chaitin, G.J., Auslander, M.A., Chandra, A.K., Cocke, J., Hopkins, M.E., and Markstein, P.W., ‘Register allocation via coloring”, Computer Languages 6 (1981), 47-57.
ICW
Chaitin, G.J.. “Register allocation and spilling via graph coloring”, Proceedings of the ACM Symposium on Compiler Construction (June 1982), 98-105.
Graph Theory M.C., Algorithmic Graphs, Academic Press, New
ICS51
Golumbic, M. C., ed., “Interval Graphs and Related Topics”, a special issue of Discrete Math. 55 (1985). 113-243.
[LH86)
Larus, J-R., and Hilfinger, P.N., “Register allocation in the SPUR Lisp compiler”, Pro-
Academic
Brelas, D., “New methods to color the vertices of a graph”, Commun. ACM 22 (1979), 251-256.
Perfect
York, 1980.
Con-
lB791
Con-
[DGP88]
Auslander, M.A. and Hopkins, M.E., “An overview of the PL.8 compiler”, Proceedings of the ACM Symposium on struction (June 1982). 22-31.
Chow, F., and Hennessy, J., “Register allocation by priority-based coloring”, Proceedings
ceedings of the ACM Symposium on Compiler Construction (June 1986), 2X-263.
263
1~721
Matula, D.W., Marble, G., and Isaacson, J.D., “Graph coloring algorithms”, in Graph Theory and Computing, (Read, R.C., ed.), Academic Press, New York, 1972.
ww
Wall, D.W., “Global register allocation at link time”, Proceedings of the ACM Symposium on Compiler Construction (June 1986), 264-275.