support for software pipelining. ..... Recurrences always result in live-in loop-variants, and it is common for the last definition of a loop- variant to be live-out.
FI;aHEWLETT
.:~ PACKARD
Register Allocation for Modulo Scheduled Loops: Strategies, Algorithms and Heuristics B. R. Rau, M. Lee, P. P. Tirumalai, M. S. Schlansker Computer Systems Laboratory HPL-92-48 April, 1992
register allocation, modulo scheduling, software pipelining, instruction scheduling, code generation, instruction-level parallelism, multiple operation issue, VLIW processors, "ery Long Instruction word processors
Software pipelining is an important instruction scheduling technique for efficiently overlapping successive iterations of loops and executing them in parallel. This technical report studies the task of register allocation for software pipelined loops, both with and without hardware features that are specifically aimed at supporting software pipelines. Register allocation for software pipelines presents certain novel problems leading to unconventional solutions, especially in the presence of hardware support. This technical report formulates these novel problems and presents a number of alternative solution strategies. These alternatives are comprehensively tested against over one thousand loops to determine the best register allocation strategy, both with and without the hardware support for software pipelining.
Internal Accession Date Only To be published in an abridged form, as "Register Allocation for Software Pipelined Loops", in the
Proceedings of the ACM SIGPLAN '92 Conference on Programming Language Design and Implementation, San Francisco, June 1992 © Copyright Hewlett-Packard Company 1992
1 Introduction 1.1 Software pipelining Software pipelining [1] is a loop scheduling technique which yields highly optimized loop schedules. Algorithms for achieving software pipelining fall into two broad classes: • •
modulo scheduling as formulated by Rau and Glaeser [2] and, algorithms in which the loop is continuously unrolled and scheduled until a situation is reached which allows the schedule to wrap back on itself without draining the pipelines [3].
Although, to the best of our knowledge, there have been no published measurements on this issue, it is our belief that the second class of software pipelining algorithms can cause unacceptably large code size expansion. Consequently, our interest is in modulo scheduling. In general, this is an NPcomplete problem and subsequent work has focused on various heuristic strategies for performing modulo scheduling ([4-7] and the as yet unpublished heuristics in the Cydra 5 compiler [8]). Modulo scheduling of loops with early exits is described by Tirumalai, et al. [9]. Modulo scheduling is applicable to RISC, CISC, superscalar, superpipelined, and VLIW processors, and is useful whenever a processor implementation has parallelism either by virtue of having pipelined operations or by allowing multiple operations to be issued per cycle. This technical report describes methods for register allocation of modulo scheduled loops that were developed at Cydrome over the period 1984-1988, but which have not as yet been published. These techniques are applicable to VLIW processors such as the Cydra 5 [10]. The processor model supports the initiation of multiple operations in a single cycle where each operation may have latency greater than one cycle. The use of hardware features, that specifically support the efficient execution of modulo scheduled loops, is assumed. In addition to the conventional general-purpose register file (GPR), these features include rotating register files (register files supporting compilermanaged hardware renaming, which were termed the MultiConnect in the Cydra 5), predicated execution, and the Iteration Control Register (ICR) file (a boolean register file that holds the predicates) [8, 10]. This technical report also considers the register allocation of modulo scheduled loops on processors that have no special support for modulo scheduling. Our discussion is limited to register allocation following modulo scheduling. Register allocation prior to modulo scheduling would place unacceptable constraints on the schedule and would, therefore, result in poor performance. Concurrent scheduling and register allocation is preferable, but how this is to be achieved, in the context of modulo scheduling, is not understood. This technical report does not attempt to describe a complete scheduling-allocation strategy. For instance, little will be said on the important issue of what to do if the number of registers required by the register allocation exceeds the number available. Instead, the focus is on studying the relative performance of various allocation algorithms with respect to the number of registers that they end up using and their computational complexity. In the rest of this section, we provide a brief overview of certain terms associated with modulo scheduling, descriptions of predicated execution and rotating register files, and of modulo scheduled code structure in the presence of this hardware support. We also examine the nature of the lifetimes that the register allocator must deal with for modulo scheduled loops. Section 2 discusses the various register allocation strategies and code schemas that can be employed, and the
1
interdependencies between them. In the context of these alternatives, the register allocation problem is formulated in Section 3 and the candidate register allocation algorithms are laid out in Section 4. Section 5 describes the experiments performed and examines the data gathered from those experiments. Section 6 places this work in a broader perspective and, finally, Section 7 states our conclusions.
1.2 A quick overview of modulo scheduling A rotating register file is addressed by adding the register specifier in the instruction to the contents of the Iteration Control Pointer (ICP) modulo the number of registers in the rotating register file. A special loop control operation, brtop, decrements the ICP amongst other actions. As a result of the brtop operation, a register that was previously specified as ri would have to be specified as ri+l> and a different register now corresponds to the specifier rio This allows the lifetime of a value generated by an operation in one iteration to co-exist with the corresponding values generated in previous and subsequent iterations. Newly generated values are written to successive locations in the rotating register file and do not overwrite previously generated values even though exactly the same code is being executed repeatedly. This also introduces the need for the compiler to perform value tracking; the same value, each time it is used, may have to be referred to by a different register specifier depending on the number of brtop operations that lie between that use and the definition [8]. The Iteration Control Register file, ICR, is a rotating register file that stores boolean values called predicates. Predicated execution allows an operation to be conditionally executed based on the value of the predicate associated with it. Each predicated operation has an additional register specifier that specifies a register in the predicate register file. For example, the operation a
=op(b,c) if Pi
executes if the predicate Pi is one, and is nullified if Pi is zero. The primary motivation for predicated execution is to achieve effective modulo scheduling of loops containing conditional branches [8, 10]. Predicates permit the if-conversion [11] of the loop body, thereby eliminating all branches from the loop body. The resulting branch-free loop body can now be modulo scheduled. In the absence of predicated execution, other techniques must be used which require either multiple versions of code corresponding to the various combinations of branch conditions [12, 3] or restrictions on the extent of overlap between successive iterations [5]. A secondary benefit of predicated execution, the one relevant to this study, is in controlling the filling and draining of the software pipeline in a highly compact form of modulo scheduled code known as kernel-only code. The number of cycles between the initiation of successive iterations in a modulo schedule is termed the initiation interval (II). The schedule for an iteration can be divided into stages consisting of II cycles each. The number of stages in one iteration is termed the stage count (SC). If the schedule length is SL cycles, the number of stages is given by SC
= r~tl.
Figure 1 shows the record of execution of five iterations of the modulo scheduled loop with a stage count of 4. The execution of the loop can be divided into three phases: ramp up, steady state, and ramp down. In Figure 1, the first 3*11 cycles, when not all stages of the software pipeline execute,
2
constitute the ramp up phase. The steady state portion begins when the fourth and last stage of the first iteration coincides with the first stage of the fourth iteration. During the steady state phase, one iteration completes for every one that starts. The steady state phase ends when the first stage of the last iteration has completed at time 5*11. Thereafter, in the ramp down phase, one iteration completes every II cycles until the execution of the loop completes at time 8*11. iter 1
{
II
time
0 SO
iter 2
Sl
SO
iter 3
S2
Sl
SO
iter 4
S3
S2
SI
SO
iter 5
S3
S2
SI
SO
S3
S2
SI
S3
S2
prologue code II 2*11 3*11
kernel code
4*11 5*11 6*11 epilogue code
7*11 S3 8*11
Figure 1. Software pipelined loopexecution
In generating code for a modulo schedule, one can take advantage of the fact that exactly the same pattern of operations is executed in each stage of the steady state portion of the modulo schedule's execution. This behavior can be achieved by looping on a piece of code that corresponds to one stage of the steady state portion of the record of execution. This code is termed the kernel. The record of execution leading up to the steady state is implemented with a piece of code called the prologue. A third piece of code, the epilogue, implements the record of execution following the steady state. In Figure 1, the number of stages in the prologue and the epilogue are each equal to SC-l. In general, depending on whether or not hardware support is provided, the register allocation strategy employed and the nature of the loop, the number of stages in the prologue and epilogue can be more or less than SC-l. For DO-loops, with hardware support in the form of the rotating register file and predicated execution, it is not necessary to have explicit code for the prologue and the epilogue. Instead a single copy of the kernel is sufficient to execute the entire modulo scheduled loop. This is called kernel-only (KO) code. Consider the KO code depicted in Figure 2a. All the operations from the same stage of the same iteration are logically grouped by attaching them to the same predicate. The KO code is repeatedly executed every II cycles. The predicates take on values as shown in Figure 2b. Operations in a stage S, are executed when the corresponding predicate Pi is one. Five iterations each consisting of four stages are swept out diagonally in Figure 2b. Po is the predicate pointed to by the ICP. This predicate is set to 1 by the brtop operation during the ramp up
3
and steady state phases, and is set to 0 during the ramp down phase. Because brtop decrements the ICP, a different physical register Po is written into every II cycles. A more detailed description of the operation of kernel-only code is provided in [8, 13].
Kernel Only Code IS3 ifP31 S2 if P21Sl ifPll SOif
pol:)
(a) PI
PO
P3
P2
0
0
0
1
0
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1 1
SO SI
SO
S2
SI
SO
83
82
81
SO
1
83
S2
81
80
1
0
83
82
81
1
0
0
83
S2
0
0
0
83
Predicatevaluesas controlledby brtop
Stages as enabled by predicates (b) Figure 2. Kernel-only code
1.3 Lifetimes of values within loops Each value produced in the loop has one producer and one or more consumers. The lifetime of a value starts when the producer is issued and ends when all of the consumers have finished'. Lifetimes correspond either to loop-invariant variables or to loop-variant variables. Loop-invariants are repeatedly used but never modified during loop execution. Loop-invariants, which are
1 This definition of lifetime is required if the VLIW code is to be interruptible and re-startable with a hardware model in which operations, that have been issued, always go to completion before the interrupt is handled. Otherwise, thelifetime could startwhen the producer finishes andend when thelastconsumer begins.
4
referenced in the loop, are assumed to have already been allocated in the (non-rotating) GPR file using conventional register allocation techniques. This is not the topic of this technical report. A new value is generated in each iteration for a loop-variant and, consequently, there is a different lifetime corresponding to each iteration. Loop-variants can be further categorized based on whether or not the value defined in one iteration is used in a subsequent one, and on whether or not a value defined in one of the last few iterations is used after the loop. That a loop-variant is used by a subsequent iteration is equivalent to saying that each iteration uses a value defined by a previous iteration. In the case of the first few iterations, these previous iterations do not exist, and the expected values must be generated before the loop is entered and, therefore, are live-in to the loop. Likewise, if a value that is defined in one of the last few iterations is used after the loop, it will be live-out from the loop.
Modulo scheduled code
FORTRAN example subroutine foo(a,s) real a(10), s do i = 1,35 s = s
a(i) enddo stop end
+ a (i) = s * s *
a(i)
time
o
operation vr34:- m2read(vr33[1] lit: 0) vr35:- fadd(vr34 vr35[1]) vr36:- fmpy(vr35 vr35) vr37:- fmpy(vr36 vr34) $vr38:- m1write(vr37 vr33[1] lit: 0) vr33:- a1add(vr33[lj vr32) $vr39:- brtop(slit: 'L.4-foo') ** schedule for iteration completes
index
OPO 13 OPI 15 OP2 18 OP3 20 OP4 OOPS o OP6 21
**
Figure3. Modulo scheduled loopexample
Figure 3 shows a FORTRAN DO-loop example, and the corresponding pseudo-code after scheduling. The latencies of the operations are given in Table 3. The schedule has an initiation interval, II, of 2 cycles. The issue time for each operation is indicated in the left hand column. A single iteration completes in 21 cycles. In the loop shown in Figure 3, virtual register vr32, which is used in the address add (operation OP5), is a loop-invariant. It is allocated to the GPR file and has the value 4 written into it prior to entering the loop. The lifetime corresponding to a loop-invariant such as vr32 extends through all iterations of the loop. Virtual register vr37 is a loop-variant. It represents a new value of a(i) computed in each iteration of the loop. This value is produced by operation OP3 and has only one consumer, operation OP4, which belongs to the same iteration. Virtual register vr35 is also a loop-variant. It is produced by operation OPI and represents a new value of s computed in each iteration. This value has two consumers, operations OP2 and OPI. Operation on uses the value produced in the same iteration whereas operation OPI uses the value produced one iteration earlier, indicated by the "[I]" after the vr35 operand. For loop variants like vr35 and vr37, there is a different lifetime corresponding to each iteration. Recurrences always result in live-in loop-variants, and it is common for the last definition of a loopvariant to be live-out. Furthermore, compiler optimizations that eliminate loads and/or stores can also result in live-in and live-out values. When a load from (store to) a particular memory location in one iteration is followed by a load from that same memory location in a subsequent iteration, the latter load is redundant; the value may be used directly from the register to (from) which it was loaded (stored) in the prior iteration. The first few iterations require that the appropriate registers be pre-loaded with the proper live-in values prior to loop entry. Likewise, when multiple stores from separate iterations can be proven to access the same memory location, all the stores, except for the
5
last one, are redundant and can be eliminated. This assumes that that last store actually occurs. In the case of the last few iterations, the supposedly redundant store, that was supposed to be covered by a store in a subsequent iteration, is, in fact, not redundant since the iteration containing the covering store is not executed. The last value that should have been stored must be stored after exiting the loop and, so, must be live-out. Such load-store optimizations, of scalar as well as subscripted references, were performed by the Cydra 5 compiler [8, 14] and have also been studied by Callahan, et al. [15]. Number of loop invariants II
=
1
2
five loop variants: It
Start
End
Omega
1
13 18 15
16
1
1
20
19 19
0 0 0
22
1
0 0 0 0
2
3 4 5
0 0
Alpha
!:j, and for scalarlifetimes, CONFLICf[i,j] =DIST[ij] =DISTU,i] The totalremainingconflictfor lifetime i at any point in time is the sum of all CONFLICf[ij] such that i;>!:j and j has not yet been set aside. Each orderingheuristic takes the incoming list of lifetimes and reorders themin accordance with the heuristic. Multiple heuristics may be employed in conjunction with one another. When this is done, it is necessary to decide which heuristic is the primaryone, which one is secondary (i.e., is used to break ties when the primary one cannot discriminate between two lifetimes), and so on. When two heuristics are used together, there are two possible prioritizations and, when all three are used together, there are six possible prioritizations. The success of the registerallocator is sensitive to the specific prioritization selected. For eachcombination of allocation strategy and allocation algorithm, all possible prioritizations were evaluated using the input data set describedin Section 5. The best prioritization in each case is listedin Table2. In the context of the taxonomy that we have defined, the Cydra 5 compiler generates kernel-only code and performs blades allocation using the end fit allocation algorithm along with adjacency (primary) and start-time (secondary) ordering. Table2. Prioritization of ordering heuristics when usedin conjunction with one another. The orderin which the heuristics are listed is primary heuristic first, thensecondary heuristic and tertiary heuristic last. Allocation StratellV MVE
Allocation AIRorithm
Best Fit First Fit
End Fit
WO BA
Best Fit First Fit End Fit
21
Prioritv of Orderina Heuristics
conflict, start start, adjacency conflict, adjacency conflict, start. adiacencv conflict, start adjacency, start adjacency, conflict adjacency, start. conflict start,conflict adjacency, start adjacency, conflict adiacencv. start conflict
4.3 Computational Complexity The worst-case computational complexity for register allocation depends upon the specific combination of ordering heuristics and allocation algorithm that is employed. From an inspection of the algorithms (Figures 7-12) one can see that the worst-case complexity of the three ordering heuristics and the three register allocation algorithms is: start time conflict adjacency best fit first fit end fit
CXn 2) CXn 2) CXn 2) max(CXn 2) , CXn 2r» max(CXn 2) , CXnr» max(CXn 2) , O(n»
where n is the number of lifetimes (either scalar or vector) to be allocated! and r is the number of registers required to do so. Empirically, we found that r was a linear function of n (approximately 0.61 *n+14.5) over the range of values of n for which there was statistical significance. So, effectively, best fit, first fit and end fit are CXn 3) , O(n2) and CXn2) , respectively. Since the ordering heuristics are CXn2), it is the selected allocation algorithm that dominates the complexity of the register allocation phase. Note that in the conflict ordering algorithm, updating Remaining'Totalflonflictlj] for all j that are still in NotSetAside is an CXn) operation. When multiple ordering heuristics are used together, the lowest priority ordering is performed first, and the highest priority one last. It is necessary, therefore, that each sorting algorithm have the property that it maintains the relative ordering of all lifetimes that are equal under the sorting criterion that is currently being applied. Consequently, a simplistic sort of O(n2) complexity was used for start-time ordering instead of using heapsort which is O(n log n). The complexity expressions for the allocation algorithms contain two terms, the greater one of which determines the complexity. The CXn2) term is caused by having to update the disallowed locations for each as yet unallocated lifetime each time a lifetime is allocated (Figure 7). The updating process is CXn) in complexity, and it is invoked n times. The other term is dependent on the allocation algorithm selected. In understanding these, one should note that each of the procedures in Figures 8-10 are invoked CXn) times and that the loops in Figures 8 and 9 are executed CXr) times on each invocation. Lastly, the function FOM that is called from within the loop in the best fit algorithm is itself CXn) in complexity. The worst-case complexity is generally quite pessimistic compared to what is actually experienced because in many cases, the iterations of the loops are guarded by IF-statements that are often not executed. So, we also have measured the empirical computational complexity for each combination. This is presented in Section 5. Counters were placed at the entry point as well as in every innermost loop of every relevant procedure. These counters count the total number of times that that point in the program is visited. In the spirit of worst-case complexity analysis, we used the largest of the counts as the indicator of empirical computational complexity. 1 Note that for the MVE strategy, the number of lifetimes will be Kmin timesgreater than for the other strategies.
22
5 Experimental Results Measurements were taken to examine the effectiveness of the allocation algorithms described in the previous section. To this end, 1358 FORTRAN DO-Ioopsl from the SPEC [18] and PERFECT Club [19] benchmark suites were modulo scheduled on a hypothetical machine similar to the Cydra 5 [10]. The relevant details of this hypothetical machine are shown in Table 3. The machine model permits the issue of up to seven operations per cycle. All the functional unit pipelines read their inputs and write their results to a common rotating register file. The 13 cycle load latency reflects an access that bypasses the first-level cache. (Our experience has been that locality of reference is often missing within innermost loops. Until compiler algorithms are devised which reliably ascertain whether or not a sequence of references in a loop has locality, it is preferably to have the deterministic albeit longer latency that comes from consistently bypassing the first-level cache.) The Cydra 5 compiler was used to generate a modulo schedule for each loop and then output a set of lifetime descriptions for all the loop-variants. The lifetime descriptions were used as input to a program that applied all combinations of ordering heuristics and allocation algorithms to each input set in turn. The register allocation program outputs various data for each input data set and for each combination of algorithms and heuristics. The data include the number of registers needed by the loop-variants, a lower bound on the number required, the code size, and the empirical computational complexity. Table 3. Relevant details of the machine model used by the compiler for this study. Pipeline
Operations
Number
Memory port
2
AddressALU Adder
2 1
Multiplier
1
Floating point multiply Integer multinlv
Instruction nipeline
I
Branch
Load Store Address add / subtract Floating point add / subtract Integer add / subtract
Latencv
13 1 1 1 1 2 2 2
The characteristics of the loops that were analyzed in our experiment are shown in Figure 13a-f. Each graph is a cumulative distribution. From Figure 13 it can be seen that most loops in our sample set had fewer than 80 loop-variants, although there were a few that had more than 200. In contrast, most loops had fewer than 10 loop-invariants with a few having more than 30. Every loop
1 Only measurements on DO-loops were performed, even though all the register allocation algorithms discussed are fully applicable to WHILE-loops and loops with early exits. This is due to the limitation of the Cydra 5 compiler which was used to generate the input to the register allocator and which is unable to recognize loops which are not DO-loops. Also, only those DO-loops with no conditional branching were selected so as to permit a meaningful comparison between MVE (which assumes no predicated execution) and the other two allocation strategies. Apart from the relative code sizes for the three allocation strategies, we do not expect any of the conclusions regarding the relative merits of the various allocation algorithms and ordering heuristics to be different across the broader set of loops.
23
had at least one live-in variable, most having less than 15, and a few having over 30. Live-out variables were less frequent; not shown in Figure 13 is that 80% had no live-out variables and, of the remaining, about half had one live-out and the rest had two live-outs. The graph for the initiation interval indicates that most loops had an II less than 40 cycles. Nearly 60% of the loops had an II less than 8 cycles. Schedule lengths are generally less than 90 cycles, but a few loops had schedules several hundred cycles long. The number of stages in the software pipeline was 10 or less 90% of the time. % 100 0
f I 0 0
P s
% 100
80
0
f
60
80 60
40
I
40
20
0 0
20
0
P 0
100
50
150
200
S
250
0 0
Number of loop variants
10
(a)
f
I 0 0
P s
% 100
80
0
80
f
60
60
40
I
40
20
0 0
20
P
0 0
20
10
30
S
40
0 0
Number of live-in variables
% 100
80
0
P
s
80
f
60
60
40 0 0
150
(d)
% 100
f
100
50
Initiation Interval in cycles
(C)
0
40
(b)
% 100 0
30
20
Number of loop invariants
40 0
20
0
20
p
0 0
50
100
150
200
250
S
300
Scheduled loop length in cycles
10
5
Number of stages
(e)
(f)
Figure 13. Cumulative distribution plots for input loops
24
15
200
best fit rust fit end fit
best fit first fit end fit
as is adi as is adi as is adi
Table 4a. Comparisonof the optimalityof strategies and heuristics MVE Stratezv WO Strateav BA Stratezv as is strt t conf stlcnf as is strt t conf stlcnf as is strt t conf stlcnf 3.21 1.71 1.32 1.30 0.89 0.52 0.52 0.51 1.23 0.58 0.62 0.59 2.40 1.75 1.30 1.30 0.27 0.08 0.14 0.08 0.49 0.24 0.27 0.23 3.36 1.77 1.36 1.36 2.46 0.99 1.16 0.99 3.86 2.20 2.31 2.17 2.55 1.82 1.36 1.36 0.18 0.08 0.14 0.08 1.38 1.49 1.27 1.40 16.42 35.79 27.75 28.80 5.65 6.52 6.59 6.79 7.08 7.72 8.34 7.98 3.30 2.51 3.00 2.37 0.18 0.08 0.14 0.08 1.40 1.56 1.37 1.48
Table 4b. Comparison of the empirical computational complexity of strategiesand heuristics MVE Strateav WO Strateev BA Strateav as is strt t conf stlcnf as is strt t conf stlcnf as is strt t conf stlcnf as is 4.44 5.39 6.58 6.69 1.21 1.14 1.10 1.12 1.24 1.18 1.13 1.17 4.35 5.39 6.65 6.68 1.01 1.01 1.01 1.01 1.05 1.04 1.04 1.04 adi as is 3.58 3.58 3.58 3.58 1.00 1.00 1.00 1.00 1.03 1.03 1.03 1.03 adi 3.58 3.58 3.58 3.58 1.00 1.00 1.00 1.00 1.03 1.03 1.03 1.03 as IS 3.SH 3.SH 3.58 3.SH 1.00 1.00 1.00 1.00 1.03 1.03 1.03 1.03 adi 3.58 3.58 3.58 3.58 1.00 1.00 1.00 1.00 1.03 1.03 1.03 1.03
Table 4c. Comparison of the code size of strategies Code size 258.51 142.43 42.30 18.21
MVE - multipleeniloaues MVE - preconditioning WO
BA
Table 4 contains data on the various heuristics and strategies. All combinations were studied. For each combination Table 4 shows • • •
the optimality, i.e., the average of the difference between the achieved number of registers and the lower bound, the average measured empirical computational complexity relative to the lowest average empirical computational complexity across all combinations (that for WO with end fit), and the average code size in operations". For MVE, two code sizes are shown: with multiple epilogues and with preconditioning.
1 In the context of VLIW machines, a distinction needs to be made between an operation and an instruction. An instruction consists of multiple operations, whereas an operation is a unit of computation equivalent to a RISe instruction.The code size is proportional to the numberof operations for both VLIW and RISC processors.
25
For the MVE strategy, fewest excess registers were used on the average by best fit allocation, using conflict ordering in conjunction with one or both of start-time and adjacency ordering. If first fit allocation were to be employed instead of best fit, but using the same ordering heuristics, marginally poorer allocation is obtained « 5% worse) but with almost a halving of the empirical computational complexity. The WO strategy achieves near-optimal results when it uses adjacency and start-time heuristics (with or without the use of the conflict heuristic) and regardless of which allocation algorithm is used. For the BA strategy, best results were obtained with best fit allocation using adjacency, start-time and conflict ordering. This only had slightly higher empirical computational complexity (by 1%) than first fit or end fit.
100
% 80 0
f 60 0 0
P
40
S
20 0
8
12
16
20
Difference between achieved number of registers and lower bound
Figure 14. Cumulative distribution plot for the optimality of strategies and heuristics
Figure 14 shows the cumulative distribution of the difference between the actual number of registers used and the lower bound, i.e., the extent of the deviation from the lower bound'. A separate plot is shown for each of the nine combinations of allocation strategy and allocation algorithm. In each case, the results corresponding to the best combination of ordering heuristics were selected. The three plots for WO are indistinguishable. Over 90% of the loops yield optimal allocation, and almost all of the rest require only one register over the lower bound. BA with best fit is almost as good. 80% of the loops are optimal, another 15% require an additional register, and very few need more than two registers over the lower bound. First fit and end fit for BA are very similar to each other, and significantly different from best fit. Especially interesting is the "plateau" between 1 and 10 registers; over 80% of the loops need at most one additional register, and almost 10% require 11 additional registers, with the rest of the loops requiring some intermediate number of registers. This suggests an interesting, hybrid allocation algorithm in which one first attempts first fit allocation and then, if too many additional registers end up being required, performs best fit
1 It should be noted that the deviation from optimality may be less since even the optimal allocation may require more registers than the lower bound.
26
allocation. For MVE, best fit and first fit are almost indistinguishable. 40% of the loops are optimal, and rarely are more than 6 additional registers required. End fit for MVE is significantly worse. On the average, all three strategies, when they employ the best combination of allocation algorithm and ordering heuristics, are able to achieve extremely good register allocation; On the average, WO is near-optimal, BA requires 0.24 registers more than the lower bound, and MVE requires 1.3 registers over the lower bound. The differences lie in their empirical complexity and the resulting code size. If we limit our discussion to only the best combinations for each strategy, WO has the lowest empirical complexity, BA is about 4% worse, and MVE is worse by a factor of 6.6. This factor drops to 3.6 if first fit, instead of best fit, is used for MVE. With respect to code size, BA is clearly the best. WO and MVE require code that is 2.3 and 14.2 times larger, respectively. If preconditioned code is used with MVE, the factor drops to 7.8 (but at the cost of reduced performance, especially for small trip counts).
For each combination of allocation strategy and allocation algorithm, Table 5 lists the set of preferred ordering heuristics. The combination of ordering heuristics selected is the one that yields the best register allocation, on the average. When different combinations of heuristics yield results that are very close in terms of optimality, the least expensive combination, computationally, is selected as the preferred one. In comparing the best fit and first fit algorithms for MVE, one sees that the two are very close in terms of optimality (a difference of 0.06 registers on the average) but that first fit has about half the empirical complexity of best fit. Consequently, in Table 5, first fit is underlined to indicate that it is preferred over best fit. With WO, all three allocation algorithms are equally optimal. So, we prefer end fit since it is the least expensive. With BA, best fit is significantly better than the other two for very little added complexity. Accordingly, best fit is the preferred allocation algorithm for BA. Table5. The preferred allocation algorithm for eachallocation strategy and the preferred ordering heuristics for each allocation algorithm. Allocation Strategy PKE/MVE
PKE/WO
KO/BA
Allocation Algorithm (Preferred) Best Fit First Fit End Fit Best Fit First Fit End Fit Best Fit First Fit End Fit
Preferred Ordering Heuristics conflict conflict adjacency, start,conflict adjacency, start adjacency, start adjacency, start adjacency, start adjacency, conflict adiacencv, conflict
Comparing the WO and BA strategies (Table 4), it can be seen that the number of registers used and the measured computational complexity are not very differentbut the average code size for strategy WO is 2.3 times larger. If one is concerned about instruction cache performance, it may be preferable to use BA register allocation, and generate KO code. This also allows one to avoid the
27
complexities involved with WO in generating the correct code schema for PKE code [13]. The performance degradation of BA relative to WO is limited to the lost opportunity in percolating outer loop computation into the prologue and epilogue. (In both cases, it is assumed that hardware support is available in the form of the rotating register file and predicated execution.) With the MVE strategy, degrees of unroll greater than Kmin were attempted. Larger degrees of unroll resulted in higher empirical computational complexity and code size, and provided little to no reduction in the number of registers used. For over 80% of the loops, unrolling by more than Kmin saved at most one register. Very few cases were found where the savings were more than three registers. Furthermore, the number of registers needed does not decrease monotonically as the degree of unroll is increased. Additional unrolling beyond Kmin is not worthwhile. Figure 15 shows the cumulative distributions for the number of registers used, the code size and the empirical computational complexity for the preferred allocation algorithm and ordering heuristics for each strategy. For the machine model used, 98% of the loops required less than 64 registers for loop-variants, and less than 12 registers for loop-invariants. The difference between the strategies is hardly noticeable in Figure 15a, but the MVE strategy performs slightly worse than the other two. Figure 15b and Figure 15c show the cumulative distributions of the code size and empirical computational complexity, respectively. The relative ordering of MVE, WO and BA is the same for the variance and maximum of the code size distribution as it is for the mean: MVE is larger than WO which is larger than BA. The same is true for the empirical computational complexity; the mean, variance and maximum for MVE are all greater than the corresponding values for BA which, in tum, are marginally greater than those for WOo
6 Discussion The results of the previous section show that with MVE there is significant code expansion as well as an increase in compile time. It must be noted, however, that these results correspond only to the modulo scheduled innermost loops. The relative code size and compilation time increases for the complete program would be less depending on what fraction of the code consists of innermost loops. Furthermore, the increase in the static code size is relatively unimportant; what matters is the effect of code size on instruction cache performance. With MVE, when preconditioning is not done, there is a significant code size expansion because there are multiple epilogues. However, if the loop is always executed with the same trip count or with varying trip counts that result in the same epilogue being executed, then the effective, dynamic code size increase is small. The remaining epilogues do not affect instruction cache performance even though they increase the static code size. If, on the other hand, the loop is executed repeatedly with a variety of trip counts requiring different epilogues, then this will affect the instruction cache's performance. The extent to which this is a problem requires further study. When the loop is preconditioned to avoid the multiple epilogue problem with MVE, the execution time of the loop varies as a sawtooth (Figure 16) depending on the trip count modulo the degree of kernel unroll [13]. When the trip count is of the form (PS+ES-SC+l)+i*K (as discussed in Section 2.2), the preconditioning code is not executed, and all loop iterations are executed in a software pipelined manner yielding high performance. In the worst case, a number of iterations (one less than max{K, (PS+ES-SC+l)+K)) have to be executed sequentially, at reduced performance, in the preconditioning code.
28
100
%
80
0
1
80
0
ao
0
p 20
S
Number 01 registers
a 100
%
80
0
1
80
0
ao
0
p
s
20
SA
aoo
800
1200
1800
2000
Code size in number 01 machine operations
b 100
%
80
0
1
80
0
ao
0
p S
20
15000
30000
45000
Measured computational complexity
c Figure 15. Cumulative distribution plots for results
29
80000
80
70
60 50
40 30 20 ".
10
_'" -
".,,~
~---
Prologue-Kernel-Epilogue
.. -
o
o
10
20
30
40
50
Trip Count
Figure 16: Execution time as a function of trip count, in the case when II = 1, SC = 7, Kmin various codeschema: preconditioned, kernel-only andprologue-kernel-epilogue.
= 5, for
The issue of phase-ordering between instruction scheduling and register allocation has been studied in the context of single basic blocks. There are advocates both for performing register allocation before scheduling [20, 21, 7] as well as for performing it after scheduling [22-25]. Each phaseordering has its advantages and neither one is completely satisfactory. The major problem with performing register allocation first is that it introduces anti-dependences and output dependences that can constrain parallelism and the ability to construct a good schedule. The drawback of performing register allocation after scheduling is that, in the event that a viable allocation cannot be found, spill code must be inserted. At this point, the schedule just constructed may no longer be viable. Simultaneous scheduling and register allocation is most attractive, but currently it is only understood in the context of a linear trace of basic blocks [26]. If a loop is unrolled some number of times and then treated as a linear trace of basic blocks, simultaneous trace scheduling and register allocation can be accomplished, but with some loss of performance due to the emptying of pipelines across the back-edge [27]. In the case of modulo scheduling, no approach has yet been advanced for simultaneous register allocation. Since doing register allocation in advance is unacceptably constraining on the schedule, it must be performed following modulo scheduling [8]. In this technical report, we have studied the optimality of various allocation algorithms and heuristics. This serves as a useful guide in selecting the best algorithm for each situation with an objective to minimizing the number of registers needed. A different objective, and one that is equally
30
important, is the issue of how one performs register allocation so as to make do with the number of registers that are available. The best algorithm from the first viewpoint is best from the second viewpoint, too. However, once modulo scheduling has been performed, if the register allocator fails to find a solution requiring no more registers than are available, some additional action must be taken. Various options exist. One approach is to reschedule the loop with an increased initiation interval. The presumption is that register pressure is proportional to the number of concurrently executing iterations. A feasible allocation may then be found at the expense of reduced performance. However, to benefit from this strategy, the modulo scheduling algorithm must incorporate heuristics that take advantage of this by attempting to minimize the number of lifetimes that are simultaneously live at any point. Further work is needed to identify and evaluate such heuristics. Another option is to select the appropriate lifetimes to spill, add the required spill code, perform modulo scheduling again, and then repeat the register allocation. Rescheduling is necessary because the additional loads and stores will, in general, increase the initiation interval. Unfortunately, the increased initiation interval yields a different schedule in which it is possible that a different set of lifetimes might need to be spilled than those that have already been spilled. The heuristics which specify the correct set of lifetimes to spill are not understood at all. A further complication is that we now require a rotating memory buffer to hold the spilled values for the same reasons that motivate the use of rotating register files. Alternatively, we need to unroll the kernel in a manner similar to MVE. A detailed discussion of this topic is beyond the scope of this paper. The third possibility, when register allocation fails, may be to split the loop into two or more loops, each one with fewer operations than the original one. The underlying premise is that loops with fewer operations tend to generate less register pressure. It is not entirely clear that this is so; in fact the smaller loops will generally have a lower II, with the net result that the register pressure could potentially even increase. Also unclear is how many loops the splitting should yield and the manner in which the operations should be divided between the smaller loops. Again, much more investigation is required to understand the merits of this approach. Although a satisfactory solution to this problem is still a difficult open issue, there is one observation that is of help. Our experiments show that the achieved allocation is usually very close to the easily calculated lower bound. A promising approach might be to keep track of the number of simultaneously live values during the modulo scheduling phase and to employ a heuristic which keeps this lower bound below the number of available registers, perhaps with a small safety margin. This would almost surely guarantee success in the subsequent register allocation phase.
7 Conclusion When hardware support is present, in the form of predicated execution and the rotating register file, the kernel-only, blades allocation strategy is to be preferred over the wands-only strategy requiring prologue-kernel-epilogue code. It results in about 55% less code size, and with insignificant increases in the average number of registers used or in the empirical computational complexity. The best combination of heuristics to use with best fit allocation is adjacency and start-time ordering heuristics. The absence of hardware support for modulo scheduling necessitates prologue and epilogue code, kernel unrolling, modulo variable expansion, and either a sequential preconditioning loop or
31
multiple epilogues. Our experimental data indicate that first fit allocation with conflict ordering is the best choice. Unrolling the kernel more than the minimum number of times necessary does not significantly reduce the number of registers required; instead it increases both the empirical computational complexity and the average code size. Our recommendation would be to employ the minimum degree of unroll. The measurements of computational complexity and average code size lend some weight to arguments in favor of hardware support for modulo scheduling, in the form of the rotating register file and predicated execution. Without it, the increased empirical computational complexity (by a factor of 3 to 6) results in longer compile times. The increased average code size (by a factor of 7 to 14) can have a negative effect on the instruction hit rate in cache and, hence, on performance. The sequential execution of the residual iterations in preconditioned code can further degrade performance for realistic trip counts.
Acknowledgements The end fit allocation algorithm and the adjacency ordering heuristic, described in this technical report, were conceived of by Ross Towle who, along with Warren Ristow and Jim Dehnert, implemented register allocation for vector lifetimes in the Cydra 5 compiler.
32
References 1.
A. E. Charlesworth. An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-I64 Family. IEEE Computer 14,9 (1981), 18-27.
2.
B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. ProceediDl~s of the Fourteenth Annual WorkshQP on MicroprolUammin~ (1981), 183-198.
3.
A. Nicolau and R. Potasman. Realistic scheduling: compaction for pipelined architectures. Proceedin~s of the 23th Annual Workshop on Microprogramming and Microarchitecture (Orlando, Florida, 1990),69-79.
4.
P. Y. T. Hsu, Hi~hly Concurrent Scalar Processing. Technical Report CSG-49. Coordinated Science Lab., University of TIlinois, Urbana, Illinois, 1986.
5.
M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. Proceedings of the ACM SIGPLAN '88 Conference on Programming Language Design and Implementation (1988), 318-327.
6.
R. L. Lee, A. Y. Kwok, and F. A. Briggs. The floating point performance of a superscalar SPARC processor. Proceedings of the Fourth International Conference on Architectural StWPQrt fQr ProlUamming Languages and Operating Systems (Santa Clara, California, 1991), 28-37.
7.
S. Jain. Circular scheduling: A new technique tQ perform software pipelining. Proceedings of the ACM SIGPLAN '91 CQnference on Programming Language Design and Implementation (1991),219-228.
8.
J. C. Dehnert, P. Y.- T. Hsu, and J. P. Bratt. Overlapped loop support in the Cydra 5. Proceedings of the Third International Conference on Architectural SuPPort for Pro~mmin~ Lanfrna~s and Operating Systems (Boston, Mass., 1989), 26-38.
9.
P. Tirumalai, M. Lee, and M. S. Schlansker. Parallelization of IQQps with exits Qn pipelined architectures. Proceedings of the SupercQmputing '90 (1990), 200-212.
10. B. R. Rau, et al. The Cydra 5 departmental supercomputer: design philosophies, decisions and trade-offs. IEEE CQmputer 22, 1 (1989). 11. 1. R. Allen, et al. Conversion of control dependence to data dependence. Proceedings of the Tenth Annual ACM Symposium on Principles of Programming Langua~es (1983). 12. K. Ebcioglu and T. Nakatani. A new compilation technique for parallelizing 10Qps with unpredictable branches on a VLIW architecture. Proceedings of the Second WorkshQP on PrQ~ammingLanguages and CQmpilers for Parallel Computing (Urbana-Champaign, 1989), 213-229. 13. B. R. Rau, et al., Code GeneratiQn Schema for Modulo Scheduled DO-Loops and WHILELoops. Technical Report HPL-92-47. Hewlett-Packard Laboratories, Palo Alto, California, 1992.
33
14. B. R. Rau. Data flow and dependence analysis for instruction level parallelism. Proceedinis of the Fourth Workshop on Lan&uaieS and Compilers for Parallel Computini (Santa Clara, 1991). 15. D. Callahan, S. Carr, and K. Kennedy. Improving Register Allocation for Subscripted Variables. Proceedings of the ACM SIGPLAN '90 Conference on Programming Languaie Design and Implementation (1990), 53-65. 16. L. J. Hendren, et al., Reiister Allocation usini Cyclic Interval Graphs: A New Approach to an Old Problem. Technical Report ACAPS Technical Memo 33. Advanced Computer Architecture and Program Structures Group, McGill University, Montreal, Canada, 1992. 17. G. J. Chaitin. Register allocation and spilling via graph coloring. Proceedings of the SIGPLAN82 Symposium on Compiler Construction (1982), 201-207. 18. J. Uniejewski. SPEC Benchmark Suite: Designed for Today's Advanced Systems. SPEC Newsletter 1, I (1989). 19. M. Berry, et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers. The International Journal of Supercomputer Applications, 3 (1989),5-40. 20. J. L. Hennessy and T. R. Gross. Postpass code optimization of pipeline constraints. Am Transactions on Pro&Tamming Languages and Systems 5, 3 (1983), 422-448. 21. P. B. Gibbons and S. S. Muchnick. Efficient instruction scheduling for a pipelined architecture. Proceedings of the ACM SIGPLAN Symposium on Compiler Construction (1986), 11-16. 22. J. R. Goodman and W. C. Hsu. Code scheduling and register allocation in large basic blocks. Proceedings of the 1988 International Conference on Supercomputing (1988), 442-452. 23. P. Sweany and S. Beaty. Post-Compaction Register Assignment in a Retargetable Compiler. Proceedings of the 23th Annual Workshop on Microprogramming and Microarchitecture (Orlando, Florida, 1990), 107-116. 24. D. G. Bradlee, S. J. Eggers, and R. R. Henry. Integrating Register Allocation and Instruction Scheduling for RISCs. Proceedings of the Fourth International Conference on Architectural SYPPort for Pro&Tamming Languages and Operating Systems (Santa Clara, California, 1991), 122-131. 25. P. P. Chang, D. M. Lavery, and W. W. Hwu, The importance of prepass code schedyling for superscalar and superpipelined processors. Technical Report CRHC-91-18. Center for Reliable and High Performance Computing, University of Illinois, Urbana, IL, 1991. 26. J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. (The MIT Press, Cambridge, Mass., 1985). 27. J. A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers 30, 7 (1981),478-490.
34