An Integer Linear Programming Model of Software ... - CiteSeerX

4 downloads 31300 Views 208KB Size Report
linear programming to software pipelining on the MIPS R8000 super- ..... Variable who's lifetime spans 3 loop iterations needs a bu er of size 3, bi = 3. Since a bu ...
An Integer Linear Programming Model of Software Pipelining for the MIPS R8000 Processor Artour Stoutchinin EE Department, University of Delaware, USA ?

Abstract. In parallelizing the code for high-performance processors, software pipelining of innermost loops is of fundamental importance. In order to bene t from software pipelining, two separate tasks need to be performed: (i) software pipelining proper ( nd the rate-optimal legal schedule), and (ii) register allocation (allocate registers to the found schedule). Software pipelining and register allocation can be formulated as an integer linear programming (ILP) problem, aiming to produce optimal schedules. In this paper, we discuss the application of the integer linear programming to software pipelining on the MIPS R8000 superscalar microprocessor. Some of the results were presented in the PLDI96 [14], where they were compared to the MIPSpro software pipeliner. In this paper we further extend the ILP model for the MIPS R8000 by including memory optimization and present the entire model in detail.

1 Introduction In the recent years, the concept of instruction-level parallelism played a central role in the microprocessor design of all the major CPU manufacturers. Several processors, such as the DEC Alpha 21064, the IBM Power PC, the MIPS R8000 and R10000, the Intel i860 and i960, the Sun Microsystems SPARC, etc., derive their bene t from instruction-level parallelism. Instruction-level parallel processors take advantage of the parallelism in programs by performing multiple machine level operations simultaneously. A typical such processor (a superscalar or VLIW processor), provides multiple pipelined functional units in parallel, thus allowing the simultaneous issue of multiple operations per clock cycle [12]. In order to take advantage of instruction-level parallelism, compilation techniques are needed that expose parallelism in programs written in a high-level language.

1.1 Exact Modulo Scheduling Approach The subject of this paper, software pipelining, and more precisely modulo scheduling, is a compiler parallelization technique which has recently received a lot of ?

important part of this work was done while in McGill University in Montreal

attention and is successfully implemented in production compilers [4, 14]. Because of its computational complexity, heuristic algorithms are used for modulo scheduling. One question that heuristic approaches leave unanswered is \How well do these methods do their job?". Indeed, is there any room for improvement? These types of questions led to the development of exact scheduling methods that guarantee the optimality of their results. The main idea behind the exact methods is to represent the scheduling problem as an optimization problem with a set of linear scheduling constraints, and an objective function minimizing some cost criterion. A number of interesting results of using the linear and integer linear programming approach for software pipelining have been published recently [10, 7, 2, 1, 5]. This paper concentrates on the development of the integer linear programming software pipeliner for the MIPS R8000 microprocessor based on the previous work by E. Altman [1]. In developing such software pipeliner our main interests were to study how well the ILP approach would work when targeted to a real processor. After the publication of the rst measurement of runtime performance for ILP-based generation for software pipelines [14], one of the important questions left unanswered was how much uncertainty have been introduced by the memory system in our experiments ? In this paper we analyze this question and present a new set of experimental results with the memory e ects taken care of by an improved ILP model.

1.2 MIPS R8000 Main Features

MIPS R8000 is a pipelined supescalar processor which allows multiple issue of up to four instructions chosen from two integer, two memory, and four oatingpoint instruction types per clock. The R8000 contains two arithmetic logic units (ALU), one shifter, one multiply-divide unit and two address generation units. Two independent memory instructions are supported per cycle. Floating-point coprocessor (FPU) performs all the oating point functions. It contains two execution datapaths each capable of multiply-adds, simple multiplies, adds, divides, square roots and conversions. The two datapaths are completely symmetric and indistinguishable for the software - the compiler simply knows that it can schedule two oating-point operations per cycle. Floating-point loads and stores go directly to/o chip external cache. External cache latency is hidden by decoupling the coprocessor from the R8000 pipeline. Floating-point instructions are dispatched into a queue where they can wait for resource contention and data dependencies to clear without holding up the integer dispatching. If a oating-point load instruction is immediately followed by a compute instruction that uses the result of the load, the queue allows both instructions to be dispatched together as if the load had no latency at all. The integer pipeline is immediately free to continue on other instructions. The load instruction proceeds down the external cache pipeline, in the meantime the compute instruction waits in the oating point instruction queue until the load data is available. By decoupling oating-point operations, a limited form of out-of-order execution of

oating-point instructions is achieved.

External cache consists of two interleaved cache banks. Two simultaneous accesses to the same cache bank cause a one cycle stall. Hardware is designed to mitigate the problem by adding a one-deep queue called the \address bellow". If the hardware can not keep up with a stream of references, compiler must deal with bank con icts by careful code generation. The rest of this paper is organized as follows. In Sect. 2 we introduce the integer linear programming based model for modulo scheduling on the R8000. In Sect. 3 we present the memory system optimizations. Our performance evaluation results using SPEC92 oating point benchmark suite are presented in Sect. 4. Finally, we put forward some conclusions and give suggestions for future work.

2 ILP Model for Modulo Scheduling Modulo scheduling can be formulated as an integer linear programming (ILP) problem. As such it de nes a set of linear constraints imposed on the legal solution by dependencies in the program and by resource constraints of the target architecture. Out of many legal schedules that satisfy such constraints the best is chosen according to some optimality criteria.

2.1 Loop Representation

When performing modulo scheduling, the loop body is modeled as a directed graph G = fV; E g, where V = fx1; x2;    ; xng is the set of nodes that correspond to the operations of the loop body, and E is the set of arcs that correspond to data dependencies between these nodes. Such graph is called the Data Dependence Graph (DDG). For each directed arc (i; j) 2 E we call xi its tail and xj its head. Dependence relationship is speci ed by assigning to each arc (i; j) in the DDG a pair of attributes (dij ; ij ): { dij , or delay, is the time in clock cycles required for the tail operation xi to produce its result (in the case of a ow- or output- dependence); or the time in clock cycles required for the tail operation xi to read its operand (in the case of an anti-dependence). { ij , or Iteration Distance, is the number of iterations that separates an instance of the tail operation xi from an instance of the head operation xj .

2.2 Periodic Scheduling Constraints

If the loop iterations are initiated at a constant rate (computation rate), the schedule is said to be periodic. De nition1. A schedule  is periodic with the period II > 0 if there exist integer numbers ti such that: i (k) = ti + (k ? 1)  II; k = 1; 2; :::;X; (1)

In (1), i (k) corresponds to the time when the kth iteration of operation xi is isssued, and ti corresponds to the time of the initiation of the operation xi in the rst iteration of the loop. For each ti let a pair of values ki and oi be de ned where:  ti  and o = (t mod II) ki = II (2) i i If we consider a time window of the size II, corresponding to the repetitive pattern, then we can say that the operation xi is initiated at the oi -th clock cycle of the repetitive pattern, and during the ki-th time window from the beginning of the execution. Each ti can be written in the following way: II  K + O = T (3) where K = [k1; k2;    ; kN ]Transpose; O = [o1; o2 ;    ; oN ]Transpose; (4) T = [t1; t2;    ; tN ]Transpose Since each oi lies in [0; II ? 1]: 2 2 3 3 2 3 o1 a0;1 a0;2    a0;N 0 6 o 7 6 a a1;2    a1;N 77 66 1 77 O = 664 ..2 775 = 664 1..;1 (5) .. .. 75  64 .. 75 . . . . . oN a(II ?1);1 a(II ?1);2    a(II ?1);N II ? 1 where A = [at;i] is a 0-1 matrix:  if oi = t; at;i = 1; 0; otherwise In other words at;i = 1 if operation xi is issued at the clock cycle t from the beginning of the repetitive pattern. Substituting (5) into (3) we obtain the matrix form of a periodic schedule : II  K + [0; 1;    ; II ? 1]  A = T (6) Because each operation is allowed to execute only once within the repetitive pattern, the following condition applies: IIX ?1 t=0

at;i = 1; 0  at;i  1 are integers; 8xi 2 [x1; x2;    ; xN ]

2.3 Precedence Constraints  is legal if it satis es linear precedence constraints (see [2]): tj ? ti  dij ? II  ij ; 8(i; j) 2 E:

(7)

Stage/Time 0 1 2 3 4 5 Stage 1 Stage 2 Stage 3

xxxx

x

(a)

x

Stage/Time 0 1 2 Stage 1 Stage 2 Stage 3

(b)

211 1 1

Fig. 1. (a) Reservation Table; (b) Circular Reservation Table for II=3 2.4 Resource Constraints

We need to know when each operation is initiated and also the usage of various pipeline stages during its execution. In classical pipeline theory, reservation tables represent resource usage of di erent operations [8, 13]. In order to determine resource requirements at each clock cycle t for a modulo scheduled loop, we de ne circular reservation tables (CRT) for operation types as follows: { For each operation type whose execution time d < II, we extend its reservation table to II columns by adding (II ? d) zero column-vectors. Since d < II, each entry in the CRT is at most 1. { For the case d > II, we fold the reservation table to II columns such that the t-th column of the original reservation table is added to the (t mod II)th column in the modi ed CRT. Entries in the modi ed CRT can be greater than 1. Figure 1 shows a reservation table for some operation that uses 3 pipeline stages and takes 6 cycles to execute, and its corresponding CRT for II = 3: the reservation table has been folded. Because the second iteration of the loop begins while the operation from the rst iteration is being completed, two units of stage 1 are needed in the rst cycle of that second iteration. Let us denote the s-th row in the CRT for an instruction type  by CRTs, a vector of length II. In any clock cycle t each instruction xi 2  requires: (IIX ?1) l=0

a((t?l)%II );i  CRTs[l];

units of resource s. At each clock cycle t in the repetitive pattern, the resource requirements for a particular resource s are: Rs =

X X

(IIX ?1)

 xi 2 l=0

a((t?l)%II );i  CRTs[l]; 8t 2 [0; II ? 1]

(8)

2.5 Objective Function The upper bound on the number of registers required by a schedule is given by the sum of the bu er sizes of all values kept in registers [10]. The bu er size of

a variable corresponds to the number of iterations this variable's lifetime spans, and can be de ned as follows:   i) bi = L(x (9) II L(xi ) denotes the lifetime of the variable into which instruction xi writes its result (variable xi 's lifetime corresponds to its longest ow dependence). Bu ers overestimate register pressure. For example, let the loop's II be 10 cycles. Variable who's lifetime spans 3 loop iterations needs a bu er of size 3, bi = 3. Since a bu er is allocated for a variable for the duration of the entire II, such a variable will be kept in the bu er for 30 cycles. However, if the actual lifetime of this variable is less than 30 cycles, say 23 cycles, it does not need to be kept in a register for the entire 30 cycles. It also should be noticed that registers can be shared by di erent values when their lifetimes do not overlap, which is not re ected by the bu ers. Bu er minimizing objective function is: X min bi (10) xi

and bu er sizes are approximated by the following constraints [10]: II  bi + ti ? tj  II  ( ij + 1) ? 1; 8(i; j) 2 E; bi are integers

2.6 ILP Formulation for Modulo Scheduling minimize

X

xi

bi

subject to:

1. Precedence constraints:  = II  K + ATranspose  [0; 1; :::;II ? 1]Transpose IIX ?1

at;i = 1; 8i 2 [1; N] t=0 tj ? ti  dij ? II  ij ; 8(i; j) 2 E

2. Resource constraints: Rs 

X

X

(IIX ?1)

2ISA xi 2 l=0

a((t?l)%II );i  CRTs [l]; 8t 2 [0; II ? 1]; 8s

Rs is the number of available units of resource s. 3. Bu er constraints: II  bi + ti ? tj  II  ( ij + 1) ? 1; 8(i; j) 2 E ti  0 are real; ki  0; 0  at;i  1; bi  0 are integers

(11)

2.7 Resource Modelling on MIPS R8000 Our machine model for the R8000 processor speci es: 1. A collection of processor resources. Instructions in the MIPS IV instruction set [11] are classi ed into twenty-one type according to their resource usage patterns. Resource usage of each instruction type is modeled using reservation tables. 2. A collection of latencies associated with each instruction type. Instruction latencies are modeled by the integer delays assigned to the data dependence arcs of loop's DDG. We rst attempt to solve the ILP subject to resource and the precedence constraints only, no minimization objective is given. A solution to such a formulation is one of many feasible schedules, not necessarily optimal. This formulation is solved faster than a formulation which minimizes bu ers, because the search for a schedule stops after the rst schedule is found. We give 3 minutes to the ILP solver. If a solution can not be found, there is little hope that a solution to a more complex formulation will be obtained. Thus, the scheduling attempt starts from scratch with the increased value of the II. A solution to the ILP, on the other hand, is often not register allocatable, in which case the bu er constraints are added to the ILP and the minimum bu er solution is sought. If this solution can not be found in 3 minutes, the best schedule found so far is accepted. If resulting schedule can not be register allocated, the scheduling attempt restarts with an increased II.

3 Memory System Optimization The MIPS R8000 provides a simple implementation of an architecture supporting more than one memory reference per cycle. The processor can issue two references per cycle, and the memory (speci cally the second level streaming cache) is divided into two banks of double-words, the even address bank and the odd address bank. If two references in the same cycle both address the same bank, one is serviced immediately, and the other is queued for service in the one element queue called the bellow. If this hardware con guration can not keep up with the stream of references, the processor stalls. Compiler can avoid unnecessary stalls by accurately scheduling memory stream. The ILP modulo scheduler attempts to nd known pairs of references to opposite memory banks (called even-odd pairs) for scheduling in the same cycle.

3.1 Memory Reference Analysis Consider the following example. Suppose that as a result of software pipelining, two following references to the array A have been scheduled in one cycle, a reference to A[aI + b] and a reference to A[c(I + 4k) + d] from 4k iterations later.

To be pairable, these two memory references must address opposite memory banks. Because the banks contain double-words, two addresses map into opposite banks if they are separated in memory by a number of bytes divisible by 8, but not by 16. Thus, two references address the opposite memory banks when:

j (c(I+4k)+d)?(aI+b) j=j (c?a)I+c4k+(d?b) j= 8+16m; m = 1; 2; 3;    12 may be satis ed for all I if (suciency condition):

(12)

(c ? a) % 16 = 0 and c  4k+ j d ? b j= 8 + 16m; m = 1; 2; 3;    (13) Thus in order for two memory references to be pairable 4k must satisfy:

4k = 8? j dc ? b j + 16c  m = r + q  m; m = 1; 2; 3;   

(14)

Let M = fxi g; i = 1; 2;    ; n be a set of memory referencing operations in the loop. For each xi 2 M, we can de ne:

{ M 0 = fxj g  M, such that xj 6= xi and they satisfy suciency conditions 13;

{ R = frij g and Q = fqij g, where rij and qij de ne a set of 4kij that satisfy 14.

3.2 Memory Constraints for the ILP Formulation Linear memory constraints have been added to the ILP formulation that enforce the following rule:

Two memory references may be scheduled in the same cycle only if they represent a known memory pair, with 4k that satis es condition 14.

1. a memory reference xi that have no pairable candidates must not share a cycle with any other memory reference xj : at;i + at;j  1; 8xi 2 M without memory pair candidate; 8xj 2 M; 8t 2 [0; II ? 1]

(15)

2. for every memory reference xi and its pair candidate xj the following must be true:

4kij =j ki ? kj j= rij + qij  mij ; mij = 1; 2; 3;   

(16)

if kith iteration of xi and kj th iteration of xj are scheduled in the same cycle.

Condition (16) may be expressed in the following linear form: ki ? kj ? (2  wij ? 1)  rij ? qij  (mij + (wij ? 1)  mmax ) = 0; 8t 2 [0; II ? 1] such that at;i = at;j = 1

(17)

where mmax is the upper limit on mij ; wij is the sign of j ki ? kj j. Often it is more convinient to rewrite equation 17 as inequalities: ki ? kj ? (2  wij ? 1)  rij ? qij  (mij + (wij ? 1)  mmax )  0; ki ? kj ? (2  wij ? 1)  rij ? qij  (mij + (wij ? 1)  mmax )  0; 8t 2 [0; II ? 1] such that at;i = at;j = 1

(18) (19)

Since these inequalities should only be true whenever for some t 2 [0; II ? 1] there exist at;i = at;j = 1, we can write: ki ? kj ? (2  wij ? 1)  rij ? qij  (mij + (wij ? 1)  mmax )  +1  ((1 ? at;i) + (1 ? at;j )); ki ? kj ? (2  wij ? 1)  rij ? qij  (mij + (wij ? 1)  mmax )  ?1  ((1 ? at;i) + (1 ? at;j )); 8t 2 [0; II ? 1] Above inequalities are always satis ed if at;i 6= at;j 6= 1, and therefore, do not impose any constraints on ki and kj . When there exist a time t 2 [0; II ? 1] at which at;i = at;j = 1, however, they require the ki and kj be such that the Ith iteration of xi and (I+ j ki ? kj j)th iteration of xj are pairable. We only have a value of mmax left out. From (14), we notice that: max j ki ? kj j= rij + qij  mmax

(20)

mmax = max j ki q? kj j ?rij

(21)

Consequently,

ij

where max j ki ? kj j= max

 

 asap(xi ) ? alap(xj ) ; asap(xj ) ? alap(xi ) II II 

 

(22) is the upper bound on the relative distance between the iterations, from which the two references come. asap(xi ) and alap(xi ) are the earliest and the latest possible times when the instruction xi may be issued.

fpppp nasa7 hydro2d su2cor swm256 mdljsp2 ear alvinn ora tomcatv wave5 mdljdp2 doduc spice2g6 geometric mean

0.40

0.30

0.20

0.10

0.00

−0.10

0.6

0.7

0.8

0.9

1.0

13

15

34

39

47

48

52

56

77

78

89

90

93

94

1.1

(a) Relative performance of ILP over SGI (b) E ect of memory system optimization Figure 2: Performance of the ILP Modulo Scheduler

4 Performance Evaluation 4.1 Experimental Framework The ILP formulation for modulo scheduling was developed as an alternative to heuristic methods, in hope that an integrated solution to the software pipelining and register allocation problem will result in the code of better quality. In order to measure its performance, the ILP software pipeliner was embedded in the Silicon Graphics' MIPSpro compiler (see [14] for details). The SGI pipeliner has also served as a backup for our pipeliner in the case it could not schedule a loop within a reasonable time instead of falling back to the single block scheduler. A new set of results is reported here after the memory system optimization was included in the ILP formulation.

4.2 Highlights of Main Results In our experiment reported in [14] the ILP pipeliner and the SGI pipeliner were used in turn for compiling the SPEC92 benchmark suite, consisting of 14 benchmark programs. The SGI pipeliner scheduled 798 loops in total in this benchmark suite and the ILP pipeliner scheduled 753 loops. We compared the execution time of programs scheduled by the ILP software pipeliner to the execution time of the same programs scheduled by the MIPSpro software pipeliner and reported a large di ernce in some benchmarks that favored the heuristic pipeliner. We reported that the dynamic factor introduced by the memory system of the R8000 whose e ects were dealt with in SGI compiler prevented us from making direct comparisons of our pipeliner's performance against the best SGI results. After

extending our ILP formulation to equally deal with the memory e ects, we are reporting new comparison results. Figure 2 (a) shows the relative improvement of ILP schedules over SGI schedules for each of the fourteen SPEC92 programs. The X-axis measures how much faster the ILP scheduled code is compared to the SGI code in percentage points. The new data shows the SGI pipeliner only slightly outperformes the ILP pipeliner in 6 of the benchmarks, down from 8 reported in [14], and that geometric mean of a suite as a whole is now equal for both schedulers (previously we reported the MIPSpro geometric mean to be 8% better than the geometric mean of our scheduler). We conclude that after addition of memory formulation to the ILP, quality of code produced by the ILP pipeliner is comparable to that produced by the SGI pipeliner. However, the SGI pipeliner still slightly outperforms the ILP pipeliner in some benchmarks. The main reason for this is the time limit imposed on the ILP pipeliner's search for a schedule. Sometimes the ILP pipeliner could not nd within 3 minutes a schedule at the same II as the heuristic pipeliner. Equally, the ILP pipeliner could not produce schedules superior to the SGI pipeliner in terms of required registers for many important loops (register optimal schedules could not be found within 3 minutes time limit). One interesting discovery was that using multiple orderings in which the ILP solver traverses the branch-and-bound tree facilitates its search for the optimal solution. When one ordering does not lead to a solution in a certain time, another may be used and quite often a solution is thus found.

4.3 Memory Stalls Figure 2 (b) shows the e ect of memory stall reduction for the ILP pipeliner. The Y-axis depicts the relative performance of the ILP pipeliner with memory stalls optimization enabled over the ILP pipeliner with memory stall optimization disabled in percentage points for all 14 benchmarks. The majority of benchmarks bene ted from memory optimization. The average improvement due to minimizing memory stalls is 7 %. Three programs: mdljdp2, alvinn and mdljsp2 - run signi cantly faster when scheduled with memory optimization. However, some of the programs su ered a slight loss in performance. Speci cally, this performance loss was noticeable in two programs swm256 and fpppp (this also explains better performance of the SGI scheduled code for these two benchmarks). Why? There are a couple of reasons for this. Additional memory stall constraints sometimes allow the ILP formulation to nd a register allocatable schedule at a lower II that it would have found without these constraints. However, it could be that a schedule at a higher II generates fewer stalls (due to other reasons than the external cache con icts) than the one at a lower II, or it could be that without these additional constraints the scheduler spills some variables and, ironically, comes up with an even lower II on a spilled loop. These problems can be dealt with at a cost of additional search in the scheduling space, a search that is too expensive for the ILP pipeliner.

4.4 Register Pressure

As mentioned earlier, minimizing register pressure in software pipelined loops is rather costly in terms of compile time. Moreover, as long as the schedule is register allocatable, the number of registers it uses should not really matter. After all, a couple of spills before the loop is entered and a couple of restores after the exit can not signi cantly slow down a program. In this respect, minimizing register usage should only be important for loops with high register pressure, when the schedule that uses some extra registers compared to the optimal one may lead to a register allocation failure and spills. Unfortunately, the exponential nature of the ILP formulation prevented the ILP pipeliner from nding minimal register schedules for many interesting loops with high register pressure. Because neither of the two schedulers outperforms the other in terms of required registers, there is no clear evidence of performance being a ected by this factor.

4.5 Branching Order

Surprisingly, for a given loop, one ordering may lead to the optimal solution in a very short time, while another may not lead to any solution at all. Why? The ILP pipeliner calls CPLEX mixed integer linear solver2 to solve the ILP formulation. CPLEX is a very powerful and exible tool, but it does not allow us to fully exploit the problem's structure. Making the integer solver aware of such structure increases the e ectiveness of integer problem solving. The SGI pipeliner uses multiple scheduling orders to facilitate register allocation [14]. The idea behind it is that at least one of the di erent scheduling orders should produce a schedule which ts into the available allocation registers. Facilitating the search for register allocatable schedules was also the original motivation for trying, in turn, many di erent orders in which the ILP solver traverses the branch-and-bound tree. However, soon we discovered that traversing the branch-and-bound tree in a \good" order signi cantly reduces the time spent searching for the optimal solution. The four orders that were used by the ILP pipeliner were the ones used by the MIPSpro [14]: 1. Folded depth- rst ordering with the nal memory sort, 2. Data precedence graph heights with the nal memory sort, 3. Reversed heights with the nal memory sort, 4. Folded depth rst ordering without nal memory sort. Table 1 shows how many loops out of the total 753 loops were scheduled by the ILP pipeliner, and which resulting schedules were successfully allocated registers, using each of the four branching orders: No branching order was able to successfully schedule all the loops. The most ecient was the folded depth- rst search order. Because of the \folding", placement of the dicult to schedule operations is attempted rst, and this allows 2

CPLEX is a trademark of CPLEX Optimization, Inc.

Table 1. Number of loop scheduled by each searching order out of the total 753 loops FDFO w/memory sort Data precedence heights Reversed heights FDFO 588 613 586 747

the ILP solver to detect earlier in the branch-and-bound tree that such operations can not be scheduled and not to spend much time exploring their subtrees. Final sorting of the loads and stores, which are easy to schedule in the R8000 architecture, is another way of trying to place operations that are more dicult to schedule ahead in the scheduling order. These results show that further improvements in the ILP solving have to come from exploiting the problem properties, such as relations between di erent operations in the data dependence graph and characteristics of the target architecture.

5 Conclusions and Future Work The integer linear programming framework o ers an optimal solution to integrated software pipelining and register allocation problem. This paper deals with the design of an ILP-based software pipeliner, and, in particular, with the design and implementation of the software pipeliner for MIPS R8000 superscalar microprocessor. We developed a complete ILP model of the R8000 processor. This machine has multiple execution pipelines that share certain stages, and an in-order superscalar dispatching mechanism that issues instructions as soon as their operands are ready and there are sucient resources for their execution. We also developed the ILP formulation that optimizes code for better memory system performance. In order to evaluate its performance, the ILP software pipeliner is compared to the Silicon Graphics' MIPSpro compiler. The ILP pipeliner scheduled 753 out of 798 loops in the SPEC92 oating-point benchmark suite. 9 loops were scheduled by the ILP pipeliner at a lower II than by the SGI pipeliner in the course of our study. We discovered that so long as a software pipelined loop actually ts in a machine's registers, the number of registers used by the loop's steady state is not the most important parameter to optimize. In fact, minimizing memory stalls that result from multiple simultaneous references to the same memory bank gave an average performance improvement of 7 percent, and up to 43 percent for one of the benchmarks that we used. Nevertheless, certain software pipelined loops achieve high register pressure and nding the optimal schedule in terms of the required registers is important as it allows us to avoid spilling in those loops. The overhead of applying integer linear programming method to modulo scheduling is very high. The largest loop solved to optimality in our experiments

contains only 14 instructions. Bigger loops simply could not be scheduled optimally in a reasonable time (whatever reasonable may be). Future work should focus on reducing the ILP solution time. Careful investigation showed that exploitation of the problem structure must be the foundation of such an improvement. The ILP framework has been e ective in solving other computationally dicult problems [3, 6]. In these problems, formulating a \good" model is of crucial importance to solving that model [9]. For example, it is well known that there are ecient methods of solving integer linear programming problems formulated in terms of 0-1 variables. Therefore, reformulating modulo scheduling as such o ers potentially signi cant improvement in solution eciency. There is also room for improvement in the usage of machine's memory system. Perhaps, an ILP formulation can be made that minimizes processor stalls not only due to simultaneous memory bank accesses but due to cache misses in general. Cache issues can no longer be ignored, because of the relatively high cache miss penalties and, therefore, signi cant performance improvement opportunities that the optimal cache utilization may lead to. A good solution to the problem of optimizing memory system could also have interesting implications in design of such systems. Another area of interest lies with loops containing conditional branches. As far as results of our study are concerned, we assumed regular loops (loops containing no conditional branches). All the irregular loops considered in our work were converted into branchless code using if-conversion. In such case, one can not be guaranteed the construction of the optimal software pipeline even if software pipelining itself has been successfull. Optimal here is meant in the broader sence than in this paper: for a given loop and a given target machine, we can not do better than the optimal schedule. Perhaps the ILP may help us understand the nature of such optimality and derive usefull scheduling methods for loops with conditional branches.

Acknowledgements The author wishes to express his gratitude to a number of people and institutions. Professor G.Gao helped with proofreading and editing. Erik Altman, R. Govindarajan and G.Gao developed the foundation of the integer linear programming formulation of the modulo scheduling. Silicon Graphics made it possible to test these ideas in the setting of their production compiler, and Dr. John Ruttenberg, Dr. Woody Lichtenstein and others made invaluable contributions to the design of the experiment.

References 1. Erik R. Altman. Optimal Software Pipelining with Function Unit and Register Constraints. PhD thesis, McGill University, Montreal, Quebec, 1995.

2. Erik R. Altman, R. Govindarajan, and Guang R. Gao. Scheduling and mapping: Software pipelining in presence of structural hazards. In Conference on Programming Language Design and Implementation, pages 139 { 150, La Jolla, CA, June 1995. ACM SIGPLAN. 3. R. Bixby, Ken Kennedy, and Uwe Kremer. Automatic data layout using 0-1 integer linear programming. In Conference on Parallel Architectures and Compilation Techniques, pages 111 { 122, August 1994. 4. James C. Dehnert and Ross A. Towle. Compiling for cydra 5. The Journal of Supercomputing (Special Issue on Instruction-Level Parallelism), 7(1/2), July 1993. 5. Alexandre E. Eichenberger, David S. Davidson, and Santosh G. Abraham. Optimum modulo schedules for minimum register requirements. In International Conference on Supercomputing, pages 31 { 40, Barcelona, Spain, July 1995. ACM SIGARCH. 6. David W. Goodwin and Kent D. Wilken. Optimal and near-optimal global register allocation using 0-1 integer programming. Software { Practice and Experience, 1996. 7. R. Govindarajan, Erik R. Altman, and Guang R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In 27th Annual International Symposium on Microarchitecture, pages 85 { 94, San Jose, CA, November { December 1994. 8. P. M. Kogge. The Architecture of Pipelined Computers. McGraw Hill, New York, 1981. 9. G. L. Nemhauser and L. A. Wolsey. Handbooks in Operation Research and Management Science: Optimization, volume 1. Elsevier Science, New York, 1989. ch.6. 10. Qi Ning and Guang R. Gao. A novel framework of register allocation for software pipelining. In 20th Annual International Symposium on Principles of Programming Languages, pages 29 { 42, January 1993. 11. Charles Price. MIPS IV Instruction Set. Silicon Graphics Computer Systems, January 1995. Revision 3.1. 12. B. R. Rau and Joseph A. Fisher. Instruction-level parallel processing: History, overview, and prospective. The Journal of Supercomputing (Special Issue on Instruction-Level Parallelism), 7(1/2):9 { 50, 1993. 13. B. R. Rau and Joseph A. Fisher. Some scheduling techniques and a easily schedulable horizontal architecture for high-performance scienti c programming. In IEEE/ACM 14th Annual Microprogramming Workshop, October 1993. 14. John C. Ruttenberg, Guang R. Gao, Artour Stoutchinin, and Woody Lichtenstein. Software pipelining showdown: Optimal vs heuristic methods in a production compiler. In Conference on Programming Language Design and Implementation, pages 1 { 11, Philadelphia, PA, May 1996. ACM SIGPLAN.

This article was processed using the LATEX macro package with LLNCS style