In the Proceedings of the 1995 International Conference on High-Performance Computing, New Delhi, India, Dec. 1995.
Instruction Scheduling in the Presence of Structural Hazards: An Integer Programming Approach to Software Pipelining R. Govindarajan
Erik R. Altman
Guang R. Gao
Dept. of Computer Science Dept. of Electrical Engineering School of Computer Science Memorial Univ. of Newfoundland McGill University McGill University St. John's, A1B 3X5, Canada Montreal, H3A 2A7, Canada Montreal, H3A 2A7, Canada
[email protected] [email protected] [email protected]
Abstract Software pipelining is an ecient instruction scheduling method to exploit the multiple instructions issue capability of modern VLIW architectures. In this paper we develop a precise mathematical formulation based on ILP (Integer Linear Programming) for the software pipelining problem for architectures involving structural hazards. Compared to other heuristic methods as well as an ILP-based method [1], a distinct feature of the proposed formulation is that it uses classical pipeline theory, in particular those relating to the forbidden latency set. The proposed software pipelining technique has been successfully applied to derive rate-optimal schedules under resource constraints. The method was used in experiments on 1008 loop kernels taken from various scienti c benchmark programs. Performance results of our experiments and comparison with a leading heuristic method reveal that the ILP based exact solution does have a signi cant performance advantage over the heuristic methods, though at the expense of a larger computation time.
1 Introduction Software pipelining is an ecient instruction scheduling technique that overlaps operations from dierent loop iterations in an attempt to fully exploit instruction level parallelism of modern VLIW and superscalar architectures. In this paper, we are particularly interested in architectures with function units involving structural hazards. A structural hazard is caused by one or more stages of the pipeline This work was supported by research grants from NSERC (Canada), MICRONET {Network Centers of Excellence (Canada), and the President's grant, Memorial University of Newfoundland.
being used at multiple times during the execution of an instruction. For example, a shifter may be used to normalize inputs at the start of a oating point divide, and be reused again at the end of the divide to normalize the output. A variety of software pipelining algorithms have been proposed which operate under resource constraints. Refer, for example, to [11, 9, 7, 2, 3] and the references in [10]. In order to be practical, almost all of the above scheduling methods attempt to nd ecient (but not necessarily optimal) solutions based on a variety of heuristic methods. In contrast, integer linear programming (ILP) based approaches for nding an optimal schedule have been proposed in [5, 4, 1]. These methods have a precise mathematical optimality objective for the software pipelining problem and attempt to obtain the optimal (or exact) solution though at the expense of a longer computation time. The software pipelining methods proposed in [2, 9, 7] can handle architectures involving structural hazards by modeling the individual stages of the pipeline as dierent resources. In [1], a similar approach was followed to develop an ILP formulation. By contrast, in this paper, we develop an ILP formulation for function units involving structural hazards hazards using the well-known classical pipeline theory [8]. In particular, our method precomputes the forbidden latency set1 for a particular function unit, and formulates resource constraints based on the forbidden latency set. In order to use the forbidden latency set eectively, our method performs instruction scheduling (specifying at what time step an instruction must be initiated) and mapping (the assignment of instructions to function units) simultaneously. The use of the forbidden latency set facilitates modeling a complex multi-stage pipeline with arbitrary structural hazards as a single 1 Refer to Section 3 for a quick review of the classical pipeline theory and the terminology used.
resource, thereby making the proposed formulation simpler and more ecient compared to the one discussed in [1]. We have implemented our scheduling method on a Unix Workstation and tested it on 1008 sample loops extracted from various scienti c benchmark programs such as SPEC, Perfect Club and the NAS kernels. In a large majority of the test cases, our method found a rate optimal schedule reasonably quickly. In particular, we obtained an optimal schedule for greater than 80% of the loops. The geometric mean of execution time was less than 2.5 seconds on a Sparc-20 workstation, and the median was 1 second. The improved performance (rate-optimality) of our software pipelining method comes at the expense of a larger computation time. This raises the question, \Does the optimality objective pay o?" We address this question by comparing the schedules generated by our ILP method with those generated by a leading heuristic method, viz. the slack scheduling method of Hu [7]. Our experiments indicate that in 216 of the test cases, the ILP schedules are faster by more than 15%. Thus the ILP based software pipelining method can be useful in performance-critical applications where runtime performance is of utmost concern. Further, we believe that the ILP based method is also useful to compiler designers to establish an optimal bound so as to compare and improve their heuristic methods of software pipelining. The rest of this paper is organized as follows. In the following section we motivate our approach to software pipelining with the help of an example. In Section 3, we develop the ILP formulation for the software pipelining problem. Experimental results involving 1008 loops are reported in Section 4. We compare our work with other related work in Section 5. Concluding remarks are presented in Section 6.
2 A Motivating Example In this section we motivate the software pipelining problem for functions units (FUs) involving structural hazards with the help of a a simple example. Consider the sample loop shown in Figure 1 to compute the inner product of two vectors. The low-level code and the data dependence graph of this loop are also shown in Figure 1. We will assume that the target architecture consists of 2 Integer ALUs, 1 Floating Point (FP) Units, and 1 Load/Store unit. Integer instructions (i0 and i1 ) are executed in the Integer ALU. The FP Units can execute all FP multiply(i4 )
and the add instruction (i5 ). Load instructions (i2 ) and (i3 ) are executed in the Load/Store Unit. For the purpose of this example, it is assumed that it takes 1 time unit to execute an operation in the Integer ALU. The FP Units and the Load/Store Unit have structural hazards. The resource usage for these FUs are shown with the help of reservation tables in Figure 1(c). The performance of a software-pipelined schedule is measured by the initiation interval T, the time elapsed between the initiation of two successive iterations. Assume that we are interested in constructing a schedule for this loop with an initiation interval T = 4. The initiation interval must be greater than or equal to the lower bounds Tdep and Tres imposed the dependency and resource constraints [10, 9, 7]. In other words, T Tlb = max(Tdep ; Tres): Due to space restrictions, the reader is referred to [10, 9, 7] for a discussion of how to compute Tdep and Tres . A schedule with an initiation interval T is said to be a resource-constrained rate-optimal schedule if, for any initiation interval less than T, there does not exist a schedule which is compatible with the architecture. In this paper we investigate periodic linear schedules, of the form T j + ti , where ti 0 is an integer oset, j the iteration number, and T the initiation interval. Figures 2(a) depicts a possible schedule (Schedule A) for the example loop. This table shows when each instruction is initiated for execution. In terms of the linear schedule form, T = 4 and ti0 = 0; ti1 = 0; ti2 = 1; ti3 = 2; ti4 = 5; ti5 = 8: We observe a repetitive pattern in the schedule. In particular we consider the pattern starting from time step 8 to 11. We observe that instructions (i0 and i1 ) can execute in the two available Integer ALUs. The FP instructions are executed one after the other (due to data dependency) and hence 1 FP Unit is sucient to support the schedule. What about the Load/Store Unit? The execution of instructions i2 and i3 overlaps in the repetitive pattern. The usage of the three stages of the Load/Store Unit for Schedule A, starting from time step 5, is shown in Figure 2(b). From this table we notice that the schedule requires no more than 1 Load/Store Unit. Thus Schedule A is a resource constrained rate-optimal schedule.
Time Steps 0 1 2 Stage 1 x Stage 2 x Stage 3 x x FP Unit
for (i = 0; i < n; i++) f sum = sum + a[i]b[i]; i0
i1
i2
i3
g
i0 : i1 : i2 : i3 : i4 : i5 :
i4
i5
(a) Dependence Graph
vr33 = vr33 + vr32 vr34 = vr34 + vr32 vr35 = load m(vr33) vr36 = load m(vr33) vr37 = vr35 vr36 vr38 = vr38 + vr37 branch to i0 if i n enddo
Time Steps 0 1 2 Stage 1 x x Stage 2 x Stage 3 x Load/Store Unit (c) Reservation Tables
(b) Program Representation Figure 1: A Motivating Example
Iter. = 0 Iter. = 1 Iter. = 2
0 i0 i1
1 2 3 i2
4
i3
Time Steps 5 6 7
8 i5
i4 i 0 ; i1
i2
i3
(a) The Schedule
i 0 ; i1
9 10 11 12
i4 i2 i3
i5
Time Steps 5 6 7 8
Stage 1 i2 i3 i2 Stage 2 i2 i3 Stage 3 i2 (b) Resource Usage of Load/Store Unit
i3
i3
Figure 2: Schedule A for the Motivating Example
3 Software Pipelining in the Presence of Structural Hazards To construct a resource-constrained schedule for a loop, we rst need to obtain an exact formulation of the software pipelining problem for a given iteration period T. We obtain a rate-optimal software pipelined schedule by solving the formulation for successive values of T from Tlb .
3.1 Forbidden Latency Set for Software Pipelined Schedules When the FUs of a target architecture have arbitrary structural hazards, it is important to ensure that two instructions in the schedule do not require the simultaneous use of any stage of a function unit at any time. Existing heuristic [7, 2] and exact methods [5, 1] go about satisfying this request by selective scheduling operations, reserving the usage of resources in a modulo reservation table, and unscheduling oper-
ations whenever there is a collision2 [8]. Such an approach requires the individual stages of the pipeline to be modeled as dierent resources. This, in turn, increases the complexity of the scheduling algorithm. In this paper we propose an alternative approach that makes use of the classical pipeline theory to model each function unit as a single resource. In addition to the classical pipeline scheduling theory, we also have to take into account the fact that instructions in a software pipelined schedule are executed repeatedly. Thus, all latencies are considered with a wrap-around. For example, if ti and tj are, respectively, the time steps at which instructions i and j are initiated, then latency between these two instructions is (tj ? ti ) mod T. Further, we can also say that the next initiation of i is separated from the current initiation of j by (ti ? tj ) mod T. Thus, any software pipelined schedule should not require the usage of any resource at two time steps that are congruent modulo T. This constraint is known as the modulo scheduling 2 Collision: Two or more instructions attempting to use the same stage of an FU at the same time step.
constraint [11, 10]. Hence a forbidden latency3 f T
for an FU, translates as a forbidden latency f mod T. Thus, as a rst step, we translate all forbidden latencies to the range [0; T ? 1]. Let us denote the new forbidden latency set by F 0 .
De nition 3.1 The forbidden latency set F 0
for a software pipelined schedule with period T is de ned as:
F 0 = ff
T; 8f 2 Fg We further rede ne the forbidden latency set of software pipelined schedule as: De nition 3.2 0 If T is the period of the repetitive pattern, and f 2 F then T ? f is also forbidden. Thus F 00 = F 0 [ fT ? f; 8f 2 F 0g In fact the above de nition is due to Lemma 3.8 in [8] (see page 97). Henceforth, we use the term forbidden latency set to refer to F 00 . mod
3.2 ILP Formulation Let i and j be instructions that execute in the same FU type. We will use integer variables ci and cj to represent FUs to which instruction i and j are assigned in the schedule. If the latency between the initiation of i and j is f 2 F 00 then i and j must execute in dierent FUs. That is, if ti ? tj modT = f, then ci 6= cj . The modulo operation is non-linear and hence is not directly useful for our integer program formulation. Fortunately, the ti variables can be related to a two dimensional 0-1 integer matrix A. The A matrix represents the repetitive pattern and has T rows and N columns, where N is the number of instructions in the loop. The element A[t; i] is either 0 or 1 depending on whether instruction i is scheduled for execution at time step t in the repetitive pattern. For Schedule A, the A matrix is: 2 3 1 1 0 0 0 1 6 7 A = [at;i] = 64 00 00 10 01 10 00 75 0 0 0 0 0 0 In order to ensure that each instruction is scheduled exactly once in the repetitive pattern, we require that the sum of each column in the above A matrix be 1. This can also be expressed as a linear constraint: TX ?1
at;i = 1 for all i 2 [0; N ? 1] t=0 3 A latency that causes a collision.
(1)
As shown in [5], the A matrix is related to the ti variables as: T = K T + ATranspose [0; 1; ; T ? 1]Transpose (2) where T = [t0; t1; ; tN ?1]Transpose and K = [k0; k1; ; kN ?1]Transpose Interested readers can verify that for Schedule A, the K vector is given by K = [0; 0; 0; 0;1;2]: The initiation of instructions i and j are separated by a latency f, if at;i = 1 and a((t+f ) mod T );j = 1, for any t 2 [0; T ? 1]. Thanks to De nition 3.2, (t+f) for f 2 F 00 also includes (t ? f) if we consider all forbidden latencies in F 00 . In order to establish the resource constraint for the software pipelined schedule, we need to establish that all pairs of instructions i, j that execute on the same FU type, execute in dierent FUs (i.e. have dierent ci, cj values if the initiation of i and j are separated by a forbidden latency f. Otherwise ci may be same as cj . As outlined in [6, 1], this can be formulated as a set of integer constraints as: ci ? cj a +a ?1 t;i ((t+f )2 mod T );j ? N wi;j (3) cj ? ci a +a ?1 t;i ((t+f )2 mod T );j ? N (1 ? wi;j )(4) 1 ck N 8k 2 [0; N ? 1] (5) where N, the number of nodes in the DDG, is an upper bound on the number of colors. The variables wi;j are 0-1 integer variables, with one such variable for each pair of instructions using the same type of function unit. Roughly speaking these wi;j variables represent the sign of ci ? cj . In [1], it is shown that Equations (3) { (5) require that two instructions i and j be assigned to dierent function units if and only if their execution overlaps. Lastly, to enforce the resource constraints, that is the schedule uses no more resources than available, we require that the value assigned to ci is less than the number of FUs available in that FU type. That is, ci Fr ; 8 i that execute in FU type r (6) where Fr is the number of FUs required for the schedule in type r.
Dependence constraints between the nodes of the DDG can be expressed as a linear constraint on the values of ti tj ? ti di ? T mij (7) where di is the delay of node i, T the period, and mij the dependence distance for arc (i; j). The complete ILP formulation, as in [5], is to minimize a weighted sum of the number of FUs required. That is, X minimize Cr Fr r
subject to Equations (1) { (7). The weight Cr is chosen based on the cost and availability of the FU type.
4 Experimental Results We have implemented our ILP based software pipelining method on a Unix workbench. We have experimented with 1008 single-basic-block inner loops extracted from various scienti c benchmark programs such as SPEC92 (integer and floating point), linpack, livermore, and the NAS kernels. The DDG's varied widely in size, with a median of 7 nodes, a geometric mean of 8, and an arithmetic mean of 12. The reservation tables used in our experiments mimic those of the PowerPC-604 execution units, but with some structural hazards introduced. We have assumed 2 Integer ALUs, one of each FP Add, Load, Store, FP Multiply, and FP Divide Units. To solve the ILP's, we used the commercial program, CPLEX. In order to deal with the fact that our ILP approach can take a very long time on some loops, we adopted the following approach. First, we limited CPLEX to 3 minutes in trying to solve any single ILP at a given T. Second, initiation intervals from [Tmin ; Tmin + 5] were tried if necessary. In a large majority of cases (more than 80%), the ILP approach did nd an optimal schedule as shown in Table 1. However, due to the 3 minute time limit imposed on the CPLEX solver, in a small fraction of the test cases, our method found a schedule at a T greater than a possible Tmin . We say a possible Tmin and possible optimal schedule here since there is no evidence | CPLEX' 3 minute time limit expired without indicating whether or not a schedule exists for a lower value of Tmin . We measured the execution time of our scheduling method on a Sun/Sparc20 workstation. The geometric mean was 2.4 seconds while the median was 1 second. A histogram of the execution time is shown in a tabular form in Table 2.
Initiation
# of # nodes in the DDG Loops Geo. Mean Median T = Tmin 803 6.0 6 T = Tmin + 1 57 17.2 19 T = Tmin + 2 3 20.3 20 T = Tmin + 3 8 18.2 18 T = Tmin + 4 6 16.0 18 T = Tmin + 5 5 20.8 20 No Schedule found 126 34.8 25 Table 1: Schedule Quality in Terms of Tmin To answer the question of whether the optimality objective and the long computation of our method pay o, we compared our schedules with an implementation of Hu's slack scheduling algorithm [7]. The comparison reveals that in 216 out of the 882 test cases for which our ILP method found a schedule, the ILP schedule is faster than the heuristic method. In this 216 test cases, the ILP method achieved an average improvement of 15% in the Tmin value. Due to the time limit imposed on the CPLEX solver, our method resulted in schedules with a larger T value for a very small fraction (0.8%) of the test cases. The execution time of the heuristic method still remains an attractive feature. The execution time was less than 1 second in 80% of the loops. Our experimental results suggests the possible use of our method for performance-critical applications. Further, compiler designers can also use our method to obtain optimalschedules for test loops so as to compare and improve existing/newly proposed heuristic methods.
5 Related Work Software pipelining has been extensively studied [2, 3, 7, 9, 11]. Rau and Fisher provide a comprehensive survey of these works in [10]. In [5, 1] we proposed an ILP based software pipelining method for architectures involving FUs with arbitrary structural hazards. All of the above mentioned methods which deal with structural hazards consider the individual stages of a pipeline as dierent resources. However such an approach results in a large number of resources in the architecture. This, in turn, increases the complexity of the resulting software pipelining method, be it based on heuristic approach or on ILP techniques. Further, these methods do not make use of the well-established classical pipeline theory.
Number of Test Cases with Execution Time (in Sec.) in the Range < 1 1-2 2-5 5-10 10-20 20-30 30-60 60-120 120-240 240-300 300-600 > 600 450 146 90 39 29 4 8 23 2 67 11 13 Table 2: Histogram of Execution Time (in seconds) By contrast, in this paper, we develop a clear mathematical formulation of the software pipelining problem by considering each FU type as a single resource rather than a number of resources. This is facilitated by the use of the classical pipeline theory, in particular the use of a forbidden latency set. Further, our method is unique in that it combines scheduling and mapping in a uni ed framework and attempts to achieve an optimal solution. An advantage of our method is that it can be extended to handle multifunction pipelines as well. It can also incorporate minimizing register requirements [3].
6 Conclusions In this paper we have proposed a method to formulate scheduling and mapping problems in a uni ed framework. Our method can handle execution units with arbitrary structural hazards. Our method uses the classical pipeline theory and considers each FU type as a single resource (rather than a number of stages). We have implemented our scheduling method and run experiments on a set of 1008 loops taken from various benchmark suites. We obtained the optimal schedules for more than 80% of the loops. The geometric mean of computation time for constructing the schedule was 2.4 seconds and the median was 1 second. While the use of integer programming methods in practical compilers may not be acceptable, we feel that the proposed formulation is useful for performance-critical applications as well as in a testbed to evaluate and improve other heuristic methods.
Acknowledgments
The authors are thankful to Kemal Ebcioglu, Mayan Moudgill, and Qi Ning for their help and to IBM for its technical support of this work.
References [1] E. R. Altman, R. Govindarajan, and G. R. Gao. Scheduling and mapping: Software pipelining in the
[2] [3]
[4]
[5]
[6] [7]
[8] [9]
[10] [11]
presence of structural hazards. In Proc. of the SIGPLAN '95 Conf. on Prog. Lang. Design and Implementation, La Jolla, CA, Jun. 1995. J. C. Dehnert and R. A. Towle. Compiling for Cydra 5. J. of Supercomputing, 7:181{227, May 1993. A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Minimum register requirements for a modulo schedule. In Proc. of the 27th Ann. Intl. Symp. on Microarchitecture, pages 75{84, San Jose, CA, Nov. 30{Dec.2, 1994. P. Feautrier. Fine-grain Scheduling under Resource Constraints. In Seventh Annual Workshop on Languages and Compilers for Parallel Computing, Ithaca, NY, August 1994. R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resourceconstrained rate-optimal software pipelining. In Proc. of the 27th Ann. Intl. Symp. on Microarchitecture, pages 85{94, San Jose, CA, Nov. 30{Dec.2, 1994. T. C. Hu. Integer Programming and Network Flows, page 270. Addison-Wesley Pub. Co., 1969. R. A. Hu. Lifetime-sensitive modulo scheduling. In Proc. of the SIGPLAN '93 Conf. on Programming Language Design and Implementation, pages 258{ 267, Albuquerque, NM, Jun. 1993. P. M. Kogge. The Architecture of Pipelined Computers. McGraw-Hill Book Company, New York, NY, 1981. M. Lam. Software pipelining: An eective scheduling technique for VLIW machines. In Proc. of the SIGPLAN '88 Conf. on Programming Language Design and Implementation, pages 318{328, Atlanta, GA, Jun. 1988. B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview and perspective. J. of Supercomputing, 7:9{50, May 1993. B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scienti c computing. In Proc. of the 14th Ann. Microprogramming Work., pages 183{198, Chatham, MA, Oct. 1981.