Hypernode Reduction Modulo Scheduling 1 - CiteSeerX

34 downloads 114 Views 268KB Size Report
Apr 18, 1995 - Software Pipelining is a loop scheduling technique that extracts parallelism ... During the scheduling step the nodes are scheduled as early/late.
Hypernode Reduction Modulo Scheduling 1 Josep Llosa, Mateo Valero and Eduard Ayguade Departament d'Arquitectura de Computadors Universitat Politecnica de Catalunya Barcelona, Spain [email protected] April 18, 1995

This work has been supported by the Ministry of Education of Spain under the contract TIC 880/92, by ESPRIT 6634 Basic Research Action (APPARC) and by CEPBA (European Center for Parallelism of Barcelona). 1

Abstract Software Pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Prior scheduling research has focused on achieving minimal execution time, without regarding register requirements Unidirectional strategies (top-down or bottom-up) tend to stretch operand lifetimes because they schedule some operations too early/late. The paper presents a novel bidirectional strategy that simultaneously schedules some operations late and other operations early, minimizing all the stretchable dependencies and therefore reducing the registers required by the loop. The key of this strategy is a pre-ordering phase that selects the order in which the operations will be scheduled. The pre-ordering heuristic guarantees that, when an operation is selected to be scheduled into the current partial scheduling, there are only predecessors or successors of the operation that have been already scheduled. An operation will be scheduled early in the schedule if it contains predecessors of the operation and late if it contains successors of the operation. We compare our method against other three leading scheduling methods. Both of them are heuristic methods and one is an exact integer linear programming formulation of the scheduling problem. Experimental results show that the method described in this paper performs as well as the linear programming method, but with an scheduling time comparable to the heuristic methods. When compared with the other heuristic methods, our method performs signi cantly better.

1 Introduction Software pipelining is an instruction scheduling technique for exploiting the instruction level parallelism of loops by overlapping successive iterations of the loop and executing them in parallel. Finding the optimal solution is an NP-complete problem and there exist several works that propose and evaluate di erent heuristic strategies to perform software pipelining [2, 1, 3, 5, 4] The drawback of aggressive scheduling techniques such as software pipelining is the high register pressure. Furthermore, the register requirements increase as the concurrency increases [7, 6], whether due to using and exploiting machines with deeper pipelines, wider issue, or a combination of both. Registers, like functional units, are a limited resource. Therefore, if a scheduling requires more registers than available, some actions, such as add spill code, have to be performed. The addition of spill code, can degrade performance [6] due to additional cycles in the scheduling, or due to memory interferences. The problems introduced by the high register requirements of aggressive scheduling techniques, together with aggressive architectures, have lead to scheduling research oriented to minimize the register requirements ( in part due to the limited number of registers that existing architectures have, and in part due to the limitations in chip area and especially acces time, that will impose register les with a high number of registers ). In this direction there are proposals of alternative register le organizations [8, 9, 10]. In order to achieve maximum performance, scheduling algorithms that reduce the register pressure while scheduling for high throughput, are of high interest. Hu 's Slack Scheduling [11] is a heuristic that attempts to address this concern but not always achieves optimal schedulings. SPILP [12] is an integer linear programming formulation of the scheduling problem that obtains the optimal resource-constrained schedule, with minimal bu er requirements. Unfortunately SPILP requires a much higher time to construct the schedule than the heuristic approaches. This paper presents Hypernode Reduction Modulo Scheduling (HRMS), an heuristic strategy that tries to shorten loop variant lifetimes, without sacri cing performance. The main contribution of this paper is the ordering strategy. The ordering phase orders the nodes before scheduling them, so that, only predecessors or successors of a node can be scheduled before a node. During the scheduling step the nodes are scheduled as early/late as possible if predecessors/successors have been previously scheduled. This strategy has been tested with a set of loops taken from [12] and compared against three leading scheduling strategies. The strategies used to compare are the previous mentioned Slack and SPILP toghether with DESP [5] that is an heuristic strategy that does not care about minimizing register requirements. Experimental results show that HRMS obtains better schedulings than the other heuristic strategies, with a comparable scheduling time. On the other hand, HRMS produces similar results to SPILP, but requiring up to 2 orders of magnitude less time than SPILP to produce the schedules. In the rest of this section, we provide a brief overview of software pipelining and some terms associated with it. In Section 2 an example is used to illustrate the problems that unidirectional strategies have, and shows how a bidirectional strategy shortens lifetimes, 1

and reduces register pressure. Some de nitions and bounds related with software pipelining and register allocation are presented in Section 3. Section 4 describes HRMS. Section 5 describes the experiments performed, and, nally Section 6 states our conclusions.

1.1 A Quick Overview of Software Pipelining

The schedule for an iteration can be divided into stages so that the execution of consecutive iterations that are in distinct stages could be overlapped. The number of stages in one iteration is termed stage count(SC). The number of cycles between the initiation of successive iterations (i.e. the number of cycles per stage) in a software pipelined scheduling is termed the initiation interval(II ). If the schedule length is SL cycles, the number of stages is   SL SC = II Figure 1 shows the execution of ve iterations of a software pipelined loop with a stage count of 3. The execution of a loop can be divided into three phases: a ramp up phase that lls the software pipeline, an steady state phase where the software pipeline achieves maximum overlap of iterations, and a ramp down phase that drains the software pipeline. The code that implements the ramp up phase is termed the prologue. During the steady state phase of the execution, the same pattern of operations is executed in each stage. This behavior can be achieved by iterating on a piece of code, termed the kernel, that correspods to one stage of the steady state phase. A third piece of code called the epilogue, is required to drain the software pipeline after the execution of the steady state phase.

Lifetimes of Values Values used in a loop correspond either to loop-invariant variables

or to loop-variant variables. Loop-invariants are repeatedly used but never de ned during loop execution. Loop-invariants, have only one value for all iterations of the loop and therefore require one register each one during all the execution of the loop independently of the scheduling and the machine con guration. For loop-variants, a value is generated in each iteration of the loop and, therefore, there is a di erent lifetime corresponding to each iteration. Because of the nature of software pipelining, lifetimes of values de ned in an iteration, can overlap with lifetimes of values de ned in subsequent iterations. Assume that the loop in Figure 1 has three loop-variants v1, v2 and v3. Notice that, if an isolated iteration of the original loop is considered, lifetime v1 overlaps with lifetime v3 but not with v2 and, variable v3 overlaps with both v1 and v2. It is easy to realize that only 2 registers are required (one for v2 and v1 and one for v3). If the software pipelined execution of the loop is considered, lifetime of variable v1 overlaps with the lifetime of variable v2 of the previous iteration. Consequently, they have to be allocated in distinct registers. In addition v3 overlaps with both v1 and v2, therefore an additional register for v3 is required. Unfortunately the lifetime of v3 also overlaps with the lifetime of v3 of the previous iteration. This is because v3 has a lifetime larger than the II and a new value is generated before the previous one is used, overwriting it. 2

PROLOGUE

iteration 1 V1 V3

S0

iteration 2

II

V1

S1

V2

V3

S0

iteration 3

Kernel Code

II

S2

S1

V3

S0

iteration 4

II

V1

S2

S1

V2

V3

S0

iteration 5 V1

S2 EPILOGUE

STEADY STATE

V1 V2

S1 S2

V2

V3

S0 S1 S2

V2

II

2 Motivating Example In this section we show how top-down and bottom-up scheduling strategies unnecessarily increase the registers required by software pipelined loops. Then, the basic idea of the bidirectional scheduling presented in this paper is introduced using a simple example; the data dependence graph is shown in Figure 2.

S1 V1 S2 V2 S3

V2 S4

S5 V4

V5 S6 V6 S7

a) Cycle: 0 1 2 3 4 5 6 7 8 9

b)

V1

V2

V4

V5

V6

V1

V2

V4

V5

V6

S1 S2 S3 S4 S5 S6 S7 S1 S2 S3 S4 S5 S6 S7 S1 S2 S3 S4 S5 S6 S7 S1 S2 S3 S4 S5 S6 S7 S1 S2 S3 S4 S5 S6 S7

c) 0 S1 S2 S3 S4 1 S5 S6 S7

d)

large lifetime due to the early placement of S5 during the scheduling. Figure 3d shows the number of live registers in the kernel (Figure 3b) during the steady state phase of execution of the loop. It is constructed by replicating Figure 3c with a shift of II cycles between successive copies until the pattern in II consecutive rows repeats inde nitely. In Figure 3d, we see that 8 registers are live in the rst row and 7 in the second; the register requirements for this approach are 8 registers.

a) Cycle: 0 1 2 3 4 5 6 7 8 9

b)

V1

V2

V4

V5

V6

V1

V2

V4

V5

V6

S1 S2 S3 S7 S6 S5 S4 S1 S2 S3 S7 S6 S5 S4 S1 S2 S3 S7 S6 S5 S4 S1 S2 S3 S7 S6 S5 S4 S1 S2 S3 S7 S6 S5 S4

c) 0 S3 S2 S1 1 S7 S6 S5 S4

d)

to be placed at cycle 3, but because all functional units are busy at this cycle, it will be scheduled at cycle 2. And nally, operation S1 will be placed at cycle 0. Figure 4c shows the lifetimes of loop variants. Notice that V2 has an unnecessary large lifetime due to the late placement of S3 during the scheduling. Figure 4d shows the number of live registers in the kernel (Figure 4b) during the steady state phase of execution of the loop. There are 7 alive registers in the rst row and 7 in the second, so the register requirements for this approach are 7 registers.

a) Cycle: 0

b)

V1

V2

V4

V5

V6

V1

V2

V4

V5

V6

S1 S2 S3 S4

1

S6 S5 S7

2

S1 S2 S3 S4

3

S6 S5 S7

4

S1 S2 S3 S4

5

S6 S5 S7

6

S1 S2 S3 S4

7

S6 S5 S7

8

S1 S2 S3 S4

9

S6 S5 S7

c)

d)

0

S1 S2 S3 S4

1

S6 S5 S7

For instance consider that nodes will be scheduled in the order fS1, S2, S3, S4, S6, S5, S7g Notice that node S6 will be scheduled before nodes fS5, S7g, a predecessor and a successor respectively, and that the partial scheduling will contain only a predecessor (S4) of S6. With this scheduling order, both S3 and S5 have a reference operation already scheduled, when they are going to be placed in the partial schedule. Figure 5a shows the nal scheduling for one iteration. Operation S1 will be scheduled in cycle 0. Operation S2 , which depends on S1, will be scheduled in cycle 2. Then S3 and later S4, are scheduled in cycle 4. At this point, operation S6 has to be scheduled as soon as possible, i.e. at cycle 6 (because it depends on S4), but there are no available resources at this cycle, so we will delay S6 to cycle 7. Now the scheduler will place operation S5 as late as possible in the scheduling because there is a successor of S5 previously placed in the partial scheduling, thus operation 5 is placed at cycle 5. And nally, since operation S7 has a predecessor previously scheduled, it will be placed as soon as possible in the scheduling, i.e. at cycle 9. Figure 5c shows the lifetimes of loop variants. Notice that neither S3 nor S5 have been placed too late or too early, because the scheduler takes always previously scheduled operations as a reference point. Since S6 has been scheduled before S5, the scheduler had a late start for S5. Figure 4d shows the number of live registers in the kernel (Figure 4b) during the steady state phase of execution of the loop. There are 6 alive registers in the rst row and 5 in the second, thus, the loop requires 6 registers.

3 De nitions and Bounds

Consider an inner loop with a set of operations fS ; S ; : : :; Sng. Svi is a pair < operation; iteration > corresponding to the execution of operation Sv in iteration i. The data dependences of a loop can be represented by a Data Dependence Graph G = DDG(V; E; ; ). V is the set of vertices of the graph G, where each vertex v 2 V represents an operation Sv of the loop. E is the dependence edge set, where each edge (u; v) 2 E represents a data dependence between two operations Su ; Sv . The dependence distance  u;v is a nonnegative integer associated with each edge (u; v) 2 E . There is a dependence of distance  u;v between two nodes u; v if the execution of operation Svi  depends on the execution of operation Sui . The latency u is a nonzero positive integer associated with each node u 2 V . The latency u of a node u 2 V is de ned as the number of cycles taken by the corresponding operation Su to produce a result. A node v 2 V is a successor of u 2 V , v 2 Succ(u), if 9(u; v) 2 E . A node v 2 V is a predecessor of u 2 V , v 2 Pred(u), if 9(v; u) 2 E . An edge e = (v1; v2) 2 E is an adjacent edge e 2 Adj edges(u) of a node u 2 V , if v1 = u or v2 = u. A path P (v ; vn)  G between two nodes v ; vn 2 V is a data dependence graph P = DDG(Vp ; Ep; ; ) where Vp  V is a succession of nodes fv ; v ; : : : ; vng and Ep  E isP the set of edges f(vi; vi )8i 2 [1 : : :n ? 1]g. The weight

of a path P  G is p = P p u;v 2E  u;v . The latency Lp of a path P  G is Lp = u2V u . An elementary recurrence circuit Rci = DDG(Vi ; Ei; ; ) is a path P (v ; v)  G from 1

(

2

)

(

+ (u;v )

)

1

1

1

+1

(

)

p

(

)

p

8

2

a node v to itself.

3.1 Minimum Initiation Interval

The initiation interval of a loop is bounded below by two factors: resource constraints and recurrence circuits. Two lower bounds for the initiation interval can be distinguished: The resource-constrained MII (MIIres ) imposes a limit to the II due to resource usage requirements of the loop. The MIIres lower bound is calculated by totaling, for each resource, the usage requirements imposed by one iteration of the loop. The resource usage of a particular operation is speci ed as a list of resources and the times at which each of those resources is used by the operation relative to the issue time of the operation. This method of modeling resource usage is termed a reservation table [15]. The nature of the reservation tables for the operations determines the complexity of computing the MIIres . A simple reservation table uses a single resource for a single cycle. A block reservation table uses a single resource for multiple, consecutive cycles. A reservation table that uses multiple resources or that uses a resource for multiple, non consecutive cycles is termed a complex reservation table. Let us de ne the reservation table RTu of an node u 2 V as a matrix, where each entry RTu (Ri; k) is a nonnegative integer that corresponds to the number of resources of the type Ri that the node u uses in cycle k relative to the cycle where u is scheduled. A lower bound of MIIres can then be calculated by: ?1 X X u

MIIres

RTu (Ri; k)

8u2V k=0 = max 8R nri i

where nri is the number of resources Ri that we have in our architecture. The exact MIIres can be computed by performing a binpacking of the reservation tables for all the operations. Binpacking is a problem which is of exponential complexity. Hence, we use the previous lower bound of MIIres . In fact, if all the operations have a simple reservation table, the lower bound of MIIres correspond to the actual MIIres . The recurrence-constrained MII (MIIrec ) imposes a limit to the II due to recurrence circuits in the DDG of the loop. Obviously if the loop has no recurrences the MIIrec is 0. Let Rci = DDG(Vi ; Ei; ; ) be an elementary recurrence circuit, in a Data Dependence Graph G = DDG(V; E; ; ), with weight i and latency Li then MIIrec can be calculated as: Li MIIrec = max 8Rc DDG

i

i

Although a graph can contain exponentially many elementary recurrence circuits, most loop bodies have very few. So we compute MIIrec by simply scanning each circuit Rci  DDG. In any case, MIIrec can be computed in O(V  E  log V ) time by indirecly nding a 9

circuit with the minimum cost-to-time ratio, where a dependence arc (u; v) 2 E is viewed as having a cost ?u and a time  u;v . (

)

The lower bounds MIIres and MIIrec impose an absolute lower bound of MII . In practice, almost all loops can achieve their absolute lower bound of MII . But for some loops the minimum feasible II is more than MII . The lower bound of MII is: MII = max(MIIres ; MIIrec ) Notice that MIIres and MIIrec are fractionary numbers, so MII can be a non-integral number. Because it does not make sense to talk about a non-integral II we will rede ne MII as MII = dmax(MIIres ; MIIrec )e However, a loop with fractionary MII = ab can be scheduled with a MII of ab by unrolling it b times and scheduling the unrolled loop with a MII of a. The unrolling can be performed as a previous step to the scheduling, and we will not deal with it.

3.2 Bounds on Register Pressure

Given a xed II , a schedule-independent lower bound on the length of the lifetime, MinLTu, of a value, generated by a node u, can be calculated. Let Pi(u ; v) be a path between two nodes u; v with a total weight i and a total latency Li. Then the minimum separation, MinSep(u; v), between u and v (i.e. v must be scheduled at least MinSep(u; v) cycles later than u) is MinSep(u; v) = 8P max L ? i  II ? v u;v G i i(

)

Therefore MinLTu can be calculated as MinLTu = 8v2max MinSep(u; v) +  u;v  II Succ u (

( )

)

A lower bound on the nal register pressure of a loop, MinAvLive can be expressed as the total length of all lifetimes divided by II [11] 2

MinAvLive = 66

X

68u2V

3

MinLTu =II 77 7

Once a loop has been scheduled, an absolute lower bound on the schedule's register pressure, MaxLive, can be found by computing the maximum number of values that are alive at any cycle of the schedule [16]. Let tu be the cycle where a node u has been scheduled. Then the lifetime, LTu , of a value generated by a node u is LTu = 8v2max t ? tu u;v  II Succ u v ( )

(

)

And, due to the modulo constraint, MaxLive can be calculated as follows: 10

for i = 0 to II do LifeV ector(i) = 0 for all u 2 V do for i = tu to tu + LTu ? 1 do LifeV ector(i mod II ) + + endfor endfor P MaxLive =

II ?1 k=0 LifeV ector(k )

4 Hypernode Reduction Modulo Scheduling Hypernode reduction modulo scheduling tries to minimize the register requirements of the loop by scheduling operations as close as possible to their relatives i.e. the predecessors and the successors of the operations. Scheduling operations in this way, shortens operand's lifetime and therefore reduces the register requirements of the loop. To software pipeline a loop, the scheduler must handle cyclic data dependences caused by recurrence circuits. A recurrence circuit Rci(Vi ; Ei; ; ) from an operation to an instance of itself i iterations later, must not be stretched beyond i  II . In addition, placing an operation u at a cycle tu uses RTu(Ri ; k) resources Ri for cycles tu +k +sII 8k 2 [0; u ? 1] and 8s. So, an operation that cannot t in one cycle might not t in any later cycles. Hypernode reduction modulo scheduling solves these problems by splitting the scheduling into two steps: A pre-ordering step that orders nodes and, the actual scheduling, that schedules nodes (once at a time) in the order given by the ordering step.

4.1 Pre-ordering step

This step orders the nodes of the data dependence graph with the goal of scheduling the loop with an II as close as possible to MII and using the minimum number of registers. For this purpose, the ordering step:  Gives priority to recurrence circuits in order to not stretch any recurrence circuit.  Ensures that, when a node is going to be scheduled, the current partial scheduling contains only predecessors or successors of the node, but never both of them (unless the node is the last node of a recurrence circuit being scheduled). In order to show how the ordering step works we will assume that the data dependence graph, G = (V; E; ; ), to be ordered is a strongly connected component (i.e. all the nodes are reachable from any node of G). If G is not a connected component: 1. G is decomposed into a set of strongly connected components fGi g, 2. For each Gi the ordering function is applied, and 3. Once each Gi has been ordered, all the Gi are concatenated giving a higher priority to the Gi with the most restrictive recurrence circuit. An alternative will be to 11

schedule each connected component as a separate loop, but even the original loop does a balanced use of the resources, each Gi will not, and also the prologue and epilogue codes will be executed for each Gi . Next we will show how the nodes of a data dependence graph G, that is a strongly connected, component are ordered. First we will assume that the data dependence graph has no recurrence circuits, and later we will introduce modi cations in order to allow recurrence circuits.

4.1.1 Pre-ordering Without Recurrence Circuits function Pre Ordering(G, L, h) fThis function returns a list with the nodes of G orderedg fIt takes as input: g fThe data dependence graph (G) g fA list of nodes partially ordered (L) g fAn initial node (i.e the hypernode) (h) g List := L; while ( Pred(h) =6 ; or Succ(h) =6 ;) V 0 := Pred(h); V 0 := Search All Path(V',G); G0 := Hypernode Reduction(V',G,h); L' := Sort PALA(G0 ); List := Concatenate(List,L');

V 0 := Succ(h); V 0 := Search All Path(V',G); G0 := Hypernode Reduction(V',G,h); L' := Sort ASAP(G0 ); List := Concatenate(List,L')

endwhile return List

Figure 6: Function that preorders the nodes in a data dependence graph without recurrence circuits The pre-ordering step (Figure 6) is an iterative algorithm that at each iteration orders the neighbors of the Hypernode, and reduces them to the Hypernode. It requires an initial Hypernode and an initial list of ordered nodes. To order a graph we make the rst node of the graph (it works for any initial node) to be the initial Hypernode, and we insert it in the list of ordered nodes. Ordering the neighbors of the Hypernode presents some implementation problems. The algorithm has to deal with all the paths between predecessors, paths between successors, 12

function Hypernode Reduction(V',G,h) f G = (V; E; ; ); V 0  V ; h 2 V g f This function creates the maximum subgraph G0 = (V 0; E 0; ; )  G g f And reduces G0 to the node h in the graph G g E 0 := ;; for all u 2 V 0 do for all e = (v1; v2) 2 Adj edges(u) do E := E ? feg; if v1 2 V 0 and v2 2 V 0 then E 0 := E 0 Sfeg else if v1 = u and v2 =6 h then E := E SSf(h; v2)g endif; if v2 = u and v1 =6 h then E := E f(v1; h)g endif endif endfor; V := V ? fug endfor return G0 Figure 7: Function Hypernode Reduction and paths between predecessors and successors. In order to make a simple implementation predecessors and successors of the Hypernode are alternatively ordered (instead of ordering them in a single step). At each step, the predecessors (successors) of the Hypernode are obtained. Then the nodes that appear in any path between the predecessors (successors) are also obtained (see function Search All Path in Figure 9). Once the predecessors (successors) and all the paths connecting them have been obtained, all these nodes are reduced (see function Hypernode Reduction in Figure 7) to the Hypernode, and the subgraph which contains them is topologically sorted. The topological sort determines the partial order of predecessors (successors), which is appended to the ordered list of nodes. As an example consider the data dependence graph of Figure 8a. Next we will illustrate the ordering of the nodes of this graph step by step. 1. Initially, the list of ordered nodes is empty (List = fg). We will start by making a node of the graph the Hypernode (H in Figure 8). Since we can start to order the nodes at any node of the graph we select the rst node, in this case the node A. The algorithm will make the node A the Hypernode H (resulting the graph of Figure 8b), and A will be appended in the list of ordered nodes (List = fAg). 2. In the next step the predecessors of H will be selected. Since it has no predecessors, the successors will be selected (i.e. the node C). Node C will be reduced to H, resulting the graph of Figure 8c and C will be added to the list of ordered nodes (List = fA; C g). 13

a) A C G

b) H

B D

E

H

F

C

I

G

D

E

H

J

d)

c)

B F

B D

H

I

G

H

e) B

J

g)

f) B

h) F

D

E

E

F

F

E

F

H I

I

H J

J

F I

J

B

E

I H

H

H

function Search All Path(V',G); P := V 0; list := ;; for all u 2 V 0 do list := Append(list; u) endfor while (list =6 ;) do u := Head(list); for all v 2 Pred(u) do if v 62 P then list := Append (list; v); S P := P fvg endif endfor endwhile; S := V 0; list := ;; for all u 2 V 0 do list := Append(list; u) endfor while (list =6 ;) do u := Head(list); for all v 2 Succ(u) do if v 62 S then list := Append (list; v); S S := S fvg

endif endfor endwhile return P T S

Figure 9: Function Search All Path

function Sort PALA(G) f Sorts topologically the graph G using an ALAP algorithm g f and returns a list of the nodes in inverted order g function Sort ASAP(G) f Sorts topologically the graph G using an ASAP algorithm g f and returns a list of the nodes in the order given by ASAP g Figure 10: Some interesting functions 15

4.1.2 Extending the Pre-ordering Step to Graphs with Recurrence Circuits

In order to apply the algorithm to loops with recurrence circuits, during the calculation of MIIrec , we store all the recurrence circuits. So we will assume that all recurrence circuits have been identi ed and ordered from the most restrictive to the less restrictive recurrence circuit (depending on their MIIrec ). We will not consider trivial recurrence circuits, i.e. dependences from an operation to itself. In order to not degrade performance when there are recurrence circuits, the ordering step is performed giving priority to the recurrence circuits with higher MIIrec . In few words what we do, is to reduce all the recurrence circuits to the Hypernode, while ordering their nodes. After this step, we have a data dependence graph without recurrence circuits, with an initial Hypernode and with a partial ordering of all the nodes that where contained in the recurrence circuit. Then we order this data dependence graph as shown in Subsection 4.1.1.

a)

b)

d)

c)

A

A

A

A

B

B 1

B

1

D

B C

1

2

C C

1 D

D

2 C

E D

procedure Ordering Recurrences(G, L, List, h); fThis function takes the data dependence graph (G)g fand the ordered list of recurrence circuits (L)g Simplify Recurrence Subgraphs(G,L); Simplify Redundant Nodes(L); V 0 = Head(L); G0 = Generate Subgraph(V 0, G); h = First(G0); List = fhg; Pre Ordering(G0, List, h); while L 6= ; do V 0 =Search All Path(fh,Head(L)g,G); G0 = Generate Graph(V 0, G); Pre Ordering(G0 , List, h); endwhile;

Figure 12: Procedure to order the nodes in recurrence circuits list of recurrences for graph of Figure 11b is ffA, D, Eg, fA, B, C, Egg. The algorithm performs two simpli cations to this list, before ordering the nodes in recurrences. The rst of them is performed by Simplify Recurrence Subgraphs. This procedure fuses all the recurrence circuits that became to the same recurrence subgraph (i.e. that have the same backward edge). For instance, the list of recurrences associated to Figure 11b will be simpli ed to a single recurrence subgraph ffA, B, C, D, Egg. The second simpli cation (Simplify Redundant Nodes) consists of eliminating redundant nodes in the list of recurrence subgraphs. The nodes that appear in more than one recurrence subgraph are removed from all the sublists except for the most restrictive sublist (i.e. the rst one in the list of recurrence subgraphs) For instance the list of recurrence subgraphs associated to Figure 11c ffA, C, Dg, fB, C, Egg will be simpli ed to the list ffA, C, Dg, fB, Egg Once the nodes have been simpli ed, the actual ordering for recurrence circuit is performed. It starts by generating the corresponding subgraph for the rst recurrence circuit, but without the backward edge that causes the recurrence. Therefore the resulting subgraph has no recurrences and can be ordered using the algorithm without recurrences presented in Subsection 4.1.1. The whole subgraph is reduced to the Hypernode. Then we search all paths between the Hypernode and the next recurrence circuits (in order to properly use the algorithm Search All Path it is required that all the backward edges causing recurrences have been removed of the graph). After that, the graph containing the Hypernode, the next recurrence circuit, and all the nodes that are in a path that connects them, is ordered applying the algorithm without recurrence circuits and reduced to the Hypernode 17

This process is repeated until there are no more recurrence subgraphs in the list. At this point all the nodes in recurrence circuits or in paths connecting them have been ordered and reduced to the Hypernode. Therefore the graph that contains the Hypernode and the remaining nodes, is a graph without recurrence circuits, that can be ordered using the algorithm presented in the previous subsection.

a)

A

1 C

b) A

B D

E

C

c) D

d) H

B E

H

e) B G

E

J

H

I F

G

I

2 J

H

F

G

H

H K

I

2 J

K

M

K L

L

M

L

M

4.2 Scheduling step

The scheduling step places the operations in the order given by the ordering step. The scheduling tries to schedule the operations as close as possible to the neighbors that have already been scheduled. When an operation is going to be scheduled, it is scheduled in di erent ways depending on the neighbors of these operations that are in the partial schedule.  If an operation u has only predecessors in the partial schedule, then u is scheduled as soon as possible. In this case the scheduler computes the Early Start of u as:

Early Startu = 8v2scheduledPred max u tv + v ( )

Where tv is the cycle where v has been scheduled and v is the latency of v. Then the scheduler scans in the partial schedule for a free slot for the node u starting at cycle Early Startu until the cycle Early Startu + II . Notice that, due to the modulo constraint, is has no sense to scan for more than II cycles.  If an operation u has only successors in the partial schedule, then u is scheduled as late as possible. In this case the scheduler computes the Late Start of u as: Late Startu = 8v2scheduledSucc min u tv ? u ( )

Then the scheduler scans in the partial schedule for a free slot for the node u starting at cycle Late Startu until the cycle Late Startu ? II .  If an operation u has predecessors and successors, then the scheduler scans the partial schedule starting at cycle Early Startu until the cycle Late Startu. If no free slots are found for a node, then the II is increased by 1. The scheduling step is repeated with the increased II , which will have more opportunities for nding free slots. One of the advantages of our proposal is that the nodes are ordered only once, even if the scheduling step has to do several tries.

5 Experimental Results and Comparison with Other Methods To demonstrate the e ectiveness of our method, we have evaluated how well our method performs compared to 3 leading methods. The methods selected are: an heuristic method that does not care about register requirements [5] (DESP), a life-time sensitive heuristic method [11] (Slack) and a linear programming method [12](SPILP). We used 24 data dependence graphs from [12] with a machine con guration with 1 FP Adder, 1 FP Multiplier, 1 FP Divider and 1 Load/Store unit. We have assumed a unit latency for add, subtract and store instructions, a latency of 2 for multiply and load, and a 19

Application Program Loop1 Loop2 Loop3 Loop4 Spice Loop5 Loop6 Loop7 Loop8 Loop10 Loop1 Doduc Loop3 Loop7 Fpppp Loop1 Loop1 Liver Loop5 Loop23 Linpack Loop1 Loop1 Loop2 Loop3 Whets. Cycle1 Cycle2 Cycle4 Cycle8

II 1 6 6 11 2 2 3 3 3 20 20 2 20 3 3 9 2 17 6 5 4 4 4 4

HRMS Buf Secs 3 0.01 9 0.03 4 0.01 12 0.20 2 0.01 16 0.08 17 0.08 6 0.02 4 0.02 12 0.17 11 0.15 20 0.10 5 0.13 10 0.02 5 0.02 23 0.10 5 0.02 16 0.10 9 0.08 5 0.02 4 0.02 5 0.02 7 0.02 11 0.02

II 1 6 6 11 2 2 3 3 3 20 20 2 20 3 3 9 2 17 6 5 4 4 4 4

SPILP Buf Secs 3 0.82 9 12.47 4 0.72 12 3.60 2 0.70 16 7.67 17 0.70 6 3.15 4 1.88 12 4.35 11 1.03 20 0.70 5 0.93 10 1.97 5 0.73 23 233.41 5 2.62 16 4.25 9 2.05 5 0.73 4 0.75 5 1.87 7 1.85 11 1.77

II 1 7 6 12 2 3 3 5 3 20 20 2 20 5 3 9 2 18 7 5 4 4 4 4

Slack Buf Secs 3 0.01 9 0.03 4 0.02 12 0.10 2 0.02 11 0.03 17 0.03 5 0.03 4 0.02 13 0.03 11 0.03 20 0.01 5 0.03 10 0.05 5 0.05 23 0.13 5 0.02 16 0.17 9 0.03 5 0.02 4 0.02 5 0.02 7 0.01 11 0.02

II 2 6 6 12 2 17 17 3 3 20 20 18 20 4 3 9 3 18 17 5 4 4 4 4

FRLC Buf Secs 2 0.02 16 0.03 4 0.02 12 0.03 2 0.02 9 0.03 11 0.01 8 0.02 5 0.02 15 0.03 22 0.03 5 0.03 6 0.02 15 0.02 6 0.02 40 0.12 4 0.02 16 0.08 7 0.03 5 0.02 4 0.02 5 0.02 7 0.03 11 0.02

Table 1: Scheduling on Benchmark data dependence graphs. latency of 17 for divide. The results for the other three methods have been obtained from [12]. Table 1 compares the initiation interval II , the number of registers ( Buf ) and the total execution time of the scheduler on a Sparc-10/40 workstation, for the four scheduling methods. Table 2 summarizes the comparison with other scheduling methods. This table brings out the bene ts, in terms of initiation interval, in using the HRMS approach, and, in terms of bu er requirements, when the initiation interval is the same. Notice that in none of the cases, the HRMS approach produced a worse schedule than any of the other methods. Finally Table 3 compares the total compilation time in seconds for the four methods compared. Notice that HRMS is slightly slower than the two heuristic methods. But, those methods perform noticeably worse in nding an optimal scheduling. On the other hand, the linear programming method (SPILP) has similar performance to HRMS, but the time 20

Method

Better

II

Equal Bu ers Better Equal SPILP 0 0 24 Slack 7 1 16 DESP 9 8 7 Table 2: Comparison of HRMS performance to the other 3 methods.

HRMS SPILP Slack DESP Total Compilation Time 1.45 290.72 0.93 0.71 Table 3: Comparison of HRMS compilation time to the other 3 methods. to construct the scheduling is much higher than using the HRMS approach.

6 Conclusions This paper has presented Hypernode Reduction Modulo Scheduling (HRMS), a novel and e ective technique for resource-constrained software pipelining. HRMS can deal with loops containing loop-carried dependences and attempts to optimize the initiation interval while reducing the register requirements of the schedule. First of all HRMS orders the nodes of the data dependence graph. The ordering function gives priority to recurrence circuits, in order to not penalize the initiation interval. In addition nodes are ordered in such a way, that when a node is scheduled, the scheduling contains at least a reference node (a predecessor or a successor). The ordering step guarantees that (except in the special case of recurrence circuits) only predecessors or successors of the current node have been already scheduled, but not both. After the nodes have been ordered, HRMS schedules them. The scheduling step schedules a node as soon as possible if it has predecessors already scheduled, and schedules a node as late as possible if it has successors already scheduled. Scheduling nodes in this way shortens lifetimes of loop variants, and therefore reduces the register requirements of the schedule. The usefulness of HRMS has been empirically established by applying it to several loops taken from common scienti c benchmarks. We have compared our schedules with three leading methods, namely Govindarajan et al.'s SPILP integer programming formulation, Hu 's Slack Scheduling and Wang et al.'s DESP scheduling. Our schedules exhibit significant improvement in performance in terms of initiation interval and bu er requirements compared to DESP, and a signi cant improvement in the initiation interval when compared to Slack lifetime sensitive heuristic. Finally we obtained similar results than SPILP, which required up to two orders of magnitude more scheduling time. 21

Acknowledgments We acknowledge R. Govindarajan, Erik R. Altman and Guang R. Gao for suplying us the data dependence graphs they used in [12] in order to compare our proposal with other methods.

References [1] Monica Lam. Software pipelining: An e ective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN'88 Conference on Programming Language Design and Implementation, pages 318{328, June 1988. [2] B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scienti c computing. In Proceedings of the 14th Annual Microprogramming Workshop, pages 183{197, October 1981. [3] S. Jain. Circular scheduling: A new technique to perform software pipelining. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 219{228, June 1991. [4] B. Ramakrishna Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63{74, November 1994. [5] J. Wang, C. Eisenbeis, M. Jourdan, and B. Su. Decomposed software pipelining: A new perspective and a new approach. International Journal of Parallel Programming, 22(3):357{379, 1994. [6] J. Llosa, M. Valero, E. Ayguade, and J. Labarta. Register requirements of pipelined loops and their e ect on performance. In 2nd Int. Workshop on Massive Parallelism: Hardware Software and Applications, October 1994. [7] William Mangione-Smith, Santosh G. Abraham, and Edward S. Davidson. Register requirements of pipeline processors. In Int. Conference on Supercomputing, pages 260{246, July 1992. [8] John A. Swensen and Yale N. Patt. Hierarchical registers for scienti c computers. In Int. Conference on Supercomputing, 1988. [9] A. Capitanio, N. Dutt, and Nicolau A. Partitioned register les for VLIWs: A preliminary analysis of tradeo s. In MICRO25, pages 292{300, 1992. [10] J. Llosa, M. Valero, and E. Ayguade. Non-consistent dual register les to reduce register pressure. In 1st Symposium on High Performance Computer Architecture, January 1995. 22

[11] Richard A. Hu . Lifetime-sensitive modulo scheduling. In 6th Conference on Programming Language, Design and Implementation, pages 258{267, 1993. [12] R. Govindarajan, Erik R. Altman, and Guang R. Gao. Minimal register requirements under resource-constrained software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85{94, November 1994. [13] M.S. Lam. A Systolic Array Optimizing Compiler. Kluwer Academic Publishers, 1989. [14] J.C Dehnert, P.Y.T. Hsu, and J.P. Bratt. Overlapped loop support in the cydra 5. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 26{38, 1989. [15] E.S. Davidson, A.T. Thomas, L.E. Shar, and J.H. Patel. E ective control for pipelined processors. In Proc. COMPCON75, pages 181{184. IEEE, March 1975. [16] B.R. Rau, M. Lee, P. Tirumalai, and P. Schlansker. Register allocation for software pipelined loops. In Proceedings of the ACM SIGPLAN'92 Conference on Programming Language Design and Implementation, pages 283{299, June 1992.

23

Suggest Documents