Software Pipeliner: Parallelization of Loops (Draft) - Semantic Scholar

Software Pipeliner: Parallelization of Loops

(Draft)

Ping Hu

Christine Eisenbeis

INRIA Rocquencourt Domaine de Voluceau, 78153 Le Chesnay, France

Abstract

Software pipelining, as an important parallel technique for loop structure, exploits the parallelism present among the iterations of a loop by overlapping the execution of successive iterations. This paper presents a practical and usable algorithm, Overlapping Modulo Scheduling(OMS), which is capable of modulo scheduling loops subjected to recurrence dependences and resource constraints for realistic machine models. The diculty encountered when software pipelining loops with conditional branches is that arbitrary control ow complicates the construction of an overlapping schedule since there are multiple possible execution paths. To address this issue, we have extended the OMS algorithm to handle conditional branches by leveraging the individual bene ts of complete and partial if-conversion.

Keywords: Instruction level parallelism, VLIW processors, Software pipelining, Modulo scheduling, Optimization of loops

This research was partially supported by the ESPRIT IV reactive LTR project OCEANS, under contract No. 22729.

1

1 Introduction Loops in a program take much of the execution time, but the parallelism available within a single iteration of a loop is quite limited. The development of eective and practical techniques for extracting the parallelism in loop constructs is thus rather critical to achieve a large amount of instruction-level parallelism. Software pipelining, as an important parallel technique for loops, exploits the parallelism present among the iterations of a loop by overlapping the execution of successive iterations. A lot of eorts[RG81, SDX86, Lam88, DHB89, AN91, EW92, WEJS94, Rau94, RGSL96] have been made to develop various heuristics and strategies for performing software pipelining. Among the categories of software pipelining techniques, modulo scheduling[RG81] applies a common model for all loop iterations, i.e. the same schedule for each iteration and the xed interval for initiating successive iterations. That makes it possible to accurately model resource con icts as well as register allocation. In general, modulo scheduling is an iterative procedure. First, the minimum II(Initiation Interval1 ) is determined with respect to resource constraints and interiteration dependences(data dependences between operations from dierent loop iterations). This value of the minimum II is established as a candidate II. Then, one attempts to nd out an uni ed kernel(i.e. new pipelined loop body) for all iterations so that this kernel can be initiated at the II rate while no resource con ict will arise. If this attempt fails, the II value is increased in order to provide more opportunities for scheduler. This procedure is repeated until the II value is increased beyond a limit bound. In this paper, we present our algorithm, OMS(Overlapping Modulo Scheduling), derived from the above modulo scheduling scheme to parallelize loops for the Philips Trimedia TM-1000 processor[Cas94]. The main developments based upon previous work of modulo scheduling are in several ways. First, we describe that computing the minimum II is no longer an NP-complete problem[Lam88]. We can calculate it at a much lower cost. Second, we demonstrate a practical and usable manner, under the support of the SALTO[RBS96] environment, to construct modulo kernel for loops subjected to inter-iteration dependences and resource constraints. This algorithm has already been implemented in the SALTO environment. In addition, software pipelining loops with conditional branches is a dicult problem due to the fact that multiple possible execution paths complicate the construction of an overlapping schedule. To take into account possible combinations of different execution paths, the techniques for software pipelining loops with conditional branches generally have to make multiple copies of an operation(i.e. a RISC instruction), which contributes signi cant code size expansion. Fortunately, predicated execution2 in the processor TM-1000 can move o branch constructs by a process known as if-conversion[AKPW83]. The OMS algorithm can then be applied to this if-converted loop body. However, the drawback is to derive a larger kernel since it contains all 1 2

II is the number of instruction issue cycles between the initiation of successive iterations. Predicated execution is a hardware feature that supports the execution of predicated operations

2

operations along all execution paths. We tend to balance these techniques with and without predicated execution in order to inherit their individual merits and avoid their handicaps. Thus, we propose an algorithm which attempts to improve the resource usages under the reasonable control of code size expansion. This algorithm is called Global Overlapping Modulo Scheduling(GOMS). The remainder of this paper is organized as follows. The overview of the OMS algorithm is presented in Section 2. Sections 3 describes each process of the OMS algorithm for modulo scheduling loops. Section 4 presents the GOMS algorithm for handling loops containing conditional branches with the support of predicated execution. Last section gives the conclusion and the discussion of our future work.

2 The overview of the OMS algorithm Figure 1 gives the overview of the OMS algorithm. The entire structure of the OMS algorithm consists of the IL le, the interface with SALTO and the OMS algorithm.

IL(Interface Language) le

The IL le reserves the useful information that is lost during code generation, for instance, control ow, memory access dependences between operations from either the same or distinct loop iterations, etc. We will see this information is quite essential for low-level optimization and scheduling, especially for software pipelining. In our compiler for the processor TM-1000[ABB+ 97], the front-end compiler will generate this IL le as well as the assembly code for the back-end compiler. However, this le can also be de ned by low-level optimizer.

Interface with SALTO

Before performing the OMS algorithm, SALTO is in charge of extracting loop constructs. Moreover, SALTO also carries out classic optimizations and local scheduling. An important process by SALTO is the minimization for antidependences and output dependences via renaming registers, which will reduce the height of data dependences. As a result, the number of overlapped iterations will be decreased so as to reduce code size expansion and relax the pressure of register allocation during software pipelining. The optimized assembly code of loop is then as an input to the OMS algorithm. The output from the OMS algorithm is the software pipelined loop code which is returned to SALTO. Finally, that is SALTO which generates the actually executable VLIW instructions3.

OMS algorithm

The algorithm OMS(Overlapping Modulo Scheduling) consists of three major components, DDG construction, CreateLoopKernel and SP construction.

3

A VLIW instruction is a list of operations

3

Assembly code

IL file

SALTO Optimized assembly loop-code

DDG Construction

DDG

CreateLoopKernel False

True

New loop kernel

SP Construction

Software pipelined loop-code

SALTO

Figure 1: The overview of the OMS algorithm

4

The rst process of the OMS algorithm is to construct the Data Dependence Graph(DDG) for loop body on which the second process CreateLoopKernel can be performed to create the new loop kernel based upon modulo scheduling scheme. Finally, SP construction process generates the software pipelined loop code. The next section will detail each process as well as our major developments from previous work of modulo scheduling.

3 Description of the OMS algorithm 3.1 DDG construction The DDG is employed to represent the dependence relations between operations. Each vertex of DDG represents an operation and each directed edge represents a dependence between the two operations joined by this edge. In our constructed DDG , each vertex is a pointer towards a data structure which collects the information about an operation. This information contains Id, the position in loop body; Symbolic, the mnemonic; Cycle, the issue time dynamically determined by scheduling algorithm; Iteration, the iteration number in software pipelined kernel. The data attached a directed edge contain Latency, the minimum number of cycles for satisfying the dependence indicated by this edge; depType, the type of the dependence(RAW,WAR,WAW and NONE); depName, the name of the dependence resource; Distance, the number of the iterations spanned by inter-iteration dependence, whose value will be set to zero if it concerns an intra-iteration dependence(data dependence between operations within one loop iteration). Cycle:

1. R1=a[i]

0

2. R2=b[i-2]

1

3. R3=R1+R2

2

4. b[i]=R3

3

5. R4=R3*2

4

6. R5=R4+3

5

7. a[i-1]=R5

6

Loop body

7

FU1 FU2

1

FU3

2

#{1,R1=a[i],0,0}

#{2,R2=b[i-2],0,0}

#{3,R3=R1+R2,2,0}

3 5

4

#{5,R4=R3*2,3,0}

#{4,b[i]=R3,3,0}

#{6,R5=R4+3,6,0}

6

#{7,a[i-1]=R5,7,0}

7 Local Scheduling

Data Dependence Graph

Figure 2: The DDG construction Let us consider a simple DDG example in Figure 2. Suppose a hypothetical processor with three fully-pipelined general-purpose functional units. The latencies of most operations are one cycle except the mult and memory access operations. After have performed local scheduling on the loop body, the constructed DDG is shown on the right side of Figure 2, in which each node is a pointer of structure fId, Symbolic, 5

Cycle, Iterationg and each edge is attached a structurefLatency, depType, depName, Distanceg. Note that all the Iteration elds are assigned to zero because

local scheduling reorders the operations from same iteration. The dependence relations of registers can be analyzed without much pain under the help of the SALTO environment. The information about memory access dependences can be obtained from the IL le as mentioned above. Some inter-iteration dependences would cause the dependence circuits in DDG. For instance, the dependence from Op4 to Op2 leads to a circuit(Op2, Op3, Op4, Op2). But It is not the same case for the edge from Op1 to Op7. This dependence does not cause a dependence circuit. To distinguish these two types of inter-iteration dependences, we give two following de nitions.

De nition 1 (CIDE)(Circuit Inter-iteration Dependence Edge)

An edge of a given DDG is a CIDE if this edge represents an inter-iteration dependence and there exists a path from the target of the edge to the source of the edge.

De nition 2 (NCIDE)(No-Circuit Inter-iteration Dependence Edge)

An edge of a given DDG is a NCIDE if this edge represents an inter-iteration dependence and there exists no path from the target of the edge to the source of the edge.

3.2 The minimum initiation interval estimation

The initiation interval(II) is the interval at which successive iterations can be initiated, which will also decide the number of cycles for executing the kernel. Therefore, this value is expected to be as small as possible from the viewpoints of optimizing parallelism. The minimum initiation interval(MII) refers to a lower bound on the smallest possible II value for achieving a modulo scheduling. The MII is known to be determined by the maximum of two constraints, resource usage constraints(ResMII) and recurrence constraints caused by inter-iteration dependences(RecMII)[RG81, Lam88, Rau94]. That is MII = dMax(ResMII; RecMII )e

3.2.1 The recurrence constrained MII Software pipelining overlaps the execution of the operations from successive loop iterations in order to aggressively extract instruction level parallelism. Therefore, a legal software pipelining schedule must respect inter-iteration dependences besides intraiteration dependences and resources constraints. More precisely, for any inter-iteration dependence edge e from operation Op1 to operation Op2, the modulo-scheduled cycles for Op1 and Op2 in same iteration must satisfy the following formula, (Op2:Cycle + e:Distance II ) (Op1:Cycle + e:Latency) (1) Intuitively, the execution of the Op2's instance after e:Distance iterations(whose length is II) could be started only after the execution of Op1 has completed. 6

In this formula, e:Distance and e:Latency have been statically determined during the construction of DDG. The modulo scheduler should determine the other three variables: II, Op1:Cycle and Op2:Cycle. In general, the recurrence constrained MII(RecMII) is known to be determined by accounting for the worst-case dependence circuit caused by recurrence dependences. Delay(c) MAX RecMII = c 2 dependence circuits Distance(c) Delay(c) is the sum of the delays4 along a circuit c in the graph and Distance(c) is the sum of the distances along that circuit. Therefore, the calculation of RecMII is an NP-complete problem since the computational complexity for identifying all circuits in a graph is exponential. To actually obtain the RecMII at a lower expense, we have developed the following approach to compute the RecMII in our implementation. After the dependence graph of a loop body have been constructed, 1. List-schedule[ACD74](a local scheduling algorithm) all the operations, just taking into account the intra-iteration dependences. In other words, we ignore all the inter-iteration dependences and the resource constraints at this moment. The scheduled cycle for each operation is reserved in its Cycle eld. 2. Compute the MII for each CIDE and select the maximum assigned to the RecMII, as demonstrated as follows. Suppose that the source of e is Op1 and the target of e is Op2, (Op1.Cycle - Op2.Cycle) + e.Latency RecMII = e 2MAX CIDEs e.Distance The motivation of these two steps for any CIDE e is that we rst utilize list scheduling technique to determine the minimum scheduled-time dierence from Op2 to Op1 with respect to the intra-iteration dependences and then we use the formula (1) to calculate the RecMII so as to ful ll the inter-iteration dependence from Op1 to Op2 as well. Let N be the number of operations in the given loop and M be the number of edges. The computational complexity of list scheduling is O(N M ). The complexity of the second step is O(M ) in the worst case. Hence, the approach presented here for computing the MII is not an NP-complete problem any more. In the example of Figure 2, the result after list-scheduling regardless of resource constraints is the same as shown in Figure 2. There is only one dependence circuit, the RecMII is thus calculated as: RecMII = (3?0)+2 2 The delay from one operation to another operation is the dierence of the scheduled cycles of these two operations 4

7

3.2.2 The resource constrained MII The resource constrained MII(ResMII) is known to be determined by accounting

for the most heavily used resource by one loop iteration and that resource availability in processor. (# FU type i required for one loop iteration) ResMII = i 2 MAX FU types (#FU type i available in processor)

However, for a realistic machine model with complex reservation tables5 and multiple alternatives6 , it is usually impractical to compute the ResMII exactly[Rau94]. The TM-1000 processor just falls in this category. For the sake of simplicity, we calculate the ResMII approximate value by the ratio of the number of the operations in one loop iteration to the number of the issues of the processor. There are two facts supporting this simplicity. First, TM-1000 is a fully-pipelined processor and most operations can be issued every cycle. Second, we can bene t from the increased II value within the modulo scheduling iterative framework, which will relax the resource constraints for the given II value. Consequently, the MII of the example in Figure 2 is calculated as: + 2 )e = 3 MII = dMax( 73 ; (3 ? 0) 2

3.3 Create the modulo scheduled loop kernel

The essence of modulo scheduling is to create a legal modulo scheduled loop kernel so that resource and dependence constraints are ful lled. The OMS algorithm is derived from the iterative framework of modulo scheduling. The algorithm is described as below. The candidate II is initially set equal to the value of MII. 1. Extended list schedule the loop body with regard to the NCIDEs. 2. Divide the software pipeline stages based on the value of II. 3. Overlap the software pipeline stages one by one into modulo reservation table(MRT). If the resource constraints or some CIDEs are no longer ful lled, then goto step(1) with an increased II.

3.3.1 Extended List scheduling for the loop body Normally, list scheduling is a local scheduling algorithm which takes into account the resource con icts and the intra-iteration dependences while scheduling the operations from the loop body. However, the NCIDEs according to De nition 1 have not 5 6

Reservation tables record the times at which each resource is used by the operations. A particular operation is executable on multiple functional units.

8

the circuits between their sources and their targets. Thus, once the value of II has been decided, we can calculate the minimum delay corresponding to an inter-iteration dependence indicated by a NCIDE e like this (e:Latency ? e:distance II ) so that the inter-iteration dependences without circuit will be also satis ed. This extended list scheduling approach is rst employed to schedule the loop body ignoring the CIDEs at this moment. Let's consider our example in Figure 2. According to the formula (1), the delay is (4 ? 1 3) = 1 for the NCIDE from operation 1 to operation 7. Operation 1 is scheduled at Cycle 0 and operation 7 could thus be scheduled at Cycle 1. But operation 6 is scheduled at Cycle 6 and the latency between operation 6 and 7 is 1. Therefore, the nally scheduled cycle is 7 for operation 7. Figure 3(a) gives the result of extended list scheduling for the example. Cycle: FU1 FU2 FU3 Stage:

1

0

2 3

0,3,6,...

0

1

MRT

Cycle:

2

51 1

1,4,7,...

3 5

2,5,8,...

4

3 (b) overlap stage 0 and 1

1

4

MRT

5

Cycle:

6

0,3,6,...

51 1

1,4,7,...

6

2,5,8,...

3

7

6 7

2 4 1 II

2

8 (a) Extended list scheduling and the stage division

2

2 4 1 II

72 (c) New Loop Kernel

Figure 3: Create the modulo scheduled loop kernel

3.3.2 The software pipeline stage division After extended list scheduling has been performed, the scheduled loop body must span multiple IIs to complete one iteration. Each group of II successive instructions is termed as a software pipelining stage. In our example, the loop body spans three IIs. Thus, we can divide it into three stages whose length of each is 3(i.e. the II value) as shown in Figure 3(a).

3.3.3 Overlap the software pipeline stages Software pipeline exploits the parallelism present among the loop iterations by overlapping the execution of successive iterations. The groups of operations from different iterations would be executed simultaneously. In other words, the dierent stages from successive iterations rather than from one iteration are in parallel execution. This 9

motivates us to overlap the stages one by one to obtain modulo scheduled loop kernel whose number of execution cycles is the value of II. The modulo reservation table(MRT) is employed to identify both dependence constraints and resource con icts. In a MRT, each row represents an VLIW instruction and there are exactly II rows. Each column represents a functional unit available at every cycle. The operations in the slots of MRT can be executed on the functional units corresponding to the columns at the given cycles. The procedure for creating the modulo scheduled loop kernel is described in pseudocode, which is given in Figure 4. procedure CreateLoopKernel (II : integer) ; /* II is initialized to MII */ var array[cycle, operation] : LoopCode, MRT, Stage ; int : No=1 ; boolean : Success=true ; begin LoopCode := extended-list-scheduling(LoopCode) ; /* Overlap the stages into MRT */ Stage:=LoopCode(1..II) ; MRT:=Stage ; /* MRT is initialized to the first stage */ while ( No < round(NumOfCycles(LoopCode)/II) ) and Success do /* The Stage is not the last one, try to overlap the next stage to MRT */ { Stage := LoopCode(No*II+1...No*II+II) ; If NoResourceConflict when overlaping Stage to MRT then {Overlap Stage to MRT; No:=No+1} else { Search the available slots for the conflict operations ; if this search fails then { Success := false ; break ; } ; delay the conflict operations and all the successors of the delayed operations ; If some inter-iteration dependences are no longer satistfied due to the delay then { Success := false ; break ; } ; } } if not Success then CreateLoopKernel(II+1) ; End

Figure 4: The procedure for creating the modulo scheduled loop kernel The main idea behind this algorithm is to place the stages one by one into the MRT. The problem arising in this process is the con icts7 of resources used by the operations from dierent stages. When an operation Op from being-placed stage causes the resource con ict, we should attempt to search an available slot in the MRT for it. This eort will result in two results as following.

A slot is available in the MRT

At this case, Op is delayed into this slot with respect to the resource constraints. Meanwhile, all the Op's descendents in the dependence graph are correspondly delayed with respect to the intra-iteration dependences and the inter-iteration dependences without circuit. On the other hand, if a delayed operation Op1 is the source of a CIDE e with the target op2, the related inter-iteration dependence

We utilize SALTO to identify the resource con icts which is characterized via the resource reservation tables in the SALTO environment. 7

10

would be no longer ful lled for the sake of the increased Op1:Cycle, i.e. (Op2:Cycle + e:Distance II ) 6 (Op1:Cycle + e:Latency) According to the de nition of the CIDE e, there is a circuit between Op1 and Op2, that is, a dependence path exists from Op2 to Op1. Thereby, it is useless to increase the value of Op2:Cycle, which will further result in the delay of Op1.

Consequently, we can pro t from the iterative scheme for increasing the value of II in order to respect the inter-iteration dependences with circuit.

No slot is available in the MRT

As mentioned in Section 3.2.2, the resource constrained MII is statically estimated. This value of II would not have provided the sucient resources during modulo scheduling. When this occurs, we increase the II to relax the resource constraints.

Let us revisit the example in Figure 3(a). The MRT is initially set to stage 0 and then stage 1 is overlapped into the MRT. At this moment, Operation 4 has to be delayed to the next slot since it con icts with operation 2 in the MRT. Moreover, operation 4 is the source of a CIDE as given in Figure 2. The inter-iteration dependence relation( (0 + 2 3) (3 + 2) ) is still satis ed. We can continue to overlap the next stage into the MRT. In a similar way, we have delayed operation 6. Here note that all the descendents of operation 6 in the dependence graph must be correspondly delayed. Hence, its son, operation 7, has been also delayed. This procedure is repeated until no more stage is left to be overlapped. Finally, the modulo scheduled loop kernel is achieved in Figure 3(c).

3.4 Software pipelining construction

Since the modulo scheduled kernel contains the operations from the successive iterations , the lifetime of a register could overlap with itself in this kernel. Thereby, the registers have to be re-allocated or renamed. We adopt the meeting graph approach[ELM95] to rename the registers and unroll the kernel several times when necessary. After the kernel has been determined, the appropriate prologue and epilogue code sequences are generated depending on whether this is a loop with early exists[RST92]. Finally, this modulo scheduled loop code is output to SALTO which is in charge of generating the actually executable VLIW instructions.

4 Global OMS for loops with conditional branches

4.1 The related work

As far as the results of our research are concerned, we suppose that the loops contain no conditional branches. The diculty encountered when software pipelining loops with conditional branches is that arbitrary control ow complicates the construction 11

of an overlapping schedule since there are multiple possible execution paths. It is obviously rather dicult and impractical to nd a pipeline schedule that can support all possible combinations of overlapping iteration paths. To address this issue, a number of techniques have been developed in the past ten years. The principal strategies proposed in these techniques can be classed into the following three categories. One strategy is to convert loops with conditional branches into straight-line structure by a called if-conversion[AKPW83] process. If-conversion converts the operations along alternative paths of each branch into predicated operations(operation with a predicated source operand). The operation whose predicate is true will be executed normally. Conversely, the operations whose predicate is false will be collapsed. After If-conversion has been performed, control dependences on conditional branches have been converted into data dependences so that branch constructs have been eliminated. As a result, the converted loop body contains only a single iteration path which can be amenable to local software pipelining8. This method has been employed to handle loops with conditional branches in [DHB89] and [WHSB92]. [DHB89] has a special hardware feature, i.e. predicated execution[RYYT89][PS91], which supports the execution of predicated operations by providing boolean registers for their predicated operands. However, there is no such hardware support in [WHSB92]. Thereby, after local software pipelining has been applied to the if-converted loop body, the Reverse if-conversion [WMHR93] have to convert predicated operations back to explicit conditional structure. Another strategy proposed in [Lam88], [SW91] and [WEJS94] is to regard the entire conditional construct as an integral part, i.e. a single node in the control ow graph. These nodes representing the conditional branches can then be scheduled like any other simple nodes and even permitted in parallel with the other operations. This scheme provides the opportunity for extracting the parallelism outside the conditional branches, however it has ignored the parallelism between the operations along the branches. The above two strategies handle alternative paths caused by conditional branches in the same manner. To honor dierent operation sets and execution frequencies for dierent paths, the other strategy is to develop variable IIs and schedules for dierent execution paths. [WPP95] has suggested a nested MRT with multiple-II kernel. More inner kernel contains shorter and more-frequently execution path. To support such a MRT, besides predicated execution, an additional hardware for compound predicate is required to choose dierent loop back edges for multiple kernels. Another approach proposed in [SDWX87] and [SL96] is to rst pipeline each iteration path in isolation and then insert necessary transition code between the paths to guarantee the semantic correction of the original program. Local software pipelining deals with loops without conditional branches. Correspondly, global software pipelining deals with loops with conditional branches. 8

12

4.2 Motivation

For software pipelining loops containing conditional branches, those techniques without predicated execution usually lead to exponential code expansion due to multiple copies of an operations along dierent control paths or due to complicated transition code with multiple loop back edges. Conversely, those techniques with predicated execution can eectively control code expansion since they do not require the code regeneration and the additional transition code. However, the drawback is to derive a larger kernel since it contains all operations along all execution paths. In reality, only those operations along a taken execution path are executed and the others are abandoned. This results in the waste of some fraction of the operations in the VLIW instructions. Let us refer a motivation example in Figure 5. The simpli ed DDG of an ifconversed loop body is shown in Figure 5(a). The latency between two operations is marked on the edge joining these two operations. The operations(3,4,5,6) along one branch path are predicated on P and the operations(7,8,9,10) along another branch path are predicated on P 0 . Figure 5(c) gives the modulo scheduled kernel by the OMS algorithm, which contains all the operations in the loop body and whose value of II is 4. Cycle: 0

1

7 1

1

4

3

7

4

4

8

5

8 1

5

2 P’

6

2

7

9 2

6 3

10

8

1

9

11 (a) DDG after if-conversion

0

2 3

2 1

3

P

Stage:

1

1

2

2 1

FU1 FU2 FU3

10

MRT: II = 4

1 41 51 11 2 2 61 3

5 9

1

6 10 2

81 10 2 91 7

(c) Kernel after OMS

11

11

(b) Local Scheduling

Figure 5: Motivation example We tend to balance these two types of techniques(with and without predicated execution) in order to combine their individual merits and avoid their handicaps. The major motivation is to improve the resource usages of predicated execution under reasonable control code size expansion during software pipelining loops with conditional branches. We thus develop an algorithm with the support of predicated execution, called GOMS(Global Overlapping Modulo Scheduling). This algorithm inherits the idea of Enhanced Modulo Scheduling(EMS)[WHSB92] to partially overlap the opera13

tions from both edges of branches in the MRT, whereas EMS overlaps these operations as possible. The other operations left along the branches will be translated into predicated operations as normal, i.e. partial if-conversion. The following subsection gives the description of this algorithm.

4.3 GOMS - Global Overlapping Modulo Scheduling

The GOMS algorithm is based on the framework of OMS for pipelining loops with conditional branches. The main extensions are in the following ways. First, the resource constrained MII(ResMII) is determined by accounting for the path with the greatest number of operations rather than all the operations. The recurrence constrained MII (RecMII) is calculated in the same way as OMS. Hence, the minimum initiation interval(MII) of the example is 3 rather than 4. Secondly, the modulo reservation table(MRT) is established as follows. 1. Extended-list-schedule the loop body, as presented in Section 3.3.1. 2. Divide the stages based on the value of II. 3. Place two operations into the same slot. These two overlapped operations must satisfy the following criteria. Their predicates are mutually exclusive. That implies these two operations are respectively from each edge of the branch and do not execute at the same time.

They have been local-scheduled in one same stage which has the greatest number of operations along both edges of branch.

Suppose that such a stage is named as S . We intend to overlap the operations from S as more as possible so as to improve the resource usages. But if these overlapped operations are distributed in several stages rather than in a stage, that will result in multiple loop back edges, thereby complicating the nal code generation. Moreover, the code size will be exponentially expanded with the number of stages due to the fact that all the possible combinations of the overlapped operations from dierent stages must be taken into account. The type of the used functional units are compatible. This permits these two operations to share the same slot in the MRT. 4. Extended-list-schedule the un-overlapped operations in S as well as below S . Repeat the steps 3 and 4 until no operation in S can be overlapped with the others. 5. Overlap the stages one by one into the MRT, as described in Section 3.3.3. Whenever the overlapped operations are delayed to the successive stages due to the modulo resource con icts, these two operations will be separated into two available slots. If the resource constraints or some CIDEs are no longer ful lled, then goto step(1) with an increased II. 14

Finally, the process of software pipelining construction establishes the kernels corresponding to the combination of the overlapped operations as well as the loop back edges before renaming the registers and generating the prologue and epilogue. Revisit the example in Figure 6. Operation 7 can be overlapped into the slot of operation 3 since the two operations satisfy the overlapping criteria. Operation 8 can also be overlapped with operation 5 for the same reason9 . Operations 9 and 10 will be delayed to their next slots due to the delay of operation 8, which will be performed in Step 4. At last, two kernels respectively predicated on P and P 0 will be constructed based on the MRT whose value of II is 3. Cycle: 0

FU1 FU2

FU3

1

3

2 3

7

4

4

8

2

3/71 10 3 2

5/8 1

1

4 2

1

9

2

11 3

MRT

9 6

2

10

8 9 10

1 6

5

6 7

II = 3

0

1

5

Stage:

3

11

1 6 5

P 3 2 1

4 2

1 1

Kernel 1

103 92 113

P’ 1 7 1 10 3 62 41 9 2 8 1 2 113 Kernel 2

11

(a) Extended list scheduling

(b) New loop body with 2 kernels

Figure 6: Global Overlapping Modulo Scheduling

5 Conclusion In this paper, we have presented a software pipelining algorithm, Overlapping Modulo Scheduling(OMS). Bene ting from the local scheduling technique, this algorithm can practically modulo-schedule loops at a lower cost. It has been implemented in the SALTO[RBS96] environment. We have furthermore extended OMS to modulo-schedule loops with conditional branches, i.e. Global OMS(GOMS). The GOMS algorithm adequately permits partial operations to share resources in order to improve resource usages with eectively controlling code expansion. From the above presentations for these two algorithms, we have seen that they To demonstrate how the algorithm works, we assume that the functional unit used by operation 8 is not compatible with that used by operation 4. Otherwise, operation 8 should be overlapped with operation 4. 9

15

are independent on the realistic machine models. Thanks to the SALTO environment, we do not need to care of the details of machines. Hence, the OMS algorithms can be retargetable on several machine models, such as TM-1000(VLIW processor), Sun-Sparc(Superscalar), etc. The GOMS algorithm can be also retargetable on the machines with predicated execution. Our future work is to implement as well the GOMS algorithm in our compiler for TM-1000 processor. In addition, we tend to further develop the GOMS algorithm for handle loops with multiple conditional branches, which has not yet been completely solved in the GOMS algorithm.

Acknowledgments The authors would like to thank Erven Rohou for his help on the implementation of our algorithm in the environment SALTO. The authors would also like to acknowledge Michel Barreteau for his comments and Francois Thomasset for his support.

References [ABB+ 97] B. Aarts, M. Barreteau, F. Bodin, P. Brinkhaus, Z. Chamski, H.-P. Charles, C. Eisenbeis, J. R. Gurd, J. Hoogerbrugge, P. Hu, W. Jalby, P. M. W. Knijnenburg, M. F. P. O'Boyle, E. Rohou, R. Sakellariou, H. Schepers, A. Seznec, E. A. Stohr, M. Verhoeven, and H. A. G. Wijsho. Oceans: Optimizing compilers for embedded applications. In EuroPar'97, Lecture Notes in Computer Science, Passau, Germany, August 1997. [ACD74] T.L. Adam, K.M. Chandy, and J.R. Dickson. A comparison of list schedules for parallel processing systems. Communications of the ACM, 12(17):685{690, December 1974. [AKPW83] J.R. Allen, K. Kennedy, C. Porte ed, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pages 177{189, January 1983. [AN91] A. Aiken and A. Nicolau. A realistic resource-constrained software pipelining algorithm. In Advances in Languages and Compilers for Parallel Processing, pages 274{290, London, 1991. Pitman/MIT press. [Cas94] B. Case. Philips hopes to displace DSPs with VLIW. Microprocessor Report, pages 12{15, December 1994. [DHB89] J.C. Dehnert, P.Y.T. Hsu, and J.P. Bratt. Overlapped loop support in the Cydra 5. In Proceedings of Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 26{38, April 1989. 16

[ELM95]

C. Eisenbeis, S. Lelait, and B. Marmol. The meeting graph: a new model for loop cyclic register allocation. In Proceedings of PACT'95, June 1995. [EW92] C. Eisenbeis and D. Windheiser. A new class of algorithms for software pipelining with resource constraints. Technical report, INRIA, 1992. Rapport de Recherche. [Lam88] M.S. Lam. Software pipelining: An eective scheduling technique for VLIW machines. In Proceedings of th ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pages 318{328, June 1988. [PS91] J.C.H. Park and M. Schlansker. On predicated execution. Technical Report HPL-91-58, Hewlett Packard Software Systems Laboratory, May 1991. [Rau94] B.R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63{74, November 1994. [RBS96] Erven Rohou, Francois Bodin, and Andre Seznec. SALTO: System for Assembly Language Transformation and Optimization. In Sixth Workshop on Compilers for Parallel Computers, December 1996. [RG81] B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scienti c computing. In Proceedings of the 14th Annual Workshop on Microprogramming and Microarchitecture, pages 183{198, October 1981. [RGSL96] J. Ruttenberg, G.R. Gao, A. Stoutchinin, and W. Lichtenstein. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the SIGLANP'96 Conference on Programming Language Design and Implementation, May 1996. [RST92] B.R. Rau, M. Schlansker, and P. Tirumalai. Code generation schemas for modulo scheduling loops. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 158{169, December 1992. [RYYT89] B.R. Rau, D.W.L. Yen, W. Yen, and R.A. Towle. The cydra 5 departmental supercomputer. IEEE Computer, pages 12{35, January 1989. [SDWX87] B. Su, S. Ding, J. Wang, and J. Xia. GURPR{A method for global software pipelining. In Proceedings of the 20th Annual Workshop on Microprogramming and Microarchitecture, pages 88{95, December 1987. [SDX86] B. Su, S. Ding, and J. Xia. URPR{An extexsion of URCR for software pipelining. In Proceedings of the 19th Annual Workshop on Microprogramming and Microarchitecture, pages 104{108, December 1986. 17

[SL96]

M.G. Stoodle and C.G. Lee. Software pipelining loops with conditional branches. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 262{273, December 1996. [SW91] B. Su and J. Wang. GURPR : A new global software pipelining algorithm. In Proceedings of the 24th Annual Workshop on Microprogramming and Microarchitecture, pages 212{216, November 1991. [WEJS94] J. Wang, C. Eisenbeis, M. Jourdan, and B. Su. DEcomposed Software Pipelining: a New Perspective and a New Approach. International Journal on Parallel Processing, 22(3):357{379, 1994. Special Issue on Compilers and Architectures for Instruction Level Parallel Processing. [WHSB92] N.J. Warter, G.E. Haab, K. Subramanian, and J.W. Bockhaus. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170{179, December 1992. [WMHR93] N.J. Warter, S.A. Mahlke, W.W. Hwu, and B.R. Rau. Reverse ifconversion. In Proceedings SIGPLAN 1993 Conference on Programming Language Design and Implementation, pages 290{299, June 1993. [WPP95] N.J. Warter-Perez and N. Partamian. Modulo scheduling with multiple initiation intervals. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 111{117, December 1995.

18

Appendix :

An example

An example is given here to illustrate how the algorithm OMS works. First, let's consider a loop in Fortran x = 0.0 DO i = 1,100 IF (x.EQ.0) THEN A(I) = x x = x - .5 ELSE B(I) = x - 2 C(I) = 3 x = x * 2 ENDIF END DO

and the assembly code of loop body generated by MT1(compiler for TM1000 developed at Leiden University). _CPT._DT_1: uimm IF r1 (_cpt.x) -> r131; ld32d IF r130 (0) r131 -> r132; feql IF r130 r132 r0 -> r133; uimm IF r1 (_cpt.a) -> r134; isub IF r130 r37 r1 -> r135; asli IF r130 (2) r135 -> r136; iadd IF r130 r134 r136 -> r137; h_st32d IF r133 (0) r132 r137; iimm IF r1 (0s5.000000e-01) -> r138; fsub IF r1 r132 r138 -> r139; h_st32d IF r133 (0) r139 r131; fneq IF r1 r0 r132 -> r140; iimm IF r1 (0s2.000000e+00) -> r141; fsub IF r1 r132 r141 -> r142; uimm IF r1 (_cpt.b) -> r143; iadd IF r1 r143 r136 -> r144; h_st32d IF r140 (0) r142 r144; iimm IF r1 (0s3.000000e+00) -> r145; uimm IF r1 (_cpt.c) -> r146; iadd IF r1 r146 r136 -> r147; h_st32d IF r140 (0) r145 r147; fmul IF r1 r132 r141 -> r148; h_st32d IF r140 (0) r148 r131;

19

iadd IF r1 r37 r1 -> r37; isub IF r1 r38 r1 -> r38;

Note that there is no longer conditional branch which has been converted into predicated operations. OMS accepts this code of loop body as input and then generates the software pipelining kernel as follows: The actual initiation interval: II = 6 --------- PROLOGUE ----------isub IF r130 r37 r1 -> r135 uimm IF r1 (_cpt.x) -> r131 iimm IF r1 (0s5.000000e-01) -> r138 iimm IF r1 (0s2.000000e+00) -> r141 uimm IF r1 (_cpt.a) -> r134 iimm IF r1 (0s3.000000e+00) -> r145 ld32d IF r130 (0) r131 -> r132 uimm IF r1 (_cpt.b) -> r143 asli IF r130 (2) r135 -> r136 iadd IF r130 r134 r136 -> r137 iadd IF r1 r143 r136 -> r144 uimm IF r1 (_cpt.c) -> r146 iadd IF r1 r146 r136 -> r147 iadd IF r1 r37 r1 -> r37 isub IF r1 r38 r1 -> r38 fsub IF r1 r132 r138 -> r139 feql IF r130 r132 r0 -> r133 fsub IF r1 r132 r141 -> r142 fmul IF r1 r132 r141 -> r148 fneq IF r1 r0 r132 -> r140 h_st32d IF r133 (0) r132 r137

(* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (*

Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle

0 0 0 0 0 1 1 1 1 2 2 2 3 3 3 4 4 4 4 5 5

Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

*) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *)

Cycle Cycle Cycle Cycle Cycle Cycle Cycle

0 0 0 0 0 1 1

Iteration Iteration Iteration Iteration Iteration Iteration Iteration

1 1 1 1 1 0 1

*) *) *) *) *) *) *)

--------- STEADY STATE ----------isub IF uimm IF iimm IF iimm IF uimm IF h_st32d iimm IF

r130 r37 r1 -> r135 r1 (_cpt.x) -> r131 r1 (0s5.000000e-01) -> r138 r1 (0s2.000000e+00) -> r141 r1 (_cpt.a) -> r134 IF r133 (0) r139 r131 r1 (0s3.000000e+00) -> r145

20

(* (* (* (* (* (* (*

ld32d IF r130 (0) r131 -> r132 uimm IF r1 (_cpt.b) -> r143 asli IF r130 (2) r135 -> r136 iadd IF r130 r134 r136 -> r137 iadd IF r1 r143 r136 -> r144 h_st32d IF r140 (0) r142 r144 uimm IF r1 (_cpt.c) -> r146 h_st32d IF r140 (0) r145 r147 iadd IF r1 r146 r136 -> r147 h_st32d IF r140 (0) r148 r131 iadd IF r1 r37 r1 -> r37 isub IF r1 r38 r1 -> r38 fsub IF r1 r132 r138 -> r139 feql IF r130 r132 r0 -> r133 fsub IF r1 r132 r141 -> r142 fmul IF r1 r132 r141 -> r148 fneq IF r1 r0 r132 -> r140 h_st32d IF r133 (0) r132 r137

(* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (* (*

Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle Cycle

1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 4 5 5

Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration

1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1

*) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *) *)

(* (* (* (*

Cycle Cycle Cycle Cycle

1 2 2 3

Iteration Iteration Iteration Iteration

1 1 1 1

*) *) *) *)

--------- EPILOGUE ----------h_st32d h_st32d h_st32d h_st32d

IF IF IF IF

r133 r140 r140 r140

(0) (0) (0) (0)

r139 r142 r145 r148

r131 r144 r147 r131

This kernel consists of three software pipelining phases: Prologue, Steady state and Epilogue. Register renaming is not still carried on this results. For more details about the implementation, please mail to: [email protected] or [email protected]

21

Software Pipeliner: Parallelization of Loops (Draft) - Semantic Scholar

Software Pipeliner: Parallelization of Loops (Draft) - Semantic Scholar

Suggest Documents

A comparison of nested loops parallelization ... - Semantic Scholar

A CLASSIFICATION OF NESTED LOOPS PARALLELIZATION ...

Automatic Parallelization of Canonical Loops - UFMG

Thalamocortical Loops - Semantic Scholar

Fastpath Speculative Parallelization - Semantic Scholar

Software Behavior Oriented Parallelization

draft -- draft -- draft -- draft - Semantic Scholar

draft -- draft -- draft -- draft - Semantic Scholar

draft draft draft draft draft draft draft draft - Semantic Scholar

Survey and Comparison of Parallelization ... - Semantic Scholar

Automatic Parallelization of Recursive Procedures - Semantic Scholar

Coarse grain parallelization of evolutionary ... - Semantic Scholar

Automatic Parallelization of XQuery Programs - Semantic Scholar

Parallelization of non-simultaneous iterative ... - Semantic Scholar

Parallelization of CABAC Transform Coefficient ... - Semantic Scholar

parallelization of apex airborne imaging ... - Semantic Scholar

Performance Optimization and Parallelization of ... - Semantic Scholar

Automatic Parallelization of Scripting Languages ... - Semantic Scholar

Optimization and Parallelization of Monaural ... - Semantic Scholar

EFFICIENT PARALLELIZATION OF H.264 ... - Semantic Scholar

Toward Automatic Parallelization of Spatial ... - Semantic Scholar

Automatic Parallelization of Non-uniform ... - Semantic Scholar

Parallelization of Loops with Variable Distance Data Dependences

Parallelization of Divide-and-Conquer by Translation to Nested Loops