constraints between the tasks, build a schedule which satisfies the precedence and the ... reservation vector, for the c
Insertion Scheduling: An Alternative to List Scheduling for Modulo Schedulers Benoˆıt Dupont de Dinechin
[email protected] CEA Limeil-Valenton center, department of Applied Mathematics, 94195 Villeneuve StGeorges cedex, France
Abstract. The list scheduling algorithm is a popular scheduling engine used in most, if not all, industrial instruction schedulers. However this technique has several drawbacks, especially in the context of modulo scheduling. One such problem is the need to restart scheduling from scratch whenever scheduling fails at the current value of the initiation interval. Another problem of list scheduling is that the order in which the instructions are selected for scheduling is constrained to be a topological sort of the scheduling graph, minus the loop-carried dependencies. We present a new instruction scheduling technique, suitable for block scheduling and modulo scheduling, which addresses these restrictions, while allowing efficient implementations. The technique is fully implemented, as part of a software pipeliner we developed for an experimental version of the Cray T3DTM cft77 Fortran compiler.
Introduction Instruction scheduling problems are a subcase of deterministic scheduling problems. That is, given a set of tasks, whose resource requirements are represented by reservation tables, and a scheduling graph, whose arcs materialize the precedence constraints between the tasks, build a schedule which satisfies the precedence and the resource constraints, while simultaneously minimizing a cost criterion. In the case of instruction scheduling, the tasks correspond to the instructions. The scheduling graph includes the control dependencies, along with the data dependencies which arise from the reads and the updates of the memory cells and of the processor registers. The cost criterion is the schedule length. The main heuristic available to schedule acyclic scheduling graphs is the list scheduling algorithm [1], which is used under some form in most if not all instruction schedulers. List scheduling however has several drawbacks in the setting of instruction scheduling, which can be summarized as: – List scheduling does not work on cyclic scheduling graphs unless it is significantly extended. The required extensions, sketched in [11], are cumbersome enough to motivate extensive research especially in the area of modulo scheduling [13], where cyclic scheduling graphs arise frequently because of the loop-carried dependencies.
– Even after it is extended1 , list scheduling requires that the instructions are issued in a topological sort order of the scheduling graph, minus the loopcarried dependencies. This is necessary to prevent deadlock, because once an instruction is issued, it is never moved at a later step. Being constrained by a topological sort order prevents useful scheduling orders from being used, such as lowest slack first which performs best in the case of recurrent loops. – List scheduling is greedy, for the ready tasks are scheduled as early as possible. As a result, the use of a value returned from memory is often separated from the corresponding LOAD by the minimum memory latency. This feature makes the performance of the resulting schedule very sensitive to unpredictable delays, such as cache misses. In addition, it is difficult to direct the scheduling process in order to optimize a second criterion beyond schedule length, such as cumulative register lifetimes, or maximum register pressure. Our insertion scheduling technique allows the instructions to be issued in any convenient order, and at any date which is within their current margins. Unlike techniques by Huff [10] or by Rau [13], this capability is achieved without resorting to backtracking. The other advantage of our technique, compared to the various extensions of list scheduling, is that we do not need to restart the modulo scheduling process from scratch, whenever scheduling fails at a given value of the initiation interval. In this respect our technique is “faster” than list scheduling in the setting of modulo scheduling. All that is currently required for the technique to apply is the satisfaction of the following conditions: – The delays associated with each precedence constraint are non-negative. – The scheduling graph without its loop-carried dependencies is cycle-free. – The resource constraints of each instruction can be represented as regular reservation tables, or equivalently as reservation vectors [5]. The restrictions on the scheduling graph are not an issue, since they must already hold for list scheduling to succeed. Our restriction on the resource constraints is assumed in [7], and is implicitly carried by the gcc processor description [14]. We use regular reservation tables for approximating the resource requirements of each task because it makes the correctness proofs of section 2.2 simpler. A regular reservation table is a reservation table where the ones in each row start in the leftmost column, and are all adjacent. Of course these reservation tables may also have rows filled with zeroes. The regular reservation tables are in turn compactly represented as reservation vectors. In figure 1, we illustrate the relationships between a reservation table, a regular reservation table, and a reservation vector, for the conditional stores of the DEC Alpha 21064. The main restriction of regular reservation tables is not that the ones in every row must be left-justified, for collisions between two reservation tables do not change if a given row is justified the same amount in both tables. Rather, it has 1
Plain list schedulers maintain the left margins, because the “ready set” is recomputed every time an instruction is issued. Extended list schedulers maintain the left margins and the right margins, in order to handle cycles in the scheduling graph.
0123 4
01 2
bus1 bus2 1 abox 1 cond 111 bbox ebox imul iwrt fbox fdiv fwrt
bus1 bus2 1 abox 1 cond 1 1 1 bbox ebox imul iwrt fbox fdiv fwrt
bus1 bus2 abox cond bbox ebox imul iwrt fbox fdiv fwrt
0 1 1 3 0 0 0 0 0 0 0
Fig. 1. A reservation table, a regular reservation table, and a reservation vector.
to do with the fact that the ones must be adjacent in each row. However these restrictions do not appear to be a problem in practice. For instance the only instructions of the 21064 where regular reservation tables are slightly inaccurate are the integer multiplications, and the floating-point divisions. The paper is organized as follows: Section 1 provides background about block scheduling and modulo scheduling. Section 2 demonstrates insertion scheduling on an example, then exposes the theoretical results upon which the technique is based. Section 3 reports the results we currently achieve on the Livermore loops, for less and less constrained scheduling graphs.
1 1.1
Instruction Scheduling Block Scheduling
We start by introducing some terminology. Formally, a non-preemptive deterministic scheduling problem is defined by (S, C, ~r, F) where: def
– S = {σi }0≤i≤N is a set of N + 1 tasks, including a dummy “start” task σ0 which is always scheduled at date zero; def – C = {tjk − tik ≥ αk }1≤k≤M , called the precedence constraints, is a set of M inequalities involving the start times {ti }0≤i≤N of the tasks; def – ~r = (r1 , r2 , . . . , rp )T is a vector describing the total availabilities of the renewable resources; def – F = {f~1 , f~2 , . . . , f~N } are N reservation functions such that ∀t < 0 : f~i (t) = ~0, ∀t ≥ 0 : f~i (t) ≥ ~0, describing for each task its use of the renewable resources. Each reservation function f~i takes as input the time elapsed from the start of task σi , and returns a vector describing its use of resources at that time. Any solution of the scheduling problem satisfies the precedence constraints, and the resource constraints: ∀t :
N X i=1
f~i (t − ti ) ≤ ~r
We call central problem a scheduling problem with no resource constraints. A central problem associated to a given deterministic scheduling problem contains all the precedence constraints of that problem, and eventually some other constraints of the form {ti = Si }. A partial central schedule of a central problem P is a map from n tasks to schedule dates {σij 7→ Sij }1≤j≤n , with 1 ≤ n ≤ N , such that P ∧ {ti1 = Si1 } . . . ∧ {tin = Sin } is not empty. We shall denote {Sij }1≤j≤n such a map. A partial schedule, also denoted {Sij }1≤j≤n , is a partial central schedule of the central problem associated with the deterministic scheduling problem, such that the resource constraints are also satisfied for {σij }1≤j≤n . The margins are another useful notion we shall borrow from operations research. Given a feasible central problem P , the left margin t− i of a task σi is the smallest positive value of τ such that the central problem P ∧{ti ≤ τ } is feasible. Likewise, assuming that the schedule length is constrained by some kind of upper bound L (which may be equal to +∞), the right margin t+ i of task σi is the largest positive value of τ such that the central problem P ∧{ti ≥ τ }∧1≤j≤N {tj ≤ L} is feasible. Intuitively, margins indicate that there is no central schedule {Si }1≤i≤N + of length below L such that Si < t− i , or Si > ti . Following Huff [10], we also + − define the slack of a task σi as ti − ti . An alternate representation of the precedence constraints C of a deterministic def scheduling problem is a valued directed graph G = [S, E] : tjk − tik ≥ αk ∈ C ⇔ (σik , σjk , αk ) ∈ E, called the scheduling graph. Since equality constraints of the form {ti = Si } can be represented by the pair of arcs ((σ0 , σi , Si ), (σi , σ0 , −Si )), a central problem is equivalent for all practical purposes to a scheduling graph. Margins are easily computed from the scheduling graph by applying a variation of the Bellman shortest path algorithm. This algorithm has an O(M N ) running time, where N is the number of nodes and M the number of arcs in the graph. Block scheduling, also called local code compaction, involves restricted versions of the non-preemptive deterministic scheduling problems, where each instruction is associated to a task. The restrictions are: – The time is measured in processor cycles and takes integral values. – The values αk are non-negative integers. – The total resource availabilities vector ~r is ~1, so it is not explicated. A popular representation of the reservation functions f~i are the so-called reservation tables [f~i (j)], where f~i (j) is the boolean vector describing the use of the resources j cycles after the instruction σi has been issued. However, block scheduling involves more than taking advantage of these restrictions. In particular, on VLIW or superscalar processors outfitted with several ALUs, floating-point operators, and memory ports, an instruction can be associated to one of several reservation tables upon issue. In the following, a task shall refer to an instruction which has been assigned a reservation table. Issuing an instruction means assigning a reservation table to it, and scheduling the corresponding task. Scheduling a task in turn means including it in a partial schedule. A valid instruction schedule, denoted {Si }1≤i≤N , is a partial schedule such that every instruction is associated to a task.
do i = 1, n sam = sam + x(i)*y end do
σ0 σ1 σ2 σ3 σ4 σ5 σ6 σ7
≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡
enterblk $LL00008 ldt f(5), i(32) mult/d F(3), f(5), f(6) addt/d F(4), f(6), F(4) addq i(32), +8, i(32) subq i(33), +1, i(33) ble i(33), $LL00009 br izero, $LL00008
Fig. 2. The sample source code and its translation.
1.2
Modulo Scheduling
Modulo scheduling is an advanced cyclic scheduling technique formulated for the purpose of constructing software pipelines [13]. The fundamental idea of modulo scheduling is that the local schedule 2 [12] of the software pipeline can be created by solving a simple extension of the block scheduling problem, called the modulo scheduling problem. More precisely, let us denote as T the software pipeline initiation interval. This value, unknown when scheduling starts, represents the number of machine cycles which separate the initiation of two successive loop iterations. Obviously, the lower T , the better the schedule. The scheduling graph of the modulo scheduling problem is derived from the scheduling graph of the corresponding block scheduling problem, by including the loop-carried dependencies. Such extra arcs take the form (σik , σjk , αk −βk T ), where αk ≥ 0, and where βk > 0, denoted Ω in [12], is the collision distance of the loop-carried dependency. Likewise, the reservation function f~i (t) of each task σi P+∞ is replaced by the corresponding modulo reservation functions k=0 f~i (t − ti − kT ), so the resource constraints now become the modulo resource constraints: ∀t :
N X +∞ X
f~i (t − ti − kT ) ≤ ~r
i=1 k=0
It is apparent that modulo scheduling is more difficult to implement in practice than block scheduling, for the precedence constraints, as well as the resource constraints, now involve the unknown parameter T . Moreover: – Because of the loop-carried dependencies, the scheduling graph may contain cycles, which prevent plain list scheduling to be used. – The modulo reservation functions have an infinite extent, so scheduling may fail for a given value of T even if there are no cycles in the scheduling graph. Throughout the paper, we shall illustrate our techniques by applying them to the code displayed in figure 2. On the left part, we have the source program, while the translation in pseudo DEC Alpha assembly code by the Cray cft77 MPP compiler appears on the right. The scheduling graph in the case of modulo scheduling is displayed in figure 3. This scheduling graph contains many fake dependencies, related to the lack of accurate information at the back-end level. For instance, arcs (σ4 , σ6 , 0) and (σ6 , σ4 , −T ) are def-use and use-def of the 2
The schedule of any particular loop body execution, required to build the pipeline.
Source
Sink
σ1 σ1 σ1 σ2 σ2 σ2 σ3 σ3 σ3 σ3 σ4
σ2 σ4 σ7 σ1 σ3 σ7 σ2 σ3 σ6 σ7 σ1
≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡
ldt ldt ldt mult/d mult/d mult/d addt/d addt/d addt/d addt/d addq
≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡
Value Type mult/d 3 def use f(5) addq 0 use def i(32) br 0 use def pc ldt -T use def f(5) addt/d 6 def use f(6) br 0 use def pc mult/d -T use def f(6) addt/d 6-T def use F(4) ble 0 def use F(4) br 0 use def pc ldt 2-T def use i(32)
Source
Sink
σ4 σ4 σ4 σ5 σ5 σ5 σ6 σ6 σ6 σ6 σ7
σ4 σ6 σ7 σ5 σ6 σ7 σ3 σ4 σ5 σ7 σ6
≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡
addq addq addq subq subq subq ble ble ble ble br
≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡
Value Type addq 1-T def use i(32) ble 0 def use i(32) br 0 use def pc subq 1-T def use i(33) ble 1 def use i(33) br 0 use def pc addt/d -T use def F(4) addq -T use def i(32) subq -T use def i(33) br 0 def use pc ble -T def use pc
Fig. 3. The arcs of the scheduling graph for the sample loop.
register i32 which would be removed if the back-end could tell that i32 is dead upon loop exit. We keep these fake dependencies here because they offer the opportunity to expose interesting aspects of our scheduling technique. Under the traditional approach to modulo scheduling, scheduling is not performed parametrically with T . Rather, lower bounds on the admissible T are computed, and their maximum Tglb is used as the first value of T . The lower bounds usually considered are the bound set by resource usage, denoted here Tresource , and the bound Trecurrence set on the initiation interval by the recurrence cycles. Then, the construction of a valid local schedule is attempted for increasing values of T , until success is achieved. Failure happens whenever there is no date, within the margins of the instruction currently selected for issuing, such that scheduling a corresponding task at this date would not trigger resource conflicts with the already issued instructions.
2
Insertion Scheduling
In the following, we assume that {Sij }1≤j≤n , {Si0j }1≤j≤n are two maps from tasks {σij }1≤j≤n to schedule dates, and define {φij , τij , φ0ij , τi0j , δij }1≤j≤n as: Sij = φij T + τij ∧ 0 ≤ τij < T ∀j ∈ [1, n] : Si0j = φ0ij T 0 + τi0j ∧ 0 ≤ τi0j < T 0 δij = τi0j − τij 2.1
An Intuitive Presentation
Let us denote the frame and the offset of an instruction σij , scheduled at Sij in a partial (local) schedule {Sij }1≤j φik ¯ τij = τik ∧ φij = φik ∧ σij ; σik In the above formula, σij ; σik denotes the fact that σij precedes σik in the transitive closure of the loop-independent precedence constraints of the scheduling graph. This relation is safely approximated by taking the lexical order of the instructions in the program text. Application of the insertion scheduling process to our example is best illustrated by displaying the modulo issue table after each issuing step. The issue table is the global reservation table where the individual reservation tables of the already issued instruction are ORed in, at the corresponding issue dates. The modulo issue table displays the issue table modulo the current initiation interval T , and is the only representation needed in modulo scheduling to manage the resource constraints. On our example, the initial value of T is Trecurrence = 6, because of the critical cycle ((σ2 , σ3 ), (σ3 , σ2 )). Instruction σ1 is selected first for issuing, and is issued at date S1 = 0. This results in the modulo issue table displayed top, far left, in figure 4. Likewise, σ2 is issued without resource conflicts at S2 = 3 (top, center left in figure 4). Then
0 1234 5
0 1 2 3 4 5
bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
0 12 3 bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
4
5
6 7 σ4
σ3 σ2 σ6 σ6 σ6 σ4 σ4 σ3 σ2 σ3 σ2
0 1 2 3 bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
σ2
σ2 σ2
0 1 2 3 bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
4
σ3 σ2
σ3 σ2 σ3 σ2 5
6 7
σ4 σ3 σ2 σ6 σ7 σ7 σ6 σ7 σ6 σ4 σ4 σ3 σ2 σ3 σ2
4 5 6
0 1 2 3 bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
5 6
σ3 σ2 σ6 σ6 σ6 σ3 σ2 σ3 σ2
0 1 2 3 bus1 σ1 bus2 abox σ1 cond σ1 bbox ebox imul iwrt fbox fdiv fwrt
4
4
5
6 7
σ5 σ4 σ3 σ2 σ6 σ7 σ7 σ6 σ7 σ5 σ6 σ4 σ5 σ3 σ2
σ4
σ3 σ2
Fig. 4. Construction of the modulo issue table
σ3 is selected for issuing. The only date currently within its margins is S3 = 9, which yields φ3 = 1, τ3 = 3. However, scheduling σ3 at S3 = 9 would result in a resource conflict with σ2 , since τ2 = 3 and because both instructions happen to use the same resources. Here σ3 ≺ σ2 , for τ3 = τ2 ∧ φ3 > φ2 , hence I − = {σ1 }, and I + = {σ2 }. The values are ∆− = 0, ∆+ = 1, hence T 0 = T + ∆− + ∆+ = 7, δ1 = δ3 = 0, and δ2 = 1. This yields the modulo issue table displayed top, center right in figure 4. After that σ6 is selected and issued at S6 = 12 ⇒ φ6 = 1∧τ6 = 5, without resource conflicts (top, far right in figure 4). Then σ4 is selected for issuing at S4 = 5 ⇒ φ4 = 0 ∧ τ4 = 5. Here we have a perfect illustration that we are not constrained to issue the instructions in a topological sort order of the scheduling graph, for the latter includes the arc (σ4 , σ6 , 0) (figure 3), while σ6 is already issued. Returning to σ4 , it conflicts with σ6 . The condition τ4 = τ6 ∧φ4 < φ6 implies that σ6 ∈ I − = {σ1 , σ2 , σ3 , σ6 }, while I + = ∅. We have ∆− = 1, ∆+ = 0, and this yields T 0 = 8, δ1 = δ2 = δ3 = δ6 = 0, δ4 = 1. The resulting modulo issue table is displayed bottom, left in figure 4. After that, σ7 and σ5 are issued without resource conflicts, to yield respectively the bottom, center and bottom, right modulo issue tables in figure 4. The resulting software pipeline appears in figure 5. This is a while-type software pipeline [2] (no epilog), which speculatively executes σ10 , σ20 , σ40 of the next iteration. Although a FOR-type software pipeline does not ask for a speculative execution support, it requires to know which register variable is the loop counter, an information not easily available in a back-end. Two iterations are overlapped, with a local schedule length L of 15 cycles, and an initiation interval T of 8 cy-
$PROLOG [0] ldt f(5), i(32) [1] [2] [3] σ2 [4] mult/d F(3), f(5), f(6) [5] σ4 [6] addq i(32), +8, i(32) $LL00008 [7] [8] ldt f(5), i(32) σ10 [9] [10] [11] addt/d F(4), f(6), F(4) σ3 σ20 , σ5 [12] mult/d F(3), f(5), f(6) subq i(33), +1, i(33) σ6 [13] ble i(33), $LL00009 σ40 , σ7 [14] addq i(32), +8, i(32) br izero, $LL00008 $LL00009 σ1
Fig. 5. The resulting while software pipeline.
cles. A block scheduler would schedule the loop body in 12 cycles, so pipelining significantly improves the performance of our sample loop, even though we did not remove the fake dependencies from the scheduling graph. 2.2
The Main Results
The following result states precisely the conditions that must be met by the δij in order to preserve the precedence constraints of the scheduling graph. Theorem 1. Let {Sij }1≤j≤n be a partial central schedule of a central problem P at initiation interval T . Let {Si0j }1≤j≤n be n integers such that: φij = φ0ij 0 ≤ δij ≤ ∆ ∀j, k ∈ [1, n] : τij < τik =⇒ δij ≤ δik τij = τik ∧ φij > φik =⇒ δij ≤ δik τij = τik ∧ φij = φik ∧ σij ; σik =⇒ δij ≤ δik def
Then {Si0j }1≤j≤n is a partial central schedule of P at initiation interval T 0 = T + ∆.
Proof: Let (σi , σj , αk −βk T ) be a precedence constraint of P . From the definition of a precedence constraint, Sj −Si ≥ αk −βk T ⇔ φj T +τj −φi T −τi ≥ αk −βk T . Given the hypothesis, our aim is to show that φj T 0 + τj0 − φi T 0 − τi0 ≥ αk − βk T 0 Dividing the former inequality by T and taking the floor yields φj − φi + τ −τ b j T i c ≥ −βk , since all αk values are non-negative. We have 0 ≤ τi < T , τ −τ 0 ≤ τj < T , hence 0 ≤ |τj − τi | < T and the value of b j T i c is −1 or 0. Therefore φj − φi ≥ −βk .
φj − φi = −βk : We only need to show that τj0 − τi0 ≥ αk . Since αk ≥ 0, we have τj ≥ τi . Several subcases need to be distinguished: τi < τj : We have δj ≥ δi ⇔ τj0 − τj ≥ τi0 − τi ⇔ τj0 − τi0 ≥ τj − τi ≥ αk . τi = τj ∧ φi 6= φj : Either φi > φj , or φi < φj . The latter is impossible, for βk = φi − φj , and since all βk are non-negative. From the hypothesis, τi = τj ∧ φi > φj yields δj ≥ δi , so the conclusion is the same as above. τi = τj ∧ φi = φj : Since βk = φi − φj = 0, there is no precedence constraint unless σi ; σj . In this case taking δj ≥ δi , works like in the cases above. φj − φi > −βk : Let us show that (φj − φi + βk )T 0 + τj0 − τi0 − αk ≥ 0. We have φj − φi + βk ≥ 1, so (φj − φi + βk )T 0 ≥ (φj − φi + βk )T + ∆. By hypothesis we also have τi ≤ τi0 ≤ τi + ∆, and τj ≤ τj0 ≤ τj + ∆, so τj0 − τi0 ≥ τj − τi − ∆. Hence (φj −φi +βk )T 0 +τj0 −τi0 −αk ≥ (φj −φi +βk )T +∆+τj −τi −∆−αk = (φj − φi + βk )T + τj − τi − αk ≥ 0. The conditions involving the φij and ; may seem awkward, but are in fact mandatory for the theorem to be useful in an instruction scheduler. Consider for instance the more obvious condition τij ≤ τik ⇒ τi0j − τij ≤ τi0k − τik as a replacement for the three last conditions of theorem 1. Then τij = τik implies τi0j = τi0k , by exchanging ij and ik . Such a constraint makes scheduling impossible if σij and σik happen to use the same resource. A result similar to theorem 1 holds for the modulo resource constraints of a partial schedule, assuming reservation vectors {~ ρi }1≤i≤N can be used. By definition, the reservation vector ρ ~i associated to task σi is such that ρil equals the number of ones in the l-th row of the (regular) reservation table of σi . Theorem 2. Let {Sik }1≤k≤n be a partial schedule satisfying the modulo resource constraints at T , assuming reservation vectors. Let {Si0k }1≤k≤n be such that: φij = φ0ij ∀j, k ∈ [1, n] : 0 ≤ δij ≤ ∆ τij < τik =⇒ δij ≤ δik Then {Si0k }1≤k≤n taken as a partial schedule satisfies the modulo resource condef straints at initiation interval T 0 = T + ∆. Proof: Thanks to the reservation vectors , the satisfaction of the modulo resource constraints at T by the partial schedule {Sik }1≤k≤n is equivalent to [5]: ti − tj ti − tj c+1)T ∧ti −tj ≥ ρjl +b cT T T These constraints look exactly like precedence constraints of the scheduling graph, save the fact that the β values are now of arbitrary sign. Since the sign of the β values is only used in the demonstration of theorem 1 for the cases where τi = τj , which need not be considered here because they imply no resource collisions between σi and σj , we deduce from the demonstration of theorem 1 that the modulo resource constraints at T 0 are satisfied by {Si0k }1≤k≤n taken as a partial schedule. ∀i, j ∈ {ik }1≤k≤n , i 6= j, ∀l : tj −ti ≥ ρil −(b
2.3
A Simple Implementation
To compute the values ∆− and ∆+ , we need to define an operation ¯ between two reservation vectors, and the function issuedelay, as: ( def ρ ~i ¯ ρ ~j = max(if ρil 6= 0 ∧ ρjl 6= 0 then ρil else 0) l
def
issuedelay(σi , σj , d) = max((~ ρi ¯ ρ ~j ) − d, 0) It is apparent that the function issuedelay computes the minimum value δ such that issuing σi at date t, and issuing σj at date d + δ, does not trigger resource conflicts between σi and σj . In fact issuedelay emulates the behavior of a scoreboard in the target processor. Now, computing ∆− and ∆+ is a simple matter given the following formulas: − def ∆ = max( max− issuedelay(σj , σin , τin − τj ), max+ issuedelay(σj , σin , τin − τj + T )) def
σj ∈I
σj ∈I
∆+ = max( max− issuedelay(σin , σj , τj + T − τin ), max+ issuedelay(σin , σj , τj − τin )) σj ∈I
σj ∈I
That is, we take for ∆− the minimum value such that σin scheduled at τin + ∆− would not conflict on a resource basis with the tasks σj in I − , if they were scheduled at the respective dates τj , nor with the tasks σj in I + , if they were scheduled at the respective dates τj − T . Likewise, ∆+ is the minimum value such that σin scheduled at τin − ∆+ would not conflict on a resource basis with the tasks σj in I − , if they were scheduled at the respective dates τj + T , nor with the tasks σj in I + , if they were scheduled at the respective dates τj . Intuitively, ∆− is meant to be the number of wait cycles needed by σin in order to avoid resource conflicts with the instructions issued before it in an actual software pipeline. And the value ∆+ is meant to be the number of wait cycles needed by the instructions issued after σin in an actual software pipeline. Theorem 3. Let {Sij }1≤j