Rate-Optimality of Cyclo-Static Schedules Liang-Fang Chao
Edwin Hsing-Mean Sha
Dept. of Elec. & Computer Engr. Iowa State University
Dept. of Computer Sci. & Eng. University of Notre Dame
[email protected]
[email protected]
1 De nitions and Preliminaries A data- ow graph (DFG) is represented by a directed weighted graph G = (V , E , d, t) where V is the set of computation nodes, E is the edge set which de nes the directed edges (or precedence relation) from nodes in V to nodes in V , and d(e) is the delay count for edge e 2 E . Each node v in V is associated with a positive integer t(v) which may represent the computation time for the node v. The graph in Figure ?? is an example of DFG, where the number attaches to a node is its computation time. The delay count, say i, on an edge (u; v) represents the sequenced relation between computation nodes u and v. For a meaningful data- ow graph, the total delay count of any loop is nonzero. The execution of all computation nodes once in V is called one iteration. An edge e from u to v with delay count d(e) means that the computation of node v at iteration j depends on the computation of node u at iteration j ? d(e). Time and processor schedules of the DFG in a parallel system are formally modeled as follows. A time schedule is represented by a function S : V N ?! R. Thee starting time of node v in the i-th iteration is S (v; i). A time schedule is legal if for every edge u?!v and iteration i, we have S (u; i) + t(u) S (v; i + d(e)). A time schedule is said to have an unfolding factor f and a cycle period c if S (v; i + f ) = S (v; i)+ c for every v in V and iteration i. Thus, such a time schedule can be represented by the partial schedule of the rst f iterations. A new instance of this partial schedule of f iterations can be initiated for every interval of length c to form a legal complete schedule. A processor schedule is represented by a function P : V N ?! L, where L is the set of processors, numbered 1, 2, . . . , to jLj. Node v in the i-th iteration is assigned to Processor P (v; i). A processor schedule with unfolding factor f is static if P (v; i + f ) = P (v; i) for every iteration i. Such a processor schedule can be speci ed by a partial processor schedule of the rst f iterations and a permuting function Q. A processor schedule is periodic if there exists a permuting function Q : L ?! L such that P (v; i + f ) = Q(P (v; i)) for every v in V and iteration i. Nodes scheduled on processor j in the i-th
B 1
A
(a) a DFG G
B(G) = 3=2
1
delay
C
1
time 0
A
1
2
B
A
3
4
B
A
5
6
B
A
7
8
B
A
9
10
B
C C C C C cycle period = 2, unfolding factor = 1
(b) a static schedule Figure 1: A simple DFG 1
grid 0
1
2
3
4
5
6
7
8
9
10
A A A A A A B B B B B B C C C C C C cycle period = 3, unfolding factor = 2 Figure 2: Integral-grid model
iteration will be scheduled to processor Q(j ) in the i + f -th iteration. 1 A simpler form of periodic schedules, called cyclo-static schedules, can be speci ed with a single processor displacement such that Q(p) = (Q(p) + )%jLj for every processor p, as generated by the algorithms in [2, 3, 4]. The iteration period of a repeating schedule is the average computation time per iteration. A repeating schedule with cycle period c and unfolding factor f has iteration period c=f . We are interested in nding static schedules with minimum iteration period. The iteration bound B(G) is a lower bound of iteration period. If the value of c=f equals the iteration bound B(G), we call such an f to be a rate-optimal unfolding factor. Under dierent timing or architecture models, there are several ways of implementing a static schedule. We classify the implementation styles below, and present one uni ed algorithm to solve the scheduling problems under dierent models in Section ??. The minimum rate-optimal unfolding factors under dierent models are also characterized in Sections ?? and ??. The measurement of computation time and cycle period is in terms of a pre-de ned time unit, which may be a machine cycle or a clock cycle under dierent models. Without loss of generality, we assume that every node has an integral computation time. There are two models for the timing style of scheduling.
Integral-grid model We imagine there is an integral grid (time unit) for the schedule. An operation
can only be issued in the beginning of a time unit. Under this model, a schedule has an integral cycle period. Fractional Model (Gridless Model) The starting time of a node can be any fractional number, i.e. the schedule can have an in nitely ne grid, and an operation can be issued at any time. Under this model, a schedule may have a fractional cycle period. Actually, we will show that a grid of size 1= gives schedules as good as gridless schedules, where is the denominator of B(G) in its irreducible form.
The fractional model usually has better execution rate than the integral model or requires smaller unfolding factors because the next instance of a static schedule can be issued as early as possible, without being constrained by the integral grid. The fractional model provides the designer an insight on the possibility of trading a faster clock with smaller clock period for a smaller unfolding factor. Consider the DFG in Figure 1 with iteration bound B(G) = 3=2. Figures 2 and 3 show a schedule under the integral-grid model and a schedule under the fractional model, respectively, where the rst instances of the static schedules are highlighted by boxes with thicker boundaries. These two static schedules have the same iteration period, 3/2. In order to achieve rate-optimality, the static schedule under the integral-grid model consists of 2 iterations, while that under the fractional model contains one iteration. Usually, a static schedule is stored as a program code where each instruction consists a set of nodes to be issued. An instruction is executed at each clock tick. The costs of storing a static ?
1 The cycling vector used in [1] equals to (Q 1 (1)
Q
?1 (2)
2
:::Q
?1 (jLj))T .
ner grid 0
A
1
2
B
3
A C
B
A
4
C
5
B
6
A C
B
A
7
C
8
B
9
A C
B
10
C
cycle period = 1.5, unfolding factor = 1 Figure 3: Fractional Grid Model
schedule can be measured by the number of instructions in the program code. Under the fractional model as shown, the clock ticks at every 1/2 time unit. Thus, the cost of storing the integral-grid and the fractional models are the same. Under every timing model, we can implement a static schedule in the following two design styles. Without loss of generality, we assume that there is only one copy of each node in a static schedule, i.e. the unfolding factor is one. The description can be easily generalized to schedules with larger unfolding factors.
Pipelined implementation There is no restriction on the scheduling of copies of the same node besides
precedence relations. If the second copy of a node starts its execution on the same processor as the rst copy before the rst copy is nished, we can use pipelined hardware to implement this processor. Non-pipelined implementation Since the hardware or processors are non-pipelined, the next copy of a node cannot start execution before the previous copy has nished execution. Thus, this constraint causes an implicit precedence relation between these two copies. For every iteration i, the i-th copy of a node has an implicit precedence relation with the (i + 1)-th copy because they are assigned to the same processor. Under our scheduling model, we assume that pipelined processors of operations can be synthesized by a back end synthesis system after schedules are generated. The maximum number of stages of a pipelined processor of an operation v is no more than dt(v)=ce, where c is the cycle period. Both structural pipelining and functional pipelining are considered here. Structural pipelining appears in the pipelined implementation of processors; functional pipelining refers to the pipelining of instances of a static schedule, which is used in each of the models we consider. The term "software pipelining" is used in the literature of parallel compilers to refer to similar concepts as functional pipelining [5, 6, 7]. However, in software compilation, people only concerns about the instruction issue time, instead of the actual time occupied by an instruction. Each node in their graph representation models the instruction issue time; the computation time is associated with the outgoing edges of the node to enforce precedence relations. Therefore, such a DFG model is equivalent to our concept of pipelined implementation. Next, we derive the condition for a static schedule to be implemented under non-pipelined design. All the arguments are true for both the integral and fractional timing models. For a static schedule with the unfolding factor f under the non-pipelined design, the (i + f )-th copy of a node cannot start execution before the i-th copy has nished execution. Thus, a schedule with unfolding factor f can be implemented by using non-pipelined processors if for every v in V and positive integer i, we have S (v; i + f ) S (v; i) + t(v). While, for the pipelined design there is no such restriction. A new copy of a node assigned to the same processor starts execution every c time units in a static schedule of cycle period c. Thus, the cycle period must be larger than the computation time of a node. 3
B
4
B(G) = 3=2 1
A
C
1 Figure 4: An exemplary DFG G
In order to improve the execution rate, one can execute several copies of a node with long computation time simultaneously. Hence, the schedule will have a larger cycle period, but the iteration period can be reduced. The following shows that as long as the cycle period of a schedule is is no less than maxv t(v), it can be implemented by a static schedule under the non-pipelined design. Lemma 1 Given a schedule S with cycle period c max t(v), this schedule can be realized as a static v schedule under the non-pipelined design. Proof: Assume that S has unfolding factor f . We can simply assign every copy of every node to a distinct processor. Since t(v) c for every v, we have S (v; i + f ) = S (v; i)+ c S (v; i)+ t(v). Therefore, S is a static schedule under the non-pipelined design. 2 Actually, the inequality c maxv t(v) is also a necessary condition for a static schedule to be implemented under non-pipelined design. Lemma 2 Any static schedule S under non-pipelined design has a cycle period c where c maxv t(v). Proof: Assume that S has unfolding factor f . For every node v and iteration i, we have S (v; i + f ) S (v; i) + t(v). Since S is repeated for every period of c, we know S (v; i + f ) = S (v; i) + c. Thus, c must be no less than the computation time of a node. Therefore, we have c maxv t(v). 2 The above lemmas give the necessary and sucient condition for a static schedule to be implemented under the non-pipelined design. Theorem 3 [8] Let G be a DFG, and S be a legal schedule with cycle period c and unfolding factor f . The static schedule S can be implemented under the non-pipelined design if and only if c maxv t(v). Proof: This theorem follows from Lemma 1 and Lemma 2. 2 Thus, there are four combinations of implementations of static schedules. The following example, which is used throughout the rest of this paper, shows that the minimum rate-optimal unfolding factor obtained may be dierent under these four combinations. The exemplary DFG is shown in Figure 4, and its iteration bound B(G) is 3/2. The minimum rate-optimal unfolding factors derived from theorems in this paper for these four combinations are as follows. The derivations of unfolding factors and the corresponding schedules will be explained in this paper. Min Rate-optimal Timing Models Unfolding Factor Fractional Integral Pipelined Design 1 2 Nonpipelined Design 3 4
An algorithm is presented in [8] to nd a schedule for given cycle period c and unfolding factor f under the fractional and integral models. We brie y summaries the results here. For a DFG G = (V; E; d; t) and given c and f , we de ne a modi ed graph with dierent weights on e edges: scheduling graph Gs = (V; E; w) where w(e) = d(e) ? t(u) f=c for every edge u?!v in E . Assume 4
that there is no negative-weight cycle in the scheduling graph Gs . We add a node v0 and directed edges from v0 to every other node with zero weight in Gs . Let sh(v) be the length of the shortest path from v0 to v in the scheduling graph Gs . The values of sh(v) can be computed by any single-source shortest path algorithm. It is easy to observe that for every node v we have sh(v) 0, and there exists a node u in V such that sh(u) = 0. The following theorem obtains legal fractional and integral schedules from the scheduling graph. Theorem 4 [8] Let G be a general-time DFG, c a cycle period, f an unfolding factor. Assuming that there is no negative-weight cycle in the corresponding scheduling graph Gs , let sh(v) be the length of the shortest path from v0 to v. (a) S f (v; i) = ?sh(v) c=f + i c=f for every v and i is a legal schedule under the fractional model. (b) S i(v; i) = d?sh(v) c=f + i c=f e for every v and i is a legal schedule under the integral grid model. It is shown that the time schedules S f and S i have unfolding factor f and cycle period c, i.e. f S (v; i + f ) = S f (v; i) + c and S i (v; i + f ) = S i (v; i) + c. The following theorem provides the necessary and sucient condition for the existence of a schedule with cycle period c and unfolding factor f . Theorem 5 [8] Let G be a general-time DFG, c a (fractional) cycle period, and f an unfol ding factor. c=f B(G) if and only if there exists a legal fractional schedule with unfolding factor f and cycle period c. By using Lemma 4-(b) as a legal integral schedule, similar result can be obtained for the integral model. Theorem 6 [8] Let G be a general-time DFG, c a (integral) cycle period, and f an unfoldi ng factor. c=f B(G) if and only if there exists a legal integral schedule with unfolding factor f and cycle period c.
2 Rate-Optimality of Periodic and Cyclo-Static Schedules In this section, we show that cyclo-static schedules in the non-pipelined processor model have equivalent time schedules to static time schedules in the pipelined processor model. Their processor schedules are dierent because dierent processor models are used. Without loss of generality, we prove lemmas and theorems in this section with unfolding factor 1. Lemma 7 Let G be a DFG and S a time schedule of G with cycle period c. The schedule S is realized on processors M1 ; : : : ; MjLj in the pipelined model by a static processor schedule P such that Ml has stage(l) pipeline stages. Let ET (l) (resp. LT (l)) be the earliest starting time (resp. the latest nishing time) of nodes in the rst iteration allocated to processor Ml . It can be derived that (l) + 1 or LT (l) < stage(l) c + ET (l). stage(l) = LT (l) ? ET c Proof: From the time schedule S , the pipeline structure of each processor can be decided. The number of pipeline stages, stage(l), needed in each processor Ml equals the maximum number of overlapping schedule instances. We can characterize ET (l) and LT (l) as ET (l) = minfS (v; 1) j for every v in V such that P (v; 1) = Ml g and LT (l) = minfS (v; 1) + t(v) ? 1 j for every v in V such that P (v; 1) = Ml g. Since the time schedule has cycle period c, the earliest starting time of the i-th iteration is ET (l)+(i ? 1)c. We can derive stage(l) as the largest i such that ET (l)+(i ? 1)c LT (l), i.e. stage(l) = LT (l) ? ET (l) + 1 . Since LT (l) ? ET (l) + 1 stage(l) c, we have LT (l) stage(l) c + ET (l) ? 1; c therefore, LT (l) < stage(l) c + ET (l). 2 From this lemma, a processor Ml is not pipelined if LT (l) < c + ET (l), i.e. LT (l) ? ET (l) + 1 c. A static processor schedule can be realized in non-pipelined model if LT (l) < c + ET (l) for every processor Ml .
5
Lemma 8 Let G be a DFG and S a time schedule of G with cycle period c. Let P be a static processor
schedule on processors, denoted by M1 ; M2; : : : ; MjLj, to realize S in the pipelined model. There exists a periodic processor schedule P 0 to realize S in the non-pipelined model. Proof: From Lemma 7, we can decide the number of pipeline stages, denoted by stage(l), in each processor Ml . Each pipelined processor Ml can be implemented by stage(l) non-pipelined processors, denoted by Ml;1 ; Ml;2 ; : : : ; Ml;stage(l) . We will construct a periodic processor schedule P 0 with a permuting function Q on the set of non-pipelined processors L0 = fMl;1 ; : : : ; Ml;stage(l) j for every l from 1 to jLjg. In the function Q, non-pipelined processors corresponding to the same pipelined processor Ml form a cycle of size stage(l):
Q(Ml;j ) = Ml;j+1 for j = 1; : : : ; t ? 1 Q(Ml;t ) = Ml;1 Since each non-pipelined processor corresponds to only one pipelined processor, the function Q is a permutation. The processor schedule is de ned as P 0 (v; 1) = Ml;1 if P (v; 1) = Ml and P 0 (v; i) = Qi?1 (P 0 (v; 1)) for every i 2. Next, we prove that the schedule P 0 can be realized on non-pipelined processors, i.e. an iteration can not be scheduled on a processor before the previous iteration nishes. Consider a node v allocated to a pipelined processor Ml . From the construction of P 0 and Q, P 0 (v; j ) = Ml;j for 1 j stage(l); P 0 (v; stage(l) + j ) = Ml;j for 1 j stage(l). Node v is scheduled to Ml;1 in the rst iteration; after that, no nodes are scheduled to Ml;1 until iteration stage(l) + 1. The earliest starting time of any node allocated to Ml in iteration stage(l)+1 is ET (l)+ stage(l) c. From Lemma 7, the latest nishing time of node v in the rst iteration, LT (l), is less than ET (l)+ stage(l) c time units. P Therefore, the constructed =jLj periodic processor schedule P 0 can be realized in the non-pipelined model with ll=1 stage(l) processors.
2
Lemma 9 Let G be a DFG and S a time schedule of G with cycle period c. Let P be a static processor
schedule on processors, denoted by M1 ; M2; : : : ; MjLj, to realize S in the pipelined model. There exists a cyclo-static processor schedule P 0 with processor displacement to realize S in the non-pipelined model. Proof: Let maxstage be the maximum number of pipeline stages in any pipelined processor, i.e. maxstage = maxl stage(l). We construct the cyclo-static processor schedule P 0 as P 0 (v; 1) = P (v; 1) and P 0 (v; i) = Qi?1 (P (v; i)) for every node v in V and iteration i, where Q(v; i) = (Q(v; i) + )%jL0 j, L0 is a set of Nnopipe + Npipe maxstage processors, and = Npipe . We will prove that this processor schedule P 0 can be realized in the non-pipelined model. Consider a node v allocated to a pipelined processor Ml . From the construction of P 0 and Q, the i-th iteration is allocated to processors fM(i?1)+1 , . . . , M(i?1)+jLjg if (i ? 1) + jLj Nnopipe + Npipe maxstage. Assume that after the rst iteration, no nodes are scheduled to processor M1 until the X -th iteration. The processor M1 can be implemented in the non-pipelined model if LT (1) < ET (1) + (X ? 1) c. From Lemma 7, we have LT (1) < stage(1) c + ET (1) maxstage c + ET (1). Therefore, if maxstage X ? 1 is proved, P 0 can be realized in the non-pipelined model. From the de nition of Q, we can derive that X is the minimum i such that 1 + (i ? 1) > Nnopipe + Npipe maxstage. Therefore, since = Npipe , we can derive (X ? 1) Npipe Npipe maxstage + Nnopipe : Thus, X ? 1 maxstage is proved. 2 From Theorem 6 and Lemma 8 we can derive the following theorem for rate-optimal periodic schedules. Theorem 10 Let G be a DFG and the denominator of B(G) in its irreducible form. The minimum rate-optimal unfolding factor for periodic scheduling under the integral and non-pipelined implementation is .
6
Proof: Let be the numerator of B(G) in its irreducible form. From Theorem 6, there exists an integral time schedule S with cycle period and a static processor schedule P to realize S in the pipelined model. From Lemma 8, there exits a periodic processor schedule P 0 to realize S in the non-pipelined model. Thus, is a rate-optimal unfolding factor. Since = equals to B(G) and is irreducible, is the minimum rate-optimal unfolding factor. 2 Similarly, from Theorem 6 and Lemma 9 the following theorem can be derived for rate-optimal cyclo-static schedules. Theorem 11 Let G be a DFG and the denominator of B(G) in its irreducible form. The minimum rate-optimal unfolding factor for cyclo-static scheduling under the integral and non-pipelined implementation is . The above theorems can be generalized into the fractional timing model. Theorem 12 Let G be a DFG. Under the fractional model and non-pipelined design, there always exists a rate-optimal periodic schedule without unfolding. Proof: From Theorem 6, a rate-optimal time schedule can be derived with cycle period B(G). From Lemma 8 a periodic processor schedule can be found in the non-pipelined model. 2
References [1] P. R. Gelabert and T. P. Barnwell, \Optimal automatic periodic multiprocessor scheduler for fully speci ed ow graphs," IEEE Transactions on Signal Processing, vol. 41, pp. 858{888, Feb. 1993. [2] D. A. Schwartz and T. P. Barnwell III, \A graph theoretical technique for the generation of systolic implementation for shift-invariant ow graphs," in Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing, pp. 1384{1387, Mar. 1984. [3] D. A. Schwartz and T. P. Barnwell III, \Cyclo-static multiprocessor scheduling for the optimal realization of shift-invariant ow graphs," in Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing, vol. 3, pp. 1384{1387, 1985. [4] D. A. Schwartz and T. P. Barnwell III, \Cyclo-static solutions: optimal multiprocessor realizations of recursive algorithms," in VLSI Signal Processing II (K. S.Y, R. Owen, and J. Nas, eds.), New York: IEEE Press, 1986. [5] M. Lam, \Software pipelining: An eective scheduling technique for VLIW machines," in Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implementation, (Atlanta, GA), pp. 318{328, June 1988. [6] A. Aiken and A. Nicolau, \Optimal loop parallelization," in Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implementation, pp. 308{317, June 1988. [7] G. R. Gao, Y.-B. Wong, and Q. Ning, \A timed petri-net model for ne-grain loop scheduling," in Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implementation, pp. 204{218, June 1991. [8] L.-F. Chao and E. H.-M. Sha, \Static scheduling for synthesis of dsp algorithms on various models," Journal of VLSI Signal Processing, June 1994. Accepted for publication.
7