Scheduling for functional pipelining and loop winding - Design ...

Scheduling for Functional Pipelining and Loop Winding Cheng-Tsung Hwang Yu-Chin Hsu Youn-LongLin Department of Computer Science, Tsing Hua University Hsinchu, Taiwan 30043, R.O.C.

ABSTRACT We present an algorithm for pipelining loop execution in the presence of loop carried dependences. We optimize both the initiationinterval and the turnaroundtime of a schedule. Given const~aintson the number of functionalunits and buses, we 6rst determine an initiation interval and then incrementally partition the operations into blocks to fit into the execution windows. A refinement procedure is incorporated to improve the turn around time. The novel feature which differs our approach from others is that the scheduled operations are iteratively moved up and down to accommodate the ready yet unscheduled operations. The algorithm produces very encourageous results.

1. INTRODUCTION The automatic synthesis of a digital system from a behavioral description to a structure design requires many tasks to be performed. Among them, operation scheduling and hardware allocation are most important. Scheduling determines the cost and speed tradeoff of the system, while allocation binds the behavioral objects to the hardware resources. In order to implement a high-throughput system, it is very important to explore the concurrency in the algorithm. Highly concurrent implementations can be obtained by executing several operations within one iteration in parallel and/or by overlapping consecutive iterations in a pipelined fashion. Usually, the performance of a digital system is constrained by both the available hardware resources and the algorithm structure (&e. precedence relationship). Thank to the progress in VISI processing techniques, using massive hardware becomes practical. Consequently, precedence relationships among operations become the primary bottleneck. During previous decades, the concurrency within one iteration has been extensively studied [ 11. It is desired to explore concurrency beyond the boundary of a single iteration. Recently, much attention has been paid to the exploration of algorithmic concurrency through pipelining. Functional pipelining [2-51 and loop winding [6-131 are two of them. Using a functional pipelined data path. the processing of a data sample can be started before the completion of the previous one. Fig. 1 shows an example where successive iterations are executed simultaneously. An +This work was supported in part by the National Science Council, R.O.C.under grant no. NSC80-0404-Em-22 and NSC80-0404-Em-20.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

4‘

Fig. 1 An example to illustrate pipelined data path. The latency of the pipeline data path is 2.

instance is initiated every two steps (called latency which is initiation interval / clock cycle) and each instance takes four steps (called delay which is turn around time / clock cycle ) to finish. The operations in control step 3 (4) of the first instance and 1’ (2’) of the second,instance ary executed simultaneously. The interval between (3, 1 ) and (4.2) is the execution window. The idea of loop winding is similar to that of functional pipelining except that it is applied to a loop rather than an algorithm and tliere may exist data dependences among operations across different iterations. Therefore, functional pipelining is a special case of loop winding. The rest of the paper is organized as follows. In section 2, the foundmental concepts are discussed. The proposed algorithm is presented in section 3. We first describe an algorithm for pipelining loops without LCD’s and then extend it to handle loops with LCD’s. Finally, a post-optimizationalgorithm is introduced. In section 4, we show the experimental results. Finally, concluding remarks are made in section 5.

2. PRELIMINARIES

2.1. The Data Flow Graph The input to the algorithm is a data flow graph DFG (V,E), where V is the set of n operations and E is the set of data precedences. An operation is denoted as oi. An instance of oi at iteration iA is denoted as oi@iA. A data precedence relationship between two operations oi and oj is denoted as oi+ oj. Each edge is associated with a weight w and a degree d, where w specifies the minimum distance between oi and oj and d means that the value generated by oi will be used by oj d iterations later. An edge is a forward edge if its degree is 0. Otherwise, it is a

28th ACM/IEEE Design Automation Conference@

Paper 43.2 764

1991 ACM 0-89791-395-7/91/0006/0764 $1.50

backward edge. A node is a source (sink) node if it has no predecessors (successors). The depth of oi by a breadth-first traversal starting from every source (sink) node on the subgraph of 0degree is denoted as si (Zi). Finally, a strongly connected component of the DFG is a subgraph in which for every pair of vertices, oi and oj, there exists at least one directed path from oi to oj and another from oj to oi.

2.2. The Pipelined Data Flow Graph To pipeline the execution of a loop, the DFG (loop body) is partitioned horizontally into pieces, DFG1, DFG2, ..., DFG@, which are then wound to form a shorter loop DFG,. DFGi will be computed during the ith pipeline stage (logically). In Fig. 1, operations ol. 0 2 , 03. 0 6 , o1o and oll form the piece DFG1, while 04, 0 5 , 0 7 , 0 8 and og form DFG2. Our algorithm dynamically partitions the DFG during the scheduling of operations. The graph induced by those not yet scheduled operations is called the remaining graph and denoted as DFGR. During each pass of the algorithm, a set of operations is selected from DFGR. The scheduled graph DFGs is constructed iteratively by adding the operations just selected and projecting the edges in DFG onto itself. The algorithm terminates when DFGR becomes empty.

ALGORITHM PIPELINESCHEDULING Input: DFG, Resources; Output: a pipelined schedule with minimal latency and delay; (1) Calculate minimum latency L. I = 0, DFGS = 0,DFGR = DFG (2) While (DFGR z 0 )do (2.1) For C,,, = L to 1 /* Backward-Scheduling */ e Schedule urgent operations in DFGs e Schedule operations in DFGs

Z=I+l (2.2) For C,,, = 1 to L /* Forward-Scheduling */ e Schedule urgent operations in DFGs e Schedule incoming operations in DFGR e Schedule operations in DFGs (23) DFGs = DFGS + DFGI DFGR = DFG - DFGs End while Fig. 2. A pseudo-code description of the algorithm for pipelining a loop without loop carried dependences.

3. THE PROPOSED METHOD We propose a two-phase approach for pipelining the execution of a loop. During the construction phase, a schedule with minimal latency is found. During the refinement phase, the delay time of the initial schedule is reduced. In this section, we start with describing the construction algorithm for pipelining a loop and then extend it to without loop curried dependencies (LCD’s), that with LCD’s. Finally, the refinement algorithm is introduced. 3.1. Pipelining a Loop Without LCD’s

We first determine an initiation interval (execution window). Then, we incrementally partition the operations of the DFG into blocks. During the scheduling of a block, operations of DFGR (called incoming operations) are scheduled into the execution window from the first to the last step and forms an induced subgraph DFG,. Intuitively, a schedule is faster if the operations can be assigned to control steps as early as possible (Forward Scheduling) during each pass. To accomplish this goal, we can pull down the scheduled operations (Backward Scheduling) so that the incoming operations can be scheduled into the earlier steps during the next pass. Our basic idea is to fully utilize the properties of both forward and backward schedulings. A pseudocode description of the algorithm is shown in Fig. 2. The backward scheduling and forward schedulig are explained in more detail.

Backward scheduling: During backward scheduling, previously scheduled operations ( oi E DFG,) are pulled down as much as possible so that the distribution of operations during the early steps becomes sparse. The priority function used here is Pb(i)=~i,

oi

E

DFGS

.

Forward scheduling: During forward scheduling, both the incoming operations and the scheduled operations are candidates for assigning to the present step C,,,. Incoming operations are given a higher priority because we wish more operations be scheduled during a pass. Only when there are resources left, a scheduled operation in DFG, can be selected and assigned, which in turn will make the resource distribution at the later steps sparse. The priority function used here is

where a is a large integer that gives to the incoming operations a higher priority. As the scheduling proceeds, more operations in the DFGR are scheduled into DFG,. These scheduled operations will compete with each other for resources. So their freedom (range of movement) become smaller and smaller. To stablize the algorithm, we maintain for each operation a pair of bounds (SIm, Shigh).s,,@,, which is determined by the forward scheduling, is the earliest step into which an operation can be scheduled, while SI,, which is determined by the backward scheduling, is the latest step into which an operation must be scheduled. An operation is urgent if its SI, equals current step C,,, during forward scheduling or s,,ig,,equals CJILpduring backward scheduling. Urgent operations should be scheduled immediately. Exmple: Let us illustrate the idea using the DFG depicted in Fig. 3(a). Suppose there are 5 adders and 3 multipliers. Adder takes 40 ns, while Multiplier takes 80 ns. The clock cycle and the latency are set to 100 ns and 3 steps, respectively. During the first pass of the scheduling, the forward scheduling will put as many operations as possible into DFGl (Fig. 3(b)). Next, we move on to the second pass. The operations just scheduled are pulled down as much as possible by the backward scheduling (Fig. 3(c)). At the beginning of the second forward scheduling, +1, +2 and +3 are scheduled first because they are urgent. Then +c, +d, *7 and *8 of DFGR are scheduled into Step 1. Because no adders are available for control step 1, no addition of DFGl can be pulled up. We proceed to control step 2. Here, *1, *2, *3, +4, +5, +6 (urgent operations) are assigned first, then +e, +f (incoming operatiom) are scheduled. Finally. urgent operations, +a. +b. *4, *5, *6, +7, +8 and incoming operation +g are assigned into control step 3. The algorithm finishes with a schedule of delay 6, which is optimum (Fig. 3(d)). Notice that Sehwa takes 9 steps to complete by using theforward feasible scheduling [2].

Complexity: By keeping the ready operations in a priority queue, the operation with the heighest priority can be retrieved in O(1ogn) time, where n is the number of operations. Urgent operation is retrieved from the queue via an extra pointer in constant Paper 43.2 765

measuring the path length within a single iteration alone. Let 8 be the largest degree among all the data dependences, (i.e., 8 = maxcEDFG de, where e is one of the edges in the DFG and de is the degree of e). The priority of an operation during forward scheduling equals the length of the longest path from the representative node to any of the sink nodes in the (e+l)-unrolled DFG. The purpose is to give a higher priority to the operations in a strongly connected component. @)

DFU 2

Fig. 4(a) shows the flow chart of our algorithm. A detailed description of the algorithm is given below.

3.2.1. The Pipeline-Schedule Iteration

time. Since the number of passes (I) is small, the complexity of the algorithm is O(nlogn).

The pipeline-schedule iteration is the kernel of our algorithm. It repeats itself I iterations of performing: scheduling without considering LED’s, pre-assignmenr and iterative folding. Scheduling without considering LCD’s has been previously described. Pre-assignment: After some operations have been scheduled into DFGI, we locate those operations which produce data for the operations of DFG,. They have to be executed before their consumers, otherwise the producerconsumer precedence would be violated. In general, an operation has to be pre-assigned to DFG,,, and is called a seed if it has a ddegree data precedence against an operation scheduled in DFGI-d+l. After the seeds are found, their predecessors in DFG, are included into DFG,,,. Notice that the pre-assigned operations and their consumers consist of strongly connected components of the DFG.

3.2. Pipelining a Loop With LCD’s A loop carried dependence occurs when a value generated during current iteration will be used by some later iterations. To pipline the execution of a loop with LCD’s,several issues have to be addressed:

Iterative folding: We say that a schedule is convergent if every operation is assigned to a legal step. After some operations have been pre-assigned, the resulting graph may not be. schedulable into the execution window (not convergent). Duing the iterative folding phase, we re-organize the DFGs until a feasible schedule is found. Three subtasks are performed

1)precedence constraints: To.ensure that a correct version

1) Convergency test: If the result from applying a list scheduling to DFG, converges, we proceed to the next iteration. Otherwise, we have to decide eitha to do folding or to give up (increase the latency).

Fig. 3. Example to illustrate the proposed method. (a) The 16p i n t digital FIR filter [2]. (b) After forward scheduling of pass 1. (c) After backward scheduling of pass 2. (d) The final schedule (DFGS).

of data is consumed, the following two constraints have to be Satisfied:

i ) producer-consumer precedence: The producer should be finished before the beginning of the execution of the consumer. This problem is complicated in the presence of LCD’s. We solve it by performing a pipeline-schedule iteration which consists of (a) assigning new operations to the current block. @) preassigning operations to the future blocks and (c) iteratively refining the construction to find a feasible solution. The detailed algorithm will be described in section 3.2.1.

ii) consumer-producer precedence (anti-dependency): If the data used by a consumer o1 is generated during a previous iteration, an over-written hazard happens when its producer o2 generates a new value before the old one has been used. An edge o1 4o2 has to be added to prevent o2 from being scheduled before o 1. Such an edge becomes superfluous if there exists a 0degree path from o2 to o 1. 2 ) the minimum achievable latency: Although both a trivial lower bound L,, and a trivial upper bound L , on the achievable latency can be easily obtained, to determine the minimum latency is in general an NP-complete problem [12]. We explore the solution space by performing a minimum-latency iteration which repeatiy estimates a new latency and does a pipelineschedule iteration until a feasible solution is found. The- detail algorithm will be given in section 3.2.2. 3 ) The priority function: Since the DFG may contain LCD’s, the urgency of an operation cannot be calculated by Paper 43.2 766

I/ Q

BACKWARD FOUlINO

(b)

Fig. 4. The flow charts of the loop winding algorithm: (a) The construction phase, @) The refinement phase.

2 ) Folding criterwnr: Once we decide to perform a folding, those operations which have been scheduled beyond the execution window are folded. When an operation is folded, it is moved from DFGi to DFGi+l. An arc is introduced between o l e D F G i l and o z ~ D F G i ifz there is an (i2-il) degree precedence between o and 0 , in DFG. 3 ) Stopping criterion: Theoretically, if every operation of the DFG, has been folded at least once (DFG = 0). it would be impossible to find a solution with any additional tries. However, it is unrealistic to do so because DFG, has a lot of operations to be detected and the predecessors of the operations in the strongly connected components are rarely folded. A better approach is to test those operations of the strongly connected components. Since the length of the longest cycle is no more than L-, we terminate the folding iteration when the number of iterations exceeds

m

L&. 3.2.2. Example We show how our idea works on the example depicted in Fig. 5(a). This example is borrowed from [ 111 and is used by [8] for optimal loop parallelization. The DFG contains 17 operations. Its solid and dash arcs denote data precedences of degree 0 and 1, respectively. Suppose there are no resources constraints. The lower bound on the achievable latency is 3 because there exists a loop (B@ 1+E@ l+H@ 1+B@2) whose execution delay is 3 and the accumulated degree is 1. The forward scheduling will schedule A, B, ..., and L into the DFG1(Fig. 5(b)). To meet the data precedence constraints, R and P must be scheduled into DFG,. Moreover, P@ 1 (R@l) must be scheduled one step earlier than J @ 2 (-2). Accordingly, M and N also must be scheduled into DFG, because they are the predecessors of P and R, respectively. By combining DFG1, the graph induced by [ A, B, C. D, E, F, G, H , I, J. K,L), and DFG,, the graph induced by (M,N, P , Q),with the arcs P@ 1+J@2 and R@ 1+E@2, we get the partial graph, DFG,, as shown in Fig. 5(b). When we perform a list scheduling on DFG, of Fig. 5(b), we find that the schedule cannot be converged within the execution window (Fig. 5(c)). To find a feasible solution, those operations (E, H and J) scheduled beyond the loop window shall be moved to the next block and new arcs (E +C, H +B and J 4 F ) shall be introduced to specify the loop carried dependences (Fig. 5(d)). By repeating this process, we find that G also has to be folded to the second block. The iterative adjustment phase stops because the convergency test succeeds. The next forward scheduling will schedule Q into DFG,. Because there is no unscheduled operation left, the scheduling terminates with a latency of 3 and a delay of 6 (Fig. 5(e)). Fig. 5(f) shows the result by [8] which has a much longer delay.

(Q

Fig. 5 . Example to illustrate the loop winding problem and the proposed method. (a) A DFG to be scheduled. (b) The DFGs after forward scheduling and the pre-assignment phase. (c) Result of the convergency test. (d) After folding is introduced. (e) The h a l result. (f) The result of OPT [8].

is m~l,lo,s(C, 1 4. 2 ) Estimate L,:

We set L , to be the number of control steps needed by performing a list scheduling on the un-folded

DFG. 3 ) Find the minimum latency: Obviously, the achievible Although in the worst case we latency falls within [L-, L,]. the complexity is have to try every latency within [L-, L,], tempered by a binary search which requires O(logn) tries. However, due to the achievable latency is usually close to L,, [12], we start with Le. If the pipeline-schedule iteration determines that a schedule with this latency is not achievable, the latency is increased by 1. This process continues until a feasible solution is found. Experiment shows that the linear search strategy takes only a few iterations in most of the cases. 32.4. Complexity of the Construction Phase

The construction phase consists of three nested loops: 1) folding iteration, 2) pipeline-schedule iteration and 3) minimurnlatency iteration. The folding iteration repeats at most L times where each iteration employs a list scheduling. Therefore, its complexity is O(L dogn). The complexity of the pre-assignment and the basic scheduling are O(n) and O(nlogn), respectively. Therefore, the overall complexity of the pipeline-schedule iteration is dominated by folding iteration which is O(L nlogn). Consequently, the complexity of the minimum-latency iteration is O(L nlog’n) in the worst case and is O(L n logn) in the average case.

3.23. The Minimum-Latency Iteration

33. Post Optimization

We perform the minimum-latency iteration to find the minimum achievable latency. It consists of three subtasks: 1) estimate L,, the lower bound on the latency, 2) estimate L , the upper bound on the latency, and 3) repeatly choose a latency and perform the pipeline-schedule iteration until a solutipn is found.

A scheduling with shorter delay may exist due to two facts in the construction phase: First, the priority is biased toward the incoming operations. Second, the operation distribution in the earlier subgraphs is more crowded than that in the latter subgraphs.

1 ) Estimate L-: The L- is determined by the resource constraint [2] and the length of longest cycle in the DFG [12]. The cycles in DFG can be discovered by performing a breadfirst-leveling on the subgraph of 0-degree. Let the level of node i be. li. Each backward edge oi + oj with degree d must be associated with a cycle whose length CI = li - l j + 1. The lower bound

The refinement algorithm is depicted in Fig. 4.(b). It tries to reduce the delay by one control step at a time by performing

the following five subtasks: 1 ) Computing the Sh’s for the operations of the last subgraph (DFGht): A backward scheduling is performed on the

Paper 43.2 767

operations of DFGb, starting with The Sb’s of the operations of DFGht are recomputed whenever change occurs in DFGL,. 2 ) Convergency test: A forward scheduling is performed on DFGs using a freedom based priority function. This is different from that of the construction phase, where give preference to the incoming operations. 3 ) Folding criterion: If the forward sclieduling fails. one of the sink Operations on the longest path of DFGs is folded. Consequently, operations will be moved gradually from the earlier subgraphs to the later subgraphs. This strategy reduces the length of the critical paths of DFGs and, hence, results in larger freedom for scheduling. 4 ) Decreasing the delay: When the convergency test in (2) successes, we decrease Sht and proceed to find a shorter delay. Associated with each reduction in Sht, a backward folding is performed to move the source nodes of DFGht to DFGCPIT..l. This strategy breaks the restriction imposed by the construction phase. If ,S = 0, which implies that DFG, is empty, we eliminate the entire block by decreasing I by 1 and setting ShI to L. 5 ) Stopping rule: The refinement phase stops when the backward scheduling fails. A backward schedule on DFGht will fail if there are too many operations folded into DFGht. Since a folding iteration will fold at least one operation to the next block and the number of operations folded during the backward folding is very limited, it is assured that the backward scheduling must fail in no more than Z x n iterations. Hence, the complexity of the refinement phase is O(n210gn). 4. EXPERIMENTAL RESULTS

The proposed method has been implemented on a VAX 11/8550 computer running the ULTRIX operation system. It is called PLS ( PipeLining Scheduler ) and is integrated into a high level synthesis system, THEDA.HLS, developed by the Electrical Design Automation group at Tsing Hua University. In addition to PLS, THEDA.HLS contains an optimal scheduler ALPS [13] ( an integer linear programming scheduler ). PLS passes the scheduled result to a data path allocator, STAR, [141 which constructs the RTL architecture. We compare the results of PLS with those by other systems on the following benchmarks: (1) the fifth order digital wave filter (assume no LCD’s), (2) the fast discrete cosine transform (without LCD’s), (3) Cytron’s example (with E D ’ S ) and (4) the fifth order digital wave filter (with LCD’s). In order to have a fair comparison, we do not restructure the DFG in any way. We assume that multipliers take two cycles while adders and subtractors take one cycle each to complete. The optimum solutions by ALPS is also presented. Experiments indicate that PLS obtains near optimal solution for most cases. 4.1. The Fifth Order Digital Wave Filter [15] with no LCD’s We assume that there are no data dependences between the algorithm instances. Under the resource constraints, we have optimally minimized both the latency and the delay for all the cases. The results by ILP, PLS and Sehwa are shown in the first part of Table I. PLS consumed less than 10 CPU seconds for the 11 case. The second part of Table I shows the results by a time constrained functional pipelining scheduler [5]. The clock cycle is lengthened so that a multiplication takes only one cycle while two additions can be executed within a cycle (chaining). The schedules are targeted towards a delay time of 10 cycles.

Paper 43.2 768

TABLE I The Pipelined Synthesisof the Fifth Order Filter With Pipelined Multipliers.

Although design styles are different, PLS’s scheduled have a better throughput-area ratio. 4.2. Fast Discrete Cosine Transform (FDCT)

This example was found in [3]. It contains 42 operations: 16 multiplications, 13 additions and 13 subtractions. The length of the critical path is 8. Therefore, the maximal degree of parallelism is 5.3. Since the number of 0-1 variables in an ILP formulation is (li - s i + 1). the vast amount of operations with large mobility result in a very large solution space. It prevents a solution from being found optimally by the ILP method. Table 11 shows the results by PLS and Sehwa.

xlG*

In this example, buses are also specified as part of the resource constraints. We employ a two-phase clocking scheme where the read and write phases are interleaved. The entries marked with a “-“ indicate that Sehwa fails to find a schedule of minimum latency for the cases. David et. al. [3] uses the case with 2 adders, 2 multipliers and 2 subtractorsto illustrate their algorithm. Staring from an initial solution by ASAP, they apply an optimization strategy that tries to compress the initial solution by swapping folded Operations with others of the same type that have a lower folding index. The latency and delay of the initial solution are 8 and 14. respectively. The delay of the final solution is 12. However, this result is based on the assumption that a multiplier and an adder (or a substractor) can be chained within a cycle. PLS obtains a result with delay 11 under the assumption that the multiplier take one cycle to complete. 43. The Cytron’s Example

The Cytron’s example [ l l ] is shown in Fig. 5(a). Given the constraint on the number of operators, we compare the results with those by the percolation based scheduler (PBS) [7] and the ATOMICS [lo] (Table III). We have achieved both minimum TABLE 11 The Pipelined Synthesis of the Fast Discrete Cosine Transform with pipelined multipliers.

latency and minimum delay for all the cases except the one marked with a "*". The PBS begins with finding the maximum parallelism pattern (Fig. 5(f)) and then applies a heuristic to adjust the optimal schedule to fit the constraints. Although the minimum latency is achieved, it scatters the codes across 5 iterations. So, it has a longer delay and needs a larger overhead for the controller.

4.4. The Fifth Order Digital Wave Filter With LCD's This example is particularly interesting because it has a relatively large number of LCD's as well as intra-loop dependences. It has 34 operations. The critical path length of the 0degree subgraph is 17 and the inherent latency (L-) is 16. So, the maximal degree of parallelism is 2.1 . The operations on the critical paths are also on the critical loops. So the performance is dictated by the precedence constraints. Given the constraint on the number of adders, multipliers and buses, Tables IV and V show the results of Spaid, PLS and ALPS by using a nonpipelined and a pipelined multiplier, respectively. Note that Spaid re-organizes (retimes) the DFG before performs the scheduling, while we perform retiming and scheduling simultaneously. 5. CONCLUSIONS In this paper, we have presented a scheduling algorithm for pipelining the execution of a loop in the presence of loop carried dependences. Our method optimize both the initiation interval and the tum around time of the pipeline. Given the constraint on the number of functional units and buses, we first determine an initiation interval and then incrementally partition the operations into several blocks. A refinement algorithm is incorporated to

TABLE Ill Cytron's Example.

improve the turn around time. Experiments on benchmark examples show that the new approach gains a considerable improvement over those by previous approaches. REFERENCES M.C. McFarland, A.C. Parker and R. Camposano, "The High-Level Synthesis of Digital System", Proceedings of the IEEE, pp. 301-318, February 1990. N. Park and A.C. Parker, "Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications", IEEE Trans. on Computer-Aided Design, pp. 356-370, March 1988. D. J. Mallon and P. B. Denyer, "A New Approach To Pipeline Optimisation," Proc. of The European Design Automation Conference, March 1990. P.G. Paulin, and J.P. Knight, "Force-Directed Scheduling for the Behavioral Synthesis of ASIC's'', IEEE Tram. on Computer-Aided Design, pp. 661-679, June 1989.

Ki So0 Hwang. Albert E. Casavant, Ching-Tand Chang and Manuel A,d'Abreu, "Scheduling and Hardware sharing in pipelined data paths", Proc. of ICCAD-89, pp. 24-27, November 1989. E.M. Girczyc, "Loop Winging - a Data Flow Approach to Functional Pipelining", Proceedings of the IEEE ISCAS, pp 382-385, May 1987. R. Potasman, J. Lis, A. Aiken and A. Nicolau, "Percolation Based Synthesis," Proc. of the 27th Design Automation Conference, pp. 444-449, 1990. A. Aiken and A. Nicolau, "Optimal loop parallelization,"

In Proc. of the I988 ACM SIGPLAN Con$ on F'rog. Lang. Design and Implementation, 1988. B.S. Haroun, and M.I. Elmasry, "Architectural Synthesis for DSP Silicon Compiler", IEEE Tram. on ComputerAided Design, pp. 43 1-447, April 1989. G. Goossens, J. Rabaey, J. Vandewalle, and H. De Man, "An efficient Micro-code Compiler for Application Specific DSP Processors", IEEE Tram. on ComputerAided Design, pp. 925-937, June 1990.

TABLE IV Fifth Order Digital Wave Filter With non-Pipelined Multiplier.

Cytron. R., "Compiler-time Scheduling and Optimization for Asynchronous Machines," Ph.D. thesis, Univ. of Illinois at Urbana-Champaign, 1984. M. S. Lam, "A Systolic Array Optimizing Compiler," Ph.D. thesis, Univ. of Camegie Mellon, 1989. C. T. Hwang, Y. C. Hsu and Y. L. Lin "Optimum and Heuristic Data Path Scheduling under Resource Constraints" Proc. of the 27th Design Automation Conference, pp. 65-70, July 1990.

TABLE V Fifth Order Digital Wave Filter with Pipelined Multiplier.

F. S. Tsai and Y. C. Hsu, "DataPath Construction and Refinement," Proc. of ICCAD-90. S.Y. Kung, H.J. Whitehouse and T. Kailath, "VU1 and Modem Signal Processing", Prentice Hall, pp. 258-264, 1985.

Paper 43.2 769

Scheduling for functional pipelining and loop winding - Design ...

Scheduling for functional pipelining and loop winding - Design ...

Suggest Documents

Rotation Scheduling: A Loop Pipelining Algorithm - CiteSeerX

A Scheduling and Pipelining Algorithm for Hardware/Software Systems

Optimization of Nest-Loop Software Pipelining

Flushing-Enabled Loop Pipelining for High-Level Synthesis

Loop Scheduling in OpenMP

IRJET- Automated Loop Winding Machine for Steering Coupling

Communication Sensitive Loop Scheduling for DSP Applications

Distributed loop-scheduling schemes for heterogeneous computer

Communication Sensitive Loop Scheduling for DSP ... - CiteSeerX

Communication Sensitive Loop Scheduling for DSP ... - CiteSeerX

SYSTEMS DESIGN FOR SCHEDULING

systems design for scheduling

SYSTEMS DESIGN FOR SCHEDULING

systems design for scheduling

AXISYWMETRIC WINDING DESIGN UNDER ...

Cooperative Packet Scheduling via Pipelining in 802.11 ... - UCSD CSE

Using ACL2 to Verify Loop Pipelining in Behavioral Synthesis

Winding Design and Analysis for a Disc-Type

Cooperative Packet Scheduling via Pipelining in 802.11 Wireless ...

Optimal Two Level Partitioning and Loop Scheduling for ... - CiteSeerX

Design and Construction of High Current Winding for

Minimizing Energy via Loop Scheduling and DVS for Multi-Core ...

systems design for scheduling - CiteSeerX

Design and Evaluation of Scheduling Algorithms for