affine-by-statement technique [4], and the index shift method [14]. ... assumption that each functional unit can execute in one ... node F has been rescheduled on the third control step. ... nodes, and t is a function representing the computation. (0,1). A. B. C. D. E. F. G ..... Compaction for Horizontal Machines" in the IEEE Trans.
Improving Nested Loops’ ILP on a Parallel ASIC Design
Robert Light, Wayne Maxfield, Bryan Reed, Nelson Passos
Edwin H.-M. Sha
Department of Computer Science
Dept of Computer Science & Eng
Midwestern State University
University of Notre Dame
Wichita Falls, TX, 76308
Notre Dame, IN, 46556
Abstract - Multi-dimensional applications such as satellite image processing, fluid mechanics, and medical imaging, require high computer performance. VLSI implementation of parallel Application Specific Integrated Circuits (ASICs) with limited number of processing units are commonly used to improve the performance of such computation-intensive applications. The critical sections of such applications consist of nested loops with the possibility of embedded conditional branch instructions. Branch predication techniques utilize predicate registers to control the validity of speculatively computed results. The use of such registers is a significant factor in the performance gain achievable by the overlap of successive iterations of a loop. This paper shows a process of dimensioning and scheduling those registers while scheduling the operations of the loop in a constrained parallel resource environment. 1.
Introduction
Multi-dimensional applications such as satellite image processing, fluid mechanics, and medical imaging, require high computer performance The critical sections of such applications consist of nested loops with the possibility of embedded conditional branch instructions. Current commercial systems use branch predication and software pipelining techniques, which can also be applied in the design of ASIC systems. The branch predication techniques utilize predicate registers to control the validity of computed results [16]. The optimized design and allocation of such registers becomes then a significant factor in the expected performance gain. Software pipeline is used to overlap consecutive loop iterations in order to reduce the overall execution time [10]. This paper utilizes loop pipelining across different dimensions in order to schedule the execution of the repetitive operations in a short schedule length, while considering the resource constraints and the predicate registers allocation. A modified multi-dimensional retiming scheduling technique is used to improve the performance of the system and to specify the characteristics of the predicate registers, reducing the overall execution time of the nested loops found in the critical sections of those applications.
Most of the previous results on scheduling loops are solely based on one-dimensional uniform problems, which do not consider the effects of conditional branches [2, 3, 6, 11, 21]. This study focuses on the parallelism inherent to multi-dimensional applications involving branch decision, ignored by the one-dimensional methods [9, 10]. Some recent research has been conducted in the scheduling of multi-dimensional applications, such as the affine-by-statement technique [4], and the index shift method [14]. However, these methods are targeted to uniform loops. Other methods focus on multi-processor scheduling and are not applicable to the problems covered in this paper [7, 8, 12, 15, 17]. Software pipeline involves the overlapped execution of consecutive loop iterations in order to minimize the computational execution time. The existence of branches or conditional statements in the code can be handled by different techniques [23, 24]. However, the overlapped iterations must come from successive iterations defined by the original loop code. Multi-dimensional retiming produces overlapped iterations just as those found in software pipeline solutions, considering, however, the overlap of iterations through different dimensions of the iteration space [19]. This paper uses multi-dimensional retiming in order to solve a broader range of problems involving nested loops. The concept of multi-dimensional retiming was described [18, 19]. That technique allows the restructure of the loop body represented by a general form of multidimensional data flow graph (MDFG), while preserving data dependencies. In an MDFG, the nodes represent the operations and the edges, the data dependencies between operations. Branch predication techniques use if-conversion methods to transform the conditional code in straight-line code, which uses Boolean guards to implement the conditional execution of individual operations [1, 16, 23, 24]. Such a transformation reduces existing control dependencies to more common data dependencies. However, it requires the use of special registers, commonly named predicate registers to store the Boolean
values. This paper develops a new method for scheduling cyclic MDFGs with a finite number of non-pipelined processing resources available [20]. Considering the assumption that each functional unit can execute in one time unit, the schedule length associated with the number of control steps.
A B
X
G
D E
This new schedule can be associated with the retimed graph presented in figure 3, where the two-dimensional delay (0,1) was pushed through nodes A, B, C, D and E of the original MDFG (Note that (0,1) is arbitrarily chosen here. The vector (1,0) is also possible, as are many others). Intuitively, the retimed nodes no longer precede the execution of F within the same iteration.
A D
F Y
H
B
X
H
Y
F
E (0,1)
C
G
C Figure 1. MDFG representing an edge detector filter As an example, a two-dimensional problem consisting of an edge detection algorithm based in the computation of a Laplace edge enhancement is represented in figure 1 [13]. While the example, for simplicity, shows an acyclic graph, the solution, proposed in this paper, was developed to work for either acyclic or cyclic graphs. In a processor with one arithmetic logic unit, one can easily conclude that the minimum schedule length would be equivalent to 8 control steps. The schedule length is reduced to 6 control steps, assuming that two computational units are available and are able to execute one operation in one control step. This reduction could be obtained by the application of a traditional list scheduling method [5], as shown in figure 2(a). Control Step 1 2 3 4 5 6
Tasks A C E F
B D
G or H X or Y (a)
Control Step 1 2 3 4 5
Tasks A C E
B D F G or H X or Y
(b) Figure 2. (a) Initial schedule (b) Schedule after retiming Figure 2(b) shows a more compact schedule in which node F has been rescheduled on the third control step.
Figure 3. Retimed graph
Such a technique is implemented through a polynomial time scheduling algorithm, resulting from a modification to the push-up scheduling algorithm presented in [20]. In this algorithm, nodes are selected to be assigned to functional units, and pushed-up to earlier control steps if the required functional unit is available. The push-up operation will activate an implicit multi-dimensional retiming if necessary in order to do the earliest assignment. A significant problem in this mechanism is the handling of the required predicate register. The modified version of the algorithm allows the designer to dimension the predicate registers in such a way that overlapped iterations do not interfere with each other. It also schedules the predicated instructions in such a way that an entire line of the schedule can be invalidated and ignored during the execution of the program. The next section establishes some of the basic concepts to be used in this paper, including an overview of multidimensional retiming. Section 3 introduces the basic fundamentals of the multi-dimensional modified push-up scheduling technique, designated in this paper as predicated push-up scheduling, and presents the implementing algorithm. Section 4 shows an example of applying this new technique. A final section summarizes the concepts presented. 2. Basic Principles As mentioned in the previous section, we use a multidimensional data flow graph (MDFG) to model the problems to be scheduled. An MDFG G consists of a tuple (V,E,d,t), where V is the set of computation nodes, E represents the set of dependence edges, d is a function representing the multi-dimensional delays between two nodes, and t is a function representing the computation
time of each node, assumed to be one time unit in this paper. A multi-dimensional retiming r is a function from V to Zn that redistributes the nodes in a dependence graph created by the replication of an MDFG G. A new MDFG Gr is created, such that each iteration still has one execution of each node in G. The retiming vector r(u) of a node u ∈G represents the offset between the original iteration containing u, and the one after retiming. After retiming, the execution of node u in iteration i is moved to the iteration i - r(u). The chained multi-dimensional retiming technique is one of the possible methods able to compute a legal multi-dimensional retiming for an MDFG. Such method addresses the optimization process aimed to an infinite resource system. Predication - Predicated execution allows the speculative execution of instructions based on Boolean guards implemented as predicate registers. Using predications, conditional instructions can be substituted by Boolean operations whose result is stored in the predicate registers in order to validate speculated instructions. This transformation in the code, also translates control dependencies in more regular data dependencies. Such a transformation is accomplished by the use of a technique called if-conversion. Instructions that are associated to a predicate that is TRUE are executed normally, those that have a predicate FALSE are then discarded. Predication allows separate control flow paths to be overlapped and simultaneously executed in a multiprocessing architecture. The predicate for each operation in our example is represented by the result of the comparison instruction (node F, stored in some predicate register f). The absence of any predicate information implies that the instruction does not depend on a branch instruction. The predicate code is different from the original code, implying that all instructions should be understood as a straight-line code. By using retiming or any other software pipelining technique that overlaps multiple iterations of a loop, the designer introduces a new hazard in the execution of the code, caused by the possibility of a redefinition of the contents of the predicate registers before they could be used. In order to avoid this problem, such registers must be designed in such a way to prevent the hazard. Section 3 shows how to avoid such a problem by using shift register correctly dimensioned according to the schedule being executed. The Push-Up Scheduling Technique Consider the example presented in figure 1. Assuming that the target processor has only two functional units, able to execute in a same amount of time, hereafter designated one control step, the first control step of the execution schedule will have those two functional units available and nodes A and B could be scheduled to them. In the second control step, nodes C and D are ready for
execution. We then say that C and D are schedulable nodes if they satisfy one of the conditions below: • • •
have no incoming edges all incoming edges have a non-zero multidimensional delay all their predecessors connected by a zero-delay edge, have been scheduled to earlier control steps
It is easy to verify that C and D are schedulable nodes by checking its incoming edges. The non-existence of incoming edges for C and the previous schedule assignment of nodes A and B, predecessors of D, imply that their required input data has been produced in some previous step and are available from some storage mechanism, such as a register file, memory, etc. C and D are then assigned to control step 2, as shown in figure 2(c). After successive retimings, the resulting graph will allow the schedule of nodes F at control step 3. Two functions are used to provide the information required by this algorithm: •
•
ES(u), which returns the first control step following the end of the execution of all nodes predecessors of u by a zero-delay edge, mathematically represented as: ES(u) = max { 1, ES(vi) + t(vi) }, for all vi preceding u by an edge ei such that d(ei)=(0,0...,0). AVAIL (fu) which returns the control step cs if no node has been assigned to fu at such control step.
Considering such definitions, the need for a multidimensional retiming is expressed by the lemma below: Lemma 1. Given an MDFG G= (V,E,d,t) and an edge u → v, such that v can be scheduled to ES(v) and d(e) = (0,0,...,0), then a multi-dimensional retiming of u is required if ES(v) > AVAIL(fu) where fu is any functional unit able to execute the operation represented by v. We already know that when retiming a node, we may need to retime predecessors of this node. Two possible reasons support that need: either the incoming edges of the node to be retimed have zero-delays or they have a delay produced by a previous retiming, required to schedule some other node, which can not be pushed to the outgoing edges. Therefore, we need to be sure that all nodes in the graph are correctly retimed such that delays are placed in the required edges. An efficient way to accomplish such a task, was used in the push-up scheduling technique through a multi-dimensional delay counting function MC. The function MC(v) gives the upper bound on the number of extra non-zero delays required by v along any path p conducting to node v with d(p)=(0,...,0).
Let us re-examine our initial example in figure 1. If we traverse the graph starting from the initial schedulable nodes, we must begin by nodes A and B. We also know that the initial value for the function MC is zero for any node in the graph. Nodes C, D and E are scheduled at control steps 2 and 3. When trying to schedule node F we are already at control step 4, i.e., ES(F)=4. However, the functional unit 2 is available at earlier control steps, i.e., AVAIL(fu)=3. Selecting to schedule F at control step 3, with ES(F)>AVAIL(fu), according to lemma 1 implies that E must be retimed. This implies in an extra delay on the edge E → F, changing MC(F) to 1. Later, instructions G and H, which depend on the predicate being computed by F are scheduled at control step 4, while X and Y are scheduled at control step 5. However, during the execution of the application, only one of the branches will be processed and one of the functional units will remain idle during the execution of steps 4 and 5. 3. Predicated Push-Up Scheduling The original push-up scheduling technique is applicable to uniform loops with no conditional statements [20]. The introduction of conditionals in the loop prevent the regular use of the push-up technique in if-converted codes, because two types of execution hazards caused by the new data dependencies created from the control flow dependencies. These two significant performance hazards, named in this paper control anticipation and idle resources, will appear in situations where the loop iterations are overlapped and predicate registers are associated to a single instance of a control instruction. Given a branch statement S, controlling the execution of two alternate paths p and q within a loop, and considering Xi the occurrence of event X at iteration i, the new hazards can be defined as below. Definition 1. A control anticipation hazard occurs when an operation in pi or qi is scheduled before Si. Definition 2. An idle resource hazard occurs when two operations belonging to the same path p (or q) are, due to existing data dependencies, scheduled in successive control steps in parallel with operations from q (or p). In this case, after S has been computed the processor responsible for one of those paths will remain idle. In order to avoid control anticipation hazards, the decisions taken at every instance of instruction S need to be stored along multiple iterations in a queue-like structure. This can be easily implemented by using shift registers dimensioned according to the number of overlapped iterations. The use of these registers, however, is not based in a queue process since some instruction may depend on the contents of the head of the queue while others may depend on intermediary elements. The shift operation occurs at the end of each iteration, when
the head of the queue is discarded and all other elements are moved forward in preparation for the next iteration. The second group of hazards, idle resources, can be handled by scheduling in parallel (same control step) operations that belong to the same branch path and iteration instance or are not dependent on a previous decision. This paper does not present the solution for this problem, which can be tackled by using resource-sharing techniques as described in [22]. The algorithm MDRPRED used to implement the new predicated pushup scheduling combines the multi-dimensional retiming technique and a predicate register allocation method. The algorithm can be summarized in the pseudo-code below: Algorithm MDRPRED (G = (V,E,d,t) ) Choose s = (s1,s2,…,sn) | s • d(e) > 0 for any e ∈ E Choose r such that r ⊥ s ES(∀ u ∈V) ← 0 MC(∀ u ∈V) ← 0 MCmax ←0 QueueV ← ϕ /* remove original edges with non-zero delays */ ∀ e ∈ E, E ←E - {e , s.t. d(e) ≠ (0,…,0) } /* queue schedulable nodes */\\ QueueV ←QueueV ∪ {u ∈V, s.t. indegree (u) = 0} while QueueV ≠ ϕ get(u, QueueV) /* does u need a non-zero delay in its incoming edges */ if AVAIL(fu) < ES(u) /* adjust the MC(u) value */ MC(u) ← MC(u)+1 MCmax ← max{MC(u),MCmax} /* adjust ES(u), schedule u on the first possible control step*/ ES(u) ←AVAIL(fu) schedule(u) /* propagate the level values to successor nodes of u */ ∀ v such that u ← v indegree (v) ← indegree(v) - 1 ES(v) ←max{ES(v), ES(u)+t(u)} /* assume this edge does not require a new delay */ MC(v) ← max{MC(v), MC(u)} /* check for new schedulable nodes */ if indegree(v) = 0 QueueV ←QueueV ∪ {v} endif endwhile /* compute the multi-dimensional retiming */ ∀ u ∈ V, r(u) = (MCmax - MC(u)) /* compute the size of the shift registers required for the predicate registers */ ∀ p ∈ P, and ui depends on S, s(p) = max(MC(ui |p) MC(S)) + 1 End MDRPRED
In the algorithm MDRPRED, a queue structure is used to store the schedulable nodes (QueueV). The edges that require additional delays are found on-the-fly, and are controlled by the MC value. By analysis of the algorithm it is simple to show that given G= (V,E,d,t), a realizable MDFG representing an n-dimensional loop, and F a target set of functional units, the multi-dimensional scheduling algorithm MDRPRED transforms G to Gr in O(n|E|) time, while assigning its nodes to available resources of F and allocating the predicate registers. The algorithm is applied to the if-converted code and produces the necessary retiming of the graph representing the loop and the dimensions of the shift register used to store the predicate results. Applying the algorithm to the example of figure 1 being scheduled in a two-processor system, the MC values computed are MC(A)= MC(B)= MC(C)= MC(D)= MC(E)=0, MC(F)=1, MC(G)= MC(H)= MC(X)= MC(Y)=1. Therefore, the dimension of the shift register is 1 (MC(Y) - MC(F) +1), and the instructions G, H, X, Y depend on that only element of the shift register.
dimension of the shift register is 2 (MC(X) - MC(F) +1), and the instruction G depends on the second element of the shift register, while X depends on the head of the queue (shift register). A simple way of looking at this solution is shown in figure 6 where a diagram showing the occupation of each unit and the shift register is shown for some iteration k.
Control Step 1 2 3 4 5 6
A G
D B
E
X
F
B D
(a) Control Step 1 2 3 4
4. Example In the example of figure 1, instructions G, H, X, and Y are used to build the final image of the edge detector. If the programmer had decided for a different coding style where the image would be initialized to one of the two possible outcomes before executing the loop, then one of those paths would be reduced to zero instructions as seen in figure 4.
Tasks A C E F G|f X|f
Tasks A C E G|f
B D F X|f
(b) Figure 5. Modified edge detector (a) Initial schedule (b) Processing unit 1
Processing Unit 2
Ak
Bk
Shift register fk-2
?
Ck
Dk
fk -2
?
Ek
Fk-1
fk-2 fk-1
G|fk-1
X|fk-2 shift
C
Ak+1
Bk+1
fk-2 fk-1 fk-1
?
Final schedule Figure 4. Modified code for the edge detector Figure 5(a) shows the standard schedule obtained through the use of a list scheduling technique applied to the if-converted code. Usual loop predication methods, using modulo scheduling for example, would be able to allow the schedule of instruction F at control step 3, however, would not be able to optimize instructions G, X due to the existing data dependence between them and the pending decision on the predicate register. Using the MDRPRED algorithm, the computed MC values are MC(A)= MC(B)= MC(C)= MC(D)= MC(E)=0, MC(F)= MC(G)=1, and MC(X)=2. These values imply that the
Figure 6. Diagram showing the use of the predicate shift register As it is possible to see, the shift operation in the predicate register will discard the obsolete decisions and open space to the insertion of a new value. The reason for not having the shift occurring at the same time a new predicate is computed is also very important. The shift at the end of the iteration allows instructions depending on a branch statement to be scheduled in parallel or earlier than the decision instruction.
5. Conclusion Most of the earlier scheduling methods in synthesizing MD systems do not explore loop pipelining across different dimensions. More recent techniques that accomplish that task are restricted to uniform loops without conditional statements. The use of if-conversion and branch predication can be used to adapt the code to those techniques. The problem of allocating the predicate registers to overlapped iterations of the loop becomes a significant parameter in the optimization method. This paper presented a novel technique on scheduling a multidimensional data flow graph through the use of a multidimensional retiming function combined with predication techniques. This new approach, named predicated pushup scheduling technique shows that it is possible to improve the performance of conditional loops by using shift registers as the predicate hosts. The time complexity of the new algorithm is O(n|E|), where n is the number of dimensions and E is the set of edges of the multidimensional data flow graph representing the problem. The algorithms were presented in detail with an example showing its application.
[9] [10]
[11]
[12]
[13] [14]
[15]
[16]
References [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
D. I. August, W. W. Hwu, and S. A. Mahlke, "A Framework for Balancing Control Flow and Predication," in the Proceedings of the 30th International Symposium on Microarchitecture, December 1997. L.-F. Chao, A. LaPaugh, and E. H.-M. Sha, "Rotation Scheduling: A Loop Pipelining Algorithm," in the Proceedings 30th ACM/IEEE Design Automation Conference, Dallas, TX, June, 1993, pp. 566-572. L.-F. Chao and E. H.-M. Sha, "Static Schedulings of Uniform Nested Loops," in the Proceedings of 7th International Parallel Processing Symposium, Newport Beach, CA, April, 1993, pp. 1421-1424. A. Darte and Y. Robert, "Constructive Methods for Scheduling Uniform Loop Nests," in the IEEE Transactions on Parallel and Distributed Systems, August, 1994, Vol. 5, no. 8, pp. 814-822. S. Davidson, D. Landskov, B. D. Shriver, and P. W. Mallett, "Some Experiments in Local Microcode Compaction for Horizontal Machines" in the IEEE Trans. on Computers, July, 1981, C-30, 7, pp. 460-477. G. Goosens, J. Wandewalle, and H. de Man "Loop Optimization in Register Transfer Scheduling for DSP Systems," in the Proceedings ACM/IEEE Design Automation Conference, 1989, pp. 826-831. R. Gupta, "Loop Displacement: An Approach for Transforming and Scheduling Loops for Parallel Execution," in the ICS 1990, 1990, pp. 388-397. P. Held, P. Dewilde, E. Deprettere, and P. Wielage, "HIFI: From Parallel Algorithm to Fixed-Size VLSI Processor Array", in the Application-Driven Architecture Synthesis, ed. F. Catthoor and L. Svensson, Norwell, Massachusetts: Kluwer Academic Publishers, 1993, pp. 71-94.
[17]
[18]
[19]
[20]
[21] [22]
[23]
[24]
L. Lamport, "The Parallel Execution of DO Loops," in the Communications of the ACM SIGPLAN, 17(2), February, 1974, pp. 82-93. M. Lam, "Software Pipelining: An Effective Scheduling for VLIW Machines," in the ACM SIGPLAN Conference on Prog. Lang. Design and Implementation, 1988, pp. 318-328. T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. Lin, "An Effective Methodology for Functional Pipelining," in the Proceedings of the International Conference on Computer Aided Design, December, 1992, pp. 230-233. H. Li, S. Tandri, M. Stumm and K. C. Sevcik, "Locality and Loop Scheduling on NUMA Multiprocessors," in the Proceedings of the 1993 International Conference on Parallel Processing, 1993, Vol. II, pp. 140-147. C. A. Lindley, Practical Image Processing in C, New York, NY: John Wiley & Sons, 1991. L.-S. Liu, C.-W. Ho and J.-P. Sheu, "On the Parallelism of Nested For-Loops Using Index Shift Method," in the Proceedings of the 1990 International Conference on Parallel Processing, 1990, Vol. II, pp. 119-123. L. E. Lucke and K. K. Parhi, "Generalized ILP Scheduling and Allocation for High-Level DSP Synthesis," in the Proceedings Custom Integrated Circuits Conference, 1993, pp. 5.4.1-5.4.4. S. A. Mahlke, R. E. Hank, R. A. Bringmann, J. C. Gyllenhaal, D. M. Gallagher, and W. W. Hwu, "Characterizing the Impact of Predicated Execution on Branch Prediction," in the Proceedings of the 27th International Symposium on Microarchitecture, December 1994, pp. 217-227. K. K. Parhi and D. G. Messerschmitt, "Fully-Static RateOptimal Scheduling of Iterative Data-Flow Programs Via Optimum Unfolding," in the Proceedings of the International Conference on Parallel Processing, 1989, Vol. I, pp. 209-216. N. L. Passos, E. H.-M. Sha, and S. C. Bass, "Loop Pipelining for Scheduling Multi-Dimensional Systems via Rotation," in the Proceedings of 31st Design Automation Conference, 1994, pp. 485-490. N. L. Passos and E. H.-M. Sha "Achieving Full Parallelism using Multi-Dimensional Retiming," in the IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 11, November 1996, pp. 1150-1163. N. L. Passos and E. H.-M. Sha, "Scheduling of Uniform Multi-Dimensional Systems under Resource Constraints,’’ in the IEEE Transactions on VLSI Systems, December, 1998, Volume 6, Number 4, pages 719-730. R. Potasman, J. Lis, A. Nicolau, and D. Gajski, "Percolation Based Scheduling," in the Proc. ACM/IEEE Design Automation Conference, 1990, pp. 444-449. J. Siddhiwala and L. F. Chao, "Scheduling Conditional Data-Flow Graphs with Resource Sharing", in the 5th Great Lakes Symposium on VLSI, 1995, pp. 94-97. N. J. Warter, D. M. Lavery, and W. W. Hwu, "The Benefit of Predicated Execution for Software Pipelining," in the Proceedings of the 26th HICSS, January, 1993, Vol. 1, pp. 497-506. N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Backhaus, "Enhanced Modulo Scheduling for Loops with Conditional Branches," in the Proc. of 25th Annual ACM/IEEE Int’l Symposium on Microarchitecture, December, 1992, pp. 170-179.