Algorithm and Hardware Support for Branch Anticipation

2 downloads 0 Views 208KB Size Report
Algorithm and Hardware Support for Branch Anticipation y. Ted Zhihong Yu z. Edwin H.-M. Sha z. Nelson Passos x. Roy Dz-ching Ju. {. Abstract.
Algorithm and Hardware Support for Branch Anticipationy Ted Zhihong Yu z

Edwin H.-M. Sha z

Abstract Multi-dimensional systems containing nested loops are widely used to model scientific applications such as image processing, geophysical signal processing and fluid dynamics. However, branches within these loops may degrade the performance of pipelined architectures. This paper presents the theory, supporting hardware and experiments of a novel technique, based on multi-dimensional retiming, for reducing pipeline hazards caused by branches within nested loops. This technique, called Multi-Dimensional Branch Anticipation Scheduling, is able to achieve nearoptimal schedule length for nested loops containing branch instructions.

1. Introduction The time critical sections of computation intensive applications, such as satellite image processing, fluid mechanics, multimedia applications and DSP systems, usually consist of nested loops of instructions. By exploring the parallelism embedded in repetitive patterns of the loops, the schedule of the operations comprising the loops can be optimized and pipelined. It has been shown that uniform nest loops represented by multi-dimensional data flow graphs can be efficiently scheduled optimally by push-up scheduling technique based on the concept of multi-dimensional retiming [6]. However, conditional statements in the loops present a significant overhead especially in the presence of pipelined architectures. An optimization method, called MD branch anticipation, is then proposed. This technique uses loop pipelining across different dimensions (MD retiming) and/or resource sharing for the nodes on mutually exclusive paths. Within certain hardware constraints, it is shown that pipeline stalls caused by branches within loops can be eliminated completely by moving branch instruction forward enough. We show that if resource sharing is not allowed, our algorithm achieves the optimal schedule length in polynomial time, and if resource sharing is allowed, our algorithm will try to give the best sharing of the resources to minimize the schedule length.

y This work was partially supported by the Mensch Graduate Fellowship, and the NFS CAREER grant MIP 95-01006. z Department of Computer Science and Engineering, Notre Dame, IN 46556 x Department of Computer Science, Midwestern State University, Wichita Falls, TX 76308 { California Language Lab, Hewlett-Packard Company, Cupertino, California 95014-9804

Nelson Passos x

Roy Dz-ching Ju {

Studies concerning conditional branches have proposed several different solutions. Branch prediction is another way to reduce pipeline stalls, by using branch history or just a guess, to predict the outcome of a condition test. However, these methods cannot guarantee 100% accuracy. In the Instruction Level Parallelism research, branch predication [9] attaches a flag to each predicated instruction in order to remove conditional branches via conditional execution of individual instructions. It usually requires a significant amount of effort in redesigning instruction formats. Since resource sharing cannot be applied, the final schedule is less efficient than ours. Software pipelining [4, 7, 11] is also an approach used to overlap instructions by moving them across iteration boundaries. Most of the existing techniques such as modulo scheduling [8] do not explore the pipelining across multiple dimensions. Therefore they are not as efficient as multi-dimensional retiming in pipelining multi-dimensional applications. The method called GPMB [11] attempts different initiation intervals via exhaustive search to reach the final schedule, it is hard to control code expansion when considering all possible combinations of operations and overlapping the program flow graphs. Previous research on resource sharing has also proposed scheduling techniques for conditional blocks [3, 10]. In [3], scheduling on directed acyclic graphs is considered, and in [10], one dimensional data-flow graphs are studied. These results also do not consider the hardware requirements in order to propagate the branch decisions along the final schedule. In the direction of multi-dimensional retiming, [5] do not take into account resource constraints. Push-up scheduling [6] presented an optimal resource constrained scheduling method for Multi-Dimensional applications, but the problem of dealing with conditional constructs with hardware support for nested loops was open. This paper focuses on the parallelism inherent to uniform nested loops represented by multi-dimensional conditional data flow graphs (MD CdDFGs). The application of a multidimensional retiming to an MD CdDFG overlaps the iterations of the original loop body, thus exposing the inherent parallelism. Figure 1 (a) presents the MD CdDFG for the two-dimensional Floyd-Steinberg algorithm [1] Node c represents the condition( also known as fork node [3]) and the triangle is a dummy node, representing the join node corresponding to node c. The circled nodes are multiplications, while all the other ones are additions. It is noteworthy of the dashed arrows starting from the fork node c which clarify the control dependence of w0 ; w1 ; z0 and z1 on c. Also, since z0 and z1 are not used by succeeding nodes, there are no outgoing data dependence edges from them. Our method deals with a uni-processor system with multiple pipelined functional units and works well with operations that need several clock cycles to complete execution. We adopt the same leading instruction issue and instruction decode stages, and the trailing memory access and write back stages for each pipelined functional unit. The difference is in execution stage. For

while the over-lined ones are from other iterations. The static schedule obtained by our technique is reduced from 15 control steps to 6, which gives about 150% performance gain. Using our branch anticipation algorithm, multiple condition signals may overlap if there are enough buffers, called branch anticipation bits (babits), available. The capability of anticipating branch decision outcomes, by carefully scheduling fork nodes, eliminates hardware pipeline stalls completely. The Branch Anticipation control logic adopts a flexible yet simple way to allocate babits for active fork nodes. Since branch control signals from different global conditional blocks would not interfere with each other, an allocation scheme which is partitioned by these global conditional blocks is adopted for minimum babit consumption.

clarity of explanation, we assume unit cycle for addition and three cycles for multiplication in our pipelined design. The data forwarding technique between two functional units in the same ASIC chip is also considered.

F w0

ea

T

c

w1

z1

z0

eb

ec

ed

(0,1)

(1,-1)

(1,0)

(1,1) p

v

r

t

q

2. Basic Principles

s

y u

(a)

A multi-dimensional conditional data flow graph (MD CdDFG) G = (V; E; d; t; k) is a node-weighted and edgeweighted directed graph, where V is the set of operations, E V V represents the set of dependence edges, d is a function from E to Z n , representing the multi-dimensional delay between two adjacent nodes where n is the number of dimensions, t is a function from V to the positive integers, representing the computation time of each node, and k is a function from V to the set of types, e.g. fork; join; alu; mult; div . The example MD CdDFG for Floyd-Steinberg algorithm was shown earlier in Figure 1 (a), where circled nodes represent variables obtained from three-cycle multiplications, node c is a fork node, and all other nodes are unit-cycle ALU operations except for the triangular join node. An iteration is equivalent to the execution of each node in V exactly once. Iterations are identified by a multi-dimensional index equivalent to integer coordinates of a Cartesian space, known as iteration space. Inter-iteration dependencies are represented by delay vectors. A multi-dimensional retiming r is a function from V to Z n that redistributes the nodes in the iteration space created by replication of an MD CdDFG G. Each iteration, represented by loop index, contains a copy of G. A new MD CdDFG Gr is created, such that the summation of delay vectors of any cycle remains unchanged. The retiming vector r(u) of a node u G represents the offset between the original iteration containing u, and the one after retiming. The delay vectors change accordingly to preserve dependencies, i.e., r(u) represents delay components pushed into the edges u v, and subtracted from the edges w u, where u; v; w G. After retiming, the execution of node u in iteration i is moved to iteration i r(u). Traditional retiming techniques [5] only retime data dependence edges. In this paper, in order to achieve shortest possible schedule by removing branch hazards, our algorithm also retimes control dependence edges. Further detail will be covered in Section 3. A legal multi-dimensional retiming on an MD CdDFG G = (V; E; d; t; k ) requires that the retimed MD CdDFG is executable. This constraint is enforced through the use of a schedule vector. A feasible linear schedule s is a vector pointing to the normal direction of the hyperplanes representing the sequence of execution of the iterations in the iteration space. The chained multi-dimensional retiming [5] technique is one of the possible methods that can compute a legal multi-dimensional retiming for some MD CdDFGs. This method addresses the optimization process aimed at an infinite resource system based on the following properties.

(b)





c (1,-2)

w 0 /w1 z 0 /z 1

ea

eb

f

ec

ed

(-4,9) (-2,7)

(-3,8)

(-3,7)

p

v r

(1,-2)

q

t

(2,-4)

(1,-2)

(2,-4)

(1,-2)

s

y u

(c)

(d)

Figure 1. Floyd-Steinberg MD CdDFG

g

2

The static schedule for one iteration of the loop of FloydSteinberg algorithm, obtained by list scheduling, is presented in Figure 1 (b). Since there are several stages in each pipeline, we represent node u at control step i for the start of its execution stage at control step i. Notice the two empty control steps before control step 10. They are due to the 3 clock cycles required for finishing a multiplication and forwarding the result to the respective ALU in our design. The Branch Anticipation method is implemented through a scheduling algorithm, called MDBA. In this algorithm, nodes are selected to be assigned to functional units and moved to earliest control steps if the required functional unit is available. This scheduling operation is recorded in a multi-dimensional retiming function. Skipping the prologue part, the static schedule computed by our Branch Anticipation method for two successive iterations is given in Figure 1 (d) where nodes w0 and w1 share the same ALU because they are mutually exclusive in the execution path. Figure 1 (c) shows the corresponding retimed graph. The multidimensional delays between adjacent multiplications and additions eliminate the empty control steps. The nodes with apostrophes represent the ones designated in the succeeding iteration

2

2

!

!

?

Property 2.1 Any vector orthogonal to the schedule vector s that realizes G is a legal multi-dimensional retiming of a node in an MD CdDFG G = (V; E; d; t; k) with all of its incoming edges having non-zero delays. Property 2.2 If r is a multi-dimensional retiming function orthogonal to a schedule vector s that realizes an MD CdDFG G = (V; E; d; t; k), and u V , then (m r)(u) is also a legal multi-dimensional retiming, for m Z .

2

2



As we will see later, the MDBA algorithm will produce the necessary retiming in order to obtain the minimal schedule length for a specific resource constrained system.

3. Multi-Dimensional Branch Anticipation

Figure 2. Supporting routines for Branch Anticipation

Let us begin the discussion on this technique by tracing the scheduling process of the operations represented in the MD CdDFG of Figure 1 (a) and seeing how this graph is transformed into that of Figure 1 (c). We begin by giving the earliest starting time of a node. Given an MD CdDFG G = (V; E; d; t; k) and a node u V , the earliest starting time for the execution of node u, ES (u), is the first control step following the execution of all predecessors of u by a zero-delay edge. This can be represented as: ES (u) = max 1; ES (vi )+t0(vi ) for all vi preceding u by an edge ei such that d(ei ) = (0; 0; : : : ; 0). t’ is the modified computation time of vi in order to capture the forwarding characteristics between functional units. We can see that nodes p, r, t are schedulable nodes as defined below:

AV AIL(FU; u):cs, when one of FU is available and the selected functional unit, AV AIL(FU;u):fu. Lemma 3.1 Given an MD CdDFG G = (V; E; d; t; k) and an e edge u ?! v , such that v can be scheduled to ES (v ) and d(e) = (0; 0; : : : ; 0), then a multi-dimensional retiming of u is required if ES (v) > AV AIL(FU; v):cs.

2

f

g

Therefore, we need to be certain that all nodes in the graph are correctly retimed such that delays are placed in the required edges. In order to efficiently accomplish such a task, we use a multi-dimensional delay counting function RC (v ), which records the number of times the predecessors of v including v need to be retimed. Let G = (V; E; d; t; k) be an MD CdDFG, v a node in V , and X a set of edges in E that require an extra multi-dimensional delay. The function RC (v ) gives the upper bound on the number

Definition 3.1 (Scheduling Conditions) Given an MD CdDFG G = (V; E; d; t; k) and a node u V , u is a schedulable node at control step cs if it satisfies one of the conditions: (1) u has no incoming edges, (2) all incoming edges of u have a non-zero multi-dimensional delay, (3) cs ES (u).

2



e

It is easy to verify that p is a schedulable node and is hence assigned to the multiplier at control step 1 as shown in Figure 1 (d). In the same way, r and t are assigned to control steps 2 and 3. At this point, node q becomes schedulable at control step 4 according to scheduling condition 3 and the two-cycle interval needed for data forwarding. However, the required ALU is available at control step 1. In order to schedule this node to that control step, it is necessary to change the graph in such a way that node q can satisfy scheduling condition 1 or 2. Intuitively, we know that it is impossible to make this node comply with condition 1. Therefore, we must try to retime the graph such that edge p q will have a multi-dimensional delay. This is feasible by Property 2.2 because a multiple of r is also a legal retiming. This implies pushing multi-dimensional delays from the incoming edge of p to its outgoing edge. In case the incoming edges of a node have zero delays, we need to propagate the retiming to predecessor nodes. The resulting graph will allow scheduling node q at control step 1 as shown in Figure 1 (d). In order to identify earlier scheduling opportunities, one more function is used for providing information about the availability of functional units. Given an operation u and the set of functional units FU = fui on each of which u can be executed, a functional unit fuj is available at control step cs if no node has been assigned to fuj at that control step, or node u is mutually exclusive to the node(s) already scheduled at fuj . The availability function AV AIL(FU; u) returns both the first control step,

?!1

of extra non-zero delays required by X along any path p = a0 ek?1 e e a1 2 : : : ak?1 k v; k 1 with d(ei ) = (0; : : : ; 0), 1 i k. We give two supporting routines for Branch Anticipation in Figure 2, namely PROPAGATE and PROMOTE, in order to explain how this RC is computed. The procedure PROMOTE increases RC value of u if the resource for u is available before ES (u), according to Lemma 3.1. After u is scheduled, its RC value is propagated to its successors in procedure PROPAGATE. Other variables and operations are explained later. We use the theorem below to compute a multi-dimensional retiming function in order to place the extra delays on the proper edges and keep original non-zero delays still non-zero.

?!  

!

?!

?!



Theorem 3.2 Given an MD CdDFG G = (V; E; d; t; k), a legal multi-dimensional retiming r for G, a set of edges X which need non-zero delays and the values of the function RC computed for the nodes in G based on X , if the multi-dimensional retiming r(u) = (maxV RC RC (u)) r is applied to every u V creating an MD CdDFG Gr = (V; E; dr ; t; k), then X e E d(e) = (0; : : : ; 0) dr (e) = (0; : : : ; 0) and dr (e) = (0; : : : ; 0) if d(e) = (0; : : : ; 0).

2 f 2 6

f g

f g?

j

6



^

6

g

However, during the execution of the scheduling algorithm, we do not know beforehand which edges should be considered mem-

3

T c1 T c2

F

F (b)

a1

m2

m3

a2

m1

(c) (a)

Figure 3. Sharing-prevention cycle that might lead to infinite increment of RC values bers of the set X . In the Branch Anticipation scheduling algorithm, we see that the value of RC is calculated on-the-fly as we find out which edges should be in the subset X . In order to state the issues associated with conditional resource sharing under Branch Anticipation, we explain the mechanism of resource sharing involved in computing AV AIL(u) in the algorithm. Each node is associated with an array of two-bit flags, conditional flags. Each fork node corresponds to a conditional flag, with (10)2 tagging a false condition, (11)2 tagging a true condition, and (0x)2 tagging no further possibility of resource sharing where x is a “don’t care” bit. The usage flag, counterpart of the conditional flag of a node, is associated with each functional unit for every control step, and indicates what nodes can be shared on that functional unit. The array of usage flags of a functional unit at a control step is initialized to be the array of conditional flags of the first node scheduled there. Nodes a1 and a2 can share one ALU and the usage flag for the ALU becomes (0x)2 . It can easily be observed that all nodes sharing the same functional unit must come from the same iteration such that there would be no possibility of enabling two or more of them at the same time if those nodes came from different iterations.

Figure 4. Multi-Dimensional Branch Anticipation algorithm

In AVAIL-PREVENT, earliest cs records the earliest possible control step on which u can be scheduled. In a while loop, when resource sharing is possible, we first try to detect any sharingprevention cycle by DFS traversing graph Gs . We point out that this graph is transformed by adding bi-directional edges among any group of nodes that share the same functional unit. Thus, if u is not marked at the end of the traversal, we know that it is safe to share it with the nodes already scheduled on fu. Otherwise, we would try the next cell in the schedule table. AVAIL-PREVENT can resolve sharing-prevention cycles and give us the Branch Anticipation algorithm as shown in Figure 4. The algorithm MDBA combines multi-dimensional retiming technique and limited conditional resource sharing in the sense that not all seemingly mutually exclusive nodes share functional unit due to sharing-prevention cycles. In the algorithm MDBA, a queue structure is used to store the schedulable nodes ( QueueV ) and another queue (Queue) maintains V , the set of all nodes. In line 2 of PROMOTE, before a node is scheduled at an earlier control step, the availability of babits is considered so the hardware constraints won’t be exceeded. In order to shorten the overlap of the lifetime of different babits, we schedule the nodes in the same global conditional block contiguously. This is achieved by the use of conditional flag (cflag ) and conditional flag mask (cflagmask). With the conditional flag (cflag ) computed in line 12 of MDBA, the top fork node in a conditional block will set the conditional flag mask (cflagmask) in line 21. This mask is used in line 6 of PROPAGATE for confining the schedulable nodes under this top fork node such that they are scheduled contiguously. The next theorem demonstrates the correctness of our algorithm.

Property 3.1 In an acyclic conditional block, all the nodes sharing the same functional unit must have the same RC value. We now investigate the situation when a node with larger RC value is going to share one functional unit at a specific control step where another node with smaller RC value has already been scheduled. In such a situation, we need to increase the RC values of scheduled nodes to that of the new node according to Property 3.1. The dotted edges in Figure 3 (a) are sharing indication edges. However, the butterfly shape of dependence relation shown in Figure 3 (a) would lead to infinite increment of RC values along the cycle of a1 m1 m2 a2 a1 because of adjustment and propagation of RC values. Such a cycle is termed sharing-prevention cycle. The graph consisting of G and sharing indication edges is denoted by Gs . This conflict can be resolved by a prior prevention of such cycles in function AVAIL-PREVENT. In presentation of algorithms, all the statements in a loop are leftaligned by the first such statement after do. Similar rules apply for statements in conditional branches.

?!

?!

?!

?!

Theorem 3.3 Let G = (V; E; d; t; k) be a realizable MD CdDFG, and FU a target set of functional units, the multidimensional scheduling algorithm MDBA transforms G to Gr in

4

a1 m1 T c1 T c2

a2

F

F

m2

T m3

a3

m4

c3

F

Figure 6. Optimal schedule table (a) and RC values (b) for Figure 5

a5 (1,1)

a8 m5

(b)

(a)

a4

a6 m8

a7

m9

Memory System

m7

Addr. Ctrl

m6

BA Ctrl Logic

Register File a9

ALU1

Figure 5. An MD CdDFG with conditional branches

ALU2

MULT

L/S

Figure 7. System architecture schedule table is shown in Figure 6 (a). Figure 6 (b) gives the final RC value for each node for reference.

O(jV j2 + njE j) time, while assigning its nodes to available resources of FU and utilizing conditional resource sharing. Here n is the number of dimensions.

5. Description of Supporting Hardware

The following theorem shows the optimality of our algorithm when resource sharing is not allowed. Note that this is not a bin packing problem because we have a fixed number of functional units.

In this section, we introduce an architecture which provides the hardware support for the Branch Anticipation approach. In an architecture that supports predicated execution, such as HPL PlayDoh architecture [2], the condition signal propagation across iterations in a modulo-scheduled loop is implemented through rotating predicate registers. In our approach, the lifetime of each branch control signal may span several iterations. Inside each iteration, the interval during which the branch control signal is active is called active interval. We use Branch Anticipation Bits (babits) to store and propagate such branch control signals. At one instant, instruction selection will be based on the active intervals stored in the babits. A babit is reused by branch control signals whose lifetimes are disjoint. As the basis for hardware implementation of the babits, we have the following theorem for babit usage calculation.

Theorem 3.4 Let G = (V; E; d; t; k) be a realizable MD CdDFG, and FU a target set of functional units, the multidimensional scheduling algorithm MDBA transforms G to Gr in O(n E ) time, achieves the shortest schedule length, while conditional resource sharing is not allowed.

j j

4. Experiments We would further analyze our method by going through one MD CdDFG from [10]. Other experiments are not described in detail but are reported in our comparison with other methods in section 6. We have a multi-dimensional delay of (1,1) on the feedback edge in the MD CdDFG shown in Figure 5. So we can select (0,1) as the schedule vector and hence (1,0) as the base retiming function for Branch Anticipation. This MD CdDFG is typical in that along the paths of the nested conditional block, there are different combinations of sharing-prevention cycles. For example, under fork node c1, the aggressive sharing of m8, m2 and m3 would block a7 from sharing with a2 or, a8 from sharing with a6. Given two ALUs and one multiplier, the multiplications are critical because there are roughly the same number of multiplications as the number of additions (comparisons). In this context the application of MDBA can yield the optimal schedule table. If we are scheduling the nodes in such an order: a1, m1, c3, m4, a5, m5, a4, m7, m6, c1, c2, m3, a2, m2, a3, a8, m9, a6, m8, a7, a9, the resultant

Theorem 5.1 Given a resource-sharing schedule S for MD CdDFG G = (V; E; d; t; k) with acyclic conditional blocks, the number of hardware babits for executing S is determined by the formula below:

X

conditional block

i

RCimax ? RCimin + 1

(1)

In the formula above, each pair of fork and join nodes specify a conditional block in the summation. RCimax refers to the maximum RC value among nodes in conditional block i and RCimin is the minimum RC value in the same conditional block. Figure 7 illustrates the architecture of a 32-bit floating point ASIC processor which includes two ALU’s, one floating point multiplier, a Load/Store unit, Address Controller, instruction/data

5

In Figure 9, the parenthesized numbers in GPMB column give the respective number of PEs GPMB would require to software pipeline the corresponding nested loop. We can see from this perspective that GPMB is not a universal method of dealing with various uniform nested loops with conditional constructs. In all cases, our Branch Anticipation can always achieve the shortest schedule table length. The only case where a tie with other method can be found is the Median Filter used by GPMB, this is because the structure of the graph of the filter is relatively simple and suitable for the GPMB method. The MD branch predication makes use of Push-up scheduling [6] and controls execution of nodes on branch paths via predicate registers. Thus, the schedule length it achieves is longer than branch anticipation. These results demonstrate the efficiency of the MDBA algorithm as well as its near optimality.

Figure 8. Topological connection of the babits

References [1] P. Held, P. Dewilde, Ed Deprettere, and P. Wielage, “HIFI: From Parallel Algorithm to Fixed-Size VLSI Processor Array”, Application-Driven Architecture Synthesis, Francky Catthoor and Lars Svensson, Ed’s, 1993, pp. 71-95. [2] V. Kathail, M. Schlansker, and R. Rau, ”HPL PlayDoh Architecture Specification: Version 1.0” Hewlett-Packard Laboratories Technical Report, HPL-93-80, Feb. 1993.

Figure 9. Comparison to other methods

[3] T. Kim, N. Yonezawa, W. S. Liu, and C. L. Liu, “A Scheduling Algorithm for Conditional Resource Sharing – A Hierarchical Reduction Approach”. IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems, Vol. 13, No. 4, April 1994, pp. 425-438.

memory and Branch Anticipation (BA) Control Logic. The focus of the BA Control Logic is to implement a Multi-way Branching Mechanism during active intervals. This mechanism is realized by selecting one instruction from several candidates that are stored as a group in instruction memory. Its functionality is directly controlled by the microprogrammed Address Controller. We show the connection, in Figure 8, between the babits which behave as a partitioned rotating register file. We refer to an instruction array whose maximum capacity is 8 instructions in the following description. The output from the babits is used as offset into this instruction array. The starting and ending cycles from the microinstruction are passed through BABIT CONTROLLER to the babits. We use the BABIT CONTROLLER to partition the babits into the number of global conditional blocks partitions. Each partition is dedicated to a specific global conditional block and has a rotating index indicating the current available babit. The “LDA”, “LDB”, through “LDH” signals of BABIT CONTROLLER are to enable the selected babit in each partition.

[4] Monica Lam, “Software Pipelining: An Effective Scheduling Technique for VLIW Machines”, ACM SIGPLAN ’88 on Programming Language Design and Implementation, June 1988, pp. 318328. [5] N. L. Passos and E. H.-M. Sha “Full Parallelism in Uniform Nested Loops using Multi-Dimensional Retiming”. Proceedings of 23rd International Conference on Parallel Processing, vol. II, pp. 130133, August, 1994. [6] N. L. Passos and E. H.-M. Sha, “ Push-up Scheduling: Optimal Polynomial-time Resource Constrained Scheduling for MultiDimensional Applications”, Proceedings of the International Conference on Computer Aided Design 95, San Jose, California, November, 1995. [7] R. Potasman, J. Lis, A. Nicolau, and D. Gajski, “ Percolation Based Scheduling”. Proc. ACM/IEEE Design Automation Conference , 1990, pp. 444-449.

6. Comparisons

[8] B. R. Rau, “Iterative Modulo Scheduling: An Algorithm For Software Pipelining Loops”, IEEE Microarchitecture, Nov. 1994, pp. 63-74.

Considering the different complexities of the algorithms, we show in Table 1 a comparison in different practical experiments among those methods which are able to find a schedule. The experiments presented in Table 1 are the Floyd-Steinberg algorithm, the MD CdDFG from [10], its modified form, two MD CdDFGs from [3], the three-value median filter from [11]. These algorithms come from diverse areas and are typical for Multi-Dimensional applications. In all experiments except for Kim’s MD CdDFG from [3], we use two unit-cycle ALUs and one three-cycle multiplier. The methods listed are list scheduling, GPMB approach from [11], Multi-Dimensional Branch Predication [9] which utilizes the OPTIMUS algorithm from [6], the polynomial time version, MDBA, and backtracking version, MDBA-BACK, of Branch Anticipation.

[9] M. Schlansker, V. Kathail, and S. Anik, “Height Reduction of Control Recurrences for ILP Processors”, 27th International Symposium on Microarchitecture, Nov. 1994, pp. 40-51. [10] J. Siddhiwala and L. F. Chao, “Scheduling Conditional Data-Flow Graphs with Resource Sharing”, Fifth Great Lakes Symposium on VLSI, 1995, pp. 94-97. [11] Zhizhong Tang, Bogong Su, Stanley Habib, et al, “GPMB– Software Pipelining Branch-Intensive Loops”, 26th International Symposium on Microarchitecture, 1993, pp. 21-29.

6