Nested Loop Transformation for Full Parallelism - CiteSeerX

Nested Loop Transformation for Full Parallelism Nelson Luiz Passos Edwin Hsing-Mean Sha

Research Report CSE-TR-94-011 April 1994

Abstract

Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in onedimensional problems. Most scienti c or DSP applications are recursive or iterative. Uniform nested loops can be modeled as multi-dimensional data ow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, substantially decreases the overall computation time. It is well known that for one-dimensional DFGs retiming can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for DFGs with more than one dimension. It also presents two novel multi-dimensional retiming techniques to obtain full parallelism. Examples, description and the correctness of our algorithms are presented in the paper.

1 Introduction Applications such as image processing, uid mechanics, and weather forecasting, require high computer performance. Researchers and designers in those areas are looking for solutions to multi-dimensional problems through the use of parallel computers and/or specialized hardware, which justi es the study of multi-dimensional optimization algorithms. Ecient transformation mechanisms, for parallel and/or pipelined processing of one-dimensional problems represented by Data Flow Graphs, have been proposed in previous studies using retiming techniques [16]. The retiming technique regroups operations in iterations in order to produce a new iteration structure with higher embedded parallelism [4, 21]. An equivalent approach to the multi-dimensional case will bene t parallel compiler design for VLIW, data- ow, and superscalar architectures [1, 10, 14], as well as high-level synthesis for VLSI design [3, 18]. Computation-intensive applications usually depend on time-critical sections consisting of a loop of instructions. To optimize the execution rate of such applications, the designer needs to explore the parallelism embedded in repetitive patterns of a loop. Some research has been done on uniform nested loop scheduling, a similar view to a multi-dimensional problem. For example, unimodular transformations [25], loop skewing [24] and loop quantization [2]. These techniques dier from our method since they don't change the structure of iterations, but the sequence in which the instructions are executed. Other techniques that search beyond loop boundaries include perfect pipelining [1] and Doacross loops [7]. Perfect pipelining looks for a repeating pattern for a single loop with the disadvantage of unpredictability of the size of such pattern and the gain of performance. Doacross loop presents an overlap of consecutive iterations that contributes to the improvement of the execution rate. These two methods can be regarded as special cases of our technique. In our study, loop bodies are represented by multi-dimensional data ow graphs (MDFG). Instead of simply overlapping iterations, we restructure the loop body, i.e., the existing dependence delays, preserving the original data dependence. We model the process of restructuring as a multi-dimensional retiming. The new MDFG is constructed from the original graph through the retiming function. The retiming process and loop pipelining are, in essence, the same concept [5]. Previous research in loop pipelining for loops with cyclic dependencies produced methods which focus on one-dimensional problems. Such methods appear in several systems [11, 15, 20, 23]. In the one-dimensional retiming method, negative delays represent the existence of a cycle in the execution sequence and can not be allowed. However, on the multi-dimensional case, the existence of negative delays is to be considered natural if there exists a schedule s that turns the graph realizable. Chao and Sha introduced the concept of a restricted multi-dimensional retiming [4]. Their algorithm can not guarantee to achieve full parallelism and is only applicable to a speci c class of MDFGs, where any cycle would have a strictly non-negative total multi-dimensional delay. In this paper, two new algorithms are presented: 1

M8

A4

A6

A5

(2,2) M7 (2,1)

A7

(2,0)

M6

(1,2)

M5

A3

(1,1) (1,0) (0,2)

M4

20 10

A2

DO 10 n1=1,N1 DO 20 n2=1,N2 y(n1,n2) = x(n1,n2) + c(0,1)*y(n1,n2-1) + c(0,2)*y(n1,n2-2) + c(1,0)*y(n1-1,n2) + c(1,1)*y(n1-1,n2-1) + c(1,2)*y(n1-1,n2-2) + c(2,0)*y(n1-2,n2) + c(2,1)*y(n1-2,n2-1) + c(2,2)*y(n1-2,n2-2) CONTINUE CONTINUE

M3

(0,1) M2

A1

M1

(b) (a)

Figure 1: (a) MDFG representing an IIR lter (b) equivalent Fortran code incremental multi-dimensional retiming and chained multi-dimensional retiming. The rst uses a legal retiming to successively restructure the loop body represented by a general form of MDFG. The second improves the eciency of the rst, by obtaining the nal solution in only one pass. Both techniques achieve full parallelism.

Full parallelism is the simultaneous execution of all operations (nodes) in a MDFG. Therefore, to obtain full parallelism is equivalent to obtain non-zero delay in all edges of the MDFG. In the one-dimensional case, such a condition can not always be achieved through retiming due to the constancy of the sum of delays in a cycle. For simplicity, we use two-dimensional problems without instructions between loops as examples. The two dimensions are generically referred as x and y. The multi-dimensional case is a straight forward extension of the concepts presented. The main example consists of an IIR lter (In nite Extent Impulse Response) [8], represented by the transfer function: H (z ; z ) = ?P2 P2 ?n1 ?n2 1 2 1 2 n1 =0 n2 =0 which can be translated in P P y(n ; n ) = x(n ; n ) + 1 c(k ; k ) y(n ? k ; n ? k ) for k ; k 6= 0. 2 1

1

2

1

2

1

2

2 k

1

=0

2 k

c(n ;n

=0

1

2

)

1

z

z

1

2

2

1

2

The MDFG derived from the equation above is shown in gure 1(a). An equivalent Fortran code is presented in gure 1(b). The retimed graph is presented in gure 2(a), where the two-dimensional delay (?1; 1) is pushed through all nodes labeled M on the original MDFG. Intuitively, the retimed nodes no longer precede the additions A1 to A4 within the same iteration, which allows those operations to be executed simultaneously. This new characteristic reduces the critical path of the graph, and consequently the overall execution time. Figure 2(b) shows the result from a new retiming operation, where the two-dimensional 2

M8

(-1,1)

A4

A6

A5

M8

(3,1) M7

(2,1) A7

(-1,1)

M7

(1,1)

A6

A5

(2,-1)

(3,0) M6

M5

(-1,1)

A3

(3,-1)

(-1,1)

(2,1) A7

(2,0) (2,-1)

(-3,2) A4

(3,1)

(3,0) (3,-1)

(2,-1)

(-1,1) M4

M3

(1,0)

A2

(1,1)

M1

A1

(2,-1)

M3

A2 (2,-1)

(2,-1)

M1

(a)

(-3,2)

(2,-1)

M2

(-1,1)

A3

(2,-1)

M4

(1,0)

(-1,1) M2

M5

(2,0) (2,-1)

(-1,1)

(-3,2) M6

(-3,2) A1

(2,-1)

(b)

Figure 2: (a) MDFG after retiming all nodes M (b) MDFG after retiming A1 to A4 delay (?3; 2) was pushed through nodes A1 to A4, improving the parallelism and reducing the critical path furthermore. Full parallelism is achieved when all edges have non-zero delays. In section 2, we show some basic concepts, such as mathematical models for Dependence Graphs and Data Flow Graphs. Also, we mention the basic ideas of multi-dimensional retiming and cycle period. Such concepts are useful on the following section when developing the algorithm for incremental retiming. Section 3 shows the properties necessary to have a legal retiming function. The incremental multi-dimensional retiming algorithm is described, giving us the foundations required for the main theorem that proves that full parallelism can always be achieved. Section 4 presents a more ecient algorithm which uses the concept of a chain reaction. Section 5 shows some examples of application of our techniques, followed by a nal conclusion that summarizes the concepts introduced in this paper.

2 Basic Principles 2.1 Background In this section we present some concepts related to the interpretation of multi-dimensional data ow graphs, such as mathematical models, dependence vectors, iterations, and vector operations. A multi-dimensional data ow graph (MDFG) G = (V; E; d; t) is a node-weighted and edge-weighted directed graph, where V is the set of computation nodes, E V V is the set of dependence edges, d is a function from E to Z , representing the multi-dimensional delay between two nodes, where n is the number of dimensions, and t is a function from V to the n

3

B

(-1,1)

A

D C

(1,1)

11 10

DO 10 j = 0,n DO 11 k = 0, m D: d(k,j) = b(k+1,j-1) * c(k-1,j-1) A: a(k,j) = d(k,j) * .5 B: b(k,j) = a(k,j) + 1. C: c(k,j) = a(k,j) + 2. CONTINUE CONTINUE (b)

(a)

Figure 3: (a) MDFG extracted from a Wave Digital Filter (b) equivalent Fortran code positive integers, representing the computation time of each node. A two-dimensional data

ow graph (2DFG) G = (V; E; d; t) is an MDFG, where d is a function from E to Z . We use d(e) = (d:x; d:y) as a general formulation of any delay shown in a two-dimensional DFG. An example of a two-dimensional DFG and its equivalent Fortran code is shown in gure 3. For this example, V = fA; B; C; Dg and E = fe1 : (A; B ); e2 : (A; C ); e3 : (D; A); e4 : (B; D); e5 : (C; D)g where, d(e1) = d(e2) = d(e3) = (0; 0), d(e4) = (?1; 1), d(e5) = (1; 1). For simplicity, each operation is assumed to be executed in one time unit, therefore, t(A) = t(B ) = t(C ) = t(D) = 1. 2

An iteration is the execution of the loop body exactly once, i.e., the execution of each node in V exactly once. Iterations are identi ed by a vector i, equivalent to a multi-dimensional index, starting from (0; 0; : : : ; 0). Inter-iteration dependencies are represented by vectorweighted edges. For any iteration ^j , an edge e from u to v with delay vector d(e) means that the computation of node v at iteration ^j depends on the execution of node u at iteration ^j ? d(e). An edge with delay (0; 0; : : : ; 0) represents a data dependence within the same iteration. A legal MDFG must have no zero-delay cycle, i.e., the summation of the delay vectors along any cycle can not be (0; 0; : : : ; 0). Several techniques are available to verify that an MDFG does not have a cycle [6, 12]. An equivalent cell dependence graph of an MDFG G is the directed acyclic graph showing the dependencies between copies of nodes representing the MDFG. Figure 4(a) shows the replication of the MDFG in gure 3(a), and gure 4(b) shows the cell dependence graph with each node representing a copy of the MDFG. A computational cell is the cell dependence graph (DG) node that represents a copy of an MDFG, excluding the edges with delay vectors dierent from (0; 0; : : : ; 0), i.e., a complete iteration. A cell is considered an atomic execution unit. A mathematical model for a cell dependence graph expanded from an MDFG is presented in [17]: a tuple (I ; D) where I is the index set of the cell structure, equivalent to the vectors n

n

4

C

D

C

D

C

D

C

D

0,3

1,3

2,3

3,3

A

B

A

B

A

B

A

B

0,2

1,2

2,2

3,2

2,1

3,1

0,1

2,1

1,1

3,1

C

D

C

D

C

D

C

D

0,1

1,1

A

B

A

B

A

B

A

B

0,0

1,0

0,0

2,0

1,0

3,0

2,0

3,0

(b)

(a)

Figure 4: (a) DG based on the replication of an MDFG, showing iterations starting at (0,0). (b) DG represented by computational cells used to identify each iteration, and D is a matrix containing all dependence vectors (delay vectors). Consider, for example, the DG shown in Figure 4; its model will be described as: "

I = f(i; j ) : 0 i 3; 0 j 3g, and D = ?11 11

#

2

Uniform loops are those that present the characteristic of constant dependence vectors, i.e., data dependencies are at a constant distance in the iteration space.

e

The notation u ?! v means that e is an edge from node u to node v. The notation u p; v e? e e means thatP p is a path from u to v. The delay vector of a path p = vP?! v ?! v : : : ?! v t(v ). is d(p) = ? d(e ) and the total computation time of a path p is 0

1 i=0

0

k

1

1

k

2

1

k

k

i=0

i

i

To manipulate MDFG characteristics represented on vector notation, such as the delay vectors, we make use of component-wise vector operations. Considering two two-dimensional vectors P and Q, represented by their coordinates (P:x; P:y) and (Q:x; Q:y), an example of arithmetic operation is P + Q = (P:x + Q:x; P:y + Q:y). The notation P Q indicates the inner product between P and Q, i.e., P Q = P:x Q:x + P:y Q:y. A schedule vector s is the normal vector for a set of parallel equitemporal hyperplanes that de ne a sequence of execution of a cell dependence graph. The existence of a schedule vector prevents the existence of a cycle. We say that an MDFG G = (V; E; d; t) is realizable if there exists a schedule vector s for the cell dependence graph with respect to G, i.e., s d 0 for any d 2 G [13].

5

B

(-2,1)

A

D 11

C

DO 10 j = 0, n d(0,j) = b(1,j-1) * c(-1,j-1) }prologue DO 11 k = 0, m-1 d(k+1,j) = b(k+2,j-1) * c(k,j-1) a(k,j) = d(k,j) * .5 b(k,j) = a(k,j) + 1. c(k,j) = a(k,j) + 2. CONTINUE a(m,j) = d(m,j) * .5 epilogue b(m,j) = a(m,j) + 1. c(m,j) = a(m,j) + 2. CONTINUE

}

(0,1) (1,0) 10

(a)

(b)

Figure 5: (a) MDFG after retimed by r(D)=(1,0) (b) equivalent Fortran code

2.2 Retiming an Multi-Dimensional Data Flow Graph In this subsection, we show the basic ideas on multi-dimensional retiming, including cycle period, critical path, and characteristics of legal retiming. The period during which all computation nodes in an iteration are executed, according to existing data dependencies and without resource constraints, is called a cycle period. The cycle period C (G) of an MDFG G = (V; E; d; t) is the maximum computational time among paths that have no delay. For example, the MDFG in gure 3(a) has C (G) = 3, which can be measured through the paths p = D ! A ! B or p = D ! A ! C . A multi-dimensional retiming r is a function from V to Z that redistributes the nodes in the original dependence graph created by the replication of an MDFG G. A new MDFG G is created, such that each iteration still has one execution of each node in G. The retiming vector r(u) of a node u 2 G represents the oset between the original iteration containing u, and the one after retiming. The delay vectors change accordingly to preserve dependencies, i.e., r(u) represents delay components pushed into the edges u ! v, and subtracted from the edges w ! u, where u; v; w 2 G. Therefore, we have d (e) = d(e) + r(u) ? r(v) for every e edge u ?! v and d (l) = d(l) for every cycle l 2 G. After retiming, the execution of node u in iteration i is moved to the iteration i ? r(u). A two-dimensional retiming r is a function from V to Z that redistributes the nodes in the original dependence graph created by the replication of a 2DFG G, resulting in a new 2DFG G , such that each iteration still has one execution of each node in G, reducing the cycle period as in the one-dimensional case. For example, gure 5(a) shows the MDFG from gure 3(a) retimed by the function r. Figure 5(b) shows the modi ed Fortran code. The critical paths of this graph are the edges A ! B and A ! C with an execution time of 2 time units. n

r

r

r

2

r

The retimed cell DG, for the example in gure 3(a), is shown in gures 6(a) and (b), where 6

Prologue

D

D

C

A

D

B

0,1

C

C

A

C

D

B

A

1,1

D

C

D

B

2,1

C

D

C

A

D

C

1,3

2,3

3,3

0,2

1,2

2,2

3,2

0,1

1,1

2,1

3,1

B

3,1

D

0,3

D

s=(1,3)

A 0,0

B

A 1,0

A

B

B

2,0

A

B

3,0

0,0

1,0

(a)

2,0

3,0

(b)

Figure 6: (a) DG based on the replication of an MDFG, after retiming. (b) same DG represented by computational cells. B

(-1,0)

A

D C

r(D) = (0,1) r(A) = (0,0) r(B) = (0,0) r(C) = (0,0)

(1,0)

0,2

1,2

2,2

3,2

0,1

1,1

2,1

3,1

0,0

1,0

(0,1) (a)

2,0

3,0

(b)

Figure 7: (a) Example of illegal retiming. (b) DG showing cycles in the x-direction. the nodes originally belonging to iteration (0; 0) are marked. A possible schedule vector for the retimed graph is s = (1; 3). Figure 7(a) shows an illegal retiming function applied to the same example. By simple inspection of the cell dependence graph in gure 7(b) we notice the existence of a cycle created by the dependencies (1; 0) and (?1; 0). A prologue is the set of instructions that are moved on directions x and y, in a twodimensional retiming, and that must be executed to provide the necessary data for the iterative process. In our example shown in gure 6(a), the instruction D becomes the prologue for that problem. The epilogue is the other extreme of the DG, were a complementary set of instructions will be executed to complete the process. In the example, the reduction of the critical path, as seen in gure 5, results in a saving of approximately one third of the total execution time.

7

3 Incremental Multi-Dimensional Retiming 3.1 Basic Concepts The technique of multi-dimensional retiming, allows changes in more than one direction [4, 19]. Our method uses this property to achieve the full parallelism by incrementally applying the retiming function to multi-dimensional problems. Properties, algorithm and supporting theorems for the incremental multi-dimensional retiming are presented below. We begin by de ning a legal retiming function.

De nition 3.1 Given a realizable MDFG G = (V; E; d; t) a legal retiming for G is the multidimensional retiming r that transforms G in G such that G is still realizable. r

r

To nd a legal retiming of an MDFG, such that all its nodes can be executed in parallel is equivalent to obtaining a cycle period that equals one when we assume the execution time of each node to be one time unit. This problem is more complex than the use of traditional retiming on one-dimensional cases. In those cases, the positive delays after the transformation guarantee the legal retiming. However, for the multi-dimensional problem, positive delay vectors are too restrictive since a dependence graph is still realizable even if it has negative delays. Therefore, the following theorem gives a more general set of constraints for a legal retiming that produces full parallelism among the nodes of an MDFG.

Theorem 3.1 Let G = (V; E; d; t) be an MDFG with t(v) = 1 for any v 2 V . r is a legal

multi-dimensional retiming for G such that C (G ) = 1 if and only if: r

1. the cell dependence graph of the retimed MDFG G = (V; E; d ; t) does not contain any cycle. r

r

2. d (e) 6= (0; 0; : : : ; 0) for any e 2 E r

3. if the execution time of a path p in G is greater than one, then d (p) 6= (0; 0; : : : ; 0). r

r

Proof: The rst condition results from the realizability of the nal cell dependence graph. If a cycle was allowed, we would nd cells depending on their own results, waiting to be executed, which is a contradiction. The second and third conditions result from the properties associated with the cycle period. 2

Our technique enforces these constraints through the use of an incremental retiming that guarantees the realization of each new retimed graph until all edges become non-zero delay edges.

8

y d

d’

d’’

x

(0,0)

scheduling subspace S

Figure 8: An example of scheduling subspace

3.2 The Scheduling Subspace We know that the existence of a linear schedule vector for the execution of the MDFG is the necessary and sucient condition for its realizability [13]. We de ne the space domain for schedule vectors of an MDFG as follows:

De nition 3.2 A scheduling subspace S for a realizable MDFG G = (V; E; d; t) is the space region where there exists schedule vectors that realize G, i.e., if schedule s 2 S then sd(e) 0 for any e 2 E . An example of scheduling subspace is shown in gure 8 for a two-dimensional case. Using such concepts, we de ne the basic conditions for a legal multi-dimensional retiming through the following lemma.

Lemma 3.2 Let G = (V; E; d; t) be an MDFG, r a multi-dimensional retiming, and s a schedule vector for the retimed graph G = (V; E; d ; t), then r

r

(a) for any path u p; v, we have d (p) = d(p) + r(u) ? r(v) (b) for any cycle l 2 G, we have d (l) = d(l) r

r

e (c) for any edge u ?! v, if d (e) 6= (0; 0; : : : ; 0) then d (e) s 0 r

r

3.3 The Selection of the Retiming Function In this segment, we show how to select a legal retiming function for a realizable MDFG. Such selection is based on properties of the scheduling subspace associated with the MDFG. We begin by de ning a strictly positive scheduling subspace. 9

De nition 3.3 Given a realizable MDFG G = (V; E; d; t) with a scheduling subspace S , a strictly positive scheduling subspace S is the set of all vectors s 2 S such that d(e) s > 0 for every d(e) = 6 (0; 0; : : : ; 0). +

It is obvious that if a scheduling subspace S is not empty, then the strictly positive scheduling subspace S associated with S is also not empty. Using the strictly positive scheduling subspace concept, we introduce the method of predicting a legal multi-dimensional retiming function in the next theorem. +

Theorem 3.3 Let G = (V; E; d; t) be a realizable MDFG, S a strictly positive scheduling subspace for G, s a schedule vector in S , u 2 V a node with all the incoming edges having +

+

non-zero delays. A legal retiming r(u) is any vector orthogonal to s.

Proof: From the de nition of retiming, we have d (e1) = d(e1 )?r(u) and d (e2) = d(e2)+r(u), e1 e2 for some ?! u ?! and d(e1) 6= (0; 0; : : : 0). To verify if the resulting MDFG is realizable, we compute the inner product of s and each of the retimed dependence vectors. Then, d (e1)s = d(e1)s?r(u)s = d(e1)s > 0, since e1 is an incoming edge and d(e1) 6= (0; 0; : : : 0). For e2, d (e2) s = d(e2) s + r(u) s = d(e2) s 0. We have now two cases: Case 1: if d(e2) 6= (0; 0; : : : 0) then d(e2) s > 0. By lemma 3.2 the resulting graph is realizable. Case 2: if d(e2) = (0; 0; : : : 0) then d (e2) = r(u). Since d (e1) s > 0 and for each edge e, d(e) s > 0, it is impossible to have a linear combination of these delay vectors orthogonal to s; i.e., parallel to d (e2) = r(u). Therefore, no cycle can be found and the resulting graph is realizable. 2 r

r

r

r

r

r

r

Therefore, given a set of dependence vectors, after de ning its strictly positive scheduling subspace, we can predict a legal retiming function. These results are used in our incremental retiming algorithm. Here we introduce two important corollaries from theorem 3.3.

Corollary 3.4 Given a realizable MDFG G = (V; E; d; t), S a strictly positive scheduling subspace for G, s a schedule vector in S , and a vector r orthogonal to s, if a set X V has +

+

all incoming edges non-zero, then r(X ) is a legal retiming.

e

e

Proof: For some ?! X ?!, we know that d(e1) 6= (0; 0; : : : 0). Considering theorem 3.3 we conclude that r is a legal retiming in G. 2 1

2

Corollary 3.5 Given a realizable MDFG G = (V; E; d; t), S a strictly positive scheduling subspace for G, s a schedule vector in S , a vector r orthogonal to s, a set X V with all incoming edges non-zero, and an integer value k > 1, then (k r)(X ) is a legal retiming. +

+

Proof: Proved immediately from theorem 3.3 and corollary 3.4. 2

10

3.4 The Incremental Multi-Dimensional Retiming Algorithm The ability to predict a legal retiming for any realizable MDFG allow us to de ne the incremental multi-dimensional retiming algorithm using the following steps: 1. Given a realizable MDFG G = (V; E; d; t), we use the idea of theorem 3.3 to nd a legal retiming by solving the inequalities s d(e) > 0 for every e 2 E , where s is the unknown, we choose a retiming function from the hyperplane with s as the normal vector. 2. We apply the selected retiming function to any node that has all incoming edges with non-zero delays and at least one outgoing edge with zero delay. 3. Since the resulting MDFG is still realizable, if there are zero delay edges, go back to step 1. As consequence of the concepts presented, we state the main theorem as follows:

Theorem 3.6 Let G = (V; E; d; t) be a realizable MDFG, the incremental multi-dimensional retiming algorithm transforms G to G , in at most jV j iterations, such that G is fully parallel. r

r

Proof: After an iteration of the incremental multi-dimensional retiming algorithm, the resulting MDFG is still realizable. Successive iterations of nding s and the respective incremental retiming function allow us to modify all zero delay edges to non-zero ones, obtaining a fully parallel MDFG. After each iteration, all outgoing edges of at least one new node will not have any zero delay. After at most jV j iterations full-parallelism is achieved. 2

Let's examine the example in gure 3. The rst incremental retiming is shown in gure 5, where s has value (0; 1). The next retiming must consider the new set of dependencies: f(?2; 1); (0; 1); (1; 0)g. We choose s = (1; 3) then, the retiming function for node A can be either (?3; 1) or (3; ?1). Choosing r(A) = (?3; 1) we obtain the graph on gure 9(a). The equivalent Fortran code is not a simple representation of the graph because it requires a recursion schedule vector [13], i.e., a schedule vector parallel to one of the axes in the index space. We use loop skewing [24, 25] to adjust the schedule vector. The code equivalent to the nal retimed graph is shown in gure 9(b).

4 Chained Multi-Dimensional Retiming The previous section described a method to nd a fully parallel solution in several iterations of a incremental retiming operation. To increase the eciency of our algorithm, we now introduce the concepts that allow us to obtain the full parallelism solution in a single pass.

e

e

e

De nition 4.1 A chain is a path p = v ?! v ?! v : : : ?!? v , where all incoming edges 0

0

1

1

2

k

1

k

for node v are non-zero delay edges and there exists at least one zero delay edge preceding 0

11

(-3,1)

B

(-2,1)

A

D (-3,1)

C

(0,1) (4,-1)

DO 10 jp = 3, 4*n+m-1 --- prologue --DO 11 kp = max(3,jp-n), min(3*n+m-1,jp-3) d(4*kp-3*jp+1,jp-kp) = b(4*kp-3jp+2,jp-kp-1) * c(4*kp-3*jp,jp-kp-1) a(4*kp-3*jp-3,jp-kp+1)= d(4*kp-3*jp-3,jp-kp+1) * .5 b(4*kp-3*jp,jp-kp) = a(4*kp-3*jp,jp-kp) + 1. c(4*kp-3*jp,jp-kp) = a(4*kp-3*jp,jp-kp) + 2. 11 CONTINUE --- epilogue --10 CONTINUE

(a)

(b)

Figure 9: (a) MDFG fully parallel (b) equivalent Fortran code after loop skewing the nodes v ; 1 i k. The nodes are numbered in a monotonically increasing order, and k is said to be the length of the chain. i

From the de nition above we derive a method for identifying all chains belonging to an MDFG. We call this construction a multi-chain structure and it is built according to the following steps: 1. all non-zero delay edges are removed from the MDFG. Since there is no zero-delay cycle for a realizable MDFG, the result is a directed acyclic graph (DAG). 2. a modi ed topological sort algorithm [22] is used to order the nodes in levels from the end to the beginning . Each node is labeled according to its level number, which produces the monotonically increasing characteristic of the node indices in a chain. 2

We call a multi-chain maximum length the highest level number obtained in the construction process described above. An example of chain can be obtained from gure 3(a). The existing chains are D ? A ? B and D ? A ? C . Node D is at level 0, node A at level 1, and nodes B and C at level 2. Those chains have a multi-chain maximum length of 2. To speed up the incremental retiming process, we introduce the idea of a chained retiming according to the new algorithm below: 1. Find a legal retiming function r as in the rst step of the incremental retiming algorithm. 2. Construct the multi-chain structure, labeling the nodes accordingly and computing the multi-chain maximum length k. 3. Retime each node v by (k ? i) r. The result is the fully parallel MDFG. i

2

Every node is assigned to a unique level number, even though a node belongs to multiple chains

12

(1,0)

B

}

(-3,1)

A

D (1,0)

C

(-1,1) (1,0)

(a)

DO 10 j = 0, n d(0,j) = b(1,j-1) * c(-1,j-1) prologue d(1,j) = b(2,j-1) * c(0,j-1) a(0,j) = d(0,j) * .5 DO 11 k = 0, m-2 d(k+2,j) = b(k+3,j-1) * c(k+1,j-1) a(k+1,j) = d(k+1,j) * .5 b(k,j) = a(k,j) + 1. c(k,j) = a(k,j) + 2. 11 CONTINUE a(m,j) = d(m,j) * .5 b(m-1,j) = a(m-1,j) + 1. epilogue c(m-1,j) = a(m-1,j) + 2. b(m,j) = a(m,j) + 1. c(m,j) = a(m,j) + 2. 10 CONTINUE (b)

}

Figure 10: (a) MDFG fully parallel using chained retiming (b) equivalent Fortran code For this algorithm, only one step of incremental retiming is executed at the beginning. The rest of the algorithm requires only O(jE j) times to be performed. Therefore, this algorithm is more ecient than the incremental retiming. The next theorem proves that the MDFG obtained from the chained retiming algorithm is realizable.

Theorem 4.1 Let G = (V; E; d; t) be a realizable MDFG, the chained multi-dimensional

retiming algorithm transforms G to G , such that G is realizable and fully parallel. r

r

Proof: By using the chained multi-dimensional algorithm, we nd r an incremental retiming for G, k the multi-chain maximum length of G, and i the level number assigned to each node v 2 V . The retiming of each node v by (k ? i) r results in nal dependence vectors of the form d(e) ? j r, d(e) + j r, or j r where 1 j k ? 1. By corollary 3.5, the retimed MDFG is realizable, and since there is no zero-delay edges, then G is fully parallel. 2 i

r

Revisiting the example in gure 3, we nd out that for k = 2 (length of existing chains) and r = (1; 0) ( rst incremental retiming function) we could have applied only one retiming step in such way that node D would be retimed by 2 (1; 0), i.e., (2; 0), and node A would be retimed by 1 (1; 0). The nal graph and Fortran code are shown in gure 10.

5 Experiments In this section we present the application of our method to two dierent examples. Our rst example is the IIR lter introduced in section 1. The second example is an MDFG representing a wave digital lter. This MDFG computes the solution for a partial dierential equations problem (a transmission line problem), based on the Fettweis method [9]. 13

(-15,8)

M8

(2,-1)

(-7,4)

(4,-2) A4

A6

A5

M8

(3,1) M7

(2,1) A7

(2,-1)

M7

(1,1)

A6

A5

M5

(2,-1)

A3

(2,1)

(1,0)

A2

(1,1)

(2,-1)

(4,-2) M6

M5

(2,0) (2,-1)

(2,-1)

M2

M1

A7

(-3,2)

(2,-1)

M3

(3,-1)

(2,-1)

M4

(2,-1)

(3,0)

(4,-2) M6

(2,0) (2,-1)

(8,-4)

(4,-2) A4

(3,1)

(3,0) (3,-1)

(2,-1)

A2 (2,-1)

M3

(2,-1)

(-3,2) A1

(2,-1)

M1

(a)

(12,-6)

(2,-1)

M2

(2,-1)

A3

(2,-1)

M4

(1,0)

(-3,2) A1

(2,-1)

(b)

Figure 11: (a) MDFG after retiming A6 (b) fully parallel MDFG after retiming A5

IIR Filter We now revisit the example shown in gure 1. Assume both adders and multipliers take one time unit to compute. Using the incremental retiming concept, we begin by retiming all nodes labeled M by r = (?1; 1) for a schedule vector s = (1; 1). Figure 2(a) shows the resulting MDFG. The new set of dependencies D is now given by: "

D = 31 30 ?31 12 02 ?21 11 01 ?11

#

After retiming nodes A1 to A4 by r = (?3; 2), the graph presented in gure 2(b) is produced. In the next step, we retime node A6 by r = (?7; 4) and we obtain the graph shown in gure 11(a). The nal graph is presented in gure 11(b) after retiming the node A5 by r = (?15; 8). Our second approach is to use the chained retiming algorithm. We nd the topological ordering of the DAG associated with the graph in gure 1. Nodes labeled M 5 to M 8 are assigned to level 0, nodes M 3, M 4, A3 and A4 to level 1, M 1, M 2, A2 and A6 to level 2, A1 and A5 to level 3, and nally, A7 to level 4. Therefore the multi-chain maximum length of G is 4. We know that the rst incremental retiming function for this set of dependencies is (?1; 1). Then we may apply the following retiming functions to the graph: r(M ) = (?4; 4) for i = 5 to 8, r(M 3) = r(M 4) = r(A3) = r(A4) = (?3; 3), r(M 1) = r(M 2) = r(A2) = r(A6) = (?2; 2), and r(A1) = r(A5) = (?1; 1). The nal fully parallel graph is then shown in gure 12. i

Transmission Line Simulation 14

(-1,1)

M8

(-1,1)

(-1,1)

(-1,1) A4

A6

A5

(6,-2) M7

(-1,1)

(6,-3) (6,-4) (5,-2) A7

(-1,1) M6

M5

(4,-2) (4,-3) (2,0)

(-1,1)

A3

(-1,1) (-1,1)

(-1,1) M4

M3

(2,-1)

A2 (-1,1)

(-1,1) M2

M1

(-1,1) A1

(-1,1)

Figure 12: Fully parallel MDFG obtained through chained retiming In this example, we introduce a two-dimensional transmission line problem initially transformed by using the Fettweis method (see [9] as a tutorial). The transmission line is characterized by the equations below:

l@i + ri + @u = f @t @t @i + c@u + gu = f @t @t 1

2

1

1

1

2

1

1

2

After applying the Fettweis transformations we obtain the Wave Digital Filter shown in gure 13(a), which is equivalent to the MDFG in gure 13(b), where the inputs e and e are zero. The two-port adaptors, A and B, are expanded to their internal con guration. For simplicity, one output port and one adder in each adaptor were deleted because the particular boundary conditions applied. Using the incremental algorithm we would obtain the graph shown in gure 14(a), after incrementally applying the following retiming functions: r(F ) = r(E ) = (1; 0), r(H ) = r(G) = (?3; 1), r(B 1) = r(A1) = (?7; 2), r(B 2) = r(A2) = (?15; 4), and r(B 3) = r(A3) = (?31; 8). Figure 14(b) shows the result of applying the chained retiming method to this case. The retiming functions were r(F ) = r(E ) = (5; 0), r(H ) = r(G) = (4; 0), r(B 1) = r(A1) = (3; 0), r(B 2) = r(A2) = (2; 0), and r(B 3) = r(A3) = (1; 0). 1

2

6 Conclusion We have presented two novel techniques for optimizing a multi-dimensional data ow graph through the use of a multi-dimensional retiming method that we developed. We call these 15

e1

B3

e2 e2 C + A

-.5

-

G

(1,1)

+ D

V1 -T1,T2

D B

T1,T2 V2

+

+ F

F

H

B1

(1,1)

(-1,1) C

-.5 E

B2

E

G

A1

(-1,1)

H A3

A2 Multipliers Adders

(a)(a)

(b) (b)

Figure 13: (a) Wave Digital Filter graph (b) equivalent MDFG

(-16,4)

(1,0)

B3 (-31,8)

B2

D

C

(4,-1) H

E (0,1)

(-31,8)

(4,-1)

F

(1,0)

(8,-2) B1

B2

(4,-1)

(4,-1)

A1

C (1,0)

H

E (1,0)

B1

G

(1,0)

(1,0)

A1 (1,0)

(3,0)

(-16,4) A3

(1,0)

(-6,1)

(-4,1) (8,-2)

(28,-7)

(1,0)

(1,0)

F

(-4,1) G

(3,0)

(1,0) (-6,1) D

(-2,1)

(0,1)

(-31,8)

(28,-7)

(-31,8) (-2,1)

B3

(1,0) A3

A2

A2

Multipliers

Multipliers

Adders

Adders

(a)

(b)

Figure 14: Final MDFG for the transmission line problem, (a) using incremental retiming (b) using chained retiming

16

methods multi-dimensional incremental retiming and chained multi-dimensional retiming. Both techniques are able to achieve full parallelism, i.e., simultaneous execution of all operations (nodes) for either a multi-dimensional data ow graph or an uniform nested loop. The main technique consists of successive retiming operations to reduce the critical path length to one node, which implies the fully parallel solution. A pre-selected retiming function is the core of the algorithm. The process of selecting such a function was presented. Dierently from one-dimensional case where the retiming technique can not guarantee full parallelism, the main theorem of this paper shows that a multi-dimensional problem can always achieve full parallelism. The algorithms were presented in detail. Some examples showed the eectiveness of the new methods in optimizing the execution of loop bodies and MD data ow graphs in parallel environments.

References [1] A. Aiken, \ Compaction Based Parallelization". (Ph.D. thesis), Technical Report 88922, Cornell University, 1988. [2] A. Aiken and A. Nicolau, \ Loop Quantization: An Analysis and Algorithm," Technical Report 87-821, Department of Computer Science, Cornell University, March 1987. [3] L.-F. Chao, A. LaPaugh, and E. H.-M. Sha, \ Rotation Scheduling: A Loop Pipelining Algorithm," Proc. 30th ACM/IEEE Design Automation Conference , Dallas, TX, June, 1993, pp. 566-572. [4] L.-F. Chao and E. H.-M. Sha, \ Static Scheduling of Uniform Nested Loops," Proceedings of 7th International Parallel Processing Symposium , Newport Beach, CA, April, 1993, pp. 1421-1424. [5] L.-F. Chao, \ Scheduling and Behavioral Transformations for Parallel Systems," Ph.D. dissertation, Princeton University, 1993. [6] E. Cohen and Nimrod Megiddo, \ Strongly Polynomial-Time and NC Algorithms for Detecting Cycles in Dynamic Graphs," Proc. 21th ACM Annual Symposium on Theory of Computing, 1989, pp. 523-534. [7] R. Cytron,\ Doacross: Beyond Vectorization for Multiprocessors". Proc. International Conference on Parallel Processing,1986, pp. 836-844. [8] D. E. Dudgeon and R. M. Mersereau, Multidimensional Digital Signal Processing, Englewood Clis, NJ: Prentice Hall, 1984. [9] A. Fettweis and G. Nitsche, \ Numerical Integration of Partial Dierential Equations Using Principles of Multidimensional Wave Digital Filters."Journal of VLSI Signal Processing , 3, pp. 7-24, 1991. [10] J. A. Fisher and B. R. Rau, \ Instruction-Level Parallel Processing". Science , Vol. 253, September 1991, pp. 1233-1241. 17

[11] G. Goosens, J. Wandewalle, and H. de Man \ Loop Optimization in Register Transfer Scheduling for DSP Systems," Proc. ACM/IEEE Design Automation Conference , 1989, pp. 826-831. [12] S. R. Kosaraju and G. F. Sullivan, \ Detecting Cycles in Dynamic Graphs in Polynomial Time," Proc. 20th ACM Annual Symposium on Theory of Computing, 1988, pp. 398406. [13] S. Y. Kung, VLSI Array Processors, Englewood Clis, NJ: Prentice Hall, 1988. [14] M. Lam, \ Software Pipelining: An Eective Scheduling for VLIW Machines". ACM SIGPLAN Conference on Prog. Lang. Design and Implementation, 1988, pp. 318-328. [15] T.-F. Lee, A. C.-H. Wu, D. D. Gajski, and Y.-L. Lin, \ An Eective Methodology for Functional Pipelining". Proc. of the International Conference on Computer Aided Design, December, 1992, pp. 230-233. [16] C. E. Leiserson and J. B. Saxe, \ Retiming Synchronous Circuitry". Algorithmica, 6, 1991, pp. 5-35. [17] D. I. Moldovan and J. A. B. Fortes, \ Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays". IEEE Transactions on Computers, January 1986, vol. c-35, pp. 1-12. [18] N. Park and A. C. Parker, \ Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Speci cations". IEEE Transactions on CAD, Vol. 7, No. 3, 1988, pp. 356-370. [19] N. L. Passos, E. H.-M. Sha, and S. C. Bass, \ Schedule-Based Multi-dimensional retiming". To appear in Proceedings of 8th International Parallel Processing Symposium, Cancun, Mexico, April, 1994. [20] R. Potasman, J. Lis, A. Nicolau, and D. Gajski, \ Percolation Based Scheduling". Proc. ACM/IEEE Design Automation Conference , 1990, pp. 444-449. [21] D. A. Schwartz, \ Cyclo-Static Realizations, Loop Unrolling and CPM: Optimal Multiprocessor Scheduling." Technical report, Georgia Institute of Technology - School of Electrical Engineering, 1987. [22] R. Tarjan, Data Structures and Network Algorithms, SIAM, Philadelphia, Pennsylvania, 1983. [23] C.-Y. Wang and K. K. Parhi, \ High Level DSP Synthesis Using the MARS Design System". Proc. of the International Symposium on Circuits and Systems, 1992, pp. 164-167. [24] M. Wolfe, \ Loop Skewing: the Wavefront Method Revisited". International Journal of Parallel Programming, Vol. 15, No. 4, August 1986, pp. 284-294. [25] M. E. Wolf and M. S. Lam, \ A Loop Transformation Theory and an Algorithm to Maximize Parallelism". IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 4, October 1991, pp. 452-471. 18

Nested Loop Transformation for Full Parallelism - CiteSeerX

Nested Loop Transformation for Full Parallelism - CiteSeerX

Suggest Documents

Work-Efficient Nested Data-Parallelism - CiteSeerX

An Efficient Implementation of Nested Data Parallelism for ... - CiteSeerX

General Loop Fusion Technique for Nested Loops ... - CiteSeerX

Supporting Nested OpenMP Parallelism in the TAU ... - CiteSeerX

Extending Fortran 90 by Nested Data Parallelism

Determining Transformation Sequences for Loop

Design and Implementation of Nested Parallelism for ... - Adam Welc

LeWI: A Runtime Balancing Algorithm for Nested Parallelism

Energy-Aware Loop Parallelism Maximization ... - Auburn Engineering

Translating Affine Nested-Loop Programs with Dynamic Loop ... - LIACS

Fine-tuning Loop-level Parallelism for Increasing ...

A Strategy for Exploiting Implicit Loop Parallelism ... - Semantic Scholar

Exploiting Loop-Level Parallelism for SIMD Arrays Using OpenMP

Exploitation of nested thread-level speculative parallelism on ...

Nested Data Parallelism in Haskell - Computer Science and Engineering

M-Adhesive Transformation Systems with Nested Application ...

Harnessing the Multicores: Nested Data Parallelism in ... - Microsoft

Transformation of Divide & Conquer to Nested

1 Nested Parallelism and Pipelining in OpenMP - High Performance ...

Nested parallelism: Allocation of threads to tasks and OpenMP

Harnessing the Multicores: Nested Data Parallelism in ... - Microsoft

M-Adhesive Transformation Systems with Nested Application ...

Profiling Java Programs for Parallelism - CiteSeerX

Extended Parallelism Models For Optimization On ... - CiteSeerX