The owner computes rule; Convex domain; Communication graph; Mapping strategies. 1. Introduction. The automatic parallelization of loop nests targeted for ...
ON THE ALIGNMENT PROBLEM ALAIN DARTE and YVES ROBERT Laboratoire LIP-IMAG, Ecole Normale Superieure de Lyon 69364 LYON Cedex 07 e-mail: [darte,yrobert]@lip.ens-lyon.fr Received (received date) Revised (revised date) ABSTRACT This paper deals with the problem of aligning data and computations when mapping uniform or ane loop nests onto SPMD distributed memory parallel computers. For ane loop nests we formulate the problem by introducing the communication graph, which can be viewed as the counterpart for the mapping problem of the dependence graph for scheduling. We illustrate the approach with several examples to show the diculty of the problem. In the simplest case, that of perfect loop nests with uniform dependences, we show that minimizing the number of communications is NP-complete, although we are able to derive a good alignment heuristic in most practical cases. Keywords: Loop nest; Uniform dependences; Ane dependences; Alignment problem; The owner computes rule; Convex domain; Communication graph; Mapping strategies.
1. Introduction
The automatic parallelization of loop nests targeted for execution onto DMPCs has motivated a vast amount of research (11 9 18 and references therein). We assume that the reader is familiar with the basic terminology (perfect nest, uniform dependences, ane dependences, scheduling functions) and we refer to 9 1 for background material. Consider a loop nest to be executed on a Distributed Memory Parallel Computer (DMPC). Several techniques have been developed for dependence analysis (parallelism extraction) and scheduling. Mapping data and computations onto physical processors is not so well understood. However, determining a good mapping is a critical task to reduce the overhead due to communication or non-local access. To simplify the mapping process, it is convenient to break it into two sub-problems: all data arrays and computations are rst mapped to a common \virtual" architecture which is then partitioned across the processors. This paper deals with the rst stage of the mapping problem, which is concerned with the relative allocation of data arrays and computations so as to reduce non-local accesses. The aim is therefore to derive a virtual computation \grid" where as many communications as possible are made internal: this is known as the alignment problem. Alignment ; ;
;
1
strategies rely either on user directives, such as the ALIGN pragma of Fortran D 11 or the TEMPLATE key-word of HPF 8, or on automatic transformations based upon program analysis, such as the one presented in this paper for uniform or ane loop nests. The alignement problem has motivated a vast amount of research, and there are many papers directly related to our program transformation approach, including 14 13 7 16 15 17 12. The key-tool to formulating the alignment problem is the communication graph, that can be viewed for the alignment problem as the counterpart of the dependence graph for the scheduling problem. We use the communication graph to express the alignment problem in a graph theoretic framework, and we demonstrate the diculty of the problem by working out some examples with general ane loop nests. We indicate some possible strategies to reduce the communication overhead. In the simplest case, that of perfectly nested loop nests with uniform dependences, we show that minimizing the number of communications is NP-complete, thereby assessing the diculty of the alignment problem. However, we are able to derive a heuristic that leads to very good allocations in most practical cases. ;
; ;
;
;
;
2. Communication graph
We deal with ane loop nests, a program model that has been carefully de ned by Feautrier 4 . In a word, ane loop nests are (not necessarily perfect) nested loop nests where all loop bounds and array references are ane functions of surrounding loop variables and structure parameters. Here is an example of such a nest:
Example 1
for k = 1 to P do for i = k + 1 to P do for j = 1 to k do Statement S1 : a(i; j) = b(j + 3; i + j + k) c(i + k ? j; i + k ? 1) Statement S2 : b(i; j) = : : : Statement S3 : c(j; i) = : : : endfor endfor endfor Here we have a perfect loop nest of depth n = 3, with k = 3 statements. The structure parameter is P. The computation domain is Dom(P) = f(k; i; j) 2 Z 3 ; 1 k P; k + 1 i P; 1 j kg: The number of computation points in the nest is O(P 3). If we look for a parallel execution in linear time O(P), we have to search for a mapping onto a virtual (n ? 1) = 2 dimensional mesh operating in SPMD synchronous mode. More precisely, we will have k = 3 superimposed grids, one for each statement, i.e. as many virtual cells as there are distinct projections of computation points S (p); 1 i k; p 2 Dom(P). Note that for non perfect loop nests the situation is more complex. Each statement can have a computation domain of dierent dimension. Also, it is not always possible to achieve an execution time linear in the structure parameters (see 5 6). i
;
2
a
c
b
S1
Figure 1: Access graph for statement S1 in Example 1 2.1. Mapping data and computations We decide a priori to have many allocation functions, one for each variable and
one for each statement, contrary to the case of systolic arrays where all computations obey the same projection rule. For each array we de ne an ane allocation function. Consider for instance array a: we want to assign element a(i; j) to a virtual processor on a 2D-array. We use an ane allocation function which is the composition of a 2 2 matrix M and of a translation by a constant : alloc(a)(u; v) = M (u; v) + : Similarly we map element b(u; v) onto virtual processor M (u; v) + , and element c(u; v) onto virtual processor M (u; v) + . Where will we execute each instance of each statement? Instance (k; i; j) of statement S1 , i.e. computation S1 (k; i; j) : a(i; j) = b(j +3; i+j +k) c(i+k ? j; i+k ? 1) will be assigned to virtual processor alloc(S1 )(k; i; j) = M 1 (k; i; j) + 1 : Here, M 1 is a rectangular 2 3 matrix. Similarly, computations S2 (k; i; j) and S3 (k; i; j) will be assigned to processors M 2 (k; i; j) + 2 and M 3 (k; i; j) + 3 . To execute computation S1 (k; i; j), processor alloc(S1 )(k; i; j) needs to receive b(j+3; i+j+k) from processor alloc(b)(j+3; i+j+k) and c(i+k ? j; i+k ? 1) from processor alloc(c)(i+k ? j; i+k ? 1). After the computation, processor alloc(S1 )(i; j) has to send a(i; j) to processor alloc(S1 )(i; j). Most compiler-parallelizers use the \the owner computes" rule 10 and would perform the computation S1 (k; i; j) inside the processor that holds a(i; j). However, this is a potential loss of freedom and there is no reason to decide a priori that S1 (k; i; j) and a(i; j) be allocated to the same processor. Anyway, if such a coupling is ecient, we can still use it. If we analyze all the required communications for processing statement S1 in Example 1, we can draw the graph of Figure 1. If we process all statements in a similar fashion, we obtain the communication graph for the loop nest that summarizes all the \reads" and \writes" that are required. Consider the communication edge from vertex b to vertex S1 , due to the fact that b(j + 3; i + j + k) is required to the computation S1 (k; i; j). The element a
a
a
a
b
c
b
c
S
S
S
S
3
S
S
S
b(j + 3; i + j + k) is available in processor
3 2 0k1 M (j + 3; i + j + k) + = M 4D @ i A ? c 5 + j 0 0 1 ?3 b
b
b
b;S1
b;S1
b
where D 1 = 1 1 1 is the access matrix and c 1 = 0 is the access vector. The communication edge from vertex b to vertex S1 is weighted with the expression + 3; i + j + k) 1 = alloc(S1 )(k; i; j) ?0alloc(b)(j 1 k = (M 1 ? M D 1 ) @ i A + M c 1 + 1 ? j that represents the distance between b(j + 3; i + j + k) and S1 (k; i; j). Similarly, we obtain the following expressions for the other two communications involving statement S1 : + k ? j; i + k ? 1) 1 = alloc(S1 )(k; i; j) ?0alloc(c)(i 1 k = (M 1 ? M D 1 ) @ i A + M c 1 + 1 ? j 1 = alloc(a)(i; j) ? alloc(S 0 k1)(k; 1 i; j) = (M D 1 ? M 1 ) @ i A + ? 1 j 1 1 ?1 0 where D 1 = 1 1 0 , c 1 = 1 and D 1 = 00 10 01 . Of course, we would derive similar expressions for statements S2 and S3 . b;S
b;S
b;S
S
b
b;S
b b;S
S
b
S
c
c;S
c
S
c
S
a
c;S
c;S
S ;a
a
S ;a
c;S
S
c;S
S ;a
2.2. Problem formulation
Given the communication graph, we want to zero out as many edges as possible, so as to internalize as many communications as possible. Note that this is not always possible. Note that the communication graph is bipartite (between vertices that represent statements and those that represent arrays). Each \write" edge corresponding to a computation Statement S : a(f(I)) = : : : where f(I) = DI + c is an ane function of the index vector I with n components, is weighted by = (M D ? M )I + M c + ? while each \read" edge corresponding to a computation Statement S : : : : = b(g(I)) S;a
a
S
a
4
a
S
where g(I) = DI ? c is an ane function of the index vector I, is weigthed by
b;S
= (M ? M D)I + M c + ? : S
b
b
S
b
Clearly, given, say, a \read" edge involving statement S, the major goal is to achieve the matrix equality M = M D. Indeed, if the two matrices M and M D cannot be made equal, then the communication distance is not bounded as I grows. Then, if we succeed to impose M = M D, we can take care of the alignment constant M c + ? : if we can zero this constant out, the communication will be made completely internal; otherwise it would occur at a xed distance (within a local neighborhood), which is still very attractive for current-generation distributedmemory architectures. However, there are many situations where achieving such matrix equalities is not possible. There are many potential sources of problems, among which the following: the equation M = M D might have no solution with M and M satisfying a given rank constraint: in our example we would most likely search for mapping matrices of rank 2, which would render the equation unfeasible if D is of rank 1. there can be several equations involving the same matrix: in our example we have to ful l both M 1 = M D 1 and M 1 = M D 1 . each cycle in the communication graph imposes a constraint on the alignment constants. S
b
S
S
b
S
b
b
b
S
a
S
b
S
c
S
b;S
a
c;S
3. Ane loop nests
In this section we give some hints on the processing of the communication graph for reducing communication overhead as much as possible. 3.1. Matrix equations
We work out the matrix equations for Statement S1 in Example 1, to illustrate the main ideas. We have obtained the following three communication weights: 0k1 1 = (M 1 ? M D 1 ) @ i A + M c 1 + 1 ? (1) j b
S
b;S
b;S
b
S
b;S
0k1 = (M ? M D ) @ i A + M c + ? j 0k1 = (M D ? M ) @ i A + ? S1
c;S1
S1 ;a
c
a
c;S1
S1 ;a
c c;S1
S1
5
j
a
S1
S1
b
c
(2) (3)
with the corresponding matrices: D 1 = 01 01 11 , D 1 = 11 11 ?10 and D 1 = 00 10 01 : We aim at synthesizing a 2D-virtual grid, hence we impose the condition that the unknown matrices M 1 , M , M and M be of rank 2. We would like to ful l simultaneously the three matrix equations: M 1 = M D 1 , M 1 = M D 1 and M 1 = M D 1 : We use the following lemma from elementary linear algebra: Lemma 1 Given two matrices D and D0 of the same dimension, there exists a non singular matrix X such that XD = D0 if and only if Ker(D) = Ker(D0 ). We have here Ker(D 1 ) = V ect((1; 0; 0) ), and Ker(D 1 ) = Ker(D 1 ) = V ect((1; ?1; 0) ). We see that is not possible to ful l the three equations simultaneously, and that the only way to ful l two equations among the three is by letting M 1 =M D 1 = M D 1 . The condition to get M D 1 = M D 1 ?2 1 . To derive this condition, we simply solve the overdeis M ?1 M = ? 1 1 termined system M ?1 M D 1 = D 1 (owing to the lemma, we know that there exists a solution). Note that \the owner computes rule" would lead us to choose M 1 = M D 1 , which implies that the other two equations cannot be satis ed, hence two non-local communications (reading values of a and b) instead of one (writing the value of a). What can we do when it is not possible to ful l a matrix equation, such as M 1 = M D 1 ? We still have some freedom. We can try to have the rank of M 1 ? M D 1 = M D 1 ? M D 1 be reduced as much as possible, so that con icts are less likely to occur when implementing the communication. In x x x our example, we can try to have M ?1 M D 1 ? D 1 = 0 0 0 , which ensures that M 1 ? M D1 is ofrank 1. In this simple example, we easily nd the condition M ?1M = 01 10 . Matrices M , M and M are de ned up to the multiplication by a non-singular matrix. Fixing M will nalize all choices, due 1 0 ? 1 ? 1 to the conditions on M M and M M . Take M = ?1 1 . We compute M = ?10 11 , M = 01 ?11 and M 1 = 01 01 10 . Finally, take ?3 ?3 0 1 1 = = 0, = M 0 = 3 , and = M 1 = 1 . We can rewrite statement S1 after space/time mapping, to check the alignments. Scheduling techniques 3 5 would lead to the scheduling vector = (1; 0; 0), which means that the space/time transformation matrix is 01 0 01 0 t1 0k1 @ p1 A = Se @ i A ; where Se = @ 0 0 1 A : 1 1 0 j p2 b;S
c;S
b
S
b
S
S ;a
c
a
c
S
b;S
a
S
c;S
t
S ;a
S ;a
c;S
b;S
t
b
S
c
c
b;S
b
c;S
b
b
c
a
S
a
S
c;S
S ;a
S ;a
a
S
b;S
S ;a
b
a
b;S
S ;a
b
a
a
S
S ;a
b;S
S ;a
b
a
a
b
c
b
b
c
c
S
a
b
a
a
b
b
S
b
c
;
6
c
b;S
c
c;S
We write I 0 = SeI, where I = (k; i; j) is the old iteration vector and I 0 = (t; p1; p2) is the new iteration vector. Let a0(I) = a(M I + ), b0(I) =b(M I + ) and c0 (I) = c(M I + ). We get M D 1 Se?1 = M D 1 Se?1 = 00 10 01 and 0 1 0 ? 1 e M D 1 S = ?1 ?1 1 , which leads to the new nest: for t = : : : do for p1 ; p2 = : : : do Statement S1 : a0(p1 ; p2 ? (t + p1)) = b0(p1 ; p2) c0(p1 ; p2) ::: endfor endfor We can check that given t, the \writes" for array a occur between processors that have the same rst component. The distance function is (p1 ; p2)?(p1 ; p2 ?(t?p1 )) = (0; t ? p1 ) which is a rank-1 expression. t
t
a
c
a
c
b
a
c
b;S
b
b
c;S
S ;a
3.2. Ecient communications
Up to now we have not precised our assumptions regarding communication protocols, because the alignment equations were not related to the scheduling of the loop nest. We assume a model where at each time step, communications are routed so that each processor responsible for a computation S(I) receives the read values of its computation and stores into the processor that owns the written value. We can view the abstract operation mode as a succession of synchronous \read-computewrite" parallel steps. 3.2.1. Broadcasting Broadcasting occurs when many virtual processors access the same data at a given time step. Consider the following situation: for I = : : : do Statement S(I): : : : = a(DI ? c) ::: endfor where I denotes the iteration vector with n components. Assume that for some reason the matrix equation M = M D cannot be satis ed, so that the communication cannot be made local. Let be the scheduling vector for statement S, i.e. computation S(I) is scheduled at time :I. The same index x from array a is read by several processors at the same time-step t if there exists computation indices I1 ; I2 2 Dom such that 1. t = I1 = I2 (same time step) 2. x = DI1 = DI2 (same data read) This implies that I1 ? I2 2 Ker( ) \ Ker(D). Ker( ) is a hyperplane of dimension n ? 1. S
a
t
t
t
t
t
7
If Ker( ) Ker(D), the same data will be broadcast to all active processors at a given time step: in other words, the restriction of D to Ker( ) is the null endomorphism. If it is not possible to achieve the condition Ker( ) Ker(D), we will try to choose so that the restriction of D to Ker( ) is of minimal rank, i.e. Ker( ) \ Ker(D) be of maximal dimension. Let p be this rank: If p = 0, we have a full broadcast. If p = n ? 1, we have a point-to-point communication. Otherwise, we have a partial broadcast along some direction. In a word, the smaller p, the more ecient the communication. Consider the following simple example: t
t
t
t
t
Example 2
for i = 1 to P do for j = 1 to P do for k = 1 to P do ::: (i) Statement S: a(i; j) = b(i + j + k; i + j + k) (ii) Statement S: a(i; j) = b(i + j; i + j) (iii) Statement S: a(i; j) = b(i + k; ?2j + k) ::: endfor endfor endfor Assume that the scheduling vector for Statement S is = (1; 1; 1) , hence Ker( ) = V ect((0; 1; ?1) ; (1; ?1; 0) ). We have the following cases: Case (i). D = 11 11 11 . We have Ker(D) = Ker( ), hence p = 0 and a full broadcast: at a given time-step t, the same data b(i + j + k; i + j + k), with i + j + k = t, is accessed by all the processors. Case (ii). D = 11 11 00 . We have Ker(D) = V ect((0; 0; 1) ; (1; ?1; 0) ), hence p = 1 and a partial broadcast: each value b(i + j; i + j) is accessed by all processors (p1; p2) of the diagonal p1 = p2 = t ? k = i + j. 1 0 1 Case (iii). D = 0 ?2 1 . We have Ker(D) = V ect((?2; 1; 2) ), hence p = 2 and each processor accesses a distinct value at time t. t
t
t
t
t
t
t
t
3.2.2. Message vectorization Message vectorization can take place when a processor accesses data from another processor that remains the same for several consecutive time-steps: data items 8
to be sent can then be regrouped into packets sent just in time to reach their destination (time deadlines are given by the schedule). Consider again the following situation: for I = : : : do Statement S(I): : : : = a(DI ? c) ::: endfor where I denotes the iteration vector with n components. Recall that the space-time matrix is given by: Se = M + 0 t
S
S
This means that at time step t, processor of the computations p ist responsible described by the iteration vector I = Se?1 p ? . For this, processor p needs some data belonging to processor M (DI ? c) + . This expression does not depend on t if M DSe?1 = (0; X) for some (n ? 1) (n ? 1) matrix X. This leads to M D = (0; X)Se = XM . This condition is equivalent to Ker(M ) Ker(M D) and when M is non singular to Ker(M ) Ker(D). Consider the following example: S
a
a
a
a
S
a
S
a
S
Example 3
for i = : : : to : : : do for j = : : : to : : : do for k = : : : to : : : do ::: Statement S: : : : = a(i + j + k; k) ::: endfor endfor endfor D = 10 10 11 . Its null space is V ect((1; ?1; 0) ). Suppose that M has been nally chosen equal to 10 10 01 with the scheduling vector (1; 0; 0) . Then the nest can be rewritten as: for t = : : : to : : : do for p1 = : : : to : : : do for p2 = : : : to : : : do ::: Statement S: : : : = a(p1 + p2 ; p2) ::: endfor endfor endfor t
S
t
9
and communications from processor (p1 + p2 ; p2) to processor (p1 ; p2) can be vectorized.
4. The alignment problem is NP-complete
In this section, we restrict ourselves to the simplest case and we show that even under strong restrictions the problem of alignment is already NP-complete. We consider a perfect loop nest of depth n in which all access functions to arrays are translations. Therefore, all arrays are n-dimensional arrays and the loop nest has the following form: for I = : : : to : : : do ::: S(I) : a(I + c) = : : : b(I ? d) : : : ::: endfor I represents the n-dimensional iteration vector. Computations and data are projected onto a (n ? 1) virtual processor space and possibly aligned dierently. For each statement or array x, the allocation function alloc is then de ned by: x
alloc (I) = M (I) + x
x
x
where M is a projection mapping (an integral (n ? 1) n matrix of full rank). is a translation constant (vector of dimension (n ? 1)). The processor computing statement S(I) has to read b(I ? d) and to write into a(I +c). These accesses can correspond to three kinds of communications: non local communications (i.e. communications at non bounded distance), local communications (i.e access to a processor at bounded distance) and internal communications (i.e access to the local memory). The \distances" of these communications are given by: (I) = alloc(S)(I) ? alloc(b)(I ? d) (I) = M (I) + ? (M (I ? d) + ) (I) = (M ? M )(I) + M d + ? for a read and (I) = alloc(a)(I + c) ? alloc(S)(I) (I) = M (I + c) + ? (M (I) + ) (I) = (M ? M )(I) + M c + ? for a write. These distances are not bounded when I grows if M , M and M are not equal, as discussed in Section . However, in this particular case, it is possible to avoid non local communications by choosing all projection matrices to be equal. Thus, in the following, we impose that for all x (statement or array), M = M. The expression for the distance reduces then to x
x
b;S
b;S
S
b;S
S
S
b
a
S
b
b
b
S
b
S;a
S;a
a
S;a
a
S
a
S
a
S
S
b
a
x
b;S
= Md + ? S
10
b
(4)
= Mc + ? (5) which does not depend upon I anymore. Therefore, our goal is to determine a projection matrix M and constants , so as to zero out the maximum number of vectors , i.e so as to nd out the allocation leading to the maximumof internal communications (the remaining ones being xed distance (local) communications). Our main result is that (the decision problem associated to) the alignment problem is NP-complete, even when the projection matrix M is xed. The proof of this result can be found in 2 , together with an ecient heuristic based upon the analysis of the cycles in the communication graph. S;a
a
S
5. Conclusion
In this paper we have dealt with the alignment problem. Our main technical contributions are the following:
The introduction of the communication graph, that captures all the required
information to align data and computations (extension of \the owner computes" rule). The formulation of several goals to reduce communication overhead, such as localization (matrix equations), communication rank minimization, broadcast, and message vectorization. The proof that the alignment problem is NP-complete, even in the simple case of perfect loop nests with uniform dependences. This result generalizes that of Li and Chen 13 for general ane loop nests and demonstrates the intrinsic complexity of the alignment problem. Fortunately, we have given an ecient heuristic in the case of uniform nests, based upon the analysis of the cycles in the communication graph.
References 1. Uptal Banerjee. Loop Transformations for Restructuring Compilers - The Foundations. Kluwer Academic Publishers, 1993. 2. Alain Darte and Yves Robert. A Graph-Theoretic Approach to the Alignment Problem. Technical Report 93-20, Laboratoire LIP, Ecole Normale Superieure de Lyon, July 1993. 3. Alain Darte, Tanguy Risset, and Yves Robert. Loop nest scheduling and transformations. In J.J. Dongarra and B. Tourancheau, editors, Environments and Tools for Parallel Scienti c Computing, volume 6 of Advances in Parallel Computing, pages 309{332. North Holland, 1993. 4. Paul Feautrier. Data ow analysis of array and scalar references. Int. J. Parallel Programming, 20(1):23{51, 1991.
11
5. Paul Feautrier. Some ecient solutions to the ane scheduling problem, part i, one-dimensional time. Int. J. Parallel Programming, 21(5):313{347, 1992. 6. Paul Feautrier. Some ecient solutions to the ane scheduling problem, part ii, multi-dimensional time. Technical Report 92-78, Laboratoire MASI, Universite Pierre et Marie Curie, Paris, October 1992. to appear in Int. J. Parallel Programming. 7. Paul Feautrier. Towards automatic distribution. Technical Report 92-95, Laboratoire MASI, Institut Blaise Pascal, Paris, December 1992. 8. Hich Performance Fortran Forum. High performance fortran language speci cation. Technical report, Rice University, January 1993. 9. Gelernter, Nicolau, and Padua, editors. Languages and Compilers for Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1990. 10. Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiler optimizations for FORTRAN D on MIMD distributed-memory machines. In Supercomputing 91, pages 86{100. IEEE Computer Society Press, nov 1991. 11. Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributed-memory machines. Comm. of the ACM, 35(8):66{80, 1992. 12. C.H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. In Banerjee, Gelernter, Nicolau, and Padua, editors, Languages and Compilers for Parallel Computing, volume 589 of Lecture Notes in Computer Science, pages 186{200. Springer Verlag, 1991. 13. Jingke Li and Marina Chen. The data alignment phase in compiling programs for distributed memory machines. J. Parallel Distrib. Computing, 13:213{221, 1991. 14. Jaon D. Lukas and Kathleen Knobe. Data optimization and its eect on communication costs in MIMD fortran code. In Dongarra, Kennedy, Messina, Sorensen, and Voigt, editors, Fifth SIAM Conference on Parallel Processing for Scienti c Computing, pages 478{483. SIAM Press, 1992. 15. Michael O'Boyle. Program and Data Transformations for Ecient Execution on Distributed Memory Architectures. PhD thesis, University of Manchester, January 1992. 16. Michael O'Boyle and G.A. Hedayat. Data alignment: Transformations to reduce communications on distributed memory architectures. In Scalable Highperformance Computing Conference SHPCC-92, pages 366{371. IEEE Computer Society Press, 1992. 17. J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. IEEE Trans. Parallel Distributed Systems, 2(4):472{482, 1991. 18. Hans Zima and Barbara Chapman. Supercompilers for Parallel and Vector Computers. ACM Press, 1990.
12