Design of a Meta-Parallelizer for Large Scienti c ? Applications Jean-Yves Berthou1;2 1 Institut National de Recherche en Informatique et en Automatique
Domaine de Voluceau-Rocquencourt
2 Laboratoire PRiSM, Universite de Versailles - St Quentin
Abstract. The "classical parallelizers" integrate more and more sophis-
ticated and costly parallelization techniques. As a result, they are limited by the size of the program they are able to compile and are not well suited to parallelize real scienti c programs whose parallelism detection complexity may dier very much from one code fragment to another. The goal of our work is the construction of a meta-parallelizer for shared memory MIMD machines, LPARAD (Large applications, PARallelizer, ADaptative), able to eciently parallelize large scienti c programs with minimum help from the user in a minimal time. The implementation of LPARAD within the scope of the PAF [PAF90] project validates our approach. Key-words: automatic parallelization, meta-parallelizer, adaptative and interactive parallelization, performance measures, subprograms inlining.
1 Introduction Designing and writing a parallel program is a very dicult process. Moreover, to do an ecient job, the parallel program has to be tailored to the underlying architecture to such an extent that portability becomes a major problem. A solution to these diculties is the development of automatic parallelizers. Many "classical parallelizers" [ABC+ 87, IJT90, PAF90, Pac86] use the following paradigm. In a rst pass, dependence relations are computed. Dependence inhibits parallelism. In a second pass, constructions (e.g, loops) whose parallelization are not inhibited by dependences are detected and translated to appropriate code for the target architecture. A limited set of transformations such as loop-splitting and scalar expansion, may be applied to the source program in order to improve the parallelism of the object code. The critical path is the dependence computation. Modern research has lead to ecient implementation of fast tests like the Fourier-Motzkin algorithm [IJT90, Pug91], with reasonable performance. However, the results are somewhat disappointing. Successful parallelization may depend on the detection of subtle relations between program ?
This research was supported in part by Centre de Recherche en Informatique de Montreal, 1801 avenue McGill College, Bureau 800, Montreal (Quebec) H3A 2N4 CANADA.E-mail:
[email protected]
variables [Fea88]. To be handled successfully, complicated subscripts may need program restructuring followed by a complex non linear analysis [Dum92]. Large programs are built from many subroutines which have to be submitted to interprocedural dependence analysis [TIF86], a dicult process, or to inlining, which generates enormous programs. Lastly, the set of available transformations is too limited. New methods for improving parallel programs have been suggested [RWF90, WL91, Wol87] which are of a much higher complexity than those that are found in the previous generation of parallelizers. It does not seem that their indiscriminate application to large code will be possible in the near future. Moreover, the parallelism detection complexity isn't uniformly distributed on the source program. Therefore, the parallelization techniques applied must be adapted to the kinds of parallelism the source program contains. The main theme of the present paper is the exploration of possible solutions to these diculties. Our main contention is that complicated parallelization techniques need only be applied to part of the source program which account for a signi cant fraction of the total running time, and cannot be satisfactorily parallelized by simpler means. This applies both to the sophisticated transformations alluded to above, and to the subroutine call processing. The goal of our work is the construction of a meta-parallelizer, LPARAD (Large applications, PARallelizer, ADaptative), able to eciently parallelize large scienti c programs with minimum help from the user in a minimal time. We call large scienti c program a program which isn't parallelizable by a classical parallelizer.
1.1 Overview of our Approach
To produce a large and ecient parallel program, LPARAD must determine on which parts of the source program the parallelization eorts have to be applied (i.e, which are the kernels |the code fragments that concentrate computation) and which eorts should be applied. To eciently parallelize a large program using only one pass, we need to directly detect the kernels. This requires a program execution which is costly and may be even impossible. We also need the user's help to identify the kind of parallelism contained in each of the kernels. We believe this is too much of a burden for the user to bear. Thus, LPARAD proceeds by successive re ning. First, all the code fragments which are possible kernels are identi ed and parallelized independently from one another by any classical parallelizer. In this way LPARAD obtains the rst parallel version of the source program without the user's help. Performance measures conducted on the parallel program identify the code fragments which are kernels and that have been badly parallelized. A diagnosis of the parallelization ineciency is produced with the user's help and a set of remedies is proposed for each of these code fragments only. This leads to the construction of a new parallelization plan and then, to the generation of a new parallel program. The parallelization process stops, when either the performance of the parallelization is satisfactory according to the user or no causes of ineciency have been detected. Since LPARAD concentrates its eorts
on parallelism detection and because data communication is not considered, the target architectures are shared memory MIMD machines. We consider the extension to distributed memory machines in the conclusion.
1.2 LPARAD Implementation
LPARAD may be implemented in the scope of any classical parallelizer X and is called Meta-X. The PAF [PAF90] parallelizer is a good candidate for LPARAD since it performs a very ecient and a very costly parallelism detection. Moreover, a set of specialized compilers is currently under development in the scope of the PAF project[Dum92, Red92, Col94]. Our current implementation of LPARAD in the scope of the PAF project, Meta-PAF, validates our approach. The source language of PAF and Meta-PAF is a subset of Fortran 77: EQUIVALENCE, ENTRY, computed and assigned goto's are forbidden. The target machine is an Encore Multimax and EPF (Encore Parallel Fortran) is the target parallel language. The next four sections describe the four steps of LPARAD. Meta-PAF has been validated on two programs, and experimental results are presented in section 6.
2 Construction of the First Parallelization Plan Which code fragments have to be parallelized by X? A code fragment is
parallelizable by X if the size of the computations carried out for its paralleliza-
tion does not exceed the memory space available on the machine on which it is executed. But the parallelization execution time of a code fragment increases at least quadratically with the number of array references it contains, and so, quickly becomes prohibitive. Moreover, in scienti c codes, the outermost loops have a strong probability of being sequential. Thus, LPARAD only selects code fragments, the R-parallelizable code fragments, which X can parallelize in a realistic time; i.e, in a few minutes. As we will see in 5.3, a parallelizable code fragment which is not R-parallelizable will be parallelized if the performance measures show it is a kernel and its eciency is bad. This way, the execution time of the rst parallelization is acceptable and each parallelizable code fragment which is a kernel will always be parallelized. How does LPARAD construct the rst parallelization plan? The kernels may be directly detected [FH92]: the execution time of each R-parallelizable code fragments is measured or estimated, then a kernel selection algorithm (like the one presented in 4.2) is applied on the whole set of fragments. If the data set does not fully describe the complexity of the source program then some kernels may be forgotten. In addition, whether the kernel detection is performed at runtime[Fah93] or by program complexity computation, at least one execution of the source program is needed3 . This is costly and and may be even impossible. 3 The complexity of the fragments may indeed be a function of some quantities only
measurable at run-time like the probability that a structured \if" condition is met, or the values of the bounds of a do loop.
Now, it is quite conceivable to parallelize a large number of little pieces of code, on a network of workstations for instance. It is the loop nests that concentrate computation and are therefore likely to contain eciently exploitable parallelism. Moreover, the parallelism detection between loop nests is very costly and hardly usable. Hence, a better idea is to use the loop nests as the initial kernels. By parallelizing R-parallelizable loop nests by the X classical parallelizer, all the R-parallelizable kernels are parallelized at least once. The execution of the source program on a single processor is also avoided. Thus, LPARAD distributes all the loop nests of the source program among three distinct groups: A nest is a large nest , if it is not R-parallelizable by X and if it is not contained in a large nest. A nest is a nest to be parallelized by X, if it is R-parallelizable by X and not contained in a nest to be parallelized by X. Since a large nest may contain R-parallelizable nests, it may contain nests to be parallelized by X. The interprocedural parallelization is dicult and very costly [IJT90]. In addition, some subprograms may be eciently parallelized without reference to the calling code which makes the parallelization of some of its call useless. Thus, no interprocedural analysis is performed during the rst parallelization, even if the X parallelizer can do it.
Construction of the Meta-PAF rst parallelization plan
The computation of the dependence graph of a code fragment A, constitutes the bottleneck of the PAF parallelizer. More precisely, the bottleneck is the number of systems of inequalities to solve, CNSI (A; PAF ), in order to build the dependence graph. CNSI (A; PAF ) is a function of the number of array references in the fragment and the depth of the nested loops that contain them. We have designed a simple algorithm to accurately estimate CNSI for PAF. The algorithm also gives an estimate of the execution time of PAF. We de ne SIMspc (PAF; M ) as the maximum CNSI (A; PAF ) of a code fragment A, such that A is parallelizable by PAF on the machine M and, SIMtime (PAF; M ) as the maximum CNSI (A; PAF ) of a code fragment A, such that A can be parallelized by PAF on the machine M in a realistic time. The experiments we have done with PAF on our work station, a Sun4, show that SIMspc (PAF ; Sun4) and SIMtime (PAF; Sun4) are equivalent |approximately 2500 | which usually represents less than 100 statements.
3 Construction of the Parallel Program Each time a parallelization plan is produced, a parallel version of the source program is generated in the following way. The loop nests to be parallelized are extracted from the source program, then they are parallelized independently from one another. The parallel version of these nests are used to produce a parallel version of the source program. This phase consists simply of tree pruning and grafting on the intermediate representation of the object code. Its implementation in the scope of the PAF project is described in detail in [Ber93].
4 Performance Measures The problem posed after the parallelization of a sequential program is the evaluation of the eciency of this parallelization. If this eciency is bad, we have to nd the causes of this ineciency. That is to say, we have to nd the code fragments that concentrate computation and have been badly parallelized; we say they satisfy the ineciency criteria.
4.1 De ning Criteria to Evaluate the Parallelization Eciency Traditionally the parallelization eciency is measured by the speed-up and eciency 4 . When a parallel program proves inecient, these criteria do not explain this ineciency. The intrinsic parallelism may have been only partially detected or badly exploited. Memory management, data transfers from one memory hierarchy to another and memory contention phenomena generate what we call noise. We established that, in most cases, it increases with the number of processors, which leads to a decrease of the eciency of the parallel program. But the eciency of the parallelism detection does not depend on the noise generated by the target machine. Therefore, we de ne the eective eciency of the parallelization of a sequential program as the eciency of the resulting parallelized program on an idealized machine with no noise5 . We established in [BK92] a relation that gives an estimate of the eective eciency of the parallelization as a function of the eciency of the parallel program and the noise generated by the target machine. The dierence between the traditional eciency and our estimate of the eective eciency of the parallelization gives us an estimate of the drop in eciency due to machine-generated noise. If it is not possible to execute the parallel program on one processor, LPARAD can execute in a downgraded mode: approximate ineciency criteria are computed with the parallel execution only [Ber93].
4.2 De ning the Ineciency Criteria At which weight6 is a code fragment said to be a kernel?
A rst approach would consist of selecting the code fragments whose weight is greater than a threshold, 5% for example. But, a program may only contains code fragments whose weight is lower than this threshold. An another approach is to sort the code fragments according to their weights. The kernels are the set 4 The speed-up is the ratio of the execution time of the best sequential program known
to the execution time of the parallel program on p processors and the eciency is the ratio of the speed-up to p. 5 The question of optimizing a program for minimizing such eects is beyond the scope of this paper. See [Win92] for such an approach. 6 The weight, Weight(A; J ), of a code fragment A for a given data set J is the ratio of its execution time to the execution time of the whole sequential program on the data set J .
of the code fragments with the greatest weight and such that the sum of their weights is greater than a certain threshold t. By doing this, we may select code fragments which have negligible weight. We designed the following algorithm for selecting kernels that attempt to avoid the pitfalls of forgetting a kernel and selecting a code fragment which is not a kernel: For each input data set J provided by the user, let C1; C2; :::; Cn be the set of the code fragments of the source program, sorted according to their weights: C1 is a kernel for J. Ci ; 2 i n, is a kernel for J if: CPi?i?1 1is a kernel for J, and k=1 Weight(Ck ; J ) < t and, Ci's weight for J is at least equal to 10% of Ci?1's. A code fragment is a kernel if it is a kernel for at least one input data set. According to Amdahl's law the speed-up is bounded by P (1?Pt)+t , where P is the number of the processors of the target machine. Thus, t has to be set according to the eective eciency of the parallelization E , the user wants to reach. This means, t is set to PP??1=E 1 . For example, in the case of the parallelization by Meta-PAF of the programs tmines and Onde 24 (Sect. 6), the user sets E to 60% and 90% respectively. Since our target machine has seven processors available (P=7), t is then 89% for tmines and 98% for Onde 24.
At which eciency is a kernel said to be badly parallelized?
Given the fact that the bigger the weight of a code fragment, the more important it is to correctly parallelize it, we have established a table of weight/eciency correspondence which is interpreted as follows. A code fragment satis es the ineciency criteria if it is a kernel and if the relations in one of the columns of the weight/eciency correspondence table holds for this kernel for at least one input data set: weight(%) ]0; 5[ [5; 10[ [10; 20[ [20; 30[ [30; 40[ [40; 50[ [50; 60[ [60; 100] eciency ]0; 10] [0; 20] [0; 35] [0; 50] [0; 65] [0; 70] [0; 75] [0; 80]
We de ne E = C (1 ? E ) as the maximum gain of eciency that the (re)parallelization of a code fragment may induce, where C is the weight and E is the eciency. The correspondence between eciency and weight have been chosen such that E doesn't vary too much.
5 Construction of a New Parallelization Plan The parallelized nests, the subprogram calls and the large nests that satisfy the ineciency criteria constitute three potential sources of ineciency which will, when processed, generate new sets of nests to be parallelized and large nests. This de nes a new parallelization plan.
5.1 Processing Parallelized Nests Satisfying the Ineciency Criteria There are various reasons for the bad parallelization of a given nest. This nest may contain parallelism that programming habits or optimizations of memory management, execution time, or code size have hidden. Some essential information required for the parallelism detection may be absent from the nest. The nest may also contain parallelism that the X classical parallelizer is not able to detect. Where X failed, other parallelizers, specialized in detecting speci c forms of parallelism (and generally more costly), might succeed. The following table presents the set of potential causes of the parallelization ineciency currently indexed by LPARAD: Symptoms
While loop
Ineciency causes
Remedies
While loop not parallelizable by Parallelize the while loop [Col94] or a classical parallelizer manually transform it into a do loop Non linear array references No detectable parallelism by a Parallelize the nest using [Dum92] classical parallelizer Cycle in the DG of the nest Possible presence of a recurrence Submit the nest to [Red92] Flow dependence detected No detectable parallelism by a Apply tiling to exploit pipeline classical parallelizer parallelism [Wol87] Indirect array references No detectable parallelism by a Manually replace each indirect classical parallelizer array reference by * Subprogram called within No detectable parallelism by a Manually replace the subprogram array subscript classical parallelizer call by * Unknown variable contained Missing information that makes Manually replace the variable by * in an array subscript impossible the parallelism detection
*: an ane function of the surrounding loop counters and the size parameters or by a constant.
LPARAD produces a diagnosis for each nest that satis es the ineciency criteria and proposes a set of remedies for such a nest. We will improve the LPARAD expertise on the recognition of other symptoms of the parallelization ineciency by parallelizing real scienti c programs, thanks to implementations such as Meta-PAF. Following the LPARAD diagnosis, the user either has to decide which automatic or manual transformations should be applied and, to which classical or specialized parallelizer the nest should be submitted, or to propose his own parallel version of the nest. If LPARAD detects no cause for the parallelization ineciency, the user is proposed to do nothing if he thinks the nest doesn't contain any parallelism. In the opposite case, he is proposed to submit the nest to a more ecient, thus more costly, parallelizer, or again, to propose his own parallel version.
5.2 Using Performance Measures to Process Subprogram Calls The subprogram calls may be processed by techniques based on interprocedural analysis of the source program or by those based upon the inlining of the subprogram called. The rst technique, developed by [IJT90], rests on the computation of the array sections that are read and modi ed by each subprogram call. This technique is very costly. The principle of total inlining consists of replacing each subprogram call by the body of the subprogram called. Its major drawback is
that it generates very large programs, which are very dicult or even impossible to parallelize. If the X classical parallelizer is capable of performing interprocedural analysis, LPARAD will decide for each subprogram call and with the user's help, if it should be parallelized. Indeed, only the subprogram calls which are contained in a interprocedural loop nest7 and such that the subprograms they call verify the ineciency criteria, have to be parallelized. LPARAD determines if the interprocedural analysis is possible and evaluates its execution time for each of these subprogram calls due to the complexity of the interprocedural analysis. For each subprogram call that is possible and necessary to parallelize, the user nally decides if the subprogram call will be parallelized8. The inlining technique has to be applied if the interprocedural analysis cannot or shouldn't be applied: De nition: A subprogram A called from B must be inlined if A is called from an interprocedural loop nest and if, for that call, it satis es one of the two following conditions: A satis es the ineciency criteria or, A contains a subprogram call which has to be inlined. Partial inlining poses some technical problems such as the generation of several copies of some subprograms, the processing of static variables or the dummy argument/actual argument association. These problems are discussed in [Ber93] for the Meta-PAF parallelizer.
Processing subprogram call by Meta-PAF
Since PAF doesn't perform any interprocedural analysis, Meta-PAF only tests whether each subprogram call should be inlined.
5.3 Processing Large Nests Satisfying the Ineciency Criteria Three transformations may be applied to a large nest which satisfy the ineciency criteria. Do nothing if the user knows the loop nest is intrinsically sequential. If the nest contains parallelism and if it is parallelizable by X, the execution time of its parallelization is estimated according to its complexity. This allows the user to decide whether he will submit it to X or not. If the nest contains parallelism and if it is not parallelizable by X or if the user decides not to submit it to X, it has to be subdivided by the user. If he refuses to do it by hand, one alternative remains. We may design a set of parallelizers with decreasing complexity in time and space that would be applied according to the complexity of the large nest to be parallelized. Additionally, this set of parallelizers with decreasing complexity should permit LPARAD to attempt a loop-splitting of the large nest before asking the user to do so. 7 A subprogram is called from an interprocedural loop nest if it is called from a loop
nest or if the calling subprogram is called from an interprocedural loop nest.
8 There are several reasons for this. The execution time of the interprocedural analysis
may not be realistic by the user's standards. The loop nest that contains the subprogram call may not have enough iterations or the execution time of the subprogram called in each iteration may vary too much, so that it isn't eciently exploitable on the target machine.
Processing large nests by Meta-PAF
Given the current status of our implementation,and so long as SIMspc (PAF; Sun4) = SIMtime (PAF; Sun4) for the Sun4 machine we use, the user is asked to split the large nests himself or to do nothing.
6 Meta-PAF Validation We have validated Meta-PAF on two programs from the ONERA9 and the IFP10 respectively. These programs are good candidates for the validation of Meta-PAF since their size is 10 to 50 times bigger than the size of a program PAF is able to parallelize and because of the diversity of the code they contain. The target machine is an eight-processor Encore Multimax with 16 Mb of shared memory.
6.1 The tmines Program from the ONERA
The tmines program was written by M. Bredif from the ONERA. It contains 1300 lines of Fortran-77 (comments excluded) and includes ten subroutines of various sizes. It was supplied with one input data set. N. Emad [Ema92] has parallelized most of its kernels using the PIPS parallelizer [IJT90] but did not perform performance measures. This work has allowed us to verify the parallelization processed by Meta-PAF. The construction of the rst parallelization plan produced three large nests and 58 nests to be parallelized by PAF. The rst phase of the parallelization process was executed in 73 min 32", 61 min of which were spent on the parallelization of the 58 nests by PAF. We obtained the following performance gures: Number of proc. 2 3 4 5 6 7 Parallel prog. e. 0.843 0.762 0.661 0.594 0.536 0.500 Parallelization e. [0:856; 0:867] [0:780; 0:796] [0:687; 0:707] [0:626; 0:648] [0:560; 0:583] [0:528; 0:551]
The detection of the parallelism was relatively ecient (0:86 on two processors). But the parallelism detected is not well suited to the target machine. Indeed, the eciency drops down to 0:53 on seven processors. There are two main reasons for this drop in eciency11 : many loops have a small number of iterations (less than 20); the execution time of the parallel loops is too small compared to the time spent synchronizing the processors12 . Only two parallelized nests satisfy the ineciency criteria. Meta-PAF didn't nd any cause of the ineciency of their parallelization. Examining their parallel version, reveals that the loops that PAF had not parallelized are sequential. No 9 ONERA : Oce National d'Etudes et de Recherches Aerospatiales, National Oce
for Aerospace Studies and Research.
10 IFP : Institut Francais du Petrole, French Institute for Petroleum. 11 The noise generated by the target machine is too small to explain this drop in
eciency.
12 The ratio of the computation time and the synchronization times is 4:6 for an exe-
cution on two processors while it is 1:3 on seven processors.
subroutine is called from a do loop, thus no loop satis es the inlining criteria13 . No large nest satis es the ineciency criteria. Since the new parallelization plan is empty, the parallelization stops here. We conclude that the tmines program contains a lot of parallelism that can only be exploited on a ne grain target machine, on which data transfer times, processes synchronization time, task distribution time are quite small.
6.2 The Program Onde 24 from the IFP
is a program which models a bidimensional wave-propagation phenomenon. It is composed of a unique 600 lines (comments excluded) main program and was supplied with ve equivalent data sets. The construction of the rst parallelization plan produced one large nest, the time loop, and 24 nests to be parallelized by PAF. The rst phase of the parallelization process was executed in 9 min 30", 5 min 14" of which was for the parallelization of the 24 nests by PAF. The performance measures of the parallelization show that Onde 24 is still sequential after the rst parallelization phase. Meta-PAF detected four nests that satisfy the ineciency criteria. Its diagnosis was the following: the parallelism is not detectable because of the presence of two variables in array subscripts that are neither loop counters nor size parameters. Meta-PAF didn't propose any applicable remedy. The four nests are of the following type: u(i,2,kp)=f(u(i,2,km),u(i,2,kp),u(i+1,2,km),u(i-1,2,km)), i 2 [3; np ? 1]. km is initialized to 1 and kp to 2, and after each iteration of the time loop, km is set equal to kp and kp is set equal to 3-km14. So, we replaced km by 3-kp in the four nests, which became new nests to be parallelized by PAF. The only large nest satis ed the ineciency criteria. Since loop-splitting was impossible according to the authors, it became an anonymous nest. A new phase of construction of a parallel program was started, and the performance gures of the parallelization are as follows: Onde 24
Number of proc. 2 3 4 5 6 7 Parallel prog. e. 0.991 0.982 0.975 0.947 0.962 0.950 Parallelization e. [0:990; 0:991] [0:982; 0:984] [0:978; 0:980] [0:950; 0:954] [0:968; 0:971] [0:958; 0:962]
All the parallelism in Onde 24 was detected and eciently exploited by the Encore Multimax.
7 Conclusion We have presented LPARAD, an adaptative, interactive, partial meta-parallelizer for large scienti c applications directed by performance analysis. The program
13 Let us note that N. Emad showed that an ecient exploitation of tmines interproce-
dural parallelism is rather unconceivable since the number of subroutines that may be simultaneously executed is at most equal to two, and the execution times of these subroutines are very dierent. 14 This optimization of the memory utilization is known as the leap frog method.
parallelization is divided into two steps, a completely automatic and relatively fast step |the construction of a new parallel program and performance measurements | and a semi-automatic one|the construction of the new parallelization plan. Experiments conducted with Meta-PAF, have con rmed that the user has to intervene only on a very small part of the source program. More generally, kernels represent only a small part of the source program |typically 20% of the code executes 80% of the computations| and only a few kernels contain parallelism that the X parallelizer invoked by LPARAD is not able to eciently detect. As an example of the improvement in compilation time due to the present approach, we have selected a loop nest in the tmines program with an estimated complexity of 44000. The direct computation of its dependence graph took 3 hours. When split into its component loop nests, the DG computation took only 6 min 14". If we had proceeded with the parallelization of the large nest, we would have found the same parallel program as Meta-PAF did, since the outermost loop was sequential, but at 30 time the Meta-PAF cost. We believe that these gures would still be valid|mutatis mutandis| if we had applied a faster compiler than the present version of PAF to larger programs. This strengthens our opinion that during the rst phase of the parallelization, we only have to parallelize the nests that are parallelizable in a realistic time, even if it means that we will wait for the next parallelization phase to try to parallelize the large nests satisfying the ineciency criteria and for which the user believes they are not sequential.
New Research Directions
Carrying out the parallelization of large scienti c programs with Meta-PAF would allow the completion of the expertise of LPARAD for the identi cation of the causes of the parallelization ineciency. Extension of LPARAD to distributed-memory architectures: since X would produce a data distribution for each R-parallelizable loop nest, LPARAD should have to respond to the following question: which data distribution(s) should be chosen for the whole program and according to what criteria? In addition, the performance measures not only have to evaluate the eective eciency of the parallelization, but they also have to evaluate the quality of the data distribution.
8 Acknowledgments We would like to thank P. Feautrier, F. Irigoin, A. Dumay, C. Freehill, P. Klein, and X. Redon for their suggestions and comments.
References [ABC+87] F. Allen, M. Burke, P. Charles, R. Cytron, and J. Ferrante. An overview of the ptran analysis system for multiprocessing. Technical Report RC 13115 (#56866), New-York, September 1987.
[Ber93]
Jean-Yves Berthou. Construction d'un paralleliseur interactif de logiciels scienti ques de grande taille guide par des mesures de performances. PhD thesis, Universite P. et M. Curie, October 1993. [BK92] Jean-Yves Berthou and Philippe Klein. Estimating the eective performance of program parallelization on shared memory mimd multiprocessors. In Parallel Processing : CONPAR 92 - VAPP V, pages 701{706. Springer Verlag, 1992. [Col94] J.-F. Collard. Space-time transformation of while-loops using speculative execution. In Proc. of the 1994 Scalable High Performance Conputing Conf., Knoxville, Tenn., May 1994. To appear. Also available as LIP Report 93-38. [Dum92] Alain Dumay. Traitement des Indexations non lineaires en parallelisation automatique : une methode de linearisation contextuelle. PhD thesis, Universite P. et M. Curie, December 1992. [Ema92] Nahid Emad. Detection de parallelisme a l'aide de paralleliseurs automatiques. Technical Report MASI 92-54, Institut Blaise Pascal, September 1992. [Fah93] T. Fahringer. Automatic Performance Prediction for Parallel Programs on Massively Parallel Computers. PhD thesis, Dept of Comp. Science., University of Vienna, November 1993. [Fea88] Paul Feautrier. Parametric integer programming. RAIRO Recherche Operationnelle, 22:243{268, September 1988. [FH92] T. Fahringer and C. Huber. The weight nder, a pro ler for fortran 77 programs. user manual. Technical report, Dept of Comp. Science., University of Vienna, September 1992. [IJT90] Francois Irigoin, Pierre Jouvelot, and Remi Triolet. Overview of the pips project. In Paul Feautrier and Francois Irigoin, editors, Procs of the Int. Workshop on Compiler for Parallel Computers, Paris, pages 199{212, December 1990. [Pac86] Paci c Sierra Research Corporation. Vast2 Reference Manual, 1986. [PAF90] Manuel de reference de paf. Groupe \Calcul Parallele" du MASI, January 1990. available on request from Paul Feautrier. [Pug91] William Pugh. Uniform techniques for loop optimization. ACM Conf. on Supercomputing, pages 341{352, January 1991. [Red92] Xavier Redon. Detection des reductions. Technical Report 92.52, IBP/MASI, September 1992. [RWF90] Mourad Raji-Werth and Paul Feautrier. Systematic construction of programs for distributed memory systems. In Paul Feautrier and Francois Irigoin, editors, Procs of the Int. Workshop on Compiler for Parallel Computers, Paris, December 1990. [TIF86] Remi Triolet, Francois Irigoin, and Paul Feautrier. Direct parallization of call statements. In ACM Symposium on Compiler Construction, 1986. [Win92] D. Windheiser. Optimisation de la localite des donnees et du parallelisme a grain n. PhD thesis, Rennes I, 1992. [WL91] M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. Transactions on Parallel and Distributed Systems, 2(4), pages 452{470, October 1991. [Wol87] M. Wolfe. Iteration space tiling for memory hierarchies. In Parallel Processing for Scienti c Computing, pages 357{361. SIAM, 1987. This article was processed using the LaTEX macro package with LLNCS style