Automap: A Parallel Coordination-based Programming System1

6 downloads 45769 Views 189KB Size Report
as Fortran or C. Although the parallelism will be explicit, the programmer will be .... by the application programmer, but getting synchronization correct requires a ...
Automap: A Parallel Coordination-based Programming System1 C. van Reeuwijk, H.J. Sips, H.X. Lin, A.J.C. van Gemund Technical Report No. 1-68340-44(1997)04 Laboratory of Computer Architecture and Digital Technique (CARDIT) Department of Electrical Engineering Delft University of Technology P.O.Box 5031, NL-2600 GA Delft, The Netherlands April 1997

Excerpt from the research proposal granted in October 1995 by the Netherlands Computer Science Foundation (SION) with nancial support from the Netherlands Organisation for Scienti c Research (NWO) under grant number SION-2519/612-33-005. 1

Abstract High Performance Computing (HPC) has been de ned as a key enabling technology for industrial competitiveness. The successful use of parallel computer technology requires the development of techniques to substantially reduce the complexity and cost of software development. In particular in commercial or industrial applications the required parallel structures tend to be complex, dynamic, or irregular, and in this environment the time to produce and modify software is as important as ultimate execution speed. Therefore, the fairly regular structures often found in the so called \Grand Challenge" applications (often used as benchmark applications for parallel compilation systems) are not very representative of `real life' applications. In response to that, AUTOMAP will develop a high-level, application-oriented parallel programming system based on coordination-based structured programming. The parallel language in the system will be aimed at de ning coarse grained tasks and control ow parallelism, both at compile-time as in run-time. Tasks themselves may be programmed in existing languages such as Fortran or C. Although the parallelism will be explicit, the programmer will be completely shielded from various implementation details. The system will be enhanced with automated task mappers based on a performance model of an underlying parallel machine. In this way, the program remains totally portable. The implementation will use existing or emerging standards, to ensure portability across current machines and future machine generations. The proposed project combines new insights in language design, both compile-time and run-time scheduling, and performance prediction in an integrated approach. AUTOMAP will be validated using core programs drawn form industrial applications.

Contents 1 2 3 4

Introduction State of the art Research approach Conclusion

2 5 8 11

1

Chapter 1 Introduction Despite the enormous growth of computer power in the last decades, there are still disciplines where there is a demand for substantially more power. This demand is not likely to be met by the further growth of computer power in the near future. The alternative way to satisfy this demand, parallel computation, is therefore as relevant as it has ever been. Traditional applications that require this much computer power1 are for example found in aerospace (stress calculations on airplane parts), oil (reservoir simulation and the interpretation of seismic data), and biochemical industry (molecular simulations). Next to these traditional applications, there is a growing interest in the use of parallel computer systems for database, multi-media, and virtual reality applications. These new applications often have more complicated data structures: whereas traditional applications often use a few (huge) arrays, new applications typically have complicated data structures such as sparse matrices, trees, linked list, etc. Also many industrial applications are highly dynamic and irregular, and show substantial control ow parallelism. An example of such application is shown in Fig. 1.1. It is drawn from the nite element system DIANA [6]. Input in the system is some geometrical structure, which is split into domains by a domain decomposer Fig. 1.1a. A block matrix structure as in Fig. 1.1b results, which is subsequently solved using a direct method. The shaded blocks contain elements, the non shaded blocks are empty. The linear solver algorithm takes the non empty blocks and processes them in an elimination tree as shown in Fig. 1.1c. The nodes in this tree are computational tasks, while the arcs denote the dependence relations among the tasks. The tasks themselves are matrix operations on complete blocks of the matrix. The size, shape, and depth of the tree is only known at run-time. Hence, data allocation and task scheduling can only be done at run-time. Irregular as this application may be, there is still structural information on the computation available, which can be used to describe the computation in a parameterized way [10]. Another observation from this application is that the parallelism is explicit: it is determined by the domain decomposing program. A problem in tackling these parallel applications is the lack of suitable programming systems in which these applications can be coded in a portable and maintainable way. In the design of such a programming system, two important problems must be solved:  The speci cation problem. In what form, and how much detail should the parallelism of an application be speci ed? 1

These applications are sometimes called the `Grand Challenge' applications.

2

1 1 nonzero

fill-in 2 zero 3

1 4

5 5 ,1 2

6

3

7

4

5

5 ,2 6 ,2

6 ,3 7 ,3

(a) 1

6 7 ,4

7

(b) 2

3

4



(c)

Figure 1.1: Fig. 1 (a) geometrical domain, (b) matrix structure, (c) computation graph 

The transformation problem. How easily can the compiler transform the speci ed parallel application into (parallel) machine code, and how ecient will this code be?

Clearly, there is a trade-o between these two problems: if the interaction between the various parallel parts of an application is described in more detail, it will generally be easier to transform it. But even for a given level of detail there is a wide range of possibilities in the design of a programming system and the transformations it can perform. For parallel applications, there are two new problems that complicate the design of a programming system: 

The synchronization problem. Depending on the type of computer architecture, synchronization manifests itself in two di erent ways. On shared memory systems, the results of a computation may depend on the relative speed in which the di erent tasks of an application are executed. Non-deterministic results must be prevented by `locking out' related tasks when a critical section is encountered. On distributed memory systems, the processors must explicitly send messages to inform other processors about the changes in a variable.

3



In both cases, a compiler could automatically ensure that this synchronization is always handled correctly, but this would impose severe and often too conservative restrictions on the parallel execution of the tasks. Alternatively, low level synchronization could be done by the application programmer, but getting synchronization correct requires a determined e ort and errors in this are dicult to nd. The load balancing problem. To get any bene t from a parallel computer system, the processing load must be spread as evenly as possible over the available processors. This means that throughout the program, tasks must be identi ed that can be executed in parallel, the execution time of these tasks must be estimated, and a good load balance must be found. In general, due to the lack of information in many applications about loads and dependencies in compile time, none of these steps can be handled adequately by a compiler. On distributed memory systems the problem is even further complicated by the fact that a wrong distribution of tasks over the available processors may lead to excessive communication overhead.

If a compiler and associated run-time system do not o er assistance in solving this problems, the application programmer must solve the problem by providing all kinds of details on synchronization and the allocation of data and computations. This approach leads to excessive programming cost, time-consuming tuning, and non-portable programs. Parallel compilers will only gain acceptance for industrial applications if parallel programming systems are provided that free application programmers from much of the above mentioned details and are capable of handling dynamic and irregular computations and data structures. The proposed project provides solutions that contribute to this goal.

4

Chapter 2 State of the art There have been many attempts at (semi)automatic parallelization. Compilers have been developed, intended to parallelize programs written in ordinary sequential programming languages (e.g. FORTRAN 77) [16, 49, 52]. Unfortunately, there are fundamental problems with the automatic detection of parallelism from sequential programs. Therefore, fully automatic parallelization can only be approximated with heuristic techniques. However, there is another point. More and more, parallelism in applications is generated at a level higher than the programming language from which the parallel executables are derived. An example is the domain decomposition technique described in the introduction. In these applications the parallelism is already made explicit by (intelligent) problem analysis tools, and bringing it back into sequential form is counter-productive. At the other end of the spectrum, their are programming languages which require a programmer to explicitly code the required synchronizations and do the load balancing. This can be done in a traditional programming language with a support library, or in a programming language extended with parallel primitives. Examples in this last category are OCCAM [31] for distributed memory machines and PCF [8] for shared-memory machines. As explained above, using parallel languages in the latter category is complicated, error-prone and leads to non-portable programs. To obtain a better balance between ease of speci cation and ease of transformation, there have been a number of attempts to develop hybrid programming systems. In a hybrid programming system, a user must still specify `hints' on how the program is to be parallelized, but is completely shielded from many of the implementation details. A well-known hybrid programming model is used in data parallel languages [1, 7, 30, 32]. In a data parallel language, the user only speci es how the data is distributed over the processors. In fact, this distribution speci cation is nothing more than a locality assertion by the programmer. Most compilers for those languages use the convention that the `owner' of the data performs the computations on that data, so the user has implicitly speci ed a task mapping as well. A number of data parallel languages have come in widespread use, in particular High Performance Fortran (HPF) [30]. However, data parallel languages are not without problems. Although the generation of message passing code is performed automatically, the user is still responsible for determining an optimal data mapping. This is far from trivial, and is usually machine-dependent. Thus, the data parallel approach still does not fully isolate the user from the underlying machine architecture. Last but not least, the data parallel programming model is not at all engineered 5

towards dynamic programs or program with less regular data and computation structures. Thus, the user still has diculties in specifying the algorithm. Moreover, HPF programs are dicult to compile eciently [9], compared to hand coded message passing programs. Another approach uses run-time techniques to reduce the programming burden. An example of such technique is the use of virtual shared memory, in which shared memory is emulated on a message passing system, either at the hardware level [37], or the software level [41, 46]. Although this approach is e ective for some problems, it may introduce a signi cant overhead for others. Also languages have been introduced which rely on sophisticated run-time techniques. Examples are Orca [2] and Linda [19]. In these languages the programmer is o ered a shared data space, which is distributed by an underlying run-time system over the processors as to obtain sucient data locality in computations Few systems have facilities for declaring more complex data structures. One such system is LPARX [14, 39]. LPARX is a C++ library that implements a class of sparse matrix structures. It also implement parallel operations on these matrices through standard iterators. The main advantages of LPARX are that it is embedded in an existing, well-known programming language, and that it allows complex sparse matrix structures to be described easily. Its main disadvantages are that since it is implemented as a C++ library, everything is handled at runtime. Clearly this is often inecient. Moreover, it is limited to sparse matrix structures and in the operations that can be applied on these matrices. Other languages allow explicit declaration of task parallelism [26, 29, 8]. In Fx [29, 45], HPF is extended with language constructs to de ne tasks explicitly. The communication between the tasks is handled by the compiler, but the user must still specify on which processor each task must be executed. Similarly, in Fortran M [26, 27], HPF is extended with language constructs that are similar to those in OCCAM. Each approach taken at the language level has a direct implication on the requirements regarding the underlying transformation mechanism (scheduling) at compile-time and runtime. Any approach towards totally isolating the user from machine-independent programming through automatic scheduling is not without problems. It is known that nding an optimal schedule is NP -complete [47], even for the simple case for three or more processors and ignoring communication costs. Although task scheduling will be performed at compile-time whenever possible, in many cases run-time scheduling will be necessary due to the lack of compile-time information about the computational entities and their mutual dependences (see the example discussed earlier), of a general purpose nite element package [6]), the unpredictable dynamic behavior as a result of conditional control ow, and the contention involved in sharing computation and communication processors. Therefore, the used scheduling methods must be fast (i.e., with a polynomial time complexity) and at the same time near optimal. Many scheduling algorithms have been described for shared-memory machines [11, 21, 36]. Although suboptimal, simple heuristics have been shown to yield acceptable results. For instance, if memory contention and cache misses are ignored, list scheduling techniques yield schedules with a guaranteed length of at most two times the optimal schedule [28]. Due to the additional data locality problem, task scheduling for distributed memory machines is more dicult and is still an area of active research. The general task scheduling problem in the existence of communication cost is a so-called min-max (minimal communication overhead and maximal load balance) problem. Finding an (sub)optimal schedule requires to re6

solve the trade-o between exploitation of parallelism and reduction of communication overhead. A number of scheduling heuristics have been proposed in literature [17, 23, 24, 44, 50]. However, some of them do not take communication time into account, while others often have a high time complexity. Furthermore, the performance in terms of schedule length of many scheduling heuristics are unknown for large task graphs or large processor graphs. Most importantly, these heuristics concentrate on compile-time scheduling and cannot deal with dynamic behaviours. Much of research is currently pursued in the context of automatic data partitioning, i.e. in which the task mapping is determined through the data mapping solution [12, 20, 25, 34, 42, 38]. However, these approaches are aimed towards compile-time scheduling and thus have a restricted application (e.g. they assume ane array index functions). Hence, they are not amenable for applications in which run-time scheduling is necessary. Despite all these initiatives such as HPF, most parallel programming systems currently lack sucient expressiveness to enable coding of large industrial applications. There is currently no approach that pairs ease of programming with fully automatic compilation, especially if good performance is required. Also, in many approaches it is dicult to make programs portable.

7

Chapter 3 Research approach We propose to develop a novel hybrid parallel task programming system. This system will combine an explicit task oriented parallel programming language with a compilation strategy based on a fully automatic scheduling. In our system, the programmer or an external program de nes coarse-grain computational tasks, and speci es their mutual dependencies. A compiler will then distribute the tasks over the available processors, and handle the communication that results from this distribution. In other words, the compiler must do both task mapping and data mapping. Since it is not possible to handle all task and data mapping at compile-time, there will be a fall-back to run-time data and task mapping. This approach makes the tuning of programs in the usual sense unnecessary. The only `tuning' that a programmer or a parallelizing program must do is to de ne tasks of the `right' granularity. However, this is not very critical. Although the parallel speci cation will be explicit, the language will be designed such that any dependences can be speci ed at a high level of abstraction and in a parameterized way. Thus, a user or providing program will be completely shielded from the details of handling the parallel computations. Therefore, the application remains portable. At rst sight it may seem unfriendly to require explicit speci cation of parallelism and dependencies between tasks in a program. However, the notion of dependent tasks is quite common, and is frequently used outside computer science. When modeling and solving complex problems, it is natural to decompose it into simpler parts, corresponding to coarse-grain tasks. The dependencies between tasks are usually apparent from the problem, or can easily be obtained. As noted before, in many industrial applications the parallelism will be generated from intelligent analysis tools and are provided in an explicit way. A complication is that often this speci cation is given at run-time, as is demonstrated in the example in the introduction. Both task mapping and virtual shared memory support can be handled completely at run-time, but the overhead will be considerable. It is therefore important to let the compiler pre-compute as much as possible. To a parallel programming environment using this approach, the following components are necessary:  A task and data-structure description language; a language to describe the coarse-grained tasks and the data structures these tasks work on;  A mapper; a program that (at compile-time or run-time) assigns tasks and data to processors; 8

A cost estimator; internally, the mapper will require performance information. Also, the user will require some performance feedback, and this will be go beyond what is required for the mapper.  Some example applications to validate the parallel programming system. Proper validation requires more than `toy' examples, so at least the core of some industrial applications must be implemented. The system will be designed such that the internals of tasks can be provided in other languages (Fortran or C). This is in line with developments that more and more large applications are not written in a single programming language, but use multiple programming languages which are integrated in a larger framework. In this way complexity is reduced and languages can be chosen for the parts of an application where they are to most suited for. The parallel programming language must provide a convenient way to specify coarse-grain tasks, their mutual dependencies these tasks, and the data structures these tasks work on. Moreover, the task mapper requires cost estimates of the execution times of tasks, and therefore there must be a cost estimator that can `look in' the code of the tasks or this estimates must be provided externally by higher level tools. We expect that the resulting parallel programming language will be somewhat limited in expressiveness and therefore it should be easy to invoke routines in existing programming languages such as C and Fortran. The approach to parallel language design will therefore resemble that of so called coordination languages [19, 22]. This implies that we can restrict language features to only those necessary to perform a `proof of concept', thereby avoiding a lot of detailed implementation work not really contributing to the outcome of the proposed research. As mentioned earlier, automatic scheduling will both involve compile-time as well as runtime techniques. We propose to employ heuristic methods that combine low scheduling cost with an acceptable eciency. An additional reduction of complexity is achieved by decomposing the scheduling problem in terms of a mapping problem1. We will adopt this approach based on the assumption that tasks can be locally scheduled dynamically at run-time using multi-threading, regardless whether the mapping is static or dynamic. Therefore, rather than tackling the total task scheduling problem we will restrict ourselves to task mapping. This approach is partly inspired by the fact that due to recent technological developments in the use of lightweight threads (kernel/library or even compiler-driven [43]), there is an increasing interest in using dynamic scheduling techniques at the processor level, both for computation and communication tasks (e.g., non-blocking communication) in order to achieve maximum (data ow) concurrency2 . The primary research question is to evaluate the properties of this heuristic mapping approach in terms of cost and performance. For many practical applications, the problem parameters are often symbolic at compiletime (e.g., sequential/parallel loop bounds). Usually, these parameters are only instantiated at run-time, for example when the size of the data set is known. Rather than to defer mapping completely to run-time (when the parameters are instantiated), it is very attractive to investigate whether compile-time methods can be employed which are practical for parameterized problems. 

Task mapping merely determines the allocation of tasks onto processors. Hence it implies a partial schedule. Moreover, the introduction of parallel slackness (multiple tasks per processor, cf. Valiant's BSP model [48]) typically increases the average utilization of the processing and communication resources. 1

2

9

The quality of the mapping solution will depend critically on the underlying performance model of the target machine. Clearly, with an inaccurate performance model, even the most perfect mapping procedure will still yield inferior results. Apart from the general requirement of accurate task execution time estimates, this especially applies to the communication models needed for mapping to distributed-memory machines. Although the communication aspect is central to the mapping problem, in current research little attention has been paid to the quality of the associated performance models. The models used are either based on training sets [15, 33], or re ect the state-of-the-art in static performance prediction in which (for example) communication costs are modeled using linear startup-bandwidth models [13, 18, 35]. However, it is shown that even under moderate communication densities these models easily underestimate communication delay by orders of magnitude due to the fact that network contention is not accounted for [5]. A new approach has been introduced to the performance prediction of parallel systems [3, 4], most notably introducing low-cost models that are capable to predict the e ects of contention without the usual increase in cost. This model will be used for performance estimation in the proposed parallel programming environment. An important question to be answered in the project is the sensitivity of the mapping quality to the presence of contention in the underlying system.

10

Chapter 4 Conclusion In this project, we have the following objectives: 



 

The design and implementation of a suitable coordination-like parallel programming language. The language should allow the easy de nition of tasks and the relation between them at compile-time, as well as in run-time. A multi-threaded run-time system to provide task mapping and virtual shared memory in run-time. It is mainly intended as a fall-back in case that compile-time task mapping and synchronization fail. Optimizers to implement task mapping and virtual shared memory as much as possible in compile-time, based on simple (heuristic) mapping algorithms. Some realistic examples to evaluate the performance of the programming system. Since the participants have ample experience in nite element problems and shallow water ow models, the examples will be drawn from these areas.

The proposed project combines new insights in language design, both compile-time and run-time scheduling, and performance prediction in an integrated approach. The approach to parallel programming that we have outlined here promises simpler parallel programming of many industrial applications. Also, the project links research areas that are rarely linked: language design, performance analysis, and scheduling. Moreover, it incorporates feedback from industrial applications needing such systems for execution on parallel systems. High Performance Computing (HPC) has been de ned as a key enabling technology for industrial competitiveness. The successful use of parallel computer technology requires the development of techniques to substantially reduce the complexity and cost of software development, particularly for commercial or industrial applications where the parallel structures required tend to be complex, dynamic, or irregular, than the fairly regular structures often found in the so called \Grand Challenge" applications and where the time to produce and modify software becomes important as well as ultimate execution speed. AUTOMAP will develop a high-level, application-oriented parallel programming model based on coordination based structured programming. The implementation will use existing or emerging standards, to ensure portability across current machines and future machine generations. 11

The main bene ciaries of AUTOMAP will be those whose primary concern is to use parallel machines to solve complex, real world problems. The second class of bene ciaries are those organizations engaged in providing methods or tools for application developers to use. The work will produce tool kits for the systematic de nition, implementation, and optimization of parallel software components. Important for the proposers is that the developed models and tools are practical. It is essential that they address and solve the problems of users who wish to use parallel machines on problems of genuine economic concern and that they are compatible with current and anticipated developments in software and hardware platforms. To ensure the applicability of the methods and tools developed in the AUTOMAP project, real applications will be modeled and real application data will be used to verify the results. Among those applications are the nite element system DIANA and the shallow water ow models of Rijkswaterstaat. Transfer of results is done by means of demonstrating the applicability of the developed system and using it in real application design. Besides that, we envision public domain availability of the nal system. The AUTOMAP project combines expertise in three di erent areas of computer science:   

Design and implementation of parallel languages and support systems; Performance modeling and prediction; Parallel application design.

The three participating groups each have their speci c eld of expertise. The AUTOMAP project combines this expertise in an unique way. The proposers are con dent that this combination will lead to innovative results.

12

Bibliography [1] F. Andre, JL. Pazat and H. Thomas, \Pandore: A System to Manage Data Distribution," Proceedings 1990 International Conference on Supercomputing, Amsterdam, The Netherlands, ACM Press, 1990. [2] H. Bal, The shared data-object model as a paradigm for programming distributed systems, PhD thesis, Vrije Universiteit, Amsterdam, 1989. [3] A.J.C. van Gemund, \Performance prediction of parallel processing systems: The Pamela methodology," in Proc. 7th ACM Int. Conf. on Supercomputing, Tokyo, July 1993, pp. 318{ 327. [4] A.J.C. van Gemund, \Compiling performance models from parallel programs," in Proc. 8th ACM Int. Conf. on Supercomputing, Manchester, July 1994, pp. 303{312. [5] A.J.C. van Gemund, \Predicting contention in distributed-memory machines," in Proc. CRPC Workshop on Automatic Data Layout and Performance Prediction, Houston, Apr. 1995, [6] H.X. Lin, H.J. Sips, \Parallel direct solution of large sparse systems in nite element computations," 1993 International Conference on Supercomputing, July 1993, Tokyo, Japan, ACM press, pp. 261-120. [7] E.M. Paalvast, H.J. Sips, L.C. Breebaart, \Booster: a high-level language for portable parallel algorithms," Applied Numerical Mathematics, 8, 1991, pp 177-192. [8] ANSI X3H5, Fortran 77 binding of X3H5 model for parallel programming constructs, draft version, September 1992. [9] C. van Reeuwijk, H.J. Sips, W. Denissen, E.M. Paalvast, \Implementing HPF distributed arrays on a message passing parallel computer system," technical report TU Delft, TN FI-CP CP-TR-9506. [10] M.R. van Steen, H.J. Sips, H.X. Lin, \Software engineering for parallelism: an exercise in separating design and implementation," in: Proceedings International Conference on Massively Parallel Processing, June 1994, Delft, The Netherlands, North Holland. [11] T.L. Adam, K.M. Chandy and J.R. Dickson, "A comparison of list schedules for parallel processing systems", Comm. ACM., vol. 17, pp. 685 - 690, 1974. 13

[12] J.M. Anderson and M. Lam, \Global optimization for parallelism and locality on scalable parallel machines," in Proc. ACM SIGPLAN '93 PLDI, June 1993. [13] M. Annaratone, C. Pommerell, and R. Ruhl, \Interprocessor communication and performance in distributed-memory parallel processors," in Proc. 16th Symp. on Comp. Archit., ACM, May 1989, pp. 315{324. [14] Scott B. Baden, Scott R. Kohn, Stephen J. Fink, \Programming with LPARX," in: Proc. Intel Supercomputer User's Group Meeting, June 1994, San Diego, CA. (Also CSE Tech. Rept. CS94-377.) [15] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, \A static performance estimator to guide data partioning decisions," in Proc. 3rd ACM SIGPLAN Symp. on PPoPP, Apr. 1991. [16] W. Blume and R. Eigenman, \Performance analysis of parallelizing compilers on the Perfect benchmark programs," IEEE Transactions on Parallel and Distributed Systems, 3(6), pp. 643-656, 1992. [17] S. Bokhari, "On the mapping problem", IEEE Transation on Computers, vol. C-30 (3), pp. 207 - 214, 1981. [18] L. Bomans and D. Roose, \Benchmarking the iPSC/2 hypercube multiprocessor," Concurrency{Practice and Experience, vol. 1, Sept. 1989, pp. 3{18. [19] N. Carriero and D. Gelernter. \Coordination langauges and their signi cance," Communications of the ACM, Vol. 35, Feb. 1992. [20] S. Chatterjee, J.R. Gilbert, R. Schreiber, and S-H. Teng, \Automatic array alignment in data-parallel programs," in Proc. 20th Annual ACM Symp. Principles of Progr. Lang., Charleston, Jan. 1993. [21] E.G. Co man, ed., Computer and job scheduling theory, John Wiley, 1976. [22] J. Darlington, Y. Guo, H.W. To, J. Yang. \Skeletons for Structured Parallel Composition." In Proc 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barbara, California, USA, July 1995. [23] K. Efe, "Heuristic models of task assignment scheduling in distibuted system", IEEE Computer, pp. 50 - 56, 1982. [24] H. El-Rewini and T.G. Lewis, "Scheduling parallel program tasks onto arbitrary target machines", J. of Parallel and Distributed Computing, vol. 9, pp. 138 - 153, 1990. [25] P. Feautrier, \Toward automatic partitioning of arrays on distributed memory computers," in Proc. 7th ACM Int. Conf. on Supercomputing, Tokyo, July 1993, pp. 175{184. [26] I. Foster and K.M. Chandy, \Fortran M: A Language for Modular Parallel Programming", in: J. Parallel and Dist. Comput. , 1994 (to appear). 14

[27] I. Foster, B. Avalani, A. Choudhary, and M. Xu, \A Compilation System that Integrates High Performance Fortran and Fortran M", in: Proc. 1994 Scalable High Performance Computing Conf. , 1994. [28] R.L. Graham, \Bounds on multiprocessing timing anomalies," SIAM J. Appl. Math., vol. 17, no. 2, 1969, pp. 416{429. [29] T. Gross, D. O'Hallaron, and J. Subhlok. \Task parallelism in a High Performance Fortran framework," in IEEE Parallel & Distributed Technology, Fall 1994, vol 2, no 2, pp 16-26. [30] High Performance Fortran Forum. Draft High Performance Fortran Language Speci cation, version 1.0. Available as technical report CRPC-TR92225, Rice University, January 1993. [31] INMOS Limited, \OCCAM Programming Manual,", Prentice-Hall International, 1984. [32] C. Koelbel and P. Mehrotra, Compiling global name space parallel loops for distributed execution," IEEE Transactions on Parallel and Distributed Systems, 2(4), pp. 440-251, 1991. [33] M. Gupta and P. Banerjee, \Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers," IEEE Transactions on Parallel and Distributed Systems, vol. 3, Mar. 1992, pp. 179{193. [34] M. Gupta and P. Banerjee, \Paradigm: A compiler for automatic data distribution on multicomputers," in Proc. 7th ACM Int. Conf. on Supercomputing, Tokyo, July 1993, pp. 87{96. [35] R.W. Hockney, \Performance parameters and benchmarking of supercomputers," Parallel Computing, vol. 17, 1991, pp. 1111{1130. [36] T.C. Hu, "Parallel sequencing and assembly line problems", Operations Research, vol. 9 (6), pp. 841 - 848, 1961. [37] Kendall Square Research Corp., Waltham, MA, KSR1 Principles of Operation, 1991. [38] Ulrich Kremer, \Np-completeness of dynamic remapping," in Proc. 4th Int. Workshop on Compilers for Parallel Computers, Delft, The Netherlands, Dec. 1993, pp. 135{141. [39] Scott R. Kohn and Scott B. Baden, \A Robust Parallel Programming Model for Dynamic Non-Uniform Scienti c Computations," Proc. 1994 Scalable High Performance Computing Conference, May 23-35, 1994, Knoxville, Tennessee. (Also CSE Tech. Rept. CS94-354.) [40] B. Lee, C. Shin, and A.R. Hurson, \A strategy for scheduling partially ordered program graphs onto multicomputers," in Proc. 28th Hawaii Int. Conf. on System Sciences, Vol. II, IEEE, Jan. 1995, pp. 133{142. [41] K. Li and R. Schaefer, \A hypercube shared virtual memory system," in Proc. 1989 Int. Conf. Parallel Proc., IEEE, 1989, pp. 125{131. 15

[42] L. Li and M. Chen, \Compiling communication-ecient programs for massively parallel machines," IEEE Transactions on Parallel and Distributed Systems, vol. 2, July 1991, pp. 361{376. [43] C.D. Polychronopoulos, \nano-Threads: Compiler-driven multithreading," in Proc. 4th Int. Workshop on Compilers for Parallel Computers, Delft, The Netherlands, Dec. 1993, pp. 190{205. [44] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multiprocessors. Pitman, London, 1989. [45] J. Subhlok, D. O'Hallaron, T. Gross, P. Dinda, J. Webb, \Communication and memory requirements as the basis for mapping task and data parallel programs," In Proc. Supercomputing '94, Washington, DC, Nov. 1994, pp. 330-339. [46] M. Stumm and S. Zhou, \Algorithms implementing distributed shared memory," Computer, May 1990, pp. 54{64. [47] J.D. Ullman, "NP-complete scheduling problems", J. Computer and System Science, pp. 384 - 393, 1975. [48] L. Valiant, \A bridging model for parallel computation," Communications of the ACM, vol. 33, Aug. 1990, pp. 103{111. [49] M. Wolfe, Optimizing supercompilers for supercomputers, Cambridge, Mass.: MIT Press, Cambridge, Mass, 1989. [50] M.Y. Wu and D.D. Gajski, "A programming aid for hypercube architectures", J. Supercomputers, vol. 2 (3), pp. 349 - 372, 1988. [51] T. Yang and A. Gerasoulis, \Pyrros: Static task scheduling and code generation for message passing multiprocessors," in Proc. 6th ACM Int. Conf. on Supercomputing, Washington, July 1992, pp. 428{437. [52] H.P. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers, Addison Wesley, 1991.

16