to all these computers is the lack of suitable programming environments. .... auto-vectorizing capabilities for the vector architecture case), the parallel computer ...
Parallel Computing in the 1990's: Attacking the Software Problem J.E. Boillaty, H. Burkhartz, K.M. Deckery, P.G. Kropfy y z
IAM, University of Berne, Switzerland IFI, University of Basle, Switzerland
(published in: Phys. Rep. 207 (1991) 141 { 165) Abstract It is today's general wisdom that the productive use of parallel architectures depends crucially on the availability of powerful development tools and run-time environments. In this paper, we systematically discuss the fundamental software problems encountered in programmingparallel architectures, in particular those with distributed resources. All these problems need to be solved, if ecient and convenient use of parallel machines is to be guaranteed. We present a ve phases model of parallel application program development, which describes the required eorts in parallel programming by means of four transformation steps: problem analysis, algorithm design, implementation, and mapping. The major part of the paper is dedicated to the description of three research projects which focus on the last three transformation steps: SKELETON, a tool for providing improved algorithmic support for the application-oriented programmer, SPADE, an integrated development and run-time environment, and MARC, a tool for automatic mapping of parallel programs.
1 Introduction The past decade has seen the emergence of a new branch of computer industry: systems of multiple processors designed for computation-intensive applications. Research prototypes of such systems were already developed in the 60's and 70's; at that time, however, most work was at the level of hardware interconnections and system software. Projects like Illiac IV, C.mmp, and CM*, among others, are well-known pioneers in the eld of parallel processing. No doubt, these systems helped in getting early experience in parallel algorithm design and implementation. But no application problems were really solved on these systems at that time. Today, the situation is quite dierent: There are numerous research projects and initiatives in the eld of parallel processing and there is already a commercial market for such systems [1]. Furthermore, we see rst application problems being solved on these systems, which have been considered too complex before [2]. Currently available parallel computers belong to the class of control-driven architectures, which can roughly be split in two groups: 1
SIMD systems represent the class of massively parallel systems regarding the num-
ber of processing elements. Thousands of identical processing elements (arithmeticlogical units) operate under the control of a single control unit. Their application mode has been called data-parallelism, because it is the partitioning of the data structures, not the partitioning of the control ow that results in ecient solutions. The distributed data sets are simultaneously modi ed on all these nodes in a synchronous manner. Programming MIMD systems is more general, but also more complicated because the asynchronous mode of operation demands for careful synchronization operations. From the architectural point of view, three subclasses can be identi ed:
{ Shared memory multiprocessors are the evolutionary approach of previous main-
{
{
frame and microcomputer systems. While current vector-supercomputers demonstrate highly sophisticated implementations of hardware pipelining techniques, more recent multiprocessor systems oer also scalable interconnection performance. Hypercube systems are a popular class of parallel systems that oer coarsegrain data parallelism in combination with asynchronous operation principles. From the hardware point of view these systems have a common memory that is distributed among the processor nodes. A processor node has access to its own memory module and the modules of its neighbors. Hardware building blocks like the Transputer approach oer further exibility. Dierent interconnection structures (e.g., trees, meshes, cubes) can be realized by proper con guration of the processing node environment.
For about 15 years vector supercomputers have been the driving force in scienti c computing. Companies like CRAY, CDC/ETA, NEC, FUJITSU, and HITACHI, among others, oered systems with few, but very powerful processors based upon more or less the same architecture principles. The success of these systems became possible because compiler technology was soon catching up with the improvements of the hardware technology. Regarding the parallelization problem, the situation is dierent. Automatic parallelization is only possible in very specialized cases (e.g., microtasking of loops). As will be shown in section 2, there are other problems with the software development as well. Nevertheless, there seems to be a break-through for massively parallel systems since 1990 (see also [3]):
For the rst time massively parallel systems were winners in peak performance,
and more important, in demonstrated application performance. For example, the 1989 winner of the Gordon Bell Award was an application in the eld of seismic data-processing [4]. The program was implemented on a Connection Machine CM-2 and performed at 5.6 GFlops using 64K processing elements. For the 1990 entry, the same program is reported to run at 14 GFlops. Conventional supercomputers, however, revealed saturation phenomena; for instance in the same year, the target for the Cray Giga op award was only raised from 1 GFlops to 1.5 GFlops [5]. Traditional supercomputers were being replaced by massively parallel systems at a much faster rate than expected. For instance, Sandia National Laboratories, the 2
Albuquerque, New Mexico USA, said, the 20 major programs that account for 95% of the lab's computing time already had been converted for use on massively parallel machines [6]. Even more important, the parallel computers worked 10 to 100 times faster on these programs. Extraordinary R&D eorts in the eld of parallel processing can be identi ed in many countries:
{ The US congress is considering the Federal High Performance Computing Pro{ { {
gram with a budget of 2000 M$ over ve years for the installation of a huge bre optic data superhighway and with 650 M$ each for software and hardware development. The largest Japanese computer makers have been slow to adapt to parallel technology and are still building vector machines that continue on the classical CRAY line. But the Japanese government is now also organizing a larger research eort to catch up. Other countries in Asia have launched remarkable projects. For instance, India has set up a Centre for Development of Advanced Computing for the development of parallel computer hardware and software [7]. In Europe several countries have announced initiatives to strengthen their industries: UK Grand Challenge Collaboration, German Suprenum project, European Community supported Genesys-P project. Even a smaller country, like Switzerland, announced special programs to push research in the direction of massively parallel systems.
The demand for high-performance computers is immense and ever-increasing. The socalled computational sciences with their complex simulation problems are still orders of magnitude removed from the envisaged computing requirements. Today, both the fastest vector supercomputers and the most massively parallel systems oer a peak performance of 20 { 30 GFlops. But the race towards the rst Tera op machine has already been started [8, 9]: A special purpose Tera ops machine for the simulation of quantum chromodynamics (QCD), proposed by the U.S. QCD Tera op Collaboration, to be jointly built with Thinking Machines Corporation, is planned to achieve one TFlops (32 bit arithmetic) sustained performance as early as 1993. A general purpose machine providing a sustained performance of one TFlops (32 bit arithmetic) on a variety of full-scale applications and 300 GBytes of main memory is proposed by several companies, and is said to become commercially available in 1995. Having access to faster machines is a necessary but not sucient pre-condition for the future progress of computational science. What we also need is higher eciency when using parallel computers | and this is a software problem. Without overcoming the software dilemma | the kind of problems described in the next section | parallel computers will remain exotic machines, used only by those highly motivated scientists, whose applications are so demanding that their research internationally only stays competitive, if they use the most powerful computer systems ever available. The typical computational scientist, however, will continue his research with more traditional architectures. 3
2 Parallel Computers: The Software Dilemma The development in hardware technology has made available a vast variety of powerful parallel computers, ranging from conventional vector supercomputers and novel SIMD architectures like the Connection Machine to completely dierent architectural concepts like data ow, graph reduction machines or neural networks and to many shared and distributed memory architectures based on the principle of control ow [10]. Common to all these computers is the lack of suitable programming environments. Consequently, the user has to spend a lot of time in understanding the details of an architecture and has to spend much eort in speci c system aspects like data distribution, routing, load balancing or most ecient vectorization, in order to get at least a considerable fraction of the theoretical performance of a machine [11, 12]. On the other hand, the user ideally would like to restrict to the details of his application problem and its algorithmic aspects rather than on the technical details of an implementation. There is thus a big disproportion between the eort of developing an application algorithm and its realization on a parallel machine. Therefore one has to focus on programming environments, suitable libraries and on the provision of full-scale application programs, such that parallel computers become very easy to use for most people and for most purposes. The challenge for the next years is thus to develop tools for using present and future parallel computers in an easy way without any signi cant loss of their potential power. The gap between the progress in hardware technology and the tools and methods of programming parallel computers needs to be bridged. To be speci c, we identify the following problems:
No common parallel programming model. There are many models for parallel computing. However, they are either too general to be of practical importance, e.g., Petri Nets, PRAM-machines [11], or they are too speci c and bound to certain machines or types of applications, e.g., CSP [13], CCS [14], temporal logic, or they presumably introduce too much system overhead, e.g., Linda [15]. Parallel programming languages. C, Fortran, Pascal, are not suited to be used directly for parallel machines. This is also true for some of the classical 'parallel' languages like Ada or Chill [16], which were designed for multitasking on single processor machines. There are many programming languages oering the possibilities to manage parallelism. On one hand, there are the classical languages with extensions for parallel processes, communication, and synchronization, e.g., parallel C, Fortran. On the other hand, there are new parallel languages like Occam. The latter are much better suited for parallel machines, because they provide parallelism as a basic idea, while the former introduce parallelism just as additional notion. A general problem is that the syntax and semantics of synchronization and communication mechanisms for parallel machines at language or operating system call level are dierent for every language and every machine, which excludes any reasonable portability of parallel programs onto dierent machines. Insucient availability of parallel algorithms. Many of the algorithms used today are just parallelized versions of classical sequential algorithms. This is especially the case 4
for numerical applications. The consequence of this is often a great loss of the potential performance of parallel computers. Even the notion of parallel algorithms itself is still an active research eld.
Poor classi cation schemes for parallel algorithms. A classi cation of parallel
algorithms is missing. Classes of relevant types of parallelism are not identi ed. Existing classi cation schemes of parallel algorithm types, for instance the classi cation of the Southampton group [17] | geometric, algorithmic and event driven parallelism | seems to be too coarse. Moreover, many parallel algorithms t into more than one or even all these classes, whereas other algorithms, like information retrieval or many algorithms from the eld of arti cial intelligence do not t in any of them. In addition, such types of classi cation schemes tend to mix up control and data ow or neglect one of those.
No common complexity model. To obtain a realistic estimate of the complexity
or the potential performance of a parallel algorithm, all the data distribution, communication, routing and synchronization issues need to be taken into account. In existing models, all these penalties are generally neglected, leading to often completely unrealistic complexity results. There exists a large gap between theory of parallel algorithms and their implementation.
Completely insucient design tools for parallel algorithms. Paradigms and methods for the development and formulation of parallel algorithms are missing. There is an urgent need for a framework to analyze problems with respect to their inherent parallelism, such that it may be maximally exploited on a parallel machine. Performance analysis of parallel algorithms. There exist huge diculties of ana-
lyzing the performance and the correctness of parallel algorithms. An analytical investigation of the performance or the complexity of a parallel algorithm is almost impossible to perform, because the available models insuciently model real machines and applicationrelevant algorithms and their implementations. The counting of relevant computational steps, as traditionally used in sequential computing, is problematic. Even a quantitative measure for the performance of parallel algorithms is missing.
Lack of suitable programming environments. Although the user ideally would like to restrict to the details of his application problem and its algorithmic aspects, currently he has to spend a lot of time in understanding the details of an architecture and has to spend much eort in speci c system aspects like data distribution, routing, load balancing or most ecient vectorization, in order to get at least a considerable fraction of the theoretical performance of a machine [11, 12]. Tools which allow to program a parallel machine in the same convenient way as conventional machines without sacri cing performance are missing. Moreover, for a given scalable architecture, it is very dicult to design scalable applications. 5
Lack of ecient application libraries. Since existing application libraries installed on
parallel architectures are often simply ports of sequential ones, they are often of insucient quality. They also in general exhibit insucient levels of abstraction, and many important application areas are completely uncovered.
Performance monitoring. Existing analyzing tools in general disturb the application
program signi cantly, and thus introduce the new problem, how performance results should be interpreted.
Parallel operating system. Truly distributed multi-user operating systems with a familiar user interface are missing.
3 Solving the Software Problem 3.1 Global Objectives To the majority of application users, parallel architectures will only become attractive, if these systems can be programmed and productively used as easily as modern computers with sequential and vector architecture. This demand leads to several important consequences. It implies that hardware features, e.g., type and number of processing nodes, type and topology of the communication system, need to be completely transparent to the user. It requires programming tools and environments which permit to achieve application performance close to the maximum oered. It also requires a standardized set of machine-independent control mechanisms, providing the platform for portability of application programs. Finally, it is important that small low-cost, scalable parallel systems become locally available for program development, and become an integral part of the modern standard working environment in the same way as conventional workstations are already today.
3.2 How to Achieve the Global Objectives To make parallel architectures readily available in a convenient and ecient way, supporting tools have to be provided for all the distinct transformation steps encountered in the typical application program development path. Our model of application program development involves ve phases and is depicted in Fig. 1. Beginning with the abstract description of the scienti c problem to be solved, and targeting towards a correct and working program, which solves the desired problem economically on a parallel architecture, we identify the following transformations in order:
Problem analysis. Although many problems have inherent parallel features, conven-
tional problem analysis often implicitly assumes that the problem is to be solved sequentially. As a consequence, parallel features of the problem are already distorted or sometimes even eliminated in this early stage of the solution process. We identify a clear 6
Present status
Problem analysis
Techniques and tools
Algorithm design
Fundamental algorithmic concepts
✔
Implementation
Programming environment Application libraries
✔
✔
Mapping
Mapper / configurator Operating system Load balancer
✔
✔
PROBLEM
prototypes
Supporting tools
concepts
Transformation
not available
Conceptional level
✔
MODEL
ALGORITHM
PROGRAM
CONFIGURATION
Five-step model of parallel application program development
Figure 1: A model of application program development and the involved transformation steps. need for problem analysis techniques and tools, speci cally designed for parallel target architectures, which force the application user to uncover the underlying parallelism of his application.
Algorithm design. In the next transformation step, the user maps the problem solu-
tion to an algorithmic description. This map is often in uenced by the large variety of already existing algorithms, which are mostly sequential in nature for historical reasons. As a result, parallel features of the problem again might be obscured. This can only be avoided, if research is conducted in the eld of parallel algorithms, algorithm design and classi cation. Section 4 gives further comments about this phase.
Implementation. At this stage, after suitable parallel algorithms have been chosen,
the user wishes to start the implementation process. Whereas a program development environment for sequential or vector architectures can consist of as little as a decent text editor, backed up on the operating system level by an appropriate compiler (with auto-vectorizing capabilities for the vector architecture case), the parallel computer case 7
requires considerably more. This is due to the fact that fully parallelizing compilers are not available, and their availability in the near future, and more generally, their usefulness is questionable. To achieve parallel computing in aordable development time, powerful development environments for parallel architectures should provide means which are capable of identifying commonly used parallel algorithmic components from prede ned algorithmic classes, and which can automatically translate them into communication skeletons, which in turn can be mapped to dierent hardware sizes and types in a completely transparent fashion. The provision of such environments ensures that the inherent parallel features of the selected problem which were carefully maintained in the previous transformation steps, are also retained in the design and implementation steps, avoiding implicit 'sequentialization'. Development environments should be supplemented by application area speci c libraries. Modules from these libraries can be considered as highly ecient implementations of corresponding truly parallel algorithms. Their usage guarantees that the inherent parallelism of the problem is preserved and suitably re ected in the actual program, and supports an ecient and reliable coding process. Besides the code which actually implements the selected algorithms, application library modules contain information on how data structures maybe distributed across a network of processors. They also keep ready information on time and storage space complexity, translated from the abstract complexity information generated in the algorithm design step, and prepared for later use in the con guration step.
Mapping. Ecient and portable tools supporting in particular the non-expert in the
further optimization and in the production phase of the application life-cycle are very important. Tools include a parallel debugger, process monitors and a load-balancer. These tools maybe supported by complexity information generated in the algorithm design step, and prepared in the implementation step. The four transformation steps described above are put into context in Fig. 1. Columns four to six indicate the availability status of transformation supporting tools. The diculty of the rst transformation step is re ected in the fact that we are not able to give any information on the availability status of the corresponding support tool. For the following transformations algorithmic design, implementation and mapping, the third column of the gure describes our subjective view what kind of support might be required and might be helpful. In the following three sections, we shall report on the conceptual aspects of three research projects which are aimed to provide the support indicated in the gure. A common and from our point of view important feature of all three projects is that application users are included into the projects already from the early stages of the design phases, thus spanning the apparent dierence in view between computer scientists and application users from dierent disciplines.
8
4 Project SKELETONS: From Algorithm Classes to Reusable Software Project SKELETONS at the University of Basle attacks the software dilemma in manifold directions. The research goal is the development of techniques and tools that provide improved algorithmic support for the application-oriented programmer. Based on the concept of common algorithmic classes, we want to overcome problems that have been caused by the classical approach of programming multiprocessors.
Problems with the classical approach: In section 3 we have identi ed the dierent
transformation levels which must be passed when programming a multiprocessor. From the programmer's point of view, the phase Algorithm design is of special interest. Two basic problems need to be solved there:
Process creation problem: Identi cation of the processes that can execute concur-
rently. Process synchronization problem: Speci cation of mutual dependencies between processes.
The result of this phase is best described as a process graph | a graphical view of the parallel processes and their interrelations. It is this abstraction level of a parallel program, which is of main interest here. The process synchronization problem has originally been studied for single processor systems. Concurrency control mechanism in operating system design have been the driving force for development. In order to make synchronization constructs available to the application programmer, the classical approach has been the transfer of low-level synchronization elements (spin locks, semaphors, Fetch&Add, etc.) | originally developed for the operating system level | to the programmer's level by integration into existing sequential languages. This approach has advantages, but also causes severe problems:
It is a universal mechanism because both regular and irregular structures process
graphs can be constructed. However, it is likely that the application programmer feels lost when confronted with such a plethora of proposed concepts. Problems not only occur because of semantic dierences, but also because of syntactic deviations when changing the environment (for case studies refer to [18, 19]). It is well-known that synchronization concepts may be hierarchically ordered [20]. Concepts at dierent abstract levels and systems layers usually have dierent execution times. Eciency of a solution may thus be guaranteed by the selection of the appropriate concept. However, this requires a considerable technical knowledge that computer engineers have, but not typical application programmers. Writing a parallel program from scratch is not easy at all. Synchronization problems occur and the resulting errors are dicult to debug. The lack of algorithmic support, as well as the high probability for a time-consuming debugging phase are obstacles to productive software development on parallel systems. 9
Bene ts of the new approach: A pure bottom-up approach (starting from scratch) can only be useful if the process graphs of applications are always dierent. However, this is not the case: many problems have common, regular process graphs. This was already reported by the authors of project CM* [21]. Algorithmic categories have been identi ed by their studies as follows: Asynchronous structure, Synchronous structure, Multiphase structure, Partitioning structure, Pipeline structure, Transaction Processing Algorithms. The identi cation of algorithmic classes is only one step. More recently, some other projects follow similar concepts, but in addition implement software tools that provide programmer support:
While project vmmp [22] supports only two types of algorithm classes, its strength
is the development of distribution primitives for array data. At the Edinburgh Parallel Computer Centre prede ned programm structures called Harnesses are collected in libraries. schedule [23] is a package that supports the creation of task graphs and their visualization in a FORTRAN environment. force is a package developed for similar purposes.
Project SKELETONS is an integrated approach towards application support. The project is still at the conceptual level. Any scenarios described in this section do not imply that the software components are existing. But our research should nally result in a quanti able improvement of key factors of software design:
Reduced error rate: A catalog of pre-de ned process graphs for the most important
algorithmic categories is checked at the entry point of phase Algorithm design. If the model matches an entry in the catalog, the user can immediately access the prede ned process graph. Thus, ill-structured process graphs are not possible, because the process creation and synchronization problems have already been solved during the design phase of the catalog. Of course, there still remains the danger of the wrong selection of a class by the programmer. A supporting tool is only as clever as the programmer who uses it. Improved programmer's eciency: Software development is speeded up if further application support is given in form of reusable software components. We envisage libraries of structural skeletons of program code. Starting at this level, the programmer will have to ll only the problem- speci c gaps of the skeleton. Increased portability: The result will be improved portability if these libraries are ported to dierent target systems (hiding of hardware and OS dependencies). We target systems that help us to follow both the SIMD and MIMD line. Fig. 2 summarizes the present research topics of the project.
10
Figure 2: Project overview
11
Target systems for experimental research Our study explores both SIMD and
MIMD programming styles. Currently, the following systems are available here and are target of our research:
The sophisticated programming environment of the shared- memory multiprocessor M 3 originally built at ETH Zurich has been ported to commercially available
hardware (VME-processor modules) [24]. A 64-node Transputer system is in the nal stage of implementation. Each node consists of a T805 Transputer chip and 8 MB of main memory. A Maspar 1204 SIMD-system with 4096 processing elements is currently available to our research group. A NeXT workstation is used as a test bed for object-oriented techniques in the eld of parallel processing.
5 An Integrated Development and Run-Time Environment 5.1 Introduction Despite of several eorts which have been undertaken in the last few years [12, 24, 25, 26, 27, 28], immediate support of sucient eciency and versatility for full-scale applications on parallel computer systems is still lacking. Although varied in nature, a common feature of all approaches is that insucient attention is paid to the aspects of integrity and expandability of the supporting framework by the application user himself. It is the purpose of the project SPADE (Scienti c Program and Application Development Environment) developed at the IAM of the University of Berne to qualitatively increase the bene ts of parallel systems for productive usage by the development of an integrated application development, program development and run-time environment. A conceptual foundation of the project is that the speci cation, design, test and development of SPADE is done in close collaboration with application users. We believe that this approach is essential in order to take into account as many of their requests as possible. The intimate interaction with users from dierent disciplines simultaneously de nes an iterative procedure by means of which a complete system can be developed for a variety of dierent application areas in a step-wise fashion. This system gives full support to all typical steps of an application | program development, application speci cation and production.
5.2 Conceptual Structure and Design Goals SPADE is targeted for distributed memory parallel processor (DMPP) systems, and relies on the model of Communicating Sequential Processes [13]. SPADE supports especially the treatment of problems with inherent data parallelism. Problems which belong to this class are all those which are generally solved by algorithms characterized by simple, local operations applied to large and regular data structures. 12
The user is supplied with library functions and complete prede ned applications which are frequently used. Libraries and prede ned applications are built on top of application kernel libraries which guarantee the high degree of eciency of all implemented algorithm, either prede ned or supplemented, by the application user. SPADE is characterized by its integrity and expandability. Particular emphasis is put on the usability by non-computer scientists. This is assured by an uniform user interface to the application development, program development and run-time environment of the system and the participation of application users from dierent disciplines | with dierent expertise in computer science | in the design and development phases right from the beginning of the project. A substantial component of SPADE is the program development environment, which guarantees its modular expandability. Expandability covers both, expansion of modules for already supported application areas, and assistance for completely new disciplines. The multilevel software structure of SPADE, together with the interaction of the dierent components with each other and the relation with the operating system and the hardware is illustrated in Fig. 3. User Interface (UI)
Program Development Environment
Application Development (ADE) Environment
RunTime Environment
Task Libraries (TL)
(PDE) Communication Libs.(CL) & Kernel Libraries (KL) Mini Operating System
(RTE)
(MOS)
Hardware (HW)
Figure 3: The SPADE software structure. The size of the dierent boxes is meant as a measure of the relative importance of the component it represents, as well as a measure of the relative amount of eort which has to be invested into its development.
13
5.3 Description of SPADE Components 5.3.1 Communication Libraries (CL) Communication libraries provide communication primitives like send and receive of a message between two not necessarily identical types of processor nodes, including data conversion, message routing and buering. Building on top of these primitives, higher level communication functions are also provided, speci cally tailored to the requirements of the dierent applications. Conceptually, communication libraries provide a portability platform for parallel applications.
5.3.2 Application Kernel Libraries (KL) Application kernel libraries make available implementations of basic algorithmic components which are either compute-intensive or often used. They are designed for independent usage on single processor nodes. To utilize the theoretical performance in an optimal fashion, application kernel libraries will be designed and implemented separately for each type of processor node supported. They thus guarantee high system utilization. Kernel libraries will be supplemented by information on the time and storage-space complexity of the individual modules. This information can be made available to a parallel distributed operating system for purposes like optimal distribution of processes on processors (load balancing). Application kernel libraries thus serve as a platform to optimally combine SPADE with a distributed parallel operating system.
5.3.3 Application Task Libraries (TL) The application area speci c application task libraries are built on top of communication and application kernel libraries. Application task library modules make available complete components of applications. One of their principal design properties is that they can be distributed and concurrently executed on an arbitrary number of processor nodes. The size of the data structure (problem size) is arbitrary. Application task library modules are implementations of truly parallel algorithms, and are designed for minimal communication overhead. Where possible, special attention is paid to maximum overlap of communications with application speci c calculations.
5.3.4 Application Development Environment (ADE) The application development environment interprets the application and hardware topology speci cation les ASF and HTSF, which provide abstract, high-level descriptions of the application and the desired hardware. Both the ASF and the HTSF are generated with the help of the user interface UI. The ASF speci es the application as the successive execution of application task and kernel library modules, together with an appropriate driver for the complete application. It also contains information about the problem size and the I/O requirements. The HTSF provides a description of the desired target architecture, i.e., type and number of processors and the processor interconnection scheme. Depending on the 14
information contained in the ASF and the HTSF, the ADE decomposes the necessary data structures and prepares the distribution onto the requested hardware, which is performed at run-time. In addition, it de nes the necessary data pathes required for communication. Together with the problem size independence of the application task library modules, the ADE thus provides full scalability with respect to the hardware size.
5.3.5 Program Development Environment (PDE) The program development environment is a major part of the SPADE system. Conceptually, a unique feature of the PDE within the SPADE system is that it is designed to keep the inherent parallelism of a problem in all the major transformation steps involved in parallel program development, i.e., problem analysis, algorithm design and implementation. As a result, new modules for the application task and kernel libraries can be created, or libraries for completely new areas of applications can be successively supplemented to the already existing ones. It is important to note that due to the strategy supported by the PDE, the generated modules should be regarded as truly parallel implementations of parallel algorithms, rather than as parallelized implementations of parallelized algorithms, which are created from sequential ones by simply adding parallel features in an ad hoc fashion. To guarantee this ambitious objective, the program development environment features optimization of the communication requirements of the program modules to be added, i.e., it minimizes the communication overhead due to the parallel nature of the algorithm and the implementation, while simultaneously maximizing the overlap of communication with application speci c calculations, where possible.
5.3.6 Run-Time Environment (RTE) The run-time environment veri es whether the requested hardware speci ed by the hardware topology speci cation le HTSF is available. Starting from the ADE-description of the application, the RTE builds the executable program, boots the hardware, loads and executes the program. It also provides basic support for debugging of parallel programs.
5.3.7 User Interface (UI) The user interface supports in a uniform way the step-wise access of application users with dierent expertise to the full functionality of the SPADE system, application development, program development and run-time environment. It assists the creation of the application and hardware topology speci cation les ASF and HTSF. The ASF provides an abstract, high-level description of the complete application, including in particular information about problem size, the necessary library modules and the I/O requirements. The HTSF describes the desired number of processor nodes out of the maximum of available nodes, the types of the nodes and the desired topology out of the set of physically realizable topologies.
15
5.4 SPADE Project Status The speci cation of SPADE could be nished, and is described in [29, 30]. The most recent activities include
The design of a C-type application speci cation language. A corresponding parser
could be implemented completely. The code generator is currently under development [31]. A systematic investigation of dierent distribution techniques of d-dimensional gridtype problems onto multiprocessor networks of various topologies [32]. The development of a communication skeleton for Monte Carlo simulations of fourdimensional lattice gauge theory [33]. Based on the communication skeleton, the implementation prede ned application task library modules for the Monte Carlo simulations of lattice gauge theory is under way [33]. The implementation of an application kernel library prototype for the Monte Carlo simulations of lattice gauge theory [33].
A SPADE prototype has been developed on a Meiko Computing surface with Intel i860 processing nodes, combined with T800 Transputer communication units, using CSTools as communication library [34]. Details, as well as the SPADE development schedule can be found in [35].
Acknowledgement. The SPADE project is supported in part by Grant 20.5641.88 from the Swiss National Science Foundation.
6 Optimized Distribution and Automatic Con guration of Parallel Programs 6.1 Introduction The successful use of parallel architectures depends heavily on suitable development tools and runtime environments. The MARC environment (MApping Routing Con guring) has been developed to provide the user of parallel distributed memory machines with tools for the ecient use of such architectures. The MARC system developed at the IAM of the University of Berne analyses the structure of parallel programs and the structure of the available parallel architecture in order to produce a load balanced and communication optimized executable program. It includes a new method for load balancing and communication optimized process distribution onto arbitrary (network) topologies. Moreover ecient and secure routing strategies are included in the system. A sophisticated performance analyzer provides the system with the necessary load and communication cost information. 16
The MARC system presented here provides tools for realizing ecient implementations on distributed memory parallel architectures reducing the gap between the application and the system aspects of implementations. It expects from the user a parallel program in the form of a set of communicating processes and produces subsequently a con guration for this program which may then be executed on a multiprocessor machine. The MARC system [26], [36], [37] has been realized for programs written in the language Occam [38] and for Transputer networks [39] as target machines. The system analyses the structure of a parallel program and the structure of the available parallel architecture in order to produce a load balanced and communication optimized executable program. The MARC project aims towards a truly distributed operating system and development environment for parallel (MIMD) architectures.
6.2 System overview Beside the traditional tools of any development environment and operating system, e.g., compilers, assembler, le system, the ecient use of parallel architectures needs more tools, which meet their speci c requirements. Provided that a suitable parallel programming paradigm is used, two major tasks or problems can be identi ed:
The mapping problem Deadlock free routing The MARC project relies on CSP (Communicating Sequential Processes) [13] as the programming model and on distributed memory architectures with an arbitrary xed or recon gurable network topology. Taking these assumptions into account, the parallel language Occam and Transputer networks are suitable objects for a rst implementation of the MARC tools. However, the methods developed for the tools are mostly independent of the programming language and the multiprocessor architecture. Much emphasis has been put on the mapping problem, because it is viewed as fundamental for distributed or parallel computing. A mapping of an arbitrary system of communicating sequential processes onto an arbitrary network of parallel processors should deliver a process distribution such that the most ecient execution results. Therefore as much knowledge as possible about a program is required, which can be determined only by a thorough performance analysis of a program [40], [41]. The automatic con guration tool MARC transforms a correct Occam program into a semantically equivalent con gured program for a xed, but arbitrary network of Transputers1 . Figures 4 and 5 give an overview of the tool which includes the following successive steps: 1. the analysis of a parallel Occam program resulting in a software graph (SW{graph) 2. the analysis of the interconnection scheme of a Transputer network resulting in a hardware graph (HW{graph) 3. the construction of routing tables along Eulerian pathes with short cuts 1
It runs so far within the standard Transputer Development System
17
PERFORMANCE ANALYSER
MARC
PARALLEL HARDWARE
OPTIMALLY
PARALLEL
MAPPER
PROGRAM
ROUTING CONFIGURING
DISTRIBUTED PROGRAM
Figure 4: MARC system overview 4. the communication optimized and load balanced mapping of the SW{graph onto the HW{graph 5. the generation of the con gured program for the processor network including a routing system SW Graph Extraction
HW Graph Extraction
Performance Analysis
Routing Tables
Load and communication optimized Mapping
Routing System Generation
Program Configuration
Figure 5: The principle MARC steps Performance analysis, mapping and routing strategies are described more precisely in appendix A.
6.3 Towards a truly distributed operating system The ultimate aim of the work done in the MARC project envisages a truly distributed operating system for massively parallel MIMD architectures. The current available operating systems for parallel architectures are oriented much more towards multitasking than parallelism. Although they allow the use of multiple processor systems, they are either restricted to a certain small number of processors arranged usually in xed (maybe 18
statically switchable) topologies like hypercubes, or they allow just the use of a dedicated parallel hardware attached to a speci c host. Most such operating systems have their roots in UNIX. However, most of them neglect the mapping problem as well as the load balancing. Also their kernels tend to be quite large and computation intensive, thereby diverting much of the parallel computers power from the applications. Some of the requirements for a truly distributed operating system (beside the classical ones) include:
fast and ecient mapping strategies and remapping methods in the case where it is
possible to relocate processes on dierent processors (process migration) mapping strategies for dynamically recon gurable parallel architectures runtime performance analysis for remapping adaptable routing or communication kernels, possibly supported by the hardware itself transparency for the user
It is clear that for future developments some of the routing tasks can be assumed to be performed by the hardware itself or at least that the routing is supported much better by the hardware. Investigations have shown that some of the aspects for a distributed operating system, as mentioned before, may already be treated in the compilation phase of program development [42]. Regarding the mapping problem in such an operating system, the Dmapper seems to be a good candidate, because it
is very ecient is very simple and easy to implement is distributed over the whole processor network does not need any global synchronization: the synchronizations are restricted to local interactions between neighboring processors produces good suboptimal solutions Furthermore it seems to be particularly well suited for dynamically recon gurable parallel architectures, because it can adapt itself to every new switching of the processor interconnections.
19
7 Conclusions Research in parallel architectures and development tools seems to be one-sided in the sense that the focus is put very much towards numerical applications. Other elds of applications like information retrivial or transaction systems are often neglected. Also the acceptance of parallel architectures outside the academic world is quite low. Reasons for this are the completely underdeveloped programming tools and the missing of application speci c libraries, the necessity to program at a very low level in order to achieve appropriate performance when using classical programming languages with parallel extensions on one hand, and on the other hand, the necessity to learn completely new ways of programming when using truly parallel languages. In general the biggest obstacle for many users preventing them from using parallel computers is the absolute necessity to change from thinking sequential to thinking parallel. It seems to be extremely dicult to overcome this problem, although our world is inherently parallel. Thus parallelism is the most natural way of treating problems, which, however, does not mean that problem solving would become much easier in general, but may be problems, which are not feasible to solve with the conventional sequential techniques today, will become solvable. Once the software dilemma has been solved, we believe that the usefulness of parallel architectures will signi cantly increase in three steps. First, applications in science and engineering will bene t from the largely increased availability of computing resources for modeling complex systems. This will be followed by increased attractivity in the entire university environment, culminating nally in knowledge-transfer and important momenta to industrial companies.
A Overview of the Algorithms used in the MARC System The MARC system2 presented in section 6 provides tools for realizing ecient implementations on distributed memory architectures, starting from a parallel algorithm formulated as a collection of communicating processes. It thus reduces the gap between the algorithmic aspects of an application and the technical details of its implementation. In this appendix, the crucial steps of MARC, i.e., strategies for performance analysis, mapping and routing are presented. The MARC system has been realized for programs written in the language Occam, targeted for Transputer networks.
A.1 Distributed Mapping and Load Balancing A.1.1 The Mapping Problem Because a mapping should deliver a distribution of the processes, such that the most ecient execution results, there are two optimization demands to be met :
Load balancing: The processes have to be distributed such that the overall system load
is well balanced, i.e., that the load caused by the processes is distributed evenly over all the processors.
2
J.E. Boillat, P.G. Kropf, IAM, University of Berne
20
Communication minimization: The communication capacity between any two processors should be used as optimal as possible, i.e., the overall communication should be distributed as evenly as possible over all communication links.
Figure 6 shows the mapping of a 4 4 mesh of processes onto a 2 2 mesh of processors. Since mapping a set of communicating processes onto processors networks is known to be NP{complete, it makes no sense to develop exact algorithms for solving the problem, but mapping algorithms should be able to produce suboptimal solutions very quickly. We have developed a mapping algorithm (called the Dmapper) which is fully distributed and delivers good optimal or suboptimal solutions in a short time. 0
8
12
4
1
9
13
5
2
10
14
6
3
11
7
15
0
4
1
T0
5
2
6
3
T1
7
-
8 12
T2
9 13
10 14
T3
11 15
Figure 6: Optimal mapping of a 4 4 process mesh (SW graph) onto a 2 2 processor mesh (HW graph)
A.1.2 Dmapper: the Distributed Mapping Algorithm The distributed mapping algorithm, Dmapper, runs in parallel on the target hardware
itself. It consists of a set of identical communicating processes running on each processor. The basic ideas of the algorithm are:
fully distributed method based on diusion local nearest neighbor optimization leads to a global optimization neighboring processors exchange information about their load and communication costs processes are moved according to the optimization demands making use of the parallel hardware itself to increase the speed of the mapping
The de nition of the distributed mapping algorithm Dmapper describes a mapping : P ?! T from the set P of processes onto the set T of processors with the following assumptions : 21
The communication costs cHij between any two processors Ti and Tj are set to ( H cij = 12 D ifelseTi and Tj are connected ij where Dij denotes the shortest distance between Ti and Tj . The communication costs cS between any two communicating processes P and P are set equal to 1.
These assumptions are realistic provided we consider ne grain parallelism. The Dmapper follows an iteration scheme consisting of three parts : 1. Information exchange 2. Communication optimization 3. Load balancing This scheme is run on each processor in parallel. As the optimal solution of a mapping
is usually not known (at least not in advance), the number of iterations to use is of
experimental matter. However, considering load balancing only, the number of iterations are known [43, 44]. Figure 7 shows the principle structure of the Dmapper process. SEQ i = 0 FOR iterations SEQ ... exchange information about process location and processes with direct neighbors ... choose a process causing high local communication costs and send it to neighboring processor {{{ balance load locally SEQ l = 0 FOR links VAL load.diff IS my.load - load[l] : VAL to.give IS load.diff / (links + 1) : IF to.give > 0 give[l] := to.give TRUE give[l] := 0 }}}
Figure 7: General structure of the Dmapper Each process of the parallel program to be mapped is represented as a token. At the beginning all the processes are placed on an arbitrary processor (except those associated with a restriction). At each iteration step, the neighboring processors exchange the current knowledge about their load and the probable location of all the processes. Processes are then moved to reach a local load equilibrium. If the load of a processor Ti is greater than the one of its neighbors Tj , a part of the load dierence will be given to Tj (see Figure 8). 22
After a certain number of iteration steps an optimal or suboptimal solution to the mapping problem is found. The algorithm is load balancing driven, i.e., the most optimal load balance is considered rst, but the processes which are passed to another processor are selected such that the local communication costs are reduced. Since a probable location of all processes is known, the process to be moved to another processor may be chosen such that communication costs are reduced. To improve the communication optimization, processes causing high local communication costs are chosen randomly for being sent to neighboring processors thereby reducing the local costs. As far as the load balancing demand is concerned, Dmapper is proven to reach a uniform load distribution in O(n2 ) parallel iterations for an arbitrary hardware graph with n processors [43, 44]. 0
0
0
8
0
0
2
2
2
2
Figure 8: Load balancing strategy of the Dmapper. The gure shows the load distribution before (left) and after (right) a load balancing step. The numbers indicate the load of the processors. The load balancing strategy is very ecient and has been used successfully for dynamic load balancing in an application as well [43].
Load balancing strategy For simplicity we assume in this section, that all Transputer links are connected. The load balancing strategy of the Dmapper is based on a discrete diusion scheme in the HW-graph. At each iteration step, processor Ti gives load(Ti) ? load(Tj ) (1) links + 1 of its own load to every neighboring processor Tj , provided load(Ti) > load(Tj ).3 Let A be the adjacency matrix of the hardware graph (processor network) and I be the identity matrix. The matrix B associated with the diusion scheme has the form B = links1 + 1 (I + A) The load balancing scheme presented here assumes the load to be a real number (see also [43]). In [44] a similar algorithm is presented with discrete loads. For the practical implementation the oor of (1) is taken respecting indivisible loads. 3
23
and is stochastic. It can be shown [43, 44] that
01 1 n : : : n B : : : : : t=B lim B B @ : : : : : t!1 1
n
: : : n1
1 CC CA
for all connected HW graphs, i.e., the load diusion converges towards the uniform distribution. Furthermore, it can be shown [44, 45] that the uniform distribution is reached after O(n(links + 1)( + 1)) steps, where n is the number of processors, and the diameter of the HW-graph. Since links is a small constant, and since n ? 1, the worst case complexity of the load balancing is proportional to the square of the number of processors. The worst case is met by a pipeline of processors because = n ? 1. As another example, a hypercube of dimension d has a diameter = d, but it is proved that the uniform distribution is reached after O(d) steps only (see [43]).
Communication Cost Optimization As de ned previously, the communication costs cS between any two processes P and P are set to cS = 1. During the mapping process three types of communication are considered:
Internal communication: A communication between two processes currently placed on
the same processor. Its value is set to 1. Short communication: A communication between processes currently placed on neighboring processors. Its value is set to 1. Long communication: A communication between processes currently placed on distant processors, i.e., processors which are not neighbors. Its value is set to 2 Dij where Dij is the shortest distance processor Ti and Tj .
Attraction Vector To select candidate processes to be moved to another processor an attraction de ned as follows :
vector is
each processor knows the shortest distance to every other processor in the network. each processor Ti knows all communication links through which a shortest distance
to all the other processors is realized. each processor knows the probable location of all processes (these locations are exchanged with the neighboring processors at each iteration). 24
Because only the neighbors of a processor, which has moved a process to another one, know immediately the exact actual location of this process, the knowledge of all the process locations each processor has, is not always up to date. This is because when a process P has been moved, it takes some iteration steps until all the other processors learn about the move. Thus the information on the current location is of probabilistic nature, but not exact. The number of iterations steps needed until a new placement induced by processor Ti becomes known exactly to processor Tj , however, time delayed, is proportional to the shortest distance Dij . The attraction vector de nes whether and which process P currently placed on processor Ti should be sent to a neighboring processor. The components of the attraction vector are the probable communication costs in each link direction (an internal link is added taking care of the internal communications). The largest component of the attraction vector indicates in which direction the process should be moved. At each iteration step Dmapper chooses those processes with the highest attraction values still ful lling the load balancing demands. If there are dierent processes with the same attraction value, the processes to be moved are selected at random. The process with the highest attraction value is sent to a neighboring processor with decreasing probability P[move] and thus neglecting the load balancing demands. This assures that the algorithm does not remain in a communication suboptimum. The probability depends on the number of iterations and the current iteration number: iteration P[move] = 1 ? totalcurrent number of iterations
A.2 Routing It is assumed that the original program which is an input for MARC runs on a single processor. Every Occam process may have an unlimited number of channels to other processes. On the other side a Transputer network has only restricted topology, because each Transputer has no more than four links. A set of processes placed on a particular processor by the Dmapper may communicate with another set on a neighboring processor. For load balancing reasons, Dmapper may also decide to place processes not on neighboring processors, although they are directly connected through a channel. This implies that
certain channels have to share the same link certain channels should provide communication across several Transputers. An ideal routing system supports also communication paths along more than one processor transparently. The MARC routing strategy leaves the semantics of any parallel Occam program unchanged when con gured for a network of Transputers. This is achieved by preserving the synchronized communication along user channels and by the deadlock freedom of the routing system itself. The routing system is economical in the sense that it makes only little use of memory and CPU resources. 25
Routing a message across several processor nodes introduces buering once on each node. Because of the synchronous communication scheme, the buering could change the semantics of an Occam program signi cantly: deadlocks may be introduced or removed. Therefore, a pair of special processes, read.with.acknowledge and write.with.acknowledge, has been developed. They allow synchronous communication along a user channel, even if the communication between the .with.acknowledge processes is buered. Using CSP consistent processes, the execution of a sending process is suspended until an acknowledgement from the remote receiver process has been transmitted.
Routing strategy The routing itself must not deadlock. The algorithm used in the MARC system is proven to be deadlock free [46]. It is based on the idea of superimposing a graph without cycles, a Eulerian Path, over the Transputer network. Because Transputers have an even number (4) of links and usually most links are connected, a Eulerian Path is very likely to exist. Because most Transputers in a network are traversed twice by the Eulerian Path, short cuts may be used reducing the length of a path for a message. Although these short cuts make the routing much more ecient they must be taken with care, such that they do not introduce cycles into the routes (and therefore the possibility of deadlocks). The routing strategy is de ned using the Eulerian path. All the nodes in the Eulerian path are numbered in increasing order. Let s be a node in the Eulerian path and T (s) denoting the processor corresponding to that node.
Rule 1: All messages are sent along the Eulerian path in increasing order or in decreasing
order, e.g., if si and sj are two nodes in the path and i < j , the message will follow the path [si ; si+1 ; :::; sj ?1; sj ] Rule 2: A message from si to sl in the Eulerian path may shortcut any subpath from sj to sk , provided T (sj ) = T (sk ) and i < j < k < l or i > j > k > l
It is easy to see that these rules de ne a routing function in the Eulerian path: any two nodes in the path are connected and the shortcut rule (2) does not introduce any cycle. The routing rules for the HW-graph are chosen such that they are compatible with the rules in the Eulerian path. Let Ti and Tj be two processors. There are at least two nodes si and sj in the Eulerian path such that T (si ) = Ti and T (sj ) = Tj . To send a message from Ti to Tj it is sucient to choose those si and sj in the Eulerian path such that the path (with shortcuts) between si and sj is the shortest one possible. Precise de nition of algorithms for computing the routing function are given in [46].
Buering The needs for memory by the routing system is closely coupled to the routing algorithm. The MARC routing algorithm needs only two message buers per link for full duplex communication, because deadlock freedom is not achieved by buering, but by routing along cycle free routes. 26
Process structure of the routing system The routing processes are completely event driven. They do not consume any CPU time, if there are no messages to transmit, i.e., the routing does not involve any polling. The routing system consists of a set of identical processes, called router. On each Transputer an instance of the router is running. The router interfaces to the four links and to all application channels having their other end on a remote Transputer. Figure 9 shows the structure of the routing processes present on each processor node. w
r
r
w
w
w
mux
demux
0
in
out
0
1
in
out
1
2
in
out
2
3
in
out
3
Figure 9: Router process on a processor node Every message in the routing system is preceded by a header identifying the destination Transputer and application channel. With the help of the local routing information, the router can decide where a message has to be forwarded to.
A.3 Performance Analysis The cost model used for load balancing and communication optimization needs information about the load and communication requirements of all the parallel processes. The determination of the load and communication costs of a process is problematic, because it is not easy to calculate the load a process will produce. There are basically two ways to determine the load. One is to calculate the theoretical time complexity of a process, the other is to determine the time a process uses by actually running it. The rst is one not realistic for the practical use at the moment, because the automatic determination of the theoretical time complexity of an algorithm is still a topic of research itself, and only few results are known yet. The second is more accurate, but involves the problem of monitoring, i.e., the problem of pro ling the time behavior of a parallel program. The load of a process changes in most cases dynamically during run, this means that both 27
the computational and the communication costs of individual processes are always time dependent and may be moreover nondeterministic. For a statical mapping algorithm it is obvious, that the load of a process is a constant. To meet a compromise to all these in uences, the actual load information for the mapping may be de ned as the mean value of the expected load a process will cause. The same remarks as for the load apply of course to the determination of the communication costs between processes, too. To get realistic values for the load and communication costs the monitoring tool Transputer Performance Analyzer (PFY) has been developed [40]. It based on the principle of simulated execution of a parallel Occam program on a single processor. Any Occam program may be pro led unchanged. The resolution regarding parallelism is the outermost PAR of a program and the channels used by its component processes. The program source code is simply moved into a harness which performs the monitoring and produces on-line graphical output and detailed led results of the performance analysis. The performance analyzer provides
the exact CPU-time of each parallel process the number of communications between the processes the number of bytes transferred along each communication channel It delivers thus accurate load and communication cost information based on most exact measuring, i.e., it gives more accurate information than most other analyzers based on idle time monitoring. The performance analyzer is based on the interception of every context switch and uses the high priority timer for measuring the CPU-time resulting in the highest possible resolution. To achieve this, the process queue is manipulated such that after every user process always a spy process becomes ready for execution. The spy subsequently activates a monitor process running at high priority. Each time the monitor process is activated, it performs a
Time analysis: accumulation of the time elapsed in the last period to the process which has been active
Channel analysis: searching all channels having been used during the last period, and determining the number of communications and the number of bytes transferred
Figure 10 demonstrates the basic idea of the performance analysis. The analysis results delivered are at Occam source code level, which is deduced by analyzing the debug information produced by the compiler.
References [1] G. Bell. The Future of High Performance Computers in Science and Engineering. Comm. of the ACM, 32(9):1091{1101, Sept 1989. 28
high priority queue (active processes) Monitor
P1
Spy
P2
Pn
low priority queue (active processes)
Figure 10: Queue manipulation for pro ling a program [2] J. Marko. A technology once considered dubious is now the wave of the future. The New york Times, 1990. [3] W. Myers. Massively parallel systems break through at Supercomputing 90. IEEE Computer, pages 121{126, January 1991. [4] J. Dongarra, A. H. Karp, K. Kennedy, and D. Kuck. 1989 Gordon Bell Prize. IEEE Software, pages 100{110, May 1990. [5] Cray Research, Inc. 1990 Giga op Performance Award Program. Call for Entries, March 1990. [6] W. M. Bulkeley. Parallel Supercomputers are Catching on Rapidly. Wall Street Journal, January 1991. [7] V. P. Bathkar. Parallel Computing: An Indian Perspective. In H. Burkhart, editor, Proceedings CONPAR 90 - VAPP IV, volume 457 of Lecture Notes in Computer Science, pages 10{25. Springer, 1990. [8] The QCD Tera op Project. A Proposal by the QCD Tera op Collaboration, October 1990. Submitted to the U.S. Department of Energy. [9] H. Satz. Tera ops for Europe. Europhys. News, 22:64, 1991. [10] R. W. Hockney and C. R. Jesshope. Parallel Computers 2. A. Hilger, Bristol, second edition, 1988. [11] M. J. Quinn. Designing ecient algorithms for parallel computers. McGraw-Hill, New York, 1987. [12] G. C. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurent Processors I. Prentice-Hall, Englewood Clis, 1988. [13] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, Englewood Clis, 1985. [14] A. J. R. G. Milner. A Calculus of Communicating Systems. LNCS 92. Springer, Berlin, 1980. [15] N. Carriero and D. Gelernter. Linda in Context. CACM, 32(4), 1989. 29
[16] P. G. Kropf. On the Suitability of the CHILL primitives to express Parallel Algorithms. In Proceedings of the 3rd CHILL Conference, 1984. [17] D. Pritchard, C. R. Askew, D. B. Carpenter I. Glendinning, A. J. G. Hey, and D. A. Nicole. Parallel Architectures and Languages. In R. J. Elliot and C. A. R. Hoare, editors, Lecture Notes in Computer Science, 258. Springer, 1987. [18] R. G. Babb II, editor. Programming Parallel Processors. Addison-Wesley, 1988. [19] A. H. Karp. Programming for Parallelism. Computer, 20(5):43{57, May 1987. [20] G. R. Andrews and F. B. Schneider. Concepts and Notations for Concurrent Programming. Computing Surveys, 15(1):3{43, March 1983. [21] E. F. Gehringer, D. P. Siewiorek, and Z. Segall. Parallel Processing, The Cm Experience. Digital Press, 1987. [22] E. Gabber. VMMP: A Practical Tool for the Development of Portable and Ecient Programs for Multiprocessors. IEEE Trans. Parallel and Distr. Syst., 1(3), July 1990. [23] J. J. Dongarra and D. C. Sorensen. SCHEDULE: Tools for Developing and Analyzing Parallel Fortran Programs. In Leah H. Jamieson, Dennis B. Gannon, and Robert J. Douglass, editors, The Characteristics of Parallel Algorithms, pages 363{394. The MIT Press, 1987. [24] H. Burkhart and R. Millen. Performance-Measurement Tools in a Multiprocessor Environment. In IEEE Trans. on Computers, volume 38, pages 725{737, 1989. [25] T. Bemmerl. The TOPSYS Architecture. In H. Burkhart, editor, Proceedings CONPAR 90 - VAPP IV, volume 457 of Lecture Notes in Computer Science, pages 732{743. Springer, 1990. [26] J. E. Boillat, N. Iselin, P. G. Kropf, S. Messerli, and J. Schneider. MARC : MApping, Routing and Con guring, User Manual. Technical Report IAM-PR-89333, IAM, University of Berne, 1990. [27] E. Bonomi, M. Fluck, R. Gruber, R. Herbin, S. Merazzi, T. Richner, V. Schmid, and C.T. Tran. ASTRID: A Programming Environment for Scienti c Applications on Parallel Vectorcomputers. In J. T. De Vreese and P. E. van Kamp, editors, Scienti c Computing on Supercomputers II. Plenum Press, 1990. [28] A. Kolawa. Hypercube Architectures and Applications. In K. M. Decker, editor, Proceedings of the 3rd Graduate Summer Course on Computational Physics, Parallel Architectures and Applications, Cr^et-Berard (Puidoux), 1991. to be published. [29] K. M. Decker, C. Jayewardena, and R. Rehmann. Libraries and Development Environments for Monte Carlo Simulations of Lattice Gauge Theories on Parallel Computers. In A. Tenner, editor, Proceedings of the Europhysics Conference on Computational Physics, Amsterdam, The Netherlands, September 10{13, 1990, pages 316{321. World Scienti c Publ. Co., 1991. 30
[30] K. M. Decker and R. Rehmann. An Integrated Development and Run-time Environment System for Distributed Memory Parallel Computers. In U.M. Heller et al., editor, Proceedings of the International Conference on Lattice Field Theory 90, Tallahassee, USA, October 8{12, 1990, Nucl. Phys. B (Proc. Suppl.) 20 (1991), pages 153{156. North-Holland, 1991. [31] K. M. Decker and R. Rehmann. A C Cross-Compiler for Applications of Master-Slave Type on Distributed Memory Parallel Computers. Technical Report IAM-91-017, IAM, University of Berne, 1991. [32] K. M. Decker. A Systematic Investigation of Data Decomposition Strategies for GridType Applications. Technical report, IAM, University of Berne, 1991. [33] K. M. Decker and R. Rehmann. Libraries for Monte Carlo Simulations of Lattice Gauge Theories on Distributed Memory Parallel Computers. Technical report, IAM, University of Berne, 1991. [34] Meiko Ltd., Bristol. CSTools - Communicating Sequential Tools, 1990. [35] K. M. Decker and R. Rehmann. SPADE | Scienti c Program and Application Development Environment. I. Functionality of the Prototype. Technical report, IAM, University of Berne, 1991. [36] J. E. Boillat, P. G. Kropf, and K. Wyler. Crossbar-Programmierung in einem Transputer-Netzwerk. Technical Report IAM-PR-89334, IAM, University of Berne, 1990. [37] J. E. Boillat, P. G. Kropf, D. Chr. Meier, and A. Wespi. An Analysis and Recon guration Tool for Mapping Parallel Programs onto Transputer Networks. In T. Muntean, editor, OUG-7: Parallel Programming of Transputer based Machines, Amsterdam, 1988. I.O.S. [38] INMOS, Englewood Clis. OCCAM 2 Reference Manual, 1988. [39] INMOS, Englewood Clis. Transputer Reference Manual, 1988. [40] TNT - Parallel Computing Support, Berne. TNT - PFY Reference Manual, 1990. [41] G. De Pietro. An Environment for Transputer CPU Load Measurements. In H. Zedan, editor, OUG-13: Real{Time Systems with Transputers, Amsterdam, 1990. I.O.S. [42] D. Chr. Meier. Automatische Kon guration von kommunizierenden sequentiellen Prozessen. Master's thesis, IAM, University of Berne, 1990. [43] J. E. Boillat. Load Balancing and Poisson Equation in a Graph. Concurrency: Practice and Experience, 2(4), 1990. [44] J. E. Boillat. Distributed Load Balancing and Random Walks in Graphs. Technical Reprot IAM-90-013, IAM, University of Berne, 1990. Submitted for publication. [45] H. J. Landau and A. M. Odlyzko. Bounds for Eigenvalues of Certain Stochastic Matrices. Linear Algebra and its Aplications, 38:5{18, 1981. 31
[46] L. Mugwaneza, T. Muntean, and I. Sakho. A Deadlock Free Routing Algorithm with Network Size Independant Buering Space. In H. Burkhart, editor, CONPAR 90 { VAPP IV. Springer, 1990.
32