Software Reuse and Portability of Parallel Programs - Semantic Scholar

3 downloads 0 Views 1022KB Size Report
DATE: Han 28. Jun 1993. ---------------. TIIIE: 8:47:03 me. -> me-LO .... dong Fang, Robert Frank, Stephan Gutzwiller, Guido. H&hler, Walter Kuhn, Peter Ohnacker, ...
Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

Software

Reuse

Helmar Burkhart [email protected]

and

Portability

of Parallel

1995

Programs*

Stephan Gutzwiller Stephan.z.h.g.u.e.GutzwillerQzhflur.UBS.ubs.ch University of Base1 Informatics Department CH-4056 Base1 Switzerland

Abstract

infancy. Programs are written by using low level concepts, which results in software that is usually neither portable, reusable, nor maintenable [2]. Parallel processing has been neglected by the informatics community for a long time and de-facto standards have been defined by people from application fields. To overcome this software dilemma, sound software engineering techniques need to be introduced into the field of parallel processing. Structured programming has been discussed by informaticians for many years. It is time to transfer successful software production techniques to parallel processing!.

The state-of-the-art of programming parallel computers is far from being successful. The main challenge today is therefore the development of techniques and tools that improve the programmer’s situation. Software reuse and software portability are two research areas where further progress is essential. We present an approach that is applicable for compute-intensive programs with regular process topologies and execution patterns. After a short introduction, we summarize the Base1 Algorithm Classification Scheme, which is the base of all our implementation parts, and present three sample algorithms. We refine these concepts towards a formal description language and introduce the prototype skeleton genemtor, which produces C source code for different parallel virtual machines. We conclude with a description of the state of the project and related work.

1

What we need is progress towards the solution of the big P challenges: Programmability - Portability Performance. We like to have programmability at a high level of abstraction to get correct and maintainable software. However, this influences the performance because high-level constructs always have a danger of performance loss. We also like to have portable software to minimize software changes when the computer environment changes. Again, a very high-level of portability is a source of inefficiencies. In conclusion: These three attributes can only be optimized together. It is one of the main challenges today to propose methodologies and languages at an abstraction level which still provides the performance that users expect from parallel systems.

Introduction

Computer power is a precondition to tackle scientific problems. The term Grand Challenges has been coined for the variety of applications that need to be solved and the discussion has sometimes been reduced to the question who will have the first teraflop computer. However, this approach has already been criticized [l]. We believe that the real grand challenge is to develop concepts that enable programmers to use parallel systems in a more productive way. Today, software development for parallel systems is still in its

The software dilemma mentioned above has been tackled from the system side. High-level interfaces, so-called parallel virtual machines, have been defined in order to hide the existence of the different operating systems and architectures. As we said, such a parallel machine model must be general enough to be implementable on a set of machines but must not be too abstract in order to preserve efficiency. Several parallel machine models have been proposed during the past years: PARMACS, EXPRESS, PICL, PVM,

*This research is funded by the Swiss National Science Foundation, research grant SPP IF 5003-34357: “Skeleton-Oriented Programming”.

289 1060-3425/95

$4.00 0 1995 IEEE

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

Algorithmic reusability

Bssd Algorithm

Ctudlalton

schm

Our approach: Binding algorithmic abstractions to virtual parallel machines.

Program portability

Figure

1: Portability

and Reusability

MPI, and p4 are a subset of the models mentioned in the respective literature ([3] presents an up-to-date list). Most of these models are very similar because they are addressing the same type of parallel computers, so-called message passing systems. In the near future, certain standards will offer support both in libraries and tools. For that reason, our research work is based on such virtual machines because software portability is among the most important questions where solutions are overdue.

2

The Base1 Approach allelism

to Master

Aspects

the same problem domain - there are always slight differences; e.g., different input/output formats. We can, though, expect to get reusable components for the coordination part of a parallel program. But why not complete reusability? While concurrent programming languages offer operations that allow the construction of rather irregular program structures, realworld applications are quite regular regarding the process structure. Our current research has shown that only a small number of coordination schemes are used in algorithms and applications. On massively parallel systems, this trend is even more emphasized because hundreds or thousands of different processes cannot possibly be managed individually. The percentage of code for the coordination part is definitely much smaller compared to the computation part; but as already mentioned, it is the crucial part. It is, therefore, this portion of code that we will make reusable by providing a library of algorithmic skeletons.

Par-

One challenge of software development is the software factory, i.e. the availability of building blocks that may be reused. We cannot expect that complete application programs are reusable, because - even in

290

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 Our research work combines aspects of portability and algorithmic reusability. Figure 1 sketches the chosen research direction. Existing architecture and operating-system concepts are hidden by virtual machine layers. At the other end, parallel programs and applications can be described and classified regarding their parallel processing attributes (the BACS layer). Our research work improves the programmer’s situation, because we support the most common parallel computation schemes on the most widely used virtual machine models and computation languages. Many research projects emphasize the need for more “programming tools”. There is no doubt that tools are needed to develop parallel processor software; however, we miss methodological aspects, such as common terminology, algorithmic descriptions and classifications, and concepts for addressing the software problems. In fact, tools easily become obsolete when new generations of parallel systems are announced.

students ponents.

(how to learn ?) use common Elements identified are

system com-

l

BACS, cribing

the vocabulary and methodology for desparallel algorithms and programs.

l

BALI, a library of algorithmic cores. Formal descriptions, called scripts, describe all relevant elements of parallel programs.

l

TINA, the program generator which produces source code skeletons for different target systems and programming languages.

Both programmers and benchmarkers benefit from the skeleton approach. By using our framework, skeletons can be completed towards complete application programs (left side) or can be completed towards synthetic benchmark suites by adding artificial loads (right side). It is TINA that acts as a portability platform; it is BALI and the script notation that offer reusability of program constructs or complete skeleton code.

3

BACS rithms

and

Three

Sample

Algo-

While several classifications for computer architectures have been proposed and some of those are really used (for example Flynn’s taxonomy [4]), widely accepted classifications for parallel algorithms do not exist. For several months, researchers at the Base1 Parallel Processing Laboratory have been compiling the Base1 Algorithm Classification Scheme BACS [5], a framework for the description and classification of parallel algorithms, upon which an environment for structured programming of parallel systems can be built.

3.1

BACS

Summary

Parallel algorithms can be regarded as a collection of processes consisting of different building blocks:

Process

Figure

and Data

Management

(daemon

call):

Services that belong to this category are process administration (creation and deletion), as well as data binding. All these operations are managed by an entity we call daemon; daemon functions can be implemented at the programming level, operating system level, and hardware level, respectively.

2: The Base1 Approach

Fig. 2 shows the overall research strategy. We prepare an environment in which programmers (how to program ?), benchmarkers (how to evaluate ?), and

291

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on system SCienCes process properties P structure

. 1 topology

Figure

1 exec. structure

/ interaction

3: Tuple for the classification

Interaction:

algorithms

elements (e.g., mail-

box) or signals to process groups (e.g., broadcast).

These consist of all statements that do computations; for example arithmetic operations, bookkeeping operations, and local data access.

structures:

of parallel

.

1 placement

anonymous synchronization

Calculation:

items are controlled ditional statements,

[I partitioning

and global interactions involving several processes which coordinate their activities by means of

Interactions define the coordination of the processes at runtime. Coordination operations are used for data exchange, signalling of events, and for consistency purposes.

Control

data attributes

15%

Finally the three previous by sequence, loop, and conrespectively.

In BACS we refine this list and end up with a generic description tuple which fully characterizes a parallel algorithm (Figure 3). The tuple itself is structured into process attributes, data attributes, and interactions. Most of the tuple elements are described and structured by hierarchical representations that are extensible. As we will later see, concrete algorithms are represented by tuples for which all fields are filled with corresponding items.

Data

partitioning: For distributed data it is important to know how they are partitioned. Arrays for example can be partitioned into elements, rows, columns, and blocks, respectively.

Data

placement: This component defines the binding of processes and data. Local data are private to a process, while non-local data can either be global (all processes can have access), be interaction data (data explicitly used for interactions), be static or dynamically placed data.

3.2

Wax-shall Algorithm

A frequent problem in graph theory is to compute the transitive closure of a directed graph. The transitive closure indicates for each vertex (node) which other vertices can be reached. In graph theory such graphs and their transitive closures are represented by square matrices with Boolean values, A and T respectively. A is called adjacency matrix, T its transitive closure.

Process

structure: We distinguish between static and dynamic process structures. An algorithm is called static if the daemon is only called at the beginning (to create all processes) and at the end (to delete them), but not during the execution of the actual algorithm. Today, static algorithms are dominating the field of numerical applications.

Process

topology: This component defines the geometric structure of the process set. We concentrate on regular topologies, which split into homogeneous topologies (all processes have identical instructions) and inhomogeneous ones (at least two processes have different operations; e.g., masterworker relationship).

The value 1 in the i-th column of the j-th row of A indicates a direct edge from node j to node i. For the transitive closure matrix T, the value 1 indicates just %eachability” but not a direct connection. The Warshall algorithm transforms A to T by the application of the transitivity rule: If there is a link from i to j, and a link from j to k, there is also a link from i to k. This rule is systematically applied by testing all intermediate nodes. In a parallel version, all processes hold a set of rows of A and send consecutively one of those rows, the test row, to the other processes. Then every row with a connection to the test row gets all connections of the test row. In

Execution

structure: While the two previous elements define static process attributes, this component describes the dynamic behaviour. We are only interested in regular execution patterns (multiphase structures).

Interactions:

The purpose of interactions has already been mentioned. BACS identifies direct interactions which exactly link two process nodes,

292

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences our example all transformations are done on the single matrix A, which means that A at the beginning holds the adjacency matrix, but the transitive closure at the end. The tuple information of the Warshall algorithm is given in figure 4. Process attributes P structure static

Topology Worker

Execution

structure c’) ’

Fi2d~bcatj

]fEiizg

Data attributes /

% Partitioning Ah

Placement

n] (1, P), (1, PI, (None,

an upper

triangular

matrix

1995

U, such that

A=LxU If L und U are known, systems of linear equations can be computed more efficiently. The tuple information of the LU algorithm is given in Figure 5. Like the parallel Warshall case, the LU algorithm is organized as a worker algorithm. If you compare the tuple entries of figures 4 and 5, you find strong similarities. The only difference is the cyclic distribution of the matrix for LU vs. a fixed blocking for Warhall, and the slightly different execution structures. In fact such similarities of algorithms are the driving force for our reusability approach, because both algorithms have similar program cores.

Block)

Process attributes /

Figure

4: Classification

tuple

of Warshall

algorithm

structure

Topology

static

Worker

As the process structure does not change during execution, it is static. The topology used is a set of worker processes, interacting with each other by means of a broadcast facility. With the basic operations Calculation, Interaction, Daemon Calls, identified in section 3.1, and the basic structures Sequence, Loop, Selection for composition, the execution structure of the Warshall algorithm may be written as

structure -

I/l

Fizw(~bcattCP)

Data attributes pzzi+zj A[n,n](l,p),(l,p),(None,Cyclic)

Figure

3.4 i.e. a vector will be sent by one of a set of processes to all the others (broadcast interaction &t), then a calculation will be performed by every process ( Ct )‘. This happens a fixed number of times and the processes form a uniform set of workers for this action (FIX, - the subscript w refers to the set of workers). The data used is the adjacency matrix, which is partitioned into rows. These are placed at the beginning by the daemon (static placement) on the worker processes. Partitioning and placement are specified using several parameters: A[n,n] represents the nxn adjacency matrix. “(1, p) , (1, p)” is the partitioning specification, “(None, Block) ’ the distribution specification. As a result, row blocks are distributed among p processes.

3.3

Execution

5: Classification

Systolic

matrix

tuple

of LU-decomposition

multiplication

Systolic algorithms are well-known candidates for massively parallel execution. Our sample algorithm makes use of blockwise distribution of the two input matrices A and B, and the output matrix C on a process grid. This time, the execution structure consists of several parts: Pre-rotation of the A-matrix, pre-rotation of the B-matrix, and the systolic part, which is a fixed loop consisting of the calculation and the rotation of one position both for A and B. The tuple information we have presented provide a first, coarse grain view of the algorithm. However it is too informal in order to be useable as input for a program generator tool. Therefore in the next section we will refine these concepts towards a formal description language.

LU-decomposition

The LU-decomposition algorithm transforms a nonsingular matrix A into a lower triangular matrix L and

4

‘The superscript t refers to the totality of an action; i.e., all processes participate in an action. If an action is only executed by a subset of the processes, a superscript p is used instead.

We have designed and implemented a skeleton generator, called TINA, that supports both reusability

Skeleton

293

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Generator

TINA

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 Process attributes P

4

structure static

Topology

Execution structure

TOWS

IRot,

IRol,

Fizr(C’,

IRot,

IRd)

Distribution

Partitioning

A,B,C[n,nl(q,q),(q,q),(Block,Block)

Figure

6: Classification

tuple of systolic

and portability of parallel program components. As a programmer you have to provide TINA the following information: l

The algorithmic core of the program using a script input notation.

l

Support for the parallel programming language generated.

4.1

Input The input

General

in mind

l

information:

on

Name of the algorithm

l

0 Virtual

machine

0 Programming

prototypes

Part 2 of the Warshall example introduces the adjacency matrix A of type graph, the line buffer line for a matrix line to be broadcasted, and some local variables. For the matrix variable the partitioning and distribution information is given according the BACS classification tuple. Two functions are declared that will later be filled in by the programmer: Prepareinteraction and Localtalc (the application of the transitivity rule).

into three parts:

The first part reports

Variables

It is important to notice that for each topology several variables are introduced implicitly. For instance for a mesh topology the variables p (total number of processes), px/py (number of processes for one row/column), me (private process number), north/east/south/west (number of processes in the relevant direction), etc are defined.

by

virtual machine and the for which code shall be

is structured

multiplication

0 Function

script script

matrix

language

0 Process topology 0 Interaction For the Warshall vide the following

mechanisms example of section sample information:

1 2 3 4 5 6 7 8 9 10 11

2.2 we pro-

GLOBAL-VARIABLES: DISTRIBUTED~VARIABLES: graph A STATIC(NONE,BLOCK,1,p,l,p)\INOUT; INTERACTION-VARIABLES: buffer line; long who; LOCAL-VARIABLES: buffer beet-buffer; FUNCTION-PROTOTYPES: Prepareinteraction(POINTER.POINTER); LocalcalcO; 12 xx

"Warehall algorithm"; iNAME: EXPRESS; 2 MACHINE: 3 LANGUAGE: c; 4 PROCESS-TOPOLOGY: STATIC Worker(p); Broadcast; 5 INTERACTIONS: 6 xx Figure

7: Part

graphcomp graph(n,n); graphcomp buffer(n); ,

TYPES:

1 of script Figure

Declarations: Part 2 of the script consists of several algorithm-dependent declarations: l

Execution

8: Part 2 of script

structure: Finally, the third part, called frame, defines the overall execution structure. For Warshall there is a For-loop in which each row is

Data types

294

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

19%

broadcasted to all other processes such that they can execute their local transitive closure computations in parallel.

1FRAME: 2 FOR (i = 0; i < n; i++) DO 3 INTERACTION 4 Prepareinteraction (best,buffer, 5 Broadcast (beet-buffer, vho) 6 END; 7 CALCULATION 8 Localcalc() 9 END 10 ENDFOR. Figure

Templates

who);

9: Part 3 of script

The availability of language constructs for reusing regular process topologies, data distributions, and execution patterns is the first step to enhance programmer productivity. But we also claimed portability. It is the skeleton generator TINA that acts as portability platform.

4.2

TINA

Parameter data

Figure

10: TINA

execution

flow

Basics fill in all missing information. Fig. 11 shows the report for our Warshall example. Because of space restrictions we cannot refer to the generated skeleton code. TINA generates a skeleton file consisting of 430 lines C code. The programmer has to write just another 24 lines of applicationdependent source code in order to get an executable version for the Warshall example. If the programmer wants to switch to a PVM environment, for instance, it is sufficient to replace “EXPRESS” by “PVM” in part 1 of the script. All other modifications can automatically be done by TINA, and C skeleton source files for PVM are generated. TINA has been built by using several programs of the Unix tool set (lex, yacc, indent, grep, and awk). The size of all TINA modules is around 3000 source lines.

The basic design decision has been to build up a support library in which parametrized text pieces (called templates) are collected. TINA concatenates these templates according to the input script specification and substitutes all parameter information. Fig 10 gives a pictorial view of the processing. In the fist step, all parameter information is generated by analyzing the script. In the second step, the relevant templates are imported, filled with the parameter data, and the program skeleton is generated. It is important to notice that TINA has been designed to support several virtual machines (and programming languages). Thus, the template library has a corresponding hierarchical structure. For each programming language supported, there is a subdirectory with interface routines to virtual machines. Again, for each machine supported there are subdirectories that contain the corresponding templates. TINA generates a skeleton program which is split into several files. The prototype implementation is based on a host-node model, so separate files for the host and the nodes, as well as common data files are generated. As the programmer has to provide the algorithmspecific part (e.g., calculations) it is essential to guide him/her during this process. Therefore, a detailed report is generated that tells the programmer where to

4.3

BALI

entries

BALI is a hypercard-like data base of sample parallel algorithms. Textual description, BACS classification, and the script program are accessible by different user groups. The set of examples is small but growing. The TINA script for the LU-decomposition algorithm is very similar to the Warshall example. As the BACS tuples already indicated, only the execu-

295

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1998

EPORT

Name: varshall DATE: Han 28. TIIIE: 8:47:03

local variables ---------------> me -> sender best-buffer ->

algorithmus Jun 1993

LDDED INCLUDE FILES ___---___---------USERDEFINED TYPE DEFINITIONS --------------------------------

VARIABLE DEFINITIONS Prepareinteraction ____---__---__-----global variables ______--__------> p-CL P -> n-CL I1 distributed variables ---------------------> A-DB A

Figure

CHANGES THAT THE USER HAS TO l4AKE _______________-____--------

in SO 91 93

graph buffer

98

file fill fill body

Final

' npr0cs.c ' line(s): in a type in a type of Procedure

body of procedure

Localcalc

in file ' udata.c ' line(s): 40 body of procedure graph_Init 45 body of procedure graph-Collect ' udata.h ' line(s): in file 18 definition of type

11: Generated

report

today’s techniques and tools are at a rather poor level. We have presented concepts and a research direction towards structured programming, portability, and software reusability in parallel processing environments.

tion pattern and the data distribution are slightly different. Just to show another example, fig. 12 shows the systolic matrix multiplication script.

5

me-LO sender-LO best-buffer-LO

Remarks

We have reported on a project that is still under development. The TINA prototype described has been successful for delivering early experience regarding the practicability of the approach (for further details see [6]). Within th e S wiss Priority Programme for Informatics we have linked our system to an environment where more application-specific descriptions are used

6

Related

Work

There are several streams l

Skeleton-oriented programming has been introduced by Cole [8]. In the field of functional programming this approach is already quite popular [9,10,11]. F or p rocedural language environments, the P4 methodology (P3L language and P3M machine model) developed at the University of Pisa addresses portability and abstract machine issues and has similarities with our work [12].

l

PCN and Strand are coordination Ianguages that provide compositionality of parallel programs [13,14].

l

Scient$c modelling languages include constructs for building process topologies and interactions [15,16].

VI.

This project puts particular emphasis on structured programming concepts. While the script notation has been a straight-forward notation to express BACS concepts, it lacks several accepted principles of language design. We are currently implementing a compiler for a MODULA-2 based coordination language we designed, that again provides the functionality of a skeleton generator. All major High Performance Computing initiatives currently emphasize the importance of cost-effective software production for parallel systems. In fact we think that this is the real Grand Challenge, because

that relate to our work:

296

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

N‘ME: NUXHINE:

"ayetolic matrix multiplication"; EXPRESS;

: L,ANGUAGE P‘ROCESS-TOPOLOGY: NTERACTIONS: i ,x TYPES: GiLOBAL_VABIABLES: DIISTRIBUTED-VARIABLES:

c;

199.5

STATIC Torus (q, q); Rot, Rol;

matrixcomp matrix(n, n); ; matrix A,B STATIC (BLOCK, BLOCK, q. q. q, q) \IN; matrix C STATIC (BLOCK, BLOCK, q, q, q, q> \OUT; I :NTERACTION-VARIABLES: ; long diet, iter; L.OCAL-VARIABLES: 'UNCTION-PROTOTYPES: systmultiply(POINTER, POINTER, POINTER); : :x F'RAME: I '* preparation of data l / INTERACTION dist = (q - mex) - 1; Rol(A, dist); dist = (q - mey) - 1; Rot(B, diet); END; */ I'* syst multiplication FOR (iter = 0; iter < q; iter++) DO CALCULATION eystmultiply(A, B, C) END; INTERACTION dist = 1; Rol(A, dist); Rot(B, dist); END; ENDFOR. Figure

12: Systolic

matrix

script

Acknowledgements

a While most of the virtual machines ignore the data aspect, the VMMP environment [17] as well as the High Performance Fortran Forum discuss data partitioning and data distribution primitives.

l

multiplication

The list of past and present laboratory members includes Helmar Burkhart, Carlos Falc6 Korn, Niandong Fang, Robert Frank, Stephan Gutzwiller, Guido H&hler, Walter Kuhn, Peter Ohnacker, Jean-Daniel Pouget, Gerald Pretot, and Stephan Waser.

Portability has been addressed in several Esprit projects, such as PPPE, GM-MIMD, GENESIS, and PUMA.

References l

Reusability approaches

is addressed in many object-oriented (see [18], for example).

l

Hansen has published a set of examples that serve as model progruma for parallel programming [19].

[l]

Gordon Bell, Ultracomputers: A Terajlop Its Time, Science, April 1992.

Before

[2] J.E. Boillat, H. Burkhart, K.M. Decker and P.G. Kropf, Parallel Programming in the 1990’s: Attacking the Software Problem, Physics Report

291

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 (Review 5),1991,

Section of Physics Letters), 141-165.

P51 R. Francis, I. Mathieson,

P. Whiting, M. Dix, H. Davies, and L. Rotstayn. A Data Parallel Scientific Modellmg Language, in J. of Parallel and Distributed Computing, 21, 46-60, 1994.

Vol 207 (3-

Parallel Computing. Special Issue on Message Interfaces, Vol. 20, No. 4, April 1994.

1161J.L. Dekeyser, D. Lazure, Ph. Marquet.

A geometrical data-parallel language, in ACM SIGPLAN Notices, Vol 29, No 4, 31-40, April 1994.

M. Flynn. Very high-speed computing systems. In Proc. IEEE, 54, No. 12, 1901-1909, 1966. Helmar Burkhart, Carlos Falco Korn, Stephan Gutzwiller, Peter Ohnacker, and Stephan Waser. BACS: Base1 Algorithm Classification Scheme, Technical Report 93-3, Institut fiir Informatik, University of Basel, Switzerland, March 1993.

1171E. Gabber.

VMMP: A Practical Tool for the Development of Portable and Efficient Programs for Multiprocessors. IEEE Trans. on Parallel and Distributed Systems, l(3), July 1990.

P31G.

Alverson and D. Notkin. Program Structuring for Effective Parallel Portability, IEEE Trans. on Parallel and Distributed Systems, Vol 4, No 9, 1041-1059, Sep 1993.

Stephan Gutzwiller. Methoden und Werkzeuge des skelettorientierten Programmierens, PhD thesis, University of Basel, 1994 (in german). K. Decker, J. Dvorak, and R. Rehmann. A Knowledge-Based Scientific Parallel Programming Environment. IFIP Working Conference WG10.3 on Programming Environments for Massively Parallel Distributed Systems, 1994.

PI

P.B. Hansen. Model Programs for Computational Science: A Programming Methodology for Multicomputers. Concurrency - Practice and Experience, Jan 1993.

M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, Massachusetts, 1989. J. Darlington, A. Field, P. Harrison, P. Kelly, D. Sharp, and Q. Wu. Parallel Programming Using Skeleton Functions, Proc. 5th Int Conf PARLE, 146-160, Springer, 1993. Workshop on “Abstract Machine Models for Highly Parallel Computers”, Leeds, April 1993, Oxford Univ. Press. (111 Conference on “Massively Parallel Models”, Berlin, September 1993.

Programming

[12] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi. P3L: A Structured High-Level Parallel Language, and its Structured Support, TR-36/93, Dept. of Informatics, University of Pisa, Dee 1993. [13] Ian Foster and Carl Kesselmann. Language Constructs and Runtime Systems for Compositional Parallel Programming, Proc. of CONPAR VAPP VI (B. Buchberger and J. Volkert, Eds.), LNCS 854, Springer, 5-16, Sep 1994. [14] Ian Foster, Robert Olson, and Steven Tuecke. Productive Parallel Programming: The PCN Approach, in Scientific Programming, Vol. 1, 51-66, 1992.

298

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE