application programmer, on the other hand, can use a high level language ... The P3L compiling tools navigate the parallel construct tree of an application that.
AN ENVIRONMENT FOR STRUCTURED PARALLEL PROGRAMMING B. Bacci, B. Cantalupo, M. Danelutto S. Orlando, D. Pasetto, S. Pelagatti, M. Vanneschi Dipartimento di Informatica, Universit`a di Pisa, Corso Italia 40, 56125 Pisa, Italy, tel +39 50 887228, fax +39 50 887226
Abstract 3
is a parallel coordination language which is based on the emerging research on skeletons (templates). The design of P 3 L begun in 1990, just before skeletons started to interest a large body of the parallel processing community, and after that we developed a prototype P 3 L compiler based on innovative template–based organization. During the last years we made experiments with the prototype P 3 L environment and performed a set of tests to verify different results: the suitability of P 3 L for massively parallel programming, its efficiency in generating “good” parallel code, the performance achieved with respect to traditional parallel programming languages and its portability over different parallel architectures. This paper summarizes the experience gained by our group with the “P 3 L experiment”. P L
1. Introduction The main problem that makes parallel architectures used for just a limited number of applications is that programming such machines is very difficult, time consuming and error prone. Moreover, the state of the art in parallel languages is still not able to ensures both programmability and performance for parallel programs. The major differences among existing parallel machine hardware make efficient programming a parallel application in a portable way a task even more difficult and often impossible. Tools should help programmers to write efficient and portable applications, but unfortunately building such tools is very difficult if programs are written accordingly to a general parallel computational
1
2 model in which processes can interact accordingly to any arbitrary communication or synchronization pattern [1]. This is mainly due to two key factors. First, the automatic solution of problems such as mapping, scheduling and load balancing turns out to be a N P –complete (and in some cases even non–approximable) problem. Moreover, the portability of a parallel application demands for hardware specific code to be inserted in the program in order to obtain good efficiency [2, 3, 4] Among the proposed solutions for effective parallel software development, an attractive one is to adopt a restricted parallel computation model, that is a framework in which the parallel structure of the computation is built by composing a restricted set of basic forms of parallelism [1]. Adopting a restricted computational model allows us to build tools without paying an excessive fee in terms of performance and/or portability. This approach is feasible because most parallel applications exhibit the same parallel structure exploiting the same paradigms of parallelism, such as data–parallel, task–farms, pipeline and so on [5, 6]. Moreover, having a restricted set of parallel interaction structures, we can devise good implementation templates for each pattern of parallelism on each different parallel machine. An implementation template is a parametric network of processes along with a performance model. The network records in a parametric way strategies for mapping, load balancing and so on in such a way to obtain a tuned implementation of a single basic form of parallelism on a specific architecture. Performance models are formulas which describe the performance of the network on each machine; performance formulas can be used to optimize both efficiency and performance of programs [7]. Doing so, we can achieve parallel program portability ensuring a good performance over a large set of parallel systems. In fact, a tool designer has a deep knowledge of each target machine and concentrates on the performance of basic forms of parallelism building the template library used by the compilers. An application programmer, on the other hand, can use a high level language independently from any specific architecture and can concentrate on algorithms rather than on their low– level implementation. This approach (known as skeleton/template based approach) has been used by just a few recently proposed libraries and languages; examples are Cole and Darlington’s skeletons for parallel functional programming [2, 3, 8] and the P 3 L language and compiling tools developed at the university of Pisa [4, 9]. In this article, we report the experience gained in the Pisa Parallel Programming Project during the testing phase of our compiling tools. 2. The P 3 L Language The main goal in the whole P 3 L design and implementation process was to verify whether programmability, portability and performance could be achieved allowing the user to express parallel computations using a restricted set of parallelism exploitation patterns or skeletons. These patterns, parallel constructs in our terminology, are supplied under the form of primitives of the language. In this sense, P 3 L belongs to the class of restricted computational models for parallel computation, according to Skillicorn [1]. 3 P L provides the user with a set of parallel constructs. Each parallel construct models a well known form of parallelism exploitation (farm, loop, map, reduce, etc.). Constructs are primitive statements of the language and can be hierarchically composed. The peculiar
3 feature P 3 L parallel constructs is that they are very high level in that they abstract from all those features an MPP programmer has normally to deal with, such as communications, process graphs, scheduling, load balancing, etc. Unlike other parallel programming languages/environment P 3 L does not allow the programmer to use low level constructs such as send/receive pair, parallel commands, explicit scheduling primitives, etc. 3 3 3 P L comes with its own set of compiling tools [9, 10]. The P L compiler takes a P L source program and produces an executable that is built out of a set of sequential processes, each containing calls to a communication library. In addition to that, the compiler solves the process-to-processor mapping. In the P 3 L compiling tools, each parallel construct in the language is supplied with a set of implementation templates that efficiently implement it onto a given target machine. Like the parallel constructs, implementation templates can be nested and their representation within the compiler is parametric with respect to the amount of resources needed. The P 3 L compiling tools navigate the parallel construct tree of an application that has to be compiled, to associate an implementation template with each parallel construct. During this process, a high level, machine independent source program is transformed in to a new parallel program, namely a set of processes written in C and containing calls to a communication library. This program can be directly compiled and executed on to the target architecture, using the state-of-the-art C compilers. The choice of the C programming language has been completely incidental. Actually, any other sequential programming language could have been used. The P 3 L compiling tools have knowledge of the features of the target machine they are compiling for. They know about the general features of the machine (interconnection topology, processing element configuration, etc.) as well as about the costs (in terms of execution time) of the basic operations of the MPP (local scheduling of a process, communication set-up, communication completion, remote memory access, etc.). This knowledge is used to match the hardware power with the parallel application features, via the instantiation of the “free” parameters of the implementation templates. As an example, a farm template (implementing the well known farm paradigm) is composed of a set of worker processes each computing entirely part of the input tasks[11, 12, 13]. The number of workers is parametric and there exist an ’optimal’ number of workers which gives the best performance with the most efficient resource usage[14]. The P 3 L compiling tools devise the optimal number of worker processes using the available knowledge relative to the target machine features and to the average execution time of the worker code (obtained by profiling). 3. P 3 L Language constructs A P 3 L program can be represented by its construct tree, which describes how the available forms of parallelism are instantiated and composed. Each tree node is labeled with a parallel construct, tree leaves are sequential code blocks and tree arcs are labeled with the type of the data that should flow between constructs. The language has a macro dataflow semantics: each construct behaves as a stateless pure functional module; this choice permits to simply transform the construct tree in order to balance the load
4 between the available processing nodes and to obtain a process network that implements the algorithm showing a better performance. Program input is a stream of tasks, eventually composed by a single input task, and the function computed by the program is applied to each element of this stream. The current implementation of the language is limited to static data structures: data exchanged between constructs must be defined statically at compile time so the compiler can compute their size and communication time. We are extending our compiler to allow constructs to exchange dynamically sized structures. Now we introduce all the language constructs available in our prototype compiler, along with their operational meaning and a denotation using a functional style to write down P 3 L parallel program structures in a very compact way. In the functional notation we use the symbol Si to denote any P 3 L construct. Sequential The sequential construct represents the leaves of the construct tree and contains the sequential code blocks of the application. The construct works in a pure functional way applying its sequential body to each input data, which represents a task to be executed, and outputting a results. It is not allowed to keep an internal state. In this document, we denote a sequential module i by Si .
Stage 1
Stage 2
Stage i
Figure 1: An i stage pipeline construct representation. Pipe The pipe construct expresses pipeline parallelism[6]: if an algorithm can be statically divided into a sequence of operations, then we can execute in parallel different operations on different input stream elements. The construct contains i stages and each task of the input stream is passed to the first stage, then the result of the computation performed by this stage is passed to the second one while the first one starts computing the subsequent task, and so on. The structure of a program exploiting pipeline parallelism is depicted in Figure 1. Here we can see i stages along with their interconnections. A pipe construct with n stages is denoted by Π(S1 ; : : : ; Sn). Farm The farm construct expresses the parallelism among the independent computation of a function f on different data in the input stream. The structure of a parallel program exploiting farm parallelism is shown in Figure 2. Here, we can see a scheduler process (called the emitter), the farm workers and a collector process that gathers the results. Farm constructs are denoted by F (S ), where S is the farm worker module. Loop The loop construct iterates the parallel computation of a P 3 L module (the loop body) on each element of the input stream. Each input task is passed to the loop body module until a termination condition is satisfied. Then, the result is sent to the next construct. This behavior is shown in Figure 3, where we can see an
5
Worker 1
Collec.
Emit.
Worker n
Figure 2: Representation of a farm construct.
Input
Body
Cond
Figure 3: Representation of a loop construct. input process, which arbitrates between new tasks coming from outside the loop and tasks which require further processing inside the loop, the body module and the termination condition check process. The loop construct is denoted as L(S ; C ) where S is the loop body and C is a sequential module that computes the termination function. Geometric The geometric construct expresses geometric parallelism[5, 15]: each input task is composed of an n dimensional array and we want to iterate the application of a convolution operator on each single array element. The input task can also contain data items that should be used in the computation of each update element of the array (these are called broadcasted data). The output of a geometric construct is also an n dimensional array. The computation phase consists in a loop of steps; each step begins with a data gathering from a set of neighboring processors (the communication pattern is called the stencil) and is followed by a local computation. Actually, both the number of iterations and the data access stencil must be known at compile time (we are improving this construct adding the support for a reduce operation every k steps to check for the convolution termination instead of statically setting the iteration number). The geometric construct is functionally denoted by G (l; N ; S; S ) where l is the number of iterations, N is the set of input and output array dimensions, S is the data access stencil subcube definition and S is the local computation body which is a sequential construct.
6 Map The map construct denotes a form of parallelism which is a special case of the geometric one. In map computations, a single function is applied to each element in the input array and the computations are all independent of each other[16, 3]. So, we do not need a data access stencil and we do not iterate; this can lead to simpler implementation templates, language constructs and performance models. Construct denotation is M(N ; S ), N and S having the same meaning as in the geometric case.
Worker Worker Worker Worker Worker Worker Worker
Figure 4: Representation of a reduce construct. Reduce The reduce construct implements a binary tree-like computation using an associative operator[16]. The input data item is an array and the output data item is a scalar of the same type of a single array element. Construct denotation is R(S ), where S is a sequential module implementing the binary operation. The structure of a parallel program exploiting reduce parallelism is shown in Figure 4. 4. An optimizing compiler for a structured parallel language Along with the definition of P 3 L language constructs, we developed a set of machine specific implementation templates, each one equipped with an architecture independent analytical model able to predict the performance of the templates on a wide range of architectures and a template-based compiler for the P 3L language. Each machine specific template is a process network parametric with respect to the contained construct, the parallelism degree (if applicable) and other internal parameters such as structure decomposition rules, etc. The performance models use a set of machine specific parameters, usually obtained using micro–benchmarking techniques, to evaluate the performance of a given template over a specific machine. The performance model is able to determine template scalability (that is the maximum feasible parallelism degree) and to optimize internal template parameters (such as grain size) given a specific construct instance. The compiler starts from the construct tree, which describes the program, and from a machine specific template library (actually available for CS-1, PVM and, in a partially automated way, Cray T3D [17]). The template libraries contain customizable code of the template process networks and a suitable representation of the performance model. The
7 compiler produces an executable program highly optimized for the target architecture and that fits into the available resources [9]. The compilation process proceeds through the following steps: 1. The source program is analyzed and the construct tree is built. 2. The program is compiled assuming an unbounded number of nodes on the target machine. The compiler examines each construct in the tree and selects an implementation template between those available for the correspondent basic form of parallelism. The selection is done computing template’s maximum theoretical performance and choosing the one that obtains the best values. To compute the maximum theoretical performance we use the template performance model to analyze the scalability and determine the implementation saturation point after which more resources are not used effectively. 3. The compiler tries to enhance program performance applying a set of rewriting rules studied to increase the parallelism exploited. All these rewriting rules maintain program semantics but modify its parallel structure in order to exploit more parallelism and obtain better performance. This phase can insert new constructs (for example, farming–out a slow module) and can select different implementation templates that optimize specific construct compositions (for example, a map followed by a reduce over the same dataset can be implemented using a single template computing efficiently the composition of the two). 4. The compiler balances the load on all the pipelines of the program applying another set of rewriting rules. For example, it can reduce construct performance reducing the number of workers, collapsing sequential processes, and so on. This phase uses the performance models to estimate the performance of each rewriting of the program. The compiler heuristically searches a good solution and at each step tries several rewriting rules. To test a rewriting rule, the compiler builds the construct tree resulting from the application of the rule and evaluates the overall program performance. A rule is selected only if the performance vs resource ratio leads to a better solution then the current one. 5. At this stage, the program structure is well balanced and gets the best performance the compiler is able to obtain on the current target system. Now the actual number of nodes in the target machine is taken into account. The compiler starts removing resources from the program implementation in order to obtain a number of processors small enough to run on the target. The resource removal process keeps the communication and computation load balanced between PEs. It can completely remove redundant parallel constructs (such as a farm with a single worker) or it can collapse a parallel construct into a single sequential stage (such as a map with one worker). This process uses several heuristics and ends in polynomial time[10]. Performance models are used to compare the performance of different solutions, in order to reduce resources still having good performance figures. 6. When the number of the nodes required by the program fits into the target system, an executable program is produced.
8 Note that all the algorithms used in the compilation process have a polynomial time complexity. Currently, our compiler can produce code for a Meiko CS–1 Transputer based system[18] and for a network of heterogeneous workstations running PVM[19]. In the last case, no optimization rule is applied due to the fact that it is very difficult to predict the performance of a network of multiuser workstations, nevertheless this version is useful in the developing phase of an application. A Cray T3D template library using native communication libraries[20, 21] and some of the compiler phases is ready, but still part of the compilation process must be done by hand. 5. Parallel programming in a structured environment To test our approach to parallel programming, we developed a set of applications using the prototype compiler. The applications were developed by undergraduate students, with low or none previous parallel programming experience. The applications we developed are described in the following sections1 , while Section 6 discusses the performance achieved by compiling and running the applications on different target machines. 5.1. CIRCUIT FAULT LIST ANALISYS Circuit fault list analysis packages (CFL) are used by electronic chip producers: when a circuit is printed onto a wafer it is possible that the building process produces defects. Due to the complexity of current chip generation, it can be very time consuming to test all the chip functionalities to search for these defects. To overcome this difficulties, producers introduce a set of testing points in the chip itself and, using these points, they search only for the most probable faults. A possibly small set of test patterns must then be designed, along with the expected output from each pattern, to obtain a fast testing process. The CFL application is aimed at producing these test patterns: input data represent both the circuit layout and the list of more probable faults the integrated circuit can have due to errors or defects in wafer preparation. Usual approaches involve the use of an automatic test pattern generator (ATPG), which produces a pattern that recognizes one of the faults still to be tested, and a simulator (SIM) that checks whether this pattern also detects different faults. Using this approach a small set of patterns can be found, but the process is very long due to the fact that both the ATPG and the SIM algorithms are complex and time consuming. We developed several different parallel versions of this application; one of the parallelization strategy selected is to exploit stream parallelism on the stream of different sublists of faults. The structure of the program is the following: Π(S1 ; L(Π(S4 ; F (Π(S5 ; S6 ))); S2 ); S3 ) The first module (S1 ) partitions the input fault list into a stream of sublists; each sublist enters the loop where it remains until the computation generates a number of input sequences able to detect every fault in the list. The test pattern generation starts with a 1 You can download the code of all these applications via World Wide Web at http://www.di.unipi.it/˜susanna/p3l.html
P 3L
home page:
9 standard sequential automatic test pattern generator ( S5 ), which computes a pattern for the currently selected fault, and proceeds through a standard sequential simulator (S6 ), which checks whether the generated pattern covers also other faults in the current sublist. The other modules handle candidate fault selection (S4 ), termination checking (S2 ) and test patterns output (S3 ). 5.2. RADIOLOGICAL IMAGE ANALISYS CAT image analysis software is fundamental to examine radiological images such as those produced by CAT machines. The application gets an image where each pixel shows the energy of a part of the body, and joins zones with similar energies to produce a simpler and understandable image. We have parallelized a sequential CAT application written in C. Several different parallelization strategies have been tested by modifying the parallel structure (i.e., changing the parallel constructs used and their composition) of the starting solution. One of the tested structure is the following: Π(S1 ; L(Π(M([X; Y ]; S2 ); S3 ; M([X; Y ]; S2 )); S5 ); S6 ) where X is the image width and Y the image height. CAT parallel program does not exploit stream parallelism, because we examine one image at a time. Instead it uses two map instances. The first one computes the energy for each region of the input image and builds a list of the neighboring pixels. A map form of parallelism is used because the same operation must be done for each pixel. After building the neighboring pixel lists, the sequential stage S3 decides which neighbor glue with each region and transmits these informations to the second map instance. The second map works independently on each image region and simply merges each region with the selected neighbor. The process is iterated until no more regions can be glued. 5.3. 3D DELAUNAY TRIANGOLATION Three dimensional Delaunay triangulation (TDE) is an application aimed at producing a volumetric visualization of sparse data sets. The application starts from a set of points randomly distributed in a 3D space and must compute a set of tetraedros whose vertex are three different points in space. The intersection between tetraedros in the set must always be null or limited to one or two sides. Tetraedros set must cover all the points of the starting dataset. We started from a sequential solution (called InCoDe) for the problem and developed different parallelization strategies[22]. One of the solution developed has the following structure: Π(S1 ; M([X; Y ]; S2 ); S3 ; M([Y ]; S4 ); S5 ) In this solution, stage S1 is the input stage, the first map partitions the input data space using a set of walls (hyperplanes that divide the point space in two) and computes all tetraedros that cross these walls. Walls divide the input data space in equally sized sections. After a data reorganization, done in stage S3 , the data structure proceed towards the second map, which computes the tetraedros that cover the points inside the walls.
10 5.4. A SIMPLE RAY-TRACER This application implements a simple ray-tracer. Scenes are built combining objects like sphere, planes, cones and polygons and using several light sources. Object bodies can have a color or a texture and a separate reflection and refraction coefficients. A sequential ray-tracer program was developed quickly without any optimization: simply, for each pixel of the screen, we send light rays and handle up to 5 ray reflection and refraction. Then, we tried two different parallelization strategies. 5.4.1. Data-parallel ray-tracer Data-parallel ray-tracer (DPR) parallelizes the ray-tracing algorithm by mapping the sequential algorithm over the set of pixels. The structure of the program is: Π(S1 ; M([W; H ]; S2); S3 ) Where stage S1 is the input stage, which reads the description of the scene and sends it out; the map construct partitions pixel coordinates and broadcasts the scene description to each pixel. Map output is a pixel color matrix that contains computed picture. Map’s worker (module S2 ) is the complete sequential ray-tracing program. The final stage (S3 ) receives the computed image and writes it to a file. This solution exploits also stream parallelism in order to render an animation. 5.4.2. Dataflow ray-tracer Dataflow ray-tracer (DFR) is an alternative implementation of the same algorithm exploiting fine grain parallelism. To program this, we did a high level dataflow analysis of the program, cutting it into macro-blocks and re-building its structure using the parallel patterns made available by P 3 L constructs. Using this strategy, we wish to exploit the stream parallelism of the computation of different pixels onto different code blocks. The resulting program structure is quite complex: Π(S1 ; L(Π(F (S6 ); S4 ; F (L(Π(S8 ; S9 ; S10 ); S7 )); S5 ); S3 ); S2 ) The stage S1 reads the scene description and outputs a stream of tasks, one for each pixel (160000 for a 400 by 400 pixel image); each task contains current pixel coordinates, scene description and a pixel state, which describes current pixel color and what we are doing now (i.e., if we are computing a 2nd ray reflection or whatever). When a pixel leaves stage st ray direct color. Stage S gathers the various S1 its state shows that we must compute 1 2 pixels, which are now connected to their definitive colors, and finally outputs the image. The main loop computes iteratively direct ray color, reflected ray color and refracted ray color; the algorithm is recursive and stops at depth 5 (i.e., we reflect a ray up to 5 times). The inner modules implement the basic code blocks:
S6
computes the position where each object intersects the current ray;
11
S4
S8
prepares the ray status to check for current light source color contributions;
S9
computes the current light source contribution to the ray color;
S10
S5
checks which object is first hit by the ray and sets ray status ready for the light source loop;
adds light source contribution to ray color;
handles the recursive process: checks if we can have one more reflection or a refraction and eventually makes the pixel do another leap in the extern loop.
6. Experimental results To validate the P 3 L approach and the performance models we developed for our template libraries, and to check if the predicted performance matches with real runs, we used our optimizing compiler to run all the application programs described in section 5 onto our Meiko CS–1 Transputer based system. In Table 1 we report measured times obtained from real runs on our parallel system and we compare them with predicted values2 . The parallelism degree column reports the number of processes (workers) assigned by the compiler to each parallel construct of the program, either to obtain a good balance between the load of each PE or due to resource reduction algorithm in order to fit the program into our small (50 nodes) CS-1. Table 1: Comparison between predicted and measures service time; all times are in seconds. program DPR DFR
pred. 23.5 0.0012
meas. 26.5 0.0013
error 11% 7%
CFL CAT
0.068 23.5
0.07 25
2% 6%
DEL
19.2
20
4%
parallelism degree main map: 6x6 workers intersect farm: 1 worker light farm: 2 workers main farm: 8 workers neighbor map: 4x4 workers glue map: 2x2 workers over-walls map: 2x2 workers intra-walls map: 2x2 workers
DPR and DFR runs are relative to a 400 by 400 pixel image with 20 objects and 10 lights, CFL has a 2800 items in the input fault list and partitions it into 95 sublists, CAT examines a 32 by 32 pixel image and DEL a 1024 point dataset. Absolute performance of the test programs is usually very high; for example the sequential ray-tracer application, used as a base for both parallel versions, takes 140 seconds of CPU time on a HP 9000/735 workstation with 64 Mbytes of RAM and just 26.5 seconds using 38 T800. A program that 2 All
times are in seconds.
12 resulted in a bad parallelization strategy is the dataflow version of the ray-tracer, which is too fine grained for this kind of machine. This was detected by the P 3 L compiler which did not allocate unused resources.
Service time (seconds)
10
CRAY T3D Meiko CS-2 Thinking Machine CM-5 T800 mesh
8
6
4
2
0
DPR
DFR
DEL
CAT
CFL
Program name
Figure 5: Testsuit programs maximum theoretical performance over different parallel architectures. These tests proved that we can provide both programmability and performance. In fact, all these programs were written in a short period of time mostly by people with low or none previous parallel programming experience. They were able, often starting from standard sequential code or writing a sequential version of the application, to produce several different parallel versions in a few months. Some of these versions at first exhibited poor performance or scalability, but simple code reorganizations allowed for a bigger speedups. Finally, there’s the portability question. We developed a PVM backend for the P 3 L compiler and all the applications described compile and run straightforwardly on PVM on a network of workstation. Unfortunately, out network is severely loaded and we could not measure their effective performance. We developed templates for several other parallel architectures and, thanks to our structured approach which allows us to predict performance in a simple and precise way, we can estimate the running time of these programs over a large set of machines. Figure 5 shows the predicted performance of the application set onto various popular parallel architecture. Please note that circuit fault list service time is multiplied by 100 and dataflow ray-tracer by 1000 to make the graphs visible. Values reported represent the best performance the program can reach on such a machine if enough resources are available; the resources needed are summarized in table 2. This
13 Table 2: Testsuit programs maximum parallelism degree over different parallel architectures. Program DPR main map DFR intersect farm DFR light farm DEL over-walls map DEL intra-walls map CAT neighbor map CAT glue map CFL main farm
T3D workers 7x7 1 1 3x3 2x2 3x3 1x1 8
CS–2 workers 10x10 1 1 3x3 2x2 5x5 1x1 20
CM–5 workers 10x10 1 1 3x3 2x2 4x4 1x1 13
T800 mesh workers 25x25 1 2 4x4 2x2 7x7 3x3 100
table also gives a measure of the scalability limit of each application when more resources are available: our compiler is able to compute the “best” number of workers, that is the number beyond which performance cannot increase further. 7. General results Skeleton/template approach The parallel constructs have been proved very powerful and valuable from the programmer viewpoint. We taught the P 3 L programming model in an advanced under-graduate course at our Department. A significant percentage of the students immediately got the basic concepts. Some students were also asked to write large applications in P 3 L and all of them succeeded in doing so. The templates used to implement the P 3 L constructs have been separately validated by coding (by hand, using existing parallel libraries) small applications using just one parallel construct. The performance of the single implementation template has been accurately measured and some specific parts of the templates not achieving good performance have been re-designed. Finally, the parallel constructs/template matching algorithm embedded in the compiling tools has been validated by comparing the performance of distinct applications written in P 3 L and in Occam or in C plus CSTools[18, 23] (this because the first version of the prototype compiler generated code for a Meiko CS/1 Transputer based machine). The results have been completely satisfactory: performance of hand written Occam code is comparable to the performance of P 3 L compiled code and, in a large number of cases, the performance of the P 3 L code is even better. Code re-use Sequential code can be encapsulated in sequential functions and easily reused within P 3L applications. Most of the applications previously shown, all written by undergraduate students, had been previously coded using different languages and tools. The percentage of code re-use varied from 30% to 90%. Lower re-use percentages were due to code written using Occam libraries. The higher rates were
14 for code entirely written in C with calls to the CSTools communication library. Most of the code re-used was the one embedded into sequential modules (processes actually) acting as worker processes in different P 3 L constructs. Rapid prototyping P 3 L supports rapid prototyping as the hierarchical composition of constructs allows “parallel refinement” of applications to be efficiently supported. Programmers start with a rough parallel decomposition of an application, compile and run it, analyze the performance values, locate bottlenecks and further parallelize the bottlenecks by inserting new constructs in the corresponding parts of the application code. Groups of three to four people spent two to three months to write and run a new P 3 L application from scratch. However, it took just a couple of weeks to develop alternative parallel versions of the same application (i.e., versions using a different construct tree). Scalability The scalability of different P 3L applications has been measured and, as shown by the experimental results, the compiling tools behave very well. 8. Conclusions This work summarizes the main results achieved by the P 3 L project in the past few years. The project has assessed and experimented various aspects of the skeleton/template approach to massively parallel programming. In particular, working with the P 3 L language and compiling tools, we have been able to state the feasibility of a system entirely based on a small set of skeletons (the P 3 L constructs) which can be hierarchically composed to build complex massively parallel applications and to quantify the improvements in code reuse, portability and performance prediction due to this innovative organization of systems for MPP. Now we are using the experience gained in the P 3 L project in defining a structured parallel language for the PQE2000 parallel machine: this language will integrate both the P 3 L approach to structured parallelism and the data parallelism typical of fortran programs. 9. Acnowledgement The work described in this article would have been impossible without the help of a large number of students, between them: Maria Gabriella Brodi, Fabrizio Petrini, Paola Criscione, Mimi Barresi, Antonio Biso, Paolo Pesciullesi, Giacomo Giunti, Stefano Milana, Gianni De Giorgi, Luca Bordin, Alessia Conserva, Andrea Zavanella (for his bimprolog effort!). References [1] D. B. Skillicorn. Models for Practical Parallel Computation. International Journal of Parallel Programming, 20(2):133–158, April 1991. [2] M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. The MIT Press, Cambridge, Massachusetts, 1989.
15 [3] J. Darlington, A.J. Field, P.G. Harrison, P.H.J. Kelly, R.L. While, and Q. Wu. Parallel Programming using Skeleton Functions. In A. Bode M. Reeve and G. Wolf, editors, Proc. of PARLE’93, volume 694 of LNCS, pages 146–160. Springer-Verlag, 1993. [4] M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti, and M. Vanneschi. A methodology for the development and the support of massively parallel programs. Future Generation Computer Systems, 8:205–220, August 1992. [5] H.T. Kung. Computational models for parallel computers. In R. J. Elliott and C. A. R. Hoare, editors, Scientific Applications of Multiprocessors, pages 1–17. Prentice Hall, 1989. [6] A. J. G. Hey. Experiments in MIMD parallelism. In E. A. M. Odjik and J. C. Syre, editors, PARLE ’89, volume 366 of LNCS, pages 28–42. Springer-Verlag, 1989. [7] S. Pelagatti. A Methodology for the Development and the Support of Massively Parallel Programs. PhD thesis, TD-11/93, Department of Computer Science, University of Pisa (Italy), March 1993. [8] J. Darlington, Y.K. Guo, H. W. To, and Y. Jing. Skeletons for structured parallel composition. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1995. [9] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi. P3 L: A Structured High level programming language and its structured support. Concurrency Practice and Experience, 7(3):225–255, May 1995. [10] B. Bacci, M. Danelutto, and S. Pelagatti. Resource optimization via structured parallel programming. In K.M. Decker and R.M. Rehmann, editors, Programming Environments for Massively Parallel Distributed Systems, pages 13–25. Birkh¨auser Verlag, Basel, Switzerland, 1994. [11] A. S. Wagner, H. V. Sreekantaswamy, and S. T. Chanson. Performance models for the processor farm paradigm. To appear on IEEE Trans. on Parallel and Distributed Systems. [12] D. J. Pritchard. Mathematical models of distributed computation. In T. Munt´ean, editor, Parallel Programming of Transputer Based Machines (OUG 7), Amsterdam, 1987. IOS Press. [13] R. W. S. Tregidgo and A. C. Downton. Processor farm analysis and simulation for embedded parallel processing systems. In S. J. Turner, editor, Tools and Techniques for Transputer Applications (OUG 12), Amsterdam, 1990. IOS Press. [14] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi. Unbalanced Computations onto a Transputer Grid. In H.R. Arabnia, editor, Transputer Research and Applications 7, pages 268–282. IOS Press, Amsterdam, 1994.
16 [15] D. J. Pritchard. Performance analysis and measurements on Transputer arrays. In A. van der Steen, editor, Evaluating Supercomputers. Chapman and Hall, London, 1990. [16] R. S. Bird. An introduction to the Theory of Lists. In M. Broy, editor, Logic of programming and calculi of discrete design, volume F36 of NATO ASI, pages 5–42. Springer-Verlag, 1987. [17] Davide Pasetto and Marco Vanneschi. Design and evaluation of parallel applications using a structured parallel language. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’96), Sunnyvale, CA, August 1996. [18] Meiko Ltd. Computing Surface: technical overview, 1989. [19] V.S. Sunderam. PVM: a framework for parallel distributed computing. Concurrency Practice and Experience, 2(4):315–339, December 1990. [20] Cray Research Inc. Cray T3D System Architecture Overview, September 1993. [21] Cray Research Inc. Cray T3D Software Overview, January 1993. [22] P. Criscione, S. Orlando, and S. Scopigno. Metodologia di programmazione parallela P3 L: un’applicazione a problemi di geometria computazionale. In Proc. of AICA 95, volume 1, pages 221–229, Chia (Cagliari), September 1995. In Italiano. [23] Meiko Ltd. CSTools Communication Library for C Programmers, 1991.