Profess: A Portable System for Prototyping the Performance of Parallel Programs John Darlington, Moustafa M. Ghanem and Shamimabi Paurobally E-mail: fjd, mmg,
[email protected] Department of Computing, Imperial College, 180 Queens Gate, London, SW7 2BZ.
Abstract Profess is a parallel program simulation tool designed to help application programmers in evaluating the performance of candidate implementations of their programs on a parallel machine. Users provide a high-level description of each program as structured compositions of known programming skeletons and program components. The system then uses a mixture of performance models and actual runs on the parallel machine to estimate the total time required for executing each implementation. In this paper we present the underlying language and basic components of the tool, describe an initial implementation on the AP1000 parallel machine and discuss preliminary results of using the system.
1 Introduction With the diversity of existing parallel machines there is a growing need for tools that help programmers in choosing between alternative implementations of their programs. Writing portable parallel programs requires choosing appropriate algorithms, implementation strategies and resource allocation schemes in order to achieve good performance on one machine while maintaining portability and guaranteed good performance on other machines. Programmers faced with these choices are often required to experiment with dierent versions of the same program by trying various implementations for certain building blocks or by trying dierent resource allocation strategies. After each run they observe the generated performance and attempt to adjust the implementa-
tion in order to improve that performance. Even after this is achieved there remains the need to ensure that good performance can be maintained when the program is ported to other machines, which may imply repeating the whole process for every machine.
1.1 Performance Simulation and Estimation This process can be simpli ed if the programmer is able to predict the performance of a program without fully coding or developing it. One option is that the programmer develops performance models for the dierent implementation options and use these in comparing the expected performance. Such performance models generally need to be parametrised by both the problem and machine characteristics. This approach requires a good understanding of the dierent factors that aect the performance of both parallel machines and parallel programs which is not necessarily an easy task. Another approach which is followed in this paper is to provide programmers with a simple performance simulation tool. The basic idea is to allow the user to easily and quickly code the different options under consideration using a simple language. This language can be translated to executable prototypes that can run on the parallel machine at hand. As opposed to the real program, a prototype does not perform any useful computation, it only executes the communication structures of the user program mixed with dummy loops simulating the sequential code of the program. However, since the prototype executes the same message passing primitives as in the original program and generates the same
P1-N-1
amount of network trac it exhibits, to a reasonable extent, the same performance as would be delivered by the full program. This approach is much simpler and faster for experimenting with dierent options than either coding and running full implementations of the dierent options or running full simulations of the program on a sequential machine.
1.2 Skeleton Programming
This prototyping approach is especially well suited for studying parallel programs developed using a structured programming method such as structured skeleton programming [3], [1]. The skeletons approach makes use of both algorithmic skeletons and reusable sequential programming components in achieving program portability across parallel machines. The scheme achieves good performance on dierent machines since the implementation of the skeletons and components can be tailored to each machine's characteristics. Within the skeletons approach a skeleton system implementor compares the performance of dierent implementations for same skeleton and experiments with the dierent resource allocation strategies only once when the skeletons are implemented. After this initial selection process only ecient implementations of the skeletons are kept in the system. Application programmers are thus guaranteed good performance for each of the skeletons used in their programs. A performance prototyping tool is of great use within the context of skeleton programming systems. It can be used by the application programmers when comparing the performance of alternative skeletons to be used in writing the same program. It can also be used by system implementors at skeleton implementation time in comparing the performance of dierent implementations of any given skeleton.
1.3 A Performance Prototyping Tool
In this paper, we describe the design of Profess, a portable tool which provides exible simulation of the performance of parallel programs. An initial implementation of the tool has been developed in [9] and tested on the AP1000 parallel machine [6]. The tool has been designed
with the skeletons approach in mind and allows programmers to express their programs as compositions of prede ned skeletons together with easy-to-extract program characteristics de ned in a high-level prototyping language. Section 2 gives an overview of the tool itself and describes some of the skeletons supported. Section 3 describes some of the features of the prototyping language used by the system and Section 4 describes an examples of using Profess in studying the performance of a user program. Finally Section 5 discusses some related performance prototyping systems and compares them to Profess while Section 6 describes some of the future work envisaged.
2 Profess Algorithm 1
PDL Program Basic Communication Structure +
Algorithm 2
Skeletons + Message lengths + Sequential Workload
Algortihm 3
Library of High Level
PROFESS
Communication Structures +
TRANSLATOR Compare and Choose
skeletons Library of C+MPI & Timing Code
Sequential Performance Curves
Run on a Parallel Machine
Timing Results
Performance Modelling Tools
Visualisation Tools
Figure 1: Overview of Profess Figure 1 provides a brief overview of using Profess. A user wishing to investigate the performance of dierent candidate parallel algorithms for solving the same problem writes a simple prototype describing each option using PDL, the high-level language associated with Profess. Once submitted to the system, the prototypes are translated into C programs with MPI [5] message passing calls. This translation process makes use of a database of existing skeleton
P1-N-2
implementations written in terms of MPI. The choice of this compilation route allows for the portability of both the tool and the resulting prototypes to any parallel machine with support for MPI. During the translation process, the system inserts timing calls to gather information about the performance of the dierent components in each program. Once a prototype is executed on a given machine, the set of timing results reported to the user can be used to study the performance of its dierent components. Other than the translator which has been implemented using standard tools, the major components of the system are described below.
2.1 Prototyping Language #Conjugate Gradient Prototype Work VecAdd(N){ t1*N microsec; } Work InnerProd(N){ t2*N microsec; AllReduce(SUM); } Work MVP(M,N){ AllGather(N); t3*M*N microsec; } Work CG(N, M){ InnerProd(N); VecAdd(N); MVP(N, M); InnerProd(N); VecAdd(N); InnerProd(N); MVP(N, M); InnerProd(N); VecAdd(N); } Begin Spmd (iter, CG(N/P, N), Gather(N) ); End
gether and by specifying descriptions for the expected message sizes and sequential workloads in a program. An example of a PDL program is shown in Figure 2. This is a performance prototype for a conjugate gradient solver program [1]. Some of the main features of the language are described in more detail in Section 2.
2.2 Library of Skeletons The system includes a library of prototype implementations for skeletons of widely used parallel program structures and communication primitives. Detailed descriptions of major skeletons can be found in [3] and [1]. In Profess however, both the implementation and interface of the skeletons are much simpler than those of the actual skeletons used within a complete skeleton programming system. In Profess the user needs only to specify the name of the required primitive or skeleton together with the size of the data being transferred and amount of sequential workload performed. He or she does not have to provide the actual data transferred nor to allocate storage for that data nor to provide any of the other parameters required in the full skeleton programming system. Examples of some of the currently supported skeletons and a brief descriptions of their syntax is given below.
The
skeleton abstracts the features of Single Program Multiple Data (SPMD) computation. The prototype de nition for this skeleton is Spmd (I, W, GF). I represents the number of iterations for executing a workload W on each processor followed by performing GF, a global function, across the processors. SPMD
The PIPE skeleton abstracts the behaviour
Figure 2: PDL Conjugate Gradient Program Prototype Profess has an underlying prototype description language, PDL. The language is used to construct prototypes by composing skeletons to-
P1-N-3
of pipelined parallelism. The prototype de nition for this skeleton is Pipe (N, W, M), where N represents the number of tasks passed through the pipeline. W represents the sequential workload executed on each processor, and M represents the size of messages passed between the adjacent processors.
The Farm skeleton abstracts the behaviour
of dynamic master-slave parallelism. The
prototype for this skeleton is Farm (N, W, M1, M2) where N represents the total number of tasks to be farmed out from a master processor, W represents the amount of work required for each task on a given processor and M1 and M2 represent the size of the data sent out and received by the master respectively for each task.
The DoAndShift
(I, W, M) skeleton speci es the execution of the workload, W, on each processor followed sending a message of size M to its right neighbour and by receiving a message of a similar size from its left neighbour for I iterations. This abstracts the behaviour of the LNO skeleton in a one dimensional con guration. A two dimensional version of the skeleton DoAndShift2D also exists.
Other currently implemented skeletons include one and two dimensional versions of DoAndExchange and DoAndExchange2D and implementations of several global communication functions such as Gather, AllGather, Reduce and AllReduce. The library of skeletons currently available was written using MPI. This allows for the portability of these de nitions and provides exibility to users wishing to add or experiment with their own skeletons.
2.3 Library of Sequential Performance Models In many cases the dierences in performance between parallel alternatives for a program arise only from the communication structures used or from using dierent skeletons, not from using dierent sequential codes. When comparing the performance of such alternatives the errors resulting from inexact sequential models generally cancel out and hence detailed or exact sequential performance models may not be required. In other cases, especially when comparing alternatives with dierent sequential codes, more detailed performance models are needed. Acquiring accurate estimates for sequential code execution times can be dicult for the user. This task can be made easier if the sequential code is restricted to well-known sequential kernels or
library routines. In this case a database of performance models for a variety of sequential kernels is available to the user.. The constants for these models can be acquired by benchmarking their implementations on the machine at system install time. A useful tool which automatically benchmarks user supplied routines and ts constants to performance models is described in [2], a similar tool can be useful for Profess.
3 PDL One of the design requirements of Profess is to allow easy speci cation of both the sequential and communication workloads in a program. Whereas, communication patterns are captured through the use of the skeletons, both the sizes of messages transferred and the time required for executing sequential code need to be speci ed. These are de ned using PDL.
3.1 Workload Descriptions Sequential workloads can be speci ed in terms of constants or mathematical functions of program or machine variables such as the total number of processors or individual processor identi ers. Additionally, the workload can be de ned to exhibit user-de ned variations around a mean value across dierent iterations. This last feature is especially useful when studying the eects of load balancing on a program's performance. To allow for extra exibility, the workload can be de ned in terms of absolute time units (e.g. seconds) or relative time units (e.g. times for executing a basic arithmetic operation). Similar features also exist for the speci cation the message sizes in a program.
3.2 Syntax PDL is a structured language supporting the definition of constants, variables and local procedures. The main body of a PDL program is enclosed between Begin and End statements. The keyword Define is used to include other les and comments are preceded by a # symbol. Variable types in PDL include int, float and double and can be used to specify both message
P1-N-4
lengths and sequential workloads. The speci cation of the workload is taken to be the number of iterations in an empty for-loop by default. To override this, the quali ers second, millisec and microsec are supported. Additionally user-de ned quali ers may be speci ed using the Load keyword. An example is shown in the PDL code below. Load
AddOp 0.00000002 #time for one #floating point add op x = 100 AddOp
PDL does not support conditional statements but however supports the binary +- operator. This is used to specify variations in workloads and message sizes. An expression x +- y speci es a probabilistic quantity of magnitude xy. This can be used in specifying both workloads and message sizes. PDL supports the de nition of procedures with local variables as well as special functions for de ning workloads and message sizes. The keyword Work is used in de ning Work functions. Within such functions, the values resulting from evaluating mathematical expressions are used to calculate the number of iterations of the loops simulating sequential code. The keyword Message on the other hand is used to de ne Message functions. In this case, the last expression calculated is returned as a message count. A NULL keyword also exists to specify null or empty functions. As an example, the PDL code below gives the performance formula for a matrix multiply routine. Work SeqMM(I, K, J){ t0 + I*(t1 + J*(t2 + k*t3)) }
The keywords StartTime and GetTime are used to specify timing points in a program. In their absence, only the total execution time and the time for executing the skeletons in the main program body are reported.
4 Experiments and Results The initial implementation of the tool on the AP1000 has been used in a simple case study for comparing the performance of three matrix multiplication algorithms. The PDL description
#Matrix Multiply Prototypes Define "SeqModels"
# include library # of sequential models.
Begin #Simple Program Spmd (1, SeqMM (N/P, N, N), NULL ); #Ring Program Spmd (P, DoAndShift( SeqMM(N/P, N, N/P), N^2/P ), NULL ); #Cannon Program R = root(P); Spmd (R, DoAndShift2D (SeqMM(N/R, N/R, N/R), N^2/P ), NULL ); End
Figure 3: Matrix Multiply Programs Prototype for these algorithms is shown in Figure 3. A brief description of each algorithm is given below and the reader is referred to in [7], for their details. In the rst algorithm, simple, each processor holds a complete copy of the matrix B while the matrix A is distributed row-wise across the processors. The sizes of the local arrays are thus N=P by N and N by N . The rows of the result matrix, C , are computed on each processor by multiplying the corresponding local rows of A and the local copy of B. This version requires no communication between the processors, but has a high local data storage requirement. In the second algorithm, ring, both matrices are distributed across the processors. The matrix A is distributed row-wise while the matrix B is distributed column-wise. The sizes of the local arrays are N=P by N and N by N=P . By multiplying the local rows of A by the local columns of B , each processor computes a portion of the C matrix. Each processor then passes its columns of B to the processor on its right and receives the columns of the processor to its left. This
P1-N-5
process is repeated for P times. This behaviour is captured using the DoAndShift skeleton. The third algorithm, cannon, distributes p pthe arrays in square blocks of size N= P by N= P elements. Each processor performs a matrix multiplication between its local matrices and then shifts its A matrix up, and p its B matrix left. The process is repeated P times. This behaviour is captured using the DoAndShift2D skeleton. 9 simple program simple prototype ring program ring prototype cannon program cannon prototype
8
7
6
5
4
tion operations of the parallel program on the actual parallel machine while simulating the sequential code of the program. Lepeb allows the de nition of arbitrary parallel programs and unrestricted forms of parallelism and interactions between processes. In contrast, Profess encourages structured de nitions of parallel programs through the use of prede ned skeletons only. Lepeb also has special syntax for de ning the cache miss behaviour within a sequential workload, in Profess this can be de ned using the sequential workload formulae. The syntax of PDL in general is simpler than that of Lepeb while still allowing for more exibility in de ning both the sequential workload and message sizes in a program.
3
6 Conclusions and Future Work
2
1
0 300
350
400
450
500
550
600
650
Figure 4: Predicted vs Measured performance for the Matrix Multiply programs as a function of the matrix dimension on 64 processors The timing results generated from the prototype on the AP1000 machine were compared to the execution times of fully coded versions of the programs. Figure 4 shows both sets of measurements, and indicates that the simulation results do predict to a reasonable extent the time measured from the actual runs.
5 Related Work Several other systems for prototyping the performance of parallel program exist. Examples include CHIP3 S [8] and Lepeb [4]. The CHIP3 S system uses a structured language to describe the underlying parallelisation scheme of a program and the performance of its components. It requires the user to provide performance models for both the communication and computation parts of a program. In this respect it does not use the actual parallel machine to execute the communication workload of the program but translates the de nitions to mathematica [10] for performance evaluation and simulation. Profess shares the basic features with Lepeb. These include the execution of the communica-
A performance estimation tool can by evaluated by both the accuracy of its predictions and its ease of use. Our experiments with Profess indicate that it has these features. However further experience with the tool on other machines and more case studies is essential for the complete evaluation of both the approach and the tool. The syntax of PDL is not nalised yet and future work envisaged includes the support for nested parallelism. The number of skeletons currently supported by the system is also expected to increase together with the development of a larger library of sequential performance models for commonly used components. Simple modi cations are also expected to the user interface to allow the automatic execution of the prototypes for user-de ned ranges of program parameters and numbers of processors.
Acknowledgements We would like to thank our colleagues in the Advanced Languages and Architectures Section at Imperial College for their assistance and ideas. We would also like to thank Fujitsu for providing the facilities at IFPC which made this work possible. This work has been conducted under a British Council Scholarship to the second author.
P1-N-6
References [1] Peter Au, John Darlington, Moustafa M. Ghanem, Yi ke Guo, Hing Wing To, and Jin Yang. Co-ordinating heterogeneous parallel computation. In Luc Bouge, Piere Fraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par'96 Parallel Processing, volume I, pages 601{614. Springer-Verlag, August 1996. [2] Eric A. Brewer. Portable High-Performance Supercomputing: High-Level Platform Dependent Optimization. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1994. [3] J. Darlington, M. Ghanem, and H. W. To. Structured parallel programming. In Programming Models for Massively Parallel Computers, pages 160{169. IEEE Computer Society Press, September 1993. [4] Alistair Dunlop, Emilio Hernandez, Oscar Naim, Tony Hey, and Denis Nicole. Collaborative tools for parallel performance optimisation. Scienti c Programming, November 1994. [5] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994. [6] Hiroaki Ishihata, Takeshi Horie, Satoshi Inano, Toshiyuki Shimizu, Sadayuki Kato, and Morio Ikesaka. Third generation message passing computer AP1000. In International Symposium on Supercomputing, pages 46{55, 1991. [7] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing. Benjamin/Cummings, 1994. [8] E. Papaefstathiou, Darren J. Kerbyson, Graham R. Nudd, and T.J. Atherton. An introduction to the CHIP3 S language for characterising parall el systems in performance studies. Research Report CS-RR280, Department of Computer Science, Uni-
versity of Warwick, Coventry, UK, January 1995. [9] Shamimabi Paurobally. Profess: parallel performance tool. BEng project report, Department of Computing, Imperial College, London SW7, June 1996. [10] Stephen Wolfram. Mathematica, A system for doing mathematics by computer. Addison Wesley, 1991.
P1-N-7