Functional Skeletons for Parallel Coordination John Darlington
Yi-ke Guo Hing Wing To Department of Computing Imperial College 180 Queen's Gate, London SW7 2BZ, U.K. E-mail: fjd, yg, hwt,
[email protected]
Jin Yang
Abstract
In this paper we propose a methodology for structured parallel programming using functional skeletons to compose and co-ordinate concurrent activities themselves de ned in a standard imperative language. Skeletons are higher order functional forms with built-in parallel behaviour. We show how such forms can be used uniformly to abstract all aspects of a parallel program's behaviour including data partitioning, placement and re-arrangement (communication) as well as computation. Skeletons are naturally data parallel and are capable of expressing computation and co-ordination at a higher level than other process oriented co-ordination notations. Being functional skeletons inherit all the desirable properties of this paradigm: abstraction, modularity and transformation and we show how program transformation can be used to achieve optimisation of parallel programs. Examples of the application of this methodology are given and an implementation technique outlined.
Key Words: Programming Language, Parallel Computing, Skeleton, Coordination Language, High Performance Fortran.
1 Introduction In [14], Gelernter and Carriero proposed the notion of coordination languages for parallel programming. In this article, they wrote: We can build a complete programming model out of two separate pieces|the computation model and the coordination model . The computation model allows programmers to build a single computational activity: a single-threaded, stepat-a-time computation. The coordination model is the glue that binds separate activities into an ensemble. An ordinary computation language (e.g. Fortran) embodies some computation model. A coordination language embodies a coordination model; it provides operations to create computational activities and to support communication among them. Applications written in this way have a two-tier structure. The coordination level abstracts all the relevant aspects of a program's parallel behaviour, including partitioning, data placement, data movement and control ow, whilst the computation level expresses sequential computation through procedures written in an imperative base language. Such a separation allows the task of parallel programming to focus on the parallel coordination of sequential components. 1
This is in contrast to the low level parallel extensions to languages where both tasks must be programmed simultaneously in an unstructured way. The coordination approach provides a promising way to achieve the following important goals:
Reusability of Sequential Code: Parallel programs can be developed by using the
coordination language to compose existing modules written in conventional languages. Generality and Heterogeneity: Coordination languages are independent of any base computational language. Thus, they can be used to compose sequential programs written in any language and can, in principle, co-ordinate programs written in several different languages. Portability: Parallel programs can be eciently implemented on a wide range of parallel machines by specialised implementations of the compositional operators for target architectures.
Various coordination languages have been proposed. In [14], Gelernter and Carriero regarded their Linda system as a coordination language. As a coordination language, Linda abstracts MIMD parallel computation as an asynchronously executing group of processes that interact by means of an associative shared memory, called the tuple space, consisting of a collection of logical tuples. Parallelism is achieved by creating process tuples, which are evaluated by processors needing work. Parallel processes interact by sharing data tuples. After a process tuple has nished execution it returns to the tuple space as a data tuple. Linda characterises a class of coordination languages, regarded as embedded coordination languages in [15], which consist of a set of asynchronous coordination primitives accessed through statements scattered throughout a base computing language program. Another important class of coordination languages can be regarded as embedding coordination languages where a coordination language speci es a framework of accomplishing a task in parallel; sequential sub-computations are embedded within that framework. The embedding coordination languages include Delirium [15], which uses a simple rst order functional language as coordination system specifying the parallel frameworks by describing the data ow of computation. An important development of a parallel language based on the coordination approach is the well known High Performance Fortran language [5] where a set of compiler directives are adopted for parallel coordination. Another important advance was the development of the Program Composition Notation (PCN) [16] where the issue of composition is emphasised by building up a coordination mechanism based on connecting together explicitly-declared communication ports. However, the lower level process model of the language obstracts the abstraction of coordination structure. An interesting development of the PCN approach is the P3L system [17]. Rather than using a set of primitive composition operators, in P3L, a set of parallel constructs are used as program composition forms. Each parallel construct in P3 L abstracts a speci c form of commonly used parallelism. For example, the map construct is used to compose programs to form data parallel computation. This approach is based on the integration of the skeleton approach [3] and the PCN model where the computation structure of commonly used class of algorithms are abstracted as skeletons (called parallel constructs in P3 L) and parallel programming are realised as instantiation of the skeletons. Such an integration, however, is not smooth since the high level abstraction of parallel computation structure is compromised by the lower level process model. 2
Although developing coordination languages has become a signi cant research topic for parallel programming, there is still no general purpose coordination language designed to meet the requirements of constructing veri able, portable and structured parallel programs. In this paper, we propose an approach for parallel coordination using functional skeletons to abstract all essential aspects of parallelism including data distribution, communication and commonly used parallel computation structure. Functional skeletons exploit their ChurchRosser property by clearly separating the de nition of their meaning, which is established by functional or axiomatic de nitions, from their parallel behaviour which results from tailored implementations on speci c target architectures which are able to exploit the degrees of freedom available to optimise critical implementation decisions. Applying skeletons to coordinate sequential components, we have developed a structured parallel programming framework SPP(X), where parallel programs are constructed in a structured way. In the SPP framework, an application is constructed in two layers: a higher skeleton coordination level and a lower base language level. Parallel programs are constructed by using a skeleton based coordination language to coordinate fragments of sequential code written in a base language (BL). The fundamental compositional property of functional skeletons naturally supports modularity of such programs. Using skeletons as the uniform means of coordination and composition removes the need to work with the lower level details of computation such as port connection. The uniform mechanism of high level abstraction of parallel behaviour means that all analysis and optimisation required can be con ned to the coordination level which, being functional and constructed from pre-de ned units, is much more amenable to such analysis and manipulation than the base language compoments or other coordination mechanisms. This paper is organised into the following sections. In section 2, a skeleton coordination language, SCL, for general purpose parallel coordination is introduced and an example is presented to show its programming style and expressive power. In section 3, we present a concrete SPP programming language, Fortran-S, produced by taking Fortran as the base language for specifying sequential computation, and discuss implementation strategies for such a lanuage. A transformation-based optimisation approach for parallel structures de ned in SCL is outlined in section 4. We nally summarise our work in section 5.
2 SCL: A Structured Coordination Language We propose a structured coordination language SCL as a general purpose coordination language by describing its three components: con guration and con guration skeletons, elementary skeletons and computational skeletons.
2.1 Con guration and Con guration Skeletons
The basic parallel computation model underlying SCL is the data parallel model. In SCL, data parallel computation is abstracted as a set of parallel operators over a distributed data structure. In this paper distributed arrays are used as our underlying parallel data structure and this idea can be generalised to richer and higher level data structure. Each distributed array, called a parallel array, has the type ParArray index where each element is of type and each index is of type index. In this paper we use > to represent a ParArray. To take advantage of locality when manipulating such distributed data structures, one of the most important issues is to coordinate the relative distribution of one data structure to that of another, i.e. data alignment. The importance of abstracting this con guration information 3
Arrays Virtual Processors (Configuration) Partition Align
Figure 1: Data Distribution Model. in parallel programming has been recognised in other languages such as HPF, where a set of compiler directives are proposed to specify parallel con gurations [5]. In SCL, we abstract control over both distribution and alignment through a set of con guration skeletons . A con guration models the logical division and distribution of data objects. Such a distribution has several components: the division of the original data structure into distributable components, the location of these components relative to each other and nally the allocation of these co-located components to processors. In SCL this process is speci ed by a partition function to divide the initial structure into nested components and an align function to form a collection of tuples representing co-located objects. This model, illustrated in Fig.1, clearly follows and generalises the data distribution directives of High Performance Fortran (HPF). Applying this general idea to arrays, the following con guration skeleton distribution de nes the con guration of two arrays A and B: distribution (f,p) (g,q) A B = align (p partition f A) (q partition g B)
This skeleton takes two functions pairs, f and g specify the required partitioning (or distribution) strategies of A and B respectively and p and q are bulk data-movement functions specifying any initial data re-arrangement that may be required. The distribution skeleton is de ned by composing the functions align and partition. Partition divides a sequential array into a parallel array of sequential subarrays: partition ::
Partition pattern ! SeqArray index ParArray index (SeqArray index )
!
where Partition pattern is a function of type (indexs ! indexp), where indexs is associated with the SeqArray and indexp addresses the ParArray. The type SeqArray is the ordinary sequential array type of our base language. Some commonly occurring partition strategy functions are provided as built-in functions. For example, partitioning a l m two-dimensional array using row block we will get: partition (row block p) A = > j
i
[1..l/p], j
[1..n] ]
Other similar functions for two-dimensional arrays are col block, row col block, row cyclic and col cyclic etc.. The align operator: align ::
ParArray index
!
ParArray index
4
!
ParArray index (,
)
pairs corresponding subarrays in two distributed arrays together to form a new con guration which is an ParArray of tuples. Objects in each tuple of the con guration are regarded as being allocated on the same processor. A more general con guration skeleton can be de ned as: distribution [(f,p)] [d] = p partition f d distribution (f,p):fl d:dl = align (p partition f d) (distribution fl dl)
where fl is a list of distribution strategies for the corresponding data objects in the list dl. Applying the distribution skeleton forms a con guration which is an array of tuples. Each element i of the con guration is a tuple of the form (DAi1; : : :; DAin) where n is the number of arrays that have been distributed and DAij represents the sub-array of the jth array allocated to the ith processor. As short hand rather than writing a con guration as an array of tuples we can also regard it as a tuple of (distributed) arrays and write it as where the DAj stands for the distribution of the array Aj . In particular we can pattern match to this notation to extract a particular distributed array from the con guration. Con guration skeletons are capable of abstracting not only the initial distribution of data structures but also their dynamic redistribution. Data redistribution can be uniformly de ned by applying bulk data movement operators to con gurations. Given a con guration C: , a new con guration C0 : can be formed by applying fj to the distributed structure DAj where fj is some bulk data movement operator de ned specifying collective communication. This behaviour can be abstracted by the following skeleton redistribution: redistribution [f1,
:::
, fn] < DA1,
:::
, DAn > = < f1 DA1,
:::
, fn DAn >
SCL supports nested parallelism by allowing ParArrays as elements of a ParArray and by permitting a parallel operation to be applied to each of elements (ParArrays) in parallel. An element of a nested array corresponds to the concept of group in MPI [20]. The leaves of a nested array contain any valid sequential data structure of the base computing language. The following skeleton gather collects together a distributed array: gather :: split
ParArray index (SeqArray index
)
!
SeqArray index
and combine are another pair of con guration skeletons:
split ::
Partition pattern ! ParArray index ParArray index (ParArray index )
combine :: split divides ParArray.
ParArray index (ParArray index
a con guration into sub-con gurations.
)
! !
combine
ParArray index
is used to atten a nested
2.2 Elementary Skeletons: Parallel Arrays Operators
In the following, we introduce functions, regarded as elementary skeletons, abstracting basic operations of the data parallel computation model. The following familiar functions abstract essential data parallel computation patterns: 5
map :: ( ! ) ! ParArray index ! ParArray index map f > = > imap :: (index ! ! ) ! ParArray index imap f > = > fold :: ( ! ! ) ! ParArray index fold () > = x0 xn
!
!
ParArray index
scan :: ( ! ! ) ! ParArray index ! ParArray index scan () > = >
The function map abstracts the behaviour of broadcasting a parallel task to all the elements of an array. A variant of map is the function imap which takes into account the index of an element when mapping a function across an array. The reduction operator fold and the partial reduction operator scan abstract the tree-based parallel reduction computation over arrays. Data communication among parallel processors are expressed as the movement of elements in ParArray. In SCL, a set of bulk data-movement functions are introduced as the data parallel counterpart of sequential loops and element assignments at the structure level. Communication skeletons can be generally divided into two classes: regular and irregular . The following rotate function is a typical example of regular data-movement. 1 rotate :: Int ! ParArray Int ! ParArray Int rotate k A = >
For a m n array, the following rotate row and rotate col operators express the data rotation of all rows or columns, in which df is a function and (df i) indicates the distance of rotation for the ith row or column to be rotated. rotate row :: (Int ! Int) ! ParArray (Int,Int) rotate row df A =
ParArray (Int,Int)
[1..m], j
[1..n] >>
Broadcasting can be thought as a regular data-movement in which a data item is broadcast to all sites and is aligned together with the local data. This skeleton is de ned as: brdcast :: ! ParArray index ! ParArray index (; ) brdcast a A = map (align pair a) A
where align pair groups a data item with the local data of a processor. A variant of the brdcast operator is the function applybrdcast: 1
In this paper we use notation > to show elements' values of a parallel array.
6
applybrdcast f i A = brdcast (f A(i)) A
this skeleton applies the function f locally to the data on the ith element and broadcastes the result. For irregular data-movement the destination is a function of the current index. This de nition introduces various communication modes. Multiple array elements may arrive at one index (i.e. many to one communication). We model this by accumulating a sequential vector of elements at each index in the new array. Since the underlying implementation is non-deterministic, no ordering of the elements in the vector may be assumed. The index calculating function can specify either the destination of an element or the source of an element. Two functions, send and fetch, are provided to re ect this. Obviously, the fetch operation models only one to one, or one to many communication. For the one dimensional de nitions, two functions can be de ned as: (Int ! (SeqArray Int Int)) ! ParArray Int ! ParArray Int (SeqArray Int ) send f > = >
send ::
fetch :: (Int ! Int) ! ParArray Int ! ParArray Int fetch f > = >
The above functions can be used to de ne more complex and powerful communication skeletons required by practical problems.
2.3 Computational Skeletons: Abstracting Control Flow
A key to achieving proper coordination is to provide the programmer with the exibility to organise multi-threaded control ow in a parallel environment. In SCL, this exibility is provided by abstracting the commonly used parallel computational patterns as computational skeletons. The control structures of parallel processes can then be organised as the composition of computational skeletons. This structured approach of process coordination means that the behaviour of a parallel program is amenable to proper mathematical rigour and manipulation. Moreover, a xed set of computational skeletons can be eciently implemented across various architectures. In this subsection, we present a set of computational skeletons abstracting data parallel computation. The farm skeleton, de ned by the following functional speci cation, captures the simplest form of data parallelism. farm :: ( ! ! ) ! ! ParArray index farm f env = map ( f env )
!
ParArray index
A function is applied to each object of a ParArray. The function also takes an environment which represents data which is common to all processes. Parallelism is achieved by utilising multiple processors to evaluate the jobs (i.e. \farming them out" to multiple processors). The farm skeleton can be generally de ned based on the map operator of any underlying parallel data structure. The SPMD skeleton, de ned as follows, abstracts the features of SPMD (Single Program Multiple Data) computation: 7
SPMD [] SPMD (gf, lf) :
fs
= id = (SPMD fs) (gf (imap lf ))
The skeleton takes a list of global-local operation pairs, which are applied over con gurations of distributed data objects. The local operations are farmed to each processor and computed in parallel. Flat local operations, which contain no skeleton applications, can be regarded as sequential . The global operations over the whole con guration are parallel operation that require synchronization and communication. Thus, the composition of gf and imap lf abstracts a single stage of SPMD computation where the composition operator models the behaviour of barrier synchronization . The iterateUntil skeleton, de ned as follows, captures a common form of iteration. The condition con is checked before each iteration. The function iterSolve is applied at each iteration, while the function finalSolve is applied when condition is satis ed. iterUntil iterSolve finalSolve con x = if con x then finalSolve x else iterUntil iterSolve finalSolve con (iterSolve x)
Variants of iterUntil can be used. For example, when an iteration counter is used, an iteration can be captured by the skeleton iterFor de ned as follows: iterFor terminator iterSolve x = fst (iterUntil iterSolve' id con (x, 1)) where iterSolve' (x, i) = (iterSolve i x, i+1) con (x, j) = j > terminator
2.4 Parallel Matrix Multiplication: a Case Study
To highlight the expressive power of SCL in this subsection we apply SCL to de ne the coordination structure of parallel matrix multiplication algorithms. We consider the following two matrix multiplication algorithms which are adopted from [2].
Row-Column-Oriented Parallel Matrix Multiplication Consider the problem of mul-
tiplying matrices Alm and Bmn and placing the result in Cln on p processors. Initially, A is divided into p groups of contiguous rows and B is divided into p groups of contiguous columns. Each processor starts with one segment of A and one segment of B. The overall algorithm structure is an SPMD computation iterated p times. At each step the local phase of the SPMD computation multiplies the segments of the two arrays located locally using a sequential matrix multiplication and then the global phase rotates the distribution B so that each processor passes its portion of B to its predecessor in the ring of processors. When the algorithm is complete, each processor has computed a portion of the result array C corresponding to the rows of A it holds. The computation is shown in the Figure 2. The parallel structure of the algorithm is expressed in the following SCL program: ParMM ::
Int ! SeqArray index Float ! SeqArray index Float ! SeqArray index Float ParMM p A B = gather DC
8
A
A
A
A
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
B 1
2
3
C
(step 1)
B 4
2
3
4
B 1
3
C
4
1
B 2
C
(step 2)
(step 3)
4
1
2
3
C
(step 4)
Figure 2: Parallel matrix multiplication: row-column-oriented algorithm where C = SeqArray ((1,SIZE(A,1)), (1, SIZE(B,2)) [ (i,j) := 0 j i [1..SIZE(A,1)], j [1..SIZE(B,2)] ] = iterFor p step dist fl = [(row block p, id), (col block p, id), (row block p, id)] dl = [A, B, C] dist = distribution fl dl step i = SPMD [(gf, SEQ MM i)] where newDist = [id, (rotate 1), id] gf X = redistribution newDist
where SEQ MM is a sequential procedure for matrix multiplication. Data distribution is speci ed by the distribution skeleton with the partition strategies of [((row block p), id), ((col block p),id), ((row block p),id)] for A, B and C respectively. The data redistribution of B is performed by using the rotate operator which is encapsulated in the redistribution skeleton.
Block-Oriented Parallel Matrix Multiplication This time we wish to multiply an l
matrix A by an m n matrix B on p p processor mesh with wraparound connections. Assume that l, m and n are integer multiplies of p and p is an even power of 2. Initially both A and B are partitioned into mesh of blocks and each processor takes a (l / p) (m / p) subsection of A and a (m / p) (n / p) subsection of B (Fig.3(a)). The parallel algorithm staggers each block row i of A to the left by i block column positions, and each block column i of B upward by i block row positions (Fig.3(b)) and the data are wrapped around (Fig.3(c)). The overall algorithm structure is also an SPMD computation iterated p times. At each step the local phase of the SPMD computation multiplies the pair of blocks located locally using a sequential matrix multiplication and then the global phase moves the data: each processor passes its portion of A to its left neighbour and passes its portion of B to its north neighbour (Fig.4). The SCL code for this algorithm is shown below.
m
9
B0,3
B0,2 B 1,3
B0,1 B 1,2 B 2,3 A 0,0 A 0,1 A 0,2 A 0,3 B0,0 B0,1 B0,2 B0,3 A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3
A 0,0 A 0,1 A 0,2 A 0,3 B0,0 B 1,1 B 2,2 B 3,3 A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 2,1 B 3,2 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 3,1 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0
A 0,0 A 0,1 A 0,2 A 0,3 B0,0 B1,1 B2,2 B3,3 A 1,1 A 1,2 A 1,3 A 1,0 B 1,0 B 2,1 B 3,2 B 0,3 A 2,2 A 2,3 A 2,0 A 2,1 B 2,0 B 3,1 B 0,2 B 1,3 A 3,3 A 3,0 A 3,1 A 3,2 B 3,0 B 0,1 B 1,2 B 2,3
(a)
(b)
(c)
Figure 3: Block-oriented algorithm: initial distribution matrixMul ::
Int ! SeqArray index Float ! SeqArray index Float ! SeqArray index Float matrixMul p A B = gather DC where C = SeqArray ((1,SIZE(A,1)), (1, SIZE(B,2)) [ (i,j) := 0 j i [1..SIZE(A,1)], j [1..SIZE(B,2)] ] = iterFor p step dist fl = [((row col block p p), (rotate row df1)), ((row col block p p), (rotate col df1)), ((row col block p p), id)] dl = [A, B, C] dist = distribution fl dl df1 i = i (* to indicate the distance of rotation *) step i = SPMD [(gf, SEQ MM 0)] where newDist = [(rotate row df2), (rotate col df2), id] gf X = redistribution newDist df2 i = 1 (* to indicate the distance of rotation *)
Above examples have shown us some important features of SCL as a general coordination language for parallel computation:
Abstraction: The parallel structure of a class of parallel algorithms for matrix multiplication can be de ned in the following SCL program:
Generic matrixMul p distribustrategy redistribustrategy A B = gather DC where C = SeqArray ((1,SIZE(A,1)), (1, SIZE(B,2)) [ (i,j) := 0 j i [1..SIZE(A,1)], j [1..SIZE(B,2)] ]
10
B 2,2 A1,1
A1,2
A 1,3 B 3,2
A1,0
A1,2
A1,3
B 0,2
B 0,2
A 1,0 B0,2
A 1,1 B 1,2
A1,1
A1,3
A1,0
B1,2
B1,2 (a)
B 3,2
B 1,2 A1,2
A1,0
A1,1
A 1,2
B 2,2
B2,2
B3,2
B3,2 (c)
(b)
A1,3 B2,2
B0,2 (d)
Figure 4: Block-oriented algorithm: data movement w.r.t. processor P(1,2) dist = distribution distribustrategy [A, B, C] = iterFor p step dist step i = SPMD [(gf, SEQ MM i)] where gf X = redistribution redistribustrategy
Thus, the row-column-oriented and the block-oriented parallel matrix multiplication program become instances of the generic parallel matrix multiplication code by instantiating the corresponding distribution and redistribution strategies. That is, the SCL code for generic parallel matrix multiplication de nes an algorithmic skeleton for parallel matrix multiplication. This example shows how an application oriented parallel computation structure could be systematically de ned.
Code Migration: A systematic method of parallelising sequential code is very important
for the real world parallel programming. Most real parallel applications begin life not as blank sheets of paper, but as serial programs that run too slowly. Therefore, an ideal structured parallel programming framework should provide the vital support for migrating sequential code into parallel programs. In SPP(X), sequential code can be accomplished. Taking the example of matrix multiplication, although there are variants of parallel matrix multiplication algorithms, they all share the same sequential code for sequential multiplication of submatrix. Thus, parallelising a sequential matrix multiplication will be realised mainly by de ning the coordination structure for the parallel invocation of this sequential matrix multiplication code for the distributed submatries.
Portability: Portability means reusability, or recycle-ability in a broad sense. We would
like to reuse as much as the same parallel code for various parallel architectures. The most important advantage of higher level abstraction of parallel computation structure of SCL is that it provides the crucial support for the reusability of parallel code. Consider the two parallel matrix multiplication again, according to the Quinn's analysis [2], both algorithms require the same number of computation steps. However, the block oriented algorithm is uniformly superior to the row-column oriented one when the number of processors is an even power of 2 greater than or equal to 16, as less time is required for communication. Therefore, when we port a row-column oriented matrix multiplication parallel code to big parallel machine where the the block oriented algorithm has the better performance, we would like to update the code to gain the performance. As we presented above, the architecture dependent features of parallel matrix multiplication can be abstracted as the distribution/redistribution strategies 11
corresponding to a particular architecture. With this abstraction, the code for generic parallel matrix multiplication can be reused to port to any parallel machine.
Modularity: Modularity is the direct result of our structured parallel programming construction. We can easily identify the generic parallel matrix multiplication as a module. Composition of modules is the direct result of the composibility of functions.
3 Fortran-S: Coordinating Fortran Programs with SCL 3.1 The Language
As an exercise in developing a concrete SPP language, we are designing a language, FortranS, to act as a powerful front end for Fortran based parallel programming. Conceptually, the language is designed by instantiating the base language in the SPP scheme with Fortran. Thus, to write a parallel program in Fortran-S, we use SCL as a coordination language to de ne the parallel structure of the program. Local sequential computation for each processor is then programmed in Fortran. For example the following Fortran-S program performs matrix addition in parallel. matrixAdd p A B = (gather map SEQ ADD) (distribution fl dl) where C = SeqArray ((1..SIZE(A,1)), (1:SIZE(A,2))) fl = [((row block p),id), ((row block p),id), ((row block p),id)] dl = [A, B, C]
where the Fortran function SEQ ADD is given by2: SUBROUTINE SEQ_ADD (X, Y, Z) REAL, DIMENSION (:,:), INTENT (IN) :: X REAL, DIMENSION (:,:), INTENT (IN) :: Y REAL, DIMENSION (:,:), INTENT (INOUT) :: Z DO I = 1, SIZE(X,1) DO J = 1, SIZE(X,2) Z (I,J) = X (I,J) + Y (I,J) END DO END DO END SUBROUTINE SEQ_ADD
The matrix multiplication examples (section 2) can be coded in Fortran-S by instantiating the sequential local procedure SEQ MM with the following Fortran subroutine for matrix multiplication SUBROUTINE SEQ_MM INTEGER, INTENT REAL, DIMENSION REAL, DIMENSION 2
(IT, IDX, X, Y, Z) (IN) :: IT, IDX (:,:), INTENT (IN) :: X (:,:), INTENT (IN) :: Y
In the paper, we adopt the syntax of Fortran 90.
12
REAL, DIMENSION (:,:), INTENT (INOUT) :: Z INTEGER :: I, J START = ((IT + IDX) * SIZE(Y,2)) MOD SIZE(Z,2) DO I = 1, SIZE(X,1) DO J = 1, SIZE(Y,2) DO K = 1, SIZE(Y,1) Z (I,J+START) = Z (I,J+START) + X (I,K) * Y (K,J) END DO END DO END DO END SUBROUTINE SEQ_MM
The argument intent of the parameters identi es the intended use of the variables. Variables speci ed with INTENT(IN) must not be rede ned by the procedure. Variables speci ed with INTENT(INOUT) are expected to be rede ned by the procedure, and variables speci ed with INTENT(OUT) pass information out of the procedure. In Fortran-S, the basic data type for SCL programming is the ParArray which is regarded as the parallel data structure whilst the basic data types, including arrays of Fortran are the sequential data structures. Thus, Fortran subroutines handle only Fortran data objects. A major issue of this two-level language design is that the interface between the SCL constructs and Fortran subroutines. Some basic principles have been established: Call Mechanism: To transfer data from an SCL skeleton to a Fortran subroutine, a Fortran function/subroutine name is called in the SCL and is usually followed by a tuple of elements of basic type. This tuple will be copied and converted into arguments with Fortran data types for the Fortran subroutine. Type Correctness: When mapping a Fortran subroutine to a parallel data structure the Fortran function receives arguments whose types are those of the sequential data type. It will be a type error if a Fortran subroutine is applied to a parallel structure. Localising Side Eects: The interface must guarantee that side-eects are localised to each Fortran program segment. That is, Fortran sequential programs should not be able to update global parallel data structures. Otherwise, transformation-based optimisation would not be possible. We adopt a simple mechanism to isolate side eects by semantically copying all the inputs to a Fortran program. Possible updating to these copies will not aect the original input data. A Fortran program then behaves like a function returning value for each call. With Fortran 90's argument intent attributes[6], localising side eects is made easy since we can simply declare global variables which can not be updated by INTENT(IN) attribute. With such an interface protocol formal functional reasoning can be fully applied to the top level parallel structure de ned in terms of SCL. Fortran-S can be implemented by transforming Fortran-S programs into conventional parallel Fortran programs. The architecture of the Fortran-S implementation is illustrated in Figure 5. The Fortran-S front-end system performs syntax checking, type checking and semantic analysis on Fortran-S programs. Due to the functional property of the SCL level 13
A Fortran-S Program
Front-end Compiler
Transformation / Optimisation
Intermediate Form
Translation
VPP Fortran Program
Translation
HPF Program
Translation
Fortran-90 with communication library
Parallel Architectures
Figure 5: Fortran-S architecture. program source level transformation can be applied to optimise the parallel behaviour of the program, including granularity adjustment, nested parallelism attening, optimised data distribution and interprocessor communication. For example, the intermediate code for the matrix multiplication (row-column oriented algorithm) is as followes. scl_ParMM(p, A, B) p: Int; A: Real, Dim(l,m); B: Real, Dim(m,n); C: Real, Dim(l,n); I,J: Int; (loop index) Loop (I = 1,l), (J = 1,n) C(I,J) = 0; EndLoop; dl = (A : B : C); fl = (((row_blk p),id) : ((col_blk p),id) : ((row_blk p),id)); dist = distribution fl dl; Loop (I = 1,p) DA = nth 1 dist; DB = nth 2 dist; DC = nth 3 dist; newdl = (id : (rotate 1) : id); gf = Lam(X, (redistribution newdl )); dist = iSPMD (gf, F90_SeqMM(I), ); EndLoop; DC = nth 3 dist; Return(Gather DC);
14
70 60
Speedup ratio
50 40 30 20 10
0
10
20
30
40
50
60
70
80
90 100
Number of processors
Figure 6: Parallel matrix multiplication: speedup The intermediate form can then be produced by the front-end. The intermediate form of the program describes the parallelism in terms of SPMD computation model (other forms of parallel computation having been transformed into SPMD computation) and thus it can be supported naturally by parallel languages such as HPF, VPP Fortran [9] and Fortran-90 with standard communication libraries such as MPI or PVM and simple barrier synchronization mechanisms. The translations from the intermediate form to these Fortran systems are similar. We are currently building a prototype system based on Fortran 77 plus MPI targeted at a Fujitsu AP-1000 machine, and we have tested the matrix multiplication example (translated to Fortran77+MPI) on AP1000 with array size of 400 400. Due to the rich of information provided by the Fortran-S code, the performance data is very encouraging. The performance data is listed in the followed table and the speedup is shown in Fig. 6. nprocs r-time speedup 1 521.414 1.0 4 133.215 3.9 8 66.110 7.9 10 52.734 9.9 16 32.883 15.9 20 26.546 19.6 25 21.459 24.3 50 12.314 42.3 80 8.953 58.2 100 7.435 70.1
3.2 Comparison with HPF
HPF supports data parallel programming by adding extensions to Fortran 90 including Forall statement and compiler directives for data distribution. Our work has been obviously motivated by HPF. For example, there is a direct correspondence between the distribution speci ed by the HPF directives and the distribution skeleton. The following SCL expression: 15
distribution ((row col block m n), id) ((row col block m n), id)) A B
speci es the same data distribution de ned by the following HPF directive code: !HPF$ !HPF$ !HPF$ !HPF$
PROCESSORS Procs(P, Q) DISTRIBUTE (BLOCK, BLOCK) ONTO Procs :: ALIGN A(I,J) WITH TEMPLATE(I,J) ALIGN B(I,J) WITH TEMPLATE(I,J)
TEMPLATE
In addition, in HPF, an array may be redistributed during run-time by declaring it to be DYNAMIC. Thus, con guration skeletons can be regarded as functional abstractions of HPF directives. Since SCL con guration skeletons are freely composable they are much more exible than the xed HPF directives. Moreover, the SCL operators provide a more powerful means to express data alignment and movement. Take the block-oriented matrix multiplication algorithms in section 2 as an example, when aligning/realigning subblocks of A, B and C, as the size of the three subblocks may be dierent, we would have to write complex index expressions to specify the relative positions of the three subblocks, as shown in the following HPF code:3 !HPF$ PROCESSORS Procs(P, P) !HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO Procs :: C !HPF$ ALIGN A(I, J), DYNAMIC, & WITH C (I, ((J / MM) * NN + (J MOD NN) - (I / MM) * NN) MOD N) !HPF$ ALIGN B(I, J), DYNAMIC, & WITH C (((I / MM) * LL + (I MOD LL) - (J / MM) * LL) MOD L, J)
where: L = SIZE (A,1), M = SIZE (A,2), N = SIZE (B,2), LL = L / P, MM = M / P, NN = N / P
The index expression for realigning is even more complex. An alternative way for such aligning/realigning in HPF is to reconstruct the three matrix so that their outer dimensions have same shape with the processor template but it still need complex index computation. In general, HPF and Fortran-S all adopt the coordination approach. Using skeleton as higher level coordination form provides Fortran-S with the power of de ning parallel computation structure in a more exible and structured way. In many ways, Fortran-S encourages a more explicit way to program parallel control.For example, HPF programmers need not to specify the communication required as the compiler will be assumed to detect it. Whereas Fortran-S requires a programmer to distinct communication from general computation with higher level bulk data communication operators. Therefore, it encourages precise description of the problem characteristics. Thus, SCL operators for data communication makes high level communication optimization be easy and provides a way to ecient implementation using lower level communication libraries. Here we suppose the indices of the three arrays begin from 0 and the MOD function can accept negative integer argument. 3
16
4 Reasoning and Optimisation by Transformation One of the signi cant advantages of the functional abstraction mechanism of SCL is that the meaning preserving transformation technique can be generally applied to optimise the parallel coordination structure speci ed uniformly in terms of skeletons. Thus, with such a high level functional speci cation of parallel behaviour, compiling time optimisation can be systematically realised based on a class of transformation rules. We overview some basic transformation rules in this section.
Map Fusion: A simple but important transformation which reduces parallel overhead is
the following map fusion law :
map f map g = map (f g)
To maintain the synchronous semantics of the system some form of barrier synchronisation must be performed between the two maps of the left-hand-side. This transformation reduces the need to perform a barrier synchronisation and provides for better load balancing. Since the map function abstracts parallel loops (e.x. the forall construct in HPF), this transformation law is a functional abstraction of the loop fusion [7] technology of conventional compiler optimisation.
Map Distribution: The following transformation, called map distribution, intend to increase the level of parallelism in a program:
foldr1 (f g) = fold f map g
Imagine that f is associative and that g is applied to an element before f combines the accumulated result and the element. Clearly, the left-hand-side is not parallel as the combined function (f g) is not associative. However, by splitting the foldr into a fold and map the program becomes parallel. It performs a very similar role to the well known optimisation loop distribution technique presented in [7].
Communication Algebra: As all data-movement patterns are described using functional operators, it is possible to develop a set of rules to optimise the communication. For example the rules: send f send g = send (f g) fetch f fetch g = fetch (g f)
enable us to remove a communication step by combining two communication steps into one. By taking these transformation rules for optimising data communication as algebraic laws, a powerful communication algebra is developed [4].
Flattening: Transformation can be applied to atten nested data parallelism. Let sgf = gf0 map gf1 (split P)
then the rule for attening is: SPMD [(gf0 , SPMD [(gf1, lf1 )])] (split P) = SPMD [(sgf, lf)]
17
With the rule, nested SPMD computation can be transformed into a at data parallel computation with a segmented global function sgf. Thus, the sgf provides a similar functionality to the Segmented Instructions [13] used in the NESL language implementation. More transformation rules can be found in [4].
5 Conclusion In this paper we have proposed functional skeletons as a new mechanism of developing general purpose parallel coordination systems. The work stems from our original work on functional skeletons to capture re-occurring patterns of parallel computation. This has been extended such that control of all aspects of parallel computation can be now expressed using skeletons. Therefore, it provides an ideal means for coordinating parallel computation. In the paper, we presented the SCL language for skeleton based coordination and, based on SCL, we achieve a structured parallel programming scheme SPP(X) where SCL is applied to coordinate computation programmed in a base language X. This work present a signi cant synthesis of some major developments of designing parallel programming systems including the coordination approach, data parallel programming, skeleton-based higher lever construction of parallel applications and declarative parallel programming. It provides a promising solution to the engineering problems of developing a practical structured programming paradigm for constructing veri able, reusable and portable parallel programs.
References [1] G. C. Fox. Achievements and prospects for parallel computing. Concurrency: Practice and Experience, Vol 3(6), Dec. 1991. [2] Michael J. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, second edition, 1994. [3] J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, and Q. Wu. Parallel programming using skeleton functions. In Parallel Languages And Architectures, Europe: PARLE 93. Springer Verlag, 1993. [4] J. Darlington, Y. Guo, H. W. To. Structured Parallel Programming: Theory meets Practice. To be appeared in the New Direction of Computing. [5] High Performance Fortran Forum. Draft High Performance Fortran Language Speci cation, version 1.0. Available as technical report CRPC-TR92225, January 1993. [6] ANSI X3J3/s8.115. Fortran 90. June 1990. [7] Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributed mememory machines. Communications of the ACM, 35(8):66{80, August 1992. [8] Ian Foster, K. M. Chandy. Fortran M: A Language for Modular Parallel Programming. Technical Report, Argonne National Laboratory, 1992 18
[9] H. Iwashita, T. Shindo, S. Okada, M. Nakanishi and H. Nakakura, VPP Fortran and Parallel Programming. Proc. of 3rd Parallel Computing Workshop, Fujitsu, Japan, 1994. [10] R. S. Francis, etc. A Data Parallel Scienti c Modelling Language. Journal of Parallel and Distributed Computing, 21, 46-60(1994). [11] B. Chapman, H. Zima, and P. Mehrotra. Handling distributed data in Vienna Fortran procedures. In Uptal Banerjee, David Gelernter, Alex Nicolau, and David Padua, editors, Languages and Compilers for Parallel Computing, 5th International WorkShop Proceedings, pages 248{263. Springer Verlag, August 1992. [12] G.W. Sabot. The paralation model. MIT Press, Cambridge, MA, 1988. [13] Guy E. Blelloch. Compiling Collection-Oriented Languages onto Massively Parallel Computers. Journal of Parallel and Distributed Computing, 8, 1990, pp 119-134 [14] N. Carriero and D. Gelernter. Coordination languages and their signi cance. Communications of ACM, Vol 35, Feb. 1992 [15] S. Lucco and O. Sharp. Parallel Programming With Coordination Structures. Proc. of 18th ACM POPL, 1991 [16] Ian Foster, Robert Olson, and Steven Tuecke. Productive parallel programming: The PCN approach. Scienti c Programming, 1(1), 1992. [17] S. Pelagatti. A Methodology for the Development and the Support of Massively Parallel Programs. PhD thesis, Universita Delgi Studi Di Pisa, 1993. [18] David A. Padua and Michael J. Wolfe. Advanced compiler optimisations for supercomputers. Communications of the ACM, 29(12):1184{1201, December 1986. [19] Michael Burke and Ron Cytron. Interprocedural dependence analysis and parallelisation. In SIGPLAN Symposium on Compiler Construction, pages 162{175. ACM, July 1986. [20] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. University of Tennessee, 1994.
19