Ecient Parallel Programming with Algorithmic Skeletons George Horatiu Botorog?
Herbert Kuchen
Aachen University of Technology, Lehrstuhl fur Informatik II Ahornstr. 55, D-52074 Aachen, Germany
fbotorog,
[email protected]
Abstract. Algorithmic skeletons are polymorphic higher-order func-
tions representing common parallelization patterns and implemented in parallel. They can be used as the building blocks of parallel and distributed applications by integrating them into a sequential language. In this paper, we present a new approach to programming with skeletons. We integrate the skeletons into an imperative host language enhanced with higher-order functions and currying, as well as with a polymorphic type system. We thus obtain a high-level programming language which can be implemented very eciently. After describing a series of skeletons which work with distributed arrays, we give two examples of parallel algorithms implemented in our language, namely matrix multiplication and a statistical numerical algorithm for solving partial dierential equations. Run-time measurements show that we approach the eciency of message-passing C up to a factor between 1 and 1.75.
1 Introduction
Algorithmic skeletons represent an approach to parallel programming which combines the advantages of high-level (declarative) languages and those of ecient (imperative) ones. The aim is to obtain languages that allow easy parallel programming, in which the user does not have to handle low-level features such as communication and synchronization, and does not face problems like deadlocks or non-deterministic program runs. At the same time, skeletons should be eciently implemented, coming close to the performance of low-level approaches, such as message-passing on distributed memory machines. A skeleton is an algorithmic abstraction common to a series of applications, which can be implemented in parallel [4]. Skeletons are embedded into a sequential host language, thus being the only source of parallelism in a program. Most skeletons, like for instance map, farm and divide&conquer [5] are polymorphic higher-order functions, and can thus be de ned in functional languages in a straightforward way. This is why most languages with skeletons build upon a functional host [6, 9]. However, programs written in these languages are still 5 to 10 times slower than their low-level counterparts, e.g. parallel C [9]. ?
The work of this author is supported by the \Graduiertenkolleg Informatik und Technik" at the Aachen University of Technology.
A series of approaches use an imperative language as a host, like for instance P3 L [1], which builds on top of C++, and in which skeletons are internal language constructs. The main drawbacks are the diculty to add new skeletons and the fact that only a restricted number of skeletons can be used. A related approach is PCN [8], where templates provide a means to de ne reusable parallel program structures. The language also contains lower-level features, such as de nitional variables, by which processes can communicate, but which can also lead to problems such as deadlocks. In the language ILIAS [12], arithmetic and logic operators are extended by overloading to pointwise operators that can be applied to the elements of matrices. These operations are actually not skeletons in the sense of the former de nition, but their functionality resembles that of some skeletons. SPP(X) [6] uses a `two-layer' language: a high-level, functional language for the application and a low-level language (at present Fortran) for ecient sequential code that is coordinated by the skeletons. In this paper, we present a new approach to programming with algorithmic skeletons. Firstly, we describe the host language, which contains higher-level features, but can be eciently implemented. After that, we de ne the data structure \distributed array" and present some data-parallel skeletons operating on it. We then show how two parallel applications can be programmed in a sequential style by using these skeletons. Run-time measurements show that our results are better than those obtained by using pure functional languages with skeletons, approaching direct low-level (C) implementations.
2 The Approach An important characteristic of skeletons is their generality, i.e. the possibility to use them in dierent applications. For this, most skeletons are parameterized by functions and have a polymorphic type. Since we wanted to obtain a high eciency, we have used the imperative language C as the basis for our host. However, in order to allow skeletons to be integrated in their full generality, we have provided the language with the following features:
{ higher-order functions, i.e. functions with functional arguments and/or result.
{ partial applications, i.e. the application of an n-ary function to k n argu-
ments. Partial applications are useful in generating new functions at runtime, as shown in Section 4. { conversion of operators to functions, which allows passing operators as functional arguments, as well as partial applications of operators. This conversion is done by enclosing the operator between brackets, e.g. (+). { a polymorphic type system, which allows skeletons to be reused in solving similar or related problems. Polymorphism is achieved by using type variables, which can be instantiated with arbitrary types. Syntactically, a type variable is an identi er which begins with a `$', e.g. $t. { distributed data structures can be de ned by means of the `pardata' construct, which has the syntax 2
pardata
name
t
t
[implem]
;
where $t1 , . . . , $tn are type variables and implem is a (polymorphic) type, representing the data structure to be created on each processor. Distributed data structures may not be nested, in particular the type arguments of a pardata construct cannot be instantiated with other pardatas. Moreover, some additional problems appear if dynamic (pointer-based) data types are used for this instantiation. This issue is addressed in [3]. We have called our language Skil, as an acronym for Sk eleton Imperative Language. A detailed description of the language and its functional features is given in [2]. After this brief outline, we want to compare Skil with two related approaches. SISAL [7] is a functional language which achieves Fortran performance. However, it does not support partial applications and polymorphism, and thus cannot be used directly as a host for our skeletons. On the other hand, SPP(X)'s functional language [6] contains more features than are strictly needed to integrate the skeletons. We therefore expect our language to be more ecient. A comparison of Skil with further related approaches, like data-parallel languages and C++extensions, can be found in [3].
3 Skeletons for Arrays Depending on the kind of parallelism used, skeletons can be classi ed into process parallel and data parallel ones. In the rst case, the skeleton creates a series of processes which run concurrently, whereas in the second case, it acts upon some distributed data structure. Although both categories can be integrated in Skil, we place the emphasis here on data parallelism. We shall describe a series of skeletons which work on the distributed data structure \array". This data structure is de ned as: pardata array ...
implementation
... ;
where the type parameter $t denotes the type of the elements of the array. At present, arrays can be distributed only block-wise onto processors. Each processor thus gets one block (partition) of the array, which, apart from its elements, contains the local bounds of the partition. These bounds are accessible via the macro array_part_bounds: Bounds array_part_bounds (array a) ;
where Bounds is a data structure comprising two Index structures (see Subsection 3.1) representing the lowest and highest indices of the current partition, respectively. Array elements can be accessed by using the macros: $t array_get_elem (array a, Index ix) ; void array_put_elem (array a, Index ix, $t newval) ;
3
which read, respectively overwrite a given array element. An important aspect is that these macros can only be used to access local elements, i.e. the index ix should be within the bounds of the array partition currently placed on each processor. The reason for this restriction is that remote accessing of single array elements easily leads to inecient programs. Non-local element accessing is still possible, however only in a coordinated way by means of skeletons. We shall now present some skeletons for the distributed data structure array. For each skeleton, its syntax, informal semantics and complexity are given. Note that for the higher-order skeletons, this complexity is given as a function of the complexities of its functional arguments, since the use of dierent customizing functions may lead to a dierent overall complexity. p denotes the number of processors (and, hence, of array partitions) and d the dimension of the array. We assume for simplicity that our arrays have the same size n in each dimension. For each skeleton, we give the actual computation time, t(n), whereas the overall complexity can be derived as c(n) = t(n) p. For simplicity, we consider that both a local operation on a processor and the sending of one array element from one processor to one of his neighbors equally take one time unit.
3.1 Constructor Skeletons
array_create creates a new, block-wise distributed array and initializes it using a given function. The skeleton has the following syntax: array array_create (int dim, Size size, Size blocksize, Index lowerbd, $t init_elem (Index), int distr) ;
where:
{
{ { { { {
is the number of dimensions of the array. At present, only one- and twodimensional arrays are supported. The types Size and Index are (classical) arrays with dim components. size contains the global sizes of the array. blocksize contains the sizes of a partition. Passing a zero value for a component lets the skeleton ll in an appropriate value depending on the network topology. lowerbd is the lowest index of a partition. Passing a negative value for a component lets the skeleton derive the lower local bound for this dimension. init_elem is a user-de ned function that initializes each element of the array depending on its index. distr gives the virtual (software) topology onto which the array should be mapped (where available). In our implementation based on Parix [13, 15], an array can be mapped directly onto the hardware topology (DISTR_DEFAULT) onto a virtual ring topology (DISTR_RING) { this can be useful for 1dimensional arrays onto a virtual 2-dimensional torus topology (DISTR_TORUS2D) { this can be useful for 2-dimensional arrays, as it is shown in the rst example in Section 4.
dim
4
If we notate the complexity of the initialization function with ti , then the ? time complexity of the skeleton is t(n) 2 O ti nd =p , since we have to call init_elem for each element of a partition and all partitions are processed in parallel. array_destroy deallocates an existing array. It has constant complexity t(n) 2 O (1). void array_destroy (array a) ;
3.2 Computational Skeletons
array_map applies a given function to all elements of an array, and puts the results into another array. However, the two arrays can be identical; in this case the skeleton makes an in-situ replacement. The syntax is: void array_map ($t2 map_f ($t1, Index), array from, array to) ;
The source and the target arrays do not necessarily have the same element type. The result is placed in another array rather than returned, since the latter would lead to the creation of a temporary distributed array, whereas our solution avoids this additional memory consumption. The return-solution is however used in array_create, since this skeleton allocates the new array anyway, whereas array_map only ` lls in' new values in an existing array. Note that this eciency improvement is not directly possible in a functional host language, where sideeects are not allowed1. of this skeleton is similar to that of array_create: t(n) 2 ? The complexity O tm nd=p , where tm is the complexity of the applied function map_f. array_fold composes (\folds together") all elements of an array. $t2 array_fold ($t2 conv_f ($t1, Index), $t2 fold_f ($t2, $t2), array a) ;
The skeleton rst applies the conversion function conv_f to all array elements in a map-like way2. After that, each processor composes all elements of its partition using the folding function fold_f. In the next step, the results from all partitions are folded together. Since the order of composition is nondeterministic, the user should provide an associative and commutative folding function, otherwise the result is non-deterministic. In our implementation, this step is performed along the edges of a virtual tree topology, with the result nally collected at the root. In order to make the result known to all processors, it is broadcasted from the root along the tree edges to all other processors. A work-around might be here deforestation, however this is dicult for higher-order programs. 2 This step could also be done by a preliminary array_map, but our solution is more ecient.
1
5
If tc and tf are the complexities of the conversion and folding function, respectively, then the complexity of the local computations?is given by the initial map and the local (sequential) folding and amounts to O (tc + tf ) nd=p . The complexity of communication and non-local computation is given by the folding of the single results from each processor and by the broadcasting of the nal results and is O ((tf? + 1) log2 p)3 . The overall complexity of the fold skeleton is thus: t(n) 2 O tc nd=p + tf (nd=p + log2 p) .
3.3 Communication Skeletons
array_rotate_parts_horiz works on a hardware mesh topology of pp pp processors, on which a virtual torus topology has been de ned, by cyclically shifting the partitions of an array in horizontal direction. The number of shifts is given by the return value of the argument function, which takes the row coordinate of the current processor as argument. If the result of this function is positive, then a shift to the rightpis performed, otherwise a shift to the left. A number ? of shifts s, with jsj > p=2 is reduced to pp ? jsj mod pp shifts in the opposite direction. The skeleton array_rotate_parts_vert analogously rotates the partitions of an array vertically on the processor topology, whereas the positive direction is downwards. The syntax of the two skeletons is: void array_rotate_parts_horiz (int f (int), array a, array b) ; void array_rotate_parts_vert (int f (int), array a, array b) ;
The complexity of the the argument two skeletons n p depends on the result of o function f: t(n) 2 O n2 =p min 2p ; maxi2f0;:::;pp?1g ff(i)g .
4 Sample Applications
We shall now present two sample applications illustrating the way parallel programs can be written using algorithmic skeletons. The rst program multiplies two matrices using Gentleman's algorithm [14], the second uses a statistical (\Monte Carlo") numerical method to solve a partial dierential equation [11]. We have implemented the skeletons and applications on a Parsytec MC system with 64 T800 transputers connected as a 2-dimensional mesh and running at 20Mhz, under the Parix operating system [13]. Since only 1MB of memory was available per node, larger problem sizes could only be tted into larger networks. We have compared our results with those obtained for the same applications using the data-parallel functional language DPFL [9, 10] and the same skeletons. Our run-times are faster than those of DPFL, approaching in most cases those of hand-written C code. On the one hand, this is due to the eciency of imperative languages, which is higher than that of their functional counterparts, on the other hand, it is due to the translation of the functional features done by the Skil compiler [2]. Detailed results are given throughout this section. 3
Actually, if the hardware topology is not a tree, then this complexity becomes
O ((tf + 1) log2 p ), where is the dilation of embedding the tree into the hard-
ware topology [15].
6
4.1 Matrix Multiplication
The algorithm of Gentleman multiplies two matrices distributed block-wise onto a two-dimensional torus network. It rstly shifts the partitions of the rst matrix placed on the ith row of processors cyclically i times to the left, and the partitions of the second matrix placed on the j th column of processors j times upwards. After this restructuring, the partitions of the two matrices placed on the same processor can be multiplied, since their pairwise corresponding inner indices are equal (i.e. aik and bkj ). After this multiplication, the partitions of the rst matrix are shifted one step to the left and those of the second matrix one step upwards. This leads to new index combinations, but these combinations also have the `appropriate' indices, so that the partitions can again be multiplied with one another and the result added to that of the previous multiplication. After repeating this combination of computation and communication pp times, p p where p p is the size of the network, we obtain the product matrix of the two initial matrices. The algorithm is presented in detail in [14]. The Straightforward Implementation with Skeletons. The straightforward solution is to use the map skeleton for the computation steps and the rotate_parts skeletons for the communication steps. The program is given below. void matmult (int n) { array a, b, c ; a = array_create (2, {n,n}4 , {0,0}, {-1,-1}, init_f1, DISTR_TORUS2D) ; b = array_create (2, {n,n}, {0,0}, {-1,-1}, init_f2, DISTR_TORUS2D) ; c = array_create (2, {n,n}, {0,0}, {-1,-1}, zero, DISTR_TORUS2D) ; array_rotate_parts_horiz ((-)(0), a, a) ; array_rotate_parts_vert ((-)(0), b, b) ; for (i = 0 ; i < xPartDim ; i++) { array_map (local_scal_prod (a, b), c, c) ; array_rotate_parts_horiz (minone, a, a) ; array_rotate_parts_vert (minone, b, b) ; } /* output array c */ array_destroy (a) ;
array_destroy (b) ;
array_destroy (c) ; }
In the rst line, the variables a, b and c are declared as having the type \distributed array with integer elements". The arrays are then created with dimension 2 and size total size n n by the create skeleton. They are distributed onto a 2-dimensional virtual torus topology, since this optimizes the communication inside the rotate_parts skeletons5. The dimensions of the torus (here pp), which determine the number of partition rotations, are given by the prede ned variables xPartDim and yPartDim. The elements of a and b are initialized by some user-de ned functions init_f1 and init_f2, whereas those of the result In order to avoid excessive code details, we have used the pseudo-code notation {a,b} for the `classic' array with elements a and b. 5 These skeletons can also work on non-torus topologies, however less eciently.
4
7
matrix c are set to zero. The rst calls of the rotate_parts skeletons perform the initial restructuring. The partial operator application (-)(0) implements in an ecient way the negation function6 , whereas minone is the constant function ?1 7 . The local partition multiplications are done by mapping the function local_scal_prod to every element of the result matrix c. This function computes the result of the scalar product of the corresponding row of the a-partition and the corresponding column of the b-partition, and adds it to the current element of c (denoted here by v): $t local_scal_prod (array a, array b, v, Index ix) { int k, xLowerBd, xUpperBd ; Bounds bds ; bds = array_part_bounds (a) ; xLowerBd = bds->lowerBd[0] ; xUpperBd = bds->upperBd[0] ; for (k = xLowerBd ; k < xUpperBd ; k++) v += array_get_elem (a, {ix[0],k}) * array_get_elem (b, {k,ix[1]}); return (v) ; }
Note that this function is called from within array_map and that it gets the current array element v and its index ix from this skeleton (see Subsection 3.2). However, it needs as further arguments the arrays a and b. This can be done without altering the type of the functional argument of the map skeleton by partially applying local_scal_prod to these two arguments in the main procedure, thus creating a function which has the type expected by map. We have measured the run-times of the Skil program for matrix sizes between 100 100 and 800 800 on 4 to 64 transputers. The results are given in Table 1, where bold entries stand for absolute run-times, roman font entries denote speed-ups relative to the DPFL implementation, and entries in italics remaining slow-down factors with respect to the Parix-C8 implementation. Note that the Skil program is on the average 4.5 times faster that the DPFL one, while being about 2 times slower than its Parix-C counterpart. This is already a good result, but we shall show in the next subsection how it can be improved further. The Optimized Version. Upon analyzing the performance of the above matrix multiplication program, we have found out, that the main cause for the remaining overhead lies in the way array elements are accessed. In order to keep the mapping of the array onto processors transparent to the user, the macro array_get_elem gets the global index of an array element and converts it internally to the local index used for the actual access, by subtracting from it the lower index (`oset') of the current partition. In the above algorithm, this 6 Since (?)(0)(x) is the explicitly curried form of (?)(0; x) which is equivalent to 0 ? x = ?x. This is more ecient than using an new function neg, because it avoids the additional function call. The values of these functions are negative, since the rotations are done to the left and upwards, respectively (see Subsection 3.3). 8 Parix-C is a parallel C dialect based on message-passing [13].
7
8
pp pp 22 44 66 88
n
100 200
300
400
500
600
700
800
4.22 33.46 113.26 4.53 2.11 1.12 4.44 2.04 0.56 4.34 2.00 0.36 4.33 1.89
4.51 4.46 { { { { { 2.10 2.08 8.57 28.67 67.54 131.52 228.00 4.48 4.47 4.47 4.48 4.46 { { 2.07 2.07 2.07 2.07 2.07 4.16 12.92 30.92 60.55 101.73 162.38 4.47 4.48 4.46 4.47 4.47 { 2.03 2.05 2.06 2.06 2.06 2.06 2.26 7.69 17.26 34.20 57.74 92.82 135.93 4.47 4.46 4.47 4.49 4.46 4.47 4.46 2.00 2.03 2.04 2.05 2.05 2.05 2.05
Table 1. Run-time results for matrix multiplication (straightforward version) [bold: absolute times; roman: DPFL/Skil; italics : Skil/Parix-C] conversion is done ? in the innermost loop, consequently its overhead is ampli ed by a factor of O n3 , where n is the size of the array. We have therefore de ned a more ecient macro for element accessing, which gets directly the local index, and additionally, for access eciency, the size of the current partition: $t array_get_local_elem (array a, Index local_ix, Size part_size) ;
where both local_ix and part_size can be derived from the local bounds of the current array partition. Using this feature (which does not collide with the transparency of the array mapping onto the processors), we can perform the conversion from global to local coordinates outside the innermost loop. The function local_scal_prod can thus be rewritten as follows. /* optimized version with local array element access */ $t local_scal_prod (array a, array b, v, Index ix) { Index a_lowerix, b_lowerix, a_localix, b_localix ; Size a_psize, b_psize ; /* derive lower indices (*_lowerix) and sizes (*_psize) for the partitions of the matrices a and b from their lower bounds */ a_localix[0] = ix[0] - a_lowerix[0] ; b_localix[1] = ix[1] - b_lowerix[1] ; for (k = 0 ; k < a_psize ; k++) { a_localix[1] = b_localix[0] = k ; v += array_get_local_elem (a, a_localix, a_psize) * array_get_local_elem (b, b_localix, b_psize) ; } return (v) ; }
The run-time results for this optimized version, given in Table 2, show that we have removed the main cause for the remaining eciency gap. We have thus 9
approached the performance of the C version up to about 20%, while becoming 7 to 8 times faster than DPFL.
pp pp 22 44 66 88
n 100 200
2.40 18.79 7.97 1.20 0.67 7.42 1.22 0.35 6.94 1.25 0.23 6.78 1.21
8.03 1.18 4.93 7.79 1.19 2.45 7.57 1.20 1.36 7.39 1.20
300 63.57 7.94 1.17 16.35 7.84 1.18 7.47 7.75 1.19 4.52 7.58 1.19
400
500
600
700
800
{
{
{
{
{
38.24 74.13 128.80
7.90 1.17 17.74 7.80 1.18 9.99 7.72 1.18
7.95 7.89 { { 1.16 1.17 34.51 57.70 91.78 7.84 7.88 { 1.17 1.17 1.16 19.62 33.04 52.86 77.21 7.82 7.80 7.85 7.85 1.17 1.17 1.17 1.17
Table 2. Run-time results for matrix multiplication (optimized version) [bold: absolute times; roman: DPFL/Skil; italics : Skil/Parix-C]
4.2 A Monte Carlo Algorithm for Solving PDEs
The second application we consider is a statistical numerical method for the solution of a partial dierential equation (PDE). The idea is to nd a random variable , such that its expectation E coincides with the solution of the problem. Having found such a , we need to generate n realizations of it P (1 ; : : : ; n ) and obtain the value of the solution as their arithmetic mean E n1 ni=1 i . An important advantage of this method is that the realizations i are independent from one another, and can thus be computed in parallel with no communication in between. Communication is only needed in the nal phase, for collecting the single realizations. We consider the following partial dierential equation: @ 2 u(x; y) + @ 2 u(x; y) = 0
@x2 @y2 de ned on the domain [0; 1] [0; 1] with the boundary conditions u(0; y) = y and u(1; y) = 1 + y; 8y 2 [0; 1] u(x; 0) = x and u(x; 1) = x + 1; 8x 2 [0; 1] The domain is discretized to a rectangular mesh with step h, and the PDE is discretized by nite dierences to u(x; y) = u(x + h; y) + u(x ? h; y) +4 u(x; y + h) + u(x; y ? h) 10
If (x0 ; y0 ) is the point where we want to know the solution of the PDE, then we must nd a random variable , such that u(x0 ; y0 ) = E. For that, we construct a random trajectory from our initial point to the boundary of the domain using the following Markov process: (x; y) := (x0 ; y0 ) while (x; y) is not on the boundary of the domain (x; y) := (x ; y ) where (x ; y ) is selected from the set of neighboring points f(x ? h; y); (x + h; y); (x; y ? h); (x; y + h)g with the same probability p = 1=4 Let (x ; y ) be the nal point on the boundary. Then, our random variable is := u(x ; y ). It can be proved that u(x0 ; y0 ) = E, which validates the result. 0
0
0
0
The algorithm is described in detail in [10, 11]. The Skil implementation of this Monte Carlo algorithm is very simple, compared to its Parix-C counterpart presented in [11]. Each of the p processors computes n=p random trajectories9 and places the results in its elements of a 1-dimensional distributed array of total size n. The trajectories are computed during the initialization phase of the array by the argument function rand_traj. The additional parameters needed by rand_traj (the initial point (x0 ; y0 ), the discretization step h and the number of trajectories n) are supplied via partial application. After the random trajectories have been computed and the results placed in the distributed array, the single results must be collected and combined, in order to yield the actual result E. This is done by the skeleton array_fold, which computes the sum of all elements of the array. Finally, the arithmetic mean is computed by dividing the result of the fold operation by n. The procedure is given below. double monte_carlo (array a, double x0, double y0, double h, int n) { double res ; a = array_create (1, {n,0}, {0,0}, {-1,-1}, rand_traj (x0, y0, h, n/netSize), DISTR_DEFAULT) ; res = array_fold (ident, (+), a) / n ; array_destroy (a) ; return (res) ; }
The run-times obtained for the Skil implementation of this algorithm are given in Table 3. We have compared these results with those obtained for the Parix-C and DPFL programs, respectively. For more clarity, we have plotted the speedups relative to DPFL and the slow-downs relative to Parix-C against the number of processors, for dierent values of h and n and obtained the graphics in Fig. 1. Again, the Skil implementation is faster than the DPFL one, with most of the relative speedups grouped around 5-6 (left graphic). On the other hand, the performance of Skil is comparable to that of C, most of the relative slow-downs being clustered around 1 (right graphic). 9 The prede ned variable netSize contains the number of processors p. 11
h
p
0.05
0.02
0.005
n 1000 5000 10000 1000 5000 10000 1000 5000 10000
3 7 15 31 63
0.98 0.46 0.25 0.17 0.11
4.67 2.06 1.02 0.55 0.30
9.31 4.08 1.96 1.04 0.55
5.31 26.51 52.56 87.05 423.02 844.96 2.38 11.64 22.99 38.10 186.29 371.88 1.20 5.54 11.07 19.23 88.80 172.56 0.72 2.92 5.51 10.15 45.26 86.95 0.39 1.52 2.84 5.83 23.30 44.91
Table 3. Run-time results for the Monte Carlo algorithm
7
7
6
6
Relative Slow-Downs Skil vs. C
Relative Speedups Skil vs. DPFL
Notice that in some cases the relative slow-downs are slightly below 1, i.e. the Skil run-times even beat the C run-times. The reason is that the C implementation referred to here is an older version, which does not use virtual topologies or asynchronous communication, as our skeleton implementation does. Of course, a Skil program could never beat an equally well optimized C version of that program, since Skil is translated to message-passing C.
5
4
3 h = 0.05, h = 0.05, h = 0.05, h = 0.02, h = 0.02, h = 0.02, h = 0.005, h = 0.005, h = 0.005,
2
1
n = 1000 n = 5000 n = 10000 n = 1000 n = 5000 n = 10000 n = 1000 n = 5000 n = 10000
h = 0.05, h = 0.05, h = 0.05, h = 0.02, h = 0.02, h = 0.02, h = 0.005, h = 0.005, h = 0.005,
5
n = 1000 n = 5000 n = 10000 n = 1000 n = 5000 n = 10000 n = 1000 n = 5000 n = 10000
4
3
2
1
0
0 10
20
30 40 Processors
50
60
10
20
30 40 Processors
50
60
Fig. 1. Comparison: Skil vs. DPFL (left) and Skil vs. Parix-C (right) for the Monte Carlo algorithm
4.3 Further Run-Time Results
We have implemented further applications using both the skeletons presented above, and some additional skeletons, like for instance partition broadcasting, row permutation and generic matrix multiplication [2]. The latter encapsulates the above Gentleman algorithm, however parameterized with the functions used to compute the \scalar product" of a row and a column. If the actual multiplication and addition are supplied, then we obtain the classical matrix multiplication. However, by using other functions, dierent algorithms with this communication pattern can be implemented, like for instance shortest paths in graphs10. 10
Here, we `multiply' the distance matrix of a graph with itself using addition in the role of the generic multiplication and the minimum function in the role of the generic addition [2].
12
We have considered the following applications: matrix mult1 is the optimized matrix multiplication procedure (Subsection 4.1.2); matrix mult2 is the implementation of this procedure based on the generic skeleton outlined above; shortest paths is also based on this skeleton; gauss is the Gaussian elimination procedure, and nally monte carlo is the implementation of the algorithm given in Subsection 4.2. For each of these programs, we have compared the average of all run-time measurements for the Skil version with the averages for the DPFL and C versions. The results are depicted in Fig. 2. 8 7
Relative Speedups
6 5 4
DPFL/Skil
3
Skil/Parix-C
2 1 0
matrix matrix shortest gauss mult 1 mult 2 paths
Problem
monte carlo
Fig. 2. Skil run-times vs. run-times of DPFL and Parix-C
Note that the relative speedups of Skil vs. DPFL are between 5.5 and 7.5, whereas the remaining slow-downs to C are relatively small, between 5% and 20%, except for gauss, where the average slow-down is 75%. The reason is that the Gaussian algorithm cannot be implemented with skeletons as straightforwardly as the other applications, so that this overhead has a harder impact on the overall performance. However, a worst-case slow-down factor of 1.75 is still a good result for practical applications.
5 Conclusions and Future Work
We have presented a new approach to parallel programming with algorithmic skeletons. We have designed a language that allows high-level parallel programming and at the same time can be eciently implemented, namely an imperative language enhanced with functional features and a polymorphic type system. We have then described a series of skeletons for the work with distributed arrays and showed how two matrix applications can be implemented on the basis of skeletons on a parallel system. Run-time results have shown that our implementation was faster than that of a functional language with skeletons, while approaching the performance of hand-written C code based on messagepassing. 13
Future work is necessary in several directions. Firstly, the distributed data type array together with its skeletons are only at a prototype level. It would be interesting to support other distributions onto processors, apart from block-wise, like cyclic, block-cyclic etc. Further, in case of block distributions, it should be possible to de ne overlapping areas for the single partitions, in order to reduce communication in operations which require more than one element at a time. Such operations are used for instance in solving partial dierential equations [10] or in image processing. In order to be able to cope with `real world' applications, new skeletons, for instance for (parallel) I/O must be designed and implemented. Moreover, the implementation of skeletons can be optimized, for instance by allowing the application of argument functions only to certain elements of a distributed data structure, since this would be more ecient than testing the condition in the customizing function itself.
References
1. B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi: P 3 L : a Structured High-level Parallel Language and its Structured Support, Technical Report HPLPSC-93-55, Hewlett-Packard Laboratories, Pisa Science Center, 1993. 2. G. H. Botorog, H. Kuchen: Skil: An Imperative Language with Algorithmic Skeletons for Ecient Distributed Programming, in Proceedings of Fifth International Symposium on High Performance Distributed Computing, IEEE Computer Society Press, 1996. 3. G. H. Botorog, H. Kuchen: Using Algorithmic Skeletons with Dynamic Data Structures, in Proceedings of IRREGULAR '96, LNCS 1117, Springer, 1996. 4. M. I. Cole: Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press, 1989. 5. J. Darlington, A. J. Field, P. G. Harrison et al: Parallel Programming Using Skeleton Functions, in Proceedings of PARLE '93, LNCS 694, Springer, 1993. 6. J. Darlington, Y. Guo, H. W. To, J. Yang: Functional Skeletons for Parallel Coordination, in Proceedings of EURO-PAR '95, LNCS 966, Springer, 1995. 7. J. T. Feo, D. C. Cann, R. R. Oldehoeft: A Report on the Sisal Language Project, in Journal of Parallel and Distributed Computing, Vol. 10, No. 4, 1990. 8. I. Foster, R. Olson, S. Tuecke: Productive Parallel Programming: The PCN Approach, in Scienti c Programming, Vol. 1, No. 1, 1992. 9. H. Kuchen, R. Plasmeijer, H. Stoltze: Ecient Distributed Memory Implementation of a Data Parallel Functional Language, in Proceedings of PARLE '94, LNCS 817, Springer, 1994. 10. H. Kuchen: Datenparallele Programmierung von MIMD-Rechnern mit verteiltem Speicher, Thesis (in German), Shaker-Verlag, Aachen, 1996. 11. H. Kuchen, H. Stoltze, I. Dimov, A. Karaivanova: Distributed Memory Implementation of Elliptic Partial Dierential Equations in a Dataparallel Functional Language, in Proceedings of MPPM '95, IEEE Computer Society Press, 1995. 12. L. D. J. C. Loyens, J. R. Moonen: ILIAS, a Sequential Language for Parallel Matrix Computations, in Proceedings of PARLE '94, LNCS 817, Springer, 1994. 13. Parsytec Computer GmbH: Parix1.2, Software Documentation, Aachen, 1993. 14. M. J. Quinn: Parallel Computing: Theory and Practice, McGraw Hill, 1994. 15. M. Rottger, U. P. Schroeder, J. Simon: Virtual Topology Library for PARIX, Technical Report 148, University of Paderborn, 1994.
14