Using Algorithmic Skeletons with Dynamic Data Structures George Horatiu Botorog?
Herbert Kuchen
Aachen University of Technology, Lehrstuhl fur Informatik II Ahornstr. 55, D-52074 Aachen, Germany fbotorog,
[email protected]
Abstract. Algorithmic skeletons are polymorphic higher-order func-
tions representing common parallelization patterns. A special category are data-parallel skeletons, which perform operations on a distributed data structure. In this paper, we consider the case of distributed data structures with dynamic elements. We present the enhancements necessary in order to cope with these data structures, both on the language level and in the implementation of the skeletons. Further, we show that these enhancements practically do not aect the user, who merely has to supply two additional functional arguments to the communication skeletons. We then implement a parallel sorting algorithm using dynamic data with the enhanced skeletons on a MIMD distributed memory machine. Run-time measurements show that the speedups of the skeleton-based implementation are comparable to those obtained for a direct C implementation.
1 Introduction Algorithmic skeletons represent an approach to parallel programming which combines the advantages of high-level (declarative) languages, and those of ecient (imperative) ones. The aim is to obtain languages that allow easy parallel programming, such that the user does not need to explicitly handle low-level features, like communication and synchronization, and does not have to ght with problems like deadlocks or non-deterministic program runs. At the same time, skeleton-based programs should be eciently implementable, coming close to the performance of low-level approaches, such as message-passing on MIMD machines with distributed memory. A skeleton is an algorithmic abstraction common to a series of applications, which can be implemented in parallel [9]. Skeletons are embedded into a sequential host language, thus being the only source of parallelism in a program. Classical examples include map, which applies a given function to all elements of a data structure, farm, which models master-slave parallelism and divide&conquer, which solves a problem be recursive splitting [10]. ?
The work of this author is supported by the \Graduiertenkolleg Informatik und Technik" at the Aachen University of Technology.
Depending on the kind of parallelism used, skeletons can be classi ed into process parallel and data parallel ones. In the rst case, the skeleton (dynamically) creates processes, which run concurrently. Some examples are pipe, farm and divide&conquer [1, 9, 10]. In the second case, the skeleton works on a distributed data structure, performing the same operations on some or all elements of this structure. Data parallel skeletons, like map, fold or rotate are used in [1, 5, 6, 7, 10, 11, 13, 17]. In this paper, we place the emphasis on the second category. More exactly, we address here the issue of skeletons working with dynamic distributed data structures. We present the additional features that are necessary in order to cope with dynamic data, both in the language and in the implementation of skeletons. We then show how a parallel algorithm that uses dynamic data can be implemented with skeletons, and that the speedups of this implementation are comparable to those of its direct (low-level) implementation. The rest of the paper is organized as follows. Section 2 gives an overview of the host language and the features needed to integrate the skeletons. Section 3 addresses the additional problems that arise when using dynamic data structures, as well as how they can be dealt with in the implementation of skeletons. Section 4 then presents some skeletons working on arrays with dynamic components. Section 5 shows how a parallel sorting algorithm (PSRS) can be implemented with these skeletons. It further gives run-time results and a comparison with results obtained for a direct implementation in parallel C. Section 6 compares our approach with some related work and the last section draws conclusions.
2 The Language An important characteristic of skeletons is their generality, i.e. the possibility to use them in dierent applications. For this, most skeletons have functional arguments and a polymorphic type. Hence, most languages with skeletons are built upon a functional host [10, 13]. The main drawback is that the eciency of these implementations lags behind that of low-level languages [13]. To obtain a better performance, we have used an imperative language, namely C1 , as the basis of our host, which we have called Skil, as an acronym for Sk eleton Imperative Language. However, in order to allow skeletons to be integrated in their full generality, we have provided Skil with some functional features as well as with a polymorphic type system. These additional features are described in [6], we give here only a brief overview:
{ Since most skeletons are higher-order functions, i.e. functions with functional arguments and/or return type, we need higher-order functions in Skil. For example, the skeleton map applies a given function to all elements of data structure: map (f; [x1 ; : : : ; xn ]) = [f (x1 ); : : : ; f (xn )]
1
This is however only a pragmatic choice, other imperative languages can equally well be used instead.
2
{ Closely related to higher-order functions is partial function application, i.e. the application of an n-ary function to k n arguments. Partial applica-
tions are useful in generating new functions at run-time and in supplying additional parameters to functions [5, 6, 7]. { Another feature is the conversion of operators to functions, which allows passing operators as functional arguments to skeletons, as well as partial applications of operators. This conversion is done by enclosing the operator between brackets, e.g. `(+)'. { Skil has a polymorphic type system, which allows skeletons to be (re-)used in solving similar or related problems. Polymorphism is achieved by using type variables, which can be instantiated with arbitrary types. Syntactically, a type variable is an identi er which begins with a $, e.g. `$t'. { Skil allows the de nition of distributed data structures, as long as they are `homogeneous', in the sense that they are composed of identical data structures placed on each processor. This is done by means of the `pardata' construct [6]. As an example, consider the data structure \distributed array". The `header' of this data structure is: pardata array ;
where the type parameter $t denotes the type of the elements of the array. A distributed array of double precision reals can then be declared as: typedef array realarray ;
The type arguments of a pardata can be instantiated with arbitrary types. However, some additional problems appear if dynamic (pointer-based) data types are used. In this case, special care has to be taken in the implementation of skeletons that move elements of the pardata from one processor to another, since one should not move the pointer as such, but the data pointed to by it. This issue is addressed in the next section.
3 Coping with Dynamic Data Structures A distributed data structure de ned by the pardata construct can be dynamic in its entirety, like for instance an adaptively re ned grid, or it can have an overall regular structure, but be parameterized by a dynamic data structure, like for example an array with arrays of variable length as elements. The rst case was considered in [5]. We showed there, that the irregularity of the data structure is dealt with on a high level, by de ning special skeletons which implement `irregular' operations such as grid re nement or load balancing. We shall focus here on the second case, where a handling on the language and skeleton implementation level is necessary. Consider the following type de nition of an array with variable length of integers: typedef struct _VarArr {int *fields; int cnt;} * VarArr ;
3
where fields is a pointer2 to the array elements and cnt gives the number of elements. Based on this data structure, a distributed array with VarArr's as elements can be de ned3 : typedef array vararray ;
While some skeletons working on distributed arrays, such as map, are not affected by this instantiation, a problem arises in the case of communication skeletons, which move array elements from one processor to another. Since moving pointers between processors obviously leads to incorrect results, pointer-based (and hence dynamic) data structures have to be ` attened' before sending and `un attened' at the destination. The next question is whether this packing/unpacking can be done automatically by the skeleton. The main problem is here the fact that a pointer of type t can point to a memory area containing several data structures of the type t, like in the VarArr example above, where fields points to a block of cnt integers. Void pointers and casts blur the entire issue even more, so that automatic packing and unpacking are hard to achieve. The solution is based on the functional features of Skil, which allow passing argument functions to skeletons. Communication skeletons, like fold or permute (see Section 4) get two additional functional parameters, pack_f and unpack_f, where the user can supply the appropriate routines for the type with which he has instantiated the distributed data structure. These functions have the types: void pack_f ($t, Buff) ; void unpack_f (Buff, $t *) ;
where $t is the type of the array elements4 and Buff is a prede ned buer structure, containing a string and its length. A further problem in sending an element with variable size is the length of the resulting message. In most message-passing systems, the receiver has to know the size of the incoming message beforehand, in order to allocate the receive buer. Consequently, the size of the element that will be sent must rst be communicated to the destination processor. This is illustrated by the following pseudo-code fragment: (Sender)
(Receiver) Recv (src, buf->len, 1) ; Recv (src, buf->buf, buf->len); unpack_f (buf, &a[j]) ;
pack_f (a[i], buf) ; Send (dest, buf->len, 1) ; Send (dest, buf->buf, buf->len);
We regard here pointer-based structures, since this is the way dynamic data structures can be de ned in C and also in Skil. 3 Such data structures can be used, for instance, to represent sparse matrices. 4 In case of other distributed data structures, which are parameterized by a dierent number of type variables, the arity of pack_f/unpack_f changes appropriately. 2
4
However, this packing/unpacking becomes inecient, if the elements of the data structure have a static type. On the one hand, these elements do not have to be ` attened', but can be copied directly into the output buer. On the other hand, their size is known, thus making the rst message super uous. In order to avoid this overhead, the internal function $$dyn was de ned. This function tests if a data type is dynamic and can be used in the implementation of skeletons. Since $$dyn is evaluated by the Skil compiler after the instantiation of polymorphism, it can also be applied to polymorphic types, as shown below. skel (array a, ..., void pack_f (), void unpack_f ()) { ... if (! $$dyn ($t)) Send (dest, a[i], sizeof (a[i])) ; else { pack_f (a[i], buf) ; Send (dest, buf->len, 1) ; Send (dest, buf->buf, buf->len) ; } ... }
4 The Skeletons We shall now brie y present some skeletons for the distributed data structure array. Since most of these skeletons are described in [6] and [7] for arrays with static elements, we shall focus here on the dynamic case. For each skeleton, we give its syntax, informal semantics and complexity (i.e. the actual computation time t(n); the overall complexity can be derived as c(n) = t(n) p). Note that the complexity of some skeletons increases if the elements of the array have a dynamic type, because of the packing/unpacking and additional messages. Let p be the number of processors (and, hence, of array partitions) and d the dimension of the array. We assume that our arrays have the same size n in each dimension. This condition is not necessary, but serves to simplify the expressions obtained for the complexity. For simplicity, we assume that both a local operation on a processor and the sending of one array element from one processor to one of his neighbors equally take one time unit.
4.1 The create and destroy Skeletons The skeleton array_create creates a new, block-wise distributed array and initializes it using a given function. The skeleton has the following syntax: array array_create (int dim, Size size, Size blocksize, Index lowerbd, $t init_elem (Index), int distr);
where dim is the number of dimensions of the array, the types Size and Index are (classical) arrays with dim components, size contains the global sizes of the array, blocksize contains the sizes of a partition, lowerbd is the lowest index of a partition, init_elem is a user-de ned function that initializes each element of 5
the array depending on its index and distr gives the virtual (software) topology onto which the array is mapped [7]. In the example in the following section, the array is mapped directly onto the hardware topology (DISTR_DEFAULT). If we notate the complexity of the? initialization function with ti , then the complexity of the skeleton is t(n) 2 O ti nd=p , since we have to call init_elem for each element of a partition and all partitions are processed in parallel. The skeleton array_destroy deallocates an existing array. The argument function destr_elem is called only in case of dynamic?elements. The complexity is thus t(n) 2 O (1) in the static case, and t(n) 2 O td nd =p in the dynamic case, where td is the complexity of the element deallocation function. void array_destroy (array a, void destr_elem ($t)) ;
4.2 The map Skeleton
applies a given function to all elements of an array, and puts the results into another array. However, the two arrays can be identical; in this case the skeleton does an in-situ replacement. The syntax is:
array_map
void array_map ($t2 map_f ($t1, Index), array from, array to) ;
O
of this skeleton is similar to ? The complexity tm nd =p , where tm is the complexity of the
that of array_create: t(n) 2 applied function map_f.
4.3 The fold Skeleton array_fold
composes (\folds together") all elements of an array.
$t array_fold ($t fold_f ($t, $t), array a, void pack_f ($t, Buff), void unpack_f (Buff, $t *)) ;
First, each processor composes all elements of its partition using the folding function fold_f. After that, the results from all partitions are folded together. Since the order of composition is non-deterministic, the user should provide an associative and commutative folding function, otherwise the result is nondeterministic. In our implementation, the second step is performed along the edges of a software tree topology, with the result nally collected at the root. In order to make the result known to all processors, it is broadcasted from the root along the tree edges to all other processors. If tf is the complexity of the folding function, then the complexity of the skeleton is given by the local (sequential) folding, by the folding of the single results from each processor, broadcasting of the nal res? ? and by the ults and amounts to t(n) 2 O tf nd =p + log2 p 5 for static elements, and to ? ? t(n) 2 O tf nd =p + log2 p + (tp + tu ) log2 p for dynamic elements, where tp and tu are the complexities of the packing and unpacking functions, respectively. 5
Actually, ? ? dif the hardwaretopology is not a tree, then this complexity becomes t(n) tf n =p + log2 p , where is the dilation of embedding the tree into the hardware topology [16]. 2
O
6
4.4 The permute Skeleton array_permute_parts switches the partitions of an array using a given permutation function (perm_f). This function takes the index of an element of the
partition placed on the source processor and returns the index of an element placed on the target processor. This apparently complicated procedure is done in order to maintain the transparency of the mapping of array partitions onto processors. The user must provide a bijective function on f0; : : : ; p?1g, otherwise a run-time error occurs. void array_permute_parts (array from, Index perm_f (Index), array to, void pack_f ($t, Buff), void unpack_f (Buff, $t *)) ;
The complexity of this skeleton is given by the complexity of evaluating the permutation function tf and by that of sending a partition (comprising nd =p elements) to another processor. The distance on the network between the source and the target processor is determined by perm_f and is at most ?p equal to the diameter of the topology. In case of a ?mesh, this diameter is O p . The ? complexity of the skeleton is thus?t(n) 2 O nd =pp pp +tf = O nd=pp + tf in the static case, and t(n) 2 O (tp + tu ) nd= p + tf in the dynamic case, where tp and tu are the complexities of the packing and unpacking functions, respectively.
5 An Application: Parallel Sorting by Regular Sampling We shall now present the Parallel Sorting by Regular Sampling6 (PSRS) algorithm [14], and show how it can be implemented on the basis of skeletons. The algorithm internally uses array blocks of variable size, as well as dynamically growing blocks during an accumulation phase, thus being an appropriate application of the previously discussed features for dynamic data structures7. We then give some run-time results and compare the achieved speedups with those obtained by the authors of the algorithm for a direct (C) implementation.
5.1 The PSRS Algorithm PSRS is a combination of a local (sequential) sort, a load balancing phase, a data exchange and a parallel merge [14]. We assume that we have to sort a distributed array with n keys on p processors (the example in Fig. 1 taken from [15] illustrates this for n = 27 and p = 3). The algorithm consists of four phases: The term `regular' refers here to the fact that the distance between the indices of the selected samples is constant. 7 Of course, variable-sized arrays could be implemented as arrays with xed size, by estimating the maximal size. We wanted to give a simple example for a technique that can be used with arbitrarily complicated dynamic data structures.
6
7
1. Each processor, in parallel, sorts its partition of n=p elements of the array using a sequential algorithm, for instance quicksort. Each processor then selects its regular samples as the 1st , w + 1st , . . . , (p ? 1) w + 1st elements of his partition, where w = n=p2. In Fig. 1, each processor selects its 1st , 4th and 7th elements as local samples. 2. One processor gathers all local samples and sorts them. Then, it selects p ? 1 pivots, as the samples with indices p + , 2p + , . . . , (p ? 1)p + , where = bp=2c. In Fig. 1, the 4th and 7th samples (33 and 69) are chosen as pivots. The pivots are then broadcasted to all processors. 3. Upon receiving the pivots, each processor spilts its array partition into p disjoint blocks, containing the elements whose values are situated between the values of neighboring pivots. Then, in parallel, each processor i keeps the ith block for itself and sends the j th block to processor j . Thus, each processor keeps one block and re-assigns p ? 1 blocks. In Fig. 1, processor 1 receives the rst blocks from processors 2 and 3, and sends the second and third blocks to processors 2 and 3, respectively. At the end of this phase, each processor i hold those elements of the initial array, whose value is between that of the i ? 1st and ith pivots. 4. Each processor, in parallel, merges its blocks into a single partition. The concatenation of the single partitions is the nal sorted array.
5.2 Implementation with Skeletons
We shall now describe the skeleton-based implementation of PSRS. While the rst and second phases remain more or less unchanged, the third and fourth ones are partly interleaved. The skeleton program is given in Fig. 2. The pre-de ned variables procId and netSize denote the own processor number and the total number of processors (p), respectively. The rst two lines contain the declarations of two distributed arrays, the rst having VarArr's as elements (see Section 3), the second MultArr's (arrays of VarArr's), which are used to store the blocks in the third and fourth phase. The initial array a is represented as an array with p elements, each partition thus being one VarArr with n=p elements. The rst phase is implemented by using a map skeleton, which applies a (sequential) sorting function to each element (i.e. to each VarArr), and puts the results back into the initial array. Then, the array s is created to hold the samples. The partitions of this array also consist of one element, which is a VarArr with p components. The selection of the samples is done by again using the map skeleton, this time with the argument function sample_f. This function selects from a VarArr the elements with indices 1, w +1, . . . , (p ? 1) w + 1. For that, sample_f is applied to all elements of a, which it leaves unchanged, storing the result as a side-eect in the array s. Note that the additional parameters s and w are supplied by partial application, whereas the last two arguments come from the map skeleton (see [5, 7] for further details). The type of this function is thus: VarArr sample_f (vararray s, int w, VarArr v, Index ix) ;
8
Initial unsorted list:
15 46 48 93 39 6 72 91 14 36 69 40 89 61 97 12 21 54 53 97 84 58 32 27 33 72 20
Phase 1 Sorted local partitions
Processor 1
Processor 2
Processor 3
15 46 48 93 39 6 72 91 14
36 69 40 89 61 97 12 21 54
53 97 84 58 32 27 33 72 20
6 14 15 39 46 48 72 91 93
12 21 36 40 54 61 69 89 97
20 27 32 33 53 58 72 84 97
6 39 72
12 40 69
20 33 72
Local regular samples
Phase 2
Processor 1 Gathered regular samples
6 39 72 12 40 69 20 33 72
Sorted regular samples
6 12 20 33 39 40 69 72 72
Pivots
33 69
Phase 3 Formed blocks
Processor 1
33
6 14 15
Processor 2
69 39 46 48
33 72 91 93
12 21
Processor 3
69 36 40 54 61 69
33 89 97
20 27 32 33
69 53 58
72 84 97
Phase 4 Re-assigned blocks From Self:
From Proc.1:
39 46 48
From Proc. 1:
From Proc. 2:
12 21
From Self:
36 40 54 61 69
From Proc. 2:
89 97
From Proc. 3:
20 27 32 33
From Proc. 3:
53 58
From Self:
72 84 97
6 14 15
72 91 93
Final merged blocks
6 12 14 15 20 21 27 32 33
36 39 40 46 48 53 54 58 61 69
Final sorted list:
72 72 84 89 91 93 97 97 69
33
6 12 14 15 20 21 27 32 33 36 39 40 46 48 53 54 58 61 69 72 72 84 89 91 93 97 97
Fig. 1. A PSRS example In the second phase, the samples from each processor are gathered by the skeleton. Since the VarArr holding the samples grows in each inter-processor folding step, the skeleton must apply the techniques described in Section 3 for handling dynamic data. This complication is however hidden from the user, whose only additional task is to supply the packing and unpacking argument functions. After gathering the samples, the `root' processor broadcasts them back to all other processors. Thus, the samples are known on all processors after the folding, so that each processor can compute the pivots by itself8 .
fold
8
This is a slight deviation from the algorithm, but it is more ecient to take advantage of the broadcast done anyway by the skeleton, than to rst compute the pivots on the root processor, and then use an extra broadcast to spread them to all processors.
9
typedef array vararray ; typedef array multarray ; void psrs (vararray a) { vararray s, q, r ; multarray m ; VarArr v ; int i, j, *pivots ; /* Phase 1: */ array_map (sort_f, a, a) ; s = array_create (1, netSize,0 9, 1,0 , DISTR_DEFAULT) ; array_map (sample_f (s, w), a, a) ;
f
g f
g fprocId,0g, init_f,
/* Phase 2: */ v = array_fold (append_f, s, pck_f, upck_f) ; sort_f (v, 0,0 ) ; copy each wth element from the sample array s into the
f
g
vector pivots
;
/* Phases 3 and 4, partly interleaved: */
create q, r and m with the same dimensions and distribution as s
; for (i = 0 ; i < netSize ; i++) { array_map (get_ith_block ((i+procId) % netSize, pivots, q), a, a) ; array_permute_parts (q, perm_f (i), r, pck_f, upck_f) ; array_map (copy_ith_block (i, m), r, r) ; } array_map (merge_blocks (a), m, m) ; array_destroy (s, destroy_f) ; m ;
}
destroy q, r and
Fig. 2. Skeleton-based implementation of the PSRS algorithm As already mentioned, phases 3 and 4 are partly interleaved. Thus, instead of rst splitting the partitions into blocks and then performing an all-to-all communication, we proceed in i steps (0 i p ? 1) as follows. Each processor procId determines the i + procIdth block of his partition. These blocks are stored into a temporary array q, which then undergoes a permutation operation, such that the i + procIdth block goes to the i + procIdth processor. This permutation is actually a cyclic shift with i positions, whereas the new position (destination par9
For simplicity, we have used the pseudo-code notation with elements a and b.
10
fa,bg
for the (classic) array
tition) is computed by the user-de ned function perm_f. Further, since the sizes of the blocks may vary, the skeleton uses the packing and unpacking functions, similarly to fold. After that, the permuted blocks (now stored in the temporary array r) are copied into elements of the multarray m. This is done by mapping a copying function to all elements of r, with a side-eect to m. Finally, after all blocks are on the `right' processors, they can be merged together. This is done by mapping a merging function to all elements of m and putting the result in the array a, which then contains the sorted overall array.
5.3 Run-time Results
We have implemented the PSRS algorithm with skeletons on a Parsytec SC 320 distributed memory machine with T800 transputers and 4 MB per node. The run-time results for 1 to 32 processors and 100,000 to 800,000 array elements are summarized below. p
n
1 4 8 16 32
100.000 200.000 400.000 800.000 23.72 50.03 105.36 221.36 7.11 14.81 30.89 { 4.26 8.69 17.89 36.91 3.18 6.02 11.83 23.91 4.04 6.76 11.43 20.46
Table 1. Run-time results for the skeleton implementation of the PSRS al-
gorithm
Based on these values, we have studied the speedup behavior of our implementation. We have considered speedups rather than absolute run-times, since our parallel computer was dierent than the ones used by the authors of the algorithm (see below). The results are depicted in the rst graphic in Fig. 3. One can see that the speedup curves become at for more than 16 processors. The main reason is that the problem size gets too small for the network size. This is clearly visible in the results presented by the authors of the algorithm, which show good speedups (i.e. an eciency of more than 50%) for array sizes between 2,000,000 and 8,000,000 elements [14]. Unfortunately, we could not test these cases, since we only had 4 MB memory per node. We have then compared our results with those presented in [14], for the array sizes we could deal with. In [14] results are given for the Intel iPSC/860, iPSC/2-386, the BBN TC2000, as well as for a LAN of workstations. Since the TC2000 has both distributed and shared memory, thus allowing simple and very ecient implementation of operations like broadcasting and folding, and since the workstation cluster has a dierent architecture than the one we used, we have chosen the iPSC/860 results as reference for our comparison. The speedups for the direct implementation are given in the second graphic in Fig. 3. 11
Notice that the curves in the two graphics resemble each other, whereas the speedups of the direct implementation are slightly better (e.g. 14 compared to 11 on 32 processors). This is no surprise, since the iPSC's hypercube topology is `richer' than the SC's topology of degree 4, thus allowing more ecient implementation of some operations. For instance, a broadcast can be done on the hypercube in O (log p) steps, since a hypercube has for each node a spanning tree rooted at that node, whereas for a general topology a broadcast requires O ( log p) steps, where is the dilation of embedding a tree into that topology. 20 linear 100000 200000 400000 800000
18 16
linear 100000 200000 400000 800000
18
Speedups for direct (C) implementation
Speedups for skeleton-based implementation
20
14 12 10 8 6 4 2
16 14 12 10 8 6 4 2
5
10
15 20 Processors
25
30
35
5
10
15 20 Processors
25
30
35
Fig. 3. Speedups for PSRS: skeleton-based implementation (left) and C implementation (right)
6 Related Work High-level features are the basis of several approaches that aim to simplify the task of parallel programming. Without trying to be exhaustive, we shall compare Skil with some of them. HPF [12] extends Fortran-90 by constructs for declaring distributed arrays, for mapping them onto processors and for specifying parallel loops. However, no arbitrary distributed data structures, like those de ned by the pardata construct, nor task-parallel operations are supported. NESL [3] is a rst-order functional, polymorphic, data parallel language, which supports nested parallelism. The latter is only partly supported by Skil, which allows nesting of skeleton calls, but not of pardata declarations. On the other hand, due to the fact that they are higher-order, skeletons are more powerful than NESL's parallel constructs. For example, NESL's operations sum, count, max_index and min_index can be implemented in Skil as instances of the fold skeleton. Numerous parallel languages are based on C++. One could mention here pC++ [4], which supports data parallelism, CC++ [8], supporting task parallelism, or HPC++ [2], which integrates both paradigms. These languages are 12
related to Skil as far as the type system is concerned, i.e. the Skil polymorphism could be expressed by C++ templates. However, the main dierence lies in the functional features, which cannot be expressed in any of these languages. This can very well be seen in the case of P3 L [1], which is a C++-based language with algorithmic skeletons. In order to cope with higher-order functionality, the language was enhanced by special syntactic constructs [1]. However, enhancing C++ with functional features (yielding \Skil++") would be a further, very promising step. Apart from polymorphism, which would come for free, the more important gain would lie in the use of C++'s object-oriented mechanisms, such as encapsulation of data. This would allow to extend the concept of skeleton to that of a parallel abstract data type [5], consisting of a distributed data structure, and a set of higher-order, powerful functions operating on this data structure. Finally, we consider some languages with skeletons. Most of these take a functional approach [10, 13], however some use an imperative host, e.g. P3L [1], or even a two-layer language { functional for the application and imperative for the skeletons { as in the case of SPP(X) [11]. We have done some comparisons with DPFL, a data parallel functional language with skeletons [13], for a series of matrix applications. The results showed that the Skil implementation was about 6 times faster than the DPFL one, while approaching the performance of message-passing C [7].
7 Conclusions and Future Work In this paper, we have described a technique that allows the use of algorithmic skeletons on distributed data structures with dynamic or variable-sized components. For that, we have enhanced the communication inside the skeletons with additional messages, packing/unpacking of data elements and, for reasons of eciency, with a test on dynamic data types. From the point of view of the user, handling dynamic data structures amounts to the task of specifying two functions for each such structure, which are passed as additional arguments to the skeletons that move data elements between processors. Note that this view is consistent with the philosophy of data parallel skeletons, which aims at reducing the description of a distributed data structure and its functionality to local descriptions of its elements and their functionality. We have then shown how a parallel algorithm that uses dynamic data can be implemented on the basis of skeletons. The results we have obtained support the idea that the use of skeletons leads to ecient parallel programs, with small losses in performance relative to direct low-level implementations. On the whole, this is a small price to pay for the considerable rise in programming convenience, safety and portability. Future plans include the handling of `overall' dynamic data structures, in particular the implementation of adaptive multigrid methods based on the skeletons presented in [5]. 13
References 1. B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi: P 3 L : a Structured High-level Parallel Language and its Structured Support, Technical Report HPLPSC-93-55, Pisa Science Center, Hewlett-Packard Laboratories, 1993. 2. P. Beckman, D. Gannon, E. Johnson: Portable Parallel Programming in HPC++, in Proceedings of ICPP '96, 1996. 3. G. Blelloch: NESL: A Nested Data-Parallel Language (3.1), Technical Report CMU-CS-95-170, Carnegie-Mellon University, 1995. 4. F. Bodin, P. Beckman, D. Gannon, S. Narayana, S. X. Yang: Distributed pC++: Basic Ideas for an Object-Oriented Parallel Language, in Scienti c Programming, Vol. 2, Nr. 3, 1993. 5. G. H. Botorog, H. Kuchen: Algorithmic Skeletons for Adaptive Multigrid Methods, in Proceedings of IRREGULAR '95, LNCS 980, Springer, 1995. 6. G. H. Botorog, H. Kuchen: Skil: An Imperative Language with Algorithmic Skeletons for Ecient Distributed Programming, in Proceedings of Fifth International Symposium on High Performance Distributed Computing, IEEE Computer Society Press, 1996. 7. G. H. Botorog, H. Kuchen: Parallel Programming in an Imperative Language with Algorithmic Skeletons, in Proceedings of EuroPar '96, Vol. 1, LNCS 1123, Springer, 1996. 8. M. Chandi, C. Kesselman: CC++: A Declarative Concurrent Object Oriented Programming Notation, in Research Directions in Concurrent Object-Oriented Programming, MIT Press, 1993. 9. M. I. Cole: Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press, 1989. 10. J. Darlington, A. J. Field, P. G. Harrison et al: Parallel Programming Using Skeleton Functions, in Proceedings of PARLE '93, LNCS 694, Springer, 1993. 11. J. Darlington, Y. Guo, H. W. To, J. Yang: Functional Skeletons for Parallel Coordination, in Proceedings of EuroPar '95, LNCS 966, Springer, 1995. 12. High Performance Fortran Language Speci cation, in Scienti c Programming, Vol. 2, No. 1, 1993. 13. H. Kuchen, R. Plasmeijer, H. Stoltze: Ecient Distributed Memory Implementation of a Data Parallel Functional Language, in Proceedings of PARLE '94, LNCS 817, Springer, 1994. 14. X. Li, P. Lu, J. Schaeer et al.: On the Versatility of Parallel Sorting by Regular Sampling, Parallel Computing, Vol. 19, North-Holland, 1993. 15. M. J. Quinn: Parallel Computing: Theory and Practice, McGraw-Hill, 1994. 16. M. Rottger, U. P. Schroeder, J. Simon: Virtual Topology Library for Parix, Technical Report 148, University of Paderborn, 1994. 17. D. Skillicorn: Foundations of Parallel Programming, Cambridge University Press, 1994.
14