Algorithmic Skeletons in an Imperative Language for Distributed Programming George Horatiu Botorog RWTH Aachen
Herbert Kuchen J.-L.-Universitat Gie en
y
Abstract
Algorithmic skeletons are functions representing common parallelization patterns and implemented in parallel. They can be used as the building blocks of parallel and distributed applications by integrating them into a sequential language. In this paper, we present a new approach to programming with skeletons. We integrate the skeletons into an imperative host language, which leads to important gains in eciency. At the same time, we enhance the host language with some functional features, like higher-order functions and currying, as well as with a polymorphic type system, in order to allow high-level programming. After describing a series of skeletons which work with distributed arrays, we give two examples of parallel algorithms written as sequential programs, namely shortest paths in graphs and Gaussian elimination. Run-time measurements show that we approach the eciency of message-passing C up to a factor between 1 and 2.5. Keywords: Algorithmic Skeletons, Data Parallelism, High-Level Imperative Languages, Matrix Applications, MIMD-DM Computers.
1 Introduction Although parallel and distributed systems gain more and more importance nowadays, the state-of-the-art in the eld of parallel software is far from being satisfactory. Not only is the programming of such systems a tedious and time-consuming task, but its outcome is usually machine-dependent and hence its portability is restricted. One of the main reasons for these diculties is the rather low level of parallel programming languages available today, in which the user has to explicitly account for all aspects of parallelism, i.e. for communication, synchronization, data and task distribution, load balancing etc. Moreover, program testing is impeded by deadlocks, non-deterministic program runs, non-reproducible errors, and sparse debugging facilities. Finally, program run only on certain machines or on restricted classes of machines. However, due to the low programming level, a high performance can be achieved. Attempts were made to design high-level languages that hide the details of parallelism and allow the programmer to concentrate on his problem. Such high-level approaches RWTH Aachen, Informatik II, D-52056 Aachen, email:
[email protected]. The work of this author has been supported by the \Graduiertenkolleg Informatik und Technik" at RWTH. y AG Informatik, Arndstr. 2, D-35392 Gie en, email:
[email protected]
1
would naturally lead to less ecient programs than those written directly in low-level languages, and the key issue is to nd a good trade-o between eciency losses and gains in easiness of programming, and in safety and portability of programs. In this paper, we shall present such an approach, namely a language with algorithmic skeletons. Although skeletons can be implemented on every distributed system, we shall consider here only MIMD computers with distributed memory. The concepts presented here can nevertheless be ported to other architectures, such as shared memory systems, SIMD computers, workstation clusters etc., as well. A skeleton is an algorithmic abstraction common to a series of applications, which can be implemented in parallel [6]. Skeletons are embedded into a sequential host language, thus being the only source of parallelism in a program. Con ning parallelism to skeletons makes it possible to control it. Moreover, the user must no longer cope with parallel implementation details, which are not visible outside the skeletons. Thus, instead of error-prone communication via individual messages, there is a coordinated overall communication, which is guaranteed to work deadlock-free. On the other hand, skeletons are implemented on a low level and therefore ecient. At the same time, they oer an interface which abstracts of the underlying hardware, thus making programs portable. Classical examples of skeletons include map, farm and divide&conquer [7]. Most skeletons are de ned as higher-order functions, i.e. as functions with functional arguments and/or return value. The parameterization with functions leads to a exibility beyond that of library functions in imperative languages. This is illustrated be the following example. Let us consider the skeleton divide&conquer (d&c), which implements the well-known computation pattern, and which can be expressed in the functional language Haskell [10] as follows1: d&c ::
(a -> Bool) -> (a -> b) -> (a -> [a]) -> ([b] -> b) -> a -> b
d&c is_trivial solve split join problem = if (is_trivial problem) then (solve problem) else (join (map (d&c is_trivial solve split join) (split problem)))
The skeleton gets four functions as arguments: is_trivial tests if a problem is simple enough to be solved directly, solve solves the problem in this case, split divides a problem into a list of subproblems and join combines a list of sub-solutions into a new (sub)solution. The last argument is the problem to be solved. The function map applies a given function to all elements of a given list, i.e. map f [x1, ..., xn] = [f x1, ..., f xn ]. Given this skeleton, the implementation of an algorithm that has the divide&conquerstructure requires only the implementation of the four argument functions and a call of the skeleton. For instance, a quicksort procedure for lists can be implemented as follows: quicksort list = d&c is_empty identity (divide (hd list)) append list
Note that this de nition of d&c only describes the overall functionality, it does not capture the parallel implementation. 1
2
where is_empty checks for the empty list, identity is the identity function, divide splits a list in two lists containing the elements that are smaller and larger than the element at the head of the list (hd list), respectively, and nally append concatenates two lists. However, other algorithms with the divide&conquer structure, such as Strassen's matrix multiplication, polynomial evaluation, numerical integration, FFT etc. can be similarly implemented, only by using dierent customizing argument functions. Skeletons can thus be viewed on a conceptual level as common parallelization patterns and on a pragmatic level as higher-order functions. Note that a similar exibility cannot be achieved by rst-order functions, i.e. functions without functional arguments. Since skeletons can be represented in functional languages in a straightforward way, it would therefore seem appropriate to use a functional language as host. Indeed, nearly all implementations of skeletons rely on functional hosts. There are currently a number of research groups who work on the design and implementation of parallel functional languages with algorithmic skeletons. One could mention here the works of Darlington et al. [7], Skillicorn [20], Deldarie et al. [9] or Kuchen et al. [13]. On the other hand, few attempts have been made to use an imperative language as a host. One such attempt is P3L [2], which builds on top of C++, and in which skeletons are internal language constructs. The main drawbacks are the diculty to add new skeletons and the fact that only a restricted number of skeletons can be used. Another approach is taken in the language ILIAS [15]. Here, arithmetic and logic operators are extended to pointwise operators that can be applied to the elements of matrices. These operations are actually no skeletons in the sense of the former de nition, but their functionality closely resembles that of some skeletons. Another interesting approach is SPP (X), developed at the Imperial College [8], in which a `two-layer' language is used: a high-level, functional language, called the Structured Control Language, in which the application is written, and a low-level Base Language (at present Fortran) for ecient sequential code to be called from within the skeletons. Depending on the kind of parallelism used, skeletons can roughly be classi ed into process parallel and data parallel ones. In the rst case, the skeleton creates a series of processes, which run concurrently. Some examples include farm, pipe and divide&conquer. Such skeletons are used in [2, 7]. Data parallel skeletons, on the other hand, act upon some distributed data structures, performing the same operations on all elements of the data structure. Data parallel skeletons, like map, shift or fold are used in [2, 7, 13, 20]. In this paper, we present a new imperative language with algorithmic skeletons. Firstly, we describe some functional features with which we enhance our host language in order to cope with the generality expected from skeletons. After that, we de ne the data structure \distributed array" and present some data-parallel skeletons operating on it. We then show how two parallel applications can be programmed in a sequential style by using these skeletons. Run-time measurements show that our results are better than those obtained by using functional languages with skeletons, approaching direct low-level (C) implementations. The rest of the paper is organized as follows. Section 2 describes the imperative host language and the additional features that are needed to integrate the skeletons. Section 3 presents some skeletons for distributed arrays. Section 4 presents two applications, shortest paths in graphs and Gaussian elimination, implemented on the basis of skeletons. Section 5 contains some run-time results as well as comparisons 3
with similar implementations in parallel C and in a skeleton-based functional language. Finally, Section 6 concludes the paper.
2 The Language As we have mentioned in the introduction, skeletons can easily be embedded into functional host languages as higher-order functions. However, using a functional language leads to a decrease of the eciency of the resulting programs, which can become critical in applications where performance is an important issue, like for instance in scienti c computing. We therefore choose our host language to be an imperative one. This approach has several advantages. On the one hand, the sequential parts of the program are more ecient in this case. Moreover, since host and skeletons are both imperative, the overhead of the context switch, which is considerable in the case of a functional host, has no longer to be taken into account. On the other hand, imperative languages oer mechanisms for local accessing and manipulating data, which have to be simulated in functional languages [13]. This leads to further gains in eciency, bringing the performance of our language close to that of low-level parallel imperative languages with message-passing. We have named the language Skil, as an acronym for Skeleton Imperative Language. In order to be able to integrate algorithmic skeletons, we shall enhance Skil with a series of functional features and with a polymorphic type system. We shall then show how these features can be implemented without eciency losses by using adequate compiling techniques. We shall use a subset of the language C as the basis of Skil. This is, however, only a pragmatic choice, motivated by the fact that C is a widespread language in the eld of parallel programming2 for which good compilers are available, thus easing the task of translating Skil programs. Nevertheless, other imperative languages can equally well be used instead.
2.1 Functional Features
Consider again the de nition of d&c given in the introduction. We shall use it to show by which features C has to be extended in order to support skeletons3. The skeleton d&c is a higher-order function, since it has functional arguments (solve, split, is_trivial, and join). Thus, higher-order functions are among the functional features we need in Skil. A second feature we need is function currying. Note that we have de ned the type of d&c as (a -> Bool) -> (a -> b) -> (a -> [a]) -> ([b] -> b) -> a -> b
instead of (a -> Bool)
(a -> b)
(a -> [a])
([b] -> b)
a -> b
By \C", we mean here the series of parallel C dialects, like for instance Parix-C [16]. Although the features are derived here from an example, we have veri ed the fact that they are sucient to implement the entire range of algorithmic skeletons known to us. 2 3
4
This indicates that d&c can be applied to its rst argument, yielding a new function with the type (a -> b) -> (a -> [a]) -> ([b] -> b) -> a -> b, then to its second etc., until the last application nally returns a value of type b. The underlying idea is to consider the application of a n-ary function as a successive application of unary functions. This procedure is called currying, after the mathematician H. B. Curry and allows functions to be partially applied to their arguments. See [17] for further details. Partial applications are useful in a series of situations, for instance in supplying additional parameters to functions. Consider the (recursive) d&c call in the else branch of the skeleton de nition. On the one hand, the function map expects a functional argument of type a -> b. On the other hand, we want to call d&c and at the same time provide it with the rest of the arguments it needs, apart from the problem to be solved, i.e. with the functions is_trivial, solve, split and join. This can be done by partially applying d&c to these arguments, which yields a function of type a -> b. Partial applications are one of the main reasons for the functional extensions in Skil. If functional arguments with no arguments of their own could be simulated in C by passing/returning pointers to functions, this is no longer possible with functional arguments yielded by partial application. We do not consider here the possibility to pass the additional parameters as global variables, since it is bad programming style to simulate variables which actually have local scope by global ones. This would introduce a source of errors, which are hard to nd. Moreover, this simulation is not always possible, e.g. if dierent partial applications of the same function are given as parameters to the same skeleton, it is not clear what should be stored in the global variables then. Apart from these two main constructs, some other features of lesser importance are included in our language, like for instance the conversion of operators to functions, which allows passing operators as functional arguments to skeletons or partial applications of operators, as shown in the examples below4. fold ((+), lst1) map ((*) (2), lst2)
The rst call computes the sum of all elements of the list lst1, whereas the second multiplies each element of the list lst2 by 2. The conversion of operators to functions is not an essential feature of Skil, since it can easily be replaced by introducing a new function for each operator needed. However, the use of converted operators leads to eciency gains [5].
2.2 The Type System
If we look back at the type declaration of the skeleton d&c, we see that it is polymorphic, as it contains the type variables a and b. We want Skil to have a polymorphic type system, since we want to de ne skeletons that depend only on the structure of the problem, and not on particular data types. For instance, the skeleton d&c performs the same operations, regardless of the type of the problem and the solution5. The advantage is that the same skeletons can be re-used in solving similar or related problems. Starting with this example, we switch from the functional notation used in the rst examples to the C-like Skil syntax. This syntax will be used throughout the rest of the paper. 5 This generality might be restricted in some case to certain type classes [10]. 4
5
In order to build a polymorphic type system, we have to include type variables in our language. Syntactically, a type variable is an identi er which begins with a $, e.g. `$t'. Type variables can be instantiated with arbitrary types, except types introduced by the pardata construct (see next subsection). The only condition for this instantiation is that identical type variables must be substituted by identical types, if they belong to the same type expression. Polymorphic types are either type variables, or compound types built from other types using the C type constructors array, function, pointer, structure or union [12] and containing at least one type variable. In case of structured types containing elds with polymorphic type, as well as in the case of polymorphic typedef's, the type variables inside the de nitions must appear as parameters of the type declaration, included between sharp brackets (): struct _list {$t elem ; struct _list *next ;} typedef struct _list * list ;
Polymorphism can be simulated in C by using void pointers and casting. This is nevertheless a potential source of errors, since type checking is thus eluded. Our approach leads however to safer programs, as a polymorphic type checking is performed. As an illustration of the language enhancements we have presented, we now give the equivalent Skil de nition of the skeleton d&c. Note that, apart from some small dierences in the syntax, the functional speci cation can be directly transformed into our language6. $b d&c (int is_trivial ($a), $b solve ($a), list split ($a), $b join (list ), $a problem) { if (is_trivial (problem)) return solve (problem) ; else return join (map (d&c (is_trivial, solve, split, join), split (problem))) ; }
We have assumed here the existence of a polymorphic type list , like the one given by the above de nition. Further, the return type of is_trivial was changed to int, since C uses this type instead of Bool.
2.3 Distributed Data Structures
Apart from the need to cope with distributed functionality, a parallel language must be able to allow the de nition of distributed data structures. This issue has been addressed in most languages with skeletons by considering these distributed types (mostly arrays) as internally de ned [13, 14]. Since one of our aims is exibility, we want to allow any distributed data structure to be de ned, as long as it is `homogenous', in the sense that it is composed of identical data structures placed on each processor. This is done by means of the `pardata' construct, which is similar to the typedef construct, but introduces a distributed (\parallel ") data structure. The syntax of this construct is: 6
Again, this is only the speci cation, and not a parallel implementation.
6
pardata
name
t
t
implem
;
where $t1, ..., $tn are type variables and implem is a (polymorphic) type, representing the data structure to be created on each processor. The distributed data structure is then the entire of all local structures and is identi ed by name. Distributed data structures may not be nested, i.e. the type arguments of a pardata construct cannot be instantiated with other pardatas. A problem that arises when using (visible) distributed data structures in an imperative language is that the programmer is able to access local parts of the structure, for instance by pointers or indices. If this occurs in an uncontrolled way, heavy remote data accessing may cause considerable communication overhead. In order to control the access to data, we keep the implementation (`implem') of the distributed data type hidden, making only its \header" with the type arguments visible. The pardata construct can thus be used similarly to prototypes of library functions given in include- les, but whose body is not accessible. For instance, if we have the following pardata \prototype" given in, say, the header le my_pardata.h: pardata my_pardata ;
then we can instantiate it with argument types and use it in variable declarations like: my_pardata m ;
The distributed data structure can only be accessed by using the skeletons de ned for it, like for instance my_skeleton (m). The examples in Section 4 illustrate this more detailed.
2.4 Implementation Issues
We shall now brie y address some implementation issues of Skil. The main problem is here the translation of the functional features and of polymorphism, as they have a crucial in uence on the performance of resulting programs. Since a classical implementation by closure techniques [17] is too inecient, we use an instantiation related to Wadler's higher-order macro-expansion [21] instead. This technique translates a (polymorphic) higher-order function (HOF), possibly with partial applications, to one or more specialized rst-order monomorphic functions [4]. Since this elimination is done at compile-time, a restriction has to be made regarding the functional arguments of HOFs. However, this restriction concerns only special cases of recursively-de ned HOFs, which seldom appear in practice [4]. Moreover, since in our case the HOFs are actually the skeletons, this restriction can be regarded as internal. A Skil program is thus processed by a front-end compiler together with the necessary skeletons and pardata implementations. This compiler translates all functional features in one pass and generates parallel C code based on message passing, which can then be processed by a C compiler used as a back-end. Using the instantiation technique for parallel programs with skeletons leads to intermediate (C-) codes that dier only little from the hand-written (C-) versions of these programs, usually containing more function calls. This explains the good results obtained for our test applications, which are presented in Section 5. 7
3 Skeletons for Arrays We have seen that skeletons can be classi ed into data parallel and process parallel ones. Although both types can be integrated in Skil, the emphasis is placed here on the rst category, since data parallelism seems to oer better possibilities to exploit large numbers of processors. We shall describe a series of skeletons which work on the distributed data structure \array". This data structure is de ned as: pardata array ...
implementation
...
;
where the type parameter $t denotes the type of the elements of the array. Although the implementation of the pardata and of the skeletons is not visible to the user, we shall give some of its details in order to make the explanation clearer. At present, arrays can be distributed only block-wise onto processors. Each processor thus gets one such block (partition) of the array, which, apart from its elements, contains the local bounds of the partition. These bounds are accessible via the skeleton array_part_bounds: Bounds array_part_bounds (array a) ;
Array elements can be accessed by using the macros: $t array_get_elem (array a, Index ix) ; void array_put_elem (array a, Index ix, $t newval) ;
which read, respectively overwrite the given array element. An important aspect is that these macros can only be used to access local elements, i.e. the index ix should be within the bounds of the array partition currently placed on each processor. The reason for this restriction is that remote accessing of single array elements easily leads to very inecient programs. Non-local element accessing should thus be done only in a coordinated way by means of skeletons. In our sample applications, remote element access is done by computational and communication skeletons. We shall now present some skeletons for the distributed data structure array. For each skeleton, its syntax, (informal) semantics and complexity are given. Note that for the higher-order skeletons this complexity can only be given as a function of the complexities of its functional arguments, since the use of dierent customizing functions may lead to a dierent overall complexity [9]. We shall notate with p the number of processors (and, hence, of array partitions), with d the dimension of the array, and assume that our arrays have the same size n in each dimension. This condition is not necessary, but serves to simplify the expressions obtained for the complexity. For each skeleton, we shall give the actual computation time, t(n), and the overall complexity, c(n) = t(n) p. We shall consider that both a local operation on a processor and the sending of one array element from one processor to one of his neighbors equally take one time unit.
8
array array_create (int dim, Size size, Size blocksize, Index lowerbd, $t init_elem (Index), int distr) ;
3.1 Constructor Skeletons
array_create creates a new, block-wise distributed array and initializes it using a given
function. The skeleton has the following syntax: where dim is the number of dimensions of the array. At present, only one- and twodimensional arrays are supported, since this can be done more eciently than for higher-dimensional arrays. The types Size and Index are (classical) arrays with dim components. size contains the global sizes of the array. blocksize contains the sizes of a partition. Passing a zero value for a component lets the skeleton ll in an appropriate value depending on the network topology. lowerbd is the lowest index of a partition. Passing a negative value for a component lets the skeleton derive the lower local bound for this dimension. init_elem initializes each element of the array depending on its index. distr gives the virtual (software) topology onto which the array should be mapped (where available). In our implemtation based on Parix [16], an array can be mapped { directly onto the hardware topology (DISTR_DEFAULT) { onto a ring virtual topology (DISTR_RING) { this can be useful for 1-dimensional arrays { onto a torus virtual toplology (DISTR_TORUS2D) { this can be useful for 2dimensional arrays, as shown in the rst example in Section 4. If we notate the complexity of the function with ti, then the time com intialization plexity of the skeleton is t(n) 2 O ti nd=p , since we have to call init_elem for each element of apartition and the partitions are processed in parallel. The overall complexity d is c(n) 2 O ti n .
array_destroy deallocates an existing array. It has constant complexities t(n) 2 O (1) and c(n) 2 O (p). void array_destroy (array a) ;
9
3.2 Computational Skeletons
array_map applies a given function to all elements of an array, and puts the results
into another array. However, the two arrays can be identical; in this case the skeleton does an in-situ replacement. The syntax is: void array_map ($t2 map_f ($t1, Index), array from, array to) ;
The source and the target arrays do not necessarily have the same element type. The result is placed in another array rather than returned, since if an array would be returned, then it would rst have to be allocated by the skeleton, only to be assigned to another array shortly afterwards. Our solution avoids this additional memory consumption. The return-solution is however used in array_create, since this skeleton allocates the new array anyway, whereas array_map only ` lls in' new values in an existing array. Note that this eciency improvement is not possible in functional host languages, where side-eects are not allowed. The complexity of this skeleton is similar to that of array_create: t(n) 2 O(tm nd=p), where tm is the complexity of the applied function map_f. Further, c(n) 2 O(tm nd).
array_fold composes (\folds together") all elements of an array. $t2 array_fold ($t2 conv_f ($t1, Index), $t2 fold_f ($t2, $t2), array a) ;
The skeleton rst applies the conversion function conv_f to all array elements in a map-like way7. After that, each processor composes all elements of his partition using the folding function fold_f. In the next step, the results from all partitions are folded together. Since the order of composition is non-deterministic, the user should provide an associative and commutative folding function, otherwise the result is non-deterministic. This step is performed along the edges of a virtual tree topology, with the result nally collected at the root. In order to make the result known to all processors, it is broadcasted from the root along the tree edges to all other processors. If tc and tf are the complexities of the conversion and folding function, respectively, then the complexity of the local computations is given by the initial map and the local (sequential) folding and amounts to O (tc + tf ) nd =p . The complexity of communication and non-local computation is given by the folding of the single results from each 8 processor and by the broadcasting of the nal results and is O ((tf + 1) log2 p) . The overall complexity of the fold skeleton is thus: t(n) 2 O tc nd=p + tf (nd=p + log2 p) and c(n) 2 O tc nd + tf (nd + p log2 p) . This step could also be done by a preliminary array_map, but our solution is more ecient. Actually, if the hardware topology is not a tree, then this complexity is ((tf + 1) log2 p ), where is the dilation of embedding the tree into the hardware topology [19]. 7 8
O
10
array_gen_mult composes two arrays of dimension 2 using the pattern of matrix multiplication, i.e. for each element of the result matrix it computes the \scalar product" of the corresponding row of the rst matrix and the corresponding column of the second one. This composition is parameterized by the function which composes a row element with a column element (here gen_mult) and by the function which folds all partial results together (here gen_add). If the actual multiplication and addition are used, then we obtain the classical matrix multiplication. However, by using other functions, dierent algorithms with this communication pattern can be implemented, like for instance shortest paths in graphs (see Section 4). void array_gen_mult (array a, array b, $t gen_add ($t, $t), $t gen_mult ($t, $t), array c) ;
Here a and b are the arrays that are multiplied and c is the result array. The skeleton uses Gentleman's distributed matrix multiplication algorithm, in which local partition multiplications alternate with partition rotations among the processors [18]. These rotations are done horizontally for the rst matrix and vertically for the second one. Since the use of the same matrix in both roles would lead to inconsistencies, we impose the condition that the two matrices must be distinct, i.e. calls of the form array_gen_mult (a, a, ...) are not allowed. This algorithm has the same asymptotic computation complexity as the classic method, but has an improved communication complexity, so that the p required time is t(n) 2 3 2 p 3 2 O(n =p + n = p). The overall complexity is c(n) 2 O(n + n p) [14].
array_copy copies one array into another. The second arraymusthave been previously created. Although the theoretical complexity is t(n) 2 O nd=p , this function runs
practically much faster, since the array partitions are internally represented as contiguous memory areas, and thus do not have to be copied element by element. This is the reason why this skeleton was implemented, instead of using a correspondingly parameterized array_map for this purpose. void array_copy (array from, array to) ;
3.3 Communication Skeletons
array_broadcast_part broadcasts one array partition to all other processors. Each processor overwrites his partition with the broadcasted one. The partition to be broadcasted is the one containing the element with index ix. void array_broadcast_part (array a, Index ix) ;
As in array_fold, the broadcasting is done along the edges of a tree, only that here are sent instead ofsingle elements. The complexity is thus t(n) 2 entire partitions d O n =p log2p and c(n) 2 O nd log2p . 11
array_permute_rows applies only to 2-dimensional arrays. It permutes the rows of
the array using a given permutation function, which computes the new position of the row based on the old one. The user must provide a bijective function on the interval [0; p], otherwise a run-time error occurs. void array_permute_rows (array from, int perm_f (int), array to) ;
The complexity of this skeleton is hard to estimate. It might be that all row shifts are done locally (on each processor), or that all of them involve communication. In the latter case, it might be that only nearest-neighbor communication is needed, or only communication between processors that are far apart. Moreover, no a priori virtual topology is used, but the communication is established dynamically, based on the communication pattern given by perm_f. In the worst case, the complexities are t(n) 2 O (n2 p) and c(n) 2 O (n2 p2). The counterpart of this skeleton with respect to partitions, array_permute_parts, which can be applied to arrays with arbitrary dimensions, is de ned analogously.
4 Sample Applications We shall now present two sample applications illustrating the way parallel programs can be written using algorithmic skeletons. The rst program is a simple application of the generic matrix multiplication skeleton, whereas the second is more complex, containing maps, folds, broadcasts and permutations. Run-time measurements and comparisons with other implementations are given in Section 5.
4.1 Shortest Paths in Graphs
Let G = (V; E ), with V = fv1; :::; vng be a graph with non-negative integer edge weights and A = (aij )1i;jn its distance matrix, with aii = 0; aij = wij , i there is an edge between nodes vi and vj with the weight wij and aij = 1 otherwise. Then the length of the shortest path between the nodes vi and vj is equal to the element cij of matrix C = An, where An is computed on the basis of matrix multiplication, in which scalar multiplication has been replaced by addition and scalar addition by the minimum-operation [1]. The implementation of the algorithm is given below. By successively computing A2, A4, ..., we need only log2n iterations to compute An (we assume for simplicity that n is a power of 2). void shpaths (int n) { array a, b, c ; a = array_create (2, {n,n}9, {0,0}, {-1,-1}, init_f, DISTR_TORUS2D); b = array_create (2, {n,n}, {0,0}, {-1,-1}, zero, DISTR_TORUS2D); c = array_create (2, {n,n}, {0,0}, {-1,-1}, int_max, DISTR_TORUS2D);
In order to avoid excessive code details, we have used the pseudo-code notation {a, for the array with elements a and b. 9
b}
12
for (i = 0 ; i < log2 (n) ; i++) { array_copy (a, b) ; array_gen_mult (a, b, min, (+), c) ; if (i < log2n - 1) array_copy (c, a) ; } /* output array c */ array_destroy (a) ; array_destroy (b) ;
array_destroy (c) ;
}
In the rst line, the variables a, b and c are declared as having the type \distributed array with elements of type unsigned integer". The arrays are then created with dimension 2 and total size nn by the appropriate skeleton. They are distributed onto a 2-dimensional torus virtual topology, since this optimizes the communication inside the array_gen_mult skeleton. Moreover, the values of the third and fourth parameters of the calls to array_create tell the skeleton to construct the array partitions with default (i.e. equal) size and default positioning. The elements of array a are initialized by some function init_f, those of array b are set to 0, since b will hold a copy of a, and those of array c are set to the maximal integer value, which represents 1. In order to avoid an over ow when adding a value to \1", the type unsigned integer was used for the array elements. This initialization through an argument function is possible due to the fact that skeletons are higher-order functions. The rest of the program contains the loop which computes An. The array_gen_mult skeleton is called with the minimum function in the role of the scalar addition and with the addition function in the role of the scalar multiplication. Since the array arguments of array_gen_mult must be distinct, the array a is rst copied into b. Finally, after outputting the results, the arrays are deallocated.
4.2 Gaussian Elimination
Gaussian elimination can be used to solve linear systems of the form A x = b or to invert a matrix A, if A is not singular. In the rst case, the right hand side vector b is added to the matrix A as n + 1st column. Using linear tranformations and combinations on the rows of the extended matrix, yielding the identity matrix in the rst n columns, the (also transformed) n + 1st column represents the solution vector x. Formally, the transformation is given by: for k := 0 to n ? 1 do for i := 0 to n ? 1 do if i 6= k then for j := n downto k do aij := aij ? aik (akj =akk ) for i := 0 to n ? 1 do xi := ain=aii The algorithm given above is not complete, since it fails if the current pivot element (akk ) is 0. This can be avoided by exchanging the current row with one where the kth element is non-zero (for reasons of numerical stability, the row with the greatest absolute value of the kth element is chosen as pivot row). 13
We shall now parallelize this algorithm. The outer loop is not adequate for parallelization, since it contains inter-loop dependencies, so we take the inner two loops and implement them with a map skeleton. However, the innermost loop must be run sequentially from n downto k in order not to overwrite elements too soon. Since the order in which map applies a function to the elements of an array cannot be imposed, we have to use two dierent arrays, one as source and one as target. The implementation of the algorithm is given below. typedef struct _elemrec {float val ; int row ; int col ;} elemrec ; void gauss (int n) { array a, b, piv ; elemrec e ; /* creation of arrays a and b (size n x n+1, default distributed)*/ /* creation of array piv (size p x n+1, default distributed)*/ for (k = 0 ; k < n ; k++) { e = array_fold (make_elemrec, max_abs_in_col (k), a) ; if (e.val == 0.0) error ("Matrix is singular") ; if (e.row != k) array_permute_rows (a, switch_rows (e.row, k), b) ; else array_copy (a, b) ; array_map (copy_pivot (b, k), piv, piv) ; array_broadcast_part (piv, {k / (n/p), 0}) ; array_map (eliminate (k, b, piv), b, a) ; } array_map (normalize (a), a, b) ; /* output array b */ /* deallocation of arrays a, b and piv */ }
We consider the right hand side vector b to have been appended to the matrix A, yielding a n (n + 1) matrix. This matrix is divided into p parts, each containing n=p rows10. This partitioning can be obtained by using appropriate block size and lower bounds in the call of array_create. The rst problem is now to determine and to broadcast the pivot row. The determination part is done by means of the array_fold skeleton. The result we expect from this skeleton is twofold: one the one hand, the number of the row containing the maximal pivot element and on the other hand, the value of the pivot element11. For that, we de ne the structured type elemrec which contains for each array element its value, its row and its column. array_fold rst converts each array element to an elemrec and then folds them together with the function max_abs_in_col, which computes the maximum only over those elements whose column is k. This is possible due to the polymorphic type of the skeleton array_fold. Having found the would-be pivot row, we must exchange it with the current row. This is done by calling the skeleton array_permute_rows with an argument function that for 10 11
We assume for simplicity that p divides n. The reason is that if this value is 0, then the matrix is singular.
14
each of the considered two rows returns the number of the other one, and is the identity for each other row. Note that since the switching is done by a skeleton, which globally works on all rows, some processors (here all, except maximal 2) have nothing to do. This is however no overhead, because no processor can continue before he gets the pivot row, so he would be idle anyway. The next step is broadcasting the pivot row. Unlike row permutation, row broadcasting can be reduced to broadcasting partitions, if each partition consists of exactly one row12. This is why the array piv is created with the size p(n+1) (p is the number of processors), each processor thus getting one row. Before actually broadcasting the pivot row, we have to get it from our array into piv. Since no single rows can be accessed, this is done by calling the skeleton array_map. This applies the function copy_pivot to all elements of the array piv, overwriting only those, which correspond to the pivot row. The function copy_pivot is given below. $t copy_pivot (array b, int k, $t v, Index ix) { Bounds bds = array_part_bounds (b) ; if (bds->lowerBd[1] upperBd[1]) return (array_get_elem (b, {k,ix[1]}) / array_get_elem (b, {k,k})); else return (v) ; }
Upon checking the arity of this function against its call in the gauss procedure and against the arity and type of the functional argument expected by the skeleton array_map (see Subsection 3.2), one can observe that copy_pivot was partially applied to the array b and to the row number k in the procedure gauss. Partial applications thus allow passing additional parameters to functions called from within skeletons. The remaining two arguments, the array element v and its index ix are supplied by the skeleton itself. The function copy_pivot checks if the current partition of the array a contains the pivot row, using the skeleton array_part_bounds. If this is true, then it returns the corresponding element, already normalized relative to the pivot element, i.e. akj =akk . The skeleton array_map overwrites then the corresponding element of the array piv. Otherwise, the old value is returned, thus leaving the element of piv unchanged. Now, the pivot row can nally be broadcasted and each processor can do the actual elimination on the rows of his partition. This elimination is done by mapping the function eliminate to all elements of the old array (b in the procedure gauss) and writing the results into the new array (a in gauss). This function also gets further arguments by partial application, namely the number of the pivot row, the pivot array and the old array. The elimination is done only for elements at the right of the pivot element, except for those in the pivot row itself, by computing aij := aij ? aik (akj =akk ) as shown below (procId is the number of the current processor, and is equal to the number of the row of the array piv mapped to this processor). $t eliminate (int k, array a, array piv, $t v, Index ix) { if (ix[0] == k || ix[1] < k) return (v) ;
This method does not work for row permutation, in case two rows placed on the same processor are switched. Such clash cases would have to be handled by the application program, which should have access to single rows. 12
15
else return (v - array_get_elem (a, {ix[0], k}) * array_get_elem (piv, {procId, ix[1]}) ) ; }
Finally, since the pivot elements were not normalized to 1, each element of the last column (i.e. of the result x) has to be divided by the pivot (i.e. diagonal) element of that row. This is done by applying the function normalize to the elements of the array.
5 Run-Time Results We have implemented the skeletons and applications presented above on a Parsytec MC system with 64 T800 transputers running at 20Mhz, under the Parix operating system [16]. Since only 1MB of memory were available, larger problem sizes could only be tted into larger networks. We have compared our results with those obtained for the same sample applications using the data-parallel functional language DPFL [13, 14] and the same skeletons. Our run-times are on the average 6 times faster than those of DPFL, approaching in most cases those of hand-written C code. This is due on the one hand to the eciency of imperative languages, which is higher than that of their functional counterparts, and on the other hand to the instantiation of the functional features. Detailed results are given below.
5.1 Shortest Paths
The shortest paths program was run for graphs with n = 200 nodes on a pp pp processors network13. The absolute run-time results, as well as a comparison between Skil and DPFL are given in Table 1. These results show that the Skil program runs more than 6 times faster than the equivalent DPFL program.
pp pp
22 33 44 55 66 77 88 Table 1:
Skil absolute speedup: DPFL Skil 1524.22 259.49 234.29 6.51 107.69 387.23 65.79 60.78 6.37 39.56 185.13 31.53 29.70 6.23 21.83 98.76 16.92 16.34 6.04 Run-time results for the shortest paths program DPFL Parix-C
Notice that in this table the Skil run-times even beat the run-times of message-passing C. The reason is that the C implementation referred to here is an older version, which does not use virtual topologies or asynchronous communication, as our skeleton implementation In the cases where for p = 3. 13
p
p does not divide n, the next highest value divisible by
p
16
p
p is taken, e.g.. n = 201
does. We have done the comparison between equally optimized C and Skil versions of the matrix multiplication algorithm, and obtained Skil times around 20% slower than direct C times [5]. Of course, a Skil program could never beat an equally well optimized C version of that program, since Skil is translated to message-passing C.
5.2 Gaussian Elimination
The Gaussian elimination program was run for matrices with size n n, for dierent n's between 64 and 640 on a pp pp processors network. The rst version of gauss was implemented without the search and the exchange of the pivot row, i.e. corresponding to the initial version of the algorithm given in Subsection 4.2. The reason was that this is the version that has been implemented in DPFL [14] and we wanted to make a fair comparison. Table 2 shows the absolute run-times of the Skil program in seconds (in bold font), its speedups relative to the DPFL program (in normal font), as well as its slow-downs relative to Parix-C (in italics). n pp pp 64 128 256 384 512 640 2.06 14.77 113.29 377.62 22 6.17 6.52 6.65 6.69 { { 2.40 2.51 2.60 2.64 0.91 4.83 32.06 102.16 236.13 453.86 44 4.82 5.73 6.22 6.40 6.48 1.57 1.73 2.02 2.20 2.31 2.38 0.85 3.49 19.42 58.03 129.89 244.77 84 3.87 4.88 5.62 5.96 6.12 6.24 1.25 1.24 1.45 1.65 1.78 1.90 0.85 2.94 13.57 37.03 78.71 143.28 88 3.48 4.17 4.78 5.21 5.47 5.68 1.04 0.94 1.03 1.15 1.26 1.37 Table 2: Run-time results for Gaussian elimination (without search for greatest pivot). [bold: absolute times; normal: quotient DPFL/Skil; italics : quotient Skil/ParixC] For more clarity, we have plotted the speedups relative to DPFL and the slow-downs relative to Parix-C against the number of processors, for all matrix sizes and obtained the graphics in Figure 1. Notice that in the left graphic, most of the speedups relative to DPFL are grouped around the factor 6, while only a few go below 5. These small values correspond to the case when arrays with small sizes are distributed onto large networks, yielding small partitions. In this case, it is obvious that the communication overhead gains more importance, leading to a drop of eciency. The slow-downs to C (depicted in the right graphic) are mainly grouped around 2, in some cases (generally, for large networks) going down to 1. On the average, this result is not so good like the ones obtained for the shortest paths problem, but there is a great dierence between that application, where practically only one skeleton (array_gen_map) did everything, and Gaussian elimination, where maps, folds, broadcasts and permutes were combined. Further, the communication in shpaths has a regular structure, which is 17
not the case in gauss. On the whole, the results relative to message-passing C are good, for larger networks even very good.
6
6
4
4
2
2
0
0
16
32
48
0
64
0
16
32
48
64
Figure 1: Relative speedup Skil vs. DPFL (left) and relative slow-down Skil vs. Parix-C (right) The second version of gauss we tested was the complete one, as given in the code fragment in Subsection 4.2. The run-times were here about twice as long as in the rst version, which is satisfactory, since it is visible from the description of the implementation of the pivot search and exchange, that this is not trivial and that it brings considerable communication overhead.
6 Conclusions and Future Work We have presented in this paper a new approach to parallel and distributed programming with algorithmic skeletons. The principal aim was to design a language which allows easy (high level) parallel programming, and at the same time can be eciently implemented. For that, we have used an imperative language enhanced with functional features and with a polymorphic type system. These features were eliminated at an early stage by a front-end compiler, thus not impairing the performance of the language. We have then described a series of skeletons for the work with distributed arrays and showed how two matrix applications can be implemented on a parallel system by writing sequential programs. Run-time results have con rmed our theory regarding eciency, since our implementation was much faster than that of functional languages with skeletons, approaching or even reaching in most cases the performance of hand-written C code based on message-passing. Future work is necessary in several directions. Firstly, the distributed data type array together with its skeletons are only at a prototype level. It would be interesting to allow other distributions onto processors, apart from block-wise, like for instance cyclic, blockcyclic etc. [11]. Further, it should be possible to de ne overlapping areas for the single partitions, in order to reduce communication in operations which require more than one 18
element at a time. Such operations are used for instance in solving partial dierential equations or image processing [14]. In order to be able to cope with `real world' applications, new skeletons, for instance for (parallel) I/O must be designed and implemented. Moreover, the implementation of skeletons can be optimized, for instance by allowing the application of argument functions only to certain elements of the array, since this is more ecient than testing the condition in the customizing function itself. Another research direction concerns the use of skeletons in other application areas. We plan to continue the work we have started on the parallel implementation of adaptive multigrid methods based on skeletons [3], and also to extend it to other areas, like for instance computational geometry.
References [1] A. V. Aho, J. E. Hopcroft, J. D. Ullman: The Design and Analysis of Computer Algorithms, Addison-Wesley, 1974. [2] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi: P 3L : a Structured High-level Parallel Language and its Structured Support, Technical Report HPL-PSC93-55, Hewlett-Packard Laboratories, 1993. [3] G. H. Botorog, H. Kuchen: Algorithmic Skeletons for Adaptive Multigrid Methods, in Proceedings of IRREGULAR '95, LNCS 980, Springer, 1995. [4] G. H. Botorog, H. Kuchen: Translation by Instantiation: Integrating Functional Features into an Imperative Language, to be presented at CC'96, Linkoping, Sweden. [5] G. H. Botorog, H. Kuchen: Parallel Programming in an Imperative Language with Algorithmic Skeletons, in preparation. [6] M. I. Cole: Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press, 1989. [7] J. Darlington, A. J. Field, P. G. Harrison et al: Parallel Programming Using Skeleton Functions, in Proceedings of PARLE 93, LNCS 694, Springer, 1993. [8] J. Darlington, Y. Guo, H. W. To, J. Yang: Functional Skeletons for Parallel Coordination, in Proceedings of EURO-PAR'95, LNCS 966, Springer, 1995. [9] H. Deldarie, J. R. Davy, P. M. Dew: The Performance of Parallel Algorithmic Skeletons, Technical Report 95.6, University of Leeds, 1995. [10] P. Hudak, S. Peyton Jones and Ph. Wadler (editors): Report on the Programming Language Haskell, a Non-Strict Purely Functional Language (Version 1.2), in ACM Sigplan Notices, Vol. 27, No. 5, May 1992. [11] High Performance Fortran Language Speci cation, in Scienti c Programming, Vol. 2, No. 1, 1993. 19
[12] B. Kernighan, D. Ritchie: The C Programming Language, Second Edition, PrenticeHall, 1988. [13] H. Kuchen, R. Plasmeijer, H. Stoltze: Ecient Distributed Memory Implementation of a Data Parallel Functional Language, in Proceedings of PARLE 94, LNCS 817, Springer, 1994. [14] H. Kuchen: Datenparallele Programmierung von MIMD-Rechnern mit verteiltem Speicher, Thesis (in German), RWTH Aachen, 1995. [15] L. D. J. C. Loyens, J. R. Moonen: ILIAS, a Sequential Language for Parallel Matrix Computations, in Proceedings of PARLE 94, LNCS 817, Springer, 1994. [16] Parsytec Computer GmbH: Parix1.2, Software Documentation, Aachen, 1993. [17] S. L. Peyton Jones: The Implementation of Functional Programming Languages, Prentice-Hall, 1987. [18] M. J. Quinn: Parallel Computing: Theory and Practice, McGraw Hill, 1994. [19] M. Rottger, U. P. Schroeder, J. Simon: Virtual Topology Library for PARIX, Technical Report 148, University of Paderborn, 1994. [20] D. Skillicorn: Foundations of Parallel Programming, Cambridge University Press, 1994. [21] P. Wadler: Deforestation: Transforming Programs to Eliminate Trees, in Theoretical Computer Science, No. 73, North-Holland, 1990.
20