Skil: An Imperative Language with Algorithmic Skeletons for Efficient Distributed Programming George Horatiu Botorog
Herbert Kuchen
Aachen University of Technology, Lehrstuhl f¨ur Informatik II Ahornstr. 55, D-52074 Aachen, Germany E-mail: fbotorog,
[email protected] Abstract In this paper we present Skil, an imperative language enhanced with higher-order functions and currying, as well as with a polymorphic type system. The high level of Skil allows the integration of algorithmic skeletons, i.e. of higherorder functions representing parallel computation patterns. At the same time, the language can be efficiently implemented. After describing a series of skeletons which work with distributed arrays, we give two examples of parallel programs implemented on the basis of skeletons, namely shortest paths in graphs and Gaussian elimination. Runtime measurements show that we approach the efficiency of message-passing C up to a factor between 1 and 2.5.
1. Introduction Although parallel and distributed systems gain more and more importance nowadays, the state-of-the-art in the field of parallel software is far from being satisfactory. Not only is the programming of such systems a tedious and time-consuming task, but its outcome is usually machinedependent and hence its portability is restricted. One of the main reasons for these difficulties is the rather low level of parallel programming languages available today, in which the user has to explicitly account for all aspects of parallelism, i.e. for communication, synchronization, data and task distribution, load balancing etc. Moreover, program testing is impeded by deadlocks, nondeterministic program runs, non-reproducible errors, and sparse debugging facilities. Finally, programs run only on certain machines or on restricted classes of machines. However, due to the nearness to the machine, a high performance can be achieved. Attempts were made to design high-level languages that The work of this author is supported by the “Graduiertenkolleg Informatik und Technik” at the Aachen University of Technology.
hide the details of parallelism and allow the programmer to concentrate on his problem. Such high-level approaches would naturally lead to less efficient programs than those written directly in low-level languages, and the key issue is to find a good trade-off between efficiency losses and gains in easiness of programming, as well as in safety and portability of programs. In this paper, we present such an approach, namely a language with algorithmic skeletons. A skeleton is an algorithmic abstraction common to a series of applications, which can be implemented in parallel [4]. Skeletons are embedded into a sequential host language, thus representing the only way to express parallelism in a program. While offering a high-level interface, which supports program portability, their implementation is close to the machine and therefore efficient. Classical examples of skeletons include map, farm and divide&conquer [5]. Most skeletons are defined as higher-order functions, i.e. as functions with functional arguments and/or result. The parameterization with functions leads to a flexibility beyond that of library functions in imperative languages. This is illustrated by the following example. Consider the skeleton divide&conquer (d&c), which implements the well-known computation pattern, and which can be expressed in a functional language as follows (note that this definition only describes the overall functionality, it does not capture the parallel implementation): d&c :: (a->Bool) -> (a->b) -> (a->[a]) -> ([b]->b) -> a -> b d&c is_trivial solve split join problem = if (is_trivial problem) then (solve problem) else (join (map (d&c is_trivial solve split join) (split problem)))
The skeleton has four functional arguments: is_trivial tests if a problem is simple enough to be solved directly, solve solves the problem in this case, split divides a problem into a list of subproblems and join combines a
list of sub-solutions into a new (sub)solution. The last argument is the problem to be solved. The function map applies a given function to all elements of a given list, i.e. map f [x1 , ..., xn ] = [f x1 , ..., f xn ]. Given this skeleton, the implementation of an algorithm that has the divide&conquer-structure requires only the implementation of the four argument functions and a call of the skeleton. For instance, a quicksort procedure for lists can be implemented as follows:
Gaussian elimination, implemented on the basis of skeletons. Section 5 contains some run-time results, as well as comparisons with similar implementations in parallel C and in a functional language with skeletons. Finally, Section 6 concludes the paper.
2. The Language As we have mentioned in the introduction, skeletons can easily be embedded into functional host languages as higher-order functions. However, using a functional language leads to a decrease of the efficiency of the resulting programs, which can become critical in applications where performance is an important issue, like for instance in scientific computing. We therefore choose our host language to be an imperative one. This approach has several advantages. On the one hand, the sequential parts of the program are more efficient in this case. Moreover, since host and skeletons are both imperative, the overhead of the context switch in the case of a functional host has no longer to be taken into account. On the other hand, imperative languages offer mechanisms for local accessing and manipulating data, which have to be simulated in functional languages [7], as well as simpler I/O. This leads to further gains in efficiency, bringing the performance of our language close to that of low-level parallel imperative languages with message-passing. We have named the language Skil, as an acronym for Skeleton Imperative Language. In order to be able to integrate algorithmic skeletons, we provide Skil with a series of functional features and with a polymorphic type system. We use a subset of the language C as the basis of Skil. This is, however, only a pragmatic choice, motivated by the fact that C is a widespread language in the field of parallel programming for which good compilers are available, thus easing the task of translating Skil programs. Nevertheless, other imperative languages can equally well be used instead.
quicksort lst = d&c is_simple ident divide concat lst
where is_simple checks if a list is empty or singleton, ident is the identity function and divide splits a list into three lists containing the elements that are smaller than a given pivot element, the pivot element itself and the elements greater or equal to the pivot, respectively. Finally concat concatenates two lists. However, other algorithms with the divide&conquer structure, such as Strassen’s matrix multiplication, polynomial evaluation, numerical integration, FFT etc. can be similarly implemented, only by using different customizing argument functions. Skeletons can thus be viewed on a conceptual level as common parallelization patterns and on a pragmatic level as (polymorphic) higher-order functions. Note that a similar flexibility cannot be achieved by firstorder functions, i.e. functions without functional arguments. Since skeletons can be represented in functional languages in a straightforward way, it would seem appropriate to use a functional language as host. For a discussion of different languages with skeletons, see [3]. Although skeletons can be implemented on every parallel or distributed system, we consider here only MIMD computers with distributed memory. The concepts presented here can nevertheless be ported to other architectures, such as shared memory systems, SIMD computers, workstation clusters etc., as well. In this paper, we present a new imperative language with algorithmic skeletons. Firstly, we describe some functional features with which we enhance our host language in order to cope with the generality expected from skeletons. After that, we define the data structure “distributed array” and present some data parallel skeletons operating on it. We then show how two parallel applications can be programmed in a sequential style by using these skeletons. Run-time measurements show that our results are better than those obtained for functional languages with skeletons, approaching those of low-level (C) implementations. The rest of the paper is organized as follows. Section 2 describes the imperative host language and the additional features that are needed to integrate the skeletons. Section 3 presents some skeletons for distributed arrays. Section 4 presents two applications, shortest paths in graphs and
2.1. Functional Features Consider again the definition of d&c given in the introduction. We shall use it to show by which features C has to be extended in order to support skeletons. The skeleton d&c is a higher-order function, since it has functional arguments (is_trivial, solve, split and join). Thus, higher-order functions are among the functional features we need in Skil. A second feature we need is function currying. Note that we have defined the type of d&c as (a->Bool)->(a->b)->(a->[a])->([b]->b)->a->b
instead of 2
Type variables can be instantiated with arbitrary types, there is however one restriction: type variables appearing as components of other data types may not be instantiated with types introduced by the pardata construct (see next subsection). Polymorphic types are either type variables or compound types built from other types using the C type constructors array, function, pointer, structure or union and containing at least one type variable. In case of structured types containing fields with polymorphic type, as well as in the case of polymorphic typedefs, the type variables inside the definitions must appear as parameters of the type declaration, included between angle brackets ():
(a->Bool) (a->b) (a->[a]) ([b]->b) a -> b
This notation indicates that d&c can be applied to its first argument, yielding a new function which has the type (a->b)->(a->[a])->([b]->b)->a->b, then to its second etc., until the last application finally returns a value of type b. The underlying idea is to consider the application of a n-ary function as a successive application of unary functions. This procedure is called currying and allows functions to be partially applied to their arguments. See [10] for further details. Partial applications are useful in a series of situations, for instance in generating new functions at run-time or in supplying additional parameters to functions. Consider the recursive call of d&c in the else branch of the skeleton definition. On the one hand, the function map expects a functional argument of type a->b. On the other hand, we want to call d&c and at the same time provide it with the rest of the arguments it needs, apart from the problem to be solved, i.e. with the functions is_trivial, solve, split and join. This can be done by partially applying d&c to these arguments, which yields a function of type a->b. Partial applications are one of the main reasons for the functional extensions in Skil. If functional arguments with no arguments of their own could be simulated in C by passing/returning pointers to functions, this is no longer possible with functional arguments yielded by partial application. A further feature of Skil is the conversion of operators to functions, which is done by enclosing the operator between brackets, i.e. (op). This conversion allows passing operators as functional arguments to skeletons as well as partial applications of operators. For example, the expression fold((+),lst1) computes the sum of all elements of the list lst1, whereas map((*)(2),lst2) multiplies each element of the list lst2 by 2.
struct _list {$t elem; struct _list *next;}; typedef struct _list * list ;
Polymorphism can be simulated in C by using void pointers and casting. This is nevertheless a potential source of errors, since type checking is thus eluded. Our approach leads however to safer programs, as a polymorphic type checking is performed. As an illustration of the language enhancements we have presented, we now give the equivalent Skil definition of the skeleton d&c. Note that, apart from some small differences in the syntax, the functional specification can be directly transformed into our language: $b d&c (int is_trivial ($a), $b solve ($a), list split ($a), $b join (list ), $a problem) { if (is_trivial (problem)) return solve (problem); else return join (map (d&c (is_trivial, solve, split, join), split (problem)));}
We have assumed here the existence of a polymorphic type list, like the one given by the above definition.
2.2. The Type System 2.3. Distributed Data Structures If we look back at the type declaration of the skeleton d&c, we see that it is polymorphic, as it contains the type variables a and b. We want Skil to have a polymorphic type system, since we want to define skeletons that depend only on the structure of the problem, and not on particular data types. For instance, the skeleton d&c has the same overall functionality, regardless of the type of the problem and the solution. The advantage is that the same skeletons can be re-used in solving similar or related problems. In order to build a polymorphic type system, we have to include type variables in our language. Syntactically, a type variable is an identifier which begins with a $, e.g. ‘$t’.
Apart from the need to cope with distributed functionality, a parallel language should allow the definition of distributed data structures. This issue has been addressed in most languages with skeletons by considering these distributed types (mostly arrays) as internally defined [7, 8]. Since one of our aims is flexibility, we want to allow any distributed data structure to be defined, as long as it is ‘homogeneous’, in the sense that it is composed of identical data structures placed on each processor. This is done by means of the ‘pardata’ construct, which is similar to the typedef construct, but introduces a distributed (“parallel”) data structure. The syntax of this construct is: 3
pardata
name
implem
[
]
;
where $t1 , . . . , $tn are type variables and implem is a (polymorphic) type, representing the data structure to be created on each processor. The distributed data structure is then the entirety of all local structures and is identified by name. Distributed data structures may not be nested, in particular the type arguments of a pardata construct cannot be instantiated with other pardatas. A problem that arises when using (visible) distributed data structures in an imperative language is that the programmer is able to access local parts of the structure, for instance by pointers or indices. If this occurs in an uncontrolled way, heavy remote data accessing may cause considerable communication overhead. In order to control the access to data, we keep the implementation (‘implem’) of the distributed data type hidden, making only its “header” with the type arguments visible. The pardata construct can thus be used without the ‘implem’ part, similarly to prototypes of library functions, whose header is visible, but whose body is not. The type arguments of a pardata can be instantiated with arbitrary types. However, some problems may appear if dynamic (i.e. pointer-based) data types are used. In this case, skeletons that move elements of the pardata from one processor to another should not move the pointer as such, but the data pointed to by it. For that, they get additional functional arguments which account for the ‘flattening’/‘unflattening’ of data. This issue is addressed in [2]; the skeletons presented here are given in a simplified syntax, without the additional functional arguments.
partial applications are translated by inlining and lifting of their arguments a polymorphic function is translated to one or more monomorphic functions, as determined by the calls of this function
The instantiation procedure is presented in detail in [1]. Since this transformation is done at compile-time, a restriction has to be made regarding the functional arguments of HOFs. However, this restriction concerns only a special class of recursively-defined HOFs, which seldom appear in practice [1]. Moreover, since our HOFs are mainly the skeletons, this restriction can be regarded as internal. The Skil compiler translates all functional features and inserts the parallel code from the definitions of the skeletons into the application program, which can then be processed by a C compiler used as a back-end. The following example illustrates the transformations performed by the Skil compiler. We consider the map skeleton described in Section 3, which applies a function to all elements of an array (distributed arrays are also described in the next section). We further assume to be working in the SPMD-style, with the same code running on each processor, so that each processor must process only its partition of the array. Then, the skeleton map can be defined as follows: void array_map($t2 map_f ($t1, Index), array a, array b) { l = lowest index of the current partition ; h = highest index of the current partition ; ...
2.4. Implementation Issues
for (i = l ; i < h ; i++) b[i] = map_f (a[i], i) ; }
We shall now address some implementation issues of Skil. A Skil program is processed by a front-end compiler together with the necessary skeletons (which contain the parallel code, e.g. based on message-passing) and pardata implementations (which contain the definitions of the distributed data). The task of this compiler is to perform a polymorphic type checking and to translate the functional features and polymorphism. The classical implementation of these features uses closures, i.e. data structures that contain the called function and the arguments that were supplied to it. However, this technique causes important run-time overheads which lead to efficiency losses. We therefore use an instantiation procedure, which translates a (polymorphic) higher-order function (HOF), possibly with partial applications, to one or more specialized first-order monomorphic functions [1], as follows:
HOFs with functional result are converted to functions with non-functional result by -expansion, i.e. by supplying additional parameters
Assume we want to compare all elements of an array of floats A with some threshold value t and put the boolean (in C and Skil integer) results into another array B. This can be done by the following call of the map skeleton: array_map (above_thresh (t), A, B) ;
where above_thresh is defined as: int above_thresh (float thresh, float elem, Index ix) { return (elem >= thresh) ; }
Upon encountering the above call of array_map, the compiler generates the following instance of this skeleton, in which the functional argument above_thresh has been inlined, its argument t has been lifted and the polymorphic types $t1 and $t2 have been instantiated:
functional arguments of HOFs are inlined into the definitions of these HOFs 4
processor. The reason for this restriction is that remote accessing of single array elements easily leads to very inefficient programs. Non-local element accessing is still possible, however only in a coordinated way by means of skeletons. We shall now present some skeletons for the distributed data structure array.
void array_map_1 (float x, floatarray a, intarray b) { ... for (i = l ; i < h ; i++) b[i] = above_thresh (x, a[i], i) ; }
where floatarray and intarray stand for the implementations of array and array , respectively. Correspondingly, the skeleton call is transformed to array_map_1 (t, A, B). Using this instantiation technique for parallel programs with skeletons leads to intermediate (C-) codes that differ only little from the hand-written (C-) versions of these programs, usually containing more function calls. This explains the good results obtained for our test applications, which are presented in Section 5 and in [3].
array_create creates a new, block-wise distributed array and initializes it using a given function. The skeleton has the following syntax: array array_create (int dim, Size size, Size blocksize, Index lowerbd, $t init_elem (Index), int distr);
where dim is the number of dimensions of the array, the types Index and Size are ‘classical’ arrays with dim elements, size contains the global sizes of the array, blocksize contains the sizes of a partition1, lowerbd is the lowest index of a partition2 , init_elem is a function which initializes each element of the array depending on its index and distr gives the virtual (software) topology onto which the array should be mapped (where available). In our implementation based on Parix [9], an array can be mapped directly onto the hardware topology (DISTR_DEFAULT), onto a ring virtual topology (DISTR_RING), or onto a torus virtual topology (DISTR_TORUS2D).
3. Skeletons for Arrays Depending on the kind of parallelism used, skeletons can be classified into data parallel and process parallel ones. Although both types can be integrated in Skil, the emphasis is placed here on the first category, since data parallelism seems to offer better possibilities to exploit large numbers of processors. We shall describe a series of skeletons which work on the distributed data structure “array”. This data structure is defined as: pardata array ...
implementation
array_destroy deallocates an existing array:
... ;
void array_destroy (array a) ;
where the type parameter $t denotes the type of the elements of the array. Although the implementation of the pardata and of the skeletons is not visible to the user, we shall give some of its details in order to make the explanation clearer. At present, arrays can be distributed only block-wise onto processors. Each processor thus gets one block (partition) of the array, which, apart from its elements, contains the local bounds of the partition. These bounds are accessible via the macro array_part_bounds:
array_map applies a given function to all elements of an array, and puts the results into another array. However, the two arrays can be identical; in this case the skeleton does an in-situ replacement. The syntax is: void array_map ($t2 map_f ($t1, Index), array from, array to) ;
The source and the target arrays do not necessarily have the same element type. The result is placed into another array rather than returned, since in the second case, a temporary data structure would have to be created. Our solution avoids this additional memory consumption. The return-solution is however used in array_create, since this skeleton allocates the new array anyway, whereas array_map only ‘fills in’ new values in an existing array. Note that this efficiency improvement is not possible in functional host languages, where side-effects are not allowed.
Bounds array_part_bounds (array a) ;
Array elements can be accessed by using the macros: $t array_get_elem (array a, Index ix) ; void array_put_elem (array a, Index ix, $t newval) ;
which read, respectively overwrite the given array element. An important aspect is that these macros can only be used to access local elements, i.e. the index ix should be within the bounds of the array partition currently placed on each
1 Passing a zero value for a component lets the skeleton fill in an appropriate value depending on the network topology. 2 Passing a negative value for a component lets the skeleton derive the lower local bound for this dimension.
5
array_copy copies one array into another. The second array must have been previously created. As array partitions are internally represented as contiguous memory areas, copying can be done very efficiently. This is the reason why this skeleton was implemented, instead of using a correspondingly parameterized array_map for this purpose.
array_fold composes (“folds together”) all elements of an array. $t2 array_fold ($t2 conv_f ($t1, Index), $t2 fold_f ($t2, $t2), array a) ;
The skeleton first applies the conversion function conv_f to all array elements in a map-like way3 . After that, each processor composes all elements of its partition using the folding function fold_f. Since the order of composition is non-deterministic, the user should provide an associative and commutative folding function, otherwise the result is non-deterministic. In the next step, the results from all partitions are folded together. This step is performed along the edges of a virtual tree topology, with the result finally collected at the root. In order to make the result known to all processors, it is broadcasted from the root along the tree edges to all other processors.
void array_copy (array from, array to) ;
array_broadcast_part broadcasts one array partition to all other processors. Each processor overwrites his partition with the broadcasted one. The partition to be broadcasted is the one containing the element with index ix. void array_broadcast_part (array a, Index ix) ;
array_permute_rows applies only to 2-dimensional arrays. It permutes the rows of the array using a given permutation function, which computes the new position of the row based on the old one. The user must provide a bijective function on f0; 1; : : :; n ? 1g, where n is the number of rows, otherwise a run-time error occurs.
array_gen_mult composes two arrays of dimension 2 using the pattern of matrix multiplication, i.e. for each element of the result matrix it computes the “dot product” of the corresponding row of the first matrix and the corresponding column of the second one. This composition is parameterized by the function which composes a row element with a column element (here gen_mult) and by the function which folds all partial results together (here gen_add). If the actual multiplication and addition are used, then we obtain the classical matrix multiplication. However, by using other functions, different algorithms with this communication pattern can be implemented, e.g. shortest paths in graphs (see Section 4.1).
void array_permute_rows(array from, int perm_f (int), array to) ;
4. Sample Applications We shall now present two sample applications illustrating the way parallel programs can be written using algorithmic skeletons. The first program is a simple application of the generic matrix multiplication skeleton, whereas the second program is more complex, containing maps, folds, broadcasts and permutations. Run-time measurements and comparisons with other implementations are given in Section 5.
void array_gen_mult (array a, array b, $t gen_add ($t, $t), $t gen_mult ($t, $t), array c) ;
Here, a and b are the arrays to be multiplied and c is the result array. The skeleton uses Gentleman’s distributed matrix multiplication algorithm, in which local partition multiplications alternate with partition rotations among the processors [11]. These rotations are done horizontally for the first matrix and vertically for the second one, while the mapping of the result matrix remains unchanged. Since the use of the same matrix in the role of both arguments, or in that of an argument and the result would lead to inconsistencies, we impose the condition that the matrices a, b and c are distinct, i.e. calls of the form array_gen_mult(a, a, ...) and array_gen_mult (a, ..., a) are not allowed.
4.1. Shortest Paths in Graphs Let G = (V; E ), with V = fv1 ; : : : ; vn g be a graph with non-negative integer edge weights and A = (aij )1i;j n be its distance matrix, with aii = 0; aij = wij , iff there is an edge between nodes vi and vj with the weight wij and aij = 1 otherwise. Then the length of the shortest path between the nodes vi and vj is equal to the element cij of matrix C = An , where An is computed on the basis of matrix multiplication, in which scalar multiplication has been replaced by addition and scalar addition by the minimum operation. The implementation of the algorithm is given below. By successively computing A2 , A4 , . . . , we need only log2 n iterations to compute An (we assume for simplicity that n is a power of 2).
3 This step could also be done by a preliminary array_map, but our solution is more efficient.
6
yielding the identity matrix in the first n columns, the (also transformed) n + 1st column represents the solution vector x. Formally, the transformation is given by:
void shpaths (int n) { array a, b, c ; a = array_create (2, {n,n}4 , {0,0}, {-1,-1}, init_f, DISTR_TORUS2D) ; b = array_create (2, {n,n}, {0,0}, {-1,-1}, zero, DISTR_TORUS2D) ; c = array_create (2, {n,n}, {0,0}, {-1,-1}, int_max, DISTR_TORUS2D);
for
k := 0 to n ? 1 do for i := 0 to n ? 1 do if i 6= k then for j := n downto k do a := a ? a (a =a i := 0 to n ? 1 do x := a =a ij
for (i = 0 ; i < log2 (n) ; i++) { array_copy (a, b) ; array_gen_mult (a, b, min, (+), c) ; array_copy (c, a) ; }
for
i
in
ij
ik
kj
kk )
ii
The algorithm given above is not complete, since it fails if the current pivot element (akk ) is 0. This can be avoided by exchanging the current row with one where the k th element is non-zero (for reasons of numerical stability, the row with the greatest absolute value of the k th element is chosen as pivot row). We shall now parallelize this algorithm. The outer loop is not adequate for parallelization, since it contains inter-loop dependencies, so we take the inner two loops and implement them with a map skeleton. However, the innermost loop must be run sequentially from n downto k in order not to overwrite elements too soon. Since the order in which map applies a function to the elements of an array cannot be imposed, we have to use two different arrays, one as source and one as target. The implementation of the algorithm is given below5 .
/* output array c */ array_destroy (a) ; array_destroy (b) ; array_destroy (c) ; }
In the first line, the variables a, b and c are declared as having the type “distributed array with elements of type unsigned integer”. The arrays are then created with dimension 2 and total size nn by the appropriate skeleton. They are distributed onto a 2-dimensional torus virtual topology, since this optimizes the communication inside the skeleton array_gen_mult. Moreover, the values of the third and fourth parameters of the calls to array_create tell the skeleton to construct the array partitions with default (i.e. equal) size and default positioning. The elements of array a are initialized by some function init_f, those of array b are set to 0, since b will hold a copy of a, and those of array c are set to the maximal integer value, which represents 1. In order to avoid an overflow when adding a value to “1”, the type unsigned integer was used for the array elements. This initialization by an argument function is possible due to the fact that skeletons are higher-order functions. The rest of the program contains the loop which computes An . The skeleton array_gen_mult is called with the minimum function in the role of the scalar addition and with the addition function in the role of the scalar multiplication ((+) denotes the function corresponding to the operator ‘+’). Since the array arguments of array_gen_mult must be distinct, the array a is first copied into b. Finally, after outputting the results, the arrays are deallocated.
typedef struct _elemrec {float val; int row; int col;} elemrec ; void gauss (int n) { array a, b, piv ; elemrec e ;
/* create arrays a and b (size n n+1) */ /* create array piv (size p n+1) */
for (k = 0 ; k < n ; k++) { e = array_fold (make_elemrec, max_abs_in_col (k), a) ; if (e.val == 0.0) error ("Matrix is singular") ; if (e.row != k) array_permute_rows (a, switch_rows ( e.row, k), b) ; else array_copy (a, b) ; array_map (copy_pivot (b, k), piv, piv); array_broadcast_part (piv, {k/(n/p),0}); array_map (eliminate (k,b,piv), b, a); }
4.2. Gaussian Elimination Gaussian elimination can be used to solve linear systems of the form A x = b or to invert a matrix A, if A is not singular. In the first case, the right hand side vector b is added to the matrix A as n + 1st column. Using linear transformations and combinations on the rows of the extended matrix,
}
4 In order to avoid excessive code details, we have used the pseudo-code notation{a,b} for the ‘classical’ array with elements a and b.
5 The complete program for Gaussian elimination can be found at http://www-i2.informatik.rwth-aachen.de/˜botorog.
array_map (normalize (a), a, b) ; /* output array b */ /* deallocate arrays a, b and piv */
7
We consider the right hand side vector b to have been appended to the matrix A, yielding a n (n + 1) matrix. This matrix is divided into p parts (p being the number of processors), each containing n=p rows (we assume for simplicity that p divides n). The first problem is now to determine and to broadcast the pivot row. The determination part is done by means of the array_fold skeleton. The result we expect from this skeleton is twofold: one the one hand, the number of the row containing the maximal pivot element and on the other hand, the value of the pivot element6 . For that, we define the structured type elemrec which contains for each array element its value, its row and its column. array_fold first converts each array element to an elemrec and then folds them together with the function max_abs_in_col, which computes the maximum only over those elements, whose column is k . This is possible due to the polymorphic type of the skeleton array_fold. Having found the would-be pivot row, we must exchange it with the current row. This is done by calling the skeleton array_permute_rows with an argument function that for each of the considered two rows returns the number of the other one, and is the identity for each other row. Note that since the switching is done by a skeleton, which globally works on all rows, some processors (here all, except maximal 2) have nothing to do. This is however no overhead, because no processor can continue before he gets the pivot row, so he would be idle anyway. The next step is broadcasting the pivot row. Unlike row permutation, row broadcasting can be reduced to broadcasting partitions, if each partition consists of exactly one row 7 . This is why the array piv is created with the size p(n+1), each processor thus getting one row. Before actually broadcasting the pivot row, we have to get it from our array into piv. Since no single rows can be accessed, this is done by calling the skeleton array_map. This applies the function copy_pivot to all elements of the array piv, overwriting only those, which correspond to the pivot row. The function copy_pivot is given below.
Upon checking the arity of this function against its call in the gauss procedure, one can observe that copy_pivot was partially applied to the array b and the row number k in the procedure gauss. Partial applications thus allow passing additional parameters to functions called from within skeletons. The remaining two arguments, the array element v and its index ix are supplied by the skeleton itself. The function copy_pivot checks if the current partition of the array a contains the pivot row, using the skeleton array_part_bounds. If this is true, then it returns the corresponding element, already normalized relative to the pivot element, i.e. akj =akk . The skeleton array_map overwrites then the corresponding element of the array piv. Otherwise, the old value is returned, which is equivalent to leaving the element of piv unchanged. Now, the pivot row can finally be broadcasted and each processor can do the actual elimination on the rows of his partition. This elimination is done by mapping the function eliminate to all elements of the old array (b in the procedure gauss) and writing the results into the new array (a in gauss). This function also gets some of its arguments by partial application, namely the number of the pivot row, the pivot array and the old array. The elimination is done only for elements at the right of the pivot element, except for those in the pivot row itself, by computing aij := aij ? aik (akj =akk ) as shown below (procId is the number of the current processor, and is equal to the number of the row of the array piv mapped to this processor). $t eliminate (int k, array a, array piv, $t v, Index ix) { if (ix[0] == k || ix[1] < k) return (v) ; else return (v - array_get_elem(a,{ix[0],k})* array_get_elem(piv,{procId,ix[1]}));}
Finally, since the pivot elements were not normalized to 1, each element of the last column (i.e. of the result x) has to be divided by the pivot (i.e. diagonal) element of that row. This is done by applying the function normalize to the elements of the array.
$t copy_pivot (array a, int k, $t v, Index ix) { Bounds bds = array_part_bounds (a) ;
5. Run-Time Results
if (bds->lowerBd[1] upperBd[1]) return (array_get_elem (a, {k,ix[1]}) / array_get_elem (a, {k,k})) ; else return (v) ;
We have implemented the skeletons and applications presented above on a Parsytec MC system with 64 T800 transputers connected as a 2-dimensional mesh and running at 20Mhz, under the Parix operating system [9]. Since only 1MB of memory was available per node, larger problem sizes could only be fitted into larger networks. We have compared our results with those obtained for the same sample applications using the data-parallel functional language DPFL [7, 8] and the same skeletons. Our
} 6 The
reason is that if this value is 0, then the matrix is singular. method does not work for row permutation, in case two rows placed on the same processor are switched. Such clash cases would have to be handled by the application program, which should have access to single rows within the same partition. 7 This
8
a fair comparison. Table 2 shows the absolute run-times of the Skil program in seconds (in bold font), its speedups relative to the DPFL program (in roman font), as well as its slow-downs relative to Parix-C (in italics).
run-times are on the average 6 times faster than those of DPFL, approaching in most cases those of hand-written C code. This is due both to the efficiency of imperative languages, which is higher than that of their functional counterparts, and to the implementation of the functional features. Detailed results are given below.
p
5.1. Shortest Paths
22
The shortest paths program was run for graphs with n = p p nodes on a p p processors network (in the cases p where p did not divide n, the next highest value divisible p p by p was taken, e.g. n = 201 for p = 3). The absolute run-time results, as well as a comparison between Skil and DPFL are given in Table 1. These results show that the Skil program runs more than 6 times faster than the equivalent DPFL program. 200
pp pp 22 33 44 55 66 77 88
DPFL
Skil Parix-C absolute rel.
1524.22
259.49
387.23
65.79
185.13
31.53
98.76
16.92
234.29 107.69 60.78 39.56 29.70 21.83 16.34
n
44 84 88
DPFL Skil
64
128
256
384
512
640
2.06 14.77 113.29 377.62 6.17 6.52 6.65 6.69 – – 2.40 2.51 2.60 2.64 0.91 4.83 32.06 102.16 236.13 453.86 4.82 5.73 6.22 6.40 6.48 1.57 1.73 2.02 2.20 2.31 2.38 0.85 3.49 19.42 58.03 129.89 244.77 3.87 4.88 5.62 5.96 6.12 6.24 1.25 1.24 1.45 1.65 1.78 1.90 0.85 2.94 13.57 37.03 78.71 143.28 3.48 4.17 4.78 5.21 5.47 5.68 1.04 0.94 1.03 1.15 1.26 1.37
Table 2. Run-time results for Gaussian elimination [bold: absolute times; roman: quotient DPFL/Skil; italics: quotient Skil/ParixC]
6.51 6.37 6.23
For more clarity, we have plotted the speedups relative to DPFL and the slow-downs relative to Parix-C against the number of processors, for all matrix sizes and obtained the graphics in Figure 1. Notice that in the left graphic, most of the speedups relative to DPFL are grouped around the factor 6, while only a few go below 5. These small values correspond to the case when arrays with small sizes are distributed onto large networks, yielding small partitions. In this case, it is obvious that the communication overhead gains more importance, leading to a drop of efficiency. The slow-downs relative to C (depicted in the right graphic) are mainly grouped around 2, in some cases (generally, for large networks) going down to 1. On the average, this result is not so good like the ones obtained for the shortest paths problem, but there is a great difference between that application, where practically only one skeleton (array_gen_mult) did everything, and Gaussian elimination, where maps, folds, broadcasts and permutes were combined. Further, the communication in shpaths has a regular structure, which is not the case in gauss. On the whole, the results relative to messagepassing C are good, for larger networks even very good. The second version of gauss we tested was the complete one, as given in Subsection 4.2. The run-times were here about twice as long as in the first version, which is satisfactory, since it is visible from the description of the implementation of the pivot search and exchange, that this brings considerable communication overhead.
6.04
Table 1. Run-time results for the shortest paths program Notice that in this table the Skil run-times even beat the run-times of message-passing C. The reason is that the C implementation referred to here is an older version, which does not use virtual topologies or asynchronous communication, as our skeleton implementation does. We have done the comparison between equally optimized C and Skil versions of the matrix multiplication algorithm, and obtained Skil times around 20% slower than direct C times [3]. Of course, a Skil program could never beat an equally well optimized C version of that program, since Skil is translated to message-passing C.
5.2. Gaussian Elimination The Gaussian elimination program was run for matrices with size n n, for different ns between 64 and 640 on a p processors network. The first version of gauss was implemented without the search and the exchange of the pivot row, i.e. corresponding to the initial version of the algorithm given in Subsection 4.2. The reason was that this version had been implemented in DPFL [8] and we wanted to make 9
9
9 n = 64 n = 128 n = 256 n = 384 n = 512 n = 640
7
n = 64 n = 128 n = 256 n = 384 n = 512 n = 640
8
Relative Slow-Downs Skil vs. C
Relative Speed-Ups Skil vs. DPFL
8
6 5 4 3 2 1
7 6 5 4 3 2 1
0
0 10
20
30 40 Processors
50
60
70
10
20
30 40 Processors
50
60
70
Figure 1. Comparison: Skil vs. DPFL (left) and Skil vs. Parix-C (right) for Gaussian elimination
6. Conclusions and Future Work
of CC ’96, Research Report LiTH-IDA-R-96-12, University of Link¨oping, 1996.
We have presented the language Skil, whose principal aim is to allow high-level parallel programming, while being efficiently implemented. Skil is an imperative language, however enhanced with functional features and with a polymorphic type system. We have then described a series of skeletons working with distributed arrays and showed how two matrix applications can be implemented on a parallel system by writing sequential programs. Run-time results show that our implementation is much faster than that of functional languages with skeletons, in most cases approaching or even reaching the performance of hand-written C code based on message-passing. Future work is necessary in several directions. For instance, it should be possible to employ other distributions of arrays onto processors, apart from block-wise, like for instance cyclic, block-cyclic etc. [6]. In the case of block distributions, it should be possible to define overlapping areas for the single partitions, in order to reduce communication in operations which require more than one element at a time. Such operations are used for instance in solving partial differential equations [8] or in image processing. Morever, in order to be able to cope with ‘real world’ applications, new skeletons, for instance for (parallel) I/O, must be designed and implemented.
[2] G. H. Botorog, H. Kuchen: Using Algorithmic Skeletons with Dynamic Data Structures, in Proceedings of IRREGULAR ’96, LNCS 1117, Springer, 1996. [3] G. H. Botorog, H. Kuchen: Efficient Parallel Programming with Algorithmic Skeletons, in Proceedings of EuroPar ’96, Vol. 1, LNCS 1123, Springer, 1996. [4] M. I. Cole: Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press, 1989. [5] J. Darlington, A. J. Field, P. G. Harrison et al: Parallel Programming Using Skeleton Functions, in Proceedings of PARLE ’93, LNCS 694, Springer, 1993. [6] High Performance Fortran Language Specification, in Scientific Programming, Vol. 2, No. 1, 1993. [7] H. Kuchen, R. Plasmeijer, H. Stoltze: Efficient Distributed Memory Implementation of a Data Parallel Functional Language, in Proceedings of PARLE ’94, LNCS 817, Springer, 1994. [8] H. Kuchen: Datenparallele Programmierung von MIMD-Rechnern mit verteiltem Speicher, Thesis (in German), Shaker-Verlag, Aachen, 1996. [9] Parsytec Computer GmbH: PARIX 1.2, Software Documentation, Aachen, 1993.
References
[10] S. L. Peyton Jones: The Implementation of Functional Programming Languages, Prentice-Hall, 1987.
[1] G. H. Botorog, H. Kuchen: Translation by Instantiation: Integrating Functional Features into an Imperative Language, in Proceedings of the Poster Session
[11] M. J. Quinn: Parallel Computing: Theory and Practice, McGraw Hill, 1994.
10