Algorithmic Skeletons for Adaptive Multigrid Methods George Horatiu Botorog
?
Herbert Kuchen
RWTH Aachen, Lehrstuhl fur Informatik II Ahornstr. 55, D-52074 Aachen, Germany
fbotorog,
[email protected]
Abstract. This paper presents a new approach to parallel programming with algorithmic skeletons, i.e. common parallelization patterns. We use an imperative language enhanced by some functional features as host for the embedding of the skeletons. This allows an ecient implementation and at the same time a high level of programming. In particular, low level communication problems such as deadlocks are avoided. Both data and process parallel skeletons can be used, but the emphasis is placed on the rst category. By de ning data parallel skeletons for dynamic data structures, we obtain constructs for handling problems with irregular and/or dynamic character. The implementation of an adaptive multigrid algorithm illustrates how such problems can be solved by using these constructs. Keywords: Algorithmic Skeletons, Data Parallelism, Adaptive Multigrid, Imperative Host Languages, MIMD-DM computers.
1 Introduction
Most languages employed today in the programming of MIMD computers with distributed memory are imperative languages, like Occam, parallel C or parallel Fortran. Here, the user must control the parallelism in his program on a rather low level, by explicitly sending and receiving messages, synchronizing processes, distributing data, balancing the load etc. This enables him to write ecient programs, with a proper granularity of parallelism, but the price is high. Such programs are mostly non-deterministic, which complicates the task of testing and debugging. Moreover, the use of low-level features can easily lead to deadlocks and restricts portability. Another possibility is to use declarative, mainly functional languages. Functional programs contain implicit parallelism, since (sub)expressions can be evaluated independently of each other. Thus, a compiler can automatically parallelize a functional program. Moreover, functional programs are high level and deterministic, hence presenting advantages over their imperative counterparts: they are easy to test and debug, portable and cannot contain deadlocks. The drawback of the approach is that parallel implementations of functional languages tend to ?
The work of this author is supported by the \Graduiertenkolleg Informatik und Technik" at the RWTH Aachen.
be very inecient. The main reasons lie in the ne granularity of parallelism, as well as in the heuristic task and data distribution, which usually lead to high communication overheads. Given these two extremes, the next step would be to search for a solution somewhere in between. Such a solution are algorithmic skeletons [3]. A skeleton is an algorithmic abstraction common to a series of applications, which can be implemented in parallel. Skeletons are embedded into a sequential host language, thus being the only source of parallelism in a program. Con ning parallelism to skeletons makes it possible to structure and control it. Moreover, the user must no longer cope with parallel implementation details, which are not visible outside the skeletons. On the other hand, skeletons are eciently implemented on a low level. At the same time, they oer an interface which totally abstracts of the underlying hardware, thus making the programs portable. Classical examples of skeletons include map, farm and divide&conquer [4]. A common feature of most skeletons that have been de ned is that, in order to provide the required exibility, they have functions as arguments, i.e. they are higher-order functions. We shall illustrate this with the de nition of divide&conquer (d&c), which can be expressed in a functional language as follows. d&c :: (a -> Bool) -> (a -> b) -> (a -> [a]) -> ([b] -> b) -> a -> b d&c is_trivial solve split join problem = if (is_trivial problem) then (solve problem) else (join (map (d&c is_trivial solve split join) (split problem)))
The skeleton gets four functions as arguments: is_trivial tests if a problem is simple enough to be solved directly, solve solves the problem in this case, split divides a problem into a list of subproblems and join combines a list of sub-solutions into a new (sub)solution. Given this skeleton, the implementation of an algorithm that has the structure of divide&conquer requires only the implementation of the four argument functions and a call of the skeleton. For instance, a quicksort procedure can be implemented as follows quicksort lst = d&c is_simple ident divide concat lst
Note that a similar exibility cannot be achieved by rst-order functions, i.e. functions without functional arguments. Besides the elegance of this style of programming, it is important that the skeleton is implemented in parallel1, handling communication and synchronization as well as distributed data internally. Thus, instead of error-prone communication via individual messages, there is a coordinated overall communication, which is guaranteed to work deadlock-free. As the example above shows, algorithmic skeletons can be represented in functional languages in a straightforward way as higher-order functions. It would therefore seem appropriate to use a functional language as host. Indeed, nearly all implementations of skeletons rely on functional hosts. 1 The above de nition of d&c only describes the overall functionality, it does not cap-
ture the parallel implementation.
2
There are currently a number of research groups working on the design and implementation of parallel functional languages with algorithmic skeletons. One could mention here the works of Darlington et al. [4], Bratvold [2], Rabhi [10], Pepper et al. [8], Skillicorn [11] or Kuchen and Stoltze [6]. On the other hand, few attempts have been made to use an imperative language as a host. One such attempt is P3 L [1], which builds on top of C++, and in which skeletons are internal language constructs. The main drawbacks are the diculty to add new skeletons and the fact that only a restricted number of skeletons can be used, lest the language should grow too big. Another approach is taken in the language ILIAS [7]. Here, arithmetic and logic operators are extended to pointwise operators that can be applied to the elements of matrices. Furthermore, contiguous subranges of matrices can be speci ed. These operations are actually no skeletons in the sense of the former de nition, but operations obtained through overloading of language constructs. Still, their internal behavior closely resembles that of some skeletons. Depending on the kind of parallelism used, skeletons can roughly be classi ed into process parallel and data parallel ones. In the rst case, the skeleton creates a series of processes, which run concurrently. Some examples include farm, pipe and divide&conquer. Such skeletons are used in [1, 2, 4, 10]. Data parallel skeletons, on the other hand, act upon some distributed data structures, performing the same operations on all elements of the data structure. Data parallel skeletons, like map, shift or fold are used in [1, 2, 4, 6, 8, 11]. A common feature of all data parallel approaches listed above, is that the underlying data structures are static, mostly arrays. The reason is that arrays are better suited for parallel processing than other data structures, as they can easily be distributed and most of the operations on them are easy to parallelize. Moreover, since the structure of an array remains unchanged, no dynamic remapping or load balancing are required. In this paper, a new approach is presented. On the one hand, an imperative host language is employed. On the other hand, data parallel skeletons are de ned for dynamic data structures. It is shown, how these data structures and skeletons can be used in solving irregular problems. The rest of the paper is organized as follows. Section 2 describes the imperative host language and the additional features that are needed to integrate the skeletons. Section 3 presents the types of skeletons we shall use. The emphasis is placed on data parallel skeletons, which are embedded into parallel abstract data types. Section 4 presents an irregular application, a full multigrid algorithm with adaptive grid re nement, implemented on the basis of skeletons. Finally, Section 5 concludes the discussion.
2 The Host Language
As already stated, most implementations today build upon functional hosts, since these allow a straightforward integration of the skeletons. The improvement over direct parallel implementations of functional languages is considerable, rising up to 2 orders of magnitude [6]. Nevertheless, programs written in these languages still run about 5 to 10 times slower than corresponding parallel C programs [6]. 3
We choose our host language to be an imperative, C-like, one. This approach has several advantages. On the one hand, the sequential parts of the program are more ecient if written in an imperative language. Moreover, since host and skeletons are both imperative, the overhead of the context switch, which is considerable in the case of a functional host, has no longer to be taken into account. On the other hand, imperative languages oer mechanisms for local accessing and manipulation of data, which have to be simulated in functional languages [6]. This should lead to further gains in eciency, bringing the performance of our language close to that of low-level parallel imperative languages. The question arising is what are the functional features needed for the integration of skeletons? We have already seen that most skeletons have functional arguments, so higher-order functions are surely among these features. Since an implementation of these constructs is inecient, they are eliminated by the compiler at an early stage. This is done by instantiating calls to higher-order functions to equivalent calls to appropriate specialized rst order functions (see [12] for details). A second feature we need are partial applications of functions. Consider again the de nition of the skeleton d&c. We had de ned the type of this skeleton as (a -> Bool) -> (a -> b) -> (a -> [a]) -> ([b] -> b) -> a -> b
instead of
(a -> b) (a -> [a]) ([b] -> b) a -> b This implies that d&c can be applied to its rst argument, yielding a new function with the type (a -> b) -> (a -> [a]) -> ([b] -> b) -> a -> b, then to its second etc., until the last application nally returns a value of type b. The underlying idea is to consider the application of a n-ary function as a successive application of unary functions. This procedure is called currying, after the mathematician H. B. Curry. See [9] for further details. Partial applications are useful in a series of situations, for instance in supplying additional parameters to functions. Consider the d&c call in the else branch of the above example. On the one hand, the function map expects a functional argument of the type a -> b. On the other hand, we want to call d&c and at the same time provide it with the rest of the arguments it needs, apart from the problem to be solved, i.e. is_trivial, solve, split and join. This can be done by the partial application of d&c to these arguments, which yields a function of the type a -> b. Partial applications are one of the main reasons for the functional extensions of the host language. If functional arguments with no arguments of their own could be simulated in C by pointers to functions, this is no longer possible with functional arguments yielded by partial applications. We do not consider here the possibility to pass the additional parameters as global variables, since it is bad programming style to simulate variables which actually have local scope by global ones. This would introduce a source of errors, which are hard to nd. Moreover, this simulation is not always possible, e.g. if dierent partial applications of the
(a -> Bool)
4
same function are given as parameters to the same skeleton, what should be stored in the global variables then? The third feature we need is polymorphism, since we want to de ne functions that depend only on the structure of the problem, and not on particular data types. For instance, the skeleton d&c performs the same operations, regardless of the type of the problem and the solution. The advantage is that the same skeletons can be employed in solving similar or related problems. Polymorphic types are either type variables2, or compound types built from other types using the C type constructors array, function, pointer, structure or union and containing at least one type variable. Although polymorphism can be simulated in C by casting, this is nevertheless a potential source of errors, since it eludes type checking. Our approach leads however to safer programs, since a polymorphic type checking is performed3 . To summarize, the functional features needed to facilitate the integration of skeletons are higher-order functions, partial applications and polymorphism. Note that the necessity of these features is not eluded by the possibility to implement skeletons as library routines. On the contrary, a library of suciently
exible skeletons is enabled by these features. Concluding this section, we will brie y address some implementation issues. A program is processed by a front-end compiler together with the necessary skeletons. The main task of the compiler is to eliminate the functional features of the language. Higher-order functions and partial applications are translated by an instantiation procedure which resembles higher-order macro-expansion [12]. Since polymorphism is translated by instantiation too, a single transformation is employed. The front-end compiler generates parallel C code based on message passing, which can then be processed by a C compiler.
3 The Skeletons We have seen that skeletons can be classi ed into data parallel and process parallel ones. Although both types can be used in our language, the emphasis is placed here on the rst category. There are two main reasons for this. On the one hand, data parallelism seems to oer better possibilities to exploit large numbers of processors. On the other hand, data structures are central to our pursuit, since we aim at using dynamic data structures to solve irregular problems. A problem that arises when using distributed data structures in an imperative language is that the programmer is able to access local parts of the data structure, for instance by pointers or indices. If this occurs in an uncontrolled way, heavy remote data accessing may cause considerable communication overhead. In order to control the access to data, we use abstract data types, which guarantee a clear interface to the data structures and the skeletons working on them, and hide implementation details. Data parallel skeletons are thus not stand-alone, but always part of an ADT. Since the underlying data structures of these ADTs 2 Type variables are de ned using the new keyword typevar. 3 To quote Milner: \Well-typed programs don't go wrong".
5
are distributed, we will call them parallel abstract data types (PADTs). Apart from controlling data accesses, PADTs have some other important advantages:
{ They allow regular and irregular problems to be equally dealt with. Irregu-
larity is supported on the one hand by dynamic data structures and on the other hand by implicit or explicit operations for extending or restricting a data structure, for dynamic data re-mapping and for load balancing. { They can be generic, thus allowing a systematic instantiation of the data structure and the operations. { They group skeletons together, making their inclusion easier. This is very important, if a large number of skeletons is available.
A similar approach is taken by Skillicorn [11] with his categorical data types. The dierence to the approach presented here is that the operations of a categorical data type must be homomorphisms, i.e. they have to respect the structure of the data type, whereas the operations of the PADTs must not. Our approach thus seems more exible, since it allows the use of operations that are not homomorphisms, like for instance neighborhood operations (see Section 4.2). In order to use PADTs, we need to enhance the language with two more constructs. On the de nition level, the construct pardata allows the declaration of a PADT. On the application level, PADTs are included, and if necessary instantiated, by means of the parinst construct. We shall illustrate the way these constructs are used by a simple example. We shall de ne a generic data type matrix and show how it can be employed to compute the shortest paths in a graph by transitive closure. The PADT matrix is parameterized by the type of its elements. The de nition of the PADT, comprising only a selection of the skeletons, is given below. pardata matrix (typevar elem_t) matrix mat_load (FILE *infile); void mat_dump (matrix m, FILE *outfile); matrix mat_gen_add (matrix m1, matrix m2, elem_t gen_add (elem_t val1, elem_t val2)); matrix mat_gen_mult (matrix m1, matrix m2, elem_t gen_add (elem_t val1, elem_t val2), elem_t gen_mult (elem_t val1, elem_t val2)); ... end
The skeleton mat_load loads the elements of a matrix from a le and distributes them onto the processors, while mat_dump does the opposite: it collects the elements of a matrix and dumps them to a le. mat_gen_add and mat_gen_mult perform a generic addition and multiplication of two matrices respectively. These generic skeletons can be instantiated to dierent operations. As an example, we shall compute the shortest paths between every two vertices of a graph represented by its adjacency matrix. For that, we have to de ne a matrix of integers and instantiate the functional arguments gen_add and gen_mult by min and +. The following program results: 6
parinst matrix (int) int_mat; void shortest_paths () { int_mat a; a = mat_load (infile); for (i = 0; i < log2 (n); i++) /* compute the transitive closure */ a = mat_gen_mult (a, a, min (), plus ()); mat_dump (a, outfile); } int min (int x, int y) {return (x < y ? x : y);} int plus (int x, int y) {return (x + y);}
4 Adaptive Multigrid
In this section, we will show how skeletons can be used to implement irregular algorithms. As an example, we will use adaptive multigrid methods, since they are both appropriate for parallelizing and have a dynamic character. First, we will give a short description of the method4 as well as of the internal representation of the underlying data structure. Then we will de ne the skeletons that perform the operations on the grid. Based on these operations, the implementation of a full multigrid algorithm will then be sketched. Finally, the process of adaptive re nement and its implementation with skeletons will be regarded more closely.
4.1 The Method
Multigrid methods are iterative solvers for systems of discretized (partial) dierential equations on a grid hierarchy. The hierarchy consists of a number of grid levels, where each new level is obtained, depending on the chosen method, either by re ning (i.e. by adding more points to), or by coarsening (i.e. by selecting some of the points of) the current level [5]. The main advantage of these methods is their eciency: they require only O(n) operations5, where n is the number of points on the nest grid [5]. Another advantage of multigrid methods is that they can be used in relation with adaptive re nement techniques, such that a re nement is performed only in those areas, where this leads to relevant improvements of the accuracy of the solution. The basic idea of multigrid methods is to eliminate the high frequencies of the error in relatively few relaxation steps on a ne grid, while the lower frequencies are eliminated by reducing the problem to coarser grids, on which they appear as higher frequencies and can be attenuated again by relaxation. This process can be continued recursively for a number of steps. Coarser grids are therefore viewed as correction grids. The opposite view, in which the ner grids are the correction grids, can also be employed [5]. This has the advantage, that it permits adaptive re nement, leading to ner levels con ned to increasingly smaller sub-domains. Although a series of multigrid algorithms has been de ned, they all build on the same basic procedures, the dierences consisting mainly in how the cycles 4 We present here the general methodology, rather than a speci c multigrid method. 5 In the sequential case.
7
over the grid levels are organized. Regardless of the particular algorithm, the basic operations are: { re ning and coarsening of the current grid level, which can be done adaptively, based on the current approximation of the solution or of the error, { prolongation and restriction of data when moving to a ner, respectively to a coarser grid level; these operations are usually applied to the solution, error, residual and right hand side, { relaxation, which is performed on one grid level, and consists of a number of iterations of a certain solver, like Jacobi, Gauss-Seidel, SOR etc. [5]. Apart from these operations, some additional operations are needed to support parallelism: { distributing data to all processors and collecting distributed data (for instance, for output), { environment (or neighborhood) operations, which select for a grid point its neighboring points on one or more levels up to a given depth, { operations for load balancing, necessary in irregular applications, { `map'-operations which evaluate a function for all items of a distributed data structure. The two categories of operations represent the basis for the multigrid skeletons, which will be presented in the next subsection. Without getting into implementation details, we will shortly describe the internal representation of the grid. The entire grid consists of a series of levels, from which one is always the current level. A grid level is represented as a graph, thus allowing a unitary handling of irregular (locally re ned) grids. The points of a level (i.e. the vertices of the graph) are de ned by their coordinates in a ndimensional space, where n is usually 2 or 3. One or more distributed variables, representing the solution, the error etc. can be assigned to these points. Each of these variables is identi ed by a unique numerical id. Further, due to the distribution of the grid, overlap areas are de ned. They represent extensions of the sub-domains that are placed on dierent processors and serve only to improve the eciency of reading accesses to the neighborhoods of data items placed on the borders of sud-domains. If data values in the overlap areas are altered, then these areas are updated automatically.
4.2 Skeletons for Multigrid
We shall now present the skeletons needed in multigrid operations. As stated in Section 3, they will be embedded into a parallel abstract data type. We shall restrain the description of the PADT to its interface, as it represents the part of the PADT that is visible to the user. The implementation of the skeleton mg_refine and of a function that performs an adaptive re nement will be described more detailed in Subsection 4.4. multigrid is a generic PADT, parameterized by dimension, type of the coordinates of the points, type of the distributed variables and horizontal and vertical overlap factors. The dimension is an integer value that is 2 or 3 for most 8
applications. Single coordinates and data values at the grid points can be of dierent types, but mostly single or double precision reals are employed. The overlap areas serve only to improve the eciency of reading accesses to the distributed data. Consider for instance the case of a 2-dimensional regular grid. If an interpolation operator uses the values at the 4 adjacent grid points, then a horizontal overlap factor of 1 is sucient to avoid additional communication for the computation of the elements on the borders of sub-domains of the grid placed on dierent processors. The same applies on the vertical, if elements of one grid level are needed to compute elements on another grid level, like for instance in the prolongation and restriction procedures. Overlap factors should therefore be chosen at least 1. The interface of the PADT multigrid is given below. Due to lack of space, only a selection of the skeletons is presented. pardata multigrid (int dim, typevar coords_t, typevar data_t, int hovl, int vovl) multigrid mg_load_coords (FILE *infile); void mg_dump_coords (multigrid mg, FILE *outfile); void mg_destroy (multigrid mg); void mg_refine (multigrid mg, coords_t *refine_f (coords_t p)); void mg_coarsen (multigrid mg, coords_t *coarsen_f (coords_t p)); void mg_balance (multigrid mg); int mg_allocate (multigrid mg); void mg_deallocate (multigrid mg, int id); void mg_load_data (multigrid mg, int id, FILE *infile); void mg_dump_data (multigrid mg, int id, FILE *outfile); void mg_prolongate (multigrid mg, int id, data_t prol_f (coords_t p)); void mg_restrict (multigrid mg, int id, data_t restr_f (coords_t p)); coords_t *mg_coords_env (multigrid mg, coords_t p, int hrad, int vrad); data_t *mg_data_env (multigrid mg,coords_t p,int hrad,int vrad,int id); void mg_relax (multigrid mg, int *from_ids, int to_id, data_t relax_f (int *ids, coords_t p)); void mg_map (multigrid mg, int *from_ids, int to_id, data_t appl_f (int *ids, coords_t p)); end
The skeletons can be grouped into grid operations (the rst six), data operations (the next six), selection operations (the next two) and computational operations (the last two). Grid operations comprise the constructor mg_load_coords, which loads the coordinates of some points from a le, distributes them and builds upon them a one-level grid; the operation mg_dump_coords, which dumps the coordinates of the current grid level into a le and the destructor mg_destroy. Note that in the case of the rst two skeletons an external data support is necessary, since it is possible, that the memory of one processor is too small to hold the entire data from one level of the grid. Finer or coarser grid levels can be derived from the current one by using the operations mg_refine, respectively mg_coarsen. The strength of these skeletons lies in their functional arguments, which create the 9
single points of the new grid. By using partial applications, additional parameters can be supplied to these functions, thus providing the possibility to adaptively create a new grid, depending for instance on the data values on the old grid. An example of an adaptive re nement function is given in Subsection 4.4. In case of an adaptive re nement, the resulting grid is usually unbalanced. This can be evened out by calling the skeleton mg_balance, which tries to achieve a (nearly) equal load on all processors and at the same time to minimize the communication between them. The load balancing procedure is based on a dimension exchange algorithm, in which each processor exchanges load with its neighbors. Data operations include the constructor mg_allocate, the destructor mg_deallocate, as well as operations for loading data items from a le and distributing them on the current grid level (mg_load_data), respectively collecting data from this level and dumping it to a le (mg_dump_data). The extension of data to a ner level is done by mg_prolongate and to a coarser level by mg_restrict. The selection operations (mg_coords_env and mg_data_env) compute for a given grid point a list of neighboring points (environment), respectively of data placed at these points. The number of neighboring levels is given by the `vertical radius' (vrad), whereas the depth of the area of adjacency is determined by the `horizontal radius' (hrad). These operations are useful for grid re nement or coarsening, as well as for the prolongation or restriction of data from one level to another and for relaxation. In order to compute the neighboring points eciently, care has to be taken, that the radii do not exceed the corresponding overlap factors given in the declaration of the multigrid type (or, to put it the other way around, that the overlap factors are upper bounds of all radii used in computations). It is worth noting, that the selection operations are not parallel, but bound to a given grid point. Parallelism is achieved by using them inside map-like skeletons, for instance in the argument function of the re nement procedure (see Subsection 4.4 for an example). Finally, computational operations comprise relaxation and mapping on a grid level. mg_map maps a given function to all points of the current grid level. It gets as argument a list with the id's of the variables to be used in the computation (from_ids), which it passes to the function to be applied (appl_f). The return value of this function is written into the variable identi ed by to_id. mg_relax works similarly to the map skeleton. In the following example, we will use the Jacobi relaxation procedure, but other methods, like Gauss-Seidel or SOR, can be employed as well. The skeletons de ned in this PADT represent the basic operations for multigrid [5], so that practically all (structured or unstructured) multigrid algorithms can be implemented with them. An example is given in the next subsection.
4.3 The Implementation of a Full Multigrid Algorithm
We will now present the implementation of a full multigrid algorithm (FMV) [5]. The algorithm is based on two ideas: coarse grid correction and nested iteration. The rst idea was explained in Subsection 4.1, the second will be outlined in the following. 10
When using an iterative method to solve a linear system, an initial approximation of the solution is needed. This approximation can either be randomly chosen, or generated by some procedure. This procedure could be for instance a relaxation performed on the same problem on a coarser grid, followed by a prolongation of the solution. This leads to an improved initial guess for the ne grid problem. The initial guess on the coarse grid can be obtained by a similar procedure and so on, recursively, up to the coarsest grid. Combining nested iteration with coarse grid correction, we obtain full multigrid algorithms. If we choose the simplest correction cycle, the V-cycle, then the FMV-algorithm results. This algorithm is described below. We start on the coarsest level (0) and re ne the problem to level 1. After that, a coarse grid correction is performed in a V-cycle between level 1 and level 0, with pre- and post-smoothing relaxations. The next step consists of a new re nement (to level 2) and a new V-cycle correction (between level 2 and level 0). Continuing this procedure yields a full multigrid V-cycle (FMV) algorithm (sketched in Figure 1). The parameters of a full multigrid algorithm are the number of pre- and post-smoothing steps (1 and 2 ), the type of the correction cycle (given by
) and the number of correction cycles performed on each level ( ). The FMV algorithm depicted in Figure 1 is characterized by = 1 (V-cycles) and = 1 (one correction cycle per level). level 3 level 2 level 1 level 0
Fig. 1. Full multigrid V-cycle (FMV)
We will now present the implementation of the FMV algorithm based on the skeletons de ned above. Some of the details have been omitted for simplicity, others are given only in pseudo-code. The notation [x, y, z, . . . ] is an explicit representation of the array or structure built with the listed components. The implementation is given below. The rst statement represents an instantiation of the generic PADT multigrid to a 2-dimensional grid whose points have real coordinates and double precision values. The vertical and horizontal overlap factors are both set to 1. The function FMV mainly contains the nested iteration, starting at level 0 and re ning until some convergence criterion is ful lled. Apart from that, this function also handles I/O and the allocation of the grid and of the variables. The function MV performs the coarse grid correction in a V-cycle with some preand post-smoothing steps on each level. On the coarsest level, an exact solution is computed, since this is usually cheap enough. 11
parinst multigrid (2, float[2], double, 1, 1) mgrid2; void FMV () { mgrid2 mg;
int l = 0;
load and distribute coordinates and data;
while (! converge (id_u)) { /* some convergence criterion */ mg_refine (mg, local_ad_ref (mg, id_u)); /* adaptive refinement */ mg_balance (mg); /* re-distribute the refined grid */ mg_prolongate (mg, id_u, interpolate (mg, id_u)); mg_prolongate (mg, id_f, interpolate (mg, id_f)); for (i = 0, l++; i < ; i++) MV (mg, l, id_u, id_f);
/* level iterations */ /* V-cycle on level l */
} }
collect distributed data and output result;
void MV (multigrid mg, int l, int id_u, int id_f) { if (l == 0) /* coarsest grid */ ; else { for (i = 0; i < 1 ; i++) /* pre-smoothing */ mg_relax (mg, id_u id_f , id_u, jacobi (mg, ! ));
compute exact solution of (u, f)
[
,
]
id_r = mg_allocate (mg); /* residual variable */ mg_map (mg, id_u id_f , id_r, residual (mg)); /* compute res. */ mg_restrict (mg, id_r, inject (mg, id_r)); id_v = mg_allocate (mg); /* error variable, initialized to 0 */ mg_restrict (mg, id_v, inject (mg, id_v));
[
,
]
for (i = 0; i < ; i++) MV (mg, l-1, id_v, id_r);
/* coarse grid correction */
mg_prolongate (mg, id_v, interpolate (mg, id_v)); mg_map (mg, id_u id_v , id_u, minus (mg));
[
,
]
/* u = u - v */
for (i = 0; i < 2 ; i++) /* post-smoothing */ mg_relax (mg, id_u id_f , id_u, jacobi (mg, ! ));
[
,
]
mg_deallocate (mg, id_r); mg_deallocate (mg, id_v); } }
4.4 Adaptive Re nement In this subsection, we will show how the mg_refine skeleton is de ned and how a particular adaptive procedure can be implemented on its basis. The grid re nement skeleton is a map-like function which is applied to all points of the current (and at the same time nest) level. For each point of the grid, one or more points of the new grid are generated, depending on the need for local re nement. If the grid needs no re nement in an area, then only the given point is returned. Otherwise, a list with additional points is generated. The de nition of the skeleton mg_refine is given below in pseudo-code. 12
void mg_refine (multigrid mg, coords_t *refine_f (coords_t p)) { for all p mg do in parallel newpointsp = refine_f (p); ;
points
of the current level of
build links between parent and children points
2 9 , 2
for all p' newpointsp do if ( q p' newpointsq ) p' newpointsp ;
}
/* p' was already created */
eliminate from build graph of new level from all newpoints; /* construct vertical and horizontal overlap areas;
new grid is distributed */
The skeleton uses the local re nement function refine_f to generate for each point one or more new points and at the same time builds the links between parents and children. After that, the local results are merged to form one new grid (not necessarily connected), whereas duplicates produced by dierent local re nements are removed. Finally, data along the borders of the sub-domains of the grid placed on dierent processors are replicated to create the overlap areas. We now want to take a closer look at the local re nement procedure. For that, we shall consider a simple locally re ned nested grid. Let the coarsest level be a 2dimensional regular grid, consisting of rectangles or even squares. The re nement is done by orthogonal recursive bisection, but only in those areas, where this leads to an improvement of the accuracy of the solution. If, for instance, the solution has a singularity in the upper right corner, then a sequence of grids like the one in Figure 2 is generated.
G0
G1
G2
G3
Fig. 2. Adaptive orthogonal recursive bisection
An example of how the local re nement is done, is shown in Figure 3. Here, the grid G is re ned to yield the grid G +1 , whereby only some areas, like the hatched one, actually undergo a re nement. The local re nement of the hatched square generates 9 new points in the grid G +1 . Nevertheless, since the re nement function is mapped to all points of G , only the 6 white points in G +1 have to be generated by the re nement corresponding to P , the others being produced by the (trivial) re nement of the points P +1 , P +1 respectively P +1 +1 . The white points can be computed for instance according to the equations: l
l
l
l
l
l i
;j
l+1 l Pi;j = Pi;j
+1 = Pil+1 ;j
l+1 Pi;j +1 =
l i;j l i;j
l i
(1)
+Pil+1;j (2) 2 l l P + P i;j i;j +1 (3) 2
l P i;j
;j
+1 Pil+2 = ;j +1
+1 Pil+1 = ;j +2
+1 Pil+1 = ;j +1
13
+1;j +Pil+1;j+1
l i
(4) 2 l + P +1 i+1;j+1 (5) 2 l l l l P + P + P + P i;j i+1;j i;j +1 i+1;j +1 (6) 4 P
P
l i;j
l
l
P i,j
l
Pi+1,j l
l
l
G
P i,j+1 l
l
P i-1,j
P i+1,j-1
P i,j-1
P i-1,j-1
l
P i+1,j+1
l
P i-1,j+1
l+1
l+1
P i,j+2 l+1 P i-1,j+2
P i+1,j+2 l+1
P i+2,j+2
l+1
P i,j+1
l+1
P i+1,j+1 l+1 P i+1,j
l+1
P i,j P i,j-1
P i-1,j-1
l+1
P i+2,j+1
l+1 Pi+2,j
l+1
l+1
G
l+1
l+1
P i-1,j
l+1
P i+2,j-1
Fig. 3. Local adaptive grid re nement
As these equations show, all 4 vertices of the hatched square are needed to generate the white points in the ner grid. The dependencies are depicted in Figure 3 by dotted lines. Since the local re nement function is mapped to a single point of the old grid at a time (here to P ), an operation is needed to compute the environment for a given point. This operation is mg_coords_env, which returns the 4 direct neighboring points of P . Calling this function again for some of the direct neighbors, yields the indirect neighbors, too. Moreover, the re nement can be performed depending not only on the coordinates of the points of the old grid, but also on certain values at these points, like for instance the solution approximation. This `data environment' is computed by the operation mg_data_env. We can now de ne the local re nement function as follows. l i;j
l i;j
coords_t *local_ad_ref (multigrid mg, id_u, coords_t p) { ps = mg_coords_env (mg, p, 1, 0); /*ps = us = mg_data_env (mg, p, 1, 0, id_u); ps = ps us = us
[ fP +1 [ fu +1 l i l i
+1 g ;j +1 g
[P ]);
return ( else {
l i;j
;j
l i
;j
l i;j
l i;j
l i;j
/* us analogous */
l Pi;j +1 */ l /* by calling mg_data_env for ui;j +1 */
uli+1;j , uli;j +1 , uli+1;j +1 ) < ") /* no need to refine */
+1 compute P +1 . . . P +1 +1 according to equations (1) . . . (6); +1 +1 +1 +1 +1 +1 return ([P , P +1 , P +1 , P +2 +1 , P +1 +2 , P +1 +1 ]); l i;j
}
l i
/* by calling mg_coords_env for
;j
if (eval_sol (uli;j ,
[P ?1 ,P +1 ,P ,P ?1 ,P +1 ]*/
l i;j
l i
l i
;j
;j
l i;j
l i
;j
l i
;j
l i
;j
}
The examples given in the last two subsections illustrate the way higherorder functions and partial applications can be used to enhance the expressive power of skeletons. The call of the skeleton mg_refine inside the body of the procedure FMV gets as functional argument a partial application of the function local_ad_ref to its rst two arguments (mg and id_u). This yields by currying a new function with one argument, which has exactly the type expected by mg_refine. The skeleton supplies in its body the last argument needed by local_ad_ref, namely (the coordinates of) the point p. 14
5 Conclusions and Future Work
We have presented a new approach to parallel programming with algorithmic skeletons. The principal aim was to design a language which allows structured parallel programming and at the same time can be eciently implemented. We have rst considered the host language. By employing an imperative host enhanced with higher-order functions, partial applications and polymorphism, an ecient and high-level language was obtained. We have then described the skeletons embedded into this language, using parallel abstract data types as a means to integrate the skeletons. It was argued, that this allows both regular and irregular problems to be dealt with. Finally, an adaptive multigrid algorithm was implemented based on a PADT. The main advantage of this approach is that the user no longer has to program global procedures for the whole grid, but only local ones for single grid points. As the grid is distributed, this is a considerable simpli cation, since it frees the user from the burden of accounting for the explicit aspects of parallelism. We are currently working on a prototype implementation of the language. Future plans include using skeletons in other application areas of parallel computation, like computational geometry or the N-bodies problem.
References
1. B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi: P 3 L : a Structured High-level Parallel Language and its Structured Support, Technical Report HPLPSC-93-55, Pisa Science Center, Hewlett-Packard Laboratories, May 1993. 2. T. A. Bratvold: Parallelizing a Functional Program Using a List-Homomorphism Skeleton, in H. Hong (ed.) Parallel Symbolic Computation PASCO '94, Lecture Notes Series in Computing, Vol. 5, World Scienti c, 1994. 3. M. I. Cole: Algorithmic Skeletons: Structured Management of Parallel Computation, MIT Press, 1989. 4. J. Darlington, A. J. Field, P. G. Harrison et al: Parallel Programming Using Skeleton Functions, in Proceedings of PARLE 93, LNCS 694, Springer-Verlag, 1993. 5. W. Hackbusch, U. Trottenberg: Multigrid Methods, Lecture Notes in Mathematics 960, Springer-Verlag, 1982. 6. H. Kuchen, R. Plasmeijer, H. Stoltze: Ecient Distributed Memory Implementation of a Data Parallel Functional Language, in Proceedings of PARLE 94, LNCS 817, Springer-Verlag, 1994. 7. L. D. J. C. Loyens, J. R. Moonen: ILIAS, a Sequential Language for Parallel Matrix Computations, in Proceedings of PARLE 94, LNCS 817, Springer-Verlag, 1994. 8. P. Pepper, M. Sudholt, J. Exner: Functional Programming of Massively Parallel Systems, Technical Report No. 93-16, Technische Universitat Berlin, 1993. 9. S. L. Peyton Jones: The Implementation of Functional Programming Languages, Prentice-Hall, 1987. 10. F. A. Rabhi: Exploiting Parallelism in Functional Languages: A \ParadigmOriented" Approach, in T. Lake, P. Dew (Eds.) Workshop on Abstract Machines for Highly Parallel Computers, Oxford University Press, 1993. 11. D. Skillicorn: Foundations of Parallel Programming, Cambridge Univ. Press, 1994. 12. P. Wadler: Deforestation: Transforming Programs to Eliminate Trees, in Theoretical Computer Science, No. 73, 1990, North-Holland.
15