DRAMA: A Library for Parallel Dynamic Load ...

DRAMA: A Library for Parallel Dynamic Load Balancing of Finite Element Applications Bart Maerteny

Dirk Roosey Achim Basermannz Guy Lonsdalez

Jochen Fingbergz

Abstract We describe a software library for dynamic load balancing of nite element codes at run-time. The application code has to provide the current distributed mesh and information on the calculation and communication requirements, and receives from the library all necessary information to re-allocate the application data. The library computes a new partitioning, either via direct mesh migration or via parallel graph re-partitioning, by interfacing to the ParMetis package. In this paper, we discuss the underlying cost model and the various graph representations of the cost model that are implemented in the library. We describe the functionality of the DRAMA library and we present some rst results.

1 Introduction

Parallel nite element applications can have changing work load and communication requirements, due to re-meshing, adaptive mesh re nement (and de-re nement), or due to a changing calculation or communication cost in parts of the mesh. In these cases, dynamic load balancing can be achieved by parallel mesh re-partitioning at run-time. Mesh repartitioning algorithms must be ecient, operating in parallel on distributed data, and must take into account the current partitioning in order to minimize data migration. Dynamic load balancing is also useful when using non-dedicated parallel machines (e.g. workstation clusters). The aim of the Esprit project DRAMA, Dynamic Re-Allocation of Meshes for parallel nite element Applications, is to develop a library with dynamic mesh re-partitioning algorithms for unstructured nite element applications. Although the aim is to develop a general purpose library, emphasis lies on two industrially relevant application codes, i.e., PAM-CRASH/PAM-STAMP and FORGE3. These codes have quite dierent repartitioning needs. FORGE3 applies rather frequently re nement/de-re nement techniques together with 3D re-meshing. PAM-CRASH has varying calculation and communication costs due to contacts between dierent parts of the mesh. The contacts change during the execution and are the main reason of load imbalance. PAM-STAMP uses adaptive (de-)re nement techniques without re-meshing. DRAMA is supported by the EC under ESPRIT IV, Long Term Research Project No. 24953. More information can be found on WWW at http://www.cs.kuleuven.ac.be/cwis/research/natw/DRAMA.html y Dept. of Computer Science, K.U.Leuven, Celestijnenlaan 200A, B-3001 Heverlee-Leuven, Belgium, [email protected], [email protected] z C&C Research Laboratories, NEC Europe Ltd., Sankt Augustin, Germany, [email protected], [email protected], [email protected]

1

2 In the next section, we introduce mesh re-partitioning and the DRAMA library. In section 3, we describe the cost function upon which the DRAMA re-partitioning algorithms are based, and how the minimization of this cost function can be formulated as a graph partitioning problem. The DRAMA library is described in section 4 and rst results are presented in section 5.

2 Mesh re-partitioning methods

The aim of a mesh re-partitioning procedure is to restore the load balance during the execution of a parallel application program. Compared to (static) mesh partitioning, which is typically performed as a pre-processing phase, (dynamic) mesh re-partitioning strategies must satisfy the following additional requirements. It must interact with the application program. The input from the application program consists of the current distributed mesh, calculation and communication cost parameters and timings. Based on the output of the mesh re-partitioner, i.e., the newly computed partitioning and the relation to the old partitioning, the application must decide whether the gain of the re-distribution will outweigh the cost of the data migration and migrate the application data according to the new partitioning. It must be fast and parallel. The re-partitioner must run in parallel and should not require too much communication. Since the mesh may change further during the execution of the application, the quality of the resulting partitioning may be lower than for static load balancing. It must take the current partitioning into account. The re-partitioner must preferably nd a partitioning that is similar to the old one, in order to minimize the data migration needed to restart the application with the new mesh distribution. Basically, there are two approaches to the mesh re-partitioning problem: a) The current partitioning is updated by migrating parts of the mesh to another (neighboring) partition. This approach is based on local and distributed decisions. When the load imbalance is not high, such a local procedure performs well and the new partitioning will not dier much from the original partitioning. However, when the load imbalance is high, a local mesh migration procedure may require many iterations (and thus much communication) to converge and/or may lead to a suboptimal partitioning. b) A new partitioning is computed using a mesh partitioning algorithm, adapted to the re-partitioning problem. By exploiting a global view of the partitioning problem, static mesh partitioners are able to nd a (nearly) optimal partitioning, but they may be expensive. Thus adaptations are required to make the algorithm suciently fast (sub-optimal partitionings are acceptable), to run in parallel, and to take the current partitioning into account. Global partitioning methods are often based on a graph representation of the mesh. Practical re-partitioning strategies combine aspects of both approaches. Indeed, mesh migration techniques try to achieve a globally optimal solution via a suitable sequence of local decisions and operations, while a strategy based on global optimization can only be parallelized eciently if it is based on local operations, which also enables to take into account the current partitioning.

3 The DRAMA library contains mesh re-partitioning techniques of both types. A meshmigration method is being developed in combination with a parallel re-meshing technique [4]. Global mesh re-partitioning is implemented in DRAMA by interfacing to existing libraries for parallel graph re-partitioning, i.e., ParMetis [7] and Jostle [8]. Both libraries use multilevel techniques in which the re-partitioning of the coarsest graph is based on the diusion algorithm of Hu & Blake [6], which minimizes the Euclidean norm of the data migration based on the processor graph. This diusion algorithm is synchronous and is suited for applications whose load do not change very much every iteration. ParMetis also contains a diusion scheme that has only a local view [5]. Here vertex migration is based on the relative dierence in partition weight for each partition and all its neighboring partitions. The algorithm works asynchronously and is highly parallel in nature. While the local view may lead to a partitioning of lower quality, the coarsening, which speeds up the convergence by moving blocks of vertices, gives the algorithm a more global view.

3 The DRAMA cost function and its graph representation 3.1 Cost function

The DRAMA cost function [1] models the execution time of parallel nite element calculations, taking into account changing work load and communication requirements and heterogeneous machine architectures. The cost function is given by Fc

(1)

F c; i2f0:::p?1g i

= max

where p denotes the number of processors and the cost function Fic for processor i consists of computational costs witot and communication costs ctot i : Fic

(2)

= witot + ctot i ;

Part of the operations within an FEM-code is performed element-wise, e.g., the assembly phase, while other operations are node-based, e.g., updating the coordinates and solving linear systems. Further, even when the calculations are done element-wise, communication is frequently carried out using node lists. Therefore, the computational cost consists of element-based and node-based contributions : witot

(3)

= e witot;e + n witot;n ;

while the communication cost consists of element-element, node-node, and element-node based contributions : (4)

ctot i

tot;en : = e witot;e + n ctot;n i + en ci

In both equations, superscript e refers to elements and superscript n to nodes. Setting the binary parameters e , n , e, n , and en to 0 or 1 allows to adjust the cost function according to the data dependencies in the application. We now describe the DRAMA model for computational and communication costs in more detail. 3.1.1 Computational costs. The computational costs are based on user-speci ed operation counts and on time measurements in the application program. A nite element mesh may contain elements of various types. For each element type ue , nope(ue ) denotes

4 the (relative) number of operations. Let Nie (ue ) be the number of elements of type ue in subdomain i allocated to processor i, and let scalc;e be the (relative) speed of element i computations on processor i. Then the element-based calculation cost on processor i is given by tot;e = 1 X N e (ue ) nope(ue ) ; w (5) i

calc;e

si

u

e

i

The relative speed scalc;e is either constant or it is derived from a previous time i is computed from measurement. In the latter case, scalc;e i P calc;e = u Nie (ue ) nope(ue ) si (6) tcalc;e i calc;e where t is the time needed for all element calculations in subdomain i in the previous phase (iteration or time step). In this way, the cost function can take into account heterogeneous and even non-dedicated machines. The node-based costs are expressed equivalently. 3.1.2 Communication costs. Since it is dicult to measure idle and communication time separately, no time measurements are used to determine the communication costs. We assume that for each processor i, the latency Lij and the bandwidth scomm (bytes/second) ij for communication with processor j are known. Element-based communication costs are then given by # " X X X 1 tot;e = (7) N e (ue ; v e ) noce (ue ; v e ) ; c L + e

i

P

neighbors

j

ij

scomm ij

u

e

v

e

ij

where j is a summation over the neighbors of processor i, i.e., processors holding neighboring subdomains. Here noce (ue ; ve ) denotes the number of bytes to be transferred per element of type ue with link type ve and Nije (ue ; ve ) denotes the number of such elements which belong to ij , the interface between subdomains i and j . The link type v e characterizes the speci c connection of an element of type ue to another element. For example, in PAM-CRASH, the number of common nodes between elements determines the link type. The node-based communication costs take the form (8)

tot;n = ci

"

X

neighbors

j

Lij

+

n nocn Nij scomm ij

#

;

where nocn is the number of bytes to be transferred per node-node dependency. For purely node-based communication the parameter nocn does not depend on a speci c node and link type since there is only one way to connect two nodes and the amount of data to be transferred is the same for all nodes. This makes node-based communication costs easier to specify and to handle than element-based communication costs. The element-node-based communication costs are formulated as " en + N ne ) nocen # X ( N ij ij tot;en = (9) ; ci Lij + scomm neighbors

j

ij

where nocen denotes the number of bytes to be transferred in case of an element-node dependency.

5

3.2 Graph based re-partitioning techniques

3.2.1 Mapping the mesh onto a graph. The input to the DRAMA library is the dis-

tributed mesh, the cost monitoring parameters nope(ue ); nopn (un ); noce (ue ; ve ); nocn ; nocen and timings done within the application [2]. In order to call one of the graph repartitioning libraries (ParMetis Jostle) this information has to be converted into a (distributed) weighted graph representation, where vertex weights correspond to calculation costs and edge weights correspond to potential communication costs. Since graph partitioning aims at minimizing the number of cut edges, it allows to minimize the communication volume, but not (directly) the other aspects of the communication cost in the DRAMA cost model. Hence constructing the graph from the mesh information amounts to a simpli cation of the cost function. For nite element meshes, one often uses the dual graph or element graph, where the graph vertices represent the calculation costs of the elements and vertices are connected by an edge if the corresponding elements share an edge (2D) or a face (3D). However, the DRAMA cost model allows to take into account also node-based calculations and various types of data dependencies (element-element, node-node and element-node dependencies). Therefore, the DRAMA library allows to use also other graph representations, i.e. node graphs, extended dual graphs and combined node-element graphs. The latter two graphs are de ned as follows. Extended Dual Graph. In the classical dual graph, graph edges only occur when the elements share an edge (2D) or a face (3D). This graph can be extended with edges between graph vertices whose corresponding elements share one or more nodes. The advantage is that all data dependencies between elements are represented, but the disadvantage is that the extended dual graph can become very complex, especially for 3D meshes. The extended dual graph can be useful if the mesh consists of elements of various dimensions. For example, a PAM-CRASH mesh may contain beams (2 nodes), triangles, quadrilaterals, and bricks (8 nodes). For the mesh shown in Fig. 1, the dual graph fails to represent some important data dependencies. r

r

r

r

r

r

Fig. 1. An example of a part of a PAM-CRASH mesh with 1 quadrilateral and 2 beam elements. The dual graph does not represent the data dependency between the quadrilateral and beam elements.

Combined Graph. In a combined graph, both elements and nodes are represented as vertices. This allows to have a good representation of all types of calculation costs. Since nite element applications often use node lists for inter-subdomain communication, only data dependencies between elements and nodes are represented as graph edges. This is a simpli ed but suciently good approximation of the general combined graph that has all kinds of element-element, node-node, and element-node connections. Fig. 2 shows an example of a combined graph. The graph type should be chosen according to the application requirements and the particular cost function model. This gives the application programmer freedom in approximating the DRAMA cost function to various degrees of accuracy and allows to compute a good (new) partitioning, according to this cost function, using existing graph

6

Fig. 2. The combined graph for a triangular mesh (dashed lines). Both the elements and the nodes are represented as graph vertices. Edges represent element-node data dependencies.

re-partitioners.

3.2.2 Reconstruction of the mesh information from the graph. The output of

graph partitioners like ParMetis and Jostle is just an array indicating for each graph vertex to which processor/partition it should be migrated to. Note that when the dual graph is partitioned, the output of the graph partitioner only gives a new distribution for the elements of the mesh, and a new distribution for the nodes still has to be determined. When the combined graph is partitioned, a new distribution for both elements and nodes is given. The user (or application program) still has to set up relations between the old and the new partitioning of the mesh, to determine a new numbering of elements and nodes, and to perform the actual data migration. These tasks are performed within the DRAMA library and the new mesh distribution together with the relations between the old and the new mesh are given back to the application program. The latter relations are necessary because the DRAMA library migrates only the mesh description. The application data (temperature, velocity, etc.) associated with the mesh have to be migrated to their new location by the application code, since the DRAMA library has no knowledge about the data structures used in the application.

4 The DRAMA library and interface

The structure of the library is shown in Fig. 3. The routines above the dashed line are directly available to the user. The library consists further of three main modules, corresponding to dierent approaches for re-partitioning. The graph module uses graph re-partitioning techniques. As explained in the previous section, rst a graph is constructed, a graph partitioning library is called to determine a new distribution for this graph and the resulting information is translated back into mesh information. The mesh module will contain a direct mesh migration technique, operating directly on the mesh connectivity and the inverse connectivity, without using a graph representation. The coordinate module contains a coordinate-based mesh partitioning method. The (distributed) mesh used as input and output for DRAMA is described as follows. For each processor a list of elements with their corresponding node numbers has to be speci ed. For each element, the element type, which determines the associated costs, is also given. Table 1 gives an example.

7 DRAMA

DRAMA Costfun

DRAMA Graph DRAMA Mesh DRAMA Coord routines: graph, mesh, coord, shared, Fig. 3.

:::

Structure of the DRAMA library Table 1

Input to DRAMA: mesh speci ed by element number, element connectivity and element type information. The notation n( ) refers to the kth node of the element with the index pair (i; j )=(local element index, processor number). N^ (u ) is the number of nodes for an element of type u and N is the number of elements owned by processor j . The variable n( ) itself is the index pair (local node index, processor number). k i;j

e

e

j

k i;j

proc. elements e(1;j )

j

e(2;j )

.. .

e(Nj ;j )

element connectivity ,

,

N^ (u(1

)

ue(1;j )

N^ (u(2

)

ue(2;j )

e

n1(1;j ) : : : n(1;j )

,

element type

e

;j )

, .. . N^ (u N n1(N ;j ) , : : : , n(N ;j ) n1(2;j ) : : : n(2;j )

e

(

j

;j )

.. .

j ;j )

)

j

ue(N

j

;j )

The DRAMA library returns the newly computed partitioning of the mesh in a distributed manner, in the same format as the input. The relation between the old and the new partitioning of the mesh is also returned as an aid for the migration of the data by the application program (see Table 2). Table 2

DRAMA output: relation between new and old elements and nodes.

element

enew (n;j )

eold (m;k )

was now

eold (m;k ) enew (n;j )

node

nnew (n;j )

nold (m;k )

was now

nold (m;k ) nnew (n;j )

All parameters are passed to and from the library as one-dimensional arrays but some of them are actually multi-dimensional arrays. The DRAMA library can be linked with FORTRAN or C applications which typically use a dierent numbering scheme. In FORTRAN applications, arrays usually start from index 1 whereas C applications usually start their numbering from 0. This is handled by the library via a ag indicating the

8 numbering format of the application. A detailed description of the interface can be found in [2].

5 Results

Currently, the DRAMA library is being integrated into the FORGE3 and PAM-CRASH software codes. Here, we present some preliminary results. For a typical simulation with FORGE3, the nite element mesh consists only of tetrahedral elements, i.e., all elements are of the same type. As a rst test to evaluate the DRAMA library we use a mesh consisting of 231846 tetrahedral elements and 49666 nodes, partitioned into 4 subdomains, that has been re-meshed such that the current partitioning is heavily unbalanced: about half of the mesh resides on one processor. The parameters for the DRAMA cost function are determined as follows. Each implicit time step of FORGE3 consists of about 1000 iterations of a Conjugate Gradient linear system solver. The implicit solver consists of both element-based and node-based operations. The parameter nope is set to 160 operations and nopn(un ) is set to 16 neln (un ) where neln (un ) is the number of elements to which a node of type un belongs (the average for neln is 20). The calculation speed is set to 200 106 operations per second for both element-based and node-based calculations. We use the node-based communication model, and the bandwidth and latency (for the NEC Cenju-4) are set to 80 106 bytes per second and 17:7 10?6 seconds respectively. The dual graph is constructed, taking into account only the element based calculation costs (neglecting the node based calculation costs). ParMetis has been called with various values for the allowed load imbalance (1%, 2%, and 5% imbalance). The re-partitioning results are reported in Table 3.

Table 3

Results for a FORGE3 mesh consisting of 231846 tetrahedral elements and 49666 nodes and divided in 4 subdomains. The initial mesh distribution is heavily unbalanced (about half of the mesh resides on one processor). ParMetis has been called with 1%, 2%, and 5% imbalance.

initial after (1%) after (2%) after (5%) 0.12622 0.067087 0.067234 0.069120 min Fic 0.03091 0.065276 0.064926 0.062965 average Fic 0.06554 0.066213 0.065917 0.065844 imbalance 92.60% 1.32% 2.00% 4.98% elements moved 60039 56725 52331 nodes moved 11967 11276 10398 re-partitioning time (s) 11.05 11.20 11.18 memory used 39MB was required for the largest domain Fc

= max Fic

Before and after re-partitioning, we have computed the predicted execution time for one iteration of the implicit CG solver by evaluating the cost function F c = maxi=0:::p Fic . Note that both the element-based and the node-based calculation costs are taken into account when evaluating the cost function. Also mini=0:::p Fic , the averagei=0:::pFic and the load (F ) are presented, together with the number of elements imbalance = maximum average (F ) and nodes moved and the re-partitioning time on the NEC Cenju-4 (using 4 processors). The re-partitioning time is the time needed by the DRAMA library, of which about 40% is i=0:::p

i=0:::p

c i

c i

9 due to the execution time of ParMetis. The results show that a nearly perfect re-partitioning is obtained (within the allowed imbalance tolerance), despite the fact that the node-based calculation costs and the communication latency are neglected in the dual graph representation. For this (homogeneous) mesh, it is not necessary to use the combined graph, which would cost much more memory and about 50% more execution time (ParMetis then takes about 80% of the re-partitioning time). Further, the re-partitioning time ( 11s) is reasonable compared to the time between two calls to the DRAMA library. Indeed, one integration time step ( 1000 iterations of the implicit solver) requires about 67 seconds and re-partitioning is typically done every 20 time steps.

References [1] A. Basermann, J. Fingberg, and G. Lonsdale, Initial DRAMA Cost Model, DRAMA Project Deliverable D1.1a, available at http://www.cs.kuleuven.ac.be/cwis/research/natw/DRAMA.html, 1998. [2] , The DRAMA Library Interface De nition, DRAMA Project Deliverable D1.2a, available at http://www.cs.kuleuven.ac.be/cwis/research/natw/DRAMA.html, 1998. [3] R. Biswas, L. Oliker, and A. Sohn, Global load balancing with parallel mesh adaptation on distributed-memory systems, in Proceedings Supercomputing, 1996, available at http://www.supercomp.org/sc96/proceedings/. [4] T. Coupez, Parallel adaptive remeshing in 3D moving mesh nite element, in Numerical Grid Generation in Computational Field Simulation, B. Soni et al., eds., Mississipi University, 1 (1996), pp 783{792, [5] G. Cybenko, Dynamic load balancing for distributed memory multi-processors, Journal of Parallel and Distributed Computing, 7 (1989), pp. 279{301. [6] Y. Hu and R. Blake, An optimal dynamic load balancing algorithm, Technical Report DL-P95-011, Daresbury Laboratory, Warrington, UK, 1995. [7] G. Karypis, K. Schloegel, and V. Kumar, ParMetis: Parallel Graph Partitioning and Sparse Matrix Ordering Library, version 2.0, Technical Report, Dept. of Computer Science, University of Minnesota. [8] W. Walshaw, M. Cross, and M. Everett, Dynamic load-balancing for parallel adaptive unstructured meshes., in Parallel Processing for Scienti c Computing, M. Heath et al., eds., SIAM, Philadelphia, 1997.

DRAMA: A Library for Parallel Dynamic Load ...

DRAMA: A Library for Parallel Dynamic Load ...

Suggest Documents

Parallel Load Balancing for Dynamic Execution ... - CiteSeerX

Dynamic Load Balancing for the Parallel

An Overview of Parallel Dynamic Load-Balancing for Parallel Adaptive ...

Dynamic load balancing in parallel database systems

Dynamic Load Balancing in Hierarchical Parallel ... - CiteSeerX

Dynamic Load Distribution in Massively Parallel

Strategies for dynamic load balancing on highly parallel computers ...

Dynamic Load Balancing in Parallel Discrete Event Simulation for ...

Dynamic Load Balancing for Parallel Traffic ... - Semantic Scholar

Interprocedural Load Elimination for Dynamic Optimization of Parallel ...

Multithreaded approach for dynamic load balancing of parallel ...

Beowulf Parallel Processing for Dynamic Load-balancing12 - CiteSeerX

Topology Preserving Dynamic Load Balancing for Parallel ... - CiteSeerX

Analysis of Dynamic Load Balancing Strategies for Parallel Shared ...

Parallel Dynamic Load-Balancing for Adaptive Distributive ... - CiteSeerX

A Dynamic Load Dispersion Algorithm for Load - IEEE Xplore

Dynamic Load Balancing in Hierarchical Parallel ... - Semantic Scholar

Dynamic Load Balancing in Parallel Database Systems - CiteSeerX

Dynamic Load Balancing in Parallel Database Systems - CiteSeerX

Predictive Dynamic Load Balancing of Parallel and ... - CiteSeerX

Dynamic Load Balancing in Parallel Database Systems - CiteSeerX

Dynamic Multi-Resource Load Balancing in Parallel ... - CiteSeerX

Efficient Dynamic Economic Load Dispatch Using Parallel Process of ...

Virtual Machine Design for Parallel Dynamic ... - ACM Digital Library