ways ahead of computer capabilities. Large-scale scientific and engineering simu- lations performed with the finite element method often require very long ...
Mesh Partitioning Techniques and Domain Decomposition Methods (Ed. F.Magoules) Saxe-Coburg, Stirling, Scotland, 2007, pp. 119-142.
Basics of the Domain Decomposition Method for Finite Element Analysis G.P.Nikishkov University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan
Abstract An introduction to the domain decomposition method for parallel finite element analysis is presented. The domain decomposition method allows decomposition of largesize problem solutions to solutions of several smaller size problems. Algorithms of domain partitioning with compute load balancing as well as direct and iterative solution techniques are considered. Keywords: Finite element method, domain decomposition, partitioning, parallel.
1 Introduction Various appications of the domain decomposition method (DDM) have a long history in computational science. The DDM in the form of substructuring was used in finite element analysis soon after introduction of the finite element method in engineering practice [1]. The reason for employing the substructuring technique was the small memory of computers. To solve large-scale problems, a structure (domain) was divided into substructures (subdomains) that fit into computer memory. Computer memory grows but demand for solution of large-scale problems is always ahead of computer capabilities. Large-scale scientific and engineering simulations performed with the finite element method often require very long computing time. While limited progress can be made with improvement of numerical algorithms, the radical time reduction can be reached with multiprocessor computations. In order to perform finite element analysis on a parallel computer, computation should be distributed across processors. Element-wise operations, such as calculation of element matrices, are easy to parallelize. More difficult is to transform solution of a global equation system into a par1
allel procedure. Simple distribution of arithmetic operations across processors leads to fine-grain parallelism with intensive data communication between processors. Such parallel computations are usually inefficient. A coarse-grain parallel finite element algorithm can be based on the DDM [2]. In the DDM, the finite element domain is divided into subdomains along element boundaries. Subdomain operations are carried out by separate processors without any data flow among them. Then interprocessor data communications and computations are necessary to establish proper subdomain connections. With load balancing, the DDM can be an efficient computational procedure for parallel finite element analysis. The necessary first stage of the DDM finite element procedure is domain partitioning into the specified number of subdomains, which is usually equal to the number of processors. Both direct and iterative solution methods are used in the finite element programs. Attractive features of the direct solution methods in comparison to iterative algorithms are simplicity, possibility of predicting the computing time, and the absence of convergence problems. For relatively small problems, direct solution methods require less time than iterative methods; for large problems, iterative methods are more efficient. In this chapter, an introduction to the domain decomposition method for the finite element analysis is presented. First, domain partitioning is described. Then we consider the DDM with direct LDU equation solver. After derivation of general computational procedure, attention is paid to domain partitioning with compute load balancing. Then the DDM algorithm with iterative solution of the equation system is discussed. Results of some parallel finite element applications illustrate efficiency of the DDM with direct and iterative solution algorithms.
2 Domain partitioning A domain partitioning algorithm should produce a subdivision that minimizes the total computing time on a multiprocessor computer. Thus the whole computations should be balanced among processors. In order to minimize the total computing time for the finite element analysis, the following objectives should be fulfilled: minimization of the number of interface nodes, which determines the size of the interface equation system (when subdomain condensation is used) and the amount of data communication; compute load balancing by assigning different numbers of elements to subdomains. The quality of domain partitioning can considerably affect computing time of parallel finite element analysis [3]. Numerous algorithms for domain partitioning have been reported in the literature. Some authors consider only partitioning itself [4, 5, 6, 7]. Publications [8, 9, 10] address also the problem of compute load balancing. Graph methods are widely used for domain partitioning, usually in the form of recursive graph bisection (RGB) [5]. The finite element mesh is represented by a dual diagonal graph. Elements of the mesh compose a graph vertex set. Vertices are connected by an edge if corresponding finite elements have one or more common 2
Figure 1: Example of the dual diagonal graph. nodes. An example of the dual diagonal graph is shown in Fig. 1. The RGB algorithm recursively bisects a graph into two subgraphs. During bisection, graph diameter is estimated, and graph vertices are separated into two groups according to their distances from end nodes. In many cases the RGB algorithm produces far from optimal subdomains with “fuzzy” interface boundaries since only distance information for graph vertices is employed. Here we present the recursive graph labeling (RGL) algorithm for domain partitioning [11]. The RGL algorithm is based on the graph labeling scheme for matrix profile reduction [12]. Both global information (distances from the end vertex) and local information (degree of a current vertex) are used for labeling graph vertices. The algorithm allows partitioning the domain into subdomains with unequal numbers of elements as necessary for load balancing. The RGL algorithm produces subdomains with smooth boundaries. This leads to fewer interface nodes and reduced data communication between subdomains. Partitioning process consists of: formation of a distance structure for the graph representing the finite element domain; labelling of graph vertices; and graph division into subgraphs related to subdomains.
2.1 Graph distance structure The graph distance structure is an ordering of the graph vertices according to their distances from the specified vertex s, for example, the lower right vertex in Fig. 1. The distance structure can be represented as a level structure L(s) = {l0 , l1 , ..., lh }, where level li consists of vertices which have distance i from s. The depth of the graph is equal to h and the width of the graph w is equal to maximum number of nodes at some level i. The diameter of a graph is the maximum distance between all vertex pairs. The degree of a vertex is the number of edges that connect it to neighbor vertices. In order to determine the graph diameter, vertex s, which has the smallest degree, is selected as a starting vertex. The graph distance structure L(s) for vertex s is compiled. For the last level lh(s) a list of vertices Q, containing one vertex of each degree, is 3
generated. For each vertex i in Q the distance structure L(Qi ) is built. The graph diameter is assigned the maximum depth from all h(Qi ). The starting vertex s and the end vertex e are vertices at opposite ends of the diameter.
2.2 Vertex labeling The graph labeling algorithm is based on the vertex priority, which is a combination of the vertex degree and its distance from the end vertex p = W1 h − W2 (d + 1),
(1)
where W1 and W2 are weights; h is the distance from the end vertex; and d is the vertex degree. The priority of each vertix is assigned during labeling. Vertices which are labeled are excluded from further consideration. Vertices which are adjacent to labeled vertices are called active. Vertices adjacent to an active vertex but not active, are called preactive. All the other vertices are inactive. The vertex labeling algorithm can be presented as follows: Form the distance structure beginning at the end vertex, L(e). Compute initial priorities for all vertices pi = W1 hi − W2 (di + 1) Put the starting vertex into the queue with preactive status do while queue is not empty Select from the queue vertex i with maximum priority pi Label vertex i with next available number Delete vertex i from the queue if vertex i is preactive then do for all vertices j adjacent to vertex i Increment priority pj by W2 if vertex j is inactive then put it into the queue with preactive status end do end if do for all vertices j adjacent to vertex i if vertex j is preactive then Assign vertex j an active status Increment priority pj by W2 do for all vertices k adjacent to vertex j Increment priority pj by W2 if vertex k is inactive then Put it into the queue with preactive status end do end if end do end while
4
2.3 Division into subdomains Suppose that the finite element domain should be partitioned into s subdomains with equal numbers of elements. Integer s is represented as a product of simple numbers s = s1 · s2 · ... · sq and the graph labeling algorithm is applied q times as shown below: Represent the number of subdomains as s = s1 · s2 · ... · sq Current number of subdomains N = 1 do i = 1, q do j = 1, N Partition subdomain j into si subdomains using graph labeling Increment the current number of subdomains N = N · si end do end do
Partitioning of a subdomain consisting of E elements into si new equal subdomains includes the following steps. Using labeling, elements are sorted according to their priorities. First E/si elements are assigned to the first new subdomain; next E/si elements are assigned to the second subdomain; and so on. It is not difficult to partition the subdomain into unequal new subdomains with specified numbers of elements.
3 Domain decomposition method with a direct solver The DDM with subdomain condensation is called the Schur complement method [1, 2]. The finite element domain is divided into subdomains. For each subdomain, elimination of inner nodes is performed. The condensed subdomain matrices are assembled into an interface equation system, which solution produces interface displacements. Finally, inner displacements and other results like strains and stresses are computed.
3.1 DDM algorithm After division of the computational domain into a number of subdomains, a finite element equation system can be assembled for each subdomain: [k]{u} = {f },
(2)
where [k] is a subdomain stiffness matrix, {u} is a subdomain displacement vector, and {f } is a subdomain load vector. The subdomain nodes are grouped into interior nodes, designated by the subscript i, and interface boundary nodes, designated by the subscript b. If the interior nodes are numbered first and the interface boundary nodes are numbered last, then the subdomain equation system can be written in the following matrix form: ¾ ¾ ½ ¸½ · fi ui kii kib . (3) = fb ub kbi kbb 5
Matrices [kii ] and [kbb ] correspond to interior and interface (boundary) nodes respectively. Matrix [kib ] reflects the interaction between the interior and boundary nodes. A condensation of the subdomain equation system is made by eliminating unknowns related to interior nodes: [k¯bb ]{ub } = {f¯b }, [k¯bb ] = [kbb ] − [kbi ][kii ]−1 [kib ], {f¯b } = {fb } − [kbi ][kii ]−1 {fi }.
(4)
Subdomain condensed stiffness matrices [k¯bb ] and subdomain surface load vectors {f¯b } are assembled into the interface equation system [K]{U } = {F }.
(5)
Solution of the interface system gives unknown interface displacements {U }. Interface displacements are diassembled and used for the determination of the interior displacements for each subdomain: [kii ]{ui } = {fi } − [kib ]{ub }.
(6)
Let us adopt the LDU method for subdomain condensation. The DDM numerical procedure with the LDU subdomain condensation has three computational phases – (a) Subdomain assembly and condensation: [k] =
X X {fel }, [kel ], {f } = el
el
[kii ] = [L][D][U ] = [U ]T [D][U ], [k¯ib ] = [U ]−T [kib ], [k¯bb ] = [kbb ] − [k¯ib ]T [D]−1 [k¯ib ], {f¯i } = [U ]−T {fi }, {f¯b } = {fb } − [k¯ib ]T [D]−1 {f¯i }.
(7)
(b) Interface assembly and solution: [K] =
X s
[k¯bb ],
{F } =
−1
{U } = [K] {F }.
X
{f¯b },
s
(8)
(c) Determination of interior displacements: {f˜i } = {f¯i } − [k¯ib ]T {ub }, {ui } = [U ]−1 [D]−1 {f˜i }.
(9)
Different subdomains are assigned to different processors for parallel computing. Subdomain stiffness matrices and subdomain load vectors are assembled of element stiffness matrices and element load vectors. The subdomain condensation procedure 6
(7) consists of LDU factorization of the matrix [kii ] with the interior degrees of freedom and matrix-matrix and matrix-vector multiplications. Subdomain operations of assembly and condensation take the most time and can be done in parallel without any data communication. All parallel tasks should be synchronized at the beginning of the solution of the interface equation system, thus the compute load for the subdomain assembly and condensation should be balanced among processors. Optimized node enumeration inside each subdomain can substantially decrease the operation count for the condensation of the subdomain matrices. The graph labeling algorithm of Section 2 with necessary modifications can be used for subdomain node renumbering. The dual diagonal graph is generated for nodes. If a subdomain has a part of the boundary which does not belong to the interface, then the starting node is placed on that part of the boundary. Solution of the interface equation system can be performed with direct or iterative methods. If a direct method is used, then factorization of the interface equation system can be done in parallel [13] using cyclic distribution of matrix columns across processors. Each column after modifications is broadcast to all processors. Then each parallel task uses the received column for modification of columns belonging to this task. The forward solve and backsubstitution phase for the interface system is not well suited for parallelization because of its very low computation to communication ratio. Since the backsolve takes a tiny fraction of total computing time, it is possible to perform this operation in a serial manner. Finite element operations that follow the determination of displacements (usually stress calculations) are element-oriented and not difficult to parallelize. The distribution of elements across processors during the stress calculations can differ from subdomain partition used during the previous computational phases.
3.2 Compute load for subdomains Analyzing the DDM computational procedure it is possible to conclude that it is important to balance compute load for subdomain assembly and condensation. Below are presented operation count estimates for this computational phase [11, 14]. The operation count for the assembly of the subdomain equation system is mainly determined by element stiffness calculations, and can be estimated as c = cel e,
(10)
where cel is the operation count for computing the element stiffness matrix and e is the number of elements in the subdomain. For estimation of the number of arithmetic operations for element stiffness matrix calculation, it is possible to use the computing time for element stiffness calculation and the average computer Mflops rate. Fig. 2 presents a simplified structure of the subdomain matrix composed of symmetrical part of a band [kii ], symmetrical part of the interaction [kib ] with triangular shape, and symmetrical part of the matrix [kbb ] corresponding to interface nodes. The matrix structure in Fig. 2 corresponds to the subdomain with a topologically regular mesh and regular node enumeration. 7
m0
h
kib
kii
kbb n
m
Figure 2: Structure of the subdomain matrix. The inner loop of the LDU factorization of matrix [kii ] can be described as follows: do j = 2, n do i = 1, h − 1 do k = 1, i − 1 Multiply-add end do end do end do .
There are two arithmetic operations of multiplication and addition in the inner k-loop. For the first h columns the operation count is calculated in the same way as for the fully populated matrix, and for the remaining (n − h) columns in the same way as for the pure banded matrix: µ ¶ 2 2 c1 = h n − h . (11) 3 Modification of the interaction matrix [kib ] is a multiple forward reduction for its columns: do k = 1, m0 do i = 1, n − (n/m0 )k do j = 1, h Multiply-add end do end do end do .
The operation count can be estimated as: c2 = (n − h)hm0 . 8
(12)
The fill factor f for the interaction matrix [kib ] is calculated as f=
1 Xm hi , i=1 mn
(13)
where hi is the height of the ith column. For our simplified structure for the matrix [kib ] with m0 = m/2, the size m0 is equal to: m0 = 2mf and the operation count c2 is expressed as: c2 = 2(n − h)mhf.
(14)
Calculation of the symmetric part of the condensed stiffness matrix [k¯bb ] is a multiplication of a profile transpose matrix [k¯ib ] by itself: do j = 1, m do i = 1, j do k = 1, n − (n/m0 )max(i, j) Multiply-add end do end do end do .
The operation count can be estimated by computing an integral Z m0 1 n 1 4 1 c3 = ( nm0 − x2 )dx = nm20 = nm2 f 2 , 2 2 m0 3 3 0
(15)
where x is a column number in the matrix [kib ]. Summing operation counts of (10), (11), (14) and (15) yields the total compute load for assembly and condensation of the subdomain stiffness matrix: 2 4 C = cel e + h2 (n − h) + 2(n − h)mhf + nm2 f 2 . 3 3
(16)
Partitioning of the finite element domain into subdomains with an equal number of elements does not lead to interprocessor load balancing since the quantities n, m, h, and f are different in different subdomains because of the subdomain position and subdomain node enumeration. In actual calculations, the structure of the subdomain stiffness matrix is more complicated than the structure shown in Fig. 2. The operation count estimate (16) can be used for irregular subdomains provided that the halfbandwidth h is replaced by its root mean square value: r 1 Xn 2 hi , (17) h= i=1 n where hi is the height of the ith column of the matrix [kii ] 9
3.3 Subdomain load balancing The first phase of the DDM algorithm is assembly and condensation of the subdomain stiffness matrices. Since the computation time for the subdomain assembly and condensation is the largest fraction of the total solution and the end of this phase is the synchronization point, the compute load of the subdomain assembly and condensation should be balanced among processors. The load balancing may be achieved by partitioning the domain into subdomains with unequal numbers of elements [11]. For subdomains having varying number of elements and similar shape, quantities of equation (16) are approximately proportional to: n ∼ e;
m∼
√
e;
h∼
√
e;
f ≈ const.
The operation count for the subdomain can be expressed through the number of elements C = cel e + ceq e2 , (18) where ceq is the operation count related to subdomain condensation. The value of the coefficient ceq is determined by 4 ceq = (h2 n + 2nmhf + nm2 f 2 )/e20 3
(19)
for some subdomain with known values of n, h, m, and f and number of elements e0 . A nonlinear equation system for the load balancing problem can be written as: C1 − C2 = 0 C2 − C3 = 0 ... C Ps−1 − Cs = 0 ei − E = 0, or
eq 2 2 el cel ei + ceq i ei − c ei+1 − ci+1 ei+1 = 0, P ei − E = 0,
(20)
i = 1...s − 1
(21)
where s is the number of subdomains, ei is the number of elements in the ith subdomain, and E is the number of elements in the domain. The nonlinear equation system (21) Fi (e1 , ...es ) = 0, i = 1...s
(22)
can be solved by the Newton-Raphson iterative procedure: {e}(0) = {E/s, E/s...} {∆e}(i) = −([J](i−1) )−1 {F }(i−1) {e}(i) = {e}(i−1) + {∆e}(i) . 10
(23)
Here (i) is the iteration number and coefficients of the matrix [J] are equal to Jij = ∂Fi /∂ej : Ji1 , ...Ji i−1 = 0 Jii = cel + 2ceq i ei Ji i+1 = −cel − 2ceq (24) i+1 ei+1 Ji i+2 , ...Jin = 0 Jsi = 1. Solution of the nonlinear load balancing system (21) predicts the new distribution of elements among subdomains. The algorithm of partitioning with load balancing is described by the following pseudo-code: Represent number of subdomains s as product of simple numbers Specify equal number of elements in subdomains {e}(0) = {E/s, E/s...} while load imbalance ≥ specified value Partition domain into s subdomains {e}(i−1) using recursive graph labeling method Optimize subdomain node enumeration with minimization of h and f Calculate {e}(i) by solving algebraic problem of element distribution among subdomains end while .
Usually it is sufficient to perform 1–3 iterations in order to achieve acceptable load balancing across subdomains.
3.4 Examples The above algorithm of domain partitioning with load balancing has been implemented as a C routine. Although it is not possible to prove the convergence of the iterative load balancing procedure, one can intuitively feel that the iterative procedure converges if the subdomains undergo small changes in shape and position between successive iterations. To provide similarity of subdomains during load balancing iterations, the following procedure was introduced into the algorithm. During the initial partitioning, graph diameter ends for all subdomains are stored. Later, each diameter end is selected as an element located at the subdomain boundary and has the minimum distance from the diameter end stored during the first partitioning. In order to “sharpen” subdomain boundaries, a large weight W2 for vertex degrees is used (W2 À W1 ) in the recursive graph labeling algorithm. Degrees for all interior elements are set to the same value, which is equal to the degree of the interior element in the regular mesh. Here, examples of partitioning both regular and irregular meshes are presented. Partitions are evaluated in terms of parallel efficiency for subdomain assembly and condensation computational phase. The parallel efficiency is calculated as Ep =
C(sequential, equal subdomains) , s · C(maximum among subdomains)
where s is the number of subdomains. Operation counts C both for partitions into subdomains with equal number of elements and for optimized partitions are compared 11
(a)
(b) Figure 3: Eight-subdomain primary partition into equal(a) and optimized (b) subdomains. to the operation count of the sequential algorithm for the partition with equal subdomains. Because of this, values of the parallel efficiency larger than 1.0 for optimized subdomains are possible. Two-dimensional four-node elements with Cel = 0.00738· 106 are used in examples. Figures 3 and 4 show topologically general meshes, selected to demonstrate performance of the proposed algorithm. The primary partition of the domain with two holes into 8 equal subdomains is presented in Fig. 3a. The optimized partition of this mesh is shown in Fig. 3b. Figures 4a and 4b illustrate the primary and optimized partitions of the domain with one hole into 12 subdomains. Partition into subdomains with equal numbers of elements leads to the parallel efficiency in the range 0.7–0.8 if the direct LDU algorithm is used for subdomain condensation. The algorithm with compute load balancing radically improves the parallel efficiency of the assembly and condensation phase and provides parallel efficiency close to 1.0. Just 2–3 iterations are necessary to reach balanced mesh partitions. The ratio of maximum and minimum 12
(a)
(b) Figure 4: 12-subdomain primary partition into subdomains with equal number of elements (a) and optimized subdomains (b).
13
numbers of elements in optimized subdomains is in the range 1.4–1.5. Dependence of parallel efficiency of the assembly and condensation on the problem size is demonstrated for nearly square regular meshes of quadrilateral elements. An example of an optimized partition of a mesh consisting of 6006 two-dimensional elements into 16 subdomains is shown in Fig. 5. Just one iteration changed the parallel efficiency from 0.65 to 1.06. Parallel efficiency values of equal and optimized 8- and 16-subdomain partitions for meshes containing from 1056 to 10201 nodes are plotted in Figures 6 and 7. The 16subdomain partitions are characterized by lower values of efficiency after partitioning into equal subdomains. Optimization with 1–2 iterations in most cases yields values of parallel efficiency that are close to 1.0 and even larger than 1.0 for 16-subdomain partitions. The domain decomposition method with the direct LDU solver was used for development of a parallel version of an industrial sheet metal forming program [15].
4 Domain decomposition method with an iterative solver The main problem of direct methods on parallel computers is their poor performance for systems with large numbers of processors (the scalability problem). For large finite element problems (and large number of processors), iterative methods are more efficient than direct ones. Various iterative methods for solution of large systems of equations are discussed in monographs [16, 17]. In many practical applications, the preconditioned conjugate gradient (PCG) method is used because of its simplicity and efficiency. A simple data distribution scheme for the PCG method is a row-wise distribution of the global stiffness matrix [18]. The rows of each processor succeed one another. Distribution of the vector array corresponds the row distribution of the matrix in a component-wise manner. Such partitioning may be simple but it can lead to long boundaries between parts of mesh assigned to processors. A more efficient approach is based on partitioning the finite element mesh into subdomains using graph partitioning schemes and processing subdomain matrices and vectors on different processors with necessary data communication. Here an efficient implementation of the parallel PCG method with nonoverlapping domain decomposition for solution of three-dimensional finite element problems [19] is considered. An algorithm for domain partitioning and algorithms for matrix-vector and vector-vector multiplications for partitioned arrays are presented in the next subsection. Then a parallel procedure of the PCG method for solution of decomposed finite element problems is described. Performing computations for interior and interface data separately allows overlapped communication and computation. 14
Figure 5: 16-subdomain optimized partition of a regular mesh.
1.25
Assembly and condensation 8 subdomains
Parallel efficiency
1.00
Optimized subdomains
0.75
Equal subdomains
0.50
0.25 0
2000
4000
6000
8000
10000
Number of nodes
Figure 6: Parallel efficiency of subdomain assembly and condensation for equal and optimized 8-subdomain partitions.
15
1.25
Assembly and condensation 16 subdomains
Parallel efficiency
1.00
Optimized subdomains
0.75
Equal subdomains
0.50
0.25 0
2000
4000
6000
8000
10000
Number of nodes
Figure 7: Parallel efficiency of subdomain assembly and condensation for equal and optimized 16-subdomain partitions.
4.1 Problem partitioning In order to implement a parallel solution of the finite element problem, both matrices and vectors of the finite element model should be divided into parts and distributed across processor nodes. The choice of partitioning is tightly related to data communication between subdomains. Since data communication is usually a critical issue in parallel finite element analysis, we consider matrix-vector and vector-vector operations for partitioned arrays. For nonoverlapping domain decomposition, subdomain boundaries coincide with finite element boundaries and each finite element belongs to one subdomain only. A simple domain divided into four subdomains is shown in Fig. 8. Matrices and vectors can be stored on processor nodes in an accumulated form or in a distributed form. An accumulated matrix or vector contains full entries for both interior and interface nodes. The term ‘distributed’ means that subdomain matrix or vector contains entries assembled of contributions from elements belonging to this subdomain. Entries of the distributed matrix or vector contain full values for the interior nodes and only partial values for the interface nodes. It is possible to demonstrate that in general matrix-vector products can not be calculated with accumulated arrays [20]. Because of this, global matrices (global stiffness matrix and preconditioning matrix) are stored in a distributed form. The distributed subdomain stiffness matrix is obtained automatically by assembling element stiffness matrices for elements belonging to this subdomain. Both distributed and accumulated storage forms should be used for vectors. The reason for this becomes clear by considering computation of vector-vector and matrix-vector products employed inside the iteration procedure of the PCG method. 16
i b eb
Figure 8: Domain divided into four subdomains. For the left upper subdomain, the following node groups are shown – i: interior nodes; b: interface (boundary) nodes; eb: external interface nodes. Arrows show data communication for transformation of a vector from distributed to accumulated form. Consider computation of an inner product for vectors a and b: X α = aT b = ai bi ,
(25)
i
where the subscript i denotes the ith entry of the vector. When vectors a and b are distributed across processors as ap and bp (p is a processor index) then the inner product is computed as: X XX α= (aT b)p = (ai bi )p (26) p
p
i
If both vectors are stored in accumulated form then quantities ai bi for the interface nodes are referenced several times and the result is incorrect. The correct result can be obtained if intermediate results ai bi for interface nodes are divided by a multiplicity factor of each node. The multiplicity factor of a node reflects how many times the node is mentioned in all subdomains. A simple approach to correctly perform the inner product is just to have one vector in accumulated form and the other vector in distributed form. Direct check of equation (26) shows that the result will be correct. It is easy to demonstrate that a matrixvector product should be performed as multiplication of a distributed matrix by an accumulated vector. The result of such operation is a distributed vector. During PCG iterations it is necessary to transform vectors from distributed form to accumulated form. This can be done using interprocessor communication of data, as illustrated in Fig. 8. To transform of a vector from distributed form a to accumulated form a, each processor performs the following steps: 17
Disassemble boundary entries ab into array segments corresponding to external subdomain boundaries; Send dissassembled ab to neighboring subdomains and receive disassembled external interface entries aeb from neighboring subdomains; Assemble external interface entries aeb to the distributed vector a: a = a + aeb . The result is the accumulated vector a. Send-receive operations are indicated by arrows in Fig. 8 for the left upper subdomain. Element connectivities are used in assembly and disassembly procedures, which are standard operations in the computational procedure of the finite element method. Position of an element entry in a global vector is determined by the correspondent element connectivity number. If the finite element domain is composed of same-type elements, then decomposition into subdomains with an equal number of elements provides compute load balancing among processors. For minimization of interprocessor data communication it is desirable to produce subdomains with a minimal interface boundary. The RGL algorithm introduced in Section 2 is quite suitable for domain partitioning when an PCG solver is used. The interface boundary provided by the RGL algorithm usually contains fewer interface nodes than that produced by the RGB algorithm. It is also worth noting that the RGL algorithm allows multisections of the current subdomain beyond simple bisection. This means that the total number of produced subdomains can be a product of any simple numbers, not just a power of two.
4.2 Algorithm for the parallel PCG method A global finite element equation system Ku = f
(27)
has a sparse symmetric positive definite matrix K, which relates a load vector f and an unknown displacement vector u. It is possible to improve the properties of the equation system and the convergence rate of an iterative method by preconditioning, i.e. by multiplying both sides of equation (3) by a matrix M −1 , which in some sense is an approximation of A−1 : M −1 Ku = M −1 f. (28) The simplest form of preconditioning is diagonal preconditioning, in which matrix M contains only diagonal entries of matrix K. 4.2.1 Parallel PCG algorithm The iteration procedure of the PCG algorithm contains two matrix-vector products accompanied by calculation of several inner products. For partitioned data, results of matrix-vector products are distributed vectors. Parallel implementation of the PCG algorithm requires interprocessor communication of boundary data to perform inner product calculation since distributed vectors should be transformed into an accumulated form. Reduction operations are necessary for computing scalar quantities. 18
Using the procedure presented earlier for distributed-accumulated vector transformation, a parallel implementation of the preconditioned conjugate gradient method can be presented as follows: u0 = 0 r0 = f Send rib , receive rieb , ri = ri + rieb do i = 0, 1... wi = M −1 ri γi = rTi wi Reduce γi Send wib , receive wieb , wi = wi + wieb if i = 0 pi = wi else pi = wi + (γi /γi−1 )pi−1 wi = Kpi
(29)
βi = pTi wi Reduce βi Send wib , receive wieb , wi = wi + wieb ui = ui−1 + (γi /βi )pi ri = ri−1 − (γi /βi )wi if γi /γ0 < ε exit end do. Here i is the iteration number; K is the equation system matrix (global stiffness matrix); f is the right-hand side (external load); u is the unknown displacement vector; M is the preconditioning matrix; r is a residual vector; w and p are working vectors; and ε is a specified error tolerance. Accumulated vectors are marked by a bar. For example, w is the distributed vector and w is the same vector in accumulated form. The result of multiplying a distributed matrix by an accumulated vector is a distributed vector. In order to transform the distributed vector into its accumulated form, interface data is communicated between neighboring subdomains and received interface data is assembled into the distributed vector. Two communications of boundary data between neighboring subdomains are required inside each iteration cycle after calculation of vector w. Two reduction operations are necessary for obtaining the total values of scalars γ and β from their fractions located on all processor nodes. 4.2.2 Efficient data communication in PCG algorithm To increase the efficiency of the parallel PCG algorithm, it is possible to use nonblocking communications for interface nodes and to overlap communication with compu19
tation. An approach to economize computing time during PCG iteration procedure is as follows: Start the communication; Do some computation with data independent of the communicated array; Wait for completion of communication; Continue computation. A parallel PCG algorithm with efficient communications can be presented as follows:
u0 = 0 r0 = f Send rib and receive rieb , ri = ri + rieb do i = 0, 1... wib = M −1 rbi Start send wib and receive wieb wii = M −1 rii if i > 0 xi−1 = xi−2 + (γi−1 /βi−1 )pi−1 γi = rTi wi Reduce γi Wait for receive wieb , wi = wi + wieb if i = 0 pi = wi else pi = wi + (γi /γi−1 )pi−1
(30)
wib = Kpbi Start send wib and receive wieb βi = pTi wi Reduce βi wii = Kpii Wait for receive wieb , wi = wi + wieb ri = ri−1 − (γi /βi )wi if γi /γ0 < ε { ui = ui−1 + (γi /βi )pi ; exit } end do Since interior and interface nodes are separated, communication of data related to interior nodes wb can be overlapped with computation for interior nodes wi . At first, the interface entries are disassembled and send and receive operations are started for the interface data. Then matrix and vector calculations for the interior nodes are performed. After waiting for completion of the receive operation, the interface data should be assembled into the subdomain vector. Finally, other calculations, which involve the interface data, can be continued. Another possibility to increase algorithm 20
4096 20-node elements
Fixed size speedup
30
20
10
0
0
10
20
30
Number of processors
Figure 9: Fixed size speedup of the parallel PCG algorithm for a mesh of 4096 20node elements. efficiency is to overlap nonblocking communication with the solution update ui = ui−1 + (γi /βi )pi . Since solution vector u is not used in any computation, its update can be done any time.
4.3 Examples Sequential and parallel finite element routines based on the domain decomposition method with the PCG equation solver have been developed in the C programming language. The programs were run on IBM SP2, using the MPI (message passing interface) to execute the parallel code. Fixed size speedup was investigated for a three-dimensional mesh consisting of 4096 20-node hexahedral elements (16×16×16) and 56355 degrees of freedom. Four divisions into subdomains were used: 4 subdomains = 2×2×1; 8 subdomains = 2×2×2; 16 subdomains = 4×2×2; 32 subdomains = 4×4×2. Results of performance of the PCG method with overlapped communication and computation (30) for the problem of 4096 3D 20-node elements are plotted in Fig. 9. Parallel efficiency varies from 0.92 for 4 subdomain partitioning to 0.63 for 32 subdomain partitioning. Lower parallel efficiency for larger number of processors is related to the small size of the problem solved on each processor node. For example, just 128 elements belong to each processor in the case of 32 subdomain partitioning. Individual meshes with element numbers ranging from 1000 to 48000 were used to determine scaled speedup for the parallel PCG algorithm using up to 48 processors. In addition to the four above-mentioned partitions, a partition into 48 subdomains 21
50 1000 20-node elements per processor
Scaled speedup
40 30 20 10 0
0
10
20
30
40
50
Number of processors
Figure 10: Scaled speedup of the parallel PCG algorithm for meshed with 1000 20node elements per processor. (4×4×3) was generated. Scaled speedup values corresponding to 1000 quadratic 20node elements per processor are presented in Fig. 10. The results on scaled parallel speedup are quite satisfactory. The parallel efficiency equals 0.86 for 48 subdomain partitioning.
5 Conclusion The domain decomposition method is a useful technique for partitioning a finite element problems into smaller subproblems, which can be executed on separate processor nodes. Efficiency of parallel computations depends on domain partitioning. A domain partitionig algorithm should produce subdomains with minimal number of interface nodes and with compute load balancing across processors. When direct solution method is applied, partitioning of a domain into subdomains with equal numbers of elements leads to significant load imbalance among processors. In this chapter, the recursive graph labeling algorithm is used for the distribution of elements among subdomains. Compute load balancing is achieved by solution of an algebraic problem to find subdomains requiring same operation count. The partitioning algorithm is able to produce balanced subdomain divisions in 1-3 iterations. An algorithm of the preconditioned conjugate gradient method based on the domain decomposition with nonoverlapping subdomains is presented. The algorithm formulation contains local vectors in distributed and accumulated forms. Division of the computational domain consisting of same-type elements into subdomains with equal number of elements provides compute load balancing across processor nodes. The 22
efficient implementation of the parallel PCG algorithm uses nonblocking communications for interface nodes and overlapping of communication with computation.
References [1] J.S. Przemieniecki, “Matrix structural analysis of substructures”, AIAA Journal, 1, 138-147, 1963. [2] I. Babuˇska and H.C. Elman, “Some aspects of parallel implementation of the finite-element method on message passing architectures”, Journal of Computational and Applied Mathematics, 27, 157-187, 1989. [3] K. Schloegel, G. Karypis and V. Kumar, “Graph partitioning for high performance scientific simulations”, CRPC Parallel Computing Handbook, Morgan Kaufmann, 2001. [4] C. Farhat, “A simple and efficient automatic FEM domain decomposer”, Computers and Structures, 28, 579-602, 1988. [5] H.D. Simon, “Partitioning of unstructured problems for parallel processing”, Computing Systems in Engineering, 2, 135-148, 1991. [6] Y.F. Hu and R.J. Blake, “Numerical experiences with partitioning of unstructured meshes”, Parallel Computing, 20, 815-829, 1994. [7] S. Gupta and M.R. Ramirez, “A mapping algorithm for domain decomposition in massively parallel finite element analysis”, Computing Systems in Engineering, 6, 111-150, 1995. [8] C. Farhat, N. Maman and G.W. Brown, “Mesh partitioning for implicit computations via iterative domain decomposition: impact and optimization of the subdomain aspect ratio”, International Journal for Numerical Methods in Engineering, 38, 989-1000, 1995. [9] D. Vanderstraeten and R. Keunings, “Optimized partitioning of unstructured finite element meshes”, International Journal for Numerical Methods in Engineering, 38, 433-450, 1995. [10] C.H. Walshaw, M. Cross and M.G. Everett, “A localized algorithm for optimizing unstructured mesh partitions”, International Journal for Supercomputer Applications, 9, 280-295, 1995. [11] G.P. Nikishkov, A. Makinouchi, G. Yagawa and S. Yoshimura, “An algorithm for domain partitioning with load balancing”, Engineering Computations, 16,120135, 1999. [12] S.W. Sloan, “A FORTRAN program for profile and wavefront reduction”, International Journal for Numerical Methods in Engineering, 28, 2651-2679, 1989. 23
[13] C. Farhat and E. Wilson, “A parallel active column equation solver”, Computers and Structures, 28, 289-304, 1988. [14] G.P. Nikishkov, A. Makinouchi, G. Yagawa and S. Yoshimura, “Performance study of the domain decomposition method with direct equation solver for parallel finite element analysis”, Computational Mechanics, 19, 84-93, 1996. [15] G.P. Nikishkov, M. Kawka, A. Makinouchi, G. Yagawa and S. Yoshimura, “Porting an industrial sheet metal forming code to a distributed memory parallel computer”, Computers and Structures, 67, 439-449, 1998. [16] Y. Saad, Iterative methods for sparse linear systems, PWS Publishing, Boston, 1996, 447 pp. [17] O. Axelsson, Iterative solution methods, Cambridge University Press, 1996, 654 pp. [18] A. Basermann, B. Reichel and C. Schelthoff, “Preconditioned CG methods for sparse matrices on massively parallel machines”, Parallel Computing, 23, 381398, 1997. [19] G.P. Nikishkov and A. Makinouchi, “Parallel implementation of the PCG method with nonoverlapping finite element domain decomposition”, Parallel and Distributed Computing Systems, ISCA 12th Int. Conf., Fort Lauderdale, FL, USA, Aug. 18-20, 1999 (Ed. S. Olariu and J. Wu), ISCA, 540-545, 1999. [20] G. Haase, “New matrix-by-vector multiplications based on nonoverlapping domain decomposition data distribution”, Lecture Notes in Computer Science 1300, 726-733, 1997.
24