Nonlinear Finite Element Problems on Parallel Computers L. Grosz, C. Roll and W. Schonauer Numerikforschung fur Supercomputer Computing Center of the University of Karlsruhe, D-76128 Karlsruhe, Germany e-mail:
[email protected] Abstract
VECFEM is a black-box solver for the solution of a large class of nonlinear functional equations by nite element methods. It uses very robust solution methods for the linear FEM problem to compute reliably the Newton-Raphson correction and the error indicator. Kernel algorithms are conjugate gradient methods (CG) for the solution of the linear system. In this paper we present the optimal data structures on parallel computers for the matrix-vector multiplication, which is the key operation in the CG iteration, the principles of the element distribution onto the processors and the mounting of the global matrix over all processors as transformation of optimal data structures. VECFEM is portably implemented for message passing systems. Two examples with unstructured and structured grids will show the eciency of the data structures.
Keywords: parallel computers, nite element method, black-box solver
The solution of nonlinear functional equations, e.g. arising from elliptic and parabolic partial dierential equations, allows the simulation of physical phenomena by computers. A Black-box solver [13, 4] makes it possible for the user to calculate reliably and eciently approximative solutions without costly program development. To cover a wide range of applications robust algorithms are required. Additionally the solution has to be computed in a reasonable computing time so that in many cases the application of supercomputers is necessary. So the data structures as well as the algorithms have to be optimized for the requirements of parallel and vector computers. In this paper we will present the concept for the implementation of a robust solver for nonlinear functional equations on parallel computers using nite element methods (FEM) with unstructured grids. The implementation bases on the FEM black-box solver VECFEM [5, 4], which was originally developed for vector computers [6, 7]. So the new version uses optimal data structures for parallel processing and on every processor optimal data structures for vector processing. Portability of the VECFEM program package is ensured by using standard FORTRAN 77. For the implementation of the communication VECFEM uses an own minimal message passing interface, which can be adapted to all message passing systems. It contains six routines: begin and end of communication, send and its corresponding wait and receive and its corresponding wait. Locally non-blocking communication (e.g intel NX/2 [9]) as well as actual locally blocking communication (e.g. pvm [2]) is supported by this interface, but in the VECFEM implementation always locally non-blocking communication is assumed. The processor topology is assumed as a nearest neighbor ring, which can be embedded into many existing processor topologies. 1
Figure 1: Distribution of the FEM nodes onto four processors.
1 The Linear Equation Solver The discretized nonlinear functional equation is iteratively solved by the Newton-Raphson method. Finally the discretization error of the computed approximation is checked by the calculation of an error indicator [5]. Both problems can be interpreted as the solution of a general linear functional equation by nite elements. So we will look at the ecient implementation for this problem on parallel computers. As a black-box solver VECFEM needs highly robust solution methods. Especially the numerical stability and the convergence behavior should not depend on the number of processors, the distribution of the elments onto the processors or the numbering of the global nodes. Therefore VECFEM does not use methods of the domain decomposition type but mounts a global matrix over all processors. The global system of linear equation is very large and very sparse. It can only be solved be iterative methods. Generalized conjugate gradient methods (CG) [12] proved to be the most robust and ecient iterative methods and therefore they are the solution methods of LINSOL [11], which is the linear equation solver of VECFEM as well as of the nite dierences program package FIDISOL/CADSOL [10]. To nd the best method in the family of the CG methods for a given matrix LINSOL uses a polyalgorithm, which switches to a slower convergent but more robust method in the case of non-convergence. A smoothing algorithm removes the heavy oscillations during the CG iteration. In addition to the standard operations 'linked triad' and 'scalar product' the matrix-vector multiplication with a sparse global matrix is the key operation and therefore optimal data structures for parallel processing has to be used especially for this operation.
2 The Matrix-Vector Multiplication
The global matrix A 2 IR is distributed to the np processors by the following scheme: The rows of A are split into blocks of consecutive rows (here may be integer but this is not necessary). Corresponding to their succession the row blocks are distributed to the processors. In the example of np = 4 processors and n = 16 unknowns shown in Figure 2 the rows 1 to 4 are put to processor P 1, 5 to 8 are put to processor P 2, 9 to 12 are put to processor P 3 and 13 to 16 are put to processor P 4. Analogously to the physical subdivision of the rows the columns are split into column blocks, so that the global matrix is subdivided into submatrices A : (1) A = (A ) =1 ; =1 ; n
n
n
n
np
np
n
np
p;q
p;q
p
;:::;np q
2
;:::;np
n np
Figure 2: Physical row block distribution and logical column block subdivision of the global matrix.
Figure 3: Scheme of the matrix-vector multiplication. where every submatrix is also a sparse matrix or the zero matrix. This distribution of the global matrix to the processors corresponds to the conception of a distribution of the nodes of the FEM mesh. Figure 1 illustrates this for the matrix in Figure 2. The block matrix A represents the coupling of the nodes on processor p to the nodes on processor q. Therefore in the example of Figure 1 only the blocks A ?1 A and A +1 are nonzero on processor p. This fact is used in the matrix-vector multiplication. For the matrix-vector multiplication b = b + Ax we assume that the input vector x and the output vector b are distributed to the processors corresponding to the physical distribution of the global matrix, i.e. p elements of x and b are on processor P 1, the next p elements are on processor P 2 etc. The matrix-vector multiplication runs in np cycles: At the beginning of a cycle every processor starts the sending of the current input vector portion to its right neighbor processor and the receiving of a new input vector portion from its left neighbor processor. Then p;q
p;p
p;p
p;p
n n
n n
3
Figure 4: Allowed processors for elements. the multiplication of the matrix block that belongs to the current input vector portion onto the output vector is executed and the result is added to output vector portion. For the matrix blocks the diagonal storage scheme with packed and unpacked diagonals is used. This is the optimal storage pattern for the processing especially on vector processors but also on scalar processors [6]. A synchronization ensures that the next input vector portion is available for the next cycle. In this scheme the communication is hidden as far as possible behind the multiplication of the matrix blocks, but it is necessary to use an alternating buer technique. As shown in Figure 3 in every cycle a diagonal in the upper triangle of A and its extension in the lower triangle is processed: cycle 1 processes the main diagonal A1 1, A2 2, : : :, A ; cycle 2 processes the rst diagonal A1 2, A2 3, : : :, A ?1 completed by A 1 etc. If all block matrices in an extended diagonal are equal to zero, this diagonal can be skipped in the calculation scheme. Then the current input vector is not sent to the next right but to the next but one or even more distant neighbor. Thereby the actual number of communications can be minimized but than we have not a nearest neighbor communication. In the following we call this the optimized communication path. A bandwidth minimized numbering of the FEM nodes [3] produces a very small number of actual communication, for two-dimensional problems 3 actual cycles independent from the number of unknowns are typical. In the example in Figure 1 the third block diagonal (A1 3; A2 4; A3 1; A4 2) contains only zero blocks and therefore the third cycle is skipped. ;
;
;
;
;
;
np
;np
;
np;np
np;
;
3 Mounting The mounting of the global matrix marks an important point in the data ow. It transforms the data structures optimized for the calculation of the element matrices (see below) onto the optimized data structures for the linear solver. Since storing and processing of one single
element on many processors costs a lot of overhead especially for three dimensional problems, the global matrix is mounted with communication instead of using element overlap. If the matrix block A is nonzero, there is an element which contains a FEM node assigned to processor p and one assigned to processor q. So it is indierent, whether this element is stored on processor p or processor q. In order to add correctly the element matrix of this element the matrix stripe on processor p has to pass processor q as well as the matrix stripe on processor q has to pass processor p during their sending around. Consequently the optimized p;q
4
Figure 5: Marking of the matrix shapes. communication path introduced for the matrix-vector multiplication is also optimal for the sending around of the matrix strips during the mounting. A suitable element distribution ensures that every element gets the chance to add all its contributions to the appropriate matrix stripe. In Figure 4 the element described by the nodes 6, 9 and 10 may store on processor P 2 or P 3, since this element mounts to the matrix stripes of the processors P 2 and P 3 and both matrix stripes pass processor P 2 as well as processor P 3 during the sending around. If the element would store on processor P 1 it could not add its contributions to the matrix stripe of processor P 3. Surprisingly elements with nodes on only one processor have a larger range with respect to their processors than elements with nodes on dierent processors (e.g. see element described by nodes 10, 11 and 12 in Figure 4) Since the matrix blocks are sparse, the element matrices are added to the packed matrix blocks. This reduces the message lengths in the communication and the storage amount although an additional integer array for the addresses of the element matrices in the packed matrix has to be allocated. Additionally the strategy saves the very expensive creation of the matrix storage pattern in every Newton-Raphson step. It must be established only for once at the beginning of the iteration. The mounting procedure runs as follows: In the rst step the shapes of the matrix blocks have to be marked. Every processor p initializes an integer array A representing its unpacked, nonzero matrix blocks. This mask is sent around following the optimized communication path. Since the masks are bit sequences with a small number of '1', they are transformed into index vectors for the sending to reduce the message lengths. In every cycle the processors mark the additional entries by their elements onto the current matrix stripe, see Figure 5. Finally every processor has the shape mask of its matrix blocks and can now create the storage patterns without communication [6, 7]. At the end every entry of the integer array A de nes its address in the packed matrix block. In order to notify to all elements the addresses of their element matrices in the packed matrix these addresses are distributed to the elements by an additional sending around of the integer array A , which can also be done in an indexed form to reduce the message lengths. In every cycle all elements gather the new addresses of their contributions to the current matrix stripe. Now for every Newton-Raphson step the arrays of the packed matrix blocks are sent around following the optimized communication path and every processor adds the contribution of its element matrices to the current stripe using the precomputed addresses. p
p
p
5
Figure 6: Element distribution onto 5 processors; elements colored with the same gray value are on the same processor.
4 Element Distribution
The coordinates of those nodes referred by any element on the processor p and the key list for the description of the elements is made available on processor p before the Newton-Raphson iteration starts. In every Newton-Raphson step the element matrices can be calculated without any communication, after the current solution approximation has been distributed onto the elements. So optimal parallelization can be reached in this very expensive program part. As discussed above the rst criterion for the element distribution has to ensure that all element matrices can be added to the matrix stripes while the matrix is sent around following the optimized communication. So we will get minimal communication during the mounting procedure. But this criterion gives enough freedom to introduce a second criterion, which is the load balancing in the calculation of the element matrices. The computational amount for one element matrix can be estimated by q g2 , where q is the number of integration points in the element matrix calculation and g is the number ofPnodes describing the elements of type t. So we can estimate the amount on the processor p by e( ) q g2 , where e( ) is the number of elements of type t on processor p. The elements are distributed, so that the estimated computational amount for every processor is equal to the mean value of the computational amount over all processors: X e( ) q g2 1 X e(1) q g2 (2) n t
t
t
t
p
t
p
t
t
t
t
p
t
t
t
p
t
t
t
t
t
where e(1) is the total number of elements of type t. Figure 6 shows an example for the distribution of a mesh onto 5 processors. The gray value represents the id number of the processors. We point out, that the regions of elements belonging to the same processor have no border or small contact region, but they are extensively overlapped. t
6
Figure 7: Timing of mounting of global matrix and for one PRES20 iteration step with structured grid on intel Paragon, the problem size proportional is to the number of processors.
5 Examples Here we present two examples, which show the eciency of our data structures. We look only to the linear VECFEM kernel to eliminate eects from the optimized stopping criterions, which involved the discretization error. The calculations run on an intel Paragon. At the time of the measurements the message processor of each node was not yet in use since the software was not yet available. Therefore locally non-blocking communication and latency hiding was not possible. 5.1
Structured Grid
The rst example is the convection driven diusion equation on a 2-dimensional channel of length 2 ( = y-axis) and height 1 (= x-axis):
? u + vru = 0:
(3)
v = (0; 4x(x ? 1)) is the driving velocity eld. On the edges x = 0 and y = 1 the Neuman boundary conditions = 1 and on the on the edges x = 2 and y = 0 the Dirichlet boundary conditions u = 0 are prescribed. The domain is discretized by a regular grid and 9-point quadrilateral elements of order 2. The global matrix has always 25 nonzero diagonals and so actual three cycles are used in the optimized communication path. Figure 7 shows the timing of this example for about 1000 unknowns per processor on an intel Paragon for various number of processors. The left pile gives the elapsed time for the computation of the element matrices and the mounting of the global matrix and the second @u
@n
7
Figure 8: Mesh for the structural analysis problem. pile shows the timing per iteration in the LINSOL iteration (PRES20-method). It considers one matrix-vector multiplication with the global matrix and additionally 12 scalar products. The diagram shows a slowly increasing O(log(np)) of the elapsed time for one LINSOL step, which is eected by the cascade method in the scalar product. So the run time for one iteration step is well scalable, but you have to keep in mind that the increasing of the total number of unknowns ( 1000 np) increases the number of iteration steps to reach the same accuracy for the solution of the linear system, since the condition number increases with the problem size. The increase from one to two processors shows the overhead that must be payed for the parallelization, mainly due to communication. This overhead will be reduced, if locally non-blocking communication will be possible. 5.2
Nonstructured Grid
As a second example we look to a three dimensional linear structural analysis problem [1]. The searched displacements of the loaded body have to ful l a system of three partial dierential equations. Here the body is the unit cube with its origin in the origin of the space coordinate system, which is bitten at the vertex (1; 1; 1) by a sphere of radius 31 , see Figure 8. At the faces x = 1, y = 1 ad z = 1 the body is xed in the normal direction of the face and at point (0; 0; 0) the nodal force (1; 1; 1) attacks. The meshes using tetrahedrons of order 2 were generated by I-DEAS [8]. The numberings of the FEM nodes were bandwidth optimized. Figure 9 gives the elapsed time for the mounting of the global system and one step in the classical CG iteration. The number of processors is varied, when the meshes are re ned so that the number of unknowns per processor is nearly equal to 700. In contrast to the regular case the timing for the mounting as well as the timing for the CG steps strongly increase. The reason is the increasing of the actual number of communications in the optimized communication path, which is given in the braces and which is due to the increasing 'irregularity' of the matrix with increasing problem size. For 96 processors we need roughly the eight-fold time of the 'unity problem' for an approximately 24-fold problem size. So we have seven times of the 8
Figure 9: Timing of mounting of global matrix and for one classical CG iteration step with nonstructured grid on intel Paragon, the problem size approximately proportional to the number of processors. expected amount as overhead. Every FEM node has nearly the same number of neighbors in the FEM mesh independently from the problem size. So the number of nonzero entries in the global matrix which eect a coupling between the processors is nearly constant, if the number of FEM nodes per processor is constant. Therefore the increase of the bandwidth for larger problem sizes produces more actual communication cycles with a lower sparsity of the matrix blocks considering the couplings between the processors. So the ratio of communication and calculation becomes worse with increasing of the problem size and it will be much more dicult to hide the communication overhead behind the calculation. A better numbering of the FEM nodes than a bandwidth optimized numbering could reduce the communication amount, where the number of communication and so the number of contact faces of the regions with nodes on the same processor has to be minimized (where is the algorithm ?).
6 Conclusion In this paper we have presented optimal data structures for the calculation of the element matrices as well as for the solution of the linear system on parallel computers. The mounting of the global matrix, which is distributed onto the processors, transforms the optimized data structures between these two tasks. The VECFEM program package, which bases on these data structures, gives an ecient tool for the solution of nonlinear functional equations on supercomputers by nite element methods. It relieves the user for a wide class of problems 9
from the searching of suitable algorithms and their costly implementation on parallel and vector computers. Our further work will be focused on the improvement of the stability and reliability of the VECFEM and LINSOL algorithms and the development of a mesh adapting procedure for VECFEM on parallel systems.
References [1] K.-J. Bathe. Finite Element Procedures in Engineering Analysis. Inc. Englewood Clis. Prentice Hall, New Jersey, 1982. [2] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM 3.0 User's Guide and Reference Manual, 1993. [3] N. Gibbs, W. Poole, and P. Stockmeyer. An algorithm for reducing the bandwidth and pro le of a sparse matrix. SIAM J. Numer. Anal., 13(2), April 1976. [4] L. Gro, C. Roll, and W. Schonauer. A black-box-solver for the solution of general nonlinear functional equations by mixed fem. In M. Dekker, editor, FEM 50, The Finite Element Method: Fifty Years of the Courant Element, Finland, 1993. [5] L. Gro, C. Roll, and W. Schonauer. VECFEM for mixed nite elements. Internal report 50/93, University of Karlsruhe, Computing Center, 1993. [6] L. Gro, P. Sternecker, and W. Schonauer. Optimal data structures for an ecient vectorized nite element code. In H. Burkhardt, editor, CONPAR 90 - VAPP II, Lecture Notes of Computer Science. Springer Verlag Berlin, Heidelberg, New York, 1990. [7] L. Gro, P. Sternecker, and W. Schonauer. The nite element tool package VECFEM (version 1.1). Internal report 45/91, University of Karlsruhe, Computing Center, 1991. [8] I-DEAS, Solid Modeling, User's Guide. SDRC, 2000 Eastman Drive, Milford, Ohio 45150, USA, 1990. [9] PARAGON OSF/1 User's Guide, April 1993. [10] M. Schmauder and W. Schonauer. CADSOL-a fully vectorized black-box solver for 2-D and 3-D partial dierential equations. In R. Vichnevetsky,D. Knight, and G. Richter, editors, Advances in Computer Methods for Partial Dierential Equations VII, pages 639{645. IMACS, New Brunswik, 1992. [11] W. Schonauer. Scienti c Computing on Vector Computers. North-Holland, Amsterdam, New York, Oxford, Tokyo, 1987. [12] R. Weiss. Convergence behavior of generalized conjugate gradient methods. Internal report 43/90, University of Karlsruhe, Computing Center, 1990. [13] R. Weiss and W. Schonauer. Black-box solvers for partial dierential equations. In E. Kusters, E. Stein, and W. Werner, editors, Proceedings of the Conference on Mathematical Methods and Supercomputing in Nuclear Applications, Kernforschungszentrum Karlsruhe, Germany, volume 2, pages 29{40, 1993.
10