Parallel Inexact Constraint Preconditioners for Saddle ...

3 downloads 0 Views 146KB Size Report
FLAN-1565: arises from the mechanical equilibrium of a steel flange dis- cretized by a 3D 8-node brick FE [11]. 5. HOOK-1498: arises from the mechanical ...
Parallel Inexact Constraint Preconditioners for Saddle Point Problems ´ Luca Bergamaschi and Angeles Mart´ınez Department of Mathematical Methods and Models for Scientific Applications University of Padua, via Trieste 63, 35121 Padova, Italy e-mail [email protected], [email protected]

Abstract. In this paper we describe an enhanced parallel implementation of the FSAI preconditioner to accelerate the PCG method in the solution of symmetric positive definite (SPD) linear systems of very large size. This preconditioner is used as building block for the construction of an indefinite Inexact Constraint Preconditioner (ICP) for saddle pointtype linear systems arising from Finite Element (FE) discretization of 3D coupled consolidation problems. The overall preconditioner, based on an efficient approximation of the inverse of the (1, 1) block proves very effective in the acceleration of the BiCGSTAB iterative solver in parallel environments. Numerical results on a number of test cases of size up to 106 unknowns and 108 nonzeros show the perfect scalability of the overall code up to 128 processors.

Keywords: Parallel computing Preconditioning Krylov subspace methods coupled consolidation

1

Introduction

The time-dependent displacements and fluid pore pressure in porous media are controlled by the consolidation theory. This was first mathematically described by Biot [7], who coupled the elastic equilibrium equations with a continuity or mass balance equation to be solved under appropriate boundary and initial flow and loading conditions. The coupled consolidation equations are typically solved numerically using FE in space, thus giving rise to a system of first-order differential equations the solution to which is addressed by an appropriate time marching scheme. A major computational issue is the repeated solution in time of the resulting discretized indefinite equations, which can be generally written as   K BT . (1) Ax = b, where A= B −C The sub-matrices K and C are both symmetric and positive definite (SPD). Denoting with m the number of FE nodes, C ∈ Rm×m , B ∈ Rm×n , and K ∈

Rn×n , where n is equal to 2m or 3m according to the spatial dimension of the problem. Matrix A in (1) is a classical example of saddle point problem, which is encountered in other fields as well including constrained optimization, least squares, and Navier-Stokes equations. Because of the large size of realistic threedimensional (3D) consolidation models (and particularly so in problems related to fluid withdrawal/injection from/into geological formations) the use of iterative solvers is strongly recommended against direct factorization methods. However, well established iterative methods such as Krylov subspace methods are very slow or even fail to converge if not conveniently preconditioned. The constraint preconditioners for Krylov solvers in the solution of saddle point problems have been studied by a number of authors [15, 12, 16, 4, 1]. In most of the above references the preconditioner is obtained from A with the (1, 1) block K well approximated and replaced by its diagonal. In the coupled consolidation problem, however, K is not diagonally dominant and a better approximation is required to ensure convergence. In this work we propose a fully explicit parallel ICP based on the FSAI (Factorized Sparse Approximate Inverse) preconditioner [14] of the matrices K and S where S is an approximate Schur complement of a block matrix M resembling A. The FSAI preconditioner is based on prefiltration and postfiltration techniques and allows to choose nonzeros in the preconditioner factors in the same position as those of AdK , where dK = 1, 2, 4. We have developed a parallel code which implements the BiCGSTAB solver preconditioned with the parallel ICP preconditioner described above. The code is written in Fortran 90 and uses the MPI standard to perform interprocessor communication. We show numerical results obtained in the solution of a number of problems of large size arising from 3D FE discretization of realistic engineering problems. All the experiments have been run on an IBM SP6 cluster of 168 Power6 575 nodes with a peak performance of just over 100 Tflops, located at Cineca (Bologna, Italy). The paper is organized as follows. Section 2 gives a brief description of the consolidation equations. In Section 3 we describe the ICP and recall the main spectral properties of the block preconditioned matrices. Section 4 describes the parallel preconditioner used in this work and explains in detail how it is implemented and applied during the BiCGSTAB iteration. Section 5 contains the numerical results obtained with PCG accelerated with FSAI preconditioner on 6 test cases arising from realistic engineering applications together with the results of the FSAI-ICP code together with a detailed scalability study to solve system (1). Finally, some conclusions are stated in Section 6.

2

Finite Element coupled consolidation equations

The system of partial differential equations governing the 3D coupled consolidation process in fully saturated porous media is derived from the classical Biot’s 2

formulation [7] and successive modifications as: (λ + µ)

∂ǫ ∂p + µ∇2 ui = α ∂i ∂i

i = x, y, z

∂p ∂ǫ 1 ∇(k∇p) = [φβ + cbr (α − φ)] +α γ ∂t ∂t

(2) (3)

where cbr and β are the volumetric compressibility of solid grains and water, respectively, φ is the porosity, k the medium hydraulic conductivity, ǫ the medium volumetric dilatation, α the Biot coefficient, λ and µ are the Lam´e constant and the shear modulus of the porous medium, respectively, γ is the specific weight of water, ∇ the gradient operator, x, y, z are the coordinate directions, t is time, and p and ui are the incremental pore pressure and the components of incremental displacement along the i−direction, respectively. Use of FE in space yields a system of first order differential equations which can be integrated by the Crank-Nicolson scheme. The resulting linear system has to be repeatedly solved to obtain the transient displacements and pore pressures. The nonsymmetric matrix controlling the solution scheme reads:   K/2 −Q/2 A =  QT (4) P  H/2 + ∆t ∆t where K, H, P and Q are the elastic stiffness, flow stiffness, flow capacity and flow-stress coupling matrices, respectively. Matrix A can be readily symmetrized by multiplying the upper set of equations by 2 and the lower set by −∆t, thus obtaining the sparse 2×2 block symmetric indefinite matrix (1) where B = −QT and C = ∆tH/2 + P . A major difficulty in the repeated solution to system (1) is the likely A ill-conditioning caused by the large difference in magnitude between the coefficients of blocks K, B and C. In long-term simulations a small ∆t is typically needed in the early stage of the consolidation process, while larger values may be used as the system approaches the steady state. Hence, the initial steps are the most critical ones, with the convergence expected to improve as the simulation proceeds.

3

Inexact Constraint Preconditioners

To solve system (1) we look for a preconditioner M−1 where M is first chosen so as to take into account the block structure of the system (1):   G1 B T , M= B −C with G1 an SPD approximation of the 1 × 1 block K. Its inverse, G−1 1 , which can be viewed as a preconditioner for K, is assumed to be explicitly known. To fulfill 3

such a requirement we compute G−1 1 as a Factorized Sparse Approximate inverse T [13, 14] which is readily available in the factorized form K −1 ≃ G−1 1 = W1 W1 . The exact preconditioner matrix M−1 can be written as:   −1    In 0 G1 0 In −G−1 BT −1 1 (5) M = 0 −S −1 −BG−1 Im 0 Im 1 where Ii is the i × i identity matrix and S = BG1−1 B T + C. Clearly, every application of the above preconditioner requires a linear system solution with S as the coefficient matrix. This makes the single iteration of the iterative solver of choice very costly. To this aim a class of Inexact Constraint Preconditioner (ICP) have been developed, based on the approximation of the inverse of the Schur complement matrix S. Using again FSAI to produce an explicit approximation of S −1 , say GS−1 , we obtain the following full ICP preconditioner:   −1    In 0 G1 0 In −G−1 BT −1 1 MF = (6) 0 Im −BG−1 Im 0 −G−1 1 S A further approximation can be used by simply neglecting the right matrix in the above expression thus obtaining a Triangular ICP preconditioner:   −1   G1 0 In −G−1 BT −1 1 (7) MT = 0 Im 0 −G−1 S Following the approach in [3], we allow to construct an an approximate Schur −1 T complement Sb = BG−1 2 B + C, with the aim of reducing its fill-in. G2 is computed as a further (sparser) FSAI approximation for the inverse of the structural sub–matrix. A third FSAI preconditioner is used to approximate the inverse of b G−1 ≈ Sb−1 . S, S 3.1

Eigenvalue distribution of the preconditioned matrices

⊤ Let G1 and GS be SPD approximations of K and S = C + BG−1 1 B , respec−1 −1 tively. G1 and GS can also be viewed as preconditioners for the corresponding matrices, so that we can define the following SPD preconditioned matrices: −1/2

KP = G 1

−1/2

KG1

and

−1/2

SP = G S

−1/2

SGS

Let us assume that 0 < αK = λmin (KP ) < 1 < λmax (KP ) = βK , 0 < αS = λmin (SP ) < 1 < λmax (SP ) = βS .

(8)

The conditions 1 ∈ [αK , βK ] and 1 ∈ [αS , βS ] are very often fulfilled in practice since preconditioners G1 and GS are expected to cluster eigenvalues around unit. The following two theorems gives bounds on the eigenvalues of the preconditioned matrix using ICP and TICP. two cases. An exhaustive spectral analysis can be found in [2] 4

Theorem 1. If βK < 2 then the real eigenvalues of (??) satisfy:   αS min αK , ≤ λ ≤ max{(2 − αK )βS , βK } βK If λI 6= 0 then αK + αS (2 − βK ) βK + βS (2 − αK ) ≤ λR ≤ 2 2

|λI | ≤

p βS max{1−αK , βK −1}

The subsequent theorem will bound the eigenvalues of (??): Theorem 2. The eigenvalues of M−1 T A satisfy the following bounds. If λI 6= 0 then   √ αK 1 + βS and ≤ λR ≤ min ,2 |λ − 1| ≤ 1 − αK , 2 2 The real eigenvalues satisfy:   αS min αK , ≤ λR ≤ max{βK , βS + βK } βK + αS The results contained in Theorems 1 and 2 point out that eigenvalues of the preconditioned matrix are clustered around one if those of the preconditioned K and the preconditioned Schur complement are so.

4 4.1

Parallel ICP FSAI preconditioner

The FSAI preconditioner, initially proposed in [13] and [14], has been later developed and implemented in parallel by Bergamaschi et al. in [5]. Here, we only shortly recall the main features of this preconditioner. Given and SPD matrix K the FSAI preconditioner approximately factorize its inverse as a product of two sparse triangular matrices as K −1 ≈ G−1 = W T W The choice of nonzeros in W are based on a sparsity pattern which in our work e k where K e is the result of prefiltration [6] of K i.e. dropmay be the same as K ping of all elements below of a threshold parameter δ (possibly δ = 0, which means no prefiltration is applied). In the present paper we allow the power k to be equal to 1, 2 or 4. The entries of W are computed by minimizing the Frobenius norm of I − W L where L is the exact Cholesky factor of K. The resulting triangular factor can be in its turn approximated (post filtrated) by dropping 5

all the elements which, relatively to the diagonal, are below a second tolerance parameter (ε). The resulting FSAI preconditioner is then related to the following three parameters: 1. prefiltration threshold δ 2. power of K (dK = 1, 2, 4) generating the sparsity pattern 3. postfiltration threshold ε 4.2

ICP application

Recalling equation (6), the full ICP can be written as:     T In 0 In −W1T W1 B T 0 W1 W1 −1 MF = −BW1T W1 Im 0 −WST WS 0 Im     T W1 0 W1 −W1T W1 B T WST = UL = WS BW1T W1 −WS 0 WST

(9)

T where G−1 1 = W1 W1 and WS is the FSAI factor of the approximate Schur come Se−1 = W T WS . The Schur complement matrix S is evaluated plement matrix S, S T as S = BW2 W2 B T + C = S0 + C, being W2 the triangular factor of a sparser FSAI approximation of K −1 , obtained from W1 by a further postfiltration. Analogously the Triangular ICP can be written as    T W1 0 W1 −W1T W1 B T WST = UL (10) M−1 = T 0 −WS 0 WST

The application of M−1 requires the explicit computation of the Schur complement matrix S whose construction may be time and memory consuming, S being the result of two sparse matrix-matrix products and one sparse sum of matrices. However, it should be noted that the evaluation of S0 = BW2T W2 B T , which involves the main computational burden in building S, is independent of the time step ∆t, and therefore can be done just once at the beginning of the simulation. Construction of preconditioner whose application is described in Tables 1 and 2 therefore based on the following parameters: 1. δ1 , dK and ε1 , for the 1st FSAI preconditioner (W1 ). 2. ε2 , postfiltration threshold for W2 3. δS , dS and εS , for the FSAI preconditioner applied to the Schur complement matrix (WS ).

5 5.1

Numerical results Solution of Kx = b.

Since the key of the success of ICP is the possibility to have a good preconditioner for matrix K (numerical experience shows that the Schur complement 6

matrix is instead well-conditioned), we analyze the performance of our FSAI preconditioner when used within the PCG method to solve a linear system Kx = b. The test cases, which we briefly describe below, are all realistic examples of large size arising from 2D and 3D FE discretization of geomechanical problems. In detail: 1. FAULT-639: arises from the numerical solution by a linear FE of the inequalityconstrained minimization problem governing the mechanical equilibrium of a 3D body with contact surfaces [10]. The contact is solved with the aid of a penalty formulation that gives rise to an SPD ill-conditioned linear system of equations. 2. STOCF-729: arises from the FE integration of the diffusion partial differential equation governing the 3D transient flow of groundwater in saturated porous media. The problem is solved assuming a stochastic distribution of the hydraulic conductivity tensor with a large permeability contrast in adjacent elements. 3. GEO-1438: arises from a regional geomechanical model of the sedimentary basin underlying the Venice lagoon discretized by a linear FE with randomly heterogeneous properties [17]. 4. FLAN-1565: arises from the mechanical equilibrium of a steel flange discretized by a 3D 8-node brick FE [11]. 5. HOOK-1498: arises from the mechanical equilibrium of a steel hook discretized by 3D 4-node tetrahedral FE [11]. 6. PO-878: arises in the simulation of the consolidation of a real gas reservoir of the Po Valley, Italy, used for underground gas storage purposes (for details, see [8]). The problem is discretized with a 3D tetrahedral grid totaling 292 785 nodes and 1 746 044 elements for 878 355 unknowns. The size and number of nonzero terms for each matrix is provided in Table 1. The linear system is solved by PCG using the exact solution as a vector of all ones. The exit test for the iterative solver is kr k k ≤ 10−10 kbk Each matrix has been preliminarily reordered by a Reverse Cuthill McKee (RCM) algorithm [9]. All tests are performed on the IBM SP6/5376 cluster at the CINECA Centre for High Performance Computing, equipped with IBM Power6 processors at 4.7 GHz with 168 nodes, 5376 computing cores, and 21 Tbyte of internal network RAM. The code is written in Fortran 90 and compiled with -O4 -q64 -qarch=pwr5 -qtune=pwr5 -qnoipa -qstrict -bmaxdata:0x70000000 options. In Table 5.1 we report the results of the PCG runs for the six test cases and a number of combination of the FSAI parameters. In particular in Table 5.1 we give the number of iteration (iter) the density of the FSAI preconditioner nnz(G−1 1 ) computed as ρ = as well as three CPU times referring to the cost nnz(K) 7

Table 1. Size and number of nonzeros of the test matrices.

Size # of nonzeros

FAULT-639 STOCF-729 GEO-1438 FLAN-1565 HOOK-1498 PO-878 638 812 729,400 1 437 960 1 564 794 1 498 023 878 355 14 626 683 10 765 586 63 156 690 117 406 044 60 917 445 38 896 749

of FSAI computation (TP ), the cost of iterative solver (Tsol ) and the total time (Ttot = TP + Tsol ).) For a fixed test case all the runs have been performed using a fixed number of processors. Inspection of Table 5.1 reveals that the choice of dK = 4 produces in all tests the smallest number of iterations and (with the only exception of Problem FLAN-1565) the smallest Tsol CPU time. However, in some instances the large cost to compute the FSAI preconditioner may greatly influence the total CPU time. 5.2

Parallel results and scalability for FSAI-based codes

We will used a strong scaling measure to see how the solution time varies with the number of processors for a fixed total problem size. Throughout the whole section we will denote with Tp the total CPU elapsed times expressed in seconds when running the code on p processors. We define a relative measure of the (p) ¯ parallel efficiency achieved by the code. To this aim we will denote as Sp , the pseudo speedup computed with respect to the smallest number of processors (¯ p) used to solve the given problem: ¯ Sp(p) = (p) ¯

We will denote Ep

Tp¯p¯ . Tp

the corresponding relative efficiency, obtained according to (p) ¯

¯ Ep(p) =

Sp p

=

Tp¯p¯ . Tp p

In Table 5.2 we report number of iterations and timings in solving the problem GEO-1438 with a number of processors from p = 2 to p = 256. We also report the scaled speedups and efficiencies for the total CPU time. 5.3

Parallel results and scalability of ICP preconditioner

We report in this Section the results obtained in the solution of our saddle point problem with PO-878 as the test example whose main features are summarized m n N nnz(K) nnz(B) nnz(C) nnz(A) as 292785 878355 1 171140 38 896749 12 965583 4 321861 56 184193 8

name p dK FLAN-1565 64 4 4 4 2 1 FAULT-639 16 4 4 2 2 2 2 1 HOOK-1498 16 4 4 2 1 1 GEO-1438 16 4 4 4 4 2 2 2 1 1 STOCF-729 16 4 4 4 2 2 2 1 PO-878 64 4 4 4 2 2 2 1

δ 0.1 0.1 0.1 0.1 0.01 0.1 0.2 0.1 0.2 0 0.01 0 0.1 0.1 0.2 0.01 0.01 0.1 0.1 0.2 0.01 0.2 0.2 0.01 0.0 0.01 0.05 0.1 0.1 0.1 0.2 0.01 0.01 0.2 0.1 0.1 0.1 0.2 0.01 0.01

ǫ 0.1 0.01 0.05 0.1 0.1 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.1 0.01 0.01 0.1 0.01 0.1 0.01 0.1 0.01 0.1 0.01 0.01 0.01 0.1 0.1 0.05 0.1 0.01 0.1 0.01 0.01 0.1 0.1 0.01 0.1 0.1 0.1 0.1

iter 4546 2785 3909 5414 6064 674 986 1511 1667 938 1252 1745 3511 2362 5195 4164 3416 585 405 1140 342 766 671 555 818 924 810 755 881 1230 2030 1056 1699 844 728 698 1414 1534 1426 2297

ρ 0.12 1.17 0.29 0.10 0.09 1.32 0.18 0.39 0.10 1.41 1.28 0.56 0.28 2.76 0.10 0.18 0.66 0.34 2.13 0.12 3.56 0.21 0.58 1.59 0.65 0.17 1.06 1.61 0.95 1.11 0.24 1.97 0.77 0.14 0.26 1.42 0.17 0.10 0.19 0.13

TP 12.60 11.79 12.47 0.81 0.72 5.90 0.35 0.53 0.23 8.03 5.02 0.83 49.29 46.38 0.49 1.12 0.96 20.65 26.77 0.78 606.28 1.24 1.42 11.85 1.13 1.23 4.96 1.96 1.51 0.30 0.17 0.73 0.20 0.27 2.99 2.75 0.34 0.19 1.99 0.22

Tsol 67.62 82.06 63.44 62.49 75.55 21.92 13.54 30.20 26.35 29.64 38.07 38.25 142.05 267.64 215.56 149.00 168.83 27.53 42.93 45.32 48.93 34.06 38.65 53.69 45.03 42.53 10.21 17.06 9.96 11.75 11.00 15.15 15.67 4.47 3.55 7.30 6.27 6.19 8.03 8.31

Ttot 80.22 93.85 75.91 63.30 76.27 27.82 13.89 30.73 26.58 37.67 43.09 39.08 191.34 314.02 216.05 150.12 169.79 48.18 69.70 46.10 655.21 35.30 40.07 65.54 46.16 43.76 15.17 19.02 11.47 12.05 11.17 15.88 15.87 4.74 6.54 10.05 6.61 6.38 10.02 8.53

Table 2. Iteration number, iter, density ρ of the preconditioner, CPU times for the FSAI computation (TP ), for the iterative solution (Tsol ) and total (T tot) for each combination of parameters. Best iteration number, smallest Tsol and Ttot for each test are printed in boldface.

9

p 2 4 8 16 32 64 128 256

iter 585 585 585 585 585 585 585 585

TP 195.8 73.9 45.7 20.2 11.0 5.9 3.1 2.0

Tsol 275.8 114.6 62.0 27.4 13.5 7.2 2.8 2.0

Ttot 471.6 228.5 107.7 47.6 24.5 13.1 5.9 4.0

(2)

(2)

Sp

Ep

4.1 8.8 19.8 38.5 72.0 159.9 235.8

1.03 1.09 1.24 1.20 1.13 1.25 0.92

The complexity and the heterogeneity of the geological domain in problem PO-878 give raise to a large number of distorted tetrahedra. This produces a very

ill-conditioned matrix A, especially for small timesteps. We solved symmetrized system (4) using ∆t = 1 after an intensive testing to tune the parameters. We choose BiCGSTAB as the iterative solver with the same exit test of Section 5.1. In Table 3 we report for each run the parameters related to the three FSAI approximations as described in the previous sections. We also provide a measure ρ of the density of the preconditioner matrices as: ρ = ρ1 + ρ2 =

nnz(G−1 nnz(G−1 1 ) S ) + nnz(A) nnz(A)

Parameter ρ gives an indication of the additional core memory needed for computing and storing the preconditioner. Table 3. Combinations of parameters and results for the PO-878 problem on 128 processors. Run ICP 1 ICP 2 ICP 3 ICP 4 TICP

δ1 0 0.01 0.10 0.10 0.10

dK 1 2 4 4 4

ǫ1 0 0.01 0.1 0.01 0.1

ǫ2 0.01 0.1 0.1 0.1 0.1

δS 0.01 0.01 0.01 0.01 0.01

dS 1 1 1 1 1

ǫS 0 10−3 10−3 10−3 10−3

ρ iter TP 1 TP 2 Tsol 1.23 > 10000 1.4 0.2 > 200.0 > 1.36 4945 2.9 1.4 127.8 0.72 1573 3.7 2.3 32.7 1.38 1337 3.7 2.4 53.5 0.72 3669 3.6 2.3 66.1

Ttot 200.0 129.2 35.0 55.9 68.4

We present the following timings, all given in seconds: TP 1 is the time to −1 construct G−1 1 and G2 and the constant part of the Schur complement matrix. This operation can be regarded as a preprocessing stage and so we do not take care of it when computing the total time. The second time, TP 2 refers to the construction of the FSAI preconditioner for the Schur complement matrix G−1 S . Finally, we report as Tsol the CPU time required by the iterative solver, and by Ttot = TP 2 + Tsol the total CPU time. We report in Table 3 the results of four ICP and one TICP runs employing the three different patterns for the FSAI preconditioner in the approximation 10

of K (with p = 128). Using dK = 1 no convergence is attained within 10000 iterations, dK = 2 yields 4945 iterations while with dK = 4 the iterative method obtains convergence after 1337 (1573) iterations. From the table we see that only a sparsity pattern for the block K which uses nonzeros far away from the diagonal (dK = 4) allows for a (relatively) fast convergence. We note on passing that the TICP with the same parameters as the third ICP run yields more than twice the ICP iterations and a little less than twice CPU time. This is again a consequence of the ill conditioning of this problem. We present in the sequel the results of the scalability study carried out with the FSAI–ICP code described so far when used to solve the PO-878 test problem. Table 4. Parallel performance of FSAI-ICP (TICP) code for the problem PO-878 with dK = 4. run

p 2 4 8 ICP 3 16 32 64 128 2 4 8 ICP 4 16 32 64 128 2 4 8 TICP 16 32 64 128

TP 1 190.5 82.4 43.3 23.2 12.9 6.8 3.7 278.7 81.9 42.9 23.9 13.0 6.9 3.7 80.7 44.1 23.9 13.2 6.8 3.6

(2)

6.8 13.0 23.3 42.9 80.8 150.7

iter 1409 1521 1518 1407 1168 1548 1573 1441 1123 1574 1075 1339 1157 1337

TP 2 74.9 34.6 18.8 10.7 6.1 3.6 2.3 83.8 34.9 18.8 10.8 6.2 3.6 2.4

Tsol 1613.5 982.2 400.0 198.6 84.9 73.4 32.7 2994.0 1175.5 781.8 245.7 158.0 83.3 53.5

Ttot 1688.4 1016.8 418.8 209.2 91.0 77.0 36.0 3077.8 1210.4 800.6 256.5 164.2 86.9 55.9

6.8 13.0 23.3 42.9 80.8 150.7

3916 3754 1075 3710 3873 3669

34.6 18.8 10.7 6.1 3.6 2.3

2275.4 896.9 530.6 262.3 142.1 66.1

2310.0 915.7 541.3 268.4 145.7 68.4

SP

4.6 8.8 16.4 29.5 56.0 103.0

(2)

(2)

Sp

Ep

3.3 8.1 16.1 37.1 43.9 93.8

0.83 1.01 1.01 1.16 0.69 0.73

5.1 7.7 24.0 37.5 70.8 110.1

1.27 0.96 1.50 1.17 1.11 0.86

5.1 7.7 24.0 37.5 70.8 110.1

1.27 0.96 1.50 1.17 1.11 0.86

The best combination of the parameters (third row in Table 3) produces the parallel results summarized in Table 4 for p = 2 to p = 128.

6

Conclusions

This paper describes and analyses the performance of a parallel preconditioner to accelerate the convergence of Krylov solvers for saddle point type systems. 11

The preconditioner studied in this work has been implemented along with the BiCGSTAB solver in Fortran 77 and and 90 using message passing interface (MPI) for interprocessor communications. We have presented a portable parallel code implemented in Fortran 77 and 90 using message passing interface (MPI) for interprocessor communications. The resulting code portability on a whole range of supercomputers. We have presented results on two large scale applications arising from 3D FE discretization of coupled consolidation problems. The results point out that our code exhibits perfect scalability both on the preprocessing stage and the iterative part as well as great computational efficiency. Acknowledgments. This study has been supported by the Italian MIUR project (PRIN) ”Advanced numerical methods and models for environmental fluid-dynamics and geomechanics”.

References 1. Benzi, M., Golub, G.H., Liesen, J.: Numerical solution of saddle point problems. Acta Numer. 14, 1–137 (2005) 2. Bergamaschi, L.: Eigenvalue distribution of constraint preconditioned saddle point matrices. Numer. Lin. Alg. Appl. (2011), submitted 3. Bergamaschi, L., Ferronato, M., Gambolati, G.: Mixed constraint preconditioners for the solution to FE coupled consolidation equations. J. Comp. Phys. 227(23), 9885–9897 (2008) 4. Bergamaschi, L., Gondzio, J., Venturin, M., Zilli, G.: Inexact constraint preconditioners for linear systems arising in interior point methods. Comput. Optim. Appl. 36(2–3), 136–147 (2007) 5. Bergamaschi, L., Mart´ınez, A.: Parallel acceleration of Krylov solvers by factorized approximate inverse preconditioners. In: Dayd`e et al., M. (ed.) VECPAR 2004. Lecture Notes in Computer Sciences, vol. 3402, pp. 623–636. Springer-Verlag, Heidelberg (2005) 6. Bergamaschi, L., Mart´ınez, A., Pini, G.: An efficient parallel MLPG method for poroelastic models. CMES: Computer and Modeling in Engineering & Sciences 49(3), 191–216 (2009) 7. Biot, M.A.: General theory of three-dimensional consolidation. J. Appl. Phys. 12(2), 155–164 (1941) 8. Castelletto, N., Ferronato, M., Gambolati, G., Janna, C., Teatini, P., Marzorati, D., Cairo, E., Colombo, D., Ferretti, A., Bagliani, A., Mantica, S.: 3D geomechanics in UGS projects: a comprehensive study in northern Italy. In: Proceedings of the 44th US Rock Mechanics Symposium, Salt Lake City (UT) (2010) 9. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: Proceedings of the 1969 24th national conference. pp. 157–172. ACM, New York, NY, USA (1969) 10. Ferronato, M., Janna, C., Gambolati, G.: Mixed constraint preconditioning in computational contact mechanics. Comp. Methods App. Mech. Engrg. 197(45-48), 3922–3931 (2008) 11. Janna, C., Comerlati, A., Gambolati, G.: A comparison of projective and direct solvers for finite elements in elastostatics. Advances in Engineering Software 40(8), 675–685 (2009)

12

12. Keller, C., Gould, N.I.M., Wathen, A.J.: Constraint preconditioning for indefinite linear systems. SIAM Journal on Matrix Analysis and Applications 21, 1300–1317 (2000) 13. Kolotilina, L.Y., Nikishin, A.A., Yeremin, A.Y.: Factorized sparse approximate inverse preconditionings IV. Simple approaches to rising efficiency. Numer. Lin. Alg. Appl. 6, 515–531 (1999) 14. Kolotilina, L.Y., Yeremin, A.Y.: Factorized sparse approximate inverse preconditionings I. Theory. SIAM J. Matrix Anal. 14, 45–58 (1993) 15. Lukˇsan, L., Vlˇcek, J.: Indefinitely preconditioned inexact Newton method for large sparse equality constrained nonlinear programming problems. Numerical Linear Algebra with Applications 5, 219–247 (1998) 16. Perugia, I., Simoncini, V.: Block-diagonal and indefinite symmetric preconditioners for mixed finite elements formulations. Numerical Linear Algebra with Applications 7, 585–616 (2000) 17. Teatini, P., Ferronato, M., Gambolati, G., Bau, D., Putti, M.: Anthropogenic Venice uplift by seawater pumping into a heterogeneous aquifer system. Water Resour. Res. 46 (2010)

13