A new parallel sparse direct solver : presentation and

A new parallel sparse direct solver : presentation and numerical experiments in large-scale structural mechanics parallel computing I. Guèyea , S. El Aremb , F. Feyela , F.-X. Rouxa , G. Cailletaudb , a ONERA - Centre de CHATILLON ˆ ˆ 29, avenue de la Division Leclerc 92322 CHATILLON Cedex, France Email: [email protected] b Centre des Mat´ eriaux P. M. FOURT, MINES ParisTech - UMR CNRS 7633 B.P. 87 91003 EVRY Cedex, France Email: [email protected]

Abstract The main purpose of this work is to present a new parallel direct solver : Dissection solver. It is based on LU factorization of the sparse matrix of the linear system and allows to detect automatically and handle properly the zeroenergy modes which are important when dealing with DDM. A performance evaluation and comparisons with other direct solvers (MUMPS, DSCPACK), are also given for both sequential and parallel computations. Results of numerical experiments with a two-levels parallelization of large-scale structural analysis problems are also presented : FETI is used for the global problem parallelization and Dissection for the local multithreading. In this framework, the largest problem we have solved is of an elastic solid composed of 400 subdomains running on 400 computation nodes (3200 cores) and containing about 165 millions dof. The computation of one single iteration consumes less than 20 minutes of CPU time. Several comparisons to MUMPS are given for the numerical computation of large-scale linear systems on a massively parallel cluster : performances and weaknesses of this new solver are highlighted. Keywords: DDM, linear sparse direct solver, finite element method, FETI, nested dissection, structural mechanics.

1. Introduction Within the framework of the Finite Elements Method (FEM) for numerical modeling of structural mechanics problems, it is often about solving linear systems of Partial Differential Equations (PDE) which represents the single largest part of the overall computing costs (CPU time and memory requirements) in three-dimensional implicit simulations. Moreover, the discretization of the PDE often leads to difficult-to-solve systems of equations since, today in modern

Preprint submitted to Elsevier

25 f´ evrier 2011

design and simulations of industrial structures, it becomes routinely essential to consider nonlinearities of various origins, high heterogeneities, tortuous geometries and complex boundary conditions. The arising linear systems are large, ill-conditioned and could have millions of unknowns especially when the aim is the modeling of complex industrial structures in real scale. The resort to Domain Decomposition Methods (DDM) is becoming automatic because of the flexibility they offer to solve both linear and nonlinear large systems of equations by dividing the structure into many sub-structures (subdomains) and using multiple processing elements simultaneously. These methods are based on quite simple and intuitive ideas : a large problem is reduced to a collection of smaller problems easier to solve computationally than the undecomposed problem, and most or all of which can be solved independently and concurrently. Thus, the divide and conquer methods are based on the splitting of the physical domain of the PDE into smaller subdomains forming a partition of the original domain, and by design are suited for implementation on parallel computer architectures. These methods showed flexibility in treating complex geometry and heterogeneities in PDE even on serial computers. Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling. Increasingly, parallel processing is being seen as the only cost-effective method for the fast solution of computationally large and data-intensive problems. Though, divide and conquer methods such as sub-structuring method in structural engineering predate DDM, interest in DDM for PDE was spawned only subsequent to the development of parallel computer architectures. This decomposition may enter at the continuous level, where different physical models may be used in different regions, or at the discretization level, where it may be convenient to employ different approximation methods in different regions, or in the solution of the algebraic systems arising from the approximation of the PDE. These three aspects are very often interconnected in practice [1]. More extensive and broader monographs ([1],[2]) and surveys ([3], [4], [5], [6]) devoted to DDM with a strong emphasis put on the algebraic and mathematical aspects could be referred. Based on a dual approach to introduce the continuity conditions at the interface between subdomains, the Finite Element Tearing and Interconnecting (FETI) is the most commonly used non-overlapping DDM [7, 4, 8, 5]. It is a dual Schur complement method where preconditioned conjugate gradient iterations are applied to find the interface forces satisfying the interface displacement compatibility. FETI is a robust and suitable method for structural mechanics problems. With the Balancing Domain Decomposition method [9, 3], FETI is among the first non-overlapping DDM that has demonstrated numerical scalability with respect to both the mesh and subdomain sizes. However, studies carried out on large numerical tests showed that its effectiveness decreases beyond a few hundreds of subdomains. Thus, if we aim to split a large-scale model into a reasonable number of subdomains, the local systems to solve also become huge. Moreover, with the evolution of microprocessor technology, we are 2

witnessing a rapid development and spread of new multi-core architectures (in terms of number of processors, data exchange bandwidth between processors, parallel library efficiency, processors scaling frequencies) which are deeply linked to the growing importance of DDM in scientific intensive computation. The essential interest and strength of multi-core solutions is to enable the simultaneous execution of threads on various cores [10]. The main objective of the current work is to present a new parallel direct solver for large sparse linear systems : Dissection solver [11]. This solver is based on LU factorization of the sparse matrix. It also allows to detect automatically and handle properly the zero-energy modes which are important when dealing with DDM. Performance evaluation and comparisons with other direct solvers (MUMPS, DSCPACK, ...) are also given for both sequential and parallel computations. We also present the latest results of the parallel computations carried out in the Centre Des Matériaux, MINES ParisTech to measure the performances that have been obtained using ZéBuLoN Finite Elements Analysis (FEA) code on JADE cluster of CINES1 (23 040 cores, 237.80 TFlops Theoretical peak performance, 267.88 TFlops Maximal LINPACK performance achieved, 18th in the TOP500 ranking of June 2010). Many FE numerical experiments have been carried out using a two-level parallelization : FETI solver for the global problem and Dissection solver for the local problem. First, a speed-up study is carried out to determine the machine optimal configuration for a given FE problem size (number of subdomains, number of degrees of freedom (dof) per subdomain, number of processes per machine). Thus, a parametric study is presented to show how many processes it would be necessary to have running on a single computer (with two processors of four cores each and 32 GB of shared RAM) for an optimized two-levels parallelization. In this study, an optimal configuration is to be found, because there is an advantage of splitting the problem into a large number of subdomains (each subdomain is going to be smaller), but the number of processes running on the same computation node will be increasing. The splitting advantage is thus limited by the progressive disappearance of the local parallelism, and also by the FETI method efficiency loss when the size of the elementary problem becomes too small. For a structural problem of a given size (≈ 3.3 millions unknowns in this study), the influence of two parameters is discussed : the numbers of subdomains to consider (FETI effects) and the computation nodes to be used (multithreading effects). Finally, we present the results of the scalability study performed to identify the size of the largest problem that would be possible to solve on JADE cluster. The global problem is a train-like succession of cubes (subdomains). Each subdomain is treated on a separate machine and taking advantage of the maximum possible multithreading. Two cases are considered depending on the presence of zero-energy modes or their absence. The largest problem we have solved is of an elastic solid composed of 400 subdomains running on 400 1 Centre

Informatique National de l’Enseignement Sup´ erieur, Montpellier, FRANCE

3

computation nodes and containing 164 754 603 dof. The computation of one single iteration consumes around 20 minutes of CPU time. 2. Parallel direct solver for large sparse linear systems Solving sparse linear systems M x = b by direct methods is often based on Gaussian elimination method. Rather than directly handling the system, the matrix M = (Mij ) is factorized via a LU decomposition, where L = (Lij ) is a lower triangular matrix and U = (Uij ) is an upper triangular matrix. This factorization relies mainly on three steps : – an analysis step which computes a reordering and symbolic factorization of the matrix. The reordering techniques are used to minimize fill-in entries of the matrix during the numerical factorization and to exhibit as many independent calculations as possible. This step produces an ordering matrix and an elimination tree. The elimination tree is then used to carry out the subsequent steps. – a numerical step which determines the lower and upper triangular factors of the matrix according to the elimination tree that was previously produced. – a solution step where the numerical solution of the system is obtained by solving the lower and upper triangular systems resulting from factorization (forward elimination and backward substitution). For linear problems with constant coefficients matrices and changing right-hand sides, this step is nearly the whole time consuming of the direct method since the factorized matrix is computed only once. In recent years, many parallel sparse direct solvers such as DSCPACK [12] and MUMPS [13] have been developed and have proved to be robust, reliable and efficient for a wide range of practical problems. In this section, we present a new parallel direct solver using a direct method based on LU factorization of the sparse linear system matrix. The matrix can be symmetric or non-symmetric. The ability of a solver to detect singularities is essential in DDM for the operator itself or when building automatically optimal preconditioners. In the case of a factorization of singular linear systems, a strategy is used to handle automatically and properly zero-energy modes. 2.1. Serial implementation 2.1.1. The ordering strategy The ordering step is based on a nested dissection method [14, 15]. In fact, this technique allows more parallelism in the factorization than those using minimum degree techniques [16] and often produces better results. The nested dissection approach is based on a recursive bisection of the graph of the matrix M to be factorized. A first bisection is performed by selecting a set of vertices forming a separator. This separator is then removed from the original graph which generates a partition into two disconnected subgraphs. The separator is chosen so that its size is as small as possible and that the 4

obtained subgraphs have equivalent sizes. Each of these subgraphs is then bisected recursively following the same principle until the size of the generated substructures is sufficiently small. The bisection is managed by METIS [17], which is primarily a tool for partitioning graphs or meshes. It aims to split a given graph into subgraphs of similar size and minimizes separators size to reduce the required communications. To illustrate the principle of nested dissection technique, we consider a small sparse linear system where the structure of M is shown in Figure 1(a). The graph (or mesh nodes) representing the matrix M is given in Figure 1(b).

(a) Structure of matrix M

(b) Graph or mesh nodes

Fig. 1: Example of a sparse linear system

In this example, we first perform a bisection by selecting a set of unknowns {2, 7, 12}. With these nodes, we form a separator Γ21 . This separator is removed from the graph, and it generates a partition into two disconnected subgraphs which are subsequently split following the same technique. We obtain separators Γ11 and Γ12 which are respectively formed by vertices {5, 6} and {8, 9}. Finally, we generate the substructures Ii with the sets of unknowns {0, 1}, {10, 11}, {3, 4} and {13, 14} (Figure 2).

Fig. 2: Recursive sub-structuring by bisection

5

Nested dissection method is an easy-to-parallelize algorithm based on the old algorithmic paradigm of ”divide and conquer”. It generates balanced supernodal elimination trees whose supernodes are sets of unknowns (see Figure 3). These trees reflect the dependency of unknowns during elimination and therefore makes the parallel numerical factorization easier.

Fig. 3: Supernodal elimination tree

Using this supernodal elimination tree, the system matrix resulting from the original problem can be written as shown in Figure 4 :

Fig. 4: Reordered Matrix M

6

The block matrices MIi Ii and MΓlj Γlj (l > 0), on the diagonal, correspond to the unknowns of the same level. It should be noted that MIi Ii diagonal block has a sparse tridiagonal structure, and that diagonal block MΓlj Γlj at level l is a full matrix. All the extra-diagonal sub-matrices MIi Γlj and MΓlj Ii are sparse and correspond to the connections between the unknowns in subdomain Ii and those belonging to the separator Γlj . The off-diagonal blocks MΓli Γm (l 6= m) are j full and relate the connections between the unknowns in separator Γli and those in separator Γm j . The overall structure of the matrix M illustrated in Figure 4 and the resulting linear system can be written as :      xI bI MII MIL1 MIL2 M x =  ML1 I ML1 L1 ML1 L2   xL1  =  bL1  = b (1) xL2 bL2 ML2 I ML2 L1 ML2 L2

where the sub-matrices MII and MLk Lk are diagonal matrices formed respectively by all the blocks MIi Ii and MΓkj Γkj . All off-diagonal sub-matrices MILk , MLk I , MLk Ll and MLl Lk are formed respectively by the blocks MIi Γkj , MΓkj Ii , MΓkj Γlm and MΓlm Γkj . 2.1.2. The block factorization The solution of the linear system of equations M x = b can be obtained in the following way : first, the matrix M (Equation [1]) is factorized and then the solution is found by forward and backward substitutions. This two-step procedure is more appropriate for problems involving matrices with constant coefficients. Before performing this factorization, we carefully scale the original matrix M for general types of linear systems that we can solve. The factorization consists of applying LU decomposition method to the matrix as follows :   MII 0 0 I MII −1 MIL1 MII −1 MIL2 M = LU =  ML1 I SL1 L1 0   0 I SL1 L1 −1 SL1 L2  M L 2 I SL 2 L 1 SL 2 L 2 0 I I 

(2)

where I is the identity matrix and SL1 L1 , SL1 L2 , SL2 L1 and SL2 L2 are Schur complements defined by :  −1   SL1 L1 = ML1 L1 − ML1 I MII −1 MIL1 ,    SL1 L2 = ML1 L2 − ML1 I MII MIL2 , (3) SL2 L1 = ML2 L1 − ML2 I MII −1 MIL2 ,   ¯L2 L2 − SL2 L1 SL1 L1 −1 SL1 L2 ,  = S S L L 2 2   with S¯L2 L2 = ML2 L2 − ML2 I MII −1 MIL2 .

Note that the computation of the Schur complements in Equation [3], will introduce fill-in in the block diagonal matrices ML1 L1 and ML2 L2 , as well as in ML1 L2 and ML2 L1 . Referring to Figure 4 and the first equation of [3], matrix SL1 L1 is obtained by computing : 7

SL 1 L 1

SΓ11 Γ11 0 = 0 SΓ12 Γ12

(4)

where

SΓ11 Γ11 = MΓ11 Γ11 − MΓ11 I1 MI1 I1 −1 MI1 Γ11 − MΓ11 I2 MI2 I2 −1 MI2 Γ11 SΓ12 Γ12 = MΓ12 Γ12 − MΓ12 I3 MI3 I3 −1 MI3 Γ12 − MΓ12 I4 MI4 I4 −1 MI4 Γ12

(5)

Likewise, the Schur complements SL1 L2 , SL2 L1 and S¯L2 L2 are obtained by computing the blocks SΓ1j Γ21 and SΓ21 Γ1j using :

and


(6)


(7)

P4 S¯L2 L2 = MΓ21 Γ21 − i=1 MΓ21 Ii MIi Ii −1 MIi Γ21 .

(8)

In all operations above, multiplications with the inverse of the matrix block MIi Ii are replaced by two solution steps using the Crout factorization, MIi Ii = LIi Ii DIi Ii UIi Ii (or LIi Ii DIi Ii LIi Ii T in symmetric case). The matrix products with MΓlj Ii are performed as sparse matrix operations. 2.1.3. The forward and backward substitutions The solution of the system M x = b can be determined by solving the triangular systems Ly = b (forward elimination) and U x = y (backward substitution). Here L and U are obtained from the block factorization of sparse matrix M (Equation [2]). More explicitly, the solution x is obtained by successively solving the following systems of equations : – during the forward step :   MII yI = bI , SL L y L = b L 1 − M L 1 I y I , (9)  1 1 1 SL2 L2 xL2 = bL2 − ML2 I yI − SL2 L1 yL1 , – during the backward step : SL1 L1 xL1 = SL1 L1 yL1 − SL1 L2 xL2 , MII xI = MII yI − MIL1 xL1 − MIL2 xL2 .

(10)

The first equation in [9] and the second equation in [10] can be readily solved in parallel since MII is a block diagonal matrix. Since the full matrix SL2 L2 is distributed to all processors, the third equation in [9] is solved in every processor. The solutions of the equations involving the block matrix SL1 L1 are obtained by solving an interface problem on the separators. 8

2.1.4. Taking into account singular linear systems Among all the available sequential and parallel direct solvers, only a few can automatically and properly handle zero-energy modes associated with singular linear systems. These singularities have either physical or geometrical origins, or could appear when splitting the initial domain into substructures. We aim to implement a parallel direct solver that takes into account these systems. The approach to compute the zero-energy modes consists of first detecting local singularities during the block factorization of M . We start the search process on the matrix MII associated with substructures. Then we progress up to the root by applying the same process on the blocks SLk Lk at level k > 0. At each level, we check if near zero pivots appear when performing a LDU factorization of a block MIi Ii or SΓlj Γlj . In the affirmative case, we refer the treatment of the equation in question at the end of the factorization of M . At the end of the factorization, we obtain a list of near zero pivots, which are candidates to be zero-energy modes of the global matrix M . To compute the zero-energy modes, we first condense the global system on local singularities to obtain a small Schur complement Ss . Then, we perform Gaussian elimination with full pivoting on Ss and check if zero pivots are found. If we don’t find zero pivots, then the matrix M is non-singular. If we find a number e of zero pivots then these pivots are the actual zero-energy modes. A basis N of null-space is built using its e corresponding rows and columns. Finally, we write the general solution of the sparse linear system M x = b in the form x = M + b + N α, where α ∈ Re is a vector of e arbitrary entries and the vector M + b is a particular solution of the linear system. 2.2. Implementation of the parallel solver In this section, the full algorithm for factorizing the matrix problem stated in Equation [1] is described. This step is most suitable when considering for example, time-dependent simulations that have constant coefficient matrix. The cost of the factorization, which can be a few orders of magnitude higher than that of one solution step, may easily be amortized as the number of these latter steps increases. Referring to the decomposed matrix shown in Equation [2], the factorization phase can be stated as follows : 1. Local factorization : for i = 1, ..., ndoms : – factorize MIi Ii = LIi Ii DIi Ii UIi Ii , – compute local contributions MΓlj Ii MIi Ii −1 MIi Γm . k 2. Static condensation : for l = 1, ..., Lr − 1 : – compute and invert the Schur complement SΓlj Γlj , using Equation 5, – compute the Schur complements SΓlj Γm , SΓm l , using Equations 6-7. k k Γj ¯ 3. Compute Schur complement SLr Lr , using Equation 8. 4. Compute and invert the Schur complement SΓl1 Γl1 .

9

All these steps can be performed in parallel. We also alternate step by step, POSIX threads and OpenMP since they are compatible and could consequently be combined easily. The strategy we have adopted at each step is as follows : – if the number of supernodes is less than the number of cores available then we use optimized BLAS-3 matrix multiplication routines [18] optimized with OpenMP, – otherwise, POSIX threads are created. 2.3. Performance evaluation The parallel direct solver described above has been implemented in ZéBuLoN FEA code. To compare its performance with those of other direct solvers, simulations of 3-D linear elasticity problems have been performed on a biprocessors Intel Quad-Core Xeon X5460 64-bit machine, with 8 cores, 32 GB of memory and 3.16 GHz frequency. 2.3.1. The sequential performance Figures 5 and 6 provide the CPU execution time comparison of Dissection (solver implemented), DSCPACK and MUMPS solvers. Both DSCPACK and MUMPS solvers are based on a multifrontal approach. The main DSCPACK weakness is that it does not handle singular systems while MUMPS is not very robust for linear systems arising from very heterogeneous floating substructures. Sparse Direct and Frontal are two additional direct solvers implemented in ZéBuLoN and able to detect automatically zero-energy modes. 12000

Dissection Sparse Direct Frontal

CPU execution time (sec.)

10000

8000

6000

4000

2000

0 50000

100000

150000

200000

250000

300000

350000

Number of degrees of freedom

Fig. 5: First phase of comparison of CPU execution time.

We can observe in Figure 5, that the performances of the oldest methods of resolution (Sparse direct and most Frontal solvers) break down when the dimension of linear systems exceeds a few tens of thousands of dof while the results obtained with Dissection solver are more interesting.

10

DSCPACK MUMPS Dissection

800

CPU execution time (sec.)

700

600

500

400

300

200

100

0 50000

100000

150000

200000

250000

300000

350000

Number of degrees of freedom

Fig. 6: Second phase of comparison of CPU execution time.

In Figure 6, we observe that the performance of DSCPACK are slightly better than Dissection solver. However, we should not neglect the fact that DSCPACK solver can not handle the zero-energy modes in the floating substructures. Accordingly, Dissection could become a more profitable solver for FETI methods than other existing solvers in ZéBuLoN FEA code. 2.3.2. The multi-threaded performance In this case, we consider the solving of a linear elasticity problem of 206 763 dof. The multi-threaded performance of Dissection is compared to those of DSCPACK and MUMPS where the BLAS library optimized with OpenMP is selected. 2.8 version hybride solveur DSCPACK solveur MUMPS

2.6

2.4

Speed up

2.2

2

1.8

1.6

1.4

1.2

1 1

2

3

4

5

6

7

8

Number of threads

Fig. 7: Speed up in multi-threading.

Results in Figure 7 show that the difference in performance between 11

Dissection and other solvers observed with sequential computations (using one single thread) is reduced. In part, this shows that the parallelization strategy proposed behaves well. The maximum performance gain achieved is about 2.2 on four cores. 3. Application on large-scale structural analysis problems After implementation and validation of the new solver, we sought to know the conditions of its optimal functioning. Many possibilities are indeed available for a problem with a given size. We can change the number of subdomains ; use more or less multithreading ; load more or less the computing nodes, etc... In what follows, two cases are considered, since by solving a problem with size N using a p processors, we aim to : – reduce the wall time (WT) by increasing p. We foresee a quasi-linear reduction of the CPU time and we talk about strong scalability or speed-up property, T1 (N ) Sequential time = (11) Sp (N ) = time on p processors Tp (N ) The parallel efficiency is given by : Ep (N ) =

Sp (N ) p

(12)

– increase the problem size by increasing p. It is the scalability property (or weak scalability) which describes how the solution time varies with the number of processors for a fixed problem size per processor (scale-up). Cp (N ) = =

Sequential time time on N processors of the problem of size pN T1 (N ) Tp (pN )

(13)

All the large-scale numerical experiments presented in what follows have been performed on JADE cluster of the CINES. JADE is a parallel scalar supercomputer of 23 040 cores distributed on 2 880 computing nodes. Each node is a bi-processor computer (two processors SGI Altix ICE 8200EX, Xeon quad core 3.0 GHz ) with 4GB RAM per core. The network fabric is an Infiniband (IB 4x DDR) double planes network. With its Theoretical Peak Performance of 237.80 TFlops, JADE appears at 18th position within the TOP500 ranking of jun 2010. 3.1. Speed-up performance In this section, we consider the numerical solving of a global problem of constant size : 3 371 544 dof. The structure is fixed at its end (x = 0) and subjected to a simple tension at its free end (x = 1). The material is considered of linear elastic behavior so that the computation involves a single load increment 12

of one single iteration. Several computations have been performed with Ns subdomains of equal size running on Nc computing nodes (each node is of 8 cores). FETI is used as the global solver and Dissection as local solver. The optimized computing tasks distribution is given by Table 1 : the multi-threading is equal to 8, 4, 2 or 1 when the number of computation nodes Nc equals Ns , Ns /2, Ns /4 or Ns /8 respectively. PP PP Ns 8 PP Nc P P 8 8 16 32 64 128 256

16

32

64

4 8

2 4 8

1 2 4 8

128 1 2 4 8

256

1 2 4 8

Tab. 1: Computing tasks distribution (multithreading)

PP PP Ns 8 PP Nc P P 8 1186|1299 16 32 64 128 256

16

32

64

367|438 252|311

179|242 132|189 87|125

172|202 98|160 80|118 55|83

128 54|93 33|114 27|57 22|115

256

24|79 20|81 18|81 14|243

Tab. 2: Solver time Tp (s) and total wall time Tw (s) for different configurations

Table 2 gives the FETI solver time Tp and total wall time Tw for all the considered configurations. We can notice, when reading Table 2 horizontally, that, for a given machine configuration (a given number of available processors), Tp decreases with the number of subdomains. This observation (FETI effect) remains valid even for a splitting into 256 subdomains where the best Tp is obtained (14s). This effect remains visible even with a total removal of the multithreading effect (when 8 subdomains are considered per computing node : Ns = 8Nc ). However, a saturation of this performance could be observed when the number of subdomains (the interface problem size) becomes important : for example, with Nc = 128, the Tp goes from 22s to 18s whereas the Ns has been doubled (from 128 to 256 respectively). Figure 8 shows that the wall time Tw becomes increasing when the number of subdomains exceeds 64. In fact, in these cases, the problem loading time becomes important and even exceeds the solver time. For example, with Ns = 256, we have Tp = 14s whereas the time spent in

13

the problem data loading is 162s. The multithreading effect is visible when reading Table 2 vertically. For Ns number of subdomains, the best time is always obtained with one subdomain per computing node (multithreading equals 8). For example, with a splitting into 64 subdomains, the Tp decreases from 172s to 55s when the multithreading goes from 1 (minimum) to 8 (maximum), respectively. Table 3 summarizes the speed Nc dof /subdomain Niter Tp Sp Ep

8 421 443 44 1186 8 1

16 215 523 49 252 37.62 2.351

32 110 211 62 87 109.5 3.422

64 56 355 70 55 171.5 2.681

128 29 427 67 22 437.2 3.415

256 15 363 68 14 693.5 2.709

Tab. 3: Solver time Tp (s) and total wall time Tw (s) for different configurations

up main results. It appears that the configuration leading to the optimal parallel performance (342%) corresponds to a splitting into 32 subdomains distributed on 32 machines. In this case, 256 cores are exploited to achieve 87s of solver time. 1400 FETI time Tp

1200

Total Wall Time Tw

Time (s)

1000

800

600

400

200

0

8 16

32

64

128

256

Subdomains number (Ns)

Fig. 8: Evolution of Tp and Tw with Ns

3.2. Scalability measurement To perform the measurement of this property, we have considered a modular problem composed of similar unit cubes of 421 443 dof each. The problem mesh 14

is obtained by stacking N unit cubes (N subdomains) behind each other in a train-like shape (Figure 9). Y

1 Subdomain 1

0

Subdomain 2

1

Subdomain N

2

N

X

Fig. 9: Train-like stacking subdomains with possible zero-energy movements.

Number of subdomains Ns

number of dof (millions)

2 4 10 20 50 100 200 300 400

0.83 1.66 4.13 8.25 20.60 41.19 82.38 123.56 164.75

Solver Time Tp (s) Dissection 954 1001 1004 1013 1024 1082 1148

Mumps 442 463 465 471 474 483 -

Total Wall Time Tw (s) Dissection 1114 1170 1190 1243 1401 1650 2018

Mumps 561 582 589 614 664 987 -

Tab. 4: Scalability measurements : the number of computing nodes is equal to the subdomains number. Zero-energy modes are present.

The structure is fixed at its end (x = 0) and subjected to a simple tension at its free end (x = N ) as shown in Figure 9. Thus, the problem involves (N − 2) floating subdomains. The material is considered of linear elastic behavior so that the calculation involves a single load increment of one iteration. Each subdomain is assigned to one machine and, consequently, the maximum multithreading is considered (which equals to 8). A tension test is performed on an increasing mesh size going from 2 to 400 subdomains. As shown by Table 4 and Figure 10, the CPU time of the FETI solver Tp remains less than 20 minutes and remarkably constant when the problem size is increased. In Table 4, the wall time Tw augmentation shows the increasing influence of the FEA code parts (mesh loading for example) that have not yet been parallelized. This has little effect on the nonlinear calculations that we would perform (fatigue life, damage and crack propagation simulations in industrial 15

1800 Dissection: FETI time Tp 1600

Dissection: Total Wall Time Tw MUMPS: FETI time Tp

Time (s)

1400

MUMPS: Total Wall Time Tw

1200

1000

800

600

400 0

50

100

150

200

250

300

Subdomains number (Ns) Fig. 10: Evolution of Tp (s) and Tw (s) with Ns : Dissectionversus MUMPS

structures) because they would involve tens of loading cycles (hundreds of increments) once the data loading has been carried out instead of one increment in the current study. The largest numerical simulation performed in this study, with 400 subdomains, involves a structure of about 165 millions dof. It can be observed from Table 4 that MUMPS is more than twice as fast as Dissection. With MUMPS as local solver and 300 subdomains, Tp is only 483s, whereas it is of 1148s when using Dissection solver. To understand this difference, we have been examining meticulously the time needed by each solver (Dissection and MUMPS) per subdomain and have found that : – In the absence of zero-energy modes (subdomain number 1 in this case study), MUMPS and Dissection require almost the same CPU time within 4.3% of maximum difference (Table 5). – However, for subdomains with possible zero-energy modes (subdomain 2 for example which has 6 possible zero energy modes), the CPU time is more than doubled with Dissection. MUMPS keeps almost the same CPU time in both cases. Additional development efforts are therefore required to improve the CPU time needed by Dissection to solve singular linear systems where zero-energy modes are to be handled. We have also performed additional numerical experiments (Table 6). In this case, the problem considered is similar to the previous one but with smaller

16

Ns Subdomain 1 : Dissection Subdomain 1 : MUMPS Subdomain 2 : Dissection Subdomain 2 : MUMPS

50 449 433 979 449

100 451 438 993 451

300 464 445 989 463

Tab. 5: Solver time Tp (s) : effects of floating subdomains detection

subdomain size . Four subdomains are assigned to one computation node but the total dof number per machine has been kept equal to 421 443. Thus the multithreading is of 2 (one subdomain for two cores). We can easily notice that by splitting the problem global mesh into smaller subdomains, we have divided the CPU time by up to 4 while using the same number of processors. For example, for the case of 41.19 millions dof, the solver CPU time has been 1013 474 divided by = 3.38 with Dissection, and by = 2.15 when using MUMPS 300 220 as local solver. Also, we can notice from the results presented in Table 6 that the difference in CPU time between Dissection and MUMPS has been reduced to less than 40%. Ns Total dof number Tp : Dissection Tp : MUMPS Difference (%)

16 1.66 178 143 24.47

40 4.13 251 207 21.26

80 8.25 283 208 36.06

120 12.36 298 214 39.25

400 41.19 300 220 36.36

Tab. 6: Dissection and MUMPS Solver time Tp (s) with multithreading 2

4. Conclusion A new parallel direct solver is presented in this paper : Dissection solver. Based on LU factorization of the sparse matrix of the linear system, Dissection allows to detect automatically and handle properly the zero-energy modes which is important when dealing with DDM. A performance evaluation and comparisons with other direct solvers (MUMPS, DSCPACK, Sparse Direct, Frontal) are also given for both sequential and multi-threading computations. Results of the two-levels parallelization of large-scale structural analysis problems are also presented. Many numerical experiments have been carried out and showed that Dissection is more efficient than the old direct solvers (SparseDirect and Frontal) implemented in ZéBuLoN. Also, we have noticed that DSCPACK could be slightly more efficient than Dissection. However, given the fact that DSCPACK 17

is not able of automatically handle the problem of floating subdomains, we think that Dissection could become more profitable for FETI in ZéBuLoN FEA code. Some comparisons with MUMPS have also been given for large-scale linear systems arising from FE simulations. In this framework, the largest problem we have solved is of an elastic solid composed of 400 subdomains running on 400 computation nodes and containing 164 754 603 dof. The computation of one single iteration consumes less than 20 minutes of CPU time. We have noticed that MUMPS remains more than twice as fast as Dissection. These numerical experiments have been performed on the CINES cluster Jade. We think that the main weakness of Dissection is the amount of CPU time it requires to handle the zero-energy modes. Additional work is therefore needed to improve this option in Dissection solver. However, since MUMPS is not very robust for linear systems arising from very heterogeneous floating substructures, we think that Dissection could be a good alternative as a local solver for parallel computing with FETI algorithms. In fact, we would perform numerical simulations of fatigue life involving tens of loading cycles (hundreds of loading increments) ; and damage and crack propagation in industrial structures which often consist of several parts made of highly heterogeneous materials. R´ ef´ erences [1] A. Toselli and O. B. Widlund. Domain Decomposition Methods : Algorithms and Theory, volume 34 of Springer Series in Computational Mathematics. Springer, 2005. [2] T. Mathew. Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations, volume 61 of Springer Series in Computational Mathematics. Springer, 2008. [3] P. Le Tallec. Domain decomposition methods in computational mechanics. Computational Mechanics Advances, 1(2) :121–220, 1994. [4] C. Farhat and F.-X. Roux. Implicit parallel processing in structural mechanics. Computational Mechanics Advances, 2(1) :1–124, 1994. [5] P. Gosselet and C. REY. Non-overlapping domain decomposition methods in structural mechanics. Archives of Computational Methods in Engineering, 13(4) :515–572, 2006. [6] T. Chan and T. Mathew. Numerica, 3 :61–143, 1994.

Domain decomposition algorithms.

Acta

[7] C. Farhat and F.-X. Roux. A method of finite element tearing and interconnecting and its parallel solution algorithm. Int. J. Numer. Meth. Engng, 32(6) :1205–1227, 1991.

18

[8] C. Farhat, K. Pierson, and M. Lesoinne. The second generation feti methods and their application to the parallel solution of large-scale linear and geometrically non-linear structural analysis problems. Computer Methods in Applied Mechanics and Engineering, 184(2-4) :333–374, 2000. [9] J. Mandel. Balancing domain decomposition. Commun. Numer. Meth. Engng., 9(3) :233–241, 1993. [10] I. Guèye. Résolution des grands systèmes linéaires issus de la méthode des élémnts finis sur des calculateurs massivement parallèles (in french). PhD thesis, MINES ParisTech, December 2009. [11] I. Guèye, X. Juvigny, F. Feyel, F.-X. Roux, and G. Cailletaud. A Parallel Algorithm for Direct Solution of Large Sparse Linear Systems, Well Suitable to Domain Decomposition Methods. European Journal of Computational Mechanics, 18(7-8) :589–605, 2009. [12] Padma Raghavan. raghavan/Dscpack.

Dscpack home page.

2001.

www.cse.psu.edu/˜

[13] P. R. Amestoy, I. S. Duff, and J.-Y. L’Excellent. Multifrontal parallel distributed symmetric and unsymmetric solvers. Computer Methods in Applied Mechanics and Engineering, 184 :501–520, 2000. [14] A. George. Nested dissection of a regular finite element mesh. SIAM Journal on Numerical Analysis, 10(2) :345–363, 1973. [15] A. George and J. W.-H. Liu. Computer solution of large sparse positive definite systems. Prentice Hall, 1981. [16] W. F. Tinney and J. W. Walker. Direct solutions of sparse network equations by optimally ordered triangular factorization. Proceedings of the IEEE, 55(11) :1801–1809, 1967. [17] G. Karypis and V. Kumar. Metis : Unstructured graph partitioning and sparse matrix ordering system. http ://www-users.cs.umn.edu/ karypis/metis, 1995. [18] J. J. Dongarra, I. S. Duff, J. Du Croz, and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16 :1–17, 1990.

19

A new parallel sparse direct solver : presentation and

A new parallel sparse direct solver : presentation and

Suggest Documents

A THREAD PARALLEL SPARSE CHEMISTRY SOLVER FOR CMAQ ...

Performance of a Fully Parallel Sparse Solver - Semantic Scholar

A parallel sparse linear solver - Integrated Systems Lab

PaStiX : A Parallel Sparse Direct Solver Based on a Static Scheduling ...

A sparse direct multifrontal solver in SCAD software

A New Approach to Parallel Sparse Cholesky

On Experiments with a Parallel Direct Solver for

PARALLEL DIRECT METHODS FOR SPARSE ... - Semantic Scholar

A Parallel Direct Solver for Self-Adaptive hp-Finite ... - CiteSeerX

a solver for massively parallel direct numerical ... - Semantic Scholar

Parallel Sparse-matrix Solution For Direct Circuit

A General Solver Based on Sparse Resultants

A General Solver Based on Sparse Resultants

A General Solver Based on Sparse Resultants

Iterative and Direct Sparse Solvers on Parallel Computers

ManySAT: a Parallel SAT Solver - CiteSeerX

A Parallel Adaptive Viscoelastic Flow Solver with

A New Approach to Parallel Preconditioning with Sparse Approximate ...

Parallel Domain Decomposition Solver for

Parallel Multithreaded Satisfiability Solver: Design and ...

A GPU implementation of a general sparse linear solver

HordeQBF: A Modular and Massively Parallel QBF Solver

A Parallel NavierâStokes Solver for Natural Convection and Free ...

A General Purpose Sparse Matrix Parallel Solvers

A new parallel sparse direct solver : presentation and