Parallel implementation of the FETI DDM constraint ...

Parallel implementation of the FETI DDM constraint matrix on top of PETSc for the PermonFLLOP package Alena Vasatova1,2 , Martin Cermak1 , and Vaclav Hapla1,2 1

2

IT4Innovations National Supercomputing Center, Ostrava, Czech Republic Department of Applied Mathematics, FEI, VSB - Technical University of Ostrava, Czech Republic

Abstract. This paper deals with implementation of the FETI nonoverlapping domain decomposition method within our new software toolbox PERMON, combining quadratic programming algorithms and domain decomposition methods. It is built on top of the PETSc framework for numerical computations. Particularly, we focus on parallel implementation of the matrix which manages connectivity between subdomains within the FETI method. We present a basic idea of our approach based on processing local and global numberings of the degrees of freedom on subdomain interfaces. Keywords: PERMON, PermonFLLOP, PETSc, FETI, DDM, constraint matrix, star forest.

1

Introduction

Many real world problems may be described by partial differential equations (PDEs). To be solved with computers, they have to be discretized, e.g. with the popular Finite Element Method (FEM). This discretization leads to large sparse linear systems of equations. Huge problems not solvable on usual personal computers can be solved only in parallel on supercomputers. Suitable numerical methods are needed for that such as domain decomposition methods (DDM) or multigrid. DDM solve the original problem by splitting it into smaller subdomain problems that are independent, allowing natural parallelization. Finite Element Tearing and Interconnecting (FETI) methods [5,4,12,2] form a successful subclass of DDM. They belong to non-overlapping methods and combine iterative and direct solvers [11]. The FETI methods allow highly accurate computations scaling up to tens of thousands of processors and billions of unknowns. In FETI, subdomain stiffness matrices are assembled, factorized and solved independently whereas continuity of the solution across subdomain interfaces is enforced by separate linear equality constraints. In our specific FETI subclass Preprint submitted on November 15, 2015. The final publication is available at Springer via https://dx.doi.org/10.1007/978-3-319-32149-3 15.

called Total FETI (TFETI) [2], Dirichlet boundary conditions are enforced in the same way, too. The goal of this paper is to focus on the parallel implementation of the FETI constraint matrix. This assembly process is not described in detail elsewhere. Our approach does not need information about neighbouring subdomains. It just needs local and global numberings of the degrees of freedom (DOFs) on the subdomain interfaces. This approach was implemented as a part of the PermonFLLOP package, being part of our set of libraries called PERMON (Parallel, Efficient, Robust, Modular, Object-oriented, Numerical) [10]. The rest of the paper reads as follows. The TFETI method is briefly described in Section 2, Section 4 introduces the PERMON toolbox, and Section 5 deals with the implementation of the gluing itself. Finally, Section 6 shows the performance of the proposed approach.

2

TFETI overview

FETI-1 [6,4] is a non-overlapping domain decomposition method which is based on decomposing the original spatial domain into non-overlapping subdomains. They are “glued together” by Lagrange multipliers which have to satisfy certain equality constraints which will be discussed later. The original FETI-1 method assumes that the Dirichlet conditions are embedded in the usual way into the linear system arising from the FEM discretization. This means physically that subdomains whose interfaces intersect the Dirichlet boundary are fixed while others are kept floating; in the linear algebra speech, the corresponding subdomain stiffness matrices are non-singular and singular, respectively. The basic idea of the Total-FETI (TFETI) method [2,13,11,15] is to keep all subdomains floating and enforce the Dirichlet boundary conditions by means of the constraint matrix and Lagrange multipliers, similarly to the gluing conditions. This simplifies implementation of the stiffness matrix pseudoinverse. The key point is that kernels of subdomain stiffness matrices are known a priori, have the same dimension and can be formed without any computation from the mesh data. Furthermore, each local stiffness matrix can be regularized cheaply, and the inverse of the resulting nonsingular matrix is at the same time a pseudoinverse of the original singular one [3]. Let Np , Nd , Nn , Nc denote the primal dimension, the dual dimension, the null space dimension and the number of processes available for our computation. The primal dimension means the number of all DOFs including those resulting from duplication of interface DOFs due to the non-overlapping domain decomposition. The dual dimension is the total number of constraints. Let us consider a partitioning of the global domain Ω into NS subdomains Ω s , s = 1, . . . , NS (NS ≥ Nc ). To each subdomain Ω s corresponds the subdomain stiffness matrix Ks , the subdomain nodal load vector f s , the matrix Rs whose columns span the nullspace (kernel) of Ks , and the signed boolean matrix Bs defining connectivity of the subdomain s with all its neighbouring subdomains. In case of TFETI, Bs also enforces Dirichlet boundary conditions. This special matrix is described

more in detail in Section 5. The local objects Ks , f s , Rs and Bs constitute global objects

K = diag(K1 , . . . , KNS ) ∈ RNp ×Np , R = diag(R1 , . . . , RNS ) ∈ RNp ×Nn , B = [B1 , . . . , BNS ] ∈ RNd ×Np , f = [(f 1 )T , . . . , (f NS )T ]T ∈ RNp ×1 , where diag means a block-diagonal matrix consisting of the diagonal blocks in parentheses. Note that columns of R span the kernel of K just as Rs do for Ks . Let us apply the convex QP duality theory to the primal decomposed problem 1 min uT Ku − uT f 2

s.t. Bu = o,

(1)

and let us establish the following notation F = BK† BT ,

G = RT BT ,

d = BK† f ,

e = RT f ,

where K† denotes a pseudoinverse of K, satisfying KK† K = K. We obtain a new QP 1 min λT Fλ − λT d s.t. Gλ = e. (2) 2 In order to solve the problem (2) efficiently, several further reformulations are carried out, not mentioned here for sake of space limitations, see [2,3,9].

3

Constraint matrix structure

The FETI method is a non–overlapping DDM and thus the submeshes of the global mesh are handled completely separately; there are no overlapping cells, no ghost layer. The DOFs on submesh interfaces are duplicated into each intersecting submesh, i.e. each submesh is “complete” and “self-contained”. It is possible to do that just with subdomain surface meshes. Volume meshing and subsequent FEM matrix assembly can then be done completely separately for each submesh. Let us introduce several DOF numberings, useful for further discussions: 1. global – a unique global DOF numbering before the DOF duplication connected with the non–overlapping domain decomposition, 2. local – after the decomposition and DOF duplication, DOFs of each subdomain are numbered starting from 0 independently of other subdomains, 3. interface – similar to local, but only the DOFs residing on subdomain interfaces are numbered.

The matrix B mentioned in Section 2 can be split into two parts, the first implementing the Dirichlet conditions and the second implementing gluing between subdomains,     S Bd B1d . . . BN d  B= = . S B1g . . . BN Bg g

To express connectivity between subdomains, we use operators described in [7]. First one is the “local trace” Boolean operator Ts which selects from all DOFs of subdomain Ω s only those that intersect with the interface. By contrast, (Ts )T prolongs the interface data to the whole subdomain, setting values corresponding to the internal DOFs to zero. Data lying on a subdomain interface have to be exchanged with its neighbouring subdomains, leading to the global “assembly” operator A, constructed in the following way. Let two subdomains Ω i and Ω j share a common interface, i < j. Let the subdomain Ω i own a DOF di and the subdomain Ω j own a DOF dj , both in the interface numbering, while di and dj represent the same DOF d in the global numbering. Then a row r is added into A with all zeros except 1 at position di in the block Ai and -1 at position dj in Aj . The matrix Bsg can then be represented as a “composed operator”, Bsg = s s A T . Implementation of Bsd can be done in a similar way except only one side of interface is taken into account.

4

PERMON toolbox

PERMON [10] is our newly emerging set of tools which combines advanced quadratic programming algorithms and domain decomposition methods. It incorporates our own codes, and makes use of renowned open source libraries, especially PETSc [14,1]. We focus so far mainly on linear elasticity and contact problems, but investigate also applications in medical imaging, ice-sheet melting modelling, statistical methods and others. The core of PERMON depends on PETSc and uses its coding style. It consists of the PermonQP and PermonFLLOP modules. PermonQP provides a base for solution of linear systems and quadratic programming (QP) problems. It includes data structures, transformations, algorithms, and supporting functions for QP. It supports any combination of linear equality, box and general linear inequality constraints, just like quadprog function in MATLAB Optimization Toolbox. PermonFLLOP is an extension of PermonQP providing support for DDM of the FETI type. This combination of DDM and QP algorithms is what makes PERMON unique. PermonQP and PermonFLLOP are licensed under the BSD 2-Clause license and we currently prepare them for publishing. Other PERMON modules include application-specific ones such as PermonPlasticity or PermonMultiBody, discretization tools such as PermonCube, interfaces with external discretization software, and support tools. PermonCube can be described as a library for parallel generation of simple finite element meshes

and their FEM processing, and serves as provider of testing data for a massively parallel DDM solver such as PermonFLLOP.

5 5.1

Gluing matrix implementation PETSc distributed matrices and their transposition

Let us mention some PETSc features concerning distributed matrices. Elements of vectors and matrices are distributed among processors; each process owns only its local part. The local part consists of a contiguous range of rows. See also [14]. Concerning the matrix BT (see Section 3), (Bs )T is its local part in the abovementioned sense. This leads us to store the matrix B as (BTE )TI , where TE means an explicit transposition and TI is an implicit transposition. The explicit transposition is implemented in PETSc with the MatTranspose routine or by direct assembling, whereas the implicit one is implemented by the MatCreateTranspose function. This way, column distribution can be mimicked while using physical row distribution. MatCreateTranspose returns a new envelope matrix of the MATTRANSPOSE type, wrapping the original matrix and swapping the meanings of its MatMult and MatMultTranspose methods. We have implemented a new convenience function PermonMatTranpose. The demanded type of transpose is specified by the additional enum argument having one of EXPLICIT, IMPLICIT, CHEAPEST values. The last one stands for the variant which is the computationally cheapest for the current matrix. In case of MATTRANSPOSE the inner wrapped matrix is returned, while in other cases MATTRANSPOSE is created wrapping the current matrix. This function allows transparent handling of all transposes and easy switching of the transpose type. 5.2

Custom gluing matrix assembly

Previous approaches to assembling the Bg operator are based on the knowledge of the subdomain adjacency [8]. Our latest approach described below needs only the i2g (interface to global) and i2l (interface to local) mappings. One of them can be replaced by the l2g (local to global) mapping, because the replaced one is easily computed. We physically assemble the matrix BTg E whereas Bg = (BTg E )TI in the sense of Subsection 5.1. The matrix Bg is in fact composed as Bg = AT = ((T)TE (A)TE )TI , see Section 3. Ts is implemented using the PETSc MATSCATTER implicit matrix type. PetscSF is a PETSc class for setting up and managing the communication of certain entries of arrays and vectors between MPI processes. It uses star forest graphs to indicate and determine the communication patterns concisely and efficiently. A star is a graph consisting of one root vertex with zero or more leaves. A union of disjoint stars is called a star forest. In the PETSc implementation, all operations are split into matching begin and end phases, which allows interleaving communication by computation. The following list introduces functions implemented in PermonFLLOP.

24

31

32

25

18

19

12

13

33

34

27

28

29

20

21

22

23

15

11

12

12

13

8

7

8

9

7

8

9

10

6

4

5

6

33

4

5

6

4

0

1

2

3

0

1

2

0

1

8

9

10

6

7

8

9

6

7

4

5

6

7

3

4

5

4

0

1

2

3

0

1

2

0

11

13

9

7

11

10

8

9

35

26

14

10

9

30

16

7

5

6

5

3

2

3

0

1

2

8

5

6

7

8

5

3

3

0

4

17

6

7

8

9

10

11

0

1

2

3

4

5 1

2

4

1

2

Fig. 1. Decomposition, and global, local and interface interface DOF numbering.

QPFetiAssembleGluing returns the matrix Bg by calling all functions mentioned

below. For illustration, we consider an elementary geometry with decomposition into 4 subdomains (Figure 1) and only one DOF per node for the sake of simplicity. QPFetiGetI2LMapping, QPFetiGetI2GMapping form i2l or i2g mapping if missing. QPFetiGetAtSF constructs ATE using the i2g mapping and PetscSF object, the procedure is described bellow. QPFetiConvertAtToBgt forms the matrix TTE from the i2l mapping. The matrix ATE is then implicitly pre-multiplied by TTE (MATCOMPOSITE can be used for that). The resulting product BTg E acts as having zero rows corresponding to non-gluing DOFs. Let us now describe the QPFetiGetAtSF function with illustrative figures which correspond to domain decomposition and numberings in Figure 1 and nonredundant connections. 1. Make first PetscSF from i2g mapping (PetscSFSetGraphLayout). Each processor has a local part of roots and leaves. MPI 0

0

1

2

MPI 1

0

1

2

3

4

5

3

6

9

12 13 14

6

MPI 2

MPI 3

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

3

4

5

9

11 14 15 16 17

15 16 17 20 23 26 29 33 34 35

12 13 14 15 18 20 24 26 30 31 32 33

2. Compute root degrees (PetscSFComputeDegreeBegin/End) and broadcast it to leaves (PetscSFBcastBegin/End), count connections. 0.

1.-9.

' 10.-11. 1

2

2

3

3

2

2

1

0

12.

1

1

1

2

1

1

1

0

0

2

0

2

0

0

1

1

0

2

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

0

0

1

1

1

1

2

1

1

0

1

2

3

6

9

12 13 14

3

4

5

9

11 14 15 16 17

12 13 14 15 18 20 24 26 30 31 32 33

15 16 17 20 23 26 29 33 34 35

1

1

1

2

1

2

2

2

1

1

2

1

2

3

2

3

3

3

2

2

2

3

3

1

1

1

2

1

1

1

2

2

2

2

1

2

1

2

1

3. Remove non-gluing leaves (light), multiply appropriate leaves with multiple connections (dark).

1

1

1

0

0

2

0

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

2

3

6

9

12 13 14

3

4

5

9

1

2

1

2

2

2

1

1

2

2

3

3

2

2

11 14 15 16 17 1

3

3

2

2

1

0

2

0

0

1

1

0

2

0

0

1

1

1

1

3

1

3

1

2

2

1

2

2

1

1

1

2

1

1

0

0

1

1

1

12 13 14 15 18 20 24 26 30 31 32 33

15 16 17 20 23 26 29 33 34 35

2

3

2

3

3

1

1

1

2

1

1

1

2

2

2

2

1

2

1

2

1

1

4. Create second PetscSF (with non-contiguous local indices). 0

1

2

3

3

9

12 13 14 14

3

5

6

6

4

7

5

6

7

8

7

10

11

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

3

9

14

15 15 16 17

12 13 14 15 20 26 33

15 16 17 20 26 33

0

3

5

6

0

0

9

6

7

8

1

2

3

5

7

11

1

2

5

3

7

5. Scatter indices of connections (PetscSFScatterBegin/End). 0 3

1 6

7

9

8

9

10

11

12

3

9

12 13 14 14

3

9

14 15 15 16 17

12 13 14 15 20 26 33

15 16 17 20 26 33

0

1

2

0

1

4

2

7

8

9

8

9

10 11 12

5

8

4,5 6,7

2

4

5

3

1

3

4

2

0

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

6

7

8

9

3

5

6

10 11 12

10 11 12

6. Make third PetscSF from connection indices.

0

1

2

3

4

5

0

0

1

2

3

4

5

1

4

6

7

8

9

6

7

8

9

10 11 12

2

3

5

6

10 11 12

7

7. Send the lowest rank number of each leaf to roots (PetscSFReduceBegin/End), then send it back from roots to leaves (PetscSFBcastBegin/End). Set value to 1 if the broadcasted number is same as the rank of the leaf, and to −1 otherwise, for each leaf. 0

0

0

0

0

0

1

1

1

1

2

0

1

2

3

4

5

6

7

8

9

10 11 12

2

2

0

1

2

3

4

5

0

1

4

6

7

8

9

2

3

5

6

10 11 12

7

8

9

0

0

0

0

0

0

0

0

0

1

1

1

1

0

0

0

1

2

2

2

1

1

1

10 11 12 2

2

2

1

1

1

1

1

1

-1

-1

-1

1

1

1

1

-1

-1

-1

-1

1

1

1

-1

-1

-1

-1

-1

-1

8. End up with all information needed to assemble (As )TE - indices of columns (from step 5) and rows (from step 4), and values to input (from step 7). MPI 0

5.3

MPI 1

MPI 2

MPI 3

0

1

2

3

4

5

0

1

4

6

7

8

9

2

3

5

6

10 11 12

7

8

9

3

5

6

6

7

7

0

3

5

6

6

7

8

0

1

2

3

5

7

11

0

1

2

10 11 12 3

5

7

1

1

1

1

1

1

-1

-1

-1

1

1

1

1

-1

-1

-1

-1

1

1

1

-1

-1

-1

-1

-1

-1

MATNEST, MATBLOCKDIAG and MatMatMultByColumns

The resulting constraint Bg = o is then passed to the PermonQP solver hmodule i d using its QPAddEq method, separately from BD . The global matrix B = B Bg is

implemented internally in PermonQP using the composite matrix type MATNEST. It provides implicit nesting of matrices; the nested matrices stay stored separately but the matrix-vector multiply function (MatMult) of the nesting matrix behaves like they were stored interleaved in both directions by processes. This design allows decoupled assembly and storage of constraint matrices related to different type of constraints, in our case Bd and Bg . We had to cope with the fact that MATNEST does not support matrix-matrix multiplication. However, the only place where it is actually needed, is the multiplication G = RT BT = (BR)T , where R = diag(R1 , . . . , RNS ) is a block-diagonal matrix and its diagonal blocks Rs , s = 1, . . . , NS , are dense with just few columns, e.g. 1 for Laplace, 3 for 2D elasticity, and 6 for 3D elasticity. Thus this matrix-matrix multiplication can be implemented using matrix-vector product efficiently. The block-diagonal matrix is implemented using the new composite matrix type MATBLOCKDIAG. New functions PMat/PMatTransposeMatMultByColumns were implemented for this reason. To carry out the matrix-matrix product Z = XY, they firstly allocate Z as a dense matrix, use newly implemented type-specific functions PMatGetColumnsVectors and PMatRestoreColumnsVectors to extract the column vectors of Y and Z, and iterate over the columns of Y. For the i-th column Y(:, i), the matrix-vector product X ∗ Y(:, i) = Z(:, i) is performed (MATLAB notation). For dense matrices, PMat{Get,Restore}ColumnsVectors creates cheaply vectors sharing the inner array with the original matrix. In our case, X = B, Y = R, and Z = GT . Thus we have to deal with X being MATNEST and Y being MATBLOCKDIAG. If the type of Y is MATBLOCKDIAG, then PMatMatMultByColumns proceeds as follows. It firstly gets the explicit transpose XTE ; this is cheap provided X is stored as (XTE )TI which is actually our case. PMatGetLocalMat is then called to obtain local parts (Xs )TE owned by each process s = 1, . . . , NS . Furthermore, PMatTransposeMatMultByColumns is used to carry out Zs = ((Xs )TE )TI Ys independently on each process. Then local matrices (Zs )TE are computed, and concatenated with no communication into the global matrix ZTE using the PMatConcat function. The final result is Z = (ZTE )TI . We implemented all the aforementioned functions also for the MATNEST type; these type-specific implementations just recursively delegate the operations to the nested blocks. The result of MATNEST B with MATBLOCKDIAG R is multiplying T

then a MATNEST matrix GTI =

6

Gd I T

Gg I

.

Numerical experiments

Parallel numerical experiments were performed using the Salomon cluster operated by IT4Innovations National Supercomputing Center, Czech Republic. This cluster consists of 1008 compute nodes, giving a total of 24192 compute cores with 129TB RAM and over 2 Pflop/s theoretical peak performance. Each node is a x86–64 computer with two Intel Xeon E5–2680v3 12-core processors (24

cores per node) and at least 128GB RAM. Nodes are interconnected by 7D Enhanced hypercube Infiniband network. Salomon consists of 576 nodes without accelerators and 432 nodes equipped with Intel Xeon Phi MIC accelerators. As a benchmark for parallel tests numerical model of an elastic cube was used. The primal data K, f , R as well as i2l and l2g mappings were generated by the PermonCube package. Timings of the individual operations introduced in Section 5.2 are shown in Table 1. There are also timings for the old approach using information about neighbouring subdomains (QPFetiAssembleGluingNeigh). As you can see the timings are comparable, thus we can achieve similar times without the neighbouring information.

NS Np Nd QPFetiAssembleGluingNeigh QPFetiAssembleGluing QPFetiGetI2Lmapping QPFetiGetAtSF QPFetiConvertAtToBgt

216 512 1 000 6 001 128 14 224 896 27 783 000 820 248 2 015 076 4 019 328 3.42E–01 3.46E–01 4.03E–01 3.33E–01 6.20E–01 7.76E–01 2.44E–03 2.13E–02 4.83E–03 3.29E–01 5.94E–01 7.64E–01 2.46E–03 4.44E–03 9.04E–03

2 197 4 096 8 000 61 039 251 113 799 168 222 264 000 9 001 071 16 980 948 33 505 068 6.28E–01 4.75E–01 1.17E+00 1.20E+00 1.42E+00 2.27E+00 2.35E–02 1.72E–02 6.86E–02 1.16E+00 1.36E+00 2.10E+00 1.74E-02 5.30E–02 1.19E–01

Table 1. Timings of the individual operations performed on Salomon for different number of subdomains with constant number of 8000 elements and 27783 DOFs per subdomain and Ns = Nc .

7

Conclusion

We have presented results related to parallel implementation of the FETI constraint matrix within our PERMON software toolbox, particularly its “gluing” part responsible for subdomain connectivity. We have briefly reviewed the TFETI method and the PERMON toolbox modules. We have shown and evaluated our new approach needing only local and global numberings of subdomain interface DOFs, and several needed implementation features. The results show the current implementation scales at least up to thousands of cores.

Acknowledgements This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070); Project of major infrastructures for research, development and innovation of Ministry of Education, Youth and Sports with reg. num. LM2011033; by the EXA2CT project funded from the EUs Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 610741; by the internal student grant competition project SP2015/186 “PERMON toolbox development”; and by the Grant Agency of the Czech Republic (GACR) project no. 15-18274S.

References 1. Balay, S., Gropp, W.D., McInnes, L.C., Smith, B.F.: Efficient management of parallelism in object oriented numerical software libraries. In: Arge, E., Bruaset, A.M., Langtangen, H.P. (eds.) Modern Software Tools in Scientific Computing. pp. 163– 202. Birkh¨ auser Press (1997) 2. Dost´ al, Z., Hor´ ak, D., Kuˇcera, R.: Total FETI – an easier implementable variant of the FETI method for numerical solution of elliptic PDE. Communications in Numerical Methods in Engineering 22(12), 1155–1162 (2006) 3. Dost´ al, Z., Kozubek, T., Markopoulos, A., Menˇs´ık, M.: Cholesky decomposition of a positive semidefinite matrix with known kernel. Applied Mathematics and Computation 217(13), 6067–6077 (2011) 4. Farhat, C., Mandel, J., Roux, F.X.: Optimal convergence properties of the FETI domain decomposition method. Computer Methods in Applied Mechanics and Engineering 115, 365–385 (1994) 5. Farhat, C., Roux, F.X.: A method of finite element tearing and interconnecting and its parallel solution algorithm. International Journal for Numerical Methods in Engineering 32(6), 1205–1227 (1991) 6. Farhat, C., Roux, F.X.: An unconventional domain decomposition method for an efficient parallel solution of large–scale finite element systems. SIAM J. Sci. Stat. Comput. 13, 379–396 (1992) 7. Gosselet, P., Rey, C.: Non-overlapping domain decomposition methods in structural mechanics. Archives of Computational Methods in Engineering 13(4), 515– 572 (2006) 8. Hapla, V., Cermak, M., Markopoulos, A., Horak, D.: FLLOP: A massively parallel solver combining FETI domain decomposition method and quadratic programming. In: 2014 IEEE Intl Conf on High Performance Computing and Communications (HPCC 2014). pp. 320–327 (2014) 9. Hapla, V., Hor´ ak, D.: TFETI coarse space projectors parallelization strategies. In: Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, vol. 7203, pp. 152–162. Springer Berlin Heidelberg (2012) 10. Hapla, V., et al.: PERMON (Parallel, Efficient, Robust, Modular, Object-oriented, Numerical) web pages (2015), http://industry.it4i.cz/en/products/permon/ 11. Hapla, V., Hor´ ak, D., Merta, M.: Use of direct solvers in TFETI massively parallel implementation. In: Manninen, P., ster, P. (eds.) Applied Parallel and Scientific Computing, Lecture Notes in Computer Science, vol. 7782, pp. 192–205. Springer Berlin Heidelberg (2013) 12. Klawonn, A., Widlund, O.B.: FETI and Neumann–Neumann iterative substructuring methods: Connections and new results. Communications on Pure and Applied Mathematics 54(1), 57–90 (2001) 13. Merta, M., Vaˇsatov´ a, A., Hapla, V., Hor´ ak, D.: Parallel implementation of TotalFETI DDM with application to medical image registration. In: Domain Decomposition Methods in Science and Engineering XXI, Lecture Notes in Computational Science and Engineering, vol. 98, pp. 917–925. Springer International Publishing (2014) 14. Smith, B.F., et al.: PETSc users manual. Tech. Rep. ANL–95/11 – Revision 3.5, Argonne National Laboratory (2014), http://www.mcs.anl.gov/petsc ˇ 15. Cerm´ ak, M., Hapla, V., Hor´ ak, D., Merta, M., Markopoulos, A.: Total-FETI domain decomposition method for solution of elasto-plastic problems. Advances in Engineering Software 84(0), 48 – 54 (2015)

Parallel implementation of the FETI DDM constraint ...

Parallel implementation of the FETI DDM constraint ...

Suggest Documents

Parallel Implementation of Constraint Solving - Universidad ...

DDM Implementation Brief on Parameter Setting

CONSTRAINT TREATMENT # //_) TECHNIQUES AND PARALLEL ...

Constraint Singularities Of Parallel Mechanisms - Robotics and ...

Performance Evaluation of Parallel Implementation

Parallel Implementation of the Unified Flow Solver

Development of parallel implementation for the dendritic

Parallel Implementation of the Gauss-Seidel

Parallel Implementation of the Discontinuous Galerkin ... - CiteSeerX

High-speed Parallel Software Implementation of the

Parallel Implementation of the PHOENIX Generalized Stellar ...

Parallel Implementation of the Box Counting

Parallel Implementation of the Ensemble Empirical ...

the design and implementation of massively parallel

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

High-speed parallel software implementation of the

A PARALLEL IMPLEMENTATION OF THE COVARIANCE MATRIX ...

Parallel Implementation and Application of the

Constraint based implementation of a PDDL-like

The Theory of Reasoned Action as Parallel Constraint Satisfaction

The Theory of Reasoned Action as Parallel Constraint Satisfaction ...

Embedding Constraint Satisfaction using Parallel Soft-Core ...

Experiments in Parallel Constraint-Based Local Search

Parallel Constraint Handling in a Multiobjective ... - CiteSeerX