Parallel Multilevel Incomplete LU Factorization ...

VBARMS

Sequential results

Parallelization

Parallel experiments

Conclusions

Parallel Multilevel Incomplete LU Factorization Preconditioner with Variable-Block Structure Masha Sosonkina Old Dominion University joint work with

Bruno Carpentieri University of Groningen Jia Liao University of Groningen SIAM LA 2015 Atlanta, GA October 29, 2015

M. Sosonkina (ODU)

VBARMS preconditioner

1 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Outline

VBARMS Sequential results Parallelization Parallel experiments Concluding remarks

M. Sosonkina (ODU)


2 / 22

VBARMS

Sequential results

Parallelization


Conclusions

The matrices with natural blocks

• Sparse matrices arising from many applications often possess a block structure when several unknown physical quantities are associated with the same grid point.

• In 2D CFD, density, energy, two velocity components, the turbulence transport variable of the fluid may be associated with one grid point.

• The Finite Element discretization gives rise to sparse matrices with small dense blocks of size 5 × 5.

M. Sosonkina (ODU)


3 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Why blocking ? 1. Memory. Matrix stores in variable block compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.

2. Stability. Better control of near singularities.

3. Efficiency. Using higher level optimized BLAS as computational kernels.

4. Cache effects, Better cache reuse is possible for block algorithms.

M. Sosonkina (ODU)


4 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Why blocking ? 1. Memory. Matrix stores in variable block compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.

2. Stability. Better control of near singularities.

3. Efficiency. Using higher level optimized BLAS as computational kernels.

4. Cache effects, Better cache reuse is possible for block algorithms. We developed a block variant of the Algebraic Recursive Multilevel Solver (ARMS) [Saad&Suchomel, 02].

M. Sosonkina (ODU)


4 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Graph compression Goals: • Larger block sizes → block density < 100%. • Block density well-controllable by an input parameter. • Low computational complexity.

Algorithm outline: 1. Use the checksum algorithm [Ashcraft, 95] to group as supernodes rows with the same nonzero structure. 2. Merge pairs of adjacent supernodes so that the resulting average block density remains above the input density parameter [cf. Saad, 02].

M. Sosonkina (ODU)


5 / 22

VBARMS

Sequential results

Parallelization


Conclusions

The ARMS preconditioner • ARMS is a Schur-complement based multilevel ILU decomposition for solving AX = b.

• Firstly, the matrix A is permuted as T

PAP =

D E

F C

,

(1)

where D is a block diagonal matrix. And next we perform a block LU factorization to the permuted matrix L 0 D F U L−1 F = × (2) E C EU −1 I 0 A1

• where A1 = C − ED −1 F is the Schur complement with size of C . The reduction process can be applied recursively to each consecutively reduced system, until the last Schur complement is small enough to be solved with a standard method.

M. Sosonkina (ODU)


6 / 22

VBARMS

Sequential results

Parallelization


Conclusions

The VBARMS method The complete pre-processing and factorization process of VBARMS consists of the following steps: STEP 1 Call graph compression algorithm to find the block ordering PB of A and apply the permutation to A as PB APBT :   e 11 A e 12 · · · A e 1p A  e e 22 · · · A e 2p   A21 A  e ≡ PB APBT =  . A (3) .. ..  ..  . , .  . . .  e p1 A e p2 · · · A e pp A ˜ on both sides and form a quotient graph of A ˜ +A ˜T . STEP 2 Scale A

M. Sosonkina (ODU)


7 / 22

VBARMS

Sequential results

Parallelization


Conclusions

The VBARMS method (cont’d)

STEP 3 Permute the matrix obtained in STEP 2 into 2 × 2 block form: D F T T PI S1 PB APB S2 PI = . E C STEP 4 Compute the partial LU factorization of matrix (4) L 0 D F U L−1 F = × , E C EU −1 I 0 A1

(4)

(5)

and form the reduced system with the Schur complement A1 = C − ED −1 F . STEP 5 Repeat STEPs 3–4 until the Schur Complement is sufficiently small.

M. Sosonkina (ODU)


8 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Matrices tested

M. Sosonkina (ODU)


9 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Experiments with the graph compression algorithm Name S3DKQ4M2 NASASRB OLAFU CT20STIF

τ 1.00 1.00 1.00 1.00

b-size 1.25 2.20 1.54 2.61

b-density(%) 100.00 100.00 100.00 100.00

τ 0.70 0.90 0.90 0.90

b-size 5.93 3.31 5.10 3.47

b-density(%) 90.34 92.31 89.50 96.61

e b-size ≡ average block size of A, b-density ≡ nnz(A)/nnz(A), τ ∈ [0, 1] is the tolerance parameter to adjust the b-density.

Relaxed blocking.

Perfect blocking.

Larger blocks may be found by treating some zero entries as nonzeros (relaxed blocking, controlled by τ ). M. Sosonkina (ODU)


10 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Experiments with three different preconditioners Matrix RAE

S3DKQ4M2

VENKAT01

BMW7ST

PWTK

Method VBARMS ARMS ILUT VBARMS ARMS ILUT VBARMS ARMS ILUT VBARMS ARMS ILUT VBARMS ARMS ILUT

P-T(B-T) 4.230 (0.050) 5.800 † 14.910 (0.150) 14.850 5.080 0.830 (0.040) 0.340 0.190 43.890 (0.130) 49.230 34.940 50.420 (0.180) 39.370 44.540

I-T 2.470 46.010 † 7.000 100.570 82.020 0.860 0.590 0.510 0.240 172.750 † 31.170 260.880 †

M-cost 2.430 3.750 † 2.667 2.781 2.664 0.493 0.456 0.469 3.057 3.112 3.085 2.669 2.963 3.038

Its 46 1000 † 55 1000 1000 40 28 32 1 1000 † 93 1000 †

Remark Better convergence and speed of VBARMS might be attributed to the better control of near-singularities of block ILU solvers and to the smaller size of the Schur complement.

M. Sosonkina (ODU)


11 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Experiments with three different preconditioners (cont’d) Matrix RAE

S3DKQ4M2

VENKAT01

BMW7ST

PWTK

Method VBARMS ARMS ILUT VBARMS ARMS ILUT VBARMS ARMS ILUT VBARMS ARMS ILUT VBARMS ARMS ILUT

`-ratio 1.270 3.210 1.286 1.519 1.197 1.203 1.499 2.442 1.399 2.569 -

MFlops 209.7684 67.087 † 253.5121 104.323 97.7966 262.205 18.4576 44.2821 240.4245 59.179 99.012 246.0856 82.735 126.085

`-ratio The ratio of the sum of the number of unknowns at all levels of the factorization to the number of unknowns in the original system.

MFlops is for the preconditioner construction, estimated by the PAPI library on an Intel Core i3 processor of 2.53 GHz with 2 GB of main memory. M. Sosonkina (ODU)


12 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Per-processor Pi view of partitioning

Load A. ˜ Use graph compression to permute A per block-ordering found: A → A. ˜ →A ˜i . Leave contiguous block-rows: A ˜ Apply parallel graph partitioner (Zoltan) on the quotient of A.

M. Sosonkina (ODU)


13 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Per-processor Pi view of linear system • local nodes (those coupled only with local variables) and interface nodes (those coupled with local variables and remote variables stored on other processors).

• The submatrix is also accordingly split into two separate contributions.

Figure: local matrix layout • The local equations on processor i may be written as Ai xi + Ui,ext yi,ext = bi M. Sosonkina (ODU)


14 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Global solvers • VBARMS is used as a local solver for three types of global solvers. 1. block Jacobi (BJ) (drop Ui and invert Ai ). 2. Restricted Additive Schwarz (RAS) (drop Ui and invert extended Ai ). 3. Schur complement method: • We also split xi and bi according to the interior and interface nodes.

xi =

ui yi

,

bi =

fi gi

.

• So the local linear system also can be written as:

Bi Ei

Fi Ci

ui yi

+

0 P

j∈Ni

Eij yj

=

fi gi

,

(6)

where Ni is the set of subdomains that are neighbors to subdomain i and the submatrix Eij yj accounts for the contribution to the local equation from the jth neighboring subdomain. M. Sosonkina (ODU)


15 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Schur complement method • In Eqn (6), eliminate the vector of interior unknowns ui from the first equations to obtain the local Schur complement system: X Si yi + Eij yj = gi − Ei Bi−1 fi ≡ gi0 , j∈Ni

• where Si is the local Schur complement matrix Si = Ci − Ei Bi−1 Fi .

• Writing all the local equations together results in the global linear system:     

S1 E21 .. . Ep1

E12 S2 Ep−1,2

... ... .. . ...

E1p E2p .. . Sp

    

y1 y2 .. . yp





    =  

g10 g20 .. . gp0

   , 

(7)

• One preconditioning step consists in solving the global system (7) approximately, and computing the ui variables from the local equations as ui = Bi−1 [fi − Fi yi ]. M. Sosonkina (ODU)


16 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Parallel experiments Matrix

Compression

AUDIKW 1

τ = 0.80, b-density=96.40%, b-size=3.16.

size: 943695 LDOOR size: 952203 STA008 size: 891815

τ = 0.80, b-density=99.96%, b-size=7.00. τ = 0.60, b-density=84.74%, b-size=3.92.

Method BJ+VBARMS RAS+VBARMS SCHUR+VBARMS BJ+ARMS BJ+VBARMS RAS+VBARMS SCHUR+VBARMS BJ+ARMS BJ+VBARMS RAS+VBARMS SCHUR+VBARMS BJ+ARMS

P-T(B-T) 86.77 (3.0) 77.44 126.39 114.69 18.42 (1.3) 23.75 10.64 47.59 44.55 (2.5) 56.40 75.60 152.07

I-T 222.86 57.45 2545.24 2318.65 109.23 23.64 63.28 280.55 164.00 108.23 1055.34 7614.69

Total 309.63 134.88 2671.63 2433.34 127.66 47.39 73.91 328.15 208.55 164.63 1130.94 7766.75

Its 218 32 63 1000+ 301 47 29 465 198 98 208 1000+

M-cost 3.46 3.31 5.51 5.24 3.90 4.26 3.76 7.66 5.27 5.52 9.53 11.83

Performance comparison of BJ + VBARMS and ARMS on big matrices.

Remark According to our sequential results, VBARMS is more efficient and numerically more stable than ARMS, and the parallel results here on larger problems confirm this.

M. Sosonkina (ODU)


17 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Parallel experiments Matrix

Compression

AUDIKW 1

τ = 0.80, b-density=96.40%, b-size=3.16.

size: 943695 LDOOR size: 952203 STA008 size: 891815

τ = 0.80, b-density=99.96%, b-size=7.00. τ = 0.60, b-density=84.74%, b-size=3.92.

Method BJ+VBARMS RAS+VBARMS SCHUR+VBARMS BJ+ARMS BJ+VBARMS RAS+VBARMS SCHUR+VBARMS BJ+ARMS BJ+VBARMS RAS+VBARMS SCHUR+VBARMS BJ+ARMS

P-T 86.77 77.44 126.39 114.69 18.42 23.75 10.64 47.59 44.55 56.40 75.60 152.07

I-T 222.86 57.45 2545.24 2318.65 109.23 23.64 63.28 280.55 164.00 108.23 1055.34 7614.69

Total 309.63 134.88 2671.63 2433.34 127.66 47.39 73.91 328.15 208.55 164.63 1130.94 7766.75

Its 218 32 63 1000+ 301 47 29 465 198 98 208 1000+

M-cost 3.46 3.31 5.51 5.24 3.90 4.26 3.76 7.66 5.27 5.52 9.53 11.83

Performance comparison of three global preconditioners on big matrices.

Remark In our experiments block Jacobi was generally more robust than one-level Schur complement-based preconditioner, which seemed to be very sensitive to scaling.

M. Sosonkina (ODU)


18 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Strong scalability Compression

Method

τ = 0.80, b-density=96.40%, b-size=3.16.

BJ+VBARMS

τ = 0.80, b-density=96.40%, b-size=3.16.

RAS+VBARMS

P-N 8 16 32 64 128 256 8 16 32 64 128 256

P-T 86.71 44.34 19.44 7.35 2.22 1.31 104.64 52.19 27.92 11.47 5.82 3.82

I-T 169.96 85.75 98.02 32.44 18.67 49.07 69.76 39.44 19.79 21.30 13.87 8.65

Total 256.66 129.91 117.46 39.79 20.89 50.37 174.41 91.64 47.71 32.76 19.69 12.47

Its 116 131 279 208 223 725 28 35 39 59 78 90

M-cost 3.55 3.50 3.29 3.08 2.88 2.72 3.39 3.38 3.16 3.08 2.93 2.77

Numerical and parallel scalability experiments on the AUDIKW 1 problem. Notation: P-N is the number of the processors.

Remark For higher processor counts, block Jacobi and Restricted Additive Schwarz are both competitive due to inherent parallelism.

M. Sosonkina (ODU)


19 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Numerical Results in Turbulent CFD • Reynolds Averaged Navier Stokes equations.

• Newton-Krylov formulation. • Standard test case of 3rd AIAA CFD 3D steady incompressible turbulent flow past the DPW3 Wing Matrix

P-N

DPW3 size: 4918165

32

DPW3 size: 9032110 DPW3 size: 22384845 M. Sosonkina (ODU)

64 128

drag prediction workshop.

Method BJ+VBARMS BJ+VBILUT BJ+ARMS BJ+VBARMS BJ+ARMS BJ+VBARMS RAS+VARMS

Total (sec) 387.27 1448.37 11450.62 460.39 10884.96 315.57 291.42


Its 51 312 1000+ 177 1000+ 317 235

M-cost 4.47 5.36 6.39 3.80 5.00 3.88 4.05 20 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Concluding Remarks • We presented a VBARMS algorithm for preconditioning sparse linear system and its sequential and parallel implementation.

• VBARMS detects automatically any existing block structure in the matrix and exploits it to maximize computational efficiency.

• Experiments showed that VBARMS can be more robust and efficient as compared with ARMS and ILUT on solving block structured matrices.

• For future work, the implementation of our package will be optimized and made available online.

M. Sosonkina (ODU)


21 / 22

VBARMS

Sequential results

Parallelization


Conclusions

Relevant papers

B. Carpentieri, J. Liao, and M. Sosonkina, A Variable Block Algebraic Recursive Multilevel Solver for Sparse Linear Systems. JCAM, volume 259, pp. 164-173 (2014). B. Carpentieri, J. Liao, M. Sosonkina, and A. Bonfiglioli, Using the VBARMS Method in Parallel Computing. Submitted to Parallel Computing (2015).

M. Sosonkina (ODU)


22 / 22

Parallel Multilevel Incomplete LU Factorization ...

Parallel Multilevel Incomplete LU Factorization ...

Suggest Documents

Parallel Sparse LU Factorization with Partial

A PARALLEL MULTILEVEL ILU FACTORIZATION BASED ... - CiteSeerX

ILUT: A dual threshold incomplete LU factorization - University of ...

A Supernodal Approach to Incomplete LU Factorization ... - CiteSeerX

A Supernodal Approach to Incomplete LU Factorization ... - CiteSeerX

s+: efficient 2d sparse lu factorization on parallel ... - Rochester CS

An Adaptive LU Factorization Algorithm for Parallel Circuit ... - CiteSeerX

Parallel LU Factorization of Sparse Matrices on FPGA-Based ...

Parallel LU Factorization of Sparse Matrices on FPGA-Based

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned ...

BALANCED INCOMPLETE FACTORIZATION â 1

ON LU FACTORIZATION ALGORITHM WITH MULTIPLIERS

Efficient Sparse LU Factorization with Lazy Space

LU Factorization for Accelerator-based Systems - Innovative

IMPROVED BALANCED INCOMPLETE FACTORIZATION â 1 ...

A Survey of Incomplete Factorization Preconditioners

an improved incomplete cholesky factorization - CiteSeerX

the incomplete factorization multigraph algorithm - Semantic Scholar

BALANCED INCOMPLETE FACTORIZATION â 1. Introduction ... - UPV

parallel multilevel graph-partitioning software

Parallel Nonnegative Matrix Factorization with Manifold Regularization

Parallel Sparse Cholesky Factorization with Spectral ... - CiteSeerX

High Performance LU Factorization for Non-dedicated Clusters ...

1. Numerical Linear Algebra 1.1. The LU Factorization. Recall from ...

Parallel Multilevel Incomplete LU Factorization ...