Dense Triangular Solvers on Multicore Clusters ... - UPCBLAS - udc.es

3 downloads 0 Views 360KB Size Report
Conclusions. Dense Triangular Solvers on Multicore Clusters using UPC. Jorge González-Domínguez*, María J. Martín,. Guillermo L. Taboada, Juan Touriño.
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño

Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es

International Conference on Computational Science ICCS 2011 1/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

2/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

3/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

UPC: a Suitable Alternative for HPC in Multi-core Era Programming Models: Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages: UPC -> C Titanium -> Java Co-Array Fortran -> Fortran

UPC Compilers: Berkeley UPC GCC (Intrepid) Michigan TU HP, Cray and IBM UPC Compilers

4/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Studied Numerical Operations BLAS Libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) Development of UPCBLAS gemv : Matrix-vector product (α ∗ A ∗ x + β ∗ y = y ) gemm: Matrix-matrix product (α ∗ A ∗ B + β ∗ C = C)

Studied Routines trsv : BLAS2 Triangular Solver (M ∗ x = b) trsm: BLAS3 Triangular Solver (M ∗ X = B) 5/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

6/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (I)

Example

Types of Blocks

Matrix 8X8

i < j Zero matrix

2 Threads

i = j Triangular matrix

2 Rows per block

i > j Square matrix

7/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (II)

THREAD 0 → trsv(M11 ,x1 ,b1 )

8/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (III)

THREAD 0 → gemv(M31 ,x1 ,b3 ) THREAD 1 → gemv(M21 ,x1 ,b2 ) → trsv(M22 ,x2 ,b2 ) → gemv(M41 ,x1 ,b4 )

9/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (IV)

THREAD 0 → gemv(M32 ,x2 ,b3 ) → trsv(M33 ,x3 ,b3 ) THREAD 1 → gemv(M42 ,x2 ,b2 )

10/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (V)

THREAD 1 → gemv(M43 ,x3 ,b4 ) → trsv(M44 ,x4 ,b4 )

11/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS2 Triangular Solver(M ∗ x = b) (and VI)

Impact of the Block Size The more blocks the matrix is divided in, the more ... computations can be simultaneously performed (↑ perf) synchronizations are needed (↓ perf) Best block size automatically determined -> Paper

12/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

13/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS3 Triangular Solver(M ∗ X = B) (I)

Studied Distributions Triangular and dense matrices distributed by rows -> Similar approach than BLAS2 but changing sequential gemv → gemm trsv → trsm

Dense matrices distributed by columns Triangular and dense matrices with 2D distribution (multicore-aware)

14/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS3 Triangular Solver(M ∗ X = B) (II)

Dense Matrices Distributed by Columns

15/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Parallel BLAS3 Triangular Solver(M ∗ X = B) (and III) Triangular and dense matrices with 2D distribution (multicore-aware)

Node 1 -> Cores 0 & 1 Node 2 -> Cores 2 & 3

16/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

17/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS2 Triangular Solver Departamental Cluster; InfiniBand; 8 nodes; 2th/node

m = 30000 5 UPC ScaLAPACK

4.5

Speedups

4 3.5 3 2.5 2 1.5 2

4

8

16

Number of Threads 18/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS3 Triangular Solver (I) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

Itanium2 supercomputer:

m = 12000

n = 12000

120 UPC M_dist UPC M_rep UPC multi ScaLAPACK

Speedups

100 80 60 40 20 0 2

4

8

16 32 Number of Threads

64

128

19/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS3 Triangular Solver (II) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

Itanium2 supercomputer:

m = 15000

n = 4000

100 UPC M_dist UPC M_rep UPC multi ScaLAPACK

90 80 Speedups

70 60 50 40 30 20 10 0 2

4

8

16 32 Number of Threads

64

128

20/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Evaluation of the BLAS3 Triangular Solver (and III) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

Itanium2 supercomputer:

m = 8000

n = 25000

140 UPC M_dist UPC M_rep UPC multi ScaLAPACK

120

Speedups

100 80 60 40 20 0 2

4

8

16 32 Number of Threads

64

128

21/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

1

Introduction

2

BLAS2 Triangular Solver

3

BLAS3 Triangular Solver

4

Experimental Evaluation

5

Conclusions

22/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Main Conclusions Summary Implementation of BLAS triangular solvers for UPC Several techniques to improve their performance Special effort to find the most appropriate data distributions BLAS2 → Block-cyclic distribution by rows Block size automatically determined according to the characteristics of the scenario

BLAS3 → Depending on the memory constraints

Comparison with ScaLAPACK (MPI) UPC easier to use Similar of better performance 23/24

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño

Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es

International Conference on Computational Science ICCS 2011 24/24

Suggest Documents