Dense Triangular Solvers on Multicore Clusters ... - UPCBLAS - udc.es

Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions

Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño

Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es

International Conference on Computational Science ICCS 2011 1/24


1

Introduction

2

BLAS2 Triangular Solver

3


4

Experimental Evaluation

5

Conclusions

2/24


1

Introduction

2


3


4


5

Conclusions

3/24


UPC: a Suitable Alternative for HPC in Multi-core Era Programming Models: Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)

PGAS Languages: UPC -> C Titanium -> Java Co-Array Fortran -> Fortran

UPC Compilers: Berkeley UPC GCC (Intrepid) Michigan TU HP, Cray and IBM UPC Compilers

4/24


Studied Numerical Operations BLAS Libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) Development of UPCBLAS gemv : Matrix-vector product (α ∗ A ∗ x + β ∗ y = y ) gemm: Matrix-matrix product (α ∗ A ∗ B + β ∗ C = C)

Studied Routines trsv : BLAS2 Triangular Solver (M ∗ x = b) trsm: BLAS3 Triangular Solver (M ∗ X = B) 5/24


1

Introduction

2


3


4


5

Conclusions

6/24


Parallel BLAS2 Triangular Solver(M ∗ x = b) (I)

Example

Types of Blocks

Matrix 8X8

i < j Zero matrix

2 Threads

i = j Triangular matrix

2 Rows per block

i > j Square matrix

7/24


Parallel BLAS2 Triangular Solver(M ∗ x = b) (II)

THREAD 0 → trsv(M11 ,x1 ,b1 )

8/24


Parallel BLAS2 Triangular Solver(M ∗ x = b) (III)

THREAD 0 → gemv(M31 ,x1 ,b3 ) THREAD 1 → gemv(M21 ,x1 ,b2 ) → trsv(M22 ,x2 ,b2 ) → gemv(M41 ,x1 ,b4 )

9/24


Parallel BLAS2 Triangular Solver(M ∗ x = b) (IV)

THREAD 0 → gemv(M32 ,x2 ,b3 ) → trsv(M33 ,x3 ,b3 ) THREAD 1 → gemv(M42 ,x2 ,b2 )

10/24


Parallel BLAS2 Triangular Solver(M ∗ x = b) (V)

THREAD 1 → gemv(M43 ,x3 ,b4 ) → trsv(M44 ,x4 ,b4 )

11/24


Parallel BLAS2 Triangular Solver(M ∗ x = b) (and VI)

Impact of the Block Size The more blocks the matrix is divided in, the more ... computations can be simultaneously performed (↑ perf) synchronizations are needed (↓ perf) Best block size automatically determined -> Paper

12/24


1

Introduction

2


3


4


5

Conclusions

13/24


Parallel BLAS3 Triangular Solver(M ∗ X = B) (I)

Studied Distributions Triangular and dense matrices distributed by rows -> Similar approach than BLAS2 but changing sequential gemv → gemm trsv → trsm

Dense matrices distributed by columns Triangular and dense matrices with 2D distribution (multicore-aware)

14/24


Parallel BLAS3 Triangular Solver(M ∗ X = B) (II)

Dense Matrices Distributed by Columns

15/24


Parallel BLAS3 Triangular Solver(M ∗ X = B) (and III) Triangular and dense matrices with 2D distribution (multicore-aware)

Node 1 -> Cores 0 & 1 Node 2 -> Cores 2 & 3

16/24


1

Introduction

2


3


4


5

Conclusions

17/24


Evaluation of the BLAS2 Triangular Solver Departamental Cluster; InfiniBand; 8 nodes; 2th/node

m = 30000 5 UPC ScaLAPACK

4.5

Speedups

4 3.5 3 2.5 2 1.5 2

4

8

16

Number of Threads 18/24


Evaluation of the BLAS3 Triangular Solver (I) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node

Itanium2 supercomputer:

m = 12000

n = 12000

120 UPC M_dist UPC M_rep UPC multi ScaLAPACK

Speedups

100 80 60 40 20 0 2

4

8

16 32 Number of Threads

64

128

19/24


Evaluation of the BLAS3 Triangular Solver (II) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node


m = 15000

n = 4000


90 80 Speedups

70 60 50 40 30 20 10 0 2

4

8


64

128

20/24


Evaluation of the BLAS3 Triangular Solver (and III) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node


m = 8000

n = 25000


120

Speedups

100 80 60 40 20 0 2

4

8


64

128

21/24


1

Introduction

2


3


4


5

Conclusions

22/24


Main Conclusions Summary Implementation of BLAS triangular solvers for UPC Several techniques to improve their performance Special effort to find the most appropriate data distributions BLAS2 → Block-cyclic distribution by rows Block size automatically determined according to the characteristics of the scenario

BLAS3 → Depending on the memory constraints

Comparison with ScaLAPACK (MPI) UPC easier to use Similar of better performance 23/24


Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño

Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es

International Conference on Computational Science ICCS 2011 24/24

Dense Triangular Solvers on Multicore Clusters ... - UPCBLAS - udc.es

Dense Triangular Solvers on Multicore Clusters ... - UPCBLAS - udc.es

Suggest Documents

Parallel Triangular Solvers on GPU

Dense Linear Algebra Solvers for Multicore with GPU Accelerators

MPI Collectives on Modern Multicore Clusters - CiteSeerX

Dense semigroups of triangular matrices

Symmetric Rank-k Update on Clusters of Multicore ... - Semantic Scholar

Servet: A Benchmark Suite for Autotuning on Multicore Clusters - UdC

Reproducible Triangular Solvers for High-Performance Computing - Hal

Computing Dense Clusters On-line for Information ... - CS@Dartmouth

Clusters in dense-inertial granular flows - USC

Harnessing Parallelism in Multicore Clusters with the All-Pairs and ...

Fermilab multicore and GPU-accelerated clusters for ...

Harnessing Parallelism in Multicore Clusters with the All-Pairs ...

Lop to model concurrent MPI communications in multicore clusters

On Triangular Fibonacci Numbers

Erratum to: Triangular ruthenium acetate clusters containing the bis

Real Time Discovery of Dense Clusters in Highly ... - VLDB Endowment

Stellar Collisions and Blue Straggler Stars in Dense Globular Clusters

Clusters in dense-inertial granular flows - University of Southern

Real Time Discovery of Dense Clusters in Highly ... - VLDB Endowment

Model of ejection of matter from nonstationary dense stellar clusters

A New Hybrid Technique for Modeling Dense Star Clusters

Equilibrium Clusters in Dense Lennard- Jones Gas: Molecular ...

Optimal Number of Clusters in Dense Wireless ... - Semantic Scholar

Real Time Discovery of Dense Clusters in Highly ... - VLDB Endowment