Conclusions. Dense Triangular Solvers on Multicore Clusters using UPC. Jorge González-Domínguez*, María J. Martín,. Guillermo L. Taboada, Juan Touriño.
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño
Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es
International Conference on Computational Science ICCS 2011 1/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
1
Introduction
2
BLAS2 Triangular Solver
3
BLAS3 Triangular Solver
4
Experimental Evaluation
5
Conclusions
2/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
1
Introduction
2
BLAS2 Triangular Solver
3
BLAS3 Triangular Solver
4
Experimental Evaluation
5
Conclusions
3/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
UPC: a Suitable Alternative for HPC in Multi-core Era Programming Models: Traditionally: Shared/Distributed memory programming models Challenge: hybrid memory architectures PGAS (Partitioned Global Address Space)
PGAS Languages: UPC -> C Titanium -> Java Co-Array Fortran -> Fortran
UPC Compilers: Berkeley UPC GCC (Intrepid) Michigan TU HP, Cray and IBM UPC Compilers
4/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Studied Numerical Operations BLAS Libraries Basic Linear Algebra Subprograms Specification of a set of numerical functions Widely used by scientists and engineers SparseBLAS and PBLAS (Parallel BLAS) Development of UPCBLAS gemv : Matrix-vector product (α ∗ A ∗ x + β ∗ y = y ) gemm: Matrix-matrix product (α ∗ A ∗ B + β ∗ C = C)
Studied Routines trsv : BLAS2 Triangular Solver (M ∗ x = b) trsm: BLAS3 Triangular Solver (M ∗ X = B) 5/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
1
Introduction
2
BLAS2 Triangular Solver
3
BLAS3 Triangular Solver
4
Experimental Evaluation
5
Conclusions
6/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS2 Triangular Solver(M ∗ x = b) (I)
Example
Types of Blocks
Matrix 8X8
i < j Zero matrix
2 Threads
i = j Triangular matrix
2 Rows per block
i > j Square matrix
7/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS2 Triangular Solver(M ∗ x = b) (II)
THREAD 0 → trsv(M11 ,x1 ,b1 )
8/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS2 Triangular Solver(M ∗ x = b) (III)
THREAD 0 → gemv(M31 ,x1 ,b3 ) THREAD 1 → gemv(M21 ,x1 ,b2 ) → trsv(M22 ,x2 ,b2 ) → gemv(M41 ,x1 ,b4 )
9/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS2 Triangular Solver(M ∗ x = b) (IV)
THREAD 0 → gemv(M32 ,x2 ,b3 ) → trsv(M33 ,x3 ,b3 ) THREAD 1 → gemv(M42 ,x2 ,b2 )
10/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS2 Triangular Solver(M ∗ x = b) (V)
THREAD 1 → gemv(M43 ,x3 ,b4 ) → trsv(M44 ,x4 ,b4 )
11/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS2 Triangular Solver(M ∗ x = b) (and VI)
Impact of the Block Size The more blocks the matrix is divided in, the more ... computations can be simultaneously performed (↑ perf) synchronizations are needed (↓ perf) Best block size automatically determined -> Paper
12/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
1
Introduction
2
BLAS2 Triangular Solver
3
BLAS3 Triangular Solver
4
Experimental Evaluation
5
Conclusions
13/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS3 Triangular Solver(M ∗ X = B) (I)
Studied Distributions Triangular and dense matrices distributed by rows -> Similar approach than BLAS2 but changing sequential gemv → gemm trsv → trsm
Dense matrices distributed by columns Triangular and dense matrices with 2D distribution (multicore-aware)
14/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS3 Triangular Solver(M ∗ X = B) (II)
Dense Matrices Distributed by Columns
15/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Parallel BLAS3 Triangular Solver(M ∗ X = B) (and III) Triangular and dense matrices with 2D distribution (multicore-aware)
Node 1 -> Cores 0 & 1 Node 2 -> Cores 2 & 3
16/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
1
Introduction
2
BLAS2 Triangular Solver
3
BLAS3 Triangular Solver
4
Experimental Evaluation
5
Conclusions
17/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Evaluation of the BLAS2 Triangular Solver Departamental Cluster; InfiniBand; 8 nodes; 2th/node
m = 30000 5 UPC ScaLAPACK
4.5
Speedups
4 3.5 3 2.5 2 1.5 2
4
8
16
Number of Threads 18/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Evaluation of the BLAS3 Triangular Solver (I) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node
Itanium2 supercomputer:
m = 12000
n = 12000
120 UPC M_dist UPC M_rep UPC multi ScaLAPACK
Speedups
100 80 60 40 20 0 2
4
8
16 32 Number of Threads
64
128
19/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Evaluation of the BLAS3 Triangular Solver (II) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node
Itanium2 supercomputer:
m = 15000
n = 4000
100 UPC M_dist UPC M_rep UPC multi ScaLAPACK
90 80 Speedups
70 60 50 40 30 20 10 0 2
4
8
16 32 Number of Threads
64
128
20/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Evaluation of the BLAS3 Triangular Solver (and III) Finis Terrae Supercomputer; InfiniBand; 32 nodes; 4th/node
Itanium2 supercomputer:
m = 8000
n = 25000
140 UPC M_dist UPC M_rep UPC multi ScaLAPACK
120
Speedups
100 80 60 40 20 0 2
4
8
16 32 Number of Threads
64
128
21/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
1
Introduction
2
BLAS2 Triangular Solver
3
BLAS3 Triangular Solver
4
Experimental Evaluation
5
Conclusions
22/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Main Conclusions Summary Implementation of BLAS triangular solvers for UPC Several techniques to improve their performance Special effort to find the most appropriate data distributions BLAS2 → Block-cyclic distribution by rows Block size automatically determined according to the characteristics of the scenario
BLAS3 → Depending on the memory constraints
Comparison with ScaLAPACK (MPI) UPC easier to use Similar of better performance 23/24
Introduction BLAS2 Triangular Solver BLAS3 Triangular Solver Experimental Evaluation Conclusions
Dense Triangular Solvers on Multicore Clusters using UPC Jorge González-Domínguez*, María J. Martín, Guillermo L. Taboada, Juan Touriño
Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,mariam,taboada,juan}@udc.es
International Conference on Computational Science ICCS 2011 24/24