Off-line benchmarkingâPerformance (Mflop/s) for a dense matrix in sparse format on four architectures (clockwise from upper-left):. Ultra 2i-333, Pentium III-500, ...
BeBOP
Automatic Performance Tuning of Sparse Matrix Kernels 3 x 3 Register Blocking Example 0
Consider sparse matrix-vector multiply (SpMxV) Implementation space • Set of r x c block sizes • Fill in explicit zeros
• Run-time estimation (when matrix is known)
2
• Evaluate code against architecture-specific upper bounds
2
35
50 0
10 20 30 40 (688 true non−zeros) + (383 explicit zeros) = 1071 nz
50
Mflop/s
208 200
52
1.20
0.81
0.75
0.95
3
1.44
0.99
1.07
0.64
2
1.21
1.27
0.97
0.78
1.00
0.86
0.99
0.93
r=6
1.31
1.34
1.41
1.39
46 44 2
1.17
1.25
1.30
1.30
1.00
1.09
1.19
1.21
180 170
48
42
row block size (r)
row block size (r)
50
3
160 150
38
140 130
40 1
190
120 r=1
110 100
36 1
2 3 column block size (c)
6
c=1
2 3 column block size (c)
c=6
1600
• Conceptually, the set of “interesting” implementations • Depends on kernel and input • May vary: instruction mix and order, memory access patterns, data structures and precisions, mathematical formulation, …
• Dense linear algebra: ATLAS/PHiPAC • Signal processing: FFTW; SPIRAL • MPI collective operations (Vadhiyar & Dongarra, 2001)
10
43
0
3
4
5 6 7 8 column block size (c)
9
10
11
90 80 70 60 50 40 30 20
D
FEM
FEM (var. blk.)
Mixed
LP
12
1
2
3
4
5
6
7
8
9
10 0
10 11 12 13 15 17 21 25 27 28 36 40 44 matrix
D
FEM
FEM (var. blk.)
Mixed
LP
1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix
Performance Summary [power3−aix]
130
4
1.19
3
1.20 1.17 1.18
2
1.18
120
240 220
110
4
200
7 180
1.42
6 5
160
140
9
10
11
260
180 160 140 120 100
120
20
1
2
3
4
5 6 7 8 column block size (c)
9
10
11
12
109
0
Reference Cache blocked
320
Performance (Mflop/s)
12 16
16
18 10 12
10
18
16
10
600
8 10
FEM (var. blk.)
13 5
8
8
8
10
11
7
8 11
10
8
4
5
7
5
10
15
20 25 Matrix No.
30
7
35
Matrix Splitting: A = Ar×c + A1×1 [ultra−solaris]
4.94
220
120
40
0
5
6
FEM (var. blk.)
7
8
9
Mixed
LP
10 11 12 13 15 17 21 25 27 28 36 40 44 matrix
⎞⎛ x1 ⎞ ⎛ b1 ⎞ ⎟⎜ ⎟ = ⎜ ⎟ Ldense ⎟⎠⎜⎝ x2 ⎟⎠ ⎜⎝ b2 ⎟⎠ Sparse solve Sparse matvec Dense solve
Sparse Triangular Solve: Performance Summary −− [ultra−solaris] LSI Matrix
LP−Italian Railways
LP−osa60
Exp. Design (BIBD)
Cache blocking—Performance on a Pentium 4 for information retrieval and linear programming matrices, with up to 5x speedups. Post−TSP Reordering: 17−rim.rua (1, 1) −− (100, 100)
0
0
10
10
20
20
30
30
40
40
50
90 85 80 75 70 65 60 55 50 45 40 35 30 25 20
Reference Reg. Blocking (RB) Switch−to−Dense (S2D) RB + S2D Analytic upper bound Analytic lower bound PAPI upper bound
dense
memplus wang4
60
60
70
70
90
100 0
100 0
80
20
40
60
80
100
column
ex11 raefsky4 goodwin lhr10 matrix
Sparse triangular solve—(Top-left) Triangular factors from sparse LU often have a large dense trailing triangle. (Top-right) The matrix can be partitioned into sparse (L11, L21) and dense (Ldense) parts. (Bottom) Performance improvements from register blocking the sparse part, and calling a tuned vendor BLAS routine (TRSM) for the dense solve step. [ICS/POHLL ’02]
50
90
38
4
20
45
1.1
21
3
2
80
20
2
L11 x1 = b1 bˆ2 = b2 − L21 x1 Ldense x2 = bˆ2
140
1.2
16 17 Matrix No.
FEM
1
1.43
160
1.4
13
0
40
⎛ L11 ⎜⎜ ⎝ L21
180
1.5
12
15
200
1.6
10
13
20% of non-zeros
1.7
1
12
2.60
2.46
17−rim.rua (1, 1) −− (100, 100)
Reference Reg. Blocking Splitting
1.3
10
Experimental results—Performance (Mflop/s) on a set of 44 benchmark matrices from a variety of applications. Speedups of 2.5x are possible. [SC’02]
matrices
1.9
•AATx, ATAx (1.2—4.2x) •RART, Akx, …
9 matrix
40
Multiple vectors—Significant speedups are possible when multiplying by several vectors (800 MHz Itanium; DGEMM n=k=2000, m=32).
1.8
8
80
5 30
0 0
120
20 1
60 10
8
DGEMV
10
12 9
8
200
140
40 D
LP
100
10 11
8 8
160
60 FEM
300
15
18
200 180
80
D
280
16 18
16
220
Cache Blocking Performance [Pentium 4−1.5 GHz, Intel Cv6.0] 340
DGEMM
10
800
240
100
60
1
12
260
1000
280
40
2 1.49 1.49
90 87
5 6 7 8 column block size (c)
300
80
3 1.55 100
3
8 1.48
Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference
320
200
4 1.55
1.17
1.19 2
340
Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference
260 240
9
140
1.22
5
Performance Summary [itanium−linux−ecc]
280
1.46 1.40 1.45
220
6 1.17 1.18
Best r×c Sparsity Sparsity + new heur. Sparsity 1×1 Naive (NIST)
400
Higher-level sparse kernels
93
Implementation space
Successful examples
50
240
•Hybrid sparse/dense data structure (1.2x—1.8x)
20
40
60
80
100
column
Splitting and reordering—(Left) Speedup after splitting a matrix (possibly after reordering) into a blocked part and an unblocked part to avoid fill. (Middle, right) Dense blocks can be created by a judicious reordering—on matrix 17, we used a traveling salesman problem formulation due to Pinar (1997).
Approach to Automatic Tuning
• Either off-line, on-line, or combination
2.54
255
7
1200
Sparse triangular solve
Architecture dependence—Sixteen r x c blocked compressed sparse row implementations of SpMxV, each color coded by performance (Mflop/s) and labeled by speedup over the unblocked (1 x 1) code for the sparse matrix above on two platforms: 333 MHz Ultra 2i (left) and 800 MHz Itanium (right). The best block size is not always 6x6!
Search using models and experiments
2
12
150
Multiple Vector Performance [itanium−ecc]
12
1400
•Register-level blocking (up to 2.5x speedups) •Symmetry (up to 2x speedup) •Diagonals, bands (up to 2.2x) •Splitting for variable block structure (1.3x—1.7x) •Reordering to create dense blocks + splitting (up to 2x) •Cache blocking (1.5x—5x) •Multiple vectors (2—7x) •And combinations…
Mflop/s
For each kernel, identify and generate a space of implementations, and search for the best one.
30
15
10
1
Speedup
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
54
1
row
Itanium: 3x1
35
5
11 1.53
row
60
Sparse matrix example—A 6x6 blocked storage format appears to be the most natural choice for this matrix for a sparse matrixvector multiply (SpMxV) implementation…
1.53
2.41 2.37 2.45 2.38
40
20
Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]
8
Filling in zeros—True non-zeros (•) and explicit zeros (•); fill ratio=1.5.
Sparse matrix-vector multiply
60
1.42
12
173 170
T
A Ax Performance [ultra−solaris] 150
BeBOP: Current and Future Work Understanding the impact on higher-level kernels, algorithms, and applications.
4x1 (2.49)
260
200
3x1 (1.99) 2x1 (1.50)
180 160 140 120
y = AA x T
m
120
( )
= ∑ ak a x
100 80 60
k =1
40 20 0
130
⎛ a1T ⎞ ⎜ ⎟ = (a1 L am )⎜ M ⎟ x ⎜ aT ⎟ ⎝ m⎠
240 220
Design and implementation of a library based on the Sparse BLAS; new heuristics for efficiently choosing optimizations. Study of performance implications for higher-level algorithms (e.g., block Lanczos) New sparse kernels (e.g., powers Ak, triple product RART) Integrating with applications (e.g., DOE SciDAC codes) Further automation: generating implementation generators Using bounds to evaluate current and future architectures
Upper bound Upper bound (PAPI) Cache + Reg (best) Cache + Reg (heuristic) Reg only Cache only Reference
140 Performance Summary: Web Subgraph (1M x 1M, 3.1M nz) [Itanium 2−900, Intel C v7.0] 300 Reference 280 Register blocked
Performance (Mflop/s)
50
1.39
11
9
45
40
Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]
60
160
Performance (Mflop/s)
30
1.40
10
10
Additional techniques for y=Ax, sparse triangular solve, and ATAx.
20
6
9
12
Exploiting Matrix Structure
Spy Plot (zoom−in): 03−olafu.rua
Ultra 2i: 6x6
2.49
Off-line benchmarking—Performance (Mflop/s) for a dense matrix in sparse format on four architectures (clockwise from upper-left): Ultra 2i-333, Pentium III-500, Power3-375, Itanium-800. Performance is a strong function of the hardware platform.
0
50
5 6 7 8 column block size (c)
11
Dense Performance ( r , c) Estimated Performance ( r , c) = Fill Ratio (r , c )
10
30 40 1188 non−zeros
4
36
40
Performance depends strongly on both the matrix and hardware platform.
20
3
25
Observations
10
2.39
45
25
1
1
1
0
2.44
2
40
1
o Estimate Fill Ratio (r,c): (# stored non-zeros) / (# true non-zeros) o Choose r x c to maximize
o Matrix known only at run-time (in general)
2.39
55 50
Performance (Mflop/s)
o Measure Dense Performance (r,c), in Mflop/s, of dense matrix in sparse r x c format
• Choose best data structure and implementation for given kernel, sparse matrix, and machine
5
70
3
3
Register Blocking Performance (Mflop/s) [dense−regprof−2k; power3−aix]
• Off-line benchmarking (once per architecture)
Goal: Automatic tuning of sparse kernels
45
20
2.36
4
4
15
6
30
Search
• Performance depends on architecture, kernel, and matrix
50
1.94
80 7
110 100
WebBase−1M (natural)
WebBase−1M (RCM) matrix (ordering)
WebBase−1M (MMD)
Application matrices: Web connectivity matrix—Speeding up SpMxV for a web subgraph (left) using register blocking and reordering (right) on a 900 MHz Itanium 2.
T k
110
Performance (Mflop/s)
Microprocessor performance difficult to model Widening processor-memory gap; deep memory hierarchies
1.96 1.98 1.97
5
10
8
Performance (Mflop/s)
• Hardware complexity is increasing
55
6
Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference
130 120
60
90
Performance (Mflop/s)
Indirect, irregular memory accesses High bandwidth requirements, poor instruction mix
7
row block size (r)
• Typical uniprocessor performance < 10% peak
60
2.03 1.95 1.94 1.98
8
row block size (r)
Challenges in tuning sparse code
5
65
9
1.98
9
Performance Summary [pentium3−linux−icc] 140
70
10
65
Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference
75 100
11
10
Performance Summary [ultra−solaris]
12
70 1.94
11
80
107
72
12
Performance (Mflop/s)
The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax.
Attila Gyulassy Chris Hsu Shoaib Kamil Benjamin Lee Hyun-Jin Moon Rajesh Nishtala
bebop.cs.berkeley.edu
Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]
Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]
Performance (Mflop/s)
Sparse kernel performance depends on both the matrix and hardware platform.
Berkeley Benchmarking and OPtimization Group
row block size (r)
Example: Choosing a Block Size row block size (r)
Problem Context
Richard Vuduc Eun-Jin Im James Demmel Katherine Yelick
100 90 80 70 60 50 40 30 20 10 0
1
2
3
4
5
6
7
8
9
10 11 12 13 15 17 21 25 27 28 36 40 44 matrix no.
Sparse AATx, ATAx—(Left) A can be brought through the memory hierarchy only once: for each column ak of A, compute a dot product followed by a vector scale (“axpy”). (Right) This cache optimized implementation can be naturally combined with register blocking.