Automatic Performance Tuning of Sparse Matrix Kernels - CiteSeerX

8 downloads 0 Views 603KB Size Report
Off-line benchmarking—Performance (Mflop/s) for a dense matrix in sparse format on four architectures (clockwise from upper-left):. Ultra 2i-333, Pentium III-500, ...
BeBOP

Automatic Performance Tuning of Sparse Matrix Kernels 3 x 3 Register Blocking Example 0

ƒ Consider sparse matrix-vector multiply (SpMxV) ƒ Implementation space • Set of r x c block sizes • Fill in explicit zeros

• Run-time estimation (when matrix is known)

2

• Evaluate code against architecture-specific upper bounds

2

35

50 0

10 20 30 40 (688 true non−zeros) + (383 explicit zeros) = 1071 nz

50

Mflop/s

208 200

52

1.20

0.81

0.75

0.95

3

1.44

0.99

1.07

0.64

2

1.21

1.27

0.97

0.78

1.00

0.86

0.99

0.93

r=6

1.31

1.34

1.41

1.39

46 44 2

1.17

1.25

1.30

1.30

1.00

1.09

1.19

1.21

180 170

48

42

row block size (r)

row block size (r)

50

3

160 150

38

140 130

40 1

190

120 r=1

110 100

36 1

2 3 column block size (c)

6

c=1

2 3 column block size (c)

c=6

1600

• Conceptually, the set of “interesting” implementations • Depends on kernel and input • May vary: instruction mix and order, memory access patterns, data structures and precisions, mathematical formulation, …

• Dense linear algebra: ATLAS/PHiPAC • Signal processing: FFTW; SPIRAL • MPI collective operations (Vadhiyar & Dongarra, 2001)

10

43

0

3

4

5 6 7 8 column block size (c)

9

10

11

90 80 70 60 50 40 30 20

D

FEM

FEM (var. blk.)

Mixed

LP

12

1

2

3

4

5

6

7

8

9

10 0

10 11 12 13 15 17 21 25 27 28 36 40 44 matrix

D

FEM

FEM (var. blk.)

Mixed

LP

1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix

Performance Summary [power3−aix]

130

4

1.19

3

1.20 1.17 1.18

2

1.18

120

240 220

110

4

200

7 180

1.42

6 5

160

140

9

10

11

260

180 160 140 120 100

120

20

1

2

3

4

5 6 7 8 column block size (c)

9

10

11

12

109

0

Reference Cache blocked

320

Performance (Mflop/s)

12 16

16

18 10 12

10

18

16

10

600

8 10

FEM (var. blk.)

13 5

8

8

8

10

11

7

8 11

10

8

4

5

7

5

10

15

20 25 Matrix No.

30

7

35

Matrix Splitting: A = Ar×c + A1×1 [ultra−solaris]

4.94

220

120

40

0

5

6

FEM (var. blk.)

7

8

9

Mixed

LP

10 11 12 13 15 17 21 25 27 28 36 40 44 matrix

⎞⎛ x1 ⎞ ⎛ b1 ⎞ ⎟⎜ ⎟ = ⎜ ⎟ Ldense ⎟⎠⎜⎝ x2 ⎟⎠ ⎜⎝ b2 ⎟⎠ Sparse solve Sparse matvec Dense solve

Sparse Triangular Solve: Performance Summary −− [ultra−solaris] LSI Matrix

LP−Italian Railways

LP−osa60

Exp. Design (BIBD)

Cache blocking—Performance on a Pentium 4 for information retrieval and linear programming matrices, with up to 5x speedups. Post−TSP Reordering: 17−rim.rua (1, 1) −− (100, 100)

0

0

10

10

20

20

30

30

40

40

50

90 85 80 75 70 65 60 55 50 45 40 35 30 25 20

Reference Reg. Blocking (RB) Switch−to−Dense (S2D) RB + S2D Analytic upper bound Analytic lower bound PAPI upper bound

dense

memplus wang4

60

60

70

70

90

100 0

100 0

80

20

40

60

80

100

column

ex11 raefsky4 goodwin lhr10 matrix

Sparse triangular solve—(Top-left) Triangular factors from sparse LU often have a large dense trailing triangle. (Top-right) The matrix can be partitioned into sparse (L11, L21) and dense (Ldense) parts. (Bottom) Performance improvements from register blocking the sparse part, and calling a tuned vendor BLAS routine (TRSM) for the dense solve step. [ICS/POHLL ’02]

50

90

38

4

20

45

1.1

21

3

2

80

20

2

L11 x1 = b1 bˆ2 = b2 − L21 x1 Ldense x2 = bˆ2

140

1.2

16 17 Matrix No.

FEM

1

1.43

160

1.4

13

0

40

⎛ L11 ⎜⎜ ⎝ L21

180

1.5

12

15

200

1.6

10

13

20% of non-zeros

1.7

1

12

2.60

2.46

17−rim.rua (1, 1) −− (100, 100)

Reference Reg. Blocking Splitting

1.3

10

Experimental results—Performance (Mflop/s) on a set of 44 benchmark matrices from a variety of applications. Speedups of 2.5x are possible. [SC’02]

matrices

1.9

•AATx, ATAx (1.2—4.2x) •RART, Akx, …

9 matrix

40

Multiple vectors—Significant speedups are possible when multiplying by several vectors (800 MHz Itanium; DGEMM n=k=2000, m=32).

1.8

8

80

5 30

0 0

120

20 1

60 10

8

DGEMV

10

12 9

8

200

140

40 D

LP

100

10 11

8 8

160

60 FEM

300

15

18

200 180

80

D

280

16 18

16

220

Cache Blocking Performance [Pentium 4−1.5 GHz, Intel Cv6.0] 340

DGEMM

10

800

240

100

60

1

12

260

1000

280

40

2 1.49 1.49

90 87

5 6 7 8 column block size (c)

300

80

3 1.55 100

3

8 1.48

Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

320

200

4 1.55

1.17

1.19 2

340

Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

260 240

9

140

1.22

5

Performance Summary [itanium−linux−ecc]

280

1.46 1.40 1.45

220

6 1.17 1.18

Best r×c Sparsity Sparsity + new heur. Sparsity 1×1 Naive (NIST)

400

ƒHigher-level sparse kernels

93

ƒ Implementation space

ƒ Successful examples

50

240

•Hybrid sparse/dense data structure (1.2x—1.8x)

20

40

60

80

100

column

Splitting and reordering—(Left) Speedup after splitting a matrix (possibly after reordering) into a blocked part and an unblocked part to avoid fill. (Middle, right) Dense blocks can be created by a judicious reordering—on matrix 17, we used a traveling salesman problem formulation due to Pinar (1997).

Approach to Automatic Tuning

• Either off-line, on-line, or combination

2.54

255

7

1200

ƒSparse triangular solve

Architecture dependence—Sixteen r x c blocked compressed sparse row implementations of SpMxV, each color coded by performance (Mflop/s) and labeled by speedup over the unblocked (1 x 1) code for the sparse matrix above on two platforms: 333 MHz Ultra 2i (left) and 800 MHz Itanium (right). The best block size is not always 6x6!

ƒ Search using models and experiments

2

12

150

Multiple Vector Performance [itanium−ecc]

12

1400

•Register-level blocking (up to 2.5x speedups) •Symmetry (up to 2x speedup) •Diagonals, bands (up to 2.2x) •Splitting for variable block structure (1.3x—1.7x) •Reordering to create dense blocks + splitting (up to 2x) •Cache blocking (1.5x—5x) •Multiple vectors (2—7x) •And combinations…

Mflop/s

For each kernel, identify and generate a space of implementations, and search for the best one.

30

15

10

1

Speedup

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

54

1

row

Itanium: 3x1

35

5

11 1.53

row

60

Sparse matrix example—A 6x6 blocked storage format appears to be the most natural choice for this matrix for a sparse matrixvector multiply (SpMxV) implementation…

1.53

2.41 2.37 2.45 2.38

40

20

Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]

8

Filling in zeros—True non-zeros (•) and explicit zeros (•); fill ratio=1.5.

ƒSparse matrix-vector multiply

60

1.42

12

173 170

T

A Ax Performance [ultra−solaris] 150

BeBOP: Current and Future Work Understanding the impact on higher-level kernels, algorithms, and applications.

4x1 (2.49)

260

200

3x1 (1.99) 2x1 (1.50)

180 160 140 120

y = AA x T

m

120

( )

= ∑ ak a x

100 80 60

k =1

40 20 0

130

⎛ a1T ⎞ ⎜ ⎟ = (a1 L am )⎜ M ⎟ x ⎜ aT ⎟ ⎝ m⎠

240 220

ƒ Design and implementation of a library based on the Sparse BLAS; new heuristics for efficiently choosing optimizations. ƒ Study of performance implications for higher-level algorithms (e.g., block Lanczos) ƒ New sparse kernels (e.g., powers Ak, triple product RART) ƒ Integrating with applications (e.g., DOE SciDAC codes) ƒ Further automation: generating implementation generators ƒ Using bounds to evaluate current and future architectures

Upper bound Upper bound (PAPI) Cache + Reg (best) Cache + Reg (heuristic) Reg only Cache only Reference

140 Performance Summary: Web Subgraph (1M x 1M, 3.1M nz) [Itanium 2−900, Intel C v7.0] 300 Reference 280 Register blocked

Performance (Mflop/s)

50

1.39

11

9

45

40

Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]

60

160

Performance (Mflop/s)

30

1.40

10

10

Additional techniques for y=Ax, sparse triangular solve, and ATAx.

20

6

9

12

Exploiting Matrix Structure

Spy Plot (zoom−in): 03−olafu.rua

Ultra 2i: 6x6

2.49

Off-line benchmarking—Performance (Mflop/s) for a dense matrix in sparse format on four architectures (clockwise from upper-left): Ultra 2i-333, Pentium III-500, Power3-375, Itanium-800. Performance is a strong function of the hardware platform.

0

50

5 6 7 8 column block size (c)

11

Dense Performance ( r , c) Estimated Performance ( r , c) = Fill Ratio (r , c )

10

30 40 1188 non−zeros

4

36

40

Performance depends strongly on both the matrix and hardware platform.

20

3

25

Observations

10

2.39

45

25

1

1

1

0

2.44

2

40

1

o Estimate Fill Ratio (r,c): (# stored non-zeros) / (# true non-zeros) o Choose r x c to maximize

o Matrix known only at run-time (in general)

2.39

55 50

Performance (Mflop/s)

o Measure Dense Performance (r,c), in Mflop/s, of dense matrix in sparse r x c format

• Choose best data structure and implementation for given kernel, sparse matrix, and machine

5

70

3

3

Register Blocking Performance (Mflop/s) [dense−regprof−2k; power3−aix]

• Off-line benchmarking (once per architecture)

ƒ Goal: Automatic tuning of sparse kernels

45

20

2.36

4

4

15

6

30

ƒ Search

• Performance depends on architecture, kernel, and matrix

50

1.94

80 7

110 100

WebBase−1M (natural)

WebBase−1M (RCM) matrix (ordering)

WebBase−1M (MMD)

Application matrices: Web connectivity matrix—Speeding up SpMxV for a web subgraph (left) using register blocking and reordering (right) on a 900 MHz Itanium 2.

T k

110

Performance (Mflop/s)

Microprocessor performance difficult to model Widening processor-memory gap; deep memory hierarchies

1.96 1.98 1.97

5

10

8

Performance (Mflop/s)

• Hardware complexity is increasing

55

6

Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

130 120

60

90

Performance (Mflop/s)

Indirect, irregular memory accesses High bandwidth requirements, poor instruction mix

7

row block size (r)

• Typical uniprocessor performance < 10% peak

60

2.03 1.95 1.94 1.98

8

row block size (r)

ƒ Challenges in tuning sparse code

5

65

9

1.98

9

Performance Summary [pentium3−linux−icc] 140

70

10

65

Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference

75 100

11

10

Performance Summary [ultra−solaris]

12

70 1.94

11

80

107

72

12

Performance (Mflop/s)

The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax.

Attila Gyulassy Chris Hsu Shoaib Kamil Benjamin Lee Hyun-Jin Moon Rajesh Nishtala

bebop.cs.berkeley.edu

Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]

Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]

Performance (Mflop/s)

Sparse kernel performance depends on both the matrix and hardware platform.

Berkeley Benchmarking and OPtimization Group

row block size (r)

Example: Choosing a Block Size row block size (r)

Problem Context

Richard Vuduc Eun-Jin Im James Demmel Katherine Yelick

100 90 80 70 60 50 40 30 20 10 0

1

2

3

4

5

6

7

8

9

10 11 12 13 15 17 21 25 27 28 36 40 44 matrix no.

Sparse AATx, ATAx—(Left) A can be brought through the memory hierarchy only once: for each column ak of A, compute a dot product followed by a vector scale (“axpy”). (Right) This cache optimized implementation can be naturally combined with register blocking.

Suggest Documents