A MATLAB Interface to the GPU - Brodtkorb

7 downloads 445 Views 1MB Size Report
University of Oslo. June 2007 ... world (over one million users of MATLAB and SIMULINK worldwide) ... 2003 Packing (Moravánszky, 2003; Hall et al., 2003) and.
Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

A MATLAB Interface to the GPU André Rigland Brodtkorb Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo

June 2007

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

What is the GPU?

What is the GPU?



GPU - Graphics Processing Unit.



Transforms input data in form of geometry into pixels on screen.



Highly efficient processor for computing with homogeneous 3D coordinates ([x y z w ]) and the RGBA color model.



Modeled as a pipeline, where all output elements have traversed all stages.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

What is the GPU?

The Graphics Pipeline Vertex Stage

Rasterization

Primitive Assembly

Fragment Stage

Buffer Operations

Texture Memory 86 GB/s

86 GB/s

86 GB/s 4 GB/s 4 GB/s

Frame Buffer

PCI-e Bus

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Why use the GPU?

Why use the GPU? Comparison of features

Theoretical GFLOPS Theoretical bandwidth (GB/s)rti Folding@home GFLOPS Price per GFLOPS (NOK) Watts per GFLOPS

1 2

CPU 1 90 6.4 1 90 1.4

GPU 2 570 100 60 10 0.3

Intel Core 2 Extreme QX6700 NVIDIA GeForce 8800 Ultra A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Why use MATLAB?

Why use MATLAB?



High-level, with mathematical syntax:



A standard tool for scientists and engineers used all over the world (over one million users of MATLAB and SIMULINK worldwide).



Extendible with user-defined MEX files.

A MATLAB Interface to the GPU

[U S V] = svd(A).

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Linear Algebra on the GPU

Matrix multiplication timeline

2001 Fixed function implementation (Larsen & McAllister, 2001). 2003 Packing (Moravánszky, 2003; Hall et al., 2003) and blocking (Hall et al., 2003) introduced. 2004 Analysis of previous algorithms. New algorithm faster than ATLAS (Fatahalian et al., 2004). 2005 Automatically tuning to underlying hardware (Jiang & Snir, 2005). 2006 Analysis of bandwidth and blocking techniques (Govindaraju et al., 2006). A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Linear Algebra on the GPU

PLU Factorization



Single-component textures



Intricate algorithm for pivoting requiring many passes



35% speedup over CPU claimed for partial pivoting.



GPU claimed to be an order of magnitude faster than the CPU for full pivoting.



Highly synthetic benchmarks, where all texture reads were restricted to three locations in memory.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Background processing

Overview



Want to utilize the GPU as a coprocessor.



Need an easy-to-use interface (tight integration with existing MATLAB syntax).



Want to execute efficiently on the GPU.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Background processing

Background processing



Want to utilize the GPU as an extra resource.



Blocking and non-blocking calls – Can utilize threads.



Neither OpenGL, nor MATLAB are thread-safe.



Need to split logic into two parts – MEX and OpenGL.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Background processing

Splitting of logic into two threads GPU

MATLAB toolbox Operations

GPU thread MATLAB

Results

MEX thread



A queue of operations, and a map of results.



Similar to RapidMind and PeakStream ideas. A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Background processing

Simultaneous computation

Using non-blocking calls, we can utilize both the CPU and the GPU simultaneously: 1. Enqueue GPU operations. 2. Compute on the CPU while the GPU operates in the background. 3. Retrieve results from GPU. This makes computation on the GPU virtually free.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Background processing

Syntax Standard MATLAB

GPU toolbox in MATLAB

a = rand (n , n ) ; b = a∗a ; [ l u p ] = lu (b );

a = gpuMatrix ( rand (n , n ) ) ; b = a∗a ; [ l u p ] = lu (b );

Background processing a = gpuMatrix ( rand (n , n ) ) ; b = a∗a ; c = lu (b ); read ( c ) ; %CPU c o m p u t a t i o n s h e r e [ l u p] = single (c ); A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Packing

Packing



Four-way vectorized arithmetic.



Packing influences computational and memory intensity.



Want to reuse data without having to repack data.



Two-by-two packing is a good compromise.



Possible to extend the toolbox to support other packing algorithms.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Packing

111 000 000 111 000 111 000 111

7×7

Padding

A MATLAB Interface to the GPU

11 00 00 11 00 11 00 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 000 111 00111 11 00 11 000 000 111 00 11 000 00111 11 8×8

4×4

Transfer to GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

Matrix-matrix multiplication Definition

(AB)i ,j

=

n X

ai ,k bk,j ,

A ∈ Rm,n , B ∈ Rn,o .

(1)

k=1



Can be viewed as vector-vector inner products.



Can be viewed as the sum of individual multiplications.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

Vector-vector inner product

1. Each fragment contains the texture-coordinate to the corresponding row in A, and column in B 2. Each fragment is computed by gathering one two-by-two matrix from A, and one from B, computing their inner product, and summing over all elements.

A MATLAB Interface to the GPU

B

A

(AB)

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

Sum of individual multiplications

1. Each fragment contains the texture-coordinate to the corresponding element from A and B. 2. Each fragment is computed by gathering the two two-by-two matrices, computing their inner product, and adding to one accumulation buffer.

A MATLAB Interface to the GPU

B

A

(AB)

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

Gauss-Jordan factorization (i)



Direct solver



Slower than Gaussian elimination, but fewer passes needed. Need to employ a pivoting strategy for numerical stability.











Full - Overkill for most problems and not applicable for the chosen implementation (Doolittle) Rook - Not applicable for the chosen implementation (Doolittle) Partial - Works well for most cases

Need to pivot two-by-two sub-matrices

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

Gauss-Jordan factorization (ii) Algorithm 1. Find the pivoting element by reducing the pivot area to the largest element. Use quasi-harmonic norm as a measure of suitedness for each element.

j

k

2. Exchange two-by-two rows 3. Eliminate two-by-two column above and below pivot element

A MATLAB Interface to the GPU

i

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

PLU factorization (i)



Direct solver.



Can use a modification to the Doolittle algorithm.



Can use same pivoting as for Gauss-Jordan elimination.



Suitable for many right hand sides (Factorization O(n3 ), while substitution is O(n2 )).

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

PLU factorization (ii)

U

1. Locate pivoting element 2. Exchange two-by-two rows, and calculate multipliers

j L k

3. Reduce below pivot element i

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Algorithms

Tridiagonal Gaussian elimination

R G B A



Tridiagonal storage - RGBA



Can solve many systems in parallel



Poor performance when solving only one system

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Accuracy and stability

Test problems Matrices ◮

Matrix multiplication - uniform random in [0, 10] (condition number 102 to 5 × 106 )



Gauss-Jordan, PLU - uniform random in [0, 5] with a random integer in [0, 100] added to the diagonal (condition number 102 to 4 × 104 ).



Tridiagonal Gaussian - uniform random in [0, 5] with a random integer in [0, 100] added to the diagonal (condition number 2 × 101 to 5 × 103 )

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Accuracy and stability

Measured error −5

−5

x 10

2.5

3.5

4

x 10

3

2 1

Relative error

Absolute error

3

Absolute / Relative error

2

1.5

2.5

2

1.5

1

1

0.5

0.5

0

0

500

(a)

1000 Matrix size

1500

Matrix multiplication

A MATLAB Interface to the GPU

0 2000

0 0

500

(b)

1000 Matrix size

1500

2000

Gauss-Jordan elimination

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Accuracy and stability

Measured error −5

7

−4

x 10

−8

x 10 1

4.5

6 0.8

4

3

0.4

2 0.2

Relative error

0.6

Absolute / Relative error

4.3

5 Absolute error

x 10

4.4

4.2 4.1 4 3.9 3.8 3.7

0

500

(c)

1000 Matrix size

1500

PLU factorization

A MATLAB Interface to the GPU

0 2000

3.6 0

(d)

500

1000 Matrix size

1500

2000

Tridiagonal Gaussian elimination

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Accuracy and stability

Accuracy ◮

Accurate storage and computation yields accurate results (e.g. most integral matrices)



Absolute error depends on n and the size of input/output elements.



Relative error depends only on n.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Speed

Runtime of different algorithms

1

Time (seconds)

Time (seconds)

10

Lsqr CPU Lsqr GPU CPU GPU

1

Lsqr GPU Lsqr CPU CPU GPU

0.1

0.1

0.01

1000 Matrix size

(a)

2000

Matrix multiplication

A MATLAB Interface to the GPU

3000

1000

4000

(b)

2000 Matrix size

3000

4000

Gauss-Jordan elimination

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Speed

Runtime of different algorithms

10

GPU CPU Lsqr CPU Lsqr GPU

1

GPU Matlab

0.9

Time in Seconds

Time (seconds)

0.8

1

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0.1

1000

2000

3000

4000

0

0

Matrix size

(c)

PLU factorization

A MATLAB Interface to the GPU

(d)

500

1000

1500

2000 2500 Matrix Size

3000

3500

4000

Tridiagonal Gaussian elimination

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Speed

Background processing

1

GPU + CPU CPU GPU Lsqr GPU + CPU Lsqr GPU Lsqr CPU

0.5

0.25

1000

1500

2000

x

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Conclusions, and further research

Conclusions ◮

The GPU is hard to program, but there are great rewards (speedups of 2-7 shown here).



The GPU can be utilized as an efficient coprocessor.



A high-level mathematical interface to efficient algorithms is useful.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Conclusions, and further research

Contributions ◮

A high-level mathematical interface to the GPU.



A new pivoting strategy for vectorized operations.



The use of packing for Gauss-Jordan and PLU factorization.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Conclusions, and further research

Fatahalian, K., Sugerman, J., & Hanrahan, P. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. Pages 133–137 of: HWWS ’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. New York, NY, USA: ACM Press. Govindaraju, N. K., Larsen, S., Gray, J., & Manocha, D. 2006. A memory model for scientific algorithms on graphics processors. Page 89 of: SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM Press. Hall, J. D., Carr, N. A., & Hart, J. C. 2003. Cache and bandwidth aware matrix multiplication on the GPU. Jiang, C., & Snir, M. 2005. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. Pages 185–196 of: PACT ’05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Washington, DC, USA: IEEE Computer Society. Larsen, E. S., & McAllister, D. 2001. Fast matrix multiplies using graphics hardware. Pages 55–55 of: Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM). New York, NY, USA: ACM Press.

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Introduction

A MATLAB Interface to the GPU

Results, conclusions and further work

References

Conclusions, and further research

Moravánszky, A. 2003. Dense Matrix Algebra on the GPU. Online; http://www.shaderx2.com/shaderx.pdf . [accessed 2006-05-11].

A MATLAB Interface to the GPU

André Rigland Brodtkorb

Suggest Documents