University of Oslo. June 2007 ... world (over one million users of MATLAB and
SIMULINK worldwide) ... 2003 Packing (Moravánszky, 2003; Hall et al., 2003)
and.
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
A MATLAB Interface to the GPU André Rigland Brodtkorb Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo
June 2007
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
What is the GPU?
What is the GPU?
◮
GPU - Graphics Processing Unit.
◮
Transforms input data in form of geometry into pixels on screen.
◮
Highly efficient processor for computing with homogeneous 3D coordinates ([x y z w ]) and the RGBA color model.
◮
Modeled as a pipeline, where all output elements have traversed all stages.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
What is the GPU?
The Graphics Pipeline Vertex Stage
Rasterization
Primitive Assembly
Fragment Stage
Buffer Operations
Texture Memory 86 GB/s
86 GB/s
86 GB/s 4 GB/s 4 GB/s
Frame Buffer
PCI-e Bus
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Why use the GPU?
Why use the GPU? Comparison of features
Theoretical GFLOPS Theoretical bandwidth (GB/s)rti Folding@home GFLOPS Price per GFLOPS (NOK) Watts per GFLOPS
1 2
CPU 1 90 6.4 1 90 1.4
GPU 2 570 100 60 10 0.3
Intel Core 2 Extreme QX6700 NVIDIA GeForce 8800 Ultra A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Why use MATLAB?
Why use MATLAB?
◮
High-level, with mathematical syntax:
◮
A standard tool for scientists and engineers used all over the world (over one million users of MATLAB and SIMULINK worldwide).
◮
Extendible with user-defined MEX files.
A MATLAB Interface to the GPU
[U S V] = svd(A).
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Linear Algebra on the GPU
Matrix multiplication timeline
2001 Fixed function implementation (Larsen & McAllister, 2001). 2003 Packing (Moravánszky, 2003; Hall et al., 2003) and blocking (Hall et al., 2003) introduced. 2004 Analysis of previous algorithms. New algorithm faster than ATLAS (Fatahalian et al., 2004). 2005 Automatically tuning to underlying hardware (Jiang & Snir, 2005). 2006 Analysis of bandwidth and blocking techniques (Govindaraju et al., 2006). A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Linear Algebra on the GPU
PLU Factorization
◮
Single-component textures
◮
Intricate algorithm for pivoting requiring many passes
◮
35% speedup over CPU claimed for partial pivoting.
◮
GPU claimed to be an order of magnitude faster than the CPU for full pivoting.
◮
Highly synthetic benchmarks, where all texture reads were restricted to three locations in memory.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Background processing
Overview
◮
Want to utilize the GPU as a coprocessor.
◮
Need an easy-to-use interface (tight integration with existing MATLAB syntax).
◮
Want to execute efficiently on the GPU.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Background processing
Background processing
◮
Want to utilize the GPU as an extra resource.
◮
Blocking and non-blocking calls – Can utilize threads.
◮
Neither OpenGL, nor MATLAB are thread-safe.
◮
Need to split logic into two parts – MEX and OpenGL.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Background processing
Splitting of logic into two threads GPU
MATLAB toolbox Operations
GPU thread MATLAB
Results
MEX thread
◮
A queue of operations, and a map of results.
◮
Similar to RapidMind and PeakStream ideas. A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Background processing
Simultaneous computation
Using non-blocking calls, we can utilize both the CPU and the GPU simultaneously: 1. Enqueue GPU operations. 2. Compute on the CPU while the GPU operates in the background. 3. Retrieve results from GPU. This makes computation on the GPU virtually free.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Background processing
Syntax Standard MATLAB
GPU toolbox in MATLAB
a = rand (n , n ) ; b = a∗a ; [ l u p ] = lu (b );
a = gpuMatrix ( rand (n , n ) ) ; b = a∗a ; [ l u p ] = lu (b );
Background processing a = gpuMatrix ( rand (n , n ) ) ; b = a∗a ; c = lu (b ); read ( c ) ; %CPU c o m p u t a t i o n s h e r e [ l u p] = single (c ); A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Packing
Packing
◮
Four-way vectorized arithmetic.
◮
Packing influences computational and memory intensity.
◮
Want to reuse data without having to repack data.
◮
Two-by-two packing is a good compromise.
◮
Possible to extend the toolbox to support other packing algorithms.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Packing
111 000 000 111 000 111 000 111
7×7
Padding
A MATLAB Interface to the GPU
11 00 00 11 00 11 00 00 11 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 000 111 00111 11 00 11 000 000 111 00 11 000 00111 11 8×8
4×4
Transfer to GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
Matrix-matrix multiplication Definition
(AB)i ,j
=
n X
ai ,k bk,j ,
A ∈ Rm,n , B ∈ Rn,o .
(1)
k=1
◮
Can be viewed as vector-vector inner products.
◮
Can be viewed as the sum of individual multiplications.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
Vector-vector inner product
1. Each fragment contains the texture-coordinate to the corresponding row in A, and column in B 2. Each fragment is computed by gathering one two-by-two matrix from A, and one from B, computing their inner product, and summing over all elements.
A MATLAB Interface to the GPU
B
A
(AB)
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
Sum of individual multiplications
1. Each fragment contains the texture-coordinate to the corresponding element from A and B. 2. Each fragment is computed by gathering the two two-by-two matrices, computing their inner product, and adding to one accumulation buffer.
A MATLAB Interface to the GPU
B
A
(AB)
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
Gauss-Jordan factorization (i)
◮
Direct solver
◮
Slower than Gaussian elimination, but fewer passes needed. Need to employ a pivoting strategy for numerical stability.
◮
◮
◮
◮
◮
Full - Overkill for most problems and not applicable for the chosen implementation (Doolittle) Rook - Not applicable for the chosen implementation (Doolittle) Partial - Works well for most cases
Need to pivot two-by-two sub-matrices
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
Gauss-Jordan factorization (ii) Algorithm 1. Find the pivoting element by reducing the pivot area to the largest element. Use quasi-harmonic norm as a measure of suitedness for each element.
j
k
2. Exchange two-by-two rows 3. Eliminate two-by-two column above and below pivot element
A MATLAB Interface to the GPU
i
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
PLU factorization (i)
◮
Direct solver.
◮
Can use a modification to the Doolittle algorithm.
◮
Can use same pivoting as for Gauss-Jordan elimination.
◮
Suitable for many right hand sides (Factorization O(n3 ), while substitution is O(n2 )).
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
PLU factorization (ii)
U
1. Locate pivoting element 2. Exchange two-by-two rows, and calculate multipliers
j L k
3. Reduce below pivot element i
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Algorithms
Tridiagonal Gaussian elimination
R G B A
◮
Tridiagonal storage - RGBA
◮
Can solve many systems in parallel
◮
Poor performance when solving only one system
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Accuracy and stability
Test problems Matrices ◮
Matrix multiplication - uniform random in [0, 10] (condition number 102 to 5 × 106 )
◮
Gauss-Jordan, PLU - uniform random in [0, 5] with a random integer in [0, 100] added to the diagonal (condition number 102 to 4 × 104 ).
◮
Tridiagonal Gaussian - uniform random in [0, 5] with a random integer in [0, 100] added to the diagonal (condition number 2 × 101 to 5 × 103 )
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Accuracy and stability
Measured error −5
−5
x 10
2.5
3.5
4
x 10
3
2 1
Relative error
Absolute error
3
Absolute / Relative error
2
1.5
2.5
2
1.5
1
1
0.5
0.5
0
0
500
(a)
1000 Matrix size
1500
Matrix multiplication
A MATLAB Interface to the GPU
0 2000
0 0
500
(b)
1000 Matrix size
1500
2000
Gauss-Jordan elimination
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Accuracy and stability
Measured error −5
7
−4
x 10
−8
x 10 1
4.5
6 0.8
4
3
0.4
2 0.2
Relative error
0.6
Absolute / Relative error
4.3
5 Absolute error
x 10
4.4
4.2 4.1 4 3.9 3.8 3.7
0
500
(c)
1000 Matrix size
1500
PLU factorization
A MATLAB Interface to the GPU
0 2000
3.6 0
(d)
500
1000 Matrix size
1500
2000
Tridiagonal Gaussian elimination
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Accuracy and stability
Accuracy ◮
Accurate storage and computation yields accurate results (e.g. most integral matrices)
◮
Absolute error depends on n and the size of input/output elements.
◮
Relative error depends only on n.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Speed
Runtime of different algorithms
1
Time (seconds)
Time (seconds)
10
Lsqr CPU Lsqr GPU CPU GPU
1
Lsqr GPU Lsqr CPU CPU GPU
0.1
0.1
0.01
1000 Matrix size
(a)
2000
Matrix multiplication
A MATLAB Interface to the GPU
3000
1000
4000
(b)
2000 Matrix size
3000
4000
Gauss-Jordan elimination
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Speed
Runtime of different algorithms
10
GPU CPU Lsqr CPU Lsqr GPU
1
GPU Matlab
0.9
Time in Seconds
Time (seconds)
0.8
1
0.7 0.6 0.5 0.4 0.3 0.2 0.1
0.1
1000
2000
3000
4000
0
0
Matrix size
(c)
PLU factorization
A MATLAB Interface to the GPU
(d)
500
1000
1500
2000 2500 Matrix Size
3000
3500
4000
Tridiagonal Gaussian elimination
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Speed
Background processing
1
GPU + CPU CPU GPU Lsqr GPU + CPU Lsqr GPU Lsqr CPU
0.5
0.25
1000
1500
2000
x
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Conclusions, and further research
Conclusions ◮
The GPU is hard to program, but there are great rewards (speedups of 2-7 shown here).
◮
The GPU can be utilized as an efficient coprocessor.
◮
A high-level mathematical interface to efficient algorithms is useful.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Conclusions, and further research
Contributions ◮
A high-level mathematical interface to the GPU.
◮
A new pivoting strategy for vectorized operations.
◮
The use of packing for Gauss-Jordan and PLU factorization.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Conclusions, and further research
Fatahalian, K., Sugerman, J., & Hanrahan, P. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. Pages 133–137 of: HWWS ’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. New York, NY, USA: ACM Press. Govindaraju, N. K., Larsen, S., Gray, J., & Manocha, D. 2006. A memory model for scientific algorithms on graphics processors. Page 89 of: SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM Press. Hall, J. D., Carr, N. A., & Hart, J. C. 2003. Cache and bandwidth aware matrix multiplication on the GPU. Jiang, C., & Snir, M. 2005. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. Pages 185–196 of: PACT ’05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. Washington, DC, USA: IEEE Computer Society. Larsen, E. S., & McAllister, D. 2001. Fast matrix multiplies using graphics hardware. Pages 55–55 of: Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM). New York, NY, USA: ACM Press.
A MATLAB Interface to the GPU
André Rigland Brodtkorb
Introduction
A MATLAB Interface to the GPU
Results, conclusions and further work
References
Conclusions, and further research
Moravánszky, A. 2003. Dense Matrix Algebra on the GPU. Online; http://www.shaderx2.com/shaderx.pdf . [accessed 2006-05-11].
A MATLAB Interface to the GPU
André Rigland Brodtkorb