Sep 1, 2011 - Repa library: just another array API [ICFP 2010]. Multi-dimensional & shape polymorphic ..... CUDA SDK
DATA PARALLELISM IN HASKELL Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell Simon Peyton Jones Thursday, 1 September 11
OUR GOAL SIMPLIFY COMPUTE INTENSIVE APPLICATIONS Thursday, 1 September 11
MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11
COR
— 0 7 9 E I7
MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11
S E R O 6C
) S D A E R H T (12
COR
— 0 7 9 E I7
S E R O 6C
) S D A E R H T (12
) S D REA
— C I M L E T IN
R O C 32
MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11
H T 8 2 ES (1
COR
— 0 7 9 E I7
S E R O 6C
) S D A E R H T (12 ) S D REA
S E R 2 CO
51 — 100
F G A I VID
N
) S D REA
— C I M L E T IN
R O C 32
MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11
H T 8 2 ES (1
H T 6 7 (24,5
COR
— 0 7 9 E I7
S E R O 6C
) S D A E R H T (12 ) S D REA
S E R 2 CO
51 — 100
F G A I VID
N
H T 6 7 (24,5
SOFTWARE NEEDS TO DEAL WITH PARALLELISM! ) S D REA
— C I M L E T IN
R O C 32
MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11
H T 8 2 ES (1
The essence of this talk 1. Parallel programming and functional programming are a natural fit 2. Data parallelism is simpler than control parallelism 3. Nested parallelism is more expressive than flat data parallelism
Thursday, 1 September 11
Concurrency ≠ parallelism
Thursday, 1 September 11
Concurrency ≠ parallelism Concurrency
‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control
Thursday, 1 September 11
Concurrency ≠ parallelism Concurrency
‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism
‣ Produce the same result, but faster Thursday, 1 September 11
Concurrency ≠ parallelism Concurrency
HARD!
‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism
‣ Produce the same result, but faster Thursday, 1 September 11
Concurrency ≠ parallelism Concurrency
HARD!
‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism
CAN BE Q
UITE EA S
‣ Produce the same result, but faster Thursday, 1 September 11
Y!
Don't we need concurrency to implement parallelism?
Thursday, 1 September 11
Don't we need concurrency to implement parallelism? Sometimes, yes. Sometimes, no:
This should not concern the application programmer!
Thursday, 1 September 11
Data Parallelism
T S B A
N O I T RAC
Parallelism
Concurrency
Thursday, 1 September 11
We can implement a parallel program using explicit concurrency
Thursday, 1 September 11
We can implement a parallel program using explicit concurrency
We can implement a web server in assembly language
Thursday, 1 September 11
How does functional programming help?
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
Non-determinism & concurrency control
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
Non-determinism & concurrency control
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
Non-determinism & concurrency control
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
Non-determinism & concurrency control
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
Thursday, 1 September 11
E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA
Purely Functional Programming
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA
S T C E F F E O N
Purely Functional Programming
Thursday, 1 September 11
Concurrency
‣
Multiple interleaved threads of control
‣
All threads have effects on the world
‣
E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA
S T C E F F E O N
Purely D E N I A R T S N Functional O C LESS ECUTION Programming EX R E D R O
Thursday, 1 September 11
A simple example processList list = (sort list, maximum list)
Thursday, 1 September 11
A simple example processList list = (sort list, maximum list)
function
Thursday, 1 September 11
function body argument
A simple example processList list = (sort list, maximum list)
function
function body argument
Performs two tasks: sorting and determining the maximum
Thursday, 1 September 11
A simple example processList list = (sort list, maximum list)
function
function body argument
Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order
Thursday, 1 September 11
A simple example processList list = (sort list, maximum list)
function
function body argument
Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order Returns a pair of the results
Thursday, 1 September 11
A simple example processList list = (sort list, maximum list)
function
function body argument
Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order Returns a pair of the results May even return the pair before sorting and maximum have completed! Thursday, 1 September 11
Typical collective operations (with a natural parallel interpretation)
map f [x1, ..., xn] = [f x1, ..., f xn]
zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn
(aka reduce)
Thursday, 1 September 11
Typical collective operations (with a natural parallel interpretation)
map f [x1, ..., xn] = [f x1, ..., f xn]
zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn
(aka reduce)
Thursday, 1 September 11
Typical collective operations (with a natural parallel interpretation)
map f [x1, ..., xn] = [f x1, ..., f xn]
zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn
(aka reduce)
Thursday, 1 September 11
Typical collective operations (with a natural parallel interpretation)
map f [x1, ..., xn] = [f x1, ..., f xn]
zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn
(aka reduce)
Thursday, 1 September 11
Typical collective operations (with a natural parallel interpretation)
map f [x1, ..., xn] = [f x1, ..., f xn]
zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn
(aka reduce)
✴
maximum = foldl1 max
Thursday, 1 September 11
Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)
Thursday, 1 September 11
Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)
argument type
Thursday, 1 September 11
result type
Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)
argument type
result type
Purity: a function's result depends solely on its arguments
Thursday, 1 September 11
Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)
argument type
result type
Purity: a function's result depends solely on its arguments Persistence: don't mutate data structures in-place (encourages collective operations) Thursday, 1 September 11
Haskell, specifically Broad-spectrum, pure and lazy Strongly typed, but types are optional Mature language and tools & tons of libraries Vibrant community: #
[email protected] http://haskell.org/
Thursday, 1 September 11
By default pure
:=
Types track purity Pure = no effects
Impure = may have effects
Int
IO Int
processList :: [Int] -> ([Int], Int)
readFile :: FilePath -> IO String
copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)
readFile :: FilePath -> IO String
copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)
readFile :: FilePath -> IO String
copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)
readFile :: FilePath -> IO String
copyFile fn1 fn2 = do (sort list, maximum list) data Array DIM2 e -> Array DIM2 e rr: [[a00, a01, a-> a12],[a21, Array e a22, a23],[a31, a32, a33]] 02],[a 10, a11, DIM2 rr: [[b [b10,b20 ], [b20,b21]] 00,b01], arr mmMult brr rrTran:[[b00,b10,b20],[b01,b11,b21]] = sum (zipWith (*) arrRepl brrRepl) rep 2 rep 2 where rrRepl:[[[atrr 00,a01,a02],[a 01,a02]],[[[a 10,a11,a12],..],............] =00,a force (transpose2D brr) rrRepl:[[[barrRepl 00,b10,b20],[b01,b11,b21]],[[[b00,b10,b20],..],............] = replicate (Z :.All :.colsB :.All) arr brrRepl = replicate (Z :.rowsA :.All :.All) trr rep 4 (Z :.colsA :.rowsA) = extent arr esult :[[a00(Z *b00+a 01*b10+...,a 00*b01+..],[.,.],[.,.],[.,.]] :.colsB :.rowsB) = extent brr sum
trr brrRepl x 4 arr
Thursday, 1 September 11
arrRepl x 2
Matrix-matrix multiplication (size 1024x1024) GHC's code generation still leaves room for improvement Thursday, 1 September 11
Sobel and Canny edge detection (100 iterations) OpenCV: high-performance computer vision library — uses SSE SIMD instructions! Thursday, 1 September 11
Summary: Regular firstclass arrays Repa library: just another array API [ICFP 2010] Multi-dimensional & shape polymorphic Collective array operations, executed in parallel But no arrays of arrays; i.e., no nested parallelism
Thursday, 1 September 11
Summary: Regular firstclass arrays Repa library: just another array API [ICFP 2010] Multi-dimensional & shape polymorphic Collective array operations, executed in parallel But no arrays of arrays; i.e., no nested parallelism But multicore CPUs only!
Thursday, 1 September 11
General Purpose GPU Programming (GPGPU)
Thursday, 1 September 11
MODERN GPUS ARE FREELY PROGRAMMABLE But no function pointers & limited recursion Thursday, 1 September 11
MODERN GPUS ARE FREELY PROGRAMMABLE But no function pointers & limited recursion Thursday, 1 September 11
Very Different Programming Model (Compared to multicore CPUs)
Thursday, 1 September 11
S D A HRE
T 6 7 24,5
REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11
S D A HRE
T 6 7 24,5
✴ SIMD: groups of threads executing in lock step (warps) ✴ Latency hiding: excess parallelism covers main memory latency
REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11
S D A HRE
T 6 7 24,5
✴ SIMD: groups of threads executing in lock step (warps) ✴ Latency hiding: excess parallelism covers main memory latency
✴ Thread divergence is expensive ✴ Memory access patterns need to be regular
REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11
Dot Product 10
Time (ms)
1.55
1.95
2.34
2.75
3.12
1.18 0.78
1
0.78
0.40
0.98
1.18
1.38
1.58
3.51
1.78
0.59 0.39
0.20
0.1
2
4
6
8
10
12
14
Number of elements (million)
Accelerate
CUBLAS
CPU (Xeon E5405 @ 2GHz): 71.3ms for 18M Computation only, without CPU ⇄ GPU transfer Thursday, 1 September 11
16
18
Challenges Code must be massively data parallel Control structures are limited Limited function pointers Limited recursion Software-managed cache, memory-access patterns, etc. Portability... Thursday, 1 September 11
Tesla T10 GPU
OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11
Tesla T10 GPU
OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11
Tesla T10 GPU
OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11
Tesla T10 GPU
OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11
Data.Array.Accelerate Collective operations on multi-dimensional regular arrays Embedded DSL
‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates Multiple backends Thursday, 1 September 11
Data.Array.Accelerate m s i l e rall
a p a dat
Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL
‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates Multiple backends Thursday, 1 September 11
Data.Array.Accelerate m s i l e rall
a p a dat
Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct
t s l ‣ Restricted controloflow ntro c d e limit
✓ GPU code ‣ First-order Generative approach based on combinator templates Multiple backends Thursday, 1 September 11
Data.Array.Accelerate m s i l e rall
a p a dat
Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct
t s l ‣ Restricted controloflow ntro c d e limit
✓ GPU code ‣ First-order
s n r e patt
s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha
Multiple backends Thursday, 1 September 11
Data.Array.Accelerate m s i l e rall
a p a dat
Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct
t s l ‣ Restricted controloflow ntro c d e limit
✓ GPU code ‣ First-order
s n r e patt
s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha y t i l i b a t Multiple backends r o p
✓
Thursday, 1 September 11
Data.Array.Accelerate m s i l e rall
a p a dat
Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct
t s l ‣ Restricted controloflow ntro c d e limit
✓ GPU code ‣ First-order
s n r e patt
s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha y t i l i b a t Multiple backends r o p
✓
Thursday, 1 September 11
[DAMP 2011]
import Data.Array.Accelerate Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')
Thursday, 1 September 11
import Data.Array.Accelerate HaskellDot product array
dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')
Thursday, 1 September 11
import Data.Array.Accelerate HaskellDot product array
dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')
Thursday, 1 September 11
import Data.Array.Accelerate HaskellDot product array
dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in Lift Haskell arrays into fold (+) 0 (zipWith (*) xs' ys') EDSL — may trigger
host➙device transfer
Thursday, 1 September 11
import Data.Array.Accelerate HaskellDot product array
dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in Lift Haskell arrays into fold (+) 0 (zipWith (*) xs' ys') EDSL — may trigger EDSL array computations
Thursday, 1 September 11
host➙device transfer
import Data.Array.Accelerate Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11
import Data.Array.Accelerate [0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]
Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11
import Data.Array.Accelerate [0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]
Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float)[[10, 20], [], [30]] ≈ ([2, 0, 1], [10, 20, 30]) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11
Architecture of Data.Array.Accelerate
Thursday, 1 September 11
– Control – – Data –
Surface language ↓ Reify & recover sharing HOAS de Bruijn ↓ Optimise (fusion)
FPGA.run
– CPU –
Allocate memory
Link & configure kernel
LLVM.run overlap
Non-parametric array representation → unboxed arrays → array of tuples tuple of arrays
CUDA.run
Frontend
Multiple Backends
Thursday, 1 September 11
Code generation ↓ Compilation ↓ Memoisation
– GPU – Copy host → device (asynchronously)
Parallel execution
First pass
Second pass
map (\x -> x + 1) arr
Thursday, 1 September 11
map (\x -> x + 1) arr
ijn u r B e d AS ->
O H & y f i e R
Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr
Thursday, 1 September 11
map (\x -> x + 1) arr
ijn u r B e d AS ->
O H & y f i e R
Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr
Recover sharing (CSE or O bserve)
Thursday, 1 September 11
map (\x -> x + 1) arr
ijn u r B e d AS ->
O H & y f i e R
Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr
Recover sharing (CSE or O bserve)
Thursday, 1 September 11
n o i t a is m i t p O ) n o i s (Fu
map (\x -> x + 1) arr
ijn u r B e d AS ->
O H & y f i e R
Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr
Recover sharing (CSE or O bserve)
n o i t a is m i t p O ) n o i s (Fu
Code generation __global__ void kernel (float *arr, int n) {...
Thursday, 1 September 11
map (\x -> x + 1) arr
ijn u r B e d AS ->
O H & y f i e R
Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr
Recover sharing (CSE or O bserve)
n o i t a is m i t p O ) n o i s (Fu
Code generation __global__ void kernel (float *arr, int n) {...
Thursday, 1 September 11
nvcc
map (\x -> x + 1) arr
ijn u r B e d AS ->
O H & y f i e R
Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr
Recover sharing (CSE or O bserve)
n o i t a is m i t p O ) n o i s (Fu
e g a k c a p a d u c
Code generation __global__ void kernel (float *arr, int n) {...
Thursday, 1 September 11
nvcc
CUDA skeletons zipWith.inl extern "C" __global__ void zipWith ( TyOut * d_out, const TyIn1 * d_in1, const TyIn0 * d_in0, const Int length) { Int ix = blockDim.x * blockIdx.x + threadIdx.x; const Int grid = blockDim.x * gridDim.x; for (; ix < length; ix += grid) { d_out[ix] = apply(d_in1[ix], d_in0[ix]); } }
Thursday, 1 September 11
Skeleton instantiation zipwith (+) typedef float TyOut; typedef float TyIn1; typedef float TyIn0; static inline __device__ TyOut apply(const TyIn1 x1, const TyIn0 x0) { TyOut r = x1 * x0; return r; } #include
Thursday, 1 September 11
Shapes
Thursday, 1 September 11
Shapes Types typedef int32_t Ix; typedef Ix DIM1; typedef struct { Ix a1,a0; } DIM2;
Thursday, 1 September 11
Shapes Types typedef int32_t Ix; typedef Ix DIM1; typedef struct { Ix a1,a0; } DIM2;
Functions int int Ix
dim(DIMn sh); size(DIMn sh); toIndex(DIMn sh, DIMn ix);
DIMn fromIndex(DIMn sh, Ix ix);
// index into row// major format // invert toIndex
Single skeleton per function (C++ templates & overloading) Thursday, 1 September 11
Free variables...
Thursday, 1 September 11
Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec
Thursday, 1 September 11
Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec
Thursday, 1 September 11
Free array-valued variable (need to lift computations)
Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec
Free array-valued variable (need to lift computations)
Backpermute skeleton prototype __global__ void backpermute (
Thursday, 1 September 11
ArrOut const ArrIn0 const DimOut const DimIn0
d_out, d_in0, shOut, shIn0);
Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec
Free array-valued variable (need to lift computations)
Backpermute skeleton prototype __global__ void backpermute ( Fixed set of arguments! (We don't generate skeletons dynamically.)
Thursday, 1 September 11
ArrOut const ArrIn0 const DimOut const DimIn0
d_out, d_in0, shOut, shIn0);
... become textures Backpermute skeleton prototype texture tex0; ... typedef DIM1 DimOut; typedef DIM1 DimIn0; static inline __device__ DimIn0 project(const DimOut x0) { DimIn0 r = tex1Dfetch(tex0, x0); return r; } #include
Thursday, 1 September 11
... become textures Backpermute skeleton prototype texture tex0; ... typedef DIM1 DimOut; typedef DIM1 DimIn0; static inline __device__ DimIn0 project(const DimOut x0) { DimIn0 r = tex1Dfetch(tex0, x0); return r; } #include
Texture access needs not be coalesced & is cached Thursday, 1 September 11
Caching data transfers & skeleton instantiations Memory table
‣ Associates Haskell arrays with device copies ‣ Transfer arrays only once Kernel Table
‣ Associates skeleton use with compiled binary ‣ Never re-compile a skeleton instance
Thursday, 1 September 11
The evaluator — executing array code Code generation ↓ Compilation ↓ Memoisation
– CPU –
Allocate memory
Link & configure kernel
overlap
– GPU –
nds
Thursday, 1 September 11
Copy host → device (asynchronously)
Parallel execution
First pass
Second pass
Black-Scholes Option Pricing 100
17.32
23.06
11.56
10 5.82
5.33
7.11
28.81 8.87
34.54 10.65
40.30
46.13
51.79
12.42
14.19
15.96
2.32
2.61
2.94
1.791
1.986
2.217
7
8
9
Time (ms)
3.56 1.79 0.98
1.31
1.64
0.66
1 0.34
0.799
0.988
1.233
1.96
1.481
0.498 0.25
0.1
1
2
3
4
5
6
Number of options (million)
Accelerate (w/o sharing) Accelerate (sharing cnd’)
Sharing can be important Tesla T10 (compute capability 1.3, 30 x 1.3GHz) Thursday, 1 September 11
CUDA SDK Accelerate (sharing cnd’ and d)
Sharing sensitive BlackScholes code Cumulative normal distribution cnd :: Exp Float -> Exp Float cnd d = let poly = horner coeff k = 1.0 / (1.0 + 0.2316419 * abs d) cnd' = rsqrt2 * exp (-0.5*d*d) * poly k in d >* 0 ? (1 - cnd', cnd')
Thursday, 1 September 11
P
2.40701070332802 4.63734184751325 0.31384340560034 5.21556798596204
Sparse-matrix vector multiplication 20
GFLOPS/s
15
10
Accelerate
COO
CSR (scalar)
CSR (vector)
Versus highly-optimised CUSP library Tesla T10 (compute capability 1.3, 30 x 1.3GHz) Thursday, 1 September 11
DIA
ELL
LP
Webbase
Circuit
FEM/Accelerator
Epidemiology
Economics
FEM/Ship
QCD
FEM/Harbour
Wind Tunnel
FEM/Cantilever
FEM/Spheres
Protein
0
Dense
5
HYB
Stocktake: flat data parallelism No nesting — code is not modular (compositional) No arrays of structured data
Thursday, 1 September 11
Stocktake: flat data parallelism No nesting — code is not modular (compositional) No arrays of structured data Embedded variant (targeting GPUs etc.):
‣ First-order, except for a fixed set of higher-order collective operations
‣ No recursion
Thursday, 1 September 11
Nested data parallelism in Haskell Data Parallel Haskell: language extension (fully integrated) [EuroPar 2001] Data type of nested parallel arrays [:e:] — here, e can be any type Parallel evaluation semantics Array comprehensions & collective operations (mapP, scanP, etc.)
Thursday, 1 September 11
Parallel Quicksort qsort :: Ord a => [:a:] -> [:a:] qsort [::] = [::] qsort xs = let p = xs!:0 smaller = [:x | x [:b:]
Stage 2: Library package DPH High-performance flat array library Communication and array fusion
L A I T N E S S E S I Y T I PUR
Radical re-ordering of computations Thursday, 1 September 11
Implementation
[FSTTCS 2008]
Extension of the Glasgow Haskell Compiler (GHC) Stage 1: The Vectoriser Transforms all nested into flat parallelism f :: a -> b
f^ :: [:a:] -> [:b:]
Stage 2: Library package DPH High-performance flat array library Communication and array fusion
L A I T N E S S E S I Y T I PUR
Radical re-ordering of computations Thursday, 1 September 11
Current Implementation targeting multicore CPUs GHC performs vectorisation transformation on Core IL Thursday, 1 September 11
2x Quad-Core Xeon = 8 cores (8 thread contexts)
1x UltraSPARC T2 = 8 cores (64 thread contexts)
Current Implementation targeting multicore CPUs GHC performs vectorisation transformation on Core IL Thursday, 1 September 11
Summary Purely functional programming simplifies parallel programming Data parallelism in Haskell is natural and flexible Nested parallelism is more expressive, but also much harder to implement Embedded languages for specialised architectures Accelerate: https://github.com/mchakravarty/accelerate DPH: http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell Repa: http://hackage.haskell.org/package/repa Thursday, 1 September 11
[EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science",2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", 2011. Twitter: @TacticalGrace
Thursday, 1 September 11