data parallelism in haskell - Computer Science and Engineering

1 downloads 161 Views 50MB Size Report
Sep 1, 2011 - Repa library: just another array API [ICFP 2010]. Multi-dimensional & shape polymorphic ..... CUDA SDK
DATA PARALLELISM IN HASKELL Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell Simon Peyton Jones Thursday, 1 September 11

OUR GOAL SIMPLIFY COMPUTE INTENSIVE APPLICATIONS Thursday, 1 September 11

MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11

COR

— 0 7 9 E I7

MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11

S E R O 6C

) S D A E R H T (12

COR

— 0 7 9 E I7

S E R O 6C

) S D A E R H T (12

) S D REA

— C I M L E T IN

R O C 32

MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11

H T 8 2 ES (1

COR

— 0 7 9 E I7

S E R O 6C

) S D A E R H T (12 ) S D REA

S E R 2 CO

51 — 100

F G A I VID

N

) S D REA

— C I M L E T IN

R O C 32

MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11

H T 8 2 ES (1

H T 6 7 (24,5

COR

— 0 7 9 E I7

S E R O 6C

) S D A E R H T (12 ) S D REA

S E R 2 CO

51 — 100

F G A I VID

N

H T 6 7 (24,5

SOFTWARE NEEDS TO DEAL WITH PARALLELISM! ) S D REA

— C I M L E T IN

R O C 32

MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11

H T 8 2 ES (1

The essence of this talk 1. Parallel programming and functional programming are a natural fit 2. Data parallelism is simpler than control parallelism 3. Nested parallelism is more expressive than flat data parallelism

Thursday, 1 September 11

Concurrency ≠ parallelism

Thursday, 1 September 11

Concurrency ≠ parallelism Concurrency

‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control

Thursday, 1 September 11

Concurrency ≠ parallelism Concurrency

‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism

‣ Produce the same result, but faster Thursday, 1 September 11

Concurrency ≠ parallelism Concurrency

HARD!

‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism

‣ Produce the same result, but faster Thursday, 1 September 11

Concurrency ≠ parallelism Concurrency

HARD!

‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism

CAN BE Q

UITE EA S

‣ Produce the same result, but faster Thursday, 1 September 11

Y!

Don't we need concurrency to implement parallelism?

Thursday, 1 September 11

Don't we need concurrency to implement parallelism? Sometimes, yes. Sometimes, no:

This should not concern the application programmer!

Thursday, 1 September 11

Data Parallelism

T S B A

N O I T RAC

Parallelism

Concurrency

Thursday, 1 September 11

We can implement a parallel program using explicit concurrency

Thursday, 1 September 11

We can implement a parallel program using explicit concurrency

We can implement a web server in assembly language

Thursday, 1 September 11

How does functional programming help?

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



Non-determinism & concurrency control

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



Non-determinism & concurrency control

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



Non-determinism & concurrency control

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



Non-determinism & concurrency control

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



Thursday, 1 September 11

E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA

Purely Functional Programming

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA

S T C E F F E O N

Purely Functional Programming

Thursday, 1 September 11

Concurrency



Multiple interleaved threads of control



All threads have effects on the world



E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA

S T C E F F E O N

Purely D E N I A R T S N Functional O C LESS ECUTION Programming EX R E D R O

Thursday, 1 September 11

A simple example processList list = (sort list, maximum list)

Thursday, 1 September 11

A simple example processList list = (sort list, maximum list)

function

Thursday, 1 September 11

function body argument

A simple example processList list = (sort list, maximum list)

function

function body argument

Performs two tasks: sorting and determining the maximum

Thursday, 1 September 11

A simple example processList list = (sort list, maximum list)

function

function body argument

Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order

Thursday, 1 September 11

A simple example processList list = (sort list, maximum list)

function

function body argument

Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order Returns a pair of the results

Thursday, 1 September 11

A simple example processList list = (sort list, maximum list)

function

function body argument

Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order Returns a pair of the results May even return the pair before sorting and maximum have completed! Thursday, 1 September 11

Typical collective operations (with a natural parallel interpretation)

map f [x1, ..., xn] = [f x1, ..., f xn]

zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn

(aka reduce)

Thursday, 1 September 11

Typical collective operations (with a natural parallel interpretation)

map f [x1, ..., xn] = [f x1, ..., f xn]

zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn

(aka reduce)

Thursday, 1 September 11

Typical collective operations (with a natural parallel interpretation)

map f [x1, ..., xn] = [f x1, ..., f xn]

zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn

(aka reduce)

Thursday, 1 September 11

Typical collective operations (with a natural parallel interpretation)

map f [x1, ..., xn] = [f x1, ..., f xn]

zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn

(aka reduce)

Thursday, 1 September 11

Typical collective operations (with a natural parallel interpretation)

map f [x1, ..., xn] = [f x1, ..., f xn]

zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn

(aka reduce)



maximum = foldl1 max

Thursday, 1 September 11

Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)

Thursday, 1 September 11

Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)

argument type

Thursday, 1 September 11

result type

Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)

argument type

result type

Purity: a function's result depends solely on its arguments

Thursday, 1 September 11

Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)

argument type

result type

Purity: a function's result depends solely on its arguments Persistence: don't mutate data structures in-place (encourages collective operations) Thursday, 1 September 11

Haskell, specifically Broad-spectrum, pure and lazy Strongly typed, but types are optional Mature language and tools & tons of libraries Vibrant community: #[email protected] http://haskell.org/

Thursday, 1 September 11

By default pure

:=

Types track purity Pure = no effects

Impure = may have effects

Int

IO Int

processList :: [Int] -> ([Int], Int)

readFile :: FilePath -> IO String

copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)

readFile :: FilePath -> IO String

copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)

readFile :: FilePath -> IO String

copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)

readFile :: FilePath -> IO String

copyFile fn1 fn2 = do (sort list, maximum list) data Array DIM2 e -> Array DIM2 e rr: [[a00, a01, a-> a12],[a21, Array e a22, a23],[a31, a32, a33]] 02],[a 10, a11, DIM2 rr: [[b [b10,b20 ], [b20,b21]] 00,b01], arr mmMult brr rrTran:[[b00,b10,b20],[b01,b11,b21]] = sum (zipWith (*) arrRepl brrRepl) rep 2 rep 2 where rrRepl:[[[atrr 00,a01,a02],[a 01,a02]],[[[a 10,a11,a12],..],............] =00,a force (transpose2D brr) rrRepl:[[[barrRepl 00,b10,b20],[b01,b11,b21]],[[[b00,b10,b20],..],............] = replicate (Z :.All :.colsB :.All) arr brrRepl = replicate (Z :.rowsA :.All :.All) trr rep 4 (Z :.colsA :.rowsA) = extent arr esult :[[a00(Z *b00+a 01*b10+...,a 00*b01+..],[.,.],[.,.],[.,.]] :.colsB :.rowsB) = extent brr sum

trr brrRepl x 4 arr

Thursday, 1 September 11

arrRepl x 2

Matrix-matrix multiplication (size 1024x1024) GHC's code generation still leaves room for improvement Thursday, 1 September 11

Sobel and Canny edge detection (100 iterations) OpenCV: high-performance computer vision library — uses SSE SIMD instructions! Thursday, 1 September 11

Summary: Regular firstclass arrays Repa library: just another array API [ICFP 2010] Multi-dimensional & shape polymorphic Collective array operations, executed in parallel But no arrays of arrays; i.e., no nested parallelism

Thursday, 1 September 11

Summary: Regular firstclass arrays Repa library: just another array API [ICFP 2010] Multi-dimensional & shape polymorphic Collective array operations, executed in parallel But no arrays of arrays; i.e., no nested parallelism But multicore CPUs only!

Thursday, 1 September 11

General Purpose GPU Programming (GPGPU)

Thursday, 1 September 11

MODERN GPUS ARE FREELY PROGRAMMABLE But no function pointers & limited recursion Thursday, 1 September 11

MODERN GPUS ARE FREELY PROGRAMMABLE But no function pointers & limited recursion Thursday, 1 September 11

Very Different Programming Model (Compared to multicore CPUs)

Thursday, 1 September 11

S D A HRE

T 6 7 24,5

REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11

S D A HRE

T 6 7 24,5

✴ SIMD: groups of threads executing in lock step (warps) ✴ Latency hiding: excess parallelism covers main memory latency

REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11

S D A HRE

T 6 7 24,5

✴ SIMD: groups of threads executing in lock step (warps) ✴ Latency hiding: excess parallelism covers main memory latency

✴ Thread divergence is expensive ✴ Memory access patterns need to be regular

REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11

Dot Product 10

Time (ms)

1.55

1.95

2.34

2.75

3.12

1.18 0.78

1

0.78

0.40

0.98

1.18

1.38

1.58

3.51

1.78

0.59 0.39

0.20

0.1

2

4

6

8

10

12

14

Number of elements (million)

Accelerate

CUBLAS

CPU (Xeon E5405 @ 2GHz): 71.3ms for 18M Computation only, without CPU ⇄ GPU transfer Thursday, 1 September 11

16

18

Challenges Code must be massively data parallel Control structures are limited Limited function pointers Limited recursion Software-managed cache, memory-access patterns, etc. Portability... Thursday, 1 September 11

Tesla T10 GPU

OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11

Tesla T10 GPU

OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11

Tesla T10 GPU

OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11

Tesla T10 GPU

OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11

Data.Array.Accelerate Collective operations on multi-dimensional regular arrays Embedded DSL

‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates Multiple backends Thursday, 1 September 11

Data.Array.Accelerate m s i l e rall

a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL

‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates Multiple backends Thursday, 1 September 11

Data.Array.Accelerate m s i l e rall

a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct

t s l ‣ Restricted controloflow ntro c d e limit

✓ GPU code ‣ First-order Generative approach based on combinator templates Multiple backends Thursday, 1 September 11

Data.Array.Accelerate m s i l e rall

a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct

t s l ‣ Restricted controloflow ntro c d e limit

✓ GPU code ‣ First-order

s n r e patt

s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha

Multiple backends Thursday, 1 September 11

Data.Array.Accelerate m s i l e rall

a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct

t s l ‣ Restricted controloflow ntro c d e limit

✓ GPU code ‣ First-order

s n r e patt

s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha y t i l i b a t Multiple backends r o p



Thursday, 1 September 11

Data.Array.Accelerate m s i l e rall

a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct

t s l ‣ Restricted controloflow ntro c d e limit

✓ GPU code ‣ First-order

s n r e patt

s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha y t i l i b a t Multiple backends r o p



Thursday, 1 September 11

[DAMP 2011]

import Data.Array.Accelerate Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Thursday, 1 September 11

import Data.Array.Accelerate HaskellDot product array

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Thursday, 1 September 11

import Data.Array.Accelerate HaskellDot product array

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')

Thursday, 1 September 11

import Data.Array.Accelerate HaskellDot product array

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in Lift Haskell arrays into fold (+) 0 (zipWith (*) xs' ys') EDSL — may trigger

host➙device transfer

Thursday, 1 September 11

import Data.Array.Accelerate HaskellDot product array

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in Lift Haskell arrays into fold (+) 0 (zipWith (*) xs' ys') EDSL — may trigger EDSL array computations

Thursday, 1 September 11

host➙device transfer

import Data.Array.Accelerate Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11

import Data.Array.Accelerate [0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]

Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11

import Data.Array.Accelerate [0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]

Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float)[[10, 20], [], [30]] ≈ ([2, 0, 1], [10, 20, 30]) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11

Architecture of Data.Array.Accelerate

Thursday, 1 September 11

– Control – – Data –

Surface language ↓ Reify & recover sharing HOAS de Bruijn ↓ Optimise (fusion)

FPGA.run

– CPU –

Allocate memory

Link & configure kernel

LLVM.run overlap

Non-parametric array representation → unboxed arrays → array of tuples tuple of arrays

CUDA.run

Frontend

Multiple Backends

Thursday, 1 September 11

Code generation ↓ Compilation ↓ Memoisation

– GPU – Copy host → device (asynchronously)

Parallel execution

First pass

Second pass

map (\x -> x + 1) arr

Thursday, 1 September 11

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Thursday, 1 September 11

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing (CSE or O bserve)

Thursday, 1 September 11

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing (CSE or O bserve)

Thursday, 1 September 11

n o i t a is m i t p O ) n o i s (Fu

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing (CSE or O bserve)

n o i t a is m i t p O ) n o i s (Fu

Code generation __global__ void kernel (float *arr, int n) {...

Thursday, 1 September 11

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing (CSE or O bserve)

n o i t a is m i t p O ) n o i s (Fu

Code generation __global__ void kernel (float *arr, int n) {...

Thursday, 1 September 11

nvcc

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr

Recover sharing (CSE or O bserve)

n o i t a is m i t p O ) n o i s (Fu

e g a k c a p a d u c

Code generation __global__ void kernel (float *arr, int n) {...

Thursday, 1 September 11

nvcc

CUDA skeletons zipWith.inl extern "C" __global__ void zipWith ( TyOut * d_out, const TyIn1 * d_in1, const TyIn0 * d_in0, const Int length) { Int ix = blockDim.x * blockIdx.x + threadIdx.x; const Int grid = blockDim.x * gridDim.x; for (; ix < length; ix += grid) { d_out[ix] = apply(d_in1[ix], d_in0[ix]); } }

Thursday, 1 September 11

Skeleton instantiation zipwith (+) typedef float TyOut; typedef float TyIn1; typedef float TyIn0; static inline __device__ TyOut apply(const TyIn1 x1, const TyIn0 x0) { TyOut r = x1 * x0; return r; } #include

Thursday, 1 September 11

Shapes

Thursday, 1 September 11

Shapes Types typedef int32_t Ix; typedef Ix DIM1; typedef struct { Ix a1,a0; } DIM2;

Thursday, 1 September 11

Shapes Types typedef int32_t Ix; typedef Ix DIM1; typedef struct { Ix a1,a0; } DIM2;

Functions int int Ix

dim(DIMn sh); size(DIMn sh); toIndex(DIMn sh, DIMn ix);

DIMn fromIndex(DIMn sh, Ix ix);

// index into row// major format // invert toIndex

Single skeleton per function (C++ templates & overloading) Thursday, 1 September 11

Free variables...

Thursday, 1 September 11

Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec

Thursday, 1 September 11

Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec

Thursday, 1 September 11

Free array-valued variable (need to lift computations)

Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec

Free array-valued variable (need to lift computations)

Backpermute skeleton prototype __global__ void backpermute (

Thursday, 1 September 11

ArrOut const ArrIn0 const DimOut const DimIn0

d_out, d_in0, shOut, shIn0);

Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec

Free array-valued variable (need to lift computations)

Backpermute skeleton prototype __global__ void backpermute ( Fixed set of arguments! (We don't generate skeletons dynamically.)

Thursday, 1 September 11

ArrOut const ArrIn0 const DimOut const DimIn0

d_out, d_in0, shOut, shIn0);

... become textures Backpermute skeleton prototype texture tex0; ... typedef DIM1 DimOut; typedef DIM1 DimIn0; static inline __device__ DimIn0 project(const DimOut x0) { DimIn0 r = tex1Dfetch(tex0, x0); return r; } #include

Thursday, 1 September 11

... become textures Backpermute skeleton prototype texture tex0; ... typedef DIM1 DimOut; typedef DIM1 DimIn0; static inline __device__ DimIn0 project(const DimOut x0) { DimIn0 r = tex1Dfetch(tex0, x0); return r; } #include

Texture access needs not be coalesced & is cached Thursday, 1 September 11

Caching data transfers & skeleton instantiations Memory table

‣ Associates Haskell arrays with device copies ‣ Transfer arrays only once Kernel Table

‣ Associates skeleton use with compiled binary ‣ Never re-compile a skeleton instance

Thursday, 1 September 11

The evaluator — executing array code Code generation ↓ Compilation ↓ Memoisation

– CPU –

Allocate memory

Link & configure kernel

overlap

– GPU –

nds

Thursday, 1 September 11

Copy host → device (asynchronously)

Parallel execution

First pass

Second pass

Black-Scholes Option Pricing 100

17.32

23.06

11.56

10 5.82

5.33

7.11

28.81 8.87

34.54 10.65

40.30

46.13

51.79

12.42

14.19

15.96

2.32

2.61

2.94

1.791

1.986

2.217

7

8

9

Time (ms)

3.56 1.79 0.98

1.31

1.64

0.66

1 0.34

0.799

0.988

1.233

1.96

1.481

0.498 0.25

0.1

1

2

3

4

5

6

Number of options (million)

Accelerate (w/o sharing) Accelerate (sharing cnd’)

Sharing can be important Tesla T10 (compute capability 1.3, 30 x 1.3GHz) Thursday, 1 September 11

CUDA SDK Accelerate (sharing cnd’ and d)

Sharing sensitive BlackScholes code Cumulative normal distribution cnd :: Exp Float -> Exp Float cnd d = let poly = horner coeff k = 1.0 / (1.0 + 0.2316419 * abs d) cnd' = rsqrt2 * exp (-0.5*d*d) * poly k in d >* 0 ? (1 - cnd', cnd')

Thursday, 1 September 11

P

2.40701070332802 4.63734184751325 0.31384340560034 5.21556798596204

Sparse-matrix vector multiplication 20

GFLOPS/s

15

10

Accelerate

COO

CSR (scalar)

CSR (vector)

Versus highly-optimised CUSP library Tesla T10 (compute capability 1.3, 30 x 1.3GHz) Thursday, 1 September 11

DIA

ELL

LP

Webbase

Circuit

FEM/Accelerator

Epidemiology

Economics

FEM/Ship

QCD

FEM/Harbour

Wind Tunnel

FEM/Cantilever

FEM/Spheres

Protein

0

Dense

5

HYB

Stocktake: flat data parallelism No nesting — code is not modular (compositional) No arrays of structured data

Thursday, 1 September 11

Stocktake: flat data parallelism No nesting — code is not modular (compositional) No arrays of structured data Embedded variant (targeting GPUs etc.):

‣ First-order, except for a fixed set of higher-order collective operations

‣ No recursion

Thursday, 1 September 11

Nested data parallelism in Haskell Data Parallel Haskell: language extension (fully integrated) [EuroPar 2001] Data type of nested parallel arrays [:e:] — here, e can be any type Parallel evaluation semantics Array comprehensions & collective operations (mapP, scanP, etc.)

Thursday, 1 September 11

Parallel Quicksort qsort :: Ord a => [:a:] -> [:a:] qsort [::] = [::] qsort xs = let p = xs!:0 smaller = [:x | x [:b:]

Stage 2: Library package DPH High-performance flat array library Communication and array fusion

L A I T N E S S E S I Y T I PUR

Radical re-ordering of computations Thursday, 1 September 11

Implementation

[FSTTCS 2008]

Extension of the Glasgow Haskell Compiler (GHC) Stage 1: The Vectoriser Transforms all nested into flat parallelism f :: a -> b

f^ :: [:a:] -> [:b:]

Stage 2: Library package DPH High-performance flat array library Communication and array fusion

L A I T N E S S E S I Y T I PUR

Radical re-ordering of computations Thursday, 1 September 11

Current Implementation targeting multicore CPUs GHC performs vectorisation transformation on Core IL Thursday, 1 September 11

2x Quad-Core Xeon = 8 cores (8 thread contexts)

1x UltraSPARC T2 = 8 cores (64 thread contexts)

Current Implementation targeting multicore CPUs GHC performs vectorisation transformation on Core IL Thursday, 1 September 11

Summary Purely functional programming simplifies parallel programming Data parallelism in Haskell is natural and flexible Nested parallelism is more expressive, but also much harder to implement Embedded languages for specialised architectures Accelerate: https://github.com/mchakravarty/accelerate DPH: http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell Repa: http://hackage.haskell.org/package/repa Thursday, 1 September 11

[EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science",2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", 2011. Twitter: @TacticalGrace

Thursday, 1 September 11