data parallelism in haskell - Computer Science and Engineering

DATA PARALLELISM IN HASKELL Manuel M T Chakravarty University of New South Wales INCLUDES JOINT WORK WITH Gabriele Keller Sean Lee Roman Leshchinskiy Ben Lippmeier Trevor McDonell Simon Peyton Jones Thursday, 1 September 11

OUR GOAL SIMPLIFY COMPUTE INTENSIVE APPLICATIONS Thursday, 1 September 11

MORE AND MORE PARALLELISM BETTER POWER EFFICIENCY Thursday, 1 September 11

COR

— 0 7 9 E I7


S E R O 6C

) S D A E R H T (12

COR

— 0 7 9 E I7

S E R O 6C

) S D A E R H T (12

) S D REA

— C I M L E T IN

R O C 32


H T 8 2 ES (1

COR

— 0 7 9 E I7

S E R O 6C

) S D A E R H T (12 ) S D REA

S E R 2 CO

51 — 100

F G A I VID

N

) S D REA

— C I M L E T IN

R O C 32


H T 8 2 ES (1

H T 6 7 (24,5

COR

— 0 7 9 E I7

S E R O 6C

) S D A E R H T (12 ) S D REA

S E R 2 CO

51 — 100

F G A I VID

N

H T 6 7 (24,5

SOFTWARE NEEDS TO DEAL WITH PARALLELISM! ) S D REA

— C I M L E T IN

R O C 32


H T 8 2 ES (1

The essence of this talk 1. Parallel programming and functional programming are a natural fit 2. Data parallelism is simpler than control parallelism 3. Nested parallelism is more expressive than flat data parallelism

Thursday, 1 September 11

Concurrency ≠ parallelism


Concurrency ≠ parallelism Concurrency

‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control



‣ Multiple interleaved threads of control ‣ All threads have effects on the world ‣ Non-determinism & concurrency control Parallelism

‣ Produce the same result, but faster Thursday, 1 September 11


HARD!




HARD!


CAN BE Q

UITE EA S


Y!

Don't we need concurrency to implement parallelism?


Don't we need concurrency to implement parallelism? Sometimes, yes. Sometimes, no:

This should not concern the application programmer!


Data Parallelism

T S B A

N O I T RAC

Parallelism

Concurrency


We can implement a parallel program using explicit concurrency


We can implement a parallel program using explicit concurrency

We can implement a web server in assembly language


How does functional programming help?


Concurrency

‣

Multiple interleaved threads of control

‣

All threads have effects on the world

‣

Non-determinism & concurrency control


Concurrency

‣


‣


‣



Concurrency

‣


‣


‣



Concurrency

‣


‣


‣



Concurrency

‣


‣


‣


E G A S Non-determinism & concurrency control STM, MES ON , O S S K C D N LO A , G N I S S PA

Concurrency

‣


‣


‣


Purely Functional Programming


Concurrency

‣


‣


‣


S T C E F F E O N

Purely Functional Programming


Concurrency

‣


‣


‣


S T C E F F E O N

Purely D E N I A R T S N Functional O C LESS ECUTION Programming EX R E D R O


A simple example processList list = (sort list, maximum list)



function


function body argument


function


Performs two tasks: sorting and determining the maximum



function


Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order



function


Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order Returns a pair of the results



function


Performs two tasks: sorting and determining the maximum The tasks are executed in an arbitrary order Returns a pair of the results May even return the pair before sorting and maximum have completed! Thursday, 1 September 11

Typical collective operations (with a natural parallel interpretation)

map f [x1, ..., xn] = [f x1, ..., f xn]

zipWith f [x1, ..., xn] [y1, ..., yn] = [f x1 y1, ..., f xn yn] foldl1 (⊕) [x1, ..., xn] = ((x1 ⊕ x2) ⊕ ···) ⊕ xn

(aka reduce)



map f [x1, ..., xn] = [f x1, ..., f xn]


(aka reduce)



map f [x1, ..., xn] = [f x1, ..., f xn]


(aka reduce)



map f [x1, ..., xn] = [f x1, ..., f xn]


(aka reduce)



map f [x1, ..., xn] = [f x1, ..., f xn]


(aka reduce)

✴

maximum = foldl1 max


Our secret weapons: Purity & Persistence! Example function type processList list = (sort list, maximum list) processList :: [Int] -> ([Int], Int)



argument type


result type


argument type

result type

Purity: a function's result depends solely on its arguments



argument type

result type

Purity: a function's result depends solely on its arguments Persistence: don't mutate data structures in-place (encourages collective operations) Thursday, 1 September 11

Haskell, specifically Broad-spectrum, pure and lazy Strongly typed, but types are optional Mature language and tools & tons of libraries Vibrant community: #[email protected] http://haskell.org/


By default pure

:=

Types track purity Pure = no effects

Impure = may have effects

Int

IO Int

processList :: [Int] -> ([Int], Int)

readFile :: FilePath -> IO String

copyFile fn1 fn2 = do (sort list, maximum list) data ([Int], Int)






copyFile fn1 fn2 = do (sort list, maximum list) data Array DIM2 e -> Array DIM2 e rr: [[a00, a01, a-> a12],[a21, Array e a22, a23],[a31, a32, a33]] 02],[a 10, a11, DIM2 rr: [[b [b10,b20 ], [b20,b21]] 00,b01], arr mmMult brr rrTran:[[b00,b10,b20],[b01,b11,b21]] = sum (zipWith (*) arrRepl brrRepl) rep 2 rep 2 where rrRepl:[[[atrr 00,a01,a02],[a 01,a02]],[[[a 10,a11,a12],..],............] =00,a force (transpose2D brr) rrRepl:[[[barrRepl 00,b10,b20],[b01,b11,b21]],[[[b00,b10,b20],..],............] = replicate (Z :.All :.colsB :.All) arr brrRepl = replicate (Z :.rowsA :.All :.All) trr rep 4 (Z :.colsA :.rowsA) = extent arr esult :[[a00(Z *b00+a 01*b10+...,a 00*b01+..],[.,.],[.,.],[.,.]] :.colsB :.rowsB) = extent brr sum

trr brrRepl x 4 arr


arrRepl x 2

Matrix-matrix multiplication (size 1024x1024) GHC's code generation still leaves room for improvement Thursday, 1 September 11

Sobel and Canny edge detection (100 iterations) OpenCV: high-performance computer vision library — uses SSE SIMD instructions! Thursday, 1 September 11

Summary: Regular firstclass arrays Repa library: just another array API [ICFP 2010] Multi-dimensional & shape polymorphic Collective array operations, executed in parallel But no arrays of arrays; i.e., no nested parallelism


Summary: Regular firstclass arrays Repa library: just another array API [ICFP 2010] Multi-dimensional & shape polymorphic Collective array operations, executed in parallel But no arrays of arrays; i.e., no nested parallelism But multicore CPUs only!


General Purpose GPU Programming (GPGPU)


MODERN GPUS ARE FREELY PROGRAMMABLE But no function pointers & limited recursion Thursday, 1 September 11

MODERN GPUS ARE FREELY PROGRAMMABLE But no function pointers & limited recursion Thursday, 1 September 11

Very Different Programming Model (Compared to multicore CPUs)


S D A HRE

T 6 7 24,5

REGULAR ARCHITECTURE Avoids deep pipelines, sophisticated caches, and so on Thursday, 1 September 11

S D A HRE

T 6 7 24,5

✴ SIMD: groups of threads executing in lock step (warps) ✴ Latency hiding: excess parallelism covers main memory latency


S D A HRE

T 6 7 24,5

✴ SIMD: groups of threads executing in lock step (warps) ✴ Latency hiding: excess parallelism covers main memory latency

✴ Thread divergence is expensive ✴ Memory access patterns need to be regular


Dot Product 10

Time (ms)

1.55

1.95

2.34

2.75

3.12

1.18 0.78

1

0.78

0.40

0.98

1.18

1.38

1.58

3.51

1.78

0.59 0.39

0.20

0.1

2

4

6

8

10

12

14

Number of elements (million)

Accelerate

CUBLAS

CPU (Xeon E5405 @ 2GHz): 71.3ms for 18M Computation only, without CPU ⇄ GPU transfer Thursday, 1 September 11

16

18

Challenges Code must be massively data parallel Control structures are limited Limited function pointers Limited recursion Software-managed cache, memory-access patterns, etc. Portability... Thursday, 1 September 11

Tesla T10 GPU

OTHER COMPUTE ACCELERATOR ARCHITECTURES Goal: portable data parallelism Thursday, 1 September 11

Tesla T10 GPU


Tesla T10 GPU


Tesla T10 GPU


Data.Array.Accelerate Collective operations on multi-dimensional regular arrays Embedded DSL

‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates Multiple backends Thursday, 1 September 11

Data.Array.Accelerate m s i l e rall

a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL

‣ Restricted control flow ‣ First-order GPU code Generative approach based on combinator templates Multiple backends Thursday, 1 September 11


a p a dat

Collective operations on multi-dimensional regular e v i s s a m arrays ✓ Embedded DSL s e r u ruct

t s l ‣ Restricted controloflow ntro c d e limit

✓ GPU code ‣ First-order Generative approach based on combinator templates Multiple backends Thursday, 1 September 11


a p a dat



✓ GPU code ‣ First-order

s n r e patt

s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha

Multiple backends Thursday, 1 September 11


a p a dat




s n r e patt

s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha y t i l i b a t Multiple backends r o p

✓



a p a dat




s n r e patt

s s e c c a d Generative approach based on combinator e n u t d n templates ✓ ha y t i l i b a t Multiple backends r o p

✓


[DAMP 2011]

import Data.Array.Accelerate Dot product dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')


import Data.Array.Accelerate HaskellDot product array

dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')



dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in fold (+) 0 (zipWith (*) xs' ys')



dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in Lift Haskell arrays into fold (+) 0 (zipWith (*) xs' ys') EDSL — may trigger

host➙device transfer



dotp :: Vector Float -> Vector Float -> Acc (Scalar Float) dotp xs ys EDSL array = = let desc. of array comps xs' = use xs ys' = use ys in Lift Haskell arrays into fold (+) 0 (zipWith (*) xs' ys') EDSL — may trigger EDSL array computations


host➙device transfer

import Data.Array.Accelerate Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11

import Data.Array.Accelerate [0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]

Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11

import Data.Array.Accelerate [0, 0, 6.0, 0, 7.0] ≈ [(2, 6.0), (4, 7.0)]

Sparse-matrix vector multiplication type SparseVector a = Vector (Int, a) type SparseMatrix a = (Segments, SparseVector a) smvm :: Acc (SparseMatrix Float) -> Acc (Vector Float) -> Acc (Vector Float)[[10, 20], [], [30]] ≈ ([2, 0, 1], [10, 20, 30]) smvm (segd, smat) vec = let (inds, vals) = unzip smat vecVals = backpermute (shape inds) (\i -> inds!i) vec products = zipWith (*) vecVals vals in foldSeg (+) 0 products segd Thursday, 1 September 11

Architecture of Data.Array.Accelerate


– Control – – Data –

Surface language ↓ Reify & recover sharing HOAS de Bruijn ↓ Optimise (fusion)

FPGA.run

– CPU –

Allocate memory

Link & configure kernel

LLVM.run overlap

Non-parametric array representation → unboxed arrays → array of tuples tuple of arrays

CUDA.run

Frontend

Multiple Backends


Code generation ↓ Compilation ↓ Memoisation

– GPU – Copy host → device (asynchronously)

Parallel execution

First pass

Second pass

map (\x -> x + 1) arr


map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R

Map (Lam (Add `PrimApp` (ZeroIdx, Const 1))) arr


map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R


Recover sharing (CSE or O bserve)


map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R




n o i t a is m i t p O ) n o i s (Fu

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R




Code generation __global__ void kernel (float *arr, int n) {...


map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R






nvcc

map (\x -> x + 1) arr

ijn u r B e d AS ->

O H & y f i e R




e g a k c a p a d u c



nvcc

CUDA skeletons zipWith.inl extern "C" __global__ void zipWith ( TyOut * d_out, const TyIn1 * d_in1, const TyIn0 * d_in0, const Int length) { Int ix = blockDim.x * blockIdx.x + threadIdx.x; const Int grid = blockDim.x * gridDim.x; for (; ix < length; ix += grid) { d_out[ix] = apply(d_in1[ix], d_in0[ix]); } }


Skeleton instantiation zipwith (+) typedef float TyOut; typedef float TyIn1; typedef float TyIn0; static inline __device__ TyOut apply(const TyIn1 x1, const TyIn0 x0) { TyOut r = x1 * x0; return r; } #include


Shapes


Shapes Types typedef int32_t Ix; typedef Ix DIM1; typedef struct { Ix a1,a0; } DIM2;


Shapes Types typedef int32_t Ix; typedef Ix DIM1; typedef struct { Ix a1,a0; } DIM2;

Functions int int Ix

dim(DIMn sh); size(DIMn sh); toIndex(DIMn sh, DIMn ix);

DIMn fromIndex(DIMn sh, Ix ix);

// index into row// major format // invert toIndex

Single skeleton per function (C++ templates & overloading) Thursday, 1 September 11

Free variables...


Free variables... Excerpt from SMVM backpermute (shape inds) (\i -> index1 (inds!i)) vec




Free array-valued variable (need to lift computations)



Backpermute skeleton prototype __global__ void backpermute (


ArrOut const ArrIn0 const DimOut const DimIn0

d_out, d_in0, shOut, shIn0);



Backpermute skeleton prototype __global__ void backpermute ( Fixed set of arguments! (We don't generate skeletons dynamically.)


ArrOut const ArrIn0 const DimOut const DimIn0

d_out, d_in0, shOut, shIn0);

... become textures Backpermute skeleton prototype texture tex0; ... typedef DIM1 DimOut; typedef DIM1 DimIn0; static inline __device__ DimIn0 project(const DimOut x0) { DimIn0 r = tex1Dfetch(tex0, x0); return r; } #include


... become textures Backpermute skeleton prototype texture tex0; ... typedef DIM1 DimOut; typedef DIM1 DimIn0; static inline __device__ DimIn0 project(const DimOut x0) { DimIn0 r = tex1Dfetch(tex0, x0); return r; } #include

Texture access needs not be coalesced & is cached Thursday, 1 September 11

Caching data transfers & skeleton instantiations Memory table

‣ Associates Haskell arrays with device copies ‣ Transfer arrays only once Kernel Table

‣ Associates skeleton use with compiled binary ‣ Never re-compile a skeleton instance


The evaluator — executing array code Code generation ↓ Compilation ↓ Memoisation

– CPU –

Allocate memory

Link & configure kernel

overlap

– GPU –

nds


Copy host → device (asynchronously)

Parallel execution

First pass

Second pass

Black-Scholes Option Pricing 100

17.32

23.06

11.56

10 5.82

5.33

7.11

28.81 8.87

34.54 10.65

40.30

46.13

51.79

12.42

14.19

15.96

2.32

2.61

2.94

1.791

1.986

2.217

7

8

9

Time (ms)

3.56 1.79 0.98

1.31

1.64

0.66

1 0.34

0.799

0.988

1.233

1.96

1.481

0.498 0.25

0.1

1

2

3

4

5

6

Number of options (million)

Accelerate (w/o sharing) Accelerate (sharing cnd’)

Sharing can be important Tesla T10 (compute capability 1.3, 30 x 1.3GHz) Thursday, 1 September 11

CUDA SDK Accelerate (sharing cnd’ and d)

Sharing sensitive BlackScholes code Cumulative normal distribution cnd :: Exp Float -> Exp Float cnd d = let poly = horner coeff k = 1.0 / (1.0 + 0.2316419 * abs d) cnd' = rsqrt2 * exp (-0.5*d*d) * poly k in d >* 0 ? (1 - cnd', cnd')


P

2.40701070332802 4.63734184751325 0.31384340560034 5.21556798596204

Sparse-matrix vector multiplication 20

GFLOPS/s

15

10

Accelerate

COO

CSR (scalar)

CSR (vector)

Versus highly-optimised CUSP library Tesla T10 (compute capability 1.3, 30 x 1.3GHz) Thursday, 1 September 11

DIA

ELL

LP

Webbase

Circuit

FEM/Accelerator

Epidemiology

Economics

FEM/Ship

QCD

FEM/Harbour

Wind Tunnel

FEM/Cantilever

FEM/Spheres

Protein

0

Dense

5

HYB

Stocktake: flat data parallelism No nesting — code is not modular (compositional) No arrays of structured data


Stocktake: flat data parallelism No nesting — code is not modular (compositional) No arrays of structured data Embedded variant (targeting GPUs etc.):

‣ First-order, except for a fixed set of higher-order collective operations

‣ No recursion


Nested data parallelism in Haskell Data Parallel Haskell: language extension (fully integrated) [EuroPar 2001] Data type of nested parallel arrays [:e:] — here, e can be any type Parallel evaluation semantics Array comprehensions & collective operations (mapP, scanP, etc.)


Parallel Quicksort qsort :: Ord a => [:a:] -> [:a:] qsort [::] = [::] qsort xs = let p = xs!:0 smaller = [:x | x [:b:]

Stage 2: Library package DPH High-performance flat array library Communication and array fusion

L A I T N E S S E S I Y T I PUR

Radical re-ordering of computations Thursday, 1 September 11

Implementation

[FSTTCS 2008]

Extension of the Glasgow Haskell Compiler (GHC) Stage 1: The Vectoriser Transforms all nested into flat parallelism f :: a -> b

f^ :: [:a:] -> [:b:]

Stage 2: Library package DPH High-performance flat array library Communication and array fusion

L A I T N E S S E S I Y T I PUR

Radical re-ordering of computations Thursday, 1 September 11

Current Implementation targeting multicore CPUs GHC performs vectorisation transformation on Core IL Thursday, 1 September 11

2x Quad-Core Xeon = 8 cores (8 thread contexts)

1x UltraSPARC T2 = 8 cores (64 thread contexts)

Current Implementation targeting multicore CPUs GHC performs vectorisation transformation on Core IL Thursday, 1 September 11

Summary Purely functional programming simplifies parallel programming Data parallelism in Haskell is natural and flexible Nested parallelism is more expressive, but also much harder to implement Embedded languages for specialised architectures Accelerate: https://github.com/mchakravarty/accelerate DPH: http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell Repa: http://hackage.haskell.org/package/repa Thursday, 1 September 11

[EuroPar 2001] Nepal -- Nested Data-Parallelism in Haskell. Chakravarty, Keller, Lechtchinsky & Pfannenstiel. In "Euro-Par 2001: Parallel Processing, 7th Intl. Euro-Par Conference", 2001. [FSTTCS 2008] Harnessing the Multicores: Nested Data Parallelism in Haskell. Peyton Jones, Leshchinskiy, Keller & Chakravarty. In "IARCS Annual Conf. on Foundations of Software Technology & Theoretical Computer Science",2008. [ICFP 2010] Regular, shape-polymorphic, parallel arrays in Haskell. Keller, Chakravarty, Leshchinskiy, Peyton Jones & Lippmeier. In Proceedings of "ICFP 2010 : The 15th ACM SIGPLAN Intl. Conf. on Functional Programming", 2010. [DAMP 2011] Accelerating Haskell Array Codes with Multicore GPUs. Chakravarty, Keller, Lee, McDonell & Grover. In "Declarative Aspects of Multicore Programming", 2011. Twitter: @TacticalGrace


data parallelism in haskell - Computer Science and Engineering

data parallelism in haskell - Computer Science and Engineering

Suggest Documents

Nested Data Parallelism in Haskell - Computer Science and Engineering

data parallelism in haskell - Computer Science and Engineering

Plugging Haskell In - Computer Science and Engineering

Finding Data in DNA - Computer Engineering and Computer Science

Lazy Graph Processing in Haskell - Computer Science

Amorphous Data-parallelism in Irregular ... - Purdue Engineering

Lazy Graph Processing in Haskell - Computer Science

Computer Science and Engineering

Computer Science and Engineering

Lecture Notes in Computer Science - Data Science & Engineering Lab

Computer Security - Computer Engineering and Computer Science

Modeling Data Product Generation - Computer Science & Engineering

Kings Engineering College Computer Science and Engineering ...

T.E. Computer Science and Engineering

Relcon - Computer Science and Engineering

Electrical Engineering and Computer Science

B.E. Computer Science and Engineering

Download - Computer Science and Engineering

Electrical Engineering and Computer Science

CS : COMPUTER SCIENCE AND ENGINEERING

SPT - Computer Science and Engineering

DryadLINQ - Computer Science and Engineering

dIPC - Computer Science and Engineering

PowerPoint - Computer Science and Engineering