Introduction to Parallel Algorithms Cilk+ Writing parallel (pseudo ...

Introduction to Parallel Algorithms

Cilk+

Dynamic Multithreading I

Also known as the fork-join model

I

Shared memory, multicore

I

Cormen et. al 3rd edition, Chapter 27

Nested Parallelism I

Spawn a subroutine, carry on with other work.

I

Similar to fork in POSIX.

I

The multithreaded model is based on Cilk+, available at svn://gcc.gnu.org/svn/gcc/branches/cilkplus

I

Programmer specifies possible paralellism

I

Runtime system takes care of mapping to OS threads

I

Cilk+ contains several more features than our model, e.g. parallel vector and array operations.

I

Similar primitives are available in java.util.concurrent

Parallel Loop I

iterations of a for loop can execute in parallel.

I

Like OpenMP

Writing parallel (pseudo)-code

Fibonacci Example

Keywords parallel Run the loop (potentially) concurrently spawn Run the procedure (potentially) concurrently sync Wait for all spawned children to complete.

Serialization I

remove keywords

I

serialized (correct) parallel code is correct serial code Adding parallel keywords to correct serial code might make it incorrect

I

I I

missing sync loop iterations not independent

function Fib(n) if n ≤ 1 then return n else x = spawn Fib(n − 1) y = Fib(n − 2) sync return x + y end if end function I

Code in Java, Clojure and Racket available from http: //www.cs.unb.ca/~bremner/teaching/cs3383/examples

Computation DAG

Work and Speedup

Strands Sequence of instructions containing no parallel, spawn, return from spawn, or sync. T1 Work, sequential time. function Fib(n) if n ≤ 1 then . return n else x = spawn Fib(n − 1) y = Fib(n − 2) . sync return x + y . end if end function nodes strands

Tp Time on p processors.

Work Law

Tp ≥ T1 /p

speedup := T1 /Tp ≤ p Figure clrs27_2 in text

down edges spawn up edges return horizontal edges sequential critical path longest path in DAG

Parallelism

span weighted length of critical path ≡ lower bound on time

Tp Time on p processors. T∞ Span, time given unlimited processors.

Span and Parallelism Example

We could idle processors: Tp ≥ T∞

(1)

Best possible speedup: parallelism = T1 /T∞ ≥ T1 /Tp = speedup

Assume strands are unit cost. I

T1 = 17

I

T∞ = 8

I

Parallelism = 2.125 for this input size.

Figure clrs27_2 in text

Composing span and work

Work of Parallel Fibonacci Write T (n) for T1 on input n. T (n) = T (n−1)+T (n−2)+Θ(1)

Substitute the inductive hypothesis

Let φ ≈ 1.62 be the solution to

T (n) ≤ a(φn−1 + φn−2 ) − 2b + Θ(1)

A A

B

φ2 = φ + 1

B

=a

A+B We can show by induction that

AkB

T (n) ∈ Θ(φn )

series T∞ (A + B) = T∞ (A) + T∞ (B) series or parallel T1 = T1 (A) + T1 (B)

T (n) ≤ aφn − b

T∞ (n) = max(T∞ (n − 1), T∞ (n − 2)) + Θ(1) = T∞ (n − 1) + Θ(1)

Transforming to sum, we get T∞ ∈ Θ(n) T1 (n) parallelism = =Θ T∞ (n) I

φn n

So an inefficient way to compute Fibonacci, but very parallel

φ+1 n φ −b φ2

= aφn − b

Assume

Span and Parallelism of Fibonacci

choose b large enough ≤a

parallel T∞ (AkB) = max(T∞ (A), T∞ (B))

φ+1 n φ − b + (Θ(1) − b) φ2

(IH)

(Ω() is left as an exercise)