Introduction to Parallel Algorithms. Dynamic Multithreading. ▷ Also known as the
fork-join model. ▷ Shared memory, multicore. ▷ Cormen et. al 3rd edition, ...
Introduction to Parallel Algorithms
Cilk+
Dynamic Multithreading I
Also known as the fork-join model
I
Shared memory, multicore
I
Cormen et. al 3rd edition, Chapter 27
Nested Parallelism I
Spawn a subroutine, carry on with other work.
I
Similar to fork in POSIX.
I
The multithreaded model is based on Cilk+, available at svn://gcc.gnu.org/svn/gcc/branches/cilkplus
I
Programmer specifies possible paralellism
I
Runtime system takes care of mapping to OS threads
I
Cilk+ contains several more features than our model, e.g. parallel vector and array operations.
I
Similar primitives are available in java.util.concurrent
Parallel Loop I
iterations of a for loop can execute in parallel.
I
Like OpenMP
Writing parallel (pseudo)-code
Fibonacci Example
Keywords parallel Run the loop (potentially) concurrently spawn Run the procedure (potentially) concurrently sync Wait for all spawned children to complete.
Serialization I
remove keywords
I
serialized (correct) parallel code is correct serial code Adding parallel keywords to correct serial code might make it incorrect
I
I I
missing sync loop iterations not independent
function Fib(n) if n ≤ 1 then return n else x = spawn Fib(n − 1) y = Fib(n − 2) sync return x + y end if end function I
Code in Java, Clojure and Racket available from http: //www.cs.unb.ca/~bremner/teaching/cs3383/examples
Computation DAG
Work and Speedup
Strands Sequence of instructions containing no parallel, spawn, return from spawn, or sync. T1 Work, sequential time. function Fib(n) if n ≤ 1 then . return n else x = spawn Fib(n − 1) y = Fib(n − 2) . sync return x + y . end if end function nodes strands
Tp Time on p processors.
Work Law
Tp ≥ T1 /p
speedup := T1 /Tp ≤ p Figure clrs27_2 in text
down edges spawn up edges return horizontal edges sequential critical path longest path in DAG
Parallelism
span weighted length of critical path ≡ lower bound on time
Tp Time on p processors. T∞ Span, time given unlimited processors.
Span and Parallelism Example
We could idle processors: Tp ≥ T∞
(1)
Best possible speedup: parallelism = T1 /T∞ ≥ T1 /Tp = speedup
Assume strands are unit cost. I
T1 = 17
I
T∞ = 8
I
Parallelism = 2.125 for this input size.
Figure clrs27_2 in text
Composing span and work
Work of Parallel Fibonacci Write T (n) for T1 on input n. T (n) = T (n−1)+T (n−2)+Θ(1)
Substitute the inductive hypothesis
Let φ ≈ 1.62 be the solution to
T (n) ≤ a(φn−1 + φn−2 ) − 2b + Θ(1)
A A
B
φ2 = φ + 1
B
=a
A+B We can show by induction that
AkB
T (n) ∈ Θ(φn )
series T∞ (A + B) = T∞ (A) + T∞ (B) series or parallel T1 = T1 (A) + T1 (B)