David Edgar Liebke - Incanter

4 downloads 195 Views 7MB Size Report
Oct 23, 2010 - processors/cores to improve the performance of a computation. ... and pvreduce, based on JSR 166 and Doug
From Concurrency to Parallelism an illustrated guide to multi-core parallelism in Clojure

David Edgar Liebke

clojureconj Durham, NC 23 October 2010

summary Concurrency is commonly mistaken for parallelism, but the two are distinct concepts. Concurrency is concerned with managing access to shared state from different threads, whereas parallelism is concerned with utilizing multiple processors/cores to improve the performance of a computation. Clojure has successfully improved the state of concurrent programming with its many concurrency primitives, and now the goal is to do the same for multi-core parallel programming, by introducing new parallel processing features that work with Clojure’s existing data structures. Clojure’s original parallel processing function, pmap, will soon be joined by pvmap and pvreduce, based on JSR 166 and Doug Lea’s Fork/Join Framework. From these building blocks, and the fjvtree function that underlies pvmap and pvreduce, higherlevel parallel functions can be developed. This talk will provide an illustrated walkthrough of the algorithms underlying pmap, pvmap, and pvreduce, comparing their strengths, weaknesses, and performance characteristics; and will conclude with an example of using these primitives to write a parallel version of Clojure’s filter function.

2

outline pmap algorithm weaknesses performance characteristics chunking chunked performance characteristics fork-join dequeues basic algorithm persistent-vector overview fjvtree, pvmap, and pvreduce algorithm performance characteristics pvfilter implementation performance characteristics

3

pmap lazy meets parallel

4

pmap

processors: 2 threads: 4

completed work

current threads 5

pmap

processors: 2 threads: 4

completed work

current threads 6

pmap

processors: 2 threads: 4

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 7

pmap

processors: 2 threads: 4

comp

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 8

pmap

processors: 2 threads: 4

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 9

pmap

processors: 2 threads: 4

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 10

pmap

processors: 2 threads: 4

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 11

pmap

processors: 2 threads: 4

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 12

pmap

processors: 2 threads: 4

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 13

pmap

processors: 2 threads: 4

comp

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 14

pmap

processors: 2 threads: 4

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 15

pmap

processors: 2 threads: 4

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 16

pmap

processors: 2 threads: 4

comp

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 17

pmap

processors: 2 threads: 4

comp

comp

comp

The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.

completed work

current threads 18

pmap when lazy isn’t parallel

19

pmap

processors: 2 threads: 4

comp

comp

comp

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 20

pmap

processors: 2 threads: 4

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 21

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 22

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 23

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 24

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 25

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 26

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 27

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 28

pmap

processors: 2 threads: 4

comp

Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.

completed work

current threads 29

pmap uneven loads

30

pmap

processors: 2 threads: 4

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 31

pmap

processors: 2 threads: 4

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 32

pmap

processors: 2 threads: 4

comp

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 33

pmap

processors: 2 threads: 4

comp

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 34

pmap

processors: 2 threads: 4

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 35

pmap

processors: 2 threads: 4

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 36

pmap

processors: 2 threads: 4

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 37

pmap

processors: 2 threads: 4

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 38

pmap

processors: 2 threads: 4

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 39

pmap

processors: 2 threads: 4

comp

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 40

pmap

processors: 2 threads: 4

comp

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 41

pmap

processors: 2 threads: 4

comp

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 42

pmap

processors: 2 threads: 4

comp

If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.

completed work

current threads 43

pmap performance characteristics

44

pmap

compared to map (2 cores) * results are the median of 50 samples, where a test function was pmapped over a vector of 100 values

Tf < 0.05 msec

0.05 < Tf < 0.1 msec

Tf > 0.1 msec 45

pmap chunky style

46

pmap

processors: 2 threads: 4 chunk size: 32

completed work

current threads 47

pmap

processors: 2 threads: 4 chunk size: 32

completed work

current threads 48

pmap

processors: 2 threads: 4 chunk size: 32

One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

completed work

current threads 49

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

pmap over chunks (pmap (fn [chunk] (map f chunk)) (partition 32 coll)))

completed work

current threads 50

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

comp

comp

comp

comp

completed work

current threads 51

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

comp

comp

comp

completed work

current threads 52

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

comp

comp

comp

completed work

current threads 53

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

comp comp

comp

completed work

current threads 54

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

comp

comp

comp

completed work

current threads 55

pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.

processors: 2 threads: 4 chunk size: 32

comp

comp

comp

comp

completed work

current threads 56

chunked pmap

compared to map (2 cores)

* results are the median of 500 samples, where a test function (duration < 0.005 msec) was pmapped over chunked vectors of 512 values

chunk < 10

10 < chunk < 50

Tf > 50 57

fork-join parallelism divide and conquer

58

dequeue double-ended queue

59

dequeue

double-ended queue

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 60

dequeue

split

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 61

dequeue

push

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 62

dequeue

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 63

dequeue

split

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 64

dequeue

push

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 65

dequeue

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 66

dequeue

split

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 67

dequeue

push

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 68

dequeue

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 69

dequeue

compute

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 70

dequeue

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 71

dequeue

pop

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 72

dequeue

work stealing

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 73

dequeue

take

A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 74

fork-join basic algorithm

75

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.

worker deques

current tasks 76

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.

worker deques

current tasks 77

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.

worker deques

push

current tasks 78

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.

push

worker deques

current tasks

2048

79

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

1024 As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.

worker deques

take

current tasks 80

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

1024

push

As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.

1024

worker deques

current tasks 81

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

take

take

As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.

1024

worker deques

current tasks 82

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

512

push

512

push

push

512

push

As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.

worker deques

current tasks 83

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

push

push 512

As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.

512

push

push 512

worker deques

512

current tasks 84

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

256 512

Once each worker’s current task is smaller than the configured threshold, it will begin the intended computation.

512

compute

compute

256

256 512

512

compute

worker deques

compute

current tasks 85

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

pop 512

Once the computation is completed for the current task, the worker will pop another task off of its dequeue...

worker deques

512

compute

pop

pop 512

512

current tasks 86

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

compute

512 ... and perform the computation on it.

512

compute

compute

compute

512

worker deques

512

current tasks 87

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

compute After the computation has been completed for at least two tasks, the results can be combined with a join function. join

worker deques

512

join

512

512

join

512

current tasks 88

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

push 512

And so on...

compute

push

worker deques

push

current tasks 89

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

256 512

And so on...

worker deques

compute

compute

256

256

compute

compute

current tasks 90

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

256 512

And so on...

compute

compute

pop

256

compute

worker deques

current tasks 91

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

256 512

And so on...

compute

compute

256

compute

compute

worker deques

current tasks 92

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

256 512

And so on...

compute

compute

256 join compute

worker deques

current tasks 93

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

256 512

And so on...

compute

compute

256 join compute

worker deques

current tasks 94

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

pop

pop take

If one of the workers falls behind, another worker can take tasks from its dequeue.

worker deques

pop

current tasks 95

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

compute

push

compute

If one of the workers falls behind, another worker can take tasks from its dequeue.

worker deques

compute

current tasks 96

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

256

compute

join If one of the workers falls behind, another worker can take tasks from its dequeue.

worker deques

compute

compute

current tasks 97

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

take

If one of the workers falls behind, another worker can take tasks from its dequeue.

worker deques

compute

compute

compute

current tasks 98

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

compute

join If one of the workers falls behind, another worker can take tasks from its dequeue.

join

worker deques

current tasks 99

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

join

join

And so on...

join

worker deques

current tasks 100

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

join And so on...

join

worker deques

current tasks 101

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

32 And so on...

join

join

worker deques

current tasks 102

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

And so on...

join

worker deques

current tasks 103

completed tasks

fork-join

workers: 4 branching factor: 2 sequential threshold: 256

... until the job is complete.

worker deques

current tasks 104

completed tasks

persistent vector parallelism with existing data structures

105

persistent-vector

00000 00000 00000

count: 0 shift: 5

A PersistentVector contains a root and a tail; the tail is an array that can contain up to 32 Object references, and the root is a Node that can contain up to 32 child Nodes.

root node tail nodes obj refs selected 106

persistent-vector

00000 00000 11111

count: 32 shift: 5

Values are added to the tail...

root node tail nodes obj refs selected 107

persistent-vector

00000 00000 11111

count: 33 shift: 5

... until the tail is full, then a new Node is created, containing the 32 Object references from the tail, and inserted as a child of the root.

root node tail nodes obj refs selected 108

persistent-vector

00000 11111 11111

count: 1025 shift: 5

Once the root is full, a new root Node is created, and the existing one is added as a child.

root node tail nodes obj refs selected 109

persistent-vector

00001 00000 11111

count: 1057 shift: 10

Once the root is full, a new root Node is created, and the existing one is added as a child.

root node tail nodes obj refs selected 110

persistent-vector

00001 11111 11111

count: 2049 shift: 10

And so on...

root node tail nodes obj refs selected 111

persistent-vector

00010 00000 11111

count: 2081 shift: 10

And so on...

root node tail nodes obj refs selected 112

persistent-vector

00010 11111 11111

count: 3073 shift: 10

And so on...

root node tail nodes obj refs selected 113

persistent-vector

00011 00000 11111

count: 3105 shift: 10

And so on...

root node tail nodes obj refs selected 114

persistent-vector

00011 11111 11111

count: 4097 shift: 10

And so on...

root node tail nodes obj refs selected 115

fork-join on persistent vectors fjvtree, pvmap, and pvreduce

116

fjvtree

workers: 4 branching factor: 32 sequential threshold: 32

The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

current tasks 117

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

fork-join based map and reduce on vectors (pvmap f v)

(pvreduce f v)

implemented with (fjvtree v combine-fn leaf-fn)

current tasks 118

completed tasks

fjvtree

workers: 4 branching factor: 32 sequential threshold: 32

The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

current tasks 119

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

push

push

push

current tasks 120

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

1024

1024

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

1024

current tasks 121

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

1024

1024

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

1024

current tasks 122

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

leaf-fn: pvmap 1024

1024 (fn [^PersistentVector$Node node] (new-node (amap (.array node) i a 32          (f (aget a i)))))

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

1024

current tasks 123

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

leaf-fn: pvreduce 1024

1024

comp

32

32

32

32

32

32

32

32

32

32

32

(fn [^PersistentVector$Node node] (let [a (.array node)]    (loop [ret (aget a 0) i (int 1)] 32      (if (< i 32) 32         (recur (f ret (aget a i)) 32 (inc i)) 32         ret))))

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

1024

current tasks 124

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

take

take

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

take

current tasks 125

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

push

comp

32

32

32

push

push

push

32

32

32

32

push

push

push

push

32

32

32

32

push

push

push

push

32

32

32

32

push

push

push

push

32

32

32

32

push

push

push

push

32

32

32

32

push

push

push

push

32

32

32

32

push

push

push

push

32

32

32

32

push

push

push

push

current tasks 126

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 comp

32

32

32

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

comp

32

32

32

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 127

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 pop

32

32

pop

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

pop

32

32

pop

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 128

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 comp

32

32

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

comp

32

32

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 129

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 pop

32

pop

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

pop

32

pop

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 130

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 comp

32

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

comp

32

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 131

completed tasks

fjvtree

workers: 4 branching factor: 32 sequential threshold: 32 pop

The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

pop

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

pop

pop

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 132

completed tasks

fjvtree

workers: 4 branching factor: 32 sequential threshold: 32 comp

The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

comp

comp

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

32

current tasks 133

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

pop

1

1

1

pop

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

pop

1

1

1

pop

current tasks 134

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

comp

1

1

1

comp

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

comp

1

1

1

comp

current tasks 135

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

join

join

join

join

current tasks 136

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

combine-fn: pvmap join

join (fn [coll] (new-node (to-array coll))))

join

join

current tasks 137

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

combine-fn: pvreduce join

join (fn [coll] (reduce f coll))

join

join

current tasks 138

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

join

join

join

join

current tasks 139

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

join

current tasks 140

completed tasks

fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.

worker deques

workers: 4 branching factor: 32 sequential threshold: 32

128

current tasks 141

completed tasks

pvmap performance characteristics

142

pvmap

compared to map (2 cores) * results are the median of 50 samples, where a test function was pvmapped over a vector of 100 values

Tf < 0.05 msec

0.05 < Tf < 0.1 msec

Tf > 0.1 msec 143

pvmap vs pmap

compared to map (2 cores) * results are the median of 50 samples, where a test function was p*mapped over a vector of 100 values

Tf < 0.05 msec

0.05 < Tf < 0.1 msec

Tf > 0.1 msec 144

pvreduce performance characteristics

145

pvreduce

compared to reduce (2 cores) * results are the median of 50 samples, where a test function was pvreduced over a vector of 100 values

Tf < 0.05 msec

0.05 < Tf < 0.1 msec

Tf > 0.1 msec 146

pvreduce examples implementing pvfilter

147

parallel sum

using pvreduce

(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])

148

parallel sum

using pvreduce

(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])

(reduce + (reduce (reduce (reduce (reduce

+ + + +

[1 2 3 [33 33 [65 66 [97 98

149

4 5 .. 32]) 34 35 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

parallel sum

using pvreduce

(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])

(reduce + (reduce (reduce (reduce (reduce

+ + + +

[1 2 3 [33 33 [65 66 [97 98

4 5 .. 32]) 34 35 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

(reduce + [528 1552 2576 3600])

150

parallel sum

using pvreduce

(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])

(reduce + (reduce (reduce (reduce (reduce

+ + + +

[1 2 3 [33 33 [65 66 [97 98

4 5 .. 32]) 34 35 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

(reduce + [528 1552 2576 3600])

8256

151

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128])

152

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce filt (reduce (reduce (reduce (reduce

filt filt filt filt

[] [] [] []

153

[1 2 3 [33 34 [65 66 [97 98

4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce filt (reduce (reduce (reduce (reduce

filt filt filt filt

[] [] [] []

[1 2 3 [33 34 [65 66 [97 98

4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

(reduce filt [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]])

154

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce filt (reduce (reduce (reduce (reduce

filt filt filt filt

[] [] [] []

[1 2 3 [33 34 [65 66 [97 98

4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

(reduce filt [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]]) java.lang.ClassCastException: clojure.lang.PersistentVector cannot be cast to java.lang.Number

155

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128])

156

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce par-filt (reduce (reduce (reduce (reduce

par-filt par-filt par-filt par-filt

157

[] [] [] []

[1 2 3 [33 34 [65 66 [97 98

4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce par-filt (reduce (reduce (reduce (reduce

par-filt par-filt par-filt par-filt

[] [] [] []

[1 2 3 [33 34 [65 66 [97 98

4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

(apply reduce conj [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]])

158

parallel filter

using pvreduce

(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce par-filt (reduce (reduce (reduce (reduce

par-filt par-filt par-filt par-filt

[] [] [] []

[1 2 3 [33 34 [65 66 [97 98

4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))

(apply reduce conj [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]]) [2 4 6 8 .. 128] 159

pvfilter performance characteristics

160

pvfilter

compared to filter with 2 cores * results are the median of 50 samples, where a test predicate was pvfiltered over a vector of 100 values

Tf < 0.05 msec

0.05 < Tf < 0.1 msec

Tf > 0.1 msec 161

references

1.

Brian Goetz Fork/Join talk: http://www.infoq.com/presentations/brian-goetz-concurrent-parallel

2.

Fork/Join API: http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/

3.

Developer’s Works article: http://www.ibm.com/developerworks/java/library/j-jtp11137.html

4.

Java Concurrency Wiki: http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W1

5.

JVM summit slides: http://wiki.jvmlangsummit.com/images/f/f0/Lea-fj-jul10.pdf

6.

Fork/Join Paper: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.42.1918&rep=rep1&type=pdf

162

questions?

163

thank you

parallel frequency

using pvreduce

(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo])

165

parallel frequency

using pvreduce

(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo]) (reduce par-freq (reduce (reduce (reduce (reduce

par-freq par-freq par-freq par-freq

166

{} {} {} {}

[:foo [:bar [:baz [:bar

:foo :foo :bar :baz

.. .. .. ..

:bar]) :foo]) :foo]) :baz]))

parallel frequency

using pvreduce

(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo]) (reduce par-freq (reduce (reduce (reduce (reduce (reduce par-freq [{:foo {:foo {:foo {:foo

par-freq par-freq par-freq par-freq

{} {} {} {}

[:foo [:bar [:baz [:bar

:foo :foo :bar :baz

15, :bar 10, :baz 7} 10, :bar 12, :baz 5} 7, :bar 14, :baz 11} 12, :bar 10, :baz 10}])

167

.. .. .. ..

:bar]) :foo]) :foo]) :baz]))

parallel frequency

using pvreduce

(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo]) (reduce par-freq (reduce (reduce (reduce (reduce (reduce par-freq [{:foo {:foo {:foo {:foo

par-freq par-freq par-freq par-freq

{} {} {} {}

[:foo [:bar [:baz [:bar

:foo :foo :bar :baz

15, :bar 10, :baz 7} 10, :bar 12, :baz 5} 7, :bar 14, :baz 11} 12, :bar 10, :baz 10}])

{:foo 44, :bar 46, :baz 33} 168

.. .. .. ..

:bar]) :foo]) :foo]) :baz]))