Oct 23, 2010 - processors/cores to improve the performance of a computation. ... and pvreduce, based on JSR 166 and Doug
From Concurrency to Parallelism an illustrated guide to multi-core parallelism in Clojure
David Edgar Liebke
clojureconj Durham, NC 23 October 2010
summary Concurrency is commonly mistaken for parallelism, but the two are distinct concepts. Concurrency is concerned with managing access to shared state from different threads, whereas parallelism is concerned with utilizing multiple processors/cores to improve the performance of a computation. Clojure has successfully improved the state of concurrent programming with its many concurrency primitives, and now the goal is to do the same for multi-core parallel programming, by introducing new parallel processing features that work with Clojure’s existing data structures. Clojure’s original parallel processing function, pmap, will soon be joined by pvmap and pvreduce, based on JSR 166 and Doug Lea’s Fork/Join Framework. From these building blocks, and the fjvtree function that underlies pvmap and pvreduce, higherlevel parallel functions can be developed. This talk will provide an illustrated walkthrough of the algorithms underlying pmap, pvmap, and pvreduce, comparing their strengths, weaknesses, and performance characteristics; and will conclude with an example of using these primitives to write a parallel version of Clojure’s filter function.
2
outline pmap algorithm weaknesses performance characteristics chunking chunked performance characteristics fork-join dequeues basic algorithm persistent-vector overview fjvtree, pvmap, and pvreduce algorithm performance characteristics pvfilter implementation performance characteristics
3
pmap lazy meets parallel
4
pmap
processors: 2 threads: 4
completed work
current threads 5
pmap
processors: 2 threads: 4
completed work
current threads 6
pmap
processors: 2 threads: 4
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 7
pmap
processors: 2 threads: 4
comp
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 8
pmap
processors: 2 threads: 4
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 9
pmap
processors: 2 threads: 4
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 10
pmap
processors: 2 threads: 4
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 11
pmap
processors: 2 threads: 4
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 12
pmap
processors: 2 threads: 4
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 13
pmap
processors: 2 threads: 4
comp
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 14
pmap
processors: 2 threads: 4
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 15
pmap
processors: 2 threads: 4
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 16
pmap
processors: 2 threads: 4
comp
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 17
pmap
processors: 2 threads: 4
comp
comp
comp
The pmap function is semi-lazy, meaning it tries to keep just enough threads running so that all the processors are utilized, but doesn’t process its entire input. It does this by using futures to invoke the function being mapped on just enough input values, but no more.
completed work
current threads 18
pmap when lazy isn’t parallel
19
pmap
processors: 2 threads: 4
comp
comp
comp
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 20
pmap
processors: 2 threads: 4
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 21
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 22
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 23
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 24
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 25
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 26
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 27
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 28
pmap
processors: 2 threads: 4
comp
Because pmap is semi-lazy, it is necessary that the process consuming its output can do so at a faster rate than the process producing its output, otherwise all the processors won’t be utilized.
completed work
current threads 29
pmap uneven loads
30
pmap
processors: 2 threads: 4
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 31
pmap
processors: 2 threads: 4
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 32
pmap
processors: 2 threads: 4
comp
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 33
pmap
processors: 2 threads: 4
comp
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 34
pmap
processors: 2 threads: 4
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 35
pmap
processors: 2 threads: 4
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 36
pmap
processors: 2 threads: 4
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 37
pmap
processors: 2 threads: 4
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 38
pmap
processors: 2 threads: 4
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 39
pmap
processors: 2 threads: 4
comp
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 40
pmap
processors: 2 threads: 4
comp
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 41
pmap
processors: 2 threads: 4
comp
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 42
pmap
processors: 2 threads: 4
comp
If one or more processors are returning results at a slower rate than the others, there is no way to redistribute the load. The more responsive processors will complete their work and sit idle until the slowest completes and its result can be consumed.
completed work
current threads 43
pmap performance characteristics
44
pmap
compared to map (2 cores) * results are the median of 50 samples, where a test function was pmapped over a vector of 100 values
Tf < 0.05 msec
0.05 < Tf < 0.1 msec
Tf > 0.1 msec 45
pmap chunky style
46
pmap
processors: 2 threads: 4 chunk size: 32
completed work
current threads 47
pmap
processors: 2 threads: 4 chunk size: 32
completed work
current threads 48
pmap
processors: 2 threads: 4 chunk size: 32
One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
completed work
current threads 49
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
pmap over chunks (pmap (fn [chunk] (map f chunk)) (partition 32 coll)))
completed work
current threads 50
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
comp
comp
comp
comp
completed work
current threads 51
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
comp
comp
comp
completed work
current threads 52
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
comp
comp
comp
completed work
current threads 53
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
comp comp
comp
completed work
current threads 54
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
comp
comp
comp
completed work
current threads 55
pmap One strategy for handling situations where the consuming process cannot keep up with the producing process, hence underutilizing the processors, is to partition the input data into chunks and pass those to pmap instead of individual values.
processors: 2 threads: 4 chunk size: 32
comp
comp
comp
comp
completed work
current threads 56
chunked pmap
compared to map (2 cores)
* results are the median of 500 samples, where a test function (duration < 0.005 msec) was pmapped over chunked vectors of 512 values
chunk < 10
10 < chunk < 50
Tf > 50 57
fork-join parallelism divide and conquer
58
dequeue double-ended queue
59
dequeue
double-ended queue
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 60
dequeue
split
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 61
dequeue
push
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 62
dequeue
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 63
dequeue
split
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 64
dequeue
push
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 65
dequeue
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 66
dequeue
split
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 67
dequeue
push
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 68
dequeue
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 69
dequeue
compute
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 70
dequeue
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 71
dequeue
pop
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 72
dequeue
work stealing
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 73
dequeue
take
A dequeue (pronounced “deck”) is a double-ended queue, where values can be pushed and popped off the front or taken from the back. The Fork-Join algorithm systematically divides a job into tasks that are pushed onto the dequeue, resulting in the largest tasks being located at the back, which in turn improves the efficiency of its work-stealing algorithm. 74
fork-join basic algorithm
75
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.
worker deques
current tasks 76
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.
worker deques
current tasks 77
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.
worker deques
push
current tasks 78
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
Fork-Join is a divideand-conquer algorithm that iteratively divides a job into smaller and smaller tasks, placing them on a dequeue, until the size of the current task is below a configured threshold.
push
worker deques
current tasks
2048
79
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
1024 As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.
worker deques
take
current tasks 80
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
1024
push
As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.
1024
worker deques
current tasks 81
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
take
take
As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.
1024
worker deques
current tasks 82
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
512
push
512
push
push
512
push
As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.
worker deques
current tasks 83
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
push
push 512
As one worker is processing its tasks, another worker can steal a task from the back of its dequeue.
512
push
push 512
worker deques
512
current tasks 84
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
256 512
Once each worker’s current task is smaller than the configured threshold, it will begin the intended computation.
512
compute
compute
256
256 512
512
compute
worker deques
compute
current tasks 85
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
pop 512
Once the computation is completed for the current task, the worker will pop another task off of its dequeue...
worker deques
512
compute
pop
pop 512
512
current tasks 86
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
compute
512 ... and perform the computation on it.
512
compute
compute
compute
512
worker deques
512
current tasks 87
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
compute After the computation has been completed for at least two tasks, the results can be combined with a join function. join
worker deques
512
join
512
512
join
512
current tasks 88
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
push 512
And so on...
compute
push
worker deques
push
current tasks 89
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
256 512
And so on...
worker deques
compute
compute
256
256
compute
compute
current tasks 90
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
256 512
And so on...
compute
compute
pop
256
compute
worker deques
current tasks 91
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
256 512
And so on...
compute
compute
256
compute
compute
worker deques
current tasks 92
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
256 512
And so on...
compute
compute
256 join compute
worker deques
current tasks 93
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
256 512
And so on...
compute
compute
256 join compute
worker deques
current tasks 94
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
pop
pop take
If one of the workers falls behind, another worker can take tasks from its dequeue.
worker deques
pop
current tasks 95
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
compute
push
compute
If one of the workers falls behind, another worker can take tasks from its dequeue.
worker deques
compute
current tasks 96
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
256
compute
join If one of the workers falls behind, another worker can take tasks from its dequeue.
worker deques
compute
compute
current tasks 97
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
take
If one of the workers falls behind, another worker can take tasks from its dequeue.
worker deques
compute
compute
compute
current tasks 98
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
compute
join If one of the workers falls behind, another worker can take tasks from its dequeue.
join
worker deques
current tasks 99
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
join
join
And so on...
join
worker deques
current tasks 100
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
join And so on...
join
worker deques
current tasks 101
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
32 And so on...
join
join
worker deques
current tasks 102
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
And so on...
join
worker deques
current tasks 103
completed tasks
fork-join
workers: 4 branching factor: 2 sequential threshold: 256
... until the job is complete.
worker deques
current tasks 104
completed tasks
persistent vector parallelism with existing data structures
105
persistent-vector
00000 00000 00000
count: 0 shift: 5
A PersistentVector contains a root and a tail; the tail is an array that can contain up to 32 Object references, and the root is a Node that can contain up to 32 child Nodes.
root node tail nodes obj refs selected 106
persistent-vector
00000 00000 11111
count: 32 shift: 5
Values are added to the tail...
root node tail nodes obj refs selected 107
persistent-vector
00000 00000 11111
count: 33 shift: 5
... until the tail is full, then a new Node is created, containing the 32 Object references from the tail, and inserted as a child of the root.
root node tail nodes obj refs selected 108
persistent-vector
00000 11111 11111
count: 1025 shift: 5
Once the root is full, a new root Node is created, and the existing one is added as a child.
root node tail nodes obj refs selected 109
persistent-vector
00001 00000 11111
count: 1057 shift: 10
Once the root is full, a new root Node is created, and the existing one is added as a child.
root node tail nodes obj refs selected 110
persistent-vector
00001 11111 11111
count: 2049 shift: 10
And so on...
root node tail nodes obj refs selected 111
persistent-vector
00010 00000 11111
count: 2081 shift: 10
And so on...
root node tail nodes obj refs selected 112
persistent-vector
00010 11111 11111
count: 3073 shift: 10
And so on...
root node tail nodes obj refs selected 113
persistent-vector
00011 00000 11111
count: 3105 shift: 10
And so on...
root node tail nodes obj refs selected 114
persistent-vector
00011 11111 11111
count: 4097 shift: 10
And so on...
root node tail nodes obj refs selected 115
fork-join on persistent vectors fjvtree, pvmap, and pvreduce
116
fjvtree
workers: 4 branching factor: 32 sequential threshold: 32
The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
current tasks 117
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
fork-join based map and reduce on vectors (pvmap f v)
(pvreduce f v)
implemented with (fjvtree v combine-fn leaf-fn)
current tasks 118
completed tasks
fjvtree
workers: 4 branching factor: 32 sequential threshold: 32
The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
current tasks 119
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
push
push
push
current tasks 120
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
1024
1024
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
1024
current tasks 121
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
1024
1024
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
1024
current tasks 122
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
leaf-fn: pvmap 1024
1024 (fn [^PersistentVector$Node node] (new-node (amap (.array node) i a 32 (f (aget a i)))))
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
1024
current tasks 123
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
leaf-fn: pvreduce 1024
1024
comp
32
32
32
32
32
32
32
32
32
32
32
(fn [^PersistentVector$Node node] (let [a (.array node)] (loop [ret (aget a 0) i (int 1)] 32 (if (< i 32) 32 (recur (f ret (aget a i)) 32 (inc i)) 32 ret))))
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
1024
current tasks 124
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
take
take
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
take
current tasks 125
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
push
comp
32
32
32
push
push
push
32
32
32
32
push
push
push
push
32
32
32
32
push
push
push
push
32
32
32
32
push
push
push
push
32
32
32
32
push
push
push
push
32
32
32
32
push
push
push
push
32
32
32
32
push
push
push
push
32
32
32
32
push
push
push
push
current tasks 126
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 comp
32
32
32
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
comp
32
32
32
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 127
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 pop
32
32
pop
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
pop
32
32
pop
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 128
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 comp
32
32
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
comp
32
32
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 129
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 pop
32
pop
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
pop
32
pop
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 130
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 comp
32
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
comp
32
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 131
completed tasks
fjvtree
workers: 4 branching factor: 32 sequential threshold: 32 pop
The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
pop
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
pop
pop
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 132
completed tasks
fjvtree
workers: 4 branching factor: 32 sequential threshold: 32 comp
The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
comp
comp
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
32
current tasks 133
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
pop
1
1
1
pop
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
pop
1
1
1
pop
current tasks 134
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
comp
1
1
1
comp
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
comp
1
1
1
comp
current tasks 135
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
join
join
join
join
current tasks 136
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
combine-fn: pvmap join
join (fn [coll] (new-node (to-array coll))))
join
join
current tasks 137
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
combine-fn: pvreduce join
join (fn [coll] (reduce f coll))
join
join
current tasks 138
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
join
join
join
join
current tasks 139
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
join
current tasks 140
completed tasks
fjvtree The core Fork-Join algorithm in Clojure is implemented in the fjvtree function, which uses the underlying tree structure of PersistentVector to break jobs into tasks.
worker deques
workers: 4 branching factor: 32 sequential threshold: 32
128
current tasks 141
completed tasks
pvmap performance characteristics
142
pvmap
compared to map (2 cores) * results are the median of 50 samples, where a test function was pvmapped over a vector of 100 values
Tf < 0.05 msec
0.05 < Tf < 0.1 msec
Tf > 0.1 msec 143
pvmap vs pmap
compared to map (2 cores) * results are the median of 50 samples, where a test function was p*mapped over a vector of 100 values
Tf < 0.05 msec
0.05 < Tf < 0.1 msec
Tf > 0.1 msec 144
pvreduce performance characteristics
145
pvreduce
compared to reduce (2 cores) * results are the median of 50 samples, where a test function was pvreduced over a vector of 100 values
Tf < 0.05 msec
0.05 < Tf < 0.1 msec
Tf > 0.1 msec 146
pvreduce examples implementing pvfilter
147
parallel sum
using pvreduce
(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])
148
parallel sum
using pvreduce
(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])
(reduce + (reduce (reduce (reduce (reduce
+ + + +
[1 2 3 [33 33 [65 66 [97 98
149
4 5 .. 32]) 34 35 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
parallel sum
using pvreduce
(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])
(reduce + (reduce (reduce (reduce (reduce
+ + + +
[1 2 3 [33 33 [65 66 [97 98
4 5 .. 32]) 34 35 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
(reduce + [528 1552 2576 3600])
150
parallel sum
using pvreduce
(pvreduce + [1 2 3 4 5 6 7 8 9 .. 128])
(reduce + (reduce (reduce (reduce (reduce
+ + + +
[1 2 3 [33 33 [65 66 [97 98
4 5 .. 32]) 34 35 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
(reduce + [528 1552 2576 3600])
8256
151
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128])
152
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce filt (reduce (reduce (reduce (reduce
filt filt filt filt
[] [] [] []
153
[1 2 3 [33 34 [65 66 [97 98
4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce filt (reduce (reduce (reduce (reduce
filt filt filt filt
[] [] [] []
[1 2 3 [33 34 [65 66 [97 98
4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
(reduce filt [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]])
154
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(filt [v x] (if (pred x) (conj v x) v))] (pvreduce filt [] v) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce filt (reduce (reduce (reduce (reduce
filt filt filt filt
[] [] [] []
[1 2 3 [33 34 [65 66 [97 98
4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
(reduce filt [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]]) java.lang.ClassCastException: clojure.lang.PersistentVector cannot be cast to java.lang.Number
155
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128])
156
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce par-filt (reduce (reduce (reduce (reduce
par-filt par-filt par-filt par-filt
157
[] [] [] []
[1 2 3 [33 34 [65 66 [97 98
4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce par-filt (reduce (reduce (reduce (reduce
par-filt par-filt par-filt par-filt
[] [] [] []
[1 2 3 [33 34 [65 66 [97 98
4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
(apply reduce conj [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]])
158
parallel filter
using pvreduce
(defn pvfilter [pred v] (letfn [(par-filt [v x] (cond (vector? x) (apply reduce conj v x) (even? x) (conj v x) :else v))] (pvreduce par-filt [] v)) (pvfilter even? [1 2 3 4 5 6 7 8 9 .. 128]) (reduce par-filt (reduce (reduce (reduce (reduce
par-filt par-filt par-filt par-filt
[] [] [] []
[1 2 3 [33 34 [65 66 [97 98
4 .. 32]) 35 36 .. 64]) 67 68 .. 96]) 99 100 .. 128]))
(apply reduce conj [[2 4 6 8 .. 32] [34 36 38 40 .. 64] [66 68 70 72 .. 96] [98 100 102 104 .. 128]]) [2 4 6 8 .. 128] 159
pvfilter performance characteristics
160
pvfilter
compared to filter with 2 cores * results are the median of 50 samples, where a test predicate was pvfiltered over a vector of 100 values
Tf < 0.05 msec
0.05 < Tf < 0.1 msec
Tf > 0.1 msec 161
references
1.
Brian Goetz Fork/Join talk: http://www.infoq.com/presentations/brian-goetz-concurrent-parallel
2.
Fork/Join API: http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166ydocs/
3.
Developer’s Works article: http://www.ibm.com/developerworks/java/library/j-jtp11137.html
4.
Java Concurrency Wiki: http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W1
5.
JVM summit slides: http://wiki.jvmlangsummit.com/images/f/f0/Lea-fj-jul10.pdf
6.
Fork/Join Paper: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.42.1918&rep=rep1&type=pdf
162
questions?
163
thank you
parallel frequency
using pvreduce
(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo])
165
parallel frequency
using pvreduce
(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo]) (reduce par-freq (reduce (reduce (reduce (reduce
par-freq par-freq par-freq par-freq
166
{} {} {} {}
[:foo [:bar [:baz [:bar
:foo :foo :bar :baz
.. .. .. ..
:bar]) :foo]) :foo]) :baz]))
parallel frequency
using pvreduce
(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo]) (reduce par-freq (reduce (reduce (reduce (reduce (reduce par-freq [{:foo {:foo {:foo {:foo
par-freq par-freq par-freq par-freq
{} {} {} {}
[:foo [:bar [:baz [:bar
:foo :foo :bar :baz
15, :bar 10, :baz 7} 10, :bar 12, :baz 5} 7, :bar 14, :baz 11} 12, :bar 10, :baz 10}])
167
.. .. .. ..
:bar]) :foo]) :foo]) :baz]))
parallel frequency
using pvreduce
(defn pvfreq [x] (letfn [(par-freq [m x] (if (map? x) (merge-with #(+ %1 %2) m x) (update-in m [x] #(if % 1 (inc %)))))] (pvreduce par-freq {} x) (pvfreq [:foo :bar :baz :bar .. :foo]) (reduce par-freq (reduce (reduce (reduce (reduce (reduce par-freq [{:foo {:foo {:foo {:foo
par-freq par-freq par-freq par-freq
{} {} {} {}
[:foo [:bar [:baz [:bar
:foo :foo :bar :baz
15, :bar 10, :baz 7} 10, :bar 12, :baz 5} 7, :bar 14, :baz 11} 12, :bar 10, :baz 10}])
{:foo 44, :bar 46, :baz 33} 168
.. .. .. ..
:bar]) :foo]) :foo]) :baz]))