Piccolo: Building fast distributed programs with partitioned tables. Russell Power ..... Asynchronous Applications. â¡
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University
Motivating Example: PageRank Repeat until convergence
for each node X in graph: for each edge XZ: next[Z] += curr[X]
Input Graph
Curr
AB,C,D
A: A: 0.25 0.12 0.2 0.17 B: B: 0.16 0.15
A: 0.25 0 0.2
C: C: 0.22 0.21 0.2 … …
C: 0.22 0 0.21
BE CD …
Next
B: 0.17 0 0.16 …
Fits in memory!
PageRank in MapReduce Data flow models do not expose global state.
1 Graph stream A->B,C, B->D
Rank stream A:0.1, B:0.2
Distributed Storage 2
3
PageRank in MapReduce Data flow models do not expose global state.
1 Graph stream A->B,C, B->D
Rank stream A:0.1, B:0.2
Distributed Storage 2
3
PageRank With MPI/RPC Graph A->B,C …
1
User explicitly programs communication
Distributed Storage
Ranks C: 0 … Graph C->E,F …
Ranks A: 0 …
2
3
Ranks B: 0 … Graph B->D …
Piccolo’s Goal: Distributed Shared State Distributed inmemory state
1 read/write Ranks A: 0 B: 0 …
2
Distributed Storage Graph A->B,C B->D …
3
Piccolo’s Goal: Distributed Shared State Graph A->B,C …
Ranks A: 0 …
1
Piccolo runtime handles communication
Ranks C: 0 … Graph C->E,F …
2
3
Ranks B: 0 … Graph B->D …
Ease of use
Performance
Talk outline Motivation Piccolo's Programming Model Runtime Scheduling Evaluation
Programming Model Implemented as library for C++ and Python
x
read/write Ranks A: 0 B: 0 …
2
1 get/put update/iterate Graph AB,C BD …
3
Naïve PageRank with Piccolo curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel(graph, curr, next): i = my_instance n = len(graph)/NUM_MACHINES for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) def main(): for i in range(50): launch_jobs(NUM_MACHINES, pr_kernel, graph, curr, next) swap(curr, next) next.clear()
Jobs run by many machines
Controller launches jobs in parallel Run by a single controller
Naïve PageRank is Slow Graph B->D …
Ranks A: 0 …
get put
1
get put
put
get
Ranks C: 0 … Graph C->E,F …
Ranks B: 0 …
2
3
Graph A->B,C …
PageRank: Exploiting Locality curr = Table(…,partitions=100,partition_by=site) next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph)
Control table partitioning Co-locate tables
def pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) swap(curr, next) next.clear()
Co-locate execution with table
Exploiting Locality Graph B->D …
Ranks A: 0 …
get put
1
get put
put
get
Ranks C: 0 … Graph C->E,F …
Ranks B: 0 …
2
3
Graph A->B,C …
Exploiting Locality Graph B->D …
get
Ranks A: 0 …
put
1
get put
put
get
Ranks C: 0 … Graph C->E,F …
Ranks B: 0 …
2
3
Graph A->B,C …
Synchronization Graph B->D …
Ranks A: 0 …
put (a=0.3)
Ranks C: 0 … Graph C->E,F …
1
How to handle synchronization?
put (a=0.2)
2
Ranks B: 0 …
3
Graph A->B,C …
Synchronization Primitives
Avoid write conflicts with accumulation functions NewValue sum,
= Accum(OldValue, Update)
product, min, max
Global barriers are sufficient Tables provide release consistency
PageRank: Efficient Synchronization curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) def pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) barrier(handle) swap(curr, next) next.clear()
Accumulation via sum Update invokes accumulation function
Explicitly wait between iterations
Efficient Synchronization Graph B->D …
Ranks A: 0 …
Ranks C: 0 … Graph C->E,F …
1
put (a=0.2) update (a, 0.2)
2
Runtime computes sum
Workers buffer updates locally Release consistency put (a=0.3) update (a, 0.3) Ranks B: 0 …
3
Graph A->B,C …
Table Consistency Graph B->D …
Ranks A: 0 …
Ranks C: 0 … Graph C->E,F …
1
put (a=0.2) update (a, 0.2)
2
put (a=0.3) update (a, 0.3) Ranks B: 0 …
3
Graph A->B,C …
PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel(graph, curr, next): for node in graph.get_iterator(my_instance) for t in s.out: next.update(t,curr.get(s.id)/len(s.out))
Restore previous computation
User decides which
def main(): tables to checkpoint curr, userdata = restore() last = userdata.get(‘iter’, 0) and when for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()
Recovery via Checkpointing Graph B->D …
Ranks A: 0 …
Distributed Storage
Ranks C: 0 … Graph C->E,F …
1
Runtime uses Chandy-Lamport protocol Ranks B: 0 …
2
3
Graph A->B,C …
Talk Outline Motivation Piccolo's Programming Model Runtime Scheduling Evaluation
Other workers are updating P6!
Load Balancing J1, J2J1
P1, P2 P1
J3, J4J3
J5, J6 J6
P3, P4 P3
P5, P6 P5
P6
1
2
J3
3
J5 Pause updates!
Coordinates workstealing
master
Talk Outline Motivation Piccolo's Programming Model System Design Evaluation
Piccolo is Fast PageRank iteration time (seconds)
400
Main Hadoop Overheads:
350 300 250
Hadoop
Sorting
Piccolo
HDFS Serialization
200 150 100 50 0
8
16
Workers
32
NYU cluster, 12 nodes, 64 cores
100M-page graph
64
1 billion page graph
Piccolo Scales Well PageRank iteration time (seconds)
ideal 70 60 50 40 30 20 10 0
12
24
48 Workers
100
EC2 Cluster - linearly scaled input graph
200
Other applications
Iterative Applications N-Body Simulation Matrix Multiply
No straightforward Hadoop implementation
Asynchronous Applications Distributed
web crawler
Related Work
Data flow MapReduce,
Tuple Spaces Linda,
Dryad
JavaSpaces
Distributed Shared Memory CRL,
TreadMarks, Munin, Ivy UPC, Titanium
Conclusion Distributed shared table model User-specified policies provide for
Effective
use of locality Efficient synchronization Robust failure recovery
Gratuitous Cat Picture I can haz kwestions?
Try it out: piccolo.news.cs.nyu.edu