Piccolo: Building fast distributed programs with partitioned ... - Usenix

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University

Motivating Example: PageRank Repeat until convergence

for each node X in graph: for each edge XZ: next[Z] += curr[X]

Input Graph

Curr

AB,C,D

A: A: 0.25 0.12 0.2 0.17 B: B: 0.16 0.15

A: 0.25 0 0.2

C: C: 0.22 0.21 0.2 … …

C: 0.22 0 0.21

BE CD …

Next

B: 0.17 0 0.16 …

Fits in memory!

PageRank in MapReduce   Data flow models do not expose global state.

1 Graph stream A->B,C, B->D

Rank stream A:0.1, B:0.2

Distributed Storage 2

3

PageRank in MapReduce   Data flow models do not expose global state.

1 Graph stream A->B,C, B->D

Rank stream A:0.1, B:0.2

Distributed Storage 2

3

PageRank With MPI/RPC Graph A->B,C …

1

User explicitly programs communication

Distributed Storage

Ranks C: 0 … Graph C->E,F …

Ranks A: 0 …

2

3

Ranks B: 0 … Graph B->D …

Piccolo’s Goal: Distributed Shared State Distributed inmemory state

1 read/write Ranks A: 0 B: 0 …

2

Distributed Storage Graph A->B,C B->D …

3

Piccolo’s Goal: Distributed Shared State Graph A->B,C …

Ranks A: 0 …

1

Piccolo runtime handles communication


2

3

Ranks B: 0 … Graph B->D …

Ease of use

Performance

Talk outline Motivation   Piccolo's Programming Model   Runtime Scheduling   Evaluation  

Programming Model Implemented as library for C++ and Python

x

read/write Ranks A: 0 B: 0 …

2

1 get/put update/iterate Graph AB,C BD …

3

Naïve PageRank with Piccolo curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel(graph, curr, next): i = my_instance n = len(graph)/NUM_MACHINES for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) def main(): for i in range(50): launch_jobs(NUM_MACHINES, pr_kernel, graph, curr, next) swap(curr, next) next.clear()

Jobs run by many machines

Controller launches jobs in parallel Run by a single controller

Naïve PageRank is Slow Graph B->D …

Ranks A: 0 …

get put

1

get put

put

get


Ranks B: 0 …

2

3

Graph A->B,C …

PageRank: Exploiting Locality curr = Table(…,partitions=100,partition_by=site) next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph)

Control table partitioning Co-locate tables

def pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) swap(curr, next) next.clear()

Co-locate execution with table

Exploiting Locality Graph B->D …

Ranks A: 0 …

get put

1

get put

put

get


Ranks B: 0 …

2

3

Graph A->B,C …

Exploiting Locality Graph B->D …

get

Ranks A: 0 …

put

1

get put

put

get


Ranks B: 0 …

2

3

Graph A->B,C …

Synchronization Graph B->D …

Ranks A: 0 …

put (a=0.3)


1

How to handle synchronization?

put (a=0.2)

2

Ranks B: 0 …

3

Graph A->B,C …

Synchronization Primitives  

Avoid write conflicts with accumulation functions   NewValue   sum,

= Accum(OldValue, Update)

product, min, max

Global barriers are sufficient   Tables provide release consistency

PageRank: Efficient Synchronization curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) def pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) barrier(handle) swap(curr, next) next.clear()

Accumulation via sum Update invokes accumulation function

Explicitly wait between iterations

Efficient Synchronization Graph B->D …

Ranks A: 0 …


1

put (a=0.2) update (a, 0.2)

2

Runtime computes sum

Workers buffer updates locally  Release consistency put (a=0.3) update (a, 0.3) Ranks B: 0 …

3

Graph A->B,C …

Table Consistency Graph B->D …

Ranks A: 0 …


1

put (a=0.2) update (a, 0.2)

2

put (a=0.3) update (a, 0.3) Ranks B: 0 …

3

Graph A->B,C …

PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel(graph, curr, next): for node in graph.get_iterator(my_instance) for t in s.out: next.update(t,curr.get(s.id)/len(s.out))

Restore previous computation

User decides which

def main(): tables to checkpoint curr, userdata = restore() last = userdata.get(‘iter’, 0) and when for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()

Recovery via Checkpointing Graph B->D …

Ranks A: 0 …

Distributed Storage


1

Runtime uses Chandy-Lamport protocol Ranks B: 0 …

2

3

Graph A->B,C …

Talk Outline Motivation   Piccolo's Programming Model   Runtime Scheduling   Evaluation  

Other workers are updating P6!

Load Balancing J1, J2J1

P1, P2 P1

J3, J4J3

J5, J6 J6

P3, P4 P3

P5, P6 P5

P6

1

2

J3

3

J5 Pause updates!

Coordinates workstealing

master

Talk Outline Motivation   Piccolo's Programming Model   System Design   Evaluation  

Piccolo is Fast PageRank iteration time (seconds)

400

Main Hadoop Overheads:

350 300 250

Hadoop

 Sorting

Piccolo

 HDFS  Serialization

200 150 100 50 0

8

16

Workers

32

NYU cluster, 12 nodes, 64 cores

 

100M-page graph

 

64

1 billion page graph

Piccolo Scales Well PageRank iteration time (seconds)

ideal 70 60 50 40 30 20 10 0

 

12

24

48 Workers

100

EC2 Cluster - linearly scaled input graph

200

Other applications  

 

Iterative Applications   N-Body Simulation   Matrix Multiply

No straightforward Hadoop implementation

Asynchronous Applications   Distributed

web crawler‫‏‬

Related Work  

Data flow  MapReduce,

 

Tuple Spaces  Linda,

 

Dryad

JavaSpaces

Distributed Shared Memory  CRL,

TreadMarks, Munin, Ivy  UPC, Titanium

Conclusion Distributed shared table model   User-specified policies provide for  

 Effective

use of locality  Efficient synchronization  Robust failure recovery

Gratuitous Cat Picture I can haz kwestions?

Try it out: piccolo.news.cs.nyu.edu

Piccolo: Building fast distributed programs with partitioned ... - Usenix

Piccolo: Building fast distributed programs with partitioned ... - Usenix

Suggest Documents

Piccolo: Building Fast, Distributed Programs with Partitioned ... - Usenix

Piccolo: Building Fast, Distributed Programs with Partitioned ... - Usenix

Calvin: Fast Distributed Transactions for Partitioned Database Systems

Devolved Management of Distributed Infrastructures With ... - Usenix

Naviglio Piccolo Naviglio Piccolo

Building Secure Software (tutorial) - Usenix

Distributed Query Processing Using Partitioned Inverted Files

Addressing Partitioned Arrays in Distributed Memory ... - CiteSeerX

fast hybrid simulation with geographically distributed ... - CiteSeerX

Fast Data Management with Distributed Streaming SQL

DieCast: Testing Distributed Systems with an Accurate Scale ... - Usenix

Swift/RAID: A Distributed RAID - Usenix

Building a Distributed Database with Device

Building Distributed Applications with Inferno and Limbo

Building Adaptive Distributed Applications with ... - CiteSeerX

Optimizing Network Performance in Distributed Machine ... - Usenix

Availability in Globally Distributed Storage Systems - Usenix

Swift: A Fast Dynamic Packet Filter - Usenix

FAST '16 Full Proceedings (PDF) - Usenix

FAST '16 Full Proceedings (PDF) - Usenix

Distributed Quota Enforcement for Spam Control - Usenix

Latency-Tolerant Software Distributed Shared Memory - Usenix

Availability in Globally Distributed Storage Systems - USENIX

Large-scale Incremental Processing Using Distributed ... - USENIX