Building Data-parallel Pipelines in Erlang - Erlang Factory

31 downloads 59836 Views 2MB Size Report
Building Data-parallel Pipelines in Erlang. Pero Subasic. Erlang Factory, San Francisco. March 29, 2012. Page 2. Outline. - An introduction to data-parallel.
Building Data-parallel Pipelines in Erlang

Pero Subasic Erlang Factory, San Francisco March 29, 2012

Outline -

An introduction to data-parallel Requirements Map-reduce Flows Recap Architecture Overview Flow Specification Language Iterative Flows - Concept Rank Flow Results Conclusion and Future

2

Requirements Computation models - map-reduce - iterative and incremental Processing models - in stream, stateful - in stream, stateless - batch Computation platform - Cloud - Virtualized, general - Bare metal

3

Data Parallel Trend Scientific Computing Tools - R -> snow, multicore, parallel, RHIPE, R+Hadoop, etc. - Mathematica -> gridMathematica - MatLab -> Parallel Toolbox Internet, Big Data - Yahoo: S4 - Google: FlumeJava - Cloudera: Crunch, Hadoop MR2 Parallelism granularity - GPGPU - Multi-CPU/Multicore - Cluster

4

The concept is not really new. It is easy to find online a paper “Data Parallel Algorithms” by W. Daniel Hillis and Guy L. Steele, Jr. from December 1986 issue of Communications of ACM where they talk about data parallel algorithms where “their parallelism comes from simultaneous operations across large sets of data rather than from multiple threads of control.” Here it applies mostly to machines with hundreds or thousands of processors. Erlang has data parallelism. Since functions are first-class objects, they can be dispatched wherever we want them in the distributed system, on-demand, and with the data thereby enabling the concept of bringing computation to the data.

Data Parallel Given an integer vector x

x : x[i],i = 1,n apply a function to vector elements in parallel

|| f ( x )

5

3 80 120 120

Campaign Counts (G)

Aggregator

120

Scanner

Campaign Counts (L)

Aggregator

Campaign Compute

Scanner

User Profiles

Aggregator

UP

Scanner

Events

Map Reduce*3 Flow

An Example Flow

Event Source A Event Source B Event Source C

ETS

Scanner A

Processor 1

Aggregator 1

DETS

Scanner B

Processor 2

Aggregator 2

DETS

Scanner C

Processor n

Aggregator n

Input Layer

...

Processing Layer

Sink (Mnesia)

...

Aggregation Layer

Consumers

3

Architecture

8

Architecture • Flow supervisor/monitor – layer – worker hierarchy • ETS/DETS/Mnesia/TCP/UDP tables for sources, sinks and intermediaries • Synchronous or asynchronous message passing between layers • Plugins • Example layer parameters: – Layer size – Layer identification – Layer input, output data/format and connectivity with adjacent layers – Mapping functions between layers: partitioning

Layer Layer Worker Worker

N

Worker

• Process • Spawns and monitors workers • Elastic (number of workers) • Workers perform uniform functions • Connects to other layers, sources, sinks, intermediaries, resources • Maps to physical nodes

Sources, Sinks and Intermediaries Source

Interme diary

ETS

ETS

ETS

ETS

ETS

ETS

• Storage for crucial data sets • Staging used for input for one or more flows • Intermediary: output from previous layer, input to the next one; implementing barrier concept • Both can be used for staging input for multiple downstream layers/flows

Workers Worker Input

UDF

Partitio ning/ Output

• Input layer: predefined for ETS scan • UDF assumes intermediary/staging table record format • UDF defined externally, still Erlang • Partitioning/load balancing function: predefined or custom • Output to intermediary or another layer • All functionality local to a node

Flow Language

13

Flow Language: Configuration Hierarchy Infrastructure Cloud/VM Hardware Platform/Framework

Application Platform Infrastructure

Erlang node configuration Code repository: framework, plugins, global libraries (Erlang, C/C++, CUDA) Applications application libraries flow

flow infrastructure (TCP, UDP, ETS, DETS, Mnesia) flow structure: flow graph (nodes, communication) flow replication optimization monitoring scheduling

14

Flow Language %% Erlang cluster data sources processing node hardware graph, specs and specs sinks layers, workers {nodes, {hardware, {layer, {layers, {source, [processor, [ ! ! [scanner, [{type, [ {pero@cache01, ETS}, processor, {name,aggregator]}, [{pa, adcom_events}, “/nas1/dpar/data”}, {size, 12}]}, ! ! !! {source, {layer, {type, adcom_scanner, {physical_nodes, worker}, [{type, DETS},[{name, [ {config, adtech_events}, “/nas1/dpar/apps/conf/flow1.cfg”},...] {size, 12}]}, !! !! {source, ! {predecessors, ! ! {type, [{type, }, {cache01, scanner}, [adcom_scanner, DETS}, {name, [{memory, tacoda_events}, adtech_scanner, 128G}, {cores, {size, tacoda_scanner]}, 16}, 12}]}, {disk, 1T}]}, ! ! {sink, ! {successors, [{type, {predecessor, [aggregator]}, {pero@cache02, {cache02, Mnesia},adcom_events}, {name, [{memory, [{pa, user_profiles}]}, 128G}, “/nas1/dpar/data”}, % source {cores, specs 16}, for{disk, the scanner 1T}]}, ! ! {function, {operation_mode, {worker, ... sequential_parallel_scan}, worker_fun, {config, []}}, %“/nas1/dpar/apps/conf/flow1.cfg”},...}, say extracts % scan records relevantininfo order, frominthe parallel event record fashion and passes across on tables ... {cache08, with key [{memory, equal to 128G}, TID or {cores, ACID 16}, {disk, 1T}]} ! ! {communication, {successor, {pero@cache08, ][processor]}, {concurrent, [{pa, asynchronous}}, “/nas1/dpar/data”}, !! !! {distribution, }] {partition, {partitioner, local}, partitioning_fun, {config, “/nas1/dpar/apps/conf/flow1.cfg”},...} [adcom]}}, % user-defined partitioning }! ! {size, ]function 36} % acting 36 workers on adcom per event layer records per node }! ! ] {communication, {concurrent, asynchronous}} ! }]}, ! {layer, {layer, aggregator, adtech_scanner, [ [ ! ! ! {type, {type, worker}, scanner}, ! ! ! {predecessors, {predecessor, [processor]}, adtech_events}, % source specs for the scanner ! ! {successors, {operation_mode, [user_profiles]}, sequential_parallel_scan}, % scan records in order, in parallel ! ! fashion {function, across {aggregator, tables aggregator_fun, []}}, % aggregates all fields with a common ! ACID/TID {successors, into the same [processor]}, entry in Mnesia table ! ! {communication, {partition, {partitioner, {concurrent,partitioning_fun, asynchronous}}, [adtech]}}, % user-defined partitioning ! ! {distribution, function acting local}, on adcom event records ! ! {size, {communication, 36} % 36 workers {concurrent, per node asynchronous}} ! ]]},... } ]},

15

Iterative Flow - Concept Rank

16

Wikipedia Concept Rank Problem Space 3,091,923 concepts 42,205,936 links

PR( pi ) =

PR( p j ) 1− d +d ∑ N p j ∈M ( pi ) L( p j )

17

Concept Rank Results

Concept Rank of Programming Languages on Wikipedia 1000 133

10

1

139

75

100

4

Erlang

27

17

14

LISP

Haskell

42 22

21

3

Python

Scala

Java

C

C#

Pascal FORTRAN Ada

Programming Languages

18

CR Flow

19

CR Compute Worker

1

if phash2(CID) == token compute_cr(CID)

read CR6

CR Worker

Token

write

2

4

backlinks(CID)

bMap ETS

CR7 Mnesia

fMap ETS

foreach concept B in backlinks(CID) FL = length(forwlinks(B)) 3 compute sum(CR(B))/FL

20

Computation Metrics and Results - memory use: 420m RAM, 63m resident memory - Mnesia is distributed across nodes. - fMap and bMap are loaded on each node in ETS tables - fMap 126M, 3,091,923 records - bMap 121M, 2,479,969 records - conceptrankX tables in Mnesia -> 3,091,923 records, 31,374,390 words of memory each

Runtimes, 10 iterations: * 1 worker process, 1 node: 50 minutes (time for 10 iterations) * 10, 1: 18.36 min * 20, 1: 23.48 min * 5, 1: 19.8 min * 2 nodes, 10 proc/node: 14.83 min * 3 nodes, 10 proc/node: 11.16 min low network traffic 10MB/sec; nice low traffic and fast response with not so high CPU usage * 4 nodes, 10 proc/node: 11.83 min med-high to high 30MB/sec * 4 nodes, 15 proc/node: 12.67 min med-high network traffic 25MB/sec * 4 nodes, 5 proc/node: 8.83 min high network traffic 35MB/sec * * 4 nodes, 2 proc/node: 22.3 min medium network traffic 15MB/sec across all four nodes * 21

- finding compromise point between network traffic due to Mnesia table sync and local computation requirements on each node - seems like 5 proc/node minimizes response time at the expense of high network traffic. Reducing number of processes per node to 2 reduces network traffic, but impacts computing capacity (CPU utilization is lowest of all aproaches). So, in that case, system spends most time computing the ranks.

Concept Rank Flow: Tradeoffs 4,5 3,10

4,10 4,15

4,2

CR Flow Run times 30.0

22.30

22.5 15.0

11.16

11.83

12.67

8.83

7.5 0

(3,10)

(4,10)

(4,15)

(4,5)

(4,2)

Configuration (Nodes, Processes per node) 22

Optimization Set Flow Parameters

Flow Optimizer

status

Monitor Monitor DB

Minimize run time network traffic cpu utilization disk i/o cloud expense some combination of the above

23

Run Time Local Minima

Swipe 1,20,1 4 nodes 200 172

150 100

130 122

50 0

130 62

1

2

3

50 4

5

68 6

80

95 93 93 86 81

7

8

9

106

94

114

106 111 113 98

10 11 12 13 14 15 16 17 18 19 20

24

Swipe Results Swipe 1,20,1 4 nodes 200 172

150 100

130 122

50 0

130 62

1

2

3

50 4

5

68 6

80

7

93

86

81

8

9

10

95

93

11

12

106

13

94

14

106 111 113

98

15

18

16

17

114

19

20

Swipe, 1,20,1 8 nodes 200 150

162

100 50 0

80

88

86 42

1

2

99

3

4

38

48

43

5

6

7

8

49

52

54

9

10

11

66

59

61

67

12

13

14

15

16

Processes per node

Time in secs

Poly, order 4

25

147 99

92

17

18

94

19

20

Swipe Ganglia Monitoring

26

Conclusion and Future - Erlang is very convenient and appropriate language and platform for data-parallel flows - Building languages and platforms makes sense to facilitate easy flow specification - Small Erlang team can do wonders

27

Questions, Comments?

28