Building Data-parallel Pipelines in Erlang. Pero Subasic. Erlang Factory, San Francisco. March 29, 2012. Page 2. Outline. - An introduction to data-parallel.
Building Data-parallel Pipelines in Erlang
Pero Subasic Erlang Factory, San Francisco March 29, 2012
Outline -
An introduction to data-parallel Requirements Map-reduce Flows Recap Architecture Overview Flow Specification Language Iterative Flows - Concept Rank Flow Results Conclusion and Future
2
Requirements Computation models - map-reduce - iterative and incremental Processing models - in stream, stateful - in stream, stateless - batch Computation platform - Cloud - Virtualized, general - Bare metal
3
Data Parallel Trend Scientific Computing Tools - R -> snow, multicore, parallel, RHIPE, R+Hadoop, etc. - Mathematica -> gridMathematica - MatLab -> Parallel Toolbox Internet, Big Data - Yahoo: S4 - Google: FlumeJava - Cloudera: Crunch, Hadoop MR2 Parallelism granularity - GPGPU - Multi-CPU/Multicore - Cluster
4
The concept is not really new. It is easy to find online a paper “Data Parallel Algorithms” by W. Daniel Hillis and Guy L. Steele, Jr. from December 1986 issue of Communications of ACM where they talk about data parallel algorithms where “their parallelism comes from simultaneous operations across large sets of data rather than from multiple threads of control.” Here it applies mostly to machines with hundreds or thousands of processors. Erlang has data parallelism. Since functions are first-class objects, they can be dispatched wherever we want them in the distributed system, on-demand, and with the data thereby enabling the concept of bringing computation to the data.
Data Parallel Given an integer vector x
x : x[i],i = 1,n apply a function to vector elements in parallel
|| f ( x )
5
3 80 120 120
Campaign Counts (G)
Aggregator
120
Scanner
Campaign Counts (L)
Aggregator
Campaign Compute
Scanner
User Profiles
Aggregator
UP
Scanner
Events
Map Reduce*3 Flow
An Example Flow
Event Source A Event Source B Event Source C
ETS
Scanner A
Processor 1
Aggregator 1
DETS
Scanner B
Processor 2
Aggregator 2
DETS
Scanner C
Processor n
Aggregator n
Input Layer
...
Processing Layer
Sink (Mnesia)
...
Aggregation Layer
Consumers
3
Architecture
8
Architecture • Flow supervisor/monitor – layer – worker hierarchy • ETS/DETS/Mnesia/TCP/UDP tables for sources, sinks and intermediaries • Synchronous or asynchronous message passing between layers • Plugins • Example layer parameters: – Layer size – Layer identification – Layer input, output data/format and connectivity with adjacent layers – Mapping functions between layers: partitioning
Layer Layer Worker Worker
N
Worker
• Process • Spawns and monitors workers • Elastic (number of workers) • Workers perform uniform functions • Connects to other layers, sources, sinks, intermediaries, resources • Maps to physical nodes
Sources, Sinks and Intermediaries Source
Interme diary
ETS
ETS
ETS
ETS
ETS
ETS
• Storage for crucial data sets • Staging used for input for one or more flows • Intermediary: output from previous layer, input to the next one; implementing barrier concept • Both can be used for staging input for multiple downstream layers/flows
Workers Worker Input
UDF
Partitio ning/ Output
• Input layer: predefined for ETS scan • UDF assumes intermediary/staging table record format • UDF defined externally, still Erlang • Partitioning/load balancing function: predefined or custom • Output to intermediary or another layer • All functionality local to a node
Flow Language
13
Flow Language: Configuration Hierarchy Infrastructure Cloud/VM Hardware Platform/Framework
Application Platform Infrastructure
Erlang node configuration Code repository: framework, plugins, global libraries (Erlang, C/C++, CUDA) Applications application libraries flow
flow infrastructure (TCP, UDP, ETS, DETS, Mnesia) flow structure: flow graph (nodes, communication) flow replication optimization monitoring scheduling
14
Flow Language %% Erlang cluster data sources processing node hardware graph, specs and specs sinks layers, workers {nodes, {hardware, {layer, {layers, {source, [processor, [ ! ! [scanner, [{type, [ {pero@cache01, ETS}, processor, {name,aggregator]}, [{pa, adcom_events}, “/nas1/dpar/data”}, {size, 12}]}, ! ! !! {source, {layer, {type, adcom_scanner, {physical_nodes, worker}, [{type, DETS},[{name, [ {config, adtech_events}, “/nas1/dpar/apps/conf/flow1.cfg”},...] {size, 12}]}, !! !! {source, ! {predecessors, ! ! {type, [{type, }, {cache01, scanner}, [adcom_scanner, DETS}, {name, [{memory, tacoda_events}, adtech_scanner, 128G}, {cores, {size, tacoda_scanner]}, 16}, 12}]}, {disk, 1T}]}, ! ! {sink, ! {successors, [{type, {predecessor, [aggregator]}, {pero@cache02, {cache02, Mnesia},adcom_events}, {name, [{memory, [{pa, user_profiles}]}, 128G}, “/nas1/dpar/data”}, % source {cores, specs 16}, for{disk, the scanner 1T}]}, ! ! {function, {operation_mode, {worker, ... sequential_parallel_scan}, worker_fun, {config, []}}, %“/nas1/dpar/apps/conf/flow1.cfg”},...}, say extracts % scan records relevantininfo order, frominthe parallel event record fashion and passes across on tables ... {cache08, with key [{memory, equal to 128G}, TID or {cores, ACID 16}, {disk, 1T}]} ! ! {communication, {successor, {pero@cache08, ][processor]}, {concurrent, [{pa, asynchronous}}, “/nas1/dpar/data”}, !! !! {distribution, }] {partition, {partitioner, local}, partitioning_fun, {config, “/nas1/dpar/apps/conf/flow1.cfg”},...} [adcom]}}, % user-defined partitioning }! ! {size, ]function 36} % acting 36 workers on adcom per event layer records per node }! ! ] {communication, {concurrent, asynchronous}} ! }]}, ! {layer, {layer, aggregator, adtech_scanner, [ [ ! ! ! {type, {type, worker}, scanner}, ! ! ! {predecessors, {predecessor, [processor]}, adtech_events}, % source specs for the scanner ! ! {successors, {operation_mode, [user_profiles]}, sequential_parallel_scan}, % scan records in order, in parallel ! ! fashion {function, across {aggregator, tables aggregator_fun, []}}, % aggregates all fields with a common ! ACID/TID {successors, into the same [processor]}, entry in Mnesia table ! ! {communication, {partition, {partitioner, {concurrent,partitioning_fun, asynchronous}}, [adtech]}}, % user-defined partitioning ! ! {distribution, function acting local}, on adcom event records ! ! {size, {communication, 36} % 36 workers {concurrent, per node asynchronous}} ! ]]},... } ]},
15
Iterative Flow - Concept Rank
16
Wikipedia Concept Rank Problem Space 3,091,923 concepts 42,205,936 links
PR( pi ) =
PR( p j ) 1− d +d ∑ N p j ∈M ( pi ) L( p j )
17
Concept Rank Results
Concept Rank of Programming Languages on Wikipedia 1000 133
10
1
139
75
100
4
Erlang
27
17
14
LISP
Haskell
42 22
21
3
Python
Scala
Java
C
C#
Pascal FORTRAN Ada
Programming Languages
18
CR Flow
19
CR Compute Worker
1
if phash2(CID) == token compute_cr(CID)
read CR6
CR Worker
Token
write
2
4
backlinks(CID)
bMap ETS
CR7 Mnesia
fMap ETS
foreach concept B in backlinks(CID) FL = length(forwlinks(B)) 3 compute sum(CR(B))/FL
20
Computation Metrics and Results - memory use: 420m RAM, 63m resident memory - Mnesia is distributed across nodes. - fMap and bMap are loaded on each node in ETS tables - fMap 126M, 3,091,923 records - bMap 121M, 2,479,969 records - conceptrankX tables in Mnesia -> 3,091,923 records, 31,374,390 words of memory each
Runtimes, 10 iterations: * 1 worker process, 1 node: 50 minutes (time for 10 iterations) * 10, 1: 18.36 min * 20, 1: 23.48 min * 5, 1: 19.8 min * 2 nodes, 10 proc/node: 14.83 min * 3 nodes, 10 proc/node: 11.16 min low network traffic 10MB/sec; nice low traffic and fast response with not so high CPU usage * 4 nodes, 10 proc/node: 11.83 min med-high to high 30MB/sec * 4 nodes, 15 proc/node: 12.67 min med-high network traffic 25MB/sec * 4 nodes, 5 proc/node: 8.83 min high network traffic 35MB/sec * * 4 nodes, 2 proc/node: 22.3 min medium network traffic 15MB/sec across all four nodes * 21
- finding compromise point between network traffic due to Mnesia table sync and local computation requirements on each node - seems like 5 proc/node minimizes response time at the expense of high network traffic. Reducing number of processes per node to 2 reduces network traffic, but impacts computing capacity (CPU utilization is lowest of all aproaches). So, in that case, system spends most time computing the ranks.
Concept Rank Flow: Tradeoffs 4,5 3,10
4,10 4,15
4,2
CR Flow Run times 30.0
22.30
22.5 15.0
11.16
11.83
12.67
8.83
7.5 0
(3,10)
(4,10)
(4,15)
(4,5)
(4,2)
Configuration (Nodes, Processes per node) 22
Optimization Set Flow Parameters
Flow Optimizer
status
Monitor Monitor DB
Minimize run time network traffic cpu utilization disk i/o cloud expense some combination of the above
23
Run Time Local Minima
Swipe 1,20,1 4 nodes 200 172
150 100
130 122
50 0
130 62
1
2
3
50 4
5
68 6
80
95 93 93 86 81
7
8
9
106
94
114
106 111 113 98
10 11 12 13 14 15 16 17 18 19 20
24
Swipe Results Swipe 1,20,1 4 nodes 200 172
150 100
130 122
50 0
130 62
1
2
3
50 4
5
68 6
80
7
93
86
81
8
9
10
95
93
11
12
106
13
94
14
106 111 113
98
15
18
16
17
114
19
20
Swipe, 1,20,1 8 nodes 200 150
162
100 50 0
80
88
86 42
1
2
99
3
4
38
48
43
5
6
7
8
49
52
54
9
10
11
66
59
61
67
12
13
14
15
16
Processes per node
Time in secs
Poly, order 4
25
147 99
92
17
18
94
19
20
Swipe Ganglia Monitoring
26
Conclusion and Future - Erlang is very convenient and appropriate language and platform for data-parallel flows - Building languages and platforms makes sense to facilitate easy flow specification - Small Erlang team can do wonders
27
Questions, Comments?
28