conceptually!straightforward,!but! ... intermediate!key/value!pairs!are!partitioned!
across! ... all!intermediate!values!of!the!same!intermediate!key!across!all!map!
Problem! Advanced! Computer Networks [DG04]!Dean!and!Ghemawat,!“MapReduce:! Simplified!Data!Processing!on!Large!Clusters,”!Proc.& of&the&6th&USENIX&Conf.&on&OSDI&'04,!Dec.!2004! ! [WNS12]!Wang,!Ng,!and!Shaikh,!“Programming!Your! Network!at!RunHTime!for!Big!Data!Applications,”! Proc.&of&the&1st&Workshop&on&HotSDN&'12,!103H108,! Aug.!2012!
MapReduce! A!programming!model!and!runHtime!library!for! processing!and!generating!large!data!sets!that! automates!and!hides!the!messy!details!of! • program!parallelization! • input!data!partitioning! • execution!distribution,!scheduling,!and!load!
balancing!across!a!set!of!machines! • fault!tolerance! • interHmachine!communication! !
Allows!programmers!inexperienced!in!parallel! and!distributed!systems!to!utilize!the! resources!of!a!large!distributed!system!
Most!data!processing!applications!are! • conceptually!straightforward,!but! • input!data!is!large!(many!terabytes),!and! • computations!must!be!distributed!across!
100s-1,000s!of!machines!
! Parallel!programming!made!complicated,!and! the!coding!complex,!by!questions!as!to!how!to! • parallelize!the!computation,! • distribute!the!data,! • balance!the!load,!and! • handle!failures!
Programming!Model! A!functionalHstyle!programming!model:!inspired! by!the!map!and!reduce!primitives!of!Lisp! Users!specify!a!map!function!that! • extracts!a!key/value!pair!from!each!
“logical”!record!in!the!input!
• to!compute!a!set!of!intermediate!key/value!pairs!
and!a!reduce!function!that!
• merges!all!intermediate!values!of!a!given!
intermediate!key!
ReHexecution!as!the!primary!mechanism! for!fault!tolerance!
RunHtime!Library!
Examples!
For!each!set!of!map!task!output,!groups!together!all! intermediate!values!with!the!same!intermediate!key!! • intermediate!key/value!pairs!are!partitioned!across!R!reduce!
tasks!by!the!intermediate!keys!(or!parts!thereof!by!userH supplied!function,!e.g.,!only!the!hostname!of!a!URL),!e.g.,! hash(key) mod R! • all!intermediate!values!of!the!same!intermediate!key!across!all!map!
tasks!are!sent!to!the!same!reduce!task!
• within!a!partition,!the!intermediate!key/value!pairs!are!sorted!
by!intermediate!keys!
The!intermediate!values!are!streamed!to!the!reduce! function!via!an!iterator!
Distributed!sort:!
map !(doc) !→ list(key, record) reduce !(list(key, record)) !→ list(key, record)!//!identity!
The!runHtime!library!automatically!partitions!the! intermediate!key/record!pairs!by!key!before!forwarding! them!to!the!reduce!tasks! Within!a!partition,!the!runHtime!library!automatically! sorts!by!the!intermediate!keys!already! So!the!reduce!function!simply!emits!all!pairs!unchanged!
Execution!Model!
Examples!
taskID!
Distributed!search!grep:! !→ list(string) !→ list(string) //!identity!
Per!word/URL!frequency:!
map !(word/url, string) !→ list(word/url, 1) reduce !(word/url, list(1)) !→ list(world/url, total)!
Reverse!webHlink!graph:! map !(page) reduce !(url, page)
!→ list(url, page) !→ list(url, list(page))
locations on local disks passed back to Master!
!→ list(word, docID) !→ list(word, list(docID))
on Google File System (GFS), which stores data in triplicate, over distributed local disks!
partitioned and forwarded to reduce workers!
intermediate values sorted by key within each partition prior to running reduce!
Inverted!index:!
map !(docID) reduce !(word, docID)
workerID!
M and R should be >> number of workers," typically M=200K, R=5K, 2K workers!
each map worker and its input data (on GFS) co-located on the same host or the same switch whenever possible! M splits of 16 - 64 MB!
map !(word, string) reduce !(list(string))
state!
both M map and R reduce tasks!
an optional combiner (reduce) function merges intermediate values with the same key!
filenames specified by user!
on globally visible GFS!
Performance:! Distributed!Sort!
Performance:!Distributed!Search! Distributed!grep!searching!for!3Hcharacter! pattern!occurring!in!92,337!out!of!1010!100Hbyte! records!(1!terabyte)!takes!150!seconds! ! peak!input!scan!
Distributed!sort!of!1010!100H byte!records!(1!terabyte),! with!10Hbyte!key,!takes!891! seconds:!map!tasks!done!by! 200!seconds,!shuffling!done! by!600!seconds,!all!writes! done!by!850!seconds! !
rate:!30!GB/s!
scan!rate!
M = 15,000 (64 MB!each)! R = 1! 1764!workers,!each!on! • 2 GHz!Intel!Xeon! • 4 GB RAM! • 160 GB IDE HD! • 1 GigE! 50-100 Gbps!bisection!bandwidth!
M = 15,000 (64 MB!each)! R = 4,000! 2-way!replicated!GFS (2 TB output)! 1700!workers!
Execution!Model!
Fault!Handling! taskID! both M map and R reduce tasks!
M splits of 16 - 64 MB!
each map worker and its input data (on GFS) co-located on the same host or the same switch whenever possible!
on Google File System (GFS), which stores data in triplicate, over distributed local disks!
state!
workerID!
idle |" in-progress |" completed!
M and R should be >> number of workers," typically M=200K, R=5K, 2K workers!
pings!
an optional combiner (reduce) function merges intermediate values with the same key!
state! idle |" in-progress |" completed!
Master!pings!workers!periodically!
• if!a!worker!fails,!all!the!worker’s!inHprogress!tasks!(both!
map!and!reduce)!are!reset!to!idle!for!reHexecution! reHexecution!(why?)!
partitioned and forwarded to reduce workers!
intermediate values sorted by key within each partition prior to running reduce!
taskID! both M map and R reduce tasks!
• worker’s!completed!map!tasks!are!also!reset!to!idle!for!
pings!
locations on local disks passed back to Master!
peak!input!scan!rate:!13!GB/s! (compare:!30!GB/s!for!grep)!
• because!intermediate!results!on!failed!local!disks!can’t!be!
accessed!by!reduce!tasks! • reduce!workers!will!be!notified!of!the!replacement!so! filenames specified by user!
on globally visible GFS!
that!they!can!grab!the!data!from!the!new!map!worker!
! If!Master!fails,!users!app!would!have!to!restart! MapReduce!computation!
workerID!
Fault!Tolerance! 891!secs!
200 workers killed!
933!secs!(+5%)!
Fault!Handling!
taskID! both M map and R reduce tasks!
state!
workerID!
idle |" in-progress |" completed!
Master!ignores!a!completion!message!for!an! already!completed!map!task!
• optimization:!nearing!the!completion!of!a!MapReduce! completed map work lost (inaccessible)!
computation,!the!Master!schedules!backup!executions! of!the!remaining!inHprogress!tasks! • increases!the!computational!resources!used!by!a!few!percent! • but!reduces!completion!time!by!30%!
GFS!guarantees!atomic!file!renaming!⇒!each! results!file!contains!just!the!data!produced!by! one!execution!of!the!reduce!task! MapReduce!runHtime!library!detects!and!skips! records!that!cause!deterministic!crashes!
Optimization:!Backup!Tasks! With!Backup!Tasks 891!secs!
!Without!Backup!Tasks! 1283!secs!(+44%)!
Additional!I/O! Users!can!add!support!for!new!input!type!by! providing!a!reader!interface!that!knows!how!to! split!its!data!into!meaningful!ranges! ! Additional!outputs!from!map!and/or!reduce! functions!must!be!atomic!(do!or!do!not,!there’s!no! try)!and!idempotent!(one!for!all!and!all!for!one)!
Implementation!on!Data!Center! How!to!configure!a!softwareHdefined,!dataHcenter! network!to!jointly!optimize!Hadoop!(open!source! MapReduce+GFS)!performance!and!network!utilization!
Architecture!and!Feasibility! other app controller network! config! +! traffic! demand!
multi-app cluster manager
network! info!
In!particular,!how!to!configure!the!network!to!support! various!combinations!of!data!aggregation!patterns! Parameters:!
• network!control!architecture! • job!scheduling! • network!topology! • routing!configuration!
Feasibility!considerations:!
• sustainable!switch!flow!table!update!rate! • scalability!of!SDN!controller!and!
sustainable!network!state!update!frequency!
• consistent!routing!update!latency!(two!phase?)! • interHapp!network!configuration!coordination!
Set!up!rackHtoHrack!flows!to!improve!scalability!
Current!Shortcomings! Current!approaches!rely!on!estimation!of! traffic!demand!matrix! ! Inaccurate!estimates!can!lead!to! • optical!circuits!configured!between!the!wrong!
sourceHdestination!pairs!
• circuit!flapping!due!to!repeated!corrections! • blocking!among!interdependent!applications! • all!resulting!in!poor!application!performance! (e.g.,!almost!8×!longer!completion!time)!
Example! Router!0!has!3!optical!ports! App!needs!to!aggregate!data! from!8!workers! App!cannot!proceed!until! all!data!has!been!aggregated! Aggregation!by!3!costs! MapReduce!operation!2.16!seconds! Aggregation!by!constructing!a!2Hlevel!treeHtopology! using!SDN!completes!aggregation!in!480!ms! (160+160 2,!assume!all!edges!are!10!Gbps!and!hosts! can!keep!up)!
Networked!for!Hadoop! Hadoop!has!a!centralized!job!tracker!to! manage!jobs!and!data! • job!tracker!has!accurate!info!about!
placement!of!map!and!reduce!tasks! • it!also!knows!which!map!and!reduce!tasks! belong!to!the!same!MapReduce!job!
! Network!configuration!must!be!done!twice:! 1. based!on!initial!map!traffic!info! 2. based!on!mapHtoHreduce!shuffle!traffic,!
which!is!not!available!until!map!tasks!are!done!
Hadoop!Job!Scheduling! Hadoop!default!job!scheduling!discipline! schedules!job!in!FIFO!order,!by!order!of!arrival! • btw,!faultHtolerance!provided!by!Hadoop!for!free!
Given!networkHlayer!info,!Hadoop!can!make! more!informed!decisions!about!job!scheduling,! e.g.,!map!and!reduce!tasks!can!be!scheduled!in! batches,!priority!given!to!earlier!batches! • ensures!that!earlier!tasks!are!not!starved!by!later!tasks! • aggregate!traffic!from!multiple!jobs!to!create!long! duration!traffic!suitable!for!circuitHswitched!path! • doesn’t!long!duration!traffic!makes!traffic!engineering!more!
difficult!by!increasing!flow!collision!probability!and!reducing! statistical!multiplexing!flexibility?!
Hadoop!Job!Placement! Hadoop!default!placement!algorithm:!
• map!tasks!are!placed!by!or!near!input!data! • but!reduce!tasks!are!placed!randomly,!without!
any!data!locality!consideration!
! Given!networkHlayer!info,!Hadoop!can!make!more! informed!decisions!about!job!placement,!e.g.,!by! aggregating!reduce!tasks!onto!a!minimum!number! of!racks,!to!reduce!ToR!configuration! • doesn’t!aggregating!reduce!tasks!also!concentrates!the! shuffling!traffic,!causing!congestion?!
Topology!and!Routing! for!Aggregation!Traffic! ManyHtoHone!aggregation!use! multiHlevel!tree!as!before,!with! high!traffic!hosts!placed!near!the!root! ! Overlapping!aggregation!pattern! can!be!broken!down!into!manyH! toHone!(S1’→D1,!S3’→D1)!and! manyHto!many!(S2’→{D1, D2})! aggregations!
ManyHtoHMany!Shuffling! Hypercube!and!Torus!topologies!from!HighH Performance!Computing!community!have!balanced! structures!that!avoid!hotspots!and!bottlenecks! ⇒ suitable!to!exploit!multiHpathing!for!shuffling!traffic! ! But!requires!hosts!to!act!as!routers:! routing!functionalities!must! compete!with!apps!for!CPU,! memory,!and!network!bandwidth! resources!
Future!Work! FlowHlevel!traffic!engineering:!accurate!traffic! demand!and!structural!pattern!from!applications! allow!for!tight!and!dynamic!integration!between! applications!and!SDN!controller,!e.g.,!to!split!or! reHroute!management!and!data!flows!on! different!routes!