Scalable Runtime Support for Data-Intensive ... - Semantic Scholar

Motivation Background Design Experimental Analysis Conclusions

Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Institute of Computer Science (ICS) Foundation for Research and Technology – Hellas (FORTH) GR–70013, Heraklion, Crete, GREECE {apapag,dsn}@ics.forth.gr

3rd MARC Symposium, 2011

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

1 / 29


Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS


2 / 29


Motivation and Contributions I

I

We are on the transition from multi-core processors to many-core processors Programmers have to deal with: I I I I

I

many cores many forms of implicit or explicit communication many forms of synchronization potential lack of cache coherence

Contributions of this work: I

I

First implementation of a high-level domain-specific parallel programming model (Google’s MapReduce) on a cache-based many-core processor with no cache coherence, based on explicit communication (SCC) Evaluation showing that the Intel SCC supports effectively: I

I

High-level programming models that hide communication, synchronization, parallelization under the hood Scalable execution of data-intensive applications



3 / 29


Intel Single-Chip-Cloud MapReduce



4 / 29



Intel SCC

I

I

Private L1 instruction cache of 16 KB, private L1 data cache of 16 KB, private unified L2 cache of 256 KB, per core 16 KB message passing buffer (MPB) per tile (only on-chip memory shared between cores)

P54C (16KB each L1)

R

R

R

VRC


CC

P54C FSB

R

Tile

Tile

Tile

Tile

R

R

R

R

Traffic Gen

L2

Message Passing Buffer

L2

Tile

Tile

Tile

R

R

R

R

Tile

Tile

Tile

Tile

To Router

MIU

CC 256KB

Tile

Tile

R

R

R

R

Tile

Tile

Tile

Tile

R

R

R

R

Tile

Tile

Tile

Tile

R

R

R

R

Tile

Tile

DDR MC

Tiles organized in a 4×6 mesh network with 256 GB/s bisection bandwidth

256KB

Tile

Tile

DDR MC

I

P54C (16KB each L1)

DDR MC

Many-core processor with 24 tiles, 2 IA cores per tile

DDR MC

I

System Interface


5 / 29





6 / 29



MapReduce

I

A framework for large-scale data processing I

Programming model (API) and runtime system for a variety of parallel architectures

I

Based of functional programming language primitives

I

I

Used extensively in real applications I

I

Clusters, SMPs, multi-cores, GPUs, among others

Indexing system, distributed grep, document clustering, machine learning, statistical machine translation

Relies heavily on a scalable runtime system I

Fault-tolerance, parallelization, scheduling, synchronization and communication



7 / 29



Example Sally sells

sea

shells

by the

sea

shore

sally,1| sells, 1

sea, 1

shells, 1

by, 1| the, 1

sea, 1

shore, 1

Map

Group By Key

by, 1

sally, 1

sea, 1:1

sells, 1

the, 1

shore, 1

by, 1

sally, 1

sea, 2

sells, 1

the, 1

shore, 1

Reduce

Counting word occurrences in a set of documents Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS


8 / 29


Outline Implementation



9 / 29



Design

Seven-stage runtime system for MapReduce: I

Map

I

Combine (optional)

I

Partition

I

Group

I

Reduce

I

Sort (optional)

I

Merge (optional)



10 / 29





11 / 29



MapReduce Map

Core 0

Core 1

by the the by

by the by the by

Map

by,1 | by,1 I

I

Map

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Each core executes the user-defined map function on chunks of input data, located in local memory Map function emits one or more intermediate key-value pairs



12 / 29



MapReduce Map

Core 0

Core 1

by the the by

by the by the by

Map

by,1 | by,1 I

Map

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Intermediate key-value pairs stored in a contiguous buffer I

I I

Runtime preallocates large chunks of memory (64 MB) for intermediate data buffers More buffering space allocated on demand, if needed Allocation strategy reduces memory management overhead



12 / 29



MapReduce Map

Core 0

Core 1

by the the by

by the by the by

Map

by,1 | by,1

I

Map

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Each core produces as many intermediate data partitions as the total number of cores



12 / 29



MapReduce Combine

Core 0

Core 1

by,1 | by,1

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Combine

Combine

Combine

Combine

by,2

the,2

by,3

the,2

I I

Optional stage executed if user provides a combiner function Reduces locally the size of each partition produced during the map stage



13 / 29



MapReduce Combine

Core 0

Core 1

by,1 | by,1

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Combine

Combine

Combine

Combine

by,2

the,2

by,3

the,2

I I

Takes as input a key and a list of partially aggregated intermediate values associated with that key Produces a new intermediate key-value pair based on intermediate key and its corresponding list of values



13 / 29



MapReduce Partition

I I

Iter. 0

P0

P1

P2

P3

Iter. 1

P0

P1

P2

P3

Iter. 2

P0

P1

P2

P3

Requires an all-to-all exchange between cores Data partitions generated during the map stage may be different in size I I

First execute an all-to-all exchange of the sizes of each partition Knowing the size of each partition, execute a second all-to-all exchange with the actual data



14 / 29



MapReduce Partition

I

Iter. 0

P0

P1

P2

P3

Iter. 1

P0

P1

P2

P3

Iter. 2

P0

P1

P2

P3

Let p be the number of available cores and rank the core ID. This algorithm uses p − 1 steps and in each step k , core rank receives data from core rank − k and sends data to core rank + k .



14 / 29



MapReduce Group

I I

Groups all (key, value) pairs with the same key Use radix sort instead of conventional merge sort I

I

I

Radix sort sorts strings of bytes and can not use a user-defined comparator for sorting If radix sort does not sort native application type, sort the output using a user-specified compare function Conventional sorting algorithms have complexity O(nlogn). Radix sort has complexity O(kn) where k is the size of the key in bytes.



15 / 29



MapReduce Reduce

Core 0 by,2 | by,3

I I

Core 1 the,2 | the,2

Reduce

Reduce

by,2

the,2

Group stage exports distinct keys with a list of corresponding values Reduce stage executes user-defined aggregation function on each key-list(of values) pair



16 / 29



MapReduce Reduce

Core 0 by,2 | by,3

I

Core 1 the,2 | the,2

Reduce

Reduce

by,2

the,2

Reduce function emits one or more output key-value pairs I

I

Total output size known prior to reduction, therefore output buffer is preallocated Minimizes memory management overhead



16 / 29



MapReduce Sort and Merge

Step 0

P0

Step 1

P0

Step 2

P0

P1

P2

P3

P2

Output Buffer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS


17 / 29


Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE



18 / 29



Benchmarks I

Histogram (partition-dominated) counts the frequency of occurrences of each RGB color component in an image file

I

Word Count (partition-dominated) counts the number of occurrences of each word in a text file

I

Kmeans (map-dominated) creates clusters from a set of data points

I

Linear Regression (map-dominated) computes a line of best fit for a set of points, given their 2D coordinates

Configuration: I

Tiles run at 533MHz

I

Mesh interconnect runs at 800MHz

I

DRAM runs at 800MHz



19 / 29





20 / 29

16

16

12

12

Speedup

Speedup


8 4

8

0 0

5 10 15 20 25 30 35 40 45 50

0

5 10 15 20 25 30 35 40 45 50

Cores Number

Cores Number

Speedup with Combiner

Speedup without Combiner

Combiner function improves scalability I

I

Histogram WordCount KMeans Linear Regression ideal

4

0

I


Kmeans and Linear Regression are map-dominated benchmarks

Superlinear speedup because complexity of the group stage decreases exponentially with the number of cores



21 / 29





22 / 29



8

Merge Sort Reduce Group Partition Combine Map

Seconds

6

4

2

0

Histogram

KMeans

WordCount

LinearRegression

Left bars with combiner, right without combiner I

Using a combiner function reduces execution time I I

Partition stage does not scale Combiner minimizes total partition time and group time



23 / 29





24 / 29



50

Seconds

40

SCC - w combiner SCC - w/o combiner Cell Blade - 1 processor Cell Blade - 2 processors

30 20 10 0 0

5

10

15

20

25

30

35

40

45

50

Cores Number

I

QS22 Blade consists of 2 Cell BE Processors at 3.2 GHz

I

Each processor has 8 SPEs (accelerators)

I

WordCount benchmark with 60MB input size

I

Single-SCC nodes outperforms dual-Cell blade by up to 1.87×



25 / 29


Related Work

I

I

Other ports of MapReduce on clusters, SMPs, multicores and GPUs (HPCA07,PACT08,IISWC09,ICPP10) Shared-memory ports based on shared data structures in cache-coherent address space I

I

SCC port based on distributed data structures and scalable exchange algorithms, while utilizing caches for fast message exchange

Distributed-memory ports based on generic sorting and group algorithms I

SCC port based on combiners and radix sort algorithm



26 / 29


Conclusions

I

Our implementation of MapReduce on the Intel SCC demonstrates: I

I

I

Feasibility of implementing high-level, domain-specific parallel programming models that hide explicit communication SCC chip scalability when using optimized chip-specific global communication algorithms Good adaptivity to diverge workloads: map-dominated, partition-dominated, group-dominated



27 / 29


Thank you! The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the I-CORES project, grant agreement no 224759.



28 / 29

Appendix

Radix Sort execution time

4.0 3.5

Seconds

3.0 2.5 2.0 1.5 1.0 0.5 0.0 0

20000

40000

60000

Number of Items



29 / 29

Scalable Runtime Support for Data-Intensive ... - Semantic Scholar

Scalable Runtime Support for Data-Intensive ... - Semantic Scholar

Suggest Documents

Common Runtime Support for Assertions - Semantic Scholar

Network Support for Scalable Distributed ... - Semantic Scholar

Granules: A Lightweight Runtime for Scalable ... - Semantic Scholar

MOLAR: Adaptive Runtime Support for High-End ... - Semantic Scholar

Programming Model and Runtime Support for ... - Semantic Scholar

Adaptive Runtime Support for Direct Simulation ... - Semantic Scholar

E cient Runtime Support for Parallelizing Block ... - Semantic Scholar

E cient Runtime Support for Parallelizing Block ... - Semantic Scholar

Olan: A Language and Runtime Support for ... - Semantic Scholar

Runtime Semantic Interoperability for Gathering ... - Semantic Scholar

Scalable Network Support for 3D Virtual Shopping ... - Semantic Scholar

WebOS: Software Support for Scalable Web Services - Semantic Scholar

Scalable Scheduling Support for Loss and Delay ... - Semantic Scholar

A virtual memory based runtime to support multi ... - Semantic Scholar

Dynamically Scoped Functions for Runtime ... - Semantic Scholar

Speculation for Parallelizing Runtime Checks - Semantic Scholar

Development and Runtime Environment for ... - Semantic Scholar

Dataflow Graph Runtime - Semantic Scholar

Formal Methods @ Runtime - Semantic Scholar

Formal Methods @ Runtime - Semantic Scholar

Open Runtime Platform - Semantic Scholar

Runtime Support for Virtual BSP Computer - CiteSeerX

Runtime Filesystem Support for Reconfigurable FPGA ... - CiteSeerX

Support for Dynamic Trading and Runtime