Scalable Runtime Support for Data-Intensive ... - Semantic Scholar

3 downloads 0 Views 311KB Size Report
Foundation for Research and Technology – Hellas (FORTH). GR–70013 .... Conclusions. Intel Single-Chip-Cloud. MapReduce. Example. Sally sells sea shells.
Motivation Background Design Experimental Analysis Conclusions

Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Institute of Computer Science (ICS) Foundation for Research and Technology – Hellas (FORTH) GR–70013, Heraklion, Crete, GREECE {apapag,dsn}@ics.forth.gr

3rd MARC Symposium, 2011

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

1 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

2 / 29

Motivation Background Design Experimental Analysis Conclusions

Motivation and Contributions I

I

We are on the transition from multi-core processors to many-core processors Programmers have to deal with: I I I I

I

many cores many forms of implicit or explicit communication many forms of synchronization potential lack of cache coherence

Contributions of this work: I

I

First implementation of a high-level domain-specific parallel programming model (Google’s MapReduce) on a cache-based many-core processor with no cache coherence, based on explicit communication (SCC) Evaluation showing that the Intel SCC supports effectively: I

I

High-level programming models that hide communication, synchronization, parallelization under the hood Scalable execution of data-intensive applications

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

3 / 29

Motivation Background Design Experimental Analysis Conclusions

Intel Single-Chip-Cloud MapReduce

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

4 / 29

Motivation Background Design Experimental Analysis Conclusions

Intel Single-Chip-Cloud MapReduce

Intel SCC

I

I

Private L1 instruction cache of 16 KB, private L1 data cache of 16 KB, private unified L2 cache of 256 KB, per core 16 KB message passing buffer (MPB) per tile (only on-chip memory shared between cores)

P54C (16KB each L1)

R

R

R

VRC

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

CC

P54C FSB

R

Tile

Tile

Tile

Tile

R

R

R

R

Traffic Gen

L2

Message Passing Buffer

L2

Tile

Tile

Tile

R

R

R

R

Tile

Tile

Tile

Tile

To Router

MIU

CC 256KB

Tile

Tile

R

R

R

R

Tile

Tile

Tile

Tile

R

R

R

R

Tile

Tile

Tile

Tile

R

R

R

R

Tile

Tile

DDR MC

Tiles organized in a 4×6 mesh network with 256 GB/s bisection bandwidth

256KB

Tile

Tile

DDR MC

I

P54C (16KB each L1)

DDR MC

Many-core processor with 24 tiles, 2 IA cores per tile

DDR MC

I

System Interface

Scalable MapReduce on the SCC. MARC3 Symposium.

5 / 29

Motivation Background Design Experimental Analysis Conclusions

Intel Single-Chip-Cloud MapReduce

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

6 / 29

Motivation Background Design Experimental Analysis Conclusions

Intel Single-Chip-Cloud MapReduce

MapReduce

I

A framework for large-scale data processing I

Programming model (API) and runtime system for a variety of parallel architectures

I

Based of functional programming language primitives

I

I

Used extensively in real applications I

I

Clusters, SMPs, multi-cores, GPUs, among others

Indexing system, distributed grep, document clustering, machine learning, statistical machine translation

Relies heavily on a scalable runtime system I

Fault-tolerance, parallelization, scheduling, synchronization and communication

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

7 / 29

Motivation Background Design Experimental Analysis Conclusions

Intel Single-Chip-Cloud MapReduce

Example Sally sells

sea

shells

by the

sea

shore

sally,1| sells, 1

sea, 1

shells, 1

by, 1| the, 1

sea, 1

shore, 1

Map

Group By Key

by, 1

sally, 1

sea, 1:1

sells, 1

the, 1

shore, 1

by, 1

sally, 1

sea, 2

sells, 1

the, 1

shore, 1

Reduce

Counting word occurrences in a set of documents Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

8 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

9 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

Design

Seven-stage runtime system for MapReduce: I

Map

I

Combine (optional)

I

Partition

I

Group

I

Reduce

I

Sort (optional)

I

Merge (optional)

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

10 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

11 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Map

Core 0

Core 1

by the the by

by the by the by

Map

by,1 | by,1 I

I

Map

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Each core executes the user-defined map function on chunks of input data, located in local memory Map function emits one or more intermediate key-value pairs

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

12 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Map

Core 0

Core 1

by the the by

by the by the by

Map

by,1 | by,1 I

Map

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Intermediate key-value pairs stored in a contiguous buffer I

I I

Runtime preallocates large chunks of memory (64 MB) for intermediate data buffers More buffering space allocated on demand, if needed Allocation strategy reduces memory management overhead

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

12 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Map

Core 0

Core 1

by the the by

by the by the by

Map

by,1 | by,1

I

Map

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Each core produces as many intermediate data partitions as the total number of cores

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

12 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Combine

Core 0

Core 1

by,1 | by,1

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Combine

Combine

Combine

Combine

by,2

the,2

by,3

the,2

I I

Optional stage executed if user provides a combiner function Reduces locally the size of each partition produced during the map stage

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

13 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Combine

Core 0

Core 1

by,1 | by,1

the,1|the,1

by,1|by,1|by,1

the,1|the,1

Combine

Combine

Combine

Combine

by,2

the,2

by,3

the,2

I I

Takes as input a key and a list of partially aggregated intermediate values associated with that key Produces a new intermediate key-value pair based on intermediate key and its corresponding list of values

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

13 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Partition

I I

Iter. 0

P0

P1

P2

P3

Iter. 1

P0

P1

P2

P3

Iter. 2

P0

P1

P2

P3

Requires an all-to-all exchange between cores Data partitions generated during the map stage may be different in size I I

First execute an all-to-all exchange of the sizes of each partition Knowing the size of each partition, execute a second all-to-all exchange with the actual data

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

14 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Partition

I

Iter. 0

P0

P1

P2

P3

Iter. 1

P0

P1

P2

P3

Iter. 2

P0

P1

P2

P3

Let p be the number of available cores and rank the core ID. This algorithm uses p − 1 steps and in each step k , core rank receives data from core rank − k and sends data to core rank + k .

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

14 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Group

I I

Groups all (key, value) pairs with the same key Use radix sort instead of conventional merge sort I

I

I

Radix sort sorts strings of bytes and can not use a user-defined comparator for sorting If radix sort does not sort native application type, sort the output using a user-specified compare function Conventional sorting algorithms have complexity O(nlogn). Radix sort has complexity O(kn) where k is the size of the key in bytes.

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

15 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Reduce

Core 0 by,2 | by,3

I I

Core 1 the,2 | the,2

Reduce

Reduce

by,2

the,2

Group stage exports distinct keys with a list of corresponding values Reduce stage executes user-defined aggregation function on each key-list(of values) pair

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

16 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Reduce

Core 0 by,2 | by,3

I

Core 1 the,2 | the,2

Reduce

Reduce

by,2

the,2

Reduce function emits one or more output key-value pairs I

I

Total output size known prior to reduction, therefore output buffer is preallocated Minimizes memory management overhead

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

16 / 29

Motivation Background Design Experimental Analysis Conclusions

Outline Implementation

MapReduce Sort and Merge

Step 0

P0

Step 1

P0

Step 2

P0

P1

P2

P3

P2

Output Buffer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

17 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

18 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

Benchmarks I

Histogram (partition-dominated) counts the frequency of occurrences of each RGB color component in an image file

I

Word Count (partition-dominated) counts the number of occurrences of each word in a text file

I

Kmeans (map-dominated) creates clusters from a set of data points

I

Linear Regression (map-dominated) computes a line of best fit for a set of points, given their 2D coordinates

Configuration: I

Tiles run at 533MHz

I

Mesh interconnect runs at 800MHz

I

DRAM runs at 800MHz

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

19 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

20 / 29

16

16

12

12

Speedup

Speedup

Motivation Background Design Experimental Analysis Conclusions

8 4

8

0 0

5 10 15 20 25 30 35 40 45 50

0

5 10 15 20 25 30 35 40 45 50

Cores Number

Cores Number

Speedup with Combiner

Speedup without Combiner

Combiner function improves scalability I

I

Histogram WordCount KMeans Linear Regression ideal

4

0

I

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

Kmeans and Linear Regression are map-dominated benchmarks

Superlinear speedup because complexity of the group stage decreases exponentially with the number of cores

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

21 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

22 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

8

Merge Sort Reduce Group Partition Combine Map

Seconds

6

4

2

0

Histogram

KMeans

WordCount

LinearRegression

Left bars with combiner, right without combiner I

Using a combiner function reduces execution time I I

Partition stage does not scale Combiner minimizes total partition time and group time

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

23 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

24 / 29

Motivation Background Design Experimental Analysis Conclusions

Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE

50

Seconds

40

SCC - w combiner SCC - w/o combiner Cell Blade - 1 processor Cell Blade - 2 processors

30 20 10 0 0

5

10

15

20

25

30

35

40

45

50

Cores Number

I

QS22 Blade consists of 2 Cell BE Processors at 3.2 GHz

I

Each processor has 8 SPEs (accelerators)

I

WordCount benchmark with 60MB input size

I

Single-SCC nodes outperforms dual-Cell blade by up to 1.87×

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

25 / 29

Motivation Background Design Experimental Analysis Conclusions

Related Work

I

I

Other ports of MapReduce on clusters, SMPs, multicores and GPUs (HPCA07,PACT08,IISWC09,ICPP10) Shared-memory ports based on shared data structures in cache-coherent address space I

I

SCC port based on distributed data structures and scalable exchange algorithms, while utilizing caches for fast message exchange

Distributed-memory ports based on generic sorting and group algorithms I

SCC port based on combiners and radix sort algorithm

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

26 / 29

Motivation Background Design Experimental Analysis Conclusions

Conclusions

I

Our implementation of MapReduce on the Intel SCC demonstrates: I

I

I

Feasibility of implementing high-level, domain-specific parallel programming models that hide explicit communication SCC chip scalability when using optimized chip-specific global communication algorithms Good adaptivity to diverge workloads: map-dominated, partition-dominated, group-dominated

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

27 / 29

Motivation Background Design Experimental Analysis Conclusions

Thank you! The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the I-CORES project, grant agreement no 224759.

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

28 / 29

Appendix

Radix Sort execution time

4.0 3.5

Seconds

3.0 2.5 2.0 1.5 1.0 0.5 0.0 0

20000

40000

60000

Number of Items

Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS

Scalable MapReduce on the SCC. MARC3 Symposium.

29 / 29

Suggest Documents