Foundation for Research and Technology â Hellas (FORTH). GRâ70013 .... Conclusions. Intel Single-Chip-Cloud. MapReduce. Example. Sally sells sea shells.
Motivation Background Design Experimental Analysis Conclusions
Scalable Runtime Support for Data-Intensive Applications on the Single-Chip Cloud Computer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS Institute of Computer Science (ICS) Foundation for Research and Technology – Hellas (FORTH) GR–70013, Heraklion, Crete, GREECE {apapag,dsn}@ics.forth.gr
3rd MARC Symposium, 2011
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
1 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
2 / 29
Motivation Background Design Experimental Analysis Conclusions
Motivation and Contributions I
I
We are on the transition from multi-core processors to many-core processors Programmers have to deal with: I I I I
I
many cores many forms of implicit or explicit communication many forms of synchronization potential lack of cache coherence
Contributions of this work: I
I
First implementation of a high-level domain-specific parallel programming model (Google’s MapReduce) on a cache-based many-core processor with no cache coherence, based on explicit communication (SCC) Evaluation showing that the Intel SCC supports effectively: I
I
High-level programming models that hide communication, synchronization, parallelization under the hood Scalable execution of data-intensive applications
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
3 / 29
Motivation Background Design Experimental Analysis Conclusions
Intel Single-Chip-Cloud MapReduce
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
4 / 29
Motivation Background Design Experimental Analysis Conclusions
Intel Single-Chip-Cloud MapReduce
Intel SCC
I
I
Private L1 instruction cache of 16 KB, private L1 data cache of 16 KB, private unified L2 cache of 256 KB, per core 16 KB message passing buffer (MPB) per tile (only on-chip memory shared between cores)
P54C (16KB each L1)
R
R
R
VRC
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
CC
P54C FSB
R
Tile
Tile
Tile
Tile
R
R
R
R
Traffic Gen
L2
Message Passing Buffer
L2
Tile
Tile
Tile
R
R
R
R
Tile
Tile
Tile
Tile
To Router
MIU
CC 256KB
Tile
Tile
R
R
R
R
Tile
Tile
Tile
Tile
R
R
R
R
Tile
Tile
Tile
Tile
R
R
R
R
Tile
Tile
DDR MC
Tiles organized in a 4×6 mesh network with 256 GB/s bisection bandwidth
256KB
Tile
Tile
DDR MC
I
P54C (16KB each L1)
DDR MC
Many-core processor with 24 tiles, 2 IA cores per tile
DDR MC
I
System Interface
Scalable MapReduce on the SCC. MARC3 Symposium.
5 / 29
Motivation Background Design Experimental Analysis Conclusions
Intel Single-Chip-Cloud MapReduce
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
6 / 29
Motivation Background Design Experimental Analysis Conclusions
Intel Single-Chip-Cloud MapReduce
MapReduce
I
A framework for large-scale data processing I
Programming model (API) and runtime system for a variety of parallel architectures
I
Based of functional programming language primitives
I
I
Used extensively in real applications I
I
Clusters, SMPs, multi-cores, GPUs, among others
Indexing system, distributed grep, document clustering, machine learning, statistical machine translation
Relies heavily on a scalable runtime system I
Fault-tolerance, parallelization, scheduling, synchronization and communication
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
7 / 29
Motivation Background Design Experimental Analysis Conclusions
Intel Single-Chip-Cloud MapReduce
Example Sally sells
sea
shells
by the
sea
shore
sally,1| sells, 1
sea, 1
shells, 1
by, 1| the, 1
sea, 1
shore, 1
Map
Group By Key
by, 1
sally, 1
sea, 1:1
sells, 1
the, 1
shore, 1
by, 1
sally, 1
sea, 2
sells, 1
the, 1
shore, 1
Reduce
Counting word occurrences in a set of documents Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
8 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
9 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
Design
Seven-stage runtime system for MapReduce: I
Map
I
Combine (optional)
I
Partition
I
Group
I
Reduce
I
Sort (optional)
I
Merge (optional)
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
10 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
11 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Map
Core 0
Core 1
by the the by
by the by the by
Map
by,1 | by,1 I
I
Map
the,1|the,1
by,1|by,1|by,1
the,1|the,1
Each core executes the user-defined map function on chunks of input data, located in local memory Map function emits one or more intermediate key-value pairs
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
12 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Map
Core 0
Core 1
by the the by
by the by the by
Map
by,1 | by,1 I
Map
the,1|the,1
by,1|by,1|by,1
the,1|the,1
Intermediate key-value pairs stored in a contiguous buffer I
I I
Runtime preallocates large chunks of memory (64 MB) for intermediate data buffers More buffering space allocated on demand, if needed Allocation strategy reduces memory management overhead
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
12 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Map
Core 0
Core 1
by the the by
by the by the by
Map
by,1 | by,1
I
Map
the,1|the,1
by,1|by,1|by,1
the,1|the,1
Each core produces as many intermediate data partitions as the total number of cores
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
12 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Combine
Core 0
Core 1
by,1 | by,1
the,1|the,1
by,1|by,1|by,1
the,1|the,1
Combine
Combine
Combine
Combine
by,2
the,2
by,3
the,2
I I
Optional stage executed if user provides a combiner function Reduces locally the size of each partition produced during the map stage
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
13 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Combine
Core 0
Core 1
by,1 | by,1
the,1|the,1
by,1|by,1|by,1
the,1|the,1
Combine
Combine
Combine
Combine
by,2
the,2
by,3
the,2
I I
Takes as input a key and a list of partially aggregated intermediate values associated with that key Produces a new intermediate key-value pair based on intermediate key and its corresponding list of values
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
13 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Partition
I I
Iter. 0
P0
P1
P2
P3
Iter. 1
P0
P1
P2
P3
Iter. 2
P0
P1
P2
P3
Requires an all-to-all exchange between cores Data partitions generated during the map stage may be different in size I I
First execute an all-to-all exchange of the sizes of each partition Knowing the size of each partition, execute a second all-to-all exchange with the actual data
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
14 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Partition
I
Iter. 0
P0
P1
P2
P3
Iter. 1
P0
P1
P2
P3
Iter. 2
P0
P1
P2
P3
Let p be the number of available cores and rank the core ID. This algorithm uses p − 1 steps and in each step k , core rank receives data from core rank − k and sends data to core rank + k .
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
14 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Group
I I
Groups all (key, value) pairs with the same key Use radix sort instead of conventional merge sort I
I
I
Radix sort sorts strings of bytes and can not use a user-defined comparator for sorting If radix sort does not sort native application type, sort the output using a user-specified compare function Conventional sorting algorithms have complexity O(nlogn). Radix sort has complexity O(kn) where k is the size of the key in bytes.
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
15 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Reduce
Core 0 by,2 | by,3
I I
Core 1 the,2 | the,2
Reduce
Reduce
by,2
the,2
Group stage exports distinct keys with a list of corresponding values Reduce stage executes user-defined aggregation function on each key-list(of values) pair
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
16 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Reduce
Core 0 by,2 | by,3
I
Core 1 the,2 | the,2
Reduce
Reduce
by,2
the,2
Reduce function emits one or more output key-value pairs I
I
Total output size known prior to reduction, therefore output buffer is preallocated Minimizes memory management overhead
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
16 / 29
Motivation Background Design Experimental Analysis Conclusions
Outline Implementation
MapReduce Sort and Merge
Step 0
P0
Step 1
P0
Step 2
P0
P1
P2
P3
P2
Output Buffer Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
17 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
18 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
Benchmarks I
Histogram (partition-dominated) counts the frequency of occurrences of each RGB color component in an image file
I
Word Count (partition-dominated) counts the number of occurrences of each word in a text file
I
Kmeans (map-dominated) creates clusters from a set of data points
I
Linear Regression (map-dominated) computes a line of best fit for a set of points, given their 2D coordinates
Configuration: I
Tiles run at 533MHz
I
Mesh interconnect runs at 800MHz
I
DRAM runs at 800MHz
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
19 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
20 / 29
16
16
12
12
Speedup
Speedup
Motivation Background Design Experimental Analysis Conclusions
8 4
8
0 0
5 10 15 20 25 30 35 40 45 50
0
5 10 15 20 25 30 35 40 45 50
Cores Number
Cores Number
Speedup with Combiner
Speedup without Combiner
Combiner function improves scalability I
I
Histogram WordCount KMeans Linear Regression ideal
4
0
I
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
Kmeans and Linear Regression are map-dominated benchmarks
Superlinear speedup because complexity of the group stage decreases exponentially with the number of cores
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
21 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
22 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
8
Merge Sort Reduce Group Partition Combine Map
Seconds
6
4
2
0
Histogram
KMeans
WordCount
LinearRegression
Left bars with combiner, right without combiner I
Using a combiner function reduces execution time I I
Partition stage does not scale Combiner minimizes total partition time and group time
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
23 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
Outline Motivation Background Intel Single-Chip-Cloud MapReduce Design Outline Implementation Experimental Analysis Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE Conclusions Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
24 / 29
Motivation Background Design Experimental Analysis Conclusions
Benchmarks Speedup Execution Time Breakdowns SCC vs. Cell BE
50
Seconds
40
SCC - w combiner SCC - w/o combiner Cell Blade - 1 processor Cell Blade - 2 processors
30 20 10 0 0
5
10
15
20
25
30
35
40
45
50
Cores Number
I
QS22 Blade consists of 2 Cell BE Processors at 3.2 GHz
I
Each processor has 8 SPEs (accelerators)
I
WordCount benchmark with 60MB input size
I
Single-SCC nodes outperforms dual-Cell blade by up to 1.87×
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
25 / 29
Motivation Background Design Experimental Analysis Conclusions
Related Work
I
I
Other ports of MapReduce on clusters, SMPs, multicores and GPUs (HPCA07,PACT08,IISWC09,ICPP10) Shared-memory ports based on shared data structures in cache-coherent address space I
I
SCC port based on distributed data structures and scalable exchange algorithms, while utilizing caches for fast message exchange
Distributed-memory ports based on generic sorting and group algorithms I
SCC port based on combiners and radix sort algorithm
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
26 / 29
Motivation Background Design Experimental Analysis Conclusions
Conclusions
I
Our implementation of MapReduce on the Intel SCC demonstrates: I
I
I
Feasibility of implementing high-level, domain-specific parallel programming models that hide explicit communication SCC chip scalability when using optimized chip-specific global communication algorithms Good adaptivity to diverge workloads: map-dominated, partition-dominated, group-dominated
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
27 / 29
Motivation Background Design Experimental Analysis Conclusions
Thank you! The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the I-CORES project, grant agreement no 224759.
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
28 / 29
Appendix
Radix Sort execution time
4.0 3.5
Seconds
3.0 2.5 2.0 1.5 1.0 0.5 0.0 0
20000
40000
60000
Number of Items
Anastasios Papagiannis and Dimitrios S. Nikolopoulos, FORTH-ICS
Scalable MapReduce on the SCC. MARC3 Symposium.
29 / 29