Jul 4, 2012 - Mapping the PRAM Model onto the Intel SCC Many-core Processor. Carsten Clauss | Chair for Operating Systems, RWTH Aachen University ...
Mapping the PRAM Model onto the Intel SCC Many-core Processor 2nd International Workshop on New Algorithms and Programming Models for the Many-core Era APMM 2012 As part of the International Conference on High Performance Computing & Simulation HPCS 2012
Carsten Clauss, Stefan Lankes, and Thomas Bemmerl Chair for Operating Systems, RWTH Aachen University July 4th, 2012
Overview The Intel SCC Many-core Processor The PRAM Model and its Emulation Mapping the PRAM Model onto the SCC A Parallelized PRAM Emulator Conclusion and Outlook
2
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
The Intel SCC Many-core Processor Intel Single-Chip Cloud Computer (SCC) Concept Vehicle for Many-core Software Research
48 Pentium-I Cores (arranged in a 6x4 on-die Mesh) 2 Cores and 1 Router per Tile
On-die Message-Passing Buffers (MBP) / 16kByte per Tile accessible as distributed on-die Shared-Memory
4 on-die Memory Controllers (MC1-4)
3
MC3 MC4
MC2
MC1
max. 64GByte DDR3 off-die main memory
L2$1
Core1
Router
MPB
L2$0
Core0
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
The Intel SCC Many-core Processor Strictly No Cache Coherency Cluster-on-Chip Architecture
Private off-die DRAM Regions (one per Core) Caches enabled! One Linux instance per Core! Shared / Global off-die DRAM Region Caches disabled per default! Shared on-die MPB Regions (cached in L1 / fast invalidation) For message-passing or shared data structures
Shared off-die DRAM Private DRAM L2$ L1$ Core 0
Private DRAM L2$ L1$ Core 47
Shared on-die Memory (MPB) 4
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
The PRAM Model and its Emulation PRAM: Parallel Random Access Machine • Processors work clock-synchronous on a shared main memory • Additionally, each processor may have a local private memory
Concurrent Memory-Accesses Different PRAM Models EREW, CREW, CRCW (Priority, Arbitrary, Combing, Common)
PRAM Programming and PRAM-based Algorithms are well explored fields of research and teaching! However, PRAM is an abstract machine Model!
Global Shared Memory Private Memory
Proc 0
Private Memory
Proc 1
…
Global Clock 5
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Private Memory
Proc n
The PRAM Model and its Emulation The model allows for a simultaneous access of all Procs to each shared memory location within one cycle! The model cannot be realized one-to-one in hardware!
Thus, practical PRAM application means emulating the model in Software or Hardware! But emulating generally implies an additional overhead!
Emulating a PRAM by a parallel computer? Emulation Efficiency:
𝐸𝑒𝑒𝑒 6
𝑡 ∗ ∙ 𝑝∗ = 𝑡∙𝑝
…when emulating t* cycles of p* PRAM Procs in t cycles on a parallel computer with p CPUs.
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Mapping the PRAM Model onto the SCC Let p SCC cores emulate p=p* PRAM processors Let’s keep it simple for the first instance!
Use the SCC’s shared off-die memory as PRAM memory Map any private PRAM memory into private off-die regions
So far, so good. But a PRAM works clock synchronous? We need to synchronize the participating SCC cores!
Suspend (local) PRAM emulation at every memory access! Are there concurrent accesses? Which value is to be written?
Resume (local) PRAM emulation when there are no more dependencies concerning the memory access order! Compare Time Step (Cycle) Counters! (Resume on Minimum!) 7
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Mapping the PRAM Model onto the SCC How to exchange the Time Step Counters? Use the fast on-die shared memory! ( the SCC’s MPBs) Strategy #1: Global View Put counter to local MPB and read on all remote MBs! Strategy #2: Master Core Put counter to local MPB but only one core (the Master) permanently determines the lowest counter read! Strategy #3: Multiple Masters Let 32 cores simulate 32 PRAM processors and use the remaining 16 cores to act as a hierarchy of Masters!
8
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Mapping the PRAM Model onto the SCC Strategy #1: Global View Private DRAM L2$ L1$ Core A Put my counter to local MPB
Private DRAM L2$ L1$ Core B
Shared on-die MPB Read on remote counters
Strategy #2: Master Core Private DRAM L2$ L1$ Master
Private DRAM L2$ L1$ Core B
Read at Master for Minimum
Shared on-die MPB Determine Minimum 9
Read on remote counters
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Mapping the PRAM Model onto the SCC Strategy #3: Multiple Masters Odd Master
Global Minimum
Even Master
16 hierarchical Master Cores … … Shared on-die MPB 32 emulating Worker Cores 10
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Mapping the PRAM Model onto the SCC A Synthetic Benchmark:
Perform (= emulate) 𝑝∗ ∙ 𝑇/𝑝∗ iterations (= PRAM cycles) Put local time steps counter to local MPB in every iteration Determine global minimum (= memory access) every τ iterations τ = 100
Speedup: 4.2 Efficiency: ~53% 32 PRAM processors emulated by 32 + 16 SCC cores
Speedup: 5.2 Efficiency: ~11% 11
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Mapping the PRAM Model onto the SCC Comparison with a recent Multicore architecture: OpenMP-based emulation via common shared memory Platform: Westmere-EX, 8 socket system, 10-core CPUs each Tested Strategy: “Global View” τ = 100
12
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
A Parallelized PRAM Emulator SB-PRAM: a PRAM emulator in hardware Saarland University (Germany), ~ mid 90s 64 physical and 2048 virtual PRAM processors 4 GByte main memory, CRCW priority protocol Global Clock Program Memory
13
...
Processor
Processor
P0
P1
Private Mem Private Mem of P0 of P1
Processor
...
...
Pn
Private Mem of Pn
Global Shared Memory
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
A Parallelized PRAM Emulator The Fork (*) PRAM programming language C-derived language for writing programs for the SB-PRAM Compiler & Linker are still available
PRAMSIM (*) : a software simulation tool for the SB-PRAM Still available and runs on common (serial) desktop PCs Processes binaries compiled for the SB-PRAM
SCC-PRAMSIM: a parallelized port to the SCC Uses p SCC cores to emulate p* SB-PRAM processors Strategies: Global View, Master Core or Multiple Masters
(*): see “Practical PRAM Programming” J. Keller, C.W. Kessler, and J.L. Traeff 14
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
A Parallelized PRAM Emulator A Simple Benchmark: The Laplace Equation Boundary value problem on a two-dimensional domain Simple JOR solver written in Fork, Iteration rule:
ui , j
1 = (ui −1, j + ui +1, j + ui , j −1 + ui , j +1 ) 4
Two Approaches for the Memory Layout: Shared-to-Shared vs. Private-to-Private
i-1 j-1 j+1 i+1 15
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
ui −1, j
ui , j −1
ui , j +1
ui +1, j
A Parallelized PRAM Emulator Benchmark Results for: Multiple Masters: p=48 SCC cores, p*=32 SB-PRAM processors Matrix Size: N=256
Shared-to-Shared: Shared Memory Access Frequency: Slowdown:
Private-to-Private:
τ
Shared Memory Access Frequency: Speedup: Emulation Efficiency:
16
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
14.46 0.138
2,615.38 8.876 18.49 %
Conclusion and Outlook The synthetic PRAM Model scales on the SCC up to 8 cores Synchronization / minimum determination via fast on-die MPBs
The Speedup can be improved by applying Multiple Masters However, this reduces the efficiency dramatically!
Real PRAM algorithms just scale if shared memory accesses can be avoided This is not the idea of common PRAM algorithms!
In case of just a lot of shared read accesses, a ReadOptimistic PRAM emulation might scale better ?!? Checkpointing and Rollback Mechanisms would be needed! 17
Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012
Thank you for your attention! Any Questions? Carsten Clauss Lehrstuhl für Betriebssysteme RWTH Aachen University Templergraben 55 52056 Aachen www.lfbs.rwth-aachen.de