Mapping the PRAM Model onto the Intel SCC Many-core Processor

2 downloads 51 Views 472KB Size Report
Jul 4, 2012 - Mapping the PRAM Model onto the Intel SCC Many-core Processor. Carsten Clauss | Chair for Operating Systems, RWTH Aachen University ...
Mapping the PRAM Model onto the Intel SCC Many-core Processor 2nd International Workshop on New Algorithms and Programming Models for the Many-core Era APMM 2012 As part of the International Conference on High Performance Computing & Simulation HPCS 2012

Carsten Clauss, Stefan Lankes, and Thomas Bemmerl Chair for Operating Systems, RWTH Aachen University July 4th, 2012

Overview  The Intel SCC Many-core Processor  The PRAM Model and its Emulation  Mapping the PRAM Model onto the SCC  A Parallelized PRAM Emulator  Conclusion and Outlook

2

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

The Intel SCC Many-core Processor  Intel Single-Chip Cloud Computer (SCC)  Concept Vehicle for Many-core Software Research

 48 Pentium-I Cores (arranged in a 6x4 on-die Mesh)  2 Cores and 1 Router per Tile

 On-die Message-Passing Buffers (MBP) / 16kByte per Tile  accessible as distributed on-die Shared-Memory

 4 on-die Memory Controllers (MC1-4)

3

MC3 MC4

MC2

MC1

 max. 64GByte DDR3 off-die main memory

L2$1

Core1

Router

MPB

L2$0

Core0

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

The Intel SCC Many-core Processor  Strictly No Cache Coherency  Cluster-on-Chip Architecture

 Private off-die DRAM Regions (one per Core)  Caches enabled! One Linux instance per Core!  Shared / Global off-die DRAM Region  Caches disabled per default!  Shared on-die MPB Regions (cached in L1 / fast invalidation)  For message-passing or shared data structures

Shared off-die DRAM Private DRAM L2$ L1$ Core 0

Private DRAM L2$ L1$ Core 47

Shared on-die Memory (MPB) 4

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

The PRAM Model and its Emulation  PRAM: Parallel Random Access Machine • Processors work clock-synchronous on a shared main memory • Additionally, each processor may have a local private memory

 Concurrent Memory-Accesses  Different PRAM Models  EREW, CREW, CRCW (Priority, Arbitrary, Combing, Common)

 PRAM Programming and PRAM-based Algorithms are well explored fields of research and teaching!  However, PRAM is an abstract machine Model!

Global Shared Memory Private Memory

Proc 0

Private Memory

Proc 1



Global Clock 5

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Private Memory

Proc n

The PRAM Model and its Emulation  The model allows for a simultaneous access of all Procs to each shared memory location within one cycle!  The model cannot be realized one-to-one in hardware!

 Thus, practical PRAM application means emulating the model in Software or Hardware!  But emulating generally implies an additional overhead!

 Emulating a PRAM by a parallel computer?  Emulation Efficiency:

𝐸𝑒𝑒𝑒 6

𝑡 ∗ ∙ 𝑝∗ = 𝑡∙𝑝

…when emulating t* cycles of p* PRAM Procs in t cycles on a parallel computer with p CPUs.

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Mapping the PRAM Model onto the SCC  Let p SCC cores emulate p=p* PRAM processors  Let’s keep it simple for the first instance! 

 Use the SCC’s shared off-die memory as PRAM memory  Map any private PRAM memory into private off-die regions

 So far, so good. But a PRAM works clock synchronous?  We need to synchronize the participating SCC cores!

 Suspend (local) PRAM emulation at every memory access!  Are there concurrent accesses? Which value is to be written?

 Resume (local) PRAM emulation when there are no more dependencies concerning the memory access order!  Compare Time Step (Cycle) Counters! (Resume on Minimum!) 7

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Mapping the PRAM Model onto the SCC  How to exchange the Time Step Counters? Use the fast on-die shared memory! ( the SCC’s MPBs)  Strategy #1: Global View Put counter to local MPB and read on all remote MBs!  Strategy #2: Master Core Put counter to local MPB but only one core (the Master) permanently determines the lowest counter read!  Strategy #3: Multiple Masters Let 32 cores simulate 32 PRAM processors and use the remaining 16 cores to act as a hierarchy of Masters!

8

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Mapping the PRAM Model onto the SCC  Strategy #1: Global View Private DRAM L2$ L1$ Core A Put my counter to local MPB

Private DRAM L2$ L1$ Core B

Shared on-die MPB Read on remote counters

 Strategy #2: Master Core Private DRAM L2$ L1$ Master

Private DRAM L2$ L1$ Core B

Read at Master for Minimum

Shared on-die MPB Determine Minimum 9

Read on remote counters

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Mapping the PRAM Model onto the SCC  Strategy #3: Multiple Masters Odd Master

Global Minimum

Even Master

16 hierarchical Master Cores … … Shared on-die MPB 32 emulating Worker Cores 10

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Mapping the PRAM Model onto the SCC  A Synthetic Benchmark:

 Perform (= emulate) 𝑝∗ ∙ 𝑇/𝑝∗ iterations (= PRAM cycles)  Put local time steps counter to local MPB in every iteration  Determine global minimum (= memory access) every τ iterations τ = 100

Speedup: 4.2 Efficiency: ~53% 32 PRAM processors emulated by 32 + 16 SCC cores

Speedup: 5.2 Efficiency: ~11% 11

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Mapping the PRAM Model onto the SCC  Comparison with a recent Multicore architecture:  OpenMP-based emulation via common shared memory  Platform: Westmere-EX, 8 socket system, 10-core CPUs each  Tested Strategy: “Global View” τ = 100

12

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

A Parallelized PRAM Emulator  SB-PRAM: a PRAM emulator in hardware  Saarland University (Germany), ~ mid 90s  64 physical and 2048 virtual PRAM processors  4 GByte main memory, CRCW priority protocol Global Clock Program Memory

13

...

Processor

Processor

P0

P1

Private Mem Private Mem of P0 of P1

Processor

...

...

Pn

Private Mem of Pn

Global Shared Memory

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

A Parallelized PRAM Emulator  The Fork (*) PRAM programming language  C-derived language for writing programs for the SB-PRAM  Compiler & Linker are still available

 PRAMSIM (*) : a software simulation tool for the SB-PRAM  Still available and runs on common (serial) desktop PCs  Processes binaries compiled for the SB-PRAM

 SCC-PRAMSIM: a parallelized port to the SCC  Uses p SCC cores to emulate p* SB-PRAM processors  Strategies: Global View, Master Core or Multiple Masters

(*): see “Practical PRAM Programming” J. Keller, C.W. Kessler, and J.L. Traeff 14

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

A Parallelized PRAM Emulator  A Simple Benchmark: The Laplace Equation  Boundary value problem on a two-dimensional domain  Simple JOR solver written in Fork, Iteration rule:

ui , j

1 = (ui −1, j + ui +1, j + ui , j −1 + ui , j +1 ) 4

 Two Approaches for the Memory Layout:  Shared-to-Shared vs. Private-to-Private

i-1 j-1 j+1 i+1 15

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

ui −1, j

ui , j −1

ui , j +1

ui +1, j

A Parallelized PRAM Emulator  Benchmark Results for:  Multiple Masters: p=48 SCC cores, p*=32 SB-PRAM processors  Matrix Size: N=256

 Shared-to-Shared:  Shared Memory Access Frequency:  Slowdown:

 Private-to-Private:

τ

 Shared Memory Access Frequency:  Speedup:  Emulation Efficiency:

16

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

14.46 0.138

2,615.38 8.876 18.49 %

Conclusion and Outlook  The synthetic PRAM Model scales on the SCC up to 8 cores  Synchronization / minimum determination via fast on-die MPBs

 The Speedup can be improved by applying Multiple Masters  However, this reduces the efficiency dramatically!

 Real PRAM algorithms just scale if shared memory accesses can be avoided  This is not the idea of common PRAM algorithms! 

 In case of just a lot of shared read accesses, a ReadOptimistic PRAM emulation might scale better ?!?  Checkpointing and Rollback Mechanisms would be needed! 17

Mapping the PRAM Model onto the Intel SCC Many-core Processor Carsten Clauss | Chair for Operating Systems, RWTH Aachen University | July 4th, 2012

Thank you for your attention! Any Questions? Carsten Clauss Lehrstuhl für Betriebssysteme RWTH Aachen University Templergraben 55 52056 Aachen www.lfbs.rwth-aachen.de

Suggest Documents