Task Data-flow Execution on Many-core Systems

Task Data-flow Execution on Many-core Systems Andreas Diavastos [email protected]

Supervisor: Pedro Moura Trancoso

Department of Computer Science, University of Cyprus CASPER Group: Computer Architecture, Systems and Performance Evaluation Research Group TM – Harvey Comics

Application Performance Scalability

#nodes/system

#cores/node

• More resource → Higher Performance • Many-nodes: Prohibitive cost • Many-cores: Scaling limitations in runtime systems Need efficient and scalable systems for many-core processors E. Strohmaier, H. W. Meuer, J. Dongarra and H. D. Simon, "The TOP500 List and Progress in High-Performance Computing," in Computer, vol. 48, no. 11, Nov. 2015.

‡ M.

M. Resch, “The end of Moore’s law: Moving on in HPC beyond current technology,” in Keynote Presentation in PDP2016 Conference, Heraklion, Crete, February 2016.

13th Nov. 2017

Task Data-flow Execution on Many-core Systems

2

State-of-the-art • Limited performance scalability in many-cores – – – – –

OpenMP, OmpSs, TBB, … Dynamic scheduling → Runtime overhead Centralized runtimes → Single point of access Runtime dependences resolution → More overhead Programming → Each one has its own API

Existing runtime systems are not well-fit for what is coming in future many-cores

13th Nov. 2017


3

The Objectives Objective 1: Increase Application Parallelism

Objective 4: Scalable Runtime System

Objective 2: Programming Productivity

Objective 5: Efficient Resource Utilization

Objective 3: Scalable Architectures

Improving application performance is a joint task between the hardware and the software 13th Nov. 2017


4

Solution Overview • Task-based Data-flow execution model – Exploits large amounts of parallelism

• Performance scaling factors: – – – – – –

Programmability Locality-aware execution The degree of parallelism Scalable architecture designs Low-overhead runtime systems Efficient use of the available resources

13th Nov. 2017


5

Outline Prologue

DDM on Manycore Systems

Speculative Parallelism in Data-flow

13th Nov. 2017

Autonomic Mapping of Resources

A Scalable Framework for Task-based Dataflow Execution


Epilogue

6

Outline Prologue


13th Nov. 2017


7

Contributes to: • Objective 1: Increase Application Parallelism

Integrate Transactions into the Data-flow Model

SPECULATIVE PARALLELISM IN DATA-FLOW  Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, International Journal of Parallel Programming (IJPP 2015).  Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, In the Proceedings of the Data-Flow Execution Models for Extreme Scale Computing Workshop (DFM 2011).

13th Nov. 2017


8

Integrate Transactions in Data-flow Dataflow

• •

Transactional Memory

Data-flow-based execution

•

Speculative execution

•

Execute tasks freely

Synchronization based on data-availability

•

Abort/restart if conflicts

TFluxTM

•

Exploit runtime parallelism in Data-flow

Increase the Data-flow application coverage

13th Nov. 2017


9

The TFlux Parallel Processing Platform • TFlux: – Provide the DDM model to commodity multicore systems C & DDM directives TFlux Preprocessor Unmodified C Compiler DDM Binary Runtime Support Kernel 1

Kernel 2

Kernel 3

...

Kernel n

TSU Group TSU 1

TSU 2

TSU 3

...

TSU n

Unmodified Operating System

Unmodified ISA Hardware

¤ Stavrou,

Kyriakos, et al. "TFlux: A portable platform for data-driven multithreading on commodity multicore systems." Parallel Processing, 2008. ICPP'08. 37th International Conference on. IEEE, 2008.

13th Nov. 2017


10

TFluxTM Implementation • Integrated TinySTM* runtime system in TFluxSoft¤ LEE’S ALGORITHM 256x256x3-n256

256x256x5-n256

512x512x7-n512

7 6 SPEEDUP

5 4 3

Data-flow alone would be here

2 1 0 2

4

8

10

12

NUMBER OF THREADS

Exploitation of runtime parallelism in Data-flow but software TM runtime overhead is still too high

Pascal Felber, Christof Fetzer, and Torvald Riegel, Dynamic Performance Tuning of Word-Based Software Transactional Memory, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008 ¤ Stavrou, Kyriakos, et al. "TFlux: A portable platform for data-driven multithreading on commodity multicore systems." Parallel Processing, 2008. ICPP'08. 37th International Conference on. IEEE, 2008. *

13th Nov. 2017


11

Outline Prologue



13th Nov. 2017


12

Contributes to: • Objective 3: Support for Scalable Architectures

Acknowledgement: We would like to thank Intel Labs for lending the Intel SCC research processor

TFlux on the Intel SCC

DATA-DRIVEN MULTI-THREADING ON MANY-CORE SYSTEMS  Andreas Diavastos, Giannos Stylianou and Pedro Trancoso “TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow”. In the Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP 2015).  Andreas Diavastos, Giannos Stylianou, Pedro Trancoso. “TFluxSCC: A Case Study for Exploiting Performance in Future Many-core Systems”. (Poster Paper) In the Proceedings of the 11th ACM Conference on Computing Frontiers, May 2014.

13th Nov. 2017


13

TFlux on the Intel SCC • Intel Single-chip Cloud Computer – 48-core processor – Offers Global address space – No hardware support for cache-coherence

• Data-Driven Multi-threading Model – Task-based-granularity Data-flow implementation – Data-flow dependences avoid simultaneous access on shared data – Cache-coherence is not needed

13th Nov. 2017


14

TFluxSCC runtime system • Fully decentralized runtime system • One TSU on every application thread • Each core has it’s own instance of SG – No locking protection required 60 48

Speedup

50 39

40

32

33

QSORT*

RK4

Max speedup achieved with 48 cores

30 20 10 0

5

5

QSORT

Blackscholes

8 FFT

MMULT

Trapez

A DDM implementation for a many-core processor without requiring support for hardware cache-coherence 13th Nov. 2017


15

Outline Prologue



13th Nov. 2017



16

Contributes to: • Objective 1: Increase application parallelism • Objective 2: Programmability • Objective 3: Support for scalable architectures • Objective 4: Low-overhead runtime • Objective 5: Efficient utilization of resources

Acknowledgement: We would like to thank The CyI for allowing us to use their Intel Xeon Phi facility

SWITCHES

A SCALABLE FRAMEWORK FOR TASKBASED DATA-FLOW EXECUTION  Andreas Diavastos and Pedro Trancoso. “SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores”, ACM Transactions on Architecture and Code Optimization (TACO) 14, 3, Article 31 (August 2017).  Andreas Diavastos, and Pedro Trancoso. “Auto-tuning Static Schedules for Task Data-flow Applications”, Proceedings of the 1st ACM Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, (ANDARE), September 2017.

13th Nov. 2017


17

SWITCHES: Key Characteristics • Software implementation • Implements Task-based Data-flow execution • Apply dependences at any level of granularity • Across-Loops Iteration Dependences • Compile-time (static) scheduling/assignment policy • Explicit task resource allocation • Fully distributed triggering systems – No central-point of communication – No locking protection on any shared runtime data

• Programming API: OpenMP v4.5 Standard 13th Nov. 2017


18

SWITCHES: Execution Model T

Compile-Time Synchronization Graph

T1

T2 T3

T4

13th Nov. 2017

Task

E

Executing

Compile-Time Add a switch to every task

F

Finished

Program Finished

Runtime

Runtime

E

E

F

F

F

F

F

F

OFF

OFF

ON

ON

ON

ON

ON

ON

T3

E

F

F

OFF

OFF

ON

ON

T4

T4

E

F

OFF

OFF

OFF

ON


19

SWITCHES: Platform Programming Model OMP v4.5 Application int main() {

#pragma omp task depend() { // do something } ...

#pragma omp task depend() { // do something } }

The Translator 13th Nov. 2017


20

SWITCHES: The Translator Dependency Graph

Parallel Code

#pragma omp task {…}

Translator

Scheduling Policy

Task1: checkP(task1); … Update(sw1); Task2: checkP(task2); … Update(sw2);

Binary

C/CC Compiler

001010101 010101010 101010101 101010101 011010100 101010100

Assignment Policy User-specified

• Source-to-Source tool – Command-line tool • Implements transitive reduction optimization 13th Nov. 2017


21

SWITCHES: API • Implemented the OpenMP v4.5 API Standard • Extended the taskloop directive to support: – – – –

Resource allocation (Better Utilization) Dependences Tasks Reduction Scheduling policy (Cross-Loop Iteration Dependences) #pragma omp taskloop

private(list) firstprivate(list) grainsize(size) num_threads(NUMBER) depend(type : list) reduction(OPERATION : list) schedule(type : [, CHUNK])

13th Nov. 2017


22

Cross-Loop Dependences with Granularity OpenMP Taskloop

• • •

Variable granularity Increases data-locality Barrier between loops

13th Nov. 2017

OpenMP Task

• • •

Dependencies across loop iterations Asynchronous Execution No data-locality

SWITCHES

• • • •


Dependencies across loop iterations Variable granularity Asynchronous Execution Increases data-locality

23

SWITCHES: Runtime System Software Thread Shared Memory

Core Hardware Thread

Application Data

Hardware Thread

T

T Hardware Thread

Te xt

Scheduler

T T

T

T T

Software

Hardware Software Thread Software Thread Thread Thread

Tex t

Task

Task Switch

Workload

Runtime Data

T

OFF

Producers Switches

Task*:

for(i->n/CORES) { //do something } print (result) exit()

Cross-Loop Switches

Scheduler Data

Multiple threads on each core • Each software thread has its own scheduler tasksoftware has its switch references to its producers Task switches stored in&shared memory • Switches variables are write-owner 13th Nov. 2017


24

Results: Experimental Setup Benchmark

TASKPARALLEL

Description DS1

DS2

DS3

Nested-Loop Join from TPC-H

60K × 1.5K

60K × 15K

60K × 150K

MMULT

Matrix Multiply

256 × 256

512 × 512

1024 ×1024

RK4

Differential equation

4800

9600

19200

SU3

Wilson Dirac equation

1920K

3840K

7680K

Poisson2D

5-point 2D stencil computation

4096

8192

16384

SparseLU

LU factorization

120 × 32

240 × 32

480 × 32

OCEAN

Red-Black solution (Gauss-Seidel)

4096 × 4096

8192 × 8192

16384 ×16384

Q12 DATAPARALLEL

Data Set Sizes (Computation Iterations)

Compared Implementations

OMP-Dynamic-For OMP-Static-For OMP-Task OMP-Taskloop

OMP-Task OMP-Task-Dep Taskloop

• Benchmarks from: TFlux, BOTS, Kastors • Intel Xeon Phi 7120P (KNC) – 61-cores x 4 threads (total 244 HW threads) – @ 1.2 GHz – Intel ICC v.17.0.2 (libiomp5 and –O3 flag) 13th Nov. 2017


25

Results: Strong Scaling

• SWITCHES achieves performance scalability with: – Static decentralized runtime system for low-overhead synchronization – Variable granularity loop tasks for data-locality – Cross-Loop Dependence mechanism for increase parallelism – Fast dependences resolution mechanism 13th Nov. 2017


26

Results: Task resource allocation Loop A i1

i2

i3

...

j2

j3

iN-1

20% Improvement

iN

Loop B j1

...

jN-1

jN

• Variable granularity loop tasks • Cross-loop iteration dependences • Task resource allocation 13th Nov. 2017


27

Data-Parallel: Weak Scaling

SWITCHES

72× (2)

141×

92× (4)

81×

OpenMP

67× Taskloop(4)

103× Static-For

79× Static-For(4)

57× Dynamic

26% Improvement • Smaller input size → Less work per thread – OMP scheduling overhead more visible

13th Nov. 2017


28

Task-Parallel: Weak Scaling

SWITCHES

80×

55× (2)

11× (4)

OpenMP

66× Taskloop

34× Taskloop(4)

9× Taskloop(4)

35% Improvement • Scheduling overhead + Runtime dependence resolution – OMP: Heavy dependence resolution mechanism

13th Nov. 2017


29

Outline Prologue



13th Nov. 2017




30

Motivation • Round-Robin & Random schedules – Produced with no knowledge of the application, task dependences or the hardware

• Ninja Hand-coded schedules – 15% - 30% performance increase

13th Nov. 2017


31

SWITCHES: Autonomic Mapping • Use Machine-Leaning to find better schedules • Take into account: – Task dependences, Data usage, Hardware topology

• Non-dominated Sorting Genetic Algorithm II (NSGA-II) – Heuristic procedure that tries to find the optimal solution in a population of candidate solutions – Multi-objective algorithm

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on volutionary Computation, vol. 6, no. 2, pp. 182–197, Apr 2002.

13th Nov. 2017


32

SWITCHES: Autonomic Mapping

Population is a set of Schedules Genetic Representation (Thread-to-Core Assignment) 3

3

2 8 9 7 7 6 5 1 0 2 8 9 7 7 6 5 1 0 3 2 8 9 7 7 6 5 1 0 3 2 8 9 7 7 6 5 1 0 0

1

Thread ID 13th Nov. 2017

2

3

4

5

6

7

8

9

Core ID Task Data-flow Execution on Many-core Systems

33

SWITCHES: Autonomic Mapping • Every parameter is configurable – Number of threads, population, generations, crossover & mutation probabilities

• Fitness objectives – Performance, Power, Temperature – Can use any combination of them

• Best schedule is stored in a file – Reload it at any time

13th Nov. 2017


34

Results: Synthetic kernels

• No-Dependences kernel – 18% Performance Increase compared to Round Robin – Within 15% of the Handcoded

Intel Xeon Phi 50 generations 64 individuals Crossover p.: 0.0001 Mutation p.: 0.6

• Dependences kernel – 11% Performance Increase compared to Round Robin – Within 10% of the Handcoded

• 30% less resources used! 13th Nov. 2017


35

Results: Poisson2D

10% Improvement

Intel Xeon Phi

• 32 threads:

10 generations 64 individuals

– Auto-tuning chooses hardware threads that belong to the same core

Crossover p.: 0.0001 Mutation p.: 0.6

• 240 threads: – 2× Performance Improvement – 30% less resources used! 13th Nov. 2017


36

Outline Prologue



13th Nov. 2017




Epilogue

37

Future Directions • Speculative parallelism on many-cores • Inter-node scalability using Data-flow • Adaptive resource manager • Heterogeneous many-cores

13th Nov. 2017


38

Achieved Objectives & Contributions • Data-flow + TM More parallelism → Higher performance

Objective 1: Increase Application Parallelism

•

Extension of taskloop directive

•Higher Implemented OpenMP API productivity v4.5 programming

Objective 2:

Programming Productivity

•

Source-to-source translation tool

Objective 4: Scalable Runtime System

Objective 3:

Scalable Architectures

performance increase (240 threads) •32% SWITCHES: Light-weight and decentralized •Reduce Minimum HW supportcomplexity – no cache-coherence hardware

Objective 5: Efficient Resource Utilization

13th Nov. 2017

•Max. Auto-tuning scheduling with tool performance

30% less resources


39

C.A.S.P.E.R. Group Visit Us: www.cs.ucy.ac.cy/carch/casper Computer Architecture, Systems and Performance Evaluation Research List of Publications: Journals: 1.

Andreas Diavastos and Pedro Trancoso. “SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores”, ACM Transactions on Architecture and Code Optimization (TACO) 14, 3, Article 31 (August 2017).

2.

Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, International Journal of Parallel Programming (IJPP 2015).

3.

Andreas Diavastos, Giannos Stylianou and Giannis Koutsou. “Exploiting Very-Wide Vector Processing for Scientific Applications”. (Article) In Computing in Science & Engineering (CiSE), Vol. 17, no. 6, Nov/Dec 2015, pp. 83.87.

Conferences & Workshops: 1.

G. Karakonstantis, Andreas Diavastos, et al., “An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits”, in Proc. of the Design, Automation and Test in Europe (DATE) 2018, Dresden, Germany, March 2018.

2.

K. Tovletoglou, Andreas Diavastos, et al., “An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits”, in Proc. of the Energy-efficient Servers for Cloud and Edge Computing 2017 Workshop (ENeSCE 2017) co-located with HiPEAC 2017, Stockholm, Sweden, January 2017.

3.

Andreas Diavastos, and Pedro Trancoso. “Auto-tuning Static Schedules for Task Data-flow Applications”, Proceedings of the 1st ACM Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, (ANDARE), September 2017.

4.

Andreas Diavastos, Giannos Stylianou and Pedro Trancoso “TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow”. In the Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP 2015).

5.

Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, In the Proceedings of the Data-Flow Execution Models for Extreme Scale Computing Workshop (DFM 2011).

6.

Andreas Diavastos, Giannos Stylianou and Giannis Koutsou “Exploiting Very-Wide Vectors on Intel Xeon Phi with Lattice-QCD kernels”. In the Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016).

7.

Panayiotis Petrides, Andreas Diavastos, Constantinos Christofi and Pedro Trancoso. “Scalability and Efficiency of Database Queries on Future Many-core Systems”. In the Proceedings of the 21st Euromicro International Conference, Distributed and Network-based Processing (PDP 2013).

8.

Andreas Diavastos, Panayiotis Petrides, Gabriel Falcao, Pedro Trancoso. “LDPC Decoding on the Intel SCC”, In the Proceedings of the 20th Euromicro International Conference, Distributed and Network-based Processing (PDP 2012).

Technical Reports: 1.

Andreas Diavastos, and Pedro Trancoso. “Unified Data-Flow Platform for General Purpose Many-core Systems”, Department of Computer Science, University of Cyprus, Nicosia, Cyprus, Technical Report UCY-CS-TR-17-2, September 2017.

2.

Andreas Diavastos, George Matheou, Paraskevas Evripidou and Pedro Trancoso. “Data-Driven Multithreading Programming Tool-chain”, Department of Computer Science, University of Cyprus, Nicosia, Cyprus, Technical Report UCY-CS-TR-17-3, September 2017.

Outline Links • Introduction • Motivation • The Problem • The Solution • SWITCHES • The Future • TFluxTM • TFluxSCC • Conclusions

Backup Slides

Introduction

• “The number of nodes in future exascale systems many not change dramatically compared to those of today because of the prohibitive cost” ‡ • It is the number of cores in a single node that will increase – Many-core Processors • e.g. GPUs and Intel MICs E. Strohmaier, H. W. Meuer, J. Dongarra and H. D. Simon, "The TOP500 List and Progress in High-Performance Computing," in Computer, vol. 48, no. 11, Nov. 2015. ‡ M.

M. Resch, “The end of Moore’s law: Moving on in HPC beyond current technology,” in Keynote Presentation in PDP2016 Conference, Heraklion, Crete, February 2016.

13th Nov. 2017


42

Data-flow Programming Systems Characteristics of Dataflow Implementations

OmpSs

Triggered Serialization Instructions Sets

OpenDF

DTT / CDTT

SEED

Statically Sequential

WaveScalar

SWARM

Intel TBB

CnC

Maxeler

Implementation (Software / Hardware)

Software

Hardware

Software

Hardware

Software

Hardware

Software

Hardware

Software

Software

Software

Hardware

Scheduling Policy (Static / Dynamic)

Dynamic

Dynamic

Dynamic

-

Dynamic

Dynamic

Dynamic

Dynamic

Dynamic

Dynamic

Dynamic - uses Intel TBB

Static

Shared

Shared

-

Shared

Shared

Shared

Shared

Shared / Distributed

Shared

Shared

-

Memory Model Shared / GPU (Shared / Private) Needs cache-coherency

Yes

No

Yes

-

-

-

Yes

Yes

?

Yes

Yes

-

Number of cores/threads tested

24

32

32

-

runtime

-

32

128 Simulation

24

8

8

-

depends on application

22

16

-

1.46 / single core

-

16

83

8

8

8

230

Inserted Triggers

writable & read_only data variables

-

macro-based triggers

-

functions, shared objects, read/write sets, sequential segments

tokens/tags

C / C++

C++

CAL Dataflow

C / C++

-

C++

Supports imperative languages

Max Speedup achieved

How dependencies are directives in(), expressed out(), inout()

Programming Language

Contributions

Date

Notes

C / C++

● Spatial accel. w. Single 8x greater areaprogramming performa. than Prometheus C++ Implementation of model for GPPs library that MPEG RVC homogeneous & ● Less instr. in implements decoder on CAL heterogeneous critical path over Serialization Sets dataflow model architectures PC-based spatial architectures

PPL 2011

ISCA 2013

PPoPP 2009

● Using ● Based on StarSs ● Supports all scratchpad and OpenMP types of memory ● Builds graph at parallelism (Data, ● Dynamic runtime Task, Pipeline, Instruction ● TaskEmbarrassing) reordering dependency graph ● Not better than ● FPGA Implem. ● Each task is OpenMP or ● New ISA executed once Pthreads extentions

ACM SIGARCH 2008

Produces VHDL code

● High memory parallelism, ● High redundant instruction code in apps parallelism and ● Data-triggered branch threads unpredictability is ● Architectural highly profitable support for DTT for dataflow execution

HPCA 2009

ISCA 2015

C++ Software runtime library

MICRO 2011

Kernels & C-macros API to macro-API / Inputs (gets()), Managers, represent explicit task Outputs (puts()) Input & Output Codelets dependencies vars. C

C++

C++, Java, .NET, Java with Haskell MaxCompiler

● CnC semantics w Less silicon area, ● Parallel ● New dataflow proof of less power ● Unified single-, algorithms and ISA determinism consumption: multi-node data structures ● Less area ● Exploit several No interface, ● Scalable occupied for logic types of instructions/instr. transparent to memory allocation ● More parallelism decode logic the programmer and task performance/area ● Performance No branch predict. scheduling scalability No GP caches

ACM Transactions 2007

2013

● Very difficult programming ● Work stealing ● Remove across nodes and ● Based on redundant Switching threads Prometheus unnecessary execution from ● One scheduler runtime of ● Only measure computation OOO to dataflow on each Serialization Sets the time for the ● Thread at runtime based thread/node, ● This is the same parallel phase generation based on application managing several as Serialization on address data needs codelets Sets changes ● Overhead of runtime observed for fine-grain scheduling

2007

2011

● Rich feature set ● Comparison only for general with Pthreads purpose ● Different system parallelism for each ● Dataapplication dependency graph ● CnC is a model ● Each task can that uses other execute multiple runtime systems times

ACM SIGARCH 2011

-

Comparable Systems Characteristics of Dataflow Implementations

OmpSs

SWARM

Intel TBB

Implementation (Software / Hardware)

Software

Software

Software


Dynamic

Dynamic

Dynamic

Shared / GPU

Shared / Distributed

Shared

Needs cache-coherency

Yes

?

Yes

Number of cores/threads tested

24

24

8

depends on application

8

8

Memory Model (Shared / Private)

Max Speedup achieved How dependencies are expressed Programming Language

Contributions

Date

Notes

directives in(), out(), inout() C / C++ Single programming model for homogeneous & heterogeneous architectures

2011

● Based on StarSs and OpenMP ● Builds graph at runtime ● Task-dependency graph ● Each task is executed once

C-macros API to represent macro-API / explicit task Codelets dependencies C ● Unified single-, multinode interface, transparent to the programmer

C++ ● Parallel algorithms and data structures ● Scalable memory allocation and task scheduling

2013

2007

● Difficult programming ● Work stealing across nodes and threads ● One scheduler on each thread/node ● Overhead of runtime observed for fine-grain scheduling

● Rich feature set for general purpose parallelism ● Data-dependency graph ● Each task can execute multiple times

OpenMP tasks overheads

• •

13th Nov. 2017


Intel Knights Corner – 240 threads Synthetic kernel based on a differential equation

45

DDM Systems Characteristics of Dataflow Implementations TSU Implementation (Software / Hardware)

D2NOW

Hardware

TFlux

Software

DDM-VMc

Hardware

DDM-VMs DDM-VMd DDMFPGA

TFluxSCC

TFluxTM

Software

Software

Software

Hardware

Software

Software


-

Static

Static & Dynamic

Static & Dynamic

Static

Static & Dynamic

Static

Static

Memory Model (Shared / Private)

Distributed Shared Memory

Shared

Private

Shared

Private

Shared

Shared Data / Distributed Runtime

Shared

-

Yes

No

Yes

Yes

Yes

No

Yes

Needs cache-coherency Number of cores/threads tested

32

6

27

6 SPEs + 1 PPE

12

12 cores x 2 nodes

8

48

12

Max Speedup achieved

26

5.9

25

Almost Linear

9.6

9.6 (SMT) / 16 (Distr)

7.96

48

6.2

How dependencies are expressed

macros

Directives

macros

macros

macros

macros

Directives

Directives

● CacheFlow ● 1st DDM Simulated Hardware Distributed

● 1st SMP Software implementation ● 1st full system simulation ● Complete & Portable ● Directive-based programming

2000

2008

Contributions

Date

● 1st heterog. Implement. ● Software CacheFlow Implement. 2011

st

●1 Software CacheFlow

2010

● 1st integrat. ● 1st many●1 ●1 of a DDM core software hardware implementation software distributed DDM FPGA with another DDM model st

st

2013

2014

2014 & 2015

2011 & 2015

DDM Systems Bottlenecks • Centralized Runtime (all – except TFluxSCC) – Single-point of communication – 30% of time is spent in the TSU (Increases with core count)

• Need hardware cache-coherence (all – except TFluxSCC) • Large TSU structures (SG) – High memory footprint – 65% of total execution time is used for TSU allocation & initialization

• Global SG (all – except TFluxSCC) – Requires protection (e.g. locking) • Not scalable!

– TFluxSCC uses one SG instance for every core • Too memory expensive!

The DDM Model • Data-Driven Multi-threading Model – Task-based-granularity Data-flow implementation

• Reduce synchronization overheads and memory latencies – No barriers, no memory locks

• Reduce the core idle time – Non-blocking execution

• Control-flow execution within a task – Exploit inherent architecture & compiler optimizations 13th Nov. 2017


48

Example of Need for State • ‘Classic’ example is a program which traverses a tree and wants to build a global histogram from values calculated by parallel threads at leaves of the tree • Shared state is an array representing the histogram • Index into array is calculated by computation at leaves • Indexed array element is incremented • Lots of potential for conflicting increments

Barth, Paul S., and Rishiyur S. Nikhil. "M-structures: extending a parallel, non-strict, functional language with state." Conference on Functional Programming Languages and Computer Architecture. Springer, Berlin, Heidelberg, 1991.or conflicting increments

13th Nov. 2017


49

Previous Approaches • ‘Functional’ Languages like SML & F# have general mutable variables – Destroys referential transparency – Error Prone – Requires locks for concurrency (usual complexity and reduces parallelism)

• M-Structures in ID – Attempt to ‘hide’ complexity of locks by variables with implicit locking – Actually made things worse – presence of ‘invisible’ locks leading to deadlock etc. 13th Nov. 2017


50

Labyrinth Implementation of Lee’s Algorithm • • • • •

Produces automated interconnection of electronic components Finds shortest interconnection between two points Part of the code is non-transactional (Stage [3] and [4]) Random accesses to shared global grid We used the Labyrinth TM implementation from STAMP* S

1s

2

1s

2

2 D

D

(a) Basic Grid

*C.

C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. “Stamp: Stanford transactional applications for multi-processing”. (2008)

13th Nov. 2017

2

1s

2

3

2 3

D

(b) Expansion

3

(c) Expansion cont.

2

1s

2

3

3

3

2

3

4

D

4

3

4D

1s

4D

4 (d) Expansion cont.


(e) Destination reached

(f) Backtracking

51

TFlux SCC Implementations Router Multi-core

Tile

Centralized

Core 2

Core 3

Application

Application

Application

Core 4 TSU

2-threaded

Inline

L2

L2

L2

Core 1

Core 2

Core 1

Core 2

TSU

TSU

Application

Application

Application

2-threaded vs Inline NORMALIZED EXECUTION TIME

Tile

L2

L2 Core 1

Router

Application

TSU

TSU

Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

1.2 1 0.8 0.6 0.4 0.2 0 2-threaded

sleep MMULT

inline

2-threaded

sleep RK4

inline

800MHz

1600MHz

MMULT

800MHz

1600MHz

MMULT+Unrolled Application

13th Nov. 2017


800MHz

1600MHz

RK4

800MHz

1600MHz

RK4+Unrolled

TSU

52

Port TFlux on Intel SCC Shared off-chip DRAM

• TFluxSCC Memory model – MPB for TSU updates – Shared off-chip for application data • Originally uncacheable to avoid conflict due to the absence of cache-coherence

→ Cache-coherence is not needed • Enabled caching global data • Flush caches to ensure write-back

Shared on-chip Message Passing Buffer (8KB/core) Off-chip memory

Uncacheable Shared Memory

On-chip memory

Cacheable Shared Memory

60 50 40 30 20 10 0 MMULT

13th Nov. 2017

Private L1 L2$ $ CPU47 DRAM

Speedup

• Simultaneous access on shared data is not allowed in DDM

...

Private L1 L2$ $ CPU0 DRAM


RK4

TRAPEZ

53

8

12

16

27

32

small

4

large

2

large

60

medium

TFlux SCC Results 48

SPEEDUP

50 40 30 20 10

MMULT

13th Nov. 2017

QSORT*

QSORT

RK4

TRAPEZ


medium

medium

small

small

large

medium

small

large

medium

small

large

medium

small

0

FFT

54

Genetic Algorithm • A GA requires: – Genetic representation of the solution domain – A fitness function to evaluate the solution domain f(x) = x2, x ϵ [0,31] Genetic representation of solution 1

0

0

13th Nov. 2017

1

0

Solution: 18


Fitness function

f(18) = 182 = 324

55

Genetic Algorithm Initial Population 1

1

1

0

0

0

0

1

1

0

0

1

1

1

0

28 6 14

f(x) = x2, x ϵ [0,31]

1

1

1

1

0

30

0

0

1

0

0

0

1

1

1

1

4 15

Initial Population

Best child

Selection of the best individuals

30

28 1

1

1

0

1

0

1

1

1

Fitness Evaluation

0

Crossover 1 1 13th Nov. 2017

1 1

1 1

1 1

0

Mutation

Mutation 1 31


Selection

Crossover 56

GA Complexity: What’s the catch? • The application must execute multiple times • But, is it that bad? – In HPC application are executed over and over again – Gather and store the statistics every time the application is executed – Create a better schedule and the next time run faster – Every time you do this, you could produce a faster schedule

• Best schedule is stored in a file – Load it every time you run the application in the future

• Run the Auto-tuning tool with a small size and apply schedule to larger data sets 13th Nov. 2017


57

Results: Global Results

• Fsdfsd

13th Nov. 2017


58

Results: Autonomic Scheduling

Intel Xeon Phi

• Consecutive tasks share data from consecutive memory locations

10 generations 64 individuals Crossover p.: 0.0001 Mutation p.: 0.6

– Round Robin is the best schedule

• All resources are used by the Auto-tuning tool

13th Nov. 2017


59

Results: Autonomic Scheduling

13th Nov. 2017


60

Profiling: Memory usage 14

OpenMP

12

TFlux

10

SWitches

8

RK4

31

6 4 2

Normalized memory usage

Normalized memory usage

MMULT 14

OpenMP

12

TFlux

SWitches

10 8 6 4 2 0

0 1

13th Nov. 2017

2

4

8 16 32 60 Number of Threads

120 180 240

1

2

4


8

16

32

60

120 180 240

Number of Threads

61

Task Data-flow Execution on Many-core Systems

Task Data-flow Execution on Many-core Systems

Suggest Documents

execution performance of scheduled dataflow architecture

Speculative Thread Execution in a Multithreaded Dataflow ...

An Efficient Stream Buffer Mechanism for Dataflow Execution on ...

Machine Learning and Manycore Systems Design - arXiv

Scalable Memory Hierarchies for Embedded Manycore Systems

Scalable Memory Hierarchies for Embedded Manycore Systems

Utilizing Dataflow-based Execution for Coupled Cluster Methods

Online Error Detection and Recovery in Dataflow Execution

Execution Performance of the Scheduled Dataflow ... - Semantic Scholar

Comparing Execution Performance of Scheduled Dataflow With RISC ...

Distributed Real-Time Dataflow: An Execution Paradigm for Image ...

Dataflow Plan Execution for Software Agents - Semantic Scholar

Fault Localization using Execution Slices and Dataflow Tests ...

Implementing OpenMP using Dataflow Execution Model for Data ...

Fault Localization using Execution Slices and Dataflow Tests ...

An Efficient Dataflow Execution Method for Mobile ... - NCLab, KAIST

Performance Tuning Scientific Codes for Dataflow Execution Andrew ...

Forwardflow: Scalable, RAM-Based Dataflow Execution - Google Sites

Forwardflow: Scalable, RAM-Based Dataflow Execution - Google Sites

Scheduling Task Dependence Graphs with Variable Task Execution ...

Predicting Task Execution Time on Handheld Devices Using the ...

Tractable dataflow analysis for distributed systems - CiteSeerX

Modeling Collaborative Task Execution In Social Networks2

Coordinated Task Allocation and Execution - NUS Computing