Task Data-flow Execution on Many-core Systems Andreas Diavastos
[email protected]
Supervisor: Pedro Moura Trancoso
Department of Computer Science, University of Cyprus CASPER Group: Computer Architecture, Systems and Performance Evaluation Research Group TM – Harvey Comics
Application Performance Scalability
#nodes/system
#cores/node
• More resource → Higher Performance • Many-nodes: Prohibitive cost • Many-cores: Scaling limitations in runtime systems Need efficient and scalable systems for many-core processors E. Strohmaier, H. W. Meuer, J. Dongarra and H. D. Simon, "The TOP500 List and Progress in High-Performance Computing," in Computer, vol. 48, no. 11, Nov. 2015.
‡ M.
M. Resch, “The end of Moore’s law: Moving on in HPC beyond current technology,” in Keynote Presentation in PDP2016 Conference, Heraklion, Crete, February 2016.
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
2
State-of-the-art • Limited performance scalability in many-cores – – – – –
OpenMP, OmpSs, TBB, … Dynamic scheduling → Runtime overhead Centralized runtimes → Single point of access Runtime dependences resolution → More overhead Programming → Each one has its own API
Existing runtime systems are not well-fit for what is coming in future many-cores
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
3
The Objectives Objective 1: Increase Application Parallelism
Objective 4: Scalable Runtime System
Objective 2: Programming Productivity
Objective 5: Efficient Resource Utilization
Objective 3: Scalable Architectures
Improving application performance is a joint task between the hardware and the software 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
4
Solution Overview • Task-based Data-flow execution model – Exploits large amounts of parallelism
• Performance scaling factors: – – – – – –
Programmability Locality-aware execution The degree of parallelism Scalable architecture designs Low-overhead runtime systems Efficient use of the available resources
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
5
Outline Prologue
DDM on Manycore Systems
Speculative Parallelism in Data-flow
13th Nov. 2017
Autonomic Mapping of Resources
A Scalable Framework for Task-based Dataflow Execution
Task Data-flow Execution on Many-core Systems
Epilogue
6
Outline Prologue
Speculative Parallelism in Data-flow
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
7
Contributes to: • Objective 1: Increase Application Parallelism
Integrate Transactions into the Data-flow Model
SPECULATIVE PARALLELISM IN DATA-FLOW Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, International Journal of Parallel Programming (IJPP 2015). Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, In the Proceedings of the Data-Flow Execution Models for Extreme Scale Computing Workshop (DFM 2011).
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
8
Integrate Transactions in Data-flow Dataflow
• •
Transactional Memory
Data-flow-based execution
•
Speculative execution
•
Execute tasks freely
Synchronization based on data-availability
•
Abort/restart if conflicts
TFluxTM
•
Exploit runtime parallelism in Data-flow
Increase the Data-flow application coverage
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
9
The TFlux Parallel Processing Platform • TFlux: – Provide the DDM model to commodity multicore systems C & DDM directives TFlux Preprocessor Unmodified C Compiler DDM Binary Runtime Support Kernel 1
Kernel 2
Kernel 3
...
Kernel n
TSU Group TSU 1
TSU 2
TSU 3
...
TSU n
Unmodified Operating System
Unmodified ISA Hardware
¤ Stavrou,
Kyriakos, et al. "TFlux: A portable platform for data-driven multithreading on commodity multicore systems." Parallel Processing, 2008. ICPP'08. 37th International Conference on. IEEE, 2008.
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
10
TFluxTM Implementation • Integrated TinySTM* runtime system in TFluxSoft¤ LEE’S ALGORITHM 256x256x3-n256
256x256x5-n256
512x512x7-n512
7 6 SPEEDUP
5 4 3
Data-flow alone would be here
2 1 0 2
4
8
10
12
NUMBER OF THREADS
Exploitation of runtime parallelism in Data-flow but software TM runtime overhead is still too high
Pascal Felber, Christof Fetzer, and Torvald Riegel, Dynamic Performance Tuning of Word-Based Software Transactional Memory, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008 ¤ Stavrou, Kyriakos, et al. "TFlux: A portable platform for data-driven multithreading on commodity multicore systems." Parallel Processing, 2008. ICPP'08. 37th International Conference on. IEEE, 2008. *
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
11
Outline Prologue
DDM on Manycore Systems
Speculative Parallelism in Data-flow
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
12
Contributes to: • Objective 3: Support for Scalable Architectures
Acknowledgement: We would like to thank Intel Labs for lending the Intel SCC research processor
TFlux on the Intel SCC
DATA-DRIVEN MULTI-THREADING ON MANY-CORE SYSTEMS Andreas Diavastos, Giannos Stylianou and Pedro Trancoso “TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow”. In the Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP 2015). Andreas Diavastos, Giannos Stylianou, Pedro Trancoso. “TFluxSCC: A Case Study for Exploiting Performance in Future Many-core Systems”. (Poster Paper) In the Proceedings of the 11th ACM Conference on Computing Frontiers, May 2014.
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
13
TFlux on the Intel SCC • Intel Single-chip Cloud Computer – 48-core processor – Offers Global address space – No hardware support for cache-coherence
• Data-Driven Multi-threading Model – Task-based-granularity Data-flow implementation – Data-flow dependences avoid simultaneous access on shared data – Cache-coherence is not needed
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
14
TFluxSCC runtime system • Fully decentralized runtime system • One TSU on every application thread • Each core has it’s own instance of SG – No locking protection required 60 48
Speedup
50 39
40
32
33
QSORT*
RK4
Max speedup achieved with 48 cores
30 20 10 0
5
5
QSORT
Blackscholes
8 FFT
MMULT
Trapez
A DDM implementation for a many-core processor without requiring support for hardware cache-coherence 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
15
Outline Prologue
DDM on Manycore Systems
Speculative Parallelism in Data-flow
13th Nov. 2017
A Scalable Framework for Task-based Dataflow Execution
Task Data-flow Execution on Many-core Systems
16
Contributes to: • Objective 1: Increase application parallelism • Objective 2: Programmability • Objective 3: Support for scalable architectures • Objective 4: Low-overhead runtime • Objective 5: Efficient utilization of resources
Acknowledgement: We would like to thank The CyI for allowing us to use their Intel Xeon Phi facility
SWITCHES
A SCALABLE FRAMEWORK FOR TASKBASED DATA-FLOW EXECUTION Andreas Diavastos and Pedro Trancoso. “SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores”, ACM Transactions on Architecture and Code Optimization (TACO) 14, 3, Article 31 (August 2017). Andreas Diavastos, and Pedro Trancoso. “Auto-tuning Static Schedules for Task Data-flow Applications”, Proceedings of the 1st ACM Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, (ANDARE), September 2017.
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
17
SWITCHES: Key Characteristics • Software implementation • Implements Task-based Data-flow execution • Apply dependences at any level of granularity • Across-Loops Iteration Dependences • Compile-time (static) scheduling/assignment policy • Explicit task resource allocation • Fully distributed triggering systems – No central-point of communication – No locking protection on any shared runtime data
• Programming API: OpenMP v4.5 Standard 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
18
SWITCHES: Execution Model T
Compile-Time Synchronization Graph
T1
T2 T3
T4
13th Nov. 2017
Task
E
Executing
Compile-Time Add a switch to every task
F
Finished
Program Finished
Runtime
Runtime
E
E
F
F
F
F
F
F
OFF
OFF
ON
ON
ON
ON
ON
ON
T3
E
F
F
OFF
OFF
ON
ON
T4
T4
E
F
OFF
OFF
OFF
ON
Task Data-flow Execution on Many-core Systems
19
SWITCHES: Platform Programming Model OMP v4.5 Application int main() {
#pragma omp task depend() { // do something } ...
#pragma omp task depend() { // do something } }
The Translator 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
20
SWITCHES: The Translator Dependency Graph
Parallel Code
#pragma omp task {…}
Translator
Scheduling Policy
Task1: checkP(task1); … Update(sw1); Task2: checkP(task2); … Update(sw2);
Binary
C/CC Compiler
001010101 010101010 101010101 101010101 011010100 101010100
Assignment Policy User-specified
• Source-to-Source tool – Command-line tool • Implements transitive reduction optimization 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
21
SWITCHES: API • Implemented the OpenMP v4.5 API Standard • Extended the taskloop directive to support: – – – –
Resource allocation (Better Utilization) Dependences Tasks Reduction Scheduling policy (Cross-Loop Iteration Dependences) #pragma omp taskloop
private(list) firstprivate(list) grainsize(size) num_threads(NUMBER) depend(type : list) reduction(OPERATION : list) schedule(type : [, CHUNK])
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
22
Cross-Loop Dependences with Granularity OpenMP Taskloop
• • •
Variable granularity Increases data-locality Barrier between loops
13th Nov. 2017
OpenMP Task
• • •
Dependencies across loop iterations Asynchronous Execution No data-locality
SWITCHES
• • • •
Task Data-flow Execution on Many-core Systems
Dependencies across loop iterations Variable granularity Asynchronous Execution Increases data-locality
23
SWITCHES: Runtime System Software Thread Shared Memory
Core Hardware Thread
Application Data
Hardware Thread
T
T Hardware Thread
Te xt
Scheduler
T T
T
T T
Software
Hardware Software Thread Software Thread Thread Thread
Tex t
Task
Task Switch
Workload
Runtime Data
T
OFF
Producers Switches
Task*:
for(i->n/CORES) { //do something } print (result) exit()
Cross-Loop Switches
Scheduler Data
Multiple threads on each core • Each software thread has its own scheduler tasksoftware has its switch references to its producers Task switches stored in&shared memory • Switches variables are write-owner 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
24
Results: Experimental Setup Benchmark
TASKPARALLEL
Description DS1
DS2
DS3
Nested-Loop Join from TPC-H
60K × 1.5K
60K × 15K
60K × 150K
MMULT
Matrix Multiply
256 × 256
512 × 512
1024 ×1024
RK4
Differential equation
4800
9600
19200
SU3
Wilson Dirac equation
1920K
3840K
7680K
Poisson2D
5-point 2D stencil computation
4096
8192
16384
SparseLU
LU factorization
120 × 32
240 × 32
480 × 32
OCEAN
Red-Black solution (Gauss-Seidel)
4096 × 4096
8192 × 8192
16384 ×16384
Q12 DATAPARALLEL
Data Set Sizes (Computation Iterations)
Compared Implementations
OMP-Dynamic-For OMP-Static-For OMP-Task OMP-Taskloop
OMP-Task OMP-Task-Dep Taskloop
• Benchmarks from: TFlux, BOTS, Kastors • Intel Xeon Phi 7120P (KNC) – 61-cores x 4 threads (total 244 HW threads) – @ 1.2 GHz – Intel ICC v.17.0.2 (libiomp5 and –O3 flag) 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
25
Results: Strong Scaling
• SWITCHES achieves performance scalability with: – Static decentralized runtime system for low-overhead synchronization – Variable granularity loop tasks for data-locality – Cross-Loop Dependence mechanism for increase parallelism – Fast dependences resolution mechanism 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
26
Results: Task resource allocation Loop A i1
i2
i3
...
j2
j3
iN-1
20% Improvement
iN
Loop B j1
...
jN-1
jN
• Variable granularity loop tasks • Cross-loop iteration dependences • Task resource allocation 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
27
Data-Parallel: Weak Scaling
SWITCHES
72× (2)
141×
92× (4)
81×
OpenMP
67× Taskloop(4)
103× Static-For
79× Static-For(4)
57× Dynamic
26% Improvement • Smaller input size → Less work per thread – OMP scheduling overhead more visible
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
28
Task-Parallel: Weak Scaling
SWITCHES
80×
55× (2)
11× (4)
OpenMP
66× Taskloop
34× Taskloop(4)
9× Taskloop(4)
35% Improvement • Scheduling overhead + Runtime dependence resolution – OMP: Heavy dependence resolution mechanism
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
29
Outline Prologue
DDM on Manycore Systems
Speculative Parallelism in Data-flow
13th Nov. 2017
Autonomic Mapping of Resources
A Scalable Framework for Task-based Dataflow Execution
Task Data-flow Execution on Many-core Systems
30
Motivation • Round-Robin & Random schedules – Produced with no knowledge of the application, task dependences or the hardware
• Ninja Hand-coded schedules – 15% - 30% performance increase
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
31
SWITCHES: Autonomic Mapping • Use Machine-Leaning to find better schedules • Take into account: – Task dependences, Data usage, Hardware topology
• Non-dominated Sorting Genetic Algorithm II (NSGA-II) – Heuristic procedure that tries to find the optimal solution in a population of candidate solutions – Multi-objective algorithm
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on volutionary Computation, vol. 6, no. 2, pp. 182–197, Apr 2002.
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
32
SWITCHES: Autonomic Mapping
Population is a set of Schedules Genetic Representation (Thread-to-Core Assignment) 3
3
2 8 9 7 7 6 5 1 0 2 8 9 7 7 6 5 1 0 3 2 8 9 7 7 6 5 1 0 3 2 8 9 7 7 6 5 1 0 0
1
Thread ID 13th Nov. 2017
2
3
4
5
6
7
8
9
Core ID Task Data-flow Execution on Many-core Systems
33
SWITCHES: Autonomic Mapping • Every parameter is configurable – Number of threads, population, generations, crossover & mutation probabilities
• Fitness objectives – Performance, Power, Temperature – Can use any combination of them
• Best schedule is stored in a file – Reload it at any time
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
34
Results: Synthetic kernels
• No-Dependences kernel – 18% Performance Increase compared to Round Robin – Within 15% of the Handcoded
Intel Xeon Phi 50 generations 64 individuals Crossover p.: 0.0001 Mutation p.: 0.6
• Dependences kernel – 11% Performance Increase compared to Round Robin – Within 10% of the Handcoded
• 30% less resources used! 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
35
Results: Poisson2D
10% Improvement
Intel Xeon Phi
• 32 threads:
10 generations 64 individuals
– Auto-tuning chooses hardware threads that belong to the same core
Crossover p.: 0.0001 Mutation p.: 0.6
• 240 threads: – 2× Performance Improvement – 30% less resources used! 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
36
Outline Prologue
DDM on Manycore Systems
Speculative Parallelism in Data-flow
13th Nov. 2017
Autonomic Mapping of Resources
A Scalable Framework for Task-based Dataflow Execution
Task Data-flow Execution on Many-core Systems
Epilogue
37
Future Directions • Speculative parallelism on many-cores • Inter-node scalability using Data-flow • Adaptive resource manager • Heterogeneous many-cores
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
38
Achieved Objectives & Contributions • Data-flow + TM More parallelism → Higher performance
Objective 1: Increase Application Parallelism
•
Extension of taskloop directive
•Higher Implemented OpenMP API productivity v4.5 programming
Objective 2:
Programming Productivity
•
Source-to-source translation tool
Objective 4: Scalable Runtime System
Objective 3:
Scalable Architectures
performance increase (240 threads) •32% SWITCHES: Light-weight and decentralized •Reduce Minimum HW supportcomplexity – no cache-coherence hardware
Objective 5: Efficient Resource Utilization
13th Nov. 2017
•Max. Auto-tuning scheduling with tool performance
30% less resources
Task Data-flow Execution on Many-core Systems
39
C.A.S.P.E.R. Group Visit Us: www.cs.ucy.ac.cy/carch/casper Computer Architecture, Systems and Performance Evaluation Research List of Publications: Journals: 1.
Andreas Diavastos and Pedro Trancoso. “SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores”, ACM Transactions on Architecture and Code Optimization (TACO) 14, 3, Article 31 (August 2017).
2.
Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, International Journal of Parallel Programming (IJPP 2015).
3.
Andreas Diavastos, Giannos Stylianou and Giannis Koutsou. “Exploiting Very-Wide Vector Processing for Scientific Applications”. (Article) In Computing in Science & Engineering (CiSE), Vol. 17, no. 6, Nov/Dec 2015, pp. 83.87.
Conferences & Workshops: 1.
G. Karakonstantis, Andreas Diavastos, et al., “An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits”, in Proc. of the Design, Automation and Test in Europe (DATE) 2018, Dresden, Germany, March 2018.
2.
K. Tovletoglou, Andreas Diavastos, et al., “An Energy-Efficient and Error-Resilient Server Ecosystem Exceeding Conservative Scaling Limits”, in Proc. of the Energy-efficient Servers for Cloud and Edge Computing 2017 Workshop (ENeSCE 2017) co-located with HiPEAC 2017, Stockholm, Sweden, January 2017.
3.
Andreas Diavastos, and Pedro Trancoso. “Auto-tuning Static Schedules for Task Data-flow Applications”, Proceedings of the 1st ACM Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, (ANDARE), September 2017.
4.
Andreas Diavastos, Giannos Stylianou and Pedro Trancoso “TFluxSCC: Exploiting Performance on Future Many-core Systems through Data-Flow”. In the Proceedings of the 23rd Euromicro International Conference, Distributed and Network-based Processing (PDP 2015).
5.
Andreas Diavastos, Pedro Trancoso, Mikel Lujan and Ian Watson. “Integrating Transactions into the Data-Driven Multi-threading Model using the TFlux Platform”, In the Proceedings of the Data-Flow Execution Models for Extreme Scale Computing Workshop (DFM 2011).
6.
Andreas Diavastos, Giannos Stylianou and Giannis Koutsou “Exploiting Very-Wide Vectors on Intel Xeon Phi with Lattice-QCD kernels”. In the Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016).
7.
Panayiotis Petrides, Andreas Diavastos, Constantinos Christofi and Pedro Trancoso. “Scalability and Efficiency of Database Queries on Future Many-core Systems”. In the Proceedings of the 21st Euromicro International Conference, Distributed and Network-based Processing (PDP 2013).
8.
Andreas Diavastos, Panayiotis Petrides, Gabriel Falcao, Pedro Trancoso. “LDPC Decoding on the Intel SCC”, In the Proceedings of the 20th Euromicro International Conference, Distributed and Network-based Processing (PDP 2012).
Technical Reports: 1.
Andreas Diavastos, and Pedro Trancoso. “Unified Data-Flow Platform for General Purpose Many-core Systems”, Department of Computer Science, University of Cyprus, Nicosia, Cyprus, Technical Report UCY-CS-TR-17-2, September 2017.
2.
Andreas Diavastos, George Matheou, Paraskevas Evripidou and Pedro Trancoso. “Data-Driven Multithreading Programming Tool-chain”, Department of Computer Science, University of Cyprus, Nicosia, Cyprus, Technical Report UCY-CS-TR-17-3, September 2017.
Outline Links • Introduction • Motivation • The Problem • The Solution • SWITCHES • The Future • TFluxTM • TFluxSCC • Conclusions
Backup Slides
Introduction
• “The number of nodes in future exascale systems many not change dramatically compared to those of today because of the prohibitive cost” ‡ • It is the number of cores in a single node that will increase – Many-core Processors • e.g. GPUs and Intel MICs E. Strohmaier, H. W. Meuer, J. Dongarra and H. D. Simon, "The TOP500 List and Progress in High-Performance Computing," in Computer, vol. 48, no. 11, Nov. 2015. ‡ M.
M. Resch, “The end of Moore’s law: Moving on in HPC beyond current technology,” in Keynote Presentation in PDP2016 Conference, Heraklion, Crete, February 2016.
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
42
Data-flow Programming Systems Characteristics of Dataflow Implementations
OmpSs
Triggered Serialization Instructions Sets
OpenDF
DTT / CDTT
SEED
Statically Sequential
WaveScalar
SWARM
Intel TBB
CnC
Maxeler
Implementation (Software / Hardware)
Software
Hardware
Software
Hardware
Software
Hardware
Software
Hardware
Software
Software
Software
Hardware
Scheduling Policy (Static / Dynamic)
Dynamic
Dynamic
Dynamic
-
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic - uses Intel TBB
Static
Shared
Shared
-
Shared
Shared
Shared
Shared
Shared / Distributed
Shared
Shared
-
Memory Model Shared / GPU (Shared / Private) Needs cache-coherency
Yes
No
Yes
-
-
-
Yes
Yes
?
Yes
Yes
-
Number of cores/threads tested
24
32
32
-
runtime
-
32
128 Simulation
24
8
8
-
depends on application
22
16
-
1.46 / single core
-
16
83
8
8
8
230
Inserted Triggers
writable & read_only data variables
-
macro-based triggers
-
functions, shared objects, read/write sets, sequential segments
tokens/tags
C / C++
C++
CAL Dataflow
C / C++
-
C++
Supports imperative languages
Max Speedup achieved
How dependencies are directives in(), expressed out(), inout()
Programming Language
Contributions
Date
Notes
C / C++
● Spatial accel. w. Single 8x greater areaprogramming performa. than Prometheus C++ Implementation of model for GPPs library that MPEG RVC homogeneous & ● Less instr. in implements decoder on CAL heterogeneous critical path over Serialization Sets dataflow model architectures PC-based spatial architectures
PPL 2011
ISCA 2013
PPoPP 2009
● Using ● Based on StarSs ● Supports all scratchpad and OpenMP types of memory ● Builds graph at parallelism (Data, ● Dynamic runtime Task, Pipeline, Instruction ● TaskEmbarrassing) reordering dependency graph ● Not better than ● FPGA Implem. ● Each task is OpenMP or ● New ISA executed once Pthreads extentions
ACM SIGARCH 2008
Produces VHDL code
● High memory parallelism, ● High redundant instruction code in apps parallelism and ● Data-triggered branch threads unpredictability is ● Architectural highly profitable support for DTT for dataflow execution
HPCA 2009
ISCA 2015
C++ Software runtime library
MICRO 2011
Kernels & C-macros API to macro-API / Inputs (gets()), Managers, represent explicit task Outputs (puts()) Input & Output Codelets dependencies vars. C
C++
C++, Java, .NET, Java with Haskell MaxCompiler
● CnC semantics w Less silicon area, ● Parallel ● New dataflow proof of less power ● Unified single-, algorithms and ISA determinism consumption: multi-node data structures ● Less area ● Exploit several No interface, ● Scalable occupied for logic types of instructions/instr. transparent to memory allocation ● More parallelism decode logic the programmer and task performance/area ● Performance No branch predict. scheduling scalability No GP caches
ACM Transactions 2007
2013
● Very difficult programming ● Work stealing ● Remove across nodes and ● Based on redundant Switching threads Prometheus unnecessary execution from ● One scheduler runtime of ● Only measure computation OOO to dataflow on each Serialization Sets the time for the ● Thread at runtime based thread/node, ● This is the same parallel phase generation based on application managing several as Serialization on address data needs codelets Sets changes ● Overhead of runtime observed for fine-grain scheduling
2007
2011
● Rich feature set ● Comparison only for general with Pthreads purpose ● Different system parallelism for each ● Dataapplication dependency graph ● CnC is a model ● Each task can that uses other execute multiple runtime systems times
ACM SIGARCH 2011
-
Comparable Systems Characteristics of Dataflow Implementations
OmpSs
SWARM
Intel TBB
Implementation (Software / Hardware)
Software
Software
Software
Scheduling Policy (Static / Dynamic)
Dynamic
Dynamic
Dynamic
Shared / GPU
Shared / Distributed
Shared
Needs cache-coherency
Yes
?
Yes
Number of cores/threads tested
24
24
8
depends on application
8
8
Memory Model (Shared / Private)
Max Speedup achieved How dependencies are expressed Programming Language
Contributions
Date
Notes
directives in(), out(), inout() C / C++ Single programming model for homogeneous & heterogeneous architectures
2011
● Based on StarSs and OpenMP ● Builds graph at runtime ● Task-dependency graph ● Each task is executed once
C-macros API to represent macro-API / explicit task Codelets dependencies C ● Unified single-, multinode interface, transparent to the programmer
C++ ● Parallel algorithms and data structures ● Scalable memory allocation and task scheduling
2013
2007
● Difficult programming ● Work stealing across nodes and threads ● One scheduler on each thread/node ● Overhead of runtime observed for fine-grain scheduling
● Rich feature set for general purpose parallelism ● Data-dependency graph ● Each task can execute multiple times
OpenMP tasks overheads
• •
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
Intel Knights Corner – 240 threads Synthetic kernel based on a differential equation
45
DDM Systems Characteristics of Dataflow Implementations TSU Implementation (Software / Hardware)
D2NOW
Hardware
TFlux
Software
DDM-VMc
Hardware
DDM-VMs DDM-VMd DDMFPGA
TFluxSCC
TFluxTM
Software
Software
Software
Hardware
Software
Software
Scheduling Policy (Static / Dynamic)
-
Static
Static & Dynamic
Static & Dynamic
Static
Static & Dynamic
Static
Static
Memory Model (Shared / Private)
Distributed Shared Memory
Shared
Private
Shared
Private
Shared
Shared Data / Distributed Runtime
Shared
-
Yes
No
Yes
Yes
Yes
No
Yes
Needs cache-coherency Number of cores/threads tested
32
6
27
6 SPEs + 1 PPE
12
12 cores x 2 nodes
8
48
12
Max Speedup achieved
26
5.9
25
Almost Linear
9.6
9.6 (SMT) / 16 (Distr)
7.96
48
6.2
How dependencies are expressed
macros
Directives
macros
macros
macros
macros
Directives
Directives
● CacheFlow ● 1st DDM Simulated Hardware Distributed
● 1st SMP Software implementation ● 1st full system simulation ● Complete & Portable ● Directive-based programming
2000
2008
Contributions
Date
● 1st heterog. Implement. ● Software CacheFlow Implement. 2011
st
●1 Software CacheFlow
2010
● 1st integrat. ● 1st many●1 ●1 of a DDM core software hardware implementation software distributed DDM FPGA with another DDM model st
st
2013
2014
2014 & 2015
2011 & 2015
DDM Systems Bottlenecks • Centralized Runtime (all – except TFluxSCC) – Single-point of communication – 30% of time is spent in the TSU (Increases with core count)
• Need hardware cache-coherence (all – except TFluxSCC) • Large TSU structures (SG) – High memory footprint – 65% of total execution time is used for TSU allocation & initialization
• Global SG (all – except TFluxSCC) – Requires protection (e.g. locking) • Not scalable!
– TFluxSCC uses one SG instance for every core • Too memory expensive!
The DDM Model • Data-Driven Multi-threading Model – Task-based-granularity Data-flow implementation
• Reduce synchronization overheads and memory latencies – No barriers, no memory locks
• Reduce the core idle time – Non-blocking execution
• Control-flow execution within a task – Exploit inherent architecture & compiler optimizations 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
48
Example of Need for State • ‘Classic’ example is a program which traverses a tree and wants to build a global histogram from values calculated by parallel threads at leaves of the tree • Shared state is an array representing the histogram • Index into array is calculated by computation at leaves • Indexed array element is incremented • Lots of potential for conflicting increments
Barth, Paul S., and Rishiyur S. Nikhil. "M-structures: extending a parallel, non-strict, functional language with state." Conference on Functional Programming Languages and Computer Architecture. Springer, Berlin, Heidelberg, 1991.or conflicting increments
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
49
Previous Approaches • ‘Functional’ Languages like SML & F# have general mutable variables – Destroys referential transparency – Error Prone – Requires locks for concurrency (usual complexity and reduces parallelism)
• M-Structures in ID – Attempt to ‘hide’ complexity of locks by variables with implicit locking – Actually made things worse – presence of ‘invisible’ locks leading to deadlock etc. 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
50
Labyrinth Implementation of Lee’s Algorithm • • • • •
Produces automated interconnection of electronic components Finds shortest interconnection between two points Part of the code is non-transactional (Stage [3] and [4]) Random accesses to shared global grid We used the Labyrinth TM implementation from STAMP* S
1s
2
1s
2
2 D
D
(a) Basic Grid
*C.
C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. “Stamp: Stanford transactional applications for multi-processing”. (2008)
13th Nov. 2017
2
1s
2
3
2 3
D
(b) Expansion
3
(c) Expansion cont.
2
1s
2
3
3
3
2
3
4
D
4
3
4D
1s
4D
4 (d) Expansion cont.
Task Data-flow Execution on Many-core Systems
(e) Destination reached
(f) Backtracking
51
TFlux SCC Implementations Router Multi-core
Tile
Centralized
Core 2
Core 3
Application
Application
Application
Core 4 TSU
2-threaded
Inline
L2
L2
L2
Core 1
Core 2
Core 1
Core 2
TSU
TSU
Application
Application
Application
2-threaded vs Inline NORMALIZED EXECUTION TIME
Tile
L2
L2 Core 1
Router
Application
TSU
TSU
Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
1.2 1 0.8 0.6 0.4 0.2 0 2-threaded
sleep MMULT
inline
2-threaded
sleep RK4
inline
800MHz
1600MHz
MMULT
800MHz
1600MHz
MMULT+Unrolled Application
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
800MHz
1600MHz
RK4
800MHz
1600MHz
RK4+Unrolled
TSU
52
Port TFlux on Intel SCC Shared off-chip DRAM
• TFluxSCC Memory model – MPB for TSU updates – Shared off-chip for application data • Originally uncacheable to avoid conflict due to the absence of cache-coherence
→ Cache-coherence is not needed • Enabled caching global data • Flush caches to ensure write-back
Shared on-chip Message Passing Buffer (8KB/core) Off-chip memory
Uncacheable Shared Memory
On-chip memory
Cacheable Shared Memory
60 50 40 30 20 10 0 MMULT
13th Nov. 2017
Private L1 L2$ $ CPU47 DRAM
Speedup
• Simultaneous access on shared data is not allowed in DDM
...
Private L1 L2$ $ CPU0 DRAM
Task Data-flow Execution on Many-core Systems
RK4
TRAPEZ
53
8
12
16
27
32
small
4
large
2
large
60
medium
TFlux SCC Results 48
SPEEDUP
50 40 30 20 10
MMULT
13th Nov. 2017
QSORT*
QSORT
RK4
TRAPEZ
Task Data-flow Execution on Many-core Systems
medium
medium
small
small
large
medium
small
large
medium
small
large
medium
small
0
FFT
54
Genetic Algorithm • A GA requires: – Genetic representation of the solution domain – A fitness function to evaluate the solution domain f(x) = x2, x ϵ [0,31] Genetic representation of solution 1
0
0
13th Nov. 2017
1
0
Solution: 18
Task Data-flow Execution on Many-core Systems
Fitness function
f(18) = 182 = 324
55
Genetic Algorithm Initial Population 1
1
1
0
0
0
0
1
1
0
0
1
1
1
0
28 6 14
f(x) = x2, x ϵ [0,31]
1
1
1
1
0
30
0
0
1
0
0
0
1
1
1
1
4 15
Initial Population
Best child
Selection of the best individuals
30
28 1
1
1
0
1
0
1
1
1
Fitness Evaluation
0
Crossover 1 1 13th Nov. 2017
1 1
1 1
1 1
0
Mutation
Mutation 1 31
Task Data-flow Execution on Many-core Systems
Selection
Crossover 56
GA Complexity: What’s the catch? • The application must execute multiple times • But, is it that bad? – In HPC application are executed over and over again – Gather and store the statistics every time the application is executed – Create a better schedule and the next time run faster – Every time you do this, you could produce a faster schedule
• Best schedule is stored in a file – Load it every time you run the application in the future
• Run the Auto-tuning tool with a small size and apply schedule to larger data sets 13th Nov. 2017
Task Data-flow Execution on Many-core Systems
57
Results: Global Results
• Fsdfsd
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
58
Results: Autonomic Scheduling
Intel Xeon Phi
• Consecutive tasks share data from consecutive memory locations
10 generations 64 individuals Crossover p.: 0.0001 Mutation p.: 0.6
– Round Robin is the best schedule
• All resources are used by the Auto-tuning tool
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
59
Results: Autonomic Scheduling
13th Nov. 2017
Task Data-flow Execution on Many-core Systems
60
Profiling: Memory usage 14
OpenMP
12
TFlux
10
SWitches
8
RK4
31
6 4 2
Normalized memory usage
Normalized memory usage
MMULT 14
OpenMP
12
TFlux
SWitches
10 8 6 4 2 0
0 1
13th Nov. 2017
2
4
8 16 32 60 Number of Threads
120 180 240
1
2
4
Task Data-flow Execution on Many-core Systems
8
16
32
60
120 180 240
Number of Threads
61