Hugo De Man/Philips, 2007. Unconventional architectures close to IPE. â How to map to these architectures ? © Re ne .... to MPSoCs, Rheinfels. Castle, 2008 ...
Peter Marwedel TU Dortmund, Informatik 12 Germany
2014年 01 月 17 日
© Springer, 2010
Mapping of Applications to Multi-Processor Systems
These slides use Microsoft clip arts. Microsoft copyright restrictions apply.
Application Knowledge
Structure of this course
2: Specification & Modeling 3: ES-hardware
4: system software (RTOS, middleware, …)
Design repository
6: Application mapping
Design 8: Test
7: Optimization 5: Evaluation & validation (energy, cost, performance, …)
Numbers denote sequence of chapters p. marwedel, informatik 12, 2014
- 2-
The need to support heterogeneous architectures Unconventional architectures close to IPE
© Renesas, MPSoC‘07
Energy efficiency a key constraint, e.g. for mobile systems
© Hugo De Man/Philips, 2007
p. marwedel, informatik 12, 2014
How to map to these architectures ? - 3-
Practical problem in automotive design
Which processor should run the software?
p. marwedel, informatik 12, 2014
- 4-
A Simple Classification
Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed
Starting from given task graph
Auto-parallelizing
Map to CELL, Hopes,
COOL codesign tool;
Qiang XU (HK) Simunic (UCSD)
SystemCodesigner
EXPO/SPEA2
Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS
p. marwedel, informatik 12, 2014
- 5-
Example: System Synthesis
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 6-
Basic Model – Problem Graph
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 7-
Basic Model: Specification Graph
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 8-
Design Space Scheduling/Arbitration
Computation Templates
Communication Templates
Cipher
EDF
FPGA RISC
SDRAM
mE
WFQ
DSP DSP
TDMA proportional share
dynamic fixed priority
LookUp
FCFS static
Which architecture is better suited for our application?
Architecture # 2
Architecture # 1 LookUp
RISC
EDF TDMA
Priority Cipher
DSP
WFQ
mE
mE
mE static
mE
p. marwedel, informatik 12, 2014
mE
mE
© L. Thiele, ETHZ
- 9-
Evolutionary Algorithms for Design Space Exploration (DSE)
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 10 -
Challenges
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 11 -
EXPO – Tool architecture (1)
MOSES
EXPO task graph, scenario graph, flows & resources
system architecture performance values
SPEA 2
Exploration Cycle
selection of “good” architectures
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 12 -
EXPO – Tool architecture (2)
Tool available online: http://www.tik. ee.ethz.ch/ex po/expo.html
© L. Thiele, ETHZ p. marwedel, informatik 12, 2014
- 13 -
EXPO – Tool (3)
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 14 -
Application Model Example of a simple stream processing task structure:
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 15 -
Exploration – Case Study (1)
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 16 -
Exploration – Case Study (2)
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 17 -
Exploration – Case Study (3)
p. marwedel, informatik 12, 2014
© L. Thiele, ETHZ
- 18 -
More Results
Performance for encryption/decryption
Performance for RT voice processing
p. marwedel, © L. Thiele, ETHZ informatik 12, 2014
- 19 -
Extension: considering memory characteristics with MAMOT Example
p. marwedel, informatik 12, 2014
- 20 -
Gains resulting from the use of memory information
Olivera Jovanovic, Peter Marwedel, Iuliana Bacivarov, Lothar Thiele: MAMOT: Memory-Aware Mapping Tool for MPSoC, 15th Euromicro Conference on Digital System Design (DSD 2012) 2012 p. marwedel, informatik 12, 2014
- 21 -
Design Space Exploration with SystemCoDesigner (Teich et al., Erlangen) System Synthesis comprises: • • • •
Resource allocation Actor binding Channel mapping Transaction modeling
Idea: • Formulate synthesis problem as 0-1 ILP • Use Pseudo-Boolean (PB) solver to find feasible solution • Use multi-objective Evolutionary algorithm (MOEA) to optimize Decision Strategy of the PB solver
© J. Teich, U. Erlangen-Nürnberg p. marwedel, informatik 12, 2014
- 22 -
A 3rd approach based on evolutionary algorithms: SYMTA/S:
[R. Ernst et al.: A framework for modular analysis and exploration of heteterogenous embedded systems, Real-time Systems, 2006, p. 124]
p. marwedel, informatik 12, 2014
- 23 -
A Simple Classification
Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed
Starting from given task graph
Auto-parallelizing
Map to CELL, Hopes
COOL codesign tool;
Qiang XU (HK) Simunic (UCSD)
SystemCodesigner
EXPO/SPEA2
Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS
p. marwedel, informatik 12, 2014
- 24 -
Martino Ruggiero, Luca Benini: Mapping task graphs to the CELL BE processor, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008
A fixed architecture approach: Map CELL
p. marwedel, informatik 12, 2014
- 25 -
Partitioning into Allocation and Scheduling R
p. marwedel, informatik 12, 2014
© Ruggiero, Benini, 2008
- 26 -
p. marwedel, informatik 12, 2014
- 27 -
A Simple Classification
Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed
Starting from given task graph
Auto-parallelizing
Map to CELL, Hopes,
COOL codesign tool;
Qiang XU (HK) Simunic (UCSD)
SystemCodesigner
EXPO/SPEA2
Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS
p. marwedel, informatik 12, 2014
- 28 -
Daedalus Design-flow
Sequential application
Explore, modify, select instances High-level Models
Library of IP cores
Sesame System-level design space exploration System-Level Specification Common XML Interface
RTL-level Models
Platform specification
Mapping specification
Automatic KPNgen Parallelization
Parallel application Kahn Process Network specification
ESPAM System-level synthesis RTL-Level Specification
Ed Deprettere et al.: Toward Composable Multimedia MP-SoC Design,1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008
Synthesizable VHDL
MP-SoC
C/C++ code for processors
Multi-processor System on Chip Xilinx Platform Studio (XPS)
(Synthesizable VHDL and C/C++ code for processors) p. marwedel, informatik 12, 2014
© E. Deprettere, U. Leiden
- 29 -
JPEG/JPEG2000 case study Example architecture instances for a single-tile JPEG encoder: 16KB
32KB
32KB
Q,VLE,Vout
Vin,Q,VLE,Vout
4KB
2KB Vin,DCT
2 MicroBlaze processors (50KB) DCT, Q 8KB
DCT, Q
1 MicroBlaze, 1HW DCT (36KB) 2KB
4x2KB 32KB VLE, Vout
Vin
DCT
Q
2KB Vin
2KB
8KB
8KB
32KB VLE, Vout
DCT, Q
8KB
4x2KB
2KB
2KB DCT, Q
DCT
DCT
2KB Q
4x16KB
6 MicroBlaze processors (120KB)
4 MicroBlaze, 2HW DCT (68KB) p. marwedel, informatik 12, 2014
© E. Deprettere, U. Leiden
- 30 -
Sesame DSE results: Single JPEG encoder DSE
p. marwedel, informatik 12, 2014
© E. Deprettere, U. Leiden
- 31 -
A Simple Classification
Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed
Starting from given task graph
Auto-parallelizing
Map to CELL, Hopes,
COOL codesign tool;
Qiang XU (HK) Simunic (UCSD)
SystemCodesigner
EXPO/SPEA2
Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS
p. marwedel, informatik 12, 2014
- 32 -
Auto-Parallelizing Compilers Discipline “High Performance Computing”:
Research on vectorizing compilers for more than 25 years. Traditionally: Fortran compilers. Such vectorizing compilers usually inappropriate for MultiDSPs, since assumptions on memory model unrealistic: • Communication between processors via shared memory • Memory has only one single common address space De Facto no auto-parallelizing compiler for Multi-DSPs! Work of Franke, O’Boyle (Edinburgh) Work of Daniel Cordes, Olaf Neugebauer (TU Dortmund) p. marwedel, informatik 12, 2014
- 33 -
Example: Edge detection benchmark
D. Cordes: Automatic Parallelization for Embedded Multi-Core Systems using HighLevel Cost Models, PhD thesis, TU Dortmund, 2013, https://eldorado.tu-dortmund.de/ bitstream/2003/31796/1/Dissertation.pdf
p. marwedel, informatik 12, 2014
- 34 -
Task Parallelism
p.Parallelization marwedel, D. Cordes: Automatic for Embedded Multi-Core Systems using High-Level Cost Models, PhD thesis, TU Dortmund, 2013, https://eldorado.tu-dortmund.de/ bitstream/2003/31796/1/Dissertation.pdf informatik 12, 2014
- 35 -
MaCC Application: Multi-Objective Task-Level Parallelization Restrictions of embedded architectures Low computational power, small memories, battery-driven, … 120 4 Cores 100
Speedup vs Energy Consumption for Matrix Multiplication on MPARM with MEMSIM
3 Cores Energy [mJ]
80 2 Cores
60 1 Core 40 20 0 1
1.5
2
2.5
3
Time [Cycles] Energy [mJ] Speedup 1 Core 125,256,949 36.433 1.00 2 Cores 74,172,007 58.095 1.69 3 Cores 53,804,696 79.607 2.33 4 Cores 44,389,652 103.655 2.82
Speedup (Based on CPU Cycles)
Good trade-off between different objectives must be found
As fast as necessary instead of as fast as possible p. marwedel, informatik 12, 2014
© D. Cordes, 2013
- 36 -
Communication overhead
Multi-Objective Task-Level Approach (2)
Speedup of Execution Time Edge detect benchmark from UTDSP benchmark suite Target architecture: MPARM with MEMSIM energy model (1-4 cores) p. marwedel, informatik 12, 2014
© D. Cordes, 2013
- 37 -
Pipeline Parallelization 1. Extract pipeline stages (horizontal splits) N1
T1
N1 N2
N2
t1 xn1 = 1, t1 xn2 = 1, t1 xn3 = 0, ...
1. 3. 5. 7.
N3 N4
N3 T2
N4
9.
1
fft(sample_real, sample_imag);
T1
T2
1
3 4
1 2
5 6 7
2 3
8
13.
17.
T1
2
int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; }
11.
15.
ILP formulation:
0
for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE];
for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + (((sample_real[j] * sample_real[j]) + (sample_imag[j] * sample_imag[j])) / SLICE_2); T2 }
19. }
p. marwedel, informatik 12, 2014
9 10
3 4
11 12
... 4 time=37
- 38 -
Heterogeneous Pipeline Parallelization Heterogeneous MPSoCs can be more efficient than homogeneous Cores behave differently on same parts of application Efficient balancing of tasks very difficult Power
Cortex-A7 Cortex-A15
Highest Cortex-A15 Operating Point
big.LITTLE
Lowest Cortex-A15 Operating Point Lowest Cortex-A7 Operating Point
Highest Cortex-A7 Operating Point Performance
Use ILP as clear mathematical model integrating cost models Combine mapping with task extraction p. marwedel, informatik 12, 2014
© D. Cordes, 2013
- 39 -
Heterogeneous pipeline parallelization results Homogeneous Pipeline
Heterogeneous Pipeline
13.0
Speedup
11.0 9.0 7.0 5.0 3.0
ad pc
m bo en un c. da ry va lu e co m pr es ed s ge de te ct fil te rb an k fir 25 66 4 iir 4 64 jp eg 20 la 00 tn rm 32 64 m ul t1 0 10 sp ec tra l av er ag e
1.0
Cycle accurate simulator: CoMET (Vast) 100, 250, 500, 500 MHz ARM1176 cores Baseline: Sequential on 100 MHz core
Average speedup: 8.9x Max.Speedup: 11.9x
p. marwedel, informatik 12, 2014
© D. Cordes, 2013
- 40 -
p. marwedel, informatik 12, 2014
Rainer Leupers, Weihua Sheng: MAPS: An Integrated Framework for MPSoC Application Parallelization, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008
© Leupers, Sheng, 2008
MAPS-TCT Framework
- 41 -
Summary Clear trend toward multi-processor systems for embedded systems, there exists a large design space Using architecture crucially depends on mapping tools Mapping applications onto heterogeneous MP systems needs allocation (if hardware is not fixed), binding of tasks to resources, scheduling Two criteria for classification • Fixed / flexible architecture • Auto parallelizing / non-parallelizing Introduction to proposed Mnemee tool chain Evolutionary algorithms currently the best choice p. marwedel, informatik 12, 2014
- 42 -