Hardware/software partitioning

1 downloads 0 Views 3MB Size Report
Hugo De Man/Philips, 2007. Unconventional architectures close to IPE. ☞ How to map to these architectures ? © Re ne .... to MPSoCs, Rheinfels. Castle, 2008 ...
Peter Marwedel TU Dortmund, Informatik 12 Germany

2014年 01 月 17 日

© Springer, 2010

Mapping of Applications to Multi-Processor Systems

These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

Application Knowledge

Structure of this course

2: Specification & Modeling 3: ES-hardware

4: system software (RTOS, middleware, …)

Design repository

6: Application mapping

Design 8: Test

7: Optimization 5: Evaluation & validation (energy, cost, performance, …)

Numbers denote sequence of chapters  p. marwedel, informatik 12, 2014

- 2-

The need to support heterogeneous architectures Unconventional architectures close to IPE

© Renesas, MPSoC‘07

Energy efficiency a key constraint, e.g. for mobile systems

© Hugo De Man/Philips, 2007

 p. marwedel, informatik 12, 2014

 How to map to these architectures ? - 3-

Practical problem in automotive design

Which processor should run the software?

 p. marwedel, informatik 12, 2014

- 4-

A Simple Classification

Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed

Starting from given task graph

Auto-parallelizing

Map to CELL, Hopes,

COOL codesign tool;

Qiang XU (HK) Simunic (UCSD)

SystemCodesigner

EXPO/SPEA2

Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS

 p. marwedel, informatik 12, 2014

- 5-

Example: System Synthesis

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 6-

Basic Model – Problem Graph

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 7-

Basic Model: Specification Graph

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 8-

Design Space Scheduling/Arbitration

Computation Templates

Communication Templates

Cipher

EDF

FPGA RISC

SDRAM

mE

WFQ

DSP DSP

TDMA proportional share

dynamic fixed priority

LookUp

FCFS static

Which architecture is better suited for our application?

Architecture # 2

Architecture # 1 LookUp

RISC

EDF TDMA

Priority Cipher

DSP

WFQ

mE

mE

mE static

mE

 p. marwedel, informatik 12, 2014

mE

mE

© L. Thiele, ETHZ

- 9-

Evolutionary Algorithms for Design Space Exploration (DSE)

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 10 -

Challenges

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 11 -

EXPO – Tool architecture (1)

MOSES

EXPO task graph, scenario graph, flows & resources

system architecture performance values

SPEA 2

Exploration Cycle

selection of “good” architectures

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 12 -

EXPO – Tool architecture (2)

Tool available online: http://www.tik. ee.ethz.ch/ex po/expo.html

© L. Thiele, ETHZ  p. marwedel, informatik 12, 2014

- 13 -

EXPO – Tool (3)

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 14 -

Application Model Example of a simple stream processing task structure:

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 15 -

Exploration – Case Study (1)

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 16 -

Exploration – Case Study (2)

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 17 -

Exploration – Case Study (3)

 p. marwedel, informatik 12, 2014

© L. Thiele, ETHZ

- 18 -

More Results

Performance for encryption/decryption

Performance for RT voice processing

 p. marwedel, © L. Thiele, ETHZ informatik 12, 2014

- 19 -

Extension: considering memory characteristics with MAMOT Example

 p. marwedel, informatik 12, 2014

- 20 -

Gains resulting from the use of memory information

Olivera Jovanovic, Peter Marwedel, Iuliana Bacivarov, Lothar Thiele: MAMOT: Memory-Aware Mapping Tool for MPSoC, 15th Euromicro Conference on Digital System Design (DSD 2012) 2012  p. marwedel, informatik 12, 2014

- 21 -

Design Space Exploration with SystemCoDesigner (Teich et al., Erlangen)  System Synthesis comprises: • • • •

Resource allocation Actor binding Channel mapping Transaction modeling

 Idea: • Formulate synthesis problem as 0-1 ILP • Use Pseudo-Boolean (PB) solver to find feasible solution • Use multi-objective Evolutionary algorithm (MOEA) to optimize Decision Strategy of the PB solver

© J. Teich, U. Erlangen-Nürnberg  p. marwedel, informatik 12, 2014

- 22 -

A 3rd approach based on evolutionary algorithms: SYMTA/S:

[R. Ernst et al.: A framework for modular analysis and exploration of heteterogenous embedded systems, Real-time Systems, 2006, p. 124]

 p. marwedel, informatik 12, 2014

- 23 -

A Simple Classification

Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed

Starting from given task graph

Auto-parallelizing

Map to CELL, Hopes

COOL codesign tool;

Qiang XU (HK) Simunic (UCSD)

SystemCodesigner

EXPO/SPEA2

Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS

 p. marwedel, informatik 12, 2014

- 24 -

Martino Ruggiero, Luca Benini: Mapping task graphs to the CELL BE processor, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008

A fixed architecture approach: Map  CELL

 p. marwedel, informatik 12, 2014

- 25 -

Partitioning into Allocation and Scheduling R

 p. marwedel, informatik 12, 2014

© Ruggiero, Benini, 2008

- 26 -

 p. marwedel, informatik 12, 2014

- 27 -

A Simple Classification

Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed

Starting from given task graph

Auto-parallelizing

Map to CELL, Hopes,

COOL codesign tool;

Qiang XU (HK) Simunic (UCSD)

SystemCodesigner

EXPO/SPEA2

Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS

 p. marwedel, informatik 12, 2014

- 28 -

Daedalus Design-flow

Sequential application

Explore, modify, select instances High-level Models

Library of IP cores

Sesame System-level design space exploration System-Level Specification Common XML Interface

RTL-level Models

Platform specification

Mapping specification

Automatic KPNgen Parallelization

Parallel application Kahn Process Network specification

ESPAM System-level synthesis RTL-Level Specification

Ed Deprettere et al.: Toward Composable Multimedia MP-SoC Design,1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008

Synthesizable VHDL

MP-SoC

C/C++ code for processors

Multi-processor System on Chip Xilinx Platform Studio (XPS)

(Synthesizable VHDL and C/C++ code for processors)  p. marwedel, informatik 12, 2014

© E. Deprettere, U. Leiden

- 29 -

JPEG/JPEG2000 case study Example architecture instances for a single-tile JPEG encoder: 16KB

32KB

32KB

Q,VLE,Vout

Vin,Q,VLE,Vout

4KB

2KB Vin,DCT

2 MicroBlaze processors (50KB) DCT, Q 8KB

DCT, Q

1 MicroBlaze, 1HW DCT (36KB) 2KB

4x2KB 32KB VLE, Vout

Vin

DCT

Q

2KB Vin

2KB

8KB

8KB

32KB VLE, Vout

DCT, Q

8KB

4x2KB

2KB

2KB DCT, Q

DCT

DCT

2KB Q

4x16KB

6 MicroBlaze processors (120KB)

4 MicroBlaze, 2HW DCT (68KB)  p. marwedel, informatik 12, 2014

© E. Deprettere, U. Leiden

- 30 -

Sesame DSE results: Single JPEG encoder DSE

 p. marwedel, informatik 12, 2014

© E. Deprettere, U. Leiden

- 31 -

A Simple Classification

Architecture fixed/ Fixed Architecture Architecture to be Auto-parallelizing designed

Starting from given task graph

Auto-parallelizing

Map to CELL, Hopes,

COOL codesign tool;

Qiang XU (HK) Simunic (UCSD)

SystemCodesigner

EXPO/SPEA2

Mnemee (Dortmund) Daedalus Franke (Edinburgh) MAPS

 p. marwedel, informatik 12, 2014

- 32 -

Auto-Parallelizing Compilers Discipline “High Performance Computing”:

 Research on vectorizing compilers for more than 25 years.  Traditionally: Fortran compilers.  Such vectorizing compilers usually inappropriate for MultiDSPs, since assumptions on memory model unrealistic: • Communication between processors via shared memory • Memory has only one single common address space  De Facto no auto-parallelizing compiler for Multi-DSPs!  Work of Franke, O’Boyle (Edinburgh)  Work of Daniel Cordes, Olaf Neugebauer (TU Dortmund)  p. marwedel, informatik 12, 2014

- 33 -

Example: Edge detection benchmark

D. Cordes: Automatic Parallelization for Embedded Multi-Core Systems using HighLevel Cost Models, PhD thesis, TU Dortmund, 2013, https://eldorado.tu-dortmund.de/ bitstream/2003/31796/1/Dissertation.pdf

 p. marwedel, informatik 12, 2014

- 34 -

Task Parallelism

 p.Parallelization marwedel, D. Cordes: Automatic for Embedded Multi-Core Systems using High-Level Cost Models, PhD thesis, TU Dortmund, 2013, https://eldorado.tu-dortmund.de/ bitstream/2003/31796/1/Dissertation.pdf informatik 12, 2014

- 35 -

MaCC Application: Multi-Objective Task-Level Parallelization Restrictions of embedded architectures  Low computational power, small memories, battery-driven, … 120 4 Cores 100

Speedup vs Energy Consumption for Matrix Multiplication on MPARM with MEMSIM

3 Cores Energy [mJ]

80 2 Cores

60 1 Core 40 20 0 1

1.5

2

2.5

3

Time [Cycles] Energy [mJ] Speedup 1 Core 125,256,949 36.433 1.00 2 Cores 74,172,007 58.095 1.69 3 Cores 53,804,696 79.607 2.33 4 Cores 44,389,652 103.655 2.82

Speedup (Based on CPU Cycles)

 Good trade-off between different objectives must be found

As fast as necessary instead of as fast as possible  p. marwedel, informatik 12, 2014

© D. Cordes, 2013

- 36 -

Communication overhead

Multi-Objective Task-Level Approach (2)

Speedup of Execution Time Edge detect benchmark from UTDSP benchmark suite Target architecture: MPARM with MEMSIM energy model (1-4 cores)  p. marwedel, informatik 12, 2014

© D. Cordes, 2013

- 37 -

Pipeline Parallelization 1. Extract pipeline stages (horizontal splits) N1

T1

N1 N2

N2

t1 xn1 = 1, t1 xn2 = 1, t1 xn3 = 0, ...

1. 3. 5. 7.

N3 N4

N3 T2

N4

9.

1

fft(sample_real, sample_imag);

T1

T2

1

3 4

1 2

5 6 7

2 3

8

13.

17.

T1

2

int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; }

11.

15.

ILP formulation:

0

for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE];

for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + (((sample_real[j] * sample_real[j]) + (sample_imag[j] * sample_imag[j])) / SLICE_2); T2 }

19. }

 p. marwedel, informatik 12, 2014

9 10

3 4

11 12

... 4 time=37

- 38 -

Heterogeneous Pipeline Parallelization Heterogeneous MPSoCs can be more efficient than homogeneous  Cores behave differently on same parts of application  Efficient balancing of tasks very difficult Power

Cortex-A7 Cortex-A15

Highest Cortex-A15 Operating Point

big.LITTLE

Lowest Cortex-A15 Operating Point Lowest Cortex-A7 Operating Point

Highest Cortex-A7 Operating Point Performance

Use ILP as clear mathematical model integrating cost models Combine mapping with task extraction  p. marwedel, informatik 12, 2014

© D. Cordes, 2013

- 39 -

Heterogeneous pipeline parallelization results Homogeneous Pipeline

Heterogeneous Pipeline

13.0

Speedup

11.0 9.0 7.0 5.0 3.0

ad pc

m bo en un c. da ry va lu e co m pr es ed s ge de te ct fil te rb an k fir 25 66 4 iir 4 64 jp eg 20 la 00 tn rm 32 64 m ul t1 0 10 sp ec tra l av er ag e

1.0

 Cycle accurate simulator: CoMET (Vast)  100, 250, 500, 500 MHz ARM1176 cores  Baseline: Sequential on 100 MHz core

 Average speedup: 8.9x  Max.Speedup: 11.9x

 p. marwedel, informatik 12, 2014

© D. Cordes, 2013

- 40 -

 p. marwedel, informatik 12, 2014

Rainer Leupers, Weihua Sheng: MAPS: An Integrated Framework for MPSoC Application Parallelization, 1st Workshop on Mapping of Applications to MPSoCs, Rheinfels Castle, 2008

© Leupers, Sheng, 2008

MAPS-TCT Framework

- 41 -

Summary  Clear trend toward multi-processor systems for embedded systems, there exists a large design space  Using architecture crucially depends on mapping tools  Mapping applications onto heterogeneous MP systems needs allocation (if hardware is not fixed), binding of tasks to resources, scheduling  Two criteria for classification • Fixed / flexible architecture • Auto parallelizing / non-parallelizing  Introduction to proposed Mnemee tool chain Evolutionary algorithms currently the best choice  p. marwedel, informatik 12, 2014

- 42 -