Coarse Grain Reconfigurable Architectures

[email protected] 22 August 2013

Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de

CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia

Reiner Hartenstein

outline

slightly modified version

Introduction

Speed-ups obtained by Reconfigurable Computing

Manycore Crisis & von Neumann Syndrome The Impact of Reconfigurable Computing Programmer education: new roadmap needed Conclusions © 2008, [email protected]

1

5 key issues

2

http://hartenstein.de

History of data processing The first reconfigurable computer • prototyped: 1884

climate change faster than predicted: by carbon emission, primarily from power plants ?

Herman Hollerith

very high and growing computer energy cost – and growing number of power plants needed here

•datastream-based

the manycore programming crisis stalls progress (end of the free ride on the Gordon Moore curve)

DPU

technologically stalled Moore‘s Law* • 1st Xilinx FPGA 100 years later

Reconfigurable Computing is a promising alternative 3

[Nick Tredennick (Gilder), 2003]

http://hartenstein.de 2008: 65, 45, 32 nm

© 2008, [email protected]

fine-grained reconfigurable

(Configuration)


5


connect box

CLB

A CLB

CLB

CConfigurable Logic Box

(Reconfiguration)

motivating the J. v N, 1946 von Neumann paradigm

CLB

switch box CLB

or, by swapping pre-wired board

60 years later: RAM available –ferrite cores


Field-Programmable Gate Array FPGA

Configware Programming no instruction streams manually

4

a wire to CLB forming Connect

©Tom 2008, [email protected] *) Williams (keynote): the 20 nm wall

6 © 2008, [email protected]

CLB B

Xilinx old „island architecture“

invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing

CLB http://hartenstein.de

1



CLB

CLB

0 0

patches even at the customer‘s desk

0

0

hidden RAM

switch box

RAM-based this switch box has hidden RAM 150 transistors & 150 flipflops FF

A CLB

CLB

forming a wire

connect box

Connect to CLB

Field-Programmable Gate Array FPGA

0

configware code loaded before run time into switch box “hidden RAM”

1

part of FF “hidden RAM”

CLB CLB B

CConfigurable Logic Box

7


hidden RAM

CLB http://hartenstein.de

if X > Y then swap; X

Xi

0 1

Y

rout thru only © 2008, [email protected]

ConfigwareCode-input

rout thru and function (multiplexer)

rout thru only

CFB ! no CPU

Yo

reconfigurable Data Path Unit, 32 Bits wide

rDPU 1

Yi

SNN Filter on supersystolic Array: mainly a Pipe Network

Xo

Swap

>

0

swap turned into a wiring pattern http://hartenstein.de

Plattform-FPGA

Legend: size: rDPU not used connect for routing only array 10used x 16backbus connect backbus

10


Reconfigurable Supercomputing Silicon graphics

Reconfigurable ApplicationSpecific Computing (RASC™)

56 – fast on-chip 424 Block RAMs: BRAMs

Cray XD1

Supercomputing 2007, Reno, Nevada, USA 9600 registered attandees, 440 exhibitors

•Xilinx Virtex-II Pro •Library by Cray

Chuck Thacker … (even Microsoft working at it) (Lab in Cambridge. UK, etc.).

[courtesy Lattice Semiconductor]

11


operator and routing not port used location marker (99% placement efficiency)

by KressArray Xplorer [Ulrich Nageldinger] CoDe-X inside [Jürgen Becker] http://hartenstein.de

8 – 32 fast serial I/O-channels

256 – 1704 BGA DPUs



Another coarse-grained r-Array

Conditional Swap Example (parallelization of the bubble sort algorithm)

8


Coarse-grained Reconfigurable Array

CLB CFB !

FPGAs mainstream since > a decade


12



2



what means Configware time domain Software Source

outline

space domain

Software to Configware Migration

Configware Source Placement & Routing

mapper

Software Compiler

Introduction

The Manycore Crisis & the von Neumann Syndrome The Impact of Reconfigurable Computing

data scheduler

Software Code

(instruction-procedural)

Flowware Code (data-procedural)

Configware Code

(structural: space domain)

13



Many-core: Break-through or Breakdown? Industry is facing a disruptive turning point “could reset µP HW & SW roadmaps for next 30 years”, [David Patterson]

intel’s vision: MultiCore

forcing a historic transition to a parallel programming model yet to be invented [David Callahan] HPC users lack understanding in basic precepts*

it‘s an education, qualification, and a R&D problem

The stakes are high ... „I would be panicked if I were inindustry“ [John Hennessy]

*) PRACE consortium (Partnership foR Advanced Computing in Europe)

http://www.prace-project.eu/documents/D3.3.1_document_final.pdf © 2008, [email protected]

15


The von Neumann Syndrome


17


Programmer education: new roadmap needed Conclusions © 2008, [email protected]

14


Declining Programmer Productivity The Law of More: programmer productivity declines disproportionately with increasing parallelism At particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07] Software done: machine obsolete © 2008, [email protected]

16


The von Neumann Syndrome


18



3



Massive Overhead Phenomena

overhead piling up to code sizes von Neumann of astronomic dimensions CPU

single core

overhead instruction fetch state address computation data address computation

2006: C.V. “RAM”

Ramamoorthy:

von Neumann machine instruction stream instruction stream instruction stream

data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream

“von Neumann Syndrome” 1986, E.I.S. Projekt: 94% for address computation total speed-up:

x 15000

2008 David Callahan:

„a terrifying number of processes running in parallel, create sequential-processing bottlenecks and losses in

Dijkstra 1968: The Goto considered harmful Koch et al. 1975: The universal Bus considered harmful Backus, 1978: Can programming be liberated from the von Neumann style? Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style © 2008, [email protected]

19

data locality“


manycore von Neumann: arrays of massive overhead phenomena

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU CPU

single CPU CPU CPU core

CPU CPU

von CPU Neumann CPU CPU many-

overhead instruction fetch state address computation data address computation

fast on-chip memory cannot store such huge instruction code blocks

von Neumann machine instruction stream instruction stream

proportionate to the number of processors

instruction stream

data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream Inter PU communication instruction stream

disproportionate to the number of processors

message passing overhead instruction stream transactional memory overh. instruction stream overhead ©multithreading 2008, [email protected]

etc. instruction stream

20


Speed-up factors obtained

outline

by Software to Configware migration

Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing

Speedup-Factor

106

28500 DSP and wireless

real-time face detection Reed-Solomon Decoding

6000

103

video-rate stereo vision pattern recognition 730 900 SPIHT wavelet-based image compression457

Programmer education: new roadmap needed

52

Conclusions

protein identification

20

100 © 2008, [email protected]

DES breaking

Image processing, Pattern matching, Multimedia

21


MAC 1000 400 288 100

FFT

BLAST

88

2400

crypto

Viterbi Decoding Smith-Waterman pattern matching molecular dynamics simulation

40

Bioinformatics Astrophysics GRAPE

22


3000

1000


Energy saving factors obtained

Accelerator card from Bruchsal

by software to configware migration

16 FPGAs

MAC means Multiply and ACcumulate Tera means 1012 or 1 000 000 000 000 (1 trillion)

• 1.5 TeraMAC/s

Speedup-Factor

106

103

Energy saving: almost x10 less than speed-up …

… could be improved


23


100

28500 DSP und wireless

real-time face detection 6000 Reed-Solomon Decoding

video-rate stereo vision pattern recognition 730 900 SPIHT wavelet-based image compression457

• I/O Bandwidth: 50 GByte/s

• Manufacturer: SIEMENS Bruchsal

DES breaking

Image processing, Pattern matching, Multimedia

@10


52

protein identification

20

MAC 1000 400 288 100

FFT

BLAST

88

2400

crypto

3000

1000

Viterbi Decoding Smith-Waterman pattern matching molecular dynamics simulation

40

Bioinformatics Astrophysics GRAPE

24



4


rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

(coarse-grained rec.)


rDPA: reconfigurable datapath array

von Neumann overhead vs. Reconfigurable Computing CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

von Neumann machine

anti machine

instruction fetch state address computation

instruction stream

none*

instruction stream

none*

data address computation

instruction stream

none*

data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream Inter PU communication instruction stream

none*

message passing overhead instruction stream

none*

transactional memory overh. instruction stream

none*

overhead

overhead ©multithreading 2008, [email protected]

25

etc. instruction stream

none* none*

none* http://hartenstein.de 25

Data meet the processor (CPU)

illustrating von Neumann syndrome

inefficient transport over off-Chip-memory by memory-cyclehungry instruction streams

by Software

This is just one of many von NeumannOverheadPhenomena © 2008, [email protected]

26


Data meet the CPU

What did we learn?

illustrating acceleration

There are 2 kinds of datastreams: 1) indirectly moved by an instruction stream machine (von Neumann): extremely inefficient

Placement of the execution locality (not moving data)

within pipe network: generated by the Configware-Compiler* *) before run time (at compile time) © 2008, [email protected]

2) directly moved by a datastream machine (from Reconfigurable Computing): very efficient

by Flowware

“Dataflow machine” would be a nice term, but was introduced by a different scene* *) meanwhile dead: not really a dataflow machine, but had used compilers accepting a dataflow language

27


What else did we learn? There are 2 kinds of parallelism: 1) Concurrent processes: instruction stream parallelism (CPU manycores): inefficient 2) Data parallelism by parallel datastreams (in Reconfigurable Computing Systems): efficient Conclusion:


data parallelism: rDPU rDPU rDPU rDPU

28

What Parallelism? [Hartenstein’s watering can model]

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU

29


instruction parallelism:

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

no von Neumannbottleneck

- Data parallelism brings the performance (we do data processing !) © 2008, [email protected]


many von Neumann bottlenecks © 2008, [email protected]

30



5



Put old ideas into practice

outline

(POIIP)

“We need a complete re-definition of CS”

[Burton Smith and other celebrities]

Wrong! I do not agree, finding out, that ... [Reiner Hartenstein]

... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas] “We need a complete re-definition of curriculum recommendations - missing several key issues.”

Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Programmer education: new road map needed Conclusions

[Reiner Hartenstein]

31




The Embedded Systems Approach?

… support their own educational approach

Advanced Real Time Systems

Real-Time Systems (Sweden)

„You can always teach programming to a hardware guy ...

Recommendations for Designing new ICT Curricula Chess – Center for Hybrid and Embedded Software Systems (courses in embedded systems)

WESE Workshop on Embedded Systems Education

... but you can never teach hardware to a programmer“

it‘s not the programmer‘s fault: it‘s due to obsolete CS curricula http://hartenstein.de


We need to

2 key rules of thumb - terrifically simple: 1) loop turns into pipeline [1979] 2) decision box turns into demultiplexer

[1967]: PvOIIP


35

fully wrong educational mainstream approaches: 1) the basic mind set exclusively instruction-streamoriented - data streams considered being exotic 2) mapping parallelism into the time domain – abstracting away the space domain is fatal

We need a dual-rail education © 2008, [email protected]


34


Two Dichotomies

POIIP for:

Software to Hardware Migration: and Software to Configware Migration:


CS is a Monster

Fighting against obsolete curricula? Graduate Curriculum on Embedded Software and Systems (EU)

32

Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded. The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm & the software/hardware chasm 1) Machine Paradigm Dichotomy (von Neumann /Dataflow machine*): the „Twin Paradigm“ model 2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping *) see definition


36



6



Def.: Dataflow Machine

1 ) Paradigm Dichotomy

(procedural dichotomy)

The Twin Paradigm Approach (TTPA)

The old „Dataflow Machine“ research scene is dead. sequential execution: not really a dataflow machine. indeterministic: unpredictable order of execution:

datastream domain

instruction domain

CPU program counter

(r)DPA data counter

had used compilers accepting a dataflow language

+

-

we re-define this term: counterpart of von Neumann

-

+

deterministic, w. data counters (no program counter)

CPU program counter

-

datastream domain

instruction domain

(TTPA)

(r)DPA data counter s

+

+

-

+

data we need+ parallelism © 2008, [email protected]

39


Procedural Languages Twins program counter

data counter(s)

imperative Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops

systolic Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops

41

for data parallelism http://hartenstein.de

x x x

(r)DPA

[1995]

[1995]

ASM ASM

x x x x x x -

ASM

x x x - -

ASM:


ASM x x x |

x x x

Data Machine: from old stuff [1979 - ...] New is only: its generalization [1989]

|

|

|

|

|

|

|

|

|

|

|

x x x


|

|

x x x

AutoSequencing Memory

- - - x x x

ASM

- - - - x x x

ASM

- - - - - x x x

ASM

[1990] GAG

Data streams [Kung et al. 1979]

RAM

data counter

|

x x x

40


Relativity Dichotomy space time/space)

time (time

super

But there is the Asymmetry © 2008, [email protected]

systolic array super systolic

ASM

(procedural dichotomy)

The Twin Paradigm Approach

ASM

Paradigm Dichotomy

ASM

38


ASM


ASM

37


time domain: procedure domain

space domain: structure domain

2 phases: 1) programming instruction streams 2) run time

3 phases: 1) reconfiguration of structures 2) programming data streams 3) run time

von Neumann Machine © 2008, [email protected]

42


Anti Machine


7



time-iterative to space-iterative n time steps, 1 CPU

k*n time steps, 1 CPU

a time to space mapping

Introduction

a time to space/time mapping

1 time step, n DPUs

( n = length of pipeline )

k time steps, n DPUs

loop transformation methodogy: 70ies and later © 2008, [email protected]

outline

Often the space dimension is limited (e.g. because of the chip size)

Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Conclusions

Strip mining 43

[D. Loveman, J-ACM, 1977]



Conclusions (1)

44


Conclusions (2)

We massively need programmable accelerator co-processors

CS education is a monster !

Established technologies are available and we can still use standard software and their tools

Fully wrong educational mainstream approaches

We need a massive Migration of Software to Configware. To cope with the implementation wall: to cope with the programmer population‘s unsustainable skills mismatches Configware skills and basic hardware knowledge are essential qualifications for programmers. © 2008, [email protected]

45


Yaw-dropping sclerosis of curriculum taskforces We need a complete re-definition of CS education

We urgently need Dual-Rail Education CS should learn a lot from Embedded Systems, like in Mechanical Engineering © 2008, [email protected]

47



END

thank you for your patience


46


48



8



time to space mapping time domain: procedure domain


time algorithm

space algorithm

pipeline

program loop

n time steps, 1 CPU

backup for discussion:

1 time step, n DPUs

Bubble Sort

n x k time steps, x 1 y „conditio time algorithm nal swap“ © 2008, [email protected] 50 unit

Shuffle Sort

k time steps, n „conditional space/time algorithm s swap“ units

condition swap al

49


Architecture instead of synchro Example

conditio swap nal conditio swap nal conditio swap nal conditio swap nal

time domain: procedure domain

swap nal conditio swap nal

direct time to space mapping

program loop

modification: with shufflefunction

accessing conflicts

„Shuffle Sort“

51



Revolution der Lehre:

Mikroelektronik-Entwurfs-Revolution traditionelle Arbeitsteilung: Anwendung Einreichung

Die neue M-&-C Arbeitsteilung: Anwendu ng

Rückweisung

loop transformations: rich methodology published [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]

Better Architecture instead of complex synchronisation: half he number of conditio Blocks + up und swap down of data nal (shuffle function) – conditio swap no von Neumannnal syndrome ! conditio

conditio swap nal conditio swap nal conditio swap nal conditio swap nal

SwitchingEbene Rückweisung Einreichung SchaltkreisEbene Rückweisung Einreichung

Entrümpelung & intuitive Modelle zur Behebung des AusbildungsDilemmas

Technologi e


Carver Mead

Application level

Lynn Conway

[1980]

Silicon Foundry

(externeTechnologie) Spezialisierungsbreite stark reduziert

53

k time steps, DPUs

Betonung auf “Systems” http://hartenstein.de

n

space/time algorithmus

52


Education Revolution:

Program level

von-NeumannParadigm (instructionstreambased)

clearing out

Layout-Ebene im Hause Spezialisierungsbreite

Pipeline

Reconfigurable Computing Revolution

Rückweisung

Rückweisung

Strip Mining Transformation

(in Deutschland: das E.I.S.-Projekt)

tall thin Kohärenz man

Zersplitterung

Einreichung


n x k time steps, 1 C P time algorithm U © 2008, [email protected]

RT-Ebene Einreichung

Logik-Ebene


Transformations since the 70ies

*) or” tall thin woman”

the tall thin man* > Dichotomy