[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia
Reiner Hartenstein
outline
slightly modified version
Introduction
Speed-ups obtained by Reconfigurable Computing
Manycore Crisis & von Neumann Syndrome The Impact of Reconfigurable Computing Programmer education: new roadmap needed Conclusions © 2008,
[email protected]
1
5 key issues
2
http://hartenstein.de
History of data processing The first reconfigurable computer • prototyped: 1884
climate change faster than predicted: by carbon emission, primarily from power plants ?
Herman Hollerith
very high and growing computer energy cost – and growing number of power plants needed here
•datastream-based
the manycore programming crisis stalls progress (end of the free ride on the Gordon Moore curve)
DPU
technologically stalled Moore‘s Law* • 1st Xilinx FPGA 100 years later
Reconfigurable Computing is a promising alternative 3
[Nick Tredennick (Gilder), 2003]
http://hartenstein.de 2008: 65, 45, 32 nm
© 2008,
[email protected]
fine-grained reconfigurable
(Configuration)
© 2008,
[email protected]
5
http://hartenstein.de
connect box
CLB
A CLB
CLB
CConfigurable Logic Box
(Reconfiguration)
motivating the J. v N, 1946 von Neumann paradigm
CLB
switch box CLB
or, by swapping pre-wired board
60 years later: RAM available –ferrite cores
http://hartenstein.de
Field-Programmable Gate Array FPGA
Configware Programming no instruction streams manually
4
a wire to CLB forming Connect
©Tom 2008,
[email protected] *) Williams (keynote): the 20 nm wall
6 © 2008,
[email protected]
CLB B
Xilinx old „island architecture“
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
CLB http://hartenstein.de
1
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
CLB
CLB
0 0
patches even at the customer‘s desk
0
0
hidden RAM
switch box
RAM-based this switch box has hidden RAM 150 transistors & 150 flipflops FF
A CLB
CLB
forming a wire
connect box
Connect to CLB
Field-Programmable Gate Array FPGA
0
configware code loaded before run time into switch box “hidden RAM”
1
part of FF “hidden RAM”
CLB CLB B
CConfigurable Logic Box
7
© 2008,
[email protected]
hidden RAM
CLB http://hartenstein.de
if X > Y then swap; X
Xi
0 1
Y
rout thru only © 2008,
[email protected]
ConfigwareCode-input
rout thru and function (multiplexer)
rout thru only
CFB ! no CPU
Yo
reconfigurable Data Path Unit, 32 Bits wide
rDPU 1
Yi
SNN Filter on supersystolic Array: mainly a Pipe Network
Xo
Swap
>
0
swap turned into a wiring pattern http://hartenstein.de
Plattform-FPGA
Legend: size: rDPU not used connect for routing only array 10used x 16backbus connect backbus
10
© 2008,
[email protected]
Reconfigurable Supercomputing Silicon graphics
Reconfigurable ApplicationSpecific Computing (RASC™)
56 – fast on-chip 424 Block RAMs: BRAMs
Cray XD1
Supercomputing 2007, Reno, Nevada, USA 9600 registered attandees, 440 exhibitors
•Xilinx Virtex-II Pro •Library by Cray
Chuck Thacker … (even Microsoft working at it) (Lab in Cambridge. UK, etc.).
[courtesy Lattice Semiconductor]
11
http://hartenstein.de
operator and routing not port used location marker (99% placement efficiency)
by KressArray Xplorer [Ulrich Nageldinger] CoDe-X inside [Jürgen Becker] http://hartenstein.de
8 – 32 fast serial I/O-channels
256 – 1704 BGA DPUs
© 2008,
[email protected]
http://hartenstein.de
Another coarse-grained r-Array
Conditional Swap Example (parallelization of the bubble sort algorithm)
8
© 2008,
[email protected]
Coarse-grained Reconfigurable Array
CLB CFB !
FPGAs mainstream since > a decade
© 2008,
[email protected]
12
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
http://hartenstein.de
2
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
what means Configware time domain Software Source
outline
space domain
Software to Configware Migration
Configware Source Placement & Routing
mapper
Software Compiler
Introduction
The Manycore Crisis & the von Neumann Syndrome The Impact of Reconfigurable Computing
data scheduler
Software Code
(instruction-procedural)
Flowware Code (data-procedural)
Configware Code
(structural: space domain)
13
© 2008,
[email protected]
http://hartenstein.de
Many-core: Break-through or Breakdown? Industry is facing a disruptive turning point “could reset µP HW & SW roadmaps for next 30 years”, [David Patterson]
intel’s vision: MultiCore
forcing a historic transition to a parallel programming model yet to be invented [David Callahan] HPC users lack understanding in basic precepts*
it‘s an education, qualification, and a R&D problem
The stakes are high ... „I would be panicked if I were inindustry“ [John Hennessy]
*) PRACE consortium (Partnership foR Advanced Computing in Europe)
http://www.prace-project.eu/documents/D3.3.1_document_final.pdf © 2008,
[email protected]
15
http://hartenstein.de
The von Neumann Syndrome
© 2008,
[email protected]
17
http://hartenstein.de
Programmer education: new roadmap needed Conclusions © 2008,
[email protected]
14
http://hartenstein.de
Declining Programmer Productivity The Law of More: programmer productivity declines disproportionately with increasing parallelism At particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07] Software done: machine obsolete © 2008,
[email protected]
16
http://hartenstein.de
The von Neumann Syndrome
© 2008,
[email protected]
18
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
http://hartenstein.de
3
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
Massive Overhead Phenomena
overhead piling up to code sizes von Neumann of astronomic dimensions CPU
single core
overhead instruction fetch state address computation data address computation
2006: C.V. “RAM”
Ramamoorthy:
von Neumann machine instruction stream instruction stream instruction stream
data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream
“von Neumann Syndrome” 1986, E.I.S. Projekt: 94% for address computation total speed-up:
x 15000
2008 David Callahan:
„a terrifying number of processes running in parallel, create sequential-processing bottlenecks and losses in
Dijkstra 1968: The Goto considered harmful Koch et al. 1975: The universal Bus considered harmful Backus, 1978: Can programming be liberated from the von Neumann style? Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style © 2008,
[email protected]
19
data locality“
http://hartenstein.de
manycore von Neumann: arrays of massive overhead phenomena
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU CPU
single CPU CPU CPU core
CPU CPU
von CPU Neumann CPU CPU many-
overhead instruction fetch state address computation data address computation
fast on-chip memory cannot store such huge instruction code blocks
von Neumann machine instruction stream instruction stream
proportionate to the number of processors
instruction stream
data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream Inter PU communication instruction stream
disproportionate to the number of processors
message passing overhead instruction stream transactional memory overh. instruction stream overhead ©multithreading 2008,
[email protected]
etc. instruction stream
20
http://hartenstein.de
Speed-up factors obtained
outline
by Software to Configware migration
Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing
Speedup-Factor
106
28500 DSP and wireless
real-time face detection Reed-Solomon Decoding
6000
103
video-rate stereo vision pattern recognition 730 900 SPIHT wavelet-based image compression457
Programmer education: new roadmap needed
52
Conclusions
protein identification
20
100 © 2008,
[email protected]
DES breaking
Image processing, Pattern matching, Multimedia
21
http://hartenstein.de
MAC 1000 400 288 100
FFT
BLAST
88
2400
crypto
Viterbi Decoding Smith-Waterman pattern matching molecular dynamics simulation
40
Bioinformatics Astrophysics GRAPE
22
© 2008,
[email protected]
3000
1000
http://hartenstein.de
Energy saving factors obtained
Accelerator card from Bruchsal
by software to configware migration
16 FPGAs
MAC means Multiply and ACcumulate Tera means 1012 or 1 000 000 000 000 (1 trillion)
• 1.5 TeraMAC/s
Speedup-Factor
106
103
Energy saving: almost x10 less than speed-up …
… could be improved
© 2008,
[email protected]
23
http://hartenstein.de
100
28500 DSP und wireless
real-time face detection 6000 Reed-Solomon Decoding
video-rate stereo vision pattern recognition 730 900 SPIHT wavelet-based image compression457
• I/O Bandwidth: 50 GByte/s
• Manufacturer: SIEMENS Bruchsal
DES breaking
Image processing, Pattern matching, Multimedia
@10
© 2008,
[email protected]
52
protein identification
20
MAC 1000 400 288 100
FFT
BLAST
88
2400
crypto
3000
1000
Viterbi Decoding Smith-Waterman pattern matching molecular dynamics simulation
40
Bioinformatics Astrophysics GRAPE
24
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
http://hartenstein.de
4
[email protected] 22 August 2013
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
rDPU
(coarse-grained rec.)
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
rDPA: reconfigurable datapath array
von Neumann overhead vs. Reconfigurable Computing CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
von Neumann machine
anti machine
instruction fetch state address computation
instruction stream
none*
instruction stream
none*
data address computation
instruction stream
none*
data meet PU + other overh. instruction stream i / o to / from off-chip RAM instruction stream Inter PU communication instruction stream
none*
message passing overhead instruction stream
none*
transactional memory overh. instruction stream
none*
overhead
overhead ©multithreading 2008,
[email protected]
25
etc. instruction stream
none* none*
none* http://hartenstein.de 25
Data meet the processor (CPU)
illustrating von Neumann syndrome
inefficient transport over off-Chip-memory by memory-cyclehungry instruction streams
by Software
This is just one of many von NeumannOverheadPhenomena © 2008,
[email protected]
26
http://hartenstein.de
Data meet the CPU
What did we learn?
illustrating acceleration
There are 2 kinds of datastreams: 1) indirectly moved by an instruction stream machine (von Neumann): extremely inefficient
Placement of the execution locality (not moving data)
within pipe network: generated by the Configware-Compiler* *) before run time (at compile time) © 2008,
[email protected]
2) directly moved by a datastream machine (from Reconfigurable Computing): very efficient
by Flowware
“Dataflow machine” would be a nice term, but was introduced by a different scene* *) meanwhile dead: not really a dataflow machine, but had used compilers accepting a dataflow language
27
http://hartenstein.de
What else did we learn? There are 2 kinds of parallelism: 1) Concurrent processes: instruction stream parallelism (CPU manycores): inefficient 2) Data parallelism by parallel datastreams (in Reconfigurable Computing Systems): efficient Conclusion:
© 2008,
[email protected]
data parallelism: rDPU rDPU rDPU rDPU
28
What Parallelism? [Hartenstein’s watering can model]
rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU
29
http://hartenstein.de
instruction parallelism:
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
no von Neumannbottleneck
- Data parallelism brings the performance (we do data processing !) © 2008,
[email protected]
http://hartenstein.de
many von Neumann bottlenecks © 2008,
[email protected]
30
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
http://hartenstein.de
5
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
Put old ideas into practice
outline
(POIIP)
“We need a complete re-definition of CS”
[Burton Smith and other celebrities]
Wrong! I do not agree, finding out, that ... [Reiner Hartenstein]
... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas] “We need a complete re-definition of curriculum recommendations - missing several key issues.”
Introduction Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Programmer education: new road map needed Conclusions
[Reiner Hartenstein]
31
© 2008,
[email protected]
http://hartenstein.de
© 2008,
[email protected]
The Embedded Systems Approach?
… support their own educational approach
Advanced Real Time Systems
Real-Time Systems (Sweden)
„You can always teach programming to a hardware guy ...
Recommendations for Designing new ICT Curricula Chess – Center for Hybrid and Embedded Software Systems (courses in embedded systems)
WESE Workshop on Embedded Systems Education
... but you can never teach hardware to a programmer“
it‘s not the programmer‘s fault: it‘s due to obsolete CS curricula http://hartenstein.de
© 2008,
[email protected]
We need to
2 key rules of thumb - terrifically simple: 1) loop turns into pipeline [1979] 2) decision box turns into demultiplexer
[1967]: PvOIIP
© 2008,
[email protected]
35
fully wrong educational mainstream approaches: 1) the basic mind set exclusively instruction-streamoriented - data streams considered being exotic 2) mapping parallelism into the time domain – abstracting away the space domain is fatal
We need a dual-rail education © 2008,
[email protected]
http://hartenstein.de
34
http://hartenstein.de
Two Dichotomies
POIIP for:
Software to Hardware Migration: and Software to Configware Migration:
http://hartenstein.de
CS is a Monster
Fighting against obsolete curricula? Graduate Curriculum on Embedded Software and Systems (EU)
32
Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded. The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm & the software/hardware chasm 1) Machine Paradigm Dichotomy (von Neumann /Dataflow machine*): the „Twin Paradigm“ model 2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping *) see definition
© 2008,
[email protected]
36
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
http://hartenstein.de
6
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
Def.: Dataflow Machine
1 ) Paradigm Dichotomy
(procedural dichotomy)
The Twin Paradigm Approach (TTPA)
The old „Dataflow Machine“ research scene is dead. sequential execution: not really a dataflow machine. indeterministic: unpredictable order of execution:
datastream domain
instruction domain
CPU program counter
(r)DPA data counter
had used compilers accepting a dataflow language
+
-
we re-define this term: counterpart of von Neumann
-
+
deterministic, w. data counters (no program counter)
CPU program counter
-
datastream domain
instruction domain
(TTPA)
(r)DPA data counter s
+
+
-
+
data we need+ parallelism © 2008,
[email protected]
39
http://hartenstein.de
Procedural Languages Twins program counter
data counter(s)
imperative Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops
systolic Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops
41
for data parallelism http://hartenstein.de
x x x
(r)DPA
[1995]
[1995]
ASM ASM
x x x x x x -
ASM
x x x - -
ASM:
© 2008,
[email protected]
ASM x x x |
x x x
Data Machine: from old stuff [1979 - ...] New is only: its generalization [1989]
|
|
|
|
|
|
|
|
|
|
|
x x x
http://hartenstein.de
|
|
x x x
AutoSequencing Memory
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
[1990] GAG
Data streams [Kung et al. 1979]
RAM
data counter
|
x x x
40
http://hartenstein.de
Relativity Dichotomy space time/space)
time (time
super
But there is the Asymmetry © 2008,
[email protected]
systolic array super systolic
ASM
(procedural dichotomy)
The Twin Paradigm Approach
ASM
Paradigm Dichotomy
ASM
38
© 2008,
[email protected]
ASM
http://hartenstein.de
ASM
37
© 2008,
[email protected]
time domain: procedure domain
space domain: structure domain
2 phases: 1) programming instruction streams 2) run time
3 phases: 1) reconfiguration of structures 2) programming data streams 3) run time
von Neumann Machine © 2008,
[email protected]
42
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
Anti Machine
http://hartenstein.de
7
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
time-iterative to space-iterative n time steps, 1 CPU
k*n time steps, 1 CPU
a time to space mapping
Introduction
a time to space/time mapping
1 time step, n DPUs
( n = length of pipeline )
k time steps, n DPUs
loop transformation methodogy: 70ies and later © 2008,
[email protected]
outline
Often the space dimension is limited (e.g. because of the chip size)
Manycore Crisis & von Neuman Syndrome The Impact of Reconfigurable Computing Conclusions
Strip mining 43
[D. Loveman, J-ACM, 1977]
http://hartenstein.de
© 2008,
[email protected]
Conclusions (1)
44
http://hartenstein.de
Conclusions (2)
We massively need programmable accelerator co-processors
CS education is a monster !
Established technologies are available and we can still use standard software and their tools
Fully wrong educational mainstream approaches
We need a massive Migration of Software to Configware. To cope with the implementation wall: to cope with the programmer population‘s unsustainable skills mismatches Configware skills and basic hardware knowledge are essential qualifications for programmers. © 2008,
[email protected]
45
http://hartenstein.de
Yaw-dropping sclerosis of curriculum taskforces We need a complete re-definition of CS education
We urgently need Dual-Rail Education CS should learn a lot from Embedded Systems, like in Mechanical Engineering © 2008,
[email protected]
47
http://hartenstein.de
http://hartenstein.de
END
thank you for your patience
© 2008,
[email protected]
46
© 2008,
[email protected]
48
invited talk, Sep 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia; CAPES/DFG Cooperation on Reconfigurable Computing
http://hartenstein.de
8
[email protected] 22 August 2013
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
time to space mapping time domain: procedure domain
space domain: structure domain
time algorithm
space algorithm
pipeline
program loop
n time steps, 1 CPU
backup for discussion:
1 time step, n DPUs
Bubble Sort
n x k time steps, x 1 y „conditio time algorithm nal swap“ © 2008,
[email protected] 50 unit
Shuffle Sort
k time steps, n „conditional space/time algorithm s swap“ units
condition swap al
49
http://hartenstein.de
Architecture instead of synchro Example
conditio swap nal conditio swap nal conditio swap nal conditio swap nal
time domain: procedure domain
swap nal conditio swap nal
direct time to space mapping
program loop
modification: with shufflefunction
accessing conflicts
„Shuffle Sort“
51
© 2008,
[email protected]
http://hartenstein.de
Revolution der Lehre:
Mikroelektronik-Entwurfs-Revolution traditionelle Arbeitsteilung: Anwendung Einreichung
Die neue M-&-C Arbeitsteilung: Anwendu ng
Rückweisung
loop transformations: rich methodology published [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]
Better Architecture instead of complex synchronisation: half he number of conditio Blocks + up und swap down of data nal (shuffle function) – conditio swap no von Neumannnal syndrome ! conditio
conditio swap nal conditio swap nal conditio swap nal conditio swap nal
SwitchingEbene Rückweisung Einreichung SchaltkreisEbene Rückweisung Einreichung
Entrümpelung & intuitive Modelle zur Behebung des AusbildungsDilemmas
Technologi e
© 2008,
[email protected]
Carver Mead
Application level
Lynn Conway
[1980]
Silicon Foundry
(externeTechnologie) Spezialisierungsbreite stark reduziert
53
k time steps, DPUs
Betonung auf “Systems” http://hartenstein.de
n
space/time algorithmus
52
http://hartenstein.de
Education Revolution:
Program level
von-NeumannParadigm (instructionstreambased)
clearing out
Layout-Ebene im Hause Spezialisierungsbreite
Pipeline
Reconfigurable Computing Revolution
Rückweisung
Rückweisung
Strip Mining Transformation
(in Deutschland: das E.I.S.-Projekt)
tall thin Kohärenz man
Zersplitterung
Einreichung
space domain: structure domain
n x k time steps, 1 C P time algorithm U © 2008,
[email protected]
RT-Ebene Einreichung
Logik-Ebene
http://hartenstein.de
Transformations since the 70ies
*) or” tall thin woman”
the tall thin man* > Dichotomy