Jul 7, 2007 - Thomas Sterling: Multicore and Heterogeneous is Disruptive. Technology. Tsugio Makimoto: Makimoto's Wave predicts âCustomizedâ for 2007 ...
High Performance Computing in the Multi-core Area
Arndt Bode Technische Universität München
ISPDC 2007 Hagenberg, Austria 07 July 2007
Technology Trends for Petascale Computing Architectures:
Multicore – Accelerators – Special Purpose – Reconfigurable – Memory and Cache – Network – Secondary Storage
Computing and Software: Programming Models – Heterogeneous vs. Homogeneous – Compilers and Threading – Operating Systems – Tools – Adaptivity Applications:
07 July 2007 © Arndt Bode
Scalability
ISPDC 2007 Hagenberg, Austria
1
Trends in Microprocessor Technology Compute Performance of COTS Microprocessors depends of - Clock Frequency - Computer Organization Clock Frequency: - Exponential Increase impossible (Power, Cooling, 135 W/Chip) - Partly Solutions: Clock- and Power reduction, Sleep Transistors, ... - New Goal for Optimization: Energy-Efficiency: MIPS, MFLOPS per W Computer Organization: ILP, Processor-internal Organization fully exploited - Pipelining, Superscalar, VLIW, Wordlength, ..., contra-productive to EnergyEfficiency (Speculation) - Future is Processor/Thread level-Parallelism (Parallelism of Simpler Units) - Multi-Core and Application-specific Processors , Fault Tolerance
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
2
Energy - Efficiency
P ~ A C V2 f P: A: C: V: F:
07 July 2007 © Arndt Bode
Power Activity Factor (Active Transistors on Chip) Total Capacity Voltage Clock Frequency
ISPDC 2007 Hagenberg, Austria
3
Multi-Core and Multithreading
... P0
P1
...
P2
P7
0
1
4
5
8
9
28
29
2
3
6
7
10
11
30
31
FLC0
FLC1
FLC2
...
...
FLC7
SLC (gemeinsam) SLC (shared) Multicore-Chip
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
4
Multi-Core Architectures: A Parallel World for Everybody Compiler: Generation of parallel virtual processes virtual processes
VP0
VP1
...
VP2
...
VPk
Virtualization at System Level Load Balancing / Fault Tolerance / Placement Multi-Core Server MMX P0
P1 1
4
5
2
3
6
7
FLC0
...
P2
0
FLC1
8
DSP
9
10 11
FLC2
...
faulty
P7
MMX
P0
28 29
...
30 31
FLC7
P1
DSP
...
P2
0
1
4
5
8
2
3
6
7
10 11
FLC0
SLC (shared)
FLC1
9
FLC2
...
P7 28
29
30
31
FLC7
SLC (shared) Multicore-Chip 0
Multicore-Chip n
Memory
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
5
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
6
Processor-Design Options for Petaflop Systems Mega-Multicore Systems: General Purpose Applications (Itanium, Xeon, Power-7, ... ) Multicore and Attached Accelerators (Cell, ClearSpeed, NVIDIA, ...) Special Purpose (BlueGene, QCD, POLARIS, ...) Reconfigurable and Adaptive (FPGA: continuous research)
General Purpose - Programmability - Scalability - Applicability History:
07 July 2007 © Arndt Bode
vs.
Special Purpose - High Performance - Low Power
Special Purpose „volatile“, Programming Interface not Compatible Microprogrammable Devices : Vertical Migration and Bitslice Early Microprocessors : Coprocessors (FP, IO, DSP, ...) Early HPC : Vector Processor Attachments, ArrayProcessors, Associative Processors ISPDC 2007 Hagenberg, Austria
7
Petaflop Processor Options: Bode’s Oracle Two Types of Systems will persist: General Purpose Mega-Multicore (Programmability, Compatibility, Cost) Special Purpose Systems for Large Specific Application Classes (Energy Efficiency) The Systems will be Highly Parallel Thomas Sterling:
Multicore and Heterogeneous is Disruptive Technology
Tsugio Makimoto:
Makimoto’s Wave predicts “Customized” for 2007 – 2017 (Pendulum: 1014 km)
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
8
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
9
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
10
Consequences for Computer Architecture Memory-Bottleneck Today:
Cache-Hierarchy
P
M
Multi-Core:
P
P
P
...
P
Bottleneck intensified
M
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
11
Cache Organizations Distributed
P0
Shared
Groupwise shared
P1
...
P2
...
P7
0
1
4
5
8
9
28 29
2
3
6
7
10 11
30 31
FLC0
FLC1
FLC2
...
...
FLC7
SLC (shared)
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
12
Communication Bottleneck classical SMP:
P
1 Interface / Proc. IC
P
M SMP based on Multi-Core P P ... P
P P ...
P
?
n Interfaces / Multi-Core IC (e. g.: n cores 5n Interfaces mesh-structure)
M
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
13
HLRB II: SGI ALtix 4700 / 9728 Montecito cores
11
07 July 2007 © Arndt Bode
m 22
m
ISPDC 2007 Hagenberg, Austria
14
Blade
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
Core 1
Itanium2 Itanium2 (Madison) Dual Core Socket
8.5 GB/s
Memory Controller (Shub 2.0)
6
B/s G .4
6 .4 G B/s
NL Port Plane 1
NL Port Plane 2
Core 2 DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
4 GByte per Compute Core 07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
15
Dual Socket Blades - Density Compute Blades Core 1
Itanium2 Dual Core Socket A
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
Core 2
8.5 GB/s Core 1
Itanium2 Dual Core Socket A Core 2 07 July 2007 © Arndt Bode
Memory Controller (Shub 2.0)
6
B/s G .4
6 .4 G B/s
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
NL Port Plane 1
NL Port Plane 2
4 GByte per Compute Core ISPDC 2007 Hagenberg, Austria
16
One 256 Socket Partition: ccNUMA Shared Memory Single System 32*6.4Image, GByte/s = Fat 204.8Tree Gbyte/sConnection
32*6.4 GByte/s = 204.8 Gbyte/s 2*0.8 Gbyte/s per Socket pair 32*6.4 GByte/s = 204.8 Gbyte/s 07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
17
2-D Torus Mesh topology. Rank
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
18
HLRB II Interconnect
Fat Tree Topology
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
19
Topology
Batch Scheduler must be topology aware PBSpro 8.0 early access Placements of jobs is optimized according to topology
Fat Tree
Topology is stored in placement sets
2D-Torus
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
20
Petascale Software Key Issue is use of Excess Parallelism: Programming Model, Language and Compiler • • • •
Conventional Parallelizing (Including JIT) Parallelizing with Directives New Parallel Languages (Transactional Memory, ...)
Use of Threads • Functional • Speculative • Assist/Helper (Prefetching, Monitoring, Debugging, Tools, Virtualization, Security, FT-Lockstep, ...)
Scalable Tool Models 07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
21
You tell us!
This is the Academic Forum, so you tell us. What are the solutions? Where are the new language ideas? Can you design a statically checked race-free language which is useful? Can naїve users really use functional languages? Any language which talks about threads is too low level for most users. So how do we raise the language level? Should we be doing message passing inside the node? It’s the only demonstrated way to achieve high scalability Do we really need to bring back Occam? James Cownie
07 July 2007 © Arndt Bode
☺
Intel Academic Forum, Budapest, 13.06.2007
ISPDC 2007 Hagenberg, Austria
22
Some Examples of Research at MMI
Cache-Prefetching with Helper-Core (up to 39 %) Cache-Behavior with shared/separate Caches Helper Cores for Fault Tolerance
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
23
Prefetching Bandwidth obtainable for Matrix-Multiplication (Blocking with small amount of caching)
Matrix Dimension
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
24
Advantages of Shared Caches Synthetic Benchmark: Latency for continuous Writes onto the same Memory area (2 cores)
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
25
Lockstepping of Virtual Machines Synchronized Identical Execution of Applications on 2 Processors Virtual Environment allows for Monitoring and Control by Second Core Minimal Performance Loss
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
26
Lockstepping of Virtual Machines
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
27
Bavarian Graduate School of Computational Engineering Elitenetzwerk Bayern Welcome to the Bavarian Graduate School
of Computational Engineering Computational Engineering ... refers to all activities in engineering that use computers as their main tool. Typical tasks in Computational Engineering are the solution of differential equations that model certain physical phenomena, the optimization of processes in engineering, or the stochastic simulation of a complex system.
The Bavarian Graduate School ... of Computational Engineering is an association of three Master programs: •Computational Engineering (CE) at the Friedrich-Alexander-Universität Erlangen-Nürnberg, •Computational Mechanics (COME), and •Computational Science and Engineering (CSE) at the Technische Universität München. Our goal is to push forward the rapidly growing field of Computational Engineering by offering high-quality master programs for students who are interested in the field.
The Elitenetzwerk Bayern ... is an initiative of the state of Bavaria to support the education and advancement of highly talented students. With the help of the Elitenetzwerk Bayern, we are able to offer an "elite" degree program for the best students in our master programs. Outstanding performance in one of the three Master's programs will be honoured by the newly formed academic degree of a "Master of Science with Honours". For further information about these elite program, please read on in our section "Infomation for students".
IGSSE BGCE is a partner of IGSSE, TUM's International Graduate School of Science and Engineering 07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
28
Summary
Multicore (and Heterogeneous): Disruptive Technologies Many Choices in System Architecture and Programming Models HPC Systems will be massively parallel: Scalability is Challenge for Application-Algorithms, System Software, Tools and Architectures Interesting Times Ahead!
07 July 2007 © Arndt Bode
ISPDC 2007 Hagenberg, Austria
29