Dec 12, 2004 - Computers in the World. - Yardstick: ... BlueGene/L. My Laptop ... Apple. ⢠Coming soon ⦠⢠Cray RedStorm. ⢠Cray BlackWidow. ⢠NEC SX- ...
Hong Kong, China, 13-15 Dec. 2004
Present and Future Supercomputer Architectures Jack Dongarra University of Tennessee and Oak Ridge National Laboratory
12/12/2004
1
A GrowthGrowth-Factor of a Billion in Performance in a Career
Super Scalar/Vector/Parallel
1 PFlop/s (1015)
IBM BG/L
Parallel ASCI Red
1 TFlop/s (1012)
ASCI White Pacific
TMC CM-5 Cray T3D
Vector
2X Transistors/Chip Every 1.5 Years 1 GFlop/s
Super Scalar
(109)
TMC CM-2 Cray 2 Cray X-MP
Cray 1 CDC 7600
1 MFlop/sScalar (106)
IBM 360/195
CDC 6600
IBM 7090
1 KFlop/s (103) 07
UNIVAC 1 EDSAC 1
1950
1960
1970
1980
1941 1945 1949 1951 1961 1964 1968 1975 1987 1992 1993 1997 2000 2003
1 (Floating Point operations / second, Flop/s) 100 1,000 (1 KiloFlop/s, KFlop/s) 10,000 100,000 1,000,000 (1 MegaFlop/s, MFlop/s) 10,000,000 100,000,000 1,000,000,000 (1 GigaFlop/s, GFlop/s) 10,000,000,000 100,000,000,000 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 10,000,000,000,000 35,000,000,000,000 (35 TFlop/s)
1990
2000
2010 2
1
H. Meuer, H. Simon, E. Strohmaier, & JD
Rate
- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www.top500.org
07
3
What is a Supercomputer? ♦ A supercomputer is a
hardware and software system that provides close to the maximum performance that can currently be achieved.
♦ Over the last 10 years the
range for the Top500 has increased greater than Moore’s Law ♦ 1993:
Why do we need them? Almost all of the technical areas that are important to the well-being of humanity use supercomputing in fundamental and essential ways.
♦ 2004: ¾ #1 = 70 TFlop/s ¾ #500 = 850 GFlop/s
Computational fluid dynamics, protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating nuclear weapons to name a few.
¾ #1 = 59.7 GFlop/s ¾ #500 = 422 MFlop/s
07
4
2
TOP500 Performance – November 2004 1.127 PF/s
1 Pflop/s
IBM BlueGene/L
100 Tflop/s 10 Tflop/s
SUM
70.72 TF/s NEC
1.167 TF/s
N=1
Earth Simulator IBM ASCI W hite
1 Tflop/s
59.7 GF/s
LLNL
Intel ASCI Red
850 GF/s
Sandia
100 Gflop/s Fujitsu
N=500
My Laptop
2002
2001
2000
1999
1998
1997
1996
1995
100 Mflop/s
1994
1 Gflop/s
1993
0.4 GF/s
2004
'NW T' NAL
2003
10 Gflop/s
07 5
Vibrant Field for High Performance Computers ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦
Cray X1 SGI Altix IBM Regatta IBM Blue Gene/L IBM eServer Sun HP Dawning Bull NovaScale Lanovo Fujitsu PrimePower Hitachi SR11000 NEC SX-7 Apple
♦ Coming soon … ¾ ¾ ¾ ¾
Cray RedStorm Cray BlackWidow NEC SX-8 Galactic Computing
07 6
3
Architecture/Systems Continuum Tightly Coupled
100%
♦ Custom processor ¾ ¾ ¾ ¾
Best Custom processor performance for codes that are not “cache friendly” Good communication performance Simplest programming model Most expensive Hybrid Good communication performance Good scalability
♦
with custom interconnect
80%
Cray X1 NEC SX-7 IBM Regatta IBM Blue Gene/L
♦ ♦ ♦
60%
♦ Commodity processor
with custom interconnect
♦ ♦
40%
¾ SGI Altix
¾ Intel Itanium 2
¾ Cray Red Storm
¾ AMD Opteron
Loosely Coupled 07
J u n -0 3
D e c -0 3
J u n -0 2
D e c -0 2
J u n -0 1
D e c -0 1
J u n -0 0
D e c -0 0
J u n -9 9
D e c -9 9
J u n -9 8
D e c -9 8
J u n -9 7
D e c -9 7
J u n -9 6
D e c -9 6
J u n -9 5
D e c -9 5
♦
J u n -9 3
¾ Pentium, Itanium, Opteron, Alpha ¾ GigE, Infiniband, Myrinet, Quadrics
Best price/performance (forCommod codes that work well with caches and are latency tolerant) More complex programming model J u n -9 4
♦ 0%
D e c -9 4
¾ Clusters
D e c -9 3
with commodity interconnect
J u n -0 4
20%
♦ Commodity processor
¾ NEC TX7 ¾ IBM eServer ¾ Dawning
7
Top500 Performance by Manufacture (11/04)
Fujitsu 2% NEC 4%
Cray 2%
Hitachi 1%
Sun 0% Intel 0%
SGI 7%
others 14%
IBM 49%
HP 21%
07 8
4
Commodity Processors ♦ HP PA RISC
♦ Intel Pentium Nocona ¾ 3.6 GHz, peak = 7.2 Gflop/s ¾ Linpack 100 = 1.8 Gflop/s ¾ Linpack 1000 = 3.1 Gflop/s
♦ AMD Opteron ¾ 2.2 GHz, peak = 4.4 Gflop/s ¾ Linpack 100 = 1.3 Gflop/s ¾ Linpack 1000 = 3.1 Gflop/s
♦ Sun UltraSPARC IV ♦ HP Alpha EV68 ¾ 1.25 GHz, 2.5 Gflop/s peak
♦ MIPS R16000
♦ Intel Itanium 2 07
¾ 1.5 GHz, peak = 6 Gflop/s ¾ Linpack 100 = 1.7 Gflop/s ¾ Linpack 1000 = 5.4 Gflop/s
9
Commodity Interconnects ♦ Gig Ethernet ♦ Myrinet
Clos
♦ Infiniband ♦ QsNet ♦ SCI Tor us
Gigabit Ethernet SCI QsNetII (R) QsNetII (E) Myrinet (D card) Myrinet (E card) 07 IB 4x
Fa t tr
Switch topology Bus Torus Fat Tree Fat Tree Clos Clos Fat Tree
ee Cost NIC $ 50 $1,600 $1,200 $1,000 $ 595 $ 995 $1,000
Cost Sw/node $ 50 $ 0 $1,700 $ 700 $ 400 $ 400 $ 400
Cost Node $ 100 $1,600 $2,900 $1,700 $ 995 $1,395 $1,400
MPI Lat / 1-way / Bi-Dir (us) / MB/s / MB/s 30 / 100 / 150 5 / 300 / 400 3 / 880 / 900 3 / 880 / 900 6.5 / 240 / 480 6 / 450 / 900 6 / 820 / 790 10
5
24th List: The TOP10 Manufacturer 1
IBM
2
SGI
3
NEC
4
IBM
5
CCD
6
HP
7
Self Made
8
IBM/LLNL
9
IBM
10
Dell
Computer BlueGene/L β-System
Columbia
Altix, Infiniband
Earth-Simulator MareNostrum
BladeCenter JS20, Myrinet
Thunder
Itanium2, Quadrics
ASCI Q
AlphaServer SC, Quadrics
X
Apple XServe, Infiniband
BlueGene/L
DD1 500 MHz
pSeries 655 Tungsten
PowerEdge, Myrinet
Rmax
Installation Site
Country
Year
#Proc
70.72
DOE/IBM
USA
2004
32768
51.87
NASA Ames
USA
2004
10160
35.86
Earth Simulator Center
Japan
2002
5120
20.53
Barcelona Supercomputer Center
Spain
2004
3564
USA
2004
4096
USA
2002
8192
USA
2004
2200
USA
2004
8192
USA
2004
2944
USA
2003
2500
[TF/s]
19.94 13.88
Lawrence Livermore National Laboratory Los Alamos National Laboratory
12.25 11.68 10.31
Virginia Tech Lawrence Livermore National Laboratory Naval Oceanographic Office
9.82
NCSA
07 399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc
11
IBM BlueGene/L 131,072 Processors
System (64 racks, 64x32x32) 131,072 procs
Rack (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC
Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors
Compute Card (2 chips, 2x1x1) 4 processors Chip (2 processors)
180/360 TF/s 32 TB DDR
90/180 GF/s 16 GB DDR 2.8/5.6 GF/s 4 MB (cache)
07
5.6/11.2 GF/s 1 GB DDR
2.9/5.7 TF/s 0.5 TB DDR
Full system total of 131,072 processors
“Fastest Computer” BG/L 700 MHz 32K proc 16 racks Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s 77% of peak
12
6
BlueGene/L Interconnection Networks 3 Dimensional Torus ¾ Interconnects all compute nodes (65,536) ¾ Virtual cut-through hardware routing ¾ 1.4Gb/s on all 12 node links (2.1 GB/s per node) ¾ 1 µs latency between nearest neighbors, 5 µs to the farthest ¾ 4 µs latency for one hop with MPI, 10 µs to the farthest ¾ Communications backbone for computations ¾ 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree ¾ Interconnects all compute and I/O nodes (1024) ¾ One-to-all broadcast functionality ¾ Reduction operations functionality ¾ 2.8 Gb/s of bandwidth per link ¾ Latency of one way tree traversal 2.5 µs ¾ ~23TB/s total binary tree bandwidth (64k machine) Ethernet ¾ Incorporated into every node ASIC ¾ Active in the I/O nodes (1:64) ¾ All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt ¾ Latency of round trip 1.3 µs Control Network
07
13
NASA Ames: SGI Altix Columbia 10,240 Processor System ♦ ♦ ♦ ♦
Architecture: Hybrid Technical Server Cluster Vendor: SGI based on Altix systems Deployment: Today Node: ¾ 1.5 GHz Itanium-2 Processor ¾ 512 procs/node (20 cabinets) ¾ Dual FPU’s / processor
♦ System: ¾ 20 Altix NUMA systems @ 512 procs/node = 10240 procs ¾ 320 cabinets (estimate 16 per node) ¾ Peak: 61.4 Tflop/s ; LINPACK: 52 Tflop/s ♦ Interconnect: ¾ FastNumaFlex (custom hypercube) within node
¾ Infiniband between nodes
07
♦ Pluses: ¾ Large and powerful DSM nodes ♦ Potential problems (Gotchas): ¾ Power consumption - 100 kw per node (2 Mw total)
14
7
07
lu eG en e/ L
D
D2
eS er ve rB
be ta -S ys te m SG (0 IA .7 lt i G la x H z 1. de Po 5 C G en w H er te z, PC rJ V o S 44 lta 20 0) ir + 11 e ( 00 P In o fin In w D er ua ib te E P an ar lI l2 C t t d an h97 .3 Si iu G 0 H m m 2. B z ul 2 2 lu A Ti G at A eG pp SC H ge or z le en ) r4 IQ ,M XS e/ 1. yr -A L er 4G D in ve lp D et H ha 1 /M z P S el -Q ro e la rv to ua n e ty ox dr r pe S ic In C s (0 f in 45 .5 ,1 ib G an .2 eS H 5 z d er Po G 4X ve H w /C z Po rp er is w Se PC co er rie Ed 44 G s ig ge 0 65 E w 5 17 /C (1 us 50 eS .7 ,P to er G m 4 H ve ) Xe z rp P on ow Se eS 3. er rie 0 er 4 6 s +) ve G 69 H rp 0 z, (1 Se M .9 rie yr G in s H LN et 69 z X P 0 ow C ( 1 lu .9 B er st lu G 4+ er eG H ) , z en Xe Po R e/ on IK w L EN er 3. D 4+ D 4 S 2 ) G up P H In ro er z, te to M Co gr ty yr m ity pe in bi e rx (0 t ne 26 .7 d D 00 G C aw Hz lu Ita ni st Po ni ng er um w 40 er 2 00 PC 1. A, 5 44 G O H 0) pt z, M e C Q ro R ua n Li 2. dr nu 2 ic G O x s H pt C z, er lu M st on yr er 2 i n Xe G et H on A z, SC 2. M 4 yr IW G in H hi et z te -Q ,S ua P dr Po i cs w er 3 37 5 M H z
B
Watts/Gflop
Performance Projection
1 Eflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
DARPA HPCS
100 Tflop/s
10 Tflop/s
SUM
100 Gflop/s
10 Gflop/s
1 Gflop/s
1993 1995 1997
BlueGene/L
1 Tflop/s
N=1 My Laptop
100 Mflop/s
N=500
1999 2001 2003 2005 2007 2009 2011 2013 2015
07 15
Power: Watts/Gflop (smaller is better)
120
100
80
60
40
20
0
Top 20 systems Based on processor power rating only
16
8
Top500 in Asia (Numbers of Machines) 120 100 80 Others India South Korea
60 40
China Japan
20
06 /1 99 3 06 /1 99 06 4 /1 99 5 06 /1 99 6 06 /1 99 06 7 /1 99 8 06 /1 99 9 06 /2 00 0 06 /2 00 1 06 /2 00 2 06 /2 00 3 06 /2 00 4
0
07 17
17 Chinese Sites on the Top500 Year
R max
Installation-site-name
Manufacturer
Computer
Area
17
Shanghai Supercomputer Center
Dawning
Dawning 4000A, Opteron 2.2 GHz, Myrinet
Research
2004
8061
2560
38
Chinese Academy of Science
lenovo
DeepComp 6800, Itanium2 1.3 GHz, QsNet
Academic
2003
4193
1024
61
Institute of Scientific Computing/Nankai University
IBM
xSeries Xeon 3.06 GHz, Myrinet
Academic
2004
3231
768
132
Petroleum Company (D)
IBM
BladeCenter Xeon 3.06 GHz, Gig-Ethernet
Industry
2004
1923
512
184
Geoscience (A)
IBM
BladeCenter Xeon 3.06 GHz, Gig-Ethernet
Industry
2004
1547
412
209
University of Shanghai
HP
DL360G3 Xeon 3.06 GHz, Infiniband
Academic
2004
1401
348
225
Academy of Mathematics and System Science
lenovo
DeepComp 1800 - P4 Xeon 2 GHz - Myrinet
Academic
2002
1297
512
229
Digital China Ltd.
HP
SuperDome 1 GHz/HPlex
Industry
2004
1281
560
247
Public Sector
IBM
xSeries Cluster Xeon 2.4 GHz - Gig-E
Governmn’t
2003
1256
622
324
China Meteorological Administration
IBM
eServer pSeries 655 (1.7 GHz Power4+)
Research
2004
1107
1008
Rank
Procs
355
XinJiang Oil
IBM
BladeCenter Cluster Xeon 2.4 GHz, GigEthernet
Industry
2003
1040
448
372
Fudan University
HP
DL360G3, Pentium4 Xeon 3.2 GHz, Myrinet
Academic
2004
1016
256
384
Huapu Information Technology
HP
SuperDome 875 MHz/HyperPlex
Industry
2004
1013
512
419
Saxony Developments Ltd
HP
Integrity Superdome, 1.5 GHz, HPlex
Industry
2004
971
192
481
Shenzhen University
Tsinghua U
DeepSuper-21C, P4 Xeon 3.06/2.8 GHz, Myrinet
Academic
2003
877
256
482
China Petroleum
HP
HP BL-20P, Pentium4 Xeon 3.06 GHz
Industry
2004
873
238
498
Digital China Ltd.
HP
SuperDome 875 MHz/HyperPlex
Industry
2004
851
416
07
Total performance growing by a factor of 3 every 6 months for the past 24 months
18
9
Important Metrics: Sustained Performance and Cost ♦ Commodity processors ¾ Optimized for commercial applications. ¾ Meet the needs of most of the scientific computing market. ¾ Provide the shortest time-to-solution and the highest sustained performance per unit cost for a broad range of applications that have significant spatial and temporal locality (good caches use). ♦ Custom processors ¾ For bandwidth-intensive applications that do not cache well, custom processors are more cost effective ¾ Hence offering better capacity on just 07 those applications.
19
High Bandwidth vs Commodity Systems ♦ High bandwidth systems have traditionally been vector
computers
¾ Designed for scientific problems ¾ Capability computing ♦ Commodity processors are designed for web servers and the
home PC market
(should be thankful that the manufactures keep the 64 bit fl pt) ¾ Used for cluster based computers leveraging price point ♦ Scientific computing needs are different ¾ Require a better balance between data movement and floating point operations. Results in greater efficiency. System Balance - MEMORY BANDWIDTH Year of Introduct ion N ode Archi tect ure Processor Cycle T ime 07 Speed per Processor Peak Operands/Flop(main memory)
Earth Simulator (N EC) 2002 Vector 500 MH z 8 Gflop/s 0.5
Cray X1 (Cray) 2003 Vector 800 MHz 12.8 Gfl op/s 0.33
ASCI Q (HP EV68) 2002 Alpha 1.25 GHz 2.5 Gflop/s 0.1
MCR Xeon 2002 Penti um 2.4 GH z 4.8 Gflop/s 0.055
Apple Xserve IBM PowerPC 2003 Power PC 2 GHz 8 Gflop/s 20 0.063
10
System Balance (Network) Network Speed (MB/s) vs Node speed (flop/s) ASCI Purple
0.13
PSC Lemieux
0.18
LANL Pink
0.05
ASCI White ASCI Blue Mountain
0.08
0.02
Blue Gene/L
0.38
Cray T3E/1200
1.00
ASCI Red
1.20
Cray Red Storm
1.60
Cray X1 0.00
07
2.00
0.50
1.00
1.50
2.00
Communication/Computation Balance (Bytes/Flop) (Higher is better)
2.50
21
SETI@home: Global Distributed Computing ♦ Running on 500,000 PCs, ~1300 CPU
Years per Day
¾ 1.3M CPU Years so far
♦ Sophisticated Data & Signal
Processing Analysis ♦ Distributes Datasets from Arecibo Radio Telescope
07 22
11
SETI@home ♦ Use thousands of Internet-
connected PCs to help in the search for extraterrestrial intelligence. ♦ When their computer is idle or being wasted this software will download ~ half a MB chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours. ♦ The results of this analysis are sent back to the SETI team, combined with thousands of other participants. ♦ 07About 5M users
♦ Largest distributed
computation project in existence ¾ Averaging 72 Tflop/s
23
♦
Google query attributes
¾ 150M queries/day (2000/second) ¾ 100 countries ¾ 8.0B documents in the index
♦
Forward link are referred to in the rows Back links are referred to in the columns
Data centers
¾ 100,000 Linux systems in data centers around the world
¾ 15 TFlop/s and 1000 TB total capability ¾ 40-80 1U/2U servers/cabinet ¾ 100 MB Ethernet switches/cabinet with gigabit Ethernet uplink
¾ growth from 4,000 systems (June 2000)
♦
¾ 18M queries then Performance and operation ¾ simple reissue of failed commands to new servers 07 ¾ no performance debugging ¾ problems are not reproducible
Eigenvalue problem; Ax = λx n=8x109 (see: MathWorks Cleve’s Corner)
The matrix is the transition probability matrix of the Markov chain; Ax = x 24
Source: Monika Henzinger, Google & Cleve Moler
12
The Grid ♦ The Grid is about gathering resources …
¾ run programs, access data, provide services, collaborate
♦ …To enable and exploit large scale sharing of
resources ♦ Virtual organization
¾ Loosely coordinated groups
♦ Provides for remote access of resources ¾ Scalable ¾ Secure ¾ Reliable mechanisms for discovery and access
♦ In some ideal setting:
07
¾ User submits work, infrastructure finds an execution target ¾ Ideally you don’t care where. 25
The Grid
07 26
13
The Grid: The Good, The Bad, and The Ugly ♦ Good: ¾ Vision; ¾ Community; ¾ Developed functional software; ♦ Bad: ¾ Oversold the grid concept; ¾ Still too hard to use; ¾ Solution in search of a problem; ¾ Underestimated the technical difficulties; ¾ Not enough of a scientific discipline; ♦ Ugly: ¾ Authentication and security 07 27
Special Purpose “SETI / Google”
“Grids”
Clusters
Highly Parallel
Tightly Coupled
Loosely Coupled
The Computing Continuum
♦ Each strikes a different balance ¾ computation/communication coupling
♦ Implications for execution efficiency ♦ Applications for diverse needs ¾ computing is only one part of the story! 07 28
14
Grids vs. Capability vs. Cluster Computing ♦ Not an “either/or” question
¾ Each addresses different needs ¾ Each are part of an integrated solution
♦ Grid strengths
¾ Coupling necessarily distributed resources
¾ instruments, software, hardware, archives, and people
¾ Eliminating time and space barriers
¾ remote resource access and capacity computing
¾ Grids are not a cheap substitute for capability HPC
♦ Highest performance computing strengths ¾ Supporting foundational computations
¾ terascale and petascale “nation scale” problems
¾ Engaging tightly coupled computations and teams
♦ Clusters
¾ Low cost, group solution Potential hidden costs ♦ Key is easy access to resources in a transparent way 07 ¾
29
Petascale Systems In 2008 ♦ Technology trends ¾ multicore processors perhaps heterogeneous ¾ IBM Power4 and SUN UltraSPARC IV ¾ Itanium “Montecito” in 2005 ¾ quad-core and beyond are coming
¾ reduced power consumption
¾ laptop and mobile market drivers
¾ increased I/O and memory interconnect integration ¾ PCI Express, Infiniband, …
♦ Let’s look forward a few years to 2008 ¾ 8-way or 16-way cores (8 or 16 processors/chip) ¾ ~10 GFlop cores (processors) and 4-way nodes (4, 8-way cores/node) ¾ 12x Infiniband-like interconnect, perhaps heterogeneous ♦ With 10 GFlop processors ¾ 100K processors and 3100 nodes (4-way with 8 cores each) ¾ 1-3 MW of power, at a minimum
♦ To some extent, Petaflops systems will look like a “Grid in a Box” 07 30
15
How Big Is Big? ♦ Every 10X brings new challenges ¾ 64 processors was once considered large ¾ it hasn’t been “large” for quite a while
¾ 1024 processors is today’s “medium” size ¾ 8096 processors is today’s “large” ¾ we’re struggling even here
♦ 100K processor systems ¾ are in construction ¾ we have fundamental challenges in dealing with machines of this size ¾ … and little in the way of programming support 07 31
Fault Tolerance in the Computation ♦ Some next generation systems are
being designed with > 100K processors (IBM Blue Gene L).
♦ MTTF 105 - 106 hours for
component.
¾ sounds like a lot until you divide by 105! ¾ Failures for such a system can be just a few hours, perhaps minutes away. ♦ Problem with the MPI standard,
no recovery from faults. ♦ Application checkpoint / restart is today’s typical fault tolerance method.
♦ Many cluster based on
commodity parts don’t have error correcting primary memory.
07 32
16
Real Crisis With HPC Is With The Software ♦ Programming is stuck
¾ Arguably hasn’t changed since the 60’s
♦ It’s time for a change
¾ Complexity is rising dramatically
¾ highly parallel and distributed systems
¾ From 10 to 100 to 1000 to 10000 to 100000 of processors!!
¾ multidisciplinary applications
♦ A supercomputer application and software are usually
much more long-lived than a hardware
¾ Hardware life typically five years at most. ¾ Fortran and C are the main programming models
♦ Software is a major cost component of modern
technologies.
¾ The tradition in HPC system procurement is to assume that the software is free.
♦ We don’t have many great ideas about how to solve 07
this problem.
33
Collaborators / Support ♦ TOP500 ¾ H. Meuer, Mannheim U ¾ H. Simon, NERSC ¾ E. Strohmaier
07 34
17