Present and Future Supercomputer Architectures - The Netlib

5 downloads 217287 Views 1MB Size Report
Dec 12, 2004 - Computers in the World. - Yardstick: ... BlueGene/L. My Laptop ... Apple. ♢ Coming soon … ➢ Cray RedStorm. ➢ Cray BlackWidow. ➢ NEC SX- ...
Hong Kong, China, 13-15 Dec. 2004

Present and Future Supercomputer Architectures Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

12/12/2004

1

A GrowthGrowth-Factor of a Billion in Performance in a Career

Super Scalar/Vector/Parallel

1 PFlop/s (1015)

IBM BG/L

Parallel ASCI Red

1 TFlop/s (1012)

ASCI White Pacific

TMC CM-5 Cray T3D

Vector

2X Transistors/Chip Every 1.5 Years 1 GFlop/s

Super Scalar

(109)

TMC CM-2 Cray 2 Cray X-MP

Cray 1 CDC 7600

1 MFlop/sScalar (106)

IBM 360/195

CDC 6600

IBM 7090

1 KFlop/s (103) 07

UNIVAC 1 EDSAC 1

1950

1960

1970

1980

1941 1945 1949 1951 1961 1964 1968 1975 1987 1992 1993 1997 2000 2003

1 (Floating Point operations / second, Flop/s) 100 1,000 (1 KiloFlop/s, KFlop/s) 10,000 100,000 1,000,000 (1 MegaFlop/s, MFlop/s) 10,000,000 100,000,000 1,000,000,000 (1 GigaFlop/s, GFlop/s) 10,000,000,000 100,000,000,000 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 10,000,000,000,000 35,000,000,000,000 (35 TFlop/s)

1990

2000

2010 2

1

H. Meuer, H. Simon, E. Strohmaier, & JD

Rate

- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June - All data available from www.top500.org

07

3

What is a Supercomputer? ♦ A supercomputer is a

hardware and software system that provides close to the maximum performance that can currently be achieved.

♦ Over the last 10 years the

range for the Top500 has increased greater than Moore’s Law ♦ 1993:

Why do we need them? Almost all of the technical areas that are important to the well-being of humanity use supercomputing in fundamental and essential ways.

♦ 2004: ¾ #1 = 70 TFlop/s ¾ #500 = 850 GFlop/s

Computational fluid dynamics, protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating nuclear weapons to name a few.

¾ #1 = 59.7 GFlop/s ¾ #500 = 422 MFlop/s

07

4

2

TOP500 Performance – November 2004 1.127 PF/s

1 Pflop/s

IBM BlueGene/L

100 Tflop/s 10 Tflop/s

SUM

70.72 TF/s NEC

1.167 TF/s

N=1

Earth Simulator IBM ASCI W hite

1 Tflop/s

59.7 GF/s

LLNL

Intel ASCI Red

850 GF/s

Sandia

100 Gflop/s Fujitsu

N=500

My Laptop

2002

2001

2000

1999

1998

1997

1996

1995

100 Mflop/s

1994

1 Gflop/s

1993

0.4 GF/s

2004

'NW T' NAL

2003

10 Gflop/s

07 5

Vibrant Field for High Performance Computers ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

Cray X1 SGI Altix IBM Regatta IBM Blue Gene/L IBM eServer Sun HP Dawning Bull NovaScale Lanovo Fujitsu PrimePower Hitachi SR11000 NEC SX-7 Apple

♦ Coming soon … ¾ ¾ ¾ ¾

Cray RedStorm Cray BlackWidow NEC SX-8 Galactic Computing

07 6

3

Architecture/Systems Continuum Tightly Coupled

100%

♦ Custom processor ¾ ¾ ¾ ¾

Best Custom processor performance for codes that are not “cache friendly” Good communication performance Simplest programming model Most expensive Hybrid Good communication performance Good scalability



with custom interconnect

80%

Cray X1 NEC SX-7 IBM Regatta IBM Blue Gene/L

♦ ♦ ♦

60%

♦ Commodity processor

with custom interconnect

♦ ♦

40%

¾ SGI Altix

¾ Intel Itanium 2

¾ Cray Red Storm

¾ AMD Opteron

Loosely Coupled 07

J u n -0 3

D e c -0 3

J u n -0 2

D e c -0 2

J u n -0 1

D e c -0 1

J u n -0 0

D e c -0 0

J u n -9 9

D e c -9 9

J u n -9 8

D e c -9 8

J u n -9 7

D e c -9 7

J u n -9 6

D e c -9 6

J u n -9 5

D e c -9 5



J u n -9 3

¾ Pentium, Itanium, Opteron, Alpha ¾ GigE, Infiniband, Myrinet, Quadrics

Best price/performance (forCommod codes that work well with caches and are latency tolerant) More complex programming model J u n -9 4

♦ 0%

D e c -9 4

¾ Clusters

D e c -9 3

with commodity interconnect

J u n -0 4

20%

♦ Commodity processor

¾ NEC TX7 ¾ IBM eServer ¾ Dawning

7

Top500 Performance by Manufacture (11/04)

Fujitsu 2% NEC 4%

Cray 2%

Hitachi 1%

Sun 0% Intel 0%

SGI 7%

others 14%

IBM 49%

HP 21%

07 8

4

Commodity Processors ♦ HP PA RISC

♦ Intel Pentium Nocona ¾ 3.6 GHz, peak = 7.2 Gflop/s ¾ Linpack 100 = 1.8 Gflop/s ¾ Linpack 1000 = 3.1 Gflop/s

♦ AMD Opteron ¾ 2.2 GHz, peak = 4.4 Gflop/s ¾ Linpack 100 = 1.3 Gflop/s ¾ Linpack 1000 = 3.1 Gflop/s

♦ Sun UltraSPARC IV ♦ HP Alpha EV68 ¾ 1.25 GHz, 2.5 Gflop/s peak

♦ MIPS R16000

♦ Intel Itanium 2 07

¾ 1.5 GHz, peak = 6 Gflop/s ¾ Linpack 100 = 1.7 Gflop/s ¾ Linpack 1000 = 5.4 Gflop/s

9

Commodity Interconnects ♦ Gig Ethernet ♦ Myrinet

Clos

♦ Infiniband ♦ QsNet ♦ SCI Tor us

Gigabit Ethernet SCI QsNetII (R) QsNetII (E) Myrinet (D card) Myrinet (E card) 07 IB 4x

Fa t tr

Switch topology Bus Torus Fat Tree Fat Tree Clos Clos Fat Tree

ee Cost NIC $ 50 $1,600 $1,200 $1,000 $ 595 $ 995 $1,000

Cost Sw/node $ 50 $ 0 $1,700 $ 700 $ 400 $ 400 $ 400

Cost Node $ 100 $1,600 $2,900 $1,700 $ 995 $1,395 $1,400

MPI Lat / 1-way / Bi-Dir (us) / MB/s / MB/s 30 / 100 / 150 5 / 300 / 400 3 / 880 / 900 3 / 880 / 900 6.5 / 240 / 480 6 / 450 / 900 6 / 820 / 790 10

5

24th List: The TOP10 Manufacturer 1

IBM

2

SGI

3

NEC

4

IBM

5

CCD

6

HP

7

Self Made

8

IBM/LLNL

9

IBM

10

Dell

Computer BlueGene/L β-System

Columbia

Altix, Infiniband

Earth-Simulator MareNostrum

BladeCenter JS20, Myrinet

Thunder

Itanium2, Quadrics

ASCI Q

AlphaServer SC, Quadrics

X

Apple XServe, Infiniband

BlueGene/L

DD1 500 MHz

pSeries 655 Tungsten

PowerEdge, Myrinet

Rmax

Installation Site

Country

Year

#Proc

70.72

DOE/IBM

USA

2004

32768

51.87

NASA Ames

USA

2004

10160

35.86

Earth Simulator Center

Japan

2002

5120

20.53

Barcelona Supercomputer Center

Spain

2004

3564

USA

2004

4096

USA

2002

8192

USA

2004

2200

USA

2004

8192

USA

2004

2944

USA

2003

2500

[TF/s]

19.94 13.88

Lawrence Livermore National Laboratory Los Alamos National Laboratory

12.25 11.68 10.31

Virginia Tech Lawrence Livermore National Laboratory Naval Oceanographic Office

9.82

NCSA

07 399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc

11

IBM BlueGene/L 131,072 Processors

System (64 racks, 64x32x32) 131,072 procs

Rack (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC

Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors

Compute Card (2 chips, 2x1x1) 4 processors Chip (2 processors)

180/360 TF/s 32 TB DDR

90/180 GF/s 16 GB DDR 2.8/5.6 GF/s 4 MB (cache)

07

5.6/11.2 GF/s 1 GB DDR

2.9/5.7 TF/s 0.5 TB DDR

Full system total of 131,072 processors

“Fastest Computer” BG/L 700 MHz 32K proc 16 racks Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s 77% of peak

12

6

BlueGene/L Interconnection Networks 3 Dimensional Torus ¾ Interconnects all compute nodes (65,536) ¾ Virtual cut-through hardware routing ¾ 1.4Gb/s on all 12 node links (2.1 GB/s per node) ¾ 1 µs latency between nearest neighbors, 5 µs to the farthest ¾ 4 µs latency for one hop with MPI, 10 µs to the farthest ¾ Communications backbone for computations ¾ 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree ¾ Interconnects all compute and I/O nodes (1024) ¾ One-to-all broadcast functionality ¾ Reduction operations functionality ¾ 2.8 Gb/s of bandwidth per link ¾ Latency of one way tree traversal 2.5 µs ¾ ~23TB/s total binary tree bandwidth (64k machine) Ethernet ¾ Incorporated into every node ASIC ¾ Active in the I/O nodes (1:64) ¾ All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt ¾ Latency of round trip 1.3 µs Control Network

07

13

NASA Ames: SGI Altix Columbia 10,240 Processor System ♦ ♦ ♦ ♦

Architecture: Hybrid Technical Server Cluster Vendor: SGI based on Altix systems Deployment: Today Node: ¾ 1.5 GHz Itanium-2 Processor ¾ 512 procs/node (20 cabinets) ¾ Dual FPU’s / processor

♦ System: ¾ 20 Altix NUMA systems @ 512 procs/node = 10240 procs ¾ 320 cabinets (estimate 16 per node) ¾ Peak: 61.4 Tflop/s ; LINPACK: 52 Tflop/s ♦ Interconnect: ¾ FastNumaFlex (custom hypercube) within node

¾ Infiniband between nodes

07

♦ Pluses: ¾ Large and powerful DSM nodes ♦ Potential problems (Gotchas): ¾ Power consumption - 100 kw per node (2 Mw total)

14

7

07

lu eG en e/ L

D

D2

eS er ve rB

be ta -S ys te m SG (0 IA .7 lt i G la x H z 1. de Po 5 C G en w H er te z, PC rJ V o S 44 lta 20 0) ir + 11 e ( 00 P In o fin In w D er ua ib te E P an ar lI l2 C t t d an h97 .3 Si iu G 0 H m m 2. B z ul 2 2 lu A Ti G at A eG pp SC H ge or z le en ) r4 IQ ,M XS e/ 1. yr -A L er 4G D in ve lp D et H ha 1 /M z P S el -Q ro e la rv to ua n e ty ox dr r pe S ic In C s (0 f in 45 .5 ,1 ib G an .2 eS H 5 z d er Po G 4X ve H w /C z Po rp er is w Se PC co er rie Ed 44 G s ig ge 0 65 E w 5 17 /C (1 us 50 eS .7 ,P to er G m 4 H ve ) Xe z rp P on ow Se eS 3. er rie 0 er 4 6 s +) ve G 69 H rp 0 z, (1 Se M .9 rie yr G in s H LN et 69 z X P 0 ow C ( 1 lu .9 B er st lu G 4+ er eG H ) , z en Xe Po R e/ on IK w L EN er 3. D 4+ D 4 S 2 ) G up P H In ro er z, te to M Co gr ty yr m ity pe in bi e rx (0 t ne 26 .7 d D 00 G C aw Hz lu Ita ni st Po ni ng er um w 40 er 2 00 PC 1. A, 5 44 G O H 0) pt z, M e C Q ro R ua n Li 2. dr nu 2 ic G O x s H pt C z, er lu M st on yr er 2 i n Xe G et H on A z, SC 2. M 4 yr IW G in H hi et z te -Q ,S ua P dr Po i cs w er 3 37 5 M H z

B

Watts/Gflop

Performance Projection

1 Eflop/s

100 Pflop/s

10 Pflop/s

1 Pflop/s

DARPA HPCS

100 Tflop/s

10 Tflop/s

SUM

100 Gflop/s

10 Gflop/s

1 Gflop/s

1993 1995 1997

BlueGene/L

1 Tflop/s

N=1 My Laptop

100 Mflop/s

N=500

1999 2001 2003 2005 2007 2009 2011 2013 2015

07 15

Power: Watts/Gflop (smaller is better)

120

100

80

60

40

20

0

Top 20 systems Based on processor power rating only

16

8

Top500 in Asia (Numbers of Machines) 120 100 80 Others India South Korea

60 40

China Japan

20

06 /1 99 3 06 /1 99 06 4 /1 99 5 06 /1 99 6 06 /1 99 06 7 /1 99 8 06 /1 99 9 06 /2 00 0 06 /2 00 1 06 /2 00 2 06 /2 00 3 06 /2 00 4

0

07 17

17 Chinese Sites on the Top500 Year

R max

Installation-site-name

Manufacturer

Computer

Area

17

Shanghai Supercomputer Center

Dawning

Dawning 4000A, Opteron 2.2 GHz, Myrinet

Research

2004

8061

2560

38

Chinese Academy of Science

lenovo

DeepComp 6800, Itanium2 1.3 GHz, QsNet

Academic

2003

4193

1024

61

Institute of Scientific Computing/Nankai University

IBM

xSeries Xeon 3.06 GHz, Myrinet

Academic

2004

3231

768

132

Petroleum Company (D)

IBM

BladeCenter Xeon 3.06 GHz, Gig-Ethernet

Industry

2004

1923

512

184

Geoscience (A)

IBM

BladeCenter Xeon 3.06 GHz, Gig-Ethernet

Industry

2004

1547

412

209

University of Shanghai

HP

DL360G3 Xeon 3.06 GHz, Infiniband

Academic

2004

1401

348

225

Academy of Mathematics and System Science

lenovo

DeepComp 1800 - P4 Xeon 2 GHz - Myrinet

Academic

2002

1297

512

229

Digital China Ltd.

HP

SuperDome 1 GHz/HPlex

Industry

2004

1281

560

247

Public Sector

IBM

xSeries Cluster Xeon 2.4 GHz - Gig-E

Governmn’t

2003

1256

622

324

China Meteorological Administration

IBM

eServer pSeries 655 (1.7 GHz Power4+)

Research

2004

1107

1008

Rank

Procs

355

XinJiang Oil

IBM

BladeCenter Cluster Xeon 2.4 GHz, GigEthernet

Industry

2003

1040

448

372

Fudan University

HP

DL360G3, Pentium4 Xeon 3.2 GHz, Myrinet

Academic

2004

1016

256

384

Huapu Information Technology

HP

SuperDome 875 MHz/HyperPlex

Industry

2004

1013

512

419

Saxony Developments Ltd

HP

Integrity Superdome, 1.5 GHz, HPlex

Industry

2004

971

192

481

Shenzhen University

Tsinghua U

DeepSuper-21C, P4 Xeon 3.06/2.8 GHz, Myrinet

Academic

2003

877

256

482

China Petroleum

HP

HP BL-20P, Pentium4 Xeon 3.06 GHz

Industry

2004

873

238

498

Digital China Ltd.

HP

SuperDome 875 MHz/HyperPlex

Industry

2004

851

416

07

Total performance growing by a factor of 3 every 6 months for the past 24 months

18

9

Important Metrics: Sustained Performance and Cost ♦ Commodity processors ¾ Optimized for commercial applications. ¾ Meet the needs of most of the scientific computing market. ¾ Provide the shortest time-to-solution and the highest sustained performance per unit cost for a broad range of applications that have significant spatial and temporal locality (good caches use). ♦ Custom processors ¾ For bandwidth-intensive applications that do not cache well, custom processors are more cost effective ¾ Hence offering better capacity on just 07 those applications.

19

High Bandwidth vs Commodity Systems ♦ High bandwidth systems have traditionally been vector

computers

¾ Designed for scientific problems ¾ Capability computing ♦ Commodity processors are designed for web servers and the

home PC market

(should be thankful that the manufactures keep the 64 bit fl pt) ¾ Used for cluster based computers leveraging price point ♦ Scientific computing needs are different ¾ Require a better balance between data movement and floating point operations. Results in greater efficiency. System Balance - MEMORY BANDWIDTH Year of Introduct ion N ode Archi tect ure Processor Cycle T ime 07 Speed per Processor Peak Operands/Flop(main memory)

Earth Simulator (N EC) 2002 Vector 500 MH z 8 Gflop/s 0.5

Cray X1 (Cray) 2003 Vector 800 MHz 12.8 Gfl op/s 0.33

ASCI Q (HP EV68) 2002 Alpha 1.25 GHz 2.5 Gflop/s 0.1

MCR Xeon 2002 Penti um 2.4 GH z 4.8 Gflop/s 0.055

Apple Xserve IBM PowerPC 2003 Power PC 2 GHz 8 Gflop/s 20 0.063

10

System Balance (Network) Network Speed (MB/s) vs Node speed (flop/s) ASCI Purple

0.13

PSC Lemieux

0.18

LANL Pink

0.05

ASCI White ASCI Blue Mountain

0.08

0.02

Blue Gene/L

0.38

Cray T3E/1200

1.00

ASCI Red

1.20

Cray Red Storm

1.60

Cray X1 0.00

07

2.00

0.50

1.00

1.50

2.00

Communication/Computation Balance (Bytes/Flop) (Higher is better)

2.50

21

SETI@home: Global Distributed Computing ♦ Running on 500,000 PCs, ~1300 CPU

Years per Day

¾ 1.3M CPU Years so far

♦ Sophisticated Data & Signal

Processing Analysis ♦ Distributes Datasets from Arecibo Radio Telescope

07 22

11

SETI@home ♦ Use thousands of Internet-

connected PCs to help in the search for extraterrestrial intelligence. ♦ When their computer is idle or being wasted this software will download ~ half a MB chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours. ♦ The results of this analysis are sent back to the SETI team, combined with thousands of other participants. ♦ 07About 5M users

♦ Largest distributed

computation project in existence ¾ Averaging 72 Tflop/s

23



Google query attributes

¾ 150M queries/day (2000/second) ¾ 100 countries ¾ 8.0B documents in the index



Forward link are referred to in the rows Back links are referred to in the columns

Data centers

¾ 100,000 Linux systems in data centers around the world

¾ 15 TFlop/s and 1000 TB total capability ¾ 40-80 1U/2U servers/cabinet ¾ 100 MB Ethernet switches/cabinet with gigabit Ethernet uplink

¾ growth from 4,000 systems (June 2000)



¾ 18M queries then Performance and operation ¾ simple reissue of failed commands to new servers 07 ¾ no performance debugging ¾ problems are not reproducible

Eigenvalue problem; Ax = λx n=8x109 (see: MathWorks Cleve’s Corner)

The matrix is the transition probability matrix of the Markov chain; Ax = x 24

Source: Monika Henzinger, Google & Cleve Moler

12

The Grid ♦ The Grid is about gathering resources …

¾ run programs, access data, provide services, collaborate

♦ …To enable and exploit large scale sharing of

resources ♦ Virtual organization

¾ Loosely coordinated groups

♦ Provides for remote access of resources ¾ Scalable ¾ Secure ¾ Reliable mechanisms for discovery and access

♦ In some ideal setting:

07

¾ User submits work, infrastructure finds an execution target ¾ Ideally you don’t care where. 25

The Grid

07 26

13

The Grid: The Good, The Bad, and The Ugly ♦ Good: ¾ Vision; ¾ Community; ¾ Developed functional software; ♦ Bad: ¾ Oversold the grid concept; ¾ Still too hard to use; ¾ Solution in search of a problem; ¾ Underestimated the technical difficulties; ¾ Not enough of a scientific discipline; ♦ Ugly: ¾ Authentication and security 07 27

Special Purpose “SETI / Google”

“Grids”

Clusters

Highly Parallel

Tightly Coupled

Loosely Coupled

The Computing Continuum

♦ Each strikes a different balance ¾ computation/communication coupling

♦ Implications for execution efficiency ♦ Applications for diverse needs ¾ computing is only one part of the story! 07 28

14

Grids vs. Capability vs. Cluster Computing ♦ Not an “either/or” question

¾ Each addresses different needs ¾ Each are part of an integrated solution

♦ Grid strengths

¾ Coupling necessarily distributed resources

¾ instruments, software, hardware, archives, and people

¾ Eliminating time and space barriers

¾ remote resource access and capacity computing

¾ Grids are not a cheap substitute for capability HPC

♦ Highest performance computing strengths ¾ Supporting foundational computations

¾ terascale and petascale “nation scale” problems

¾ Engaging tightly coupled computations and teams

♦ Clusters

¾ Low cost, group solution Potential hidden costs ♦ Key is easy access to resources in a transparent way 07 ¾

29

Petascale Systems In 2008 ♦ Technology trends ¾ multicore processors perhaps heterogeneous ¾ IBM Power4 and SUN UltraSPARC IV ¾ Itanium “Montecito” in 2005 ¾ quad-core and beyond are coming

¾ reduced power consumption

¾ laptop and mobile market drivers

¾ increased I/O and memory interconnect integration ¾ PCI Express, Infiniband, …

♦ Let’s look forward a few years to 2008 ¾ 8-way or 16-way cores (8 or 16 processors/chip) ¾ ~10 GFlop cores (processors) and 4-way nodes (4, 8-way cores/node) ¾ 12x Infiniband-like interconnect, perhaps heterogeneous ♦ With 10 GFlop processors ¾ 100K processors and 3100 nodes (4-way with 8 cores each) ¾ 1-3 MW of power, at a minimum

♦ To some extent, Petaflops systems will look like a “Grid in a Box” 07 30

15

How Big Is Big? ♦ Every 10X brings new challenges ¾ 64 processors was once considered large ¾ it hasn’t been “large” for quite a while

¾ 1024 processors is today’s “medium” size ¾ 8096 processors is today’s “large” ¾ we’re struggling even here

♦ 100K processor systems ¾ are in construction ¾ we have fundamental challenges in dealing with machines of this size ¾ … and little in the way of programming support 07 31

Fault Tolerance in the Computation ♦ Some next generation systems are

being designed with > 100K processors (IBM Blue Gene L).

♦ MTTF 105 - 106 hours for

component.

¾ sounds like a lot until you divide by 105! ¾ Failures for such a system can be just a few hours, perhaps minutes away. ♦ Problem with the MPI standard,

no recovery from faults. ♦ Application checkpoint / restart is today’s typical fault tolerance method.

♦ Many cluster based on

commodity parts don’t have error correcting primary memory.

07 32

16

Real Crisis With HPC Is With The Software ♦ Programming is stuck

¾ Arguably hasn’t changed since the 60’s

♦ It’s time for a change

¾ Complexity is rising dramatically

¾ highly parallel and distributed systems

¾ From 10 to 100 to 1000 to 10000 to 100000 of processors!!

¾ multidisciplinary applications

♦ A supercomputer application and software are usually

much more long-lived than a hardware

¾ Hardware life typically five years at most. ¾ Fortran and C are the main programming models

♦ Software is a major cost component of modern

technologies.

¾ The tradition in HPC system procurement is to assume that the software is free.

♦ We don’t have many great ideas about how to solve 07

this problem.

33

Collaborators / Support ♦ TOP500 ¾ H. Meuer, Mannheim U ¾ H. Simon, NERSC ¾ E. Strohmaier

07 34

17