Overview of High Performance Computing - Jan Thorbecke - Xs4all

Overview of High Performance Computing Jan Thorbecke Cray Application Engineer [email protected] 26 September 2002

Typeset with LATEXand FoilTEX

HPC

1

Contents • HPC architectures 1 • Vector architectures • Clusters and their interconnects • HPC software tools • Classification of applications • Cray HPC project

1

Most of this material is based on the overview of Aad van der Steen.

HPC

2

HPC architectures HPC definitions

HPC

2

, s em ke t s sy ma e t HPC definitions re ch a i a r w hi rop d r a s w pp h of che an a n roa at o i t c app ble e l l co ing vaila a m sa s m hi e a s r as rog whic p m c p nd o c i en ener ble a g ssi ng i t u nd po p m s a ns o C ge atio e nc ngua pplic a m la le a r o rf ols, sib e P to fea h n e ig ”H twar sly u f so viou pre ce.” pr i

HPC architectures

HPC

2


HPC architectures

up gs pin isci elo ev is d n n d of th at ca so h by ate rea s t ntr in a gram usly ce ma : pro taneo on A tc l tha s. tware simu f ce ter en mpu d so uted c sci n rco s a exe ter pu upe hm be om on s gorit can fc l a ce h o un nc o r sing pie bra e t es ch ”A twar proc at ea f so rallel so th pa ces pie

HPC



HPC architectures

nd a , ns, s on atio ia i t ica rkst pec n u wo s n m , om tific rks ratio c n two ne yst , e g i ge d s d tin sc ne u p ng w an ke d e m i e n co lud spe he ons d lin t ti an ed inc h a , c c s an ies, hig tem ppli ted v a gra s ad log s, y d s an s no em te l e n i a s h t as tec sys ent ms, well p om tion ter perim yste nts c ”En rma mpu d ex llel s pone o inf erco an para com p e su rpos cale h all work pu ge s e wit net d lar twar pee f so gh s 2

HPC



HPC architectures

nd a , ns, s on atio ia i t ica rkst pec n u wo s n m , om tific rks ratio c n two ne yst , e g i ge d s d tin sc ne u p ng w an ke d e m i e n co lud spe he ons d lin t ti an ed inc h a , c c s an ies, hig tem ppli ted v a gra s ad log s, y d s an s no em te l e n i a s h t as tec sys ent ms, well p om tion ter perim yste nts c ”En rma mpu d ex llel s pone o inf erco an para com p e su rpos cale h all work pu ge s e wit net d lar twar pee f so gh s 2

Architectures

HPC

3

HPC architectures Flynn’s classification of High Performance Computers is based on instructionand data-streams into a processor: • SISD: Single Instruction Single Data Stream most workstations • SIMD: Single Instruction Multiple Data Stream vector-processors act on arrays of data, and processors array • MISD: Multiple Instruction Single Data Stream not yet commercially build • MIMD: Multiple Instruction Multiple Data Stream each processors fetches its own instructions and operates on its own data ? Shared Memory2: address-space is shared, (N)UMA ? Distributed Memory: every CPU has its own address-space 2

virtual shared-memory programming models are able to address the whole collective address space on physically distributed memory systems. Architectures

HPC

4

HPC architectures: SIMD Thinking Machines CM 2

• 12-D hypercube is a cube of 9-D hypercubes: 4096 chips • processors were grouped 16 to a chip • 65,536 1-bit processors that simultaneously perform the same calculation • Operations on larger data elements (32-bit floats) required one cycle per bit • local memory of 4K bits Architectures: SIMD

HPC

5

HPC architectures: SIMD Cray MTA

• multithreaded machine (eliminate useless cycles) • 1-256 CPU’s • memory is physically distributed and globally addressable • each processor is capable of storing the state of 128 separate threads. announced that a 40-processor, all-CMOS Cray MTA-2(TM) system with a revolutionary multithreaded architecture has completed customer acceptance Architectures: SIMD

HPC

6

testing at the Washington, D.C. facilities of the Naval Research Laboratory (NRL). Cray provided the Cray MTA-2 system to NRL under a contract with Northrop Grumman Information Technology, a premier provider of advanced IT solutions, engineering and business services for government and commercial businesses. NRL is a leading-edge Distributed Center within the Department of Defense’s High Performance Computing Modernization Program. NRL’s mission is to explore and evaluate innovative computing and networking technologies.

Architectures: SIMD

HPC

7

HPC architectures: DM-MIMD Cray T3E:

• bi-directional 3D torus (480 MB/s each direction) • E-registers • adaptive routing: source and destination node coordinates Architectures: DM-MIMD

HPC

8

HPC architectures: DM-MIMD Cray T3E Processor (EV-5, EV-6)

Architectures: DM-MIMD

HPC

9

HPC architectures: SM-MIMD

Architectures: SM-MIMD

HPC

10

HPC architectures: SM-MIMD Most important differences in organization of memory • Uniform Memory Access • Non-Uniform Memory Access • Cache Coherency


HPC

11

HPC architectures: SM-MIMD UMA Uniform Memory Access


HPC

12

HPC architectures: SM-MIMD NUMA Non-Uniform Memory Access


HPC

13

HPC architectures: SM-MIMD NUMA Origin 3000 building block

Distrubuted shared memory is a model, for the user this is a shared memory system.


HPC

14

HPC architectures: SM-MIMD NUMA Origin 3000 building systems

Evert Hub and Router add bandwidth.


HPC

15

HPC architectures: SM-MIMD NUMA Origin 3000 building systems


HPC

16


HPC

17

HPC architectures: Cache Coherency '

$

P0 &

cache A=1

'

$

P1 %

'

$

'

P2

&

%

cache A=1

&

cache

$

P3 %

&

cache A=1

Memory

'

$

P4 %

&

cache

A=1

%

HPC

17


$

P0 &

cache A=1

'

$

P1 %

'

$

'

P2

&

%

cache A=1

&

cache

$

P3 %

&

cache A = 15

Memory

'

$

P4 %

&

cache

A=1

%

HPC

17


$

P0 &

cache A=1

'

$

P1 %

'

$

'

P2

&

%

cache A=1

&

cache

$

P3 %

&

cache A = 15

Memory

'

$

P4 %

&

cache

A=1

%

HPC

17


$

P0 &

cache A=1

'

$

P1 %

'

$

'

P2

&

%

cache A=1

&

cache

$

P3 %

&

cache A = 15

Memory

'

$

P4 %

&

cache

A=5 1

%

HPC

17


$

P0 &

cache A=1

'

$

P1 %

'

$

'

P2

&

%

cache A=1

&

cache

$

P3 %

&

cache A = 15

Memory

'

$

P4 %

&

cache

A=5 1

%

HPC

17


$

P0 &

cache A = 15

'

$

P1 %

'

$

'

P2

&

%

cache A = 15

&

cache

$

P3 %

&

cache A = 15

Memory

'

$

P4 %

&

%

cache

A=5 1

If multiple processors cache and update data concurrently, then inconsistency may arise between the ”same” data in different caches


HPC

18

HPC architectures: Cache Coherency There are two classes of cache coherency protocols • snoopy protocol, monitoring all requests to memory, bus based architecture (SUN) • directory memory, cache state maintained in each memory block(SGI) and two schemes: • updates: write though can consume more system bandwidth (Cray) • invalidates: can lead to false sharing optimized for uniprocessors performance (SGI)


HPC

19

HPC architectures: SIMD/MIMD: Vector Processor

Vector supercomputers load data into vector registers, to hide latency, and perform vector operations, to produce high computation rates. Architectures: Vector

HPC

20

HPC architectures: Vector • Overcomes limitations of pipelining in scalar processors: ? deep pipelining can slow down a processor, ? limits on instruction fetch and decode. • One single vector instruction executes an entire loop: control unit overhead is greatly reduced. • Vector instructions have a known access pattern to memory • High memory bandwidth (interleaved) is needed to feed the vector processors. • memory-memory (70’s) and memory-register (Cray 1 ... C-90, Convex, NEC) memory-cache-register (Cray SV1, X1)

Architectures: Vector

HPC

21

HPC architectures: Pipelining Clock Cycle 1 2 3 4 5

Fetch Inst. 1 Inst. 2 Inst. 3 Inst. 4 Inst. 5

Decode Inst. Inst. Inst. Inst.

1 2 3 4

Execute

Store

Inst. 1 Inst. 2 Inst. 3

Inst. 1 Inst. 2

4-cycle instructions; using all the available hardware (ALU, Register, ..) within one cycle. Balance between clocks/instructions and pipeline depth.


HPC

22

HPC architectures: Vector


HPC

23

HPC architectures: Vector


HPC

24

HPC architectures: Vector code Vectorization inhibitors within loops are: • function/subroutine calls • I/O statements • Backward branches • Statement numbers with references from outside the loop • References to character variables • External functions that do not vectorize • RETURN, STOP, or PAUSE statements • Dependencies (recursive members) Architectures: Vector

HPC

25

HPC architectures: Vector code Dependency within loop cannot be vectorized: for (i=2; i

Overview of High Performance Computing - Jan Thorbecke - Xs4all

Overview of High Performance Computing - Jan Thorbecke - Xs4all

Suggest Documents

3-D recursive extrapolation operators: an overview - Jan Thorbecke

Boeken - Jan Joris Vereijken - Xs4all

Seismic Imaging from a TBM - Jan Thorbecke

High-Performance Computing

High performance astrophysics computing

1. High Performance Computing

high-performance pervasive computing

High Performance Quantum Computing

High Performance Bioinformatics Computing

A High Performance Computing Cloud Computing ...

High Performance Computing (HPC) - CiteSeerX

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

ATM High Performance Distributed Computing

Developing High Performance Computing ... - ScienceDirect.com

High-Performance Cluster Computing - LexisNexis

SRIRAM KRISHNAMOORTHY - High-Performance Computing

Biology and High-Performance Computing

High Performance Computing - Google Sites

High Performance Computing - Google Sites

High Performance Computing for Dummies

High Performance Computing - Google Sites

High Performance Computing - Google Sites

HIGH-PERFORMANCE COMPUTING IN OIL