Overview of High Performance Computing Jan Thorbecke Cray Application Engineer
[email protected] 26 September 2002
Typeset with LATEXand FoilTEX
HPC
1
Contents • HPC architectures 1 • Vector architectures • Clusters and their interconnects • HPC software tools • Classification of applications • Cray HPC project
1
Most of this material is based on the overview of Aad van der Steen.
HPC
2
HPC architectures HPC definitions
HPC
2
, s em ke t s sy ma e t HPC definitions re ch a i a r w hi rop d r a s w pp h of che an a n roa at o i t c app ble e l l co ing vaila a m sa s m hi e a s r as rog whic p m c p nd o c i en ener ble a g ssi ng i t u nd po p m s a ns o C ge atio e nc ngua pplic a m la le a r o rf ols, sib e P to fea h n e ig ”H twar sly u f so viou pre ce.” pr i
HPC architectures
HPC
2
, s em ke t s sy ma e t HPC definitions re ch a i a r w hi rop d r a s w pp h of che an a n roa at o i t c app ble e l l co ing vaila a m sa s m hi e a s r as rog whic p m c p nd o c i en ener ble a g ssi ng i t u nd po p m s a ns o C ge atio e nc ngua pplic a m la le a r o rf ols, sib e P to fea h n e ig ”H twar sly u f so viou pre ce.” pr i
HPC architectures
up gs pin isci elo ev is d n n d of th at ca so h by ate rea s t ntr in a gram usly ce ma : pro taneo on A tc l tha s. tware simu f ce ter en mpu d so uted c sci n rco s a exe ter pu upe hm be om on s gorit can fc l a ce h o un nc o r sing pie bra e t es ch ”A twar proc at ea f so rallel so th pa ces pie
HPC
up gs pin isci elo ev is d n n d of th at ca so h by ate rea s t ntr in a gram usly ce ma : pro taneo on A tc l tha s. tware simu f ce ter en mpu d so uted c sci n rco s a exe ter pu upe hm be om on s gorit can fc l a ce h o un nc o r sing pie bra e t es ch ”A twar proc at ea f so rallel so th pa ces pie
, s em ke t s sy ma e t HPC definitions re ch a i a r w hi rop d r a s w pp h of che an a n roa at o i t c app ble e l l co ing vaila a m sa s m hi e a s r as rog whic p m c p nd o c i en ener ble a g ssi ng i t u nd po p m s a ns o C ge atio e nc ngua pplic a m la le a r o rf ols, sib e P to fea h n e ig ”H twar sly u f so viou pre ce.” pr i
HPC architectures
nd a , ns, s on atio ia i t ica rkst pec n u wo s n m , om tific rks ratio c n two ne yst , e g i ge d s d tin sc ne u p ng w an ke d e m i e n co lud spe he ons d lin t ti an ed inc h a , c c s an ies, hig tem ppli ted v a gra s ad log s, y d s an s no em te l e n i a s h t as tec sys ent ms, well p om tion ter perim yste nts c ”En rma mpu d ex llel s pone o inf erco an para com p e su rpos cale h all work pu ge s e wit net d lar twar pee f so gh s 2
HPC
up gs pin isci elo ev is d n n d of th at ca so h by ate rea s t ntr in a gram usly ce ma : pro taneo on A tc l tha s. tware simu f ce ter en mpu d so uted c sci n rco s a exe ter pu upe hm be om on s gorit can fc l a ce h o un nc o r sing pie bra e t es ch ”A twar proc at ea f so rallel so th pa ces pie
, s em ke t s sy ma e t HPC definitions re ch a i a r w hi rop d r a s w pp h of che an a n roa at o i t c app ble e l l co ing vaila a m sa s m hi e a s r as rog whic p m c p nd o c i en ener ble a g ssi ng i t u nd po p m s a ns o C ge atio e nc ngua pplic a m la le a r o rf ols, sib e P to fea h n e ig ”H twar sly u f so viou pre ce.” pr i
HPC architectures
nd a , ns, s on atio ia i t ica rkst pec n u wo s n m , om tific rks ratio c n two ne yst , e g i ge d s d tin sc ne u p ng w an ke d e m i e n co lud spe he ons d lin t ti an ed inc h a , c c s an ies, hig tem ppli ted v a gra s ad log s, y d s an s no em te l e n i a s h t as tec sys ent ms, well p om tion ter perim yste nts c ”En rma mpu d ex llel s pone o inf erco an para com p e su rpos cale h all work pu ge s e wit net d lar twar pee f so gh s 2
Architectures
HPC
3
HPC architectures Flynn’s classification of High Performance Computers is based on instructionand data-streams into a processor: • SISD: Single Instruction Single Data Stream most workstations • SIMD: Single Instruction Multiple Data Stream vector-processors act on arrays of data, and processors array • MISD: Multiple Instruction Single Data Stream not yet commercially build • MIMD: Multiple Instruction Multiple Data Stream each processors fetches its own instructions and operates on its own data ? Shared Memory2: address-space is shared, (N)UMA ? Distributed Memory: every CPU has its own address-space 2
virtual shared-memory programming models are able to address the whole collective address space on physically distributed memory systems. Architectures
HPC
4
HPC architectures: SIMD Thinking Machines CM 2
• 12-D hypercube is a cube of 9-D hypercubes: 4096 chips • processors were grouped 16 to a chip • 65,536 1-bit processors that simultaneously perform the same calculation • Operations on larger data elements (32-bit floats) required one cycle per bit • local memory of 4K bits Architectures: SIMD
HPC
5
HPC architectures: SIMD Cray MTA
• multithreaded machine (eliminate useless cycles) • 1-256 CPU’s • memory is physically distributed and globally addressable • each processor is capable of storing the state of 128 separate threads. announced that a 40-processor, all-CMOS Cray MTA-2(TM) system with a revolutionary multithreaded architecture has completed customer acceptance Architectures: SIMD
HPC
6
testing at the Washington, D.C. facilities of the Naval Research Laboratory (NRL). Cray provided the Cray MTA-2 system to NRL under a contract with Northrop Grumman Information Technology, a premier provider of advanced IT solutions, engineering and business services for government and commercial businesses. NRL is a leading-edge Distributed Center within the Department of Defense’s High Performance Computing Modernization Program. NRL’s mission is to explore and evaluate innovative computing and networking technologies.
Architectures: SIMD
HPC
7
HPC architectures: DM-MIMD Cray T3E:
• bi-directional 3D torus (480 MB/s each direction) • E-registers • adaptive routing: source and destination node coordinates Architectures: DM-MIMD
HPC
8
HPC architectures: DM-MIMD Cray T3E Processor (EV-5, EV-6)
Architectures: DM-MIMD
HPC
9
HPC architectures: SM-MIMD
Architectures: SM-MIMD
HPC
10
HPC architectures: SM-MIMD Most important differences in organization of memory • Uniform Memory Access • Non-Uniform Memory Access • Cache Coherency
Architectures: SM-MIMD
HPC
11
HPC architectures: SM-MIMD UMA Uniform Memory Access
Architectures: SM-MIMD
HPC
12
HPC architectures: SM-MIMD NUMA Non-Uniform Memory Access
Architectures: SM-MIMD
HPC
13
HPC architectures: SM-MIMD NUMA Origin 3000 building block
Distrubuted shared memory is a model, for the user this is a shared memory system.
Architectures: SM-MIMD
HPC
14
HPC architectures: SM-MIMD NUMA Origin 3000 building systems
Evert Hub and Router add bandwidth.
Architectures: SM-MIMD
HPC
15
HPC architectures: SM-MIMD NUMA Origin 3000 building systems
Architectures: SM-MIMD
HPC
16
Architectures: SM-MIMD
HPC
17
HPC architectures: Cache Coherency '
$
P0 &
cache A=1
'
$
P1 %
'
$
'
P2
&
%
cache A=1
&
cache
$
P3 %
&
cache A=1
Memory
'
$
P4 %
&
cache
A=1
%
HPC
17
HPC architectures: Cache Coherency '
$
P0 &
cache A=1
'
$
P1 %
'
$
'
P2
&
%
cache A=1
&
cache
$
P3 %
&
cache A = 15
Memory
'
$
P4 %
&
cache
A=1
%
HPC
17
HPC architectures: Cache Coherency '
$
P0 &
cache A=1
'
$
P1 %
'
$
'
P2
&
%
cache A=1
&
cache
$
P3 %
&
cache A = 15
Memory
'
$
P4 %
&
cache
A=1
%
HPC
17
HPC architectures: Cache Coherency '
$
P0 &
cache A=1
'
$
P1 %
'
$
'
P2
&
%
cache A=1
&
cache
$
P3 %
&
cache A = 15
Memory
'
$
P4 %
&
cache
A=5 1
%
HPC
17
HPC architectures: Cache Coherency '
$
P0 &
cache A=1
'
$
P1 %
'
$
'
P2
&
%
cache A=1
&
cache
$
P3 %
&
cache A = 15
Memory
'
$
P4 %
&
cache
A=5 1
%
HPC
17
HPC architectures: Cache Coherency '
$
P0 &
cache A = 15
'
$
P1 %
'
$
'
P2
&
%
cache A = 15
&
cache
$
P3 %
&
cache A = 15
Memory
'
$
P4 %
&
%
cache
A=5 1
If multiple processors cache and update data concurrently, then inconsistency may arise between the ”same” data in different caches
Architectures: SM-MIMD
HPC
18
HPC architectures: Cache Coherency There are two classes of cache coherency protocols • snoopy protocol, monitoring all requests to memory, bus based architecture (SUN) • directory memory, cache state maintained in each memory block(SGI) and two schemes: • updates: write though can consume more system bandwidth (Cray) • invalidates: can lead to false sharing optimized for uniprocessors performance (SGI)
Architectures: SM-MIMD
HPC
19
HPC architectures: SIMD/MIMD: Vector Processor
Vector supercomputers load data into vector registers, to hide latency, and perform vector operations, to produce high computation rates. Architectures: Vector
HPC
20
HPC architectures: Vector • Overcomes limitations of pipelining in scalar processors: ? deep pipelining can slow down a processor, ? limits on instruction fetch and decode. • One single vector instruction executes an entire loop: control unit overhead is greatly reduced. • Vector instructions have a known access pattern to memory • High memory bandwidth (interleaved) is needed to feed the vector processors. • memory-memory (70’s) and memory-register (Cray 1 ... C-90, Convex, NEC) memory-cache-register (Cray SV1, X1)
Architectures: Vector
HPC
21
HPC architectures: Pipelining Clock Cycle 1 2 3 4 5
Fetch Inst. 1 Inst. 2 Inst. 3 Inst. 4 Inst. 5
Decode Inst. Inst. Inst. Inst.
1 2 3 4
Execute
Store
Inst. 1 Inst. 2 Inst. 3
Inst. 1 Inst. 2
4-cycle instructions; using all the available hardware (ALU, Register, ..) within one cycle. Balance between clocks/instructions and pipeline depth.
Architectures: Vector
HPC
22
HPC architectures: Vector
Architectures: Vector
HPC
23
HPC architectures: Vector
Architectures: Vector
HPC
24
HPC architectures: Vector code Vectorization inhibitors within loops are: • function/subroutine calls • I/O statements • Backward branches • Statement numbers with references from outside the loop • References to character variables • External functions that do not vectorize • RETURN, STOP, or PAUSE statements • Dependencies (recursive members) Architectures: Vector
HPC
25
HPC architectures: Vector code Dependency within loop cannot be vectorized: for (i=2; i