CS/ECE 757: Advanced Computer Architecture II. (Parallel Computer
Architecture). Scalable Multiprocessors ... locked to specific processor
architecture/bus.
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Scalable Multiprocessors Case Studies(Chapter 7) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!
Massively Parallel Processor (MPP) Architectures Processor
Network Interface
• Network interface typically close to processor
Cache
Memory Bus
– Memory bus: » locked to specific processor architecture/bus protocol – Registers/cache: » only in research machines
I/O Bridge
Network
I/O Bus Main Memory Disk Controller
Disk
• Time-to-market is long – processor already available or work closely with processor designers
Disk
• Maximize performance and cost (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 1
CS/ECE 757
2
Network of Workstations Processor
interrupts
Cache
Core Chip Set I/O Bus
Main Memory Disk Controller
Disk
Disk
Graphics Controller
Graphics
Network Interface
• Network interface on I/O bus • Standards (e.g., PCI) => longer life, faster to market • Slow (microseconds) to access network interface • “System Area Network” (SAN): between LAN & MPP
Network
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
3
Transaction Interpretation • Simplest MP: assist doesn’t interpret much if anything – DMA from/to buffer, interrupt or set flag on completion
• User-level messaging: get the OS out of the way – assist does protection checks to allow direct user access to network – may have minimal interpretation otherwise
• Virtual DMA: get the CPU out of the way (maybe) – basic protection plus address translation: user-level bulk DMA
• Global physical address space (NUMA): everything in hardware – complexity increases, but performance does too (if done right)
• Cache coherence: even more so – stay tuned (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 2
CS/ECE 757
4
Increasing HW Support, Specialization, Intrusiveness, Performance (???)
Spectrum of Designs • None: Physical bit stream – physical DMA
nCUBE, iPSC, . . .
• User/System – User-level port – User-level handler
CM-5, *T J-Machine, Monsoon, . . .
• Remote virtual address – Processing, translation – Reflective memory (?)
Paragon, Meiko CS-2, Myrinet Memory Channel, SHRIMP
• Global physical address – Proc + Memory controller
RP3, BBN, T3D, T3E
• Cache-to-cache (later) – Cache controller
Dash, KSR, Flash
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
5
Net Transactions: Physical DMA
Data
Dest
DMA channels Addr Length Rdy
Memory
Status, interrupt
Cmd
P
°°°
Addr Length Rdy
Memory
P
• Physical addresses: OS must initiate transfers – system call per message on both ends: ouch
• Sending OS copies data to kernel buffer w/ header/trailer – can avoid copy if interface does scatter/gather
• Receiver copies packet into OS buffer, then interrupts – user message then copied (or mapped) into user space (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 3
CS/ECE 757
6
Conventional LAN Network Interface Host Memory
NIC trncv
NIC Controller Data
Addr Len Status Next
Addr Len Status Next
Addr Len Status Next
TX
addr
RX
len
IO Bus mem bus
Addr Len Status Next
Addr Len Status Next
DMA
Proc
Addr Len Status Next
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
7
User Level Ports User/system Data
Dest
°°°
Mem
P
Status, interrupt
Mem
P
• map network hardware into user’s address space – talk directly to network via loads & stores
• user-to-user communication without OS intervention: low latency • protection: user/user & user/system • DMA hard… CPU involvement (copying) becomes bottleneck (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 4
CS/ECE 757
8
Case Study: Thinking Machines CM5 • Follow-on to CM2 – – – – –
Abandons SIMD, bit-serial processing Uses off-shelf processors/parts Focus on floating point 32 to 1024 processors Designed to scale to 16K processors
• Designed to be independent of specific processor node • Current" processor node – 40 MHz SPARC – 32 MB memory per node – 4 FP vector chips per node
(C) 2003, J. E. Smith
9
CM5, contd. • Vector Unit – – – – –
Four FP processors Direct connect to main memory Each has a 1 word data path FP unit can do scalar or vector 128 MFLOPS peak: 50 MFLOPS Linpack
(C) 2003, J. E. Smith
Page 5
10
Interconnection Networks • Two networks: Data and Control • Network Interface – – – –
Memory-mapped functions Store data => data Part of address => control Some addresses map to privileged functions
(C) 2003, J. E. Smith
11
Interconnection Networks • Input and output FIFO for each network • Two data networks • Save/restore network buffers on context switch
Diagnostics network Control network Data network
PM PM Processing partition
SPARC
Processing Control partition processors
FPU
$ ctrl
Data networks
$ SRAM
I/O partition
Control network
NI
MBUS
DRAM ctrl DRAM
Vector unit
DRAM ctrl DRAM
DRAM ctrl DRAM
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 6
Vector unit
DRAM ctrl DRAM
CS/ECE 757
12
Message Management • Typical function implementation – Each function has two FIFOs (in and out) – Two outgoing control registers: » send and send_first » send_first initiates message » send sends any additional data – read send_ok to check successful send;else retry – read send_space to check space prior to send – incoming register receive_ok can be polled – read receive_length_left for message length – read receive for input data in FIFO order
(C) 2003, J. E. Smith
13
User Level Network ports Virtual address space Net output port Net input port
Processor
Status Registers Program counter
• Appears to user as logical message queues plus status • What happens if no user pop?
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 7
CS/ECE 757
14
Control Network • In general, every processor participates – A mode bit allows a processor to abstain
• Broadcasting – 3 functional units: broadcast, supervisor broadcast, interrupt » only 1 broadcast at a time » broadcast message 1 to 15 32-bit words
• Combining – Operation types: » reduction » parallel prefix » parallel suffix » router-done (big-OR) » Combine message is a 32 to 128 bit integer » Operator types: OR, XOR, max, signed add, unsigned add – Operation and operator are encoded in send_first address
(C) 2003, J. E. Smith
15
Control Network, contd • Global Operations – – – –
Big OR of 1 bit from each processor three independent units; one synchronous, 2 asynchronous Synchronous useful for barrier synch Asynchronous useful for passing error conditions independent of other Control unit functions
• Individual instructions are not synchronized each processor fetches instructions barrier synch via control is used between instruction blocks => support for a loose form of data parallelism
(C) 2003, J. E. Smith
Page 8
16
Data Network • Architecture – – – – – – –
Fat-tree Packet switched Bandwidth to neighbor: 20 MB/sec Latency in large network: 3 to 7 microseconds Can support message passing or global address space
• Network interface – – – –
One data network functional unit send_first gets destn address + tag 1 to 5 32-bit chunks of data tag can cause interrupt at receiver
• Addresses may be physical or relative – physical addresses are privileged – relative address is bounds-checked and translated
(C) 2003, J. E. Smith
17
Data Network, contd. • Typical message interchange: – alternate between pushing onto send-FIFO and receiving on receive-FIFO – once all messages are sent, assert this with combine function on control network – alternate between receiving more messages and testing control network – when router_done goes active, message pass phase is complete
(C) 2003, J. E. Smith
Page 9
18
User Level Handlers U se r/sys te m D a ta
Ad d re ss
D est
° ° ° Mem
M em
P
P
• Hardware support to vector to address specified in message – message ports in registers – alternate register set for handler?
• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
19
CS/ECE 757
20
J-Machine • Each node a small messagedriven processor • HW support to queue msgs and dispatch to msg handler task
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 10
J-Machine Message-Driven Processor • MIT research project • Targets fine-grain message passing – very low message overheads allow: – small messages – small tasks
• J-Machine architecture – – – – –
3D mesh direct interconnect Global address space up to 1 Mwords of DRAM per node MDP single-chip supports processing memory control message passing not an off-the-shelf chip
(C) 2003, J. E. Smith
21
Features/Example: Combining Tree •
Each node collects data from lower levels, accumulates sum, and passes sum up tree when all inputs are done.
•
Communication – SEND instructions send values up tree in small messages – On arrival, a task is created to perform COMBINE routine
•
Synchronization – – – –
message arrival dispatches a task example: combine message invokes COMBINE routine presence bits (full/empty) on state value set empty; reads are blocked until it becomes full
(C) 2003, J. E. Smith
Page 11
22
Features/Example, contd. • Naming – segmented memory, and translation instructions – protection via segment descriptors – allocate a segment of memory for each combining node
(C) 2003, J. E. Smith
23
Instruction Set • Separate register sets for three execution levels: – background: no messages – priority 0: lower priority messages – priority 1: higher priority messages
• Each Register set: – – – – –
4 GPRs 4 Address regs.: contain segment descriptors base + field length 4 ID registers: P0 and P1 only instruction pointer: includes status info
• Addressing – an address is an offset + address register
(C) 2003, J. E. Smith
Page 12
24
Instruction Set, contd • Tags – – – – – –
4 bit tags per 32-bit data value data types, e.g. integer, boolean, symbol system types, e.g. IP, addr, msg 4 user-defined types Full/empty types: Fut, and Cfut interrupt on access to Cfut value
• MDP can do operand type checking and interrupt on miss-match => program does not have to do software checking
• Instructions – 17-bit format (two per word) – three operands – words not tagged as instructions are loaded into R0 automatically
(C) 2003, J. E. Smith
25
Instruction Set, contd.
(C) 2003, J. E. Smith
Page 13
26
Instruction set, contd. • Naming – ENTER instruction places translation into TLB – XLATE instruction does TLB translation
• Communication – SEND instructions – Example: » SEND R0,0 ;send net address, prio 0 » SEND2 R1,R2,0 ;send two data words » SEND2E R3,[3,A3],0; send two more words;
(C) 2003, J. E. Smith
27
Instruction set, contd. • Task scheduling – – – – – –
Message at head of queue causes MDP to create a task opcode loaded into IP length field and message head loaded into A3 takes three cycles low latency messages executed directly others specify a handler that locates a required method and transfers control to the method
(C) 2003, J. E. Smith
Page 14
28
Network Architecture • 3D Mesh; up to 64K nodes • No torus => faces for I/O • Bidirectional channels – – – – – –
channels can be turned around on alternate cycles 9 bits data + 6 bits control => 9 bit phit 2 phit per flit (granularity of flow control) Each channel 288 Mbps Bisection bandwidth (1024 nodes) 18.4 Gps
• Synchronous routers – clocked at 2X processor clock – => 9-bit phit per 62.5ns – messages route at 1 cycle per hop
(C) 2003, J. E. Smith
29
Dedicated Message Processing Without Specialized Hardware Network dest °°° Mem
Mem NI
P User
• • • •
NI
MP
P
System
User
MP System
Msg processor performs arbitrary output processing (at system level) Msg processor interprets incoming network transactions (in system) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 15
CS/ECE 757
30
Levels of Network Transaction Network dest °°°
Mem
NI
P
Mem
NI
MP
User
MP
P
System
• User Processor stores cmd / msg / data into shared output queue – must still check for output queue full (or grow dynamically)
• Communication assists make transaction happen – checking, translation, scheduling, transport, interpretation
• Avoid system call overhead • Multiple bus crossings likely bottleneck (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
31
Example: Intel Paragon Service Network I/O Nodes
I/O Nodes
Devices Devices
16 Mem
175 MB/s Duplex 2048 B NI
i860xp 50 MHz 16 KB $ 4-way 32B Block MESI
°°°
EOP
rte MP handler Var data
64 400 MB/s
$
$
P
MP
sDMA rDMA
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 16
CS/ECE 757
32
Dedicated MP w/specialized NI: Meiko CS-2 • Integrate message processor into network interface – active messages-like capability – dedicated threads for DMA, reply handling, simple remote memory access – supports user-level virtual DMA » own page table » can take a page fault, signal OS, restart • meanwhile, nack other node
• Problem: processor is slow, time-slices threads – fundamental issue with building your own CPU
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
33
Myricom Myrinet (Berkeley NOW) • Programmable network interface on I/O Bus (Sun SBUS or PCI) – embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) – 256KB SRAM – 3 DMA engines: to network, from network, to/from host memory
• Downloadable firmware executes in kernel mode – includes source-based routing protocol
• SRAM pages can be mapped into user space – separate pages for separate processes – firmware can define status words, queues, etc. » data for short messages or pointers for long ones » firmware can do address translation too… w/OS help – poll to check for sends from user
• Bottom line: I/O bus still bottleneck, CPU could be faster (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 17
CS/ECE 757
34
Shared Physical Address Space • Implement SAS model in hardware w/o caching – actual caching must be done by copying from remote memory to local – programming paradigm looks more like message passing than Pthreads » yet, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right » result: great platform for MPI & compiled data-parallel codes
• Implementation: – “pseudo-memory” acts as memory controller for remote mem, converts accesses to network transaction (request) – “pseudo-CPU” on remote node receives requests, performs on local memory, sends reply – split-transaction or retry-capable bus required (or dual-ported mem)
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
35
Case Study: Cray T3D • • • • • • • • • • •
Processing element nodes 3D Torus interconnect Wormhole routing PE numbering Local memory Support circuitry Prefetch Messaging Barrier synch Fetch and inc. Block transfers (C) 2003, J. E. Smith
Page 18
36
Processing Element Nodes • Two processors per node • Shared block transfer engine – DMA-like transfer of large blocks of data
• Shared network interface/router • Synchronous 6.67 ns clock
(C) 2003, J. E. Smith
37
Communication Links • Signals: – Data: 16 bits – Channel control: 4 bits – -- request/response, virt. channel buffer – Channel acknowledge: 4 bits
• virt. channel buffer status
(C) 2003, J. E. Smith
Page 19
38
Routing • Dimension order routing – may go in either + or - direction along a dimension
• Virtual channels – Four virtual channel buffers per physical channel – => two request channels, two response channels
• Deadlock avoidance – In each dimension specify a "dateline" link – Packets that do not cross dateline use virtual channel 0 – Packets that cross dateline use virtual channel 1
(C) 2003, J. E. Smith
39
Packets • Size: 8 physical units (phits) – 16 bits per phit
• Header: – – – – –
routing info destn processor control info source processor memory address
• Read Request – header: 6 phits
• Read Response – header: 6 phits – body: 4 phits (1 word) – or 16 phits (4 words) (C) 2003, J. E. Smith
Page 20
40
Processing Nodes • Processor: Alpha 21064 • Support circuitry: – – – – – – –
Address interpretation reads and writes (local/non-local) data prefetch messaging barrier synch. fetch and incr. status
(C) 2003, J. E. Smith
41
Address Interpretation • T3D needs: – 64 MB memory per node => 26 bits – up to 2048 nodes => 11 bits
• Alpha 21064 provides: – 43-bit virtual address – 32-bit physical address – (+2 bits for mem mapped devices)
=> Annex registers in DTB – – – – – –
external to Alpha map 32-bit address onto 48 bit node + 27-bit address 32-annex registers annex registers also contain function info e.g. cache / non-cache accesses DTB modified via load/locked store cond. insts.
(C) 2003, J. E. Smith
Page 21
42
Reads and Writes • Reads – note: no hardware cache coherence – may be cacheable or non-cacheable – two types of non-cacheable: » normal and atomic swap » swap uses "swaperand" register – three types of cacheable reads: » normal, atomic swap, cached readahead » cached readahead buffers and pre-reads 4 word blocks of data
• Writes – – – –
Alpha uses write-through read-allocate cache write counter: to help with consistency increments on write request decrements on ack
(C) 2003, J. E. Smith
43
Data Prefetch • Architected Prefetch Queue – 1 word wide by 16 deep
• Prefetch instruction: – Alpha prefetch hint => T3D prefetch
• Performance – Allows multiple outstanding read requests – (normal 21064 reads are blocking)
(C) 2003, J. E. Smith
Page 22
44
Messaging • Message queues – 256 KBytes reserved space in local memory – => 4080 message packets + 16 overflow locations
• Sending processor: – Uses Alpha PAL code – builds message packets of 4 words – plus address of receiving node
• Receiving node – – – – – – –
stores message interrupts processor processor sends an ack processor may execute routine at address provided by message (active messages) if message queue full; NACK is sent also, error messages may be generated by support circuitry (C) 2003, J. E. Smith
45
Barrier Synchronization • For Barrier or Eureka • Hardware implementation – – – –
hierarchical tree bypasses in tree to limit its scope masks for barrier bits to further limit scope interrupting or non-interrupting
(C) 2003, J. E. Smith
Page 23
46
Fetch and Increment • Special hardware registers – – – –
2 per node user accessible used for auto-scheduling tasks (often loop iterations)
(C) 2003, J. E. Smith
47
Block Transfer • Special block transfer engine – – – –
does DMA transfers can gather/scatter among processing elements up to 64K packets 1 or 4 words per packet
• Types of transfers – constant stride read/write – gather/scatter
• Requires System Call – for memory protection => big overhead
(C) 2003, J. E. Smith
Page 24
48
Cray T3E • • • • • •
T3D Post Mortem T3E Overview Global Communication Synchronization Message passing Kernel performance
(C) 2003, J. E. Smith
49
T3D Post Mortem • Very high performance interconnect – 3D torus worthwhile
• Barrier network "overengineered" – Barrier synch not a significant fraction of runtime
• Prefetch queue useful; should be more of them • Block Transfer engine not very useful – high overhead to setup – yet another way to access memory
• DTB Annex difficult to use in general – one entry might have sufficed – every processor must have same mapping for physical page
(C) 2003, J. E. Smith
Page 25
50
T3E Overview • Alpha 21164 processors • Up to 2 GB per node • Caches – – – – –
8K L1 and 96K L2 on-chip supports 2 outstanding 64-byte line fills stream buffers to enhance cache only local memory is cached => hardware cache coherence straightforward
• 512 (user) + 128 (system) E-registers for communication/synchronization • One router chip per processor
(C) 2003, J. E. Smith
51
T3E Overview, contd. • Clock: – system (i.e. shell) logic at 75 MHz – proc at some multiple (initially 300 MHz)
• 3D Torus Interconnect – bidirectional links – adaptive multiple path routing – links run at 300 MHz
(C) 2003, J. E. Smith
Page 26
52
Global Communication: E-registers • • • •
Extend address space Implement "centrifuging" function Memory-mapped (in IO space) Operations: – load/stores between E-registers and processor registers – Global E-register operations » transfer data to/from global memory » messaging » synchronization
(C) 2003, J. E. Smith
53
Global Address Translation • E-reg block holds base and mask; – previously stored there as part of setup
• Remote memory access (mem mapped store): – data bits: E-reg pointer(8) + address index(50) – address bits: Command + src/dstn E-reg
(C) 2003, J. E. Smith
Page 27
54
Global Address Translation
(C) 2003, J. E. Smith
55
Address Translation, contd. • Translation steps – – – – –
Address index centrifuged with mask => virt address + virt PE Offset added to base => vseg + seg offset vseg translated => gseg + base PE base PE + virtual PE => logical PE logical PE through lookup table => physical routing tag
• GVA: gseg(6) + offset (32) – goes out over network to physical PE » at remote node, global translation => physical address
(C) 2003, J. E. Smith
Page 28
56
Global Communication: Gets and Puts • Get: global read – word (32 or 64-bits) – vector (8 words, with stride) – stride in access E-reg block
• Put: global store • Full/Empty synchronization on E-regs • Gets/Puts are pipelined – up to 480 MB/sec transfer rate between nodes
(C) 2003, J. E. Smith
57
Synchronization • Atomic ops between E-regs and memory – – – –
Fetch & Inc Fetch & Add Compare & Swap Masked Swap
• Barrier/Eureka Synchronization – – – – – –
32 BSUs per processor accessible as memory-mapped registers BSUs have states and operations State transition diagrams Barrier/Eureka trees are embedded in 3D torus use highes priority virtual channels
(C) 2003, J. E. Smith
Page 29
58
Message Passing • Message queues – – – –
arbitrary number created in user or system mode mapped into memory space up to 128 MBytes
• Message Queue Control Word in memory – – – –
tail pointer limit value threshold triggers interrupt signal bit set when interrupt is triggered
• Message Send – from block of 8 E-regs – Send command, similar to put
• Queue head management done in software – swap can manages queue in two segments (C) 2003, J. E. Smith
59
T3E Summary • Messaging is done well... – within constraints of COTS processor
• Relies more on high communication and memory bandwidth than caching => lower perf on dynamic irregular codes => higher perf on memory-intensive codes with large communication
• Centrifuge; probably unique • Barrier synch uses dedicated hardware but NOT dedicated network
(C) 2003, J. E. Smith
Page 30
60
DEC Memory Channel (Princeton SHRIMP) Virtual
Virtual Physical
Physical
• Reflective Memory • Writes on Sender appear in Receiver’s memory – send & receive regions – page control table
• Receive region is pinned in memory • Requires duplicate writes, really just message buffers (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
61
Performance of Distributed Memory Machines • Microbenchmarking • One-way latency of small (five-word) message – echo test – round-trip divided by 2
• Shared Memory remote read • Message Passing Operations – see text
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 31
CS/ECE 757
62
Network Transaction Performance 14 Microseconds
12 10 8
Gap L Or Os
6 4
T3D
NOW
CS2
Paragon
CM-5
T3D
NOW
CS2
Paragon
0
CM-5
2
Figure 7.31 (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
CS/ECE 757
63
Remote Read Performance
20 15 Gap L Issue
10
T3D
NOW
CS2
Paragon
CM-5
T3D
NOW
CS2
0
Paragon
5 CM-5
Microseconds
25
Figure 7.32 (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 32
CS/ECE 757
64
Summary of Distributed Memory Machines • Convergence of architectures – everything “looks basically the same” – processor, cache, memory, communication assist
• Communication Assist – where is it? (I/O bus, memory bus, processor registers) – what does it know? » does it just move bytes, or does it perform some functions? – is it programmable? – does it run user code?
• Network transaction – input & output buffering – action on remote node
(C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh
Page 33
CS/ECE 757
65