Aug 16, 2016 - Power efficiency growing slower than compute power ... Department of Energy places it at 20MW. â Point
End-to-end Modeling and Optimization of Power Consumption in HPC Interconnects HUCAA 2016, August 16th, Philadelphia Sébastien Rumley, Robert Polster, and Keren Bergman Lightwave Research Lab, Columbia University New York, NY, USA
Simon D. Hammond and Arun Rodrigues Sandia National Laboratories Albuquerque, NM, USA
Rev PA1
1
Context • Power consumption of largest supercomputers is slowly reaching the “pain point” – Power efficiency growing slower than compute power – Megawatt power consumption is the norm • Several instances above 10MW (Tianhe-2, K computer)
• Pain Point? – Department of Energy places it at 20MW – Point of comparison: • Nuclear reactor: ~500MW
à New rule for next generation supercomputers (in particular, Exascale) – Scaling in terms of compute power = scaling in power efficiency – Keep power consumption (at least) constant
Rev PA1
2
Living in a power constraint world • Need to know – Who the big consumers are – How much they consume – How does the consumption evolve with scale, structure
[Wallace et al., HPPAC 2013]
• In this presentation: focus on the interconnect – How do the number of network elements scale with system size – Models for network element power consumption – Exploration of various designs, analysis of the most power efficient ones Rev PA1
3
Interconnect model Long distance RR link
Node Node
Optical transceivers
Router
Router
Core
Core
N compute nodes in system
Node
Short distance RR link
Node Node Short distance NR link
Rev PA1
Router
Electrical transceivers
Core NR: node to router RR: router to router
4
First aspect: topology and structure •
There is a large variety of topologies. But: – Interconnect can be expected to be “balanced” •
Symmetry among nodes - no obvious bottleneck
– Topology can be expected to have a low average distance Δ •
•
Low Δ guarantees low latency, in particular for collectives
A direct topology based interconnect can be reduced to three “shaping” parameters 1. Verbosity factor ν (byte/flop) •
Establishes how much bandwidth is present, normalized by node computer power
2. Concentration factor C •
Defines how “big” the switches are (switchlets vs. large hubs)
3. Internal bandwidth factor κ •
Ratio between bandwidth provided by internal links (RR: router-router) and external links (NR: node-router)
Rev PA1
5
Interconnect model Bandwidth of RR links defined by κ (relative to NR links)
C nodes per router Node Node
Node Node
Rev PA1
Bandwidth of NR links defined by ν (relative to node computer power)
Router
Router
Core
Core
Node
Router Core
6
Balanced topologies • Suppose uniform traffic, emitted at maximum injection rate
Average number of hops in the topology
N ⋅ν ⋅ Π node ⋅ Δ
– What is the total instantaneous traffic?
Injection bandwidth of one compute node
• And how much bandwidth do we have? # of switches
• Therefore
S ⋅ R ⋅κ
N ⋅ R ⋅κ C
# of connections to other switches
must be larger than but close to
RR bandwidth factor
N ⋅ν ⋅ Π node ⋅ Δ
for the traffic to be adequately supported
Rev PA1
NνΠ nodeΔ ≤1 N Rκ C
Rκ Δ≤ CνΠ node 7
Switch RR connectivity related to Δ R + 2 R( R − 1) + 3R( R − 1) 2 + ... + Dx Δ ideal (R ) = N −1
Rκ Δ≤ CνΠ node
(1)
10
1,000 Switches N = 1,000 10,000 Switches N = 10,000 100,000 Switches N = 100,000
GMG
Δ ideal (R )
8 6
•
• So (1) applies
2
•
Assuming topology – Is achieving close to minimal distance
4
0
( 2)
– Is well balanced • So (2) applies
5
10 15 20 Topology connectivity Connectivity factor R R
25 à Number of RR links per switch R can be determined from shaping parameters
From R stems the number of links and switch radix
[1] S. Rumley et al. “Design Methodology for Optimizing Optical Rev PA1 Interconnection Networks in High Performance Systems”, ISC-HPC 2015.
8
Second ingredient: power models • Short distance electrical transceivers
Energy per bit E = 0.189B + 1.496 pJ Rev PA1
(B: Bitrate) 9
Energy efficiency of optical links Single channel (low density) Assumes 30% laser efficiency nm
nm
0.67 pJ/bit
0.4 pJ/bit
[Bahadori, optical interconnects. 2016]
[R. Polster]
Performance-energy trade-offs for pointto-point links
Many channels (high density) Rev PA1Assumes 10% laser efficiency
10
Energy Efficiency [fJpb]
Optical links 30000 25000 20000 15000 10000 5000 0 0
10
20 30 40 Per Channel Bitrate [Gbps]
50
• ~pJ/bit energy efficiencies reported for a variety of bitrates with VCSEL based links • No clear trend! à Retained model: 1pJ/bit for any-rate Long distance RR link
• NB: optical RR links has an extra transceiver • 50% of RR links are optical Rev PA1
Router
Router
Core
Core 11
Switch power model • •
What is the energy consumption (in pJ/bit) of a router chip with r ports each providing a bandwidth B? Assumptions: – – – – –
Number of IO pins limited by PINMAX (here to 1280) A lane uses 4 pins (differential signaling in the two directions) Chip thermal dissipation limited (here to 132 W) Chip power consumption accounts for 70% (other 30%: overheads) Switch power consumption = IO consumption + switching consumption
IO consumption: 1. 2. 3. 4.
Find available lanes per port: Lport = PINMAX/ 4r Find the bitrate per lane: Blane = B / Lport Apply electrical transceiver model Obtain chip wide IO consumption by multiplying with B x r
Rev PA1
Example: 40 ports, 100G Lport = 1280/(4 x 40) = 8 Blane = 100 / 8 = 12.5G 12.5 x 0.189 + 1.496 = 3.86pJ/bit 3.86 x 40 x 100 = 15.44W 12
Switch power model (2) • Switch “core” consumption – Based on commercial products – Core power obtained by subtracting IO power from total power • IO power estimated using IO model
Switch core power P = 8.15 B + 50.68W (B: Bitrate over all ports in Tb/s ) Rev PA1
13
Switch model (3) 6 Tb/s / 132W = 22 pJ/bit With 30% overhead à 31.4pJ/bit
Rev PA1
Power limited Pin limited
14
Result - Interconnect wide power consumption Verbosity = 0.01B/F
Long distance RR links All short distance links Switching cores
• Routing (packet) is by far the dominant power consumer – Motivation for developing energy-efficient switching schemes Rev PA1
15
Role of concentration factor
Increasing C also increases router radix, limiting per port bandwidth, thus node compute power Rev PA1
20 PF total
16
Overall results
Rev PA1
17
Disclaimer • Modeling interconnect power consumption: our first tentative here – Many aspects missing: • Impact of utilization • Are we modeling average or peak consumption?
– Many “half blind” assumptions • One optical link = 2 electrical transceivers – is that fair? • Overhead of commercial routers
à Input and feedback most welcome à Especially for modeling the switching core
Rev PA1
18
Conclusions • Rule of thumb 10-100TF system: 100 pJ/bit – 3 router traversals at 31 pJ/bit each = 95 pJ/bit (diameter 2 topology) – ~5-12 pJ/bit for transmission à Should be reduced by factor 4 at least for Exascale
• Switching takes the lion share – Transceivers are not relevant powerwise • Is that different in terms of area? Cost?
– Path to improved energy efficiency: • Energy efficient router chips (CMOS technology, microarchitecture) • More locality in algorithms to decrease Δ • Bandwidth steering with optical switching [1]
• Interconnect shape can always be adapted to remain energy efficient – E.g. voluntarily downgrading bandwidth of internal links compared to injection ones
[1] K, Wen, al. “Flexfly: Enabling a Reconfigurable Dragonfly Through Silicon Photonics” accepted at SC’16 19 Revet PA1
Thank you !
Rev PA1
20