One carrier sled = 32 FPGA modules. â¡ Our first FPGA module uses a Xilinx Kintex Ultrascale KU060. â A mid-range FPGA with high performance/price and low ...
An FPGA Platform for Hyperscalers F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes IBM Research – Zurich, Switzerland
Hot Interconnects 25, Santa Clara, CA, Aug. 29-30, 2017
What is a hyperscaler? ■
Definition
– Cloud operator that runs several hundreds of thousands of servers – Operator with more than US$[1,4,8] billion in annual revenue* from: • a service such as IaaS, PaaS or SaaS (e.g., AWS, Rackspace, Google) • Internet, search, social networking (e.g., Salesforce, ADP, Google) • E-commerce/payment processing (e.g., Amazon, Alibaba, eBay)
■
Hyperscalers in 2017:
– 24 hyperscale companies** operating 300+ data centers
Hot Interconnects 25 – Aug. 2017
Sources: *Cisco Global Cloud Index and ** Synergy Research Group
2
Hyperscale server – How does it look?
Microsoft's cloud server architecture
Facebook's Yosemite sled V1 with four Mono Lake servers Sources: How Microsoft Designs its Cloud-Scale Servers, Microsoft, 2014 and Introducing "Yosemite": the first open source modular chassis for high-powered microservers, Facebook, 2015.
Hot Interconnects 25 – Aug. 2017
3
Why do we need a new platform? ■
Traditional bus attachment → FPGA as a co-processor CPU Server Node
FPGA
CPU Server Node
FPGA FPGA Node
PCIe
–––
Server homogeneity
+++
–
Increased server cost & power
–
––
Performance boost
+
– – – Number of accelerators per server + – Hot Interconnects 25 – Aug. 2017
Workload flexibility and migration
–– 4
Disaggregation of the FPGA from the server[1] ■
Network attachment → FPGA as a peer processor CPU Server Node
FPGA
FPGA
FPGA Node
FPGA Node
FPGA
FPGA Node
PCIe Ethernet Switch
DC Network +++ Server homogeneity +++ Server cost & power ++ Management
+++ Number of FPGAs per server ++ Performance boost + Workload flexibility & migration
++/-- Large-scale distributed applications [1] J. Weerasinghe et al., “Enabling FPGAs in hyperscale data centers,” in 2015 IEEE Int'l Conf. on Cloud and Big Data Computing, Beijing, China, 2015. Hot Interconnects 25 – Aug. 2017
5
Standalone network-attached FPGA 1) Replace PCIe I/F with integrated NIC (iNIC) FPGA Module
3MTM SPD08 Series
DRAM DDR4 PWR
2) Turn FPGA card into a self-contained appliance
JTAG
DDR4
GTY (x8)
x72
x72
HPIO
HPIO
IO IO
PMBus EN PG MON
FPGA
PSoC Pervasive PSoC ARM Cortex-M3 JTAG JTAG
Emif I2C
MPSSE
USBv2
Mon
iNIC
x32 HR x48 HP
Kintex UltraScale
PCIe (x4)
Flash
Cfg
~15%
JTAG
GTH
GTH
GTH
10GBASE KR (x6)
PCIe (x8)
SATA (x2)
BPI Config
Connector
Backplane
3) Replace transceivers w/ backplane connectivity Hot Interconnects 25 – Aug. 2017
6
One carrier sled = 32 FPGA modules ■
Our first FPGA module uses a Xilinx Kintex Ultrascale KU060 – A mid-range FPGA with high performance/price and low wattage
Hot Interconnects 25 – Aug. 2017
7
One carrier sled = 32 FPGA modules ■
Our first FPGA module uses a Xilinx Kintex Ultrascale KU060 – A mid-range FPGA with high performance/price and low wattage
×16
×16 40GE x8 640 Gb/s Ethernet switch
Hot Interconnects 25 – Aug. 2017
8
Two carrier sleds per chassis = 64 FPGAs
Figurative picture
Legend (per slice): [==] [––] [––] [––] [██] SP
x8 x32 x32 x32 x16 x1
40GbE up links (320 Gb/s) 10GbE FPGA-to-Switch links (320 Gb/s) 10GbE redundant links 10GbE FPGA-to-FPGA links PCIe x8 Gen3 Service Processor
Hot Interconnects 25 – Aug. 2017
Balanced (i.e. no over-subscription) between the north and south links of the Ethernet switch
9
Sixteen chassis per rack = 1024 FPGAs 1024 FPGAs → 2.8M DSPs, 2x1015 Fixed-Point Multiply-Accumulates/s 10 Tb/s bi-sec. Bw – 16 TB DDR4 – 40 kW max.
Hot Interconnects 25 – Aug. 2017
10
Sixteen chassis per rack = 1024 FPGAs 1024 FPGAs → 2.8M DSPs, 2x1015 Fixed-Point Multiply-Accumulates/s 10 Tb/s bi-sec. Bw – 16 TB DDR4 – 40 kW max. → x4 TORs
Example of a two-tier Clos architecture with 4096 FPGAs and four edge switches (TORs) Hot Interconnects 25 – Aug. 2017
11
Combined passive and active water cooling Cooling rail
Rendering of a 2U ❌ 19” chassis Packaging technology – Courtesy of DOME project
Hot Interconnects 25 – Aug. 2017
12
Passive cooling w/ heat spreader Schematic of board assembly
power notch
7.2 mm
55 mm
7.6 mm
Packaging technology – Courtesy of DOME project
Hot Interconnects 25 – Aug. 2017
13
Prototype in the lab
Courtesy of DOME project Hot Interconnects 25 – Aug. 2017
14
Software-defined multi-FPGA fabric (SDMFF)
E.g., security 1 Server + 1 FPGA
E.g., text processing
E.g., distributed DNN
E.g., HPC
Few Servers + 100 FPGAs Few Servers + 1000 FPGAs 1 Server + 5 FPGAs CPU
CPU
Courtesy of Reddit
Hyperscale infrastructure Hot Interconnects 25 – Aug. 2017
Data Center Network 15
Network performance ■
Comparison with bare-metal servers, virtual machines and Linux containers[2] Tput (Gb/s)
RT Latency (µS)
UD P
UDP
RT Latency (µS)
Tput (Gb/s) UD P
iW
UDP
AR P
TC P
iW
TC P AR P
[2] J. Weerasinghe et al., “Disaggregated FPGAs: Network performance comparison against bare-metal servers, virtual machines and Linux containers,” in IEEE Int' Conf. on Cloud Computing Technology and Science, Luxembourg, 2016.
Hot Interconnects 25 – Aug. 2017
16
SDMFF Application ■
Distributed text analytic[3] Standard UIMA
Unstructured Data
https://uima.apache.org/ UIMA: Unstructured Information Management Architecture MN: Master Node, SN: Slave Node
Latency (ms)
MN:
Collection Process Engine CAS Consumer
Collection Reader SN: SN:
Analysis Engine Analysis Engine
Tput (char/s)
Structured Data
SN1=
CPU
CPU
PCIe
FPGA
FPGA
SN2=
CPU
CPU
PCIe
FPGA
FPGA
Cost ($)
[3] J. Weerasinghe et al., “Network-attached FPGAs for data center applications,” in IEEE International Conference on Field-Programmable Technology (FPT '16), Xian, China, 2016.
Hot Interconnects 25 – Aug. 2017
17
Compute density – S822LC (aka Minsky) vs FPGA chassis ■
Same 2U chassis and similar power consumption
■
x4 Tesla P100 w/ NVLink – Total performance 42.4 TeraFLOPS Single-Precision* 84.8 TeraFLOPS Half-Precision*
■
x2 [POWER8 CPU + 256 GB DRAM]
■
Power consumption ≈ 2.3 kW * http://www.nvidia.com/object/tesla-p100.html
Hot Interconnects 25 – Aug. 2017
■
x64 Xiling KU060 + 16 GB DDR4 – Total performance
53 TeraFLOPS Single-Precision** 106 TeraFLOPS Half-Precision** 424 TeraOPS Fixed-Point (INT8)***
1 TB DDR4 / Power consumption ≈ 2.5 kW ** Computed as: 64 x (#DSPs-per-KU060 x FMAX) / (#DSPs-per-FusedMultiplyAdd) *** Xilinx, WP487 (v1.0) June 27, 2017 – 8-Bit Dot-Product Acceleration
18
Summary ■
A platform to deploy FPGA at large scale in DCs
– Integrates FPGAs at the drawer/chassis layer – Combines passive and active water-cooling – Provides high density, energy efficiency and reduced costs • Fits 1000+ FPGAs per DC rack
■
Builds on the disaggregation of FPGAs from the servers
– FPGAs connect to the DC network over 10/40 Gb/s Ethernet • Key enabler for large scale deployment of FPGAs in DCs • FPGAs generate and consume their own networking packets
– FPGA cards become stand-alone resources • Deployed FPGAs become independent of the number of servers • Promotes the use of medium- and low-cost FPGAs ■
Makes FPGAs plentiful in DCs
– Users can rent and link them in any type of topology
Hot Interconnects 25 – Aug. 2017
19
Acknowledgments
■
■
This work was conducted in the context of the joint ASTRON and IBM DOME project and was funded by the Netherlands Organization for Scientific Research (NWO), the Dutch Ministry of EL&L, and the Province of Drenthe, the Netherlands.
Special thanks to Martin Schmatz, Ronald Luijten and Andreas Doering who initiated this new packaging concept for the needs of their microserver DOME project.
Hot Interconnects 25 – Aug. 2017
20
Thank you May the FPGA be with you
https://www.zurich.ibm.com/cci/cloudFPGA/ Hot Interconnects 25 – Aug. 2017
21
Backup Hot Interconnects 25 – Aug. 2017
22
The baseboard carrier at scale
Hot Interconnects 25 – Aug. 2017
23
Disaggregated switch module ce System n e r e f e R – ffTrail Intel Seacli
from 7,938** cm3 ** 41 x 44 x 4.4 cm
32x10 GbE + 8x40 GbE
48 x 10 GbE + 4 x 40 GbE
Hot Interconnects 25 – Aug. 2017
1/21
to 378* cm3 * 14 x 6 x 4.5 cm
Switch Module SM6000
24
Other cloudFPGA modules NVMe Module
USB Hub Module
Hot Interconnects 25 – Aug. 2017
Service processor (T4240)
Power converter
25
x4 Tesla P100 – Air cooled
Hot Interconnects 25 – Aug. 2017
26
x4 Tesla P100 – Water cooled
Hot Interconnects 25 – Aug. 2017
27
Minsky chassis vs cloudFPGA chassis
http://web.archive.org/web/20161016071547/https://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html#kintexu
Hot Interconnects 25 – Aug. 2017
28
Related Work – Other large-scale FPGA deployments
This work (FPGAs/Rack: 1024)
EC2 F1
Hot Interconnects 25 – Aug. 2017
29