An FPGA Platform for Hyperscalers

An FPGA Platform for Hyperscalers F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes IBM Research – Zurich, Switzerland

Hot Interconnects 25, Santa Clara, CA, Aug. 29-30, 2017

What is a hyperscaler? ■

Definition

– Cloud operator that runs several hundreds of thousands of servers – Operator with more than US$[1,4,8] billion in annual revenue* from: • a service such as IaaS, PaaS or SaaS (e.g., AWS, Rackspace, Google) • Internet, search, social networking (e.g., Salesforce, ADP, Google) • E-commerce/payment processing (e.g., Amazon, Alibaba, eBay)

■

Hyperscalers in 2017:

– 24 hyperscale companies** operating 300+ data centers

Hot Interconnects 25 – Aug. 2017

Sources: *Cisco Global Cloud Index and ** Synergy Research Group

2

Hyperscale server – How does it look?

Microsoft's cloud server architecture

Facebook's Yosemite sled V1 with four Mono Lake servers Sources: How Microsoft Designs its Cloud-Scale Servers, Microsoft, 2014 and Introducing "Yosemite": the first open source modular chassis for high-powered microservers, Facebook, 2015.


3

Why do we need a new platform? ■

Traditional bus attachment → FPGA as a co-processor CPU Server Node

FPGA

CPU Server Node

FPGA FPGA Node

PCIe

–––

Server homogeneity

+++

–

Increased server cost & power

–

––

Performance boost

+

– – – Number of accelerators per server + – Hot Interconnects 25 – Aug. 2017

Workload flexibility and migration

–– 4

Disaggregation of the FPGA from the server[1] ■

Network attachment → FPGA as a peer processor CPU Server Node

FPGA

FPGA

FPGA Node

FPGA Node

FPGA

FPGA Node

PCIe Ethernet Switch

DC Network +++ Server homogeneity +++ Server cost & power ++ Management

+++ Number of FPGAs per server ++ Performance boost + Workload flexibility & migration

++/-- Large-scale distributed applications [1] J. Weerasinghe et al., “Enabling FPGAs in hyperscale data centers,” in 2015 IEEE Int'l Conf. on Cloud and Big Data Computing, Beijing, China, 2015. Hot Interconnects 25 – Aug. 2017

5

Standalone network-attached FPGA 1) Replace PCIe I/F with integrated NIC (iNIC) FPGA Module

3MTM SPD08 Series

DRAM DDR4 PWR

2) Turn FPGA card into a self-contained appliance

JTAG

DDR4

GTY (x8)

x72

x72

HPIO

HPIO

IO IO

PMBus EN PG MON

FPGA

PSoC Pervasive PSoC ARM Cortex-M3 JTAG JTAG

Emif I2C

MPSSE

USBv2

Mon

iNIC

x32 HR x48 HP

Kintex UltraScale

PCIe (x4)

Flash

Cfg

~15%

JTAG

GTH

GTH

GTH

10GBASE KR (x6)

PCIe (x8)

SATA (x2)

BPI Config

Connector

Backplane

3) Replace transceivers w/ backplane connectivity Hot Interconnects 25 – Aug. 2017

6

One carrier sled = 32 FPGA modules ■

Our first FPGA module uses a Xilinx Kintex Ultrascale KU060 – A mid-range FPGA with high performance/price and low wattage


7

One carrier sled = 32 FPGA modules ■

Our first FPGA module uses a Xilinx Kintex Ultrascale KU060 – A mid-range FPGA with high performance/price and low wattage

×16

×16 40GE x8 640 Gb/s Ethernet switch


8

Two carrier sleds per chassis = 64 FPGAs

Figurative picture

Legend (per slice): [==] [––] [––] [––] [██] SP

x8 x32 x32 x32 x16 x1

40GbE up links (320 Gb/s) 10GbE FPGA-to-Switch links (320 Gb/s) 10GbE redundant links 10GbE FPGA-to-FPGA links PCIe x8 Gen3 Service Processor


Balanced (i.e. no over-subscription) between the north and south links of the Ethernet switch

9

Sixteen chassis per rack = 1024 FPGAs 1024 FPGAs → 2.8M DSPs, 2x1015 Fixed-Point Multiply-Accumulates/s 10 Tb/s bi-sec. Bw – 16 TB DDR4 – 40 kW max.


10

Sixteen chassis per rack = 1024 FPGAs 1024 FPGAs → 2.8M DSPs, 2x1015 Fixed-Point Multiply-Accumulates/s 10 Tb/s bi-sec. Bw – 16 TB DDR4 – 40 kW max. → x4 TORs

Example of a two-tier Clos architecture with 4096 FPGAs and four edge switches (TORs) Hot Interconnects 25 – Aug. 2017

11

Combined passive and active water cooling Cooling rail

Rendering of a 2U ❌ 19” chassis Packaging technology – Courtesy of DOME project


12

Passive cooling w/ heat spreader Schematic of board assembly

power notch

7.2 mm

55 mm

7.6 mm

Packaging technology – Courtesy of DOME project


13

Prototype in the lab

Courtesy of DOME project Hot Interconnects 25 – Aug. 2017

14

Software-defined multi-FPGA fabric (SDMFF)

E.g., security 1 Server + 1 FPGA

E.g., text processing

E.g., distributed DNN

E.g., HPC

Few Servers + 100 FPGAs Few Servers + 1000 FPGAs 1 Server + 5 FPGAs CPU

CPU

Courtesy of Reddit

Hyperscale infrastructure Hot Interconnects 25 – Aug. 2017

Data Center Network 15

Network performance ■

Comparison with bare-metal servers, virtual machines and Linux containers[2] Tput (Gb/s)

RT Latency (µS)

UD P

UDP

RT Latency (µS)

Tput (Gb/s) UD P

iW

UDP

AR P

TC P

iW

TC P AR P

[2] J. Weerasinghe et al., “Disaggregated FPGAs: Network performance comparison against bare-metal servers, virtual machines and Linux containers,” in IEEE Int' Conf. on Cloud Computing Technology and Science, Luxembourg, 2016.


16

SDMFF Application ■

Distributed text analytic[3] Standard UIMA

Unstructured Data

https://uima.apache.org/ UIMA: Unstructured Information Management Architecture MN: Master Node, SN: Slave Node

Latency (ms)

MN:

Collection Process Engine CAS Consumer

Collection Reader SN: SN:

Analysis Engine Analysis Engine

Tput (char/s)

Structured Data

SN1=

CPU

CPU

PCIe

FPGA

FPGA

SN2=

CPU

CPU

PCIe

FPGA

FPGA

Cost ($)

[3] J. Weerasinghe et al., “Network-attached FPGAs for data center applications,” in IEEE International Conference on Field-Programmable Technology (FPT '16), Xian, China, 2016.


17

Compute density – S822LC (aka Minsky) vs FPGA chassis ■

Same 2U chassis and similar power consumption

■

x4 Tesla P100 w/ NVLink – Total performance 42.4 TeraFLOPS Single-Precision* 84.8 TeraFLOPS Half-Precision*

■

x2 [POWER8 CPU + 256 GB DRAM]

■

Power consumption ≈ 2.3 kW * http://www.nvidia.com/object/tesla-p100.html


■

x64 Xiling KU060 + 16 GB DDR4 – Total performance

53 TeraFLOPS Single-Precision** 106 TeraFLOPS Half-Precision** 424 TeraOPS Fixed-Point (INT8)***

 1 TB DDR4 / Power consumption ≈ 2.5 kW ** Computed as: 64 x (#DSPs-per-KU060 x FMAX) / (#DSPs-per-FusedMultiplyAdd) *** Xilinx, WP487 (v1.0) June 27, 2017 – 8-Bit Dot-Product Acceleration

18

Summary ■

A platform to deploy FPGA at large scale in DCs

– Integrates FPGAs at the drawer/chassis layer – Combines passive and active water-cooling – Provides high density, energy efficiency and reduced costs • Fits 1000+ FPGAs per DC rack

■

Builds on the disaggregation of FPGAs from the servers

– FPGAs connect to the DC network over 10/40 Gb/s Ethernet • Key enabler for large scale deployment of FPGAs in DCs • FPGAs generate and consume their own networking packets

– FPGA cards become stand-alone resources • Deployed FPGAs become independent of the number of servers • Promotes the use of medium- and low-cost FPGAs ■

Makes FPGAs plentiful in DCs

– Users can rent and link them in any type of topology


19

Acknowledgments

■

■

This work was conducted in the context of the joint ASTRON and IBM DOME project and was funded by the Netherlands Organization for Scientific Research (NWO), the Dutch Ministry of EL&L, and the Province of Drenthe, the Netherlands.

Special thanks to Martin Schmatz, Ronald Luijten and Andreas Doering who initiated this new packaging concept for the needs of their microserver DOME project.


20

Thank you May the FPGA be with you

https://www.zurich.ibm.com/cci/cloudFPGA/ Hot Interconnects 25 – Aug. 2017

21

Backup Hot Interconnects 25 – Aug. 2017

22

The baseboard carrier at scale


23

Disaggregated switch module ce System n e r e f e R – ffTrail Intel Seacli

from 7,938** cm3 ** 41 x 44 x 4.4 cm

32x10 GbE + 8x40 GbE

48 x 10 GbE + 4 x 40 GbE


1/21

to 378* cm3 * 14 x 6 x 4.5 cm

Switch Module SM6000

24

Other cloudFPGA modules NVMe Module

USB Hub Module


Service processor (T4240)

Power converter

25

x4 Tesla P100 – Air cooled


26

x4 Tesla P100 – Water cooled


27

Minsky chassis vs cloudFPGA chassis

http://web.archive.org/web/20161016071547/https://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html#kintexu


28

Related Work – Other large-scale FPGA deployments

This work (FPGAs/Rack: 1024)

EC2 F1


29

An FPGA Platform for Hyperscalers

An FPGA Platform for Hyperscalers

Suggest Documents

An FPGA-Based Open Platform for Ultrasound ... - IEEE Xplore

An FPGA-Based Open Platform for Ultrasound Biomicroscopy

An FPGA platform for on-line topology exploration ... - Semantic Scholar

An FPGA-based Instrumentation Platform for use at Deep Cryogenic ...

An FPGA-Emulation-based Platform for ... - Google Sites

FPGA: An Efficient And Promising Platform For Real-Time Image ...

An FPGA based rapid prototyping platform for wavelet coprocessors

An FPGA-based Prototyping Platform for Research in ... - CMU (ECE)

An FPGA-Based Acceleration Platform for Auction ... - Semantic Scholar

An FPGA-based Instrumentation Platform for use at Deep Cryogenic

An FPGA-based Prototyping Platform for Research in ... - CMU (ECE)

An FPGA-based Prototyping Platform for Research in ... - Google Sites

Manycore Processor Education Platform with FPGA for ...

FPGA Platform Configuration for Cloud Applications

An XUPÃUNM Educational Platform: A Dual-FPGA Platform for ... - ivPCL

An FPGA Scalable Software Defined Radio Platform ...

An FPGA Scalable Software Defined Radio Platform ...

An FPGA-Based MIMO and Space-Time Processing Platform - Core

FPGA Platform Based Digital Design Education

A Platform FPGA-based Hardware-Software ...

Open FPGA-Based Development Platform for Fuzzy Systems ... - Csic

A Dynamically-Reconfigurable FPGA Platform for Evolving ... - CiteSeerX

AcENoCs: A Configurable HW/SW Platform for FPGA Accelerated NoC ...

A Virtual FPGA Platform for Applications and Tools ... - Semantic Scholar

An FPGA Platform for Hyperscalers