An FPGA Platform for Hyperscalers

223 downloads 0 Views 7MB Size Report
One carrier sled = 32 FPGA modules. □ Our first FPGA module uses a Xilinx Kintex Ultrascale KU060. – A mid-range FPGA with high performance/price and low ...
An FPGA Platform for Hyperscalers F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes IBM Research – Zurich, Switzerland

Hot Interconnects 25, Santa Clara, CA, Aug. 29-30, 2017

What is a hyperscaler? ■

Definition

– Cloud operator that runs several hundreds of thousands of servers – Operator with more than US$[1,4,8] billion in annual revenue* from: • a service such as IaaS, PaaS or SaaS (e.g., AWS, Rackspace, Google) • Internet, search, social networking (e.g., Salesforce, ADP, Google) • E-commerce/payment processing (e.g., Amazon, Alibaba, eBay)



Hyperscalers in 2017:

– 24 hyperscale companies** operating 300+ data centers

Hot Interconnects 25 – Aug. 2017

Sources: *Cisco Global Cloud Index and ** Synergy Research Group

2

Hyperscale server – How does it look?

Microsoft's cloud server architecture

Facebook's Yosemite sled V1 with four Mono Lake servers Sources: How Microsoft Designs its Cloud-Scale Servers, Microsoft, 2014 and Introducing "Yosemite": the first open source modular chassis for high-powered microservers, Facebook, 2015.

Hot Interconnects 25 – Aug. 2017

3

Why do we need a new platform? ■

Traditional bus attachment → FPGA as a co-processor CPU Server Node

FPGA

CPU Server Node

FPGA FPGA Node

PCIe

–––

Server homogeneity

+++



Increased server cost & power



––

Performance boost

+

– – – Number of accelerators per server + – Hot Interconnects 25 – Aug. 2017

Workload flexibility and migration

–– 4

Disaggregation of the FPGA from the server[1] ■

Network attachment → FPGA as a peer processor CPU Server Node

FPGA

FPGA

FPGA Node

FPGA Node

FPGA

FPGA Node

PCIe Ethernet Switch

DC Network +++ Server homogeneity +++ Server cost & power ++ Management

+++ Number of FPGAs per server ++ Performance boost + Workload flexibility & migration

++/-- Large-scale distributed applications [1] J. Weerasinghe et al., “Enabling FPGAs in hyperscale data centers,” in 2015 IEEE Int'l Conf. on Cloud and Big Data Computing, Beijing, China, 2015. Hot Interconnects 25 – Aug. 2017

5

Standalone network-attached FPGA 1) Replace PCIe I/F with integrated NIC (iNIC) FPGA Module

3MTM SPD08 Series

DRAM DDR4 PWR

2) Turn FPGA card into a self-contained appliance

JTAG

DDR4

GTY (x8)

x72

x72

HPIO

HPIO

IO IO

PMBus EN PG MON

FPGA

PSoC Pervasive PSoC ARM Cortex-M3 JTAG JTAG

Emif I2C

MPSSE

USBv2

Mon

iNIC

x32 HR x48 HP

Kintex UltraScale

PCIe (x4)

Flash

Cfg

~15%

JTAG

GTH

GTH

GTH

10GBASE KR (x6)

PCIe (x8)

SATA (x2)

BPI Config

Connector

Backplane

3) Replace transceivers w/ backplane connectivity Hot Interconnects 25 – Aug. 2017

6

One carrier sled = 32 FPGA modules ■

Our first FPGA module uses a Xilinx Kintex Ultrascale KU060 – A mid-range FPGA with high performance/price and low wattage

Hot Interconnects 25 – Aug. 2017

7

One carrier sled = 32 FPGA modules ■

Our first FPGA module uses a Xilinx Kintex Ultrascale KU060 – A mid-range FPGA with high performance/price and low wattage

×16

×16 40GE x8 640 Gb/s Ethernet switch

Hot Interconnects 25 – Aug. 2017

8

Two carrier sleds per chassis = 64 FPGAs

Figurative picture

Legend (per slice): [==] [––] [––] [––] [██] SP

x8 x32 x32 x32 x16 x1

40GbE up links (320 Gb/s) 10GbE FPGA-to-Switch links (320 Gb/s) 10GbE redundant links 10GbE FPGA-to-FPGA links PCIe x8 Gen3 Service Processor

Hot Interconnects 25 – Aug. 2017

Balanced (i.e. no over-subscription) between the north and south links of the Ethernet switch

9

Sixteen chassis per rack = 1024 FPGAs 1024 FPGAs → 2.8M DSPs, 2x1015 Fixed-Point Multiply-Accumulates/s 10 Tb/s bi-sec. Bw – 16 TB DDR4 – 40 kW max.

Hot Interconnects 25 – Aug. 2017

10

Sixteen chassis per rack = 1024 FPGAs 1024 FPGAs → 2.8M DSPs, 2x1015 Fixed-Point Multiply-Accumulates/s 10 Tb/s bi-sec. Bw – 16 TB DDR4 – 40 kW max. → x4 TORs

Example of a two-tier Clos architecture with 4096 FPGAs and four edge switches (TORs) Hot Interconnects 25 – Aug. 2017

11

Combined passive and active water cooling Cooling rail

Rendering of a 2U ❌ 19” chassis Packaging technology – Courtesy of DOME project

Hot Interconnects 25 – Aug. 2017

12

Passive cooling w/ heat spreader Schematic of board assembly

power notch

7.2 mm

55 mm

7.6 mm

Packaging technology – Courtesy of DOME project

Hot Interconnects 25 – Aug. 2017

13

Prototype in the lab

Courtesy of DOME project Hot Interconnects 25 – Aug. 2017

14

Software-defined multi-FPGA fabric (SDMFF)

E.g., security 1 Server + 1 FPGA

E.g., text processing

E.g., distributed DNN

E.g., HPC

Few Servers + 100 FPGAs Few Servers + 1000 FPGAs 1 Server + 5 FPGAs CPU

CPU

Courtesy of Reddit

Hyperscale infrastructure Hot Interconnects 25 – Aug. 2017

Data Center Network 15

Network performance ■

Comparison with bare-metal servers, virtual machines and Linux containers[2] Tput (Gb/s)

RT Latency (µS)

UD P

UDP

RT Latency (µS)

Tput (Gb/s) UD P

iW

UDP

AR P

TC P

iW

TC P AR P

[2] J. Weerasinghe et al., “Disaggregated FPGAs: Network performance comparison against bare-metal servers, virtual machines and Linux containers,” in IEEE Int' Conf. on Cloud Computing Technology and Science, Luxembourg, 2016.

Hot Interconnects 25 – Aug. 2017

16

SDMFF Application ■

Distributed text analytic[3] Standard UIMA

Unstructured Data

https://uima.apache.org/ UIMA: Unstructured Information Management Architecture MN: Master Node, SN: Slave Node

Latency (ms)

MN:

Collection Process Engine CAS Consumer

Collection Reader SN: SN:

Analysis Engine Analysis Engine

Tput (char/s)

Structured Data

SN1=

CPU

CPU

PCIe

FPGA

FPGA

SN2=

CPU

CPU

PCIe

FPGA

FPGA

Cost ($)

[3] J. Weerasinghe et al., “Network-attached FPGAs for data center applications,” in IEEE International Conference on Field-Programmable Technology (FPT '16), Xian, China, 2016.

Hot Interconnects 25 – Aug. 2017

17

Compute density – S822LC (aka Minsky) vs FPGA chassis ■

Same 2U chassis and similar power consumption



x4 Tesla P100 w/ NVLink – Total performance 42.4 TeraFLOPS Single-Precision* 84.8 TeraFLOPS Half-Precision*



x2 [POWER8 CPU + 256 GB DRAM]



Power consumption ≈ 2.3 kW * http://www.nvidia.com/object/tesla-p100.html

Hot Interconnects 25 – Aug. 2017



x64 Xiling KU060 + 16 GB DDR4 – Total performance

53 TeraFLOPS Single-Precision** 106 TeraFLOPS Half-Precision** 424 TeraOPS Fixed-Point (INT8)***

 1 TB DDR4 / Power consumption ≈ 2.5 kW ** Computed as: 64 x (#DSPs-per-KU060 x FMAX) / (#DSPs-per-FusedMultiplyAdd) *** Xilinx, WP487 (v1.0) June 27, 2017 – 8-Bit Dot-Product Acceleration

18

Summary ■

A platform to deploy FPGA at large scale in DCs

– Integrates FPGAs at the drawer/chassis layer – Combines passive and active water-cooling – Provides high density, energy efficiency and reduced costs • Fits 1000+ FPGAs per DC rack



Builds on the disaggregation of FPGAs from the servers

– FPGAs connect to the DC network over 10/40 Gb/s Ethernet • Key enabler for large scale deployment of FPGAs in DCs • FPGAs generate and consume their own networking packets

– FPGA cards become stand-alone resources • Deployed FPGAs become independent of the number of servers • Promotes the use of medium- and low-cost FPGAs ■

Makes FPGAs plentiful in DCs

– Users can rent and link them in any type of topology

Hot Interconnects 25 – Aug. 2017

19

Acknowledgments





This work was conducted in the context of the joint ASTRON and IBM DOME project and was funded by the Netherlands Organization for Scientific Research (NWO), the Dutch Ministry of EL&L, and the Province of Drenthe, the Netherlands.

Special thanks to Martin Schmatz, Ronald Luijten and Andreas Doering who initiated this new packaging concept for the needs of their microserver DOME project.

Hot Interconnects 25 – Aug. 2017

20

Thank you May the FPGA be with you

https://www.zurich.ibm.com/cci/cloudFPGA/ Hot Interconnects 25 – Aug. 2017

21

Backup Hot Interconnects 25 – Aug. 2017

22

The baseboard carrier at scale

Hot Interconnects 25 – Aug. 2017

23

Disaggregated switch module ce System n e r e f e R – ffTrail Intel Seacli

from 7,938** cm3 ** 41 x 44 x 4.4 cm

32x10 GbE + 8x40 GbE

48 x 10 GbE + 4 x 40 GbE

Hot Interconnects 25 – Aug. 2017

1/21

to 378* cm3 * 14 x 6 x 4.5 cm

Switch Module SM6000

24

Other cloudFPGA modules NVMe Module

USB Hub Module

Hot Interconnects 25 – Aug. 2017

Service processor (T4240)

Power converter

25

x4 Tesla P100 – Air cooled

Hot Interconnects 25 – Aug. 2017

26

x4 Tesla P100 – Water cooled

Hot Interconnects 25 – Aug. 2017

27

Minsky chassis vs cloudFPGA chassis

http://web.archive.org/web/20161016071547/https://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html#kintexu

Hot Interconnects 25 – Aug. 2017

28

Related Work – Other large-scale FPGA deployments

This work (FPGAs/Rack: 1024)

EC2 F1

Hot Interconnects 25 – Aug. 2017

29

Suggest Documents