HermitCore: A Library Operating System for Cloud and HPC

3 downloads 0 Views 314KB Size Report
Jun 22, 2017 - Every administration layer has its overhead ⇒ e.g. Hourglass benchmark ... Now, every system call is a function call ⇒ Low overhead.
HermitCore: A Library Operating System for Cloud and HPC Jens Breitbart0 , Stefan Lankes1 , Simon Pickartz1 0 1

Bosch Chassis Systems Control, Stuttgart, Germany RWTH Aachen University, Aachen, Germany

Agenda

Motivation Challenges for the Future Systems OS Architectures HermitCore Design Performance Evaluation Conclusion and Outlook

2 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Motivation

Complexity of high-end HPC systems keeps growing Extreme degree of parallelism Heterogeneous core architectures Deep memory hierarchy Power constrains ⇒ Need for scalable, reliable performance and capability to rapidly adapt to new HW

Applications have also become complex In-situ analysis, workflows Sophisticated monitoring and tools support, etc. . . Isolated, consistent simulation performance ⇒ Dependence on POSIX, MPI and OpenMP

Seemingly contradictory requirements. . .

3 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Operating System = Overhead Every administration layer has its overhead ⇒ e. g. Hourglass benchmark

Number of events

106 104 102 100 10000 20000 30000 Loop time (cycles) OS noise reduce the scalability / increases latency 4 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

How does the HPC community reduce the overhead?

Light-weight Kernels Typically taken of an existing fat kernel (e. g. Linux) Removal of unneeded features to improve the scalability e. g. ZeptoOS

Multi-Kernels A specialized kernel runs side-by-side to full-weight kernel (e. g. Linux) Applications of the full-weight kernels are able to run on the specialized kernel. The specialized kernel catch every system call and delegate them to the full-weight kernel Binary compatible to the full-weight kernel

Examples: mOS, McKernel 5 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

OS Designs for Cloud Computing – Usage of Common OS Application

Application

OS

OS

Hypervisor

Operating System

eth0

eth0

Software Virtual Switch

eth0

Two operating systems to maintain a single computer? Double Management! Why should one run a Multi-User-/Multi-Tasking OS within a hypervisor in the Cloud for a simple webserver? 6 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

OS Designs for Cloud Computing – LibraryOS Application libOS

Hypervisor

Operating System

eth0

Application libOS

eth0

Software Virtual Switch

eth0

Now, every system call is a function call ⇒ Low overhead Whole optimization of an image possible (including the library OS) Link Time Optimization (LTO)

Removing unneeded code ⇒ reduces the attack vector 7 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

HermitCore Design

Combination of the Unikernel and Multi-Kernel to reduce the overhead The same binary is able to run in a VM (classical unikernel setup) or bare-metal side-by-side to Linux (multi-kernel setup)

Support for dominant programming models (MPI, OpenMP) Single-address space operating system No TLB Shootdown

8 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

HermitCore Design Classical Unikernel Setup Applications are able to boot directly within a VM Tested with Qemu / KVM Tested with uhyve (experimental KVM-based Hypervisor) Qemu emulates more than HermitCore needs ⇒ large setup time uhyve reduce the boot time (from 2 s to ~30 ms)

Google Compute Engine

Multi-Kernel Setup One kernel per NUMA node Only local memory accesses (UMA) Message passing between NUMA nodes

One FWK (Linux) in the system to get access to a broader driver support Only a backup for pre- / post-processing Critical path should be handled by HermitCore 9 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Runtime Support Pthreads

SSE, AVX2, AVX512, FMA,. . .

Thread binding at start time No load balancing ⇒ less housekeeping

Full C-library support (newlib) HBM support similar to memkind IP interface & BSD sockets (LwIP)

OpenMP via Intel’s Runtime iRCCE- & MPI (via SCC-MPICH)

IP packets are forwarded to Linux Shared memory interface

Full support for the Go runtime

Tile MC 1

R

R

R

MC 0

R

37 36 25 24 13 12 1 0

R

R

R

R

39 38 27 26 15 14 3 2

R

R

R

R

41 40 29 28 17 16 5 4

R

R

R

43 42 31 30 19 18

R

7 6

R

R

R

R

45 44 33 32 21 20 9 8

FPGA

10 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

R

R

R

R

47 46

MIU MC 3

23 22

Router

MC 2

L2$ MPB

Core 22

35 34

11 10

Core 23

L2$

Is the software stack difficult to maintain?

Changes to the common software stack determined with cloc

Software Stack binutils gcc Linux Newlib LwIP Pthread OpenMP RT HermitCore

11 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

LoC

Changes

5 121 217 6 850 382 15 276 013 1 040 826 38 883 13 768 61 594 –

226 4 821 1 296 5 472 832 466 324 10 597

Operating System Micro-Benchmarks

Test systems Intel Haswell CPUs (E5-2650 v3) clocked at 2.3 GHz Intel KNL (Phi 7210) clocked at 1.3 GHz, SNC mode with four NUMA nodes

Results in CPU cycles System activity

KNL HermitCore

getpid() sched_yield() malloc() first write access to a page

12 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Haswell Linux

15 486 197 983 3 051 12 806 2 078 3 967

HermitCore Linux 14 97 3715 2018

143 370 6575 4007

·105

1

0.6 0.4 0.2

0

50

100

150 200 Time in s

250

0.4

0

300

·105

0

HermitCore

50

100

150 200 Time in s

250

300

250

300

·105

1

HermitCore without IP thread

0.8 Gap in Ticks

0.8 Gap in Ticks

0.6

0.2

1

0.6 0.4 0.2 0

Linux with enabled isolcpus & nohz

0.8 Gap in Ticks

Gap in Ticks

0.8

0

·105

1

Standard Linux

0.6 0.4 0.2

0

50

100

150 200 Time in s

250

300

0

0

50

100

150 200 Time in s

Overhead of VMs – Determined via NAS Parallel Benchmarks (Class B) Native (Linux)

HermitCore + uhyve

Speed in GMop/s

20

15

10

5

0

CG

BT

14 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

SP

EP

IS

Outlook

Memory Node 2

Core

Core

Memory

NIC

Linux

Node 0 15 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Core

Core

Memory

vNIC

Core

HermitCore

Node 3 HermitCore Core

Core

Memory Node 1

vNIC

Core

vNIC

HermitCore

Virtual IP Device / Message Passing Interafce

Arm v8 support SR-IOV simplifies the coordination between Linux & HermitCore

Conclusions

It works! ⇒ https://youtu.be/gDYCJ1DOTKw Binary packages are available Reduce the OS noise significantly Try it out!

http://www.hermitcore.org Thank you for your kind attention!

16 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Backup slides

OS Designs for Cloud Computing – Container

Application

Application

Container

Container

OS

eth0

Building of virtual borders (namespaces) Containers and their processes doesn’t see each other Fast access to OS services Less secure because an exploit for the container attacks also the host OS Doesn’t reduce the OS noise of the host system

18 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

EPCC OpenMP Micro-Benchmarks

Overhead in µs

3 Parallel Parallel Parallel Parallel Parallel Parallel

2

on Linux (gcc) For on Linux (gcc) on Linux (icc) For on Linux (icc) on HermitCore For on HermitCore

1

0

2

4

6 Number of Threads

19 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

8

10

Throughput Results of the Inter-kernel Communication Layer

Throughput in GiB/s

5

PingPong via iRCCE PingPong via SCC-MPICH PingPong via ParaStation MPI

4 3 2 1 0

256

4 Ki Size in Byte

20 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

64 Ki

1 Mi

Lack of programmability

Non-Uniform Memory Access Costs for memory access may vary Run processes where memory is allocated

Socket 0

Allocate memory where the process resides Implications for the performance

Socket 1

Interconnect

Where should the applications store the data? Who should decide the location? The operating system? The application developers?

21 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Memory 0

Memory 1

Lack of programmability

Non-Uniform Memory Access

Memory 2

Memory 3

Costs for memory access may vary Run processes where memory is allocated Allocate memory where the process resides Implications for the performance

Interconnect

Where should the applications store the data? Who should decide the location? The operating system? The application developers?

21 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Memory 0

Memory 1

Tuning Tricks

5

Many side-effects and error-prone Incremental parallelization

Parallelization via Message Passing (MPI) Restructuring of the sequential code Less side-effects

Performance Tuning Bind MPI applications on one NUMA node ⇒ No remote memory access

22 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Speed in GFlop/s

Parallelization via Shared Memory (OpenMP)

4 3 2 1 0

LU-MZ.C.16 BT-MZ.C.16 SP-MZ.C.16 2x8 4x4 8x2 16x1 < threadcount > × < proccount >

OpenMP Runtime

GCC includes a OpenMP Runtime (libgomp) Reuse synchronization primitives of the Pthread library Other OpenMP runtimes scales better In addition, our Pthread library was originally not designed for HPC

Integration of Intel’s OpenMP Runtime Include its own synchronization primitives Binary compatible to GCC’s OpenMP Runtime Changes for the HermitCore support are small Mostly deactivation of function to define the thread affinity

Transparent usage For the end-user, no changes in the build process

23 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Support of compilers beside GCC Just avoid the standard environment (−ffreestanding) Set include path to HermitCore’s toolchain Be sure that the ELF file use HermitCore’s ABI Patching object files via elfedit

Use the GCC to link the binary LD = x86_64 - hermit - gcc # CC = x86_64 - hermit - gcc # CFLAGS = - O3 - mtune = native - march = native - fopenmp -mno - red - zone CC = icc - D__hermit__ CFLAGS = - O3 - xHost -mno - red - zone - ffreestanding - I$ ( HERMIT_DIR ) - openmp ELFEDIT = x86_64 - hermit - elfedit stream . o : stream . c $ ( CC ) $ ( CFLAGS ) -c -o $@ $ < $ ( ELFEDIT ) -- output - osabi HermitCore $@ stream : stream . o $ ( LD ) -o $@ $ < $ ( LDFLAGS ) $ ( CFLAGS ) 24 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

Linux

104 102 100

Number of events

Number of events

106

106 104 102 100

Linux (isolcpu)

104 102 100 10000 20000 30000 Loop time (cycles)

10000 20000 30000 Loop time (cycles) Number of events

Number of events

10000 20000 30000 Loop time (cycles) 106

Hermit w LwIP

106

Hermit wo LwIP

104 102 100 10000 20000 30000 Loop time (cycles)

MFLOPS

Hydro (preliminary results)

Linux (1 process × n threads) Linux (1 proc. × n thr., bind-to 0–19) Linux (n proc. × 5 thr., bind-to numa) HermitCore (n proc. × 5 thr.)

10,000

5,000

5

10 15 Number of Cores

26 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017

20

Thank you for your kind attention! Jens Breitbart et al. – [email protected] www.jensbreitbart.de HermitCore logo is provided by EmojiOne.

Suggest Documents