HermitCore: A Library Operating System for Cloud and HPC Jens Breitbart0 , Stefan Lankes1 , Simon Pickartz1 0 1
Bosch Chassis Systems Control, Stuttgart, Germany RWTH Aachen University, Aachen, Germany
Agenda
Motivation Challenges for the Future Systems OS Architectures HermitCore Design Performance Evaluation Conclusion and Outlook
2 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Motivation
Complexity of high-end HPC systems keeps growing Extreme degree of parallelism Heterogeneous core architectures Deep memory hierarchy Power constrains ⇒ Need for scalable, reliable performance and capability to rapidly adapt to new HW
Applications have also become complex In-situ analysis, workflows Sophisticated monitoring and tools support, etc. . . Isolated, consistent simulation performance ⇒ Dependence on POSIX, MPI and OpenMP
Seemingly contradictory requirements. . .
3 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Operating System = Overhead Every administration layer has its overhead ⇒ e. g. Hourglass benchmark
Number of events
106 104 102 100 10000 20000 30000 Loop time (cycles) OS noise reduce the scalability / increases latency 4 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
How does the HPC community reduce the overhead?
Light-weight Kernels Typically taken of an existing fat kernel (e. g. Linux) Removal of unneeded features to improve the scalability e. g. ZeptoOS
Multi-Kernels A specialized kernel runs side-by-side to full-weight kernel (e. g. Linux) Applications of the full-weight kernels are able to run on the specialized kernel. The specialized kernel catch every system call and delegate them to the full-weight kernel Binary compatible to the full-weight kernel
Examples: mOS, McKernel 5 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
OS Designs for Cloud Computing – Usage of Common OS Application
Application
OS
OS
Hypervisor
Operating System
eth0
eth0
Software Virtual Switch
eth0
Two operating systems to maintain a single computer? Double Management! Why should one run a Multi-User-/Multi-Tasking OS within a hypervisor in the Cloud for a simple webserver? 6 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
OS Designs for Cloud Computing – LibraryOS Application libOS
Hypervisor
Operating System
eth0
Application libOS
eth0
Software Virtual Switch
eth0
Now, every system call is a function call ⇒ Low overhead Whole optimization of an image possible (including the library OS) Link Time Optimization (LTO)
Removing unneeded code ⇒ reduces the attack vector 7 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
HermitCore Design
Combination of the Unikernel and Multi-Kernel to reduce the overhead The same binary is able to run in a VM (classical unikernel setup) or bare-metal side-by-side to Linux (multi-kernel setup)
Support for dominant programming models (MPI, OpenMP) Single-address space operating system No TLB Shootdown
8 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
HermitCore Design Classical Unikernel Setup Applications are able to boot directly within a VM Tested with Qemu / KVM Tested with uhyve (experimental KVM-based Hypervisor) Qemu emulates more than HermitCore needs ⇒ large setup time uhyve reduce the boot time (from 2 s to ~30 ms)
Google Compute Engine
Multi-Kernel Setup One kernel per NUMA node Only local memory accesses (UMA) Message passing between NUMA nodes
One FWK (Linux) in the system to get access to a broader driver support Only a backup for pre- / post-processing Critical path should be handled by HermitCore 9 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Runtime Support Pthreads
SSE, AVX2, AVX512, FMA,. . .
Thread binding at start time No load balancing ⇒ less housekeeping
Full C-library support (newlib) HBM support similar to memkind IP interface & BSD sockets (LwIP)
OpenMP via Intel’s Runtime iRCCE- & MPI (via SCC-MPICH)
IP packets are forwarded to Linux Shared memory interface
Full support for the Go runtime
Tile MC 1
R
R
R
MC 0
R
37 36 25 24 13 12 1 0
R
R
R
R
39 38 27 26 15 14 3 2
R
R
R
R
41 40 29 28 17 16 5 4
R
R
R
43 42 31 30 19 18
R
7 6
R
R
R
R
45 44 33 32 21 20 9 8
FPGA
10 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
R
R
R
R
47 46
MIU MC 3
23 22
Router
MC 2
L2$ MPB
Core 22
35 34
11 10
Core 23
L2$
Is the software stack difficult to maintain?
Changes to the common software stack determined with cloc
Software Stack binutils gcc Linux Newlib LwIP Pthread OpenMP RT HermitCore
11 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
LoC
Changes
5 121 217 6 850 382 15 276 013 1 040 826 38 883 13 768 61 594 –
226 4 821 1 296 5 472 832 466 324 10 597
Operating System Micro-Benchmarks
Test systems Intel Haswell CPUs (E5-2650 v3) clocked at 2.3 GHz Intel KNL (Phi 7210) clocked at 1.3 GHz, SNC mode with four NUMA nodes
Results in CPU cycles System activity
KNL HermitCore
getpid() sched_yield() malloc() first write access to a page
12 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Haswell Linux
15 486 197 983 3 051 12 806 2 078 3 967
HermitCore Linux 14 97 3715 2018
143 370 6575 4007
·105
1
0.6 0.4 0.2
0
50
100
150 200 Time in s
250
0.4
0
300
·105
0
HermitCore
50
100
150 200 Time in s
250
300
250
300
·105
1
HermitCore without IP thread
0.8 Gap in Ticks
0.8 Gap in Ticks
0.6
0.2
1
0.6 0.4 0.2 0
Linux with enabled isolcpus & nohz
0.8 Gap in Ticks
Gap in Ticks
0.8
0
·105
1
Standard Linux
0.6 0.4 0.2
0
50
100
150 200 Time in s
250
300
0
0
50
100
150 200 Time in s
Overhead of VMs – Determined via NAS Parallel Benchmarks (Class B) Native (Linux)
HermitCore + uhyve
Speed in GMop/s
20
15
10
5
0
CG
BT
14 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
SP
EP
IS
Outlook
Memory Node 2
Core
Core
Memory
NIC
Linux
Node 0 15 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Core
Core
Memory
vNIC
Core
HermitCore
Node 3 HermitCore Core
Core
Memory Node 1
vNIC
Core
vNIC
HermitCore
Virtual IP Device / Message Passing Interafce
Arm v8 support SR-IOV simplifies the coordination between Linux & HermitCore
Conclusions
It works! ⇒ https://youtu.be/gDYCJ1DOTKw Binary packages are available Reduce the OS noise significantly Try it out!
http://www.hermitcore.org Thank you for your kind attention!
16 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Backup slides
OS Designs for Cloud Computing – Container
Application
Application
Container
Container
OS
eth0
Building of virtual borders (namespaces) Containers and their processes doesn’t see each other Fast access to OS services Less secure because an exploit for the container attacks also the host OS Doesn’t reduce the OS noise of the host system
18 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
EPCC OpenMP Micro-Benchmarks
Overhead in µs
3 Parallel Parallel Parallel Parallel Parallel Parallel
2
on Linux (gcc) For on Linux (gcc) on Linux (icc) For on Linux (icc) on HermitCore For on HermitCore
1
0
2
4
6 Number of Threads
19 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
8
10
Throughput Results of the Inter-kernel Communication Layer
Throughput in GiB/s
5
PingPong via iRCCE PingPong via SCC-MPICH PingPong via ParaStation MPI
4 3 2 1 0
256
4 Ki Size in Byte
20 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
64 Ki
1 Mi
Lack of programmability
Non-Uniform Memory Access Costs for memory access may vary Run processes where memory is allocated
Socket 0
Allocate memory where the process resides Implications for the performance
Socket 1
Interconnect
Where should the applications store the data? Who should decide the location? The operating system? The application developers?
21 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Memory 0
Memory 1
Lack of programmability
Non-Uniform Memory Access
Memory 2
Memory 3
Costs for memory access may vary Run processes where memory is allocated Allocate memory where the process resides Implications for the performance
Interconnect
Where should the applications store the data? Who should decide the location? The operating system? The application developers?
21 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Memory 0
Memory 1
Tuning Tricks
5
Many side-effects and error-prone Incremental parallelization
Parallelization via Message Passing (MPI) Restructuring of the sequential code Less side-effects
Performance Tuning Bind MPI applications on one NUMA node ⇒ No remote memory access
22 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Speed in GFlop/s
Parallelization via Shared Memory (OpenMP)
4 3 2 1 0
LU-MZ.C.16 BT-MZ.C.16 SP-MZ.C.16 2x8 4x4 8x2 16x1 < threadcount > × < proccount >
OpenMP Runtime
GCC includes a OpenMP Runtime (libgomp) Reuse synchronization primitives of the Pthread library Other OpenMP runtimes scales better In addition, our Pthread library was originally not designed for HPC
Integration of Intel’s OpenMP Runtime Include its own synchronization primitives Binary compatible to GCC’s OpenMP Runtime Changes for the HermitCore support are small Mostly deactivation of function to define the thread affinity
Transparent usage For the end-user, no changes in the build process
23 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Support of compilers beside GCC Just avoid the standard environment (−ffreestanding) Set include path to HermitCore’s toolchain Be sure that the ELF file use HermitCore’s ABI Patching object files via elfedit
Use the GCC to link the binary LD = x86_64 - hermit - gcc # CC = x86_64 - hermit - gcc # CFLAGS = - O3 - mtune = native - march = native - fopenmp -mno - red - zone CC = icc - D__hermit__ CFLAGS = - O3 - xHost -mno - red - zone - ffreestanding - I$ ( HERMIT_DIR ) - openmp ELFEDIT = x86_64 - hermit - elfedit stream . o : stream . c $ ( CC ) $ ( CFLAGS ) -c -o $@ $ < $ ( ELFEDIT ) -- output - osabi HermitCore $@ stream : stream . o $ ( LD ) -o $@ $ < $ ( LDFLAGS ) $ ( CFLAGS ) 24 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
Linux
104 102 100
Number of events
Number of events
106
106 104 102 100
Linux (isolcpu)
104 102 100 10000 20000 30000 Loop time (cycles)
10000 20000 30000 Loop time (cycles) Number of events
Number of events
10000 20000 30000 Loop time (cycles) 106
Hermit w LwIP
106
Hermit wo LwIP
104 102 100 10000 20000 30000 Loop time (cycles)
MFLOPS
Hydro (preliminary results)
Linux (1 process × n threads) Linux (1 proc. × n thr., bind-to 0–19) Linux (n proc. × 5 thr., bind-to numa) HermitCore (n proc. × 5 thr.)
10,000
5,000
5
10 15 Number of Cores
26 HermitCore | Jens Breitbart et al. | personal open source project | 22nd June 2017
20
Thank you for your kind attention! Jens Breitbart et al. –
[email protected] www.jensbreitbart.de HermitCore logo is provided by EmojiOne.