Sep 23, 2013 ... Multicore, Multiple processing entities, Parallelism on different ..... 64 cores, AMD
Opteron(TM) Processor 6272, 8 NUMA nodes, 125.9 GB.
System-level IPC on Multi-core Platforms SICS Multicore Day – 2013-09-23
Ola Dahl CTO Office Enea Enea Confidential – Under Copyright © 2013 EneaNDA AB
Before we start • Enea
~400 employees 468 MSEK revenue Products and Services
Services FOUNDED
Middleware OSE
Linux
1968
Now LTH
• Myself
Enea Confidential – Under Copyright © 2013 EneaNDA AB
STLiU Ericsson
System-level IPC
Message-passing between processes – intra-node and inter-node Monitoring and event handling – fault-tolerance
OSE operating system – kernel services, file system services, IP communication, program management, run-time loader, LINX Number of communicating entities ~ tens of thousands (pid space extension from 16 to 20 bits) – number of nodes ~ 100s Enea Confidential – Under Copyright © 2013 EneaNDA AB
System-level IPC Element Messaging Framework – Name server, message dispatch, communication patterns, HA functionality, Linux
C
C
C
C
A A
#nodes ~ 100(s) #threads/node ~ 1000s
B
D
B
D
D
A D B
D
D
C
Elastic Multi-Node Fixed Multi-Node A D
B C
SoC Platform Cloud
Enea Confidential – Under Copyright © 2013 EneaNDA AB
IPC
Operating System
Operating System
Communicating entities - Linux process, Linux thread, RTOS task, Bare-metal executive, User-space thread, Other executing entity (e.g. in an event-driven execution model)
Enea Confidential – Under Copyright © 2013 EneaNDA AB
IPC and Multicore
Operating System C0
C1
C2
C3
Operating System C0
C4
Bus, Interconnect, Cache, Controllers, I/O
C1
D0
D1
D2
Bus, Interconnect, Cache, Controllers, I/O
Multicore, Multiple processing entities, Parallelism on different levels – inside one SoC block, inside SoC, between SoC Communication on different levels – interconnect, caches, memory, hardware buffers and hardware IPC support Enea Confidential – Under Copyright © 2013 EneaNDA AB
IPC and Multicore Realtime
Operating System C0
C1
C2
C3
Non-Realtime
Operating System C0
C4
Bus, Interconnect, Cache, Controllers, I/O
C1
D0
D1
D2
Bus, Interconnect, Cache, Controllers, I/O
Multicore, Multiple processing entities, Parallelism on different levels – inside one SoC block, inside SoC, between SoC Communication on different levels – interconnect, caches, memory, hardware buffers and hardware IPC support Real-time – core isolation – dedicated cores for real-time response Enea Confidential – Under Copyright © 2013 EneaNDA AB
Heterogeneous Hardware TCI6638K2K - Multicore DSP+ARM KeyStone II System-on-Chip http://www.ti.com/product/tci6638k2k
Processing – 8 C66x DSP Cores (up to 1.2 GHz), 4 ARM Cores (up to 1.4 GHz), Wireless comm (3GPP) coprocessors Interconnect and control - Multicore Navigator, TeraNet, Multicore Shared Memory Controller, HyperLink Enea Confidential – Under Copyright © 2013 EneaNDA AB
Heterogeneous Software Core isolation for real-time response
Realtime
Non-Realtime
Real-time domain and non-real-time domain Run-time categories in real-time domain • Native threads • User-space threads • RTOS migration • Other execution frameworks, e.g. Open Event Machine • ENEA LWRT
Operating System C0
C1
D0
D1
D2
Bus, Interconnect, Cache, Controllers, I/O
Enea Confidential – Under Copyright © 2013 EneaNDA AB
System-level IPC and Multicore Communicating entities – e.g. processes, threads, user-space threads, bare-metal executives Levels of parallelism • Multicore processor in a SoC • Multiple blocks in a SoC • Multiple SoC in a node • Multiple nodes Communication on different levels (e.g. intra-node and internode) • On each level – Establish contact, Perform communication, Monitor and act on events, Close Enea Confidential – Under Copyright © 2013 EneaNDA AB
Where are we heading?
Linux Hardware
Virtualisation Enea Confidential – Under Copyright © 2013 EneaNDA AB
Linux EE Times report - http://seminar2.techonline.com/~additionalresources/embedded_mar1913/embedded_mar1913.pdf
Linux usage 2013 – 50% 2012 – 46%
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Linux Status of embedded Linux – March 2013 http://elinux.org/images/c/cf/Status-of-Embedded-Linux-2013-03-JJ44.pdf
• • • •
Average time between Linux releases – 3.3 – 3.8 – 70 days Linux 3.4 – RPMsg for IPC between Linux and e.g. RTOS Linux 3.7 – ARM multi-platform support, ARM 64-bit support Linux 3.7 – perf trace (alternative to strace)
Status of Linux – September 2013 • Latest stable kernel – 3.11.1 • Example changes in 3.11 (released September 2, 2013): – ARM huge page support, KVM and XEN support for ARM64 – SYSV IPC message queue scalability improvements
• Example changes in 3.10 (released June 30, 2013): – Timerless multitasking Enea Confidential – Under Copyright © 2013 EneaNDA AB
Linux and real-time Real-time framework e.g. Xenomai - http://www.xenomai.org/ PREEMPT_RT - https://rt.wiki.kernel.org/index.php/Main_Page Core isolation and tickless operation – striving for ”Bare-Metal Multicore Performance in a General-Purpose Operating System” http://www2.rdrop.com/~paulmck/scalability/paper/BareMetalMW.2013.02.25a. pdf Timerless multitasking in 3.10 retains 1 Hz tick also on isolated cores Linux 3.12-rc1 (2013-09-16) - even more tickless kernel (1 Hz maintenance tick removed) – still work to be done, e.g. with memory management
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Hardware ITRS - http://public.itrs.net - fifteen-year assessment of the semiconductor industry’s future technology requirements ITRS 2012 UPDATE - http://public.itrs.net/Links/2012ITRS/Home2012.htm • System Drivers - SOC Networking Driver, SOC Consumer Driver, Microprocessor (MPU) driver, Mixed-Signal Driver, Embedded Memory Driver • SOC networking driver - moving towards “multicore architectures with heterogeneous on-demand accelerator engines”, with “integration of onboard switch fabric and L3 caches”
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Hardware SOC networking driver – MC/AE Architecture – from http://public.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Hardware SOC networking driver – System performance and # of cores – from http://public.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf
Assumptions - constant cost (die area), per-year increase of number of cores (1.4 x), core frequency (1.05 x), accelerator engine frequency (1.05 x) - logic, memory, cache hierarchy, switching-fabric and system interconnect will scale consistently with the number of cores System performance – the “product of number of cores, core frequency, and accelerator engine frequency” Enea Confidential – Under Copyright © 2013 EneaNDA AB
Virtualization NFV – Network Function Virtualization ETSI - http://portal.etsi.org/NFV/NFV_White_Paper.pdf “leveraging standard IT virtualisation technology to consolidate many network equipment types onto industry standard high volume servers, switches and storage, which could be located in Datacentres, Network Nodes and in the end user premises” Virtualization using e.g. KVM or XEN
Enea Confidential – Under Copyright © 2013 EneaNDA AB
System-level IPC aspects Establishing and performing efficient communication Constraints from • Real-time • Hardware with an increasing interest in virtualization
Enea Confidential – Under Copyright © 2013 EneaNDA AB
IPC and Linux
Is there any remaining work to do?
Enea Confidential – Under Copyright © 2013 EneaNDA AB
IPC in Linux (and UNIX)
POSIX named semaphore Linux 2.6
mmap SVR4 pipe
POSIX rt
UNIX SysV
FOUNDED
CMA Linux 3.2
eventfd Linux 2.6.22 Now
1964 ’70 Enea
’90
’80 Emacs
flock 4.2BSD Linux 1.0
’10 ’00 POSIX shmem Linux 2.4 POSIX mq Linux 2.6.6
Overview, book, man pages, etc. by Michael Kerrisk - http://man7.org/ Enea Confidential – Under Copyright © 2013 EneaNDA AB
IPC on Linux nanomsg OpenMPI TIPC
kdbus
AF_BUS
Binder
DBUS
FOUNDED
RPMsg
0MQ
Now
2000 ’2
’4
’6
’8
LINX for Linux Enea Element
Enea Confidential – Under Copyright © 2013 EneaNDA AB
’10
Work in progress sysv ipc shared mem optimizations, June 18, 2013 http://lwn.net/Articles/555469/ “With these patches applied, a custom shm microbenchmark stressing shmctl doing IPC_STAT with 4 threads a million times, reduces the execution time by 50%” ALS: Linux interprocess communication and kdbus, May 30, 2013 http://lwn.net/Articles/551969/ “The work on kdbus is progressing well and Kroah-Hartman expressed optimism that it would be merged before the end of the year. Beyond just providing a faster D-Bus (which could be accomplished without moving it into the kernel, he said), it is his hope that kdbus can eventually replace Android's binder IPC mechanism. “ Enea Confidential – Under Copyright © 2013 EneaNDA AB
Work in progress Speeding up D-Bus, February 29, 2012 http://lwn.net/Articles/484203/ “D-Bus currently relies on a daemon process to authenticate processes and deliver messages that it receives over Unix sockets. Part of the performance problem is caused by the user-space daemon, which means that messages need two trips through the kernel on their way to the destination”
Fast interprocess communication revisited, November 9, 2011 https://lwn.net/Articles/466304/ “Rather we start with the observation that this many attempts to solve essentially the same problem suggests that something is lacking in Linux. There is, in other words, a real need for fast IPC that Linux doesn't address” Enea Confidential – Under Copyright © 2013 EneaNDA AB
Work in progress Fast interprocess messaging, September 15, 2010 http://lwn.net/Articles/405346/ “Rather than copy messages through a shared segment, they would rather deliver messages directly into another process's address space. To this end, Christopher Yeoh has posted a patch implementing what he calls cross memory attach.”
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Which IPC to use?
Functionality
Performance
Cost
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Technology constraints
Choosing an IPC - Functionality Functionality
SysV Shared memory
POSIX Shared memory
FIFO
Stream Socket
0MQ
LINX
End-point addressing
SysV key
Shmem object name
File system node
AF_UNIX – file system node, AF_INET – IP adress and port
Transport and address (Transport = TCP, ipc, inproc)
Endpoint name specifying path to peer
End-point repr.
Variable
File desc
File desc x 2
Socket descriptor
0MQ socket
LINX endpoint, spid
Channels
A memory area
A memory area
The FIFO (unidirectional)
The socket (bidirectional)
0MQ socket internal (bidirectional) – e.g. TCP or UNIX domain socket
Buffer associated with LINX endpoint
Initialisation
shmget, shmat
shm_open, mmap
mkfifo, open
socket, bind, listen, accept, connect
Create 0MQ context and 0MQ socket
linx_open, linx_hunt
Closing
shmdt
munmap, shm_unlink
close, unlink
close
Close 0MQ socket
linx_close
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Choosing an IPC - Functionality Functionality
SysV Shared memory
POSIX Shared memory
FIFO
Stream Socket
0MQ
LINX
Sending
write to memory, no synchronizati on
write to memory, no synchronizat ion
write
write
Send message or number of bytes to 0MQ socket
Send LINX signal
Receiving
Read from memory, no synchronizati on
Read from memory, no synchronizat ion
read
read
Receive message or number of bytes from 0MQ socket
Receive LINX signal
Blocking
No (unless implemented separately)
No (unless implemented separately)
Blocking and nonblocking R/W
Blocking and non-blocking R/W
Blocking and non-blocking R/W
Receive is blocking (nonblocking possible), Send is not
Monitoring
No (unless implemented separately)
No (unless implemented separately)
select, poll
select, poll
Monitoring callback can be registered with 0MQ context
LINX attach
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Choosing an IPC – Technology constraints Technology
0MQ
kdbus
LINX
Sockets
Yes
No
Yes, own type
Daemons
No
No
Discovery daemon (optional)
Kernel modules
No
Yes
Yes
Pthread synchronization
Yes
No
Yes
Kernel synchronization
No
Yes
Yes
Programming languages
C and more
C
C
Development status
Latest stable release is 3.2.3, from May 2013
Estimated to be ready in 2013
Initial release 2006, current version is 2.6.5, released June 2013
License
LGPLv3
LGPL
BSD and GPLv2
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Choosing an IPC - performance • ipc-bench: A UNIX inter-process communication benchmark • University of Cambridge http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ Measures Latency, Throughput, IPI latency • Public results dataset “Since we have found IPC performance to be a complex, multi-variate problem, and because we believe that having an open corpus of performance data will be useful to guide the development of hypervisors, kernels and programming frameworks, we provide a database of aggregated ipc-bench datasets.” Enea and ipc-bench – porting to 32-bit, porting to ARM, porting to PowerPC, adding tests for CMA, LINX, ZeroMQ Enea Confidential – Under Copyright © 2013 EneaNDA AB
Measuring IPC performance Why is this interesting? From The case for reconfigurable I/O channels, S. Smith et al, RESoLVE12, 2012 - http://anil.recoil.org/papers/2012-resolve-fable.pdf “We show dramatic differences in performance between communication mechanisms depending on locality and machine architecture, and observe that the interactions of communication primitives are often complex and sometimes counter-intuitive” “Furthermore, we show that virtualisation can cause unexpected effects due to OS ignorance of the underlying, hypervisor-level hardware setup” Enea Confidential – Under Copyright © 2013 EneaNDA AB
Measuring IPC performance Submitted measurements - http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpn2YlFp.html
Pairwise IPC latency between cores
64 cores, AMD Opteron(TM) Processor 6272, 8 NUMA nodes, 125.9 GB Linux 3.8.5-030805-generic, x86_64 Enea Confidential – Under Copyright © 2013 EneaNDA AB
Measuring IPC performance Submitted measurements - http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpn2YlFp.html
Pairwise IPC throughput between cores. (x-axis is packet size, y-axis is Gbps)
64 cores, AMD Opteron(TM) Processor 6272, 8 NUMA nodes, 125.9 GB Linux 3.8.5-030805-generic, x86_64 Enea Confidential – Under Copyright © 2013 EneaNDA AB
Measuring IPC performance Intel(R) Xeon(R) CPU - X3460 @ 2.80GHz, Cores 6 and 7 180000 160000 140000 mempipe_spin_thr
120000
mempipe_thr 100000
pipe_thr tcp_thr
80000
unix_thr vmsplice_coop_pipe_thr
60000
vmsplice_pipe_thr 40000 20000 0 64
4096
65536
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Measuring IPC performance ARM Pandaboard @ 1 GHz, Cores 0 and 1 3000
2500 mempipe_spin_thr
2000
mempipe_thr pipe_thr 1500
tcp_thr unix_thr
1000
vmsplice_coop_pipe_thr vmsplice_pipe_thr
500
0 64
4096
65536
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Measuring IPC performance Intel(R) Xeon(R) CPU - X3460 @ 2.80GHz, Cores 6 and 7 30000
0MQ vs UNIX sockets
25000
20000
64 15000
4096 65536
10000
5000
0 zmq_inproc_thr
zmq_ipc_thr
zmq_tcp_thr
Enea Confidential – Under Copyright © 2013 EneaNDA AB
unix_thr
Profiling and Performance Brendan Gregg - Linux Performance Analysis and Tools - SCaLE 11x 2013 http://dtrace.org/blogs/brendan/2013/06/08/linux-performance-analysis-andtools/ Apps and libs System call interface
***
VFS, File systems, Block device interface
Sockets, TCP/UDP, IP, Ethernet
Scheduler, VM
Device drivers - perf - https://perf.wiki.kernel.org/index.php/Main_Page *** - DTrace - https://github.com/dtrace4linux - SystemTap - http://sourceware.org/systemtap/ Enea Confidential – Under Copyright © 2013 EneaNDA AB
Profiling and Performance Collecting data with perf – IPC test with pipes
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Profiling and Performance Analyzing data recorded with perf
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Profiling and Performance Examining where time is spent
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Profiling and Performance A lot more to choose from*: strace, netstat, top, pidstat, mpstat, dstat, vmstat, slabtop, free, tcpdump, ip, nicstat, iostat, iotop, blktrace, ps, pmap, traceroute, ntop, ss, lsof, oprofile, gprof, kcachegrind, valgrind, google profiler, nfsiostat, cifsiostat, latencytop, powertop, LLTng, ktap, ...
* http://www.brendangregg.com/Slides/SCaLE_Linux_Performance2013.pdf Enea Confidential – Under Copyright © 2013 EneaNDA AB
Summary IPC in Linux - Stable but not finished
IPC on Linux – diversified Performance and profiling – ipc-bench (with adaptations and extensions), a large selection of profiling tools
Enea Confidential – Under Copyright © 2013 EneaNDA AB
Conclusions • A variety of IPC mechanisms exist • There is no clear one-fits-all solution • Performance aspects and functionality aspects (location transparency, robustness) – different trade-offs for different use-cases
• IPC and Linux – many stable mechanisms but still work-inprogress (e.g. kdbus) • Performance and profiling required – ipc-bench (with adaptations and extensions) – perf for performance profiling (one of several, however with a powerful feature set) Enea Confidential – Under Copyright © 2013 EneaNDA AB
Challenges • Systems requirements and design - parallelism, partitioning, heterogeneity, functional requirements, performance requirements – choosing an IPC mechanism • Programming - frameworks and execution environments – legacy and re-use – choosing a programming paradigm • Verification - measurements and profiling - are we designing (and implementing) the system as we planned? – choosing the right tools
Enea as an IPC partner - Long-term experience, Competence for building future IPC systems – development, integration, configuration, performance assessment
Enea Confidential – Under Copyright © 2013 EneaNDA AB
SICS Multicore day System-level IPC on multicore platforms Multicore System-on-Chip solutions, offering parallelization and partitioning, are increasingly used in real-time systems. As the number of cores increase, often in combination with increased heterogeneity in the form of hardware accelerated functionality, we see increased demands on effective communication, inside a multicore node but also on an inter-node system-level. The presentation will outline some of the challenges, as seen from Enea, to be expected when building future communication mechanisms, with requirements on performance and scalability, as well as transparency for applications. We will give examples from ongoing work in the Linux area, from Enea and from other open source contributors. Enea Confidential – Under Copyright © 2013 EneaNDA AB