Hardware. Host Linux kernel. Docker daemon rkt rkt app containers or pods app app a p p a p p. Page 6. rkt architecture
Container mechanics in rkt and Linux Alban Crequy LinuxCon Europe 2015 Dublin
Alban Crequy ✤ Working on rkt ✤ One of the maintainer of rkt ✤ Previously worked on D-Bus and AF_BUS
https://github.com/alban
Container mechanics in rkt and Linux ✤ ✤ ✤ ✤
Containers Linux namespaces Cgroups How rkt use them
Containers vs virtual machines virtual machines app
app
Guest Linux kernel
app
app
Guest Linux kernel
Hypervisor Host Linux kernel Hardware
Containers containers or pods app
app
Docker daemon
rkt
app
Host Linux kernel Hardware
a p p
a p p
rkt
rkt architecture stage 2
apps
apps
stage 1
systemd-nspawn
lkvm
stage 0
rkt
rkt
Containers: no guest kernel system calls: open(), sethostname()
app
app
rkt
kernel API
app
rkt Host Linux kernel Hardware
Containers with an example Getting and setting the hostname: ✤
The system calls for getting and setting the hostname are older than containers int uname(struct utsname *buf); int gethostname(char *name, size_t len); int sethostname(const char *name, size_t len);
containers
host
hostname: thunderstorm
hostname: rainbow
hostname: sunshine
Linux namespaces
Processes in namespaces 1 3 2
6
gethostname() -> “rainbow”
9
gethostname() -> “thunderstorm”
Linux Namespaces Several independent namespaces
✤ ✤ ✤ ✤ ✤
uts (Unix Timesharing System) namespace mount namespace pid namespace network namespace user namespace
Creating new namespaces unshare(CLONE_NEWUTS);
1
2
6
“rainbow”
Creating new namespaces
1
2
6
“rainbow”
6
“rainbow”
PID namespace
Hiding processes and PID translation ✤ ✤
the host sees all processes the container only its own processes
Hiding processes and PID translation ✤ ✤
the host sees all processes the container only its own processes
Actually pid 30920
Initial PID namespace 1
2
6
7
Creating a new namespace clone(CLONE_NEWPID, ...); 1
2
6
7
Creating a new namespace 1
2
6
1
7
rkt rkt run … ✤ ✤
uses unshare() to create a new network namespace uses clone() to start the first process in the container with a new pid namespace
rkt enter … ✤
uses setns() to enter an existing namespace
Joining an existing namespace 1
2
setns(...,CLONE_NEWPID);
6
1
7
Joining an existing namespace 1
2
6
1 4
When does PID translation happen? ✤ ✤ ✤ ✤ ✤ ✤
the kernel always show the getpid(), getppid() /proc /sys/fs/cgroup//.../cgroup.procs credentials passed in Unix sockets (SCM_CREDS) pid = fork()
Future: ✤
Possibly: getvpid() patch being discussed
Mount namespaces
/
container /my-app
/
host
/etc
/var
/home
user
Storing the container data (Copy-on-write) Container filesystem
Overlay fs “upper” directory
/var/lib/rkt/pods/run//overlay/sha512-.../upper/
Application Container Image
/var/lib/rkt/cas/tree/sha512-...
rkt directories /var/lib/rkt ├─ cas │ └─ tree │ ├─ deps-sha512-19bf... │ └─ deps-sha512-a5c2... └─ pods └─ run └─ e0ccc8d8 └─ overlay/sha512-19bf.../upper └─ stage1/rootfs/
unshare(..., CLONE_NEWNS);
1
7 3
/
/etc
/var
/home
user
1
7
7 3 /
/
/etc
/var
/home
user
/etc
/var
/home
user
Changing root with MS_MOVE /
... rootfs
//
mount($ROOTFS, “/”, MS_MOVE)
my-app ...
rootfs
my-app
$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs
Relationship between the two mounts: - shared - master / slave - private
Mount propagation events /
/
/etc
/var
/home
user
/etc
/var
/home
user
Mount propagation events /home /home
/home
/home
/home /home
Private
Shared
Master and slave
How rkt uses mount propagation events ✤
/ in the container namespace is recursively set as slave:
mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL)
/
/etc
/var
/home
user
Network namespace
Network isolation Goal: ✤
✤
each container has their own network interfaces
container1
container2
eth0
eth0
Cannot see the network traffic outside the container (e.g. tcpdump)
host eth0
Network tooling ✤ ✤
Linux can create pairs of virtual net interfaces Can be linked in a bridge
container1
container2
eth0
eth0
veth1
veth2 bridge
eth0
IP masquerading via iptables
How does rkt do it? ✤
rkt uses the network plugins implemented by the Container Network Interface (CNI, https://github.com/appc/cni) network namespace configure via setns + netlink
systemd-nspawn create, join
network plugins /var/lib/rkt/pods/run/$POD_UUID/netns
HTTP API
exec()
rkt
User namespaces
History of Linux namespaces ✓
1991: Linux
✓
2002: namespaces in Linux 2.4.19
✓ ✓ ✓ ✓ ✓
2008: LXC 2011: systemd-nspawn 2013: user namespaces in Linux 3.8 2013: Docker 2014: rkt
… development still active
Why user namespaces? ✤ ✤ ✤ ✤
Better isolation Run applications which would need more capabilities Per user limits Future: ✣ Unprivileged containers: possibility to have container without root
User ID ranges 0
4,294,967,295 (32-bit range)
host
0
65535
container 1 0 container 2
65535
User ID mapping /proc/$PID/uid_map: “0 unmapped
unmapped 1048576 host
container
65536
65536
unmapped
1048576
65536”
Problems with container images web server Application Container Image (ACI)
downloading
container 1
container 2
Container filesystem
Container filesystem
Overlayfs “upper” directory
Overlayfs “upper” directory
Application Container Image (ACI)
Problems with container images ✤ ✤
Files UID / GID rkt currently only supports user namespaces without overlayfs ✣ Performance loss: no COW from overlayfs ✣ “chown -R” for every file in each container
Problems with volumes bind mount (rw / ro)
/
/data
✤ ✤ ✤
mounted in several containers No UID translation Dynamic UID maps
/my-app
/
/data
/var
/home
user
/data
User namespace and filesystem problem ✤ ✤
✤
Possible solution: add options to mount() to apply a UID mapping rkt would use it when mounting: ✣ the overlay rootfs ✣ volumes Idea suggested on kernel mailing lists
Namespace lifecycle
Namespace references 1 3 2
6
9
Namespace file descriptor
These files can be opened, bind mounted, fd-passed (SCM_RIGHTS)
Isolators
Isolators in rkt ✤
specified in an image manifest
✤
limiting capabilities or resources
Isolators in rkt Currently implemented
Possible additions
✤ ✤ ✤
✤ ✤ ✤ ✤
capabilities cpu memory
block-bandwidth block-iops network-bandwidth disk-space
Capabilities (1/3) ✤
Old model (before Linux 2.2): ✣ User root (user id = 0) can do everything ✣ Regular users are limited
✤
Now: processes have capabilities Configuring the network
CAP_NET_ADMIN
Mounting a filesystem
CAP_SYS_ADMIN
Creating a block device
CAP_MKNOD
etc.
37 different capabilities today
Capabilities (2/3) ✤
Each process has several capability sets:
Permitted
Inheritable
Effective
Bounding set
Ambient (soon!)
Capabilities (3/3) Other security mechanisms: ✤
✤
Mandatory Access Control (MAC) with Linux Security Modules (LSMs): ✣ SELinux ✣ AppArmor… seccomp
Isolator: memory and cpu ✤
based on cgroups
cgroups
What’s a control group (cgroup) ✤ ✤ ✤
group processes together organised in trees applying limits to them as a group
cgroups
cgroup API /sys/fs/cgroup/*/ /proc/cgroups /proc/$PID/cgroup
List of cgroup controllers /sys/fs/cgroup/ ├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd
How systemd units use cgroups /sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ │ ├─ NetworkManager.service │ │ │ └─ cgroups.procs │ │ ... │ └─ machine.slice
How systemd units use cgroups w/ containers /sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...
│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...
cgroups mounted in the container /sys/fs/cgroup/ ├─ systemd RO │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ RW │...
│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...
Memory isolator “limit”: “500M” Application Image Manifest
[Service] ExecStart= MemoryLimit=500M
write to memory.limit_in_ bytes
systemd service file
systemd action
CPU isolator “limit”: “500m” Application Image Manifest
[Service] ExecStart= CPUShares=512
systemd service file
write to cpu.share
systemd action
Unified cgroup hierarchy ✤
Multiple hierarchies: ✣ ✣ ✣ ✣
✤
one cgroup mount point for each controller (memory, cpu, etc.) flexible but complex cannot remount with a different set of controllers difficult to give to containers in a safe way
Unified hierarchy: ✣ ✣ ✣ ✣
cgroup filesystem mounted only one time still in development in Linux: mount with option “__DEVEL__sane_behavior” initial implementation in systemd-v226 (September 2015) no support in rkt yet
Isolator: network ✤ ✤ ✤ ✤ ✤
limit the network bandwidth cgroup controller “net_cls” to tag packets emitted by a process iptables / traffic control to apply on tagged packets open question: allocation of tags? not implemented in rkt yet
Isolator: disk quotas
Disk quotas Not implemented in rkt ✤ ✤
loop device btrfs subvolumes ✣
✤
per user and group quotas ✣
✤
systemd-nspawn can use them not suitable for containers
per project quotas: in xfs and soon in ext4 ✣
open question: allocation of project id?
Conclusion We talked about: ✤ ✤ ✤ ✤
the isolation provided by rkt namespaces cgroups how rkt uses the namespace & cgroup API
Thanks
CC-BY-SA Thanks Chris for the theme!