Container mechanics - Linux Foundation Events

Container mechanics in rkt and Linux Alban Crequy LinuxCon Europe 2015 Dublin

Alban Crequy ✤ Working on rkt ✤ One of the maintainer of rkt ✤ Previously worked on D-Bus and AF_BUS

https://github.com/alban

Container mechanics in rkt and Linux ✤ ✤ ✤ ✤

Containers Linux namespaces Cgroups How rkt use them

Containers vs virtual machines virtual machines app

app

Guest Linux kernel

app

app

Guest Linux kernel

Hypervisor Host Linux kernel Hardware

Containers containers or pods app

app

Docker daemon

rkt

app

Host Linux kernel Hardware

a p p

a p p

rkt

rkt architecture stage 2

apps

apps

stage 1

systemd-nspawn

lkvm

stage 0

rkt

rkt

Containers: no guest kernel system calls: open(), sethostname()

app

app

rkt

kernel API

app

rkt Host Linux kernel Hardware

Containers with an example Getting and setting the hostname: ✤

The system calls for getting and setting the hostname are older than containers int uname(struct utsname *buf); int gethostname(char *name, size_t len); int sethostname(const char *name, size_t len);

containers

host

hostname: thunderstorm

hostname: rainbow

hostname: sunshine

Linux namespaces

Processes in namespaces 1 3 2

6

gethostname() -> “rainbow”

9

gethostname() -> “thunderstorm”

Linux Namespaces Several independent namespaces

✤ ✤ ✤ ✤ ✤

uts (Unix Timesharing System) namespace mount namespace pid namespace network namespace user namespace

Creating new namespaces unshare(CLONE_NEWUTS);

1

2

6

“rainbow”

Creating new namespaces

1

2

6

“rainbow”

6

“rainbow”

PID namespace

Hiding processes and PID translation ✤ ✤

the host sees all processes the container only its own processes

Hiding processes and PID translation ✤ ✤

the host sees all processes the container only its own processes

Actually pid 30920

Initial PID namespace 1

2

6

7

Creating a new namespace clone(CLONE_NEWPID, ...); 1

2

6

7

Creating a new namespace 1

2

6

1

7

rkt rkt run … ✤ ✤

uses unshare() to create a new network namespace uses clone() to start the first process in the container with a new pid namespace

rkt enter … ✤

uses setns() to enter an existing namespace

Joining an existing namespace 1

2

setns(...,CLONE_NEWPID);

6

1

7

Joining an existing namespace 1

2

6

1 4

When does PID translation happen? ✤ ✤ ✤ ✤ ✤ ✤

the kernel always show the getpid(), getppid() /proc /sys/fs/cgroup//.../cgroup.procs credentials passed in Unix sockets (SCM_CREDS) pid = fork()

Future: ✤

Possibly: getvpid() patch being discussed

Mount namespaces

/

container /my-app

/

host

/etc

/var

/home

user

Storing the container data (Copy-on-write) Container filesystem

Overlay fs “upper” directory

/var/lib/rkt/pods/run//overlay/sha512-.../upper/

Application Container Image

/var/lib/rkt/cas/tree/sha512-...

rkt directories /var/lib/rkt ├─ cas │ └─ tree │ ├─ deps-sha512-19bf... │ └─ deps-sha512-a5c2... └─ pods └─ run └─ e0ccc8d8 └─ overlay/sha512-19bf.../upper └─ stage1/rootfs/

unshare(..., CLONE_NEWNS);

1

7 3

/

/etc

/var

/home

user

1

7

7 3 /

/

/etc

/var

/home

user

/etc

/var

/home

user

Changing root with MS_MOVE /

... rootfs

//

mount($ROOTFS, “/”, MS_MOVE)

my-app ...

rootfs

my-app

$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs

Relationship between the two mounts: - shared - master / slave - private

Mount propagation events /

/

/etc

/var

/home

user

/etc

/var

/home

user

Mount propagation events /home /home

/home

/home

/home /home

Private

Shared

Master and slave

How rkt uses mount propagation events ✤

/ in the container namespace is recursively set as slave:

mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL)

/

/etc

/var

/home

user

Network namespace

Network isolation Goal: ✤

✤

each container has their own network interfaces

container1

container2

eth0

eth0

Cannot see the network traffic outside the container (e.g. tcpdump)

host eth0

Network tooling ✤ ✤

Linux can create pairs of virtual net interfaces Can be linked in a bridge

container1

container2

eth0

eth0

veth1

veth2 bridge

eth0

IP masquerading via iptables

How does rkt do it? ✤

rkt uses the network plugins implemented by the Container Network Interface (CNI, https://github.com/appc/cni) network namespace configure via setns + netlink

systemd-nspawn create, join

network plugins /var/lib/rkt/pods/run/$POD_UUID/netns

HTTP API

exec()

rkt

User namespaces

History of Linux namespaces ✓

1991: Linux

✓

2002: namespaces in Linux 2.4.19

✓ ✓ ✓ ✓ ✓

2008: LXC 2011: systemd-nspawn 2013: user namespaces in Linux 3.8 2013: Docker 2014: rkt

… development still active

Why user namespaces? ✤ ✤ ✤ ✤

Better isolation Run applications which would need more capabilities Per user limits Future: ✣ Unprivileged containers: possibility to have container without root

User ID ranges 0

4,294,967,295 (32-bit range)

host

0

65535

container 1 0 container 2

65535

User ID mapping /proc/$PID/uid_map: “0 unmapped

unmapped 1048576 host

container

65536

65536

unmapped

1048576

65536”

Problems with container images web server Application Container Image (ACI)

downloading

container 1

container 2

Container filesystem

Container filesystem

Overlayfs “upper” directory

Overlayfs “upper” directory

Application Container Image (ACI)

Problems with container images ✤ ✤

Files UID / GID rkt currently only supports user namespaces without overlayfs ✣ Performance loss: no COW from overlayfs ✣ “chown -R” for every file in each container

Problems with volumes bind mount (rw / ro)

/

/data

✤ ✤ ✤

mounted in several containers No UID translation Dynamic UID maps

/my-app

/

/data

/var

/home

user

/data

User namespace and filesystem problem ✤ ✤

✤

Possible solution: add options to mount() to apply a UID mapping rkt would use it when mounting: ✣ the overlay rootfs ✣ volumes Idea suggested on kernel mailing lists

Namespace lifecycle

Namespace references 1 3 2

6

9

Namespace file descriptor

These files can be opened, bind mounted, fd-passed (SCM_RIGHTS)

Isolators

Isolators in rkt ✤

specified in an image manifest

✤

limiting capabilities or resources

Isolators in rkt Currently implemented

Possible additions

✤ ✤ ✤

✤ ✤ ✤ ✤

capabilities cpu memory

block-bandwidth block-iops network-bandwidth disk-space

Capabilities (1/3) ✤

Old model (before Linux 2.2): ✣ User root (user id = 0) can do everything ✣ Regular users are limited

✤

Now: processes have capabilities Configuring the network

CAP_NET_ADMIN

Mounting a filesystem

CAP_SYS_ADMIN

Creating a block device

CAP_MKNOD

etc.

37 different capabilities today

Capabilities (2/3) ✤

Each process has several capability sets:

Permitted

Inheritable

Effective

Bounding set

Ambient (soon!)

Capabilities (3/3) Other security mechanisms: ✤

✤

Mandatory Access Control (MAC) with Linux Security Modules (LSMs): ✣ SELinux ✣ AppArmor… seccomp

Isolator: memory and cpu ✤

based on cgroups

cgroups

What’s a control group (cgroup) ✤ ✤ ✤

group processes together organised in trees applying limits to them as a group

cgroups

cgroup API /sys/fs/cgroup/*/ /proc/cgroups /proc/$PID/cgroup

List of cgroup controllers /sys/fs/cgroup/ ├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

How systemd units use cgroups /sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ │ ├─ NetworkManager.service │ │ │ └─ cgroups.procs │ │ ... │ └─ machine.slice

How systemd units use cgroups w/ containers /sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

cgroups mounted in the container /sys/fs/cgroup/ ├─ systemd RO │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ RW │...

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

Memory isolator “limit”: “500M” Application Image Manifest

[Service] ExecStart= MemoryLimit=500M

write to memory.limit_in_ bytes

systemd service file

systemd action

CPU isolator “limit”: “500m” Application Image Manifest

[Service] ExecStart= CPUShares=512

systemd service file

write to cpu.share

systemd action

Unified cgroup hierarchy ✤

Multiple hierarchies: ✣ ✣ ✣ ✣

✤

one cgroup mount point for each controller (memory, cpu, etc.) flexible but complex cannot remount with a different set of controllers difficult to give to containers in a safe way

Unified hierarchy: ✣ ✣ ✣ ✣

cgroup filesystem mounted only one time still in development in Linux: mount with option “__DEVEL__sane_behavior” initial implementation in systemd-v226 (September 2015) no support in rkt yet

Isolator: network ✤ ✤ ✤ ✤ ✤

limit the network bandwidth cgroup controller “net_cls” to tag packets emitted by a process iptables / traffic control to apply on tagged packets open question: allocation of tags? not implemented in rkt yet

Isolator: disk quotas

Disk quotas Not implemented in rkt ✤ ✤

loop device btrfs subvolumes ✣

✤

per user and group quotas ✣

✤

systemd-nspawn can use them not suitable for containers

per project quotas: in xfs and soon in ext4 ✣

open question: allocation of project id?

Conclusion We talked about: ✤ ✤ ✤ ✤

the isolation provided by rkt namespaces cgroups how rkt uses the namespace & cgroup API

Thanks

CC-BY-SA Thanks Chris for the theme!

Container mechanics - Linux Foundation Events

Container mechanics - Linux Foundation Events

Suggest Documents

kpatch - Linux Foundation Events

OpenConfig - Linux Foundation Events

sched_deadline - Linux Foundation Events

Desktop Linux Distribution - Linux Foundation Events

DTrace on Linux - Linux Foundation Events

Desktop Linux Distribution - Linux Foundation Events

Ftrace Linux Kernel Tracing - Linux Foundation Events

Microservices Modularity - Linux Foundation Events

meta-ivi - Linux Foundation Events

Mesos Networking - Linux Foundation Events

CloudStack Tuning - Linux Foundation Events

MS Cluster on KVM - Linux Foundation Events

KVM Live Snapshot support - Linux Foundation Events

Storage Management - Linux Foundation Events [PDF]

x86 Instruction Encoding - Linux Foundation Events

Design of Vhost-pci - Linux Foundation Events

Open source @ scale - Linux Foundation Events

Vhost and VIOMMU - Linux Foundation Events

Containers/Docker with CloudStack - Linux Foundation Events

LinuxCon Europe 2014 - Linux Foundation Events

Fujitsu Standard Tool - Linux Foundation Events

Open source @ scale - Linux Foundation Events

QEMU interface introspection - Linux Foundation Events

ApacheConBigData - Introducing Apache ... - Linux Foundation Events