O for In-Memory File Systems - Usenix

Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han

Operations Per Second

Storage Latency Close to DRAM

SATA/SAS Flash SSD (~100μs)

PCIe Flash SSD (~60 μs)

3D-XPoint & Optane (~10 μs)

NVDIMM-N (~10ns)

SATA HDD

Millisec

Microsec

Nanosec 2

Large Overhead in Software • Existing OSes were designed for fast CPUs and slow block devices • With low-latency storage, SW overhead becomes the largest burden • Software overhead includes Latency Decomposition (Normalized)

– Complicated I/O stacks – Redundant memory copies – Frequent user/kernel mode switches

Software

• SW overhead must be addressed to fully exploit low-latency NVM storage

Storage

NIC

100% 80%

77%

60% 50%

40% 20%

0%

82%

7%

HDD NVMe SSD PCM PCM HDD 10Gb NVMe PCM 10Gb PCM 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC NIC 10Gb NIC NIC 100Gb NIC

(Source: Ahn et al. MICRO-48) 3

Eliminate SW Overhead with mmap • In recent studies, memory mapped I/O has been commonly proposed – Memory mapped I/O can expose the NVM storage’s raw performance

• Mapping files onto user memory space – Provide memory-like access – Avoid complicated I/O stacks – Minimize user/kernel mode switches – No data copies

user

kernel

• A mmap syscall will be a critical interface

App

App r/w syscalls File System

load/store instructions

Device Driver Disk/Flash

Persistent Memory

Traditional I/O

Memory mapped I/O 4

Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) Elapsed Time (sec)

1.5

1

1.06

1.42 0.16 1.02 0.64 0.64

계열 4 • default-mmap : mmap any special flags pagewithout fault overhead

0.39

page table entry • populate-mmap : construction mmap with a MAP_POPULATE flag file access

0.5

0.63 0

• read : read system calls

5

Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) Elapsed Time (sec)

1.5

1

1.06

1.42 0.16 1.02 0.64

계열 4 0.64

0.5

0.63 0

0.39

page fault overhead page table entry construction file access

5

Overhead in Memory Mapped File I/O • Memory mapped I/O can avoid the SW overhead of traditional file I/O • However, Memory mapped I/O causes another SW overhead – Page fault overhead – TLB miss overhead – PTE construction overhead

• It decreases the advantages of memory mapped file I/O

6

Techniques to Alleviate Storage Latency • Readahead – Preload pages that are expected to be accessed

• Page cache – Cache frequently accessed pages

User Hints (madvise/fadvise) Main Memory

• fadvise/madvise interfaces – Utilize user hints to manage pages

App

Page Cache Readahead

• However, these can’t be used in memory based FSs

Secondary Storage

7

Techniques to Alleviate Storage Latency • Readahead ⇒ Map-ahead – Preload pages that are expected to be accessed

• Page cache ⇒ Mapping cache – Cache frequently accessed pages

User Hints (madvise/fadvise)

• fadvise/madvise interfaces ⇒ Extended madvise – Utilize user hints to manage pages

App

Main Memory Page Cache

Readahead

• However, these can’t be used in memory based FSs ⇒ New optimization is needed in memory based FSs

Secondary Storage

7

Map-ahead Constructs PTEs in Advance • When a page fault occurs, the page fault handler handles – A page that caused the page fault (existing demand paging) – Pages that are expected to be accessed (map-ahead)

• Kernel analyzes page fault patterns to predict pages to be accessed – Sequential fault : map-ahead window size ↑ – Random fault : map-ahead window size ↓

rnd NON_SEQ winsize ↓ seq

• Map-ahead can reduce # of page faults

PURE_SEQ winsize ↑ seq

rnd rnd seq

NEARLY_SEQ winsize 8

Comparison of Demand Paging & Map-ahead • Existing demand paging application page fault handler

a

b

1

c

2

d

3

f

e

4

5

time

9

Comparison of Demand Paging & Map-ahead • Existing demand paging application page fault handler

a

overhead in user/kernel mode switch, page fault, TLB miss

b

1

c

2

d

3

f

e

4

5

time

9

Comparison of Demand Paging & Map-ahead • Existing demand paging application

a

page fault handler

overhead in user/kernel mode switch, page fault, TLB miss

b

1

c

d

2

f

e

3

4

5

time

• Map-ahead application page fault handler

a

b

1

2

3

4

5

c

d

e

f

time

Performance gain

Map-ahead window size == 5 9

Async Map-ahead Hides Mapping Overhead • Sync map-ahead application page fault handler

a

b

1

2

3

4

5

c

d

e

f

time

Map-ahead window size == 5

10

Async Map-ahead Hides Mapping Overhead • Sync map-ahead application

a

page fault handler

b

1

2

3

4

c

d

e

f

time

5

• Async map-ahead application page fault handler kernel threads

a

b

1

c

d

f

time

2 3

e

Performance gain 4

5 10

Extended madvise Makes User Hints Available • Taking advantage of user hints can maximize the app’s performance – However, existing interfaces are to optimize only paging

• Extended madvise makes user hints available in memory based FSs Existing madvise MADV_SEQUENTIAL Run readahead aggressively MADV_WILLNEED

Extended madvise Run map-ahead aggressively

MADV_RANDOM

Stop readahead

Stop map-ahead

MADV_DONTNEED

Release pages in page $

Release mapping in mapping $ 11

Mapping Cache Reuses Mapping Data • When munmap is called, existing kernel releases the VMA and PTEs – In memory based FSs, mapping overhead is very large

• With mapping cache, mapping overhead can be reduced – VMAs and PTEs are cached in mapping cache – When mmap is called, they are reused if possible (cache hit) VMA VMA

VMA VMA VMA

mmaped region

VA | PA VA | PA VA | PA

Process address space

Page table

VMA

mm_struct red-black tree

Physical memory 12

Experimental Environment • Experimental machine – Intel Xeon E5-2620 – 32GB DRAM, 32GB NVDIMM-N – Linux kernel 4.4 – Ext4-DAX filesystem

• Benchmarks – fio : sequential & random read – YCSB on MongoDB : load workload – Apache HTTP server with httperf

Netlist DDR4 NVDIMM-N 16GB × 2

13

fio : Sequential Read 7

default-mmap

populate-mmap

map-ahead

extended madvise

Bandwidth (GB/s)

6 5

4 3 2 1 0

4

8

16

32

64

Record Size (KB) 14

fio : Sequential Read default-mmap

7

populate-mmap

map-ahead

extended madvise

Bandwidth (GB/s)

6 5

4

23% 38%

3

14 thousand page faults

46%

13 thousand page faults

2

16 million page faults

1 0

4

8

16

32

64

Record Size (KB) 14

fio : Sequential Read 7

default-mmap

populate-mmap

map-ahead

extended madvise

Bandwidth (GB/s)

6 5

4 3 2 1 0

4

8

16

32

64

Record Size (KB) 14

fio : Random Read

Bandwidth (GB/s)

4

default-mmap

populate-mmap

map-ahead

extended madvise

3

2

1

0 4

8

16

32

64

Record Size (KB) 15

fio : Random Read

Bandwidth (GB/s)

4

default-mmap

populate-mmap

map-ahead

extended madvise

3

2

1

0 4

8

16

32

64

Record Size (KB) 15

fio : Random Read

Bandwidth (GB/s)

4

default-mmap

populate-mmap

map-ahead

extended madvise

3

2

1

0 4

8

16

32

64

Record Size (KB) 15

fio : Random Read 4

default-mmap

populate-mmap

map-ahead

extended madvise

Bandwidth (GB/s)

madvise(MADV_RANDOM) == default-mmap 3

2

1

0 4

8

16

32

64

Record Size (KB) 15

Web Server Experimental Setting • Apache HTTP server – Memory mapped file I/O – 10 thousand 1MB-size HTML files (from Wikipedia) – Total size is about 10GB

Apache HTTP server

• httperf clients – 8 additional machines – Zipf-like distribution 1Gbps switch 24 ports (lperf: 8,800Mbps bandwidth) httperf clients 16

CPU Cycles (trillions, 1012)

Apache HTTP Server No mapping cache

6

kernel

libphp

libc

others

Mapping cache (2GB)

5

Mapping cache (No-Limit)

4 3

2 1 0 600

600

700700

800800

900900

1000 1000

Request Rate (reqs/s) 17

CPU Cycles (trillions, 1012)

Apache HTTP Server No mapping cache

6

kernel

libphp

libc

others

Mapping cache (2GB)

5

Mapping cache (No-Limit)

4 3 2 1 0 600

600

700700

800800

900900

1000 1000

Request Rate (reqs/s) 17

Conclusion • SW latency is becoming bigger than the storage latency – Memory mapped file I/O can avoid the SW overhead

• Memory mapped file I/O still incurs expensive additional overhead – Page fault, TLB miss, and PTEs construction overhead

• To exploit the benefits of memory mapped I/O, we propose – Map-ahead, extended madvise, mapping cache

• Our techniques demonstrate good performance by mitigating the mapping overhead – Map-ahead : 38% ↑ – Map-ahead + extended madvise : 23% ↑ – Mapping cache : 18% ↑

18

QnA [email protected]

19

O for In-Memory File Systems - Usenix

O for In-Memory File Systems - Usenix

Suggest Documents

Rump File Systems: Kernel Code Reborn - Usenix

Sprockets: Safe extensions for distributed file systems - Usenix

A Parts-of-file File System - Usenix

Push-Button Verification of File Systems via Crash Refinement - Usenix

Black-Box Problem Diagnosis in Parallel File Systems - Usenix

Push-Button Verification of File Systems via Crash Refinement - Usenix

O Stack Optimization for Smartphones - Usenix

Configuring Database Systems - Usenix

Server-Side I/O Coordination for Parallel File Systems

Intelligent Transport Systems for Indian Cities - Usenix

Rapid File System Development Using ptrace - Usenix

A Web based Covert File System - Usenix

Scaling Security for Big, Parallel File Systems 1 Motivation 2 ... - Usenix

End-to-end Data Integrity for File Systems: A ZFS Case Study - Usenix

An MS-DOS File System for UNIX - Usenix

Support for Size-Changing Algorithms in Stackable File ... - Usenix

Email-based File System for Personal Data - Usenix

O-Optimal Recovery from Disk Failures - Usenix

WorkOut: I/O Workload Outsourcing for Boosting RAID - Usenix

I/O Speculation for the Microsecond Era - Usenix

File Systems

File Systems

DTWC FILE o

File I/O