Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi†, Jiwon Kim, Hwansoo Han
Operations Per Second
Storage Latency Close to DRAM
SATA/SAS Flash SSD (~100μs)
PCIe Flash SSD (~60 μs)
3D-XPoint & Optane (~10 μs)
NVDIMM-N (~10ns)
SATA HDD
Millisec
Microsec
Nanosec 2
Large Overhead in Software • Existing OSes were designed for fast CPUs and slow block devices • With low-latency storage, SW overhead becomes the largest burden • Software overhead includes Latency Decomposition (Normalized)
– Complicated I/O stacks – Redundant memory copies – Frequent user/kernel mode switches
Software
• SW overhead must be addressed to fully exploit low-latency NVM storage
Storage
NIC
100% 80%
77%
60% 50%
40% 20%
0%
82%
7%
HDD NVMe SSD PCM PCM HDD 10Gb NVMe PCM 10Gb PCM 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC NIC 10Gb NIC NIC 100Gb NIC
(Source: Ahn et al. MICRO-48) 3
Eliminate SW Overhead with mmap • In recent studies, memory mapped I/O has been commonly proposed – Memory mapped I/O can expose the NVM storage’s raw performance
• Mapping files onto user memory space – Provide memory-like access – Avoid complicated I/O stacks – Minimize user/kernel mode switches – No data copies
user
kernel
• A mmap syscall will be a critical interface
App
App r/w syscalls File System
load/store instructions
Device Driver Disk/Flash
Persistent Memory
Traditional I/O
Memory mapped I/O 4
Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) Elapsed Time (sec)
1.5
1
1.06
1.42 0.16 1.02 0.64 0.64
계열 4 • default-mmap : mmap any special flags pagewithout fault overhead
0.39
page table entry • populate-mmap : construction mmap with a MAP_POPULATE flag file access
0.5
0.63 0
• read : read system calls
5
Memory Mapped I/O is Not as Fast as Expected • Microbenchmark: sequential read, a 4GB file (Ext4-DAX on NVDIMM-N) Elapsed Time (sec)
1.5
1
1.06
1.42 0.16 1.02 0.64
계열 4 0.64
0.5
0.63 0
0.39
page fault overhead page table entry construction file access
5
Overhead in Memory Mapped File I/O • Memory mapped I/O can avoid the SW overhead of traditional file I/O • However, Memory mapped I/O causes another SW overhead – Page fault overhead – TLB miss overhead – PTE construction overhead
• It decreases the advantages of memory mapped file I/O
6
Techniques to Alleviate Storage Latency • Readahead – Preload pages that are expected to be accessed
• Page cache – Cache frequently accessed pages
User Hints (madvise/fadvise) Main Memory
• fadvise/madvise interfaces – Utilize user hints to manage pages
App
Page Cache Readahead
• However, these can’t be used in memory based FSs
Secondary Storage
7
Techniques to Alleviate Storage Latency • Readahead ⇒ Map-ahead – Preload pages that are expected to be accessed
• Page cache ⇒ Mapping cache – Cache frequently accessed pages
User Hints (madvise/fadvise)
• fadvise/madvise interfaces ⇒ Extended madvise – Utilize user hints to manage pages
App
Main Memory Page Cache
Readahead
• However, these can’t be used in memory based FSs ⇒ New optimization is needed in memory based FSs
Secondary Storage
7
Map-ahead Constructs PTEs in Advance • When a page fault occurs, the page fault handler handles – A page that caused the page fault (existing demand paging) – Pages that are expected to be accessed (map-ahead)
• Kernel analyzes page fault patterns to predict pages to be accessed – Sequential fault : map-ahead window size ↑ – Random fault : map-ahead window size ↓
rnd NON_SEQ winsize ↓ seq
• Map-ahead can reduce # of page faults
PURE_SEQ winsize ↑ seq
rnd rnd seq
NEARLY_SEQ winsize 8
Comparison of Demand Paging & Map-ahead • Existing demand paging application page fault handler
a
b
1
c
2
d
3
f
e
4
5
time
9
Comparison of Demand Paging & Map-ahead • Existing demand paging application page fault handler
a
overhead in user/kernel mode switch, page fault, TLB miss
b
1
c
2
d
3
f
e
4
5
time
9
Comparison of Demand Paging & Map-ahead • Existing demand paging application
a
page fault handler
overhead in user/kernel mode switch, page fault, TLB miss
b
1
c
d
2
f
e
3
4
5
time
• Map-ahead application page fault handler
a
b
1
2
3
4
5
c
d
e
f
time
Performance gain
Map-ahead window size == 5 9
Async Map-ahead Hides Mapping Overhead • Sync map-ahead application page fault handler
a
b
1
2
3
4
5
c
d
e
f
time
Map-ahead window size == 5
10
Async Map-ahead Hides Mapping Overhead • Sync map-ahead application
a
page fault handler
b
1
2
3
4
c
d
e
f
time
5
• Async map-ahead application page fault handler kernel threads
a
b
1
c
d
f
time
2 3
e
Performance gain 4
5 10
Extended madvise Makes User Hints Available • Taking advantage of user hints can maximize the app’s performance – However, existing interfaces are to optimize only paging
• Extended madvise makes user hints available in memory based FSs Existing madvise MADV_SEQUENTIAL Run readahead aggressively MADV_WILLNEED
Extended madvise Run map-ahead aggressively
MADV_RANDOM
Stop readahead
Stop map-ahead
MADV_DONTNEED
Release pages in page $
Release mapping in mapping $ 11
Mapping Cache Reuses Mapping Data • When munmap is called, existing kernel releases the VMA and PTEs – In memory based FSs, mapping overhead is very large
• With mapping cache, mapping overhead can be reduced – VMAs and PTEs are cached in mapping cache – When mmap is called, they are reused if possible (cache hit) VMA VMA
VMA VMA VMA
mmaped region
VA | PA VA | PA VA | PA
Process address space
Page table
VMA
mm_struct red-black tree
Physical memory 12
Experimental Environment • Experimental machine – Intel Xeon E5-2620 – 32GB DRAM, 32GB NVDIMM-N – Linux kernel 4.4 – Ext4-DAX filesystem
• Benchmarks – fio : sequential & random read – YCSB on MongoDB : load workload – Apache HTTP server with httperf
Netlist DDR4 NVDIMM-N 16GB × 2
13
fio : Sequential Read 7
default-mmap
populate-mmap
map-ahead
extended madvise
Bandwidth (GB/s)
6 5
4 3 2 1 0
4
8
16
32
64
Record Size (KB) 14
fio : Sequential Read default-mmap
7
populate-mmap
map-ahead
extended madvise
Bandwidth (GB/s)
6 5
4
23% 38%
3
14 thousand page faults
46%
13 thousand page faults
2
16 million page faults
1 0
4
8
16
32
64
Record Size (KB) 14
fio : Sequential Read 7
default-mmap
populate-mmap
map-ahead
extended madvise
Bandwidth (GB/s)
6 5
4 3 2 1 0
4
8
16
32
64
Record Size (KB) 14
fio : Random Read
Bandwidth (GB/s)
4
default-mmap
populate-mmap
map-ahead
extended madvise
3
2
1
0 4
8
16
32
64
Record Size (KB) 15
fio : Random Read
Bandwidth (GB/s)
4
default-mmap
populate-mmap
map-ahead
extended madvise
3
2
1
0 4
8
16
32
64
Record Size (KB) 15
fio : Random Read
Bandwidth (GB/s)
4
default-mmap
populate-mmap
map-ahead
extended madvise
3
2
1
0 4
8
16
32
64
Record Size (KB) 15
fio : Random Read 4
default-mmap
populate-mmap
map-ahead
extended madvise
Bandwidth (GB/s)
madvise(MADV_RANDOM) == default-mmap 3
2
1
0 4
8
16
32
64
Record Size (KB) 15
Web Server Experimental Setting • Apache HTTP server – Memory mapped file I/O – 10 thousand 1MB-size HTML files (from Wikipedia) – Total size is about 10GB
Apache HTTP server
• httperf clients – 8 additional machines – Zipf-like distribution 1Gbps switch 24 ports (lperf: 8,800Mbps bandwidth) httperf clients 16
CPU Cycles (trillions, 1012)
Apache HTTP Server No mapping cache
6
kernel
libphp
libc
others
Mapping cache (2GB)
5
Mapping cache (No-Limit)
4 3
2 1 0 600
600
700700
800800
900900
1000 1000
Request Rate (reqs/s) 17
CPU Cycles (trillions, 1012)
Apache HTTP Server No mapping cache
6
kernel
libphp
libc
others
Mapping cache (2GB)
5
Mapping cache (No-Limit)
4 3 2 1 0 600
600
700700
800800
900900
1000 1000
Request Rate (reqs/s) 17
Conclusion • SW latency is becoming bigger than the storage latency – Memory mapped file I/O can avoid the SW overhead
• Memory mapped file I/O still incurs expensive additional overhead – Page fault, TLB miss, and PTEs construction overhead
• To exploit the benefits of memory mapped I/O, we propose – Map-ahead, extended madvise, mapping cache
• Our techniques demonstrate good performance by mitigating the mapping overhead – Map-ahead : 38% ↑ – Map-ahead + extended madvise : 23% ↑ – Mapping cache : 18% ↑
18
QnA
[email protected]
19