Exim mail server on RAMDISK btrfs ext4 ... Exim email server at 80 cores. RAMDISK ...... Mostly use memory file system t
Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim
Application must parallelize I/O operations ●
●
Death of single core CPU scaling –
CPU clock frequency: 3 ~ 3.8 GHz
–
# of physical cores: up to 24 (Xeon E7 v4)
From mechanical HDD to flash SSD –
IOPS of a commodity SSD: 900K
–
Non-volatile memory (e.g., 3D XPoint): 1,000x ↑
But file systems become a scalability bottleneck
Problem: Lack of understanding in internal scalability behavior Exim mail server on RAMDISK 14k
messages/sec
12k
btrfs
F2FS
ext4
XFS
10k
Embarrassingly parallel application!
1. Saturated
8k 6k 4k 2k 0k
2. Collapsed 0
10 ● ●
20
30
40 #core
50
60
70
Intel 80-core machine: 8-socket, 10-core Xeon E7-8870 RAM: 512GB, 1TB SSD, 7200 RPM HDD
3
80
3. Never scale
Even in slower storage medium file system becomes a bottleneck Exim email server at 80 cores 12k
RAMDISK SSD HDD
messages/sec
10k 8k 6k 4k 2k 0k
btrfs
ext4
F2FS
4
XFS
Outline ●
Background
●
FxMark design –
A file system benchmark suite for manycore scalability
●
Analysis of five Linux file systems
●
Pilot solution
●
Related work
●
Summary 5
Research questions ●
What file system operations are not scalable?
●
Why they are not scalable?
●
Is it the problem of implementation or design?
6
Technical challenges ●
Applications are usually stuck with a few bottlenecks → cannot see the next level of bottlenecks before resolving them → difficult to understand overall scalability behavior
●
How to systematically stress file systems to understand scalability behavior 7
FxMark: evaluate & analyze manycore scalability of file systems FxMark: File systems:
tmpfs Memory FS
ext4
XFS
J/NJ
Journaling FS
Storage medium: # core:
3 applications
19 micro-benchmarks
btrfs
F2FS
CoW FS
Log FS
SSD
1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80 8
FxMark: evaluate & analyze manycore scalability of file systems FxMark: File systems:
tmpfs Memory FS
ext4
XFS
>4,700
J/NJ
Journaling FS
Storage medium: # core:
3 applications
19 micro-benchmarks
btrfs
F2FS
CoW FS
Log FS
SSD
1, 2, 4, 10, 20, 30, 40, 50, 60, 70, 80 9
Microbenchmark: unveil hidden scalability bottlenecks Data block read Low Sharing Level
●
Medium
R
R
File
High
R R
Block
R R
Process
Operation
R
Legend:
10
Stress different components with various sharing levels
11
Evaluation ●
Data block read
R
R
Linear scalability
250
Low:
200
File systems:
M ops/sec
Legend
btrfs ext4 ext4NJ F2FS tmpfs XFS
150
100
50
0
Storage medium: 12
0
10
20
30
40 50 #core
60
70
80
Outline ●
Background
●
FxMark design
●
Analysis of five Linux file systems –
What are scalability bottlenecks?
●
Pilot solution
●
Related work
●
Summary 13
Summary of results: file systems are not scalable DRBM 9
140
1.8
9
1.6
8
3.5
1.4
7
1.2
6
M ops/sec
100
5 4
100 80 60
3 50
50
40
2
20
1 0
0
10
20
30
40 50 #core
60
70
0
80
0
10
20
30
DWSL
0
80
0
10
20
30
40 50 #core
MRPL
60
70
0
80
0
10
20
30
MRPM
40 50 #core
60
70
0.6
3
0.4
2
0.2
1
0
80
0
10
20
30
MRPH
40 50 #core
60
70
0
80
0
70
8
4.5
450
7
4
400
7
60
3.5
350
3
300
4 3
2.5 2
50
1
0
0
0
0
70
80
0
10
20
30
MWCM
60
70
80
0.4
0.45
0.35
0.4
0.2 0.15
0.3 0.25 0.2
60
70
0
80
0
10
20
30
40 50 #core
60
70
0
80
0.4
0.4
90k
0.35
80k
0.25 0.2 0.15 0.1
0.05 0
0.3
messages/sec
M ops/sec
0.1
10
20
30
40 50 #core
60
70
80
0
40 50 #core
60
70
80
10
20
30
40 50 #core
60
70
80
20
30
40 50 #core
60
70
80
1.5 1 0.5
0
10
20
30
40 50 #core
60
70
0
80
0.2 0.15
0
10
20
30
40 50 #core
60
70
80
70k 60k
400
50k 40k 30k
40 50 #core
60
70
80
70
0
10
20
30
40 50 #core
8 6 4 2
0
10
20
30
40 50 #core
14
60
70
80
60
70
80
0
0
10
20
30
40 50 #core
0
0
60
70
80
80
10
20
30
40 50 #core
60
70
80
70
80
DRBM:O_DIRECT
0.45
0.45
0.4
0.4
0.35
0.35
0.3 0.25 0.2
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05 0 0
10
20
30
40 50 #core
btrfs ext4 ext4NJ F2FS tmpfs XFS
10
70
1
0.5
Legend
12
60
1.5
0.5
0
80
14
300
0
60
40 50 #core
0.5
16
100 30
40 50 #core DBENCH
200
20
30
30
2
18
600
10
20
RocksDB 700
0
10
20
MWCL
0.15
0
10
DRBL:O_DIRECT
0.05
0
0
2.5
MWRM
0.3
500
0k
10
0.1
10k 0
0
0.25
20k
0.05 0
30
0.35
Exim
0.35
0.2
20
0
80
2
MWRL
0.3
100k
0.15
10
2
DWOM:O_DIRECT
0.25
0
2.5
0.4
0.45
0.3
80
0.1
DWOL:O_DIRECT 0.45
70
0.2
0.05 40 50 #core
60
0.5
0.1
0.05
40 50 #core
0.6
0.15
0.1
30
30
MWUM
M ops/sec
M ops/sec
0.25
20
20
0.7
0.35
0.3
10
10
M ops/sec
0.5
0
0
MWUL
0.45
0
40 50 #core
M ops/sec
60
GB/sec
40 50 #core
ops/sec
30
70
3
100
1
0
20
200
1
10
0
60
4
0.5
20 10
250 150
2
40 50 #core
5
1.5
20
0
M ops/sec
5
30
6
M ops/sec
30
M ops/sec
M ops/sec
40
M ops/sec
120
60
20
MRDM 8
80
10
MRDL 500
6
2 1.5 1
5
50
2.5
0.5
9
M ops/sec
M ops/sec
70
4
80
40
M ops/sec
60
5
140
100
M ops/sec
40 50 #core
1 0.8
3 M ops/sec
120
6
M ops/sec
M ops/sec
M ops/sec
100
DWTL 4
8
150
DWAL 10
7 150
DWOM 2
M ops/sec
200
DWOL 160
M ops/sec
200
DRBH 10
M ops/sec
250
M ops/sec
DRBL 250
60
70
80
0
10
20
30
40 50 #core
60
Summary of results: file systems are not scalable DRBM 9
140
1.8
9
1.6
8
3.5
1.4
7
1.2
6
M ops/sec
100
5 4
100 80 60
3 50
50
40
2
20
1 0
0
10
20
30
40 50 #core
60
70
0
80
0
10
20
30
DWSL
0
80
0
10
20
30
40 50 #core
MRPL
60
70
0
80
0
10
20
30
MRPM
40 50 #core
60
70
0.6
3
0.4
2
0.2
1
0
80
0
10
20
30
MRPH
40 50 #core
60
70
0
80
0
70
8
4.5
450
7
4
400
7
60
3.5
350
3
300
4 3
2.5 2
50
1
0
0
0
0
70
80
0
10
20
30
MWCM
60
70
80
0.4
0.45
0.35
0.4
0.2 0.15
0.3 0.25 0.2
60
70
0
80
0
10
20
30
40 50 #core
60
70
0
80
0.4
0.4
90k
0.35
80k
0.25 0.2 0.15 0.1
0.05 0
0.3
messages/sec
M ops/sec
0.1
10
20
30
40 50 #core
60
70
80
0
40 50 #core
60
70
80
10
20
30
40 50 #core
60
70
80
20
30
40 50 #core
60
70
80
1.5 1 0.5
0
10
20
30
40 50 #core
60
70
0
80
0.2 0.15
0
10
20
30
40 50 #core
60
70
80
70k 60k
400
50k 40k 30k
40 50 #core
60
70
80
70
0
10
20
30
40 50 #core
8 6 4 2
0
10
20
30
40 50 #core
15
60
70
80
60
70
80
0
0
10
20
30
40 50 #core
0
0
60
70
80
80
10
20
30
40 50 #core
60
70
80
70
80
DRBM:O_DIRECT
0.45
0.45
0.4
0.4
0.35
0.35
0.3 0.25 0.2
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05 0 0
10
20
30
40 50 #core
btrfs ext4 ext4NJ F2FS tmpfs XFS
10
70
1
0.5
Legend
12
60
1.5
0.5
0
80
14
300
0
60
40 50 #core
0.5
16
100 30
40 50 #core DBENCH
200
20
30
30
2
18
600
10
20
RocksDB 700
0
10
20
MWCL
0.15
0
10
DRBL:O_DIRECT
0.05
0
0
2.5
MWRM
0.3
500
0k
10
0.1
10k 0
0
0.25
20k
0.05 0
30
0.35
Exim
0.35
0.2
20
0
80
2
MWRL
0.3
100k
0.15
10
2
DWOM:O_DIRECT
0.25
0
2.5
0.4
0.45
0.3
80
0.1
DWOL:O_DIRECT 0.45
70
0.2
0.05 40 50 #core
60
0.5
0.1
0.05
40 50 #core
0.6
0.15
0.1
30
30
MWUM
M ops/sec
M ops/sec
0.25
20
20
0.7
0.35
0.3
10
10
M ops/sec
0.5
0
0
MWUL
0.45
0
40 50 #core
M ops/sec
60
GB/sec
40 50 #core
ops/sec
30
70
3
100
1
0
20
200
1
10
0
60
4
0.5
20 10
250 150
2
40 50 #core
5
1.5
20
0
M ops/sec
5
30
6
M ops/sec
30
M ops/sec
M ops/sec
40
M ops/sec
120
60
20
MRDM 8
80
10
MRDL 500
6
2 1.5 1
5
50
2.5
0.5
9
M ops/sec
M ops/sec
70
4
80
40
M ops/sec
60
5
140
100
M ops/sec
40 50 #core
1 0.8
3 M ops/sec
120
6
M ops/sec
M ops/sec
M ops/sec
100
DWTL 4
8
150
DWAL 10
7 150
DWOM 2
M ops/sec
200
DWOL 160
M ops/sec
200
DRBH 10
M ops/sec
250
M ops/sec
DRBL 250
60
70
80
0
10
20
30
40 50 #core
60
Summary of results: file systems are not scalable DRBM 9
140
1.8
9
1.6
8
3.5
1.4
7
1.2
6
M ops/sec
100
5 4
100 80 60
3 50
50
40
2
20
1 0
0
10
20
30
40 50 #core
60
70
0
80
0
10
20
30
DWSL
0
80
0
10
20
30
40 50 #core
MRPL
60
70
0
80
0
10
20
30
MRPM
40 50 #core
60
70
0.6
3
0.4
2
0.2
1
0
80
0
10
20
30
MRPH
40 50 #core
60
70
0
80
0
70
8
4.5
450
7
4
400
7
60
3.5
350
3
300
4 3
2.5 2
50
1
0
0
0
0
70
80
0
10
20
30
MWCM
60
70
80
0.4
0.45
0.35
0.4
0.2 0.15
0.3 0.25 0.2
60
70
0
80
0
10
20
30
40 50 #core
60
70
0
80
0.4
0.4
90k
0.35
80k
0.25 0.2 0.15 0.1
0.05 0
0.3
messages/sec
M ops/sec
0.1
10
20
30
40 50 #core
60
70
80
0
40 50 #core
60
70
80
10
20
30
40 50 #core
60
70
80
20
30
40 50 #core
60
70
80
1.5 1 0.5
0
10
20
30
40 50 #core
60
70
0
80
0.2 0.15
0
10
20
30
40 50 #core
60
70
80
70k 60k
400
50k 40k 30k
40 50 #core
60
70
80
70
0
10
20
30
40 50 #core
8 6 4 2
0
10
20
30
40 50 #core
16
60
70
80
60
70
80
0
0
10
20
30
40 50 #core
0
0
60
70
80
80
10
20
30
40 50 #core
60
70
80
70
80
DRBM:O_DIRECT
0.45
0.45
0.4
0.4
0.35
0.35
0.3 0.25 0.2
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05 0 0
10
20
30
40 50 #core
btrfs ext4 ext4NJ F2FS tmpfs XFS
10
70
1
0.5
Legend
12
60
1.5
0.5
0
80
14
300
0
60
40 50 #core
0.5
16
100 30
40 50 #core DBENCH
200
20
30
30
2
18
600
10
20
RocksDB 700
0
10
20
MWCL
0.15
0
10
DRBL:O_DIRECT
0.05
0
0
2.5
MWRM
0.3
500
0k
10
0.1
10k 0
0
0.25
20k
0.05 0
30
0.35
Exim
0.35
0.2
20
0
80
2
MWRL
0.3
100k
0.15
10
2
DWOM:O_DIRECT
0.25
0
2.5
0.4
0.45
0.3
80
0.1
DWOL:O_DIRECT 0.45
70
0.2
0.05 40 50 #core
60
0.5
0.1
0.05
40 50 #core
0.6
0.15
0.1
30
30
MWUM
M ops/sec
M ops/sec
0.25
20
20
0.7
0.35
0.3
10
10
M ops/sec
0.5
0
0
MWUL
0.45
0
40 50 #core
M ops/sec
60
GB/sec
40 50 #core
ops/sec
30
70
3
100
1
0
20
200
1
10
0
60
4
0.5
20 10
250 150
2
40 50 #core
5
1.5
20
0
M ops/sec
5
30
6
M ops/sec
30
M ops/sec
M ops/sec
40
M ops/sec
120
60
20
MRDM 8
80
10
MRDL 500
6
2 1.5 1
5
50
2.5
0.5
9
M ops/sec
M ops/sec
70
4
80
40
M ops/sec
60
5
140
100
M ops/sec
40 50 #core
1 0.8
3 M ops/sec
120
6
M ops/sec
M ops/sec
M ops/sec
100
DWTL 4
8
150
DWAL 10
7 150
DWOM 2
M ops/sec
200
DWOL 160
M ops/sec
200
DRBH 10
M ops/sec
250
M ops/sec
DRBL 250
60
70
80
0
10
20
30
40 50 #core
60
Summary of results: file systems are not scalable DRBM 9
140
1.8
9
1.6
8
3.5
1.4
7
1.2
6
M ops/sec
100
5 4
100 80 60
3 50
50
40
2
20
1 0
0
10
20
30
40 50 #core
60
70
0
80
0
10
20
30
DWSL
0
80
0
10
20
30
40 50 #core
MRPL
60
70
0
80
0
10
20
30
MRPM
40 50 #core
60
70
0.6
3
0.4
2
0.2
1
0
80
0
10
20
30
MRPH
40 50 #core
60
70
0
80
0
70
8
4.5
450
7
4
400
7
60
3.5
350
3
300
4 3
2.5 2
50
1
0
0
0
0
70
80
0
10
20
30
MWCM
60
70
80
0.4
0.45
0.35
0.4
0.2 0.15
0.3 0.25 0.2
60
70
0
80
0
10
20
30
40 50 #core
60
70
0
80
0.4
0.4
90k
0.35
80k
0.25 0.2 0.15 0.1
0.05 0
0.3
messages/sec
M ops/sec
0.1
10
20
30
40 50 #core
60
70
80
0
40 50 #core
60
70
80
10
20
30
40 50 #core
60
70
80
20
30
40 50 #core
60
70
80
1.5 1 0.5
0
10
20
30
40 50 #core
60
70
0
80
0.2 0.15
0
10
20
30
40 50 #core
60
70
80
70k 60k
400
50k 40k 30k
40 50 #core
60
70
80
70
0
10
20
30
40 50 #core
8 6 4 2
0
10
20
30
40 50 #core
17
60
70
80
60
70
80
0
0
10
20
30
40 50 #core
0
0
60
70
80
80
10
20
30
40 50 #core
60
70
80
70
80
DRBM:O_DIRECT
0.45
0.45
0.4
0.4
0.35
0.35
0.3 0.25 0.2
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05 0 0
10
20
30
40 50 #core
btrfs ext4 ext4NJ F2FS tmpfs XFS
10
70
1
0.5
Legend
12
60
1.5
0.5
0
80
14
300
0
60
40 50 #core
0.5
16
100 30
40 50 #core DBENCH
200
20
30
30
2
18
600
10
20
RocksDB 700
0
10
20
MWCL
0.15
0
10
DRBL:O_DIRECT
0.05
0
0
2.5
MWRM
0.3
500
0k
10
0.1
10k 0
0
0.25
20k
0.05 0
30
0.35
Exim
0.35
0.2
20
0
80
2
MWRL
0.3
100k
0.15
10
2
DWOM:O_DIRECT
0.25
0
2.5
0.4
0.45
0.3
80
0.1
DWOL:O_DIRECT 0.45
70
0.2
0.05 40 50 #core
60
0.5
0.1
0.05
40 50 #core
0.6
0.15
0.1
30
30
MWUM
M ops/sec
M ops/sec
0.25
20
20
0.7
0.35
0.3
10
10
M ops/sec
0.5
0
0
MWUL
0.45
0
40 50 #core
M ops/sec
60
GB/sec
40 50 #core
ops/sec
30
70
3
100
1
0
20
200
1
10
0
60
4
0.5
20 10
250 150
2
40 50 #core
5
1.5
20
0
M ops/sec
5
30
6
M ops/sec
30
M ops/sec
M ops/sec
40
M ops/sec
120
60
20
MRDM 8
80
10
MRDL 500
6
2 1.5 1
5
50
2.5
0.5
9
M ops/sec
M ops/sec
70
4
80
40
M ops/sec
60
5
140
100
M ops/sec
40 50 #core
1 0.8
3 M ops/sec
120
6
M ops/sec
M ops/sec
M ops/sec
100
DWTL 4
8
150
DWAL 10
7 150
DWOM 2
M ops/sec
200
DWOL 160
M ops/sec
200
DRBH 10
M ops/sec
250
M ops/sec
DRBL 250
60
70
80
0
10
20
30
40 50 #core
60
Data block read DRBL 250
R
R
All file systems linearly scale
200 M ops/sec
Low:
150 100 50 0
0
10
20
30
40 50 #core
60
70
80
DRBM
R
R
XFS shows performance collapse
200 M ops/sec
Medium:
250
150 100
XFS
50 0
0
10
20
30
40 50 #core
60
70
80
DRBH
R
All file systems show performance collapse
9 8 7 M ops/sec
High:
R
10
6 5 4 3 2 1 0
0
10
20
30
40 50 #core
60
70
80
18
Page cache is maintained for efficient access of file data OS Kernel
2. look up a page cache 1. read a file block
R Page cache 5. copy page
3. cache miss 4. read a page from disk
Disk
19
Page cache hit OS Kernel
2. look up a page cache
1. read a file block
R Page cache 4. copy page
3. cache hit
Disk
20
Page cache can be evicted to secure free memory OS Kernel
Page cache
Disk
21
… only when not being accessed OS Kernel
1. read a file block
Reference counting is used to track # of accessing tasks
R Page cache 4. copy page
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
Disk
22
Reference counting becomes a scalability bottleneck R
R
R
...
R
R
10 9 8 7
M ops/sec
R
20 CPI DRBH (cycles-per-instruction)
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
6
100 CPI (cycles-per-instruction)
5 4 3 2 1 0
23
0
10
20
30
40 50 #core
60
70
80
Reference counting becomes a scalability bottleneck R
R
R
...
R
R
10 9 8 7
M ops/sec
R
20 CPI DRBH (cycles-per-instruction)
access_a_page(...) { atomic_inc(&page->_count); ... atomic_dec(&page->_count); }
6
100 CPI (cycles-per-instruction)
5 4 3 2 1 0
High contention on a page reference counter → Huge memory stall 0
10
20
30
40 50 #core
60
70
80
Many more: directory entry cache, XFS inode, etc 24
Lessons learned
High locality can cause performance collapse
Cache hit should be scalable → When the cache hit is dominant, the scalability of cache hit does matter.
25
Data block overwrite DWOL
W
Ext4, F2FS, and btrfs show performance collapse
140 120 M ops/sec
Low:
W
160
100 80 60 40
ext4
20 0
0
10
20
30
40 50 #core
F2FS btrfs 60
70
80
DWOM
All file systems degrade gradually
1.8 1.6 1.4 M ops/sec
Medium:
W W
2
1.2 1 0.8 0.6 0.4 0.2 0
0
10
20
30
40 50 #core
60
70
80
26
Btrfs is a copy-on-write (CoW) file system ●
Directs a write to a block to a new copy of the block → Never overwrites the block in place → Maintain multiple versions of a file system image W
Time T
Time T
Time T+1
CoW triggers disk block allocation for every write W
Time T
Time T+1
Block Allocation
Block Allocation
Disk block allocation becomes a bottleneck Ext4 → journaling, F2FS → checkpointing 28
Lessons learned Overwriting could be as expensive as appending → Critical at log-structured FS (F2FS) and CoW FS (btrfs)
Consistency guarantee mechanisms should be scalable → Scalable journaling → Scalable CoW index structure → Parallel log-structured writing 29
Data block overwrite DWOL
W
Ext4, F2FS, and btrfs show performance collapse
140 120 M ops/sec
Low:
W
160
100 80 60 40
ext4
20 0
0
10
20
30
40 50 #core
F2FS btrfs 60
70
80
DWOM
All file systems degrade gradually
1.8 1.6 1.4 M ops/sec
Medium:
W W
2
1.2 1 0.8 0.6 0.4 0.2 0
0
10
20
30
40 50 #core
60
70
80
Entire file is locked regardless of update range ●
All tested file systems hold an inode mutex for write operations –
Range-based locking is not implemented
***_file_write_iter(...) { mutex_lock(&inode->i_mutex); ... mutex_unlock(&inode->i_mutex); }
Lessons learned A file cannot be concurrently updated – Critical for VM and DBMS, which manage large files
Need to consider techniques used in parallel file systems → E.g., range-based locking
Summary of findings ●
High locality can cause performance collapse
●
Overwriting could be as expensive as appending
●
A file cannot be concurrently updated
●
All directory operations are sequential
●
Renaming is system-wide sequential
●
Metadata changes are not scalable
●
Non-scalability often means wasting CPU cycles
●
Scalability is not portable 33
See
er p a p our
Summary of findings Many themcan arecause unexpected and collapse counter-intuitive ● High of locality performance → Contention at file system level ● Overwriting could be as expensive as appending to maintain data dependencies ● A file cannot be concurrently updated ●
All directory operations are sequential
●
Renaming is system-wide sequential
●
Metadata changes are not scalable
●
Non-scalability often means wasting CPU cycles
●
Scalability is not portable 34
See
er p a p our
Outline ●
Background
●
FxMark design
●
Analysis of five Linux file systems
●
Pilot solution –
If we remove contentions in a file system, is such file system scalable?
●
Related work
●
Summary 35
RocksDB on a 60-partitioned RAMDISK scales better A single-partitioned RAMDISK tmpfs
700
700
600
600
500
500
400
400
ops/sec
ops/sec
A 60-partitioned RAMDISK
300 200
200
100
100
0
0
10
20
30
40 50 #core
60
70
0
80
2.1x
300
0
10
20
30
** Tested workload: DB_BENCH overwrite **
36
btrfs ext4 F2FS tmpfs XFS
40 50 #core
60
70
80
RocksDB on a 60-partitioned RAMDISK scales better A single-partitioned RAMDISK tmpfs
700
700
600
600
500
500
400
400
ops/sec
ops/sec
A 60-partitioned RAMDISK
300 200
200
100
100
0
0
10
20
30
40 50 #core
60
70
0
80
2.1x
300
0
10
20
30
btrfs ext4 F2FS tmpfs XFS
40 50 #core
60
70
80
Reduced contention on file systems ** Tested workload: DB_BENCH overwrite helps improving performance and**scalability 37
But partitioning makes performance worse on HDD
ops/sec
A single-partitioned HDD F2FS
A 60-partitioned HDD
500
500
400
400
300
300
200
200
100 0
btrfs ext4 F2FS XFS
2.7x
100
0
5
10 #core
15
0
20
0
5
** Tested workload: DB_BENCH overwrite **
38
10
15
20
But partitioning makes performance worse on HDD
ops/sec
A single-partitioned HDD F2FS
A 60-partitioned HDD
500
500
400
400
300
300
200
200
100 0
btrfs ext4 F2FS XFS
2.7x
100
0
5
10 #core
15
0
20
But reduced spatial locality degrades performance → Medium-specific (e.g.,**spatial locality) ** Testedcharacteristics workload: DB_BENCH overwrite should be considered 39
0
5
10
15
20
Related work ●
Scaling operating systems –
●
Mostly use memory file system to opt out the effect of I/O operations
Scaling file systems –
–
Scalable file system journaling ●
ScaleFS [MIT:MSThesis'14]
●
SpanFS [ATC'15]
Parallel log-structured writing on NVRAM ●
NOVA [FAST'16] 40
Summary ●
●
●
Comprehensive analysis of manycore scalability of five widely-used file systems using FxMark Manycore scalability should be of utmost importance in file system design New challenges in scalable file system design –
●
Minimizing contention, scalable consistency guarantee, spatial locality, etc.
FxMark is open source –
https://github.com/sslab-gatech/fxmark 41