Improvement of I/O Error Handling on Ext3 Filesystem - The Linux ...

27 downloads 360 Views 2MB Size Report
How does ext3 preserve its consistency? – A file operation change several metadata blocks. (inode, directory, indirect block, etc.) and file data blocks (but not ...
Improvement of I/O Error Handling on Ext3 Filesystem

Hidehiro Kawai, Satoshi Oshima Hitachi Ltd., Systems Development Lab.

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

Agenda • • • •

Introduction I/O Error handling improvements Next challenges Summary

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

2

Agenda • • • •

Introduction I/O Error handling improvements Next challenges Summary

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

3

Introduction • Data protection mechanism is important for enterprise systems – Data corruption will lose more money and time than system down

• Also important for desktop users – Documents, photos, source codes, etc.

• Hardware layer is equipping a variety of protection mechanisms – RAID, ECC/CRC, DIF/DIX, MCA

• ...but how about software layer? Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

4

Introduction • Known data corruption problem responsible for software Æ Dirty pagecache page lost problem ①

② reclaimed

dirty page

clean page write, but transiently fail



④ modified

clean page

dirty page read old data

write possibly wrong data

・iSCSI link down ・Bad medium ...

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

5

Introduction • To resolve the dirty page lost problem – How about leave the dirty page dirty when the write failed? It will cause another problem if it was permanent error – It’s hard for me!

...by the way, are there similar problems in ext3 file system, that is, corruption caused by transient I/O error? Yes, there were! Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

6

Agenda • Introduction • I/O Error handling improvements – Error on writing metadata – Error on logging metadata – Error on flushing file data – Error on updating inode block

• Next challenges • Summary

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

7

About Ext3 File System • How does ext3 preserve its consistency? – A file operation change several metadata blocks (inode, directory, indirect block, etc.) and file data blocks (but not always) – To preserve consistency, at least metadata blocks have to be updated atomically – To achieve this, at first, these metadata blocks are logged into the journal space, then written back to the file system – Multiple atomic file operations are bundled into a transaction which is an atomicity unit Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

8

Basic Journaling Mechanism Ensured to be reflected to the fs, i.e. replayed at the next mount after crash

Committed transaction

Logical image of journal space

Committing transaction

$ mv old/foo new/bar

Running transaction newer log Update dir ‘old’

NOT reflected if the system crashes at this point

Update dir ‘new’

New updates are added here

write back to the fs (pdflush or kjournald) Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

9

I/O Error handling improvements • We found 4 data corruption cases caused by transient I/O error – Transient I/O error on writing metadata – Transient I/O error on logging metadata – Transient I/O error on flushing file data in data=ordered mode – Transient I/O error on updating inode block

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

10

Error on Writing Metadata Type Metadata corruption How? Write metadata blocks to the fs after committing them or when replaying journal for recovery, but failed due to a transient I/O error Problem Ext3 (JBD) didn’t check errors on writing metadata to the fs new metadata write error old or random FS

new metadata released (unmount, etc.)

old or random read OK old or random

new metadata released (unmount, etc.)

FS Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

journaled metadata replay error old or random FS

11

Error on Writing Metadata Fix 1 ‘after commit’ case: when a metadata buffer is finally removed from the journal space, check if I/O error happens on it, then abort journaling Released to make a free space

journal

new metadata write error

new metadata

old or random

old or random

FS

FS

At the next mount...

If an I/O error was found, don’t release it and abort journaling

new metadata

‘journal abort’ causes panic or ro-remount depends on a mount option

new metadata

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

FS

replay the journal, write OK

12

Error on Writing Metadata Fix 2 ‘journal replay’ case: check if I/O error happens, then make the recovery and the mount fail

At the next mount...

journal

new metadata

replay the journal, write error

return error when any errors are found, so the mount fails

new metadata

new metadata

new metadata

FS

FS Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

replay the journal again, write OK

13

Error on Logging Metadata Type Metadata corruption How? Fail to write metadata to journal space, then succeed to write a commit record to journal space, it ends up journal abort. The metadata block is overwritten with invalid data by replaying journaled data at the next mount Problem Try to write commit record even if the preceding metadata logging has failed committing transaction C

C

write error

At the next mount... C

C

C

metadata commit record

FS

Commit record says, “please replay this transaction after unclean unmount”

journal replay writes garbage data

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

14

Error on Logging Metadata Fix When detect I/O error on logging metadata, abort journaling before writing a commit record

committing transaction C

At the next mount... This is an incomplete transaction, so don’t replay!

C

write error metadata

C

Commit record isn’t written if journaling has been aborted

FS

Consistent state is kept

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

15

Error on Flushing File Data Type File data corruption How? An application issues a buffered write, then it fails due to a transient I/O error. After a while, the failed page is reclaimed. If the app doesn’t check I/O error, old data can be re-read from the disk... An application didn’t check I/O error via fsync(), etc.

buffered write

reclaimed

dirty page

clean page write, but transiently fail

modified clean page

dirty page read old data

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

write possibly wrong data

16

Error on Flushing File Data Approach 1 The application checks I/O error via fsync(), etc. and if an error is found, handle it properly Approach 2 Ext3/JBD checks I/O error for file data writes, and if any errors are found, then abort journaling Fix In data=ordered mode, all file data writes are tracked by a transaction and guaranteed to be written out before logging metadata blocks. So changed to check if those file data blocks have I/O errors, and then abort journaling. But this causes another issue...

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

17

Another issue Problem Andrew Morton> Does any other filesystem driver turn the fs read-only on the first write-IO-error? It seems like a big policy change to me. Improve Add a new mount option for ext3 ordered mode: data_err=ignore Just printk on file data error (default) data_err=abort Abort journaling on file data error NOTE: The effect of journal abort depends on ‘errors’ mount option: errors=continue Make the fs read-only if journaling has been performed errors=remount-ro Make the fs read-only errors=panic Cause a kernel panic Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

18

Error on Updating Inode Type Metadata corruption How? Fail to write inode block to fs, then add a new inode into the same block. As a result, old inode block is read from the disk, and newer version is logged into journal space Problem On-memory data is actually uptodate, but fs thinks of it as not being uptodate because uptodate flag is cleared Inew

Inew

committed transaction Inew

Iold FS

C write error

Iold

try to add new entry

on-disk data is reloaded because ‘uptodate’ flag is cleared by the previous I/O error

read OK Iold

Ibad

add new entry to old inode block, then log it

Ibad running transaction

FS Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

19

Error on Logging Metadata Fix Regard an on-memory inode block as uptodate if it has a write error flag (BH_Write_EIO) even if it is marked as not uptodate !BH_Uptodate BH_Write_EIO Inew

Inew

committed transaction Inew

Iold FS

C write error

try to add new entry

BH_Write_EIO Inew2 add new entry w/o reloading

Inew2

log the correct new block

Inew2 running transaction

Iold FS

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

20

Agenda • Introduction • I/O Error handling improvements • Next challenges – Generic approach to prevent data corruption – Pseudo I/O error from memory error recovery

• Summary

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

21

Next Challenges • Prevent file data corruption at VFS layer – Our work very depended on a specific file system • ext3/4 with data=ordered

– Generic way is better Idea

– Re-dirty a failed page unless the total number of failed pages is under a given threshold • Rewrite may succeed • Unlimited re-dirtying can get the system hang ...but still risky because rewrite I/Os can be stuck in queues

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

22

Next Challenges • Address apparent transient I/O error caused by the hwpoison feature What is the hwpoison feature?

– Mostly developed by Andi Kleen – Available since 2.6.32-rc1 – Utilize a new RAS feature of Nehalem-EX, which enables recovery from uncorrected memory error(*) with or without some side-effects

*: So far, panic and reboot was the only way we could when we got an uncorrected error Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

23

Next Challenges How to recover • If we get an uncorrected error on a page, we cannot access the page anymore because it has corrupted data, so isolate it • memory_failure() tries to isolate the page: – without any side-effects (e.g. clean pagecache page) – with killing processes (e.g. anon page, mmap’ed page) – with faking an I/O error (e.g. dirty pagecache page)

This is a kind of transient I/O errors! Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

24

Next Challenges • Pseudo I/O error – Kind of transient I/O errors Æ next access causes a page fault, then clean (but old) data is read from the disk – Set AS_EIO to the corresponding address space Æ User space can handle the error via fsync(), etc. – The I/O error is NOT notified to FS Æ Ext3 continues to operation normally

This implies we need more work

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

25

Next Challenges • How can we handle this issue? 1. Disable the hwpoison feature (need no fix) echo 0> /proc/sys/vm/memory_failure_recovery

2. Notify the error to fs • Need per-fs implementations

3. Make the error sticky (Wu, Fengguang’s idea) • Set a new flag, AS_HWPOISON, to the address space • When AS_HWPOISON is set: – write always fails – adding new pagecache always fails (this prevents old data from being re-read)

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

26

Next Challenges • How can we handle this issue? 1. Disable the hwpoison feature (need no fix) echo 0 > /proc/sys/vm/memory_failure_recovery

This would be the best approach, 2. Notifyit the to fs will error also be usable for the • Need per-fs implementations (we can use new normal transient I/O error case! a_ops‐>error_remove_page operation)

3. Make the error sticky (Wu, Fengguang’s idea) • Set a new flag, AS_HWPOISON, to the address space • When AS_HWPOISON is set: – write always fails – adding new pagecache always fails (this prevents old data from being re-read) Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

27

Summary • To prevent data corruption due to transient I/O error, we fixed 4 error handling problems in ext3/jbd • Status – All related patches got merged into 2.6.28 and RHEL 5.3 – ext4/jbd2 version also got merged into 2.6.28

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

28

Summary • To avoid data corruption… – User space developers (file data) • should use fsync() family to detect and properly handle I/O errors if they use buffered write for valuable data…

– Kernel (fs) developers (metadata, file data) • should take care of transient I/O error which can cause data corruption

– Admins/Users (file data) • might want to mount ext3/4 filesystems with data=ordered, data_err=abort • might want to off memory_failure_recovery for now, but…

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

29

Thank you for listening • Trademarks – Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. – Other product names used this publication are for identification purposes and maybe trademarks of their respective companies

Copyright (c) 2009 Hitachi LTD., Systems Development Lab. All right reserved

30