Linux Ext3 Filesystem

55 downloads 1734 Views 2MB Size Report
Ext3 is completely backwards compatible with Ext2. □Just adds a journal in a special file. □Does not change the basic filesystem structure, inodes, directories  ...
15: Filesystem Examples: Ext3, NTFS Mark Handley

Linux Ext3 Filesystem

1

Problem: Recovery after a crash 

fsck on a large disk can be extremely slow.  An issue for laptops. Power failure is common.  An issue for highly available servers. Failure is rare but recovery must be reliable and fast.



With a Journaling File System (JFS), don’t need to check the whole disk.  Re-read the journal from the last checkpoint after a crash.

Journaling Filesystem  

Atomically updated Old and new versions of data held on disk until the update is committed.

Undo logging:  Copy old data to the log.  Write new data to the disk.  If you crash during update, copy old data from the log. Redo logging:  Write new data to the log.  Old data remains on disk until commit.  If you crash during update, copy new data from the log

2

Journal Data and Transactions  





Fixed size, stored on disk, used as a circular buffer Contains:  Metadata: entire contents of a single block of filesystem metadata, as updated by the transaction.  Descriptor: where metadata really lives on disk  Header: head and tail of journal (in circular buffer) Each disk update is an atomic transaction.  Write new data to the journal.  Not complete until a commit. Only after commit is the update final.  Will be flushed to disk in due course.

Commit 

    

Transaction is committed.  Subsequent file system operations will go in a new transaction. Flush transaction to journal on disk, pin the memory buffers because the data is not yet in the right place on disk. After flushed, update the journal header blocks. Sync the journal transaction to disk. Unpin the memory buffers. Release space in the journal.

3

Crash Recovery 



Only completed updates have been committed.  During reboot, committed transactions in the journal are re-applied. Old and updated data are each stored separately until the commit block is written to the journal on disk.

Ext3 vs Ext2 vs LSFS 



Ext3 is completely backwards compatible with Ext2.  Just adds a journal in a special file.  Does not change the basic filesystem structure, inodes, directories, etc. A Log-structured filesystem ONLY contains a log.  Everything is written to the end of the log.

4

NTFS Filesystem

File System API Calls in Windows 2000,XP…

 

Principle Win32 API functions for file I/O Second column gives nearest UNIX equivalent

5

Windows 2000:

File System API  

Windows API has very many parameters. Eg CreateFile() has 7 parameters:  Pointer to filename to open/create.  Flags for read/write/both.  Flags for whether multiple processes can simultaneously open file.  Pointer to security descriptor telling who can access the file.  Flags telling what to do if the file exists/doesn’t exist.  Flags dealing with attributes such as archiving and compression.  The handle of a file whose attributes should be cloned for the new file.

Windows 2000 File System API:

Copying a File /* Open Files for Input and Output */ inhandle = CreateFile(“data”, GENERIC_READ, 0, NULL, OPEN_EXISTING, 0, NULL); outhandle = CreateFile(“new”, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); /* Copy the File */ do { s = ReadFile(inhandle, buffer, BUF_SIZE, &count, NULL); if (s && count > 0) WriteFile(outhandle, buffer, count, &ocnt, NULL); } while (s > 0 && count > 0) /* Close the Files */ CloseHandle(inhandle); CloseHandle(outhandle);

6

Windows 2000 File System API:

System Calls for Directory Management



Second column gives nearest UNIX equivalent, when one exists

NTFS   



NTFS replaces FAT file system in recent Windows releases. Design from scratch: complex and fully featured. Each volume (partition) is a linear sequence of blocks  4KB blocksize is typical  64bit block IDs. Each volume has a Master File Table (MFT)  Sequence of 1KB records.  One (or more) record per file or directory.  Somewhat like i-nodes, but more flexible.  Each MFT record is a sequence of variable length (attribute, value) pairs.  Long attributes can be stored externally, and a pointer kept in the MFT record.

7

NTFS Master File Table 

 

First 16 entries are reserved for NTFS metadata files. MFT is itself a file. 1st record describes the MFT file itself (when the blocks are on disk).

MFT MetaData Attributes $LogFile: when many changes to filesystem are made, they’re logged here first. If system goes down, consistency can be recovered by reading the log. $AttrDef: MFT attributes are defined here, allowing extensibility. $Bitmap: keeps track of free blocks. $Boot: points to bootstrap loader for OS booting. $Upcase: Defines filename case mapping (for nonroman alphabets).

8

File System Structure (2)

The attributes used in MFT records

NTFS File Block Management   



NTFS tries to allocate files in runs of consecutive blocks. Unlike with FAT, files can contain holes. In an MFT record, blocks are described by a sequence of DATA attributes - one for each section between holes.  Within each DATA attribute, there are multiple fields each indicating a run of consecutive disk blocks. If all the attributes don’t fit into one MFT record, extension records can be use to hold more.

9

MFT Record for Normal File

An MFT record for a three-run, nine-block file

Extension MFT Records



A file that requires three MFT records to store its runs.  Typically because file is very fragmented or very large.

10

MFT Record for a Small Directory

 

The MFT record for a small directory.  Directory Entries stored as a simple list. Large directories use B+ trees instead.

NTFS File Name Lookup



Steps in looking up the file C:\maria\web.htm  First prepend \?? to filename, and lookup in \?? directory

11

NTFS File Compression  



API can specify that a file should be compressed by the filesystem. OS attempts to compress 16 blocks at a time.  If compression reduces to 15 blocks or less, compressed blocks are written to disk.  Otherwise uncompressed blocks are written.  Runs of compressed blocks use two DATA runs in MFT, one for the compressed data blocks, and one for how much compression was achieved. Seeking is not terribly efficient:  Must decompress 16 blocks at a time to find the correct uncompressed block.

NTFS File Compression

(a) An example of a 48-block file being compressed to 32 blocks (b) The MTF record for the file after compression

12

NTFS File Encryption K retrieved

user's public key

Encrypting File System (EFS) sits above NTFS, below the Win32 API.

13