Still lacked performance. • In 1993, the Second Extended File system, or. EXT2,
was added. • In 1999, the Third Extended File system or. Ext3 was developed by
...
Linux Filesystems Ext2, Ext3 Nafisa Kazi
1
What is a Filesystem • A filesystem: – Stores files and data in the files – Organizes data for easy access – Stores the information about files such as size, file permissions, owner, creation time etc. – May use a storage device such as a hard disk or CD-ROM • Involve maintaining the physical location of the files
– Could be virtual and exist only as an access method for virtual data or for data over a network (e.g. NFS). 2
Linux File System History • Minix: The first file system for Linux – Restrictive and lacked performance – Filenames longer than 14 characters not allowed – Maximum file size was 64 Mbytes
• EXT (Extended File System): The first file system designed specifically for Linux – Introduced in April 1992 – Still lacked performance
• In 1993, the Second Extended File system, or EXT2, was added • In 1999, the Third Extended File system or Ext3 was developed by Stephen Tweedie 3
Linux File System History (cont’d.) • VFS (Virtual File System): developed when EXT filesystem was added – VFS allows Linux to support different file systems – Each file system presents a common software interface to the VFS – All the details of various file systems are translated by software • All file systems appear identical to rest of Linux kernel
4
VFS
• For example: cp /floppy/TEST /tmp/test 5
VFS : Superblocks and i-nodes • VFS describes system’s files in terms of superblocks and inodes • The VFS i-nodes: – Describe files and directories within the system
• The VFS superblocks:
– As each system is initialized, it registers itself with VFS at boot time – Each file system type’s superblock read routine maps the filesytem’s topology onto VFS superblock – VFS keeps a list of the mounted file systems and their VFS superblocks – Each VFS superblock contains a pointer to the first VFS inode on the file system – As the system' s processes access directories and files, system routines are called that traverse the VFS inodes 6
Logical Diagram of VFS
7
Caching in VFS • I-node cache: – Repeatedly accessed inodes are kept in inode cache for quicker access
• Directory cache: – VFS also keeps a cache of directory lookups so that the inodes for frequently used directories can be found quickly – Stores directory name ⇒ i-node mapping
8
Caching in VFS (cont’ d.) • Buffer cache: – Cache data buffers from the devices to help speed up access – Makes the Linux file systems independent from the underlying media and from the device drivers that support them – Is integrated with the block device interface – Read request from filesytem result in block device drivers reading physical blocks from the device that they control – These blocks are saved in the global buffer cache and are shared by all filesystems – Buffers are identified by their block number and a unique identifier for the device that read it – Filesystems don’ t have to go to the device if a block is in the cache 9
Ext2 Disk Data Structures • The first block in each Ext2 partition is reserved for the partition boot sector • Rest of space is split into block groups, each of which has following layout
• All the block groups have the same size and are stored sequentially
– The kernel can derive the location of a block group in a disk simply from its integer index. 10
Ext2 Superblock • • • •
Contains a description of the file system Duplicated in each block group The superblock and the group descriptors in block group 0 are used when the filesystem is mounted. Some important information that this block holds are: – Magic Number : • Identifies the filesytem type
– Block Group Number : • The Block Group number that holds this code of the Superblock
– Block Size • The size of the block for this file system in bytes
– Blocks per Group • The number of blocks in a group. This is fixed when the file system is created
– Free Blocks • The number of free blocks in the file system,
– Free Inodes • The number of free Inodes in the file system,
– First Inode • This is the inode number of the first inode in the file system. • The first inode in an EXT2 root file system would be the directory entry for the ' /' directory
11
EXT2 Group Descriptor and Bitmap • All the group descriptors for all of the Block Groups are duplicated in each Block Group. It contains: – Blocks Bitmap – Inode Bitmap – Inode Table
• The bitmaps are sequences of bits – Value 0 specifies that the corresponding inode or data block is free – Value 1 specifies that the corresponding inode or data block is used 12
Inodes • Every file and directory in the file system is described by one inode • The inodes for each Block Group are kept in the inode table together with a bitmap. The inode contains the following fields: – mode
• Permissions that users have • Owner Information
– Size
• The size of the file in bytes,
– Timestamps
• The time that the inode was created and the last time that it was modified,
– Datablocks
• Pointers to the blocks that contain the data that this inode is describing. 13
Inode structure
14
Consistency Check Problem with Ext2
• Updates to filesystem blocks are kept in dynamic memory before being flushed to disk • A power-down failure might leave the filesystem in inconsistent state • To overcome this problem, each filesystem is checked (and fixed) before it is mounted – Utility is called fsck – Runs upon reboot after a system crash
• Does not scale well – With today’ s large disks and filesystems, fsck can take many hours to perform consistency check – Totally unacceptable in production environment 15
Ext3 Filesystem • Ext3 is a journaling filesystem – Goal of journaling filesystem: • To avoid time-consuming consistency checks during system start-up after ungraceful termination
– Main idea: • First write blocks to a special area of disk called journal • Then write blocks from journal to the filesystem
– Examples of journaling file systems • SGI’ s XFS and IBM’ s JFS
• Ext3 is as much compatible as possible with Ext2 filesystem – Fairly easy to migrate between Ext2 and Ext3 16
Journaling Filesystem (JFS) • Two step procedure for performing high-level change to the filesystem: – Step 1: Committing to the Journal • Keeps track of the information to be written to the hard drive in a journal • A copy of the blocks to be written is stored in the journal
– Step 2: Committing to the filesystem • When I/O transfer to the journal is completed, the blocks are written to the filesystem • When I/O transfer to the filesystem is completed, the copies of the blocks in the journal are discarded
• Journal allows quick recovery of filesystem after crash – No need to scan the entire disk; only scan the journal area 17
System Recovery with JFS • Two cases for system recovery – Case 1: the system failure occurred before a commit to the journal • Either the copies of the blocks relative to the change are missing from the journal or they are incomplete – In both cases, fsck ignores them
• Result: the high-level change to the filesystem is lost, but the filesystem state is still consistent
– Case 2: the system failure occurred after a commit to the journal • The copies of the blocks are valid, and fsck writes them into the filesystem • Result: fsck applies the whole change, thus fixing every inconsistency due to unfinished I/O data transfers into the filesystem 18
Journaling Modes • Logging blocks to the journal leads to a significant performance penalty • Therefore, JFS allows operator to decide what kind of blocks has to be logged • Gives rise to three journaling modes: – Journal – Ordered – Writeback
• Journaling mode is specified as an option to mount command – Example: mount –t ext3 data= writeback /dev/wd0a /jdisk 19
The Journal Journaling Mode • All filesystem data and metadata are logged into the journal – Metadata includes superblocks, inodes, data bitmap blocks, bitmap blocks etc
• Minimizes loss of updates made to each file • Requires additional disk accesses – Example: when a new file is created, all its data blocks are duplicated as log records
• Safest but slowest mode
20
Ordered Journaling Mode • Only changes to filesystem metadata are logged to the journal • Metadata and relative data blocks are grouped – Data blocks are written to disk before the metadata is written to disk
• Two cases of changes to a file – Case 1: appending to a file • If system crashes after data blocks are written to disk, metadata will not reflect the change • Hence file consistent though the changes to file are lost
– Case 2: overwriting part of a file • No guarantee that blocks are written to disk in order – Thus, can not assume that because overwritten block ‘x’ was updated, overwritten block ‘x-1’ was updated as well
• No changes to metadata (block allocation bitmap) • Hence no way of knowing if file is consistent or not
• Default journaling mode for Ext3 filesystem – Works out fine in practice as appending to a file is much more common than overwriting in the middle of a file 21
Writeback Journaling Mode • Only changes to filesystem metadata are logged • Does not wait for associated changes to file data to be written • Example: files may exhibit metadata inconsistencies – Block allocation bitmap will have data blocks as occupied, however updated data was not written when the system went down – This isn' t fatal, but can be disappointing to users
• Fastest mode 22
Journaling Block Device Layer • Ext3 journal is stored in hidden file ./journal in the root directory of filesystem
• The journal handled by a kernel layer called Journaling Block Device (JBD)
• Ext3 filesystem invokes JBD routines to ensure disk data structures don’ t get corrupted in case of system failure 23
Interaction Between Ext3 and JBD • JBD uses the same disk to log changes performed by Ext3 filesystem • Thus JBD must protect itself from system failure that could corrupt the journal • Hence, interaction between Ext3 and JBD is based on three fundamental units: – Log Record – Atomic Operation Handles – Transactions
• Log Record – Describes a single update of a disk block – Describes a low-level operation issued by the filesystem – Represented inside journal as blocks of data or metadata 24
Atomic Operation Handles • Log records of a set of low-level operations that correspond to a high-level changes of the filesystem • Example: appending block of data to file involves many low-level operations – If system failure occurs in middle, inconsistency
• Hence, when recovering from system failure, either the whole high-level operation is applied or none 25
Transactions • All log records belonging to several atomic operation handles are grouped into a single transaction • All log records are stored in consecutive blocks of the journal • JBD handles each transaction as a whole • Reclaims blocks used by a transaction only after all data in its log records are committed to filesystem
26
References • • • • •
http://www.tldp.org/LDP/tlk/fs/filesystem.html Safari book online : Understanding the Linux Kernel http://web.mit.edu/tytso/www/linux/ext2intro.htmls http://uranus.it.swin.edu.au/~jn/explore2fs/es2fs.htm http://www.lugatgt.org/articles/filesystems/?print=ht ml • http://www.redhat.com/support/wpapers/redhat/ext3/i ndex.html • http://www.gentoo.org/doc/en/articles/l-afig-p8.xml • http://olstrans.sourceforge.net/release/OLS2000ext3/OLS2000-ext3.html 27