Design and Implementation of a Flexible Storage System: ∗ Nunbora SungWon Chung, Young-Jin Shin, Woo-Young Park, and Sang-Hwa Chung
†
Pusan National University Division of Electrical and Computer Engineering Pusan 609-735, Korea
ABSTRACT
1. INTRODUCTION
The object of Nunbora storage system is to enable storage system reconfiguration without backup and restore. To realize this object, we design and implement nunbora filesystem, cache consistency layer, virtual logical disk device driver, and volume manager. Nunbora storage system is the first storage system that enables filesystem size reduction and expansion without losing data. This paper describes design of Nunbora storage system and it’s implementation on NetBSD operating system.
Traditionally, filesystem had fixed size and could not span over a single storage device. However, with the development of RAID, filesystem could grow over a single device, moreover, and provide fault-tolerant service. As a a new paradigm, we present a flexible storage concept. With its implementation, filesystem reconfiguration is dynamic and doesn’t need remount or reformat. It means that the filesystem size can be reduced so as to remove redundant storage device, and that the filesystem size can be increased so as to attach additional storage devices, just like LEGO block, with providing continuous service. To realize this concept, we implement four integral components. They are resizable filesystem, cache consistency layer, virtual logical disk device driver, and volume manager. When we think these four components as a flexibility layer, this scheme can be applied to virtual memory system to enable memory module install and deinstall without reloading operating system.
*This report is a working paper on the Nunbora file system, which allows the dynamic file system resizing on the fly. This work was
supported by the Student Research Grant of The USENIX Association in 2002. †SungWon Chung received the B.S. degree in electrical engineering from Pusan National University, Busan, Korea, 2002. He designed the nunbora file system architecture and implemented cache coherency interface (
[email protected]). Young-Jin Shin was with Forwin Information Technology, Co., Centerm Venture Town 205, Woo-2-dong, Haewoondae-gu, Pusan, Korea. He implemented a doubly linked i-node structure for the nunbora file system (
[email protected]).
2. ARCHITECTURE Internal architecture of Nunbora storage system is shown in Fig.1. In the subsequent section, we will describe the
Woo-Young Park was a junior attending the Pusan National University. He implemented a virtual logical disk device driver for the nunbora file system (
[email protected]). Sang-Hwa Chung received the B.S. degree from Seoul National University, Seoul and Ph.D. degrees from the University of Southern California, in 1993. He joined the faculty of the Pusan National University, Pusan, Korea, where he is currently a Professor in the Department of Computer Engineering (E-mail:
[email protected]).
System Call Interface Active File Entries VM
VNODE layer socket network protocols network interface drivers
NFS
UFS (local naming service)
Special devices tty
MFS
FFS
LFS
Buffer Cache
nbrFS Logical Disk Device Cache Coherency Module
Block−Device Driver
swap space line discipline mgmt.
Character−Device Driver
Hardware
Copyright (c) 2002 SungWon Chung. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.
Figure 1: Architecture of Nunbora Storage System design of inner components.
3. NUNBORA RESIZABLE FILESYSTEM 3.1 New features
Terms F(d) U(d) M(d, n) N(d, n) Q(d)
Table 1: Definition of some terms. Definition Free space in physical disk d Used space in physical disk d The rest disk except n in logical volume d The position number of n in the disk d The quantity of physical disk in the disk d
In this section, we describe our new features for flexibility — addition and removal of storage device. We introduce some terms to quantitatively describe efficiency issues related with flexibilities. and them, we define proper addition and removal.
3.1.1 Terms 3.1.2 Proper Addition We define proper addition using predefined terms in the Tbl.1 Definition: Let addition of physical disk WDx to the logical volume NWD0 1. The file system of WDx is Nunbora. 2. U(WDx) = 0 3. N(WDx) = Q(NWD0) 4. During proper addition, NWD0 do not accept other addition or removal requests from user-level volume manager. From the first condition, user can add a storage device that is initialized with Nunbora filesystem. The second condition says that the used space of the disk that will be added must be 0. It means the disk called WDx must be empty. By the third condition, added disk is placed at the end of the logical volume. The reason for this condition is two: first, user doesn’t need to know about the specific place of the data in the file system. Second, supposing that the added disk is place in the middle of the logical volume, linear allocation of logical block number in the logical volume is broken. Final condition means that our system doesn’t provide request queuing policy. So, if the file system is busy for adding or removing, it does not accept any other requests. As a result, we can increase the filesystem size without block relocation. So, the nunbora filesystem has no overhead in proper addition.
3.1.3 Proper Removal Proper removal must satisfy the following conditions. Definition: Let removal of storage device WDx from the logical volume NWD0. 1. F(M(NWD0, WDx)) ≥ U(WDx) 2. During proper removal, NWD0 do not accept addition or removal request from user-level volume manager. This definition means that nunbora system maintains contents in nunbora filesystem after proper removal of a storage device.
Table 2: Contents of each block Block Type Directory chunk Inode Data block Indirect block Double Indirect block Triple Indirect block
Contents File name and inode number 12 direct block, indirect, double and triple indirect block number Raw data Raw data block number Indirect block number Double indirect block number
3.2 Relocation This subsection describes a new idea for efficienct block relocation in logical volume to reduce filesystem size. First, we show a characteristic of traditional unix filesystem. Second, we present a critical problem of block relocation in traditional unix filesystem. And finally, we suggest our solution.
3.2.1 Traditional Unix filesystem The inode structure distinguishes traditional unix filesystem from others. Inode looks like singly linked list. Implementation of logical block relocation mechanism requires a new way of thinking about filesystem. Tbl.2 shows the contents of a physical block. Physical disk block contains a super block inforamtion, inode, indirect, double indirect, and triple indirect pointer or raw data. Fig.2 shows the characteristics of each block. Directory chunk has file names and inode numbers so an inode has many referer. Inode also has direct, indirect, double indirect, and triple indirect pointer. Each of them points raw data, indirect, double indirect, and triple indirect pointer block. On the other side, raw data, indirect, double indirect, and triple indirect pointer block have exactly one referer. Single linked structure provide efficient data store and search. but there is a serious problem in moving logical block. We discuss the problem in the next section.
3.2.2 Problems in Logical Block Relocation Basic steps of logical block relocation are to move a block to a specific position and then update meta information in the referer block. you may have two questions: how to allocate a new position in the first step and how to get a list of referers in the second step. We can solve the first problem with usage information of block or free block list in the traditional file system. However, it is very hard to get a list of referer in the singly linked structured. Since inode has one or more refers, all directory tree should be traversed to find who they are, in traditional unix filesystem such as FFS and LFS. This second step results in intolerable overheads due to many repeated scan. Also, we cannot know the identity of a block to move at a moment. Therefore to tell the kind of a block, we should scan all disk at the worst case. Fig.3 shows the step of removing a middle disk from a logical volume which is composed of physical storage device WD0, WD1, WD2. The number in the figure is logical block number of each disk. For example, WD0 contains logical blocks numbered 0 to 999. It breaks the linear structure
Inode 2
Element0
Element1
Element0
Element1
Element2
Element3
Directory chunk. foo
bar
Element2
Element3
4
Figure 4: Linked list Directory chunk. file1
because inode has one or more referer and the inode itself have no information about them. If we use inode structure shown in the left of Fig.5, we must know maximum number of reverse link and make a room for the space to store the reverse links in inode structure. However, it is impossible since there is no limit in the number of reverse link.
file2
8 Direct blcok
Dir. chunk
Dir. chunk
Indirect block
Direct blcok Dir. chunk
Dir. chunk
Inode Dir. chunk
Figure 2: Characteristic of an each block
Dir. chunk
composed of a logical block number to remove a disk except the last one.
0
1000
3000
WD0 0
WD1 1000 3000
WD0
6000
WD2 6000
Inode Dir. chunk
Dir. chunk
Figure 5: In case of inode has many referer, single and doubly linked structure We solve the problem in the way that inodes use indirect pointers to reference a large file. As a inode refers different kinds of blocks such as raw data block, indrect pointer block and so on, the inode can refers different kinds of pointer blocks. This solution is shown in Fig.6. We devised reverse direct, indirect, double indirect, and triple indirect pointer blocks. They contains sufficient reverse pointer without overhead in finding referers.
WD2 Place broken linear structure.
Figure 3: Remove a disk Clearly, there are two serious problems in traditional unix filesystems. they are obtaining information about referers and maintaining linear logical block space after storage device removal.
3.2.3 Reversed link Fig.4 shows general singly linked lists and doubly linked lists. They contain the same elements. It is more harder than to find a colored element 2 from the list B than find it from the list A. We can get important clues from here. If the inode structure similar to the singly linked list A changes to the structure similar to the doubly linked list B. Then, the logical block relocation becomes faster and more efficient. The right of Fig.5 shows an inode which has many referers. It is not easy to reversely follow the link from the inode
3.2.4 Two Phases We need to seperate the relocation process from logical block number space continuation process. The first step is to move all blocks in the disk which will be moved from logical volume. The second step is, if the logical block number space resulting from the first step is not continuos, to correct the logical block numbers in the filesystem to be continuous. Fig.3 shows the discontinuity. In the figure, there must be only 0 - 4000 logical block after removing, but 3000 - 6000 block exists as formerly. So, we must correct the linkage information to make the logical block number space continuous. The key idea of correction algorithm is simple. After moving blocks, we apply follow formula to adjust the linkage information of logical disk blocks placed after removed blocks. N ew block number
= Original block number − End of the last block number
Fig. 7 shows two phases of relocation. First step starts from a removal request by a volume manager. It moves a
Ref.
0
400
1100
1500
0
400
1100
1500
Inode
Ref.
Ref.
Now this must be empty.
Indirect Reversed Block
Inode
Ref.
0
400
800
Ref.
Figure 7: Steps of relocating Ref.
Indirect Reversed Block
Ref. Inode
We need to make a general structure called tags. Then, we reuse traditional filesystem codes and only add codes for tags. Tags contain meta data composed of a pointer and the identity of a block. As we saw in the previous section, the pointer indicates the referer of the block, i.e., their parent. The identity field in the tag tells what the block is: inode, directory chunk, data block, indirect block, and so forth.
Double Ref.
Indirect Indirect Reversed Block
Ref.
Figure 6: Reversed indirect pointer block block in the disk which will be removed, into the free space of another remaining disk. Now middle space is not continous. So, the second step is to make logical block number space be continuous. Finally, we get a logical voulme composed of two storage devices, whose entire logical block number is continuous.
3.3 Architecture The architecture of nunbora filesystem is shown in Fig.8
3.3.1 Linkage information We concentrate on linkage information for the design of Nunbora filesystem because most overhead occurs in addition and removal of a physical disk from logical volume. We put linkage information into each block. That contains a pointer which points to its parent. So, in the Nunbora filesystem, we can find any referers reading a block.
3.3.2 Tags
4. CACHE COHERENCY INTERFACE We implement this abstraction layer with Nunbora filesystem which leaves the burdon of data migration that needs many hours of service down to back up data, reinitializing filesystem, and restoring data in a large storage such as RAID. All components of Nunbora filesystem including flexibility layer is shown in Fig.1. Flexibility layer is an abstraction layer which has an interface that manages a cache-coherent map between a unified addressable space and separate object pools, with a dedicated policy. Design policy of flexibility is to make general purpose for such as filesystem and virtual memory system, independenlt of particular operating system. This paper describes design design and architecture of flexibility layer. As a implementation on NetBSD operating system, modules for flexibility layer has operating system dependent shell interacting with buffer cache and inode cache. In describing the architecture of the flexibility, we presents three view points which is repectively functional description, internal organization, and message interface.
4.1 Functional Description Flexibility layer interacts with the other components of Nunbora filesystem and various kernel subsystems[3] as shown in Fig.9. The flexibility layer replaces the implementatin of kernel buffer management. Therefore it’s main functionality is read, write, and, addtionally, scheduling of physical block
Inode Directory chunk
2 dirA
dirB
3 4 Directory chunk
5
dirAA
fileAB
6 7 8
fileAAA
fileAAB
9 10 Indirect block
direct block
direct block reversed indirect block
Link node Reversed link node
Figure 8: Structure of Nunbora file system relocation. Read and write operation is similar to ordinary buffer management implementation except that the operation honors the validity of the meta-information mapping logical block number into physical block number. The relocation process itself is initiated by user-level adminitration program and commanded by relocation module in Nunbora filesystem. When relocation to reconfigure disk array starts by userlevel administration program, the relocation module in Nunbora filesystem decides which physical block to move, and send requests to flexibility layer. Then the layer schedule the request. When the request is processed, the flexbility layer initiates block transfer and refresh meta-information cache with necessary locking. We implements the flexibility layer as a component of Nunbora filesystem on NetBSD. In this section, we presents the interface and operation flow of the implementaion of flexibility layer in the NetBSD environment. There are four possible operation flow of flexibility layer.
They are buffer cache hit, buffer cache miss, write and reconfiguration.
4.1.1 Message Interface Principal message interface should provide the following functionality to the implementation of the Nunbora filesystem vnode operation. • buffer cache request[4] — read and write void void void void int int int int void int void
flex_bawrite (struct buf *); flex_bowrite (struct buf *); flex_bdwrite (struct buf *); flex_biodone (struct buf *); flex_biowait (struct buf *); flex_bread (struct vnode *, ... ); flex_breada (struct vnode *, ... ); flex_breadn (struct vnode *, ... ); flex_bufinit (void); flex_bwrite (struct buf *); flex_cluster_callback (struct buf *);
read()
Management Tools
User Applications user level kernel level
Pseudo system calls for flexibility layer
vn_read()
System call interfaces VNODE operation layer
lfs_read() Policy Manager
Disk Reconfigurator
sys_lfs_bmapv()
Disk Block Mapper
State Machine
cluster()
XFS
UVM
Physical Disk Array
Buffer Manager
Flexibility Layer
flex_bread()
flex_lock_bcache()
Linear Disk Device
getblk()
Figure 9: Internal architecture of flexiblity layer and virtual filesystem
bremfree()
flex_unlock_bcache()
int flex_cluster_read (struct vnode *, ... ); void flex_cluster_write (struct buf *, ... ); • meta-information management int
flex_meta_refresh(...);
• relocation request queue management int int int
flex_req_relocate(ty_flex_req req); flex_req_fetch(*ty_flex_req p_req); flex_req_status(void);
• locking mechanism void void void void
flex_lock_bcache(ty_flex_lock_post lk); flex_unlock_bcache(ty_flex_lock_post lk); flex_giant_lock(ty_flex_lock_post lk); flex_giant_unlock(ty_flex_lock_post lk);
The specific interface including parameter type is not yet decided.
4.1.2 Buffer Cache Hit When the read access to file is requested by user process and cache hits, it is processed with flexibility layer as shown in Fig.10
4.1.3 Buffer Cache Miss When the read access to file is requested by user process and cache miss occurs, it is processed with flexibility layer as shown in Fig.11. The reason why flex lock xxx() wraps the disk read-write operation is to guarantee the validity of inode information mapping logical block number to physical block number.
4.1.4 Write When the write access to file is requested by user process, it is processed with flexibility layer as shown in Fig.12.
Figure 10: Interface procedure for buffer cache hit
4.1.5 Reconfiguration Though we have to investigate further on this issue, general concepts developed by us follows the diagram shown in Fig.13. The flex giant lock() interface is necessary to block any access to file during updating mount structure. After updating mount structure, we can remove or attach SCSI hard disk drive with scsictl administration commands in NetBSD.
4.2 Internal Organization The flexibility layer itself is composed of service request queue, locker, buffer cache management, and meta information cache management component. Also it contains two main data structure — circular queue containing block relocation orders and hashed lock table holding the number of physical block in relocation. The operation of the mentioned internal components of the flexibility layer with other parts of kernel data structure related to file access is shown in Fig.14[5].
5. VIRTUAL LOGICAL DISK DEVICE AND VOLUME MANAGER Utilizing vnode interface and virtual logical disk device, the storage device is not confined to a magnetic hard disk drive, but is possibly RAID device, or network disk. We named our implementation of virtual logical disk device as the ncd driver. It provides the capability similar to ccd pseudo device driver in BSD-derivative UNIX. But, with ncd driver, device reconfiguration is possible for mounted filesystems.
5.1 Functions Ncd device driver provides the following functions. • Make serveral disks into a large logical volume. • Add a extra disk into a logical volume.
Nunbora Filesystem Vnode Operation Processor Descriptor Finite State Machine Open File Entry Request
Locker
Queue
Buffer Cache
Meta Infomation
Management
Cache Management Buffer Cache
vnode
OS dependent layer
Circular Queue
inode
Hashed Lock Table
Flexibility Layer Module
mount Structure
disk
Figure 14: Internal organization of the flexiblity layer read()
write()
vn_read()
vn_write() lfs_read()
lfs_write() sys_lfs_bmapv()
sys_lfs_bmapv()
cluster()
flex_bread()
flex_lock_bcache()
getblk()
cluster() getnewbuf() allocbuf()
flex_bwrite()
flex_lock_bcache()
disk read
disk write
flex_unlock_bcache()
Figure 11: Interface procedure for buffer cache miss • Remove a free disk form a logical volume. Of course, this operation is possible only when the logical volume has enough free space. • Unconfigurate a logical volume.
5.2 Internal Design Ncd driver is mainly composed of three parts that are dependent of operating system. Although the design is not confirmed, we are considering the following internal design. • NetBSD Disklabel management for each storage device • Disklabel management for logical volume • Error management
flex_unlock_bcache()
Figure 12: Interface procedure for write
6. IMPLEMENTATION We are beginning implementation of Nunbora storage system on NetBSD operation system and will experiment it with SUN Ultra 1 workstation.
7. RELATED WORKS There are many related works to make flexible disk or filesystems. xFS[6] and Zebra[7] tries to makes a single image of distrib uted disks with RAID facility. ISTORE project provide our feature but they must require hardware
flex_giant_lock()
nunconfig
flex_refresh_mount()
vfs_relocate()
flex_giant_unlock() flex_req_relocate()
flex_req_fetch() .
. flex_lock_bcache()
Request Queue . .
bio
.
System Inode
flex_refresh_meta_cache()
flex_unlock_bcache()
Figure 13: Interface procedure for disk block relocation support. Microsoftware dynamic volume manager support spanning and mirroring but do not support removing physical disk to reduce filesystem size. The purpose of Nunbora filesystem implementation is not to give single disk image, but to give flexibility in storage management as a virtual layer. That is Nunbora can be used with RAID configuration or network storage. Since Nunbora filesystem uses closely integrated virtual device driver to enable hot-reconfiguration of disk array, the crucial difference from the product providing similar functionality such as Microsoft Logical Disk, SUN Storedge, or EMC, is the filesystem size can be shunk by disk block relocation technique.
8.
CONCLUSION
During one semester, we designed a new storage system architecture enabling addition and removal of storage device without reformat or remount. Now, we are starting implementation and will estimate its performance trade-offs.
9.
REFERENCES
[1] Brown, A., D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, and D.A. Patterson, “ISTORE: Introspective Storage for Data-Intensive Network Services”, Proceedings of the 7th Workshop on Hot Topics in Operating Systems (HotOS-VII), Rio Rico, Arizona, March 1999. [2] HP-UX Volume manager white paper, http://www.hp.com. [3] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and John S. Quarterman, The design and implementation of the 4.4BSD Operating System, pp. 193-196, Addision Wesley, 1996. [4] Maurice J. Bach, Design of the Unix Operating System, pp. 46-56, Prentice Hall, 1986. [5] Chales D. Cranor, “The Design and Implementation of the UVM Virtual Memory Sys tem”, Ph.D. Dissertation, Department of Computer Science, Washington University, 1998. [6] Thomas E. Anderson, et al., “Serverless Network File
Systems”, ACM Transaction on Computer Systems, February, 1996. [7] John H. Hartman, John K. Outsterhout, “The Zebra striped network file system”, ACM Transaction on Computer Systems, August, 1995.