OASIS: Implementation of a Cluster File System Using ... - CiteSeerX

3 downloads 8226 Views 160KB Size Report
OASIS, a cluster file system using OSD, which is designed to combine .... recovery. To eliminate the single points of failure, OASIS supports two metadata.
OASIS: Implementation of a Cluster File System Using Object-based Storage Devices Young-Kyun Kim, Hong-Yeon Kim, Sang-Min Lee, June Kim, and Myoung-Joon Kim Internet Server Group, Digital Home Research Division, Electronics and Telecommunications Research Institute, 161 Gajeong-dong, Yuseong-gu, Daejeon, 305-350, Korea {kimyoung, kimhy, sanglee, jkim, joonkim}@etri.re.kr

Abstract. An emerging object-based storage device (OSD) architecture facilitates the creation of self-managed, secure, and shared storage. Despite of its potential of greatly improving the scalability and performance of distributed storage systems, only high-end applications direct their attentions to OSD. Currently, it is necessary for mid/entry-levels applications to employ the OSD technology. In this paper, we present OASIS, a cluster file system using OSD, which is designed to combine the iSCSI protocol with OSD SCSI. Our experiments show that our system matches up to performance of existing systems and achieves linear scalability of performance as the number of clients increased.

1

Introduction

Fundamental advances in storage system architectures and storage device interfaces greatly improve the scalability and performance of storage systems. In the aspect of storage architectures, storage systems focus on the separation of metadata management from the data path, which is a key to the scalability of distributed storage. By this separation, clients directly access to storage devices. It also distributes the system metadata without a central bottleneck. Many systems have been built on this architecture, including IBM Storage Tank, the Lustre File System, and the Panasas Cluster Storage [6, 7, 9, 11]. In the field of storage devices, a new standard interface for a storage device is defined as Object-based Storage Device (OSD) to allow storage devices to manage the layout of object data within the device [12]. OSD differs from standard storage devices such as FC or IDE, with their traditional block-based interface. OSD enables the creation of self-managed, heterogeneous, and secure storage for storage networks. When a variety of storage systems employ OSD, they offload space management into storage devices and reach new levels of scalability. Although storage systems using OSD have advantages to scalability and performance, there are a few barriers for wide acceptance of OSD technology. Those are already mentioned in [2]. The main barrier is that hosts no longer access blocks of data on disk, rather, they refer to objects and offsets within objects. This requires the changes of basic I/O subsystems in the system stack. Moreover,

clients accessing the storage device should be changed because the hosts do not handle space management anymore. Up to date, only a few systems using OSD are introduced in the industry. To support large or highly distributed and scalable applications they readily accept the OSD technology and it should be an indispensable component in their systems. The representative systems adopting OSD for highly scalable storage are the PanFS cluster file system, the Lustre SAN file system, and the zFS distributed file system. Panasas [7] announced the ActiveScale Storage Cluster. It is comprised of the OSDs, the PanFS MetaData Server (MDS), the Panasas File System (PanFS), and Gigabit Ethernet network fabric. PanFS provides direct and parallel access to the OSDs when it needs file data and MDS manages out-of-band metadata operations. While Panasas actively does the standardization effort, OSD is not standard-compliant currently. Cluster File Systems Inc. developed an open source file system, Lustre [1, 6], in order to be a highly scalable SAN file system. It is built out of three components: clients, Cluster control-system, and Object Storage Targets (OSTs) where a variety of high speed networks such as Quadrics networks, Myrinet, Fibre Channel, Infiniband, and TCP connect them together. As alike as PanFS, Lustre also uses a proprietary OSD to provide very high performance. zFS [9] is the IBM research project aimed at building a decentralized file system that distributes all aspects of file and storage management over a set of cooperative machines interconnected by a high-speed network. zFS provides applications with a distributed file system. While the Panasas ActiveScale and Lustre share the same concept of the storage architecture, zFS does not. That is, it distributes metadata management into the OSDs instead of the centralized management of meta-data. All systems as stated above are concentrating their attentions only on highly scalable and high performance storage systems. This means that systems basically require high speed networks including FC, Inifiniband, etc. and they use their native protocols accessing the OSDs, which are not standard compliant, and proprietary interfaces of OSD for high performance. As a result, this makes the OSD adoption slower in many applications including small scale storage systems. We think that it is possible for OSD to be more prevalent in many applications by emphasizing conformance of the standard OSD interface and a standard protocol accessing networked storage in the Gigabit Ethernet network that is the most common among storage network fabrics. In this paper, we present a scalable cluster file system using OSD, OASIS1 , developed for the use of the OSD technology in large-scale distributed storage systems. The basic idea of OASIS is that it is to comply with the T10 OSD specification and to deploy the iSCSI protocol which is widely accepted in the field of the storage industry, for accessing the OSDs. We show the design and implementation of OASIS and describe the experiment results of OASIS so that illustrate the performance implications of OASIS. 1

OASIS: Object-based storage Architecture for Scalability, Intelligence, and Security

The rest of this paper is organized as follows: In section 2, we describe the details of OASIS FS, that is, OASIS FS’s architecture and the functionality of various components. Section 3 shows various experiments of the implementation result and examines the performance implications of OASIS. Finally, in Section 4, we summarize our work.

2

OASIS Architecture

OASIS is a research prototype that is a cluster file system deploying OSD in an asymmetric file systems environment similar to Lustre and PanFS. The design and implementation of OASIS is aimed at building a highly scalable file system that operates equally well on few or hundreds of machines connected by TCP/IP over Gigabit Ethernet. More specifically, the major objectives of OASIS are as follows. – To achieve high scalability, it separates storage management from file management and metadata operations from the data path by using the objectbased storage device. – It totally complies with the standard OSD SCSI T10 specification. – For easy of acceptance of the system, it provides the connectivity infrastructure that binds multiple components of the system over the Gigabit Ethernet network. – For no single points of failure, it supports the fail-over cluster of metadata servers. – It leads to an almost linear increase in performance whenever a computing node is added. OASIS storage devices are network-attached OSDs which provide an objectlevel interface to the file data and use iSCSI for its SCSI transport. Each OSD manages the layout of objects and autonomously serves and stores objects. Also, the metadata server cluster is in charge of namespace management and authentication, etc. providing clients with the file to objects mapping. Clients with the OASIS client file system module directly communicate with the OSDs in order to retrieve the objects to the file they wish to access. There are three components in OASIS and Figure 1 demonstrates all the components of it. – The OASIS OSD Target (OASIS/OST) is a self-managed storage device. It lays out, manages, and serves objects through the OSD command interface. We assume that the OSTs contains a processor, RAM, the Gigabit Ethernet network interface, and disks. – The OASIS MetaData Server (OASIS/MDS) supports all file system namespace operations including file lookup, file creation, and file and directory attribute manipulation. Also, it allows clients to share file data at the same time of maintaining cache coherency on all clients.

VFS

MetaData Manager

Volume Manager

Local File Systems (ext3)

Multiple Objects Devices Driver

SCSI Upper Layer - SCSI OSD (SO) Driver -

SCSI Upper Layer - SD Driver -

SCSI Upper Layer - SCSI OSD (SO) Driver -

SCSI Middle Layer

SCSI Middle Layer

SCSI Middle Layer

SCSI Lower Layer - UNH Initiator (Extended CDB) -

SCSI Lower Layer - FC Driver -

SCSI Lower Layer - UNH Initiator (Extended CDB) -

OASIS/FM

RPC

Client Node

Linux SCSI Stack

OASIS/MDS SCSI OSD Commands

iSCSI Target Driver

iSCSI Target Driver

iSCSI Target Driver

Object I/O Manager

Object I/O Manager

Object I/O Manager

OSD Manager

OSD Manager

OSD Manager

Storage Device Manager (ext3/XFS/ReiserFS)

Storage Device Manager (ext3/XFS/ReiserFS)

Storage Device Manager (ext3/XFS/ReiserFS)

OASIS/OST

OASIS/OST

OASIS/OST OASIS components

Linux Built-in components

Fig. 1. OASIS components

– The OASIS File System (OASIS/FS) runs on the clients on which they want to use OASIS. It presents to the client the standard POSIX API and provides access to OASIS files and directories. To access file data, it addresses the OSDs directly. The operation of OASIS is as follows: Applications running on a client machine make file system requests to OASIS/FM on that machine. OASIS/FM preprocesses the requests and queries the clustered metadata servers, OASIS/MDS, to open the files and get information that is used to determine which objects comprise the files. OASIS/FM then contacts the appropriate OASIS/OST to access the objects that contain the requested data, and provides that data to the applications. 2.1

OASIS/OST Satisfying T10 OSD

OASIS/OST is the network-attached storage device on which file data are created, and from where they are retrieved. It provides the standard OSD SCSI T10 protocol and uses iSCSI for its SCSI transport in order to create and delete objects, and write and read byte-ranges to/from the object to the file data. OASIS/OST is comprised of the OSD manager and the storage device driver. The OSD manager is the kernel-loadable module that implements T10 OSD protocol over iSCSI. On the other hand, the storage device driver is to support data storage within the local file systems, such as ext3, ReiserFS, and XFS. This is similar to the Lustre’s OST, except SCSI OSD commands over TCP/IP.

The OSD manager is divided into the iSCSI target driver and the object I/O manager. The iSCSI target driver is only to support the extensions to iSCSI that are needed to for OSD commands. As a result, OSD commands between clients and targets are transported as SCSI OSD commands over TCP/IP. The object I/O manager is a key component of OASIS/OST. It takes charge of encapsulating and interpreting OSD commands, managing the data layout including the allocation of object data as well as object attributes, executing the security enforcement, and carrying out data I/O with the storage device driver. With this design, we developed it that runs on a Linux platform and is implemented over the local file systems. We think that OASIS/OST provides the flexibility required to adopting new technologies and new types of file systems. The metadata associated with the objects stored in the OSDs are managed by OASIS/OST and then, the metadata server maintains only one object per OSD, regardless how much data that the object has. This reduces the metadata management burden on the metadata server and greatly increases the metadata server’s scalability. 2.2

OASIS/MDS with Reliability and High Availability

OASIS/MDS takes charge of file and directory management and maintains cache consistency for clients of the same file. For the scalability, reliability, and availiability of OASIS, OASIS/MDS provides the following key functions: Basically, OASIS/MDS implements a file structure on the storage system. This includes directory and file creation and deletion and access control. Whenever a client opens a file, the metadata server performs the permission and access control checks associated with the file and returns the inode with the associated map that enumerates the OASIS/OSTs over which the file data is located. For guaranteed metadata coherency on multiple clients, OASIS/MDS strictly protects all metadata operations. Clients with the OASIS/FMs are guaranteed to always see the same view of the namespace, and will receive consistent results for update operations. In addition to metadata coherency, OASIS/MDS assists client-based data caching by notifying the OASIS/FMs when changes in a cached data impact the correctness of client’s cache. The metadata srever provides online storage management by grouping multiple OASIS/OSTs into logical volumes to improve disk utilization and eliminate storage-related downtime. This is done by cooperating with the multiple object devices driver (MOD) that provides virtual devices that are created from one or more independent underlying OASIS/OSTs. It performs functions such as mirroring (RAID1), RAID 0/5, and linear expansions of OASIS/OSTs. Also, the metadata server always monitors the status of the OASIS/OSTs participating in logical volumes. When a certain failure occurred, it detects and executes the recovery. To eliminate the single points of failure, OASIS supports two metadata servers in an active/standby failover configuration. If the active OASIS/MDS fails, the other node is configured to automatically take over all metadata ser-

vices. For achieving MDSs failover, OASIS/MDSs use the shared storage device on which the metadata file system is reside. To support above mentioned services, OASIS/MDS is comprised of the metadata manager, the volume manager, the MOD driver, and a metadata repository. The metadata manager contains the network interface for communicating with clients through RPC, the namespace manager that is responsible for various namespace operations, and the lock server for cache coherency among multiple clients. Multiple logical volumes can be configured by the combination of the volume manager and the MOD driver that consists of a raid0, a raid1, a raid5, and a linear module. We implement kernel-loadable modules, running on Linux, for OASIS/MDS except the metadata repository. We build the metadata repository of OASIS/MDS on ext3 that resides on the shared storage for easy implementation of failover. We believe that it is possible for OASIS to increase the scalability of the system including supports of more clients and more storage by both decoupling of the datapath from the control path and supporting the dynamic volume management. 2.3

POSIX-Compliant & Light-Weight OASIS/FM

Whenever a client wants to use OASIS, a kernel-loadable module such as OASIS/FM should be installed on the client. OASIS/FM presents a POSIX file system interface and provides access to OASIS files and directories. It provides the following functions: OASIS/FM provides the standard POSIX interface for applications to perform operations such as superblock operations, inode operations, and file operations. It also provides caching in the client for incoming data from OASIS/OSTs as well as metadata from OASIS/MDS. In order to cache all data in OASIS/FM, it uses the Linux buffer/page caches. Hence, OASIS/FM automatically supports the read-ahead cache for prefetching file data and the write-back cache for efficient aggregation of multiple data writes, which seem to be inherent functions of the Linux VFS (Virtual File System). Clients with OASIS/FM directly access multiple OSTs by the help of the MOD mentioned in Section 2.2. Dissimilar to Lustre, OASIS does not support stripes on a per file basis, but MOD stripes files across multiple OSTs on the file system basis being mounted. Therefore, clients take advantage of utilizing the RAID functions of the system. OASIS/FM uses the iSCSI protocol so that it can transmit the OSD SCSI commands and the data payload across TCP networks. To enable iSCSI communications, it provides an iSCSI OSD initiator according to the Linux SCSI stack. The iSCSI OSD initiator is comprised of the SCSI OSD (SO) driver that is to construct and parse OSD SCSI commands satisfying the T10 standard and the iSCSI initiator that is to send/receive both OSD SCIS commands and data payload across the network. We use the UNH initiator [10] as the iSCSI initiator

and extend it to support the extended CDBs (Command Descriptor Blocks) for OSD commands. Excluding the MOD driver, OASIS/FM consists of very light-weight modules. We expect that it will be easy to make it portable for other platforms including Windows XP because of its small size and simplicity.

3

Highly Scalable Performance

We implemented all components of the OASIS prototype running on Fedora Core 3 Linux, kernel version 2.6.10. In this chapter, we show the experimental results of the OASIS system by the well-known file system benchmark tools such as IOzone [5] and Iometer [4], demonstrating the scalable performance of our prototype. 3.1

Experimental Setup

All of the experiments were executed on servers with a Xeon 3.0 GHZ CPU, 512MB of RAM, and a Gigabit Ethernet NIC. More specifically, the test servers include 10 OASIS/OSTs, 2 OASIS/MDS nodes, and 10 client nodes with the OASIS/FMs and all of the them are connected to a Gigabit Ethernet switch. Also, each OASIS/OST equips with two SATA/1.5 disks. For each experiment, we maintained that the memory size of each client is to be 256 MB and that of each OST to be 512 MB. Also, we unmounted the directory that mounts the OASIS file system to purge the buffer/page cache, then mounted the OASIS file system to run the benchmarks. We repeated these steps ten times and took the average of the performance numbers obtained. In this experiments, we used two benchmarks. One is to measure the performance of both the sequential read/write and the random read/write of a file in which the file size varies from 128 MB to 512 MB. The other is the aggregated performance benchmark that measures the aggregated read performance of multiple reads requested by clients. During the second experiment, all of clients do not read the same file but each client reads its files in the file system. 3.2

Performance Implications

Figure 2(a) illustrates results of the first benchmark that is performance of single request type such as sequential read or sequential write and allows us to examine the performance of our file system. As seen in Figure 2(a), OASIS exhibits the throughput of about 100 MB/sec for all read operations which access cached data on OASIS/OST. On the other hand, it shows the throughput of 40 MB/sec for the operations that it is necessary to perform physical disk operations in the OASIS/OSTs. From this benchmark, we examined that data caching plays an important role in OSDs to satisfy high performance requirements. As compared with Lustre, our prototype nearly matches up to performance of Lustre for both cache-hits operations and all operations with disk I/Os. Also, Figure 2(b) shows

sequential write performance

400000

throughput (Kbytes/s)

throughput (Kbytes/s)

seqential read performance 350000 300000 250000

4K 8K 16K 32K 64K 128K

200000 150000 100000 50000 0

131072

262144

160000 140000 120000

4K 8K 16K 32K 64K 128K

100000 80000 60000 40000 20000 0

524288

131072

262144

file size (Kbytes)

524288

file size (Kbytes)

(a) Performance on a workload of IOzone

Read Bandwidth (MB/sec)

Aggregated Read Performance 1000 900 800 700 600 500 400 300 200 100 0

1

2

3

4

5

6

7

8

9

10

Number of Clients

(b) Linear scalability of read performance

Fig. 2. OASIS performance benchmarks

how the performance scales linearly with the number of clients in OASIS. We experimented with 10 clients and 10 OSDs, then identified that performance scales linearly for 1 to 10 clients. It illustrates the graph peaks at about 900 MB/sec for sequential read operations on cache-hits data of OSTs. By doing the benchmark, we examined that OASIS continues to provide increased aggregated throughput as the number of clients increases. In addition to formal tests, we tested OASIS in the hardware environment employing TOE (TCP Offload Engine) hardware accelerators that are installed on clients and OSDs. Through this experiment, we ascertained that OASIS had a little bit performance increase by about 3%. It implies the OASIS system provides enough performance without any hardware accelerator.

4

Conclusions

The emerging technologies based on both of iSCSI and OSD are promising for large-scale and high performance file systems. Not only iSCSI and OSD but the out-of-band metadata operations give storage systems chances for providing high scalability and performance. For dissemination of object-based storage devices and achievement of the file system using OSD, we have designed and implemented OASIS, a cluster file system targeted for the standard compliant technologies being widely accepted in

industries. Currently, OASIS supports the mandatory interface of the T10 OSD specification and transports the OSD commands and data payload through the standard iSCSI protocol. We have tested the OASIS prototype. In particular, our experiments show that OASIS can provide enough performance for high-end applications and OASIS performance scales linearly to support a large number of clients by distributing the load of a number of clients across OASIS/OSTs. With this results, we believe that OASIS will be a good candidate for highly scalable cluster file systems.

References [1] [2]

[3] [4] [5] [6] [7]

[8]

[9]

[10] [11]

[12]

Braam, The Lustre Storage Architecture, Technical reprot, Cluster File System, Inc., 2002. http://www.lustre.org/docs/lustre.pdf Factor, Meth, Naor, Rodeh, Satran, Object Storage: The Future Building Block for Storage Systems, A Position Paper IBM Labs in Israel, Hifa University, Mount Carmel, http://www.haifa.il.ibm.com/procject/storage/objectstore IETF, http://www.ietf.org/html.charters/ips-charter.html Iometer, http://www.iometer.org/doc/documents.html IOzone Filesystem Benchmark, http://www.iozone.org Lustre: A Scalable High Performance File System, Cluster File System, Inc. 2003. http://www.lustre.org/docs.html Nagle, Serenyi, Matthews, The Pansas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage, in Proceedings of the ACM/IEEE SC2004 Conference, Pittsburgh, PA, November 2004. Pease, D, Rees, R., Heneman, W, et al., IBM Storage Tank: A Distributed Storage System, in Proceedings of the first USENIX Conference on File and Storage Technologies Work In Progress, Monterey CA, January 2000. Rodeh, Teperman, zFS - A Scalable Distributed File System Using Object Disks, Technical report, IBM Labs in Israel, Hifa University, Mount Carmel, 2005, http://www.haifa.il.ibm.com/procject/storage/zFS/public.html UNH iSCSI, http://www.iol.unh.edu/consortiums/iscsi/ Wang, Brandt, Miller, Long, OBFS: A File System for Object-based Storage Devices, in Proceedings of the 21th IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST2004), College Park, MD, April 2004. Weber, SCSI Object-Based storage Device Commands (OSD), Document Number: ANSI/INCITS 4000-2004. InterNational Committe for Information Technology Standard, December 2004. http://www.t10.org/drafts.html

Suggest Documents