Hybrid RAID-Tape-Library Storage System for Backup - IEEE Xplore

Hybrid RAID-Tape-Library Storage System for Backup

Lingfang Zeng, Dan Feng, Fang Wang, Ke Zhou, Peng Xia Key Laboratory of Data Storage System, Ministry of Education School of Computer, Huazhong University of Science and Technology, Wuhan, China Corresponding Authors: [email protected], [email protected] Abstract Traditional tape back-up systems have begun to wear out their welcomes in some businesses, which are turning to the virtual tape technologies. The virtual tape technologies combine the traditional backup methodology with the disk drive technology to create a disk-based library that acts as a tape library. In concert with the traditional backup software from vendors, the virtual tape libraries write data to a disk in current tape formats. Because the disk is used rather than tape, data can be backed up at channel speeds many times faster than with a tape and also recovered more quickly. This paper provided a new hybrid storage appliance: the RAID-tape-library, which makes the customary backup software behave performance and other features of the tape drives. And this hybrid storage device is not like the traditional software-based or hardware-based virtual tape library (VTL), which separates the tape library and the disk, our implementation is as an independent appliance that supports the backup application and the Ultra SCSI attachments. It is similar to a disk device, but it does have abundance storage capacity like a tape library. With this hybrid device, the time of mount, position and dismount tapes is overlapped by effectual data transport in a RAID. The result of simulation shows the hybrid storage device has preferable performance.

1.

Introduction

Nowadays large amounts of data are stored on the tertiary storage media such as the optical disks and the magnetic tapes. The tape library has its strengths, despite its relatively slow speed and overall bulkiness. After all, it is inexpensive compared with most disks, it's demountable and has been used in data-center for many years. But, the tertiary devices present a problem for most of application

software since these devices have portable media and have very different operational characteristics compared to disks. For example, a tape library often offers very high capacity at low cost, but tapes are accessed sequentially. It can not be avoided to involve lengthy latencies and deliver lower bandwidth. Typically, there are main four methods to discuss those problems. The first is the disk-cache [1, 5] system, which allocate some disks (or a RAID, redundant array of independent disks) space for a tape library to cache some frequently used backup data. The second is that looks disks as a secondary backup device (looks disks as the primary backup devices). Those data are first stored on disks, and then the backup software will backup those data (in disks) to tape libraries [2, 3] with the information lifecycle management (ILM) [4] policies. Virtual tape library (VTL) technology is the third method, The VTL makes the disks appear and function just like a traditional tape library. The VTL makes it greatly convenient for those users who have invested the backup software, and they do not have to change their daily backup works. With those disks offering higher throughput than most tape library, the backup window is decrescent. Further, since all current data resides on the fast disk device, restoration can be performed without the need to retrieve offsite tapes in a tape library, further reducing the time required to restore data. The four is change tape architecture, e.g. The AIT media [6], which features an innovative semiconductor memory element called memory-in-cassette (MIC). The MIC is a memory chip built into the data cartridge that provides a direct and immediate connection to the drive's on-board processors, which speeds access to files and cartridge data, and holds the system's log and other user-definable information and provides a wealth of data about the history and current state of the data cartridge. Information and file search parameters are formatted within the MIC system, rather than using the on-tape index file or requiring the time-consuming media load and tape threading process used by other tape technologies. Data access time is effectively cut in

Proceedings of the Second International Conference on Embedded Software and Systems (ICESS’05) 0-7695-2512-1/05 $20.00 © 2005

IEEE

half—regardless of tape drive speed and recording density. Our solution is similar to the last category, unlike the VTL which also are software-based or hardware-based, but they separate the tape library and the disk, our implementation is as an independent appliance that supports backup application and Ultra SCSI attachments. It is similar to a disk device, but it does have abundance storage capacity. File data are distributed in the disk device and the tape library, respectively. Specially, there are no data migration between the disk device and the tape library (the VTL often has data migration operation).

2.

Architecture

Figure 1 shows the whole hardware architecture of our hybrid RAID-tape-library device. The Hybrid RAID-Tape-Library Device comprises a RAID and a tape library which are connected by SCSI channel. And the hybrid storage device attaches the Backup server by a SCSI bus. The Backup server, Hybrid RAID-Tape-Library Device, Console and application servers (Web server, E-mail server etc.) are interconnected by TCP/IP network. There are dual channels: the peripheral channel and the network logic data channel. In the following section, we provide our implementations block-level and file-level respectively.

3.

File-level implementation

3.1 File abstract File abstract: File abstract maintains the mapping information which directs space allocation for stored files. For RAID, the mapping information contains some fore-part data of file whose size is determined by some placement policies and given storage device. In our methods, those file abstracts should be stored in RAID, those backup application software first writes file in RAID, at the same time, the tape library manger is loading/unloading/mounting/unmounting tape. The successive file data may be written to the tape when the tape driver is ready for writing. By this method, we overlap the time which tape library is spent to operate tapes. At the same time, file abstract is stored in RAID and accelerate file locating. Variable File Abstract: variable file abstract indicates the size of file abstract information is not fixed. The main object is to save the RAID storage space. File Sub-abstract: File sub-abstract is used for those files which are distributed in different tapes. We have to overlap twice (or more) tape operation time. Figure 2 gives a sketch map for above mentioned definitions.

Figure 2. Organization of File with Abstract Design

Figure 1. Architecture of hybrid RAID-Tape-Library Storage Device

3.2 File-level design However, in file-level, our design has to refer itself to some specific backup software. And those mapping information are organized by defining correlative data structures. We choose the Taper [9] which is an easy to use backup solution. The Taper allows backups to tape drives, floppy drives, removable devices or any device that Linux supports. Incremental backup & selective restores are


IEEE

available, as well as backup verifies. But, for the Taper, auto-changing media is not handily supported. The MTX [10] also is adopted for the auto library operations. The MTX is a set of programs for controlling the tape drives and the robotic mechanism of autoloaders and tape libraries. The Taper and the MTX all are licensed under the GNU general public license, version 2. In the Taper, file structs.h defines those main data structures, such as tape_header, info_file_header, volume_header, file_info, etc. In our design, we add a new command operation (-a, for abstract) in order not to change the primary defines of the Taper. Some principles are brought forward in connection with our file abstract policy: (a) Small size files had better be compressed into a large backup archive. (b) A file, whose size is little some size (e.g. 40M Bytes, this configure should according the given device), is stored in RAID, otherwise, the file should be split into two sub-files. The sub-file, which includes fore-part original file data with not less than some size (e.g. 40M Bytes), is stored in a RAID. (c) For a backup/restore operation, write/read file should begin from the RAID, and at the same time, the tape library has to operate tape for the residual sub-file (for small size file, this operation is not necessary).

Figure 3. Mapping relation of file-level data organization Figure 3 indicates the mapping relation of file-level data organization. In Figure 3, file1 is split to sf101 and sf102, the same to file2 and file3, etc. For any file, it fore-part is stored in fast access device (RAID), and the other part is stored in the tape. In fact, file system server as file management, our main works is to create or maintain the mapping relation between file and its sub-files. At the same time, there are mapping relation between the file

abstract and sub-file in the RAID. When the system writes the sub-file in the RAID, the correlative file abstract is also created. And the system reads the file abstract firstly when it responds to a client file read request.

3.3 File-level prototype system Our prototype system design refers itself to some specific backup software. And the above mentioned file abstracts (file mapping information) are organized by defining correlative data structures. We choose Taper [9] which is an easy to use backup solution. The Taper allows backups to tape drives, floppy drives, removable devices or any device that Linux supports. Incremental backup & selective restores are available, as well as backup verifies. But, for the Taper, auto-changing media is not handily supported. The MTX [10] also is adopted for auto library operations. The MTX is a set of programs for controlling tape drives and the robot mechanism of autoloaders and tape libraries. In the Taper, file structs.h defines those main data structures, such as tape_header, info_file_header, volume_header, file_abstract, etc. In our design, we add a new command operation (-a, for abstract) in order not to change the primary defines of the Taper. Some principles are brought forward in connection with our file abstract policy: (a) Small size files had better be compressed into a large backup archive. (b) A file, whose size is little some size (e.g. 40M Bytes, this configure should according the given device), is stored in the RAID, otherwise, the file should be split into two sub-files. The sub-file, which includes fore-part original file data with not less than some size (e.g. 40M Bytes), is stored integrating file abstract in the RAID. (c) For a backup/restore operation, write/read file should begin from a file abstract in the RAID, and at the same time, the tape library has to operate tape for the residual sub-file (for small size file, this operation is not necessary). We studied and analyzed the Taper [9], and used it as our blueprint. Part of the file (file abstract) is stored in a RAID and the remainder of the file is stored in a tape library. When the hybrid device responses to a client access request, the program first read some information and part of file data from the RAID and transfer it to the client, at the same time, the tape library locates correct position, which of course takes quite a lot of time. While the tape library is seeking position, the Taper reads those file data in RAID. After the Taper finishes reading those data in RAID, subsequent data of file will be read from tape library. Our experimental platform is a personal computer running Linux. The RAID is designed by our lab, called


IEEE

HUST-RAID, connected by a SCSI card (LSI53C875), and hp StorageWorks MSL5000 series library (MSL5030) [12] is also attached the PC by the same SCSI card.

Note: (1) is the File Size; (2) is the Total Backup Time (second); (3) is the Real Write Tape Time (second) Table.3

3.4 File-level experiment results Under above mentioned testbed, we test backup/restore a series of different size files (50MB, 100MB, 150MB, 200MB, 250MB, 300MB) using the Taper. In Table 1, 2, 3, 4 and 5, the Original presents the test results by original Taper without file abstract processing, and the Improved shows the experiment results using the Taper having file abstract method. The experiment process as follows: First, the six files (50MB, 100MB, 150MB, 200MB, 250MB, 300MB) are sequentially written (backup) one bye one into a tape by the original Taper. The real write tape time and the total backup time are all recorded. Then, we read (restore) these six files (in order 50MB, 100MB, 150MB, 200MB, 250MB and 300MB) one by one using the original Taper and record the real read tape time, the total restore time and the first request response time respectively. Lastly, we format the same tape, and repeat above test processing using the improved Taper (adopting file abstract method). At the same time, we record the real write tape time, the total backup time, the real read tape time, the total restore time, the first request response time and the real read abstract time respectively. Table.1

Original (without file abstract for write (backup))

(1) (2) (3) 50MB 9 7 100MB 20 14 150MB 30 22 200MB 39 28 250MB 48 35 300MB 57 41 Note: (1) is the File Size; (2) is the Total Backup Time (second); (3) is the Real Write Tape Time (second) Table.2

Original (without file abstract for read (restore))

(1) (2) (3) 50MB 9 7 100MB 46 24 150MB 89 44 200MB 142 68 250MB 211 105 300MB 288 141 Note: (1) is the File Size; (2) is the Total Restore Time (second); (3) is the Real Read Tape Time (second) Table.4

Improved (by file abstract read (restore))

(1) (2) (3) 50MB 9 7 100MB 46 23 150MB 86 41 200MB 138 63 250MB 202 96 300MB 277 130 Note: (1) is the File Size; (2) is the Total Restore Time (second); (3) is the Real Read Tape Time (second) Table.5 The first request response time (original and improved) and real read abstract time (only for read)

Original Improved (2) (3) (4) 50MB 2000