A hybrid data compression approach for online backup service Hua Wang, Ke Zhou, MingKang Qin School of Computer Science and Technology Huazhong University of Science & Technology
[email protected],
[email protected],
[email protected]
ABSTRACT With the popularity of Saas (Software as a service), backup service has becoming a hot topic of storage application. Due to the numerous backup users, how to reduce the massive data load is a key problem for system designer. Data compression provides a good solution. Traditional data compression application used to adopt a single method, which has limitations in some respects. For example data stream compression can only realize intra-file compression, de-duplication is used to eliminate inter-file redundant data, compression efficiency cannot meet the need of backup service software. This paper proposes a novel hybrid compression approach, which includes two levels: global compression and block compression. The former can eliminate redundant inter-file copies across different users, the latter adopts data stream compression technology to realize intra-file de-duplication. Several compressing algorithms were adopted to measure the compression ratio and CPU time. Adaptability using different algorithm in certain situation is also analyzed. The performance analysis shows that great improvement is made through the hybrid compression policy. Keywords: backup service, block compression, global compression, hybrid compression
1 INTRODUCTION Data backup technology, served as an effective means to realize data recovery in case of system crash, has received more and more popular application. With the popularity of Saas (Software as a service), current software industry is undergoing a revolution of implementing software functions by means of service mode instead of software. Backup service can reduce redundant investment, simplify software maintenance and update software automatically through backing up the distributed data into a centralized storage server. It can also achieve remote disaster recovery for those organizations lacking of the ability to configure remote storage space. The most difference between backup service and backup software is that the former one faces with ever-increasing numerous users, in order to response to multi-users’ backup requirement, high speed data transferring and large storage space are needed. Under certain hardware condition, it is particularly important to decrease the data transferred and stored through data compression. In data compression domain, in spite of traditional data stream compression, there are many researches conducted on de-duplication in recent years. Benjamin Zhu described three techniques used for de-duplication in Data Domain file system (DDFS) to relieve the disk bottleneck [1]. Athicha Muthitacharoen proposed a low-bandwidth network file system (LBFS), file is partitioned into content-based data chunks by owners to
Photonics and Optoelectronics Meetings (POEM) 2009: Optical Storage and New Storage Technologies, edited by Masud Mansuripur, Changsheng Xie, Xiangshui Miao, Proc. of SPIE Vol. 7517, 751703 · © 2009 SPIE · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.843524 Proc. of SPIE Vol. 7517 751703-1
eliminate duplication [2]. William J. Bolosky described Single-Instance Storage, it uses copy-on-close technology to detect the redundant copies in NTFS volume [3]. Sean Quinlan described Venti, a network storage system, which can implements a write-once policy by means of fix-sized data block [4]. Walter Santos presented a parallel de-duplication algorithm, which can detect replicas in datasets by using a cluster. It belongs to post-processing [5]. Jim Austin implemented de-duplication in large database of a grid based system. It is scalable to many nodes and avoids the problems such as loss of data and/or deadlock [6]. All the above mentioned researches, no matter inline de-duplication modes or post processing ones, aimed at file system or storage system, studied some aspects of de-duplication, such as how to partition data blocks or relieve disk bottleneck, none of them considering backup application. In the cases of application software applying data de-duplication, Youjip Won introduced main memory index lookup structure to improve index lookup efficiency for backup system [7], without studying how to reduce redundant data during backup process. Yan Chen studied the type of data redundancy and compression methods for a disk-based network backup system [8], it focuses on the comparison of disk-based compression technologies with tape-based ones. Tianming Yang proposed a fingerprint-based backup method (FBBM), using anchor-based chunking scheme to perform data de-duplication [9], the shortcoming is that block partitioning algorithm consumes a lot of CPU resource in client machines, so that backup performance is influenced greatly. Landon E Cox presented Pastiche, in which content-based blocking is used to implement de-duplication [10], Pastiche serves for peer-to-peer backup, in which peers are selected from nodes who have significant common data, it is different from our developed backup service software. Traditional compression application usually applies a singular compression technology, such as de-duplication, dictionary-based stream compression, etc, whose compression ratio is limited. This paper proposes a hybrid data compression approach that includes two levels. The higher level is global data compression, which represents that the same data blocks, judged by their hash values, are stored only one copy in storage server. The lower one is data block compression, which means the new block inexistent in storage server previously will be compressed before transferring through the network. Experimental result shows that this approach can improve compression ratio greatly and well satisfy the need of backup service.
2 DESCRIPTION OF BACKUP SERVICE SOFTWARE Backup service software comprises of three modules, which are backup client, director server and storage server, the software architecture is illustrated as figure1.
Proc. of SPIE Vol. 7517 751703-2
q!LGCOL 2GLAG
AWl
2fOLYC
In the three-party structure, director server is responsible for task scheduling, metadata management, enrollment and management of user and storage media. Storage server implement concrete backup and recovery task, including reception and storage of backup data, reading and transmission of recovery data, storage reclamation and cleaning up. Backup client is installed in distributed user machines, connected with server side through WAN, whereas director server and storage server are inter-connected through internal high speed network. The backup service software can implement many types of backup service. According to the data objects of client, there are file set backup, database backup and system backup. Backup levels consist of full backup, incremental backup and differential backup. Backup modes include real-time manual backup and pre-defined automatic backup. Data stream transmits between backup client and storage server directly, without going through director server. Command stream is used to control the other two modules by director server. All the source data come from backup clients, in order to respond to multi-users’ massive data backup and recovery request, it is necessary to apply data compression technology in the client side to improve system performance.
3 A HYBRID DATA COMPRESSION APPROACH 3.1 Data block compression
Block level compression means the stream compression against data before transferring to storage server, so that compressed data are stored. In case of recovery, compressed data are transmitted to client and decompressed to the original state. Block level compression is a simple visual way. We can adopt statistic-based or dictionary-based compression algorithms, different algorithm has different characteristics. LZ77 algorithm is proposed by A.Lempel and J.Ziv in 1977, which is dictionary-based, easy and efficient. Many popular compression tools, such as ARJ, WinZip, RAR,
Proc. of SPIE Vol. 7517 751703-3
GZip, etc, are based on LZ77. It uses a fixed length sliding window, once data stream exists in previous window, it will use a pointer to replace the subsequent data and link to the previous one. MiniLZO is a light subset of LZO compression library. It is very fast and suitable for real-time compression. Bzip2 is an open source compression algorithm that exists in many release versions of Unix & Linux, it supports repairing media error, which is de-compressing right content from a damaged file. Process using different algorithms to accomplish data block compression for backup service software is shown as figure2.
COWbL
BfILLL
COb C01JJbL BIILLGL
D' P Suppose that original data is sliced in terms of 4M unit (file itself in case of smaller than 4M). The 4M data in grey rectangle box is about to be compressed. The process in dashed line represents the process using miniLZO and bzip2 algorithms, there are three procedures, which are (1)read, (2)compress and (3)write back respectively. The process using LZ77 algorithm is a little more complex, it adopts two level caches, the compression procedure further consists of three sub-procedures: (2.1)read, (2.2)compress and (2.3)copy. 3.2 Local compression
Block level compression can improve the efficiency of data transmission and storage, but the compression scope restricts to redundant data stream within a single file, so the compression ratio is limited. For backup software, much redundant data may exist among multiple jobs belong to the same user, especially for the multi-version file backup. Traditional backup mode distinguishes different files by file name and time stamp. To those files with a little modification, we have to backup the whole file. With backup keeps on going, many redundant data copies are stored, which causes much valuable network bandwidth and storage space are wasted. De-duplication is a new storage technology raised in recent years. It is defined by Enterprise Storage Group (ESG) as that eliminating redundant file or data to guarantee only one copy is stored in disk. Referring to de-duplication, we can maintain a local index table in each client to store a unique data block only once, this is local compression, which is shown as figure3.
Proc. of SPIE Vol. 7517 751703-4
bolufe
qT9 PlC
E!âne
To each backup client, before the execution of a backup task, relative files are sliced into blocks and the hash value of each block is computed, then local index table is queried for the hash of each block, if the hash exists, it means the block has been transferred previously, then we only need modify the reference count of the hash record, otherwise, the block shall be transferred to storage server. For example, in client1, block D2 in Job2 has been backed up by other job, so it doesn’t need to be backed up again. In the storage server, there is a global index table, in which hashes of all the blocks are saved. Local compression approach can eliminate inter-file redundant data blocks within a single client and without coming across different users, for example, server saves two copies of block D3 for client 1 and client n respectively in figure3. 3.3 Global compression
Block level compression aims at intra-file redundant data stream, local compression can eliminate inter-file duplicated data within a client, both of which cannot accomplish de-duplication across different clients. Since backup service software has to serve for numerous users in WAN, there exists a lot of duplicated information across users during the course of backing up general software, video/audio files, OS files, etc. It is of great significance to mine redundant data across different clients so as to improve compression ratio. On the basis of local compression, we proposed a global compression method. De-duplication across clients is accomplished through maintaining a global index table in the server, which can be represented as figure4.
Proc. of SPIE Vol. 7517 751703-5
backup client
Client1
Client2
Client3
storage server
hash
data
H1
D1
global index table pointe count r
storage file
H2
D2
hash
H3
D3
H1
1
D4
hash
data
H2
2
D1
H3
D3
H3
2
D6
H4
D4
H4
2
D2
H5
D5
H5
1
D3
H2
D2
H6
1
D5
hash
data
H4
D4
H6
D6
data block
Figure4 Global compression for backup service software
In the client side, files are sliced according to job, each data block has its corresponding hash. Three clients contain three, four and two blocks respectively. Storage server maintains a global index table, which involves hash value, reference count and data pointer of each block. Metadata comprising mapping relationship between logical files and physical data block are maintained in the director server. In case of recovery, all the related blocks are indexed, transferred to client and composed into the original files. We can see that through de-duplication, the total block number in all the clients is nine, which is decreased to six in the storage server. 3.4 A hybrid data compression approach
For the backup service, global compression can accomplish inter-file de-duplication, meanwhile block compression can be used to eliminate intra-file redundant data stream. The two methods situate in different level. We can improve compression ratio greatly once we combine both of them. This paper proposed a hybrid data compression approach, which includes two levels, the higher one is global data compression, which means that blocks with the same hash are stored only once, the lower one is block compression, whose meaning is that new block should be compressed in the client side before transferring to server. The hybrid compression process is shown as figure5, in which the two compression types cannot be reversed, otherwise the latter one will destroy the similarity among different files.
L1L
L!f1L
In figure5, H(Di) and C(Di) functions represent computing hash of Di and block level compression respectively. For simple reason, only the data transportation between backup client and storage server is illustrated,
Proc. of SPIE Vol. 7517 751703-6
authentication process and state monitoring by director server is omitted. During the backup process, the backup client creates backup objects, which can include file set, database or operating system. To the file larger than the pre-defined threshold value, we should break it into chunks with fixed size, the smaller one is left unchanged. By doing this, we can provide a data size benchmark for de-duplication. In the global compression, hash value of each data block should be computed and transmitted to storage server to query if a block with the same value exists. Once it exists in the global index table, the index of the block is updated without transferring the block. On the other hand, if it doesn’t exist, then the block should be propagated to storage server and new index is created. Towards those new blocks, applying appropriate compression algorithm before transmission can save bandwidth and storage space greatly. During the recovery process, backup client receives data blocks of specific data object from storage server according to the file list obtained from director server, then decompress against the blocks and compose them into the original data object.
4 PERFORMANCE ANALYSIS Since the hybrid data compression approach will bring client computing overhead inevitably, so we should carry out comprehensive test including compression ratio and CPU time. The basic configuration of the testing computer (backup client and server are installed in the same computer to simplify the test) is of 1.73GHZ CPU and 1.25G memory. The test will carry out adopting three types of block compression algorithm mentioned above and aiming at four file types. The testing items include block compression ratio, composite compression ratio and CPU utilization ratio. Now we should give the definitions of testing items. (1) Block compression ratio The ratio of the original data to compressed data under the condition of block compression and without regarding to global compression. (2) Composite compression ratio The ratio of the original data to compressed data in case of considering both block compression and global compression. (3) CPU utilization ratio The ratio of CPU time consumed by backup client to that consumed by all the processes during the course of data backup. 4.1 Design of test cases
According to the analysis of users’ habit, we select four file types, which are DOC files, PDF files, MP3 files and file set containing many file types. We sample five backup objects for every file type, and each object including two files. In order to test the global compression ratio, from the second backup object, each object includes a same file as the previous one, so that duplicated blocks can occur. 4.2 Compression ratio
In order to compare the block compression ratio and composite compression ratio, we will show the two ratios
Proc. of SPIE Vol. 7517 751703-7
of the same file type in the same figure. The test results are shown from figure6 to figure9, in which vertical axis represents compression ratio, unit is %, horizontal axis denotes the five groups of selected test cases, they are arranged with the increasing duplicated data with the server. The name of curve is composed of A_B_C, in which A indicates test benchmarks including block compression ratio (Blo), composite compression ratio (Com) and CPU utilization ratio (CPU), B denotes three types of compression algorithms (LZ77, miniLZO and bzip2) and original data without compression (ORG), C represents file types of DOC, PDF, MP3 and multi-type file set. All the successive curves follow this naming rule.
Figure6 Compression ratio of DOC file
Figure7 Compression ratio of PDF file
Figure8 Compression ratio of MP3 file Figure9 Compression ratio of File set In the above four figures, to the same type of file, curves can be partitioned into two groups according to the aggregated extent, the upper three curves represent composite compression ratio, whereas the lower three ones denotes block compression ratio. It is obviously that the efficiency of composite compression is much higher than block compression, which indicates the prominent impact of de-duplication. To the three algorithms, LZ77 is similar to miniLZO in compression ratio, whereas the efficiency of bzip2 is the highest. Another conclusion can be drawn that with the redundant data of client to server increasing, the compression efficiency will be much higher. 4.3 CPU utilization ratio
We tested the CPU utilization ratio of four file types by using different compression algorithms, the results can
Proc. of SPIE Vol. 7517 751703-8
be shown from figure10 to figure13. The vertical and horizontal axes represent CPU utilization ratio and sample data respectively.
Figure10 CPU utilization ratio of DOC file
Figure12 CPU utilization ratio of MP3 file
Figure11 CPU utilization ratio of PDF file
Figure13 CPU utilization ratio of File set
From the above figures, the CPU utilization ratio of miniLZO is the lowest, which is similar to the original case without compression. The CPU utilization ratio of LZ77 is middle. Whereas bzip2 will consume a lot of system resource, especially in case of compression to MP3 files, the CPU utilization ratio will reach 70%. We can draw the following conclusions: (1) We can apply miniLZO or LZ77 algorithm to implement block compression in the systems with poor hardware configuration or without needing high compression ratio. Otherwise we can use bzip2 algorithm. (2) Though the CPU utilization ratio of bzip2 is much higher than the other two algorithms, it focuses on the compression to video/audio file. In the real application, we can consider mask off these files before using the algorithm, so that we can obtain good compression efficiency without consuming too much system resource.
5 CONCLUSIONS Aiming at the characteristics of backup service, we studied on how to improve the transmission and storage efficiency of massive data, and proposed a hybrid data compression approach. This approach covers two levels:
Proc. of SPIE Vol. 7517 751703-9
global compression realizes inter-file de-duplication, block compression removes intra-file redundant data stream, the former one covers backup clients and server, through comparing the data in clients with that in server side, only transfer the data inexistent in storage server. The latter one restricted within the clients, it decrease the data by using stream-compression algorithms. In the block compression, we adopted three types of lossless compression algorithms, analyzed the CPU utilization ratio, composite compression ratio and block compression ratio against the frequently-used file types and summarized the adaptability of each algorithm in certain hardware configuration and application requirement. It has good reference value to the compression method design of backup service software. Currently, global compression method is fixed length slicing, which can’t mine the inter-file redundant data to the greatest extent. Future work will use the content-based slicing to realize global compression under the condition of not influencing system performance greatly. References:
[1] Benjamin Zhu, Kai Li, Hugo Patterson, “Avoiding the Disk Bottleneck in the Data Domain Deduplication File System”, FAST’08: 6th USENIX Conference on File and Storage Technologies, 269-282(2008). [2] Athicha Muthitacharoen, Benjie Chen, and David Mazi `eres, “A Low-Bandwidth Network File System”, (2001). [3] William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur, “Single Instance Storage in Windows 2000”, (2000). [4] Sean Quinlan, Sean Dorward, “Venti: a New Approach to Archival Storage”, FAST'02: USENIX Conference on File and Storage Technologies, (2002). [5] Walter Santos, Thiago Teixeira, Carla Machado, Wagner Meira Jr, “A Scalable Parallel Deduplication Algorithm”, 19th International Symposium on Computer Architecture and High Performance Computing, 79-86(2007). [6] Jim Austin, Aaron Turner, Sujeewa Alwis, “Grid Enabling Data De-Duplication”, Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, (2006). [7] Youjip Won, Jongmyeong Ban, Jaehong Min, Jungpil Hur, Sangkyu Oh, Jangsun Lee, “Efficient index lookup for De-duplication backup system”, (2008). [8] Yan Chen, Zhiwei Qu, Zhenhua Zhang, Boon-Lock Yeo, “Data Redundancy and Compression Methods for a Disk-based Network Backup System”, Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04), (2004). [9] Tianming Yang, Dan Feng, Jingning Liu, Yaping Wan, “FBBM: A new Backup Method with Data De-duplication Capability”, 2008 International Conference on Multimedia and Ubiquitous Engineering, 30-35(2008). [10] Landon E Cox, Christopher D. Murray, and Brian D. Noble, “Pastiche: Making Backup Cheap and Easy”, 5th Symposium on Operating Systems Design and Implementation, 285-298(2002).
Proc. of SPIE Vol. 7517 751703-10