VMStore: Distributed storage system for multiple virtual ... - Springer Link

7 downloads 36698 Views 567KB Size Report
are many VMs running on a virtual cluster to provide desktop applications for client users. .... In parallax, the storage VM is deployed on each application.
SCIENCE CHINA Information Sciences

. RESEARCH PAPERS . Special Focus

June 2011 Vol. 54 No. 6: 1104–1118 doi: 10.1007/s11432-011-4273-0

VMStore: Distributed storage system for multiple virtual machines LIAO XiaoFei, LI He, JIN Hai∗ , HOU HaiXiang, JIANG Yue & LIU HaiKun Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China Received September 8, 2010; accepted April 26, 2011

Abstract Desktop virtualization is a very hot concept in both industry and academic communities. Since virtualized desktop system is based on multiple virtual machines (VM), it is necessary to design a distributed storage system to manage the VM images. In this paper, we design a distributed storage system, VMStore, by taking into account three important characteristics: high performance VM snapshot, booting optimization from multiple images and redundancy removal of images data. We adopt a direct index structure of blocks for VM snapshots to speed up VM booting performance significantly; provide a distribute storage structure with good bandwidth scalability by dynamically changing the number of storage nodes; and propose a data preprocessing strategy with intelligent object partitioning techniques, which would eliminate duplication more effectively. Performance analysis for VMStore focuses on two metrics: the speedup of VM booting and the overhead of de-duplication. Experimental results show the efficiency and effectiveness of VMStore. Keywords

virtualization, virtualized desktop, distributed storage

Citation Liao X F, Li H, Jin H, et al. VMStore: Distributed storage system for multiple virtual machines. Sci China Inf Sci, 2011, 54: 1104–1118, doi: 10.1007/s11432-011-4273-0

1

Introduction

Since virtualization technologies [1] have been used widely in many areas, the concept of desktop virtualization, which can provide on-demand desktop services, has attracted great attention in both academia and industry communities. Desktop virtualization has the potential to offer a new, cost efficient paradigm shift to ease the demand for resources while maximizing the return on investment. Combining the potential cost savings with other advantages such as disaster recovery, robustness, scalability and security makes it an attractive computing model. ClouDesk [2] is our implementation consistent with the definition of this concept. In ClouDesk, there are many VMs running on a virtual cluster to provide desktop applications for client users. The system has been experimentally deployed in our working environment. Based on the analysis of the user experience and management, it is important to design a high performance storage system for multiple VMs in ClouDesk. As single storage domain is not sufficient to support large-scale on-line users, designing a distributed storage system for ClouDesk is an optimal choice. ∗ Corresponding

author (email: [email protected])

c Science China Press and Springer-Verlag Berlin Heidelberg 2011 

info.scichina.com

www.springerlink.com

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

1105

It is known that designing a distributed storage system for multiple VMs has several special requirements, such as frequent snapshotting images storage for different VMs, fast VM booting from these images files, and less redundant data among VM images. Unfortunately, traditional distributed storage systems [3, 4] do not face these issues nor propose good solutions. Parallax [5] has been proposed to solve part of above problems. It has good read/write performance in accessing the storage system for images files. Parallax is a good reference to design a new distributed storage system for multiple VMs. However, it does not consider the images redundancy issue, which will cause about 50% redundancy in storage, and moreover, it aims to single storage domain, not for distributed storage system. In this paper, we design and implement a distributed storage system specified in multiple VMs supported desktop environments, called VMStore. The contributions of this paper can be summarized as follows. • We implement high performance snapshot in VMStore. Snapshotting is a special feature of virtualized storage due to the encapsulation of virtualization. We adopt a direct index structure of blocks to achieve high performance snapshotting, imposing a little higher storage load than chained images. • We provide a distributed storage structure and the data accessing bandwidth scales with the changing number of storage nodes. • We propose a data preprocessing scheme with intelligent object partitioning techniques, which can eliminate duplicates of multiple VM images more effectively. On data nodes, we not only reduce storage overheads, but also improve bandwidth utilization. The remainder of the paper is organized as follows: Section 2 introduces and analyzes the storage requirements for virtualized desktops. Section 3 describes the system architecture and key technologies of VMStore. Implementation issues are described in section 4. Detailed experimentations and results are presented in section 5, followed by discussions about related work in section 6. Finally, we reach some conclusions in section 7.

2 2.1

Storage requirements for virtualized desktop Architecture of virtualized desktop

Figure 1 shows a general overview of the architecture of a virtualized desktop system, named ClouDesk, which has been implemented and deployed in Huazhong University of Science and Technology, China (HUST). In ClouDesk, with a client appliance, every user accesses a virtualized desktop corresponding to a virtual machine in an application cluster. All user data and system data are stored in a shared storage called data center for VMs accessing, and a manage node coordinates all components for a uniform service. In virtualized desktop system, client appliances provide a user interface to access user applications and data. This interface is a normal desktop just like a graphic operation system to archive unobstructed user experience. In current systems, there are two different solutions to building a desktop interface. One is based on some remote desktop protocols such as VNC [6] or RDP, while the other will use some special mechanism to rebuild the interface of each user application in client terminals and integrate the produced interfaces in a desktop view. The latter always provides better applicability and flexibility than the former. An application cluster is composed of a number of nodes deployed with hypervisor to support hundreds of VMs. Application servers are usually equipped wtih larger memory and powerful computing resource to provide good performance for client users. All nodes in this cluster are linked with a high bandwidth fabric for mass data exchange, such as VMs migration. To efficiently manage all components cooperatively, a manage node is used to process all users accessing, startup and shutdown VMs automatically when the client users login and logout. It not only works at beginning and leaving phase of virtualized desktop but also monitors each VM for load balancing on each application node. Data center is usually a share storage system with enormous capacity to store all VM disk images and user data. For safety, the storage space for VM disk images is completely independent of user data space.

Liao X F, et al.

1106

Figure 1

Sci China Inf Sci

June 2011 Vol. 54 No. 6

An instance of virtualized desktop system (ClouDesk) overview.

Currently, for different scales of users, data centers are adopting different storage solutions, e.g. NAS for small scale or some cluster file system for large scale. However, larger scale of virtualized desktops is not well supported in a single storage domain. Moreover, several important characteristics of multiple VMs have not been fully considered yet. 2.2

Single storage service for multiple VMs

In a system with multiple VMs, the storage subsystem is an essential component for high availability. For this goal, most of storage solutions are based on single storage service, including local storage device or NAS system using file-based protocols such as NFS (network file system). Parallax presented an elaborate design based on a single storage service. In parallax, the storage VM is deployed on each application node for connecting other general VMs on the same node. The storage service is provided by POSIX I/O interface. Due to the encapsulation provided by virtualization, each general VM in this architecture accesses different storage services in the way they use local disk. Depending on the agnostic storage service, parallax built in some special advanced storage functions. Especially, a very efficient disk snapshot for VMs snapshot is traditionally implemented on arrays or in storage switches. And these storage switches are much more expensive than consuming storage hosts. Some VM systems implement similar snapshot functions based on chained images but the performance degrades shapely with the increasing number of snapshots. Through a direct disk image content management, the cost of snapshot in parallax can be reduced. Parallax is a novel organization for a storage system for multiple VMs, providing good performance towards both shared storage target and commodity local disk. For a smallscale virtual cluster of machines, parallax is a very suitable storage solution as it only supports single storage service. When the scale of VMs exceeds the capacity of single storage service, these solutions like parallax cannot satisfy even normal storage I/O operations of VMs. Parallax adopts a fine-grained block mapping which costs a large amount of storage resources to maintain a complex structure of metadata. In some extreme cases, the metadata size will occupy about 50 percent space of the whole write data, while the ratio is 20 percent in normal cases. 2.3

Storage requirements for virtualized desktop

To design a storage subsystem for virtualized desktop, three essential issues should be considered.

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

1107

Snapshot. Snapshot is a function of virtualized storage due to the encapsulation provided by virtualization. A VM disk image in an error state could be rewound to an arbitrary normal state [7] by using snapshot. This advantage is useful in many service oriented applications for error recovery. For virtualized desktop system, another advantage of snapshot is versioning virtual disk image files. In virtualized desktop, most users will deploy their applications in the same operation system, especially in windows series operation system. Most of the data in these disk images originating from one operating system are similar. In practice, we will copy the duplicated data to each disk image for each VM. However, these copies are redundant and may quickly use up the storage space. Versioning based on snapshot has been adopted in some virtualized desktop systems so as to avoid the unnecessary duplication by versioning the images. All virtual disk images from the same operation system will be generated by snapshots of a source operation system image. Each user VM image file is a version of source operation system. However, as discussed before, traditional snapshot solution is based on chained images, and the performance would linearly degrade with the increasing number of chains. Due to this shortcoming, current systems will merge most versions to guarantee ordinary storage accessing. As a result, in these systems, one operation system can only generate a limited number of versions for a few application scenarios and users cannot save enough storage space. A direct index structure of blocks can support snapshot with a better performance even though its storage load is usually much higher than chained images. We adopt this solution with some optimization to reduce its storage load while keeping its high performance snapshot in our distributed storage system. Boot optimization. In a common physical cluster, each node will access its local file system when it boots up in a general case. However, in a virtualized desktop system, all VMs will read booting information from the data center. In most cases, virtualized desktop systems are deployed in some enterprises to provide the same desktop environments for their staff. Users will boot up their virtualized desktop at a given period of time when their work hour begins, and we called such cases as booting storm. In our evaluation samples, one VM needs at least 10 Mbps I/O bandwidth of the data center when it boots up in application nodes. Hundreds of VMs will boot up concurrently in an enterprise and the total bandwidth will exceed multiple times of a single storage service capacity. To solve this issue, some virtualized desktop systems such as VMFS [8] deploy SAN (storage area network) or other expensive storage arrays. Even though the use of advanced storage systems will reduce the incidence of boot storm when user dimension is determined, these systems are lacking flexibility when users scale up. Upgrading these systems is expensive and inefficient since expanding bandwidth is impossible in ordinary situations. To resolve this issue with an affordable expense, we resort to a distributed storage structure by consuming hosts, thus enabling the capacity of bandwidth to adapt to user scale variation flexibly by changing the number of storage nodes. Less duplication. Virtualization is an efficient solution for computing resources sharing. But for storage resources, there is not any better solution than shared storage system. But the efficiency of storage sharing is worse than computing resources’ sharing as store accessing bandwidth is limited. This means that in a VM cluster, requirements for storage capacity are stricter than those of physical hosts and many duplicate data will exist in every disk of VMs. Current multiple VM systems resolve this issue through versioning to some extent. For most situations, versioning cannot reduce the duplicated data when applications are diverse and data cannot be shared in treelike version branches. Moreover, when the applications are executed, VMs will produce many data which are similar in most cases. In general, the size of a VM disk image is more than 20 GB to provide enough storage space for applications. Beside the VM disk image, data center still provide much space for storing user data. Therefore, de-duplication is very important for cost saving. In virtualized desktop systems, this issue will proliferate since every user has different demands and the diversity is more than that of traditional single VM system. Traditional de-duplication solutions are not suitable for multiple VM systems because of its complex processing procedure and high latency. We propose a data preprocessing scheme of using intelligent object partitioning techniques, which can eliminate duplicates more effectively. By doing this at the source, not only are storage overheads reduced, but also bandwidth utilization is improved.

1108

Liao X F, et al.

Figure 2

3

Sci China Inf Sci

June 2011 Vol. 54 No. 6

Overview of the VMStore system architecture.

Design of VMStore

This section will describe VMStore, a distributed structure for multiple VMs storage. VMStore expands a single storage service to a distributed structure by connecting storage service nodes together. It provides a transparent interface for VMs to access storage nodes. The advantages of VMStore include: distributed structure, image encapsulation, metadata for snapshot and de-duplication with good performance. 3.1

Architecture of VMStore

Figure 2 presents a high-level view of the structure of VMStore. A detailed discussion of each component is described below. Storage nodes in VMStore are just consuming physical hosts running I/O appliances for network access. This I/O appliance on each storage node is not a sole intermediary for data exchange between applications and storage resource. It also provides other facilities. We implement a data de-duplication mechanism in this appliance to reduce the possible storage load. It performs when duplicate data are being generated or after the duplicate data are written to local disk. In many distributed file system or storage system, a management node is set up to coordinate storage nodes and store all metadata, providing an index service for client appliances to access data from the distributed system. Though we designed a management node in VMStore, there is no centralized node available in our system for two reasons. First, the metadata discussed later in VMStore are too large to be stored in one node. Second, a centralized structure will be a bottleneck of performance to process mass I/O requests. In VMStore, we deployed a VM running storage appliance in each physical hosts of the application cluster to mediate accessing to a virtual block storage device by providing distinguished virtual disks to other VMs running on the same host. VMs storage in all application hosts builds up an independent management domain above entire VM application domain. All advanced features could be applied in this domain including snapshot, data de-duplication, and global block remapping implemented in software. Those functions are usually equipped by some expensive enterprise storage hardware. Although some virtualized desktop products, such as Citrix virtual desktops, had used Xen or Xenserver in their virtualization platform, we implement our prototype based on Xen because of its open source

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

1109

and high performance. In Xen tools, block tap (blktap) [9] acts as the connector between users and VMs storage. Developers can use some interface provided by blktap to implement a customized image format to index each block of virtual disk. We designed a format that indexes each block of all VMs in the application cluster to the network of storage nodes. Every user storage operation will be reflected to storage nodes instead of accessing local disks. 3.2

Image file

In VMStore, the structure of image file is the critical point that determines the data organization of entire storage system. We designed an index structure for remapping all blocks of a virtual disk image based on blktap interface. We considered two important issues in designing index structure. First, the index structure should reduce the load of snapshotting in a distributed environment and prevent I/O performance of reloading from snapshot from affecting user experience too much. Second, this index structure needs to support the real time de-duplication mechanism and to reduce the duplicate data as more as possible. In a distributed environment, snapshotting would not be exactly the same with the approach in a single storage service. When storage nodes increase in number, the potential of network latency is bigger than that of single storage node, especially when the access connection switches from one node to another. However, in this environment, “chained” structures in VMDK [8] and qcow (an image file format in VM), possibly become performance bottlenecks when we access distributed image files in different storage nodes. The degradation of performance will be more obvious in random read/write procedure than sequential access. We propose a direct index structure somewhat like parallax based on single storage service. In VMStore, each block of image files has a 64-bit global address remapped in index structure. A radix tree is used to build the hierarchy of index structure of one image file. Each root of these trees is an entry to access a specified version of an image file. When a snapshot is generated, current radix tree will be set read-only, and it becomes a version of index structure to the image file and a new one will be replicated from current tree by using COW (copy on write) mechanism. After snapshoting, all write operations will be mapped to new storage space indexed by the new index structure. As discussed above, our de-duplication mechanism consists of two phases. One is completed in storage server VM. The other is very crucial, which guarantees that less reduplicates are generated by users after write operation. Therefore after new index structure is generated, data to be written is cached in local memory. When a new snapshot is generated, new data are compared with the data pointed by source radix tree. Different data are transferred to storage nodes but the same data will be eliminated. 3.3

Data management

Using the direct index structure, metadata in VMStore is more complex and expensive than traditional distributed file system. Since metadata in VMStore are too large to be stored in memory, a special storage space is divided for metadata in each storage node. Each index structure has an index address for every read/write request. The processing of read/write index structure is the same as reading/writing data block to the storage nodes. After receiving a data search request, storage nodes will read the metadata address to find the root of radix tree and then follow the index structure to return the global address of data requested. As shown in Figure 3, the global address of each data or metadata is divided into two parts. The high 33 bits address is the key of consistent hashing [10] to index a storage node and the low 31 bits address is the index of the data block in the storage node. We store all blocks with identical high 33 bits addresses in a file set the filename by using the 33 bits address. The low 31 bits address of these blocks mapped to the same file indexes a block in this file. By using a simple rule of block mapping, every block in all image file and metadata will be stored on right location in VMStore. In data distributing procedure, the consistent hash algorithm is chosen for hashing block bock to storage nodes. Consistent hash is the first DHT algorithm in P2P network, and we choose it in consideration of

Liao X F, et al.

1110

Figure 3

Sci China Inf Sci

June 2011 Vol. 54 No. 6

The index of a global block address.

our decentralized structure and a small or medium scale of nodes. By using the consistent hash algorithm, each storage node needs to store only the network address of neighbor nodes instead of whole network topology, and the dynamic management of nodes is convenient for implementation, as will be discuss in section 4. 3.4

Data redundancy

Data de-duplication in VMStore is an important technology. In general, data de-duplication is a special data compression technique for eliminating coarse-grained redundant data. It was typically used to data backup to improve storage utilization. Most of time, data de-duplication is executed by background applications without disrupting the running service and reduces costs. It is also particularly effective for virtual servers. In many cases, virtual servers contain duplicate copies of operating system and other application files. As a result, there should be many redundant data in virtualized environment. De-duplication compares data blocks on networks to identify the equivalent data and then eliminate redundant information. For example, an organization node stores file A on its network first (Each data block within the file is represented by a letter). When the organization node tries to store file B, the de-duplication system knows that the file B contains files A and L. Then it only stores file L. The system creates logical pointers to A so that accessing file B can still get data L from the position where file A is stored. To build up a virtual desktop environment, the storage system should be able to support hundreds of users or even more, so the storage requirement for images library may scale in the magnitude of TB. According to the image analysis by IBM [11], it is clear that redundant data in virtualized environment are inevitable. IBM has compared images of a cross-section from various versions of operating system such as Linux and Windows by keeping the file systems. Here the overlap of cross-section means duplication. In our system, a large quantity of duplicated data cannot be avoided either. So we have done an analysis to find out how much duplicated data our storage system has. We take an easier way instead of crawling the file systems. Firstly, we put those image files into pieces in a 4 kb scale, which fit with the file system and then compute the SHA-1 hash value of every unit. So we can figure out the ratio of duplicated data by virtue of the number of items in hash table as well as the sum of blocks. In this test, we selected 5 disk images of VMs which were typically installed in a frequently-used operating system such as Fedora, RHEL and Windows. Table 1 gives the statistics for each disk image. It shows the self-similarity of these images. Here “Items” means the types of blocks. The difference between the sum of blocks and items is equivalent to the amount of data duplicated in single image. The ratio is a percentage of duplicates in image file. Through a simple comparison, it is easy to find that the ratio of duplicates is proportional to the available capacity of image.

Liao X F, et al.

Sci China Inf Sci

Table 1 Image

1111

June 2011 Vol. 54 No. 6

Statistics of image on redundancy Sum of blocks

Items

Ratio (%)

Fedora8 i386

1145939

755526

34.07

Fedora8 x86 64

1366427

936837

31.44

RedHat5 i386

1024257

700104

31.65

Windows XP i386

1877153

681148

63.71

Windows Vista i386

1400349

1054617

24.69

Table 2 Image

Uniform blocks between images

Fed x86 64

RedHat 5

XP i386

Vista i386

50.94%

32.97%

52.48%

28.91%

Fedora8 x86 64



31.57%

50.12%

28.02%

RedHat5 i386





52.39%

27.63%

Windows XP i386







47.05%

Fedora8 i386

Table 2 shows a series of overlaps between different operating systems and Linux distributions. For localized experimental data, the percentage of duplicated data reaches 46.19%. This value implies that the potential duplications in the VM storage system might consume nearly 40 percentile of capacity or even more. In our storage solution, based on the virtualized user environment, 40 percent of capacity means several disks or storage nodes. If we can eliminate the duplicates by de-duplication technology, costs and management overhead can be greatly reduced. VMStore runs on a user-level daemon in the storage node, and handles the block requests with Xen’s block tap driver. As we have designed, the de-duplication feature is implemented in the form of assistant module. This has been divided into 2 parts separately in application nodes and storage nodes. One module in storage nodes, named Image Compression, eliminates the duplicated data when administrator of VMStore transfuses image files into the virtual block device. It takes charge of detecting identical blocks between the data to be written and the data existing. According to the experiment taken by IBM, it is more precise to detect the duplicated data in 4 KB scale than crawl file system. High precision is obtained at the cost of performance penalty. To achieve this goal effectively, there are two key points: data structure and redundancy elimination algorithm. Common data structure in storage system is not useful in de-duplication. We define a type of new chunk stored in our system. Every chunk has 3 parts: a series of blocks to store content and two pointers to make reference to other chunks. In our concept, there is almost no change in metadata, because it could reorganize all data if it knows the starting address of the file. Although the structure could reduce the difficulty of de-duplication implementation, it still brings about two problems at the same time. Usually, we improve I/O performance via stripping, but in this way, the perfectness of hard disk would be deteriorated by the discrete distribution of chunks. The absolute addresses may not connect with offsets address, which could lead to the failure of random access unless the data are organized before accessing. It is the second performance bottleneck. To avoid these obstacles, we plan to import a database for computing the offsets of all chunks, and accelerate the reorganization by image perfecting. The image perfecting is very simple. We keep a trace on popular images and record the frequently accessed chunks. Then we put the data with striping into a cache pool, enabling each data segment to contain these chunks as many as impossible. By reducing the access times of chunks, the performance will be improved directly, especially on boot storm. Hash algorithm and redundancy eliminating algorithm are extraordinary important in our system. We use SHA-1 algorithm to calculate block’s hash value, which produces a 160-bit digest from a message with a maximum length of (264 − 1) bits. Duplicate elimination can be classified in two dimensions in our system [12]: the granularity of the de-duplication, the time the de-duplication works. VMStore is implemented with fixed-sized blocks, but the sum of blocks in chunks is unlimited. In this way de-duplication would be better because of content-dependent chunking. Online duplication elimination is implemented on storage nodes. It improves write throughput, since duplicated blocks can be eliminated by modifying the pointers to avoid disk writing. We have built a table for each image

Liao X F, et al.

1112

Sci China Inf Sci

June 2011 Vol. 54 No. 6

file to record logic addresses and corresponding physic addresses of its blocks. The time complexity of searching in the table is O(logN) due to the index lookup overhead. This table is necessary for realizing the inline de-duplication function. Therefore, the overhead cannot be avoided but is reduced after some improvements of mapping structure. We will use an index tree instead of the map table to get better performance in the future work. Sparse Indexing [13] and Fingerdiff [14] are two kinds of delegates on deduplication. Fingerdiff uses hash value as fingerprint and Sparse Indexing uses sampling and exploits the inherent locality to eliminate redundancy blocks. Although Fingerdiff could improve the work without increasing associated overheads, the management overhead of fingerprints will increase as the chunks get more and more. So some ideas of Fingerdiff are suited to Redundancy Update. Another module in application nodes called Redundancy Update, working to change the index of redundancy blocks created by VM. It must capture all write requests issued by the VM. Then it will create a buffer pool to cache modified content, and look up the same blocks in block device at the same time. At last, it writes the blocks with only one instance into the storage system, and relocates others by changing the index in metadata. Since the main task in application node is run on virtual machines, performance overhead of Redundancy Update should be reduced efficiently. A feasible solution here is to detect the redundancy in a medium or low precision by redefining the size of blocks. The key technical difficulty is to forecast the position of the same data in storage.

4 4.1

Implementation issues Cache scheme

In VMStore, every cache mechanism plays a very important role in reducing the loss of performance since the I/O procedure runs in speed disparity. We have implemented two types of cache: one finds uses in I/O procedure and the other serves to accelerate the block index process. I/O cache consists of a module for caching network packets and a module for sequential writing. Due to the light weight of block data, the incidence of network congestion will arise when mass I/O requests are generated. To alleviate this situation, I/O requests will be merged before network transmission. We choose 512 KB as the cache length to buffer data to keep balance between memories overload and network efficiency. Random writing of small data pieces in mechanical hard disks will degrade the overall system’s performance. Sequential writing cache implemented in storage node will improve the writing performance by reducing the small data pieces written to disk. We choose 8 MB as the length of sequential writing cache since the rising speed of bigger pieces of data writing is limited. The filename of data file is the index to sequential cache to buffer corresponding data. When the data length reaches 8 MB, all data in this cache will be flushed to the hard disk. The index cache can buffer the metadata to accelerate the data indexing procedure. Since metadata are stored in hard disks on storage nodes, the index latency equaling the disk access latency and network latency will decrease the performance of I/O request. Therefore we cache a part of indexing structure in storage to reduce the speed of a network access for metadata. Due to the limitation of memory in storage VMs, the index cache only buffers the new version of each disk image. 4.2

Nodes management

As discussed before, each node only maintains its neighbor’s information according to the principle of consistent hash algorithm. Even though adoption of consistent hash scheme reduces the complexity of dynamical nodes joining and leaving, handling the changes of some nodes is not a simple operation. To avoid the inefficient query of the storage network, each node address is stored in the whole storage. When a node wants to join in the storage sever, each VM is the entry for this node to get its right position. Node will download all storage address from storage VM and calculate its neighbors through the consistent hash algorithm. After that this node informs its two neighbors by joining signals, all storage VM will update the new node information. Former nodes should transfer the data whose hash

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

1113

values are bigger than the new node while latter nodes should delete the data whose hash values are bigger than this new node. When a node is to leave, its neighbor will take a dynamical procedure to maintain the network topology. The former will activate the data duplication of the offline node and then transfer these data to its former. At the same time, the latter will transfer its data to the former for duplication. If a request is sent to the offline node, this request will be redirected to the former node on default and the sender will delete the address of the offline node.

5 5.1

Experimentations and evaluations Experimental setup

As discussed above, the implementation of our system has a number of factors to be imposed considerable performance overheads. Distributed design improves the whole throughput of the storage system, which hashes data address to storage nodes and therefore incurs processing overhead on data requests every time. Additionally, we adopt a de-duplication mechanism to reduce duplicated data, which may reasonably increases the performance of I/O access due to the complex processing of data comparison. The performance analysis aims to answer two questions. First, what are the speedup and overheads that VMStore imposes on distribute data access requests? Second, what is the emerging overhead that de-duplication is used during data process in real time? We address these questions in turn, using VM to boot up throughput and sequential read/write to answer the first, and using a performance comparison between the situation with de-duplication and original to answer the second. In all tests, we use Dell Power Edge 2950 machines as application nodes, each node with two 1.6 GHz Xeon processors, 4 GBytes of Ram, and an Intel e1000 GbE network interface. Storage nodes are Dell Power Edge T710 machines, each node with two 2.4 GHz Xeon processors, 24 GB of Ram, a 1 TB 7.2 k RPM SAS hard driver, and a Broadcom 5709 Dual Port 1 GbE network interface. We have been developing the system based on Xen 3.2.0. We will upgrade the version of Xen to 3.3.4 for better blktap performance. 5.2

Booting performance

Suppose N VMs boot up simultaneously in the application cluster. Each VM reads data needed by OS launching from VMStore. Each VM is configured with one vcpu, 512 MB of Ram and 4 GB of disk image size. We set a script to boot specific number of VMs concurrently, and then record the highest rate in the booting-up procedure. The booting-up procedure will be ended by the VM when the log in interface appears. To obtain the global aggregate rate, we need to calculate the sum of the rate on each storage node while there is no center monitor server. Figure 4(a) shows the aggregate read for N VMs in booting up. When one VM boots up, the aggregate rate is 1.88 MB/s and it arises to 9.82 MB/s when N is 5. This growth rate reflects that the aggregate rate linearly increases with the increasing number of VMs. When the number of VM is greater than 5, the aggregate growth rate will drop to a certain extent. This is because when the number of VM rises, the I/O capacity of VM will drop, thus affecting the read rate per VM. Figure 4(b) shows the aggregate write rate for N VMs in booting-up. The write rate is larger than that of read curve since there are far more write operations in VM boot up. As the same with read situation, write rate per VMs will decrease when N is bigger than 5. 5.3

Snapshot performance

Considering snapshot performance is a facility of VMStore, we measured the general read latency of checkpoints to a disk image. As shown in Figure 5, we repeat the snapshot operations 12 times and between each time we record the latency of read 2 MB data. The read latency slightly fluctuates near a constant value, unlike the condition of the “chained” image with a linear decrease performance.

Liao X F, et al.

1114

Figure 4

Sci China Inf Sci

June 2011 Vol. 54 No. 6

The aggressive throughput of storage nodes when N VMs boot up. (a) Read; (b) write.

Figure 5

Figure 6

Read latency of different versions.

IOZone results running in VM. (a) Read; (b) write.

We choose IOZone (a file system benchmark) to test the sequential I/O performance of virtual disk in VMs. As shown in Figure 6(a), the performance of sequential reading 4 KB length of data is better than other length since data with the default block size will be transferred without any data section overload. As shown in Figure 6(b), the write performance is better than read performance due to the sequential writing cache. As same as read, the rate of writing 4 KB length of data is better than other rates. 5.4

Data redundancy performance

The de-duplication rate means how high a redundancy rate we could find in our system. This experiment is a simple test to find out the influence of minimum size of chunks on de-duplication rate. In this test, we abandon those redundancy data segments that are smaller than the minimum chunk. Then we define all redundancy data based on blocks such that the de-duplication rate on 4 KB is a baseline and consider that redundancy data inexistent if we do de-duplication with this standard. Through the test shown in Figure 7, we reach a conclusion that we could only use the standard at 8 and 12 KB. When we use 4 KB as our standard, the hash compute will take too much performance overhead. The computing resource

Liao X F, et al.

Figure 7

Figure 8

Sci China Inf Sci

June 2011 Vol. 54 No. 6

1115

De-duplication rate with different standards.

(a) Write rate with de-duplication (IOZone); (b) read rate with de-duplication (IOZone).

could only process the data at a speed of 28 MB/s. And in our future work, we will pay more attention to the standard of 8 KB to improve the de-duplication rate. The IOZone writing trial is run on a VM with 512 MB memory and 4 GB disk image. After the test, the minimum chunk size of de-duplication is 8 KB. Referring to the test on the same environment, the write rate on a VM is not interfered by de-duplication feature. As shown by Figure 8(a), all the outcomes are decreased from 1.5 to 2 MB/s. This result is better than anticipated. A key factor is the cache pool on Dom0, and another key factor is that de-duplication on application node is a kind of update mechanism, so the write operations on background have only a few computing tasks. The second IOZone experiment with de-duplication feature consists of a single client randomly reading a 32 MB file. In Figure 8(b), the read rate has reduced up to 30% compared with the test without deduplication feature. Two reasons could cause this. First, data in storage is not stripped so the perfecting would often fail, making the hit ratio of cache lower than stripping data. Second, once the cache fails, the request would be sent to storage nodes to relocate the required data and reload them. This work would consume I/O resource and network resource. So the best way to eliminate the bottleneck is to improve the cache mechanism and simplify the seeking method.

6

Related works

Storage has become a very important issue in the virtualization environment. Initially users usually use virtual disk to encapsulate the data of VM nodes. However, as VMs grew to hundreds and thousands of instances, the storage management for the virtual disk becomes an increasing challenge. Most of data

1116

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

centers or cluster systems administrators use centralized storage solution such as SAN or NAS to simplify the management complexity. However, the isolation of virtual disk and the data redundancy become a signification contradiction, and the performance is also another major issue. There are many studies and solutions to those issues. The following paragraphs introduce the state-of-the-art techniques for virtual storage designs. VMWare virtual machine file system (VMFS) is a high-performance cluster file system that provided storage virtualization optimized for VMs. Each VM is encapsulated in a small set of files and VMFS is the default storage system for these files on physical SCSI disks and partitions. VMFS is optimized and certified for a wide range of Fiber Channel and iSCSI SAN equipment. VMFS efficiently stores the entire VM state in a central location and can be created in advance, enabling instant provisioning of VMs, without relying on a storage administrator. VMFS provide a set of compelling features such as distributed resource optimization, high availability, and efficient off-host backup and dynamic increase of VMFS volume size. However, VMFS is not an open storage technique and can only be used for VMWare platform. Parallax is a distributed storage system that provides storage facilities specifically for virtualization environments. Parallax demonstrates that virtual disks can be stored centrally with very high scalability. The design of parallax is based on four high-level concerns: the agnosticism, isolation of virtual disk image (VDI) blocks-level non-file operations, minimum lock management, and snapshots as a primitive operation. A VDI is a single-writer virtual disk that may be accessed in a location-transparent manner from any of the physical hosts in the Parallax cluster. A snapshot in Parallax is a read-only image of an entire disk at a particular point in time. Parallax offers a comprehensive set of storage features, e.g. frequent, low-overhead snapshot of virtual disks, the authoralleged “gold-mastering” of template images, and the ability to use local disks as a persistent cache to mitigate burst demand on networked storage. Parallax allows virtual disks to be efficiently used and modified in a copy-on-write fashion by many users. However, it does not allow cooperative sharing among users, nor does it enhance the transparency or improve the granularity of virtual disks. Machine Bank [15] is a virtual storage management system engineered towards the popular shared-lab scenario, where users outnumber available PCs and may get different PCs in different sessions. Machine Bank is organized as Client/Server architecture. The client side runs Virtual PC which instantiates users’ working environments. Machine Bank allows users to preserve their entire working environment across sessions. Each client runs VM, which is saved to and re-instantiated from a content-addressable backend storage. Machine Bank provides a set of techniques to improve re-instantiation speed as well as to remove unnecessary network and disk traffic. For example, lightweight hooks at client side are implemented to cache and track logics of user sessions. Two-level caching mechanism combining with on-demand fetching is designed to reduce VM re-instantiation latency. Capo [16] designs a transparent, persistent block request proxy for VM disk images. Capo attempts to reduce the aggregate consumption of shared storage bandwidth in situations where shared storage is used to host VM images. And Capo supports a configurable degree of differential durability, allowing administrators to relax the durability properties and the associated write load, for less-important subsets of a VM’s file system. Ventana is a virtualization aware file system proposed by Stanford University [17]. Virtual disks are the main form of storage in today’s VM environments. Compared with current file systems, they have many attractive features, such as whole system versioning, isolation, and mobility. However, the low-level interface of virtual disks is a very coarse-grained one, forcing all-or-nothing whole system to roll back, or to be opaque, and offering no practical means of sharing. These problems seriously limit virtual disks’ usability, security, and ease in management. Free of these limitations, Ventana combines the advantages of the file-based storage and sharing of a conventional distributed file system with those of the versioning, mobility, and access control features to make virtual disks compelling. Ventana is similar to a conventional distributed file system in that it provides centralized storage for a collection of file trees, allowing transparency and collaborative sharing among users. Ventana’s is featured by its versioning, isolation, and encapsulation properties to support virtualization. The file system of Ventana

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

1117

is implemented using an object store technique [18]. Each version of a file’s data or metadata is stored as an object. An object store contains objects, sparse arrays of bytes numbered from zero to infinity, similar to files. For compatibility with existing clients, the host manager uses an NFS v3 server for clients to access files. Ventana offers four types of system functions: 1) Branches A private branch is used primarily by a single VM, while a shared branch is used by multiple VMs. In a shared branch, changes made from one VM are visible to the others, and so these branches can be used for sharing files, as in a conventional network file system. Non-persistent branches, whose contents do not survive across reboots, are also provided for caches and cryptographic material. 2) View Ventana is organized as a collection of file trees. To instantiate a VM, a view is constructed by mapping one or more of these trees into a new file system name space. 3) Access control Ventana provides orthogonal types of ACLs to satisfy file permissions: those of the guest operating systems that partition functionality according to the guests’ own principals, and those of users that control access to confidential information. 4) Disconnected operation Ventana supports disconnected operation by a combination of aggressive caching and versioning.

7

Conclusions and future works

In this paper, we have addressed three important characteristics in the design of a distributed storage system—VMStore—for multiple virtualized desktops: high performance snapshot, boot optimization from multiple images and redundancy removing of images files [19]. We have evaluated VMStore through analyzing the speedup and overhead imposed by VMStore on distributed data access requests and the overhead emerging on real-time de-duplication. The preliminary experimental results show that the system has good performance. In future, we plan to apply the system in large-scale environment and optimize its design and performance.

Acknowledgements This work was supported by the National Basic Research Program of China (Grant No. 2007CB310900), Program for New Century Excellent Talents in University (Grant No. NCET-08-0218), the National Natural Science Foundation of China (Grant No. 60973133), FOK YING TUNG Education Foundation (Grant No. 122007), and the MoE-Intel Information Technology Special Research Foundation (Grant No. MOE-INTEL-10-05).

References 1 Wang X L, Sun Y F, Luo Y W, et al. Dynamic memory paravirtualization transparent to guest OS. Sci China Inf Sci, 2010, 53: 77–88 2 Liao X F, Jin H, Hu L T, et al. Towards virtualized desktop environment. Concurr Comp-Pract E, 2010, 22: 419–440 3 Xiao N, Chen T, Liu F. RSEDP: Reliable, scalable and efficient data placement algorithm. Int J Super Comput, 2011, 55: 103–122 4 Zhao Y J, Xiao N, Liu F. Red: An efficient replacement algorithm based on REsident distance for exclusive storage caches. In: Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), Nevada, America, 2010 5 Meyer D, Aggarwal G, Cully B, et al. Parallax: Virtual disks for virtual machines. In: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys 2008), ACM, 2008 6 Richardson T. Virtual network computing. IEEE Internet Comput, 1998, 2: 33–38 7 Peterson Z, Burns R. Ext3cow: A time-shifting file system for regulatory compliance. ACM Trans Stor, 2005, 1: 190–212 8 VMware, Inc. VMware VMFS product datasheet. http://www.vmware.com/pdf/vmfs\ datasheet.pdf 9 Warfield A. Virtual devices for virtual machines. PhD thesis, University of Cambridge, 2006 10 Karger D, Lehman E, Leighton F, et al. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, ACM, 2009. 654–663 11 Anthony L, Eric V H. Experiences with content addressable storage and virtual disks. In: Proceedings of the First Workshop on I/O Virtualization, ACM, 2008

1118

Liao X F, et al.

Sci China Inf Sci

June 2011 Vol. 54 No. 6

12 Dubnicki C, Gryz L, Heldt L, et al. Hydrastor: A scalable secondary storage. In: Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST 2009), ACM, 2009 13 Lillibridge M, Eshghi K, Bhagwat D, et al. Sparse indexing: Large scale, inline de-duplication using sampling and locality. In: Proceedings of the Eighth USENIX Conferenceon on File and Storage Technologies (FAST 2009), ACM, 2009 14 Bobbarjung D, Dubnicki C, Jagannathan S. Fingerdiff: Improved duplicate elimination in storage systems. In: Proceedings of IEEE/NASA Goddard Conference on Mass Storage Systems and Technologies (MSST 2006), IEEE, 2006 15 Tang S, Chen Y, Zhang Z. Machine Bank: Own your virtual personal computer. In: The 2007 IEEE International Parallel and Distributed Processing Symposium (IPDPS’07), IEEE, 2007 16 Meyer D T, Wires J, Ivanova M, et al. Capo: Recapitulating storage for virtual desktops. In: Proceeding of FAST’11: 9th USENIX Conference on File and Storage. Technologies, ACM, 2011 17 Pfaff B, Garfinkel T, Rosenblum M. Virtualization aware file systems: Getting beyond the limitations of virtual disks. In: Proceedings of the 3rd Symposium on Networked Systems Design and Implementation, ACM, 2006 18 Factor M, Meth K, Naor D, et al. Object storage: The future building block for storage systems. In: Proceedings of the 2nd International IEEE Symposium on Mass Storage Systems and Technologies, Sardinia, Italy, IEEE, 2005 19 Suzaki K, Yagi T, Iijima K, et al. Effect of disk prefetching of guest OS on storage deduplication. In: Proceeding of Runtime Environments/Systems, Layering, and Virtualized Environments(RESoLVE) Workshop, ACM, 2011

Suggest Documents