Designing a Storage Infrastructure for Scalable Cloud Services

Designing a Storage Infrastructure for Scalable Cloud Services Byung Chul Tak⋆ , Chunqiang Tang⋆⋆ , and Rong N. Chang⋆⋆ ⋆

Pennsylvania State Universtiy - University Park, PA IBM T.J. Watson Research Center, Hawthrone, NY [email protected],{ctang,rong}@us.ibm.com ⋆⋆

Abstract. In an IaaS (Infrastructure-as-a-Service) cloud services, storage needs of VM (Virtual Machine) instances are met through virtual disks (i.e. virtual block devices). However, it is nontrivial to provide virtual disks to VMs in an efficient and scalable way for a couple of reasons. First, a VM host may be required to provide virtual disks for a large number of VMs. It is difficult to ascertain the largest possible storage demands and physically provision them all in the host machine. On the other hand, if the storage spaces for virtual disks are provided through remote storage servers, aggregate network traffic due to storage accesses from VMs can easily deplete the network bandwidth and cause congestion. We propose a system, vStore, which overcomes these issues by utilizing the host’s limited local disk space as a block-level cache for the remote storage in order to absorb network traffics from storage accesses. This allows the VMM (Virtual Machine Monitor) to serve VMs’ disk I/O requests from the host’s local disks most of the time, while providing the illusion of much larger storage space for creating new virtual disks. Caching is a well-studied topic in many different contexts, but caching virtual disks at block-level poses special challenges in achieving high performance while maintaining virtual disk semantics. First, after a disk write operation finishes from the VM’s perspective, the data should survive even if the host immediately encounters a power failure. Second, as disk I/O performance is dominated by disk seek times, it is important to keep a virtual disk as sequential as possible in the limited cache space. Third, the destaging operation that sends dirty pages back to the remote storage server should be self-adaptive and minimize the impact on the foreground traffic. We propose techniques to address these challenges and implemented them in Xen. Our evaluation shows that vStore provides the illusion of unlimited storage space, significantly reduces network traffic, and incurs a low disk I/O performance overhead. Key words: Virtual machine, Virtual disk, Cloud storage,

1 Introduction Cloud Computing has recently drawn a lot of attention from both industry and research community. The success of Amazon’s Elastic Compute Cloud (EC2) services [2] has demonstrated the practicality of the concept and its potential as a next paradigm for enterprise-level services computing. There are also other competitors providing

2

Designing a Storage Infrastructure for Scalable Cloud Services

different types of Cloud services such as Microsoft’s Azure [18] and Google’s AppEngine [8]. Research community has also recognized the importance and started to build several prototype Cloud computing systems [6, 22] for research purposes. In this paper, we design a scalable architecture that provides reliable virtual disks (i.e., block devices as opposed to object stores) for virtual machines (VM) in a Cloud environment. Much of the challenges arise from the scale of the problem. On the one hand, a host potentially needs to provide virtual disks as a virtual block device to large number of virtual machines running on that host, which would incur a high cost if every host’s local storage space is over-provisioned for the largest possible demand. On the other hand, if the storage spaces for virtual disks are provided through the network attached storage (NAS), it may cause congestion at the network and/or storage servers in a large-scale Cloud environment. We propose a system, called vStore, which addresses these issues by using the host’s limited local disks as a block-level cache for the network attached storages. This cache allows the hypervisor to serve VMs’ disk I/O requests using the host’s local disks most of the time, while providing the illusion of unlimited storage space for creating new virtual disks. Caching is a well-studied topic in many different contexts, but it poses special challenges in achieving high performance while maintaining virtual disk semantics. First, the block-level cache must preserve the data integrity in the event of host crashes. After a disk write operation finishes from the VM’s perspective, the data should survive even if the host immediately encounters a power failure. This requires that cache handling operations must always ensure consistency between on-disk metadata and data to avoid committing incorrect data to the network attached storage during recovery from a crash. A naive solution could easily introduce high overheads in updating on-disk metadata. Second, unlike memory-based caching schemes, the performance of an ondisk cache is highly sensitive to data layout. It requires a cache placement policy that maintains a high degree of data sequentiality in the cache as in the original (i.e. remote) virtual disk. Third, the destaging operation that sends dirty blocks back to network attached storage should be self-adaptive and minimize the impact on foreground I/O operations. To our knowledge, this is the first study that systematically addresses issues in caching virtual disks at block-level. Log-structured file system [23] and DCD [13] use local disk as a write log buffer rather than as a generic cache that can handle normal read/write operations. Dm-cache [26] is the closest to vStore. It uses one block device to cache another block device, but may corrupt data in the event of host crashes, because the metadata and data are not always kept consistent on disk. FS-Cache [12] in the Linux kernel implements caching at the file system level instead of at the block device level, and it also may corrupt data in the event of host crashes, due to the same reason. Network file systems such as NFS [25], AFS [15], sprite [19], and coda [14] employ disk caching at the file-level. If hypervisor uses these file systems as a remote storage, virtual disk images has to exist in the remote file system as large-sized files. But since caching is done at a file granularity (or partial file granularity), file system level caching may end up storing all or large part of the entire virtual disk images, which local disks may not accommodate due to limited spaces.


… VMM

… VMM

3

…

VMM

Fig. 1. Our model of scalable cloud storage architecture with vStore.

We have implemented vStore in Xen environment. Our evaluation shows that vStore can handle various workloads effectively. It provides the illusion of unlimited storage, significantly reduces network traffic, and incurs a low disk I/O performance overhead. The rest of this paper is organized as follows. In Section 2, we give general background and motivates the work. Section 3 presents the design of vStore. Section 4 describes vStore’s implementation details. In Section 5, we empirically evaluate various aspects of vStore. We discuss related work in Section 6, and state concluding remarks in Section 7.

2 Cloud Storage Architecture Fig. 1 shows architectural model of a scalable Cloud storage system. It consists of the following components: • VM-hosting machines: A physical machine hosts a large number of VMs and has limited local storage space. vStore uses local storage as a block-level cache and provides to VMs the illusion of unlimited storage space. The hypervisor provides a virtual block device to VMs which implies that VMs see raw block devices and they are free to install any file systems on top of it. Thus, hypervisor receives block-level requests and has to redirect it to the remote storage. • Storage Server Cluster: Storage server clusters provide network attached storage to physical machines. They can be either dedicated high-performance storage servers or a cluster of servers using commodity storage devices. The interface to the hypervisors can be either block-level or file-level. If it is the block-level, iSCSI type of protocol is used between storage servers and clients (i.e. hypervisors). If it is file-level, the hypervisor mounts a remote directory structure and keeps the virtual disks as individual files. Note that regardless of the protocol between hypervisors and storage servers, the interface between VMs and hypervisor remains at block-level. • Directory Server: The directory server holds the location information about the storage server clusters. When a hypervisor wants to connect to a specific storage server, it consults the directory server to determine the storage server’s address.

4


• Networking Infrastructure: Usually network bandwidth within a rack is well-provisioned, but cross-rack network is usually 5—10 times under-provisioned than that of withinrack network [4]. 2.1

Motivations

There are multiple options when designing a storage system for a Cloud. One solution is to use only local storage. In a Cloud, VMs may use different amounts of storage space, depending on how much the user pays. If every host’s local storage space is overprovisioned for the largest possible demand, the cost would be prohibitive. Another solution is to only use network attached storage. That is, a VM’s root file system, swap area, and additional data disks are all stored on network attached storage. This solution, however, would incur an excessive amount of network traffic and disk I/O load on the storage servers. Sequential disk access can achieve a data rate of 100 MB/s. Even with pure random access, it can reach 10 MB/s. Since 1 Gbps network can sustain roughly about 13 MB/s, four uplinks to the rack-level switch are not enough to handle even one single sequential access. Note that uplinks to the rack-level network switches are limited in numbers and cannot be easily increased in commodity systems [4]. Even for random disk access, it can only support about five VMs’ disk I/O traffic. Even with 10 Gbps networks, it still can hardly support thousands of VMs running in one rack (e.g., typical numbers are 42 hosts per rack, and 32 VMs per host, i.e., 1,344 VMs per rack). The seriousness of the bandwidth limitation due to hierarchical networking structures of the data centers has been recognized and there are continuing efforts to resolve this issue through networkrelated enhancements [1, 10]. However, we believe it is also necessary to address the issue from systems perspective using techniques such as vStore. 2.2

Our Solution

vStore takes a hybrid approach that leverages both local storage and network attached storage. It still relies on network attached storage to provide sufficient storage space for VMs, but utilizes the local storage of a host to cache data and avoid accessing network attached storage as much as possible. Consider the case of Amazon EC2, where a VM is given one 10GB virtual disk to store its root file system and another 160GB virtual disk to store data. The root disk can be stored on local storage due to its small size. The large data disk can be stored on network attached storage and accessed through the vStore cache. Data integrity and performance are two main challenges in the design of vStore. After a disk write operation finishes from the VM’s perspective, the data should survive even if the host immediately encounters a power failure. In vStore, system failures can compromise data integrity in several ways. If the host crashes while vStore is in the middle of updating either the metadata or the data and there is no mechanism for detecting the inconsistency between the metadata and the data, after the host restarts, incorrect data may remain in the cache and be written back to the network attached storage. Another case that may compromise data integrity is through violating the semantics of writes. If data is buffered in memory and not flushed to disk after reporting


5

write completion to the VM, a system crash will cause data loss. This observation about semantics constrains the design of vStore and also affects the overall performances. The second challenge is to achieve high performance, which conflicts with ensuring data integrity and hence requires a careful design to minimize performance penalties. The performance of vStore is affected by several factors: (i) data placement within the cache, (ii) vStore metadata placement on disk, (iii) complication introduced by the vStore logic. For (i), if sequential blocks in a virtual disk are placed far apart in the cache, a sequential read of these blocks incurs a high overhead due to a long disk seek time. Therefore, it is important to keep a virtual disk as sequential as possible in the limited cache space. For (ii), ideally, on-disk metadata should be small and should not require an additional disk seek to access data and metadata separately. For (iii), one potential overhead is the dependency among outstanding requests. For example, if one request is about to evict one cache entry, then all the requests on that entry must wait. All of these factors affected the design of vStore.

3 System Design 3.1

System Architecture

The architecture of vStore is shown vStore Module in Fig. 2. Our discussion is based on In-memory Cache Space metadata Block para-virtualized Xen [3]). VMs generLocal Storage Requests Cache ate block requests in the form of (secHandling VM Logic Wait tor address, sector count). Requests Queue Remote Storage arrive at the front-end device driver within the VM after passing through Fig. 2. The architecture of vStore the guest kernel. Then they are forwarded to the back-end driver in Domain-0. The back-end driver issues actual I/O requests to the device, and send responses to the guest VM along the reverse path. The vStore module runs in Domain-0, and extends the function of the back-end device driver. vStore intercepts requests and filters them through its cache handling logic. In Fig. 2, vStore internally consists of a wait queue for incoming requests, a cache handling logic, and in-memory metadata. Incoming requests are first put into vStore’s wait queue. The wait queue is necessary because the cache entry that this request needs to use might be under eviction or update triggered by previous requests. After clearing such conflicts, the request is handled by the cache handling logic. The in-memory metadata are consulted to obtain information such as block address, dirty bit, and modification time. Finally, depending on the current cache state, actual I/O requests are made to either the cache on local storage or the network attached storage. I/O Unit: Guest VMs usually operate on 4KB blocks, but vStore can perform I/Os to and from the network attached storage at a configurable larger unit. A large I/O unit reduces the size of in-memory metadata, as it reduces the number of cache entries to manage. Moreover, a large I/O unit works well with high-end storage servers, which are optimized for large I/O sizes (e.g., 256 KB or even 1 MB). Thus, reading a large

6

Designing a Storage Infrastructure for Scalable Cloud Services Fields Virtual Disk ID Sector Address Dirty Bit Valid Bit Lock Bit Read Count Write Count

Size 2 Bytes 4 Bytes 1 Bit 1 Bit 1 Bit 2 Bytes 2 Bytes

Bit Vector Variable Access Time 8 Bytes Total Size < 23 Bytes

Descriptions ID assigned by vStore to uniquely identify a virtual disk. An ID is unique only within individual hypervisors. Cache entry’s remote address in unit of sector. Set if cache content is modified. Set if cache entry is being used. Set if under modification by a request. How many read accesses within a time unit. How many write accesses within a time unit. Each bit represents 4KB within the block group. Set if corresponding 4KB is valid. The size is (block group)/4KB bits. Most recently accessed time. Table 1. vStore Metadata.

unit is as efficient as reading 4KB. This may increase the incoming network traffic, but our evaluation shows that the subsequent savings outweigh the initial cost (as shown in Fig. 8 (b) of Section 5). We use the term, block group, to refer to the I/O unit used by the vStore as opposed to the (typically 4KB) block used by the guest VMs. That is, one block group contains one or more 4 KB blocks. Metadata: Metadata holds information about cache entries on disk. Metadata are stored on disk for data integrity and cached in memory for performance. Metadata updates are done in a write-through manner. After a host crashes and recovers, vStore visits each metadata entry on disk and recovers any dirty data that have not been flushed to network attached storage. Table 1 summarizes the metadata fields. Virtual Disk ID identifies a virtual disk stored on network attached storage. When a virtual disk is detached and reconnected later, cached contents that belong to this disk is identified and reused. Bit Vector identifies valid 4KB blocks within a block group. Without Bit Vector, on a 4K write, vStore has to read the corresponding block group from network attach storage, merges with the 4K new data, and writes the entire block group to cache. Without Bit Vector, it can write to the 4K data directly without fetching the entire block group. Our experiments show that Bit Vector helps significantly reduce network traffic when using a large cache unit size. Maintaining the metadata on disk may cause poor performance. A naive implementation may require two disk accesses to handle one write request issued by a VM—one for metadata update and one for writing actual data. vStore solves this problem by putting metadata and data together, and updates them in a single write. The details are described in Subsection 3.1. In-memory Metadata: To avoid disk I/Os for reading the on-disk metadata, vStore maintains a complete copy of the metadata in memory and updates them in a writethrough manner. We use a large block group size (e.g., 256KB) to reduce the size of the in-memory metadata.


7

Cache Structure: vStore organizes local storage as a set-associative cache with writeback policy by default. We describe the cache as a table-like structure, where a cache set is a column in the table, and a cache row is a row in the table. A cache row consists of multiple block groups, each of which with contents possibly coming from different virtual disks. Block groups in the same cache row are laid out in logically contiguous disk blocks. Each 4K block in a block group has Total size = n·(4096+512) bytes a 512-byte trailer shown in Fig. 3. This Block trailer contains metadata and the hash Group value of the 4KB data block. On a write operation, vStore computes the hash of hash 4KB block meta data the 4KB block, and writes the 4KB block and its 512-byte trailer in a single write Trailer 512 bytes operation. If the host crashes during the write operation, after recovery, the hash Fig. 3. Structure of one cache entry. The block value helps detect that the 4KB block and group consists of n number of 4KB blocks and the trailer are inconsistent. The 4K block each 4KB blocks have trailers. can be safely discarded, because the completion of the write operation has not been acknowledged to the VM yet. When handling a read request, vStore also reads the 512-byte trailer together with the 4KB block. As a result, a sequential read of two adjacent blocks issued by the VM is also sequential in the cache. If only the 4K data block is read without the trailer, the sequential request would be broken into two sub-requests, spaced apart by 512 bytes, which adversely affects the performance. Since the size of a 512-byte trailer is 12.5% of the size of a 4K block, theoretically the overhead of vStore is around 12.5%. Many of our design and implementation efforts have been directed to achieve this theoretical bound. Cache Replacement: Simple policies like LRU and LFU are not suitable for vStore, because they are designed primarily for memory-based cache without consideration of block sequentiality on disk. If two consecutive blocks in a virtual disk are placed at two random locations in vStore’s cache, sequential I/O requests issued by the VM become random accesses on the physical disk. vStore’s cache replacement algorithm strives to preserve the sequentiality of a virtual disk’s blocks. Below, we describe vStore’s cache replacement algorithm in detail. We introduce the concept of base cache row of a virtual disk. The base cache row is the default cache row on which the first row of blocks of a virtual disk is placed. Subsequent blocks of the virtual disk are map to the subsequent cache rows. For example, if there are two virtual disks Disk1 and Disk2 currently attached to the vStore and the cache associativity is 5 (i.e. there are 5 cache rows), then Disk1 might be assigned 1 as a base cache row and Disk2 be assigned 3 to keep them reasonably away from each other. If we assume one cache row is made of ten 128KB cache groups, Disk2 ’s block at address 1280K will be mapped to row 4 which is the next row from Disk2 ’s base cache row. Upon arrival of new data block, vStore determines the cache location in two steps. First, it looks at the cache entry’s state whose location is calculated using the base cache row and the block’s address. If it is invalid or not dirty, then it is immediately assigned

8

Designing a Storage Infrastructure for Scalable Cloud Services ABCDE F

! ;&

"# $%&'( )*)+,-. 6,# "# /&)= )*)+,-.

-"%56. ;&

;&

1234 !

%,8&5, %,*- 9

%,8&5, %,*- 0

%,8&5, %,*- 0

1234 !

ZQH

RGJVZS aK

TKNOT UJGV Q F

MOJVGOTTZ bOTGRS ZQH

TKNOT JQOR _

JQ`KV Q JQOR _

%,8&5, :%"5 , 9

7%"5, 5 & )*)+,

[ JGV Q \] ^TKN] V K NONPQ

8,%$,

?,%$, -"%56

Designing a Storage Infrastructure for Scalable Cloud Services

Designing a Storage Infrastructure for Scalable Cloud Services

Suggest Documents

Cloud Storage: Infrastructure

Flexible, Scalable Infrastructure for Desktop ... - Nimble Storage

A Scalable Availability Model for Infrastructure-as-a-Service Cloud

A Scalable Location-Based Services Infrastructure ...

CLOUD SERVICES SIMPLE STORAGE

A Dynamically Scalable Cloud Data Infrastructure for Sensor Networks

A Dynamically Scalable Cloud Data Infrastructure for Sensor Networks

Cloud Plan: Infrastructure - Google Services

Cloud Plan: Infrastructure - Services - Google

Building a Distributed Block Storage System for Cloud Infrastructure

simplified storage for cloud services - Dell EMC

simplified storage for cloud services - Dell EMC

A Scalable Agent Infrastructure

infrastructure of services for a smart city using cloud environment

A Declarative Recommender System for Cloud Infrastructure Services ...

A Declarative Recommender System for Cloud Infrastructure Services

Read Cloud Native Infrastructure: Patterns for Scalable ... - Google Sites

[PDF] Cloud Native Infrastructure: Patterns for scalable ... - Google Sites

PDF Cloud Native Infrastructure: Patterns for Scalable ... - Google Sites

iCanCloud: A Flexible and Scalable Cloud Infrastructure Simulator

A Cloud Environment for Data-intensive Storage Services

A Privacy Maturity Model for Cloud Storage Services - IEEE Computer ...

Augustus: Scalable and Robust Storage for Cloud ... - CiteSeerX

Augustus: Scalable and Robust Storage for Cloud ...