BitVault: a Highly Reliable Distributed Data Retention ... - CiteSeerX

4 downloads 25919 Views 405KB Size Report
email archiving. Reference data ... KB of email messages to a few GB of video streams; 2) ...... Support for Automated Availability Management", NSDIГ04. [6].
BitVault: a Highly Reliable Distributed Data Retention Platform Zheng Zhang, Qiao Lian, Shiding Lin, Wei Chen, Yu Chen, Chao Jin Microsoft Research Asia [email protected] must retrieve the required documents within 48 hours or face a penalty. The raw performance of data access should be reasonable, in order that search functionality can be built on top of the platform, a concern especially important for email archiving applications.

Abstract This paper summarizes our experience designing and implementing BitVault: a content-addressable retention platform for large volumes of reference data ± seldomchanging information that needs to be retained for a ORQJWLPH%LW9DXOWXVHV³VPDUWEULFNs´as the building block to lower the hardware cost. The challenges are to keep management costs low in a system that scales from one brick to tens of thousands, to ensure reliability, and to deliver a simple design. Our design incorporates peer-to-peer (P2P) technologies for self-managing and self-healing and uses massively parallel repair to reduce system vulnerability to data loss. The simplicity of the architecture relies on an eventually reliable membership service provided by a perfect one-hop distributed hash table (DHT). Its object-driven repair model yields lastreplica recall guarantee independent of the failure scenario. So long as the last copy of a data object remains in the system, that data can be retrieved and its replication degree can be restored. A prototype has been implemented. Theoretical analysis, simulations and experiments have been conducted to validate the design of BitVault.

Our top three design goals are 1) low total cost of ownership (TCO), 2) very high reliability and availability and 3) simplicity. To achieve these goals, our design is based on the latest peer-to-peer (P2P) technologies. In a radical departure from many other storage systems, the BitVault design is completely decentralized and does not employ any distributed transactions. The reliability guarantee that BitVault provides is one-replica survivability: if failure wipes out replicas of an object, as long as there is one replica of the object in the system, its full replication degree will be rapidly restored. The massively parallel repair process significantly reduces the vulnerability window. BitVault is self managing and self healing. Failed nodes can be taken away and new capacity can be provisioned dynamically without interrupting normal operations or requiring human intervention. BitVault balances loads internally and automatically in a policy-driven manner.

1. Introduction

A prototype of BitVault has been implemented and evaluated in our lab, and the experimental results validate our design choices. In addition, we have developed several applications on top of BitVault.

BitVault is a backend storage platform for reference data. It was designed and prototyped in early spring of 2004. The term reference data denotes seldom-changing information that needs to be retained for its business value or for compliance reasons. Examples include check images, electronic invoices, and email messages,. Enterprise Storage Group [1] estimates that by 2005, more than half of the data stored by North American businesses will be reference data. Furthermore, the amount of reference data is growing one and a half times faster than non-reference data. Behind these trends is the digitization of all kinds of information. For example, X-ray images alone produce over 20PB of data every year. Digitizing all phone conferences in a year would have generated 17,000PB worth of data. In the enterprise world, legal regulations often require email archiving

This paper summarizes the main design points and our experience; more details are available in our technical report [42]. In Section 2 we elaborate on our design goals. Section 3 describes the architecture and the protocols. In Section 4 we describe our implementation and several BitVault applications. Section 6 reports some of the experimental results. Our experiences of building the BitVault system and identifying further research problems are summarized in Section 7. We discuss the related work in Section 8 and share our conclusions in Section 9.

2. Requirements and design goals We believe there are key requirements for a data retention platform of reference data: 1) store a very large and rapidly growing volume (hundreds of billions) of small and medium-sized objects that vary in size from a few

Reference data must be kept for an indefinite time, but it must be easily accessible: the SEC-17(a) regulation states that when a court order is delivered, a company

27

KB of email messages to a few GB of video streams; 2) provide very high reliability and availability, with good access performance in general.

for example, automatically power down certain replicas or items that have not been accessed for a long time. High reliability and availability. One of the most important design goals of BitVault is rapid repair. In a system of 10K bricks, failures will be frequent, as has been observed in other large-scale systems (e.g., GFS [16]). Some of these failures are transient and only affect availability temporarily, and some others are fatal and impact reliability. In the design of BitVault, we use the term one-replica fast recall to characterize a storage s\VWHP¶Vability to deal with failures. As long as an object still has one copy, its replication degree can be fully restored and the object can be accessible as quickly as possible. BitVault leverages the system scale and spreads the repair load to utilize all the aggregated network bandwidth, with a target repair window of minutes instead of hours.

Accordingly, the design and implementation of BitVault have three top-level objectives. Low TCO (total cost of ownership). We break down TCO into hardware cost, operational cost, and power consumption. In order to keep the total cost low, a large system must ride the economies of scale. Data retention is dominated by tape libraries, which are expensive to operate, slow to access, and are therefore ill suited when we start to look into the possibility of digitizing all kinds of data. In contrast, the price per GB of disk is declining and is now only twice that of tape. Therefore, backing up to disks instead of tapes is becoming feasible. For this UHDVRQZHXVH³smart bricks´DVRXUEXLOding blocks. A smart brick is a trimmed-down PC with large disk(s). We believe that smart bricks will soon be commodity components, just as PCs are today. This design decision also allows us to have a good data-access performance, and will allow search functionality to be built on top of the platform.

Simplicity. A complex system of such a scale is difficult to get right, and the first two objectives are already ambitious. From the outset we aimed to make BitVault as simple as possible so that the system behavior is provable and correct. Many of our decisions, such as soft-state index (Section 3.3.1), object-driven repair model (Section 3.3.2) and the avoidance of global and transaction protocols are founded upon this drive toward simplicity.

Given the longevity of data kept in the system and the volume of data growth, smart bricks of different capability will be procured at different times and must coexist. Thus, BitVault must cope with inherent heterogeneity and support online migration to new hardware.

3. Design Interface: Reference data do not change often, if ever. Thus, BitVault stores immutable objects; if an object is updated, a new version is stored. Each object is identified with a 160-bit key. This key is the secure hash of the object content, and is the handle to retrieve the object. Thus, BitVault belongs to the category of a Content-Addressable Store (CAS). The checkin and checkout API retrieves and stores an object, respectively.

Hardware cost is only a small fraction of TCO. The management overhead of dealing with a complex system rises quickly as the system scales. To provide 5PB of raw capacity with a single disk capacity of 500GB, BitVault needs to scale to 10K bricks (assuming one disk per brick). Initially, few instances will need such a scale, making incremental scale-out capability also important. BitVault must be self managing and self organizing: its administration overhead must be low and almost independent of system size. Ideally, all the administrator will do is unpack a brick, install the BitVault software, and plug the brick into the system. The new brick will be integrated automatically into the system. When a brick fails, it departs from the system, and is later removed by the administrator. All these changes must be done online, with minimum perturbation to ongoing operations and no compromise to reliability or availability.

System model: BitVault operates in a controlled environment and all hardware is assumed to be trusted. Bricks are connected with high-throughput and lowlatency LAN, and we assume Gigabit Ethernet as the switching fabric, though our prototype uses 100Mb switches. Transient failures such as brick reboot, cable disconnection, and switch failures can affect multiple bricks and cause correlated but transient failures. For the time being, we do not consider network partitions. All other failures such as disk crashes and brick hardware faults are permanent and fail-stop.

A very large BitVault installation can consume substanWLDOSRZHUDQGWKLVLVZKHUHWKHWUDGLWLRQDO³FROG´Pedia such as tape and CD hold advantages: they cost zero energy. We do not address this issue yet, but believe there are ample opportunities to strike a balance between power consumption and accessibility. We could,

Figure 1 gives a high-level overview of BitVault. Smart bricks are the building blocks of BitVault. All bricks have 160-bit node IDs and use the same 160-bit logical name space in which the objects and their indices are stored. The node IDs are secure hashes of their hostnames. Using a secure hash with long enough bits makes collision arbitrarily small.

28

We adopt the DHT (Distributed Hash Table) [37] logical space as the primary abstract for its self-organizing and scale-out capability. The zone of any live brick x is (y.id, x.id], where y is x¶VLPPHGLDWHORJLFDOSUHGHFHVVRU in the space. This is basically consistent hashing as in Chord. A brick x is called the root brick of an object o if o.key belongs to x.zone. As shown in Figure 1,replica placement is policy controlled and in theory the replicas of an object can live on any k bricks, where k is the REMHFW¶V VSHFLILHG UHSOLFDWLRQ GHJUHH +RZHYHU WKH Lndex of an object is always found in its root brick. In other words, the index partition is DHT-based.

data structure maintained by three protocols. The top two layers are rather conventional. The bottom one is the leafset, which is a set of 2L+1 nodes including L closest nodes on each side of the DHT logical space plus the home node itself. The heartbeat messages carrying the full leafset of a node are sent between every pair of leafset nodes to maintain the leafset data structure. Leafset members use a voting mechanism to detect brick leave and join events and to reduce erroneous detections. For a system with N bricks, the middle layer consists of a finger table, which contains O(logN) entries to implement a straightforward O(logN) prefixbased routing algorithm. A nRGH¶V i-th finger points to the node that owns the key that is idenWLFDOWRWKHQRGH¶V ID except with the i-th bit flipped. Regular probing messages are sent to finger table entries to detect failures and repair the finger table. Finally, the third layer SSRT (soft-state routing table) enables one-hop lookup performance with high probability. SSRT is maintained by broadcasting node join or leave events detected by the leafset heartbeat protocol using a scalable broadcast through finger and leafset members. The SSRT structure of XRing already contains most brick membership information, but does not satisfy the eventual reliability because the broadcast, though has O(L+logN) redundancy, is only best-effort.

Figure 1. Example of the object layout in BitVault. Object x has replicas in brick i, j and k, whereas object y has replicas in brick j and k. The zone of a brick is the portion of logical space between the ID of its immediate predecessor and the ID of itself.

To ensure membership convergence, we add a background anti-entropy protocol that enables bricks to periodically reconcile missing membership information with other random nodes in the system. Essentially, randomly paired nodes regularly exchange timestamped signatures of their SSRTs, allowing the membership view to converge in a gossip fashion. The antientropy protocol is also used when a new brick joins for the system or rejoins after leaving the system for a while. The new brick either has no SSRT at all or has a possibly outdated SSRT and the anti-entropy protocol will quickly bring its SSRT up to date.

There is no centralized entity in BitVault. The fundamental mechanism is a weak consistency membership protocol in which all bricks participate. We will discuss this layer before we move on to other protocols.

3.1 Membership and routing layer The Membership and Routing Layer (or MRL in short) provides an eventual membership guarantee. This means that once the system is stabilized (i.e. no brick failure for a long enough time), all live bricks will eventually agree on the live bricks in the system. The convergence of the membership should be as rapid as possible. Thus, when the system in stabilized, the sorted list of node IDs yields the logical name space. An eventual membership service does not require bricks to agree on the intermittent membership views of the system. Therefore, more expensive view-based group membership protocols (e.g., [6], [10]) are not necessary. Many eventual membership protocols exist, such as SWIM [13]. These protocols gain their scalability by dividing the protocol into two correlated parts: failure detection and failure dissemination.

(a) Failure detection time

(b) Membership convergence time

Figure 2. Failure detection time (a) and convergence time of MRL as a function of system scale and message drop rate (b)

We extend a best-effort one-hop DHT called XRing[41] to implement MRL. XRing divides a 160-bit logical space with participating nodes using consistent hashing as in Chord [37]. Each node in XRing has a three-layer

Figure 2 shows MRL performance reported with a simulator where each instance runs the actual stack of 05/ ;5LQJ¶V OHDIVHW KHDUWEHDW ILQJHU SURELQJ DQG

29

anti-entropy use interval of 5s, 5s and 10s respectively. The leafset size is 8 (i.e., 4 logical neighbors on each side). A brick marks a neighbor as dead after failing to hear from it in 3 heartbeat cycles. The vote among the OHDIVHWPHPEHUVZLOOGHFODUHDEULFN¶VGHSDrture in 1020 seconds. To understand thH 05/¶V Uobustness, we drop r% of the packets. The failure detection time is around 16 seconds, regardless of system size because the detection is done through the leafset nodes. This value is consistent with what we observed in prototype. We set the network latency to be 2ms, and thus convergence speed is very fast and rises with O(logN) in general. Higher drop rate (e.g. r=40%) yields longer converging time, but the difference is negligible in practice.

APIs that upper-layer applications call into BitVault, we devote most of the discussion on the self-managing aspects of the system. The protocols include failure handling (both permanent and transient), brick additions and load balancing. We will show how we base the design upon a very simple set of invariants.

3.3.1 Checkin/checkout The checkin request carries the object, the replication degree k, supplied from upper-layer application, and the REMHFW¶VNH\7KHFKHFNRXWUHTXHVWQHHGVWKHNH\DVWKH only parameter. Both requests can be submitted to an arbitrary brick in BitVault. When performing checkin, the brick will invoke the placement policy (detailed in Section 3.3.4) to pick k bricks, and send replicas to these bricks. When the first acknowledgement is received, the checkin procedure is FRPSOHWHG IURP FOLHQW¶V SHUVSHFWLYH ,I GHVLUHG WKH checkin can wait for responses from more replicas.

3.2 Per-brick architecture The components of a BitVault brick are shown in Figure 3. The MRL module participates in the eventual membership service, discussed in the previous section. Client requests/responses

Collect pointer (from DMs of bricks hosting the objects)

Access Module (AM)

All o.k pointers collected partial

Index Module (IM)

Data Module (DM)

Any one of the o.k pointers damaged

complete

Get the 1st pointer: x When an object is first created x All k-1 replicas gone and the index is gone too (root crash)

Membership change notifications; Routing

Figure 4. Object index state machine.

MRL (XRing)

A brick persistently stores a received object to its DM and at the same time publishes the object with a ipublish message. Out of all messages in BitVault, this is the only one that needs to be reliable, and we ensure this with ack/resend mechanism. This message uses the REMHFW¶V NH\ DQG LV URXWHG WKURXJK 05/ WR UHDFK WKH root brick; the message also includes the specified replication degree. The ipublish message is one of the events that trigger the index state machine (Figure 4) at the root brick. An index has two possible states: partial and complete. A pointer in the index is valid if and only if the brick pointed to is alive and contains a copy of the object. In a complete index the number of valid pointers equasl the specified replication degree; a partial index has with fewer valid pointers. Therefore, when receiving the first such message, a root brick will start a partial index with one valid pointer and start a timer. If k such pointers are collected, the index becomes complete and the timer is stopped. However, if the timer expires before enough pointers are collected, repair will be triggered.

Figure 3. Components of one BitVault brick

Index module (IM) keeps the indices for any objects rooted at this brick. The index is soft-state, and records K pointers to the actual locations of the replicas, where K is the replication degree of the object. IM listens to MRL for membership changes and issues repair for missing replicas if necessary. Data module (DM) stores replicas to local disk. Along with the object we store a few metadata, including its key and the specified replication degree. A reverse table is built in the memory, and the table entries record the IP addresses of the root bricks of the objects. This table is used to perform indices repair, also triggered by the MRL notifications. Finally, the Access module (AM), is a very light module and VHUYHVDVWKHEULFN¶VJDWHZD\WRH[WHUQDOFOLHQW requests.

3.3 Individual protocols

Checkout is simple; the request is routed to the root brick and from there follows any one of the pointers to retrieve the object.

This section goes through the major classes of BitVault protocols. Following the basic checkin and checkout

30

We assume that failures have wiped out all but the very last copy. The membership service of MRL has the following properties: eventually and with high probability in O(logN) time bound (where N is the number of bricks in the system), every live brick is known to every other live brick, and every failed brick is excluded. Since the membership list defines a DHT space, the last copy can watch the change of its root and publish towards the root the ipublish that includes the specified replication degree. This process occurs even if the root changes infinitely often (due to brick crash or addition), and if the last-copy moves from one replica to the other. This message will start a partial index and the repair timer which, when expired, will instruct the last copy to insert new replicas into the system. The cycle is complete if and only if enough replicas are generated and the index state is changed to complete. Although the system should be able to withhold this property after stabilization, in practice a desired condition is that the repair rate is greater than the failure rate.

3.3.2 Repair of permanent failure One-replica survivability is a system-wide invariant immune to various failure sequences. For example, suppose there is only one remaining replica and immediately after we restore another copy, both the source replica and the index disappear simultaneously. In this case, the last replica moves from one to the other. This sequence can occur an infinite number of times. Once the system stabilizes, however, both the index and k replicas should be intact. Translating this property to the BitVault data structure, we have the following invariants: a) eventually (when the system stabilizes and no failure occurs for a sufficiently long time) DQREMHFW¶VLQGH[LVDOZD\VIRXQGDW the root brick of the object, and b) eventually all indices VKRXOGEHLQWKH³FRPSOHWH´VWDWH%LW9DXOWUHOLHVRQWKH membership service provided by MRL and the reliable ipublish message to deliver both with a very simple set of mechanisms.

We note two key properties here. First, the repair strategy is object-driven. Indeed, one can say that repairing missing replicas is triggered by the index being at the state of partial, but the index itself is generated from any surviving object replicas. Many alternate approaches rely instead on the robustness of index coupled with direct monitoring of data to ensure availability. Second, the contents of a DM are pointed to from IMs of many different bricks, and their sibling replicas are spread across the whole system. This distribution of both repair triggering and repair source is the basis of rapid and parallel repair. Figure 1 illustrates both points: if brick j fails, index repair for object x can be triggered by either brick i and k, and data repair for object x and y can be processed by i and k in parallel.

Indices repair: the DM filters membership change events sent from MRL. For any object whose root changes to a different brick (due either to brick failures or additions), DM issues the same ipublish as it receives the object the first time towards the new root via MRL. The first ipublish message establishes a partial index at the root; the rest k-1 messages turn the index state to complete, repairing the index. Since indices are partitioned across all members, index repair occurs when there is any membership change. An optimization to improve index availability is to lazily backup x¶VLQGLces to x+11, so that x+1 can serve with the cached indices after x crashes, since x+1 takes over the zone of x. Data repair: the IM filters membership change events sent from MRL. For any index that has replicas in a failed brick, the IM changes its state from complete to partial and instructs one of the replica keepers who, after consulting the object placement policy, inserts another copy to the selected brick. Data repair occurs only when a brick crashes.

A simple calculation shows the advantage of parallel repair. When using parallel repair, we need to consider the network bandwidth, especially since the bandwidth of the root switch may become the bottleneck. Let BBRICK be the disk I/O bandwidth, BNET be the available bandwidth of the root switch for data repair. Then the parallel repair degree Nr, the number of repair sourcedestination pairs that can participate in repair in parallel, is given by Nr = BNET/BBRICK. Nr is the repair speedup, if the object replicas are spread evenly among roughly Nr bricks. For example, if the disk bandwidth BBRICK = 5MB/s, and the available root switch bandwidth BNET = 1GB/s (67% of a 1Gigabit 24 port switch bisection bandwidth), then Nr = 200, which means instead of taking more than one day to repair one failed disk with 500GB data, the parallel repair can be done in 8 minutes. This drastically reduced repair time and shrinks the vulnerability window. Therefore, spreading replicas among a large number of bricks can achieve much faster data repair speed.

These are the only necessary steps. Notice that when a new replica is made, the receiving brick will generate an ipublish message towards the root, and the message FKDQJHV WKH FRUUHVSRQGLQJ LQGH[¶V VWDWH WR FRPSOHWH, closing the repair cycle. Should anything interrupt this distributed procedure, the index will stay in the partial state and repair will continue. This is true even when multiple failures occur (e.g. the index and several replicas are gone simultaneously). While a full proof is beyond the scope of this paper, we offer an informal argument that these protocols deliver the last-replica recall guarantee. 1

We use x+1 to refer to the successor of x in the logical space.

31

other garbage collection purposes as we will discuss shortly.

3.3.3 Brick additions A new brick is ready to join the service after it installs the BitVault code. It takes the secure hash of its hostname as its ID and contacts any of the existing bricks. As part of the MRL protocol, all live bricks will include this new brick into their membership list; likewise, the new brick acquires the same list as well. Since BitVault uses consistent hashing to partition the space, for any objects whose root changes to the new brick, their hosting DMs will issue index repair to build indices onto the IM of the new brick. Similar to the optimization that improves index availability when dealing with brick failure, when x joins, brick x+1 can split its indices that belongs to x and sends it to x, so x can have a cached copy of indices to begin serving.

3.3.5 Dealing with transient failures In BitVault many transient failures such as reboot as a result of software upgrade, cable drop, switch failures, and power failures can occur. In these cases, some data may become inaccessible for a short period of time. Despite this, as long as some bricks are alive, the system can still operate, albeit with reduced performance. It is difficult to know whether a failure is transient, and our strategy is to initiate repair regardless. The worst that can happen is that, when the bricks come back online, extra replicas exist. We set a high watermark (e.g. k+1) and when the total number of copies exceeds that threshold, we will start deleting until total copies equal to k. Notice that if future failures reduce replicas to k or above, no repair is triggered. Also, if the watermark is equal to k, then eventually the replication degree is strictly enforced. This strategy is also is proposed in TotalRecall [5].

3.3.4 Load balance After enough brick failures, brick additions and data repair movements, the storage load on bricks is likely to be skewed. Unbalanced load reduces access performance because overloaded bricks become a bottleneck while underutilized bricks are mostly idle.

4. Implementation

To address the load balance issue, BitVault performs background load balancing operations. BitVault runs an in-system monitoring utility using the SOMO [44] distributed data structure which gathers and disseminates load information periodically. If the load of the current brick is higher than the average load (in our prototype the threshold is set at 5%), the brick will randomly move some replicas to low-load bricks. As before, the bricks receiving the replicas will send ordinary ipublish messages to build indices. When the source brick receives confirmation from the sink brick that a replica has been created there, it issues a delete message to the root of the object.

The prototype including PRVWRI%LW9DXOW¶VNH\IHatures has been developed using a toolkit we have built called WiDS (WiDS implements Distributed System) [28][27]. WiDS combines three aspects of a typical development process: prototyping and debugging, large-scale simulation, and deployment. WiDS defines a small set of APIs including message-passing, sockets, synchronization and timer APIs. A distributed system constructed with the WiDS APIs can be simulated and debugged within a single process or deployed in a networked environment, all with the same codebase. Using WiDS significantly improves the development effort of BitVault. All components of BitVault and WiDS are implemented using C++. Currently, BitVault, XRing, SOMO and WiDS have about 6K, 4K, 2.5K and 7K lines of code, respectively. Although it is difficult to quantify directly, the small code base is indeed a result of the simplicity of BitVault.

The delete protocol is coordinated at IM. It first removes the pointer to the replica to be deleted, and then inserts this pointer to a delete pool. The IM then picks an entry from the delete pool and issues a delete request to the target brick. It will keep on retrying until an ack is received. The entry is removed from the delete pool if 1) the ack is received or 2) the target brick crashes (as notified by MRL). The protocol works correctly and ensures that there is never a dangling pointer. The worst that can happen is the index is unaware of a replica, which can occur if the delete pool (a soft-state in memory) is corrupted. In this case and all others where the local state (including the index) may be bad, we simply reset the brick itself and let transient failure handling fix the problems.

5. Building applications over BitVault Any complete applications that use BitVault as the backend storage must incorporate some mechanism to group and catalog the object IDs. One solution is to set aside a SQL database for this functionality, which makes, the database server a single point of failure and, under heavy loads, a scalability bottleneck. We explore another alternative using the Catalog utility to build application-level and soft-state index inside BitVault.

The delete protocol is not exposed as an API, but it can be invoked by the load-balancing process and also for

When an object is stored into BitVault, it can optionally be tagged. The tag must contain a keyword and a descriptor, and is persisted to disk along with the replicas.

32

D FDWDORJ RI WKH XVHU¶V 90 LPDJHV is built and stored inside BitVault. At the beginning of the login, the user can select any VM instances in the past to re-instantiate at the current PC, thus migrating a work environment seamlessly across both time and space. More details can be found in [39].

Later, the application can use the hash of the keyword to retrieve a list of objects that share the same keyword. This list is called a catalog, each entry of which is an pair. Catalog is entirely soft-state and is built in the same way that the object index is built: the node that receives a replica, when seeing its tag, publishes towards the node that contains the hash of the keyword. The node receives the tag then appends the entry to the catalog with the specified keyword. If membership protocol indicates that the node covering the keyword of a catalog changes, we rebuild the catalog by republishing the tags. This is a simple and robust mechanism to add metadata management support inside BitVault.

6. Evaluation We built a prototype of 30 bricks, each of which is a commodity PC with a 3GHZ Pentium4 CPU, 512MB memory and 120GB STAT Seagate disk. These PCs run Windows XP and are connected with two AT8324SX 100Mb switches stacked together. Unless otherwise specified, the replication degree is set to 3 in all experiments.

The catalog function allows us to build two complete BitVault applications, both utilities responding to daily requirements from users in our lab.

Due to space constraints, we refer readers to [42] for detailed raw performance of BitVault, but report instead the system behavior under failures. In general, the latency of checkin and checkout is comparable to that of a remote mounted NTFS system, and the scalability and throughput are comparable with the GFS [16] data on a similar testbed configuration.

BitVault Client Utility (BCU). BCU allows users to backup and retrieve their files (documents and project files) from any desktop as long as they can connect to a BitVault system. The client side of BCU is integrated with Windows Explorer. Upon a mouse click, a user can choose to backup the file or directory into BitVault, or retrieve its version history and then select one to checkout from BitVault. At checkin time, a tag is automatically generated for the object. The keyword of the WDJ LV WKH KDVK RI D XVHU¶V DFFRXQW QDPH DQG WKe descriptor is the complete path of the file name and any optional text annotation. Thus, inside BitVault there is a complete catalog for a user. BCU retrieves the catalog, parses it and loads it into a Microsoft Access database file. The user can perform simple queries and initiate checkout of any versions of the files.

6.1 Repair performance Table 1 takes a closer look at what happens inside the system under repair. We let each brick log the number of objects and indices it hosts periodically, and merge them at the end after aligning the clocks. In this experiment, every brick has 30K 1MB objects (30GB/brick), and we fail several bricks in sequence. We vary the total number of initial bricks at a step of 5 bricks. Table 1. Repair speed experiment. The experiment is done by initially setting 5, 10, 15, 20, 25, and 30 bricks, each one loaded with 30G replicated data. Then manually fail one brick and measure the time to repair 90% of the lost 30G data.

Machine Bank. Like many research institutes, MSR Asia has a large shared lab for hundreds of intern students. In the shared lab there is a tension between flexible resource utilization and productivity. A student may use different PCs across different working sessions, but must preserve his or her entire working environment across sessions. In Machine Bank, analogous to the safebox of a banking institute, PCs in the shared lab run Microsoft¶V 9LUWXDO 3& $ 90 9Lrtual Machine) is broken into 64KB blocks and stored into BitVault. Since a majority of the VM images across different users and time are the same, most of the blocks are identical. This avoids the problem of having each VM take its entire space. Each PC also implements local persistent cache to improve performance, and at the end of a session only modified blocks are stored into BitVault. The mapping between the blocks and their hash is captured in a file called Virtual Machine Instance (VMI). When all modified blocks are checked into BitVault, VMI is checked in as well with a tag which uses the hash of the user account name as the keyword. Thus,

Brick Number Time to repair 90% Repair Bandwidth (after crash one) (min) MB/s GB/m 4

56.4

8.2

0.5

9

20.3

22.8

1.4

14

9.9

46.5

2.8

19

9.3

49.4

3.0

24

5.3

86.4

5.2

29

5.3

87.8

5.3

The total duration to repair 30GB data with 20 and 30 bricks takes 600 and 300 seconds each, giving a repair rate of 50MB/s and 100MB/s, respectively. The superlinear improvement is probably due to better utilization of memory and other per-brick resources. The rate of 20 bricks is about 12% of what a 227-node GFS cluster achieves [16]  %LW9DXOW¶V UHSDLU SHUIRUPDQFH VFDOHV

33

nicely with number of bricks, up to a ceiling imposed by the network bandwidth.

7. Discussion At the time of this writing, the BitVault project is almost three years old. We summarize the major lessons we learned here.

6.2 Performance under failure BitVault should self-heal and continue to function even in the face of failure. To verify this, we conducted checkin from 16 clients into a 16-brick BitVault. Each client continuously checks in 1MB size objects. We gather statistics in units of 5-second granularity. For the client-side throughput, we log the aggregate throughput in terms of total successful checkins. Similarly, we log the total number of objects that the bricks receive, again aggregated over all bricks. At the tenth minute, we take out one brick. The client-side throughput corresponds to what users perceive, while the server-side throughput reflects both the checkin traffic and repair traffic. The expected behavior is that the performance will drop while repair is underway and then return to normal.

One key design decision is the fully decentralized design. As a tradeoff of being very scalable, the membership protocol can only offer eventual consistency semantic. Consequently, most system properties become correctness conditions that are only satisfied eventually. This has made debugging quite tedious at times. Another problem is that there is no central place to look for ground truth which, in hindsight increases management overhead rather than reducing it. Another BitVault design highlight is the object-driven repair model. However, when there are a large number of objects in the system, index repair time is sacrificed. When a checkout fails, it is therefore difficult to determine whether the object does not exist in the system or it has been lost. Adding indexing replication would improve index availability but the protocol can be complex, countering the design principle of simplicity. This choice becomes especially hard when network partition is considered.

Figure 5 shows the variation of the throughput of both clients and servers, with the server-side throughput normalized by 3 (the replication degree). Before the crash and after the repair, the client-side throughput matches server-side throughput. However, after the crash and during the repair window, client-side throughput decreases as resources are dedicated to repair the failed disk. The repair traffic, represented by the darkest area on the chart, corresponds to about 3GB worth of data on the failed disk. The repair window is about 70 seconds. If there were more data on the failed disk, the repair window would increase. During repair, policy determines how repair traffic and user requests are prioritized. In this prototype, they compete against each other with the same priority (only MRL messages have higher priority).

Dealing with immutable objects with a simple get/put interface made the system easier to design. However, immutability does not mean that there is no mutation to system state. For instance, load-balancing must delete replicas. Our experience shows that this is one place where bugs creep in. If metadata mutability is required, it becomes difficult to build a CAS system entirely with BitVault without adding other bookkeeping support. Developing applications on top of BitVault is important to understand how useful the system is. By and large, the two applications we built remain as research prototypes. Efforts to make them available for daily use failed for mostly non-technical reasons. The backup utility (BCU) failed because the BitVault prototype was not robust enough and we were not successful in delivering a good UI design. The machine-bank stalls because the intern PCs in the shared lab are not powerful enough to run VirtualPC. However, BitVault is a successful research project by many measures. It identifies a previously unexplored point in the design space and suggests many new research directions. As BitVault has the flexibility to install arbitrary policy to control replica placement, we wanted to understand how replica placement policy impacts system reliability. Our result in [23] shows that the policy should spread replicas widely to leverage parallel repair, but only to the point where bandwidth is fully utilized. Recently, we pushed this work further to ask how other important

Figure 5: performance under failure experiment. Server throughput is normalized by 3. A failure is introduced around the tenth minute.

34

factors such as switch topology [8] affect system reliability. That work also quantifies how lazily (and consequently more accurately) we can declare a failure of a node, an important factor when dealing with transient failure.

Petal[23] and xFS[3] all require strong consensus protocol which presents a scalability challenge. The DHT-based systems such as Oceanstore [22], Pond [34], CFS [12], Ivy [30], PAST [36] and Pastiche [11] have gone to the other extreme. They operate over a logical space with a hash table abstraction, maintain a small list of other members (O(logN) and traverse the space in O(logN) steps. In these systems, replicas of an object are placed in successive nodes starting from the root note of the object. They target a wide-area P2P sharing scenario, are self-organizing, and can scale out. However, their target context is dynamic and has led to legitimate concerns on what guarantee these systems can provide [7]. We have found that, unfortunately, these designs do not fit the more benign environment either. The requirement to place replicas sequentially impacts the handling of heterogeneity for better storage utilization, causing data movement solely to satisfy the placement invariants. Coupling object placement with the logical space does not leverage abundant network bandwidth to achieve rapid and parallel repair. These are the issues that can only be solved by using indices to control the placement.

Under transient failure, our aggressive repair policy leads to spurious replicas. Removing these replicas raises a subtle issue. As the nodes autonomously decide to remove replicas, it is theoretically possible that all replicas can disappear. The problem can be mapped to the k-set problem, one in which the system functions only when there are no more than k consensuses. The kset protocol and its implication is an active research area in our group. Our IPTPS paper [9] asks whether we can achieve a stronger consistency than eventual consistency in a fully decentralized manner. On the more practical side, the project makes us realize how important it is to design useful development tools on the side. Although WiDS has helped to make the rapid prototyping possible, we have found many subtle bugs that only appear when the system is deployed. Our most recent effort is the WiDS checker [26], which logs non-deterministic events and then uses the logged trace to replay deterministically in simulation. A developer can write assertions to check distributed properties such as those invariants in BitVault. WiDS checker identifies three subtle bugs that would cause us to lose replicas. That effort is continuing on two fronts. First, we are making the utility work for applications developed using native Windows API [20]. Second, we are developing infrastructures to catch bugs when the system is operating online.

Indirection introduces the indices as another vulnerability point. The conventional methodology adopted by many, including GFS [16] and TotalRecall [5], has been to first ensure the integrity of index, which then reacts to failures via failure detection. BitVault demonstrated that, if coupled with an eventual membership service, an object-driven repair strategy in which the survival of the last replica can quickly restore both the index and the rest of replicas is both simple and effective. This architecture also affords very rapid repair by spreading repair loads, which has so far only been done in a centralized-indexed system such as GFS [16].

8. Related work

EMC Centera [14] is a brick-based retention platform that aims at self healing and manageability. However, no architectural details are available, and its scalability target is not clear.

As stated in [31], the primary challenges for systems like BitVault will be management and availability, not performance. The primary contributions of BitVault are the following: 1) it uses eventual membership protocol with a DHT abstraction to offer great scale-out capability with low overhead and self management, and 2) it employs massively parallel repair to achieve very high data reliability and availability, and 3) it delivers both with a simple architecture. Below we will contrast its core contributions with previous systems.

BitVault as a scalable store for immutable object is only a starting point. For example, one could use Farsite[2]¶VGLUHFWRU\VHUYLFHIRUWKHQDPHVSDFHZKLOHVWRring objects inside BitVault. Similarly, if we combine an in-system P2P locking protocol [25], it is possible to build a file system, perhaps in the same style as Frangipani [40].

Single-box solutions such as Venti[32] cannot meet the challenge of coping with the volume and growth rate of reference data; client/server architecture such as GFS[16], NASD[16] and WiND[4] works to a certain extent but will hit bottlenecks as well. The fact that objects are often small and there is no or little access locality exacerbates the scaling problem. Existing fully distributed proposals such as Boxwood[29], FAB[13],

9. Conclusion and future work The main objectives of a large-scale distributed storage system are its maintainability and availability. P2P technologies ± currently widely explored for wide-area context, are immensely interesting design alternatives for keeping management cost down. When many components are brought together, they also make it possible

35

[21]

to do massively parallel repair for high data availability. BitVault has demonstrated both of the above points.

[22]

As a research project, BitVault is successful in helping to further research directions on many fronts, both applied theoretical problems and advanced development and debugging tools.

[23] [24] [25]

Acknowledgement [26]

The BitVault project owes a great deal to researchers and architects in other parts of the company. Specifically, we thank Dan Teodosiu, Scott Colville, Cristian Teodorescu, Lidong Zhou and Jim Gray for their insightful inputs. We are also grateful to Lidong Zhou, Mike Schroeder and John MacCormick for their help in editing the final version.

[27] [28]

[29]

References [1] [2]

[3] [4] [5] [6] [7] [8]

[9]

[10]

[11] [12] [13]

[14] [15] [16] [17] [18] [19] [20]

[30]

³(QWHUSULVH 6WRUDJH *URXS 5HIHUHQFH ,QIRUPDWLRQ 7KH 1H[W :DYH´-XQH A. Adya, W.J. Bolosky, M. Castro, et al, ³FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment´26',¶. T.E. Anderson, M.D. Dahlin, J.M. Neefe, et al. ³6HUYHUOHVV 1HWZRUN)LOH6\VWHPV´, 6263¶95. A. Arpaci-Dusseau, R. Arpaci-Dusseau HW DO ³0DQDJHDEOH 6WRUDJHYLD$GDSWDWLRQLQ:L1'´CCGrid'01. R. Bhagwan, K. Tati, Y.C. Cheng HWDO³Total Recall: System Support for Automated Availability Management´16',¶ .%LUPDQDQG5YDQ5HQHVVH³5HOLDEOH'LVWULEXWHG&RPSuWLQJZLWK,6,67RRONLW´,(((&RPSXWLQJ6RFLHW\3UHVV &%ODFN5 5RGULJXHV ³+LJK $YDLODELOLW\ 6FDODEOH 6WRUDJH Dynamic Peer Networks: Pick TwR´+2726¶ 0 &KHQ : &KHQ /. /LX = =KDQJ ³An Analytical Framework and Its Applications for Studying Brick Storage Reliability´VXEPLWWHGWR'&&6¶ :&KHQ;=/LX³(QIRUFLQJ5RXWLQJ&RQVLVWHQF\LQ6WUXctured Peer-to-Peer Overlays: Should WH DQG &RXOG :H"´ ,3736¶ *9&KRFNOHU ,.HLGDUDQG59LWHQEXUJ³*URXSFRPPuQLFDWLRQ VSHFLILFDWLRQV $ FRPSUHKHQVLYH VWXG\´ $&0 &Rmputing Surveys, 88:4, 2001, 427²469. L.P. Cox, C.D. Murray, B.D. Noble, ³3DVWLFKH0DNLQJ%DFkXS&KHDSDQG(DV\´2SDI¶02. )'DEHN0).DDVKRHN'.DUJHUHWDO³:LGH-area coopHUDWLYHVWRUDJHZLWK&)6´6263¶ A. Das, I. Gupta, A. Motivala, "SWIM: Scalable Weaklyconsistent Infection-style Process Group Membership Protocol", DSN¶02 EMC-Centara: http://www.emc.com/products/systems/centera.jsp S. Frolund, A. Merchant, Y. 6DLWR HW DO ³FAB: enterprise VWRUDJHV\VWHPVRQDVKRHVWULQJ´, HOTOS¶03. S. Ghemawat, H. Gobioff, S.T. Leung, ³7KH*RRJOH)LOH6\sWHP´6263¶. G.A. Gibson, D.F. Nagle, K. Amiri, et al. ³$ &RVW-Effective, High-%DQGZLGWK6WRUDJH$UFKLWHFWXUH´, ASPLOS¶98. - *UD\ : &KRQJ 7%DUFOD\ HW DO ³7HUD6FDOH 6QHDNHU1HW Using Inexpensive Disks for Backup, Archiving, and Data ExFKDQJH´0657HFKQLFDO5HSRUW1R065-TR-2002-54. J. *UD\³6WRUDJH%ULFNV+DYH$UULYHG´LQYLWHGWDON)$67µ02. =

Suggest Documents