Cloud Storage Design Based on Hybrid of Replication and Data ...

2010 16th International Conference on Parallel and Distributed Systems

Cloud Storage Design Based on Hybrid of Replication and Data Partitioning Yunqi Ye, Liangliang Xiao, I-Ling Yen, Farokh Bastani Department of Computer Science University of Texas at Dallas Email: {yxy078000, xll052000, ilyen, bastani}@utdallas.edu Abstract Most existing works on data partitioning techniques simply consider to partition data objects and distribute the shares to servers. Such pure data partitioning approaches may bring potential performance and scalability problems when used in widely distributed systems. First, it is difficult to apply lazy update. Second, the share consistency verification may incur costly communications among widely distributed servers. In this paper, we propose a two-level DHT (TDHT) approach for widely distributed cloud storage to address these problems. First, we analyze the tradeoffs on security and availability between TDHT and the conventional pure data partitioning approach (integrated with DHT and called GDHT (global DHT)). The results show that TDHT can provide better security than GDHT and almost the same level of availability as GDHT. To compare their performance, we design a two-level access (TLA) protocol for the TDHT approach and compare it with the distributed version server (DVS) protocol proposed in [Ye10] for the GDHT approach. The experimental results show that TLA can provide much better user perceived update response latency and the same level or even better read access latency compared to DVS.

they are not replicas. To apply lazy update, the update must be performed on at least k shares (on k servers) first. And then the k servers need to coordinate to compute the new shares using re-share techniques [Nik03] and distribute them to other servers. This implies a high computation cost and at least k messages among servers. Thus, in existing pure data partitioning approaches [Cac06, Abd05, Ye10], all the shares are updated in one round without considering lazy updates. Second, in the pure data partitioning approach, share consistency verification can result in a large number of message exchanges among servers, which may significantly affect the system scalability and performance in widely distributed systems. Since only consistent shares can be used to reconstruct the data, share consistency verification is a necessary task in data partitioning based systems. In existing pure data partitioning approaches, no matter which strategies (such as update-time verification [Cac06], lazy verification [Abd05, Ye10]) are adopted, all the servers hosting shares of one data object need to exchange verification information with each other to verify the share consistency. In widely distributed systems, such information exchange among servers may incur a very high communication cost. In this paper we consider the design of a two-level cloud storage system to address the two problems stated above. First, we divide the storage servers into server groups based on their location information. Within each group, we apply SSS to each data object and distribute the shares to servers in the group. Then the data shares are replicated to different groups. In this approach, lazy update can be applied by only updating shares in one group and propagating the updates to other groups. Thus, the two-level approach can significantly reduce the user perceived response latency for update accesses. Also, the consistent share verification can be performed within each group independently. Thus, the involved server-to-server communications can be constrained within the group and the cost can be significantly reduced. Therefore, two-level approach can be much more scalable than the pure data partitioning approach. Directory management is another important issue in data partitioning based systems. The conventional centralized directory service may not scale up well with increasing system size. In this paper, we adopt the distributed hash table (DHT) technique to eliminate the need of directory management. DHT can scale very well since it is designed toward the peer-to-peer system with thousands or millions of nodes. In the following we use global DHT (GDHT) to refer to the pure SSS approach that integrates the DHT technique for share allocation and lookup. Also, we design a two-level

Keywords: cloud storage, global DHT, two-level DHT, short secret sharing, access protocols.

1 Introduction Cloud storage integrates storage resources that are distributed all over the world and offered by different providers into one virtualized environment. Such infrastructure can resist disastrous failures and has the potential of achieving low access latency and greatly reduced network traffic by replicating data close to where they are needed. To address confidentiality and availability issues, data partitioning techniques, such as secret sharing [Sha79], IDA (information dispersal algorithm, a type of erasure code) [Rab89], and short secret sharing (SSS) [Kra93], have been frequently used. A lot of research efforts have been devoted to data partitioning based storage systems [Cac06, Abd05, Ye10]. Most of them simply consider to partition data objects and distribute the shares to servers. However, such pure data partitioning approach may incur potential performance and scalability problems when used in the widely distributed cloud storage. First, many replication based approaches use lazy update [Kri02] to achieve low user perceived response latency and high scalability in widely distributed systems by asynchronously propagating updates to all replicas after the updates are performed on one replica. But it is very difficult to apply lazy update on pure data partitioning approaches. With an (n, k) data partitioning scheme, the updates to one share cannot be propagated to another share directly because

1521-9097/10 $26.00 © 2010 IEEE DOI 10.1109/ICPADS.2010.112

415

DHT (TDHT) approach that integrates DHT into our twolevel cloud storage design to provide secure, scalable and high performance management. To study the tradeoffs between TDHT and GDHT, we conduct a series of experiments to compare their security and availability. The results show that TDHT can provide better security than GDHT and almost the same level of availability. Since access performance depends on the concrete access protocols, we design specific access protocols for TDHT and GDHT and compare their performance. For GDHT, we consider the distributed version server (DVS) protocol given in [Ye10]. Then, we make two changes to DVS to obtain the two-level access (TLA) protocol for TDHT. First, TLA adopts the lazy update. Second, we select a super node for each group to maintain the update history and status. To tolerate the super node failure, TLA is self-adaptive and can automatically switch to DVS when super node fails. We conduct experiments to compare the performance of TLA and DVS. To make a fair comparison, we not only measure the user perceived response latency for update accesses, but also the overall update latency including update propagation. The experimental results show that TLA has much lower clientperceived response latency for update accesses than that of DVS while the overall update latency in TDHT is higher. For read latency, TLA and DVS are the same in most cases. But with large k, TLA yields better read latency than DVS. The rest of the paper is structured as follows. The system model is given in Section 2. We present two-level approach and compare the security and availability of TDHT and GDHT in Section 3. Section 4 discusses TLA protocol and compares the access performance between TLA and GDHT approaches. Section 5 introduces some related works. Section 6 states the conclusion of the paper.

min(k, n−k)/2, i.e., we require n > 4f to tolerate f malicious servers, same as the assumption in [Ye10]. A malicious client may just write garbage (the shares are still consistent) to the data objects that it has the rights to update. But this is beyond the scope of access protocols. Approaches, such as [Hu03, Vie05], can be used to detect malicious updates based on the update history analysis. We assume that all the messages are transmitted on a reliable channel, that is, the channel does not change, or duplicate, or drop the messages. All sent messages from one node will eventually arrive at the destination in their sending order unless the destination node fails.

3 Two-level DHT We propose a two-level DHT (TDHT) approach for data allocation and lookup. At first, the storage servers are divided into different groups. Each data object is replicated to r groups using the group level DHT where r denotes the replica number of each data object. But the data replica only exists conceptually and is not really stored to servers. Within each group, each data replica is partitioned by SSS and the shares are allocated to servers in the group using the server level DHT. The detailed TDHT design is given in Subsection 3.1. To evaluate this TDHT design, we use the analysis models for security and availability in [Xia09] to compare the GDHT and TDHT approaches. To make the results comparable, we always generate the same number of shares for each data object in the two approaches. In other words, if we apply (n, k) SSS to the TDHT scheme, then we apply (n·r, k) SSS to the GDHT scheme. The comparison results are presented in Subsections 3.2 and 3.3 respectively.

3.1

TDHT Design

We consider our storage servers as a two-level topology and divide all the servers into M groups according to the distance of each two servers. After grouping, we assume that normally the distance between two nodes in one group is much less than that between two nodes in different groups. We assume that each group has a super node that is always trustable and may only fail due to some routine maintenance. Let g.sn denote the super node of group g. The super node is used to maintain the group level information such as the group coordinates that is defined as the coordinates of the centroid of all its servers. The distance between two groups is defined from the group coordinates. At the group level, we adopt DHT for data replicas allocation and lookup. As shown in Figure 1, let system ID ring denote the DHT ID space. We assign each data object and group an ID in the system ring. Let d.ID denote the DHT ID of data object d. When allocating the data d, the r data replicas are allocated to the first r successor groups of d, which are the first r groups on the system ID ring after d.ID. Let RGd denote these residence groups hosting replicas of d. To protect the locations of data objects, we consider a lattice-based, key-protected DHT scheme. The security domains of the data and the clearance level of the clients are

2 System Model Assume that the cloud consists of N storage servers geographically distributed over the Internet. Each storage server can be a monolithic storage platform or a normal server in a storage cluster. The storage servers are relatively reliable and trustworthy. They may dynamically join or leave the system but at a low rate. Though the servers are generally trustworthy, some of them may still be malicious or be compromised. Each server is assigned a longitude and latitude to represent its location. The distance between two servers is defined as their geographical distance based on their coordinates. Let D denote the set of critical data objects. Each data object d D has a unique key, which is the search key for the data object. Let d.data denote the actual data of d. We use the (n, k) SSS technique to decompose each data object into n shares. Let d.share = {d.shi | 1 ≤ i ≤ n} denote the set of shares of d. We consider both malicious failures and crash failures. We assume that there are at most f server failures (including both crashed and malicious failures) at any time, where f < 416

divided into multiple categories in a lattice [Bis03]. Each category c holds a distinct key kc. The DHT ID of each data object is computed based on both the data search key and kc. Thus, the adversary cannot infer where the data objects are stored when no user is compromised. At the server level within a group, we also use DHT for share allocation and lookup. Let group ID ring denote the ID space at this level. We assign each share an ID as d.shi.ID = (d.ID + i*2m / n) % 2m where 2m is the size of the ID space and each server an ID based on its network address. Then d.shi is assigned to the successor server on the group ID ring. Let RSg,d denote the residence servers hosting d.share within group g.

enforcement policies so that the compromise of one server cannot be spread to other servers simply. Also, because the data locations are protected, we assume that the adversary does not know where the data are stored and just randomly selects some server to start the attack. We create and distribute 10000 data objects to the servers and compute the corresponding security levels. We conduct three experiments to study the impacts on security for various parameters. The default parameter settings are n=8, k=3, r=5 and the probability that a server is attacked W=0.01. The results are shown in Figures 2-4. In the first experiment, we consider W with settings 10−4, 10−3.5, 10−3, 10−2.5 and 10−2. In the second, we consider different replica numbers r from 3 to 8. And in the third, we consider different k from 3 to 7. As can be seen, TDHT has the better security. This is because in GDHT the adversary can reconstruct a data object by compromise any k residence servers while in TDHT it needs to compromise k residence servers with different shares. As expected, larger W, which implies that more servers will be attacked, results in lower security level (Figure 2). When there are more replicas, the security level is decreased since the data is more likely to be compromised (Figure 3). But with larger k, since the number of servers to be compromised by the adversary is also increased, the security level becomes higher.

Figure 1. Two-Level DHT

Since each access to partitioned data may involve lookup operations on multiple shares in different groups, to achieve better performance, we adopt one-hop-routing DHT [Gup03] scheme for both group and server levels so that each lookup operation can complete in one hop. Thus, each super node should keep the full routing table of the super nodes of other groups. To enable nearby replica access, the coordinates for all the groups are also maintained in the routing table. This information can be used to calculate the geographical distances to help locate near-by residence groups. Though the distance does not necessarily correlate to the communication cost, [Zha05] points out that the distance can be used to establish a lower bound on the latency. Also, each super node should also have the full routing information for servers within the group. Then, after exchanging routing information among super nodes and servers, each server will finally have the full routing information of all servers. Therefore, from each part of the system, the residence groups and servers can be computed independently. Since the servers are relatively stable, the cost of routing table maintenance is tolerable.

3.2

3.3

Availability Comparisons

In the availability model, the availability level is defined as the average of ratios of accessible data objects for all clients. Since the network topology may affect the data accessibility, we use Inet [Inet] to generate the topology with 3500 nodes and 2000 of them are used as servers. The servers are grouped according to their coordinates and each group contains at least 20 servers when considering TDHT schemes. All of the 3500 nodes can be the clients. We assume that the nodes may fail with probability pnf, and the connection links may fail with probability pef. We also apply dynamic failure probabilities on the nodes and links to simulate failures, i.e. pnf and pef are generated for each node/link according to corresponding probabilistic distributions respectively. Failures may partition the system into completely disconnected partitions. For each client, only the servers within the same partition are accessible. Still, 10000 data objects are generated and their replicas and/or shares are allocated to the 2000 servers. In TDHT, we use (8, k) SSS to partition the data and each share have r replicas. In GDHT, we use (8∙r, k) SSS to partition the data object. We conduct three experiments to study impacts on the availability for various parameters. The default parameter settings are r=5, k=3, pnf = U(0, 0.01] and pef = U(0, 0.0001] where U(x, y] denote a value uniformly selected from (x, y]. In the first experiment, we consider pef with settings U(0, 0.0001], U(0, 0.0005], U(0, 0.001], U(0, 0.005] and U(0, 0.01]. In the second experiment, we consider pnf with settings U(0, 0.01], U(0, 0.05], U(0, 0.1],

Security Comparisons

In [Xia09], the security model is based on the National Vulnerability Database [NVD] and Common Attack Pattern Enumeration and Classification [CAPEC] definitions. We simulate the security attacks and obtain security levels by analyzing the ratio of compromised data objects. We randomly generate 2000 servers that are widely distributed over the Internet. For TDHT, we assume that these servers grouped and each group has at least 20 servers. Here, since the servers are widely distributed, we assume that these servers have independent weaknesses and security 417

GDHT

GDHT 1

0.9

0.8

0.8 0.7

0.4 0.2

Figure 2. Security for varying W (x=log10(W))

GDHT

0

-2

-3 x

2

6 replica

TDHT

GDHT

2

TDHT

0.9 0.85

0.95 0.9 0.85

0

0.1 0.2 0.3 Edge Failure Rate

Figure 5. Availability for varying edge failure rate

8

GDHT

TDHT

1 Availability

Availability

0.95

5 k

Figure 4. Security for varying k values

1

1

TDHT

1.2 1 0.8 0.6 0.4 0.2 0

10

Figure 3. Security for varying number of replicas

1.05 Availability

GDHT

0.6

0.6 -4

TDHT

Security

1 Security

Security

TDHT

0.9995 0.999 0.9985 0.998

0

0.1 0.2 0.3 0.4 0.5 Node Failure Rate

Figure 6. Availability for varying node failure rate

U(0, 0.15], U(0, 0.2], U(0, 0.25], U(0, 0.4] and U(0, 0.5]. And In the third, we consider r from 3 to 8. The experimental results are shown in Figures 5-7. As can be seen, the availability levels of GDHT and TDHT are very close. And as expected, the availability level decreases as the node or edge failure rate increases since both node and edge failures may partition the network and some partitions may not contain sufficient shares for data reconstruction, i.e., the data becomes inaccessible for the clients in that partition. In general, in the topology generated by Inet, some nodes only have a single link to the rest of the network (although most of the nodes have multiple links). Thus, once the only link fails, the node is isolated. When pnf and pef are small, after some node and edge failures, the network contains one large partition that is well connected and a few isolated individual nodes. Thus, it is highly likely that the large partition contains a sufficient number of shares/replicas and most of the accesses are within the large partition. So, the availability is almost 1 even with lower replica numbers (Figure 7).

3

6 replica

9

Figure 7. Availability for varying number of replicas

However, the access performance evaluation depends on the concrete access protocols. Thus, we first design a two-level access (TLA) protocol in Subsection 4.1 based on the ideas of a GDHT based access protocol: the distributed version server (DVS) protocol proposed in [Ye10]. And then we conduct experiments to compare the performance between TLA and DVS in Subsection 4.2.

4.1

TLA protocol

DVS [Ye10] is a decentralized protocol for GDHT. It uses SSS to protect each data object and DHT to allocate and lookup shares. To identify consistent shares, it assigns each update a unique version number consisting of a logical timestamp, the client ID and the cross checksum. The cross checksum consists of the hash values of all the shares of a data [Gon89] and is used to assure the integrity of the shares even if the compromised server modifies the shares. Let share instance (SI) denote one data share together with a version number. Also, in DVS, each server keeps multiple versions for every share on it so that the data can be recovered from malicious updates. Let share history denote the sequence of SIs of one data share. In DVS, for update accesses, the client will request version number from all the residence servers to make sure the total order of version numbers and then update all shares directly. For read accesses, the client contacts the nearby

4 Access Protocol To thoroughly study the TDHT approach, in this section we consider the access performance of the TDHT approach. 418

Upon receiving the version number from gi.sn, the client create the new shares and compute the cross checksums and construct the new SIs. Then it computes the residence servers , and sends each SI to the corresponding server. Then each server inserts the SI to the share history and notifies gi.sn. After the super node collects n−f notifications, it will remove the incomplete label from corresponding version number. The pseudo code is given in Figure 8. The share consistency verification information such as cross checksum can be sent to the super node together with the notification so that the super node can verify whether the shares in this update are consistent. This process limits the message exchanges within one cluster and thus avoids the costly communication among the whole system. Also, it does not affect the client perceived latency since it does not require the client involvement.

residence servers and retrieves their latest SIs together with the version number lists in their share histories. If the client can find k consistent shares, then it reconstructs the data and repairs the missed SIs. Otherwise, the client need analyze the version number lists received from servers to find the latest available version and retrieve this specific version of SIs. In the following, we propose a two-level access (TLA) protocol for the two-level DHT approach based on the ideas of DVS such as storing multiple versions of shares, totally ordering updates and accessing nearby shares. But considering the special benefits of two-level approaches, TLA has two major improvements than DVS. First, TLA adopts lazy update. Second, in TLA, the super nodes are used to maintain the update history and status. Such a centralized component can handle version number related operations (such as version number generation, stale version identification) conveniently for each group. To tolerate super node failure, we make TLA an adaptive solution which can automatically switch to DVS when super nodes fail.

uTLA (d): Phase II: 1. gi := closest cluster := residence cluster ring k := t=r?1:t+1 2. gi.sn notifies all residence servers in gi to propagate the new version to gk 3. Each server forwards the SI of vn to corresponding server in gk 4. Upon receiving the forwarded SI of vn, each server notifies its super node 5. If next cluster is not gi, the server also forward the SI to corresponding server in next cluster

4.1.1 Update Algorithm uTLA The update algorithm, uTLA, can be divided into two phases. In the first phase, the client tries to write n SIs to the closest group. Upon an update request to data object d, the client first compute the residence groups RGd and find the closest group gi. And then it contacts the super node gi.sn to retrieve a new version number. If the client finds that the super node fails (timeout), it will execute the update algorithm uDVS [Ye10] in DVS protocol in gi. In uDVS, the client contacts all the residence servers to retrieve totally ordered version numbers and then update shares. When gi.sn does not fail, since it maintains the update histories of data objects hosted by its group, it can increment the current largest timestamp and generate a new version number with empty cross checksum values. This version number will be labeled as “incomplete” and sent back to the client. It is possible that concurrent clients that contact different groups may get the same timestamp, but their client ID can be used to ensure the order.

Figure 9. uTLA phase II: update propagation

In the second phase, uTLA propagates the update to other residence groups. To avoid blind propagation, we construct a uniform residence group ring that can minimize the whole transmission cost to propagate the update to all other residence groups. Such a ring should be fixed and can be computed by each super node independently. We only allow the update be propagated along with one direction of the ring. Assume that RGd = {gi | i = 1…r} and the residence group ring is g1ɦg2ɦ … ɦgrɦg1 and the closest group to the client is gi. When gi.sn removes the incomplete label of the version number vn, gi.sn notifies all servers , to forward the SI with vn to corresponding servers in , (or , if i = r). Then similar to the first phase, the residence servers in gi+1 insert the new SI to their share histories and notify gi+1.sn. At the same time, each server also forwards the received SI to the residence servers in the next group in the ring. Upon receiving the first notification of version vn, the super node gi+1.sn will put vn to its update history with incomplete label. Once gi+1.sn receives at least n−f notifications, it can remove incomplete label of vn, i.e., the update is propagated to gi+1 successfully. Then the same process is performed along the residence group ring till the update is propagated to group gi−1 (or gr if i = 1). As a result, all the residence groups have the replica of version vn. The pseudo code of update propagation is given in Figure 9.

uTLA (d): Phase I: 1. Client computes RGd and finds the closest cluster gi {d.sh1, d.sh2, …, d.shn} := SSS (d.data) Send version number request to gi.sn 2. If gi.sn fails, client executes uDVS; otherwise 3. gi.sn generate new version number vn Label vn incomplete and send it to the client 4. Client compute cross checksums for vn Compute , Send to corresponding server 5. Each server update share history and notify gi.sn 6. When gi.sn receives n−f notification for vn: Remove incomplete label from vn Initiate Phase II. Figure 8. uTLA phase I: update the closest group

419

data requests to access data objects following a Zipf distribution. We assume that 20 percent of the requests are updates and the others are reads. For partial updates, we assume that the number of shares that are successfully written to the servers follows a uniform distribution between 1 and n (for TLA) or n·r (for DVS). For update accesses in DVS, the user perceived response latency involves all the shares. But for updates in TLA, the user perceived response latency only involves the communication between the client and its closest group. So to make the comparison fair, we also measure the whole update latency including the propagation time (upTLA). To simulate the realistic communication costs, we conduct experiments on PlanetLab [Plab] platforms and measure the communication latencies between 1000 pairs of nodes. Two fitting functions F1 and F2 are derived, where CL is the connection latency (without message payload): F1: CL = 0.5 + 0.18*(Distance)0.74 F2: Latency=0.05*(Size)0.46*CL+CL In the simulation, the location (longitude and latitude) of each node is randomly generated. Given a message, we first calculate the spherical distance between the message sender and receiver and then obtain the message size from the protocols. Then, the latency is estimated using F1 and F2. We conduct a series of experiments to compare the performance of TLA and DVS with various system configuration parameters, including the read/update request arrival rates, SSS threshold number k and replica number r. The default settings for the experiments are listed in Table 1. In each experiment, we only vary one parameter and keep the others fixed to explore the performance impact of the parameter. The experimental results are shown in Figures 11-16.

If some residence servers fail, then corresponding SIs will be lost during the update propagation. But each group can keep at least n−2f (>k) consistent SIs in the worst case, which is sufficient for data reconstruction. 4.1.2 Read Algorithm rTLA Due to partial updates, malicious updates, or concurrency between read ad updates, the latest SIs from different residence servers may be inconsistent or have different version numbers. Thus, if the residence servers return their latest SI upon receiving a read request, the client may not exactly get k consistent shares and thus additional cost are required. To address this problem, when the client tries to read data object d, instead of directly requesting SIs from residence servers, it first contacts the super node in the closest group. Since the super node have the update history of d, it can specify the latest version number without incomplete label (say lvn) and then notify the residence servers to return SIs with lvn to the client. According to the group model, the approximate communication costs from the client to various servers in one group are close to each other, thus, in rTLA all the n residence servers respond the client the SI with lvn or null. Figure 10 shows the pseudo code of rTLA algorithm. Since for a version without incomplete label, there will be at least k consistent shares, the client can reconstruct the data correctly. Besides this, rTLA also repairs missed SIs to avoid the synchronization among servers for this issue. After the data reconstruction, for the servers that respond null or their cross checksum are inconsistent with the majority, the client generates corresponding SIs for them and send these missed SIs to them. rTLA (d): 1. Client computes RGd and finds closest cluster gi Send read request to gi.sn 2. If gi.sn fails, client executes rDVS; otherwise 3. gi.sn find latest version without incomplete label lvn Send lvn to servers in , 4. Each residence server responds the SI with lvn to client if it has it; otherwise, respond null 5. Client reconstructs d.data from received SIs Repair the missed or invalid SI

Parameter r n k Request arrival rate Client failure rate

Table 1. Default parameters in experiments First, we consider different request arrival rates, from 250 to 25000 requests per second (Figures 11 and 12). For both protocols, the user perceived update latency stays constant with increasing request arrival rate. For read accesses, the latency of DVS increases as request arrival rate increases. This is because DVS is an optimistic protocol that can only complete read accesses in one round of share retrieval in optimistic conditions. But higher request arrival rate implies more concurrent read/update accesses, which result in that more read accesses have to use two rounds of share retrievals so that the read latency increases. As for TLA, since it always finds the latest available version before the share retrieval, the read latency stays constant with increasing request arrival rate.

Figure 10. rTLA algorithm

Similar to the update algorithm, if the super node fails, the client will execute the read algorithm rDVS [Ye10] in DVS protocol. In rDVS, the client contacts all the residence servers in gi to retrieve their latest version of shares and version numbers of other shares. If the retrieved shares are consistent, they will be used to reconstruct the data. Otherwise, the client analyzes the retrieved version numbers and finds the latest consistent version for data reconstruction.

4.2

Default value 5 TDHT: 8; GDHT: 8∙r 4 10000 (requests/second) 0.01

Performance Comparisons

To evaluate the access performance of the TDHT approach, we compare the access performance of TLA and DVS (for GDHT). The simulated system consists of 2000 storage servers hosting 10000 data objects. Clients issue 420

uDVS

2 1.2 0.4 0

Figure 11. Update latency for varying request arrival rate

0.34 0.31 0

uTLA

0.45 0.4 0.35 0.3 3

k

6

9

Figure 14. Read latency for varying k

upTLA

uDVS

2.4 1.4 0.4 0

3

6

9

k Figure 13. Update latency for varying k

uDVS

3.4 2.9 2.4 1.9 1.4 0.9 0.4

upTLA

3.4

10000 20000 30000 Request Arrival Rate (requests/sec)

Figure 12. read latency for varying request arrival rate

Update Latency (s)

Read Latency (s)

0.37

rDVS

rTLA

0

0.4

10000 20000 30000 Request Arrival Rate (requests/sec)

uTLA

rDVS Update Latency (s)

2.8

rTLA

rTLA Read Latency (s)

upTLA

Read Latency (s)

Update Latency (s)

uTLA

rDVS

0.5

0.45 0.4

0.35 0.3

2

2

6 10 number of replicas

Figure 15. Update latency for varying number of replicas

Second, we explore the impacts of different threshold k, from 2 to 7 (Figures 13 and 14). As k increases, the share size becomes smaller [Rab89]. So for update accesses, the latency decreases as k increases in both protocols. But for read accesses of DVS, the latency increases as k increases. This is because the clients have to retrieve more shares and, hence, need to access farther servers, which seems to have a more significant impact than the decreasing message size. For TLA, they only need to contact the servers in one group for share retrieval, thus their communication distances are almost constant. So the read latency also decreases as k increases due to the reduced message size and outperform DVS protocol when k is large (≥5). Third, we consider different replica number, from 3 to 8 (Figures 15 and 16). When the replica number is higher, in one side, the update propagations in TLA have to go through more hops that are randomly distributed in the Internet. In the other side, the distance between the client and the closest residence group tends to become smaller. Thus, for TLA, the update propagation latency increases while the user perceived update latency and read latency decrease as replica number increases. Similarly, high replica number also implies that DVS protocols have more chance to retrieve shares from more nearby servers. Thus, the read latency of DVS decreases as replica number increases. But the replica number does not affect the upper bound of

6 10 number of replicas

Figure 16. Read latency for varying number of replicas

distance between the client and the farthest shares, thus, the update latency of DVS stays constant as replica number increases. From all the experiments, the update propagation of TLA always has higher latency because the shares have to be forwarded several times among groups before they arrive to the farthest residence server and this significantly increases the transmission distance when comparing to DVS. The user perceived update latency of TLA is much better since it only directly writes shares to the closest residence group. When k is small, the read latency of DVS is a bit better than that of TLA. But the case is reversed when k is large.

5 Related Works Most of the existing storage systems [Lak03, Cac06, Abd05, Hen07] only consider cluster based storage system. Works in [Cac06, Abd05, Hen07] are based on pure data partitioning techniques. They assume that each server keeps a share for every data. During accesses, the clients always contact all servers for share update or retrieval. So there is no directory management issue regarding where to find the data. But in widely distributed cloud storage with a potentially large number of servers, this assumption is too restrictive. The work in [Lak03] considers hybrid of 421

than DVS while the whole update latency in TDHT is higher. For read latency, TLA and DVS are at the same level in most cases. But with large k, TLA yields better read latency than DVS.

replication and secret sharing. In fact, it just adds multiple fully replicated servers to each server in [Cac06, Abd05, Hen07]. It neither handles complicated update issues such as partial updates or concurrent updates nor considers directory management. In [Ye10], we have proposed a GDHT based widely distributed cloud storage design. In this design, the nearby share retrieval support is integrated into the DHT so that the client can locate nearby servers for share retrieval to achieve good read performance. However, as we have discussed, it have high response time for update accesses and cannot support lazy update approaches. It also generates a large volume of data or control messages among servers for stale version removal. [Zha02] is another GDHT based P2P storage using erasure coding. It uses replicated version files to assure serialized update accesses and correct data reconstruction. But it does not support nearby share retrieval. And its update protocol requires three reading/writing operations to all the replicated version files, which can affect the performance significantly. The work in [Mei03] considers share placement problem in a two-level network topology. But it is still a pure data partitioning approach. It tries to achieve high data assurance by moving data shares to the sub-networks where there are more access demands. The work in [Tu10] also considers share placement in the two-level network topology. It uses the hybrid of replication and data partitioning techniques and proposes algorithms that can dynamically allocate the replicated shares to appropriate groups and servers to minimize the access cost. However, accessing data shares that are dynamically allocated on cloud storage servers requires the support of an efficient directory service in order to achieve the performance, security and availability goals. But the directory management is not discussed in these two works.

References [Abd05] M. Abd-El-Malek, G. R. Ganger, M. K. Reiter, J. J. Wylie, G. R. Goodson. Lazy verification in fault-tolerant distributed storage systems. SRDS, 2005. [Bis03] M. Bishop. Computer Security: Art and Science. ISBN 0201-44099-7, Addison Wesley Professional, 2003. [Cac06] C. Cachin, S. Tessaro. Optimal resilience for erasurecoded Byzantine distributed storage. DSN, 2006. [CAPEC] Common Attack Pattern Enumeration and Classification. http://capec.mitre.org/data/xml/capec_v1.4.xml [Gon89] L. Gong. Securely replicating authentication services. ICDCS, 1989. [Gup03] A. Gupta, B. Liskov, R. Rodrigues. One hop lookups for peer-to-peer overlays. HotOS, 2003. [Hen07] J. Hendricks, G. R. Ganger, M. K. Reiter. Low-overhead byzantine fault-tolerant storage. ACM SIGOPS, 2007. [Hu03] Y. Hu, B. Panda. Identification of malicious transactions in database systems. IDEAS, 2003. [Inet] Inet3. http://topology.eecs.umich.edu/inet/ [Kra93] H. Krawczyk. Secret sharing made short. Crypto, 1993. [Kri02] S. Krishnamurthy, W. H. Sanders, M. Cukier. An adaptive framework for tunable consistency and timeliness using replication. DSN, 2002. [Lak03] S. Lakshmanan, M. Ahamad, and H. Venkateswaran. Responsive security for stored data. TPDS, 2003. [Mei03] A. Mei, L.V. Mancini, S. Jajodia. Secure dynamic fragment and replica allocation in large-scale distributed file systems. TPDS, 2003. [Nik03]V. Nikov, S. Nikova, B. Preneel. Multi-party computation from any linear secret sharing scheme unconditionally secure against adaptive adversary: the zero-error case. Lecture Notes in Computer Science, 2003. [NVD] National Vulnerability Database. http://nvd.nist.gov/ [Plab] PlanetLab. http://www.planet-lab.org. [Rab89] M. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of ACM, 1989. [Sha79] Adi Shamir. How to share a secret. CACM, 1979. [Tu10] M. Tu, P. Li, I. Yen, B. Thuraisingham, L. Khan. Secure data objects replication in data grid. TDSC, 2010. [Vie05] M. Vieira, H. Madeira. Detection of malicious transactions in DBMS. PRDC, 2005. [Xia09] L. Xiao, Y. Ye, I. Yen, F. Bastani. Evaluating dependable distributed storage systems. Technical Report: UTDCS-50-09. [Ye10] Y. Ye, I. Yen, L. Xiao, F. Bastani. Secure. Dependable and high performance cloud storage. Technical Report: UTDCS10-10. [Zha02] Z. Zhang, Q. Lian. Reperasure: replication protocol using erasure-code in peer-to-peer storage network. SRDS, 2002. [Zha05] H. Zhang, A. Goel, R. Govindan. An empirical evaluation of internet latency expansion. ACM SIGCOMM Computer Communication Review, 2005.

6 Conclusion The pure data partitioning approach has the potential performance and availability problems when used in widely distributed systems: 1) it is difficult to apply lazy update; 2) the share consistency verification may incur costly communications among widely distributed servers. In this paper, we consider a two-level DHT approach that use the hybrid of SSS and replication techniques to address the above problems and meet the security, dependability and performance requirements of cloud storage. We compared the TDHT and GDHT approaches upon security, availability and access performance. We used security and availability models proposed in [Xia09] to compare TDHT and GDHT. The results show that TDHT can provide better security than GDHT and almost the same level of availability. Since the access performance depends on the concrete access protocols, we presented the TLA access protocol based on DVS (a GDHT based protocol) and compare their access performance. The results show that the user perceived response latency of update accesses of TLA is much less

422