DHT-based Self-adapting Replication Protocol for Achieving High Data Availability⋆ Predrag Kneˇzevi´c∗ , Andreas Wombacher†, and Thomas Risse∗ 1 ∗
Fraunhofer IPSI Dolivostrasse 15, 64293 Darmstadt Germany {knezevic,risse}@ipsi.fhg.de 2 † University of Twente Department of Computer Science Enschede, The Netherlands
[email protected]
Abstract. An essential issue in peer-to-peer data management is to keep data highly available all the time. The paper presents a replication protocol that adjusts autonomously the number of replicas to deliver a configured data availability guarantee. The protocol is based on a Distributed Hash Table (DHT), measurement of peer online probability in the system, and adjustment of the number of replicas accordingly. The evaluation shows that we are able to maintain the requested data availability achieving near to optimal storage costs, independent of the number of replicas used during the initialization of the system. Key words: decentralized data management, distributed hash tables, self-adaptation, high data availability
1 Introduction Distributed Hash Tables provide an efficient way for storing and finding data within a peer-to-peer network, but they do not offer any guarantees about data availability. When a peer goes offline, the locally stored data become inaccessible as well. The work presented in the paper extends existing Distributed Hash Tables (DHTs) by delivering a configurable data availability. The protocol is fully decentralized and has the following features: (1) measuring current average peer online probability in the community, (2) computing the number of replicas that every stored object needs to have in order to achieve the requested average data availability locally, and (3) adjusting the number of needed replicas for locally stored items so that the requested data availability is achieved and kept. The research presented in this paper is motivated by the BRICKS3 project, which aims to design, develop and maintain a user and service-oriented space of digital libraries on the top of a decentralized/peer-to-peer architecture. During the run-time, the ⋆ 3
This work is partly funded by the European Commission under BRICKS (IST 507457) BRICKS - Building Resources for Integrated Cultural Knowledge Services, http://www.brickscommunity.org
2-9525435-1 © SITIS 2006
- 270 -
system requires to have a variety of system metadata like service descriptions, administrative information about collections, or ontologies globally available [1]. The paper is organized in the following way. The next Section gives details about the problem. Section 3 describes the applied replication technique. The approach is evaluated by using custom-made simulator and various results are presented in Section 4. Related work to the idea presented in the paper is given in Section 5. Finally, Section 6 gives conclusions and some ideas for the future work.
2 Problem Statement As suggested in [2], a common DHT API should contain at least the following methods: (1) route(Key, Message) - routes deterministically forward a message according to the given key to the final destination; (2) store(Key, Value) - stores a value under the given key in DHT; and (3) lookup(Key) - returns the value associated with the key. Every peer is responsible for a portion of the key space, so whenever a peer issues a store or lookup request, it will end up on the peer responsible for that key. Although implementation specific, a peer is responsible for a part of the keyspace nearest to its ID. The distance function can be defined in many ways, but usually it is simple arithmetic distance. When the system topology is changed, i.e. a peer goes offline, some other online peers will be now responsible for the keyspace that has belonged to the offline peer. The part of the keyspace belonging to the currently offline peer will be split and taken over by peers that are currently nearest to keys in the part of the keyspace. Also, peers joining the system will take responsibility for a part of the keyspace that has been under control of other peers until that moment. As it is already mentioned, our decentralized storage uses a DHT layer for managing XML data. Unfortunately, all DHT implementations do not guarantee data availability, i.e. when a peer is offline, then locally stored data are offline too. Therefore, enabling high data availability can be achieved by wrapping a DHT and adding the implementation of a replication protocol on top of it, and at the same time being transparent to the users, i.e. providing the same API to upper layers as a DHT does.
3 Approach Every value and its replicas are associated with a key used for store and lookup operation. The first replica key is generated using a random number generator. All other replica keys are correlated with the first one, i.e. they are derived from it by using the following rule: replicaKey(ron) =
c : hash(replicaKey(1) + ron) :
ron = 1 ron ≥ 2
(1)
where ron is a replica ordinary number, c is a random byte array, hash is a hash function with a low collision probability. replicaKey and i are observed as byte arrays, and + is an array concatenation function. The above key generation schema ensures that all replicas are stored under different keys in the DHT. Further, two consecutive replica keys are usually very distant in the
- 271 -
key space (a consequence of using a hash function) meaning that replicas should be stored on different peers in a fairly large network. Based on the key generation schema in Formula 1, deriving the key of the ronth replica requires access to the first replica key and the replica ordinary number (ron). Therefore, this information is wrapped together with the stored value in an instance of Entry class. Since the wrapper around DHT implements the common DHT API, store and lookup operations must be re-implemented. store(Key, Value) When a value is created, it is wrapped in a number of instances of Entry class and appropriate keys are generated. The constructed objects are stored using the basic DHT store operation. If the objects already exist, their old value is overwritten. lookup(Key) When a peer wants to get a value, it is sufficient to return any available replica. Therefore, the keys for the assumed number of replicas are generated and the basic DHT lookup operation is applied until the corresponding Entry object is retrieved. When a peer rejoins the community, it does not change its ID, and corollary the peer will be now responsible for a part of the keyspace that intersects with the previously managed keyspace. Therefore, the peer keeps previously stored data, but no explicit data synchronization with other peers is required. Replicas, whose keys are not anymore in the part of the keyspace managed by the rejoined peer, can be removed and may be sent to peers that should manage them. 3.1 Number of Replicas In order to add data availability feature to the existing DHT, every stored value must be replicated R number of times. The number of replicas R is independent of objects, i.e. the same data availability is requested for all objects. Joining the community for the first time, a peer can assume an initial value for R, or can obtain it from its neighbors. To meet the requested data availability in a DHT, the number of replicas of stored data at each peer has to be adjusted. Therefore, every peer measures once per online session, i.e. a peer being online, the current average peer online probability, and knowing the requested data availability, calculates the new value for the number of replicas R. By knowing the previous value R′ , a peer removes removes the replicas with ordinary number ron greater than R from its local storage replicas, if it turns out that less replicas are needed than before (R < R′ ). In case a higher number of replicas are needed (R > R′ ), a peer creates new replicas of the data in its local storage under the keys replicaKey(j), j = R′ + 1, ..., R. Be aware that replicating all R replicas (in case of R > R′ ) instead of replicas , R′ + 1, . . . , R as originally proposed in [3] results in wrong online probability measures as discussed in the next section. Having the peer online probability p and knowing the requested data availability a, a peer can compute the needed number of replicas. Let us denote with Y a random variable that represents the number of replicas being online, and P (Y ≥ y) is the probability that at least y replicas are online. Under an assumption that peers are independent and behave similarly, the probability follows Binomial distribution. The
- 272 -
probability a that a value is available in a DHT with peer online probability p is equal to the probability that at least one replica is online (P (Y ≥ 1)): a = P (Y ≥ 1) = 1 − (1 − p)R Therefore, the number of needed replicas R is log(1 − a) R= log(1 − p)
(2)
(3)
3.2 Measuring Peer Availability Measuring the average peer online probability is based on probing peers and determining the peer online probability as a ratio between the number of positive and total probes. However, the measuring cannot be done directly (i.e. pinging peers) because we do not know anything about the peer community, i.e. peer IDs, and/or their IP addresses. Our wrapper uses the underlying DHT via the common DHT API, i.e. its routing operation, for probing the availability of replicas. According to the key generation properties (Formula 1) stated before, we assume that every replica of a particular object is stored on a different peer within a fairly large network. Based on this assumption, the peer online probability equals the replica online probability. Be aware that this condition is only valid if each ronth replica is stored in a single online or offline peer. Having several ronth replicas distributed over several online or offline peers results in peer online probability measurement errors, because the availability of a replica does no longer equal the availability of a single, but of several peers. Having only one replica under a key is supported by the implemented replication, and therefore the measurement of peer online probability can be realized as follows. Every peer has some replicas in its local store and based on that and Formula 1, all keys of all other replicas corresponding to the same objects can be generated, and their availability can be checked. Of course, we are not in a position to check all replicas stored in the system. In order to optimize the number of probes, we consider the confidence interval theory [4] to find out what is the minimal number of replicas that has to be checked, so the computed average replica online probability is accurate with some degree of confidence. Unfortunately, the direct usage of the confidence interval theory is not so promising for practical deployment. Namely, by requesting a small absolute error (e.g. δ = 0.03), and high probability of having the measurement in the given interval (e.g. C = 95%), a peer should make at least 1068 random replica availability probes, generating high communication costs. Even if the network is not so large (e.g. 400 peers), the number of probes n drops only to 290. Therefore, we need to modify slightly our probing strategy to achieve the requested precision. After probing some replicas using their keys, and calculating the peer availability, the peer uses the same replica keys to ask other peers about their calculated value. Finally, the peer computes the average peer online availability as average of all received values. This is done by routing an instance of P eerP robabilityRequest message using the route DHT method with all replica keys from the probe set. The contacted peers route their answers (P eerP robabilityResponse) back. The number
- 273 -
of generated probe messages is now 2n, but we are able to check up to n2 replicas. Measuring can be done with low communication costs; already with n = 33, and 66 messages we are able to check up to 1089 replicas, and achieve good precision in a network of any size.
3.3 Costs In general, the total costs consists of two major parts: communication and storage costs. The presented approach is self-adaptable; it tries to reach and keep the requested data availability with minimum costs at any point in time. If the initial number of replicas is not sufficient to ensure the requested data availability, new replicas will be created, i.e. storage costs will be increased. On the other hand, if there are more replicas than needed, peers will remove some of them, reducing storage costs. Let us denote with S(t) average storage costs per peer in a system with M objects and N peers that are online with probability p. The minimum storage costs per peer Smin that guarantee the requested data availability a is: Smin =
M R N
(4)
where R is computed like in Formula 3. Every store operation generates R StoreRequest and R StoreResponse messages. A lookup operation does not have fixed costs, it stops sending LookupRequest messages when a replica is found. Sometime, already the 1st replica is available, and only two messages (LookupRequest and LookupResponse) are generated. In an extreme case, 2R messages must be generated in order to figure out if the requested object is available or not. It can be shown that on average a lookup needs to generate R messages, before finding a replica of the requested object. The proposed approach introduces additional communication costs generated by measuring peer online probability p and adjusting the number of replicas. Every measurement generates n LookupRequest, n LookupResponse, n P eerAvailabilityRequest, and n P eerAvailabilityResponse messages, where n is the number of replicas to probe. As stated in the Section 3.2, n depends on the absolute error δ we want to allow, and the probability that the measured value is within the given interval (p − δ, p + δ). If the computed number of replicas R is greater from the previous one R′ , the peer will create additionally (R − R′ )S(t) StoreRequest messages in average.
4 Evaluation The evaluation has been done using a custom-made simulator. The protocol implementation is based on FreePastry [5], an open-source DHT Pastry [6] implementation. The obtained results match to the developed model, i.e. by using the computed number of replicas and the defined operations. The DHT achieves the requested data availability.
- 274 -
4.1 Settings Every simulation starts with the creation of a DHT community, populating it with a number of entries, which are initially replicated a predefined number of times. During this phase, all peers are online, to ensure that the load of the local peer storages is a balanced. After initializing the DHT, the simulator executes a number of iterations. In every iteration, peers, whose session time is over go offline, and some other peers with a probability p come online and stay online for some time. The session length is generated by using a Poisson distribution, with the given average session length λ. Every time when a peer comes online, it measures the average peer online probability, determines the number of required replicas, and tries once to adjust the number of replicas for locally stored data according to the operations defined in Section 3.2. Finally, at the end of every iteration, the simulator measures the actual data availability in the system by trying to access all stored data. It iterates over the object keys, picks up an online peer randomly, which then issues a lookup request. The configuration parameters of the simulator consists of the community size N , the requested data availability a, the average peer online probability p, the average session time λ, the initial number of replicas R, and the number of replicas to probe n. These parameters offer a huge number of possible scenarios to simulate, and since the simulations are time-consuming, we have fixed the community size N to 400 peers, the average session time λ to three iterations, and the number of replicas to probe n to 30. As stated in Section 3.2, for the defined network, n = 17 would be enough, but we wanted to be on the safe side at the first glance. The system is initialized with less replicas than needed for reaching the requested data availability. Therefore, peers must cope with that, and create more replicas until the data availability is achieved. The protocol is evaluated both on low and high peer availability network 10 times, and the obtained values are averaged. 4.2 Low Available DHTs We have set up a DHT with the average peer online probability of 20%, and tried to reach the data availability of 99% initializing peers with the number of replicas R = 5. After 10 runs, the obtained values are averaged, and the results are shown on Figure 1. The requested data availability has been reached after the major number of peers has been able to measure the actual peer availability and to create a correct number of replicas (Figure 1a). Figure 1b shows the development of the average error (difference between the requested and obtained data availability) during the simulations. As it can be seen, the DHT maintains the data availability that is a bit under the requested one. Figure 1c shows the average storage costs per peer during the simulations. After reaching the requested data availability, the storage load stabilizes as well, and matches quite well with the curve ”theoretical minimum” (Formula 4 and 3). 4.3 High Available DHTs The evaluation has been applied also to a DHT with the average peer availability of 50%. The peer has been initialized again with the lower number of replicas R = 3,
- 275 -
1 0.95
Data Availability
0.9 0.85 0.8 0.75 0.7 average maximum minimum
0.65 0.6 0
20
40
60
80
100
120
140
160
180
200
(a)
Time (iterations)
Data Availability Average Error
0.5
0.4
0.3
0.2
0.1
0 0
20
40
60
80
100
120
140
160
180
200
(b)
Time (iterations) 500 450
Average Storage Size
400 350 300 250 200 150 100
average maximum minimum theoretical minimum
50 0 0
20
40
60
80
100
120
Time (iterations)
140
160
180
200
(c)
Fig. 1. Reaching the requested availability in networks with 20% peer availability
which is insufficient to guarantee the requested data availability of 99%. The obtained results are presented on Figure 2. The data availability has been reached even faster (Figure 2a) than in the previous case. After measuring the actual peer availability and computing the number of replicas,
- 276 -
1 0.95
Data Availability
0.9 0.85 0.8 0.75 0.7 average maximum minimum
0.65 0.6 0
20
40
60
80
100
120
140
160
180
200
(a)
Time (iterations)
Data Availability Average Error
0.5
0.4
0.3
0.2
0.1
0 0
20
40
60
80
100
120
140
160
180
200
(b)
Time (iterations) 140
Average Storage Size
120 100 80 60 40 average maximum minimum theoretical minimum
20 0 0
20
40
60
80
100
120
Time (iterations)
140
160
180
200
(c)
Fig. 2. Reaching the requested availability in networks with 50% peer availability
peers can adjust the number of replicas for all locally stored data. Since more peers are always online, replica adjustment happens to more data in average and the requested data availability is reached faster. The achieved average error (Figure 2b) is as low as in
- 277 -
the previous case. Also, the average storage costs (Figure 2c) fit again very well to the developed model.
5 Related Work P2P file-storing system like CFS [7] and PAST [8] have recognized the need for replication. CFS splits every file in a number of blocks that are then replicated a fixed number of times. PAST applies the same protocol, without chunking files into blocks. The number of replicas must be carefully chosen for the given community, because systems are not able to correct it, if the requested availability cannot be obtained. Also, if the system behaves much better, local peer storages will contain unnecessary replicas. Oceanstore [9] is a P2P file-storing system trying to provide a wide-area storage system. Oceanstore is not fully decentralized, it makes a distinction between clients and powerful, highly available servers (i.e. super-peer network). Within the storage, the data are replicated by using erasure coding (Reed-Solomon codes). Erasure coding is also used in TotalRecall [10]. Erasure coding offers lower storage costs compared to replication, if managed data are large in size. However, our DHT layer handles smaller data items. Additionally, if locally stored data are encoded, then it is not possible to process them without retrieving all needed pieces from the network, which would decrease the system performances. The protocol presented in [11] exploits as well the erasure coding for achieving data availability. Periodically, peers detect current file availability and if below the requested one, replicate randomly files attempting to increase their availability. The idea behind the protocol is similar to our approach, but requires existence of a global index and global directory, where all information about files (locations and availability) are managed. The protocol presented in our paper is fully decentralized, and does not need such data structure for proper functioning.
6 Conclusion and Future Work The paper has presented a fully decentralized replication protocol suited for DHT networks. It enables reaching and keeping the requested data availability, and at the same time, the protocol keeps storage costs very close to the minimum. The clear advantage of a DHT network that supports the presented protocol is deployment in an environment, whose parameters like peer availability, are not well-known or cannot be good predicted. The system itself will find the optimal number of replicas, and maintain the requested data availability. If the system properties change afterwards, the system will adapt automatically, i.e. no system restart and/or reconfiguration is needed. Future work will evaluate the robustness of the approach and update issues. In particular, it has to be investigated how the protocol behaves in networks with high churn rate, i.e. peer online/offline rates changes significantly in some periods of system runtime. Since not all peers are always online, an update might not be able to change all replicas, leaving some of them unmodified. Also, uncoordinated concurrent updates of an object result in unpredictable values of object replicas. The protocol presented in the paper has been already extended in order to support solving the above issues [12], and
- 278 -
the first analysis shows that we are able to provide data consistency with arbitrary high probabilistic guarantees. The protocol will be evaluated under various system settings and similar communities like in this paper.
References 1. Risse, T., Kneˇzevi´c, P.: A self-organizing data store for large scale distributed infrastructures. In: International Workshop on Self-Managing Database Systems(SMDB). (April 8-9 2005) 2. Dabek, F., Zhao, B., Druschel, P., Stoica, I.: Towards a common api for structured peer-topeer overlays. In: 2nd International Workshop on Peer-to-Peer Systems. (February 2003) 3. Kneˇzevi´c, P., Wombacher, A., Risse, T., Fankhauser, P.: Enabling high data availability in a DHT. In: Grid and Peer-to-Peer Computing Impacts on Large Scale Heterogeneous Distributed Database Systems (GLOBE’05). (2005) 4. Berry, D.A., Lindgren, B.W.: Statistics: Theory and Methods. Duxbury Press (October 1995) 5. University, R.: FreePastry - Open-source Pastry Implementation. (2006) http:// freepastry.org/FreePastry/. 6. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. Lecture Notes in Computer Science 2218 (2001) 7. Dabek, F., Kaashoek, M.F., Karger, D., Morris, R., Stoica, I.: Wide-area cooperative storage with CFS. In: Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, ACM Press (2001) 202–215 8. Rowstron, A., Druschel, P.: Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, ACM Press (2001) 188–201 9. Kubiatowicz, J., Bindel, D., Chen, Y., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., Zhao, B.: Oceanstore: An architecture for global-scale persistent storage. In: Proceedings of ACM ASPLOS, ACM (November 2000) 10. Bhagwan, R., Tati, K., Cheng, Y.C., Savage, S., Voelker, G.M.: Total recall: System support for automated availability management. In: First ACM/Usenix Symposium on Networked Systems Design and Implementation. (March 29-31 2004) 337–350 11. Cuenca-Acuna, F.M., Martin, R.P., Nguyen, T.D.: Autonomous replication for high availability in unstructured p2p systems. srds 00 (2003) 99 12. Kneˇzevi´c, P., Wombacher, A., Risse, T.: Highly available DHTs: Keeping data consistency after updates. In: Proceedings of International Workshop on Agents and Peer-to-Peer Computing (AP2PC ). (2005)
- 279 -