Low-latency Access to Robust Amnesic Storage∗
Dan Dobre
Matthias Majuntke
Neeraj Suri
TU Darmstadt Germany
TU Darmstadt Germany
TU Darmstadt Germany
[email protected]
[email protected]
[email protected]
ABSTRACT We address the problem of building a reliable distributed read/write storage from unreliable storage units, e.g. a collection of servers, of which up to one-third can fail by not responding or by undetectably corrupting the data stored on them. Our contribution consists in the development of Byzantine-resilient storage algorithms that for the first time combine strong consistency and liveness guarantees with space-efficiency. Previous solutions featuring equivalent properties resort to storing an unlimited number of data versions in the storage units, thus eventually running into problems of space exhaustion.
Categories and Subject Descriptors B.3.2 [Memory Structures]: Design Styles—shared memory; C.2.4 [Computer-Communication Networks]: Distributed Systems; D.4.2 [Operating Systems]: Storage Management—secondary storage, distributed memories; D.4.5 [Operating Systems]: Reliability—fault-tolerance
non-responsive (e.g. due to a crash) or responding with arbitrarily (and undetectably) corrupted data. Captured by this fault model are a multitude of failure types entailing software bugs, intermittent hardware failures and corruptions carried out by malicious intruders. The aim of this work is to enhance this fruitful line of research into a spaceefficient distributed storage system with strong consistency and liveness guarantees. In our setting, previous solutions with equivalent guarantees are space-inefficient: they take the approach of storing the entire history of written values, eventually facing problems of space exhaustion. This extended abstract provides only an overview of our results. A detailed presentation can be found in our full paper available as a technical report [5].
∗Research supported in part by DFG GRK 1362 (TUD GKmM), EC NoE ReSIST, EC Genesys
An essential building block of a distributed storage architecture is the abstraction of a reliable read/write register providing two operations: a write operation, which writes a value into the register, and a read operation which returns a value previously written [13]. A reliable register is typically implemented from a collection of unreliable storage units, using replication. Depending on their power, storage units come in different flavors, ranging from simple disks to fullfledged servers executing complex protocols. Collectively, they are referred to as base objects. We consider algorithms that wait-free implement a regular register from malicious base objects. Algorithms that tolerate the arbitrary failure of one-third of the base objects are optimally resilient [16]. Wait-freedom [11] dictates that clients complete their operations independent of the progress and activity of other clients. An even stronger liveness property called bounded wait-freedom stipulates that every client completes its operations in a bounded number of its own steps. A regular register [13] always returns the latest value written before the read operation is invoked, or one written concurrently with the read. Regular registers are attractive because they never return stale data or values concocted by malicious base objects, and they can be used, for instance, in conjunction with a failure detector to solve consensus [1].
Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © ACM 2008 ISBN: 978-1-60558-296-2 ...$5.00
Algorithms that wait-free implement a regular register from Byzantine components are termed robust. Robust algorithms are particularly difficult to design when values previously stored are not permanently kept in storage but are eventually erased by a sequence of values written after them. Algorithms that satisfy this property are called amnesic (see [4] for a formal definition). Amnesic algorithms have the desirable property of storing in the base objects only a limited,
General Terms Algorithms, reliability, Byzantine failures
1.
BACKGROUND AND MOTIVATION
Fostered by recent advances in the storage area network (SAN) technology and the availability of cheap commodity disks, reliable storage has developed into a critical aspect of any dependable architecture. Distributed storage services that reliably store and retrieve data have become a popular method to provide increased storage space, high availability and disaster tolerance. With distributed storage, redundant data copies may be spread across sites to prevent catastrophic failures from disrupting the availability of the data. Recent studies have explored the resilience of distributed storage to Byzantine components, that may fail by being
typically small history of written values (if any). In contrast, all non-amnesic algorithms take the approach of storing an unlimited number of values in the base objects. Therefore, the amnesic property captures an important aspect of the space requirements of a distributed storage implementation. To avoid the problem of space exhaustion, non-amnesic algorithms need to be augmented with some form of garbage collection. However, network asynchrony and client crashes can prevent such a solution from working. The difficulty of implementing robust and amnesic storage stems from the fact that read operations might not be able to sample enough copies of the same value to return. This can happen when old values are continuously erased from the base objects by a sufficient number of overlapping writes. When t is the maximum number of corrupted base objects, a value read must be retrieved from at least t + 1 distinct base objects to guarantee that it is not forged. It has just recently been shown [4] that robust and amnesic algorithms exist only if readers can write. To better understand the challenge behind robust and amnesic storage, consider an algorithm that stores only the last k written values. With this algorithm, each value v is erased by any sequence of k values v 0 6= v written after v. Thus, v might be erased from the memory of the base objects before the reader has collected t + 1 copies, making it impossible for the reader to complete its operation. To simplify the illustration, we assume a single writer and a single reader and n base objects of which one has been corrupted by a malicious adversary. Moreover, we assume that each base object produces a reply in response to every read or write request. In this context, consider the critical situation where a slow read operation is overlapping with a sequence of write operations storing values v1 , v2 , . . . (where vi = vj implies i = j). The base objects respond to the reader with their history of the last k values they have stored such that base object o ∈ {1, . . . , n} replies with history v(o−1)k+1 , . . . , vok . We note that the histories sampled from any pair of base objects are disjoint and that this situation could be extended to an infinite run (given an infinite value domain). Hence, the read operation cannot safely return any value read because it cannot know whether that value was indeed written, or it was in fact forged by a malicious base object. To solve this problem, many algorithms take the approach of storing the entire history of written values in the base objects, e.g. [16, 6, 8, 9] and thus they are not amnesic. Some of the prior amnesic but not robust algorithms relax wait-freedom, providing only weaker liveness guarantees, for instance [1, 10]. Others relax the consistency guarantees provided by regular registers and feature only weaker, safe register semantics [12, 15]. With a safe register, a read operation executing concurrently with a write can return any value (from the value domain) [13]. A different approach taken is to assume a stronger model in which data is self-verifying and where each value is typically digitally signed by the writer [3, 14]. Any value that is correctly signed may be returned, eliminating the problem of sampling multiple copies of a value. Our work focuses on the resilience and the (worst case) latency of robust and amnesic storage algorithms. An implementation of a reliable register requires the clients accessing the register via a high-level operation to invoke multiple lowlevel operations on the base objects. In a distributed setting,
each invocation of a low-level operation results in one round of communication from the client to the base object and back. The number of rounds needed to complete the highlevel operation is used as a measure for the latency of the algorithm. While a substantial amount of effort [1, 16, 8, 7] has been devoted to studying the latency of non-amnesic storage implementations, no corresponding study exists for amnesic (and robust) storage. Despite the importance of robust and amnesic storage, state-of-the-art research leaves the following question open: Are amnesic algorithms inherently more expensive than non-amnesic ones?
Contribution Our fist contribution is to introduce the first (robust and amnesic) algorithm with optimal latency. Our second contribution consits in the first bounded wait-free amnesic algorithm with optimal resilience. The developed algorithms are based on a novel concurrency detection mechanism and a helping procedure, by which a writer detects overlapping reads and helps them to complete. In more detail, we make the following contributions:
Contribution 1. Our first developed algorithm is fast, which means that every read and write operation completes in only one round of communication with the base objects. To the best of our knowledge, this is the first robust and amnesic algorithm with optimal latency. On the downside it requires 4t+1 base objects to tolerate t Byzantine failures. The only other robust and amnesic register construction with 4t + 1 base objects, albeit elegant, is non-optimal [2]. Moreover, any algorithm with fewer base objects (e.g. 4t) would require at least two communication rounds for both operations.
Contribution 2. The second developed algorithm uses the optimal number of 3t + 1 base objects. While write operations have a latency of three communication rounds, every read operation completes after only two communication rounds. To the best of our knowledge, this is the first bounded wait-free amnesic algorithm. Furthermore, the read latency of two rounds is tight [8]. At least two rounds are necessary for reading from the weaker safe register, one that is allowed to return arbitrary values. Our result, that robustly reading from amnesic storage can be equally cheap is surprising given that most existing algorithms are either not amnesic or not robust. The only other existing robust and amnesic algorithm is not bounded wait-free [7].
2.
INTUITION OF THE APPROACH
Justified by the impossibility of amnesic storage when readers do not write, our algorithms employ a reverse communication scheme between writer and reader. With this scheme, the reader writes some information subsequently read by the writer to detect overlaps. For this purpose, we have introduced a shared object termed safe counter whose value is advanced by the reader and read by the writer. The value returned by the counter is termed view and read operations are associated with increasing views. When a read operation has advanced its current view, a subsequent write operation can read the new updated view. When the writer detects a concurrent read operation rd (indicated by a view change), it freezes the latest value v previously written such that v is not erased by any subsequent write. Freezing v means that
v is not overwritten unless the read operation rd has completed. Basically, this scheme guarantees that rd samples t + 1 copies of v, which would ensure that v is not forged. We note that rd does not violate regularity by returning v. Essentially this is true because all values written after v are written concurrently with rd. However, to preclude that read operations return outdated values (previously frozen), the writer freezes v together with a freshness indicator: the view of the read operation that requested v to be frozen. Thus, a read operation whose view is higher than that of the frozen value knows that it must pick a fresh value. A read completes when (a) it reads a value reported by at least one correct base object (i.e., t + 1 base objects) and (b) all newer values are written not before the read was invoked.
3.
DISCUSSION
Our results provide a first step towards developing algorithms with properties similar to non-amnesic ones. However, for the deployment in a real distributed storage environment, many problems still need to be addressed. For instance, in the optimally-resilient case, the write operation needs three rounds of communication with the servers, whereas existing non-amnesic register constructions need only two rounds. Secondly, our algorithms implement only a single-reader register. A straightforward construction of a multiple-reader register can be realized using m copies of the single-reader register, one for each reader. In a distributed storage setting, the writer accesses all copies in parallel, whereas the reader accesses a single copy. Although correct, this construction is highly inefficient, because the writer has to store each value m times. Here, the challenge is to devise protocols which are more efficient in terms of bandwith and space usage. To further improve the throughput and the memory consumption at the servers, our algorithms could be combined with the powerful approach of erasure coding. Instead of storing a complete copy of the data, each server holds a share, such that the original data can be reconstructed from enough servers’ portions. Existing practical distributed storage systems utilizing erasure coding are either not amnesic [8] or they are not robust [10]. Specifically, in [10] read operations are guaranteed to terminate only in the absence of contention. Some of the prior amnesic (but not robust) register implementations assume that the readers cannot modify the server state (see e.g. [1]). This assumption in fact results in implementations that possess several properties that could be valuable in practice, for instance the ability to tolerate any number of malicious readers while using only O(1) memory at the servers. We are not aware of any robust implementation supporting that as well, and in fact, our algorithms are not an exception. We leave as an open problem the question whether robust and amnesic register implementations exist, that would support any number of readers while using only O(1) memory at the servers1 .
4.
REFERENCES
[1] I. Abraham, G. Chockler, I. Keidar, and D. Malkhi. Byzantine disk paxos: optimal resilience with byzantine shared memory. Distributed Computing, 18(5):387–408, 2006. 1
We thank Gregory Chockler for pointing out this open problem.
[2] I. Abraham, G. Chockler, I. Keidar, and D. Malkhi. Wait-free regular storage from byzantine components. Inf. Process. Lett., 101(2), 2007. [3] C. Cachin and S. Tessaro. Optimal resilience for erasure-coded byzantine distributed storage. In DSN ’06: Proceedings of the International Conference on Dependable Systems and Networks (DSN’06), pages 115–124, Washington, DC, USA, 2006. IEEE Computer Society. [4] G. Chockler, R. Guerraoui, and I. Keidar. Amnesic Distributed Storage. In Proceedings of the 21st International Symposium on Distributed Computing (DISC’07), pages 139–151, 2007. [5] D. Dobre, M. Majuntke, and N. Suri. On the time-complexity of robust and amnesic storage. Technical Report TR-TUD-DEEDS-04-01-2008, Technische Universit¨ at Darmstadt, 2008. http://www.deeds.informatik.tudarmstadt.de/dan/amnesicTR.pdf. [6] G. R. Goodson, J. J. Wylie, G. R. Ganger, and M. K. Reiter. Efficient byzantine-tolerant erasure-coded storage. In DSN ’04: Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN’04), pages 135–144, Washington, DC, USA, 2004. IEEE Computer Society. [7] R. Guerraoui, R. R. Levy, and M. Vukoli´c. Lucky read/write access to robust atomic storage. In DSN ’06: Proceedings of the International Conference on Dependable Systems and Networks (DSN’06), pages 125–136, 2006. [8] R. Guerraoui and M. Vukoli´c. How fast can a very robust read be? In PODC ’06: Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing, pages 248–257, New York, NY, USA, 2006. ACM. [9] R. Guerraoui and M. Vukoli´c. Refined quorum systems. In PODC ’07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 119–128, 2007. [10] J. Hendricks, G. R. Ganger, and M. K. Reiter. Low-overhead byzantine fault-tolerant storage. In SOSP ’07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 73–86, New York, NY, USA, 2007. ACM. [11] M. Herlihy. Wait-free synchronization. ACM Trans. Program. Lang. Syst., 13(1):124–149, 1991. [12] P. Jayanti, T. D. Chandra, and S. Toueg. Fault-tolerant wait-free shared objects. J. ACM, 45(3):451–500, 1998. [13] L. Lamport. On interprocess communication. part II: Algorithms. Distributed Computing, 1(2):86–101, 1986. [14] B. Liskov and R. Rodrigues. Tolerating byzantine faulty clients in a quorum system. In ICDCS ’06: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, pages 34–43, Washington, DC, USA, 2006. IEEE Computer Society. [15] D. Malkhi and M. Reiter. Byzantine quorum systems. Distrib. Comput., 11(4):203–213, 1998. [16] J.-P. Martin, L. Alvisi, and M. Dahlin. Minimal Byzantine Storage. In Proceedings of the 16th International Symposium on Distributed Computing (DISC 2002), LNCS 2508, pages 311–325, 2002.