Relaxing the Limitations of Serializable Transactions in Distributed Systems
slowly. Although this argument appears attractive, we should recognize that other distributed OS primitives such as the Remote Procedure Call (RPC) [4] also require relatively large-size code and non-trivial run-time overhead. Unlike transactions, RPC is considered suciently useful and widely adopted. With the continual advance of RISC technology, the cost measured in instruction counts will tend to go down and more users willing to pay.
Calton Pu
Department of Computer Science Columbia University New York, NY 10027 Internet:
[email protected]
Therefore, we should not blame only the cost of transactions in size and run-time overhead. A more fundamental problem resides in the concept of transactions, speci cally, with the serializability correctness criterion. First, practical algorithms that enforce serializability, called concurrency control, restrict system concurrency using smart heuristics. However, as the system grows larger and faster, the restrictions imposed by concurrency control become stricter and system eective concurrency decreases. Second, distributed commit protocols are needed to preserve the global atomicity of distributed transactions. These protocols require a synchronization step at the end of each distributed transaction. The agreement message exchange delays response time as the diameter of the network wider, since the slowest link is in the critical path. In addition, synchronizing multiple nodes limits the availability as the system grows more heterogeneous, since the least reliable node must be up and running. Further, forcing the transaction components to agree on the transaction outcome also reduces site autonomy.
1 A Cost/Bene t Analysis Atomic transactions have been recognized as an important concept in the development of databases and operating systems (OS). Serializability is the standard notion of correctness [3] in transaction processing. Informally, transactions maintain system consistency: a transaction takes a consistent system state into another consistent system state, regardless of concurrent executions of other transactions and crashes. Despite the recognition given to the concept, however, OS researchers have refrained from adopting transactions in practice. Some exceptions such as Argus [10], Eden [1], and Clouds [6] only prove the rule. The main advantage of programs encapsulated in transactions is their simple structure. Such programs do not have to deal with interleaving of other concurrent programs or worry about system crashes. In other words, the programmer can concentrate on the program per se , even if the program runs in a complex, distributed and parallel environment. Therefore, as our computing environments become more complex, distributed, and concurrent, one would expect the bene ts of transactions to become stronger and more evident. Still, there is no widespread support for the direct implementation and use of transactions in distributed OS's. A major objection raised against the support for and practical use of transactions is their cost, both in terms of compiled program size and run-time overhead. Systems that support transactions are large and run
If we consider an idealized transaction-based system, with in nite CPU power, memory and disk storage, and network bandwidth, we will see that serializability remains as the ultimate bottleneck in the system. In fact, the larger and more complex the system, the worse is the bottleneck. A critical question is, \do we really need the consistency provided by serializable transactions?" Not surprisingly, the answer is negative in the distributed OS environment.
1
2 Consistency/Performance Trade-os
but not an arbitrarily large inconsistency. However, the diculty is now in quantifying the amount of inconsistency introduced into the system and the nonserializable transactions. For example, Gray [7] introduced four degrees of inconsistency, where level 3 is equivalent to serializability. Level 2 is a common level of consistency in real-world DB2 environments, allowing some non-committed values to be read. Unfortunately, there is no bound on the amount of inconsistency seen by any particular transaction. This problem limits the applicability of level 2 consistency as well as some other schemes. Allowing a bounded amount of inconsistency is also a direct way to alleviate the serializability bottleneck. Since non-serializable executions are allowed, more concurrency becomes possible. In fact, for the application designers facing the serializability bottleneck in the idealized system, this is the only general way to push the envelope. Semantics optimization techniques such as those based on commutativity apply here, too. We consider the commutative operations as non-serializable reads and writes that are allowed to execute \out of order" because of application semantics.
In many situations we need only \approximately correct" information. Unlike the dollar amounts in bank databases, which must be preserved at all costs, much of the information in large distributed systems can be used as long as it is close to the true value. For example, few resource allocation algorithms aim for an optimal result. In most CPU scheduling policies, small temporary errors, in terms of scheduling a job out of order, can be corrected by the next round. In network bandwidth allocation, a relatively large number of messages allow similar recovery strategies. Another important class of distributed OS actions that do not require serializable consistency is decisions that have real-time constraints. Some of these decisions are on resource allocation, such as network congestion control. Other decisions are related to input/output processing, for example, in sound and image generation. Losing a few packets when reconstructing voice from digital samples is less important than delivering most of the packets on time. In these situations, getting or delivering most of the information on time is more important than getting all of it consistently. In distributed environments, execution autonomy is very important, often tolerating some small amount of inconsistency. In a global bank, for example, if a small funds transfer hangs in the middle of two-phase commit due to network problems, it is preferable to allow the small inconsistency than locking up a large account involved. It is OK to x the small problem later, but unacceptable to lock up a large sum of customer money because of the window of vulnerability. Similarly, telephone eld services require access to free circuits regardless of what the system database indicates. This is specially serious if the eld personnel is working under adverse weather conditions, say under a snow storm. For these and some other reasons, we need a way to allow non-serializable executions, provided the amount of inconsistency incurred is within certain tolerable bounds speci ed by the system designer. The bounds are important because most of applications (in the OS kernel or outside) can tolerate some inconsistency,
3 Epsilon Serializability We introduce the notion of epsilon-serializability (ESR) as a generalization of serializability. The purpose of ESR [12, 11, 16, 13, 14] is to explicitly allow some limited amount of inconsistency. ESR increases system throughput by alleviating data contention. For distributed TP systems, ESR allows asynchronous processing and therefore higher availability and autonomy. ESR has three main advantages over previous \weak consistency" models: (1) ESR is a general framework, applicable to a wide range of application semantics; (2) ESR is upward-compatible, since it reduces to serializability as ! 0; and (3) ESR has number of ecient algorithms that support it, derived from algorithms supporting serializability. Let S be a system state space. S is a metric space if it has the following properties:
2
A distance function dist(u; v) is de ned over every u; v 2 S on real numbers.
Triangle inequality. dist(u; w).
(
dist u; v
)+
(
dist v; w
)
ImpLimit = 0 ImpLimit > 0
Symmetry. dist(u; v) = dist(v; u). A real world system state space usually contains strings and numerical values, too complex to be a metric space. For example, the bank system contains client name, address, account number, and account amount. However, the interesting updates happen only on the amount. If we consider the system state subspace by restricting our attention to the amount, we have a metric space. However, there are state spaces that are not symmetric. For example, the actual ying time from New York to California is longer than from California to New York because of jet stream. Also, there may be state spaces that do not respect triangle inequality. The investigation of non-metric distance spaces is an active area of research. For simplicity we consider only metric spaces here. We concede that in a broad sense the question of whether a system state space is metric depends on the semantics of the system state. However, because our world is a metric space, many practical applications with dierent semantics have that property. Bank accounts and seats in airline reservation systems are such examples. Timestamps used for calculating time intervals or as version numbers also form a metric space. Current algorithms that support ESR apply to any metric space, regardless of underlying system state semantics. We de ned an Epsilon-Transaction (ET) as a transaction with an error bound speci cation. (A more formal treatment of ETs can be found in a paper using the ACTA model [5] to characterize ESR [14].) ET generalizes the transaction interface by adding a declaration to begin-ET that limits the amount of inconsistency allowed in the ET. Each ET speci es its own limits on inconsistency imported (called ImpLimit) and inconsistency exported (ExpLimit). There are four important cases shown in Table 1. In the simplest case, ImpLimit= 0 and ExpLimit= 0 (=0), we have serializable transactions. More generally, an ET has either ImpLimit> 0 or ExpLimit> 0, but not both. In Table 1 we see either queries (QET ) with ImpLimit> 0 or updates (U ET ) with ExpLimit> 0. In this case, the system remains in a consistent
ExpLimit = 0
Transaction
QET
ExpLimit > 0
U ET
unbounded inconsistency
Table 1: Three Kinds of ETs. state, since all the updates are serializable with respect to each other. But queries are allowed to see inconsistent states produced by the updates. Divergence control algorithms [16] maintain QET within their ImpLimit and U ET within their ExpLimit. In the most general case, both ImpLimit> 0 and ExpLimit> 0, permanent inconsistency may be introduced into the system and the system state may degenerate with unbounded inconsistency. The reason is that even though each ET introduces only a limited amount of inconsistency (bounded by ImpLimit and ExpLimit), successive ETs may corrupt the same data item to an arbitrary degree. The algorithms that bring the system back to a consistent state are a topic of active research.
4 Divergence Algorithms
Control
In another paper [16] we have described in detail a methodology to extend classic con ict-based concurrency control methods such as two-phase locking into a divergence control algorithm that guarantees ESR. Here we outline a simple case for illustration. We assume that the ET interface consists of ImpLimit for QET and ExpLimit for U ET . The job of divergence control is to limit the amount of inconsistency in execution to below that speci ed by ImpLimit and ExpLimit. We outline here the extension of 2PL concurrency control (CC) in the Read/Write (R/W) model to 2PL divergence control (DC). In Table 2, QET represent read locks in queries and RET (W ET ) are read (write) locks from updates. The table shows that read locks are always compatible (AOK). Also, the lower right corner of the table shows that the lock compatibility
3
QET AOK AOK LOK{1 RET AOK AOK | W ET LOK{2 | |
can be seen as optimizations of DC methods. The way they improve transaction concurrency in classic concurrency control is to specify all kinds of con icts and their resolution, for example, commutativity. This can be used the same way in DC methods as an optimization, too. Intuitively, we can see the semantics as special situations in which we avoid incrementing the inconsistency counters.
Table 2: Lock Compatibility for 2PL DC.
5 Active ESR Research Topics
between the updates is exactly the same as 2PL concurrency control, therefore ensuring SR ordering for the updates. The three incompatible squares correspond to R/W and W/W con icts between update ETs. In general, a DC method maintains two counters for each ET, an imported inconsistency counter (ImpCounter) and an exported inconsistency counter (ExpCounter). For 2PL DC, in QET only the ImpCounter is used and in U ET only the ExpCounter is used. When a lock con ict is detected (it can be an R/W con ict for LOK{1 or an W/R con ict for LOK{2), increment the involved ETs' inconsistency counters (ImpCounter of QET and ExpCounter of U ET ). For an accurate estimate of inconsistency, each write lock request carries the amount of change the intended write will introduce into the system. This estimate is the amount we increment into the counters. If the amount of change is not available (for a blind write in some applications) then the amount to be written can be used as a safe overestimate for unsigned elds such as a bank account. If the counters in the con icting ETs remain below the respective inconsistency limits, then the ET is allowed to proceed, otherwise block either QET or U ET . From Table 2 we can see that if = 0 then LOK{1 and LOK{2 will be disallowed and 2PL-based DC reduces to standard 2PL. There are some other DC methods, such as those based on timestamps and optimistic validation algorithms. The same extension/relaxation methodology transforms all con ict-based concurrency control methods into corresponding DC methods. We omit their description here and the interested reader is referred to [16]. Semantics-based concurrency control
The design of distributed DC methods is a topic of active research. Our approach is the same: extension and relaxation. The idea is that the distributed algorithms also must detect the same con icts, so we count them as usual and relax the abort decision. The main dierence between distributed and centralized concurrency control resides in the complications due to the redundancy and message passing in the distributed algorithms. These complications may add to the cost of accumulating inconsistency in con icts. We use the Demarcation Protocol [2] to solve some of these complications. Similarly, we are designing the consistency restoration algorithms that bring system state back to consistency and evaluating the amount of concurrency gained through ESR. Using ESR in distributed systems will become practical once ecient consistency restoration algorithms are described in detail. At the same time, we are designing the support of ESR by modifying classic transaction processing products. Since divergence control methods extend existing concurrency control methods and consistency restoration methods extend existing crash recovery methods, our design is simpli ed. Furthermore, successful implementation will con rm our claim that ESR algorithms are straightforward extensions of classic transaction processing algorithms. Many related works such as Optimistic Commit Protocol [9], Unilateral Commit [8] and Eventual Consistency [15] describe concrete protocols to implement some version of asynchronous transaction processing for distributed databases. ESR is a general concept with many possible implementations. Relaxing the limitations of transactions through ESR may alleviate many of the problems of classic serializability discussed in the Section 1. Ecient algorithms that sup-
QET
RET
U ET W ET
4
port ESR oer hopes that ETs may prove useful in the construction of large, complex, parallel and distributed systems.
[10] B.H. Liskov and R.W. Schei er. Guardians and Actions: Linguistic support for robust, distributed programs. In Proceedings of the Ninth Annual Symposium on Principles of Programming Languages, pages 7{19, January 1982. [11] C. Pu. Generalized transaction processing with epsilon-serializability. In Proceedings of Fourth International Workshop on High Performance Transaction Systems, Asilomar, California, September 1991. [12] C. Pu and A. Le. Replica control in distributed systems: An asynchronous approach. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, pages 377{386, Denver, May 1991. [13] C. Pu and A. Le. Autonomous transaction execution with epsilon-serializability. In Proceedings of 1992 RIDE Workshop on Transaction and Query Processing, Phoenix, February 1992. IEEE/Computer Society. [14] K. Ramamrithan and C. Pu. A formal characterization of epsilon serializability. Technical Report CUCS044-91, Department of Computer Science, Columbia University, 1991. [15] A. Sheth, Yungho Leu, and Ahmed Elmagarmid. Maintaining consistency of interdependent data in multidatabase systems. Technical Report CSD-TR91-016, Computer Science Department, Purdue University, March 1991. [16] K.L. Wu, P. S. Yu, and C. Pu. Divergence control for epsilon-serializability. In Proceedings of Eighth International Conference on Data Engineering, pages 506{ 515, Phoenix, February 1992. IEEE/Computer Society.
References [1] G.T. Almes, A.P. Black, E.D. Lazowska, and J.D. Noe. The Eden system: A technical review. IEEE Transactions on Software Engineering, SE-11(1):43{58, January 1985. [2] D. Barbara and H. Garcia-Molina. The demarcation protocol: A technique for maintaining linear arithmetic constraints in distributed database systems. In Proceedings of the International Conference in Extending Database Technology, Vienna, March 1991. [3] P.A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley Publishing Company, rst edition, 1987. [4] A.D. Birrell and B.J. Nelson. Implementing remote procedure calls. ACM Transactions on Computer Systems, 2(1):39{59, February 1984. [5] P.K. Chrysanthis and K. Ramamritham. ACTA: The Saga continues. In Ahmed Elmagarmid, editor, Transaction Models for Advanced Applications. Morgan Kaufmann, 1991. [6] P. Dasgupta, R.C. Chen, S. Menon, M.P. Pearson, R. Ananthanarayanan, U. Ramachandran, M. Ahamad, R.J. LeBlanc, W.F. Appelbe, J.M. Bernabeu-Auban, P.W. Hutto, M.Y.A. Khalidi, and C.J. Wilkenloh. The design and implementation of the clouds distributed operating system. Computing Systems, 3(1):11{46, Winter 1990. [7] J.N. Gray, R.A. Lorie, G.R. Putzolu, and I.L. Traiger. Granularity of locks and degrees of consistency in a shared data base. In Proceedings of the IFIP Working Conference on Modeling of Data Base Management Systems, pages 1{29, 1979. [8] M. Hsu and A. Silberschatz. Unilateral commit: A new paradigm for reliable distributed transaction processing. In Proceedings of the Seventh International Conference on Data Engineering, Kobe, Japan, February 1990. [9] E. Levy, H. Korth, and A. Silberschatz. An optimistic commit protocol for distributed transaction management. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, Denver, Colorado, May 1991.
5