The rules of distributed transaction pro- cessing require that each of those re- source managers arrives at an independent, autonomous decision as to whether it ...
Interactive Simulation of Distributed Transaction Processing Commit Protocols S. Ramsay, J. Nummenmaa, P. Thanischy, R.J. Pooley and S.T. Gilmore Dept. of Computer Science, University of Edinburgh, May eld Road, Edinburgh EH9 3JZ, Scotland
Abstract In distributed transaction processing, the database servers at each site must agree on whether to commit or roll back each transaction. Whilst the database servers are arriving at a global decision, site and network failures can occur, potentially causing blocking at the surviving sites, thereby preventing the progress of other transactions. To mitigate this problem, the database servers follow a distributed commit protocol. Commit protocols dier in their ability to withstand various combinations of failures without exhibiting some form of pathological behaviour. We have developed an interactive simulation environment for analysing distributed commit protocols in which the user can specify failures and observe the eects on the simulated system. In using simulation as a tool for protocol analysis, we are not interested in the simulated time to completion of the protocol, but rather in behavioural phe-
nomena such as blocking and extra rounds of message passing between the sites.
1 Introduction
When using simulation to analyse the ef ciency of a highly complex system, it is often the case that we wish to focus our attention on a particular component of that system, entirely suppressing all of the other components. Under these circumstances, we may be suppressing so much of the realworld system that it becomes unrealistic to use the simulated time of the model as a guide to the wall-clock time of real-world system. However, we may still make a productive use of simulation as it can provide insights into the behaviour of the component which may well have a bearing, albeit indirect, on the overall eciency of the system. The use of simulation to analyse the behaviour of protocols is a case in point. Typ On leave from Department of Computer Sci- ically, such protocols are embedded in larger software systems and under normal cirence, University of Tampere, Finland y To whom correspondence should be addressed cumstances, the execution of the protocol
may not dominate the level of performance of the system. However, there may be circumstances, possibly related to some error condition, where the behaviour of the protocol does impact on overall eciency. Simulation can be used to study such `pathological' behaviour and the circumstances that trigger it. In the present paper we describe our use of simulation in the analysis of distributed commit protocols [1]. These protocols are in widespread use throughout industry in transaction processing systems. They allow geographically-remote, autonomous database servers to reach unanimity on whether a transaction's changes to the database can be made permanent. The participating servers communicate with each other by sending messages through a network. What makes this process of reaching agreement dicult (and therefore worthy of research) is the fact that failures can occur whilst the protocol is in operation. For example, sites and network components can fail and messages can be delayed. One of the challenges of designing simulation experiments is the speci cation of such failure scenarios. Our simulation environment helps us to examine and compare the robustness of protocols. Given a particular sequence of failures, some protocols will block, whereas more robust (but more computationally expensive) protocols would be able to survive these failures and run to completion. Also, some failure sequences can cause more rounds of messages to pass between participants' sites in one protocol than in another. The simulation environment is interactive in the sense that the user has the option to specify when and where a failure occurs as the protocol progresses.
In Section 2, we discuss transaction processing and distributed commit protocols. In Sections 3 and 4, we describe the two protocols used in the simulations. In Section 5, we describe our simulation environment and some example simulations. In Section 6, we describe other research that has used simulation to analyse the properties of simulation protocols. We conclude with some remarks about future work in Section 7.
2 Transaction cessing
Pro-
In transaction processing [6], a transaction is an atomic unit of work that transforms the database from one consistent state to another. Although there is a huge variation, industry folklore has it that the `typical' transaction updates ten or so dierent data items. For our present purposes, the key fact to bear in mind is that either all ten data items are successfully changed or, if for some reason this is not possible, none of the ten data items is changed and the transaction is rolled back and re-started. It must never be the case that some of the changes are made, but not others. In many transaction processing systems, the data items that a particular transaction needs to update are stored on physically separate database servers that may be at dierent geographical locations. In this case, the transaction is said to be distributed. Each database server has a software component called a resource manager that decides (autonomously) whether the changes requested by a transaction can be made. There are several reasons why the re-
source manager might not be able to make the required changes. For example, it might have run out of disk space, the transaction might have to be rolled back in order to break a local deadlock, or the changes might violate a local integrity constraint. If the changes can be made, then the transaction can be committed; otherwise, the transaction must be rolled back. A medium-sized transaction processing system must maintain a throughput of several hundred transactions per second. The operations associated with concurrentlyexecuting transactions have to be interleaved. Consequently, transactions `compete' with each other to lock the data items that they need to change. Once a transaction has nished its work, the sooner it can be committed, the sooner its data item locks will be released and other transactions can obtain those locks and proceed with their work. The rules of distributed transaction processing require that each of those resource managers arrives at an independent, autonomous decision as to whether it can proceed with the commit of the transaction. However, before any changes are actually made to any of the databases, there must be unanimity among the aected resource managers that the commit can go ahead. This is because a transaction is de ned to be an atomic unit of work: either all of its changes are made or none of them are made. If even one resource manager cannot make the changes, then the transaction must be rolled back at all sites. Whilst the resource managers are deciding the fate of a particular transaction, site failures and network failures can occur. If the commit process for the transaction is
delayed by such failures then performance suers even on servers that have not failed, since the transaction will continue to hold exclusive locks on shared data items until the commit process has completed. This may prevent the other transactions that are executing in the system from completing their work. Consequently, in practice the resource managers will follow a distributed commit protocol, such as two-phase commit (2PC) [5], which helps the surviving participants to progress the commit of the transaction despite failures. Faults can take a variety of forms. For example, A server on which a database system is executing could fail. A communications link could be broken. A transient failure in the communications system could cause a message to be lost. The communications network could partition into components. There could be a total failure of all servers. In transaction processing, a database server writes details of the state of the commit process for each transaction into a log on stable storage. When a failed server recovers, it consults this log in order to resume the commit process1. In the design The concurrent execution of transactions means that, at any one time, the server may be in the process of committing several hundred different transactions. In the present study, however, we simplify by concentrating on the commit of a single transaction. 1
of a commit protocol, there is a trade-o between the quantity of communication and the fragility of the protocol. A simple protocol that is cheap on communication might cause the coordinator and participants to block when some particular sequence of failures occurs, but another, more sophisticated (and communications- expensive) protocol might survive that same failure sequence without blocking. An interesting aspect of our work, as an example of simulation research, is the ambiguous nature of time in our experiments. As the systems in which we are interested are asynchronous, there is no notion of global time in practice. The performance metrics which we use in our simulation study are related to the messagepassing behaviour of the protocols under various failure scenarios. The only relevance of time in our simulation study is to allow the experimenter to specify a partial order of events in the failure scenario with respect to the stages in the protocols under study.
3 The two-phase commit protocol In our simpli ed description of protocols, we assume that there is just one transaction undergoing its commit phase. The changes requested by the transaction will aect a subset of the servers in the transaction processing system, and these servers are called the participants. We assume that there is a process executing at each server which is participating in the transaction. At present, most of the major commer-
cial transaction processing systems use the two-phase commit (`2PC') protocol. One of the participants is elected to be the coordinator. In 2PC, the coordinator multicasts a message to the participants asking them to vote Yes or No on the outcome of the transaction to be committed. When (and if!) a participant receives this message, it decides on what can be done, locally, with the transaction. A participant sends a Yes vote to the coordinator if it can commit the transaction; otherwise it sends a No vote. The coordinator collects the votes. If there is a unanimous Yes vote, then the coordinator multicasts a Commit message. However, if there are any No votes, the coordinator multicasts a Rollback message. If the coordinator does not receive a reply from a participant after a timeout period, then the coordinator will multicast a Rollback message: the participant may have crashed and this is the only safe course of action. When (and if!) the participants receive the Commit or Rollback message, they obey it. However, if a participant does not receive this message and it voted Yes, then there is a problem: the participant is blocked. It has to maintain locks on the transaction's data items until it hears the result of the voting. If the coordinator has crashed, it must wait for the coordinator to recover or it must send Help-Me messages to the other participants, just in case the coordinator crashed halfway through multicasting the results of the election, in which case some of the other participants may know the outcome of the voting.
4 A non-blocking protocol As we saw in the previous section, there are situations where the 2PC protocol can lead to blocking: a participant site is unable to release locks on data items until a failure at another site has been repaired. In the present section, we describe a simpli ed version of the three-phase commit (`3PC') [10] protocol, which assumes that there are no network partitions and that site failures are non-total. Our description of 3PC follows closely the text of Bernstein et al. [1]. 3PC proceeds in the same way as 2PC up to the point that the coordinator receives a unanimous vote to commit. At this point, the coordinator tells the participants to enter a Pre-Commit state and to await con rmation that all participants are aware of the outcome of the voting phase. As we shall see, this simple change avoids the blocking problem that was noted in our description of 2PC. The cost, however, is extra message-passing and more complex recovery. Restricting our attention to the failures that the algorithm is intended to tolerate, at any point in this execution, if all existing failures are repaired and no new failures occur for a suciently long period of time, then all processes will eventually reach a decision. Because of failures, it can happen that all participants vote Yes, and yet the decision is to Rollback. As long as a participant has not voted Yes, it can make a unilateral decision to Rollback at any time. However, after voting Yes, the participating server cannot take any unilateral actions.
The period between voting Yes and receiving information about the overall decision is called the uncertainty period for the participating server. During that period, we say that the participant is uncertain. A participant is blocked if it must await the repair of a fault before it is able to proceed. A transaction may remain unterminated, holding locks, for arbitrarily long periods of time at the blocked server. Figure 1 and 2 show the transitions made by the coordinator and the participants as 3PC proceeds. 1. The coordinator sends VoteRequest to all participants. 2. When a participant receives VoteRequest it responds Yes or No, depending on its vote. If a participant sends No, it decides Rollback and stops. 3. The coordinator collects the vote messages from all participants. If any of them are No or if the coordinator votes No, then the coordinator decides Rollback, sends Rollback to all participants who voted Yes and then stops. Otherwise, the coordinator sends Prepare-Commit to all participants. 4. A participant that voted Yes waits for a Prepare-Commit or Rollback message from the coordinator. If it receives a Rollback, the participant decides Rollback and stops.
If it receives Prepare-Commit, it sends Ack (i.e. an acknowledgement message) to the coordinator.
5. The coordinator collects the Acks. When they have all been received, it decides Commit and sends Commits to all participants. 6. A participant waits for the Commit from the coordinator. When it receives it, it decides Commit and stops. Messages received at steps (5) and (6) have a peculiar property: the recipient knows what the message is before it is received. In step (5) the coordinator knows it may only receive Acks, and in step (6) a participant knows it can only receive a Commit. However, these messages are useful because they inform their recipients of the occurrence of certain non-local events. (i) The receipt of Ack from participant P tells the coordinator that P is no longer uncertain. (ii) Since a Commit is sent only after all Acks have been received, a participant that receives the Commit knows that no participant is uncertain. To handle failures, timeout actions, specifying what a process must do if an expected message does not arrive, must be described.
5 The Simulation Environment 5.1 The System Design The simulation software is written in Java. It uses the SimJava and SimAnim class libraries [9] to provide simulation and animation facilities. These packages allow the collection of statistics on system performance and behaviour to be collected. As will be clear by now, we are more interested in counting the occurrences of certain categories of events, such as the number of rounds of messages, than we are in measuring the passage of simulated time. SimJava contains an appropriate set of facilities for these tasks. Using Java's object-oriented programming facilities, we have a collection of Site objects. Each Site comprises a Participant object, a Coordinator object (which only gets activated if the site is elected to act as a coordinator) and a (network) Interface object. The Participants and Coordinators embody representations of the nite state automata for the protocols being simulated. Thus, they have the states de ned in the automata, and they process messages in the way described in the automata. A Site's Interface object will be passed messages by the Site's Participant and Coordinator. If a communication is local, it gets passed on, but non-local communication is passed to a message delivery system, which is controlled by an Adversary object. The Adversary is the means by which faults are introduced into the simulated system. Its role is in this sense analog-
ous to that of adversaries in theorems and proofs about protocol behaviour [2]. The Adversary controls the delivery of messages. When a site is waiting for a message, the Adversary can do one several things: 1. Once the message has been passed to the Adversary by the sending site, the Adversary can pass the message to the receiving site. 2. The Adversary can tell the receiving site that it should timeout without receiving the message. 3. The Adversary can inform the site that is awaiting the arrival of the message that it either has crashed before receiving the message or that it will crash immediately after it has received the message. Furthermore, the Adversary can tell any Site at any time that it has crashed. For example, a Site's Coordinator object might be in the middle of performing a broadcast to the Participants when the Adversary informs it that it has crashed. Thus the broadcast will be interrupted and some sites will not be sent the message. The Adversary also informs a crashed site if it should recover. Finally, the Adversary will also maintain the state of any network partitions. Its decisions as to which messages can be delivered are consistent with the current state of the network. When the simulation is starting, the user may specify several parameters which control the behaviour of the Adversary. It is, for instance, possible to specify the states in which a participant or a coordinator may crash. Also, it is possible to specify a probability distribution for crashing, in which
case the Adversary will use the probability distribution when deciding whether to introduce failures or not. Similarly, it is possible to control the recovery of the crashed Sites. Further, it is possible to specify the maximum number of crashing Sites as well as the maximum number of crashed Sites which are allowed to recover.
6 Example Simulations Using the software we have produced example simulations to compare the 2PC and 3PC protocols' behaviour. We will describe in more detail two example simulations: 1. an example simulation without any failures, and 2. an example simulation, where a coordinator and a participant fail, but do not recover. It is generally considerably easier to handle a case where some of the participants vote No. Thus, in our examples all participants vote Yes.
6.1 No failures
With these simulations, both 2PC and 3PC will nish without any sites failing or, as mentioned before, without any sites voting No. Figure 4 shows a screenshot at the end of the simulation with the 2PC protocol. On top left corner is what is visible of the adversary window. It now shows just some nal statistics. To the right of the adversary window is the options window, where some of the options for the run can be seen. In bottom left is the shell where the run was
started. The window shows dierent kinds of statistics. The coordinator, participant 1, and participant 2 windows show the last events for the coordinator and the respective participants. The last events include decisions made by the participants, messages received and sent, and discussion with the adversary. Figure 4 shows a similar screen shot for a test simulation with the 3PC protocol. The screen shots do not show the (somewhat lengthy) trace les from the runs. The trace les show the information from the participant and coordinator windows and some more detailed data e.g. about the communication. The nal result from both simulations is, thus, that the distributed commit was performed successfully. It now remains to compare the eciency of the two protocols. As can be seen from the statistics visible in the screenshots, the simulations demonstrate a well-known fact: if there are no failures and all participants vote Yes, then the 2PC protocol needs fewer messages to perform the distributed commit than the 3PC protocol.
Now, the remaining participants cannot commit, since the failed participant might have voted No and rolled back. Similarly, if we assume all participants, including the failed participant, voted Yes, the failed coordinator received all the votes, sent Commit to the failed participant and failed immediately after that itself, then it is actually possible that the failed participant has received Commit and performed a commit before it failed. Thus, the remaining participants cannot rollback, but they remain blocked. We performed an example simulation with this failure pattern and arrived to a blocked state, as can be seen from Figure 5. Since 3PC is a non-blocking protocol, there is always a way for the remaining live participants to come to a local decision. Here, the only remaining participant can decide to rollback, once it receives a timeout, since it has not received a Prepare-Commit. The 3PC protocol implemented in the software also has this feature, and Figure 6 shows a screenshot from the relevant simulation. There, although Participant 2 voted Yes, that is, to commit, it nally had to rollback to avoid blocking.
6.2 A coordinator and a par7 Related Research ticipant failing A classic example of a blocking failure sequence with the 2PC protocol is such where a coordinator and a participant fail after the participants have voted Yes and before the coordinator has delivered the global result of the vote. There, even a clever recover protocol does not help, since all live participants have voted Yes, but they do not know what the failed participant has voted.
A number of researchers have used simulation in order to assess the performance of distributed commit protocols. For example, Desai and Boutros [3] study the performance of a variant of 2PC called Prudent 2PC. This protocol can avoid the unnecessary rollback of a transaction when the failure was merely a transient communication problem. They compare this variant
of 2PC with standard 2PC. In their simulation study, they model various aspects of transaction processing in order to study the overall eect of workload, message passing, locking etc. This approach diers from ours in the sense that Desai and Boutros are interested in wall-clock timing dierences between systems using dierent protocols when those systems are running a normal workload. By contrast, we are interested in identifying pathological behaviour when failure sequences occur. Liu et al. [8] use simulation in order to study the performance of various 2PC protocols when site failures occur. Again, they are interested in wall-clock performance of a more realistic simulation model, analysing the eect of such factors as messageprocessing latency. Various researchers are discovering the usefulness of three-phase commit for applications outside of transaction processing. For example, Li et al. [7] used a simulation analysis to study the problem of con guring the control scheme for a communications network that has to adapt to point-to-point, multicast and broadcast connections. 3PC is used to maintain the consistency of the system.
8 Conclusions
At present, the system allows the user to gain an insight into the way the protocol reacts to a sequence of failures. However, we would like to scale up the experiments so that the user can specify, in more general terms, the nature of the failure scenarios in which he or she is interested. The simulation environment would then attempt to identify failure sequences which trigger pathological behaviour on the part of the protocol under test. We intend to develop our simulation environment in a number of ways. We would like to specify failure scenarios in scripts, in addition to specifying them interactively. This would facilitate using the simulation environment to conduct experiments which compare the behaviour of protocols with respect to the same sequence of failures. It will also be necessary to make the generic simulation facilities in the Site class even more generic, so that a wider range of protocols can be tested without the need to do substantial re-programming. At present, the system can handle basic versions of twoand three-phase commit. Lastly, we would like to make the speci cation of the failure model more generic, in order to create simulated environments that can represent real-world transaction processing environments embodying technology such as mobile computing [4].
Traditionally, commit protocol performance comparisons have been based on asymptotic complexity analyses of message-passing. References Comparisons have also been based on the class of failure scenarios in which the pro- [1] P.A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control tocol can survive without blocking. We and Recovery in Database Systems. have used discrete-event simulation in orAddison-Wesley, Reading, MA, 1987. der to study aspects of protocol behaviour.
[2] B.A. Coan and J.L. Welch. Trans- [9] R. McNab and F.W. Howell. Using Java for discrete event simulaaction commit in a realisitic timing tion. Research note, University model. Distributed Computing, 4:87{ of Edinburgh, Dept. Of Computer 103, 1990. Science, May eld Road, Edinburgh, EH9 3JZ, September 1996. On-line, [3] B.C. Desai and B.S. Boutros. Performhttp://www.dcs.ed.ac.uk/home/hase/. ance of a two-phase commit protocol. Information and Software Technology, [10] D. Skeen. Nonblocking commit proto38:581{599, September 1996. cols. In Proceedings of the ACM SIGMOD Conference on the Management [4] A. Elmagarmid, J. Jing, and of Data (SIGMOD'81), pages 133{142, O. Bukhres. An ecient and re1981. liable reservation algorithm for mobile transactions. In Proceedings of the Conference on Information and Knowledge Management (CIKM'95), pages 90{95, 1995. [5] J. Gray. Notes on database operating systems. In Operating Systems An Advanced Course, pages 394{481. Springer Verlag, Berlin, 1979. [6] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo, CA, 1993. [7] C.-S. Li, C.J. Georgiou, and K.W. Lee. A hybrid multilevel control scheme for supporting mixed trac in broadband networks. IEEE Journal on Selected Areas in Communications, 14(2):306{ 316, February 1996. [8] L. Liu, D. Agrawal, and A. El Abbadi. The performance of two-phase commit protocols in the presence of site failures. In Proceedings of IEEE 24th International Symposium on Fault Tolerant Computing, pages 234{243, 1994.
Start Send VOTE-REQ
Collect Votes
One or more NOs or ... ... a timeout
Unanimous YES
Must Rollback Send
ROLLBACK
Should Commit
Rolled back
Send PREPARE-COMMIT Collect ACKs
What the timeout
heck
Send
Send COMMIT
COMMIT
to operational participants Committed
Figure 1: State transition diagram for the 3PC Coordinator
Start (timeout) Send
Receive VOTE-REQ
NO
Send NO
Decide
Rollback
Receive ROLLBACK
Send YES
(timeout)
Uncertain
Recover 1
Receive PRE-COMMIT
Pre-Commit
Send ACK (timeout) Committable
Recover 2
Receive COMMIT Committed
Figure 2: State transition diagram for 3PC Participant
Figure 3: A successful distributed commit using the 2PC protocol
Figure 4: A successful distributed commit using the 3PC protocol
Figure 5: A blocking failure sequence using the 2PC protocol
Figure 6: A non-blocking failure sequence using the 3PC protocol