A Single-Phase Non-Blocking Atomic Commitment ... - CiteSeerX

0 downloads 0 Views 159KB Size Report
Although widely accepted, 2PC is a blocking protocol and it introduces a ... A number of non-blocking commit protocols have been proposed in the literature.
A Single-Phase Non-Blocking Atomic Commitment Protocol1 Maha Abdallah, Philippe Pucheral University of Versailles, PRiSM Laboratory 45, avenue des Etats-Unis 78035 Versailles - France e-mail: @prism.uvsq.fr

Abstract Transactional standards have been promoted by OMG and X/Open to allow heterogeneous resources to participate in an Atomic Commitment Protocol (ACP), namely the two-phase commit protocol (2PC). Although widely accepted, 2PC is a blocking protocol and it introduces a substantial time delay (two phases to commit). Several optimized protocols and non-blocking protocols have been proposed. Optimized protocols generally violate site autonomy while non-blocking protocols are inherently more costly in time and increase communication overhead. This paper proposes a new ACP that provides the non-blocking property while (1) having a lower latency than other ACPs (one phase to commit), and (2) preserving site autonomy, which makes it compatible with existing DBMSs. This protocol relies on the assumption that all participants are ruled by a rigorous concurrency control protocol. We show that this assumption can be relaxed to accept participants supporting SQL2 levels of isolation. Performance analysis shows that our protocol is more efficient in terms of time delay and message complexity than other blocking and non-blocking ACPs.

Keywords:

distributed transaction processing, atomic commit protocols, non-blocking commit protocols, multidatabase systems

1. Introduction Recently, much emphasis has been laid on the development of distributed systems integrating several data sources. The emergence of CORBA compliant platforms reinforces this trend. Standards like DTP from X/Open [DTP91] and OTS from OMG [OTS94] define transactional interfaces that allow distributed data sources to participate in a two-phase commit (2PC) protocol [GR93], generally coordinated by TPMonitors. 2PC is used to preserve the atomicity of distributed transactions even in the presence of failures. Although 2PC is the most widely used Atomic Commit Protocol (ACP), it has two major drawbacks. First, it is blocking: if the coordinator crashes between the vote phase and the decision phase, a transaction can hold system resources for an unbounded period. Second, it incurs three sequences of

1 This work has been partially funded by the CEC under the OpenDREAMS Esprit project n°20843

message rounds (request for vote, vote and decision) that introduce a substantial delay in the system even in the absence of failures. This makes 2PC not adequate to today's highly reliable distributed platforms. Several solutions have been proposed to eliminate the voting phase of the 2PC. The early prepare protocol [SC90] forces each participant to enter in a prepare state after the execution of each operation. A participant is thus ready to commit a transaction at any time thereby making its vote implicit. The main drawback comes from the fact that each operation has to be registered in the participant's log on disk, thus introducing a blocking I/O. The coordinator log protocol [SC90] exploits the early prepare idea and avoids the blocking I/O on the participants by centralizing the participant's log on the coordinator. However, this violates site autonomy by forcing participants to externalize their log records. More recently, the IYV (implicit yes-vote) protocol [AC95] has been proposed to exploit the performance and reliability properties of future gigabit-networked database systems. IYV capitalizes on the early prepare and coordinator log protocols. Participants in a transaction pass in the acknowledgment messages their redo log files and read locks to the coordinator. Thus the coordinator can forward recover a transaction on failed participants. Although well adapted to gigabit-networked DBMSs, this protocol (i) does not preserve site autonomy by forcing participants to externalize logging and locking information, (ii) puts strong hypothesis on the network bandwidth and (iii) increases the probability of blocking since a coordinator crash that occurs at any time during the transaction processing will block the participants until the coordinator's recovery. A number of non-blocking commit protocols have been proposed in the literature. The simplest is the three phase commit protocol (3PC) [Ske82]. 3PC is non blocking at the expense of two extra message rounds needed to terminate a transaction even in the absence of failures. This high latency makes the 3PC not adapted to today’s systems with long mean time between failures. To avoid this problem, a new protocol, noted ACP-UTRB, was proposed [BT93]. ACP-UTRB shares the basic structure of 2PC (two phases to commit i.e., same time delay as 2PC) and achieves non-blocking in synchronous systems by exploiting the properties of the underlying communication primitive it uses to broadcast decision messages. In this paper, we propose a new atomic commitment protocol, noted NB-SPAC (Non-Blocking Single-Phase Atomic Commit) protocol. NB-SPAC has a low latency (one phase to commit), is non-

2

blocking and preserves site autonomy. During normal execution as well as in case of one or more site failures, a transaction is committed in a single phase under the assumption that participating DBMSs are ruled by a rigorous concurrency control protocol [BHG87]. This assumption is exploited to eliminate the voting phase of the 2PC and to guarantee that our recovery mechanism preserves serializability. We show that this assumption can be relaxed to accept participants supporting SQL2 levels of isolation. Non blocking is achieved by using a reliable broadcast primitive to deliver the decision messages [BT93]. Finally, NB-SPAC preserves site autonomy by exploiting techniques introduced in multidatabase systems to recover from failures [BGS92]. This paper is organized as follows. Section 2 introduces the distributed transaction model we are considering. Section 3 presents SPAC, our single phase atomic commitment protocol. Section 4 details NB-SPAC, the non-blocking version of SPAC. The two protocols share the same structure but differ on the way communications between the sites are handled. Section 5 introduces the recovery protocol, common to SPAC and NB-SPAC, needed to recover after site failures. In section 6, we study the performance of NB-SPAC and compare it with other well known ACP protocols. Section 7 summarizes the main contributions of this work. Finally, in the Appendix we show how to relax the rigorous hypothesis to accept DBMSs supporting SQL2 levels of isolation.

2. Distributed transaction model In this paper, we consider distributed database systems built from pre-existing and independent local DBMSs located at different sites. Distributed transactions are allowed to access objects residing at these DBMSs under the control of a Global Transaction Manager (GTM). Each distributed transaction is decomposed by the GTM into one branch per accessed site. There is no restriction on parallel or sequential processing by the GTM of the operations belonging to the same transaction or to different distributed transactions. However, we consider that each operation is acknowledged by the local DBMS executing it. Each operation is sent to and executed by a single DBMS. An important objective of the considered distributed transaction model is the preservation of site autonomy. This means that : (i) participating DBMSs do not have to be aware of the distribution, (ii) local DBMS information (e.g., log records or lock tables) is not externalized and cannot be exploited by the

3

GTM. Site autonomy is a preliminary condition to the integration of legacy systems [SL90]. Each participating DBMS is assumed to guarantee, by means of a local transaction manager (LTM), serializability and fault tolerance of all transactions (i.e., GTM's transaction branches) submitted to it. The GTM is responsible for the global serializability and the global atomicity of distributed transactions. A local GTM agent, noted GTMAgent, can be used to ease the integration of each local DBMS. This doesn't violate site autonomy as far as the existing interface of each local DBMS is preserved. To simplify notations, we consider a single DBMS per site (i.e., site Sk is synonym of DBMSk) and a single GTM in the architecture. A typical example of a GTM is a TP-monitor. Clearly, several instances of GTMs can coexist in the same architecture and may share the same GTMAgents. When the presence of several GTMs impacts the algorithms, it will be mentioned. The considered distributed transaction model is pictured in Figure 1. Ti GTM

Log

Tik

Log

Til

GTM Agentk

GTM Agentl

Local DBMSk

Local DBMSl

LTMk

LTMl

Log

Site Sl

Site Sk

Figure 1: Distributed transaction model. We summarize below notations that will be used all along the paper : • e1 < e2 means that event e1 occurred before event e2. • op1i, op2i, ..., opni is the sequence of operations performed by transaction Ti. • ack(op) refers to the acknowledgment of a given operation op. • Tik refers to the branch of transaction Ti executed on site Sk. • commiti (resp., aborti) refers to the commit (resp., abort) of the transaction Ti. • commitik (resp., abortik) refers to the commit (resp., abort) of the transaction branch Tik. • sendk(m) and receivek(m) respectively refer to the sending to site Sk and the receipt from site Sk of a given message m. • Ti RW>Tj represents a dependency between Ti and Tj due to a Read/Write conflict, that is Tj performed a write operation on a given object O which conflicts with a preceding Ti read operation on O. Similar notations are used to express Write/Write (WW) and Write/Read (WR) conflicts. Ti→Tj represents any dependency between Ti and Tj.

4

3. The Single-Phase Atomic Commit Protocol 3.1 Informal description and related work The objective of this section is to introduce a Single-Phase Atomic Commitment protocol, noted SPAC, that preserves site autonomy (i.e., that can be used with existing DBMSs). For the rest of the paper, the words GTM (resp., GTM agents) and coordinator (resp., participants) will be used interchangeably to designate the coordinator (resp., participants) of the atomic commitment protocol. The basic idea on which relies SPAC is eliminating the voting phase of the 2PC by exploiting the properties of the local DBMSs serialization protocols. Let us assume that all participating DBMSs serialize their transactions using a pessimistic concurrency control protocol and that this protocol avoids cascading aborts (e.g., strict 2-Phase Locking protocol [BHG87]). In this context, a transaction that executes successfully all its operations can no more be aborted due to a serialization problem. Thus, this transaction is guaranteed to be committed in a failure free environment. When the acknowledgments of all operations of a transaction Ti are received, this means that all the transaction branches Tik have been successfully executed till completion. At this point, the application submits its commit demand to the GTM which can directly ask each site accessed by the transaction Ti to commit, with no synchronization between the sites. If a transaction branch, say Tik is aborted by Sk during its execution for any problem (e.g., integrity constraint violation or non serializability), the GTM simply asks each accessed site to abort that transaction. While the SPAC and the IYV protocols [AC95] share the same assumption concerning the serialization policies of the local DBMSs, they strongly differ in the way failures are handled. Assume that site Sk crashes during the one-phase commit of transaction Ti. Meanwhile, Ti may have been committed at other sites. To guarantee Ti atomicity, the effects of the transaction branch Tik have to be forward recovered in Sk. In the IYV protocol, each site executing an update operation includes in its acknowledgment message the physical redo log records generated during the execution of this operation along with their respective log sequence number (LSN) [GR93]. The GTM registers these records in its own log. Once Sk recovers from its crash, it asks the GTM for all redo log records whose LSN are greater than Sk highest LSN and reinstalls them in the database. Each site executing a read operation also includes in its acknowledgment message the list of read locks granted during the execution of this operation. This

5

list is also logged by the GTM. Logging read locks allows to recover on-going transactions and complete their execution after a site crash. This protocol is well adapted to the context of gigabit-networks having a large bandwidth. However, it is not adequate to the distributed context we are considering for three reasons: (i) it does not preserve site autonomy2, (ii) it puts strong hypothesis on the network bandwidth and (iii) it increases the probability of blocking : one-phase commit protocols are often subject to blocking since the uncertainty period lasts all along the transaction (see Section 3.4). To enforce transaction atomicity without sacrificing site autonomy, SPAC exploits logging schemes introduced in multidatabase systems. The retry approach [BGS92] simulates a two-phase commit protocol on top of local DBMSs not supporting the 2PC interface. On each site Sk, a GTM agent logs in a private log each operation sent to it before its execution. The GTM is coordinator of the 2PC protocol while the GTM agents are participants. During the voting phase of the 2PC, each agent force writes its log in stable storage and answers the GTM with a positive vote (if everything goes well). During the decision phase, when an agent receives the commit decision, it delivers it to its local DBMS. If the local DBMS crashes before completing the commit, it will abort the transaction during its recovery phase. After the DBMS recovery, its GTM agent re-executes all operations found in its log and belonging to the globally committed transaction. This approach guarantees global atomicity while preserving site autonomy but it may violate global serializability [BGS92]. In addition, the retry approach may fail if the content of the database changed between the initial transaction execution and the execution of the associated retry transaction [MRKS92]. Global serializability can be restored by adding restrictions on the way global transactions are executed [BST90] or on the objects accessed by global and local transactions [WV90]. Note that multidatabase systems distinguish between global transactions controlled by the GTM and local transactions that escape its control. Compared to these, our approach proposes a non-blocking singlephase ACP (instead of a blocking 2PC) and does not impose any restriction on global transactions. However, we restrict the serialization policies that can be used by participants and we consider only global transactions, that is no transaction escapes the coordinator's control.

2 Site autonomy means, among others, that internal information (e.g., LSN, lock cells, information on blocked transactions)

remain private to a site and cannot be exported.

6

In order to recover from failures, SPAC maintains a redo log at the GTM. Indeed, maintaining the redo logs in the GTM agents cannot lead to a single phase ACP since the GTM must be guaranteed that the redo logs of all its agents are written in stable storage before taking its decision. The GTM redo log will contain the list of operations sent to each site. In case of a participant crash during the one-phase commit, the failed transaction branches will be re-executed due to the operations registered in the GTM redo log. Logging logical operations instead of physical redo records is mandatory to preserve site autonomy. Global serializability and redo correctness are maintained due to the hypothesis done on the DBMSs serialization protocol properties. These hypothesis are detailed in the next section.

3.2 Hypothesis on participating DBMSs In order for our protocol to work properly, the participating DBMSs must satisfy the two following properties: P1:

[∀n receivek(ack(opni))] by the GTM => Tik is locally serializable on site Sk, where opni denotes the nth operation of the transaction branch Tik.

P2:

Local transactions are strongly isolated

Assuming all sites satisfy property P1 guarantees that if the GTM receives the acknowledgment of all operations of transaction Ti, Ti is serializable on all sites. Property P2 means that the effects of an uncommitted transaction are hidden until the transaction end, which guarantees that local schedules are cascadeless. Moreover, P2 is necessary in order to handle failures properly (see Section 5). If both P1 and P2 are satisfied and if the commit decision is submitted after having received the acknowledgment of all operations, the assertion A1 given below is true. It follows that a single-phase commit protocol is sufficient to guarantee the atomicity of distributed transactions in a failure free environment. A1:

[∀k sendk(commiti)] by the GTM => Tik will be committed if Sk doesn't crash

Let us now characterize the class of DBMSs satisfying P1 and P2. [BGS92] identified four classes of DBMSs, namely Strongly Serializable, Serialization-Point, Strongly Recoverable and Rigorous DBMSs in the increasing order of restrictions incurred by their local serialization protocol. Definition: Rigorous DBMS A DBMS is said to be rigorous if, for all pairs of transactions Ti and Tj, if a conflict occurs and generates a dependency of the form Ti >Tj, then Tj is blocked until Ti terminates (by a commit or abort).

7

Theorem: Rigorous DBMSs satisfy P1 and P2. Proof: Let's start with property P1. If [∀n (ack(opni))] is received by the GTM, this means that all operations of transaction Ti have been successfully executed. Thus, no dependency of the form Tj→Ti can exist in the serialization graph of Sk, ∀k. Otherwise, Ti would have been blocked and the GTM would not have received the acknowledgment of all operations of Ti. It follows that Tik is locally serializable on site Sk, ∀k. Property P2 is satisfied by construction. Indeed, if any conflict of the form Ti RW>Tj, Ti WR>Tj, or Ti WW>Tj exists, Tj will be blocked until Ti terminates, thereby enforcing strong isolation. Strongly Recoverable DBMSs enforce the assertion: [(Ti→Tj) => commiti < commitj]. This assertion does no longer enforce P1 nor P2. Since other classes of DBMSs provide weaker conditions than Strongly Recoverable DBMSs, Rigorous DBMSs appear to be the sole class of DBMSs compatible with our protocol. This restriction is not so strong since almost all today's DBMSs (either relational or object oriented) are ruled by strict 2PL which is a rigorous serialization protocol. However, commercial relational DBMSs are likely to use levels of isolation standardized by SQL2 [SQL92]. These DBMSs do not satisfy properties P1 and P2 since they accept some non serializable schedules. In the Appendix, we show that NB-SPAC can be extended to deal with DBMSs providing levels of isolation.

3.3 Protocol description The standard scenario induced by the SPAC protocol is depicted in Figure 2. To simplify the figure, we show the actions executed by a single GTM and a single site Sk, assuming that all GTMs and sites follow the same scenario. Step 1 is concerned with the initiation of the distributed transaction. At the application's request, the GTM starts the distributed transaction Ti. The Ti identifier is sent back to the application. Step 2 represents the execution of the first operation op1i of a transaction branch Tik. We assume that sites dynamically join a distributed transaction. Thus the first operation sent to a site Sk will force a transaction branch Tik to be started at Sk. To preserve site autonomy, the local DBMSs are not assumed to be aware of the distribution. The GTM asks, through GTMAgentk, the local DBMSk to start a new ordinary transaction. GTMAgentk registers in a memory table the association , where Tp is the identifier of the transaction started by the local DBMS, that is the local representative of Ti. Then, the

8

GTM non-force writes in its log that site Sk is participating in Ti. Thus, at transaction end, the GTM log contains the list of all participants for each transaction. The execution of the first operation op1i is finally performed in the same way as any other operation (see Step 3). Application

GTMAgent k

GTM

DBMS

k

BeginTrans() Step 1

Ti op1i

BeginTrans(Ti)

BeginTrans() Tp

ack non-force writes Sk joins Ti Step 2

Starts a new transaction Tp (proxy of Ti in Sk)

registers assoc.

non-force writes op1i in log op1i ack

op1p ack

ack

Tp executes op1

opni non-force writes opni in log opni

Step 3 ack

opnp ack

ack

Tp executes opn

commit(Ti) force writes Ti log records and commiti decision commit(Ti)

write record ack

Step 4

Tp executes the write

commit(Tp) ack

ack

executes commit(Tp)

non force writes ackk(commiti) ack

When all ack have been received, non force writes endi and forgets Ti

Figure 2: Basic SPAC protocol scenario. Step 3 represents the sequence of operations executed on behalf of Tik. The GTM registers in its log each operation that is to be executed. Note that this registration is done by a non-forced write. Non-forced writes are buffered in main memory and do not generate blocking I/O. When the agent receives the operation, it asks the local DBMS for its execution on behalf of Tp. Each operation is acknowledged up to the application. Step 4 concerns the atomic validation of Ti. If the application issues a commit request to the GTM, this means that all operations of Ti have been successfully executed (consequently Ti is serializable at all sites). The GTM can thus take the commit decision. First, the GTM force writes all Ti log records along with its commit decision on stable storage in the same step. Assuming that the GTM log is a sequential

9

file, this step will generate a single blocking I/O, the size of which depends on the activity of Ti. Second, the GTM asks its agents to commit Ti by broadcasting the commit decision to all. In order to be able to know whether or not a failed DBMS has effectively committed a transaction and without violating the DBMS autonomy, each agent creates in the local database a special record containing the commit decision for Ti. This record is created by means of an ordinary write operation3 inside Ti. This operation will be treated by the DBMS in exactly the same manner as the other operations belonging to Ti, that is either all committed or all aborted atomically. Once the write operation of the commit decision has been acknowledged by the DBMS, the GTM agent asks the local DBMS to commit transaction Tp, the local representative of Ti. In a failure free environment, this commit always succeeds and is acknowledged up to the GTM. For every transaction Ti, the GTM non-force writes each ackk(commiti) in its log. When the acknowledgment is received from all the participants in the transaction, the GTM non-force writes an endi record. At this point, the GTM can forget transaction Ti. If transaction Ti is to be aborted, Step 4 is simpler than its committing counterpart. The GTM discards all Ti log records and broadcasts the abort message to the local DBMSs, through its agents. SPAC is a presumed abort protocol. Thus, abort messages are not acknowledged and the abort decision and transaction initiations are not recorded in the GTM log.

3.4 The SPAC algorithm The algorithm implementing the SPAC protocol is presented in Figure 3. For the sake of clarity, we omit the message acknowledgments in this algorithm. This information can be easily deduced from the scenario depicted in Figure 2. Figure 3 shows the correspondence between the algorithm and the scenario steps. Note that, unlike 2PC protocols, this algorithm does not exhibit a separate procedure to implement the atomic commitment protocol. Indeed, a GTMAgent directly receives the decision of the GTM for a transaction Ti (line 18 or 19) after Ti's last operation invocation (last execution of line 17). The GTM informs the participants of a transaction (GTMAgents) of its decision through a simple broadcast primitive presented in Figure 4.

3 Note that this operation cannot generate a dependency cycle since it is the last operation executed in any transaction that has

to be committed.

10

GTM algorithm 1- do forever { wait for (receipt of a message m from the application) 2case(m) of: 3beginTrans(): allocate a new Ti identifier ; 4send (Ti) ; 5appli op1i(): send (beginTrans(Ti)) ; 6GTMAgentk

send

78910-

opni(): aborti: commiti:

11-

send

// implements Step 1 // implements Step 2:

(op1i) ;

// op1i() concerns site Sk

(opni) ;

// implements Step 3 for

GTMAgentk

GTMAgentk

broadcast (aborti) ; A force write Ti log records and Ti commit decision in GTM's log ; broadcast (commiti) ;

// any Ti's operation after op1i // implements Step 4 // implements Step 4

A

} // A stands for the set of all GTM agents participating in transaction Ti GTMAgentk algorithm 12- do forever { 13- wait for (receipt of message m from GTM or delivery of decision message m by the broadcast) case(m) of: 14beginTrans(Ti): sendlocalDBMS(beginTrans()) ; // implements Step 2 15register association ; 16opni(): sendlocalDBMS(opnp) ; // implements Step 3 17// for Ti's operation aborti: sendlocalDBMS(abortp) ; // implements Step 4 18commiti: sendlocalDBMS(writep(commiti)) ; // implements Step4 19sendlocalDBMS(commitp) ; 20} Figure 3: SPAC algorithm In the following, we say that a participant (i.e., GTMAgent) decides on a transaction when it delivers the decision message to the local DBMS, whereas a local DBMS decides on a transaction when it executes the decision received from its agent. Thanks to assertion A1 (Section 3.2) and in the absence of failures, every local DBMS conforms to the decision received from its agent. So, to simplify the presentation of the following theorems and proofs, we do not distinguish between GTMAgents and local DBMSs when we refer to the term participant. We will show in Section 5 that thanks to our recovery algorithms, local DBMSs eventually conform to the decision of their corresponding GTMAgents even in the case of failures.

11

To be correct, an atomic commitment protocol must satisfy the following property : AC: all participants that decide reach the same decision. Theorem : the SPAC algorithm satisfies property AC Proof: The proof is by contradiction. Assume a participant p decides Commit while another participant q decides Abort. Participant p can decide Commit only at line 20 while participant q can decide Abort only at line 18. In both cases, p and q have decided following the delivery of a decision message from the GTM. Since there is a single GTM per transaction, this means that the GTM has broadcasted two different decisions. This contradicts the fact that the GTM decides only once at line 9 or 11, depending on the application’s decision. broadcast (m) { A // the GTM executes : send(m) to all participants in A // each GTMAgent ∈ A executes : Upon (receipt of m) deliver m } Figure 4 : A simple Broadcast primitive A key step in our protocol is the dissemination of the decision message to all the participants by the GTM. In SPAC, this is achieved through a simple broadcast primitive, noted broadcast. Figure 4 shows a possible implementation of this primitive using multiple send and receive operations. Indeed, a send operation has a corresponding action at the destination called receive. The broadcast primitive has a corresponding action at the destination called deliver. Note that this primitive is not reliable (i.e., not atomic): if the GTM crashes in the middle of the broadcast of the decision message, some participants may deliver the decision while others never do so. Consequently, the SPAC protocol may produce blocking situations in case of a GTM failure. Assume the GTM crashes during the transaction execution (lines 6 to 8) or during the broadcast of the decision (lines 9 and 11). In this case, each GTMAgent that did not receive the GTM decision remains blocked at line 13. In [AP97], we proposed a termination protocol that allows GTMAgents to take a uniform decision in the case of a GTM crash. In this protocol, the GTMAgents cooperate together in order to take a uniform decision depending on their local states. However, no termination protocol can solve all blocking situations. Take for example the blocking situation produced if the GTM crashes in the middle of the broadcast of its decision and this decision is received only by participants that later crash. Thus, non-failed participants cannot decide on the transaction outcome since there is no mean by which they can deduce

12

whether or not failed participants have decided. In the next section, we propose an extension of the SPAC protocol which guarantees the non-blocking property in all possible crash failure cases.

4. Achieving non-blocking 4.1 System model In the previous section, we did no assumptions whatsoever concerning message transfer delays or communication failures since the SPAC protocol can be used in systems that are subject to site failures, communication failures as well as unknown time delays to transport messages. In order to achieve non-blocking, we follow the model and terminology used in [BT93] and assume, for the rest of the paper, a synchronous system composed of a set of DBMSs completely connected through a set of communication channels4. Since it is well known that distributed systems with unreliable communication do not admit non-blocking solutions to the atomic commitment problem [Gra78], we also assume a reliable communication between the sites. A synchronous system means that there is a known upper bound on communication delays, that is each message is received within δ time units after being sent In such a model, site failures can be reliably detected by any reliable failure detector and reported to any operational site. Reliable failure detectors can be easily implemented by means of time-out. At any given time, a site may be operational or down. Operational sites may go down due to crash failures. A site that is down may become operational again by executing a recovery protocol. Finally, we say that a site is correct if it has never crashed ; otherwise it is faulty5.

4.2 Reliable broadcast As we explained in section 3.4, the protocol SPAC has failure scenarios for which no termination protocol can lead to a decision. This is due to the unreliability of the broadcast primitive used in the SPAC protocol and which allows faulty GTMAgents to deliver a message that is never delivered by correct ones. To

4 An adaptation of our non-blocking protocol to asynchronous systems augmented with unreliable failure detectors is

currently being studied. 5 The period of interest of these definitions is the duration of the atomic commitment protocol execution.

13

solve this problem, we need a broadcast primitive, called Reliable Broadcast [BT93], which guarantees the following property : Uniform Agreement : if any GTMAgent, correct or not, delivers a message m, then all correct GTMAgents will eventually deliver m. R-broadcast (m) { A

// the GTM executes : send(m) to all participants in A // each GTMAgent ∈ A executes : upon (first receipt of message m) { send(m) to all GTMAgents in A deliver m } } Figure 5 : A Reliable Broadcast primitive Figure 5 shows the R_broadcast primitive, a possible implementation of a reliable broadcast. Every participant (GTMAgent) relays the message it receives for the first time to all the others before delivering it. So, if a GTMAgent delivers a message m, then it has achieved relaying it. This guarantees that all correct GTMAgents will eventually deliver m. Thus, it is clear that this primitive satisfies the Uniform Agreement property even if the initial broadcaster (or the relayer) crashes. This primitive also satisfies the following property : ∆-Timeliness : There exists a known constant ∆ such that if the broadcast of a message m is initiated at time t, then no participant receives m after time t+∆ To prove this, [BT93] showed that there exists a constant delay ∆= (F+1)δ, where F denotes the maximum number of sites that may crash during the execution of the atomic commitment protocol, by which the delivery of m must occur.

4.3 The Non-Blocking Single-Phase ACP algorithm The blocking SPAC can be made non-blocking by replacing the unreliable broadcast primitive used by the GTM to disseminate its decision message with the reliable primitive described in section 4.2. Figure 6 illustrates the algorithm implementing the Non-Blocking Single-Phase Atomic Commit protocol, noted NB-SPAC. The GTM algorithm for NB-SPAC is omitted since the only modification with respect to that

14

of SPAC is in line 11 : the GTM broadcasts the commit decision message using the reliable broadcast primitive and thus, executes R-broadcast (commiti) instead of broadcast (commiti). A

A

GTMAgentk algorithm 12- do forever { 13- wait for (receipt of a message m from the GTM or delivery of a decision message m by the broadcast or receipt of GTM crash notification ) if (receipt or delivery of m) then // same treatment than SPAC 14case(m) of: 15beginTrans(Ti):sendlocalDBMS(beginTrans()) ; 16register association ; 17opni(): sendlocalDBMS(opnp) ; 18aborti: sendlocalDBMS(abortp) ; 19commiti: sendlocalDBMS(writep(commiti)) ; 20sendlocalDBMS(commitp) ; 21else { // GTM crash notification 22set time-out to ∆ ; 23wait until (delivery of a decision by the broadcast or time-out expiration ) 24if (delivery of a decision) then 25sendlocalDBMS(decisionp) 26else sendlocalDBMS(abortp) ; // on time-out expiration 27} } Figure 6: NB-SPAC algorithm In this algorithm, when a GTMAgent detects a GTM crash, a time-out is set. The value of this timeout is the constant delay ∆ of the reliable broadcast by which the delivery of a decision message must occur. If the delivery of a commit decision does not occur by the specified time, a participant can safely deduce that no other participant, correct or not, will deliver a commit decision (i.e., it can safely decide abort). This is due to the ∆-timeliness property of the reliable broadcast. An atomic commitment protocol is said to be non-blocking if it satisfies the following property in addition to property AC: NBAC: Every correct participant involved in the atomic commitment protocol eventually decides. In the following, we prove that NB-SPAC satisfies properties AC and NBAC. Theorem :NB-SPAC satisfies property AC of an atomic commitment protocol. Proof : The proof is by contradiction. Suppose that participant p decides Commit and participant q decides Abort. Participant q can decide Abort only in lines 19, 26 or 27. Participant p can decide Commit only in line 20 following the receipt of a commiti message from the GTM, or in line 26 by

15

mean of another participant which relayed the message. In both cases, the GTM must have reliably broadcast the Commit decision at line 11, say at real-time tR-broadcast. Since we have a single GTM per transaction and since the GTM broadcasts only one decision per transaction, participant q cannot have decided at line 19. Furthermore, due to the Uniform Agreement property of the reliable broadcast used to disseminate the commit decision received by p, q cannot have decided abort at line 26. This means q must have decided at line 27 following the time-out expiration. In this case, q must have reliably detected the GTM crash. Assuming that this detection occurs by real-time tCrashDetection, the time-out expiration occurs at real-time tCrashDetection+∆. Since participant p has received a commiti message, this means that the GTM had already executed the reliable broadcast of the commit decision before it crashed. Thus, tR-broadcast< tCrashDetection. Since p delivered a commit decision, by the Uniform Agreement property of the reliable broadcast, participant q eventually delivers the commit decision as well. By the ∆-timeliness property of the reliable broadcast, q does so by real-time tR-broadcast+∆. Since tR-broadcast< tCrashDetection, q must have received the commit decision before timing out. This contradicts the fact that q has executed line 27 and thus, that p and q have decided differently. Theorem : NB-SPAC satisfies property NBAC of an atomic commitment protocol. Proof : In SPAC, a correct participant could be prevented from deciding on a transaction by being blocked indefinitely in the wait statement of line 13 following the GTM crash. In the NB-SPAC protocol, we substituted this wait statement with another one (line 13 of Figure 6) in which a participant blocks until the receipt of a message from the GTM or from the underlying reliable failure detector. In case of a GTM crash, the participant will no longer remain blocked indefinitely at line 13 since the GTM crash will be eventually detected, in which case, the participant executes the associated wait statement at line 24 after having set its timer. If the decision message being waited for is not received by the specified time, the time-out expires and the participant decides Abort. This proves that every correct participant eventually decides.

5. Recovering failures 5.1 Participant's recovery Figure 7 details the recovery algorithm executed after a participant crash. For the sake of clarity, step numbers correspond here to the steps ordering. Step 1 and Step 2 represent the standard recovery procedure executed by the local DBMS on the crashed site. To preserve site autonomy, we do no assumptions on the way these steps are handled.

16

Anyway, we assume that each local DBMS integrates a Local Transaction Manager that guarantees atomicity and durability of all transactions submitted to it. Since the GTM agents do not maintain a private log, Step 3 is necessary to determine if some globally committed transactions have to be locally re-executed in the crashed site Sk. A GTMAgentk must contact every GTMs that may have run transactions on site Sk. To simplify, we assume that each active instance of GTM is recorded by each GTM agent by means of records stored in the local DBMS. This can be done at the first connection of a GTM to the GTMAgent in order to avoid, during a transaction processing, the blocking I/O needed to record the GTM associated with the transaction. Other solutions are possible to record active GTMs6. Anyway, they do not impact the rest of the recovery procedure. In Step 4, the GTM aborts all still active transactions in which the crashed site Sk participates. Step 5 checks if there exists some globally committed transaction Ti for which the GTMAgentk did not acknowledge the commit. This may happen in two situations. Either the participant crashed before the commit of Tik was achieved and Tik has been backward recovered during Step 2, or Tik is locally committed but the crash occurred before the acknowledgment was sent to the GTM. Note that these two situations must be carefully differentiated. Re-executing Tik in the latter case may lead to inconsistencies if Tik contains non idempotent operations (i.e., op(op(x)) ≠ op(x)). Thus, the GTM must ask the GTMAgentk about the status of Tik (Step 6). During Step 7, if GTMAgentk finds a record in the underlying database, this proves that Tik (more precisely Tp, the local Ti proxy) has been successfully committed by the local DBMS, since write is an operation performed on behalf of Tik. If, on the other hand, is not present in the database, this means that Tik has been backward recovered during Step 2 and must be entirely re-executed. This re-execution is performed by exploiting the GTM log (Step 8). In both cases, the GTMAgentk informs the GTM about the status of Tik. Once the recovery procedure is completed, new distributed transactions are accepted by the GTM (Step 9) and by the GTM agent (Step 10).

6 In practical situations, a local DBMS is often under the control of a single GTM (e.g., TP-monitor) recorded statically. In

more general situations, active GTMs may be registered by external traders in distributed platforms.

17

Local DBMS algorithm 1- forward recover the transactions that reached their commit state before the crash. 2- backward recover all other transactions GTM agent algorithm 3- once the local DBMS has restored a consistent state, contact the GTM(s) 7- if the GTM asks about the local status of a transaction branch during step 6, check the corresponding commit decision record and answer the GTM 10- accept new transactions GTM algorithm If contacted by the GTM agent of site Sk during step 3, do: for each transaction Ti in which Sk participates 4- if (commiti) ∉ GTM_log , then// Ti is still active broadcast (aborti) to all other Ti participants and forget Ti 5- if (commiti) ∈ GTM_log, then // Ti had been committed before the crash if ackk(commiti) ∉ GTM_log, then 6- ask GTMAgentk for the exact state of Tik 8- if Tik has not been locally committed, then restart a new transaction T'ik on Sk re-execute all Tik operation on behalf of T'ik 9- accept new distributed transactions Figure 7: Participant recovery algorithm. This recovery algorithm guarantees that re-executing the concerned transaction branches in Step 8: (i) will not produce undesirable conflicts and (ii) will generate exactly the same database state than after their first execution. Let us call ϕ the set of all transaction branches that have to be re-executed in Sk during Step 8. ϕ = {Tik / commiti ∈ GTM_log ∧ ackk(commiti) ∉ GTM_log ∧ ∉ Sk database}. Proof of condition (i): Since local DBMSs are rigorous, ∀Tik,Tjk ∈ ϕ, ¬∃(Tjk → Tik). Otherwise, Tik would have been blocked during its first execution, waiting for the termination of Tjk, and would not have complete all its operations, which contradicts Tik ∈ ϕ. This means that Tik and Tjk do not compete on the same resources, thereby precluding the possibility of conflict occurrence. Proof of condition (ii): Step 2 of the recovery algorithm guarantees that all resources accessed by any Tik ∈ ϕ are restored to their initial state (i.e., the state before Tik execution). The re-execution of all other transaction branches belonging to ϕ cannot change this state due to condition (i). Since Step 8 precedes Step 9 and Step 10, new transactions that may modify Tik resources are executed only

18

after Tik re-execution. Consequently, Tik is guaranteed to re-access the same state of its resources and consequently, will generate the same database state as after its first execution.

5.2 Coordinator's recovery The recovery algorithm executed by the GTM is presented in Figure 8. Nothing has to be done for transactions that were still active at the time the GTM crashed. Indeed, the GTM force writes the log records of a transaction only if the commit decision is taken. Thus, still active transactions do not appear in the GTM log after a crash. If a participant asks about the status of one of these transactions, the GTM answers abort, according to the presumed abort assumption. For each transaction Ti / (commiti) ∈ GTM_log and (endi) ∉ GTM_log, do { If (∃ k / ackk(commiti) ∈ GTM_log) 1GTM_Decision = commit else { 2ask at least a GTMAgentk for the outcome of Tik GTM_Decision = outcomeik } If (GTM_Decision = commit) 3for each Sk participating in Ti / ackk(commiti) ∉ GTM_log, do sendk(commiti) if (∀k ackk(commiti) ∈ GTM_log), write(endi) in GTM_log and forget Ti } 4- Accept new distributed transactions Figure 8: GTM recovery algorithm For transactions that had been committed before the GTM crash, the recovery algorithm slightly differs depending on whether SPAC or NB-SPAC is used. In the case of SPAC, participants are blocked if the GTM crashes. Thus, participants never take a unilateral decision and they all wait for the GTM's decision. Consequently, only Step 3 and Step 4 of the GTM recovery algorithm are mandatory. In the case of NB-SPAC, the recovery action depends on the number of commit acknowledgments the GTM received before crashing. If at least one acknowledgment has been received (Step 1), this means that at least one correct participant has received the commit decision and thus all correct participants have decided commit due to the Uniform Agreement property of the R-Broadcast. In this case, the GTM takes a definitive commit decision and executes Step 3. If, however, no acknowledgment has been received, the correct participants must have decided abort in case no correct participant has received the commit decision.

19

Otherwise, they must have decided commit (see Section 4.3). The GTM must contact at least one correct participant to learn about the outcome of that transaction and it conforms to this decision (Step 2). After recovery, the GTM accepts new distributed transactions (Step 4). Since the recovery procedures may lead to a decision, we need to show that this decision is not in conflict with any decision taken by correct participants or by the GTM. Theorem : the recovery algorithms guarantee property AC stating that all participants (faulty or not) that decide, reach the same decision Proof: The proof is by contradiction. Suppose that a participant q crashes. When q executes its recovery algorithm, it may decide according to the GTM decision either (i) abort due to step 4 (i.e., (commiti) ∉ GTM_log), or (ii) commit due to step 8 of the algorithm (i.e., commiti ∈ GTM_log). In case (i), assume a correct participant p decided commit. This means that the GTM broadcasted the commit decision in the (NB-)SPAC algorithm, in which case this decision had been recorded in the GTM log (line 10) before the broadcast occurred (line 11). This contradicts the fact that (commiti) ∉ GTM_log. In case (ii), let us assume that a correct participant p decided abort. This means that either (a) the GTM broadcasted an abort decision in (NB-)SPAC, or (b) p’s time-out expired in NB-SPAC. Case (a) contradicts the fact that commiti ∈ GTM_log. Case (b) means that no correct participant received a commit decision. Since commiti ∈ GTM_log, this implies that the GTM crashed before sending the commit decision to any correct participants. In its recovery algorithm, the GTM conforms to the decision of correct participants, that is abort. Consequently, commiti ∉ GTM_log. This contradicts the assumption of case (ii). Consequently, property AC is always satisfied.

6. Performance Analysis In this section, we study and compare the performance of SPAC and NB-SPAC with 2PC and ACP-UTRB respectively. The reason for choosing these two protocols for our comparisons is the following : 2PC is the most well-known blocking ACP adopted by most commercial DBMSs and transactional standards (so, we compare it to our blocking SPAC protocol), while ACP-UTRB is the most efficient non-blocking protocol that has been proposed for synchronous systems (so, we compare it to the non-blocking version of SPAC, namely NB-SPAC). Roughly speaking, ACP-UTRB shares the basic structure of 2PC and uses a reliable broadcast primitive to send the coordinator's decision to all participants, making the protocol nonblocking [BT93].

20

Let n denote the number of participants in the transaction. The other parameters that enter in the performance analysis are those introduced in sections 4.1 and 4.2, namely δ, F and ∆. Since SPAC and NB-SPAC are single-phase protocols, the time delay (denoted T) and message complexity (denoted M) are those needed for the broadcast of the decision message. Thus, the performance of each of these protocols depends on the particular broadcast primitive used. In SPAC, the coordinator broadcasts the decision using the unreliable broadcast primitive presented in Figure 4, so the time delay T = δ and the message complexity M = n. In 2PC, we need to add the complexity of the vote request and vote collection message rounds. This leads to the results presented in Figure 9 that shows clearly that SPAC is more efficient than 2PC in both, time delay and message complexity. Protocol

Number of steps

Time Delay

Message Complexity

SPAC

1

δ

n

2PC

2



3n

Figure 9 : Steps, time delay and message complexity for 2PC and SPAC The coordinator of the NB-SPAC protocol uses the reliable broadcast primitive R-Broadcast presented in Figure 5 to broadcast its Commit decision message to all the participants in the transaction. In this primitive, each participant relays the message to all other participants, so the message complexity M = n+n(n-1). As mentioned in Section 4.2, the timeliness of R-Broadcast is ∆ = (F+1)δ, so the time delay T = ∆ = (F+1)δ. Figure 9 shows the performance of the NB-SPAC and ACP-UTRB protocols. Similarly to 2PC, ACP-UTRB is a two-phase protocol, so we need to add the complexity of the vote request and vote collection message rounds. Again, Figure 10 shows that our non-blocking protocol is more efficient than the ACP-UTRB in both time delay and message complexity.

Protocol

Number of steps

Time Delay

Message Complexity

NB-SPAC

1



n

ACP-UTRB

2

2δ+∆

2n+n

2 2

Figure 10 : Steps, time delay and message complexity for NB-SPAC and ACP-UTRB

21

Note that NB-SPAC is a presumed abort protocol. Thus, the coordinator uses the unreliable broadcast primitive to broadcast an Abort decision. This gives our NB-SPAC the same performance as the blocking version (SPAC) in case of an Abort decision. As we saw previously, the performance of the NB-SPAC protocol depends on the performance of the reliable broadcast primitive. The implementation of the broadcast primitive given in Figure 5 is simple but requires a quadratic number of messages. [BT93] proposed different optimizations of this primitive that reduce the number of messages from quadratic to linear. These optimizations makes ACP-UTRB comparable to 2PC in terms of time delay and message complexity when no crash occurs during the validation protocol. These optimizations can be perfectly used as well in the NB-SPAC protocol, making it even more efficient than 2PC. Concerning the I/O cost, note that the SPAC and NB-SPAC protocols generate a single blocking I/O at the coordinator to log the coordinator's decision along with the operations executed by the committing transaction. Thus, there is no extra I/O compared with 2PC. However, the size of this I/O increases with the transaction update activity (but it may be quite low for OLTP transactions).

7. Conclusion The massive development of applications on top of distributed database systems make the definition of efficient atomic commitment protocols (ACPs) a very challenging area. The efficiency of ACP is even more crucial when high transaction throughput has to be supported (e.g., OLTP transactions). Several ACPs have been proposed in the literature. Blocking ACPs penalize system resources, which is exacerbated in a multidatabase system. Non-blocking ACPs generally incur a higher latency and a substantial communication overhead even during normal execution. This makes them not adapted to today’s systems with long mean time between failures. In this paper, we proposed an atomic commitment protocol, named NB-SPAC, which provides the non-blocking property while having the lowest latency possible for an ACP (one phase to commit). The underlying assumption in NB-SPAC is that all participants are ruled by a rigorous concurrency control

22

protocol. We show in the Appendix that this assumption can be relaxed to accept participants supporting SQL2 levels of isolation, making it practical for most commercial DBMSs. Finally, it is worth noting that NB-SPAC preserves site autonomy making it compatible with legacy DBMSs. Throughout the paper, we abstracted the interface offered by the underlying DBMSs. In fact, NB-SPAC can be adapted to DBMSs offering the 2PC interface, by integrating an interface mapping into the GTM agents. This makes NB-SPAC compatible with the X/Open Xa and the OMG OTS standard interfaces. We did not detail this aspect for the sake of conciseness. While SPAC requires no assumption on the underlying network, we considered a synchronous communication system in the implementation of NB-SPAC. Chandra and Toueg proposed a consensus protocol that solves the consensus problem (an abstract form of agreement problems) in an asynchronous system augmented with an unreliable failure detector [CT91]. The possibility of exploiting this protocol in NB-SPAC to adapt it to asynchronous systems is currently being studied.

Acknowledgments The authors wish to thank Rachid Guerraoui from Ecole Polytechnique Fédérale de Lausanne (EPFL) for fruitful discussions on this work.

References [AC95]

Y. Al-Houmaily and P. Chrysanthis. Two-phase Commit in Gigabit-Networked Distributed Databases. In Proceedings of the 8th International Conference on Parallel and Distributed Computing Systems, September 1995.

[AP97]

M. Abdallah and P. Pucheral. A Non-Blocking Single-Phase Commit Protocol for Rigorous Participants. In Proceedings of the National Conference Bases de Données Avancées (France), September 1997.

[BGS92]

Y. Breitbart, H. Garcia-Molina, and A. Silberschatz. Overview of Multidatabase Transaction Management. VLDB Journal, October 1992.

[BHG87]

P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems, Addison Wesley, 1987.

[BST90]

Y. Breitbart, A. Silberschatz, and G. Thompson. Reliable Transaction Management in a Multidatabase System. In Proceedings of ACM-SIGMOD International Conference on Management of Data, 1990.

[BT93]

O. Babaoglu and S.Toueg. Non-Blocking Atomic Commitment. Distributed Systems, ACM Press, 1993.

23

In Sape Mullender, editor,

[CT91]

T. Chandra and S. Toueg. Unreliable Failure Detectors for Reliable Distributed Systems. Technical Report 93-1374, Department of Computer Science, Cornell Univ, 1993. Also appeared in th the Proceedings of the 10 ACM Symposium on Principles of Distributed Computing, August 1991.

[DTP91]

CAE Specification, Distributed Transaction Processing: the XA Specification, XO/CAE/91/300, 1991.

[Gra78]

J. Gray. Notes on Database Operating Systems. In Operating Systems: An Advanced Course. R. Bayer, R.M. Graham and G. Seegmuller editors, LNCS 60, Springer Verlag, 1978.

[GR93]

J. Gray and A. Reuter. Transaction processing: Concepts and Techniques. Morgan Kaufmann, 1993.

[MRKS92] S. Mehrotra, R. Rastogi, H. Korth, and A. Silberschatz. A Transaction Model for Multidatabase Systems. In Proceedings of the International Conference on Distributed Computing Systems, 1992. [OTS94]

Object Transaction Service, OMG Document 94.8.4, OMG editor, August 1994.

[SC90]

J. Stamos and F. Cristian. A Low-Cost Atomic Commit Protocol. Symposium on Reliable Distributed Systems, October 1990.

[Ske82]

D. Skeen. Non-Blocking Commit Protocols. In Proceedings of ACM-SIGMOD 1982 International Conference on Management of Data, 1982.

[SL90]

A. Sheth and J. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM computing surveys, 1990.

[SQL92]

International Standardization Organization IS, Information Processing Systems - Database Language SQL, ISO/IEC 9075, 1992.

[WV90]

A. Wolski and J. Veijalainen. 2PC Agent Method: Achieving Serializability in presence of failure in a Heterogeneous Multidatabase. In Proceedings of the International Conference on Databases Parallel Architectures and their Applications, 1990.

In Proceedings of ninth

Appendix : Relaxing the rigorous hypothesis The SPAC, consequently NB-SPAC, protocol relies on properties P1 and P2. We demonstrated in Section 3.2 that Rigorous DBMSs satisfy both properties, and thus are first class candidates to exploit the SPAC protocol. Although most commercial DBMSs exploit locking for concurrency control, the introduction of levels of isolation in the SQL2 standard [SQL92] make them no longer rigorous. We recall below the SQL2 levels of isolation. • Serializable: Transactions running at this level are fully isolated and are protected against phantoms. • Repeatable Read: Transactions running at this level are guaranteed to do only repeatable reads but they are no longer protected against phantom. • Read Committed: Transactions running at this level read only committed data but repeatable read is no longer guaranteed. In a lock based protocol, this means that read locks are relaxed before transaction end. A transaction Ti willing to update the database must use SQL cursors with the for update option to enforce Cursor Stability [GR93]. This guarantees that a concurrent transaction Tj

24

will not modify an object O and commit it between the read and the write operations performed by Ti on the same object O. • Read Uncommitted: Transactions running at this level may do dirty reads. For this reason, they are not allowed to update the database. Isolation levels are widely exploited because they allow faster executions, increase transaction parallelism and reduce the risk of deadlocks. Although non-serializable schedules may be produced, they are considered as semantically correct. DBMSs supporting isolation levels satisfy property P3 and P4 instead of P1 and P2. P3: [∀n receivek(ack(opni))] by the GTM => Tik cannot be rejected on Sk due to a serialization problem, where opni denotes the nth operation of the transaction branch Tik. P4: local schedules are cascadeless. Indeed, either no operation of Tik generates a conflict and Tik is serializable or the conflicts generated are ignored by the local serialization policy. In both cases, Tik will not be rejected due to a serialization problem. Properties P3 and P4 guarantee that assertion A1 (see Section 3.2) remains true, which means that a single-phase commit protocol is sufficient to guarantee the atomicity of distributed transactions in a failure free environment. The problem is to prove that P3 and P4 are strong enough to guarantee the correctness of the recovery algorithms. Let us study the schedules that each isolation level may produce. Serializable: If all transactions submitted to the GTM run at this level of isolation, the schedules produced remain rigorous. Thus, the SPAC recovery algorithms remain valid. Repeatable Read: At this level of isolation, the local DBMS ignores some Read/Write conflicts. Let us note Selectik a selection operation performed by Tik that retrieves all objects satisfying predicate Q and Insertjk the creation by Tjk of a new object satisfying Q. Although a phantom is produced, the following local schedule will be accepted with no blocking : Selectik < Insertjk < Commitjk < Selectik < Commitik. Assume Sk crashes and that Tik must be re-executed by the recovery algorithm. Since Tik is re-executed after the commit of Tjk, the schedule produced is equivalent to: Insertjk < Commitjk < Selectik < Selectik < Commitik and differs from the initial one. Read Committed: At this level of isolation, the local DBMS ignores Read/Write conflicts that occur on a same object O. Thus the following local schedule is accepted with no blocking : Read(O)ik
[receivedbms(ack(commitik))] < [senddbms(commitjk)] , ∀Tj / Ti and Tj have overlapping executions. Thus, the GTM agent submits the commit of Tj to its underlying DBMS only when it receives the acknowledgment of the Ti commit. Tj commit is slightly deferred but Ti and Tj are allowed to execute in parallel, thereby preserving the benefit of low levels of isolation. In fact, the commit ordering corresponds to the order of conflicts of the form (Ti RW>Tj) that are ignored by the local DBMS. Thus, Ti is guaranteed to commit before Tj so that a potential re-execution of Ti will never see the effects of Tj.

26