The Superdatabase Architecture: Cooperative ... - Semantic Scholar

6 downloads 100 Views 243KB Size Report
four di erent EDBs, and the availability of modern \open system" TP monitors con rm ..... of recovery is dominated by the commit protocols and SDBMS logging, both .... the supertransaction (to acqure the ticket) and some concurrency lossĀ ...
The Superdatabase Architecture: Cooperative Heterogeneous Transactions Calton Pu Department of Computer Science Columbia University New York, NY 10027 [email protected]

(212) 854-8110

Abstract We propose the superdatabase (SDB) architecture to support atomic transactions across cooperative heterogeneous databases. The SDB is based on hierarchical composition of element databases (EDBs), enhanced by optimization and distribution. To support heterogeneous crash recovery, an SDB translates di erent commit agreement protocols without extra messages. To support heterogeneous concurrency control, an SDB groups di erent kinds of algorithms, such as two-phase locking, timestamps, and optimistic validation methods to guarantee global serializability with little concurrency loss and overhead. Integrating heterogeneous commit protocols, local concurrency control, and local recovery, the SDB consumes very little run-time overhead and few messages. The Harmony prototype implementation of SDB, composed of the Supernova SDB and four di erent EDBs, and the availability of modern \open system" TP monitors con rm the practicality and usefulness of SDB.

Keywords: cooperative heterogeneous transaction processing, heterogeneous concur-

rency control, heterogeneous recovery, multidatabase serializability, open systems, database composability.

i

Contents

1 Introduction 2 Related Work

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

3.1 Model and Terminology : : : : : : : : : : : : : 3.2 Tree-Structured Superdatabase : : : : : : : : : 3.3 Hierarchical Recovery : : : : : : : : : : : : : : : 3.3.1 Heterogeneous Hierarchical Commit : : : 3.3.2 Superdatabase Recovery : : : : : : : : : 3.3.3 Summary of Hierarchical Recovery : : : 3.4 Hierarchical Concurrency Control : : : : : : : : 3.4.1 Local Concurrency Control : : : : : : : : 3.4.2 Hierarchical Certi cation with O-vectors 3.4.3 Run-Time Cost : : : : : : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

2.1 2.2 2.3 2.4

Standardization : : : : : : Other Works in HDB : : : Cooperative Heterogeneity TP in Multidatabases : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

3 Hierarchical Composition

4 Optimization and Distribution

4.1 Hierarchy Flattening : : : : : : : : : : : : : : : : : 4.2 Concurrency Control Grouping : : : : : : : : : : : 4.2.1 Two-Phase Locking : : : : : : : : : : : : : : 4.2.2 Optimistic Validation : : : : : : : : : : : : : 4.2.3 Timestamp-Based Algorithms : : : : : : : : 4.2.4 Summary of Concurrency Control Grouping 4.3 Symmetric Distribution : : : : : : : : : : : : : : : :

5 Implementation 5.1 5.2 5.3 5.4

Historical Perspective : Supernova : : : : : : : EDBs : : : : : : : : : Evaluation : : : : : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

6 Conclusion 7 Acknowledgment

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

1 2 2 2 3 4

5

5 6 7 7 9 9 10 10 13 14

15 15 16 16 16 17 17 18

19 19 20 20 22

23 24

ii

1 Introduction Political and technical advantages push for integration of heterogeneous databases (HDB). Politically, mergers and acquisitions bring di erent corporate databases together at the company level; within a company, departments and divisions may want to maintain their own databases. Technically, the integration of HDB has several advantages. For example, in an HDB we acquire access to data previously hidden behind di erent database managers. Also, existing applications may be extended to a wide range of databases. Finally, speci c databases tailored for their data may obtain much higher eciency for speci c access patterns. The work on HDB started in the seventies, focusing mostly on data model translation, query language translation, and schema integration. In contrast, transaction processing (TP) in HDBs (in short, heterogeneous transaction processing { HeTP) remained a challenge until mid-eighties. This paper describes the superdatabase (SDB) proposal [25] for HeTP, summarizes our implementation experience [24], and evaluates the SDB in the context of HeTP evolution since SDB was introduced in 1987. The rst contribution of SDB architecture is solving a dicult algorithmic heterogeneity problem. In contrast to data format heterogeneity (e.g., from EBCDIC to ASCII byte representation) and protocol heterogeneity (e.g., message translation in heterogeneous RPC [4]), where static translation suces, algorithmic heterogeneity requires runtime dynamic integration of di erent programs implementing algorithms with varied data structures. The main HDB algorithmic heterogeneity problem is the integration of di erent concurrency control methods. Our approach to this problem is to capture a common abstract property of the concurrency control algorithms, namely local serializability. Section 3 explains the idea to solve this problem. The second contribution of SDB architecture is to show that under appropriate assumptions of cooperation (the same as homogeneous distributed TP) HeTP is not much more dicult than homogeneous TP. We describe ecient solutions for heterogeneous concurrency control and heterogeneous crash recovery, both in terms of local overhead, network messages, and the amount of concurrency loss in the system. To the best of our knowledge, no current HDB or multidatabase proposals show better HeTP eciency and concurrency than SDB. Section 4 summarizes the optimizations. The third contribution of SDB architecture is its simplicity and practicality, demonstrated by the implementation of the Harmony prototype at Columbia. Section 5 outlines current implementation status. The feasibility of SDB is further emphasized by the recent availability of \open system" TP monitors [3], such as TopEnd from NCR and Tuxedo/T 1

from USL, capable of integrating di erent database backends. These TP monitors can be seen as a special case of SDB. Finally, Section 6 concludes the paper with an evaluation of SDB.

2 Related Work

2.1 Standardization First, we should clarify the tension between some e orts on establishing standards and the research on heterogeneous systems. An extreme view considers the establishment of standards as the only worthwhile e ort, since as soon as we agreed on the standard, we would not have any heterogeneity. This position is too narrow for several reasons. First, political decentralization of power and frequent merger and acquisitions tend to introduce heterogeneity into large corporations. Second, wide-spread heterogeneity arises naturally from technical advantages. For example, niche products are better adapted to speci c applications than generics. As long as there is diversi ed need, heterogeneity will thrive and evolve. Third, standards must rely on proven and mature technology. However, new technology and research results have to be tested and gain acceptance in a heterogeneous environment. While we observe the limitations of standards, we also recognize their importance. For example, the way commercial products such as TopEnd and Tuxedo/T support heterogeneous backends is to follow the X/Open standard interface between the TP monitors and the backend resource managers [3]. Standards simplify the system integration work. But as we explore advanced technology inevitably we nd heterogeneity. Therefore, we need to push the research on heterogeneity and interoperability further. Indeed, two reasons make the SDB work complementary to standards. First, the SDB can handle concurrency control and crash recovery that conform to di erent standards, or no standard at all; this is especially important for the integration of new technologies (e.g., long transactions) and their applications (e.g., CAD/CAM/CASE databases). Second, during the transition period from one technology to another, both the old and the new need to be running side by side; this requires the degree of integration beyond adherence to a standard.

2.2 Other Works in HDB R* [22] and ingres/star [23] have demonstrated physical distribution of fully functional homogeneous databases. Early research on integrated heterogeneous databases has 2

been largely limited to query-only systems such as Multibase and Mermaid. Multibase [20] is a retrieve-only system, developed at Computer Corporation of America. Through the DAPLEX functional language, Multibase provides uniform access to a CODASYL database and a hierarchical database. The focus of Multibase is on query optimization and reconciliation of data, not consistent update across databases. Mermaid [31] has been developed at System Development Corporation. Unlike Multibase, Mermaid supports the relational view of data directly, through the ARIEL query language. Although language and model translation as well as schema integration are important problems, we refer the reader to a recent survey by Sheth and Larson [28]. In addition, a recent issue of Data Engineering Bulletin on Database Connectivity [26] is devoted to the progress made in the area. In this paper we concentrate on HeTP.

2.3 Cooperative Heterogeneity We say that a distributed TP systems is cooperative if it exchanges information willingly among its components, for either distributed concurrency control or crash recovery (commit protocols in particular), or both. This is the expected amount of cooperation in traditional distributed database work such as R [22]. The SDB architecture is cooperative. In this context, Gligor and Luckenbaugh [14] have discussed the recovery problem in HDBs, without describing speci c algorithms. We discuss concrete recovery algorithms in Section 3.3.1. Gligor and Popescu-Zeletin [15] studied concurrency control in HDBs. They speci ed ve conditions which should be satis ed by concurrency control mechanisms for HDBs. The SDB shares conditions 1 and 3, which say that at each local element database (EDB), we have local serializability and run only one subtransaction per global transaction. The SDB does not use conditions 2 and 4, which impose a pessimistic concurrency control and explicit object identi cation from all sites. Finally, their fth condition refers to global deadlock detection, a problem for further research. A recent example of cooperative HeTP is the Pre-Speci ed Order approach by Elmagarmid and Du [10]. Their idea is to give to the concurrency control mechanism a pre-speci ed order for each transaction submitted. This is a generalization of the basictimestamp ordering. Their hierarchical structure can be seen as an instance of the SDB architecture, with a conservative Pre-Speci ed Order concurrency control throughout the system. In contrast, the algorithms described in this paper allow each site to control its own ordering, and is optimistic. Compared to the algorithms in this paper, Pre-Speci ed 3

Order is more restrictive in both the kind of local concurrency control and the amount of execution concurrency allowed.

2.4 TP in Multidatabases In contrast to the recovery and concurrency control algorithms in a cooperative HeTP system, typical multidatabase (MDB) work assumes little or no explicit exchange of control information between an EDBMS1 and the MDBMS. Usually, an MDBMS uses some agent, which looks like a client to the EDBMS, to synchronize between the EDB and the MDBMS. The primary advantage of the MDB approach is that agents do not require modi cations to the EDBMS. Although the MDB work solves an important problem, the evolution of many database systems (e.g., Sybase 4.0 and Oracle 7.0) to support standard interfaces such as X/Open decreases this advantage signi cantly. Therefore, a major objection to the cooperative approach (the lack of EDBMSs that are cooperative) is being removed by the marketplace. One class of MDB work uses a form of group paradigm [9], making each site a group for synchronization. Breitbart et al. [5] propose the notion of site graph to guarantee global consistency. Site graphs limit transaction concurrency severely: their MDB may run one transfer transaction but not two concurrently between the same databases. Salem et al [27] have extended altruistic locking similarly. More recently, Vidyankar [33] used tree-structured site locking (non-two-phase). All the proposals use site as a group; the di erence is on how they order global transactions across groups. In all cases the amount of concurrency allowed between sites is low. Another thrust of MDB research emphasizes asynchrony (sometimes called autonomy). One example of the asynchrony work is Breitbart et al. [6], where they study the diculties of implementing distributed transactions without a commit protocol. There are other dicult problems in the integration of autonomous MDBs, such as stronger notions of autonomy and weaker notions of consistency. In this line of work, heterogeneity is an important motivation, but not a necessary condition. This contrasts with the SDB architecture, where heterogeneity is central, but not asynchrony. Thus we are concerned with the most ecient way to solve the HeTP problem, preserving the same semantics and assumptions of traditional distributed TP. In this paper, the sux DBMS means database management system . For example, EDBMS is the database management system for an EDB and SDBMS the database management system for SDB. 1

4

SDBMS SDBMS

? EDBMS1

? ?

EDBMS2

? EDBMS3

Figure 1: Conceptual Structure of SDBs

3 Hierarchical Composition 3.1 Model and Terminology

In the SDB architecture, a database is a set of objects, supporting read and write operations. An atomic transaction is a set of operations executed as a unit { either all of the operations are completed or all of them are rolled back. An EDBMS is an element database management system with its own concurrency control and crash recovery to support atomic transactions. An SDBMS is the heterogeneous database management system that implements atomic transactions across di erent EDBMSs. In Figure 1, EDBMS (the leaves) represent di erent EDBMSs glued together by SDBMSs (the internal nodes). A transaction spanning several EDBs is called a supertransaction, which is composed of transactions on EDBs. When participating in a supertransaction, the local transaction on each EDB is called a subtransaction. When we use the term \supertransaction", we are discussing a transaction with respect to the EDBs. When we use \subtransaction", we are referring to the transaction with respect to the SDB. When we use \transaction", it is with respect to the database in which the transaction is running. Finally, we assume a supertransaction is translated into no more than one subtransaction for each EDB. This is a standard assumption made both in HeTP [15] and R. j

5

3.2 Tree-Structured Superdatabase We start with a tree-structured SDB for two reasons. First, hierarchical organization minimizes the amount of data transfer in both size and number of messages. Speci cally, in Section 3.4.3 we show that we only need to piggyback a small amount of information on messages already required for distributed commit protocols. Second, hierarchical algorithms are easy to explain and understand. This is the case of concurrency control certi cation in Section 3.4. We will describe in Section 4 the optimizations to improve performance, transaction concurrency, and distribution for availability. The SDB architecture solves two important HeTP problems: (1) composition of EDBMSs with di erent crash recovery methods, and (2) composition of EDBMSs with di erent concurrency control techniques. For hierarchical composition, an EDBMS is called composable if it satis es two requirements: 1. crash recovery. The EDBMS should understand some kind of agreement protocol, e.g., two-phase commit. As we shall see in Section 3.3.1, this requirement is a necessary consequence of distributed control, not heterogeneity. 2. concurrency control. The EDBMS should present an explicit serial ordering of its local transactions. This can be obtained in several ways, as explained in Section 3.4. We assume that EDBMSs do not serialize subtransactions into the past. This is satis ed by the recoverability property [2]. All of the practical databases are recoverable. For some speci c cases, the SDBMS can handle exotic concurrency control methods, but they are beyond the scope of this paper. For consistent HeTP updates, these two are the only requirements we make on the EDBMSs.2 An EDBMS may be centralized, distributed, or another SDBMS. Since centralized databases do not need agreement protocols nor do they supply the transaction serial order, an SDBMS cannot integrate centralized databases \as is". Nevertheless, we believe that these requirements, mild for distributed databases, can be feasibly incorporated into current and future database systems. This is con rmed by our own implementation experience (Section 5) and the current generation of open system TP Monitors [3, 32]. The existence of a commit protocol implies some sophistication in recovery algorithms. This is the same for both HeTP and homogeneous distributed TP. 2

6

3.3 Hierarchical Recovery

3.3.1 Heterogeneous Hierarchical Commit The usual model of a distributed transaction contains a coordinator and a set of subtransactions. Each subtransaction maintains its local undo/redo information. At transaction commit time, the coordinator organizes an agreement protocol with subtransactions to guarantee a unanimous decision. (Of the published protocols, two-phase commit is the most commonly used for its low message overhead.) Without agreement protocols, it is very dicult to maintain the atomicity property of the supertransaction, since one subtransaction may commit while another aborts [6]. From the HeTP point of view, the important observation is that the need for agreement on the transaction outcome is due to distribution, not heterogeneity. The distributed database system R [22] supports a tree-structured model of computation that re nes the above at coordinator/subtransactions model. Subtransactions in R are organized in a hierarchy, and the two-phase commit protocol is extended to cover the tree structure. At each level, the parent transaction serves as the coordinator. During phase one, the root sends the message \prepare to commit" to its children. The message is propagated down the tree, until a leaf subtransaction is reached, when it responds with its vote. At each level, the parent collects the votes; if all its own children vote \yes", then it sends \yes" to the grand-parent. If all votes are \yes", the root commits and sends the \committed" message down the tree. Between the sending of its vote and the decision by the root, each child subtransaction remains in the prepared state, ready to both undo the transaction if aborted, and to redo the transaction if the child crashed and the root decided to commit. Besides the hierarchical two-phase commit in R, three-phase commit and Byzantine agreements also have natural tree-structured extensions.3 The SDBMS's function is to know and use the appropriate protocol for each EDBMS. If all EDBMSs use the same hierarchical protocol, SDBMS is the coordinator. Interesting cases arise when EDBMSs support di erent kinds of agreement protocols. The SDB divides the distributed agreement protocols into two groups: symmetric and asymmetric. In asymmetric protocols, a distinguished coordinator decides the outcome based on information supplied by other participants. For example, in the centralized and linear two-phase commit, as well as the three-phase commit, a coordinator initiates the protocol and decides whether the transaction commits or aborts. The SDBMS assumes 3 In the discussion below, references on the Byzantine agreements can be found in several PODC Proceedings; the other protocols are described in the recent book by Bernstein et al. [2].

7

the role of coordinator with respect to EDBMSs with asymmetric agreement protocols. The SDBMS collects one bit of commit/abort information from each asymmetric protocol and returns the decision. Two facts simplify the SDBMS's role in asymmetric protocols. First, we do not modify the protocols in any way. Second, no information is passed between participant EDBMSs, so they do not need to know what protocol the others are using. Symmetric protocols such as Byzantine agreements and decentralized two-phase commit give all participants equal role. In this case, we have two choices for the SDB. First, the naive method simulates the symmetric protocol for all EDBMSs by translating the information received from \asymmetric" EDBMSs and passing it to the \symmetric" participants. For example, consider three EDBMSs, TWOPC1 and TWOPC2 with two phase commit and SYMC3 using symmetric two-phase commit. In this naive method, the SDBMS passes the votes from TWOPC1 and TWOPC2 to SYMC3 explicitly, even though the knowledge of the existence of TWOPC1 and TWOPC2 does not increase system resilience to crashes, since TWOPC1 and TWOPC2 do not know symmetric two-phase commit to help SYMC3 recover if the SDBMS crashes. This method is obviously correct, but sends unnecessary messages. Second, an optimized SDBMS may eliminate the extra messages by serving as a representative of all the asymmetric participants, sending the result of the asymmetric protocols in one round of messages. In our previous example, there would be just one message from SDBMS to SYMC3 instead of two. This second method decreases the number of messages by combining the extra messages into one. These two choices also exist for the communication between symmetric participants using di erent protocols. The message savings for m participants of one protocol and n of the other is (m  n) ? (m + n). The more information a symmetric protocol requires, the more work it is to integrate the EDBMS into an HDB. For example, Byzantine agreement protocols include node id in the message. In this case, the SDBMS may need to parse and generate messages correctly. Fortunately, this protocol heterogeneity problem is restricted to speci c parameter translation, similar in nature but simpler than heterogeneous RPC [4]. This translation guarantees that the subtransactions of a supertransaction will either all commit or all abort. We rst observe that the SDBMS does not change the commit protocols, so the correctness of each commit protocol guarantees that each subtransaction agrees with the SDBMS on the transaction outcome. Therefore, the SDBMS's decision is the decision of all participant EDBMSs for their respective subtransaction. Between an SDBMS and its parent, we can use any agreement protocol that both 8

understand. In this paper and in our implementation, we adopt two-phase commit to minimize message overhead. For its children, the SDB functions both as a coordinator for the asymmetric agreement protocols and as a translator for the symmetric protocols. It collects sucient information for supertransaction commit, and provides enough information for participants using symmetric protocols so they can reach the same decision on transaction outcome as the other participants.

3.3.2 Superdatabase Recovery Since the SDBMS is the coordinator for EDBMSs during commit protocols, it should record the commitment state information on stable storage. Otherwise, a crash during the window of vulnerability would hold resources in the EDBMSs inde nitely. The commit protocol state is easily recorded on a log, which is conceptually separate from any EDBMS logs, just as the SDBMS itself. A separate log is useful since supertransactions do not necessarily abort when an SDBMS crashes. Suppose that an SDBMS crashes, but is brought back online quickly, before its subtransactions have nished. Since the SDBMS performs no computation, the supertransaction may still commit. To carry out commit agreement after such glitches, participant subtransactions should be recorded in the log, which is read at restart time to reconstruct the SDBMS state before the crash. For each transaction, the commit protocol state on log consists of: participant subtransactions, parent SDBMS (if any), transaction id and state (active, prepared, committed, or aborted). If a transaction was in the active state when an SDBMS crashed, the SDBMS simply waits for the (re)transmission of two-phase commit from the parent. In case of root SDBMS, it (re)starts the two-phase commit. If a transaction was in the prepared state when an SDBMS crashed, the SDBMS inquires the parent about the outcome of the transaction (or the log, if root). The abort/commit decision is then retransmitted to the subtransactions.

3.3.3 Summary of Hierarchical Recovery The key idea to hierarchical recovery in SDB is the separation of two recovery components of an HDB: local undo/redo functions and commit protocols. Of these, local undo/redo functions of an EDBMS are completely isolated from other EDBMSs and the SDBMS. The SDBMS delivers each supertransaction's commit/abort decision to participant EDBMSs using each EDBMS's protocol. Then each EDBMS carries out the decision using its own undo/redo algorithms. Consequently, the heterogeneity in EDBMS internal undo/redo functions is not a problem. 9

The main problem in HeTP recovery is the translation between di erent commit protocols. We have divided the protocols into two groups, symmetric and asymmetric. The SDBMS communicates with EDBMSs in each group using an appropriate local protocol. Both the collection of votes and the distribution of the decision happen only between the SDBMS and individual EDBMSs. This isolation of the EDBMSs from each other simpli es their integration by the SDBMS. To guard against SDBMS crashes, we save the SDBMS commit protocol state information on stable storage just like homogeneous distributed databases. The total cost of recovery is dominated by the commit protocols and SDBMS logging, both necessary for homogeneous TP and HeTP the same way. The only additional overhead is a xed (per transaction) translation cost. In summary, the successful separation between commit protocols and undo/redo functions of all recovery methods shows that heterogeneous recovery only adds protocol translation to the cost of homogeneous distributed transaction recovery.

3.4 Hierarchical Concurrency Control Our approach to heterogeneous concurrency control is analogous to hierarchical recovery. We use the common abstract element of all concurrency control methods { serializability { to simplify the problem. The main di erence is the amount of information we need to handle in concurrency control. The transaction serialization information is larger and more diverse than the one-bit commit/abort decision. The SDB architecture does not impose a speci c concurrency control algorithm for ensuring global serializability. For example, the Pre-Speci ed Order approach by Elmagarmid and Du [10] can be seen as a pessimistic way to achieve that goal. In this section, we describe an optimistic algorithm that minimizes the amount of information passed. In Section 4 we will see some other choices in di erent variants of the algorithm.

3.4.1 Local Concurrency Control To illustrate the main problem in serializing heterogeneous transactions, let us consider the following example.4 In Figure 2, the supertransaction T1 has subtransactions T1 1 and T1 2 running on EDB1 and EDB2, respectively. Suppose that both EDBMS1 and EDBMS2 use two-phase locking. If T1 1 starts releasing locks while T1 2 has not reached its lock point, the supertransaction T1 may lose its two-phase property and become non-serializable. This scenario reveals the crucial :

:

:

:

4

In this example, the top-level transaction may commit regardless of subtransaction aborts.

10

(Top-level,

T)

BeginTransaction tid: 1 cobegin 1 BeginTransaction parentid: 1 tid: 1:1 actions 1 CommitTransaction 1:1 2 BeginTransaction parentid: 1 tid: 1:2 actions 2 CommitTransaction 1:2 coend CommitTransaction( 1)

EDB . ... EDB . ...

... EDB .

... EDB .

(

T,

(

T,

T (T T (T

)

)

)

)

T

Figure 2: Example Distributed Transaction problem in hierarchical composition of concurrency control mechanisms: the union of local serializations does not guarantee global serialization.5 Our solution is to have the SDBMS certify that all local orderings are compatible in a globally serializable ordering. One way to implement SDBMS certi cation is to require that each EDBMS provide the ordering of its local transactions to the SDBMS. This method is sucient for composition of heterogeneous databases, but not necessary, since implicit serialization is possible under certain circumstances (see Section 4.2 for optimization). The serial order of each local transaction is represented by an order-element (called O-element for short). For each EDBMS, the most important property of its O-elements is that they should be comparable to each other, and that a comparison recovers the serialization order maintained by the EDBMS's concurrency control method. There is no constraint on the actual format of an O-element. Each EDBMS may encode its O-elements and provide its own comparison routine. Since the SDBMS does not compare the O-elements from di erent EDBMSs, an SDBMS does not have to understand an O-element's representation or to know how the comparison is done. All it needs is the result of comparison to nd the local serialization order. This fact will be used in Section 4.2 to increase global concurrency at the cost of more complex and larger O-elements. The comparison between two O-elements is de ned by the usual precede (denoted by ) relation in the partial ordering of local transactions [2]. If T1  T2 in the local serialization then O-element(T1)  O-element(T2). If T1 and T2 do not con ict (i.e., neither T1  T2 nor T2  T1), then we say that O-element(T1) is concurrent to (denoted by }) O-element(T2). Concurrent O-elements are de ned here for completeness. Although they may help preserve the concurrency in the SDB, it takes a large amount of data and processing 5

This is called a global property by Weihl and Herlihy.

11

to achieve this result. Therefore, in the remaining of this paper we make a simpli cation. When T1 and T2 are concurrent, we allow the comparison routine to return either O-element(T1)  O-element(T2) or O-element(T2)  O-element(T1). This simpli cation introduces two solvable problems. First, it reduces potential concurrency, since the certi cation algorithm (Section 3.4.2) may see con icts where none exists. We deal with this problem in Section 4.2. Second, it introduces dependencies that local concurrency control algorithms should take into account. In particular, optimistic validation algorithms should add the new dependency arc into the transaction dependency graph. Despite these two technical problem, we adopt it for simplicity both of exposition and the O-element production methods for the main concurrency control algorithms. We start with two-phase locking. Transactions acquire all locks in a growing phase, and release them during a shrinking phase, in which no additional locks may be acquired. The moment between the growing phase phase and shrinking phase is called the transaction's lock point. Eswaran et al. [12] showed that two-phase locking guarantees serializability of transactions because SHRINK (T ), a timestamp taken at the lock point of T indicates T 's place in the serialization regarding all transactions. We can use SHRINK (T ) as the O-element for EDBMSs using two-phase locking. Second, some databases use timestamp-based concurrency control methods. The timestamps used for serialization represent an explicit ordering, so they serve well as Oelements. Timestamp intervals [1] or multidimensional timestamps [21] can be passed as O-elements as well. The important thing is to capture the serialization order of committing local transactions. Third, optimistic validation methods also provide an explicit serialization order. Kung and Robinson [18] assign a serial transaction number after the write phase, which can be used directly as O-element. Ceri and Owicki [7] proposed a distributed algorithm in which a two-phase commit follows a successful validation. Taking a timestamp from a Lamport-style global clock [19] at commit time will capture the serial order of transactions. Since the write phase has yet to start, all following transactions will have a later timestamp.6 One concern from the practical side is the need to modify existing EDBMSs to make the O-elements explicit. Georgakopoulos and Rusinkiewicz [13] have shown that it is possible to produce O-elements without modifying the EDBMS. Their idea is called forced local con icts . Instead of asking an EDBMS to produce an O-element, they add an agent to i

i

i

i

We note that the simpli cation omitting concurrent O-elements introduces an implied dependency between the successive O-elements. These arcs should be introduced into the transaction dependency graph by the EDBMS using optimistic validation. 6

12

run on the EDB. The agent \guards" a local data item, called a ticket . By convention, all the subtransactions originated by some supertransaction will acquire the ticket at an appropriate time (for example, the lock point) and release it when safe (for example, after commit). Since the access to the ticket is serialized, the agent can count each access and return a serial number, which is an O-element. The advantage of the forced local con icts method is the preservation of local EDBMS interface. The costs include some changes in the supertransaction (to acqure the ticket) and some concurrency loss (limited by ticket acquisition).

3.4.2 Hierarchical Certi cation with O-vectors Given the O-elements from EDBs the SDBMS can certify global serializability. First, we de ne an order-vector (O-vector). Conceptually, an O-vector is a vector of n O-elements in an HDB of n EDBs. The O-element from EDB is the i-th component of the Ovector. If a supertransaction is not running on all EDBs, we use a wild-card O-element, denoted by  (star), to ll in for the missing EDBs. Since its order does not matter, by de nition, O-element(any) }, or in our simpli cation, both O-element(any)   and   O-element(any). The ordering over O-vectors is induced by the comparison between O-elements: i

 O-vector(T1) } O-vector(T2) if for all j , O-element(T1 ) } O-element(T2 ).  O-vector(T1)  O-vector(T2) if for at least one j , O-element(T1 )  O-element(T2 ) and for all j , either O-element(T1 ) } O-element(T2 ) or O-element(T1 )  O:j

:j

:j

:j

element(T2 ).

:j

:j

:j

:j

Given this ordering, we can detect subtransactions serialized in di erent ways. In our example (Section 3.4.1), this happens when a second transaction T2 of the same type produces the circular ordering: O-element(T1 1)  O-element(T2 1) and O-element(T2 2)  O-element(T1 2). :

:

:

:

The hierarchical certi cation checks the history of committed supertransactions. If transaction T 's O-vector causes a circular ordering with any committed supertransaction, we abort T . Otherwise, we commit T and insert its O-vector into the history. From the SDB architecture point of view, the key observation is that the certi cation based on O-vectors is independent of any particular concurrency control methods used by the EDBMSs. Consequently, an SDBMS can combine heterogeneous concurrency control methods. i

i

i

13

The construction of O-vector allows recursive composition of an SDBMS as EDB, since an O-vector quali es as an O-element. The certi cation gives the supertransaction an explicit serial order. Thus we have found a way to hierarchically compose SDB concurrency control, maintaining serializability at each level. The naive certi cation method is optimistic, in the sense that it allows EDBMSs to run their sub-transactions to completion and then certi es the global ordering. In particular, the O-vector is constructed only after the subtransactions have nished. One way to improve the performance of SDBMS is to do the certi cation incrementally, as soon as an O-element becomes available. This is especially useful for EDBMSs that decide on the serialization order early, such as basic timestamps. If a circular ordering is found early, the supertransaction may be restarted right away. In any case, the certi cation will end only after the last O-element has reached the SDB. This optimization is analogous to incremental two-phase commit protocols.

3.4.3 Run-Time Cost The run-time cost for hierarchical concurrency control is in general a trade-o between the amount of information produced and the amount of concurrency loss. In our simpli cation, we have adopted simple O-elements and comparisons, postponing the optimizations (on concurrency loss) to Section 4.2. Taking a timestamp for O-element in centralized databases is inexpensive. However, if an EDBMS is a distributed database with internal concurrency control, a Lamport-style global clock may be necessary. Fortunately, the maintenance of a global clock is independent of the number of transactions, and therefore its cost can be amortized. On the SDBMS side, the certi cation of an O-vector implies the comparison with all committed supertransactions, which is potentially expensive both in terms of storage and processing. For simple O-elements, it is not necessary to compare the O-vector with all committed supertransactions. The upper bound is given by an arti cial O-vector(T0) such that for every i, O-element(T0 )  O-element(T ) for all active transaction T , k > 0. First we construct the O-vector(T0). Let SA be the set of currently active or pending transactions (submitted but un nished). For every EDB, nd a committed T such that T  SA for all pending transactions in EDB . Using recoverability, any T that has committed before any of SA has started suces. Having chosen the T , let O-vector(T0) = (T1 1; : : :; T ; : : :): By construction, O-element(T0 )  O-element(T ) for every j and every T 2 SA. Since we will not serialize any T 2 SA before T0, the history preceding T0 can be garbage collected. :i

k:i

k

j

j

j

j

j

j

:

j

j:j

:j

k

k

14

k:j

In some databases, read-only queries may be serialized into the past if older versions are available. This strategy increases concurrency. For these cases, the construction of T0 may limit the number of old versions available to supertransactions. Since this is a garbage collection problem, we can use simple heuristics, such as \when memory runs out, cut the history by half", or as far as SA allows. We note the similarity between this bound and the Global Virtual Time (GVT) in Time-Warp systems [17]. In Time-Warp style optimistic computations, once all nodes have passed GVT, the history before GVT can be garbage collected. Similarly, in SDB when all EDBs have passed O-vector(T0), the history before it can be released. Finally we analyze the message overhead of certi cation. For each supertransaction, the only piece of information that the SDBMS needs from EDBMSs is the O-element. Since an agreement protocol is necessary for recovery purposes, at least one round of messages must be exchanged between the SDBMS and each participant EDBMS at commit time. The certi cation occurs only at commit time, so the subtransaction serial order information can piggyback on the commit vote message. Therefore, the hierarchical SDB does not introduce any extra message overhead for HeTP (compared to homogeneous distributed transactions).

4 Optimization and Distribution 4.1 Hierarchy Flattening

Hierarchical algorithms (e.g., hierarchical two-phase commit) have run-time dominated by the depth of the tree. A standard optimization technique on hierarchical algorithms to reduce response time is to atten the hierarchy by coalescing all internal nodes into one root node, which then communicates to all leaf nodes directly. The attened tree uses the same algorithm (simpli ed to one level), with run-time bounded only by the slowest link. Since attening occurs at the time a new node joins the tree, its implementation is straightforward. The SDBMS maintains a list of EDBMSs attached to it, with their attributes including the local concurrency control and commit agreement protocol used. Instead of creating a hierarchy, we attach all EDBMSs under the same root SDBMS. This optimization does not change the internal structure of each EDB; for example, R maintains its own tree structure. Flattening simpli es the structure of the whole system, but also accentuates the centralization problem. Only one root SDBMS may become both a bottleneck for performance and Achilles' heel for availability. We will take advantage of the simpler structure 15

in Section 4.2 to increase transaction concurrency and address the centralization problem in Section 4.3.

4.2 Concurrency Control Grouping The key idea to improve concurrency in SDB uses the semantics of suciently similar concurrency control methods to eliminate certi cation conservatism. Usually we can avoid certi cation completely within each group. Here, we discuss three groups for their practical importance: two-phase locking (2PL), timestamp-based, and optimistic validation.

4.2.1 Two-Phase Locking We consider rst the strict-2PL, due to its practical importance (many commercial database products adopt it). This case turns out to be remarkably simple. All the subtransactions controlled by strict-2PL database managers share a global lock point at commit time. Therefore there is no need for any additional certi cation between EDBMSs using strict-2PL. The second group is the general 2PL. The example in Section 3.4.1 shows that general 2PL requires more care than strict-2PL. To avoid mismatched lock points, we synchronize the lock point of EDBMSs through an agreement protocol. Let us consider the concrete example of using two-phase agreement with the SDBMS as coordinator. As each EDBMS reaches its subtransaction's lock point, it sends a message \lock point" to the SDBMS. When all 2PL participants of the supertransaction have reached their lock points, the SDBMS replies \global lock point" and each participant 2PL subtransaction may enter the shrinking phase.

4.2.2 Optimistic Validation Unlike 2PL, optimistic validation algorithms use explicit representation of the transaction dependency information, such as transaction dependency graphs (TDGs). In principle, this information can be used to decrease concurrency loss. For example, if an EDBMS sends the entire TDG in the O-element, the comparison routine can decide whether there is precedence, no-con ict, or interference between two transactions by checking the TDG. Although this solution guarantees no concurrency loss, the execution overhead would be very high. One way to decrease the size of O-elements is to maintain a copy of the TDG in SDBMS and send incremental updates through O-elements. However, this strategy also introduces some delicate problems of mutual consistency between the copy TDG in SDBMS and the 16

original in the EDBMS. Since maintaining the TDG information implies large amount of communications, an alternative is to send back the O-elements being compared instead of receiving the TDG updates. The comparison routine for an O-element would not do local comparisons, but it would send over the O-elements to the EDBMS. In either case, the gain in concurrency is weighted against message overhead in a trade-o . The naive certi cation algorithm of Section 3.4.2 would work poorly due to many wasted messages. In this case, an improved certi cation algorithm sends all the Oelements together, and receives an answer from the EDBMS con rming or denying their serializability.

4.2.3 Timestamp-Based Algorithms Timestamp-based concurrency control algorithms can be divided into two subgroups: static (e.g., the basic timestamp method [2]) and dynamic (e.g., time-interval [1] and multi-dimensional timestamps [21]). The best way to avoid aborts for static methods is to have the SDBMS generate global timestamps and send them with the subtransactions. By sending globally ordered timestamps, we can guarantee the synchronization of subtransactions and prevent aborts due to di erences in timestamps. The treatment of dynamic methods is similar to optimistic validation. The main di erence is that we have less information in timestamps compared to TDG. Therefore, it is easier to t the timestamps or time-intervals in an O-element and do the comparison. The comparison routine may either resolve the con icts on the SDBMS side or send the O-elements back to the EDBMS. In either case, the comparison yields a precedence if the time-intervals do not overlap. Otherwise they may be serialized either way by subdividing the time-interval, returning \no-con ict". If the SDBMS is doing the subdivision, it should send a message to communicate the result to the EDBMS, so it will take the subdivision into account in future decisions.

4.2.4 Summary of Concurrency Control Grouping In this section we described three optimization methods to reduce or completely eliminate concurrency loss for three important groups of concurrency control algorithms: 2PL, optimistic validation, and timestamp-based. Each group has its own requirements and trade-o s. For example, we can almost completely eliminate the concurrency loss problem for 2PL. In contrast, the amount of information is the key problem in improving the performance of optimistic validation algorithms. Timestamp-based methods are somewhere between 2PL and optimistic validation. 17

There are two additional observations on concurrency control grouping. First, we have de ned the O-element comparison routine abstractly. Altogether, we have suggested three concrete implementations that illustrate the range of trade-o s between concurrency preservation and simplicity: timestamp comparison, TDG traversal, and direct consultation with an EDBMS. Second, we can avoid or re ne the certi cation within each group of EDBMSs using similar concurrency control methods, but a global certi cation must be carried out between di erent groups. In this higher level certi cation, each group participates with one O-element. Therefore, supertransactions aborted due to non-serializability necessarily come from di erent groups.

4.3 Symmetric Distribution As we have seen in Section 3, a hierarchical organization of SDBMSs results in low message overhead. However, hierarchical structure implies a centralized organization. Shutting down any of the internal nodes may isolate parts of the tree. In the optimized version, the root SDBMS appears to be a single point of failure that can make the whole heterogeneous system inaccessible. In reality, the four functions of the root SDBMS (supertransaction decomposition and result collection, concurrency control and crash recovery) can be independently distributed for parallelism and availability. For each supertransaction the decomposer sends the list of participant EDBMSs to the result collector, the commit coordinator, and concurrency controller. Decomposition and result collection of a supertransaction is independent of another, so individual replicas of SDBMS can decompose di erent supertransactions and collect results. The situation is the same for recovery. Coordinators of two-phase commit protocol for di erent supertransactions do not communicate with each other. If an SDBMS replica crashes, it a ects only supertransactions handled by that replica. One way to increase system availability is checkpointing the decomposition and commit information to a backup SDBMS. In case of crash, the backup may be able to nish the commit protocol and compose results. The situation is more complicated for concurrency control. We could replicate the global certi cation information in SDBMS nodes, resulting in higher message overhead to keep the replicas consistent. Unfortunately, consistent replication is expensive and this approach then loses the low-overhead advantage of hierarchical SDB. Alternatively, we can circulate the concurrency control certi cation information among several sites. This approach is similar to existing distributed optimistic validation algorithms (for example, see [7]). Again, higher message overhead will be incurred. A reasonable compromise 18

would be a central root SDBMS for normal certi cation. Periodic checkpoints send the global wait-for-graph (history of O-vectors) to backup sites. If the root node crashes, one of the backups will take over to reconstruct the situation. The trade-o s between normal processing cost and recovery time are similar to homogeneous distributed systems.

5 Implementation

5.1 Historical Perspective When we started the Harmony project7 in 1987, the database market is dominated by proprietary system such as CICS, Oracle, and Ingres, which did not communicate with each other. To show the practicality of the SDB architecture and to test and re ne the ideas in SDB, we started an implementation e ort in 1987. The initial prototype was built by about 15 students taking an Advanced Database course, which included group projects. We divided the projects into three layers of software: operating system (OS) support, EDBMSs, and SDBMS. The OS support group worked on a common interface between the OS and database, intended for higher portability of Harmony. The SDBMS group designed and implemented a simple instance of SDBMS, called Supernova. The EDBMS was further divided into four parts: storage manager, concurrency controller, query compiler, and data dictionary. This layer became the rst version of our own relational database, called Nova. Since then, more than 20 project students have worked on the Harmony project (for a partial listing see Section 7). Averaging about 2.5 project-semesters each, the total implementation e ort exceeds 50 project-semesters. All of the major components have been re-implemented at least once. All the code has been written in C on some avor of Unix. The current working code adds up to about 30,000 lines out of a total of more than 80,000. From the management point of view, the stability in leadership (the author and a PhD student, S.W. Chen) has been instrumental in keeping the project together. We spent a signi cant part of our time on the speci cation of interfaces and we wish we had spent a larger percentage of this time earlier in the project. This is perhaps unsurprising to the developers of interoperable systems, in particular the X/Open standard. However, we started the system development thinking that we can modify the interfaces quickly and easily, since we had the sources to each one of the components. As it turned out, many interfaces (e.g., the communications protocol between the SDBMS and the EDBMSs) are used by more than one component and each precipitated decision carried 7

The Harmony project was initially also called Superdatabase until its renaming in 1989.

19

signi cant penalty in terms of software rewrite on several modules. The lesson is that carefully de ned interfaces are the foundation of interoperable (heterogeneous) systems, whether it is open or being built by one group.

5.2 Supernova Supernova is the SDBMS that glues the Harmony prototype together. It contains three modules: query compiler, global recovery, and global concurrency control. The query compiler distributes global supertransactions to component databases. The global recovery logs supertransaction states for recovery and translates commit protocols. The global concurrency control certi es the global ordering of component transactions when they complete. The current version of Supernova does not include the optimizations described in Section 4 but implements the recovery and concurrency control algorithms summarized in Sections 3.3 and 3.4. It runs under Ultrix on a Microvax 3600 and is referred to as Supernova/Ultrix. Supernova/Ultrix uses certi cation with O-vectors for global concurrency control. Its concurrency control module maintains a list of O-elements for each component database. Each item in the list represents the O-element returned by an EDBMS participating in a supertransaction. Currently, we have two families of concurrency control techniques among the component databases: two-phase locking and optimistic validation (see Section 5.3). For the two-phase locking, the O-element comes from the timestamp of the lock point of each local transaction. The optimistic validation produces its O-elements directly from a counter. Currently, all the EDBMSs share the same comparison routine for O-elements, since they all use an integer as an O-element. The recovery module of Supernova/Ultrix also translates between di erent commit protocols. Supernova maintains a table of procedures that implements the commit process for each type of commit protocol. The actual commit protocol is table-driven to allow easy addition of new protocols. Since we had to add a commit protocol to each one of the EDBMSs, the rst prototype simply used the same protocol (developed for Nova) for them. We have implemented a subset of LU 6.2 protocol for inclusion in the system. However, the integration of the LU 6.2 into the table-driven Supernova has not been completed.

5.3 EDBs Supernova/Ultrix currently integrates four di erent EDBMSs: a modi ed version of university Ingres running on SUNOS, a Camelot server running on the MACH operating 20

system (on a Microvax), and two avors of our own Nova relational DBMS. The rst version of Nova, implementing 2PL, was written by project students from scratch. Nova-2PL includes a (small subset) SQL/C query compiler, rudimentary data dictionary support, two-phase locking concurrency control, and recovery with logging on Unix le system. Nova-2PL uses a simple two-phase commit protocol to communicate with Supernova. This protocol has been incorporated into the other EDBMSs. The current version of Nova-2PL runs on Ultrix. The Nova-2PL concurrency controller contains about 8000 lines, the data manager 4000 lines, and the simple query compiler (without query optimization) 1600 lines. The second Nova changes the concurrency control from 2PL to optimistic validation, resulting in Nova-OCC. The other modules of Nova-OCC are shared with Nova-2PL. The optimistic concurrency controller is about 5000 lines and took about three projectsemesters. This modi cation was feasible because the implementor was able to maintain the interfaces between the four original modules in Nova-2PL. The two versions of Nova showed that a modular prototype is feasible and useful. The university version of Ingres is a good example of an originally centralized database made composable. We added the Nova protocol and the \prepared" state to the recovery mechanism, plus returning the O-element to Supernova. The conversion of the centralized Ingres to a composable Ingres took about four project semesters. The total number of lines changed was about 2000 (out of a total of 100,000 lines). Current commercial version of Ingres already o ers two-phase commit (as part of Ingres/Star). Some other products such as Oracle 7.0 already do the same [32]. Two candidates for \open" commit protocols are the LU 6.2 and the X/Open proposed standard. Camelot [29] is a reliable distributed transaction library developed at CMU to run on top of the Mach operating system. A Camelot application is composed of servers that support transactions invoked by clients. Our Camelot server EDBMS is called Jake, a slightly modi ed version of the Jack server distributed with the Camelot package. Jack executes simple transactions and returns results. Jake converts the Camelot commit protocol to the Supernova protocol. Although the Camelot servers by design satisfy the composability conditions, we had some diculties with the interface. We gladly acknowledge the help from Prof. Dan Duchamp of Columbia University, who wrote the original Camelot transaction manager. The current working version of Jake is about 900 lines of code, which has been rewritten three times. It is interesting to note that the successor to Camelot, the Encina TP monitor produced by Transarc Corp. [3], is composable from the commit protocol point of view by supporting the X/Open standard. 21

Transaction Manager

Tuxedo, TopEnd, Transarc

XA Interface Resource Manager

6

6

6

?

?

?

Sybase

Informix

Oracle

Figure 3: TP Monitors with X/Open Interface

5.4 Evaluation The implementation of Supernova/Ultrix and the four EDBMSs showed the feasibility of the SDB architecture. The total implementation e ort of Harmony prototype is similar in magnitude to several of the prototypes reported in a recent special issue of IEEE Transactions on Knowledge and Data Engineering [30]. The degree of heterogeneity, global atomicity, amount of concurrency preserved, and the low overhead of Supernova compares favorably with the HeTP systems reported in the literature [8, 11, 16, 26]. This result is relevant given the relatively modest implementations reported in a recent special issue of Computing Surveys on Heterogeneous Databases [11]. Another demonstration of SDB architecture is the availability of open system commercial TP monitors such as TopEnd, Tuxedo/T, and Transarc [3]. Figure 3 shows the structure of these systems that follow the X/Open standard interface. For comparison with Figure 1, we have turned the X/Open gure 90 degrees, with the transaction manager at the top rather than on the side. These TP monitors do not have global serializability certi cation, since they assume that every resource manager (EDB) follows strict 2PL protocol (Section 4.2.1). They also o er only limited commit protocol translation, since they also assume the resource managers to follow the XA interface of X/Open standard. Compared to these commercial TP monitors our implementation is modest. But the degree of heterogeneity in our system is larger. For example, all the commercial TP monitors assume the adoption of strict-2PL concurrency control in the EDBMSs, thus 22

bypassing the heterogeneous concurrency control problem. On recovery side, they assume the adoption of some standard commit protocol; Supernova will do the same until we include the LU 6.2 subset. Preliminary performance measurements showed the bottlenecks to be the local transactions at EDBMSs. For example, reading and writing two records in university Ingres and Nova takes from half a second to one second. Camelot is faster, but the supertransactions run at the pace of the slowest EDB. The overhead in Supernova is not measurable under these circumstances. Our current e orts focus on Harmony interfaces and portability, in particular, the addition of new protocols such as X/Open.

6 Conclusion When the superdatabase (SDB) architecture was rst proposed [25] in 1986, the main objection to it was the need to change the interface to element databases (EDBs) to make the EDBMSs composable by the SDBMS. The marketplace has shown the advantages of changing the interface to achieve the interoperability o ered by such EDBMSs. Many database backends such as Informix and Sybase export their two-phase commit protocol through the standard X/Open interface, so modern TP monitors such as TopEnd and Tuxedo/T [3] can integrate them as an SDBMS. Therefore, cooperative HeTP has been established as a viable and desirable way to do distributed transaction processing. We note this trend is not in opposition to multidatabase work (which assumes no cooperation), but complementary. SDB provides atomic transactions across databases with di erent concurrency control and crash recovery mechanisms. The key idea is to solve the algorithmic heterogeneity problem between the participant EDBMSs. Several reasons make the SDB an attractive approach to consistent HDBs: 1. SDB structure is straightforward and a completed prototype implementation demonstrates its feasibility. 2. SDB performance is good. No transaction concurrency is lost for EDBs that share the same concurrency control method. Run-time overhead in both CPU and messages is low. To the best of our knowledge no other HeTP system does better. 3. Replication and distribution of SDB for parallelism and availability is easy, although distribution of concurrency control will carry additional message overhead. 23

4. SDBs can bridge the gap between di erent standards. This will be useful in making the transition from an older standard to an emerging standard due to new technology. The contribution of this work includes the description of SDB architecture, the algorithms for heterogeneous recovery and concurrency control (for both the SDBMS and the EDBMSs), a number of optimization options for the algorithms, and the implementation containing signi cant heterogeneity. Compared to the multidatabase work, the EDBMS composability conditions are more strict, but in compensation an SDBMS provides the same degree of consistency (global serializability) as homogeneous TP systems at the same performance level, which is much more than multidatabase systems. Recently, commercial open system TP monitors became available. They are specialized SDBMSs that assume strict 2PL for concurrency control and some standard commit protocol. Thus the usefulness of SDB architecture has been demonstrated in practice.

7 Acknowledgment Two of my PhD students contributed signi cantly to the Harmony project: Avraham Le on cooperation between databases and Shu-Wie Chen on the implementation of Nova and Supernova. Both have made comments that improved this paper. Many project students contributed to the implementation of Harmony prototype. The main contributors are listed in alphabetical order.

 Undergraduate project students: Je Alvidrez, Steven Harari.  MS project students: Ariel Blumencwejg, Shiu Chong, Heidi Jones, Surasak Lert-

pongwipusana, Pierre Nicoli, Pauline Powledge, Mike Sokolsky, Vanessa Sun, Nathan Tanuwidjaja, Les Temple, Boris Umylny, Magdeline Vargas, Holger Veith, Albert Wang.

 PhD student: David Fox. This work is partially funded by the New York State Center for Advanced Technology on Computer and Information Systems under the grant NYSSTF CU-0112580, the National Science Foundation under the grant CDA-88-20754, the AT&T Foundation under the Special Purpose Grant program, the Digital Equipment Corporation under the External Research Program, and IBM Fellowships. 24

References [1] R. Bayer, K. Elhardt, J. Heigert, and A. Reiser. Dynamic timestamp allocation for transactions in database systems. In H. J. Schneider, editor, Distributed Data Bases. North-Holland, 1982. [2] P.A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley Publishing Company, rst edition, 1987. [3] P.A. Bernstein and R.W. Taylor, editors. Proceedings of the Fourth International Workshop on High Performance Transaction Systems, Asilomar, California, September 1991. [4] B. N. Bershad, D. T. Ching, E. D. Lazowska, J. Sanislo, and M. Schwartz. A remote procedure call facility for interconnecting heterogeneous computer systems. IEEE Transactions on Software Engineering, SE-13(8):880{894, August 1987. To be reprinted in Distributed Processing: Concepts and Structures, ed. A.L. Ananda and B. Srinivasan, IEEE Computer Society Press. [5] Y. Breitbart and A. Silberschatz. Multidatabase update issues. In Proceedings of 1988 SIGMOD International Conference on Management of Data, pages 135{142, May 1988. [6] Y. Breitbart, A. Silberschatz, and G. Thompson. Reliable transaction management in a multidatabase system. In Proceedings of 1990 SIGMOD International Conference on Management of Data, pages 215{224, May 1990. [7] S. Ceri and S. Owicki. On the use of optimistic methods for concurrency control in distributed databases. In Proceedings of the Sixth Berkeley Workshop on Distributed Data Management and Computer Networks, pages 117{129, University of California, Berkeley, February 1982. Lawrence Berkeley Laboratory. [8] W.W. Chu, editor. Special Issue on Distributed Database Systems, volume 75:5 of Proceedings of the IEEE. IEEE Press, May 1987. [9] A. El Abbadi and S. Toueg. The group paradigm for concurrency control protocol. IEEE Transactions on Knowledge and Data Engineering, 1(3):376{386, September 1989. [10] A. Elmagarmid and W. Du. A paradigm for concurrency control in heterogeneous distributed database systems. In Proceedings of the Sixth International Conference on Data Engineering, pages 37{46, Los Angeles, February 1990. [11] A.K. Elmagarmid and C. Pu, editors. Special Issue on Heterogeneous Databases, volume 22:3 of ACM Computing Surveys. ACM, September 1990. [12] K.P. Eswaran, J.N. Gray, R.A. Lorie, and I.L. Traiger. The notions of consistency and predicate locks in a database system. Communications of ACM, 19(11):624{633, November 1976. 25

[13] D. Georgakopoulos and M. Rusinkiewicz. On serializability of multidatabase transactions through forced local con itcs. In Proceedings of the Seventh International Conference on Data Engineering, Kobe, Japan, April 1991. [14] V. Gligor and G.L. Luckenbaugh. Interconnecting heterogeneneous database management systems. Computer, 17(1):33{43, January 1984. [15] V. Gligor and R. Popescu-Zeletin. Concurrency control issues in distributed heterogeneous database management systems. In F.A. Schreiber and W. Litwin, editors, Distributed Data Sharing Systems, pages 43{56. North Holland Publishing Company, 1985. Proceedings of the International Symposium on Distributed Data Sharing Systems. [16] A. Gupta, editor. Integration of Information Systems: Bridging Heterogeneous Databases. IEEE Press, 1989. [17] D.R. Je erson. Virtual time. ACM Transactions on Programming Languages and Systems, 7(3):404{425, July 1985. [18] H. T. Kung and John T. Robinson. On optimistic methods for concurrency control. Transactions on Database Systems, 6(2):213{226, June 1981. [19] L. Lamport. Time, clocks and ordering of events in a distributed system. Communications of ACM, 21(7):558{565, July 1978. [20] T. Landers and R.L. Rosenberg. An overview of MULTIBASE. In H.J. Schneider, editor, Distributed Data Bases. North Holland Publishing Company, September 1982. Proceedings of the Second International Symposium on Distributed Data Bases. [21] P.J. Leu and B. Bhargava. Multidimensional timestamp protocols for concurrency control. IEEE Transactions on Software Engineering, SE-13(12):1238{1253, December 1987. [22] B. Lindsay, L.M. Haas, C. Mohan, P.F. Wilms, and R.A. Yost. Computation and communication in R: a distributed database manager. ACM Transactions on Computer Systems, 2(1):24{38, February 1984. [23] R. McCord. INGRES/STAR: a distributed heterogeneous relational DBMS. Vendor Presentation in SIGMOD, May 1987. [24] C. Pu and S.W. Chen. Implementation of a prototype superdatabase. In Proceedings of the Workshop on Experimental Distributed Systems, Huntsville, Alabama, October 1990. [25] Calton Pu. Superdatabases for composition of heterogeneous databases. In Amar Gupta, editor, Integration of Information Systems: Bridging Heterogeneous Databases, pages 150{157. IEEE Press, 1989. Also appeared in Proceedings of Fourth International Conference on Data Engineering, 1988, Los Angeles. 26

[26] D.S. Reiner, editor. Special Issue on Database Connectivity, volume (13):2 of Quarterly Bulletin of the IEEE Computer Socienty Technical Committee on Data Engineering. IEEE Computer Society, June 1990. [27] K. Salem and H. Garcia-Molina. Altruistic locking: A strategy for coping with long lived transactions. Technical Report CS-TR-087-87, Department of Computer Science, Princeton University, April 1987. [28] A. Sheth and J. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183{236, September 1990. [29] A.Z. Spector, D.S. Thompson, R.F. Pausch, Eppinger J.L., D. Duchamp, R.P. Draves, D.S. Daniels, and J.J. Bloch. Camelot: A distributed transaction facility for Mach and the Internet - an interim report. Technical Report CMU-CS-87-129, Computer Science Department, Carnegie-Mellon University, June 1987. [30] M. Stonebraker, editor. Special Issue on Database Prototype Systems, volume 2:1 of IEEE Transactions on Knowledge and Data Engineering. IEEE Computer Society, March 1990. [31] M. Templeton, D. Brill, S. K. Dao, E. Lund, P. Ward, Chen A.L.P., and R. MacGregor. MERMAID | a frontend to distributed heterogeneous databases. Proceedings of the IEEE, 75(5):695{708, May 1987. [32] G. Thomas, G.R. Thompson, C-W. Chung, E. Barkmeyer, F. Carter, M. Templeton, S. Fox, and B. Hartman. Heterogeneous distributed database systems for production use. ACM Computing Surveys, 22(3):237{266, September 1990. [33] K. Vidyasankar. A non-two-phase locking protocol for global concurrency control in distributed heterogeneous database systems. IEEE Transactions on Knowledge and Data Engineering, presumed 1990. Forthcoming.

27