B-Tree Concurrency Control and Recovery in Page ... - CiteSeerX

1 downloads 3286 Views 392KB Size Report
applied by the server when flushing updated pages onto disk and by the ... ing stale data but allow a data page to be updated by one client transaction and read by many ... Permission to make digital or hard copies of part or all of this work for ...
B-Tree Concurrency Control and Recovery in Page-Server Database Systems IBRAHIM JALUTA Helsinki University of Technology SEPPO SIPPU University of Helsinki and ELJAS SOISALON-SOININEN Helsinki University of Technology

We develop new algorithms for the management of transactions in a page-shipping client-server database system in which the physical database is organized as a sparse B-tree index. Our starvation-free fine-grained locking protocol combines adaptive callbacks with key-range locking and guarantees repeatable-read-level isolation (i.e., serializability) for transactions containing any number of record insertions, record deletions, and key-range scans. Partial and total rollbacks of client transactions are performed by the client. Each structure modification such as a page split or merge is defined as an atomic action that affects only two levels of the B-tree and is logged using a single redo-only log record, so that the modification never needs to be undone during transaction rollback or restart recovery. The steal-and-no-force buffering policy is applied by the server when flushing updated pages onto disk and by the clients when shipping updated data pages to the server, while pages involved in a structure modification are forced to the server when the modification is finished. The server performs the restart recovery from client and system failures using an ARIES/CSA-based recovery protocol. Our algorithms avoid accessing stale data but allow a data page to be updated by one client transaction and read by many other client transactions simultaneously, and updates may migrate from a data page to another in structure modifications caused by other transactions while the updating transaction is still active. Categories and Subject Descriptors: H.2.2 [Database Management]: Physical Design—Access methods; recovery and restart; H.2.4 [Database Management]: Systems—Concurrency; transaction processing General Terms: Algorithms, Design

Authors’ addresses: I. Jaluta and E. Soisalon-Soininen, Department of Computer Science and Engineering, Helsinki University of Technology, P. O. Box 5400, FIN-02015 HUT, Finland; email: {ijaluta,ess}@cs.hut.fi; S. Sippu, Department of Computer Science, University of Helsinki, ¨ P. O. Box 68 (Gustaf Hallstr¨ omin katu 2b), FIN-00014 University of Finland, Finland; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].  C 2006 ACM 0362-5915/06/0300-0082 $5.00 ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006, Pages 82–132.

B-Tree Concurrency Control and Recovery



83

Additional Key Words and Phrases: ARIES, ARIES/CSA, B-tree, cache consistency, callback locking, client-server database system, data shipping, key-range locking, page server, partial rollback, physiological logging, sparse B-tree, structure modification

1. INTRODUCTION In a data-shipping client-server database management system, database pages or individual objects are shipped from the server to clients, so that clients can run applications and perform database operations on cached pages or objects. Having pages or objects available at the clients can reduce the number of clientserver interactions and offload server resources (CPU and disks), thus improving client-transaction response time. Many data-shipping systems have adopted the page-server architecture, because it is the simplest to implement and can utilize the clustering of objects [DeWitt et al. 1990; Franklin et al. 1996b]. Local caching in a page-server system allows for copies of a page to reside in multiple client caches. When intertransaction caching is used, a page may remain cached locally when the transaction that accessed it has terminated. A replica-management protocol is needed to ensure that client transactions can either avoid accessing stale (out-of-date) data in cached page copies (replicas) or can detect when they have accessed stale data (see, e.g., Franklin et al. [1997]). Many page-server systems use callback-locking [Howard et al. 1988; Lamb et al. 1991; Wang and Rowe 1991; Franklin and Carey 1994] for the management of page replicas; being avoidance-based, this protocol guarantees that copies of pages in client caches are always up-to-date, so that a client transaction can safely S-lock and read cached pages without server intervention. When fine-grained (record/object-level) concurrency control is used, a page may contain uncommitted updates by several transactions. For avoidancebased page-server systems, two approaches have been suggested for controlling the concurrent updating of page copies. In the “update-privilege” (or “writetoken”) approach, a page to be updated at some site (a client or a server site) is called back from all other sites, and an exclusive access is granted to the updating site on the page for the time of the update [Mohan and Narang 1991, 1994]. In the “update-merging” approach, different sites are allowed to update simultaneously (different records/objects in) different copies of a page; these updates are later merged at the server to create a single current page copy [Mohan et al. 1991; Carey et al. 1994b; Zaharioudakis et al. 1997]. For a discussion of these two approaches, see Panagos et al. [1996] and Zaharioudakis et al. [1997]. Update merging is difficult or impossible if records vary in size or if updates include record insertions and deletions; when merging cannot be done, one must resort to the update-privilege approach anyway. The update-privilege approach also works better than the update-merging approach with the ARIES family of recovery protocols [Mohan 1990a, 1996; Mohan et al. 1992; Mohan and Levine 1992; Mohan and Narang 1994], in which the log sequence number (LSN) stamped in the Page-LSN field of each page is used to indicate the currency of the page. In a performance study by Zaharioudakis et al. [1997], it was found that the update-merging and update-privilege approaches are equivalent in efficiency. The update-merging approach led to a small amount of additional disk reads, while the update-privilege approach led to a small amount of ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

84



I. Jaluta et al.

additional message traffic for the write-token transfers. In both cases, these special overheads were so small compared to other common overheads that they did not play a significant role in the overall performance. Mohan and Narang [1994] presented a version of the ARIES recovery algorithm for page-server systems. This algorithm, called ARIES/CSA, assumes the update-privilege approach, so that a page that is being updated by some client transaction may reside cached at that client only, while a page that is currently being only read may reside cached at several clients. ARIES/CSA assumes lazy update propagation for all pages, so that a page updated at a client need not be shipped to the server until the page is needed at the server or at some other client, or until the client cache becomes full. According to the client-server write-ahead-logging (WAL) protocol, the log records generated by a client transaction for updates on a page must be shipped from the client to the server before the updated page is shipped, and the log records must be forced to the log disk at the server when the transaction commits. The performance of different concurrency-control and replica-management algorithms has been the subject of many studies in page-server or object-server systems [Wilkinson and Neimat 1990; Carey et al. 1991; Carey and Livny 1991; Wang and Rowe 1991; Franklin and Carey 1994; Franklin et al. 1992a, 1993; Carey et al. 1994b; Adya et al. 1995; Liskov et al. 1996; Panagos and Biliris ¨ 1997; Franklin et al. 1997; Zaharioudakis et al. 1997; Ozsu et al. 1998; Voruganti et al. 1999, 2004]. On the other hand, less attention has been paid to recovery and index management (B-trees), and to repeatable-read-level isolation of transactions containing record insertions, record deletions, and key-range scans. Besides Mohan and Narang [1994], ARIES-based recovery protocols have been discussed in Deux [1991], Franklin et al. [1992b], Carey et al. [1994a], White and DeWitt [1994], Panagos et al. [1996], and Voruganti et al. [2004]. Algorithms for B-tree management in page-server systems are presented in Gottemukkala et al. [1996], Basu et al. [1997], and Zaharioudakis and Carey [1997]. The performance studies on optimistic detection-based concurrency control and replica-management protocol have not shown how B-tree indexes are managed or how repeatable-read-level isolation could be achieved in a finegrained manner in the presence of insertions and deletions of objects and scans over object collections. Index pages have more relaxed consistency requirements than data pages, and they are accessed more often and hence cached at clients more often than data pages. In the B-tree-index-management algorithms for centralized database systems that work with the ARIES-based recovery algorithms [Mohan 1990a, 1996; Mohan and Levine 1992], a page that is subject to a record update or a structure modification (such as a page split or merge) by a transaction T is kept write-latched only for the (short) duration of performing the update or modification, so that the page can be updated and modified by other transactions while T is still active. Besides the work on ARIES and ARIES/CSA [Mohan et al. 1992; Mohan and Narang 1994; Mohan 1996], the work most closely related to our article appeared in Gottemukkala et al. [1996], Basu et al. [1997], and Zaharioudakis and Carey [1997], which all considered the management of dense B-tree indexes ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



85

in a page-server system in which each client is assumed to run only a single transaction at a time. In the simulation-based performance study by Basu et al. [1997], concurrency control and replica management (callback-locking) on index pages were performed at the page level. Concurrency control for data pages is adaptive, so that a page-level lock held by a client transaction can be replaced by a set of record-level locks should transactions at other clients want to simultaneously access other records in the locked page. Immediate propagation for updated pages is assumed, so that a copy of an updated page is shipped to the server immediately after the update. B-tree structure modifications are not considered, and the simulation only involves index leaf pages, while page overflow and page underflow are not dealt with. The simulation does not consider repeatable-read-level isolation for transactions, nor recovery in the face of client-transaction and system failures. Gottemukkala et al. [1996] presented a simulation-based comparative study of two index-cache-consistency algorithms for page-server systems. These algorithms are called the callback-locking (CBL) and relaxed-index-consistency (RIC) algorithms, respectively. The CBL algorithm is avoidance-based and uses a strict consistency approach for index and data pages. Strict consistency is achieved by a page-level callback-locking protocol. On the other hand, the RIC algorithm is detection-based in that it uses a strict-consistency approach for data pages but a relaxed-consistency approach for index pages. In the relaxedconsistency approach, clients are responsible for detecting stale copies of index pages. To do so, the RIC algorithm maintains index-coherency information, both at the server and at the clients, in the form of index-page version numbers. The server updates its index-coherency information whenever it receives a request for an X latch on an index page from a client. When a client sends a request to the server for a record lock, the server sends the updated index-coherency information and the record lock when the request is granted. The client uses this index-coherency information for checking the validity of their cached copies of index leaf pages. The callback-locking protocol used in the algorithms in Gottemukkala et al. [1996] works at the page level, and it is not clear how to extend this protocol to work at the record level or in an adaptive manner. These algorithms do not show how structure modifications are handled or how recovery is performed, and no explicit B-tree algorithms are presented. Moreover, the management of index-coherence information is quite complex. Zaharioudakis and Carey [1997] presented a simulation study with four approaches to the cache-consistency and concurrency control of dense B-trees in page-server systems. The four approaches are called no-caching (NC), strict hybrid caching (HC-S), relaxed hybrid caching (HC-R), and full caching (FC). These approaches assume a B-tree in which the pages on each level of the tree are doubly linked, allowing the propagation of structure modifications along the tree in a lazier way than in ARIES/IM [Mohan and Levine 1992]. In the NC approach, clients do not cache B-tree pages at all, so that the server executes all B-tree operations on client requests. In the HC-S approach, clients are only allowed to cache index leaf pages, and clients can perform fetch actions and to some extent also fetch-next actions, while the server still ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

86



I. Jaluta et al.

performs page updating on client requests. The HC-R approach is similar to the HC-S approach, but HC-R uses a relaxed-consistency approach for cached index leaf pages. When the server updates a leaf page, copies of that page may remain cached at other clients for reading. Clients are prevented from accessing stale key ranges by using a locking protocol that is less permissive than the standard key-range locking protocol. In the NC, HC-S, and HC-R approaches, the server executes most of the B-tree operations on client requests. In the FC approach, clients cache both interior and leaf pages of the dense B-tree index and are able to perform all B-tree operations locally. To update a page at a client, an exclusive update privilege must be acquired on the page, so that all copies of the page at other clients must be called back. Transaction rollback or restart recovery was not considered in Zaharioudakis and Carey [1997]. In this article, we present a comprehensive set of new algorithms for the management of transactions in a page-server system in which the physical database is organized as a sparse B-tree, that is, a B+ -tree whose leaf pages are the data pages of the database and store the actual data records (which may vary in length; see Section 3). In our page-server model, each client can run several transactions concurrently (Section 4). Unlike in the protocols presented in Gottemukkala et al. [1996], Basu et al. [1997], and Zaharioudakis and Carey [1997], in our page-level replica-management protocol only the pages that are subject to a structure modification need be cached in exclusive (X) mode, while leaf pages that are subject to a record insertion or deletion need only be cached in update (U) mode (Sections 4 and 8). Update mode is analogous to U-locking [Mohan and Narang 1991; Gray and Reuter 1993; Lomet and Salzberg 1997] in that a page cached in update mode at a client may simultaneously be cached at other clients in shared (S) mode. This protocol combined with our fine-grained concurrency-control protocol allows a transaction at one client to update records in a leaf page while transactions at other clients are simultaneously reading other records from a copy of that page (Section 9). Update propagation is immediate for pages involved in a structure modification and lazy for updated data (leaf) pages (Section 7). Holding a page cached at a client in X mode is of short duration in that X mode is downgraded to U or S mode immediately after the structure modification is completed and the pages involved in the modification have been shipped to the server (Section 10). Pages may reside cached at a client in U or S mode for an indefinite time, but the caching of an unlatched page is given up immediately when the page is called back to be cached in X mode at some other site. All latches are held for a short duration only, and no process ever holds any latch while waiting for a lock (Sections 8, 9, and 12). A record updated by a client transaction may migrate from its page to another page, and copies of the page containing the updated record may migrate to other clients, while the updating transaction is still active. Our fine-grained locking protocol is based on standard key-range locking (Section 5), where the lock names are hash values computed from the primary keys of data records [Mohan 1990a, 1996; Mohan and Levine 1992; Gray and Reuter 1993]. Although stale copies of data pages may be around in the system and read by transactions, our implementation of the locking protocol prevents ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



87

access to stale key ranges and thus guarantees repeatable-read-level isolation (i.e., serializability) for transactions (Section 9). Our algorithms work in the general setting in which any number of forward-rolling and backward-rolling transactions, as well as transactions doing partial rollbacks, are running concurrently, and each transaction contains any number of record insertions, record deletions, and key-range scans. Our recovery protocol is based on the ARIES/CSA recovery algorithm [Mohan and Narang 1994], and it deals with partial and total rollbacks of transactions and restart recovery from client and server failures (Sections 12 and 13). In our protocol, total and partial rollbacks of client transactions during normal processing are performed by the client, while the server performs the restart recovery after a system failure. The server is assumed to apply the steal and no-force buffering policies. Record insertions and deletions on leaf pages of the B-tree are logged using physiological redo-undo log records as in Mohan [1990a, 1996], Mohan and Levine [1992], Gray and Reuter [1993], and Lomet and Salzberg [1992, 1997] (Section 6). Our technique for handling B-tree structure modifications is different from that in Mohan [1990a, 1996], Mohan and Levine [1992], and Zaharioudakis and Carey [1997]. In our algorithms, no special locking mechanism during normal processing is needed to guarantee correct recovery from system failures, nor is any additional pass needed to handle structure modifications correctly in restart recovery. This is in contrast to the algorithms in Mohan [1990a, 1996] and Mohan and Levine [1992], in which structure modifications are effectively serialized by the use of “tree latches” in order to guarantee that logical undo operations always see a structurally consistent tree, and to multilevel recovery algorithms [Lomet 1992, 1998; Lomet and Salzberg 1997], in which two undo passes over the log are needed, the first one to undo the incomplete structure modifications and the other to undo the record insertions and deletions of aborted transactions. This is because in these algorithms a single structure modification may involve any number of levels of the B-tree, and several redo-undo log records are used to log such a modification in the name of the transaction that triggered the modification. We identify five structure modifications for our B-tree, namely, split, merge, redistribute, increase-height, and decrease-height, and define each such modification as an atomic action that involves only three pages (at two adjacent levels) of the tree (Section 10; cf. Jaluta [2002]; Jaluta et al. [2005]). Each successfully completed structure modification brings the B-tree into a structurally consistent and balanced state whenever the tree was structurally consistent and balanced initially. Each structure modification is logged using a single redo-only log record. Thus, the redo pass of our restart-recovery algorithm will always produce a structurally consistent and balanced B-tree, hence making possible the logical undo of record insertions and deletions. Our algorithms maintain the strict balance of the B-tree under both record insertions and deletions, so that the time taken by a B-tree traversal from the root page down to the target leaf page is always logarithmic in the number of data records (Sections 11 and 12). ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

88



I. Jaluta et al.

2. LOGICAL DATABASE AND TRANSACTION MODEL We assume that our logical database consists of data records of the form (x, w), where x is the key of the record and w denotes the value of the record (representing the values of the other attributes of the record). The keys are unique, and there is a total order, ≤, among the keys. The least key is denoted by −∞ and the greatest key is denoted by ∞. We assume that the keys −∞ and ∞ do not appear in data records; they can only appear in B-tree index records. A transaction on our logical database can contain any number of forward database actions of the following forms in the forward-rolling phase of the transaction (cf. Sippu and Soisalon-Soininen [2001]; Jaluta [2002]; Jaluta et al. [2005]): — I [x, w]: insert a new record (x, w). Given a key x and a value w, insert the record (x, w) into the database, provided that the database does not contain a record with key x. — D[x, w]: delete the record with key x. Given a key x, delete the record, (x, w), with key x from the database, if the database contains such a record. — F [x, θ z, w]: fetch the first matching record (x, w). Given a key z, find the least key x and the associated value w such that x θ z and the record (x, w) is in the database. If there is no such record in the database, the special record (∞, 0) is returned. Here θ is one of the comparison operators ≥ or >. We also say that the operation scans the key range [z, x] (if θ is ≥) or (z, x] (if θ is >). The action F [x, θ z, v] can be used to simulate both the find-first and findnext actions used in Mohan [1990a, 1996], and Mohan and Levine [1992] to scan the records within a key range. To fetch the first record in the range, the action F [x1 , ≥ z, w1 ] is performed, where z is the lower bound of the range. To fetch the next record in the range, the action F [x2 , > x1 , w2 ] is performed, etc. We may write the actions I [x, w], D[x, w], and F [x, θ z, v] as I [x], D[x], and F [x, θ z], respectively, for short. For an insert or a delete action o, we define the undo action (or the backward database action), denoted by o−1 or undo-o, which can appear in the backwardrolling phase of an aborted transaction as well as in a partial rollback of a forward-rolling transaction: — I −1 [x, w] or undo-I [x, w]: undo the insertion of (x, w). Delete from the database the record (x, w) inserted into the database by the corresponding forward action of the same transaction. — D −1 [x, w] or undo-D[x, w]: undo the deletion of (x, w). Reinsert into the database the record (x, w) deleted from the database by the corresponding forward action of the same transaction. In addition to the above actions, a transaction can contain the following control actions: — B: begin a new transaction; —C: commit a forward-rolling transaction or complete the rollback of a backward-rolling transaction; ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



89

— A: abort a forward-rolling transaction, that is, change the state of the transaction from a forward-rolling transaction to a backward-rolling transaction; — S[P ]: set savepoint P ; — A[P ]: begin partial rollback to savepoint P ; —C[P ]: complete the partial rollback to savepoint P . The actions S[P ], A[P ], and C[P ] are used to control partial rollbacks within the forward-rolling phase of a transaction; they do not change the state of the transaction. In normal transaction processing, a transaction can be in one of the following four states: forward-rolling, committed, backward-rolling, or rolled-back. A forward-rolling transaction is a string of the form Bα, where B is the begintransaction action and α is an action string constituting the forward-rolling phase of the transaction (defined below). A committed transaction is of the form BαC, where Bα is a forward-rolling transaction and C is the commit action. A backward-rolling transaction is of the form Bαβ Aβ −1 , where Bαβ is a forward-rolling transaction, A is the abort action, and β −1 is the undo string of β (defined below). The string β −1 is called the backward-rolling phase of the transaction. A backward-rolling transaction Bαβ Aβ −1 denotes a transaction that has undone a suffix, β, of its forward-rolling phase, while the prefix, α, is still not undone. A transaction of the form Bα Aα −1 C is a rolled-back transaction or a transaction that has completed its (total) rollback. A transaction is an active transaction if it is forward-rolling or backwardrolling, an aborted transaction if it is backward-rolling or rolled-back, and a terminated transaction if it is committed or rolled-back. The action string constituting the forward-rolling phase of a transaction may take one of the following three forms: (1) A string α of database actions I , D, and F , and set-savepoint actions S. (2) A string of the form αS[P ]β A[P ]β −1 C[P ]γ , where α, β, and γ are strings of the form (1) or (2). Here the substring S[P ]β A[P ]β −1 C[P ] represents a completed rollback to savepoint P . (3) A string of the form αS[P ]βδ A[P ]δ −1 , where α, β, and δ are strings of the form (1) or (2). Here the substring S[P ]βδ A[P ]δ −1 indicates that the transaction is in the process of rolling back to savepoint P . The undo string α −1 of an action string α of one of the above forms is defined as follows. For a string α of form (1), let α1 be the substring of α containing only I and D actions. If α1 is empty, then α −1 is defined to be empty. Otherwise, if α1 is α2 o, where o is a single action, then α −1 is defined to be o−1 α2−1 . For a string of form (2), the undo string is defined to be γ −1 α −1 . For a string of form (3), the undo string is defined to be β −1 C[P ]α −1 . Example 2.1.

The transaction

BI [x1 , w1 ]S[P1 ]I [x2 , w2 ]S[P2 ]I [x3 , w3 ]A[P2 ]I −1 [x3 , w3 ]C[P2 ] I [x4 , w4 ]A[P1 ]I −1 [x4 , w4 ]I −1 [x2 , w2 ]C[P1 ]I [x5 , w5 ]AI −1 [x5 , w5 ]I −1 [x1 , w1 ]C ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

90



I. Jaluta et al.

is a rolled-back transaction, which in its forward-rolling phase inserts (x1 , w1 ), sets savepoint P1 , inserts (x2 , w2 ), sets savepoint P2 , inserts (x3 , w3 ), begins a partial rollback to savepoint P2 , undoes the insertion of (x3 , w3 ), completes the partial rollback, inserts (x4 , w4 ), begins a partial rollback to savepoint P1 , undoes the insertions of (x4 , w4 ) and (x2 , w2 ), completes the partial rollback, inserts (x5 , w5 ), aborts the transaction, that is, starts the total rollback of the transaction, undoes the insertions of (x5 , w5 ) and (x1 , w1 ), and completes the total rollback. 3. SPARSE B-TREES Our physical database consists of a set of pages. A page can be a data page, index page, free page, or storage-map page. A storage-map page contains a bit vector that indicates, for a range of page identifiers (page-ids), which pages are allocated (1) and which are free (0). A data page is an allocated page that is used to store data records. An index page is an allocated page that is used to store index records. Each page contains a type field that indicates the type (data, index, free, storage-map) of the page, and a Page-LSN field that contains the log sequence number (LSN) of the most recent update (record insert, record delete, structure modification, or page allocation) performed on the page. (Section 6 explains how LSNs are generated.) The data and index pages form a sparse B-tree, that is, a B+ -tree whose leaf pages are data pages and nonleaf pages are index pages [Bayer and Schkolnick 1977]. The logical database represented by a B-tree b, denoted by db(b), is the set of data records in its leaf pages. Each page p of the B-tree covers a range of keys x, low-key( p) ≤ x < high-key( p). The ranges covered by the list of pages, p1 , . . . , pk , at each single level of the tree are disjoint and together cover the entire key space, that is, low-key( p1 ) = −∞, high-key( pi ) = low-key( pi+1 ) for all i = 1, . . . , k − 1, and high-key( pk ) = ∞. A data page p can contain data records (x, w) where x is covered by p. An index page p can contain index records ( y, q), called child links, where y is a key covered by p and q is the page-id of the child page of p with low-key(q) = y. For each index page p, the key in the first (or lowest-key) record in p is low-key( p). As usual, we assume that the data pages are chained from left to right; the link in data page p is represented together with the high-key value of p as a high-key record (high-key( p), p ). The page p pointed to by the high-key record is the next page of p. The high-key record in the data page p with high-key( p) = ∞ is (∞, 0). Each data or index page contains a height field that stores the height (i.e., the number of levels) of the subtree rooted at that page. The height of a subtree rooted at a leaf page is one. The height field is used by our traversal algorithms (see Section 11), which have to know the mode (read or write) in which to latch the next page in the search path, before accessing the page. Data records and their keys may be of varying length. However, we assume that a data record never takes more space than one eighth of the space reserved for the records in a data page. More specifically, we assume that each page can hold a maximum of M 1 ≥ 8 data records or a maximum of M 2 ≥ 8 index records. Let m1 , 2 ≤ m1 < M 1 /2, and m2 , 2 ≤ m2 < M 2 /2 be the chosen minimum load ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



91

Fig. 1. Page-server system with page buffers (a) and processes (b).

factors for a nonroot data page and a nonroot index page, respectively. We say that a B-tree page p is underflown if either (1) p is a nonroot data page and contains less than m1 data records, or (2) p is a nonroot index page and contains less than m2 index records, or (3) p is a root index page and contains only one index record. A B-tree is structurally consistent if it satisfies the basic definition of the B-tree, so that all the paths from the root page down to any leaf page are of equal length, and the data page covering a given key can be accessed from the root by following child links. A structurally consistent B-tree can contain underflown pages. We say that a structurally consistent B-tree is balanced if none of its pages are underflown. We say that a B-tree page p is about to underflow if either (1) p is a nonroot data page and contains only m1 data records, or (2) p is a nonroot index page and contains only m2 index records, or (3) p is a root index page and contains only two index records. One reason for using a sparse B-tree, instead of a dense B-tree, as our physical database model is that in that way it is easiest to demonstrate the aspect that a data record containing an uncommitted update can migrate from its page to another page while the updating transaction is still active. Moreover, the sparse B-tree has been found practical in modern database management: it is the structure for the “index-organized table” in Oracle [Srinivasan et al. 2000]. It is easy to adapt our algorithms to work with dense B-trees (see Section 15). 4. PAGE-SERVER MODEL Our database system consists of a page-server system with one server, s, connected to n clients, c1 , . . . , cn , via a network. The server manages the disk version of the database and the log file, while the clients do not use local disks for permanent storage of the database nor of the log records. Thus, our page-server model is different from models (such as the one in Panagos et al. [1996]) in which the clients are executing in a stable environment. The server uses the server buffer to buffer database pages. Each client has a local client buffer, also called the client cache, to buffer database pages fetched from the server (see Figure 1 (a)). The server s runs a multithreaded database server process, S, that services requests coming from the clients using the principle of page shipping. There is ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

92



I. Jaluta et al.

one server thread for each client. Each client ci runs a multithreaded client database process, Ci , that services requests coming from application processes, Ai j , running at ci . There is one client thread for each application process (see Figure 1 (b)). In addition, there is one client thread dedicated to service callback requests coming from the server. The application processes Ai j at ci communicate with the client database process at ci using the principle of query shipping: they send SQL queries to the client database process Ci , which parses and optimizes the queries, unravels the optimized queries into sequences of record-wise insert, delete, and fetch actions, performs these actions on the cached pages, and generates log records for the insert and delete actions. Each application process runs a database application and generates a sequence of transactions. Only one transaction is being generated by one application process at a time, but, as each client can run several application processes concurrently, there may be several transactions active at each client at any time. The server is also capable of running transactions of its own that perform their actions directly on the pages residing in the server buffer. Each site c (a client or the server) generates transaction identifiers (transaction-ids) T locally; the pair (c, T ) uniquely identifies a transaction across all transactions in the system. A page can reside in the server buffer or in a client buffer in one of three caching modes: S, U, or X (cf. Mohan and Narang [1991]). To access the contents of a page, the page must be cached at the accessing site in one of these modes. S mode (or shared mode) only allows for reading the page contents. Several copies of a page may simultaneously reside cached in S mode at different sites, besides the copy at the server (if any). To insert a record into a data page, or to delete a record from a data page, the page has to be cached in U mode (or update mode). For any data page, at most one copy may be cached in U mode at a time; in addition to this copy, other copies of the page may simultaneously be cached in S mode. To perform a structure modification such as a page split or merge on an index page or a data page, the page must be cached in X mode (or exclusive mode). For any page, only one copy may be cached in X mode at a time, and no other copies may simultaneously be cached in any mode. This also holds for the server copy. In our algorithms, caching in U mode applies to data pages only, while index pages are only cached in S or X mode. We assume that the buffer-control block of each page cached at a client is used to store the mode in which the page is currently cached at the client. To keep track of all the page copies cached at the clients, the server maintains a main-memory hash table, called the page-caching map, of all tuples ( p, c, m), where p is the page-id of a page currently cached at site c in m mode. The table is indexed by the page-ids. The server does not need to register explicitly the mode in which a page in the server buffer is cached at the server. A page p in the server buffer is regarded as cached at the server in S mode if p is also cached at some client in U mode, in X mode if p is not cached at the clients, and in U mode if p is cached at some clients in S mode but at no client in U mode. When there are several cached copies of an index page around in the system, these copies must always have the same contents. In other words, all the copies ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



93

must have an equal Page-LSN value. On the contrary, the cached copies of a data page need not have the same contents. Their Page-LSNs may be different because different sites may have obtained their copy from the server at different times, and the copy cached in U mode and updated at a client may have migrated several times via the server to other clients. We only require that each cached copy of a data page always covers the same key range, that is, for any two cached copies p1 and p2 of a data page, low-key( p1 ) = low-key( p2 ) and high-key( p1 ) = high-key( p2 ). In our approach, one of the copies of any page is always a unique current version, that is, the most up-to-date copy of the page. This is consistent with the ARIES-based recovery protocols [Mohan 1990a, 1996; Mohan et al. 1992; Mohan and Levine 1992; Mohan and Narang 1994], which require the PageLSN value of each page to uniquely designate the currency of the page. Among the copies of a page, the one with the greatest Page-LSN value is always the current version of the page, and for any two copies p1 and p2 of a page, if Page-LSN( p1 ) < Page-LSN( p2 ), then p2 is more up-to-date than p1 , so that p2 is obtained from p1 by redoing on p1 the updates for the page logged by log records with LSNs greater than Page-LSN( p1 ) and less than or equal to Page-LSN( p2 ). The disk version pd of page p is the version of p found on the server disk. The server version ps of p is the version of p found in the server buffer, if p is there, and the disk version of p otherwise. The client version pc of p for client c, or the client-c version of p, is the version of p found in the buffer of c, if p is cached at c, and the server version of p otherwise. The site-c version of p, for site c, is the client-c version of p if c is a client, and the server version of p if c is the server. The following two conditions must always hold: (a) If page p is cached at client c in S mode, then the server version ps of p is not less up-to-date than the client version pc of p. In other words, Page-LSN( ps ) ≥ Page-LSN( pc ). (b) If page p is cached at client c in U or X mode, then the client version pc of p is not less up-to-date than the server version ps of p. In other words, Page-LSN( pc ) ≥ Page-LSN( ps ). Condition (a) implies that if p is not cached at any client in U or X mode then the server version of p is the current version. Conditions (a) and (b) together imply that if p is cached at site c (a client or the server) in U or X mode then the site-c version of p is the current version. The above conditions are maintained if we do the following: (i) if client c wants to cache in S mode a page p that is not cached elsewhere in U or X mode, then c is given the server version of p, or, if so requested, the current version; (ii) if client c wants to cache in U or X mode a page p then c is given the current version of p; (iii) if the current version requested by client c resides at site c then the current version is first installed at the server and then at c; (iv) if client c ceases to cache page p in U or X mode, or downgrades the mode to S mode, then the client-c version of p must first be shipped to the server, where it replaces the server version of p. Actually, in case (i), for reasons that will ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

94



I. Jaluta et al.

become apparent later, we choose to give c the current version of p whenever p is not yet cached at c or c explicitly requests the cached copy of p to be replaced by the current version. The following condition, which holds for any centralized database system, naturally also holds for our system: (c) The server version ps of any page p is not less up-to-date than the disk version pd of p. In other words, Page-LSN( ps ) ≥ Page-LSN( pd ). The disk version (respectively server version or client-c version or current version) of a physical database is the collection of disk versions (respectively server versions or client-c versions or current versions) of its pages. The disk version (respectively server version or current version) of a logical database is the collection of data records stored in disk versions (respectively server versions or current versions) of its data pages. Note that we define no “client-c version” of the logical database. This is because we will insist that no client transaction shall ever access stale data, so that, whenever the isolation level set for a transaction allows the transaction to access a data record, that record belongs to the current version of the logical database. 5. TRANSACTIONAL ISOLATION BY KEY-RANGE LOCKING A transaction running concurrently with other transactions must be protected against isolation anomalies of three types: dirty writes, dirty reads, and unrepeatable reads (cf. Berenson et al. [1995]; Sippu and Soisalon-Soininen [2001]). A key x is dirty by an active transaction T with respect to the current logical database and the current states of transactions, if there are not-yet-undone updates by T on the record with key x; in other words, if the last update (insertion or deletion) on the record with key x is either (1) by a forward action I [x] or D[x] by T or (2) by an undo action I −1 [x] or D −1 [x] by T that undoes a forward action I [x] or D[x] by T that is not the first update on x by T . An insertion or a deletion of a record (x, w) by transaction T2 is a dirty write if the database contains a record with key x that is dirty by some other transaction T1 . Scanning a key range by transaction T2 is a dirty read if the key range scanned contains a key x which is dirty by some other transaction T1 . Scanning a key range by a transaction T1 is an unrepeatable read if the key range scanned contains a key x such that some record with key x is later inserted or deleted by some other transaction T2 while T1 is still forward-rolling. A transaction runs at the repeatable-read isolation level (or is serializable) if it does no dirty writes, dirty reads, or unrepeatable reads. A transaction runs at the read-committed isolation level if it does no dirty writes or dirty reads. Serializability for all transactions can be guaranteed by key-range locking or key-value locking [Mohan 1990a, 1996; Mohan and Levine 1992; Gray and Reuter 1993]. We consider key-range locking in its simplest form that uses only the two basic lock modes, S (shared) and X (exclusive). Transactions protect their insert, delete and fetch actions by locking keys of records by commit-duration or short-duration S or X locks. Short-duration locks are held only for the time the action is being performed, while commit-duration ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



95

locks are released only after the transaction has committed or completed its rollback. To perform a forward insert action I [x, w], the transaction acquires a commitduration X lock on the key x and a short-duration X lock on the next key y, that is, the least key y such that x < y and there is a record with key y in the database. If there is no record with key greater than x in the database, the next key of x is defined to be ∞. To perform a forward delete action D[x, w], the transaction acquires a short-duration X lock on the key x and a commit-duration X lock on the next key y. To perform a fetch action F [x, θ z, w], the transaction acquires an S lock on the key x; the lock is of commit duration if serializability for the transaction is required, and of short duration if the transaction is run at the read-committed isolation level. No additional locks are acquired for backward actions. To undo the insertion of record (x, w) by the undo-insert action I −1 [x, w], the transaction just deletes (x, w) under the protection of the commit-duration X lock acquired on x for the corresponding forward action I [x, w]. To undo the deletion of record (x, w) by the undo-delete action D −1 [x, w], the transaction just inserts (x, w) under the protection of the commit-duration X lock acquired on the next key y for the corresponding forward action D[x, w]. To implement the key-range locking protocol, each client c maintains a local client lock table that contains the local locks for c, that is, locks acquired locally for transactions running at c. The server maintains a server lock table that contains the locks acquired for transactions running at the server and any global locks of client transactions, that is, local locks from the clients that have also been registered at the server. A lock-table entry for a lock of mode m and of duration d on key x for transaction T at site c is a tuple of the form (c, T, x, m, d ), where x denotes the lock name for key x. The lock name of key x may be the key x itself, some truncated version of x, or any hash value computed from x. In an efficient organization of the lock table, the lock names are fixed-length hash values [Gray and Reuter 1993; Mohan 1996]. The lock table is organized as a hash table indexed by the lock names. There is also a secondary index to the table on the client-id-transaction-id pairs (c, T ), or the lock-table entries for each pair (c, T ) are chained. Every X lock for a client transaction must be acquired both locally and globally, whereas S locks need only be acquired locally when the key to be locked is not X-locked by any other transaction. A local S lock held by transaction T at client c on a key x need not be registered globally as long as the data page p covering x resides cached at c and the key range covered by p does not change. If T commits before p is purged or becomes subject to a structure modification, the S lock need never appear in the server lock table. However, when data page p is to be purged from the buffer of c, or the caching mode of p is to be changed to X mode (for a structure modification), all the local S locks on keys covered by p that have not yet been registered globally at the server must now be registered globally. To keep track of the local S locks that have not yet been registered globally, each client c maintains a main-memory table, called the lock-to-page map, of ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

96



I. Jaluta et al.

tuples (x, T, p), where x is the lock name of a local S lock held by transaction T and p is the page-id of the data page cached at c that currently covers a key whose lock name is x. Obviously, a local S lock on x held by T at c must be registered globally at the moment the information about the lock stored in the lock-to-page map at c is going to become invalid, that is, when the page p covering the locked key is purged from the buffer of c, or the record with the locked key moves from p to another page in a structure modification. After the global registration of the lock (c, T, x, S, d ), the tuple (x, T, p) is deleted from the lock-to-page map. The lock-to-page map is organized as a hash table indexed by the page-ids, and there is also a secondary index to the table on the lock names. At client c, the function local-S-locks( p) returns the set of S locks (c, T, x, S, d ) in the lock table of c for which the entry (x, T, p) is found in the lock-to-page map at c and deletes those entries (x, T, p) from the lock-to-page map if p is a data page, and returns an empty set otherwise. The mechanism used to detect conflicts between X locks at one client and local S locks at another client is explained in Section 9. 6. LOGGING AND LSN MANAGEMENT In our page-server system, updates are logged using physiological logging [Mohan 1990a, 1996; Mohan et al. 1992; Mohan and Levine 1992; Gray and Reuter 1993]. Forward update actions are logged as follows. The log record for the insert action I [x, w] performed by transaction T at site c (a client or the server) on data page p is n: c, T, I, p, (x, w), m, where n is the LSN of the log record, p is the page-id of page p, and m is the Undo-Next-LSN of the log record, that is, the LSN of the log record for the previous (not-yet-undone) forward action by T . The log record for the delete action D[x, w] performed by T at c on p is similarly n: c, T, D, p, (x, w), m. These records are redo-undo log records: they are used in the redo pass of restart recovery to install missing updates on data pages, as well as in normal processing and in the undo pass of restart recovery to undo updates in partial or total rollbacks of transactions. Undo actions are logged as follows. The log record for the undo-insert action I −1 [x, w] performed by transaction T at site c on data page p is n: c, T, I −1 , p, x, m, where m is the Undo-Next-LSN of the log record, that is, the LSN of the log record for the next forward action by T to be undone. The log record for the undo-delete action D −1 [x, w] performed by T at c on p is n: c, T, D −1 , p, (x, w), m. These records, also called compensation log records (CLRs), are redo-only log records: they are only used in the redo pass of restart recovery to install missing updates [Mohan 1990a, 1996; Mohan et al. 1992]. The log records for the transaction-control actions B, C, A, S[P ], A[P ], and C[P ] by transaction T at site c are n: c, T, B, n: c, T, C, n: c, T, A, n: c, T, S[P ], m, n: c, T, A[P ], and n: c, T, C[P ], respectively. In c, T, S[P ], m, m is the Undo-Next-LSN. In practice, the log records also have a field Prev-LSN that is used to link together all the log records generated by a transaction; we omit this field because it is not needed in our algorithms. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



97

Fig. 2. The log records generated for transaction T : BI [x1 , w1 ]S[P1 ]I [x2 , w2 ]S[P2 ]I [x3 , w3 ]A[P2 ] I −1 [x3 , w3 ]C[P2 ]I [x4 , w4 ]A[P1 ]I −1 [x4 , w4 ]I −1 [x2 , w2 ]C[P1 ]I [x5 , w5 ]AI −1 [x5 , w5 ]I −1 [x1 , w1 ]C.

Figure 2 gives the log records generated for the transaction given in Example 2.1, assuming that all the insertions go to the same data page, p, and no structure modifications occur and that no other transactions are active at the same time. The allocation, formatting or deallocation of a page, or a structure modification such as a page split or merge affecting a set of B-tree pages is logged using a single redo-only log record that does not belong to any transaction. The exact form of these log records can be seen in Section 10, which gives the algorithms for the page allocation and deallocation calls and for the different structure modifications. The log records for client transactions and structure modifications performed at a client are generated at that client, and clients assign LSNs locally. Thus LSNs are not unique across the entire system and do not necessarily form a sequence that is monotonically increasing in time. However, for any log record with LSN n generated at site c (a client or the server), the pair (c, n) must form a unique key of the record across all log records generated in the system, and the LSNs generated at c must form a monotonically increasing sequence. Moreover, for each page p, the LSNs stamped in the Page-LSN field of p at different sites must form a monotonically increasing sequence. Following ARIES/CSA [Mohan and Narang 1994], these properties are guaranteed as follows. Each client maintains a Local-Max-LSN value, which is the LSN of the log record most recently generated by the local log manager. The LSN n for a log record generated at client c is computed by the formula n = 1+max{Local-Max-LSN(c), Page-LSN( p1 ), . . . , Page-LSN( pk )}, where p1 , . . . , pk (k ≥ 0) are the pages that are updated by the logged operation and Page-LSN( pi ) is the LSN value found in the Page-LSN field of page pi before the update, for i = 1, . . . , k. The new LSN value n is stamped in the Page-LSN fields of the pages p1 , . . . , pk and assigned as the new value of LocalMax-LSN(c). In our algorithms we use the call log(n, . . .) to append a new log record to the local log buffer. The call returns the LSN n of the log record. 7. UPDATE PROPAGATION The server version ps of a page p is modified, if it is different from the disk version pd of p, that is, if Page-LSN( ps ) > Page-LSN( pd ). We assume that the ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

98



I. Jaluta et al.

server applies the usual steal-and-no-force policy in handling modified pages: (1) the server may flush onto disk a dirty page, that is, a page that contains an update by an active transaction; (2) the server need not flush onto disk updated pages when the updating transaction is committing, nor does it need to flush pages involved in a structure modification when the structure modification is completed. In flushing modified pages from the server buffer onto disk, the server always applies the write-ahead logging (WAL) policy: the server call flush-onto-disk( p), for a page-id p of a modified page, first flushes onto the log disk all log records in the server log buffer with LSN ≤ Page-LSN( p), then flushes page p, and marks p as an unmodified page in the buffer frame. The client version pc of page p is modified, if it is different from the server version ps of p, that is, if Page-LSN( pc ) > Page-LSN( ps ). We assume that each client applies the client-side steal-and-no-force policy for data pages: (1) the client may ship dirty data pages to the server at any time; (2) the client need not ship to the server updated data pages when the updating transaction is committing. However, for structure modifications, the clients apply the force-to-server-atcommit policy: pages involved in a structure modification are shipped to the server and installed in the server buffer as soon as the structure modification is completed. In shipping modified pages to the server, each client c always applies the client-side write-ahead logging (WAL) policy: page p may be shipped to the server only if all log records generated at c for updates on p have already been shipped to the server or if they are shipped to the server as part of the same message that contains the modified page p. A client may discard log records from its buffer only after receiving from the server an acknowledgment that those log records have been installed in the log buffer of the server [Mohan and Narang 1994]. As will be apparent, our algorithms apply a lazy or demand-driven updatepropagation policy for data pages: a modified data page is shipped from a client to the server only when the client buffer becomes full or when the page is needed at some other site (called back). Of course, when a transaction T commits or completes its rollback at client c, all log records generated by T are shipped from c to the server and flushed onto the log disk. In comparison, the basic ARIES/CSA method allows for a lazy updatepropagation policy to be applied for all pages [Mohan and Narang 1994]. In Carey et al. [1994b], and Zaharioudakis et al. [1997], pages are never shipped from a client to the server; instead, updates performed at a client on cached pages are propagated to the server by shipping (at transaction commit) to the server the log records generated for the updates, which are then replayed on the affected pages in the server buffer. In Gottemukkala et al. [1996], it is assumed that clients follow the force-to-serverat-commit policy, so that the client sends all cached dirty pages to the server when a transaction commits. The full-caching (FC) algorithm of Zaharioudakis and Carey [1997] postpones the installation of index updates at the server until commit time. Both Basu et al. [1997] and Jaluta [2002] ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



99

used immediate update-propagation policies for all pages, thus avoiding client checkpointing. The call ship-to-server(P, M , R, L) is used by client c to ship a set P of copies of modified pages, a set R of log records, and a set L of local S locks to the server. The call assumes that the pages in P reside write-latched and cached in U or X mode, in the buffer of c. The parameter M is a set of pairs ( p, m), where p is a page-id and m takes one of the values S, U, “purge” or “deallocate” indicating the mode in which page p remains cached at c; “purge” means that c gives up the caching of p and will purge p from the buffer; “deallocate” means that p is purged at c and is to be deallocated at the server. The set R contains the not-yet-shipped log records residing in the log buffer of c with LSNs up to the maximum of the LSNs stamped in the Page-LSN field of the pages in P . The set L contains local S locks on keys covered by some data pages cached at c; these locks are to be registered globally at the server. The actions performed by client c in the ship-to-server call are the following. Request the server to append the log records in R, install the pages in P with remaining caching modes M , and to register globally the locks in L, and wait for an acknowledgment. When the acknowledgment arrives, advance the ShippedLog-LSN (a locally maintained LSN value) to the maximum LSN of the log records in R, mark each page p in P that remains cached at c in mode m as an unmodified page with caching mode m in the buffer of c, and unlatch and purge each page in P with remaining caching mode M = “purge” or “deallocate”. Log records with LSNs less than or equal to Shipped-Log-LSN may be discarded from the local log buffer if the local log buffer becomes full. The call log-recordsup-to(n), for an LSN n, returns the set of log records with LSNs n satisfying Shipped-Log-LSN < n ≤ n. The request issued by the call ship-to-server(P, M , R, L) at client c is processed by the server as follows. Append the log records in R to the server log buffer, and insert the locks in L in the server lock table. For each page p in P , write-latch p, assign as the Force-Addr of p the logical address in the server log of the log record with local LSN = Page-LSN( p) (for enforcing the WAL protocol) [Mohan and Narang 1994], install p in the buffer as a modified page, update the page-caching map to reflect the current caching mode m of p at c, and unlatch p. Send the acknowledgment to c. Note: The inserting of the locks in L in the server lock table does not cause any lock waits, because local locks are granted only when it is known that there exist no incompatible locks (see Section 9). The server uses the call ship-to-client(c, p) to ship page p to client c. In response to the request sent by this call, client c installs page p in its buffer. 8. PAGE-LEVEL REPLICA MANAGEMENT AND LATCHING In this section we explain how page copies cached at various sites are managed in our page-server system and how page accesses within a site are synchronized. Our page-level replica-management protocol extends the standard callbackread-locking protocol [Lamb et al. 1991; Wang and Rowe 1991; Franklin and ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

100



I. Jaluta et al.

Carey 1994; Carey et al. 1994b] to work with the relaxed update-privilege approach in which a page may reside cached in U mode at one site while copies of the page reside cached in S mode at other sites (cf. Mohan and Narang [1991]). Within a single site c, latches [Mohan 1990a, 1996, Mohan et al. 1992; Mohan and Levine 1992; Gray and Reuter 1993] are used to protect a page cached at c from improper simultaneous access by concurrent processes. Given the pageid of a page, the call read-latch p fixes and latches p in shared mode for the thread of the database process of c that wants to read the contents of p. The call write-latch p fixes and latches p in exclusive mode for the thread that wants to update p or for some other reason to gain exclusive access to p. A read latch on p can be granted immediately if there exists no write latch on p, while a write latch on p can be granted immediately only if there exists no latch on p. Otherwise, a wait occurs. An existing read latch cannot be upgraded to a write latch. The call read-latch p is used at site c only when p resides cached in the buffer of c, while the call write-latch p may also be issued when p does not reside in the buffer of c. If p does not reside in the buffer, the call write-latch p first allocates a buffer frame for p and then fixes and latches it in exclusive mode. If there is no free buffer frame, the least-recently-used (LRU) unlatched page q in the buffer is write-latched. If q is a modified page, either the call flush-ontodisk( p) (when c is the server) or the call ship-to-server({q}, {(q, purge}, R, L) (when c is a client) is issued, where R = log-records-up-to(Page-LSN(q)) and (for data page q) L = local-S-locks(q). If q is an unmodified page at client c, the call ship-to-server(∅, {(q, purge}, ∅, L) is issued. The buffer frame for q is then allocated for p. When the call write-latch p results in the allocation of a buffer frame for p, the buffer frame is marked as not-installed, to indicate that it does not yet contain a copy of page p. Naturally, a page with not-installed contents is never shipped to the server or flushed onto disk. We use the call installed( p) to test whether the buffer-frame contents of a latched page p are installed or not installed. At the server, the call fetch-from-disk( p) is used to install the disk version of p into the buffer frame of p; the page is then marked as an unmodified page in the buffer. The call unlatch p unlatches and unfixes a latched installed page p, while the call unlatch and purge p, for a write-latched installed page p, also purges p from the buffer. If p is a write-latched not-installed page, the call unlatch p unfixes, unlatches and purges p. Our B-tree algorithms normally use the call cache-and-latch( p, m) to get page p cached in mode m and latched in the appropriate mode. Here m is one of S, SC, U, or X. Page p is read-latched if m is S or SC, and writelatched if m is U or X. When executed at the server, the call cache-andlatch( p, S) caches and read-latches the server version of p, if p resides in the server buffer, and the current version of p otherwise. When executed at a client, the call cache-and-latch( p, S) caches and read-latches the client version of p, if p resides in the client buffer, and the current version of p otherwise. For m = SC, U, and X, the call cache-and-latch( p, m) always results in the current version of p being cached and latched in the buffer of ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



101

the accessing site. SC means getting the current version of p cached in S mode. Cache-and-latch( p, m) at the server. First wait if p is cached at some client in X mode. Case m = S: If p resides in the server buffer then read-latch p and return else do as in case m = CS. Case m = CS: Write-latch p. If installed( p) and p is not cached in U mode at any client then downgrade the latch to a read latch and return. If not installed( p) and p is not cached in U mode at any client then fetch-from-disk( p), downgrade the latch to a read latch, and return. Send the request callback-page( p, SC) to the client c that caches p in U mode, and wait for reply. When c replies with page copy p and log-record set R, append the log records in R into the log buffer, install p in the buffer frame as a modified page, send an acknowledgment to c, and downgrade the latch to a read-latch. Case m = U: Write-latch p. If installed( p) and p is not cached in U mode at any client, return. If not installed( p) and p is not cached in U mode at any client, fetchfrom-disk( p) and return. If p is cached at some client c in U mode, send the request callback-page( p, U) to c, and wait for reply. When c replies with page copy p and logrecord set R, append the log records in R into the log buffer, install p in the buffer frame as a modified page, replace the entry ( p, c, U) by ( p, c, S) in the page-caching map, and send an acknowledgment to c. Case m = X: Write-latch p. If installed( p) and p is not cached at any client, return. If not installed( p) and p is not cached at any client, fetch-from-disk( p) and return. If p is cached at some client c in U mode, send the request callback-page( p, X) to c, and wait for reply. When c replies with page copy p, log-record set R and S-lock set L, insert the locks in L into the server lock table, append the log records in R into the log buffer, install p in the buffer frame as a modified page, delete the entry ( p, c, U) from the pagecaching map, and send an acknowledgment to c. Send the request callback-page( p, X) to all clients c that cache p in S mode, and wait for the replies. For each c , when c replies with S-lock set L, insert the locks in L into the server lock table, and delete the entry ( p, c , S) from the page-caching map.

The server call cache-and-latch( p, X, c), where c is a client-id, works as cache-and-latch( p, X) but does not send the callback-page( p, X) request to client c. This call is used at the server in processing get-page( p, X) requests from client c. The server call cache-and-latch( p, X, c, no-fetch) works as cache-andlatch( p, X, c) but p is never fetched from disk. This call is used at the server in processing get-X-mode( p) requests from client c. Cache-and-latch( p, m) at client c: Case m = S: If p resides in the buffer of c then read-latch p and return else do as in case m = SC. Case m = SC: Write-latch p. If not installed( p) then send the request get-page( p, SC) to the server else let n = Page-LSN( p) and send the request get-page( p, SC, n) to the server. Wait for the reply. If a copy of page p arrives then install p in the buffer of c as an unmodified page and register p as cached at c in S mode. Downgrade the write latch on p to a read latch. Case m = U: Write-latch p. If p is cached at c in U or X mode, return. If p is cached at c then let n = Page-LSN( p) and send the request get-page( p, U, n) to the server else send the request get-page( p, U) to the server. Wait for the reply. If a copy of p arrives, install it in the buffer of c as an unmodified page. Register p as cached at c in U mode. Case m = X: Write-latch p. If p is cached at c in X mode, return. If p is cached at c in U mode then send the request get-X-mode( p) to the server else if p is cached at c in S mode then let n = Page-LSN( p) and send the request get-page( p, X, n) to the server else send the request get-page( p, X) to the server. Wait for the reply. If a copy of p arrives, install it in the buffer of c as an unmodified page. Register p as cached at c in X mode. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

102



I. Jaluta et al.

The request get-page( p, m, n) or get-page( p, m) sent by client c is processed by the server as follows: Case m = SC: cache-and-latch( p, SC); if n is given and Page-LSN( p) = n then reply “cached page p up-to-date” else ship-to-client(c, p); register p as cached at c in S mode; unlatch p. Case m = U: cache-and-latch( p, U); if n is given and Page-LSN( p) = n then reply “cached page p up-to-date” else ship-to-client(c, p); register p as cached at c in U mode; unlatch p. Case m = X: cache-and-latch( p, X, c); if n is given and Page-LSN( p) = n then reply “cached page p up-to-date” else ship-to-client(c, p); register p as cached at c in X mode; unlatch and purge p. [Note: p must be purged from the server buffer, in order to maintain the property that caching in X mode is exclusive across all the sites]. The request get-X-mode( p) from client c that caches the current version of p is processed by the server as follows: cache-and-latch( p, X, c, no-fetch); reply “X mode on p granted”; register p as cached at c in X mode; unlatch and purge p.

The request callback-page( p, m) sent by the server to client c that caches p is processed by the dedicated thread of the client database process at c as follows: Case m = SC: write-latch p; let R = log-records-up-to(Page-LSN( p)); ship-toserver({ p}, {( p, U )}, R, ∅); unlatch p. [Note: This case occurs only when p is cached at c in U mode; p remains cached there in U mode]. Case m = U: write-latch p; let R = log-records-up-to(Page-LSN( p)); ship-toserver({ p}, {( p, S)}, R, ∅); downgrade the caching mode of p to S mode; unlatch p. [Note : This case occurs only when p is cached at c in U mode; p remains cached there in S mode]. Case m = X: write-latch p; if p is a data page, set L = local-S-locks( p), else set L = ∅; if c caches p in U mode then let R = log-records-up-to(Page-LSN( p)) and shipto-server({ p}, {( p, purge)}, R, L) else ship-to-server(∅, {( p, purge)}, ∅, L); unregister the caching of p at c; unlatch and purge p. [Note: This case occurs when p is cached at c in U or S mode; in both cases p is purged from the buffer of c].

A client c that caches page p in X mode but has not modified or updated p may issue the call downgrade-mode( p, S) in order to request the server to downgrade the caching mode of p to S mode. Page p remains cached at c in S mode. We say that an operation is of short duration if the time taken by the operation does not depend on the number of actions of any transaction, given that the operation is not charged for tasks whose amortized cost is constant when averaged over the total number of actions in the transactions. Tasks with constant amortized cost include shipping modified pages, log records or locks from a client to the server, and flushing modified pages or log records onto disk. An operation which is not of short duration is of long duration. For example, holding a commit-duration lock is of long duration, and so is waiting for a short-duration or a commit-duration lock. All our algorithms maintain the following conditions: (1) Holding of a latch, once it has been acquired, is of short duration. (2) No process thread ever holds more than a small constant number of latches (namely, three latches) at a time. (3) No read latch is ever upgraded to a write latch. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



103

(4) Each process thread acquires latches on B-tree pages in the parent-tochild, older-sibling-to-younger-sibling order, so that the parent page is never latched while holding a latch on a child page, and an older sibling page is never latched while holding a latch on a younger sibling page. (5) Holding a page cached in X mode is of short duration. (6) All requests for a latch are serviced using a fair (i.e., first-come-firstserviced) policy. Under these conditions, we have the following: LEMMA 8.1. The cache-and-latch calls maintain the conditions (a), (b), and (c) stated in Section 4 and are deadlock-free and starvation-free, so that, when a page is requested to be cached and latched, then this will eventually happen, given that no system crash occurs. Moreover, the calls read-latch, write-latch, cache-and-latch, get-page, get-X-mode, callback-page, and downgrade-mode are of short duration. 9. RECORD-LEVEL REPLICA MANAGEMENT AND LOCKING S-locking keys only locally in a data page cached at a client works only if the client knows what keys in the page are possibly X-locked by transactions at other sites. We use record marking [Carey et al. 1994b; Zaharioudakis et al. 1997] as a cheap mechanism for conveying information about X-locked keys in data pages between the sites. In the header of each data record (and of the high-key record (∞, 0)), one bit is reserved as an unavailability bit that indicates whether the record is available (0) or unavailable (1). An available record can be accessed by any transaction, while an unavailable record can only be accessed by the transaction that holds an X or S lock on it. The unavailability bit is part of the logical record in that, when the record moves from a data page to another, the bit also moves. Setting or clearing the unavailability bit is not considered an update on the page; it is not logged and does not affect the modified/unmodified status of the page. To set an unavailability bit of a record in a data page, the page must be cached at the site in U or S mode and write-latched. An unavailability bit can be cleared only at the server; the page must then be cached at the server in X, U, or S mode and write-latched. Record marking allows transactions at site c to acquire just local S locks on the keys of records to be fetched when these records are available. The unavailability bit of record (x, w) in data page p is set in all cached copies of p if and only if some transaction T has acquired an X lock on x. Thus, (x, w) is unavailable for all transactions other than T . Our fine-grained locking protocol combines our page-level replica-management protocol with record marking and standard key-range locking. Our protocol is based on that presented in Jaluta [2002], but modified to work with lazy update propagation for data pages. In our protocol, data pages are cached and latched, records are marked, and appropriate locks for database actions by transaction T at site c are acquired in such a way that the following conditions hold. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

104



I. Jaluta et al.

(1) In performing I [x, w], when the record (x, w) is about to be inserted, the current version of the data page p covering key x and the current version of the data page p containing the next key y must be cached at c in U mode and write-latched, p must have room for (x, w), x must be locally and globally X-locked by T for commit duration, y must be locally and globally X-locked by T for short duration, and the record with key y, when present, must have been marked as unavailable in all copies of p cached at sites other than c. After inserting (x, w) into p, the inserted record must be marked as unavailable in the copy of p at c, the pages p and p must be unlatched, and the X lock on y must be released. (2) In performing D[x, w], when the record (x, w) is about to be deleted, the current version of the data page p containing (x, w) and the current version of the data page p containing the next key y must be cached at c in U mode and write-latched, p must not underflow when (x, w) is deleted, x must be locally and globally X-locked by T for short duration, y must be locally and globally X-locked by T for commit duration, and the records with keys x and y, when present, must have been marked as unavailable in all copies of p and p cached at sites other than c. After deleting (x, w) from p, the next-key record must be marked as unavailable in the copy of p at c, the pages p and p must be unlatched, and the X lock on x must be released. (3) In performing F [x, θ z, w], when the record (x, w) is about to be fetched, either (i) the data page p covering key z and the data page p containing (x, w) must be cached at c and read-latched, x must be the least key with x θ z among the record keys in the site-c version of the physical database, the record (x, w) must be available in p , and key x must be locally S-locked by T for commit duration, or (ii) key x must be locally and globally S-locked by T for commit duration, c must cache and hold a read latch on versions of the data page p covering key z and of the data page p containing (x, w) that were current at some time between granting the global S lock on x and performing the fetch, and x must be the least key with x θ z among the record keys in the cached versions of p and p . Thus, if (x, w) is found to be unavailable, a global S lock on x and the current versions of p and p must be acquired. After fetching (x, w), the pages p and p must be unlatched. (4) In performing I −1 [x, w] during normal processing, when the inserted record (x, w) is about to be deleted, the current version of the data page p containing (x, w) must be cached at c in U mode and write-latched, p must not underflow when (x, w) is deleted, and T must hold both the global and the local commit-duration X lock acquired on x for the corresponding forward action I [x, w]. After deleting (x, w), the page p must be unlatched. (5) In performing D −1 [x, w] during normal processing, when the deleted record (x, w) is about to be reinserted, the current version of the data page p covering key x must be cached at c in U mode and write-latched, p must have room for (x, w), and T must hold both the global and the local commit-duration X lock acquired on the next key y for the corresponding forward action D[x, w]. After inserting (x, w), the record must be marked as unavailable in the copy of p at c and the page p must be unlatched. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



105

(6) The unavailability bit of a record in a data page p cached at client c can be set upon the server’s request, provided that all local S locks on the key of the record at c are registered globally. The bit can never be cleared at c. The unavailability bit of a record in p can only be cleared at the server when the current version of p is cached at the server (in U or S mode) and the key of the record is no longer globally X-locked. (7) Pages are cached and latched using the cache-and-latch calls given in the previous section. By induction on the number of insert, delete, undo-insert, and undo-delete actions, conditions (1), (2), (4), (5), (6), and (7) imply the following: LEMMA 9.1. Using our protocol, the following conditions hold for any site c and data page p covering key x. (a) If p is cached at c and contains an available record with key x, then that record belongs to the current version of the logical database and the key x is not X-locked globally. (b) If the current version of the logical database does not contain a record with key x but the copy of p at c does, then that record is unavailable. (c) If the current version of the logical database contains a record with key x but the copy of p at c does not, then the record with the key next to x in the site c version of the physical database is unavailable. By conditions (1), (2), (4), and (5) satisfied by our protocol, the insertion or the deletion of a record or the undoing of an insertion or a deletion of a record is always performed on the current version of the logical database. The following theorem states that the same is also true for fetch actions, although fetch actions are not necessarily performed on the current version of the physical database. THEOREM 9.2. When implemented using our protocol, the action F [x, θ z, w] always fetches the record with the least key x satisfying x θ z in the current version of the logical database. PROOF.

See the Electronic Appendix.

Example 9.3. Assume that client c1 , client c2 , and the server all cache the current version of data page p that contains data records with keys 20, 24, and 40, all available (see Figure 3 (a)). Client c2 caches p in U mode. Transaction T1 at c1 and transaction T2 at c2 perform the following actions: — F [x, > 21] by T1 . The call cache-and-latch( p, S) at c1 read-latches the copy of p cached at c1 . Key 24 satisfies the fetch condition and the record with key 24 is available. Key 24 is S-locked locally at c1 . The record with key 24 is fetched. Page p is unlatched. — I [38, w] by T2 . The key 38 is locally X-locked for T2 at c2 . The call cacheand-latch( p, U) at c2 write-latches p. The next key y = 40 is determined and locally X-locked for T2 at c2 . A commit-duration X lock on 38 in p and a shortduration X lock on 40 in p are requested from the server. The server acquires the requested X locks for T2 , write-latches p, marks the record with key 40 ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

106



I. Jaluta et al.

Fig. 3. The server and the clients c1 and c2 all cache data page p. In (a), the page copies have identical contents, and all records are available. In (b), the record (38, w) has been inserted into the copy of p at c2 and marked there as unavailable (×), while the next-key record in the copies of p at the server and at client c1 has been marked as unavailable. In (c), the current version of p has been shipped from c2 to the server and from there to c1 .

as unavailable, and requests c1 to do the same. Client c1 write-latches p. If all the S locks on keys in p (if any) were already registered globally, p would now be purged at c1 . Assume this is not the case. Then the record with key 40 is marked as unavailable at c1 , and p is unlatched. The server unlatches p and informs c2 that the requested X locks have been granted. Client c2 now inserts (38, w) into p, marks the record as unavailable, generates the log record n: c, T2 , I, p, (38, w), m, where m = Undo-Next-LSN(T2 ), stamps n into the Page-LSN of p, unlatches p, releases the local X lock on 40, and requests the server to release the global X lock on 40 (see Figure 3 (b)). — F [x, > 24] by T1 . The call cache-and-latch( p, S) at c1 read-latches p. Key 40 satisfies the fetch condition but the record with that key is unavailable. Therefore, p is unlatched and a global commit-duration S lock on 40 for T1 is requested from the server. As 40 is no longer X-locked at the server, the global S lock is granted. The S lock is inserted into the client lock table at c1 . The call cache-and-latch( p, SC) is then used to get the current version of p cached and read-latched at c1 as follows. First, the call write-latches p and sends the request get-page( p, SC, n ) to the server, where n = Page-LSN( p). The request is processed at the server as follows. The call cache-and-latch( p, SC) at the server is used to get the current version of p cached and write-latched at c1 . This call includes the following: as p is cached at client c2 in U mode, the server sends the request callback-page( p, SC) to c2 . Client c2 write-latches p, replies with page p and log-record set R, and unlatches p. The server appends the log records in R into the log buffer, installs p in the buffer frame as a modified page, and sends an acknowledgment to c2 . As Page-LSN( p) > n , the server uses the call ship-to-client(c1 , p) to ship p to client c1 , and unlatches p. Client c1 installs p in the buffer at c1 as an unmodified page and downgrades the write latch on p to a read latch (see Figure 3 (c)). Client c1 rechecks the fetch condition, only to find that the record with key 38 is now the correct record to fetch but unavailable. Client c1 thus unlatches p and requests from the server a global commit-duration S lock on 38. The server ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



107

finds that key 38 is X-locked by another transaction and thus replies that the lock request is pending and puts the lock request into the queue of pending lock requests for key 38. T1 must now wait until the conflicting lock has been released. In comparison, the relaxed-hybrid-caching (HC-R) algorithm in Zaharioudakis and Carey [1997] would acquire commit-duration X locks on both the inserted key 38 and its next key 40. THEOREM 9.4. PROOF.

Our protocol guarantees serializability for all transactions.

See the Electronic Appendix.

The conditions (1) to (7) satisfied by our protocol do not specify exactly when the unavailability bits of records are cleared. To prevent the accumulation of unavailable records that are no longer X-locked, the unavailability bit must be cleared in situations in which it can be done with low cost. It is impractical to try to clear the bit of an updated record when the updating transaction commits, because the record may have migrated away from the page it resided at the time of the update and because the page that currently contains the update may even have been flushed onto disk and purged from the buffer at the time the updating transaction finally commits. The Commit-LSN concept has been suggested by Mohan [1990b] to reduce locking for transactions that run at the read-committed isolation level, and the concept can also be utilized for other purposes. In a centralized database, the Commit-LSN is the LSN of the earliest begin-transaction log record among the active transactions that have so far done updates. The Commit-LSN can be used to conclude that a data page does not contain uncommitted updates if its Page-LSN is less than Commit-LSN. To implement this in a page-server system, each client maintains a local Commit-LSN and ships this to the server whenever it has to ship log records to the server. The server uses as a global Commit-LSN the minimum of the log-record addresses corresponding to the local Commit-LSNs [Mohan and Narang 1994]. In the server call ship-to-client(c, p), before the page p is shipped to client c, if p is a data page for which the log-record address corresponding to Page-LSN( p) is less than the global Commit-LSN, the unavailability bits in all records in p are cleared. If p is a data page for which the log-record address corresponding to Page-LSN( p) is greater than the global Commit-LSN, then, for each unavailable record in p, the server lock table is probed to check if the key of the record is still X-locked or if there are pending X-lock requests for the key, and if not, the unavailability bit in that record is cleared. The same actions on the unavailability bits are performed in the server call fetch-from-disk( p), after page p has been fetched from disk and installed in the server buffer, and in the server call flushonto-disk( p), before page p is flushed from the server buffer onto disk. These actions guarantee that the unavailability bits on no-longer-locked records will eventually be cleared. In the following, we give the lock calls that are needed to implement our protocol. Every X lock for a client transaction must first be acquired locally, and then a global X lock must be requested from the server. To acquire an X lock on ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

108



I. Jaluta et al.

key x, when the data page covering x is cached in U mode and write-latched, a conditional lock call is first tried: Conditionally-locally-X-lock(c, T, x, d ) acquires a local X lock of duration d on key x for transaction T at site c if x is not locally locked by any other transaction at c. If the lock is granted, the lock (c, T, x, X, d ) is inserted into the lock table at c; otherwise the call returns with “lock not granted.”

If the conditional lock call returns with “lock not granted”, the page covering the key to be locked is unlatched, in order to prevent an undetectable deadlock that might occur due to a process holding a latch while waiting for a lock [Mohan 1990a, 1996; Mohan and Levine 1992; Lomet and Salzberg 1992, 1997]. The lock is then acquired unconditionally: Locally-X-lock(c, T, x, d ) acquires a local X lock of duration d on key x for transaction T at site c. A wait occurs if x is locally locked by some other transaction at c. When the lock is granted, the lock (c, T, x, X, d ) is inserted into the lock table at c.

After acquiring the local X lock, the page covering the key is again cached in U mode and write-latched and a global X lock is requested from the server: Request-global-X-lock(c, T, p, x, d ) at client c requests from the server a global X lock of duration d on key x covered by data page p, and waits for a reply. At the server the request is processed as follows: 1. If T at c already holds a global X lock on x, then insert the requested lock (c, T, x, X, d ) into the server lock table (if the lock already held by T is of duration different from d ), reply “lock granted immediately,” and return. 2. If x is globally locked by some transaction other than T or if there are pending lock requests for x, reply “lock request pending,” append the request into the queue of pending lock requests for key x, and continue with processing requests from the other threads of the client database process of c, if any. 3. Write-latch p. If installed( p) and p contains a record with key x, mark that record as unavailable in p. 4. Let C  be the set of clients other than c that cache p. Send the request callbackrecord( p, x) to all clients c ∈ C  , and wait for replies. 5. For each client c ∈ C  , if c replies “page purged” then delete the entry ( p, c , S) from the page-caching map else if c replies with a nonempty set L of local S locks then insert the locks in L in the server lock table. 6. If all the clients c ∈ C  reply “page not cached”, “page purged” or “key not S-locked” (so that in the previous step no S locks on x were inserted), insert the requested lock (c, T, x, X, d ) into the lock table, reply “lock granted immediately”, unlatch p, and return. Otherwise, reply “lock request pending”, append the request into the queue of pending lock requests for key x, unlatch p, and continue with processing requests from the other threads of the client database process of c, if any. 7. The queue of pending lock requests for x will be reprocessed whenever some transaction releases a global lock on x. In the reprocessing of a request, no new callbackrecord requests are sent.

If the request-global-X-lock call returns “lock request pending”, the page covering the key to be locked is unlatched and the client thread waits for the eventual granting of the lock. The request callback-record( p, x) is processed at client c as follows: 1. Write-latch p. If not installed( p) then reply “page not cached,” unlatch p, and return. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



109

2. Set L = local-S-locks( p). If L is empty then unlatch and purge p then reply “page purged” and return. 3. If p contains a record with key x, mark it as unavailable. 4. If L does not contain a lock on x then reply with L and message “key not S-locked” else reply with L and message “key S-locked”. Unlatch p.

To acquire an S lock, when the page or pages covering the key range to be scanned are cached and read-latched, a client transaction first tries to get a local S lock using the following conditional lock call: Conditionally-locally-S-lock(c, T, p, x), for data page p cached at client c and key x covered by p, acquires a local commit-duration S lock on key x for transaction T at c if the following conditions are satisfied: (1) x is not locally X-locked by any other transaction at c, and (2) either x is X-locked or S-locked by T or the record with key x in the copy of p at c is available. If the lock is granted, the lock (c, T, x, S, commit) is inserted into the lock table of c and the entry (x, T, p) is inserted into the lock-to-page map at c; otherwise the call returns with “lock not granted”.

If the S lock is not granted by the conditional lock call, the page(s) covering the key range to be scanned are unlatched and a global S lock is requested from the server: Request-global-S-lock(c, T, x) at client c requests from the server a global commitduration S lock on key x, and waits for a reply. At the server the request is processed as follows: 1. If x is not globally X-locked by any transaction other than T and there are no pending requests for an X lock on x then insert the requested lock (c, T, x, S, commit) into the server lock table (if not already there), reply “lock granted immediately”, and return. 2. Reply “lock request pending”, append the request into the queue of pending lock requests for key x, and continue with processing requests from the other threads of the client database process of c, if any. 3. The queue of pending lock requests for x will be reprocessed whenever some transaction releases a global lock on x.

If the request-global-S-lock call returns “lock request pending”, the client thread waits for the eventual granting of the lock. When the lock is granted, it is inserted into the client lock table, and the client thread uses the call cacheand-latch( p, S) to cache and latch the current version(s) of the page(s) p covering the key range to be scanned. A lock held by an active transaction may be released by the following call: Unlock(c, T, x, m, d ) releases the local lock of mode m and of duration d held by transaction T at site c on key x, and, if c is a client and the lock may have been registered globally (i.e., if either m = X or the lock-to-page map at c does not contain an entry (x, T, p)), requests the server to release the global lock.

10. B-TREE STRUCTURE MODIFICATIONS In this section we show how B-tree pages are allocated and deallocated and how the structure of the B-tree is modified. To reduce the inevitable contention on storage-map pages, these pages are never cached at clients. All B-tree pages are allocated and deallocated by the server in response to client requests. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

110



I. Jaluta et al.

To allocate a new page of type t (index page or data page) requested by client c, the server performs the following: 1. Cache-and-latch(sm, X) for some storage-map page sm; determine the page-id p of some free page in the range covered by sm; set p as allocated in sm. 2. Log(n, s, allocate, sm, p), where s is the site-id of the server; set Page-LSN(sm) = n; unlatch sm. 3. Write-latch p; format p as an empty page of type t; mark p as a modified page in the buffer; log(n , s, format, p, t); set Page-LSN( p) = n . 4. Register p as cached at c in X mode; ship-to-client(c, p); unlatch and purge p.

A client c can request a page p to be deallocated by specifying “deallocate” in place of the remaining caching mode of p in the ship-to-server(P, M , R, L) call, when p resides write-latched and cached in X mode in the buffer of c. The server installs the pages in P (including p) in the server buffer, appends the log records in R to the log buffer, inserts the locks in L into the lock table, unregisters the caching of p at c, and sends an acknowledgment to c, which then unlatches and purges p from its cache. Then the server issues the call cache-and-latch(sm, X) for the storage-map page sm that covers the page-id p, sets p as deallocated in sm, calls log(n, s, deallocate, sm, p), sets Page-LSN(sm) = n, and unlatches sm. The log record generated for a page allocation, formatting, or deallocation is redo-only. Thus, these operations are separated from the B-tree structure modifications that give rise to these operations. A downside of this approach is that, in some rare cases, a process failure or a client or server crash occurring between a page allocation and a B-tree structure modification (or between a B-tree structure modification and a page deallocation) may cause a storagemap page to show some page as allocated when it actually is not. We discuss this minor problem in Section 13. The structure of the B-tree is modified by one of the five structure modifications split, merge, redistribute, increase-height, or decrease-height. Each of these modifications is logged using a single redo-only log record. Below we list the pre and post conditions for each of these modifications when performed at a client. We also give the algorithm for split; the algorithms for merge, redistribute, increase-height, and decrease-height are given in the Electronic Appendix. The calls save( p) and save( p, q) appearing in the algorithms are related to maintaining a saved traversed path; their meaning is explained in Section 11. Split ( p, q, x). Given are a key x (the current search key), and the page-ids of a writelatched parent page p and a write-latched child page q of p, both pages cached at client c in X mode, where page p has room for a child link with a maximum-length key and page q contains at least twice the minimum number of records required for the page. The operation allocates a new page q  , moves the upper half of the records in q to q  , and makes q  the right sibling of q. Upon return, the parameter p denotes the one of the pages q, q  that covers x; that page remains write-latched and cached in X mode (if it is an index page) or in U mode (if it is a data page), while the parent and the sibling are unlatched and their caching modes are downgraded to S mode.

ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



111

1. If q is a data page, set L = local-S-locks(q), else set L = ∅. 2. Request the server to allocate a new data page q  (if q is a data page) or a new index page q  (if q is an index page); wait for the page to arrive; write-latch q  ; install q  in the buffer as a modified page; set height(q  ) = height(q). 3. Determine the key y that distributes the records in page q evenly, move the records with keys greater than or equal to y from q to q  and make q  the right sibling of q by inserting the child link ( y, q  ) into p. 4. If q is a data page, move the high-key record in q to q  and insert ( y, q  ) as the high-key record in q. 5. Log(n, c, split, p, q, y, q  , V ), where V is the set of records moved from q to q  . Stamp the LSN n in the Page-LSN of pages p, q and q  . 6. Let R = log-records-up-to(n). We have four cases: (a) x < y and q is an index page: ship-to-server({ p, q, q  }, {( p, S), (q  , S)}, R, L); save( p, q); unlatch p; unlatch q  ; set p = q. (b) x < y and q is a data page: ship-to-server({ p, q, q  }, {( p, S), (q, U), (q  , S)}, R, L); save( p, q); unlatch p; unlatch q  ; set p = q. (c) x ≥ y and q  is an index page: ship-to-server({ p, q, q  }, {( p, S), (q, S)}, R, L); save( p, q  ); unlatch p; unlatch q; set p = q  . (d) x ≥ y and q  is a data page: ship-to-server({ p, q, q  }, {( p, S), (q, S), (q  , U)}, R, L); save( p, q  ); unlatch p; unlatch q; set p = q  . Merge( p, q, q  ). Given are the page-ids of a write-latched parent page p and write-latched child pages q and q  of p, all cached at client c in X mode, where p is not about to underflow (so that a child link can safely be deleted), q  is the right sibling of q, and the records in q and q  fit in a single page. The operation moves the records from q  to q and deallocates q  . Upon return, the parameter p denotes the page q; that page remains write-latched and cached in X mode (if it is an index page) or in U mode (if it is a data page), while the parent is unlatched and its caching mode is downgraded to S mode. Redistribute( p, q, q  , x). Given are a key x and the page-ids of a write-latched parent page p and write-latched child pages q and q  of p, all cached at client c in X mode, where q  is the right sibling of q and q tolerates the change of the key in a child link into one with a maximum-length key. This operation is used when one of the pages q, q  is about to underflow, but they cannot be merged. The operation redistributes the records in q and q  evenly. Upon return, the parameter p denotes the one of the pages q, q  that covers x; that page remains write-latched and cached in X mode (if it is an index page) or in U mode (if it is a data page), while the parent and the sibling are unlatched and their caching modes are downgraded to S mode. Increase-height( p, x). Given is a key x and the page-id of the write-latched root page p, cached at client c in X mode, where p contains at least twice the minimum number of records required. The operation allocates two new pages, q and q  , moves the lower half of the records in p to q and the upper half of the records in p to q  , and makes q and q  the children of p. Upon return, the parameter p denotes the one of the pages q, q  that covers x; that page remains write-latched and cached in X mode (if it is an index page) or in U mode (if it is a data page), while the root and the sibling are unlatched and their caching modes are downgraded to S mode. Decrease-height( p, q, q  ). Given are the page-ids of a write-latched parent page p and write-latched child pages q and q  of p, all cached at client c in X mode, where q and q  are the only children of p and can be merged, so that the records in q and q  fit in a single page. The operation moves the records in q and q  to p and deallocates q and q  . Upon return, the parameter p still denotes the root page; that page remains write-latched and cached in X mode (if it is an index page) or in U mode (if it is a data page).

ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

112



I. Jaluta et al.

The server-side algorithms for the structure modifications are somewhat simpler than the above client-side algorithms, because no ship-to-server calls are needed. We have the following: LEMMA 10.1. Assume that the current version of the physical database is a structurally consistent and balanced B-tree. Then any of the structure modifications split, merge, redistribute, increase-height, and decrease-height, when performed at any site in a situation in which the preconditions for the operation hold, retains the structural consistency and balance of the B-tree. Each structure modification is of short duration. When the structure modification is terminated, the server versions of the modified pages are the current versions of the pages. 11. B-TREE TRAVERSALS To make repeated accesses to a page more efficient, our B-tree traversal algorithms maintain a saved path (cf. Gray and Reuter [1993]; Lomet and Salzberg [1992, 1997]). Each thread of the database process at site c stores its saved path in a local array path[1 · · · h], where h is the height of the B-tree. After traversing a search path ph , ph−1 , . . . , p1 of the B-tree, where ph is the page-id of the root page and p1 is the page-id of the target leaf page, the contents of the entries path[i], i = 1, . . . , h, are the following: path[i].page-id = pi , path[i].page-lsn = Page-LSN( pi ), path[i].low-key = low-key( pi ), path[i].high-key = high-key( pi ), and path[i].space-status = the space status of page pi . The space status has meaning only for an index page and it can take one of the values “about-tounderflow,” “about-to-overflow,” or “safe.” Status “about-to-underflow” means that the page does not tolerate the deletion of any child link. Status “aboutto-overflow” means that the page does not have room for a child link with a maximum-length key. Status “safe” means that the page is neither about to underflow nor about to overflow. The call save( p, q), for a latched index page p and its latched child page q, updates the entry path[height(q)] by setting the current values for page q to all the five fields. The value low-key(q) is obtained from page p, and the value high-key(q) is obtained from page p or from path[height( p)].high-key. The call save( p, p ), for a latched data page p and its latched next page p , updates the entry path[height( p )] by setting the current values for page p . The value low-key( p ) is obtained from the high-key record in p. The call save(r), for the latched root page r, updates the entry path[height( p)] by setting the current values for the root page (with low-key( p) = −∞ and high-key( p) = ∞). These calls are used to maintain the saved path by the traversal algorithms below and by the structure-modification algorithms given in the previous section. The function call covers( p, i, x), for a latched page p with path[i].page-id = p, is used to test if p is a B-tree page that covers key x (for an algorithm, see the Electronic Appendix). When the call returns false, we cannot be sure whether or not p still covers key x. In that case we have to retraverse the B-tree starting from some higher-level page in the saved path. The following call uses latch-coupling [Mohan 1990a, 1996; Mohan and Levine 1992; Gray and Reuter 1993; Lomet and Salzberg 1992, 1997 (an analogue to lock-coupling [Bayer and Schkolnick 1977]) to traverse a path ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



113

from a given page p that covers key x down to the data page q that covers x: Traverse( p, x, m, q). Given the page-id of a latched index or data page p, a key x covered by p, and mode m (S or U), the call traverses the path from p down to the data page q that covers x. The path is saved. The page q is cached in (at least) m mode and either read-latched (if m = S) or write-latched (if m = U). 1. Let q = p and n = Page-LSN( p). If p is a read-latched data page and m = U, unlatch p and cache-and-latch( p, U). 2. If Page-LSN( p) = n, downgrade-mode( p, S), unlatch p, let p be the page-id of the root page, cache-and-latch( p, S), save( p), and go to step 1. 3. While p is not a data page, repeat the following: let q be the child page of p that covers x; if height( p) = 2 then cache-and-latch(q, m) else cache-and-latch(q, S); save( p, q); unlatch p; set p = q.

In performing F [x, θ z, w], the following call is used to find the data page q that covers the key z: Traverse-for-fetch(z, q). Given a key z, the call returns the page-id of the data page q that covers z. The page q is cached and read-latched. 1. If no saved path exists, let r be the page-id of the root, cache-and-latch(r, S), save(r), traverse(r, z, S, q), and return. 2. Let i be the least index with path[i].low-key ≤ z < path[i].high-key, let p = path[i].page-id, and cache-and-latch( p, S). 3. While not covers( p, i, z), unlatch p, set i = i + 1, set p = path[i].page-id, and cacheand-latch( p, S). 4. Traverse( p, z, S, q).

The following call is used to find the data page q on which to perform one of the forward actions I [x, w] or D[x, w] or to implement logically one of the undo actions I −1 [x, w] and D −1 [x, w]: Traverse-for-update(x, q). Given a key x, the call returns the page-id of the data page q that covers x. The page q is cached in U mode and write-latched. 1. If no saved path exists, let r be the page-id of the root, cache-and-latch(r, S), save(r), traverse(r, x, U, q), and return. 2. Let i be the least index with path[i].low-key ≤ x < path[i].high-key, and let p = path[i].page-id. 3. If i = 1 then cache-and-latch( p, U) else cache-and-latch( p, S). 4. While not covers( p, i, x), unlatch p, set i = i + 1, set p = path[i].page-id, and cacheand-latch( p, S). 5. Traverse( p, x, U, q).

The following call is needed in performing an insert action I [x, w] or a logically implemented undo-delete action D −1 [x, w], when the data page covering the key x has been found to have no room for the record (x, w) to be inserted. Retraverse-for-insert(x, w, p). Given a data record (x, w), the call returns the page-id of the data page p that covers x. The call performs any structure modifications (increaseheight, split) needed to arrange that p has room for (x, w). Page p is cached in U mode and write-latched. Steps 1 and 2 determine the highest-level page that may need a structure modification, and then steps 3 to 6 traverse the path from that page down to ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

114



I. Jaluta et al.

the data page that covers x, caching and latching pages in X mode and splitting pages level-by-level. 1. Let i be the least index with path[i].low-key ≤ x < path[i].high-key and path[i].spacestatus = “about-to-overflow” or, if no such index exists, the index of the root. Let p = path[i].page-id; cache-and-latch( p, X). 2. While p is not the root and, in addition, either not covers( p, i, x) or there is no room in p for a child link with a maximum-length key (if p is an index page) or there is no room in p for the record (x, w) (if p is a data page), repeat the following: downgrademode( p, S), unlatch p, set i = i+1, set p = path[i].page-id, and cache-and-latch( p, X). 3. If p is the root such that there is no room in p for a child link with a maximum-length key (if p is an index page) or there is no room in p for the record (x, w) (if p is a data page), increase-height( p, x). 4. While p is an index page, repeat steps (5) to (6). Then return. 5. Let q be the page-id of the child page of p that covers x; cache-and-latch(q, X). 6. If there is no room in q for a child link with a maximum-length key (if q is an index page) or there is no room in q for the record (x, w) (if q is a data page), split( p, q, x), else save( p, q), downgrade-mode( p, S), downgrade-mode(q, U) (if q is a data page), unlatch p, and set p = q.

The following call is needed in performing a delete action D[x, w] or a logically implemented undo-insert action I −1 [x, w], when the data page containing the record (x, w) has been found to be about to underflow: Retraverse-for-delete(x, p). Given a key x, the call returns the page-id of the data page p that covers x. The call performs any structure modifications (decrease-height, merge, redistribute) needed to arrange that p tolerates the deletion of any record. Page p is cached in U mode and write-latched. The algorithm is given in the Electronic Appendix.

LEMMA 11.1. Given any number of traverse-for-fetch, traverse-for-update, retraverse-for-insert, and retraverse-for-delete calls executed at various sites of the page-server system on a structurally consistent and balanced B-tree, where each call is immediately followed by the unlatching of the last page in the search path, the execution of these calls is deadlock-free and starvation-free and retains the structural consistency and balance of the tree. A traverse-for-fetch call keeps at most two pages latched (namely, read-latched) at a time. A traversefor-update call keeps at most two pages latched at a time; one of these pages is read-latched and the other page is either read-latched or write-latched. A retraverse-for-insert call or a retraverse-for-delete call keeps at most three pages latched (namely, write-latched) at a time. In the best case, a traverse-for-fetch call or a traverse-for-update call caches and latches only one page. PROOF.

See the Electronic Appendix.

Our approach of performing structure modifications in a top-down manner bears some similarities to the algorithms in Mond and Raz [1985], where about-to-overflow pages encountered in a traversal for insertion are split in advance, and about-to-underflow pages encountered in a traversal for deletion are merged or redistributed in advance. However, our algorithms work with variable-length records and still avoid doing structure modifications that may turn out to be unnecessary. Only in those rare cases in which the path saved in the traverse-for-update call fails to indicate correctly the pages in the current ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



115

path or the space status of the pages, a subsequent retraverse-for-insert call may start splitting pages too early in the path, or a retraverse-for-delete call may start merging pages too early. 12. IMPLEMENTATIONS OF THE ACTIONS OF TRANSACTIONS In this section we give algorithms that implement the fetch, insert, delete, undo-insert, undo-delete, and the transaction-control actions of our transactions, using the B-tree algorithms presented in the previous sections. To manage the state of each transaction, each site c maintains an active-transaction table that contains an entry for each forward-rolling or backward-rolling transaction of site c (cf. Mohan et al. [1992]). The entry for transaction T contains besides the transaction-id T the following fields: (1) state(T ): “forward-rolling” or “backward-rolling”; (2) Undo-Next-LSN(T ): the LSN of the log record for the last not-yet-undone forward action by T . In practice, the entry for T also contains a field (3) Last-LSN(T ): the LSN of the log record for the last logged action of T ; we omit this field because it is not needed in our algorithms. The following call implements the action I [x, w] by transaction T at client c: Insert(c, T, x, w). Given a client-id c, transaction-id T , and a data record (x, w), the call inserts the record into the database, provided that the database does not yet contain a record with key x. The insertion is redo-undo-logged for T at c. 1. Locally-X-lock(c, T, x, commit). Traverse-for-update(x, p). If page p has no room for (x, w) then unlatch p and retraverse-for-insert(x, w, p). Let y be the greatest datarecord key in p. If x ≥ y then cache-and-latch( p , U), where p is the next page of p, and let x  be the least data-record key in p ; else let p = p and let x  be the least data-record key in p with x  > x. 2. Conditionally-locally-X-lock(c, T, x  , short). If the lock is granted then go to step 4. Save Page-LSN( p) and Page-LSN( p ); unlatch p; unlatch p (if p = p ); locally-Xlock(c, T, x  , short); cache-and-latch( p, U); cache-and-latch( p , U) (if p = p ). 3. Check if p and p still have the saved Page-LSNs, and, if not, check if p still is a data page covering x and if p still is a data page covering x  and if x  still is the least data-record key greater than x and if p still has room for (x, w); if not (or if we cannot tell whether or not), unlatch p, unlatch p , remember the lock on x  , and restart the algorithm. 4. Request-global-X-lock(c, T, p, x, commit). Request-global-X-lock(c, T, p , x  , short). If both X locks are granted immediately then go to step 5. Save Page-LSN( p) and PageLSN( p ); mark the record with key x  in page p as unavailable (to prevent starvation); unlatch p; unlatch p (if p = p ); wait for the locks to be granted; cache-andlatch( p, U); cache-and-latch( p , U) (if p = p ); perform step 3. 5. If p contains a data record with key x then unlatch p, unlatch p (if p = p), unlock(c, T, x  , X, short) and return with exception “uniqueness violation.” 6. Insert (x, w) into p and mark it as unavailable; log(n, c, T, I, p, (x, w), m), where m = Undo-Next-LSN(T ); set Page-LSN( p) = n; set Undo-Next-LSN(T ) = n; mark p as a modified page in the buffer; unlatch p; unlatch p (if p = p); unlock(c, T, x  , X, short). Also release all locks on keys other than x remembered in step 3, if any.

The following call implements the action D[x, w] by transaction T at client c: Delete(c, T, x, w). Given a client-id c, a transaction-id T , and a key x, the call deletes the data record, (x, w), with key x from the database, provided that the database contains ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

116



I. Jaluta et al.

such a record. The deletion is redo-undo-logged for T at c. The algorithm for this call is given in the Electronic Appendix.

The following call implements the action F [x, θ z, w] by transaction T at client c: Fetch(c, T, x, θ z, w). Given a client-id c, a transaction-id T , a comparison operator θ, and a key z < ∞, the call fetches from the database the record (x, w) with the least key x with x θ z. 1. Traverse-for-fetch(z, p). Let y be the greatest data-record key in p. If z < y or z = y and θ = “≥” then let p = p and let x be the least data-record key in p with x θ z else if high-key( p) = ∞ then let p = p and x = ∞ else let p = the next page of p, cache-and-latch( p , S), save( p, p ), and let x be the least data-record key in p . 2. Conditionally-locally-S-lock(c, T, p , x). If the lock is granted, go to step 4. Save Page-LSN( p) and Page-LSN( p ); unlatch p; unlatch p (if p = p ); request-globalS-lock(c, T, x); if the reply “lock request pending” is received, wait for the lock to be granted; insert the lock (c, T, x, S, commit) into the client lock table; cache-andlatch( p, SC); cache-and-latch( p , SC) (if p = p ). 3. Check if p and p still have the saved Page-LSNs, and if not, check if p still is a data page covering z and if p still is a data page covering x and if x still is the least data-record key with x θ z; if not (or if we cannot tell whether or not), unlatch p, unlatch p , remember the lock on x, and restart the algorithm. 4. If x = ∞ then let w = 0 else get the record (x, w) with key x from p . Unlatch p; unlatch p (if p = p ). Release all locks on keys other than the final x remembered in step 3, if any. Return with (x, w).

The server calls insert(s, T, x, w), delete(s, T, x, w), and fetch(s, T, x, θ z, w), for implementing the actions I [x, w], D[x, w] and F [x, θ z, w] of transactions run at the server, s, are somewhat simpler than the above client calls; we omit their algorithms. The following call implements the action I −1 [x, w] by transaction T at site c: Undo-insert(n, c, T, p, x, m). Given a log record n: c, T, I, p, (x, w), m, the call deletes (x, w) from the database. The deletion is redo-only-logged for T at c. 1. Cache-and-latch( p, U). If p is a data page and either Page-LSN( p) = n or p can be seen still to cover x and is not about to underflow, then let p = p and go to step 3. 2. Unlatch p. Traverse-for-update(x, p ). If page p is about to underflow then unlatch p and retraverse-for-delete(x, p ). 3. Delete the record with key x from p ; log(n , c, T, I −1 , p , x, m); set Page-LSN( p) = n ; set Undo-Next-LSN(T ) = m; mark p as a modified page in the buffer; unlatch p .

The following call implements the action D −1 [x, w] by transaction T at site c: Undo-delete(n, c, T, p, x, w, m). Given a log record n: c, T, D, p, (x, w), m, the call inserts (x, w) into the database. The insertion is redo-only-logged for T at c. 1. Cache-and-latch( p, U). If p is a data page and either Page-LSN( p) = n or p can be seen still to cover x and has room for (x, w), then let p = p and go to step 3. 2. Unlatch p. Traverse-for-update(x, p ). If page p has not room for (x, w) then unlatch p and retraverse-for-insert(x, w, p ). 3. Insert (x, w) into p and mark it as unavailable; log(n , c, T, D −1 , p , (x, w), m); set Page-LSN( p) = n ; set Undo-Next-LSN(T ) = m; mark p as a modified page in the buffer; unlatch p . ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



117

The following calls implement the transaction-control actions B, C, A, S[P ], A[P ], and C[P ] by T at site c: Begin(c, T ): generate a new transaction-id T ; log(n, c, T, B); insert T into the activetransaction table of c with state(T ) = “forward-rolling” and Undo-Next-LSN(T ) = n. Commit(c, T ): log(n, c, T, C); let R = log-records-up-to(n); request the server to append the log records in R into the server’s log buffer, to flush the log, and to release all global locks held by T ; wait for reply; advance the Shipped-Log-LSN to n (see Section 7); release all local locks held by T ; remove T from the active-transaction table of c. Abort(c, T ): set state(T ) = “backward-rolling” in the active-transaction table; log(n, c, T, A). Set-savepoint(c, T, P ): log(n, c, T, S[P ], m), where m = Undo-Next-LSN(T ); set UndoNext-LSN(T ) = n. Abort-to-savepoint(c, T, P ): log(n, c, T , A[P ]). Complete-rollback-to-savepoint(c, T, P ): log(n, c, T, C[P ]).

A partial rollback and a total rollback are performed by the following calls at site c: Rollback-to-savepoint(c, T, P ): 1. Abort-to-savepoint(c, T, P ); let n = Undo-Next-LSN(T ); let r = get-log-record(c, n). 2. While r is not c, T, S[P ], m, perform steps 3 and 4. Then complete-rollback-tosavepoint(c, T, P ) and return. 3. If r is c, T, I, p, (x, w), m then undo-insert(n, c, T, p, x, m) else if r is c, T, D, p, (x, w), m then undo-delete(n, c, T, p, x, w, m) else if r is c, T, S[P  ], m  with P  = P then set Undo-Next-LSN(T ) = m . 4. Let n = Undo-Next-LSN(T ); let r = get-log-record(c, n). Rollback(c, T ): 1. Abort(c, T ); let n = Undo-Next-LSN(T ); let r = get-log-record(c, n); 2. While r is not c, T, B, perform steps 3 and 4. Then commit(c, T ) and return. 3. If r is c, T, I, p, (x, w), m then undo-insert(n, c, T, p, x, m) else if r is c, T, D, p, (x, w), m then undo-delete(n, c, T, p, x, w, m) else if r is c, T, S[P ], m then set Undo-Next-LSN(T ) = m. 4. Let n = Undo-Next-LSN(T ); let r = get-log-record(c, n).

The call get-log-record(c, n) returns the log record with LSN n generated at site c. If the log record still resides in the log buffer of c, it is fetched from there. Otherwise, the log record is requested from the server (if c is a client) or fetched from the log disk (if c is the server). To make this efficient, each client maintains an in-memory mapping of locally generated LSNs to addresses in the local log buffer, for all LSNs greater than or equal to the local Commit-LSN (see Section 9). If each client has enough virtual memory allocated for the local log buffer to accommodate all the log records by currently active transactions at the client, then any client transaction can be rolled back without fetching any log records from the server. THEOREM 12.1. The calls insert, delete, fetch, undo-insert, and undo-delete correctly implement our fine-grained locking protocol and are starvation-free, so that each of these calls will eventually terminate, given that each site services all ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

118



I. Jaluta et al.

lock requests in a fair (first-come-first-serviced) manner and that no deadlocks or system crashes occur. PROOF.

See the Electronic Appendix.

In general, deadlocks cannot be avoided, because transactions may upgrade S locks to X locks and access multiple records in arbitrary orders. However, since the calls insert, delete, and fetch never hold any page latched while being forced to wait for a lock, we have the following: THEOREM 12.2. Assume that all transactions access data records in ascending key order and that no transaction ever upgrades any S lock to an X lock. Then no deadlocks can occur. Lemmas 8.1, 10.1, and 11.1 imply the following: THEOREM 12.3. Assume that any number of forward-rolling, committed, backward-rolling, and rolled-back transactions have been run in our page-server system, on an initially structurally consistent and balanced B-tree b. Then the resulting current version of the physical database is a structurally consistent and balanced B-tree b such that (1) db(b ) is the logical database produced by those transactions for db(b), (2) the server version of any index page in b is the current version of the page, (3) the cached copies of any index page in b are equal, and (4) the cached copies of any data page in b cover an equal key range. 13. RESTART RECOVERY When a client c crashes, all the page copies cached at c, and all the not-yetshipped log records generated at c, are lost. Thus, if a page was cached at c in U or X mode, the server version of the page is then the most up-to-date existing version of the page. However, because data-page updates are propagated lazily, there may be log records from c at the server for updates on data pages cached at c in U mode that do not yet show in the server versions of those pages. Thus, redoing of data-page updates may be needed to reproduce the current versions of data pages. After the redoing, all transactions that were active at c at the time of the crash must be rolled back at the server using the log records found at the server and undoing any not-yet-undone insertions and deletions by those transactions on the reproduced current versions of the pages. When the server crashes, all pages in the server buffer, any unflushed log records, the server lock table, and the page-caching map are lost. Thus there is no other option than to invalidate all page copies residing in the client caches and to discard all log records that remain at the clients. The current versions of pages are reproduced by redoing data-page updates, structure modifications, and page allocations, formattings, and deallocations on the disk versions of the pages, using the log records found in the server’s log disk. After the redoing, all transactions that were active at any site at the time of the crash must then be aborted and rolled back. Our restart-recovery protocol is based on ARIES/CSA [Mohan and Narang 1994]. The differences between our protocol and ARIES/CSA are mainly related to how the log records for structure modifications and page allocations ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



119

and deallocations are processed. As explained in Sections 6 and 10, in our protocol each structure modification or page allocation, formatting, or deallocation is logged using a single redo-only log record, while in ARIES and ARIES/CSA such a modification is logged using several redo-undo log records. In our protocol, such a modification need never be undone when rolling back transactions during normal processing or restart recovery, while in ARIES and ARIES/CSA an interrupted modification is undone when rolling back the transaction that triggered it. As in ARIES/CSA, we assume that each site maintains a modified-page table (called “dirty-pages list” in Mohan and Narang [1994]). The modified-page table at the server contains, for each modified page p in the server buffer, a physical log address Rec-Addr( p) that indicates the position in the log after which there may be log records for updates on p that do not yet exist in the disk version of p. When page p is modified or updated at the server for the first time since p was last fetched from disk, p is inserted into the modified-page table at the server with Rec-Addr( p) set to the log-record address of the log record for the modification or update. The modified-page table at a client c contains, for each modified page p in the buffer of c, an LSN value Rec-LSN( p) with the property that Rec-LSN( p) ≤ n for the LSNs n of the log records for the updates on p that do not yet exist in the server version of p. When page p is modified or updated at client c for the first time since p was last cached at c, p is inserted into the modified-page table at c with Rec-LSN( p) = n, where n is the LSN of the log record for the modification or update. We assume that whenever client c uses the ship-to-server call to ship a set of modified pages to the server it also ships the Rec-LSNs of those pages. When installing such a shipped page p in the server buffer, if the server version of p is unmodified, that is, either p does not reside in the server buffer or the copy of p in the server buffer has the same contents with the disk version of p, then p is inserted into the modified-page table of the server with a Rec-Addr( p) that corresponds to (or, conservatively, is lower than) Rec-LSN( p). Otherwise the Rec-Addr( p) is not changed. The entry for p is deleted from the modified-page table at client c when p has been shipped to the server and an acknowledgement has been received from the server. The entry for p is deleted from the modifiedpage table at the server when p has been flushed onto disk. As in ARIES/CSA, we assume that all the clients as well as the server take checkpoints periodically. A checkpoint logged at client c contains copies of the active-transaction table and the modified-page table of c. When the server receives a log record c, end-checkpoint indicating the end of a checkpoint, the server determines the physical log-record addresses corresponding to the RecLSNs in the logged modified-page table and appends these into the log with the end-checkpoint log record. When taking a checkpoint at the server, all the clients are requested to ship a copy of their modified-page table to the server. The server combines the modified-page tables of the clients with its own modifiedpage table and appends into the log the combined table and the server’s activetransaction table [Mohan and Narang 1994]. Recovery from a client failure is performed by the server thread that services the failed client c. Once the failure of c has been noticed, the server ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

120



I. Jaluta et al.

thread gains exclusive access to all entries in the page-caching map that relate to pages cached at c. The entries for pages cached at c in S or X mode can safely be deleted. Note that, because of immediate propagation of structure modifications, the log record for a structure modification on a page that remains cached at c in X mode cannot have arrived to the server. Exclusive access to the page-caching-map entries for pages cached at c in U mode is held by the server thread until the current versions of those pages have been reproduced. Moreover, other threads of the server process are prevented from getting a latch on those pages. Other server threads that are waiting in vain for their callbackpage or callback-record requests to be serviced by c must adjust their behavior accordingly (for simplicity, we have omitted the steps needed to accomplish this from the algorithms in Sections 8 and 9). With these exceptions, normal transaction processing can continue during the recovery process. The rollbacks of transactions of c are protected by the commit-duration X locks still found in the server lock table at the server. The recovery begins with the analysis pass, which is performed by scanning the log in the forward direction for log records generated for c, starting from the latest checkpoint taken by c. In the analysis pass, the active-transaction table and the modified-page table of c are reconstructed as in ARIES/CSA, except that, when a redo-only log record for a structure modification is encountered, every page affected in that modification is removed from the modified-page table, if it is there. Recall that structure modifications done at the clients are immediately propagated to the server: if a log record for a structure modification is found at the server, then the pages involved in that modification have also been shipped to the server (all in one ship-to-server call with the log record) and installed in the server buffer. As in ARIES/CSA, the analysis pass ends in determining the value Redo-LSN, the minimum of the Rec-LSNs in the reconstructed modified-page table. Next the server thread performs the redo pass for c by scanning the log in the forward direction for log records for c, starting from the log record with LSN = Redo-LSN. Any log record with LSN n for an action I [x, w], D[x, w], I −1 [x, w], or D −1 [x, w] by a transaction of c on a data page p that does not appear in the modified-page table or appears there with Rec-LSN( p) > n is skipped, while for any such log record with Rec-LSN( p) ≤ n the call cache-and-latch( p, U) is issued and Page-LSN( p) is examined. Now if p is unmodified and Rec-LSN( p) ≤ Page-LSN( p), then Rec-LSN( p) is advanced to Page-LSN( p) + 1. Then either p is unlatched (if Page-LSN( p) ≥ n) or the logged action is performed on p, Page-LSN( p) is advanced to n and p is unlatched (if Page-LSN( p) < n). Finally the server thread performs the undo pass for c. In the undo pass, all forward-rolling transactions are aborted and rolled back and the rollback of all backward-rolling transactions is completed. The log is scanned backward from the end of the log until all updates of such transactions are undone. First, abort(c, T ) is called for each forward-rolling transaction T in the reconstructed active-transaction table of c. Then the following is repeated until the reconstructed active-transaction table is empty: let n = the greatest Undo-NextLSN(T ) value in the the reconstructed active-transaction table; let r = get-logrecord(c, n); if r is c, T, B then commit(c, T ) else if r is c, T, I, p, (x, w), m ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



121

then undo-insert(n, c, T, p, x, m) else if r is c, T, D, p, (x, w), m then undodelete(n, c, T, p, x, w, m) else if r is c, T, S[P ], m then set Undo-NextLSN(T ) = m. Here the transactions are rolled back using a single backward scan of the log. In an alternative solution, the call rollback(c, T ) (see Section 12) is used to roll back each transaction T individually. These calls can be run serially within the server thread, or new concurrent processes can be created to do the job. THEOREM 13.1. Assume that restart recovery from a failure at client c is performed in the presence of concurrent transactions at the active sites. Then the current version of the logical database is one that would be possible to produce in normal processing where c, instead of crashing, just aborts and rolls back all its forward-rolling transactions and completes the rollback of all its backwardrolling transactions. No structure modifications are redone. The structural consistency and balance of the B-tree is retained, except that, in rare cases, some storage-map pages may show some pages as allocated although they are not part of the B-tree. The slight inconsistency relating to storage-map pages occurs when a client has allocated a page, in order to perform one of the structure modifications split or increase-height, and then failed before the modification was completed or propagated to the server. As regards our algorithms, the inconsistency does no harm. To reclaim the unused allocated pages, the analysis pass of restart recovery from a client failure can be augmented so as to detect unused allocated pages, or a garbage-collection process can be run periodically. An alternative solution would be to let the clients do page allocation (and deallocation), so that the allocation (or deallocation) is logged as part of the log record for the structure modification, as was done in Jaluta [2002]. However, caching storagemap pages at the clients in U or X mode limits concurrency. Recovery from a server failure is performed similarly, except that (1) the last checkpoint taken by the server is used as the starting point of the analysis pass, (2) the pages affected in structure modifications encountered in the analysis pass are not removed from the modified-page table but instead are inserted there (if not yet there), (3) the redo pass also includes redoing structure modifications and page allocations, formattings and deallocations, (4) active transactions of all sites are rolled back, (5) the undo pass processes the log records in decreasing log-record-address order, and (6) no new transactions are admitted to the system until the recovery has been completed. The recovery process is similar to ARIES/CSA, except for the processing of log records for structure modifications. Assume a redo-only log record r with LSN n for an operation that affects pages p1 , . . . , pk . In the redo pass, this log record is processed as follows. For each page pi for which Rec-LSN( pi ) ≤ n, the server call cache-and-latch( pi , X) is issued and Page-LSN( pi ) is examined. If Page-LSN( pi ) < n then the effect of the logged operation is installed on page pi , Page-LSN( pi ) is advanced to n, and pi is unlatched, else pi is just unlatched. It does not matter in which order the pages p1 , . . . , pk are processed, because the log record contains all the information needed to reproduce the effect of the operation on any page pi . ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

122



I. Jaluta et al.

For example, the log record n: c, split, p, q, y, q  , V  can be redo-processed as follows: 1. If p does not appear in the modified-page table or Rec-LSN( p) > n then go to step 2; cache-and-latch( p, X); if p is unmodified and Rec-LSN( p) ≤ Page-LSN( p) then set Rec-LSN( p) = Page-LSN( p) + 1; if Page-LSN( p) < n then insert the child link ( y, q  ) into p and set Page-LSN( p) = n; unlatch p. 2. If q does not appear in the modified-page table or Rec-LSN(q) > n then go to step 3; cache-and-latch(q, X); if q is unmodified and Rec-LSN(q) ≤ Page-LSN(q) then set Rec-LSN(q) = Page-LSN(q) + 1; if Page-LSN(q) < n then delete from q the records with keys greater than or equal to y, and (if q is a data page) insert ( y, q  ) as the high-key record in q and set Page-LSN(q) = n; unlatch q. 3. If q  does not appear in the modified-page table or Rec-LSN(q  ) > n then return; cache-and-latch(q  , X); if q  is unmodified and Rec-LSN(q  ) ≤ Page-LSN(q  ) then set Rec-LSN(q  ) = Page-LSN(q  ) + 1; if Page-LSN(q  ) < n then insert into q  the records from V and set Page-LSN(q  ) = n; unlatch q  .

In comparison, a page split is logged in ARIES [Mohan et al. 1992; Mohan 1996] using three redo-undo log records, one for each page involved. Redoing these three log records involves the same amount of work as is done in redoing the single log record that represents the effects of the split on all the three pages. The redo algorithm uses the call cache-and-latch, which includes latching. As in ARIES, write-latching of pages during restart recovery from a server failure is actually needed only if parallelism is employed to speed up the recovery process. Thus, if parallelism is not employed, we may use a simpler version of the call cache-and-latch, which does not include latching. THEOREM 13.2. Assume that restart recovery from a server failure is performed. Then the current version of the logical database is one that would be possible to produce in normal processing starting with the current version of the physical database at the time of the crash and aborting and rolling back the forward-rolling transactions at all sites and completing the rollback of the backward-rolling transactions at all sites. The server version of the B-tree is the current version, and it is structurally consistent and balanced, except that, in rare cases, some storage-map pages may show some pages as allocated although they are not part of the B-tree. In this case the slight inconsistency relating to storage-map pages may also be due to one of the structure modifications merge or decrease-height, when the log record for the structure modification has been flushed onto disk before the crash but the log record for the page deallocation has not. Recovery from media failures is handled exactly as in ARIES/CSA. Assume that a page p is cached at site c in U or X mode and that the page copy becomes corrupted when an update is performed on it. If c is a client, the page is recovered by c from the server version of p by redoing on it any missing updates with LSN ≥ Rec-LSN( p). If c is the server, the page is recovered by the server from the disk version of p by redoing on it any missing updates with a log-record address greater than or equal to Rec-Addr( p). If an uncorrupted version of p cannot be obtained from disk, a copy of p is obtained from the last backup copy (archive ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



123

dump), and updates are redone on it starting from the appropriate log address as recorded with the backup copy [Mohan et al. 1992; Mohan and Narang 1994]. 14. ANALYSIS OF THE ALGORITHMS In this section we give examples that demonstrate some important features of our algorithms. More examples are given in the Electronic Appendix. Example 14.1. There are n clients, each running m applications. The database contains R data records and B data pages, and the B-tree is of height h. Each application generates a transaction that fetches all the records in the database in ascending key order, by issuing a series of fetch actions F [xi+1 , > xi ], i = 0, 1, . . . , R, where x0 = −∞. No updating transactions are active. At each client, the lag (in the number of pages accessed) between the fastest-progressing transaction and the slowest-progressing transaction is always less than the size of the client buffer. The lag between the fastest-progressing transaction and the slowest-progressing transaction across the entire system is less than the size of the server buffer. Then, to fetch the first record, the client-database-process thread that services an application needs to perform a traversal from the root down to the leftmost data page, caching the pages in the traversed path in S mode. The rest of the records are fetched by caching data pages alone. This is because the traversed path is saved and because the leaf pages are sideways linked from left to right. Assume that all the buffers are empty initially. Then each client performs in all m(B + h − 1) cache-and-latch calls, B + h − 1 get-page requests, and mR conditionally-locally-S-lock calls. No global locks are acquired. The server performs n(B + h − 1) cache-and-latch calls, B + h − 1 fetch-from-disk calls, and n(B + h − 1) ship-to-client calls. Example 14.2. We assume a database that contains R data records with even keys 2, 4, . . . , 2R stored in data pages, where each page is half-full of records and has odd low and high keys. Transaction T1 at client c1 inserts R records with odd keys 1, 3, . . . , 2R − 1 in ascending key order, while transaction T2 at client c2 fetches the records with even keys by first issuing the action F [x  , > 1] and then repeating the action F [x  , > x + 1], where x is the previous key fetched. Then T1 and T2 can run concurrently in many ways, provided that at least in the beginning T1 runs at least one record ahead of T2 , so that T1 has time to acquire and to release the short-duration X locks on the next keys (i.e., the even keys) before T2 acquires the commit-duration S locks on those keys. Note that T2 does no dirty or unrepeatable read, because the odd keys inserted by T1 do not belong to the key ranges scanned by T2 . Assume, in particular, that T1 runs all the time one page ahead of T2 , so that, for all data pages p, T2 fetches the first even key from p just after T1 has inserted into p the greatest odd key covered by p. When x is the least odd key covered by data page p (i.e., the low key of p), performing the action I [x] at c1 leads to the caching at the server and at c1 of the pages in the path from the root page to p. The index pages are cached at ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

124



I. Jaluta et al.

both sites in S mode, while the data page p is cached at c1 in U mode and at the server in S mode. Because the path is saved at c1 , the path is not retraversed when inserting the other odd keys covered by p, and, as p is already cached at c1 in U mode, p is just write-latched for the time of the insertion of each such key. The server grants T1 global commit-duration X locks on the odd keys inserted into p and global short-duration X locks on the even keys (the next keys). The records with the odd keys are marked as unavailable in the copy of p at c1 , and the records with the even keys are marked as unavailable in the copy of p at the server. Performing the action F [x  , > 1] at c2 leads to the caching at c2 in S mode of the pages in the path from the root page to the leftmost data page, while the action F [x  , > x + 1], for even keys x > 0, leads to the caching of the data page, p, alone (because of the sideways linking of the data pages). The pages are obtained by issuing the call cache-and-latch( p , S) at c2 for each page p in the path. The calls result in the requests get-page( p , SC) to be sent to the server. For each index page, the request is satisfied by shipping to c2 the copy of the page found in the server buffer, while for the data page p the server sends the request callback-page( p, SC) to c1 . Client c1 responds by shipping p with the log records to the server, which then installs p in its buffer and appends the log records to the log buffer, and sends a copy of p to c2 . In this copy, the records with odd keys are unavailable while those with even keys are available. Thus, for all the even keys x  in p, the call conditionally-locally-S-lock(c2 , T2 , p, x  ) returns with the lock on x  granted, so that no global locks need be acquired and the records with even keys in p can be fetched without any further communication with the server. For simplicity, assume that the server buffer and the buffer at c1 are large enough so that no page need be fetched twice from disk or shipped twice to c1 . (A two-page buffer is enough for c2 .) Then with a database with B data pages and B index pages, the server issues B+ B fetch-from-disk calls, 2B+ B +h−1 ship-to-client calls and B callback-page calls, and grants R commit-duration X locks and R short-duration X locks. Client c1 issues B + B get-page requests, 2R requests for global locks, and B ship-to-server calls. Client c2 issues B+h−1 get-page requests, but no requests for global locks. Example 14.3. Assume that the lag between the transactions T1 and T2 of the previous example is only one record, so that, for all data pages p and even keys x  covered by p, T2 fetches x  just after T1 has inserted into p a record with key x  − 1. Then, for each insertion of a record in p, the nextkey record is marked as unavailable in the copy of p at the server and, if the inserted record is not the first one in p, also in the copy of p at c2 . Thus, in all R − B callback-record requests are sent by the server to c2 . When fetching any but the first record in any data page p, client c2 finds every time that the record to be fetched is unavailable. Thus, a global S lock on the key is acquired from the server and the current version of p is fetched from c1 . In this case, c1 issues B + B get-page requests, 2R requests for global locks, and R ship-toserver calls, while c2 issues R + h − 1 get-page calls and R − B requests for global locks. This represents a worst-case scenario (for two-client transactions), ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



125

in which a maximum number of global-lock requests and get-page calls are needed. Example 14.3 demonstrates a case in which our approach inevitably leads to more lock-request messages being passed between a client and the server than are expected in approaches that use optimistic protocols, such as those in Voruganti et al. [2004]. However, a definite comparison is difficult because most of the many performance studies on page-server algorithms are based on the abstract model of reading and writing objects in data pages, thus missing issues that are most relevant for our article, namely, B-tree concurrency control and recovery, and fine-grained repeatable-read-level isolation in the presence of record insertions, record deletions, and key-range scans. As stated by Mohan [1992] (also cited by Gottemukkala et al. [1996]), there are several practical problems in implementing optimistic protocols, such as space management, access path maintenance, and fine-granularity conflict checking in the presence of object insertions and deletions. As for previous B-tree algorithms for page servers, using the full-caching (FC) or the strict-hybrid-caching (HC-S) algorithm in [Zaharioudakis and Carey 1997] or the callback-locking (CBL) algorithm in [Gottemukkala et al. 1996] in the scenarios of Examples 14.2 and 14.3 would result in repeated purging of each data page from the cache of client c2 and refetching it from client c1 , because exclusive mode is required for any leaf-page update. On the other hand, if the relaxed-hybrid-caching (HC-R) algorithm in Zaharioudakis and Carey [1997] were used, the transactions T1 and T2 could not run concurrently, because in HC-R commit-duration (instead of short-duration) X locks are acquired on the next keys of inserted records. The B-tree algorithms in [Basu et al. 1997] are not comparable to ours, because they do not guarantee repeatableread-level isolation and because B-tree structure modifications are not considered. 15. EXTENSIONS AND ALTERNATIVES In order to keep the presentation of our algorithms short, we have used a database and transaction model that is stripped of all details that are not relevant in discussing the main points of our approach. In this section we briefly discuss how to include some natural extensions into the model and how to implement some alternative solutions. Multiple granularity can be added into our model by grouping sets of records as relations of tuples, and sets of relations as databases. A tuple (x, w) with primary key x in a relation named r of a database named b is identified by the triple (b, r, x). Intention locks of modes IX, IS, and SIX on database names and relation names are used. The action of inserting (x, w) into relation r of database b is denoted by I [b, r, x, w], and the action is protected by a commitduration IX lock on b, a commit-duration IX lock on (b, r), a commit-duration X lock on (b, r, x), and a short-duration X lock on (b, r, y), where y is the next key of x in r. All lock names are short hash values computed from the logical keys b, (b, r), (b, r, x). The action of creating a new relation r into database b is denoted ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

126



I. Jaluta et al.

by I [b, r, R], where R is the schema of r. The action is protected by a commitduration IX lock on b, a commit-duration X lock on (b, r) and a short-duration X lock on (b, s), where s is the relation name next to r in b. Thus, the key-range locking protocol is applied on each level of the granularity hierarchy, so that dirty or unrepeatable scans of relation and database names can be prevented. The action of fetching the name r and schema R of the first matching relation in database b is denoted by F [b, r, θr  , R]. For handling the locks and intention locks on the coarser-grained database objects (i.e., relation and database names) in our page-server system, we suggest the straightforward and easy-to-implement solution in which all such locks and intention locks are always acquired globally. Thus, for the fetch action F [b, r, x, θ y, w] by transaction T at client c, T must acquire global commitduration IS locks on b and (b, r) before acquiring the S lock on (b, r, x). A transaction that scans an entire relation r (or a large part of it) by issuing a series of fetch actions F [b, r, xi+1 , > xi , w], i = 1, . . . , n, may just acquire a commitduration IS lock on b and a commit-duration S lock on (b, r), in order to save in the number of lock calls. Global relation-level S locks are also our solution for protecting record-fetch actions that are based on conditions that are more complex than “least x with x ≥ z” or “least x with x > z.” In [Mohan 1990a, 1996; Mohan and Levine 1992], intention locks were used to define a key-range locking protocol that is more permissive than the basic protocol that uses only the lock modes S and X. It suffices that the next-key lock acquired for an insert action I [b, r, x, w] is an IX lock instead of an X lock. This protocol can readily be implemented in our page-server system. The physical database model, the sparse B-tree, that we have assumed in our study can readily be applied to a multigranular database as follows. Each relation occupies a sparse B-tree of its own, so that the leaf pages of the B-tree for relation r of database b store the tuples (x, w) of r. For each database b there is a sparse B-tree that stores the data-dictionary table for b, that is, a relation with tuples of the form (r, R), where r is the name and R is the schema of a relation in b. Finally, there is a sparse B-tree that stores a relation with tuples of the form (b, B), where b is the name and B is a description (user information, etc) of a database in the system. It is also possible to adapt our algorithms to a physical database organization in which the data records are stored in a heap file, and all indexes are dense B-trees on that heap. The leaf pages then store index records (x, ( p, i)) where ( p, i) is the physical record identifier (record-id) of a data record with key x stored in record position i of data page p. With this organization it is usually assumed that the data records do not move from their original pages in normal database actions, so that an insertion, a deletion or an update of a record with record-id ( p, i) does not cause any modifications of the index entries pertaining to records other than that with record-id ( p, i). When a data record is inserted or deleted, both the affected data page and the B-tree leaf page must be cached in U mode. Two redo-undo log records are generated: one for the insertion or deletion of the data record and one for the insertion or deletion of the index record. Database reorganization operations that do move data records must be protected by X locks of coarser granularity. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



127

Some B-tree algorithms used in well-known database management systems use two-way sideways linking of the leaf pages, and some variations of the B-tree, such as the B-link tree, may use sideways linking at all the levels of the tree. To implement one-way linking of index pages, only some very minor additions are needed in the algorithms for the structure modifications. To implement two-way linking, some minor additions are needed in the algorithms for redistribute, increase-height, and decrease-height, while slightly more modifications are needed in the algorithms for split and merge (and hence also in the algorithms for retraverse-for-insert and retraverse-for-delete). In split or merge, an additional page must be cached in X mode and write-latched, in order to adjust its previous-page link. Our locking protocol guarantees repeatable-read-level isolation (i.e., serializability) for all transactions. It is easy to modify our algorithms so as to let a transaction to run at a lower degree of isolation. In running a client transaction at the read-committed isolation level, available records at cached and latched pages are fetched without locking, and only short-duration global S locks are needed on records that are found to be unavailable. The Commit-LSN optimization [Mohan 1990b] can be used to reduce further the number of S locks needed. In our page-server system, update propagation for B-tree pages involved in a structure modification is immediate in that modified pages are shipped to the server as soon as the modification is completed, while lazy propagation is used for a data page updated by inserting or deleting a data record. Our algorithms are easily modified so as to use immediate propagation also for data-page updates, as is done in the algorithms in Jaluta [2002]. This would simplify our recovery protocol in that the clients would no longer need to take checkpoints and in that no redo pass would be needed in recovering from client failures. On the other hand, more changes would be needed in our algorithms if lazy propagation were used for structure modifications. The pagelevel replica-management protocol would have to be changed, because then a page cached in X mode at some client would have to be explicitly called back when it is needed at the server or at some other client, while using immediate propagation the server needs only wait for the page to arrive. In recovering from a client failure, the redo pass would then also include redoing structure modifications. We have assumed that each transaction executes entirely at one site. It might be advantageous to allow for the option of “hybrid-shipping,” that is, executing parts of a client transaction at the server (cf. Franklin et al. [1996b]; Voruganti et al. [2004]). In this setting, some SQL queries submitted by a client application to the client database process are forwarded to the server, thus using query shipping. Modifying our algorithms to allow query shipping besides page shipping between a client and the server is possible, but it would have the undesirable effect of blocking all other applications at the client if the shipped query involves updates. This is because the client must then ship all the log records and the Local-Max-LSN to the server, in order that the server can generate log records for the update actions performed at the server on behalf of the client; during the time the server performs those actions, the client cannot ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

128



I. Jaluta et al.

perform any other update. Thus this extension is valid only for shipping readonly queries, or for the case in which a client runs only a single transaction at a time. 16. CONCLUSION We have presented a complete set of algorithms for the management of transactions in a page-shipping client-server database system in which the physical database is organized as a sparse B+ -tree (Section 3). Our page-server model is a general one in which each client as well as the server can run several transactions concurrently and each transaction executes in its entirety at one site (Section 4). We have written our algorithms for a logical database model in which the database consists of a set of variable-length data records identified by their primary keys and the database actions include record insertions, record deletions and record fetches based on primary-key-range scans (Section 2), but we have also discussed how to extend this model to include multiple granularity (Section 15). Our transaction model includes total and partial rollbacks (Section 2). During normal processing, a partial or total rollback of a client transaction is performed by the client as in ARIES/CSA [Mohan and Narang 1994] (Section 12). For page-level replica management, we apply a relaxed update-privilege (or write-token) approach that allows a data page (i.e., a leaf page of the sparse B-tree) to be simultaneously cached at one site in update (U) mode and at several other sites in shared (S) mode (Sections 4 and 8) (cf. Mohan and Narang [1991]). Update mode allows for data records to be inserted and deleted, and shared mode allows for records to be fetched. A data page or an index (nonleaf) page involved in a structure modification such as a page split or merge must be cached in exclusive (X) mode. This approach allows for more concurrency than the FC and HC-S algorithms in Zaharioudakis and Carey [1997] and the CBL algorithm in Gottemukkala et al. [1996], which require exclusive mode for leaf-page updates. Intertransaction caching is used for all pages, so that any page copy once cached at a client remains cached there until (1) the client buffer becomes full, or (2) the page is called back to be cached in X mode at some other site, or (3) a record in the page is called back and all S-locks that still exist on the keys of records in the page have been registered globally, or (4) the client explicitly requests the page copy to be replaced by its current version, or (5) the page is deallocated (Sections 8–12). A callback request for a page cached at a client can be serviced any time when the page is not latched. No page is held latched for a time longer than it takes to perform a single operation (a latch-coupling step, a page read, a page update, or a structure modification), and no latch is ever held by a process that is waiting for a lock. No more than three latches are ever held at a time by any client-database-process thread that services an application (Lemma 11.1). Transactional isolation is based on standard key-range locking (Section 5) coupled with a record-level replica-management protocol based on record marking (Section 9). When client transactions insert or delete records in cached data ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



129

pages, the records or their next-key records must be marked as unavailable in all cached copies of those pages, so that transactions at other clients can fetch available records from their cached page copies without server intervention. We have shown that when using our protocol fetch actions indeed always retrieve records belonging to the current version of the logical database (Theorem 9.2) and that repeatable-read-level isolation (i.e., serializability) is guaranteed (Theorem 9.4). Lower degrees of isolation such as the read-committed level can also be supported (Section 15). In comparison, the HC-R algorithm in Zaharioudakis and Carey [1997] uses a key-range locking protocol that is less permissive than standard key-range locking. Our algorithms assume that the server applies the usual steal-and-no-force policy in handling modified pages in the server buffer, and that each client applies the client-side steal-and-no-force policy in shipping to the server data pages updated at the client (Section 7). On the other hand, the force-to-server policy is applied in propagating structure modifications, so that pages involved in a structure modification are shipped to the server immediately after the modification is completed (Sections 7 and 10). Besides simplifying the pagelevel replica-management protocol and reducing the time a page is kept cached in exclusive mode at a client, this also reduces the work to be done in the redo pass when recovering from a client failure (Lemma 10.1). In our new B-tree algorithms, the structure of the B-tree is modified by one of five structure modifications: split, merge, redistribute, increase-height, and decrease-height (Section 10). Each modification is defined as an atomic action that involves only three B-tree pages. Our B-tree-traversal algorithms (Section 11) and our implementations of the fetch, insert, delete, undo-insert, and undo-delete actions (Section 12) are deadlock-free and starvation-free (Lemma 11.1, Theorem 12.1) and retain the structural consistency and balance of the tree under all circumstances (Lemma 10.1 and Theorem 12.3). The traversal algorithms perform (in a top-down manner) any structure modifications needed to arrange that the target data page can accommodate a record insertion or a record deletion. Saved paths are utilized to reduce the need of retraversals and to avoid doing unnecessary structure modifications. Restart recovery from client and server failures is done at the server using an ARIES/CSA-based recovery protocol (Section 13). A record insertion or deletion is logged physiologically using a redo-undo log record, while a structure modification is logged using a single redo-only log record (Sections 6 and 10). The redo pass of the restart-recovery algorithm will always produce a structurally consistent and balanced tree, on which the undo-insert and undo-delete actions by backward-rolling transactions can be performed logically (Theorems 13.1 and 13.2). Our approach reduces logging and the work to be done in the redo and undo passes. This is in contrast to previous B-tree algorithms, in which structure modifications are logged using redo-undo log records and may need to be undone in transaction rollback and in the undo pass of restart recovery. Moreover, concurrency is increased because structure modifications require no additional synchronization mechanism, and nor is any additional pass needed in restart recovery. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

130



I. Jaluta et al.

ELECTRONIC APPENDIX The electronic appendix for this article can be accessed in the ACM Digital Library by visiting the following URL: http://www.acm.org/pubs/citations/ journals/tods/2006-31-1/p1-jaluta. ACKNOWLEDGMENTS

We wish to thank the associate editor and the referees for their valuable comments that greatly helped to improve the article. REFERENCES ADYA, A., GRUBER, R., LISKOV, B., AND MAHESHWARI, U. 1995. Efficient optimistic concurrency control using loosely syncronized clocks. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. 23–34. ¨ , M. 1997. Centralized versus distributed index schemes in BASU, J., KELLER, A., AND POSS OODBMS—a performance analysis. In Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems (ADBIS’97). 162–169. BAYER, R. AND SCHKOLNICK, M. 1977. Concurrency of operations on B-trees. Acta Informatica 9, 1–21. BERENSON, H., BERNSTEIN, P., GRAY, J., MELTON, J., O’NEIL, E., AND O’NEIL, P. 1995. A critique of ANSI SQL isolation levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. 1–10. CAREY, M. J., DEWITT, D. J., FRANKLIN, M. J., HALL, N. E., MCAULIFFE, M. L., NAUGHTON, J. F., SCHUH, D. T., SOLOMON, M. H., TAN, C. K., TSATALOS, O. G., WHITE, S. J., AND ZWILLING, M. J. 1994a. Shoring up persistent applications. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 383–394. CAREY, M. J., FRANKLIN, M. J., LIVNY, M., AND SHEKITA, E. J. 1991. Data caching tradeoffs in clientserver DBMS architectures. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data. 357–366. CAREY, M. J., FRANKLIN, M. J., AND ZAHARIOUDAKIS, M. 1994b. Fine-grained sharing in a page server OODBMS. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 359–370. CAREY, M. J. AND LIVNY, M. 1991. Conflict detection tradeoffs for replicated data. ACM Trans. Database Syst. 16, 703–746. DEUX, O. 1991. The O2 system. Commun. ACM 34, 50–63. DEWITT, D. J., MAIER, D., FUTTERSACK, P., AND VELEZ, F. 1990. A study of three alternative workstation-server architectures for object-oriented database systems. In Proceedings of the 16th International Conference on Very Large Data Bases. 107–121. FRANKLIN, M. J. AND CAREY, M. J. 1994. Client-server caching revisited. In Distributed Object Management, Papers from the 1992 International Workshop on Distributed Object Management. Morgan Kaufmann, San Francisco, CA, 57–78. FRANKLIN, M. J., CAREY, M. J., AND LIVNY, M. 1992a. Global memory management in client-server DBMS architecture. In Proceedings of the 18th International Conference on Very Large Data Bases. 596–609. FRANKLIN, M. J., CAREY, M. J., AND LIVNY, M. 1993. Local disk caching for client-server database systems. In Proceedings of the 19th International Conference on Very Large Data Bases. 641–654. FRANKLIN, M. J., CAREY, M. J., AND LIVNY, M. 1997. Transactional client-server cache consistency: Alternatives and performance. ACM Trans. Database Syst. 22, 315–363. ´ FRANKLIN, M. J., JONSSON , B. J., AND KOSSMANN, D. 1996. Performance tradeoffs for client-server query processing. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. 149–160. FRANKLIN, M. J., ZWILLING, M. J., TAN, C., CAREY, M. J., AND DEWITT, D. 1992b. Crash recovery in client-server EXODUS. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. 165–174. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

B-Tree Concurrency Control and Recovery



131

GOTTEMUKKALA, V., OMIECINSKI, E., AND RAMACHANDRAN, U. 1996. Relaxed index consistency for a client-server database. In Proceedings of the 12th IEEE International Conference on Data Engineering. 352–361. GRAY, J. AND REUTER, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA. HOWARD, J. H., KAZAR, M. L., MENEES, S. G., NICHOLS, D. A., SATYANARAYANAN, M., SIDEBOTHAM, R. N., AND WEST, M. J. 1988. Scale and performance in a distributed file system. ACM Trans. Comput. Syst. 6, 51–81. JALUTA, I. 2002. B-tree concurrency control and recovery in a client-server database management system. Ph.D. dissertation, Department of Computer Science and Engineering, Helsinki University of Technology. Helsinki, Finland. Go online to http://lib.hut.fi/ Diss/2002/isbn9512257068/. JALUTA, I., SIPPU, S., AND SOISALON-SOININEN, E. 2005. Concurrency control and recovery for balanced B-link trees. VLDB J. 14, 257–277. LAMB, C., LANDIS, G., ORENSTEIN, J., AND WEINREB, D. 1991. The ObjectStore database system. Commun. ACM 34, 50–63. LISKOV, B., ADYA, A., CASTRO, M., DAY, M., GHEMAWAT, S., GRUBER, R., MAHESHWARI, U., MYERS, A. C., AND SHRIRA, L. 1996. Safe and efficient sharing of persistent objects in Thor. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. 318–329. LOMET, D. 1992. MLR: A recovery method for multi-level systems. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. 185–194. LOMET, D. 1998. Advanced recovery techniques in practice. In Recovery Mechanisms in Database Systems, V. Kumar and M. Hsu, Eds. Prentice Hall PTR, Englewood Cliffs, NJ, 697–710. LOMET, D. AND SALZBERG, B. 1992. Access method concurrency with recovery. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. 351–360. LOMET, D. AND SALZBERG, B. 1997. Concurrency and recovery for index trees. VLDB J. 6, 224–240. MOHAN, C. 1990a. ARIES/KVL: A key-value locking method for concurrency control of multiaction transactions operating on B-tree indexes. In Proceedings of the 16th International Conference on Very Large Data Bases. 392–405. MOHAN, C. 1990b. Commit LSN: A novel and simple method for reducing locking and latching in transaction processing systems. In Proceedings of the 16th International Conference on Very Large Data Bases. 406–418. MOHAN, C. 1992. Less optimism about optimistic concurrency control. In RIDE-TQP’92, Proceedings of the 2nd International Workshop on Research Issues on Data Engineering: Transaction and Query Processing. 199–204. MOHAN, C. 1996. Concurrency control and recovery methods for B+ -tree indexes: ARIES/KVL and ARIES/IM. In Performance of Concurrency Control Mechanisms in Centralized Database Systems, V. Kumar, Ed. Prentice Hall, Englewood Cliffs, NJ, 248–306. MOHAN, C., HADERLE, D., LINDSAY, B., PIRAHESH, H., AND SCHWARZ, P. 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. Database Syst. 17, 94–162. MOHAN, C. AND LEVINE, F. 1992. ARIES/IM: An efficient and high concurrency index management method using write-ahead logging. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. 371–380. MOHAN, C. AND NARANG, I. 1991. Recovery and coherency-control protocols for fast intersystem page transfer and fine-granularity locking in a shared disks transaction environment. In Proceedings of the 17th International Conference on Very Large Data Bases. 193–207. MOHAN, C. AND NARANG, I. 1994. ARIES/CSA: A method for database recovery in client-server architectures. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 55–66. MOHAN, C., NARANG, I., AND SILEN, S. 1991. Solutions to hot spot problems in a shared disks transaction environment. In Proceedings of the 4th International Workshop on High Performance Transaction Systems. MOND, Y. AND RAZ, Y. 1985. Concurrency control in B+ -tree databases using preparatory operations. In Proceedings of the 11th International Conference on Very Large Data Bases. 331– 334. ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

132



I. Jaluta et al.

¨ ZSU, M. T., VORUGANTI, K., AND UNRAU, R. C. 1998. An asynchronous avoidance-based cache conO sistency algorithm for client caching DBMSs. In Proceedings of the 24th International Conference on Very Large Data Bases. 440–451. PANAGOS, E. AND BILIRIS, A. 1997. Synchronization and recovery in a client-server storage system. VLDB J. 6, 209–223. PANAGOS, E., BILIRIS, A., JAGADISH, H. V., AND RASTOGI, R. 1996. Fine-granularity locking and clientbased logging for distributed architectures. In Advances in Database Technology—EDBT’96. Proceedings of the 5th International Conference on Extending Database Technology. 388–402. SIPPU, S. AND SOISALON-SOININEN, E. 2001. A theory of transactions on recoverable search trees. In ICDT 2001, Proceedings of the 8th International Conference on Database Theory. 83–98. SRINIVASAN, J., DAS, S., FREIWALD, C., CHONG, E. I., JAGANNATH, M., YALAMANCHI, A., KRISHNAN, R., TRAN, A.-T., DEFAZIO, S., AND BANERJEE, J. 2000. Oracle8i index-organized table and its application to new domains. In Proceedings of 26th International Conference on Very Large Data Bases. 285–296. ¨ ZSU, M. T., AND UNRAU, R. C. 1999. An adaptive hybrid server architecture for VORUGANTI, K., O client caching object DBMSs. In Proceedings of the 25th International Conference on Very Large Data Bases. 150–161. ¨ ZSU, M. T., AND UNRAU, R. C. 2004. An adaptive data-shipping architecture for VORUGANTI, K., O client caching data management systems. Distrib. Parall. Databases 15, 137–177. WANG, Y. AND ROWE, L. 1991. Cache consistency and concurrency control in a client-server DBMS architecture. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data. 367–376. WHITE, S. J. AND DEWITT, D. J. 1994. QuickStore: A high performance mapped object store. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 395– 406. WILKINSON, K. AND NEIMAT, M. 1990. Maintaining consistency of client-cached data. In Proceedings of the 16th International Conference on Very Large Data Bases. 122–133. ZAHARIOUDAKIS, M. AND CAREY, M. 1997. Highly concurrent cache consistency for indices in clientserver database systems. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. 50–61. ZAHARIOUDAKIS, M., CAREY, M. J., AND FRANKLIN, M. J. 1997. Adaptive fine-grained sharing in a client-server OODBMS: a callback-based approach. ACM Trans. Database Syst. 22, 570–627. Received January 2004; revised December 2004, May 2005, August 2005; accepted September 2005

ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.

Suggest Documents