University of Georgia, we have developed a Java-based Transaction ... show the effectiveness of our design with performance tests that simulate typical.
Transaction Management for a Main-Memory Database Authors: Piyush Burte, Boanerges Aleman-Meza, D. Brent Weatherly, Rong Wu Supervising Professor: John A. Miller Affiliation: The Department of Computer Science, the University of Georgia Abstract As part of research by members of the Department of Computer Science at the University of Georgia, we have developed a Java-based Transaction Manager that fits into the multi-layered design of MMODB, a main-memory database system. We have sought to maximize the benefits of the Java programming language and to implement transaction principles that are suitable for in-memory databases. In this paper, we examine the details of thread concurrency and resource locking protocols, our deadlock prevention scheme, and the Java-based implementation of these design decisions. We show the effectiveness of our design with performance tests that simulate typical transactions on a highly concurrent database system. 1. Introduction As technological advances in non-volatile physical memory continue to be made, the possibilities for utilizing the speed of main memory for database systems becomes more of a reality. Whereas traditional database systems rely on the disk subsystem to retrieve and update data and use an offline storage device such as magnetic tape for backup, a main-memory database will use physical memory as primary storage and a disk subsystem for backup. To facilitate this shift, database systems must be redesigned to not only take advantage of the performance benefit but also to handle implementation issues surrounding the inherent differences between disk and memory storage.
In addition, the increased speed of main-memory databases will also put a strain on the concurrency control mechanism of the database transaction manager. The mainmemory database designer must be careful to choose concurrency primitives and locking protocols that can efficiently and accurately schedule transactions while contributing as little overhead as possible to the database system. We have designed a transaction manager with the following principles: monitorsbased concurrency control, rigorous two-phase locking protocol, database extent-level locking, and deadlock prevention. We have chosen Java as the programming language because it is outstanding for “layered” programming and has many built-in features to facilitate thread concurrency and data-structures to handle database resource locking. We begin this paper with a look at other research in the area of main -memory databases. We then detail the decisions that were made in the design of this transaction manager. The implementation itself is explained in Section 4. We provide some performance test results in Section 5. We provide our conclusions in Section 6. Finally, Section 7 explains the changes and/or improvements that we would like to make. 2. Related Work The predominant focus of main-memory databases is in real-time applications. The design for these databases has to provide concurrency control that efficiently schedules real-time transactions for applications with shared, direct access to data. Research in this area includes the work carried out by Bell Labs researchers with the DALI project, IBM with the Starburst database system, and the ground breaking System M proposed by Kenneth Salem and Hector Garcia-Molina. We briefly discuss the approach to transaction management of these three systems.
2.1 Dali Main-Memory Storage Manager The DALI system is part of ongoing research on main-memory databases at Bell Laboratories [1]. Their scheme has data organized into regions. Each region has a single associated lock with exclusive (X) and shared (S) modes, referred to as the region lock, which guards accesses and updates to the region. DALI uses a concurrency scheme that provides support for enhanced concurrency based on the semantics of operations. Specifically it permits the use of weaker operation locks in place of stronger shared/exclusive locks. Under this scheme, consider a find operation. This operation obtains a find operation lock on the key value, then an S Region lock (shared-lock) on the bucket containing the key value. It releases the lock on the bucket once the node in the bucket chain containing the key value has been found. However, the find lock on the key-value is held for the duration of the transaction. Update operations, like insert, work in a similar fashion: first the operation lock is obtained and then the region lock is obtained, except that the region lock is an X-Region lock. 2.2 IBM Starburst As a part of the Starburst extensible database project developed at the IBM Almaden Research Center [2], the authors have designed and implemented a memory-resident storage component that coexists with Starburst’s disk-oriented storage system. The transaction management system uses a single latch for protecting a table, all of its indexes, and all of its related lock information, in order to reduce storage component latch costs. The researchers show that although a table-level latch is a large granule latch, it does not significantly restrict concurrency. In addition, they suggest that their design is
more is appropriate for memory-resident storage components when compared to the traditional lock manager design. The new design exploits direct addressing of lock data and dynamic, multi-granularity locks. 2.3 System M System M [6] is an experimental transaction processing test bed that runs on top of the Mach operating system. This system is comprised of a collection of processes operating on shared data structures, including the database itself. Each process acts as a server, accepting work requests and returning results. Though there are four parts to the system, two are responsible for transaction management: the Transaction Server and the Message Server. The Message Server is responsible for queuing transaction requests, and the Transaction Server processes these transactions. System M utilizes a lock manager that provides shared and exclusive lock modes, deadlock detection, and lock conversion (shared to exclusive). Locks are acquired at segment granularity and can be requested in a non-blocking mode. Currently, lock acquisition time is reduced by preallocating the data structures that are used to represent lockable database objects. Because the primary database is stored in main-memory, transaction execution in System “M” is relatively simple and efficient; transactions even run serially, if possible. 3. Design Decisions 3.1 Locking Protocol We have chosen a version of two-phase locking known as rigorous two-phase locking (or 2PL). Two-phase locking is a concurrency protocol that guarantees serializable schedules by ensuring that all locking operations precede the first unlock
operation in the transaction [3]. Serializability is important because it results in an ordering of operations which will produce the same effect on the database as the order of operations if all transactions were executed consecutively and without any interleaving. Rigorous two-phase locking is more restrictive than standard 2PL in that it enforces the rule that no locks can be released by a transaction until after it commits or aborts. This ensures that an item in the database cannot be read nor written by a transaction until the previous transaction that wrote that item has committed. This is known as a strict schedule and simplifies the recovery process. The final reason for choosing rigorous 2PL is that it is straightforward to implement. We have chosen extent-level locking for our transaction manager. We use the term extent because our database is designed around object-oriented principles, but essentially, we are locking the entire table for any request for an object within it. 3.2 Concurrency Mechanism For concurrency, we use the built-in Java synchronization primitives. Java uses monitors to perform Thread synchronization. Monitors, introduced by C.A.R. Hoare [4], ensure that only a single thread can be executing a method or section of code at a time. Threads that are unable to enter a section of code because of the presence of another thread are blocked until the current thread leaves this section of code (called a critical section). This solves our problem of thread synchronization. However, we must choose a method to block transactions that are unable to obtain a lock on a database resource. To accomplish this, we create Java objects for any extent that is requested in the database. When a transaction wishes to block, we create a monitor by synchronizing on this database resource and then perform a “wait” on this resource. Similarly, when a
transaction commits or aborts, it will “notify” any transactions that are waiting on the resources that it has locked. By choosing the database resource as the synchronization object on which to wait/notify, we are able to eliminate the unnecessary waking of threads that are blocked on other resources. 3.3 Deadlock Handling Rigorous two-phase locking protocol does not prevent deadlock; therefore, deadlock detection is an important part of our system. Formally, deadlock occurs when each transaction T in a set of two or more transactions is waiting for some item that is locked by some other transactions T’ in the set [3]. In general, two categories of algorithms exist to deal with deadlocks: deadlock prevention and deadlock detection and recovery. We have chosen a deadlock detection scheme to handle deadlocks; however, we actually employ this scheme to prevent deadlocks (see Section 4.3). One of the methods of deadlock detection is to discover a cycle in a directed graph of transactions and resources-- nodes represent transactions or resources and arcs represent relationships between transactions and resources [5]. A cycle in the graph means that a deadlock exists involving the transactions and the resources in that cycle. Our system uses a modified version of this algorithm that inspects a graph and terminates either when it has found a cycle or when it has shown that none exist [5]. 4. Implementation 4.1 Transactions A Java object, class Transaction, encapsulates all operations that are accessible by a database transaction. A transaction has a start/begin, which occurs when a Transaction object is created (constructor). A number of methods may be called on a Transaction
object. These methods are wrappers to calls for the next layer (Storage Manager). The methods to access data from the Storage Manager layer either try to obtain a Shared lock or an Exclusive Lock. These locks are obtained or denied through a table of database resource locks (see section 4.2). A Transaction ends when the commit or rollback methods are called on a Transaction object. Class Transaction contains both local and static information. The static section ensures that the transaction object can be repeatedly instantiated by the Query Processor but maintain “global” control data that is necessary for interfacing with the Storage Manger and for concurrency control. This control data includes: an instantiated object of the Storage Manager module, an instantiated object that provides the Java RMI methods to other modules, a LockTable object that implements the concurrency control, and a counter that assigns (internal) transaction id' s to every newly created Transaction object. However, a transaction object does contain information about the transaction that is specific to each transaction instance, such as the transaction id, a list of locks held, and an identifier for the object for which this transaction may be waiting. This information is part of the non-static section of the object, as allowed by the Java programming language. 4.2 Resource Locking For every extent that is requested from the database, an object, class DBResource, is created. This object stores the extent name, the type of lock that is held on this object, and the list of transactions that hold the current lock. A lock on a DBResour ce can be none, shared, or exclusive. Once created, these objects are stored in a Hash table keyed by the extent name. This table is contained within class LockTable. The LockTable
controls the access to all database resources. A transaction interfaces with the LockTable via three methods: getSharedLock, getExclusiveLock, and releaseLocks. For getSharedLock and getExclusiveLock, there are two different synchronized blocks of codes. The first block synchronizes on the LockTable object itself. By synchronizing on the LockTable, we ensure that only one transaction can access the table at a time. Each method performs a look-up into the hash table for a DBResource with an equivalent name to the extent name that has been requested. If a DBResource does not exist, one will be created and the lock granted. Otherwise, there are three scenarios that will allow the lock to be granted: the resource is not locked, the resource has a shared lock and the current transaction has requested a shared lock, or the resource h as a shared lock that is held ONLY by the current transaction and the current transaction has requested an exclusive lock. If none of these conditions is met, the transaction is forced to wait. After confirming that waiting will not cause a deadlock, the transaction leaves the block of code that is synchronized on the LockTable and enters a block of code that is synchronized on the actual DBResource object that it is trying to lock. The transaction then issues a wait on this object and goes to sleep. When a transaction has committed or aborted, it calls the releaseLocks method in LockTable. This method will iterate through the transactions list of locked DBResource objects. It will synchronize on each resource contained in its locked list. It will then clear the lock of this object and issue a "notify" to all threads that may be waiting for a lock on this resource. 4.3 Deadlock Handling
Our system uses a modified version of the cycle detection algorithm mentioned in Section 3.3 to prevent deadlock. In the usual implementation of this algorithm, a “deadlock-detection” thread/process will recursively construct a “waits-for” graph for EVERY transaction [5]; we modify the algorithm so that we only need to create the graph for the current (running) transaction. We are able to do this because any transaction that is unable to obtain a lock on a resource must perform deadlock checking before waiting, as opposed to another thread/process performing deadlock detection when it may already occur. For this reason, deadlock never exists; cycles can only occur in a “waits-for” graph involving the current transaction and the resource for which it wants to wait. Thus, performing a search from every node in the graph is unnecessary, and the current transaction is the only starting node in the search. To construct the “waits-for” graph, we utilize the waitingFor field in the Transaction class, which indicates the DBResource for which this transaction is waiting, and the LockedBy list in the DBResource class, which includes all of the Transactions holding a lock on this resource. A recursive method is used to search the graph and terminates when a cycle is found or the algorithm reaches a dead-end. If a cycle is found, the transaction requesting the lock is aborted. In this case, an exception is thrown (see Section 4.5) to trigger rollback (if necessary) and to instruct the Query Processor to terminate this transaction. Otherwise, the transaction is allowed to wait for that resource. 4.4 RMI Interfaces of Snapshot Handling RMI (Remote Method Invocation) is used to handle the snapshot in our system. Two RMI interfaces, class ITransactionSnapshotRMI and class ITransactionStorageRMI, are provided for the Snapshot Manager and the Storage Manager respectively. The only
method in ITransactionSnapshotRMI is requestSnapshot, which receives snapshot requests from the Snapshot Manager. Similarly, the only method in ITransactionStorageRMI is endSnapshot, which will be called by the Storage Manager when the snapshot is complete. The Storage Manager has employed a NO-UNDO, NO-REDO recovery scheme that keeps two copies of the data in the database at all times. All updates are performed on one copy, while another copy is maintained as the shadow copy. This scheme prevents the shadow copy from being modified during transaction execution and therefore guarantees that the shadow copy will always be consistent. For this reason, all snapshots are taken on the shadow copy. Unfortunately, a transaction can affect the snapshot if the transaction commits while the snapshot is in progress, because the Storage Manager updates the shadow copy upon transaction commit. To prevent this problem, our system forbids transactions from committing while the snapshot is running. In our implementation, the transaction thread will be blocked if the snapshot is in and will be woken when the snapshot is complete. Conversely, it may be necessary to delay the snapshot request if one or more transactions are in the process of committing--these transactions must be allowed to complete the commit process before the snapshot begins. To solve this problem, the snapshot thread is put to sleep until it is safe to proceed. To avoid snapshot starvation, however, no other transactions are allowed to begin the commit phase. 4.5 Error Handling For simplicity, when a transaction is to be aborted, an exception, an instance of AbortTransactionException, is thrown to the current Transaction. This exception contains
some information about the aborted transaction that will be needed by the Storage Manager to perform rollback. The Transaction then throws this exception to the next layer (Query Processor), which will handle it accordingly. 5. Performance Tests 5.1 Parameters Three sets of performance tests were done. These are different in the para meters but similar in the testing procedure. The modules involved in testing were: Transaction Manager and Storage Manager. The data set used for testing had 80 different extents. A total of 1,000 Transactions were run via 1,000 threads (one transaction per thread). Each Transaction executed a fixed number of operations on each set of tests. In addition, each operation involves a single extent, which is chosen randomly from the 80 possibilities. Each set of tests was run with different percentages of updates/queries. The different percentages tested were: • • • • • •
0% updates, 100% queries 20% updates, 80% queries 40% updates, 60% queries 60% updates, 40% queries 80% updates, 20% queries 98% updates, 2% queries
For each of these tests, the following information was recorded: • • •
time taken for completion of all the transactions average number of times a transaction needed to wait for an extent used by other transaction number of transactions aborted because of a detected Deadlock
Each of the tests was run several times to in order to obtain an average for each value. 5.2 Transaction Performance
Transaction performance is measured in two areas: the throughput for transaction completion and the average number of times a transaction goes to ' wait' state because it is unable to obtain a lock on a resource. Throughput: Transaction throughput is measured by taking the total number of transactions attempted, subtracting the number of transactions that were aborted because of deadlock, and dividing by the time required to complete the test. The three tests gave results that were expected. Transaction throughput decreased as the percentage of updates increased. The throughput was likewise hindered when we increased the number of operations per transaction. These three cases are illustrated in the following figure (Figure 6.1):
Figure 6.1
Wait States: For the set of tests executing only one operation per transaction, the number of wait-states increased slowly for percentages of updates under 60%. For percentages o f updates over 60%, the number of wait-states increased rapidly. For the set of tests executing 1 to 5 operations per transaction, the number of waitstates increased with the percentage of updates increasing. However, for the set of tests executing 1 to 20 operations per transactions, the number of wait-states encountered shows what appears to be inconclusive or convoluted information. However, the rise and fall of the wait-states can be explained by the increase in deadlocks. With so many operations, deadlock occurs quickly, resulting in transactions aborting quickly. It seems that this “infant death” of transactions frees up resources, which enables other transactions to continue. Eventually, though, deadlock between the remaining transactions occurs, as evidenced by the figure. These three cases are illustrated in the following figure (Figure 6.2):
Figure 6.2
5.3 Frequency of Deadlock As for deadlock detection, the test results were as expected. The number of aborted transactions because of a detected deadlock increases as the percentage of updates increases. Note: no deadlock occurs if all transactions contain a single operation on a single extent. The three cases are illustrated in the following figure (Figure 6.2):
Figure 6.3
6. Conclusions We have described our transaction manager for a main-memory database. We believe that the design decisions that we have made were prudent and believe that our performance tests show that our implementation is promising—an average throughput of two hundred and thirty transactions per second for a thousand single-operation transactions is a good start. We would like to further our work by implementing the proposed changes and perform tests in a true multi -user, multi-processor environment. 7. Future Work 7.1 Transactions
Interface methods to access Storage Manager services thru Transaction Manager are mainly divided in three types: commit/rollback, methods that require an Exclusive Lock, and methods that require a Shared Lock. It may be very useful to create a generic method that requires either a Shared or Exclusive lock. This method will call the appropriate method of the Storage Manager Module using Java Reflection. To implement this feature, the generic method will request a Shared or Exclusive lock based on say, a boolean parameter. The name of the Storage Manager method to be called could be represented as a String and other parameters could all be put in an Array. This use of Java Reflection within Transaction Manager methods would make the layered design more robust by allowing methods at the lower layer (Storage Manager) to add/change with no need of changing the Transaction Storage methods. 7.2 Resource Locking In the current implementation of MMODB, virtually all operations on the database require the Query Processor to issue a method call to get an extent “object” by resource name. The transaction manager will attempt to secure a shared-lock on this resource. However, the operation may be an update, in which case the transaction manager will eventually have to upgrade this lock to exclusive. This implementation can lead to increased transaction “waits” and/or deadlocks, because multiple transactions may acquire a shared-lock, preventing any transactions from upgrading the lock to exclusive. To address this issue, we have added another version of the getExtentByName method, called getExclusiveExtentByName, which attempts to acquire an exclusive-lock on the resource. Assuming the Query Processor is aware of the eventual type of operation (ie.
Read or Update) that is being performed, the proper locking method can be called at the outset of the operation. This should improve the performance of the transaction manager. We would also like to implement object-level locking. It is possible that this finer granularity of locking will reduce the average time for transaction completion and decrease the average number of waits encountered during transaction execution. However, extent-level locking may outperform object-level locking in some cases (such as large range queries). For this reason, it would be beneficial to utilize both methods and allow the query processor to select the locking protocol based upon the type of operation. In addition, we would like to explore other concurrency control protocols in order to compare with two-phase locking. One obvious approach that would be relatively trivial to implement is Multiversion Two-Phase Locking (MV2PL). Multiversion 2PL differs from standard 2PL in that it includes a third locking mode—certify-locked [3]. In this scenario, the database maintains a current version of all committed items. If a writelock is requested, the database makes a copy of the item that the transaction has requested. This scheme allows read-locks and write-locks to coexist on database resources; all reads for an item will continue on the committed version, while writes will continue on their exclusive copy of that item. The current design of the Storage Manager will facilitate this modification quite naturally. On the other hand, it would be interesting to implement completely different concurrency protocols, such as Timestamp Ordering and Multiversion Timestamp Ordering, as well. Timestamp Ordering is another protocol that, like two-phase locking, also guarantees serializable schedules. Essentially, some ordering value is used to uniquely identify each transaction. This value could be generated from a counter or the
current date/time of the system clock [3]. It is then the responsibility of the transaction manager to order the transactions based on their timestamps. An enhancement on this scheme is Multiversion Timestamp Ordering, which allows multiple versions of data items to exist at any time. Various algorithms exist to accomplish these protocols, and the interested reader is urged to consult literature such as [3] for further details. Lastly, there is a potential performance boost by saving DBResource objects that are freed either by lock release or by the physical removal of the extent that it represents from the database. In the former implementation, DBResource objects that are kept in the lock table will always be locked. In the latter, we only remove a DBResource object from the lock table after the extent has been deleted from the database. In either case, a transaction requesting a resource that does not exist in the lock table will check the pool of unused DBResource objects for availability. If one exists, it will be updated to represent the proper extent. Otherwise, a new DBResource object will be created. By reusing the DBResource objects, we will minimize the overhead associating with object creation. 7.3 Deadlock Handling For simplicity, when a deadlock is found, our system will abort the transaction that made the lock-waiting request (the “current” transaction). A more efficient method will involve in a good victim selection scheme, which should avoid choosing transactions that have been running for a long time and/or have performed many updates but instead select transactions that have not made many changes [3]. All those requirements would require adding more information to the Transaction class.
In addition, the depth-first search used in the cycle detection is far from optimal [5]. A more efficient cycle detection algorithm will improve the system performance. 8. References 1. P. Bohannon, D. Lieuwen, R. Rastogi, A. Silberschatz, S. Seshadri, S. Sudarshan. “The Architecture of the Dali’ Main-Memory Storage Manager,” Kluwer Academic Publishers, Boston. 2. Vibby Gottemukkala, Tobin J. Lehman. “Locking and Latching in a MemoryResident Database System,” Proceedings of the 18 th VLDB Conference, Vancouver, British Columbia, Canada, 1992. 3. Ramez Elmasri, Shamkant B. Navathe. Fundamentals of Database Systems. Addison-Wesley, 2000. 4. C. A. R. Hoare, “Monitors: An Operating Systems Structuring Concept,” Communications of the ACM, Vol. 17, No. 10, October 1974, pp. 548-557. 5. Andrew S. Tanenbaum, Modern Operating Systems. Prentice Hall, 1992. 6. Kenneth Salem and Hector Garcia-Molina, “System M: A Transaction Processing test bed for Memory Resident Data,” IEEE Transactions on Knowledge and Data Engineering (TKDE) Volume 2, Number 1, pp (161-172), 1990.