H Highly Available Database Management Systems

0 downloads 0 Views 1MB Size Report
database management systems, running in a server farm environment. ..... distributed transaction commit and recovery using Byzantine agreement within ...
Category: Data Mining and Databases

1733

Highly Available Database Management Systems Wenbing Zhao Cleveland State University, USA

IntroductIon In the Internet age, real-time Web-based services are becoming more pervasive every day. They span virtually all business and government sectors, and typically have a large number of users. Many such services require continuous operation, 24 hours a day, seven days a week. Any extended disruption in services, including both planned and unplanned downtime, can result in significant financial loss and negative social effects. Consequently, the systems providing these services must be made highly available. A Web-based service is typically powered by a multi-tier system, consisting of Web servers, application servers, and database management systems, running in a server farm environment. The Web servers handle direct Web traffic and pass requests that need further processing to the application servers. The application servers process the requests according to the predefined business logic. The database management systems store and manage all mission-critical data and application states so that the Web servers and application servers can be programmed as stateless servers. (Some application servers may cache information, or keep session state. However, the loss of such state may reduce performance temporarily or may be slightly annoying to the affected user, but not critical.) This design is driven by the demand for high scalability (to support a large number of users) and high availability (to provide services all the time). If the number of users has increased, more Web servers and application servers can be added dynamically. If a Web server or an application server fails, the next request can be routed to another server for processing. Inevitably, this design increases the burden and importance of the database management systems. However, this is not done without good reason. Web applications often need to access and generate a huge amount of data on requests from a large number of users. A database management system can store and manage the data in a well-organized and structured way (often using the relational model). It also provides highly efficient concurrency control on accesses to shared data. While it is relatively straightforward to ensure high availability for Web servers and application servers by simply running multiple copies in the stateless design, it is not so for a database management system, which in general has abundant state. The subject of highly available database systems has been studied for more than two decades, and there exist

many alternative solutions (Agrawal, El Abbadi, & Steinke, 1997; Kemme, & Alonso, 2000; Patino-Martinez, JimenezPeris, Kemme, & Alonso, 2005). In this article, we provide an overview of two of the most popular database high availability strategies, namely database replication and database clustering. The emphasis is given to those that have been adopted and implemented by major database management systems (Davies & Fisk, 2006; Ault & Tumma, 2003).

Background A database management system consists of a set of data and a number of processes that manage the data. These processes are often collectively referred to as database servers. The core programming model used in database management systems is called transaction processing. In this programming model, a group of read and write operations on a data set are demarcated within a transaction. A transaction has the following ACID properties (Gray & Reuter, 1993): •

• • •

Atomicity: All operations on the data set agree on the same outcome. Either all the operations succeed (the transaction commits) or none of them do (the transaction aborts). Consistency: If the database is consistent at the beginning of a transaction, then the database remains consistent after the transaction commits. Isolation: A transaction does not read or overwrite a data item that has been accessed by another concurrent transaction. Durability: The update to the data set becomes permanent once the transaction is committed.

To support multiple concurrent users, a database management system uses sophisticated concurrency control algorithms to ensure the isolation of different transactions even if they access some shared data concurrently (Bernstein, Hadzilacos, & Goodman, 1987). The strongest isolation can be achieved by imposing a serializable order on all conflicting read and write operations of a set of transactions so that the transactions appear to be executed sequentially. Two operations are said to be conflicting if both operations access the same data item, at least one of them is a write operation, and they belong to different transactions. Another popular isolation model is snapshot isolation. Under the snapshot

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

H

Highly Available Database Management Systems

isolation model, a transaction performs its operations against a snapshot of the database taken at the start of the transaction. The transaction will be committed if the write operations do not conflict with any other transaction that has committed since the snapshot was taken. The snapshot isolation model can provide better concurrent execution than the serializable isolation model. A major challenge in database replication, the basic method to achieve high availability, is that it is not acceptable to reduce the concurrency levels. This is in sharp contrast to the replication requirement in some other field, which often assumes that the replicas are single-threaded and deterministic (Castro & Liskov, 2002).

dataBase hIgh aVaIlaBIlIty technIQues To achieve high availability, a database system must try to maximize the time to operate correctly without a fault and minimize the time to recover from a fault. The transaction processing model used in database management systems has some degree of fault tolerance in that a fault normally cannot corrupt the integrity of the database. If a fault occurs, all ongoing transactions will be aborted on recovery. However, the recovery time would be too long to satisfy the high availability requirement. To effectively minimize the recovery time, redundant hardware and software must be used. Many types of hardware fault can in fact be masked. For example, power failures can be masked by using redundant power supplies, and local communication system failures can be masked by using redundant network interface cards, cables, and switches. Storage medium failures can be masked by using RAID (redundant array of inexpensive disks) or similar techniques. To tolerate the failures of database servers, several server instances (instead of one) must be used so that if one fails, another instance can take over. The most common techniques are database replication and database clustering. These two techniques are not completely distinct from each other, however. Database replication is typically used to protect against total site failures. In database replication, two or more redundant database systems operate in different sites — ideally in different geographical regions — and communicate with each other using messages over a (possibly redundant) communication channel. Database clustering is used to provide high availability for a local site. There are two competing approaches in database clustering. One uses a shared-everything (also referred to as shared-disk) design, such as the Oracle Real Application Cluster (RAC) (Ault & Tumma, 2003). The other follows a shared-nothing strategy, such as the MySQL Cluster (Davies & Fisk, 2006) and most DB2 shared database systems. To achieve maximum fault tolerance and hence high availability, one can combine database replication with database clustering. 1734

database replication Database replication means that there are two or more instances of database management systems, including server processes, data files, and logs, running on different sites. Usually one of the replicas is designated as the primary, and the rest of the replicas are backups. The primary accepts users’ requests and propagates the changes to the database to the backups. In some systems, the backups are allowed to accept read-only queries. It is also possible to configure all replicas to handle users’ requests directly. But doing so increases the complexity of concurrency control and the risk of more frequent transaction aborts. Depending on how and when changes to the database are propagated across the replicas, there are two different database replication styles, often referred to as eager replication and lazy replication (Gray & Reuter, 1993). In eager replication, the changes (i.e., the redo log) are transferred to the backups synchronously before the commit of a transaction. In lazy replication, the changes are transferred asynchronously from the primary to the backups after the transactions have been committed. Because of the high communication cost, eager replication is rarely used to protect site failures where the primary and the backups are usually far apart. (Eager replication has been used in some shared-nothing database clusters.)

Eager Replication To ensure strong replica consistency, the primary must propagate the changes to the backups within the boundary of a transaction. For this, a distributed commit protocol is needed to coordinate the commitment of each transaction across all replicas. The benefit for doing eager replication is that if the primary fails, a backup can take over instantly as soon as it detects the primary failure. The most popular distributed commit protocol is the two-phase commit (2PC) protocol (Gray & Reuter, 1993). The 2PC protocol guarantees the atomicity of a transaction across all replicas in two phases. In the first phase, the primary (which serves as the coordinator for the protocol) sends a prepare request to all backups. If a backup can successfully log the changes, so that it can perform the update even in the presence of a fault, it responds with a “Yes” vote. If the primary collects “Yes” votes from all backups, it decides to commit the transaction. If it receives even a single “No” vote or it times out a backup, the primary decides to abort the transaction. In the second phase, the primary notifies the backups of its decision. Each backup then either commits or aborts the transaction locally according to the primary’s decision and sends an acknowledgment to the primary. As can be seen, the 2PC protocol incurs significant communication overhead. There are also other problems such as the potential blocking if the primary fails after all backups

Highly Available Database Management Systems

have voted to commit a transaction (Skeen, 1981). Consequently, there has been extensive research on alternative eager replication techniques, for example, the epidemic protocols (Agrawal et al., 1997; Stanoi, Agrawal, & El Abbadi, 1998), and multicast-based approaches (Kemme & Alonso, 2000; Patino-Martinez et al., 2005). However, they have not been adopted by any major commercial product due to their high overhead or complexities.

Lazy Replication Most commercial database systems support lazy replication. In lazy replication, the primary commits a transaction immediately. The redo log, which reflects the changes made for the recently committed transactions, is transferred to backups asynchronously. Usually, the backup replicas lag behind the primary by a few transactions. This means that if the primary fails, the last several committed transactions might get lost. Besides the primary/backup replication approach, some database management systems allow a multi-primary configuration where all replicas are allowed to accept update transactions. If this configuration is used with lazy replication, different replicas might make incompatible decisions, in which case manual reconciliation is required.

database clustering In recent years, database clustering has evolved to be the most promising technique to achieve high availability as well as high scalability (Ault & Tumma, 2003; Davies & Fisk, 2006). Database clustering, as the name suggests, uses a group of computers interconnected by a high-speed network. In the cluster, multiple database server instances are deployed. If one instance fails, another instance takes over very quickly so high availability is ensured. Database clustering not only brings high availability, but the scaling-out capability as well. Scaling-out means that the capacity of a database management system can be dynamically increased by adding more inexpensive nodes while keeping the old equipment. There are two alternative approaches in database clustering. One approach pioneered in Oracle RAC adopts a sharedeverything architecture. A number of other products choose to use the shared-nothing architecture. Both approaches have their challenges and advantages.

Shared-Everything Cluster In a shared-everything database cluster, all server instances share the same storage device, such as a storage area network. The cluster nodes typically connect to the shared storage device via a fiber channel switch or shared SCSI for fast

disk I/O. The shared storage device must also have built-in redundancy such as mirrored disks to mask disk failures. To minimize disk I/O, all server instances share a common virtual cache space. The virtual cache space consists of local cache buffers owned by individual server instances. A number of background processes are used to maintain the consistency of the data blocks in the cache space. These processes are also responsible to synchronize the access to the cached data blocks because only one server instance is allowed to modify a data block at a time. Each server instance has its own transaction logs stored in the shared disk. If a server instance fails, another server instance takes over by performing a roll-forward recovery using the redo log of the failed server instance. This is to ensure that the changes made by committed transactions are recorded in the database and do not get lost. The recovery instance also rolls back the transactions that were active at the time of the failure and releases the locks on the resources used by those transactions. The shared-everything design makes it unnecessary to repartition the data, and therefore eases the tasks of cluster maintenance and management. However, this benefit does not come for free. The most prominent concern is the cost of inter-node synchronization. Unless high-speed interconnect is used and the workload is properly distributed among the server instances, the inter-node synchronization might limit the scalability of the cluster. Also, the requirement for a high-speed shared disk system also imposes a higher financial cost than using conventional disks.

Shared-Nothing Cluster In a shared-nothing database cluster, each node runs one or more server instances and has its own memory space and stable storage. Essential to the shared-nothing approach, the data must be partitioned either manually or automatically by the database system across different nodes. Each partition must be replicated in two or more nodes to keep the desired redundancy level. Concurrency control and caching are carried out in each local node, and therefore they are more efficient than those in shared-everything clusters. However, to ensure the consistency of replicated data and fast recovery, the two-phase commit protocol is often used to ensure atomic commitment of the transactions in the cluster. Comparing with the shared-everything approach, the cost of inter-node synchronization is essentially replaced by that of distributed commit. The shared-nothing approach faces the additional challenge of split-brain syndrome prevention (Birman, 2005). The split-brain syndrome may happen if the network partitions, and if each partition makes incompatible decisions on the outcome of transactions or their relative orders. To prevent this problem, typically only the main partition is allowed to survive. The minor partition must stop accepting 1735

H

Highly Available Database Management Systems

new transaction and abort active transactions. Usually, the main partition is the one that consists of the majority of the replicas or the one that contains a special node designated as the arbitration node (Davies & Fisk, 2006).

Future trends Existing database systems are designed to tolerate process crash fault and hardware fault. However, considering the increased pace of security breaches, future database management systems must be designed to be intrusion tolerant — that is, they should provide high availability against a variety of security threats, such as the unauthorized deletion and alteration of database records, the disruption of distributed commit (may cause replica inconsistency), and the exposure of confidential information. To make a database system intrusion tolerant, many fundamental protocols such as the 2PC protocol must be enhanced. There may also be a need to design special tamper-proof storage devices to protect data integrity (Strunk, Goodson, Scheinholtz, Soules, & Ganger, 2000). Even though there has been intensive research in this area (Castro & Liskov, 2002; Malkhi & Reiter, 1997; Mohan, Strong, & Finkelstein, 1983; Deswarte, Blain, & Fabre, 1991), the results have rarely been incorporated into commercial products yet. The primary barrier is the high commutation and communication cost, the complexity, and the high degree of replication required to tolerate malicious faults.

conclusIon Database systems are the cornerstones of today’s information systems. The availability of database systems largely determines the quality of service provided by the information systems. In this article, we provided a brief overview of the state-of-the-art database replication and clustering techniques. For many, a low-cost shared-nothing database cluster that uses conventional hardware might be a good starting point towards high availability. We envisage that future generation of database management systems will be intrusion tolerant — that is, they are capable of continuous operation against not only hardware and process crash fault, but a variety of security threats as well.

reFerences Agrawal, D., El Abbadi, A., & Steinke, R.C. (1997). Epidemic algorithms in replicated databases. Proceedings of the ACM Symposium on Principles of Database Systems (pp. 161-172), Tucson, AZ. 1736

Ault, M., & Tumma, M. (2003). Oracle9i RAC: Oracle real application clusters configuration and internals. Kittrell, NC: Rampant TechPress. Bernstein, P.A., Hadzilacos, V., & Goodman, N. (1987). Concurrency control and recovery in database systems. Reading, MA: Addison-Wesley. Birman, K. (2005). Reliable distributed systems: Technologies, Web services, and applications. Berlin: Springer-Verlag. Castro, M., & Liskov, B. (2002). Practical Byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4), 398-461. Davies, A., & Fisk, H. (2006). MySQL clustering. MySQL Press. Deswarte, Y., Blain, L., & Fabre, J.C. (1991). Intrusion tolerance in distributed computing systems. Proceedings of the IEEE Symposium on Research in Security and Privacy (pp. 110-121). Oakland, CA: IEEE Computer Society Press. Gray, J., & Reuter, A. (1993). Transaction processing: Concepts and techniques. San Mateo, CA: Morgan Kaufmann. Kemme, B., & Alonso, G. (2000). A new approach to developing and implementing eager database replication protocols. ACM Transactions on Database Systems, 25(3), 333-379. Malkhi, D., & Reiter, M. (1997). Byzantine quorum systems. Proceedings of the ACM Symposium on Theory of Computing (pp. 569-578), El Paso, TX. Mohan, C., Strong, R., & Finkelstein, S. (1983). Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. Proceedings of the ACM Symposium on Principles of Distributed Computing (pp. 89-103), Montreal, Quebec. Patino-Martinez, M., Jimenez-Peris, R., Kemme, B., & Alonso, G. (2005). Middle-R: Consistent database replication at the middleware level. ACM Transactions on Computer Systems, 375-423. Skeen, D. (1981). Nonblocking commit protocols. Proceedings of the ACM International Conference on Management of Data (pp. 133-142), Ann Arbor, MI. Stanoi, I., Agrawal, D., & El Abbadi, A. (1998). Using broadcast primitives in replicated databases. Proceedings of the IEEE International Conference on Distributed Computing Systems (pp. 148-155), Amsterdam, The Netherlands. Strunk, D., Goodson, G., Scheinholtz, M., Soules, C., & Ganger, G. (2000). Self-securing storage: Protecting data in compromised systems. Proceedings of the USENIX As-

Highly Available Database Management Systems

sociation Symposium on Operating Systems Design and Implementation (pp. 165-189), San Diego, CA.

key terMs Database Cluster (Shared-Everything, Shared-Nothing): A database management system runs on a group of computers interconnected by a high-speed network. In the cluster, multiple database server instances are deployed. If one instance fails, another instance takes over very quickly to ensure high availability. In the shared-everything design, all nodes can access a shared stable storage device. In the shared-nothing design, each node has its own cache buffer and stable storage. Database Recovery (Roll-Backward, Roll-Forward): Recovery is needed when a database instance that has failed is restarted or a surviving database instance takes over a failed one. In roll-backward recovery, the active transactions at the time of failure are aborted and the resourced allocated for those transactions are released. In roll-forward recovery, the updates recorded in the redo log are transferred to the database so that they are not lost. Database Replication (Eager, Lazy): Multiple instances of a database management system are deployed in different computers (often located in different sites). Their state is synchronized closely to ensure replica consistency. In eager replication, the updates are propagated and applied to all

replicas within the transaction boundary. In lazy replication, the changes are propagated from one replica to others asynchronously. High Availability (HA): The capability of a system to operate with long uptime and to recover quickly if a failure occurs. Typically, a highly available system implies that its measured uptime is five nines (99.999%) or better, which corresponds to 5.25 minutes of planned and unplanned downtime per year. Split-Brain Syndrome: This problem may happen if the network partitions in a database cluster, and if each partition makes incompatible decisions on the outcome of transactions or their orders. To prevent this problem, typically only the main partition is allowed to survive. Transaction: A group of read/write operations on the same data set that succeeds or fails atomically. More accurately, a transaction that has atomicity, consistency, isolation, and durability properties. Two-Phase Commit Protocol (2PC): This protocol ensures atomic commitment of a transaction that spans multiple nodes in two phases. During the first phase, the coordinator (often the primary replica) queries the prepare status of a transaction. If all participants agree to commit, the coordinator decides to commit. Otherwise, the transaction is aborted. The second phase is needed to propagate the decision to all participants.

1737

H