storage and retrieval, real-time systems, view maintenance. 1. ... directly by write operations because only read operations of OLAP transactions are al- .... consistency, COB utilizes publish order graph (POG) in scheduling conflict-free publish.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 23, 723-739 (2007)
Concurrent View Maintenance Scheme for Soft Real-Time Data Warehouse Systems NAMGYU KIM AND SONGCHUN MOON* Department of Business Information Technology Kookmin University Seoul, 136-702 Korea * Department of Management Engineering Korea Advanced Institute of Science and Technology Seoul, 305-701 Korea The data warehouse should be refreshed periodically or aperiodically in order to prevent OLAP transactions from reading extremely stale data. This refreshing process is referred to as view maintenance. There have been many researches on ways to preserve data consistency during the process of view maintenance. The simplest approach among them is to perform view maintenance in the nighttime while OLAP transactions are allowed to be active only in daytime. However, this approach cannot be applied to globalized corporations because no common idle time is available for such corporations. In this paper, we propose a new algorithm for performing view maintenance without blocking OLAP transactions. Our algorithm improves data freshness by allowing one or more maintenance transactions and OLAP transactions to be executed concurrently. We introduce the details of algorithm with some examples, present proofs of devised scheme, and then evaluate performance gains by means of simulation method. Keywords: concurrency control, database management, data warehouse, information storage and retrieval, real-time systems, view maintenance
1. INTRODUCTION OLAP [1] transactions usually issue long-lived queries like complicated decision support processes. Multiple, distributed, and heterogeneous databases are integrated to support such information-intensive OLAP queries. The integration methods can be classified into on-demand approach and in-advance approach [2]. In the former approach, information is not extracted from the source databases until queries are actually issued. In the latter one, on the contrary, the information of interest is extracted from each source database in advance, and then it is stored in a logically centralized repository. When a query is actually issued, it is analyzed directly at the repository without accessing operational database (ODB). This approach is commonly referred to as data warehousing and the integrated repository itself is referred to as data warehouse (DWH). Data flows from ODB into DWH [3] in DWH environment. Value of data in DWH cannot be updated directly by write operations because only read operations of OLAP transactions are allowed to access data in DWH. Data in ODB could be, on the contrary, updated directly by write operations of OLTP transactions. Therefore, the data value in DWH and ODB would be getting different with the passage of time. It implies that data value in DWH Received June 7, 2005; revised December 12, 2005 & January 17, 2006; accepted March 20, 2006. Communicated by Ming-Syan Chen.
723
724
NAMGYU KIM AND SONGCHUN MOON
should be replaced with recent value of data in ODB in order to prevent OLAP transactions from reading excessively stale data from DWH. This refreshing process is frequently referred to as view maintenance and is performed by maintenance transactions generated by DWH management system. In the literature, there have been many researches [2, 4-8] on ways to preserve data consistency [9] during the view maintenance. The simplest approach among them is to execute maintenance transactions in the nighttime. In this approach, every maintenance transaction is supposed to be executed only in the nighttime and no OLAP transaction is allowed to execute its read operations in this period. On the contrary, no maintenance transaction is allowed to execute its write operations in daytime. View maintenance can be performed without damaging data consistency in this manner. This approach has revealed two kinds of limitations. First of all, there is no common nighttime for a globalized corporation. If any, it is hard to pledge view maintenance transactions to be completed in the nighttime because maintenance transactions are usually long-lived ones. In this sense, there have been many efforts to enable both OLAP and maintenance transactions to be performed concurrently without damaging data consistency. Most of them assumed that two or more maintenance transactions are not allowed to be executed concurrently. This restriction seems to be unduly strict when it is considered that each ODB might be owned by mutually different owners. Moreover, the restriction would degrade transaction concurrency and data freshness [10] because it disallows concurrent execution of maintenance transactions even when they are independent of each other. By relaxing the restriction, therefore, the data freshness of DWH could be improved without raising cost of version management. Our view maintenance scheme proposed in this paper could be characterized as follows. At first, the scheme tries to minimize the cost of space management by maintaining only one version for OLAP transactions and the other one for OLTP ones. Additionally, the proposed scheme could improve data freshness of DWH by allowing concurrent execution of multiple maintenance transactions. To explain our algorithm simply by transactional approach [4], we regard OLAP transactions as read-only ones and OLTP transactions as write-only ones. In this paper, we discuss details of the algorithm with some examples, present proofs of devised scheme, and then evaluate performance gains by means of simulation method. The rest of this paper is organized as follows. Section 2 presents overviews and limitations of previous works on view maintenance. In section 3, we propose conflict order-based (COB) view maintenance scheme for soft real-time data warehouse environment. Section 4 provides proofs of correctness for COB. Performance gains of COB are analyzed in section 5. Finally, section 6 concludes this paper.
2. RELATED WORKS In order to preserve data consistency between ODBs and DWH, traditional view maintenance schemes [6] allowed only one maintenance transaction to be executed at a time. However, the serial execution of maintenance transactions would degrade the data freshness. In [6], two-version no locking (2VNL) algorithm is proposed for enhancing data availability by allowing OLAP transactions and a maintenance transaction to be executed concurrently. 2VNL achieves the concurrency by offering two alternative data
VIEW MAINTENANCE SCHEME FOR DATA WAREHOUSE SYSTEMS
725
versions to OLAP transactions. 2VNL maintains three types of database versions such as a future version, a current version, and a previous version. When a maintenance transaction is running, there exists only a future version and a current version. On the contrary, only a current version and a previous version can exist when there is no active maintenance transaction. Maintenance transactions always operate on future versions while OLAP transactions could read data value from either previous versions or current versions. Each OLAP transaction is associated with exactly one version that was a current one when the transaction began, and continues to read the version all through its lifetime. Fig. 1 illustrates a simple scenario for view maintenance in 2VNL.
Maintenance Transactions
TU1
TU2
TU3
TR1 OLAP Transactions
TR2
TR3 TR4
TR5
V0 V1
Database
V2
Versions
Time
8 am
9 am
8 am
9 am
Fig. 1. Performing view maintenance without blocking OLAP transactions.
In Fig. 1, OLAP transactions TR1, TR2, TR3, TR4, and TR5 are continuing their execution without being blocked during execution of maintenance transactions TU1, TU2, and TU3. To support this concurrency, 2VNL provides two database versions to OLAP transactions. In Fig. 1, V0 is an initial version of database while V1 and V2 represent new versions which are created by maintenance transactions TU1 and TU2, respectively. Let us suppose that every maintenance transaction starts at 9 a.m. and finishes its execution at 8 a.m. on the next morning. A new database version is created as soon as the maintenance transaction finishes. This version is maintained from then and destined to be expired at 9 a.m. on the next day. In Fig. 1, OLAP transactions TR1 and TR2 are destined to access only V0 which was a current version when they started. In particular, it should be noticed that TR2 should still read V0 even after 8 a.m. although another version V1 is newly created and is currently available to TR3. A 2VNL scheduler disallows TR2 to read V1 in order to prevent TR2 from reading inconsistent data from two versions, V0 and V1. Therefore the version V0 should be still offered to TR3 till 9 a.m. although it is not a current version any longer. To prevent the number of versions from increasing infinitely, the previous versions ought to be expired at a predefined deadline (i.e. 9 a.m.). In the same manner, TR3, TR4, and TR5 are destined to access V1 all through their lives. TR4 reads V1 even after 8 a.m. because of the same reason in the case of TR2. We should be attentive to execution of TR5. TR5 will be
726
NAMGYU KIM AND SONGCHUN MOON
aborted finally because it reads data from V1 and is still active after the expiration of V1. It implies that the more OLAP transactions would be aborted by 2VNL scheduler when the larger size of OLAP transactions is executed. In 2VNL, an OLAP transaction should be aborted if it is still running even after its associated version has been expired. Frequent aborts of OLAP transactions could cause critical overheads because OLAP transactions usually have long execution time. It implies that only two versions might be insufficient for view maintenance in DWH management systems [11]. In this sense, [6] proposed not only 2VNL but also nVNL algorithm which maintains more than two database versions. However, nVNL algorithm suffers from a limitation that it is not easy to estimate the sufficient number of required versions exactly. Cost of version management, moreover, increases as the number of versions increases. On the other hand, 2VNL and nVNL enforce only one maintenance transaction to be executed at a time. This strict restriction could degrade transaction concurrency because it disallows multiple maintenance transactions even in the case that they do not conflict with each other at all.
3. CONFLICT ORDER-BASED VIEW MAINTENANCE SCHEME 3.1 Notion of Publish Ordering To execute multiple maintenance transactions concurrently without damaging data consistency, COB maintains a data version for ODB (VODB) as well as a version for DWH (VDW). Only one VODB is maintained globally instead of being maintained locally by each ODB. Let us suppose that TUi represents an OLTP transaction which updates data in VODB while TRi is an OLAP transaction. Every data updated by TUi cannot be directly reflected in VDW so it should be stored in VODB for a while. On the contrary, TRi’s read operation retrieves data value directly from VDW. In summary, every OLAP transaction reads data value from VDW while every OLTP transaction updates new data value into VODB. In order to prevent OLAP transactions from reading excessively stale data, therefore, data in VDW should be replaced with recent data in VODB by maintenance transactions. In this paper, a term publish stands for the replacement process and a notation Pub( ) denotes its operation. Every data in VDW could be refreshed with recent data in VODB only by performing publish operations. For example, publish of every data written by a committed transaction TUi is represented as Pub(TUi). For the purpose of preserving data consistency, COB utilizes publish order graph (POG) in scheduling conflict-free publish order. Conflicts are not occurred between OLAP transactions because OLAP transactions never issue any write operation. Every conflict occurred during the view maintenance process would be only the following patterns: (1) a conflict between a read operation of an OLAP transaction and a write operation of a maintenance transaction, (2) a conflict between write operations of maintenance transactions. We will refer the first case as inter-conflict and the other as intra-conflict. In the next section, some rules will be presented to enable COB protocol to generate a conflict-free schedule. Each maintenance transaction is generated from exactly one OLTP transaction. To create a maintenance transaction, COB only appends publish operation to the end of the transaction. Some transactions could execute publishes immediately after their commits while others should
VIEW MAINTENANCE SCHEME FOR DATA WAREHOUSE SYSTEMS
727
be delayed. By arranging publish order of the maintenance transactions in accordance with the precedence of POG, COB scheduler could preserve data consistency during concurrent execution of maintenance transaction. 3.2 View Maintenance Using Publish Order Graph When a transaction is committed, some links would be appended to POG if the transaction has executed any conflicting operations with other transactions. To produce a one-copy serializable view maintenance schedule [9], every updated data in VODB should be published in the topological order of POG. In the following scenario, TRd is an OLAP transaction while TUi and TUd are view maintenance transactions each of which is generated by attaching a publish operation to the end of an OLTP transaction. A link TUd → TUi on POG implies that Pub(TUi) should be performed after Pub(TUd). Similarly, Pub(TUi) cannot be performed until TRd is committed when TRd → TUi exists in POG. In this subsection, we define Rules 1 and 2 for appending links to POG and Rules 3 and 4 for publishing data in one-copy serializable order. When each transaction is committed, the rules are triggered in the sequential order of Rules 1, 2, 3 and 4. OLAP Transaction: TR1, TR2
OLTP Transaction: TU1, TU2, TU3
TR1 r(x)
r(y) cR1
TR2 TU1 TU2 TU3
r(y) r(z) cR2 w(x) w(y) cU1 Pub(TU1) w(z) cU2 Pub(TU2) w(y) w(z) cU3 Pub(TU3)
Fig. 2. Data inconsistency caused by immediate publishes of TU1, TU2, and TU3.
Data consistency might be threatened if a transaction publishes its updated data immediately after its commit (Fig. 2). TR1 and TR2 are reading data from VDW while TU1, TU2, and TU3 are updating data on VODB. Every data updated by OLTP transactions could become accessible to OLAP transactions only by Pub( ) operations. In the first place, let us take only TR1 and TU1 into consideration. The data x and y updated by TU1 is not accessible to TR1 until Pub(TU1) is performed. Therefore, TR1 reads old value of x while it reads new value of y which has been updated and published by TU1. It results in inconsistent data retrieval because TR1 is affected by partial results of TU1. To be a one-copy serializable schedule, every transaction should be affected by all or no result of other transactions. We can detect inconsistent data retrieval if any cycle in serialization graph (SG) [9] is observed. In the SG for TU1 and TR1, we can detect a cycle between them which consists of links TR1 → TU1 on data x and TU1 → TR1 on data y. COB regulates the following Rule 1 to prevent this type of cycle from being formed. In the following rules, TUi, TUd are maintenance transactions while TRi, TRd are OLAP ones. They are assumed to be participating in contention on data x. Operation Pub(TUi) is initially issued only after TUi’s commit. The operation could be executed immediately or should be de-
728
NAMGYU KIM AND SONGCHUN MOON
layed according to the following rules. In the following rules, a symbol