A Model for Schema Versioning in Temporal Database Systems John F Roddick Advanced Computing Research Centre School of Computer and Information Science University of South Australia The Levels, Adelaide South Australia 5095
[email protected] Abstract The aim of temporal data models is to accommodate the nature of time naturally and directly and to avoid the ad hoc, one-off extensions commonly employed in information systems design. Schema versioning is one of a number of related areas dealing with the general problem of using multiple schemata for various database related tasks. In particular, schema versioning, and its weaker companion, schema evolution, deal with the need to retain current data and software system functionality in the face of changing database structure. This paper brings temporal and evolutionary models together and proposes an extension to temporal data models capable of accommodating partial schema versioning. Keywords Schema Evolution, Partial Schema Versioning, Temporal Data Modelling.
1. Introduction In recent years there has been a lot of work investigating the accommodation of change in data models. Change in data results from the dynamic character of the real world and has resulted in research into dynamic and temporal modelling techniques. More recently, change in data structure, reflecting changes in real-world object morphology or user requirements, has resulted in research in schema evolution and schema versioning. Within the database systems field, this has resulted in the prototype development of temporal and evolving database systems. A variety of temporal data models have been proposed in the literature [1-13] and a comparative survey of those based on the relational model is given by McKenzie and Snodgrass [14]. Substantially fewer models have been proposed accommodating schema evolution or schema versioning [15-17] although an increasing number of papers in this field are appearing [18]. This paper brings these models together and proposes an extension to temporal data models capable of accommodating partial schema versioning.
This paper first briefly outlines the necessary characteristics of a temporal, evolving model. Because of the complexity involved, this necessitates giving a number of definitions which will be used to delineate the scope of the model. In Section 3 the extensions are then presented and evaluated with respect to the pragmatic requirements of such a model. Some discussion of other research issues is then given in Section 4.
2. Characteristics of Temporal, Evolving Data Models The growth in research in accommodating changing data and data structure has resulted from, above all, a strong user requirement for stable systems in changing environments. Numerous analyses of future database directions has indicated that legacy, change management and high availability are characteristics that are currently, if not poorly, are at least inadequately supported by databases [19-21]. From our perspective, this gives a strong pragmatic emphasis to the development of a temporal, evolving model. In addition, the development of commercial object-oriented database management systems have divided (very broadly) into two groups; the unified architectures which are attempting to build OO functionality on top of a relational engine (for example UniSQL/X [22] and Postgres/Montage [23, 24]), and the pure approach which generally does not (for example O2 [25] and ObjectStore [26]). The approach taken here is to extend those temporal data models which are based on the relational model and thus provide a basis that may be able to be utilised by the unified architectures. The decision to extend the relational model is reinforced by recent initiatives to extend the SQL92 standard to support temporal data and, to a more limited extent, evolving schemata [27]. Schema versioning is defined as being accommodated when [28]: … a database system allows the accessing of all data, both retrospectively and prospectively, through user definable version interfaces.
Proceedings of the 19th ACSC Conference, Melbourne, Australia, January 31-February 2 1996, pp 446-452.
Schema versioning subsumes the concept of schema evolution which in turn subsumes that of schema modification. A more complete discussion of some of the general schema evolution issues can be found in [29]. The definition can be further refined by distinguishing between retrieval and update activity. In particular, the definition of full and partial schema versioning can be given as follows: Partial Schema Versioning is accommodated when a database system allows the viewing of all data, both retrospectively and prospectively, through user definable version interfaces. Data updates are allowable through reference to one designated (normally the current) schema definition only. Full Schema Versioning is accommodated when a database system allows the viewing and update of all data, both retrospectively and prospectively, through user definable version interfaces. Some of the problems associated with schema versioning are analogous with those associated with data and view integration which aims to merge or allow the use of schemata from different sources. Research in this area has largely been directed at integrating heterogeneous systems but the applicability to the evolution of schemata is clear. Miller, Ioannidis and Ramakrishnan [30] for example, investigate the concept of the information capacity of a schema to decide whether schema or view integration would be lossless. A taxonomy is given in which the translation ability for two schemata is distinguished by their ability to retain information during update and retrieval. This aspect is also investigated by Orlowska and Ewald [31-33] who view schema integration as a schema evolution process in which the integration of two or more schemata is effected by choosing one and applying the facts held in the others. Geller, et al. [34] present a method which allows the integration of structurally similar but semantically dissimilar datasets by using a “dual” model which maintains separate representations for structure and semantics. Although very similar to data and view integration and although many aspects of this work are common, the accommodation of historical data and the ability to retain the evolutionary history of schemata enable and, indeed, necessitate a different treatment of the problem. Miller et al. [30] have shown that in order to update data stored under two different schemata using the opposite schemata, they must be equivalent, i.e. all valid instances of some schema S 1 must be able to be stored under S 2 and vice-versa. Since we cannot (generally) foresee the future requirements of the database, and hence the changes required to the data structure, and since neither the active nor the historical schemata can be changed (we may only create a new version which
supersedes the old), this is too strong a condition to impose. We therefore adopt the weaker concept of partial schema versioning in which data stored under any historical schema may be viewed through any other schema but may only be updated through the current or active schema. Even this level of support necessitates imposing some restriction on the schema modification which can be achieved without reorganisation and data coercion. This aspect will be discussed in more detail in Section 3. A set of characteristics for a temporal, evolving data model are now presented. The length of this paper precludes lengthy discussion on these, however references are provided to other sources for further information. a. The model must support bitemporal data. While a number of valid-time and transaction-time only models have been proposed and while valid-time and transaction-time are orthogonal, it is important that the model be general with the non-temporal models as special cases †. b. The model must degrade, in a conceptually simple manner, to non-evolving, snapshot or temporal models. c. The model must accommodate retrieval of all data through any defined version interface ‡ . Note that version interfaces may cease to be defined under specified circumstances. d. The model must accommodate the ability to update all (allowable* ) data, although the version interfaces allowable for this purpose may be restricted. In the context of the definitions above, this model accommodates partial schema versioning and allows update only through the current schema definition. e. The model must accommodate a set of restructuring operations which must be both complete and pragmatically useful. Completeness in this sense requires not only that any allowable relational structure be able to be created but that this evolution is also able to be tracked. SQL’s DDL statements, for example, while able to permit the creation of any relational structure do not permit all forms of evolution to that structure, for example, creating one relation from the outer join of two others. f. The model must accommodate a set of pragmatically useful, type conversion rules to allow for sensible evolution of attribute domains. †
Terminology in this paper will adhere to the consensus glossary presented by Jensen et al. [28]. ‡ Although it will be assumed in this paper that all changes to a database schema result in a new version interface, this need not be necessary. This is discussed further in Section 4. * Bitemporal and transaction-time models do not allow changes to data stored with a transaction-time end date which is not set to the open value until changed. That is, retrospective changes may only occur on the valid-time axis; temporal reference to events on the transaction-time axis are linked to the system clock.
g. The model must facilitate some form of metalevel vacuuming in order to remove unwanted schema versions [35, 36]. See section 4.2.
3. An extension to temporal models to support versioning 3.1 Completing the Relation The extension presented here uses, as an example, the bitemporal conceptual data model (BCDM) employed in the TSQL2 language design [13, 27] although the ideas presented are generally applicable to many other temporal models. Under the BCDM, time is conceptually modelled as consisting of bitemporal chronons (elementary rectangles) existing in the two dimensional space between transaction and valid-time. A set of chronons is termed a bitemporal element. Facts are represented by tuples consisting of an arbitrary number of explicit attributes and an implicit timestamp attribute which represents a bitemporal element. Thus a tuple in an instance of a relation comprises an arbitrary number of attributes representing a fact and a temporal element indicating the temporal validity of the fact. Consider the example in Figure 1 adapted from Jensen and Snodgrass [13]. Flight XX025 to Sydney was to leave at 1:00pm and arrive at 3:30pm. This was recorded at 11:15am that morning. At 12:30 a delay of an hour to the flight was reported. At 2:15pm an actual departure time was recorded of 1:30pm as well as a new estimated time of arrival of 3:30pm. Finally at 4:00pm the
The introduction of schema versioning affects the composition and the method of retrieval and update of the explicit attributes (in the example above, flight number and destination). Given a bitemporal relation scheme R = (A1 , …, A n , T), a tuple x, without schema versioning would have the form (a 1 , …, an | t). With the introduction of schema versioning, the concept of a non-temporal completed relation scheme C is introduced which contains the minimal union of all explicit attributes which have been defined during the life of the relation. Moreover, the domain of each attribute in C is considered syntactically general enough to hold all data stored under every version of the relation scheme R and the implicit primary key of C is defined as the maximal set of key attributes for the scheme over time. Versions of the schema (denoted S time ) can be seen to be views of C. Thus the relation scheme active during the interval t 1 , R t1 = (S t1 , T). The current schema through which updates may be performed is denoted Snow.
3.2 View Functions A view function V t1 maps C to a subset of the attributes in a schema S t1 active during t1. The converse function Wt2 maps from S t2 to C. An example is shown in Figure 2.
C St7 = Snow
V t5
St6
Valid time
St5 St4
W t2
4 .00 pm
St3 St2
3.00 pm
St1 Figure 2 - Versions of Schemata over time
(XX023, Sydney)
2.00 pm
1.00 pm
11:00 am
12:00
1:00 pm
2:00 pm
3:00 pm
4:00 pm
Transaction time
Figure 1 - Schematic representation of a fact in bitemporal space actual arrival time was recorded as 3:45pm.
Thus the data stored during t2 may be mapped to the format specified during t 5 through invocation of Vt5(Wt2(St2)). Note that the view functions V and W are syntactic devices only and can be considered as complex casting mechanisms. As an example, consider the following structural history for an Employee relation: 1 Jan, 1995
Relation Employee defined as:
Id Name Salary
NUM(6), CHAR(30), NUM(4)
1 Feb, 1995 Attributes Gender
CHAR(1),
Maritalstatus
CHAR(1)
added 1 Mar, 1995 Attribute Maritalstatus
deactivated † 1 Apr, 1995 Attribute Salary redefined as: Salary
NUM(5,2)
2 Apr, 1995 Attribute Salary redefined as: Salary
NUM(5)
The completed scheme for this relation would be: Id Name Gender Maritalstatus Salary
NUM(6), CHAR(30), CHAR(1), CHAR(1), NUM(5,2)
At all points in time the view functions V and W are available to convert from the stored schema to the completed schema C and then from C to the required schema. Given a completed relation C = (A1 , …, An) and a given version S = (Bk , …, B m ), with instance (bk, …, bm | t), a view function Vt can be considered to be composed of attribute view functions (vt1, …, vtn) as follows: ∀ Ai ∈ S : vt i = Γ ∀ Ai ∉ S : vt i is undefined
... 1 ... 2
Furthermore the function W is defined as composed of the attribute view functions (w t1, …, wtn ) as follows: ∀ B i ∈ S ∧ Ψ(bi , Ai ) : wt i = Γ ... 3 ∀ B i ∈ S ∧ ¬Ψ(bi , Ai) : wt i = ∆ t i (Bi ) ... 4 Γ is one of a set of standard type conversion functions, Ψ(β, α) indicates whether the application of Γ is lossless between each instance of β and the type of α. and ∆t(α) is a default value function for α at t to be used in cases where Γ is not lossless. For example, consider the following data stored on 1 Apr, 1995 under the example outlined earlier: where
Id 45 46
†
Name Smith Brown
Gender F M
Salary 12,532.00 12,532.50
It should be remembered that in a temporal database data may be marked as old but is never deleted (although see 4.2). Similarly, in evolving temporal databases, structure is superceded rather than deleted.
If this data was retrieved through S 1 Apr, 1995 the application of Γ is lossless. If this data was retrieved through S 2 Apr, 1995 the application of Γ for the first tuple is lossless but in the second is not. Thus in the second case the default value function is used (which in this instance may be a simple truncation or rounding or may be the signalling of warnings or failures indicating loss of data accuracy). Finally, if this data was retrieved through S 2 Jan, 1995 the application of Γ in both cases not lossless and the default value function would again be used (which in this instance may return some predefined value indicating overflow). In TSQL2[27], ∆ is defined through a new INAPPLICABLE clause which provides for a simple function restricted to an SQL function of attributes in that relation.
3.3 Restrictions imposed on the completed scheme C It can be seen that the retrieval of information requires a set of view functions which translate the data from any schema definition St1 through C to any other schema definition S t2 . Since schema modification is not generally lossless, updates cannot be guaranteed not to violate key and other constraints. We therefore impose the following rule: The current schema must be an updatable view of the completed schema. This is required so that the current schema is always available for data update. Note that this does not imply that all attributes in C may be updated through Snow, rather any update through Snow does not result in a violation of specified functional and foreign key dependencies for the relation and therefore, implicitly, in C. Given that when the database is initially defined S now ≡ C, and leaving aside relation composition and decomposition for a moment, Sn o w will be an updatable view unless the following has occurred: a. The domains of the primary key attributes have been reduced. b. The composition of the primary key of the relation has been reduced. Note that many of the conditions normally specified for update through views do not apply here as schema updates can ensure that no existing data contradicts the new structure (for example, the new key definitions) and, under partial schema versioning, data will not be able to be added under any other schema.
3.3 The Completed Scheme C and Rules for Relation Composition Schema versioning commonly allows the formation of one or more relations through the composition or decomposition of one or more original relations. q.v. [36, 37]. For example, two
relations might be combined horizontally and then decomposed vertically in a database restructuring exercise. In cases where the relation has been composed from the amalgamation of other relations (either vertically or horizontally) the completed relation C must be constructed to apply to two or more sets of historical schemata. Given the maximal nature of C and the fact that the view functions convert between schemata by converting to and from C this does not cause further problems. Note that it is the user’s responsibility (or the responsibility of a different part of the schema formation mechanism) to ensure that past data does not result in duplicate primary keys†. For decomposition of relations, the completed relation is duplicated for each of the new relations. However, problems are encountered in the querying of old data where relations have been horizontally decomposed and new data has been added as it is not clear which of the two old relations (if any) the new data is implicitly a member of‡. Three options are suggested: a. Disallow queries through historical versions, b. Assume new data belongs to neither obsolete relation, or c. Allow the implicit or explicit specification of a relation to resolve the problems during relation construction. One overarching solution (essentially adhering to a. and b. above) would be to define the life of a relation to be since it’s explicit creation and not to include possible antecedent relations from which it may have been composed. However, given that neither vertical decomposition nor vertical nor horizontal composition suffer from this ambiguity, and thus this may be too restrictive a general rule to use. The resolution of this problem is currently the subject of future research but at present we are investigating the use of the following rule: Data may be retrieved through antecedent relations except in cases where there could be ambiguity of the implied data source. Thus all schema changes are allowed but some will result in data ceasing to be retrievable from the source relations. This rule also allows for the possible incorporation of a disambiguation function. Finally, note that the completed scheme can be simplified by removing old schema definitions (meta-level vacuuming).
4. Discussion and Future Research Issues Apart from the data sourcing problem outlined in section 3.3, many issues remain to be solved, some of which are listed briefly below. Other issues, including query language construction issues, are discussed in more detail in [29].
4.1 Meta-level Transaction Processing Many schema change requirements involve composite operations and thus a mechanism for schema level commit and rollback functions are suggested which could operate at a higher level to the data level commit and rollback operations. For example, since data updated to a revised schema may be inapplicable if the schema change is not itself committed, it may not be considered appropriate to allow data to be committed outside of the scope of the current session until the schemalevel commit is issued. This allows for the definition a n d population of attributes to be completed as one molecular operation. As an example, the following sequence of database operations may be issued to test a program: Add attribute(s) to relation; Populate attribute(s); Data level commit; Run test programs; If tests successful Schema-level-commit; else Schema-level-rollback;
4.2 Schema Vacuuming In temporal databases the concept of vacuuming allows for the physical deletion of temporal data in cases where the utility of holding the data is outweighed by the cost of doing so [35]. Similar consideration must be given to the retention of old schema definitions, especially in cases where no data exists adhering to either that version (physically) or referring, through its transaction-time values, to the period in which the definition was active. In [36] two pragmatic positions are proposed: a. All schema definitions which pre-date all data (both in format and in transaction-time values) are to be considered obsolete and should be deleted; b. Old schema definitions are considered valuable independent of whether data exists and may only be deleted through a special form of vacuuming.
†
We take a pragmatic approach to schema versioning. Schema changes are generally required whether the database systems facilitates them or not. Our aim is to provide maximal resilience to schema change but we advocate providing warnings rather than prohibitions in cases where dataloss may occur. ‡ We term this the implicit data source problem.
Acknowledgements Some of the ideas contained here resulted from work on a temporal extension to the SQL92 standard, TSQL2 [27, 38]. The commentaries giving background discussions on this initiative are
available on-line in the tsql/doc directory at FTP.cs.arizona.edu via anonymous FTP or in the recently published [38] while further information on this initiative may be obtained from Prof. Richard Snodgrass at the University of Arizona.References [1]
G. Ariav, A. Beller and H.L. Morgan, "A temporal model". Decision Sciences Dept. University of Pennsylvania, Ph., Technical Report. DS–WP-82–12–05. 1982.
[2]
G. Ariav, "A temporally oriented data model". ACM Trans. Database Syst. vol. 11, no. 4, pp. 499–527, 1986.
[3]
J.A. Bubenko, "The temporal dimension in information modelling", in Architecture and Models in Data Base Management Systems, Nijssen, G., (ed.) North-Holland Amsterdam, 1977.
[4]
[5]
[6]
[7]
J. Clifford and G. Ariav, "Temporal data management: models and systems", in New Directions for Database Systems Ch. 12, Norwood, N.J.: ABLEX Publishing Co., pp. 168-185, 1986. S.K. Gadia, "Towards a multihomogeneous model for a temporal database", in Proc. Second IEEE International Conference on Data Engineering. Los Angeles CA., IEEE Computer Society Press, pp. 390-397, 1986. S.K. Gadia, "A homogeneous relational model and query languages for temporal databases". ACM Trans. Database Syst. vol. 13, no. 4, pp. 418-448, 1988. N.A. Lorentzos and R.G. Johnson, "TRA: a model for a temporal relational algebra", in Proc. IFIP TC 8/WG 81 Working Conference on Temporal Aspects in Information Systems. Sophia-Antipolis, France, Elsevier Science Publ. (North-Holland) Amsterdam, pp. 95-108, 1988.
[8]
A. Segev and A. Shoshani, "Logical modelling of temporal data". SIGMOD Rec. vol. 16, pp. 454-466, 1987.
[9]
J.F. Roddick and J.D. Patrick, "Adding temporal support to the RM/T model", in Databases in the 1990s, Srinivasan, B. and Zeleznikow, J., (eds.), World Scientific, pp. 45-57, 1991.
[10] A.U. Tansel, "Modelling temporal data". Inf. Softw. Technol. vol. 32, no. 8, pp. 514-520, 1990. [11] S.B. Navathe and R. Ahmed, "Temporal relational model and a query language". Inf. Sci. vol. 49, pp. 147-175, 1989.
[12] R. Elmasri and G.T.J. Wuu, "A temporal model and query language for ER databases", in Proc. Sixth IEEE International Conference on Data Engineering. Los Angeles, CA, IEEE Computer Science Press, pp. 76-83, 1990. [13] C.S. Jensen and R. Snodgrass, "The TSQL2 data model". University of Arizona. Also available as a TSQL2 commentary, TSQL2 language design committee, TempIS document. 37. 1992. [14] L.E. McKenzie and R.T. Snodgrass, " E v a l u a t i o n o f r e l a t i o n a l algebras incorporating the time dimension in databases". ACM Comput. Surv. vol. 23, no. 4, pp. 501-543, 1991. [15] J. Banerjee, H.-T. Chou, H.J. Kim and H.F. Korth, "Semantics and implementation of schema evolution in object-oriented databases". ACM SIGMOD International Conference on Management of Data, SIGMOD Record. vol. 16, no. 3, pp. 311-322, 1987. [16] P. Klahold, G. Schlageter and W. Wilkes, "A general model for version management in databases", in Proc. Twelfth International Conference on Very Large Databases. Kyoto, Morgan Kaufmann, Palo Alto, CA, 1986. [17] L.E. McKenzie and R.T. Snodgrass, "Schema evolution and the relational algebra". Inf. Syst. vol. 15, no. 2, pp. 207-232, 1990. [18] J.F. Roddick, "Schema evolution in database systems - an updated bibliography". School of Computer and Information Science, University of South Australia. An earlier version appeared in SIGMOD Rec. vol. 21, no. 4, pp. 35-40. 1992, Technical Report. CIS-94-012. 1994. [19] Laguna Beach Report, "Future directions in DBMS research". SIGMOD Rec. vol. 18, no. 1, pp. 17-26, 1989. [20] P.G. Selinger, "Predictions and challenges for database systems in the year 2000", in Proc. Nineteenth International Conference on Very Large Databases. Dublin, Ireland, Morgan Kaufmann, Palo Alto, CA, pp. 667-675, 1993. [21] M. Stonebraker, R. Agrawal, U. Dayal, E.J. Neuhold and A. Reuter, "DBMS research at the crossroads: the Vienna update", in Proc. Nineteenth International Conference on Very Large Databases. Dublin, Ireland, Morgan Kaufmann, Palo Alto, CA, pp. 688-692, 1993. [22] W. Kim, "Object-oriented systems: promises, reality and future", in Proc. Nineteenth International Conference on Very Large D a t a b a s e s . Dublin, Ireland, Morgan Kaufmann, Palo Alto, CA, pp. 676-687, 1993.
[23] M. Stonebraker and G. Kemnitz, "The Postgres next generation database management system". Commun. ACM. vol. 34, no. 10, pp. 78-92, 1991. [24] M. Stonebraker and L. Rowe, "The design of POSTGRES", in Proc. ACM SIGMOD International Conference on Management of Data. Washington, ACM, pp. 340-355, 1986. [25] O. Deux, "The O2 system". Commun. ACM. vol. 34, no. 10, pp. 34-48, 1991. [26] C. Lamb, G. Landis, J. Orenstein and D. Weinreb, "The ObjectStore database system". Commun. ACM. vol. 34, no. 10, pp. 50-63, 1991. [27] R.T. Snodgrass, et al., "TSQL2 language specification". SIGMOD Rec. vol. 23, no. 1, pp. 65-86, 1994. [28] C. Jensen, et al., "A consensus glossary of temporal database concepts". SIGMOD Rec. vol. 23, no. 1, pp. 52-64. Also Technical Report R93-2035, Department of Mathematics and Computer Science, Aalborg University, Denmark, November, 1993, 1994. [29] J.F. Roddick, "A survey of schema versioning issues for database systems". Inf. Softw. Technol. vol. 37, no. 7, pp. 383-393, 1995. [30] R.J. Miller, Y.E. Ioannidis and R. Ramakrishnan, "The use of information capacity in schema integration and translation", in Proc. Nineteenth International Conference on Very Large Databases. Dublin, Ireland, Morgan Kaufmann, Palo Alto, CA, pp. 120-133, 1993. [31] M.E. Orlowska and C.A. Ewald, "Meta-level updates: the evolution of fact-based schemata". Key Centre for Software Technology, Department of Computer Science, University of Queensland, Technical Report. 211. 1991. [32] M.E. Orlowska and C.A. Ewald, "Schema evolution - the design and integration of factbased schemata", in Research and Practical Issues in Databases, Third Australian Database Conference, Srinivasan, B. and Zeleznikow, J., (eds.), La Trobe University: World Scientific, pp. 306-320, 1992. [33] C.A. Ewald and M.E. Orlowska, "A theoretical approach to the understanding of relational schema evolution". Key Centre for Software Technology, University of Queensland, Technical Report. 287. 1994. [34] J. Geller, Y. Perl, E. Neuhold and A. Sheth, "Structural schema integration with full and partial correspondence using the dual model". Inf. Syst. vol. 17, no. 6, pp. 443-464, 1992. [35] C.S. Jensen, "Vacuuming in TSQL2". TSQL2 language design committee, TSQL2 Commentary. 1993.
[36] J.F. Roddick and R.T. Snodgrass, "Schema versioning support", in The TSQL2 Temporal Query Language, Snodgrass, R.T., (ed.) Boston: Kluwer Academic Publishing, pp. Ch. 22. 427-449, 1995. [37] B. Shneiderman and G. Thomas, "An architecture for automatic relational database system conversion". ACM Trans. Database Syst. vol. 7, no. 2, pp. 235-257, 1982. [38] R.T. Snodgrass (ed.) The TSQL2 Temporal Query Language, New York: Kluwer Academic Publishing, 1995.