A Specification for a Temporal Query System

0 downloads 0 Views 100KB Size Report
To further increase its portability, it will be written in Java. ... by temporarily inserting information into the database and make decisions using these ... The TLSQL query language can then be used to perform temporal queries on this table. ... The FIRST, SECOND, and LAST keywords in the SELECT clause allow us to.
A Specification for a Temporal Query System Martin J. O’Connor, Samson W. Tu, and Mark A. Musen Stanford Medical Informatics 251 Campus Drive, Medical School Office Building (MSOB), X-215 Stanford University, Stanford, CA 94305-5479 Email: {moconnor, tu, musen}@smi.stanford.edu Martin J. O’Connor (corresponding author) Stanford Medical Informatics 251 Campus Drive, Medical School Office Building (MSOB), X-267 Stanford University, Stanford, CA 94305-5479 Voice: +1-(650) 345-6063 Fax: +1-(650) 725-7994

Abstract This document outlines a specification for a temporal query system called Chronus II. The design of Chronus II has been influenced by the original Chronus system, which was written in Stanford University by Amar Das, and by the TSQL2 temporal query specification, which was written by Richard Snodgrass at the University of Arizona. We first outline the general features that we would like Chronus II to provide. We follow this discussion with a detailed review of the temporal model that will serve as the underpinning of our implementation. We then describe the steps that we will take to provide this implementation.

1. Introduction Our temporal query system must meet the following core requirements: Support as full a subset of possible of SQL Ideally, Chronus II should support any SQL query that conforms to the SQL-92 standard. Given that there are no complete implementations of this standard in existence, support for a significant subset of SQL should suffice. The original Chronus system [1] supported a very small subset of SQL. For example, it did not allow table creation or subqueries. Non temporal tables were also not allowed. Standards conformance Maintaining compatibility with future database standards is desirable. While there are no standards for temporal databases per se, parts of the TSQL2 specification [2] have influenced the proposed SQL3 standard has a number of temporal extensions [3]. We would like the temporal model and query language we propose to maintain as close a mapping as possible to this standard. Increase the power of the original Chronus query language The original Chronus query language had a number of shortcomings that made concise query expression difficult. Many temporal queries had to be expressed as two or more subqueries. Chronus II should allow concise query expression. The implementation must be portable Chronus was designed to operate above a relational DBMS using ODBC as its access layer. This approach dramatically increases portability. Chronus II will also use this approach. To further increase its portability, it will be written in Java.

An additional facet of portability is query language conformance with existing SQL standards. We would like our query language to be compatible with standard SQL. Ideally, it will be a superset of SQL so that all valid SQL queries should be valid Chronus II queries. At the very least, our query language should support a significant subset of existing SQL features. If we are careful when defining our query language, this requirement should not be too difficult to meet. If we specify our query language using localized well-defined extensions to standard SQL, we should be able to implement it using a pre- and post-processor that sits outside a DBMS. With this approach, most of the non-temporal components of a query can be dealt with by the DBMS. Queries with no temporal operations can be passed through and dealt with directly by the underlying system. Additional crucial features are: Support for indeterminacy In many domains, the time of certain events is not known with certainty. We would like to support the ability to deal with this uncertainty when working with temporal information. Multiple calendar support The standard Gregorian calendar is pervasive in SQL. All temporal operations are expressed in terms of this calendar. In some domains, however, the ability to perform temporal operations with user-defined calendars may be desirable. For example, we may wish to define a calendar that is tailored to specific treatment regimes or diseases. We would like to be able to easily specify temporal queries using the new calendar. We would also like to be able to interact seamlessly with the standard calendar, and with other user-defined calendars. Joins Joining temporal tables is non-trivial and any system that includes temporal information must rigorously define the semantics of such joins. Transaction time support Few vendors presently support transaction time in their DMBSs. However, transaction time can be a useful feature in some domains. This document outlines an approach to implementing transaction time support in Chronus II. Because of the significant implementation effort, the addition of transaction time support in the near term is not

likely, however. For present purposes, the crucial point is that our design does not preclude its future addition to the system. Truth maintenance Currently, Chronus interacts closely with a temporal abstraction system called RÉSUMÉ [4]. Chronus uses RÉSUMÉ to perform temporal abstractions and uses these abstractions when evaluating queries. These abstractions are not cached and are regenerated for each query. If such abstractions were cached, we would need a truth maintenance scheme to ensure that they remain valid when new data arrive because such data may invalidate previous abstractions. We may need support from the Chronus II to do this truth maintenance. ‘What If’ queries Many decisions are made with incomplete information. In some cases, we might wish to make some assumptions about this incomplete information and perform hypothetical queries using these assumptions. In other words, we would like to posit ‘what if’ queries by temporarily inserting information into the database and make decisions using these new data. In other cases, we may have made decisions in the past using incomplete information and would like use information received at a later date to retroactively reevaluate earlier decisions. We would like Chronus II to facilitate both of the above types of queries by efficiently supporting the temporary positing of data and also by providing the ability to alter the system’s concept of ‘now’ when evaluating queries. Hierarchical queries Significant hierarchical relationships may exist between attributes in a database. We would like our query language to capture these relationships. For example, we might like a query to understand that an entity is a type of something else. To do this we need to model hierarchical relationships between attributes in our database schema. Backward compatibility Backward compatibility with the original Chronus is desirable. There are two aspects to this compatibility: compatibility at the schema level and query language compatibility. Compatibility at the schema level is desirable because of the significant amount of temporal data that already exists in our databases. The design specified in this document

satisfies this requirement. Compatibility at the query language level was not felt to be necessary. One of the primary motivations of Chronus II is to increase the expressiveness of the original query language, so this incompatibility is essentially unavoidable. This document outlines a design that addresses the above requirements. To provide some background for subsequent discussion, we provide a brief overview of the current Chronus system and the TSQL2 specification. We follow this overview with a detailed description of the temporal model that we intend to use in the next incarnation of Chronus. This model serves as a basis for the design of Chronus II. The subsequent sections give detailed design specifications for the primary Chronus II language features. These features are defined in terms of our temporal model. The final section outlines an implementation strategy for Chronus II.

2. Chronus and TSQL2 This section provides a brief overview of both the current Chronus system and the TSQL2 specification. Chronus Chronus is a temporal database query system developed in Stanford by Amar Das. It is based on the relational model and uses a query language that is an extension of SQL to perform temporal queries. This query language is called TimeLineSQL (TLSQL). Its temporal model assumes that every tuple in the database contains information specifying the period of its validity. This information is stored in two columns labeled START_TIME and STOP_TIME For example, a temporal table called OCCUPATION that holds information about employees and their job titles could look as follows: Name Fred Fred

Title Mechanic Supervisor

START_TIME June,10,1995 May, 13, 1996

STOP_TIME May, 12, 1996 December, 31, 1996

Fred

Mechanic

Jan,1, 1997

April, 10, 1997

Joe

Mechanic

April,1,1994

Feburary,12, 1997

The TLSQL query language can then be used to perform temporal queries on this table.

A TLSQL retrieval statement may contain the following clauses: SELECT [FIRST | SECOND | .. | LAST ] [CONCATENATED] select_item_commalist FROM table_name_commalist [WHEN table_name [[NOT] CONCURRENT WITH] table_name | temporal_comparison_commalist] [WHERE search_condition_commalist] TLSQL adds an additional WHEN clause to SQL’s SELECT statement. This clause is used to express the temporal part of a query. The operations in the WHEN clause could be simple direct comparisons using temporal information contained in the timestamp columns. The operations may also be one of several temporal functions provided by TLSQL. Examples of such temporal functions include DURATION, BEFORE, and AFTER. The FIRST, SECOND, and LAST keywords in the SELECT clause allow us to choose specific rows from the result of a query based on their temporal ordering. For example, using the above OCCUPATION table, the query “List those employees that were supervisors for longer than 30 days” can be specified in TLSQL as: GRAIN day SELECT Name FROM Occupation WHERE Title = ‘Supervisor’ WHEN DURATION(OCCUPATION.START_TIME, OCCUPATION.STOP_TIME) > 30 The GRAIN clause specifies the query’s temporal granularity. In the above case, we are dealing with a granularity of days. The DURATION function can then be used to calculate the time difference between the start and stop timestamps using this granularity. TLSQL provides several other temporal functions. These include: BEFORE, UNTIL, LEADS, STARTS, EQUALS, DURING, SPANS, FINISHES, LAGS, FROM, AFTER, OVERLAPS, ADJACENT Other queries could include “List those employees that have worked continuously for longer than one year”: GRAIN day SELECT CONCATENATED Name INTO #tmp FROM OCCUPATION GRAIN day SELECT #tmp.Name FROM #tmp WHEN DURATION(#tmp.START_TIME, #tmp.STOP_TIME) > 365

The CONCATENATED operator merges value equivalent tuples. For example, the above projection using the ‘Name’ field will generate the following table: Name Fred Joe

START_TIME June, 10, 1995 April, 1, 1994

STOP_TIME April, 10, 1997 February, 12, 1997

The temporary table is needed in this example because the WHEN clause in Chronus operates on the time elements in the original table, not on the projected and concatenated table. This is a serious shortcoming because most temporal queries require operations that use projection and concatenation operations. Another significant limitation of Chronus is that its temporal model is incompatible with the standard relational model. The model requires that all tuples have START_TIME and STOP_TIME attributes. Thus non-temporal tables are not supported. Other limitations include: -

No more than two tables can be specified in a FROM clause. A very small subset of standard SQL is supported. For example, table creation is not supported; subqueries are also not allowed.

The above two restrictions are related to the current implementation and are not a limitation of the underlying model. However, the more significant restriction on nontemporal tables is related to this model. The model does not allow interoperation with standard relational tables. We would like to see this restriction removed in Chronus II because it limits the usefulness of the system. TSQL2 The TSQL2 specification was developed at the University of Arizona by Richard Snodgrass. It supports both temporal and non-temporal tables and its query language is compatible with standard SQL. Part of the TSLQ2 specification may be included in the SQL3 standard under the SQL/Temporal proposal. No implementations of the TSQL2 specification exist. A TSQL2 temporal table has a column indicating the valid-time of each tuple. Name Fred

Title Mechanic

Joe

Mechanic

ValidTime {[June,10,1995-May,12,1996], [Jan,1,1997-April,10,1997]} [April,1,1994-February,12,1997]

TSQL2 will concatenate value-equivalent tuples in projections over temporal tables. For example, the query: SELECT Name FROM OCCUPATION will generate the following table: Name Fred Joe

ValidTime [June,10,1995-April,10,1997] [April,1,1994-February,12,1997]

We can also restrict our queries to particular intervals: SELECT Name VALID PERIOD ‘[4/6/95-now]’ FROM OCCUPATION Our previous query, “List those employees that have worked for longer than one year”, can be written in TSQL2 as follows: SELECT SNAPSHOT Name FROM OCCUPATION(Name) AS O1 WHERE CAST(VALID(O1) AS INTERVAL YEAR) > INTERVAL ‘1’ YEAR The SNAPSHOT keyword tells the query to return a table without a valid-time column, i.e., a non-temporal table. If we would like to see who has worked continuously for longer than one year, we can do the following: SELECT SNAPSHOT Name FROM OCCUPATION(Name)(PERIOD) AS O1 WHERE CAST(VALID(O1) AS INTERVAL YEAR) > INTERVAL ‘1’ YEAR The PERIOD keyword tells TSQL2 not to merge value-equivalent tuples so that we can look at the constituent maximal periods of the timestamps. The TSQL2 specification is large and complex and no production-quality implementations exist. Two partial prototype implementations have been developed for subsets of the query language but these are not reusable outside their development environments. However, the specification does contain a very detailed description of a temporal query model. This model supports indeterminacy, multiple calendars and transaction time. We intend to adopt this model in Chronus II. We describe this model in the next section; subsequent sections use this model as a basis for describing the extended temporal support that Chronus II will provide.

3. Temporal Model This section outlines a conceptual temporal model that serves as the underpinning of our design of Chronus II. This model is described by Richard Snodgrass in “The TSQL2 Temporal Query Language” [2]. We first describe the model in general terms, and we follow this discussion by a more detailed description of its core concepts. We then outline an approach to representing these concepts in an implementation. Time Time has a standard geometric metaphor: time itself can be compared to a line; a point on this line is called an instant; the time between two instants is known as a time period; and a length, or unanchored segment, of the time-line is called an interval. In general, three basic models of time can be chosen: -

the continuous model, where time is viewed as being isomorphic to the real numbers the dense model, where time is seen as being isomorphic to the rationals and the discrete model, with time being seen as isomorphic to the integers

Typically the discrete model will be chosen because it is most efficiently modeled on a discrete computing device. This choice is, however, not crucial (apart from efficiency considerations) and does not limit the power of the temporal model. Times are stored in a structure called a timestamp. Typically, a bounded time model will be chosen because an unbounded timestamp representation would be impractical. For example, SQL-92's calendar can only represent time from 1 A.D. to 9999 A.D. In any case, we usually operate on time as though it has some sort of bounds, so this model is not restrictive in practice. Times on this bounded time segment are obtained by reading a time-line clock, the units of which are referred to as time-line chronons. The time-line is normally partitioned into a finite set of smaller segments called granules. The partitioning function that maps time-line clock chronons into granules is referred to as the granularity. The smallest possible granularity is the time-line clock chronon; the largest granularity is the entire (bounded) length of time. An instant is a point on a time-line, as opposed to granules, which are segments on the time-line. A period is the time between two instants. An interval is an unanchored duration of time; it will have a known length but no start or stop instants. A datetime is an isolated point in time known to have a given granularity.

The following are examples of the use of datetime, intervals, and periods: Datetime Interval Period

When did the patient visit the clinic? How long was the patient in the clinic? Did the patient visit the clinic between October and November?

Times on the time-line clock are related to more familiar temporal descriptions using calendars. The most familiar calendar would be the Gregorian calendar (a variant of which is used by SQL), though many others are possible, including user-defined calendars. An instant timestamp records that an instant is located sometime during a particular granule, but we often do not know the exact granule during which it is located. Such an instant is termed indeterminate. Periods and intervals may also be indeterminate. The exact start and stop times of an indeterminate period are not reliably known. Similarly, the length (in time) of an indeterminate interval is not precisely known. The following sections give a more detailed definition of some of the terms outlined above. Granularity Granularity is the unit of measure for a temporal datum. The most familiar granularities that we deal with are seconds, minutes, days, weeks, and so on. Abstractly, a granularity may be seen a partitioning function that maps time-line clock chronons into granules. The smallest possible granularity is that of time-line clock chronons. The largest possible granularity is ’all of time’, the entire time-line considered as a single granule. A standard assumption is that the set of granules is well ordered within a given granularity. If we assume that we are using a bounded time model then we can use beginning and forever as two special values in this ordering. These value exist just outside the closed period of time. They represent the lowest and greatest values, respectively, in the ordering. Operationally, they are treated and minus and plus infinity. These special values will typically be used extensively in comparisons involving time. The model may also use a distinguished time called now to refer to the current time. ‘Now’ is a special kind of datetime, rather than an interval or period. Indeterminacy Granularity and indeterminacy are related features of temporal data. Granularity is the unit of measure for a temporal datum, while indeterminacy represents partial information about finer units of measure. For example, an instant known at the granularity of an hour has an hour-long period of indeterminacy. We do not know, for example, in what second or minute the instant occurred. In other words, an instant measured at a particular granularity in indeterminate at all finer granularities.

Instants An instant is a point on the time-line, whereas a granule is a segment (however short) on the time-line. An instant timestamp records that an in instant is located in a particular granule. In many cases, however, we may not know the exact granule in which the event occurred - we may only know that the instant occurred some time during a range of granularities. We term this instant an indeterminate instant. A determinate instant is a special version of an indeterminate instant with a one granule period of indeterminacy. In the abstract, all instant timestamps are indeterminate in that beyond the resolution of the finest granularity (which will be the resolution of the timeline clock) - we can never know with absolute certainty when they occurred. This uncertainty is representative of the real world in that absolute precision is impossible our measuring equipment will always have a limited resolution. Calendars A calendar is a human abstraction of some portion of physical time. Thus, every calendar has an origin, which describes the point before which the calendar is not valid, and an end point, which describes the point after which a calendar is not valid. For example, the SQL calendar uses a variant of the Gregorian calendar with an origin of 1 A.D. Thus, the SQL calendar is unable to represent times before this date. Within each calendar will be a finite set of granularities. For example, the Gregorian calendar supports granularities of seconds, minutes, hours, days, weeks, months, years, and decades; the Business calendar supports granularities of years, quarters, weeks, and days. The SQL-92 calendar supports the granularities of year, month, day, hour, minute, second, and fractions of a second. A calendar is responsible for defining mappings between the different granularities that it contains. The Business calendar, for example, must be able to map between quarters and days. To perform this mapping, the calendar will first map from the coarsest granularity to successively finer granularities until it reaches the target granularity. Some mappings are straightforward or regular. The conversion of minutes to seconds is an example of a regular mapping. Other mappings are more complex and are termed irregular. The conversion of days to months would be an example of an irregular mapping. Irregular mappings are typically described by functions. The relationship between granularities within a calendar is usually represented using a lattice. Each connecting arrow in the lattice describes the mapping between the granularities at each end of the arrow. Granularities are calendar-dependent but they may be related by a single multi-calendar granularity lattice. In other words, different calendars may share the same granularities.

In some cases, we may wish to specify calendars that are tailored to particular domains. For example, in the diabetes domain, we are frequently interested in the cycle of hours within a day, and are less interested in absolute time. Similarly, for other domains, we may be more concerned with the cycle of weeks within a month. Modeling Time This section outlines an approach for the representation of some of the temporal concepts that we have just discussed. Instants An instant is a point on the time-line and may be modeled with an instant timestamp. For example, the DATETIME data type in SQL can be used to hold an instant timestamp. Periods We can model periods using a period timestamp. Each period timestamp is composed of two instant timestamps, which are assumed to have the same granularity. This granularity is also the granularity of the associated period. Null periods are allowed, so the start and end timestamp may be equal. Intervals Intervals are modeled by an interval timestamp. An interval timestamp is a count of granules and is assumed to have a particular granularity. The count may be positive (representing time heading into the future) or negative (representing time heading into the past). Temporal Tables Conceptually, we may represent a temporal table by adding a valid-time column to a non temporal table. This column holds the set of valid periods for each tuple in the table. For example, a temporal table representing a history of employees and their job titles could look as follows: Name Fred

Title Mechanic

Joe

Mechanic

ValidTime {[June,10,1995-May,12,1996], [Jan,1,1997-April,10,1997]} [April,1,1994-February,12,1997]

Value-equivalent tuples are not allowed by the data model, so any rows with the same non temporal information will automatically be coalesced by the system. Coalescing will ensure that all overlapping or adjacent intervals for tuples with the same values are merged and that suitable new intervals are generated.

The above representation is conceptual -it will probably not be the appropriate representation at the physical database level. However, the above representation (or something very close to it) should be what a user of the temporal query system sees. Granularity In SQL, granularities are referenced when specifying the schema. In the SQL-92 standard, these are: DATE (year, month, day), TIME (hour, minute, second, and, optionally, fractions of a second), TIMESTAMP (a DATE concatenated with a TIME), INTERVAL (duration of time), TIME WITH TIME ZONE (a TIME concatenated with an INTERVAL HOUR TO MINUTE), and TIMESTAMP WITH TIMEZONE. With all but DATE one can specify the number of fractional digits. The smallest granularity in a system will be the granularity of the time-line clock chronon. All other granularities will ultimately be described in terms of this granularity. As described above, a calendar is responsible for defining the granularities that it contains and for defining the mapping functions from one granularity to another. These mapping functions will be exercised regularly because the mixing of temporal data at different granularities in a single database is common. This mixing creates various problems: -

What are the semantics of operations with operands at different granularities? Can times at different granularities be ordered? Can times be converted from one granularity to another? How expensive are temporal operations at mixed granularities? Can multiple granularities be stored as efficiently as a single granularity?

We will return to these issues in a later section. Indeterminacy We would like the ability to capture the uncertainty associated with some temporal information. A temporal uncertainty model to describe this uncertainty will depend on the type of uncertainty we are trying to model. Often, the type of uncertainty that we have to deal with typically involves events that are known to have happened – the exact time of their occurrence is usually what is in question. A model that describes this type of uncertainty is outlined by Curtis Dyreson and Richard Snodgrass in “Supporting ValidTime Indeterminacy” [5]. This model represents an indeterminate instant with three components: a lower support, an upper support, and a probability mass function (p.m.f.). While we may not know exactly when the event occurred, we do know that it did not happen before the lower support; similarly, we know that it did not happen after the upper support. The supports have a particular granularity and represent the lower and upper bound on when an indeterminate instant is known to have occurred, respectively. Between the supports lies a

period of indeterminacy, which identifies a set of possible instants in which the event occurred. The p.m.f. describes the distribution of probabilities in this set of possible instants. Periods and intervals may also be indeterminate and can be modeled using a similar approach. We will consider indeterminacy in detail in a later section. Calendars Calendars define granularities. The Gregorian calendar provided by SQL defines the granularities of decades, years, months, fortnights, weeks, days, hours, minutes, seconds and fractions of a second. At present, this calendar is the only one supported in SQL-92’s temporal model. To support multiple coexisting calendars, we envision that a calendar will be defined in a specification file. These specification files can be provided by a database vendor, third party suppliers, local database administrators, or may perhaps be defined by the user. The calendar specification file must provide a full definition of all the granularities it contains. Each granularity will be described as a further partitioning of some other granularity using either a regular or irregular mapping. Ultimately, the granularities will map to the finest granularity provided by the DBMS, which will be the resolution of the time-line clock. Anchor points must also be given for each granularity. In essence, a calendar is a collection of mapping function that are to be linked with the DBMS and the specification file. If we are to have multiple coexisting calendars in a DMBS, we will quickly want a means of converting a time representation in one calendar to a time representation in another. We will go into further detail on this issue, in addition to discussing both granularity and calendar representation, in later sections.

4. Granularity Conversion Regular and irregular mapping functions are provided by calendars to convert between granularities. Some irregular mappings may be quite complex. For example, a mapping from days to months must accommodate months that have different numbers of days, and must also consider leap years. In some calendars, there may even be an irregular mapping from minutes to seconds to account for leap second insertions. The calendar must also specify an anchor point for each granularity it defines. At first sight, it does not seem obvious that all granularities do not share the same anchor point – we would expect the anchor point of all granularities to be the earliest time defined by the associated calendar. However, in many cases this assumption proves to be false.

For example, the origin of the Gregorian calendar is midnight, January 1, A.D. 1. If we adopted this time as the anchor point of the decade granularity, then the 10 years between 1971 and 1980, inclusive, would be a decade, rather than the 10 years between 1970 and 1979, inclusive. The latter, and not the former, fits the common definition of decade. Consequently, we cannot adopt the Gregorian calendar origin as the anchor point for decades. Instead we must pick a decade boundary as our anchor point. An example anchor point would be the start of the year A.D. 1000. When anchor points differ between two granularities, semantic checking is performed to determine if the anchor point of the coarser granularity is on a granule boundary of the finer granularity. The semantics of temporal predicates are modified by the granularities of the temporal elements in the predicate. For example, how do we compare a timestamp at the granularity of an hour with one at the granularity of a day? There are several possible approaches: -

-

-

Give a mismatched granularity error. Given that operations over different granularities will be common, this solution is not useful. Perform the operation at the granularity of the first operand. Thus, the granularity of an operation can be controlled by rearranging its operands. However, there will be a loss of symmetry in operations. Perform the operation at the finest granularity. All instants are thus converted to the granularity of the finest operand before any temporal operations take place. However, as noted earlier, all determinate instants at a particular granularity are indeterminate at all finer granularities. As a consequence, the above approach will introduce indeterminacy into the result. Alternatively, we could perform the operation at the coarsest granularity. We thus avoid adding any indeterminacy that is not already present. However, we do lose precision because information a finer granularity is lost when it is converted to a coarser granularity.

With the exception of the first approach, all the above solutions rely on operations that move times within the granularity lattice. The following subsection describes an operation called scale that moves time up and down the granularity lattice. Scale Scale uses the regular and irregular mappings between granularities to convert a time from one granularity to another. The mechanics of scale depends on whether we are converting to a finer or coarser granularity. We discuss converting from a finer to a coarser scaling first, followed by the conversion of a coarse to a fine scaling. Scaling from Fine to Coarse Scaling a determinate instant from a finer to a coarser granularity results in a determinate instant. We use the mapping function provided in the lattice to scale from the fine to the

coarse granularity; we must also consider differences between the anchor points of both granularities. For example, suppose that we wish to scale an instant, time, from finer granularity, f, to coarser granularity c. In this example, we assume that there are no intermediate granularities between f and c. Let df,c be the difference between the granularity anchor points f and c, expressed in the finer granularity, f. Let mapf_to_c be the mapping (regular or irregular) from the finer to the coarser granularity. Then we can define scale as follows: scale(timef, c) = mapf_to_c(timef – df,c) = timec If an intermediate granularity, i, lies between f and c in the granularity lattice, we can compose the mapping functions as follows: scale(timef, c) = mapi_to_c (mapf_to_i(timef – df,i) – di,c) = timec Using the above approach we can move time any number of levels in the granularity lattice. Note that we lose information when we scale from a fine to a coarser granularity. Scaling from Coarse to Fine Scaling a determinate instant from a coarser to a finer granularity generates an indeterminate instant. The mechanics of scaling are similar to the approach used in the fine-to-coarse mapping outlined above, but now our result will be an indeterminate period. Assuming a mapping from a coarse granularity, c, to finer granularity, f, we define our mapping as follows: scale(timec, f) = (mapc_to_f (timec – dc,f)) ~(mapc_to_f ((timec + 1) – dc,f) - 1) The upper and lower bounds are scaled separately and our result is a pair of times representing an indeterminate period. With minor modifications, we can adopt the above approach to scaling both periods and intervals. It can also be applied to indeterminate temporal elements.

5. Basic Temporal Queries The Chronus II query language will be a superset of SQL. All existing non temporal queries will retain their existing semantics. To support temporal queries we describe a set of extensions to standard SQL syntax.

These extensions may be grouped under the following headings: -

Table creation Table updates Basic temporal functions Period operators Query support

The following subsections outline some of the extensions that will be defined. A very basic outline of the proposed syntax is presented. A separate report will describe the completed query language. To distinguish them from non temporal commands, all Chronus II temporal commands will be preceded by the TEMPORAL keyword. Table Creation If we wish to create a temporal table, we must have a way of specifying that it is temporal when we create it. We propose appending a VALID clause after the normal SQL table creation command. A number of possible keywords may appear after the VALID clause: the STATE keywords will indicate that the table is a temporal table with valid-time periods associated with each tuple; the EVENT keyword will indicate that the table is a temporal table with instant timestamps associated with it. Other keywords may be used to indicate that a table has transaction time support, for example. A granularity must also be specified when creating a temporal table. This granularity is specified as part of the VALID clause. The following example shows the creation of a state table with a granularity of days. TEMPORAL CREATE TABLE T1 ( ... ) AS VALID STATE DAYS Table Updates The table update command must be modified to allow specification of the time components associated with the update. We first introduce a PERIOD clause that can be used in commands to specify time periods in a temporal command. The PERIOD clause will have the following form: PERIOD(‘’, ‘’)

An instant clause will also be defined to describe instants. It will have the form: INSTANT(‘’) The format of the time literal specification will follow the SQL-92 standard for the standard Gregorian calendar. If user-defined calendars are allowed, these time literals may be in a user-specified format. When updating a temporal table, a VALID clause can be used to indicate the valid-time associated with the update. An example of an UPDATE clause could be as follows: TEMPORAL UPDATE T1 VALID PERIOD('..', '..'), SET ... Only tuples that have a valid-time intersecting with the specified period are updated by the above command. We may also specify more than one period in the update by separating the PERIOD clauses by the plus operator. For example: TEMPORAL UPDATE T1 VALID PERIOD('..', '..') + PERIOD('..', '..') SET ... A VALID clause may also be applied to a SELECT statement to indicate that we are interested in tuples only in a particular time period for this query. For example: TEMPORAL SELECT Name VALID PERIOD('..','..') FROM T1 WHERE ... Basic Temporal Functions Several temporal operators are required to assist in the construction of temporal predicates. The ability to extract the valid-times associated with tuples in a temporal table is a minimal requirement. We will provide the function VALID_TIME to do this extraction. For example, the expression VALID_TIME(T1) can be used in a query to explicitly refer to the valid-time element of the temporal table T1. The application of the VALID_TIME function to a non temporal table will produce an error. We may also wish to refer to the start or end instants of the valid periods. The functions END_OF and START_OF will be supplied to extract these times; they will have the same form as VALID_TIME.

Many temporal queries will require the conversion of temporal valid times in an expression to a specific granularity. Functions will be provided to perform this conversion. Period Operators We will need a set of functions to provide a set of standard operators on temporal periods. For example, a question of the form “Did period X occur before period Y?” would be common in temporal queries. We will provide the following minimal set of operations: BEFORE, AFTER, EQUALS, MEETS, MET_BY, OVERLAPS, OVERLAPPED_BY, DURING, CONTAINS, STARTS, STARTED_BY, FINISHES, FINISHED_BY The above temporal operators follow the definition of Allen’s Operators. So if we wish to ask if tuples in state table T1 temporally overlap with tuples in state table T2, we use the following expression: OVERLAPS(T1, T2) Query Support Temporal queries are similar to their non temporal counterparts except that they contain additional temporal predicates. We propose adding a WHEN clause to the SELECT statement to support temporal predicates. For example, if we are given the following OCCUPATION state table Name Fred

Title Mechanic

Joe

Mechanic

ValidTime {[June,10,1995-May,12,1996], [Jan,1,1997-April,10,1997]} [April,1,1994-February,12,1997]

and wish to select employees that have worked as a mechanic for longer than two months, we could express the query as follows: TEMPORAL SELECT Name FROM OCCUPATION WHERE Title = ‘Mechanic’ WHEN DURATION(OCCUPATION, ‘Months’) > 2 The result of a temporal query will be a temporal table, so the above selection will return a table with a column called ‘Title’ followed by a valid-time column. Many temporal queries will wish to project selected columns in a table before operating on them. The resulting projection may be coalesced or uncoalesced. Given that many

temporal queries will operate in this way, we would like to avoid requiring that the projection be performed in a separate SELECT statement because this approach would necessitate the writing of two SELECT statements for many temporal queries. Instead, we allow projection to be performed in the FROM clause of a SELECT statement. These projections are named so that we can refer to them in the subsequent WHERE clause. They are coalesced by default; the keyword UNCOALESCED may be used to turn off coalescing. The following is an example of a SELECT statement with a coalesced temporal projection in the FROM clause: TEMPORAL SELECT Name FROM Fred(Name) AS P WHEN DURATION(P, ‘Months’) > 2 The above query differs from the first query in that we are now asking for employees that worked for longer than 2 months at any job.

6. Joining Temporal Tables Performing joins on temporal tables is non trivial and any system that includes temporal information must rigidly define the semantics for such joins. For example, we cannot join two temporal tables using the semantics of the standard join because our resulting table would have two valid-time columns. Such a table will not fit our temporal model. Instead, we must define our own semantics for temporal joins. When joining two or more temporal tables we need to specify a way of combining the valid-time elements of the temporal tables involved in the join. A number of options are possible. The original Chronus specified the semantics of a number of temporal joins, the most common type of join being the extract join. This join has the most intuitive semantics: the valid-time periods of a joined table are created from the intersection of the valid-time elements of the tables specified in the join. This type of join has variously been called the intersect join and the temporal join. Other types of joins include the difference join, in which the result valid-time elements are computed from non-common periods in the valid-times of the joined tables. TSQL2 adopts a similar approach. When performing a temporal join in TSQL2, the user is responsible for specifying the function that will generate the valid-time result from the joined tables. This function could in principle be arbitrary, but will usually be a small set of standard functions, such as intersect and difference. When performing a temporal join, the valid-time columns are at first ignored and the non temporal attributes are joined as normal. Then, the specified temporal function is evaluated using the source valid-time periods of each tuple in the result. If the function evaluates to a null time period, the tuple is excluded from the final result. If the time

period is not null, then the tuple is included in the result. The result time period will server as this tuple’s new valid-time. Assume we have a temporal function INTERSECT that takes two tuples from a temporal table and produces a valid-time that is computed from the intersection of the two tuple valid-times. We can then perform in intersect join on two tables T1 and T2 with attributes ‘a’ and ‘b’, respectively, as follows: TEMPORAL SELECT a, b, VALID INTERSECT(T1, T2) FROM T1, T2 This approach can be extended directly to three or more tables if we provide temporal functions that take a variable number of parameters. It also allows us to directly model the semantics of Chronus joins. In Chronus II, we thus follow the join model adopted by TSQL2.

7. Multiple Calendar Support The SQL-92 and SQL3 language specifications make the implicit assumption that only the Gregorian calendar will be used. This assumption is pervasive throughout both specifications. As mentioned earlier, we would like to add support for multiple calendars to our query system. This support must also allow the conversion of times from one calendar representation to another. To allow meaningful coexistence of different calendars, we must weave the granularities from several calendars together. To perform this association we assert that some pairs of granularities in different calendars are congruent. Two granularities are congruent if the partition the time-line into identical sets. They may, however, have different anchor points. For example, suppose that the ‘day’ granularity in the Gregorian and the Business calendars partitions the time-line identically. The anchor points, however, may differ, primarily because both calendars could define time over different portions of physical time. The 23rd business calendar day may not be the same the 23rd Gregorian day; instead, it may be the 1023rd calendar day. Conceptually, we may test two granularities for congruence by enumerating the set of time-line clock chronons in each granularity, converting each chronon to the given granularity, and seeing whether they belong to the same partition. Alternatively, the DBA could simply declare congruence between pairs of granularities. Another approach is to define mappings between granularities in different calendars. Some of these mappings could be regular. For example, Gregorian seconds can be directly mapped to astronomy hundredth-days using a conversion factor of 864. More complex relationships would require functions to do the mapping.

To summarize the steps involved in creating a multi-calendar system: -

first describe the various calendars involved using calendar specification files assert any congruencies between granularities in the calendars assert any regular mappings between granularities in the calendars write mapping functions for the irregular mappings between granularities

Conceptually, we have now combined the granularities from different calendars into a single system-wide lattice.

8. Temporal Indeterminacy Not all temporal information will be known with certainty. For example, we may only know that some event happened “between 1 AM and 6 AM”, “sometime in late June”, or “sometime last year”. These are examples of temporal indeterminacy - we know that some event occurred but we do not know exactly when it occurred. This indeterminacy can arise for several reasons: many dating techniques are inherently imprecise; the time of many events is often imprecisely specified; often, we just plain do not know when an event occurred. In the abstract, all temporal information has some degree of uncertainty associated with it: the degree of accuracy of any measurement is at the very least limited by the resolution of our measurement device. This resolution must be specified at some granularity, and thus, by definition, the measurements will be imprecise at all finer granularities. However, if we specify that we are interested in time information only at a particular granularity, then we can consider certain times specified at that granularity to be determinate. If we did not adopt this approach, we could never make any definite conclusions using the temporal information in our database. Note that we are dealing with uncertainty only in terms of when an event happened: we assume that the event did in fact happen – we just do not know exactly when. We are not dealing with the uncertainty of whether an event did in fact take place. We would like Chronus II to be able to capture indeterminacy in its temporal model. We would also like it to support temporal queries dealing with the associated indeterminate data. This section outlines the approach we intend to adopt. We first outline the conceptual temporal model of indeterminacy we plan to use; this outline is followed by a description of the syntactic extensions we propose for retrieving temporally indeterminate information from the database.

Modeling Indeterminacy Indeterminate Instants If we do not know the exact location of an event in time but instead know a range of possible instants in which it occurred, we can model the indeterminacy with three components: a lower support, an upper support, and a probability mass function (p.m.f.) [5]. The supports indicate the times that delimit when the instant is located; while we do not know exactly when the event occurred, we do know that it did not happen before the lower support; similarly, we know that it did not happen after the upper support. Between these supports lies a period of indeterminacy, which identifies a set of possible instants in which the event occurred. The probability mass function will map each instant in this period of indeterminacy to a probability; this probability indicates the likelihood that the event occurred at that instant. In the simplest case of all instants being equally likely, the p.m.f. will simply be the reciprocal of the number of instants in the indeterminacy period. A p.m.f. evaluating to the normal distribution would be common for certain classes of events. Other distributions could include the “probably early” distribution, which would skew the mass function towards the lower support; the “probably late” distribution could similarly skew the function towards the upper support. User-defined p.m.f.s could be appropriate in some cases. In some cases, the user may not know the underlying distribution, either because the information is unavailable or the mass function would exceed the implementation capacities of the system. Such distributions are termed missing and indicate a complete lack of knowledge about the distribution. We thus have incomplete information about indeterminate information, which would lead to great complexities if we tried to model it. We thus do not allow unknown or partially known distributions. In a similar attempt at model simplification, we assume that all indeterminate instants are independent. That is, we do not consider the case in which the probability of an event occurring at a certain instant within one period of indeterminacy is related to the probability of another event occurring at some instant in its period of indeterminacy. The consideration of joint or dependent probabilities would generate serious complications in the model. As presented above, the p.m.f. is a discrete rather than continuous function, implying a discrete model of time. However, the above approach can also be applied to the continuous model of time without loss of generality.

Indeterminate Periods We may model in indeterminate period using a pair of indeterminate instants. We term the instants in this pair the starting and terminating instants, respectively. An indeterminate period may start during any member of a set of starting instants and end during a member of a set of terminating instants. An indeterminate period thus represents a set of possible periods between the lower support of the start instant and the upper support of the terminating instant. The one restriction on the above model is that set of possible instants on the bounding instants must overlap by at most a single instant. This restriction ensures that we do not have negative periods; it also ensures that we do not have null periods. Expressed differently, it also ensures that every indeterminate period has a determinate portion of at least one instant. Indeterminate Intervals A determinate interval can be described by a count of instants; the interval lasts a known duration of time. An indeterminate instant, on the other hand, must be described in terms of an imprecise number of instants. For example, we may state that an interval lasted between 10 and 20 instants. A indeterminate interval will have a p.m.f. associated with it describing the distribution of the number of instants in the interval. The assumptions and restrictions that we made in the case of indeterminate instants above will apply here also. Model Representation This section describes an outline representation of the above temporal indeterminacy model. The semantic extensions required for our temporal query language to support indeterminacy are detailed; the necessary syntactic extensions are also described. Semantic Extensions Several semantic extensions to the existing determinate model are required to support indeterminate queries. Range Credibility Range credibility changes the information available to query processing. It eliminates unlikely (though possible) periods from indeterminate periods until the desired credibility is reached. Possible instants from the start of an indeterminate period are eliminated. In each query, we are given the option to use a range credibility value to control this elimination. This value is a number between 0 and 100. It specifies the degree of certainty we require in our consideration of an indeterminate period. The range credibility is first scaled to between 0 and 1, inclusive, and we use the probability mass function for

the indeterminate period to exclude instants at the start that have a cumulative probability less than the scaled range credibility. For example, a range credibility of 50 will remove instants at the start of an indeterminate period that have a cumulative probability of less than or equal to 0.5 of being in that period. The period is thus shrunk to exclude instants that have 0.5 or less of a probability of being in the indeterminate period. A range credibility of 0 represents the retention of all indeterminacy, so no shrinking occurs; a range credibility 100 represents its entire removal; the period will be shrunk to the last possible instant. If we have an indeterminate instant, α, delimited by instants α* and α*, and range credibility γ, we can define a function shrink as follows: shrink(γ,[α*, α*]) = [x, α*] where x is constrained by: (α*≤ x ≤ α* ∧ Fα(x)≥ γ)) ∧¬ (∃i)(x < i ≤ α* ∧ Fα(i)= Fα(x)) ∧¬ (∃j)( α* < j ≤ x ∧ Fα(x)> Fα(j)) Fα is the cumulative probability function for indeterminate instant α. We must also compute a p.m.f. for the new period because the old p.m.f. may have assigned non-zero probabilities to instants that are no longer in the period of indeterminacy. We construct a new mass function by scaling the remaining instants by the cumulative probability of the chopped instants. This new mass function will then be used in the further evaluation of the associated query. Ordering Plausibility

A set of determinate instants has a single temporal ordering. Given a temporal expression consisting of temporal predicates, this ordering will either satisfy the expression or fail to satisfy it. However, a set of indeterminate instants and periods will have many possible temporal orderings, with varying degrees of plausibility. Thus, when evaluating a query that operates on indeterminate instants or periods, we will want to be able to specify the degree of plausibility that we are willing to accept. We call this term the ordering plausibility. For any two indeterminate instants, α and β, the probability that α is before β is: P(α< β) = ∑ Pα(i) × Pβ(j)

i,j ∈ Z, i < j

Pα is the p.m.f. for indeterminate instant α. We define a term called ordering plausibility, which is analogous to range credibility factor. It will also be a number between 0 and 100, inclusive. We use this plausibility to

define a function called before that determines if it is plausible that one instant occurred before another. This function takes the ordering plausibility, γ , and the two indeterminate instants α and β as follows: before(α, β,γ) ¬ (α is β) ∧ ((P(α < β) × 100) ≥ γ Similar functions can be defined for after, intersects, and so on. We can associate this plausibility with any temporal comparison operation on indeterminate data. We vary this plausibility depending in the degree of confidence we would like in our result. An ordering plausibility of 0 indicates that we are willing to accept the level of uncertainty that already exists in the underlying data; a factor of 100 represents complete certainty and reduces to normal determinate semantics. In line with the simplifying assumption made for the range credibility factor above, we only consider the case where there are no dependencies between the probabilities associated with indeterminate instants. Hence, we cannot use the above approach to determine a triplet ordering between three indeterminate instants. We can examine the ordering of the first instant relative to the second, the second instant relative to the third, and the first relative to the third independently, but not all at once. Significant performance issues would arise if this simplification were not adopted. Generating an ordering between three or more dependent indeterminate instants is computationally expensive. Query Reducibility

If the user does not require indeterminacy support, we would like the above extensions to be syntactically and semantically transparent to them. Our extended semantics should reduce to determinate semantics for determinate queries. In other words, queries with a range credibility and ordering plausibility of 100 (i.e., all determinate queries) should not activate or require knowledge of any indeterminacy extensions. The semantics of extant determinate queries are thus preserved. In addition, all determinate queries will effectively ignore indeterminate information; the range credibility of 100 ensures this outcome. If necessary, indeterminacy may be removed from a database by applying a range credibility of 100 to each indeterminate table in the database. This process effectively replaces each indeterminate period with an indeterminate one. (Recall that every indeterminate period reduces to at least a one instant determinate period.) Such a database is termed period reduced.

Syntactic Extensions

Several basic syntactic extensions are required to support the above semantics. We need to be able to: -

indicate that a temporal object is indeterminate rather than determinate specify determinate temporal tables in the CREATE statement specify a range credibility in the SELECT statement specify an ordering credibility in the SELECT statement specify query defaults for both range and ordering credibility

The exact syntactic representation in Chronus II has yet to be decided.

9. Transaction Time Support A database that supports both valid-time and transaction time is termed bitemporal. In addition to valid-times, a table in a bitemporal database may optionally have transaction times associated with it. Each transaction time records operations on the associated table. The three primary types of operations are: creation, modification, and deletion. Transaction times may also be associated with database schema. However, in this section we consider only data-oriented transaction time support. Bitemporal tables are append-only. Information is never deleted: if the table is modified, the data preceding the modification is superceded by the new information but it remains in the database. We use the transaction times associated with the table to identify the currently valid data. By default, unless explicit transaction time operations are used, only currently valid data will be visible when operating on the database. Transaction time periods are a function of the system clock at the time when the DBMS successfully commits the updates. Thus, two tuples in the same or different tables updated in the same transaction are given the same transaction time. Tables updates by separate transactions must be given distinct transaction times, primarily for audit purposes. Thus the granularity of the transaction time dimension must be fine enough to differentiate between two successive updates. This granularity is system dependent. Valid and transaction time have identical ontologies and identical baseline clocks. However, there are significant differences in their use. The transaction time component of a database operation is supplied automatically by the system and is not specified by the user. An example, of a query using transaction time would be: “List all employee details incorrectly recorded on January 1, 1992.” This query requires the use of both valid-time and transaction time and illustrates that the two time components cannot be considered separately. The specification of transaction time and valid-time periods must occur in the same query.

The addition of transaction time to the valid-time system we propose would not alter the semantics of any existing operations, temporal or otherwise. In addition, because transaction time shares the same temporal model as valid-time, it can make use of much of the common underlying temporal functionality. However, irrespective of how easily the transaction model fits, providing transaction time support in a portable way is non-trivial. A significant advantage of the valid-time model outlined in this document is that it can be implemented reasonably efficiently as a layer above an existing DBMS. An efficient implementation of transaction time support at this level is not straightforward - an implementation at the database level is desirable. However, few commercial DMBS currently supports transaction time, and those that do provide it do so in a proprietary way. No transaction time support is proposed in the SQL3 standard.

10. Implementation Strategy A crucial property of the design outlined in this document is it can be implemented without requiring modifications to an underlying DBMS. Given the desire for portability and the plethora of DBMSs, this property is important. Obviously, underlying database support for temporal queries would be ideal, but such support is not likely in any reasonable timeframe. We intend to implement Chronus II as a pre and post-processor that sits outside a DBMS and uses ODBC as its access layer. The preprocessor will take Chronus II temporal commands and map them to standard SQL. The SQL commands can then be sent to a standard database server. The results of the query may then be post-processed if necessary and passed to the user. Commands that require no temporal support (i.e., standard SQL commands) can be passed straight through to the database, and the result returned without post-processing. Obviously, there will be some performance hit for each command, but if we are careful the penalty can be relatively minor. The implementation of the various subsections of Chronus II will roughly follow the sequence of their description in this document. We envisage the following stages: -

-

-

Temporal Ontology Support. We will provide an implementation of the temporal model outlined in this document. Basic support will include the ability to specify timestamps, periods, and intervals. Temporal comparison and associated granularity conversion operations will also be provided. Basic Language Support. A parser will be written that will extend the syntax of SQL to support Chronus II temporal queries. Temporal joins will also be supported. A parser has already been written for standard SQL and is currently being extended to include Chronus II productions. Indeterminacy. The implementation of the temporal model and query language will be extended to support indeterminacy. Multiple Calendars. Multiple calendar support will be added if required. Transaction Time. If needed, we may add transaction time support to the system.

The implementation language will be Java. JavaCC (Java Compiler Compiler, the Java equivalent of Lex and Yacc) will be used to write the parser.

References 1. Das AK, Musen MA. A temporal query system for protocol-directed decisionsupport. Methods of Information in Medicine, 1994; 33:358-370. 2. Snodgrass RT (Ed). The TSQL2 Temporal Query Language. Kluwer Academic Publishers, Boston, 1995. 3. Snodgrass RT, Jensen CS, Steiner A. Transitioning temporal support in TSQL2 to SQL3. Temporal Databases: Research and Practice, 1998; 150-194. 4. Shahar Y, Musen MA. RÉSUMÉ: a temporal-abstraction system for patient monitoring. Computers and Biomedical Research, 1993; 26:255-273. 5. Dyreson C, Snodgrass RT. Supporting Valid-time Indeterminacy. TimeCenter Publications: http://www.cs.auc.dk/research/DP/tdb/TimeCenter/publications.html, 1997.

Suggest Documents