Extending The Constraint Database Framework - CiteSeerX

1 downloads 0 Views 566KB Size Report
David Toman. Variable ... Vitter. Indexing for Data Models with Constraints and Classes. J. Computer and System Sciences, 52(3), pp. 589–612, 1996.
Extending The Constraint Database Framework ∗ Dina Goldin

Ayferi Kutlu

Mingjun Song

University of Connecticut Storrs, CT, USA

University of Connecticut Storrs, CT, USA

University of Connecticut Storrs, CT, USA

[email protected]

[email protected]

[email protected]

ABSTRACT Constraint Databases (CDBs) are an extension of relational databases that enrich both the relational data model and the relational query primitives with constraints. By providing a finite representation of data with infinite semantics, the Constraint Database approach is particularly appropriate for querying spatiotemporal data. Since the introduction of Constraint Databases in the early 1990’s, several Constraint Database systems have been implemented. In this paper, we discuss several extensions to Paris’s Constraint Database framework that we believe necessary, based on the experience with these implementations, and specifically CQA/CDB: • Extending the CDB schema to explicitly distinguish traditional from constraint attributes. • Additional Constraint Query Algebra operators: keeping queries safe. • Multi-attribute indexing systems: how best to group the attributes? • Flexibility in representing infinite data: taking constraints out of CDBs. We believe that Paris would have been the first to modify the Constraint Database framework if he felt there were a way to improve it, and we hope that these extensions to the CDB framework bring us closer towards realizing the promise that constraint database technology holds for integrating the advantages of traditional relational database technology with emerging data-intensive applications.

Categories and Subject Descriptors H.2.1 [Database Management]: Logical Design; H.2.3 [Database Management]: Languages; H.2.5 [Database ∗Supported by NSF grants no. 9733678 and IRI-0296195.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PCK50, June 8, 2003, San Diego, California, USA. Copyright 2003 ACM 1-58113-604-8/03/0006 ...$5.00.

Management]: Heterogeneous Databases; H.2.8 [Database Management]: Database Applications—spatial databases and GIS

General Terms Constraint Database Design, Constraint Query Languages, Database Theory

Keywords constraint databases, constraint query languages, query algebra, spatial databases, database indexing

1. INTRODUCTION Constraint Databases (CDBs) extend the relational database paradigm to allow working with infinite amounts of data, as long as it is finitely representable, via constraints (section 2). A perfect example is spatiotemporal data, which consists of points in time and/or in space, typical of the new database applications such as medical, scientific, or geographic applications. Besides having constraints for data representation, Constraint Databases also have them as query language primitives, with clean integration into all query language paradigms: procedural (algebraic), declarative (formula-based), or deductive (logic programming). Therefore, CDBs provide a strictly more expressive query paradigm than relational databases, even for traditional “administrative” data. In the mid-1990’s, the Constraint Database framework was a subject of active research by Paris and his students [17, 18, 16, 11, 19]. While his own work was theoretical, Paris was looking forward to practical implementations based on this theory, with the goal of eventual commercial viability. Since then, several Constraint Database systems have been implemented including MLPQ/GIS [25, 26], C 3 [4], and Dedale [13, 27]. These implementation efforts have validated Paris’ Constraint Database ideas, while refining and extending them. In this paper, we discuss some of these extensions, focusing on those that have emerged out of our work with CQA/CDB (section 1.1). The four extensions that we highlight are as follows: • an extension to CDB schema to explicitly distinguish traditional from constraint attributes (section 3); • Additional CQA operators (section 4); • Multi-attribute indexing systems (section 5);

• Flexibility in representing infinite data (section 6). Our discussion is in the spirit of honoring Paris’ Constraint Database ideas; we believe that Paris would have been the first to modify the Constraint Database framework if he felt there were a way to improve it.

1.1 CQA/CDB One of Paris’ last papers, submitted for publication two days before his tragic accident, focused on the role of Constraint Query Algebras in the Constraint Database framework [11]. The CQA/CDB system is based on this work. CQA/CDB is a prototype rational linear Constraint Database system built from the ground up in Java. CQA/CDB allows the management and querying of heterogeneous data, including traditional relational data as well as spatial and temporal data that is represented with linear constraints over rational coefficients. While traditional relational data can be viewed as a special case of constraint data (with equality constraints) [18], we have found it necessary to differentiate between traditional and constraint data; this is discussed in section 3. Just as with other CDB implementations [27], our choice of linear constraints is made for reasons of query evaluation efficiency. The constraint query approach applies to other classes of constraints as well, such as polynomial, but accurate evaluation of algebraic operators becomes more time consuming in this case. To its advantage, a data model based on linear constraints can approximate any spatial extent to an arbitrary accuracy (by making line segments shorter). Since all spatial measurements contain an inherent degree of inaccuracy, the linear constraint representation does not lead to an additional loss of accuracy. CQA/CDB, built from the ground up in Java, uses constraint query algebra (CQA) for its query representation, optimization, and evaluation. CQA queries consist of primitive operations that are equivalent to the operations of the relational query algebra. CQA represents an extension of the relational algebra querying paradigm to the constraint setting, with a familiar syntax and a well defined semantics (section 2.4) that allow for uniform handling of heterogeneous data. Just as with relational algebras in case of relational databases, CQA is expected to play a key role as the “middle layer” of commercial Constraint Database systems, i.e., the layer underneath the user interface layer and above the disk access layer. Queries can be specified by the user in a higherlevel language, perhaps visual or declarative or SQL-based, and translated to CQA. In turn, CQA queries can be optimized for efficient evaluation, through the use of indexing (section 5) and through operator reordering. The design of such a system, centered around CQA, is illustrated in Figure 1. There are fundamental differences between CQA/CDB and other prototype Constraint Databases systems of which we are aware. Unlike CQA/CDB, Dedale and C 3 use commercial object-oriented databases systems (O2 and ObjectStore, respectively) for embedding their constraints, providing a special interface for the constraint evaluation. Unlike CQA/CDB, MLPQ/GIS separates relational and constraint data stores and uses approximation and conversion modules for transitioning between formats; also, it relies on Datalog for query representation. We have pursued the intention of implementing a Con-

Figure 1: The design of a Constraint Database system. straint Database system from the bottom up. Our goal was to create a system that can handle both non-spatial and spatial data in a homogeneous fashion and that provides the full functionality of a relational database for spatial data. Our reliance on CQA for query representation, optimization, and evaluation is unique, and we believe, most promising as a foundation of a feasibility study around all issues of linear Constraint Databases, from user interfaces to data access methods.

1.2 Overview In section 2, we discuss the CDB framework, as defined by Paris [18]. This is followed by definitions of the CDB data model and the CQA query language. In section 3, we show that the semantics of Constraint Databases lead to an inconsistency when queried in the presence of missing attributes. We fixed this inconsistency in CQA/CDB by enriching the CDB relational schema with an extra flag for each attribute, resulting in heterogeneous databases. A Hurricane application served as a case study of a heterogeneous database for CQA/CDB. In section 4, we extend CQA by introducing two new spatial operators, Buffer-Join and k-Nearest. Both of these are examples of whole-feature operators, that return a relation keyed by spatial feature IDs. While common spatial operators such as distance result in queries that are unsafe (section 2.4), queries involving whole-feature operators are guaranteed to be safe. Then, we focuse on indexing structures for CDBs, showing in section 5 that there can be advantages to providing a single multidimensional indexing structure, rather than having a separate one-dimensional index structure for each attribute to be indexed, as was originally discussed in the Constraint Database literature [11, 19]. Given a heterogeneous relation R over attributes α(R), finding heuristics for choosing the subsets of α(R) whose attributes are to be indexed together is an interesting open question. Finally, in section 6 we point out that Paris’ contribution to Constraint Databases can be reinterpreted in a “constraintneutral” fashion. As a result, the CDB framework can afford the flexibility to incorporate other means for data representation in addition to, or instead of, constraints. We argue that for spatial databases, there are advantages to choos-

ing data representation based on geometry rather than constraints.

2.

CONSTRAINT DATABASES

This section provides the foundation for the rest of the paper. We discuss the principles behind the Constraint Database framework. We also defining the constraint data model and the constraint query algebra (CQA).

2.1 Constraint Database Framework Relational database technology is not suited for “nontraditional” data intensive applications, such as scientific or geographical or medical data repositories and query systems. Constraint Databases address this shortcoming of the relational paradigm by enriching both the data model and the query primitives with constraints. One can look at the history of Constraint Logic Programming (CLP) to get a glimpse at the potential promise that Constraint Databases hold. Until the mid-’80s, Logic Programming was viewed mostly as an instructional tool, with limited real-world applications. Then, Logic Programming was transformed by applying the following insight: • the unification mechanism of standard Logic Programming can be regarded as a trivial constraint solver (for equality constraints only). The resulting CLP technology has revolutionized the field of Operations Research and has found use in other areas where problems can be stated declaratively and can be solved by a combinatorial search approach [31]. In [18], the gap between database programming and constraint solving was bridged with an analogous insight: • a tuple (or a record or ground fact) in standard relational databases can be regarded as a conjunction of equality constraints on the attributes of the tuple. Therefore, an appropriate generalization of a tuple is a conjunction of constraints over the tuple attributes. The constraint data model extends the relational data model to allow relations that include infinitely many data points, by replacing the notion of finite data with finitely representable data. A perfect example is spatiotemporal data, which consists of points in time and/or in space [28, 21], typical of the applications mentioned at the beginning of the section. Besides extending the data model with constraints, Constraint Databases also integrate constraints into the queries, while preserving the efficient bottom-up declarative semantics that enabled relational databases to become such a success. By widely enriching the types of data that can be managed by Database Management Systems, as well as the types of queries that can be expressed in the system, the Constraint Database approach holds a lot of promise for integrating the advantages of traditional relational database technology with emerging data-intensive applications.

2.2 Constraint Database Querying The framework for Constraint Database (CDB) queries was first proposed by [18], where the bottom-up, efficient, declarative database programming was combined with efficient constraint solving. The resulting constraint query calculi (CQC) are a generalization of relational calculus to

constraints. For many constraint classes, these calculi have polynomial time data complexity [21]. This use of data complexity, a common tool for studying expressibility in finite model theory, distinguishes the CDB framework from arbitrary, and inherently exponential, theorem proving. In subsequent work [11], the declarative query approach of CQC was supplemented by an efficient bottom-up evaluation strategy based on the familiar syntax of relational algebra. The new query language, Constraint Query Algebra (CQA), was proven to have equivalent expressiveness to CQC. The algebraic querying paradigm is procedural rather than declarative, since the algebraic expressions represent a ‘plan’ or a ‘recipe’ for evaluating a query. An example of a CQA queries is presented in section 3.3. Relational Algebras play a very important role in database theory, since they are more useful than the calculi for carrying out query optimization. For this reason, it is typical that declarative user queries are translated into algebraic expressions before they are optimized and evaluated by the relational database system. It is expected that CQAs will prove just as ubiquitous in Constraint Database Systems as Relational Query Algebras are in the relational database systems.

2.3 Constraint Data Model We present a summary of the CDB framework, based on [18]. This framework encompasses all classes of constraints, rather than just the linear case. Definition 1. A constraint k-tuple t, is a set of constraints on k variables (attributes), where these variables range over a set (domain) D. The formula for t, notation φ(t) is the conjunction of the constraints in t; the semantics of t is the set of assignments to the attributes which satisfy φ(t). There are many kinds of constraint tuples depending on the kind of constraints used. In all cases equality constraints among individual variables and constants are allowed, to ensure that Constraint Databases constitute a proper extension of relation databases. Example 1. Let (3, 4) be a tuple of arity 2 in a (traditional) relation R. It is a single point in two-dimensional space, representable also as R(x, y) with x = 3 and y = 4, where x, y range over some finite set. This illustrates the insight that relational tuples are a special case of constraint tuples, over equality constraints. R(x, y) with (x = y ∧ x < 2) is also a constraint tuple of arity 2 and so is R(x, y) with x + y = 2.5, where x, y range over the rational or the real numbers. 2 Definition 2. A constraint relation R of arity k is a finite set of constraint k-tuples, with each k-tuple over the same variables. The formula for R, notation φ(R) is the disjunction of the formulas for the constraint tuples in R; two constraint relations are equivalent if their formulas are equivalent. The semantics of R is the set of assignments to the attributes which satisfy φ(R); equivalent relations have the same semantics. A Constraint Database is a finite set of constraint relations. φ(R) is a first-order formula in disjunctive normal form (DNF) of constraints, which uses at most k variables ranging over set D. Each constraint relation R is thus a finite

representation, as a set of constraint tuples, of a possibly infinite set of k-ary tuples (or points in k-dimensional space Dk ) corresponding to those variable assignments which satisfy φ(R). Given one such point, or tuple, t, we use the notation R(t) to indicate that t is in the semantics of R. The syntax and semantics of constraint query calculi (CQC) for CDBs can be found in [18]. We will summarize it as follows:

case study, as well as two others, are available at the CQA/CDB web site [1]. Note that CQA is equivalent to the domain-independent CQC [21]. Also note that the framework of [18] imposes the following critical requirement on queries:

Definition 3. The syntax of CQC is the union of a relational database query language and formulas in a decidable logical theory. The semantics of CQC is based on that of the decidable logical theory, by interpreting database atoms as shorthands for formulas of the theory. For example: the theory of real closed fields [30, 24]; or the theory of dense order with constants [8].

This means that the output must be representable in the same form (e.g. the same constraint class) as the input. The analogue for the relational model is that relations are finite structures, and queries are supposed to preserve this finiteness. This requirement can be therefore viewed as an extension of the notion of query safety to the Constraint Database setting. We will return to the issue of constraint query safety in section 4.

2.4 Constraint Query Algebra

2.5 The Closure Principle

We will assume familiarity with the syntax and semantics of relational algebra as presented in [15]. In particular, it is based on a small number of primitive operations on relations: project, select, natural-join, union, rename, and difference. Note that Relational Algebra is equivalent to the domain-independent Relational Calculus for both finite and infinite (i.e., unrestricted) relations [15]. Constraint Query Algebra is based on the same set of operators as relational algebra, reinterpreted over constraint relations. Let X be a finite set of attributes, and R be a constraint relation whose attributes are α(R). For each algebraic operation, in clause (1) we have the syntax of the expression; in clause (2) we have the conditions required of the arguments and the result arity; in clause (3) we have the semantics of the operation, i.e. the set of tuples which satisfy the expression. Project: (1) E = πX (R) is the projection of R on X. (2) X ⊆ α(R) and α(E) = X. (3) {t[X] : R(t)}, where t is a mapping from α(R) to values in D, and t[X] is the restriction of t to X. Select: (1) E = ςξ (R) is the selection on R by ξ. (2) ξ is a conjunction of constraints over α(R) and α(E) = α(R). (3) {t : R(t) and ξ(t) is true }. Natural-Join: (1) E = R1 1 R2 is the (natural) join of r1 and r2 . (2) α(E) = α(R1 ) ∪ α(R2 ). (3) {t : R1 (t[α(R1 )]) and R2 (t[α(R2 )])}. Remark: Both Cross-Product and Intersection are special cases of this Natural-Join, so we do not include them in our set of operators. Union: (1) E = R1 ∪R2 is the union of R1 and R2 . (2) α(R1 ) = α(R2 ) and α(E) = α(R1 ). (3) {t : R1 (t) or R2 (t)}. Rename: (1) E = ̺B|A (R) is the renaming in R of A into B. (2) A ∈ α(R), B 6∈ α(R) and α(E) = (α(R) − {A})∪{B}. (3) {t : for some t′ such that R(t′ ), t[B] = t′ [A] and t[C] = t′ [C] where C = α(R) − A }. Difference: (1) E = R1 − R2 is the difference of R1 and R2 . (2) α(R1 ) = α(R2 ) and α(E) = α(R1 ). (3) {t : R1 (t) and ¬R2 (t)}. Examples of CQA queries can be found as part of the Hurricane case study (section 3.3). Further details for this

The principle of semantic closure for CQA operations was articulated in [11]. It enables us to treat algebraic operators syntactically, as concrete operations over some specific data representation, and yet be able to argue about their correctness at the semantic level, which corresponds to unrestricted database relations over traditional data. The distinction between the syntax and semantics of the operators parallels that between the syntactical (set-of-constraint-tuples) and the semantical (set-of-points) views of constraint data itself (section 2.3). Given a concrete syntactic description of an operator (which translates into its implementation in a straightforward fashion), one proves correctness by showing that this operator would have the desired semantics, i.e. that the results are the same as they would be for equivalent relational algebra expressions over the corresponding (infinite) sets of points. Also, since our set of CQA operators is the same as for the relational calculus, as a corollary of semantic closure for each operator, we obtain the equivalence of the Constraint Algebra and Constraint Calculus [11]. Another corollary of semantic closure is that the semantics of any CQA query remain the same if we substitute any constraint relation in the query expression with another, equivalent, one. We shall make use of this corollary in section 3.1.

• For each input, the queries must be evaluable in closed form.

3. CDB FOR HETEROGENEOUS DATA In this section, we show that the semantics of Constraint Databases (Definition 2) are inconsistent with relational semantics when queries involve attributes not present in the tuple. After identifying this missing attribute inconsistency, we present a way to avoid it that was implemented in CQA/ CDB, through enriching the schema of Constraint Databases with a flag that indicates the type of each attribute relational or constraint; this obtains heterogeneous databases. We then present the Hurricane Database as a case study in heterogeneous databases.

3.1 The Missing Attribute Inconsistency “Upward compatibility” of Constraint Databases with traditional relational databases is one of the requirements of the Constraint Database framework; it is a corollary of the principle that Constraint Databases are an extension of relational databases [18].

Upward compatibility: Given the same query, a constraint relation R over equality constraints should have the same output as the corresponding traditional relation R. This is expected to hold because, in the case of traditional (administrative) data, its constraint representation and its relational extent coincide. However, in case then a constraint tuple involves only a subset of the relation’s attributes, the semantics no longer coincide: • For relational databases, a missing attribute is assumed to have a null value, distinct from all values in the domain; e.g., if an employee’s age is missing and we ask “whose age is 40?”, it would be wrong to return that employee. • For constraint attributes, it follows from definitions 1 and 2 that a missing attribute admits all values in the domain. We refer to these semantics as narrow and broad, respectively. Example 2. Consider the following constraint relation R over attributes {x, y}, consisting of a single tuple: R: {(x = 1)}, and let Q be a query over R: Q: ςy=17 R. According to the broad interpretation dictated by Defini′ tions 1 and 2, R is equivalent to R : ′

3.2 Heterogeneous Data Model There are multiple approaches to avoiding the inconsistency identified in section 3.1, including: 1. Set one of these interpretations as default; when the other interpretation is needed, it is specified explicitly. In the example of R above, we should instead state: R: {(x = 1), (y = null)}, 2. Specify the desired interpretation of each attribute as part of the schema. In the example of R above, its schema would state that y is a relational attribute, implying a narrow interpretation in case of missing attribute values. The second approach was adopted by CQA/CDB, as follows: 1. For each attribute in the constraint relational schema, we introduce a flag that indicates whether the corresponding attribute is “constraint” or “relational”; we refer to it as the C/R flag. 2. When a constraint tuple t has a constraint attribute A that does not participate in any constraints, it is interpreted broadly, that is: true |= φ(t) ∧ (A = a), for all values a ∈ D. 3. When a constraint tuple t has a relational attribute A that does not participate in any constraints, it is interpreted narrowly, that is: true 6|= φ(t) ∧ (A = a), for any value a ∈ D.

R : {(x = 1, −∞ < y < ∞)}; therefore, Q should evaluate to {(x = 1, y = 17)} By contrast, if R represented a traditional relation, missing attribute semantics dictate the narrow interpretation, and Q should instead evaluate to ∅ (the empty set). Note that missing attribute values are akin to incomplete or indefinite information, where the information is underspecified if not unknown. Incomplete information can be specified by constraints, and has been discussed in the context of constraint databases [25, 20]. Just as for unknown data, the semantics of this constraint specification is different from constraint tuples. The semantics is disjunctive rather than conjunctive; one of the values satisfying the constraints is correct, rather than all of them, as for constraint tuples. Proposition 1. The semantics of Constraint Databases are inconsistent with relational semantics when queries involve attributes not present in the tuple. Proof. See example above. As we have shown, this inconsistency depends not on the query or the data, but on the interpretation of semantics for the attribute missing from the tuple, narrow or broad. Note that this inconsistency is not restricted to queries involving the select operator. Queries involving natural-join or intersect can exhibit the same inconsistency.

Example 3. Let R = {(x = 1), (y = 1), (x = 17, y = 17)}, where the schema of R is [x : relational, y : constraint]. The asymmetry of this schema leads to an asymmetric (but consistent) interpretation of queries: ςx=17 R returns {(x = 17, y = 17)} ςy=17 R returns {(x = 1, y = 17), (x = 17, y = 17)}. The “dual” behavior of attributes depending on the C/R flag led us to refer to Constraint Databases augmented with this flag as the heterogeneous data model. Claim: Unlike the constraint data model, the heterogeneous data model is completely upwardly compatible with the relational data model. For the rest of the paper, we will be working with the heterogeneous data model rather than the constraint data model. While this solution avoids the need to explicitly specify the meaning of any tuple that is missing an attribute, note that there are other advantages in having such a flag as part of the schema. Attribute type plays a role, for example, in establishing variable independence [5]; if an attribute is known to be relational, it is automatically independent of all other attributes.

approach in [7]. All relation names except for the original three (Land, Landownership, Hurricane) represent intermediate relations; the last step of the query produces the query output. Query 1: who owned Land A and when R0 = select LandID=A from Landownership R1 = project R0 on name, t Query 2: all landIDs that hurricane passed R0 = join Hurricane and Land R1 = project R0 on landID Query 3: names of those whose land was hit by the hurricane between time 4 and 9

Figure 2: An instance of the Hurricane Database.

3.3 Hurricane Database: a Case Study We now present the Hurricane database as a case study of the heterogeneous data model and queries in CQA/CDB. It consists of three relations whose schema are as follows: Land [landId: string, relational; x, y: rational, constraint] Landownership [name: string, relational; t: rational, constraint; landID: string, relational] Hurricane [t, x, y: rational, constraint] In the schema above, names of relations are bold and names of attributes are in italics; t represents time and x, y represent spatial coordinates. Land contains the ID and the extent for all land parcels; Landownership is a cadastral relation, representing who owned which parcel of land and when; and Hurricane represents the path of a hurricane, showing when each segment of the path was traversed. An example instance of this database is illustrated in Figure 2. Non-traditional data is represented with rational linear constraints. A data model based on linear constraints can approximate any spatial extent to an arbitrary accuracy, by making line segments shorter. For example, it is assumed that in a real database, the hurricane path and the land parcel boundaries would contain many more segments, to approximate the actual geometry to an arbitrary accuracy. Note that, even though approximation is involved in representing the actual data in the CQA/CDB system, there is no approximation involved in evaluating CQA/CDB queries; the output preserves the data and query semantics accurately. Also note that the constraint query approach applies to other classes of constraints as well (such as polynomial), but accurate evaluation of algebraic operators becomes more time consuming in this case. Since all spatial measurements contain an inherent degree of inaccuracy, this approach does not lead to an additional loss of accuracy. We now present five typical queries over the Hurricane Database. Note that instead of using the operator symbols as described in section 2.4, we use their English equivalents in CQA/CDB. This allows queries to be representable in ASCII, for portability of the system. Also note that CQA/CDB queries are broken up into multiple steps, similar to the

R0 R1 R2 R3

= = = =

join Landowner and Land select t>=4, t b. In this case, the query box for searching the multi-dimensional index is: −∞ < x < a, b < y < ∞. By contrast, if there were separate indices for the two attributes, a separate search would be needed for each index, followed by intersecting the resulting sets of tuples. Furthermore, suppose that for each of the two constraints above, the selectivity is very low; that is, about half of all the tuples in the relation intersect with x < a and about half intersect with y > b. However, suppose that very few tuples satisfy both of these constraints simultaneously. In this case, the advantage of our approach becomes very pronounced, reducing the time performance from linear to logarithmic in the size of data.

5.4 Experiments with Indexing Structures We performed a series of experiments, involving querying relations of two attributes x, y and comparing the performance of two indexing strategies for these queries: joint index, where there is a single indexing structure for both attributes, and separate index, where each attribute has an index. AnR∗ tree was used as the index data structure. The data is heterogeneous. That is, attributes could be either relational (having a single value for any given tuple) or constraint (having a range of possible values). Queries can involve either one attribute or both. For joint index, when querying only one attribute, the bound of the other attribute is set from minimum to maximum. All together, there were five experiments, for different combinations of relations and queries: expt. 1-A: both x, y are constraint attributes; queries involve both attributes expt. 1-B: both x, y are relational attributes; queries involve both attributes expt. 2-A: both x, y are constraint attributes; queries involve one attribute expt. 2-B: both x, y are relational attributes; queries involve one attribute

Prior to running the experiments, we randomly generated a data file and a query file as follows: 1. Randomly generate 10,000 bounding boxes representing data tuples, with height and width in [1,100]; store them in the data file. 2. Randomly generate 100 queries, which are rectangles of height and width in [1,100]; store them in the query file. For experiment 3, generate 500 queries. 3. All rectangles are obtained by randomly generating (a) the upper-left coordinates, and (b) the height and width of each rectangle. All coordinates are between [0, 3000]. For queries over one attribute, we graphed the number of disk accesses against the query length. For queries over two attributes, we graphed the number of disk accesses against the query area. Next, we describe the experiments and analyze their outcomes.

5.4.1 Queries involving two attributes We read in the data file, building (a) one 2-dimensional R∗ tree for all the bounding boxes, and (b) two 1-dimensional R∗ trees for the x-bounds and y-bounds of all boxes, respectively. Then we ran the queries from a randomly generated query file against both indexing structures and measured the number of disk accesses. In the case of two one-dimensional R∗ trees, the overall number of disk accesses was the sum of the numbers for the two subqueries. The graph plotting disk accesses vs. the area of bounding box is shown in Figure 4. It can be easily seen that for both relational and constraint attributes, if the query involves both of the attributes, it is more efficient to have them stored in the same index structure. We can also conclude that: 1. When query area is small, using a joint index for both attributes gives a larger improvement for constraint attributes than for relational attributes. 2. The disk access count depends on query selectivity (query area) a lot less in the case of joint than in the case of separate indices.

5.4.2 Queries involving one attribute In the second set of experiments, we queried only one of two attributes, using either a joint index or two separate ones. The results are shown in Figure 5. Based on these experiments, we can conclude that it is better to have separate indices when queries only use one attribute, as expected. However, this advantage is not as significant as the advantage of joint indices when queries use both attributes. In general, an index structure for a constraint relation is over some subset of the relation’s attributes. While the CDB framework has focused on subsets of size one (single attributes), we focus on the maximal subset of all attributes. However, it is possible to consider the problem of designing index structures for a constraint relation in the most general setting; to our knowledge, this is an open problem: Problem: Given a constraint relation over attributes X = {x1 , . . . , xk }, determine a set of subsets of X that should correspond to indices over X, with one index per subset.

As our discussion in section 5.3 shows, the selectivity of various attributes and the kinds of queries that are “typical” will need to be considered when solving this problem.

6. TAKING CONSTRAINTS OUT OF CDBS In this section, we rethink the main insight behind the Constraint Database framework – it’s not about constraints! We then point out some disadvantages of the constraint representation for spatial data, and argue that geometric representation such as in the vector model is preferable in case of spatial data.

6.1 From Two Layers to Three The constraint data model extends the relational data model to allow relations that include infinitely many data points, by replacing the notion of finite data with finitely representable data. The result is a new (syntactic) database layer for the finite data representation, sandwiched between the logical (semantic) layer and the physical (section 2.5). This additional layer in Paris’ Constraint Database framework builds on the separation of layers that was introduced in Codd’s relational databases two decades earlier [6]. While the pre-relational database models required queries to be aware of the physical data layout, the relational model separated this physical layer from the logical layer. The queries were based on the latter, achieving a clean semantics for the system, while freeing the query system from dependence on physical data layout. Constraint Databases provide a further separation of the logical layer into the finite representation and its infinite relational extent. The semantics of Constraint Database queries are based on the latter, while the evaluation strategies are concerned with the former. By separating the (infinite) relational semantics of the data from its a finite representation, we obtain a database framework that can handle nontraditional data, while providing clean query semantics. While Paris chose constraints for the finite representation, it is in principle possible to choose alternate representations for the middle layer of the CDB framework. In fact, the choice of representation for this middle layer should be transparent to the user, since query formulation, which is based on the top layer, should not be affected. We next argue that constraints do not always make the best choice of representation, especially for spatial data.

6.2 Disadvantages of Constraints Consider linear spatial features such as roads, rivers, or hurricane trajectories. These features can be viewed as consisting of many connected line segments, and their constraint representation consists of many constraint tuples – one for every segment. Each of these tuples involves three constraints: one for the line collinear with the segment, one for its starting point, and one for the ending point. Furthermore, consider non-linear spatial features, such as lakes, towns, or temperature zones. These features can be viewed as enclosed in an outline consisting of many line segments, typically concave in many places. The constraint data model requires us to represent this feature as a union of convex polyhedra, each of which is represented as a constraint tuple.

Figure 4: Querying both attributes.

Figure 5: Querying one attributes.

The representation of both types of features above involves two types of redundancies: 1. Should the relation include attributes other than the spatial extent, these attributes are duplicated for each of the constraint tuples representing the same feature. 2. The constraints representing the boundaries of each line segment or convex polyhedron are the same as for the tuples representing neighboring segments or polyhedra. The first type of redundancy can be avoided in multiple ways: • Dedale chose to depart from the relational model and use the nested model instead [13, 27]. The constraint part of all tuples representing the same feature are grouped into a set, and stored as one nested attribute value; the non-spatial attributes for each feature are only stored once, together with this nested value. The nest and unnest operators in Dedale are necessary to work with this data model. • In CQA/CDB, the same redundancy is minimized by using spatial constraint relations (section 4.2), where the feature ID is the only non-spatial attribute in the relation. However, the second type of redundancy is unavoidable as long as constraints are used to represent the spatial data. Another disadvantage of the constraint-based data representation is that it is far removed from the representation used for data input and (visual) output in Geographic Information Systems, requiring costly conversions in each direction: • The spatial data in Geographic Information Systems is normally obtained by digitization [32]. When digitizing a linear feature or a boundary of a region, streams of points and segments are generated. To use the constraint model, we must first convert these points to constraint representation. • When displaying a feature as part of data visualization or query output, the reverse conversion must take place. In order to display a feature, its boundary points have to be computed from the constraints. The spatial outlines corresponding to each tuple must be found and combined together to obtain the feature boundary. A good alternative would be to combine vector and constraint representation in a system that can handle both of them. For relations where the advantages of the constraint representation are not necessary, less symbolic alternatives such as the vector representation (section 4) may be preferable, for reasons discussed above. Query evaluation would be performed directly over the vector representation: Example 8. If two-dimensional regions are stored directly as a sequence of points that outline it, a region’s projection onto either of the dimensions can be obtained by taking the appropriate coordinate of each point, and finding the extrema for all points.

7. SUMMARY Constraint Databases (CDBs) extend the relational database paradigm to allow working with data with infinite semantics, as long as it is finitely representable, via constraints. A perfect example is spatiotemporal data, which consists of points in time and/or in space, typical of the new database applications such as medical, scientific, or geographic ones. CDBs provide a strictly more expressive query paradigm than relational databases, even for traditional “administrative” data. To reach Paris’ dream of making CDBs commercially viable, his work on CDB theory needs to be validated through practical implementations. Towards that goal, we developed CQA/CDB, a prototype rational linear Constraint Database system built from the ground up in Java. CQA/CDB allows the management and querying of heterogeneous data, including traditional relational data as well as spatial and temporal data that is represented with linear constraints over rational coefficients. This paper has discussed four extensions to the CDB framework that we believe necessary, based on our experience with CQA/CDB, as well as other CDB systems. In particular, we have shown that the semantics of Constraint Databases lead to an inconsistency when queried in the presence of missing attributes. We fixed this inconsistency in CQA/CDB by enriching the CDB relational schema with an extra flag for each attribute, resulting in heterogeneous databases. A Hurricane application served as a case study of a heterogeneous database for CQA/CDB. We have also extended CQA by introducing two new spatial operators, Buffer-Join and k-Nearest. Both of these are examples of whole-feature operators, that return a set of feature IDs. While common spatial operators such as distance result in queries that are unsafe (section 2.4), queries involving whole-feature operators are guaranteed to be safe. Then, we focused on indexing structures for CDBs, showing that there can be advantages to providing a single multidimensional indexing structure, rather than having a separate one-dimensional index structure for each attribute to be indexed, as was originally discussed in the Constraint Database literature [11, 19]. Given a heterogeneous relation R over attributes α(R), finding heuristics for choosing the subsets of α(R) whose attributes are to be indexed together is an interesting open question. Finally, we point out that Paris’ contribution to Constraint Databases can be reinterpreted in a “constraint-neutral” fashion. As a result, the CDB framework can afford the flexibility to incorporate other means for data representation in addition to, or instead of, constraints. We argue that for spatial databases, there are advantages to choosing data representation based on geometry rather than constraints. We hope that these extensions to the CDB framework bring us closer towards realizing the promise that constraint database technology holds for integrating the advantages of traditional relational database technology with emerging data-intensive applications. We believe that Paris would have been the first to modify the Constraint Database framework if he felt there were a way to improve it, and we are honored to be doing this work.

8. REFERENCES [1] CQA/CDB Project - Demos. http://www.cse.uconn.edu/cdb/demos.html

[2] N. Beckmann, H.P. Kriegel, R. Schneider, B. Seeger. The R∗ -tree: an efficient and robust access method for points and rectangles. Proc. 1990 ACM SIGMOD Int’l Conf. Management of Data, pp. 322–331, May 1990, Atlantic City, New Jersey, United States. [3] T. Brinkhoff, H.P. Kriegel, R. Schneider, B. Seeger. Multi-step processing of spatial joins. Proc. 1994 ACM SIGMOD Int’l Conference on Management of Data, pp. 197–208, May 1994, Minneapolis, Minnesota, United States. [4] A. Brodsky, V. E. Segal, and P. A. Exarkopoulo. The CCUBE constraint object-oriented database system. CONSTRAINTS, An International Journal, 51(1), pp. 26–52, 1995. [5] Jan Chomicki, Dina Goldin, Gabriel Kuper, and David Toman. Variable Independence in Constraint Databases. To appear in IEEE Transactions on Knowledge and Data Engineering, 2003. [6] E.F. Codd. A Relational Model for Large Shared Data Banks. In Communications of the ACM 13(6), pp. 377–387, 1970. [7] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems, 3rd edition. Addison Wesley, 2000. [8] J. Ferrante and J.R. Geiser. An Efficient Decision Procedure for the Theory of Rational Order. Theoretical Computer Science, 4, pp. 227–233, 1977. [9] S. I. Gass. Linear Programming, Fifth Edition. McGraw-Hill Inc., New York, 1985. [10] D.Q. Goldin and P.C. Kanellakis. On Similarity Queries for Time-Series Data: Constraint Specification and Implementation. Proc. 1st Int’l Conf. on the Principles and Practice of Constraint Programming, LNCS 976, pp. 137–153. Cassis France, September 1995. [11] D.Q. Goldin and P.C. Kanellakis. Constraint Query Algebras. Constraints Journal, 1(1/2), pp. 45–83, 1996. [12] Dina Goldin, Ayferi Kutlu, Mingjun Song, Fuzheng Yang. The Constraint Database Framework: Lessons Learned from CQA/CDB. In Proc. Int’l Conf. on Data Engineering (ICDE), 2003. [13] S. Grumbach, P. Rigaux, L. Segoufin. The DEDALE System for Complex Spatial Queries. Proc. ACM SIGMOD, June 1998. [14] R.H. Gueting. An Introduction to Spatial Database Systems. VLDB Journal 3 (1994), pp. 357–399. [15] P.C. Kanellakis. Elements of Relational Database Theory. Handbook of Theoretical Computer Science, J. van Leeuwen editor, volume B, chapter 17, pp. 1073–1158. North-Holland, 1990. [16] P.C. Kanellakis, D.Q. Goldin. Constraint Programming and Database Query Languages. Symposium on Theoretical Aspects of Computer Software, LNCS 789, pp. 96–120, Sendai Japan, April 1994. [17] P.C. Kanellakis, G.M. Kuper, and P.Z. Revesz. Constraint Query Languages. Proc. 9th ACM PODS Symposium on the Principles of Database Systems (PODS), Nashville Tennessee USA, pp. 299–314, March 1990. [18] P.C. Kanellakis, G.M. Kuper, and P.Z. Revesz.

[19]

[20]

[21] [22] [23]

[24]

[25] [26] [27]

[28]

[29] [30]

[31] [32]

Constraint Query Languages. J. Computer and System Sciences, 51(1), pp. 26–52, 1995. P. C. Kanellakis, S. Ramaswamy, D. E. Vengroff, J. S. Vitter. Indexing for Data Models with Constraints and Classes. J. Computer and System Sciences, 52(3), pp. 589–612, 1996. M. Koubarakis. Database Models for Infinite and Indefinite Temporal Information. Information Systems 19:2, pp. 141–173, 1994. G.M. Kuper, L. Libkin, J. Paredaens. Constraint Databases. Springer Verlag, 2000. G. Kuper and M. Scholl. Geographic Information Systems, in [21]. A. Mitchell. The ESRI Guide to GIS Analysis, Volume 1: Geographic Patterns & Relationships. ESRI Press, 1999. J. Renegar. On the Computational Complexity and Geometry of the First-order Theory of the Reals: Parts I–III. Journal of Symbolic Computation, 13, pp. 255–352, 1992. Peter Revesz et. al. The MLPQ/GIS Constraint Database System. Proc. ACM SIGMOD, May 2000. P. Revesz. Introduction to Constraint Databases. Springer Verlag, New York, 2002. Philippe Rigaux, Michel Scholl, Luc Segoufin, Stephane Grumbach. Building a Constraint-Based Spatial Database System: Model, Languages, and Implementation. To appear in Information System Journal, 2003. P. Rigaux, M. Scholl, and A. Voisard. Spatial Databases with Application to GIS. Academic Press, 2002. A. Schrijver. Theory of Linear and Integer Programming. John Wiley and Sons, 1986. A. Tarski. A Decision Method for Elementary Algebra and Geometry. University of California Press, Berkeley, California, 1951. P. Van Hentenryck. Constraint Satisfaction in Logic Programming. MIT Press, 1989. M. F. Worboys. GIS: A Computing Perspective. Taylor & Francis, 1995.