Dense-Order Constraint Databases
(Extended Abstract)
Stephane Grumbach University of Toronto and INRIA
Abstract
We consider in nite databases which admit a nite representation in terms of dense-order constraints. We study the complexity and the expressive power of various query languages over dense order constraint databases, allowing order, addition, recursion, or nested sets. We provide in particular an exact characterization of the class of dense order queries computable in PTIME (data complexity). We also prove that region and graph connectivity queries are not de nable with linear constraints. We then investigate complex object models for constraint databases. Complex objects are fundamental to deal with pointsets as rst-class citizens. We introduce an active domain semantics, and show that in terms of complexity and expressive power, the characterizations of the calculus for constraint complex objects are similar to the case of the classical complex object calculus.
1 Introduction Constraint databases constitute a very promising model for new database applications such as geographical information systems. The constraints I.N.R.I.A. Rocquencourt BP 105, 78153 Le Chesnay, France { E-mail:
[email protected] { Work supported in part by Esprit Project BRA AMUSING, and an NSERC fellowship in Canada. Computer Science Dept., University of California, Santa Barbara, California 93106, USA { E-mail:
[email protected] { Work supported in part by NSF grant IRI-9117094 and NASA grant NAGW-3888. A part of this work was done while visiting I.N.R.I.A.
Jianwen Su
University of California at Santa Barbara provide a sound mathematical framework to de ne both data models and query languages. There are many challenging problems from both practical and theoretical points of view. In recent years, there has been a growing interest in constraint-based databases and query languages (e.g., [KKR90, Rev93, Kup93, BJM93, KG94, ACGK94, GS94, PVV94, GST95]). Constraint database models extend the traditional relational databases [Cod70] to potentially in nite collections of data items under the assumption that the databases admit nite representations. As introduced in the seminal paper by Kanellakis, Kuper and Revesz [KKR90], the basic idea is to generalize the relations by allowing generalized tuples as conjunctions of constraints. For instance, x y ^ x 0 de nes a binary generalized tuple. A generalized, or \ nitely representable", relation is then a nite set of such tuples. In the rational plane, it results in an in nite set of points, or tuples of atomic values, over Qk . The relational calculus over nitely representable relations constitutes a constraint query language which admits a declarative semantics and an ecient bottom-up evaluation in closed form [KKR90]. There have been very few theoretical results on in nite databases. General results on the completeness of query languages for in nite recursive databases were reported by Hirst and Harel in [HH93], where it is shown in particular that quanti er free rst-order logic is complete on the class of all recursive databases. In [KKR90, KG94], the data complexity of both the relational calculus and in ationary Datalog with negation is studied. It was shown that over dense-order constraint databases, the relational calculus has AC0 data
complexity and in ationary Datalog with negation has PTIME data complexity. In this paper, we restrict our attention to databases which admit a nite representation with dense order constraints over the rational numbers Q = (Q; ). We generalize the classical de nition of relational queries [CH80] to \dense order" databases and consider queries closed under automorphisms of Q. We study the complexity and the expressive power of query languages for such databases. As shown in [GS94], this topic has been only hardly investigated, and there is a serious lack of proof techniques. We concentrate on three query languages of increasing expressive power: rst-order with dense order constraints, FO; rst-order with linear constraints (with a built-in addition), FO+ ; and in ationary Datalog with negation, Datalog:. We rst consider their complexity, and show in particular that: (i) FO+ has AC0 data complexity when restricted to input databases de ned with integers only [GST95]; and (ii) Datalog: expresses exactly all PTIME constraint queries. FO and Datalog: express mappings that are closed under automorphisms of Q (queries). It is not the case of FO+ in general. We therefore restrict our attention to the queries de nable in FO+ . Queries in FO+ are expressible in Datalog:, but we prove that various forms of connectivity (expressible in Datalog:) are not expressible in FO+ . On the other hand, a great amount of research has been done on hierarchical database structures, in particular, those constructed using the tuple and set constructs such as the nested or non rst normal form relations (e.g., [JS82, FT83, RKS88]), and the complex objects (e.g., [AB87, HS91, GV91]). Query languages for complex objects have been extensively studied, and their complexity is now well understood. Complex objects provide a modeling power which is fundamental in the context of constraint databases. In \ at" query languages for constraint databases, the universe is restricted to atomic objects, i.e., tuples of rational numbers. Pointsets, i.e., sets of points satisfying some constraints, should also be rst-class citizens in constraint query languages. In practical examples, there are properties naturally associated to pointsets and
not to individual points (e.g., rainfall, population, etc. in geographical databases). The need for hierarchically structured or aggregation constraints has already been observed [KG94, Kup93, SRR94, SRS94, Rev95, BK95]. We develop a complex object model for constraint databases. Intuitively, \complex constraint objects" are composed from nitely representable sets by the tuple and set constructs. A logic based query language C-CALC is also proposed which uses the \active domain" semantics. In particular, the language allows quantifying over sets. We study the complexity and expressive power of the new language and show that essentially the results [HS91, GV91] for query languages of classical complex objects carry over to the context of constraint databases. For example, when restricted to at input and output, C-CALC expresses exactly all constraint queries having hyper-exponential time (space) complexity; the hierarchy based on \setheight" does not collapse. A sub-language using range restriction is also discussed. The paper is organized as follows. In sections 2 and 3, the basic concepts such as nitely representable databases, queries, and data complexity are introduced. Section 4 is devoted to the complexity and the expressive power of FO, FO+ and Datalog:. In Section 5, the hierarchical constraint database model is introduced, with its query language, along with complexity and expressive power results. Section 6 concludes the paper.
2 Dense Order Databases Finitely representable databases over dense-order constraints were rst studied in [KKR90], and further investigated in particular in [KG94, GS94, GST95]. In the following, we recall the de nitions and illustrate them with motivating examples. We consider a countable rst-order language L with equality (=) and order (). Let = fR1; : : :; Rng be a signature (database schema) such that L \ = ;, where R1; : : :; Rn are relation symbols. We distinguish between logical predicates (e.g., =; ) in L and relation symbols in . Throughout the whole paper, we consider the structure Q = (Q; ) of the set of rational numbers along with its dense order. Unless stated otherwise,
constraints are given in terms of equations or inequalities over the rationals. Kanellakis, Kuper, and Revesz [KKR90] introduced the concept of a k-ary generalized tuple, which is a constraint expressed as a conjunction of atomic formulas in L over k variables. For instance, (x y ^ x 0 ^ y 10) is a binary generalized tuple representing a triangle. A generalized tuple is a nite representation of a potentially in nite set of tuples over rationals. A k-ary nitely representable relation (or generalized relation in [KKR90]) is then a nite set of k-ary generalized tuples. In this framework, a tuple (a; b) in the classical relational data model [Cod70] is an abbreviation for the formula (constraint) \(x = a ^ y = b)" represented using only the equality symbol \=" and constants. Intuitively, a nitely representable database over can be seen as an expansion of Q to , i.e., a structure over the vocabulary fg[ , which coincides with Q on fg. The new relations from constitute the database. They may be in nite, but they have to be representable in a nite way with f=; g [ Q. More formally, we have:
(the complement of a nite model is not nite). Finally, if ' is a sentence in L [ , we denote by K' the collection of instances satisfying ', that is, for each nitely representable expansion A, if A j= ', then each instance representing Aj is in K' . A nite relation is representable using only equality (and constants). The converse doesn't hold. Nevertheless, a monadic relation is representable with equality only i it is nite or co nite. For practical purposes, one can assume that the database contains the quanti er-free formulas de ning the relations. The other predicates (=; ) are built-in.
De nition 2.1 An n-ary relation
De nition 2.2 Let
R
Qn is
said to be nitely representable if there exists a quanti er-free formula '(x1; : : :; xk ) in L [ Q with k distinct free variables x1; : : :; xk such that for all a1 ; : : :; ak 2 Q,
Q j= '(a ; : : :; ak ) , (a ; : : :; ak ) 2 R: Let A be an expansion of Q to . The structure A is said to be nitely representable if for every 1
1
relation symbol R in , RA is nitely representable. If an expansion A of Q to is nitely representable then each set of formulas representing the relations RA in the restriction Aj of A to is called a (database) instance over . It is easy to see that the class K of instances over is eectively enumerable. Furthermore, every relation in each instance is recursive. Note that K has interesting closure properties. Indeed, it is closed under nite union and intersection and moreover under complement. This diers from nite model theory
We next introduce, a syntactic normal form of the instances, called maximal cover. For simplicity, we consider here relations in dimension 2, and assume that R is some binary relation. It will be clear how to generalize the following to relations of arbitrary arity. The binary relation R de nes a pointset over Q2 . It can be represented by a nite set of regions of some atomic shape. In dimension 2, there are four types of atomic shapes: (i) isolated point, (ii) line segment, (iii) triangle, and (iv) rectangle (see Figure 1). (Atomic shapes may also have in nite boundaries.) R be a k-ary nitely representable relation. An atomic shape S of dimension less or equal to k is covered by R if each point in S is also in R. S is maximal in R if it is covered by R but not by any other atomic shape covered by R. Finally, the set of all maximal atomic shapes in R is called the maximal cover of R.
It is easy to verify that the maximal cover of each nitely representable relation is unique and nite. The nite representation of a relation is in normal form, when each tuple denotes a maximal atomic shape. It is important to note that these particular shaped objects can be represented by four constants along with a ag indicating the shape (and boundary conditions). This lead to ecient encoding of dense-order constraint databases. An alternative encoding was proposed in [KG94].
(a2 ; b2 )
(a4 ; b4 )
q
p
q
s
r
(a0 ; b0 ) q
(a1 ; b1 )
??
q
q
?? ?? ?? ? ? t
q
(a3 ; b3 )
q
(a5 ; b5 )
?? ? ?
q
q
r
t
(a6 ; b6 )
(a8 ; b8 )
q
q
(a7 ; b7 )
q
Figure 1: Atomic shapes in dimension 2
3 Queries The notion of a database query was introduced by Chandra and Harel [CH80] as a mapping Q from ( nite) structures over a given signature to ( nite) relations of a xed arity n, which is partial recursive and satis es the following consistency criterion: if two structures, A and B, over are isomorphic under an isomorphism , then Q(A) and Q(B) are also isomorphic under . This criterion was then called \genericity" in the database literature. It is easy to see that automorphisms of Q preserve the nite representability of structures, that is, if A is a nitely representable expansion of Q, and is an automorphism1 of Q, then (A) is also nitely representable. This allows us to extend naturally the de nition of [CH80] to nitely representable databases (where the recursiveness is with respect to the nite representation of the relations).
De nition 3.1 A boolean query K is a partial re-
cursive collection of nitely representable database instances over closed under automorphisms of Q.
The de nition of non boolean queries is then classical. In [PVV94], dierent notions of queries were introduced. They are based on consistency criteria related to the geometry in which the database is to be interpreted. We next see that our de nition of a query corresponds naturally to a topological concept. Consider the usual topology on the set Q of rationals. A homeomorphism of Q is a bicontinuous
where is extended to relations (subsets of Qk for some arity k) in the natural way. 1
bijection from Q onto Q. It maps open sets to open sets and closed sets to closed sets. The following proposition follows from De nition 3.1.
Proposition 3.1 Each boolean query K is closed under homeomorphisms of Q. Queries over dense-order constraints are thus insensitive to homeomorphic transformations on the axis. This constitutes a tool to prove nonde nability results as shown in the next section. A consequence of Proposition 3.1 is that there are very simple mappings which are not queries, such as \does there exist a line separating two regions in the input" in the rational plane. This de nition of queries may appear at rst glance to be very restrictive. Nevertheless, numerous natural mappings such as the parity of the cardinality of a set, the connectivity of a graph or a region, etc. are queries. Examples which are not queries include convex hull, Voronoi diagram, etc. For these latter examples from computational geometry, dense order constraints are not very appropriate. Instead, linear constraints are necessary. The expressive power of rst-order in the case of linear constraint databases has been studied in [ACGK94, GST95]. The \data complexity" of queries is de ned as usual based on computational devices and \standard encodings" of the input and output. We rst introduce the standard encoding of a database, which is obtained by encoding the quanti er-free formula representing it. Formulas are encoded in the following alphabet:
f#; (; ); ^; _; :; =; ; 0; 1; ; xg[ :
Natural numbers are encoded in binary notation, and rationals as pairs (fractions) of natural numbers. We illustrate the encoding of a relation with the following example.
Example 3.1 The binary relation R de ned by: R(x; y ) (2:75 x 7 ^ x > y ) _ (x y ) is encoded as follows: R _ (1011 100 x1 ^ x1 111 ^ :x1 x0)(x1 x0)#
2
The size of a database is de ned as the size of the encoding of its normal form representation. Let C be a complexity class (such as PTIME, PSPACE, etc.). A query Q is computable in C if there is a Turing machine2 M , such that M starting on a standard encoding of the input database produces a standard encoding of the output database, in a bounded amount of resources according to C (such as polynomial time, polynomial space, etc.) in the size of the input. For each complexity class C , we denote also by C the set of queries over denseorder constraint databases computable in C .
4 First-Order Query Languages, with Addition or Recursion In this section, we consider the following query languages:
FO | rst-order with dense-order constraints, FO | FO with linear constraints (with a +
built-in addition, +), and Datalog: | in ationary Datalog with negation with dense-order constraints (or similarly in ationary xpoint logic [AV89, GS86]).
We shall focus on their expressive power and complexity. The language FO over dense-order constraints is essentially rst-order logic over the language f=; g[ . Datalog with (dense-order) constraints is de ned as follows. Constraints are allowed in Or any other device such as a family of boolean circuits in the case of parallel complexity classes. 2
the bodies of rules. For Datalog:, negations are allowed in the bodies of rules. The in ationary semantics is computed by adding after each iteration the set of facts just derived to the set previously obtained. It has been shown in [KKR90] that both FO and Datalog: can be evaluated bottom-up and in closed form, i.e., instances are mapped to instances. FO+ also allows the addition operator, +, with the intended semantics. It follows from results in [Tar51], that FO+ can be evaluated bottom-up. Remark: We restrict our attention to queries (in the sense of De nition 3.1) over dense order databases. In the following, FO+ denotes the set of queries de nable with +. Note that in general, mappings de nable in FO+ may not be closed under automorphims of Q, and moreover non boolean FO+ queries may have outputs whose representations require addition. It is easy to see that under the restriction to nitely representable databases, rst-order sentences and boolean in ationary Datalog: programs de ne boolean queries. The theory of dense order without endpoints is decidable and admits quanti er elimination procedures [CK73]. It has been shown in [GS94], that it is a sucient and necessary condition for rst-order logic to de ne a query language. The data complexity of queries in rst-order logic and in Datalog: has been studied in [KKR90, KG94]. FO queries are computable in AC0 data complexity, FO+ queries in NC, and Datalog: queries in PTIME. Nevertheless, the expressive power of these languages has not been seriously investigated. It is still open if numerous well known queries are rst-order de nable over nitely representable databases. The next result [GST95] shows that FO+ queries can be evaluated in AC0 over dense-order constraint databases de ned using only integers in the constraints. The restriction is harmless since dense-order databases are homeomorphic (transformation on the axis) to databases representable with only integers, and the representation over integers only can be used in practice to avoid the encoding of rationals.
Theorem 4.1 [GST95] FO has uniform AC +
0
data complexity over inputs de ned with integers. The proof relies on the fact that in the evaluation of a query involving addition over dense-order constraint databases whose parameters are integers (no rationals), it is only necessary to do additions of database parameters (which is in AC0 ), and multiplications by a constant coming from the query (which is also in AC0). The previous result has fundamental consequences related to the expressive power of linear constraints. We prove that numerous classical queries including various forms of connectivity are not expressible in FO+ . We consider (i) classical graph queries, and (ii) spatial queries. For graph queries, the inputs are nite relations over integer values (de ned with equality constraints). It was shown in [FSS84] that parity and connectivity of a nite graph are not in AC0 . It follows easily that:
Theorem 4.2 The graph connectivity and parity
queries are not linear (not in FO+ ).
For spatial queries, we consider queries over in nite relations. Let R be an in nite binary relation. It can be seen as de ning an in nite set of points on the rational plane. The region connectivity query returns yes if R is connected, i.e., every pair of points in R can be linked by a curve contained entirely in R. Similarly, we can de ne other topological queries: the input region has at least (exactly ) one hole (a connected region non intersecting with R but completely surrounded by points in R). When R consists of only line segments, the Eulerian traversal query returns yes if there is a traversal going through each line segment exactly once. Finally, the 2-dimensional homeomorphism query checks if two input binary relations are homeomorphic in dimension 2. Using reductions from the function majority, the following can be shown (for dimension 2).
Theorem 4.3 The following queries are not linear
(not in FO+ ): (i) region connectivity, (ii) at least one hole, (iii) exactly one hole, (iv) Eulerian traversal, and (v) homeomorphism.
We next consider Datalog:. Some of the previous queries can easily be de ned in Datalog:. Consider the (2-dimensional) region connectivity query. The basic idea is to perform alternatively horizontal and vertical \sweeps". We rst pick an arbitrary point (e.g., lowest and leftmost) within the input region and store it in a temporary relation S in the iteration 0. For odd (even) iterations, we extend S horizontally (vertically) by adding points of R (input) into S which are horizontally (vertically) connected to S . The process ends when S stops growing. Finally, if S is the same as R, the algorithm stops and answers \the region is connected"; otherwise, it is \not connected". It is clear that this is expressible in Datalog:. The following main result of this section completely characterizes the expressive power of in ationary Datalog:.
Theorem 4.4 In ationary Datalog: = PTIME. Remark: A similar statement can be found in
[KKR90] and also in [KG94], but its meaning is different. What was proved there was that \in ationary Datalog: can express any relational database query computable in PTIME" ([KKR90] Theorem 3.15), i.e., PTIME queries from nite relational databases to nite relations. In other words, it was shown that: PTIMErelational input/output In ationary Datalog: PTIMEdense-order input/output = PTIME In our result, PTIME denotes a set of queries over dense-order constraint databases, and not over relational databases as de ned in [CH80].
Proof of Theorem 4.4 (Sketch). The inclusion
of in ationary Datalog: in PTIME has been shown in [KKR90]. We only prove the converse inclusion. For the sake of simplicity, we detail the proof in the case of a binary query over a database schema containing a single relation of arity 2. In other words, we work in a space of dimension 2. The technique carries over easily to higher dimensions.
Let Q be a query over dense order constraints computable in polynomial time and M be a Turing machine that computes Q. The proof consists of the following steps: 1. Given a quanti er-free formula I representing an input I , there is a Datalog: program, encode, which \computes" a relational representation of the maximal cover of I , and its encoding on the input tape of the machine. 2. There is a Datalog: program, compute, which computes the encoding of the output starting from an encoding of the input, i.e., the program simulates the computation of M . 3. There is a Datalog: program, decode, which computes a relational representation of the output O, from its encoding on the output tape of the machine. Let I be a database instance. Recall that the maximal cover of the instance I consists of a set of points, segments, triangles, and rectangles. The relational representation of I is a (classical) relation of xed arity 5, where each tuple denotes an atomic shape as follows. Consider the rectangle between the points (a7 ; b7) and (a8 ; b8) (see Figure 1) de ned by a conjunction: \a7 < x < a8 ^ b7 < y < b8". It is represented by a tuple: [a7; b7; a8; b8; r], where r denotes the type of the object. Points, segments and triangles are represented similarly as follows for the shapes shown in Figure 1: [a0 ; b0; a0; b0; p]; [a1 ; b1; a2; b2; s]; [a3; b3; a4; b4; t]; [a5; b5; a6; b6; t]: The symbols +1 and ?1 are used for non compact atomic shapes. The following can be easily veri ed.
Lemma 4.5 There is a Datalog: program which,
given a quanti er-free formula I representing the input I , produces a 5-ary relation, RI , containing the relational representation of the maximal cover of I . The rest of the proof involves techniques which have been used already in the literature, following the results of [Var82, Imm86]. Note that in I and its relational representation, the constants are rational numbers. These rational constants
occurring in the relational representation of the input or in the query itself, are encoded into consecutive integers by respecting their order. Zero is zero. The smallest positive (respectively negative) constant occurring in the relation is 1 (respectively ?1), and so on. The integers are translated in binary notation (+1 and ?1 are encoded as bigger integers), and the whole relation is nally encoded on the Turing tape (i.e. a relation representing the tape, whose rst attributes denote the indices over the alphabet of the initial relation, and whose last attribute contains the value of the cell over the alphabet [ f#; [; ]; 0; 1;p,s,t,rg, where is the database schema. 2
5 Complex Objects In this section, we extend the constraint database model to constraint complex objects. This allows us to represent nested nitely representable relations and sets. Thus, properties over natural spatio-temporal objects can be easily modeled. We also extend the rst order constraint query language FO to the constraint complex objects framework. The resulting language C-CALC uses an active domain semantics and allows bottom up evaluation in closed form. The expressive power and complexity of C-CALC is studied. Nested sets and pointset properties occur naturally in many applications of constraint databases. For example, populations of regions, areas of river basins, and region boundaries, etc. are properties associated with sets of points in the corresponding space. In addition, many of the spatial data models include hierarchical constructs. For example, multi-layered thematic maps may represent many regions and channels which in term are represented by many atomic spatial objects such as lines and points. In order to bound the complexity, sets are incorporated into the constraint database framework with restrictions: only nitely representable sets are allowed. For example, a database may have \nested" relations, however, each nested set is represented as a \y = f(x1; : : :; xn ) j '(x1; : : :; xn )g", where y is a set variable and xi is an (atomic or set) variable for each 1 i n.
De nition 5.1 The family of types is a collection
of expressions de ned recursively as follows: 1. If n 1 and 8i 2 f1::ng, Ti = Q or is a set type, then [T1; : : :; Tn ] is an n-ary tuple type; and 2. If T is a tuple type, then fT g is a set type. A type is at if it does not contain a set constructor. For simplicity we have only tuple and set types in the formal system. However, we may blur the distinction of unary tuple type constructions (e.g. between [Q] and Q) in informal discussions.
Example 5.1 Consider regions and their average
rainfall. The information can be represented using a tuple type T = [f[Q; Q]g; Q], where the rst component (of an object of T ) stores a region (in dimension 2) and the second represents the average rainfall of the region. 2 We now de ne \constraint complex objects" and \constraint (complex object) instances". Intuitively, the former are built (inductively) using tuple and set constructors from generalized tuples and nitely representable sets and the latter are nite sets of objects. The constructions are essentially similar to the classical complex object model except that generalized tuples and nitely representable sets are used, and sets may be in nite. We assume that each variable is associated with a unique type and there is an in nite set of variables for each type. Let Lc be the language extending L with (typed) variables, and logical predicates 2T , T , =T for each type T (when the context is clear, type subscripts are omitted), and necessary delimiters. Constraint complex objects and instances are represented by formulas in Lc (2T ; T are not used).
De nition 5.2 For each type T , the domain of T ,
denoted dom (T ), is de ned recursively as follows:
1. If T is an n-ary at tuple type, then dom (T ) is the set of all generalized n-ary tuples, i.e., all quantifer-free conjunctive formulas in L whose free variables are in fx1 ; : : :; xn g.
2. If( T = fT 0g is a set type, then dom )(T ) = k _
i=1
'i k 0; 8i 2 f1::kg; 'i 2 dom (T 0) .
For each formula 2 dom (T ) with n (the arity of the type T 0 ) distinct free variables x1 ; : : :; xn , let f(x1; : : :; xn ) j g be a T -term or set term. 3. If T = [T1; : : :; Tn1 ; S1; : : :; Sn2 ] is a tuple type where 8i 2 f1::n1g; Ti = Q and 8i 2 f1::n2g; Si = fSi0g is a set type, then dom (T ) = (
i
' 2 dom ([T1; : : :; Tn1 ]); ' ^ yi = t 8i 2 f1::n2g; ti is an Si0-term i=1 n
2 ^
)
Each element in dom (T ) is called a constraint complex object (or simply c-object ) of type T ; each nite subset of dom (T ) is called a constraint complex object instance (or simply c-instance ) of type T .
Example 5.2 Let T = [f[Q; Q]g; Q] be the type in Example 5.1. A c-object of type T is (x; y ): (x = f(x1 ; x2) j g ^ y = 30) where = (1 x1 2 ^ 2 x2 3).
2
De nition 5.3 A complex database schema is a
nite set of tuple types (nested signatures), and a complex constraint database is a mapping I from to c-instances such that for each T 2 , I (T ) is a c-instance of T .
Note that we only allow tuple types in the schema. The restriction is for technical convenience, since set types can be viewed as unary tuple types. We next consider a logic-based query language for constraint complex object databases. Syntactically, the query language is de ned over the language Lc in a way very similar to the classical complex object calculus presented in [AB87, HS91, GV91] and with the following extension. \Set terms" may be composed in the following manner: if ' is a formula with free variables in fx1; : : :; xn g, then f(x1; : : :; xn ) j 'g is a set term. Finally, each formula in the new language de nes a query whose free variables specify the answer. We denote this language as C-CALC.
The semantics for C-CALC deserves a careful examination. The primary diculty comes from the interpretation of set variables. A naive semantics is to allow set variables to range over arbitrary sets. In this case, addition, multiplication, and exponentiation are easily de nable. However, it is known that, with the presence of addition, multiplication, and exponentiation, the theory doesn't admit a quanti er elimination procedure [Dri82]. An improvement to the naive approach would be to restrict the range of set variables to only nitely representable sets. For example, a set variable x of type f[Q]g would range over all possible nite sets of rational intervals. The following theorem states that this alternative is not feasible either.
Theorem 5.1 Let be a schema containing at
least one unary ( at) type. There exists a boolean query ' in C-CALC such that, under the nitely representable set semantics for set variables, the set of database instances satisfying ' is not recursive.
Proof: (Sketch) Suppose R is the input relation
and M is a Turing machine with the input alphabet fag. Since R is unary, it de nes a nite set of intervals over Q. Now we can construct a query Q such that Q returns R itself if M halts on an , and the empty set otherwise, where n is the number of intervals in R. The construction of Q is similar to that presented in the proof of similar results for the classical complex object calculus with invention [HS91, HS93]. 2 In this paper, we propose a semantics for CCALC which is analogous to the active domain semantics for the classical complex object calculus [AB87, HS91, GV91]. In particular, under this semantics, the range of each set variable consists of a nite number of c-objects. The number and also the actual c-objects depend on the input database. Speci cally, these c-objects are constructed from a nite set of \active" constants (the active domain in the traditional sense is possibly in nite) and possibly operations analogous to the \powerset" if the sets are nested. We use the notion \active domain"3 to capture the set of active constants. The notion uses maximal covers and is de ned informally here. Let 3
We keep the same name.
R be a c-instance of type [Q; :::; Q]. The active domain of R, adom (R), is the set of constants
used in the representation of the maximal cover of R. In general, a c-instance f'1; : : :; 'k g of type T can also be viewed as a set term t = f(x1; : : :; xn) j Wi 'i g. Note that the formulas 'i's may also include set terms. We now recursively de ne the active domain of a set term. Let t be the following (set) T -term:
f(x ; : : :; xn; y ; : : :; yk ) j i('i ^ W
1
1
V
j yi;j
= ti;j )g
where the xi 's are variables ranging over rational numbers and the yj 's are set variables. Then, the active domain of t, adom (t), is de ned to be the union: adom (t) = adom (R) [
i;j adom (ti;j )
S
where R = f(x1; : : :; xn ) j i 'ig is a at relation. Finally, the active domain of a c-instance is the active domain of the corresponding set term and the active domain of a complex constraint database I , adom (I ), is the union of the active domains of all c-instances in I . W
De nition 5.4 Let C be a set of constants. A
c-object ' is C -representable if each constant occurring in ' is in C . If C is nite, it is easy to verify that for each type T , the set of C -representable c-objects of type T is also nite. Let be a schema and I a complex constraint database of . The semantics of C-CALC is then de ned in the usual manner except that for each variable x of a type T , the range of x is the set f' 2 dom (T ) j ' is adom (I )-representable g. When restricted to at input schemas and (second order) set variables ranging over relations, our semantics is in the spirit of quantifying over \cells" [Col75, KY85]. In the remainder of the section, we consider the expressive power and complexity of the language C-CALC. We focus here only on queries over dense-order constraint databases (\ at" complex constraint databases). We show that the results on classical complex object languages [HS91, GV91]
carry over to the complex constraint database framework. (Proofs are provided in the full paper.) We rst de ne the set-height [HS91] of a type to be the maximal number of set constructs in a path from the root to a leaf in a syntax tree of the type. Similar to [HS91], let C-CALCi be the set of all C-CALC queries Q whose input and output are at and the set-height of each type used in Q is smaller or equal to i. Intuitively, queries in CCALCi can use variables of type at most i levels of nested sets. C-CALC0 is FO. The following result characterizes the expressive power of C-CALC1 .
Theorem 5.2 PTIME C-CALC PSPACE. 1
Proof: (Sketch) The upper bound result is ob-
tained similarly to the classical complex object case [HS91] but focusing on maximal covers. To see the lower bound, it is sucient to observe that one can simulate xpoint computation with one level of set nesting [AB87]. 2 Let Hi (i 0) be families of functions from N to N such that H0 = f p j p is a polynomial g and Hi = f 2f j f 2 Hi?1g for i 1. For a family of functions F , we denote by F -TIME (respectively F -SPACE) the set of queries over dense-order databases having time (respectively space) complexity g 2 F .
Theorem 5.3 For each i 0, Hi -TIME C-CALCi Hi -SPACE. +1
Following from the above theorem and the space (time) hierarchy results [GJ79], the family of languages C-CALCi (i 0) forms a non collapsing hierarchy. In particular, this gives a separation of C-CALCi+2 from C-CALCi for each i 0. It is known that for the similar hierarchy for classical complex objects each level of nested sets yield strictly more expressive power [HS91]. Using a reduction to the classical case, this above separation can be further improved:
Theorem 5.4 For each i 0,
C-CALCi C-CALCi+1 .
Let E be the set of constraint queries having hyper-exponential time (space) complexity.
Corollary 5.5 C-CALC = E = ([i Hi )-TIME = ([i Hi )-SPACE. We can also extend C-CALC with xpoint and while constructs similarly to [KKR90, GV91]. The following can be shown:
Theorem 5.6 For each i0, C-CALCi + xpoint = Hi -TIME and C-CALCi +while = Hi -SPACE. Before we end the section, we brie y discuss another approach to incorporating sets into constraint databases. This approach, called \range restriction", uses syntactic conditions on formulas to ensure that set values assigned to set variables are only from the input database. The range restriction rules are de ned similar to that for classical complex objects in [GV91]. For example, one rule states that if R(x1; : : :; xn ) is an atomic formula, then x1; : : :; xn are range restricted. Due to space limitation, we do not list the conditions here. For range restricted queries, the active domain semantics coincides with the natural semantics, i.e., set variables range over all possible sets. However,
Theorem 5.7
Range-restricted C-CALC C-CALC.
6 Conclusions We studied the complexity and expressive power of query languages for (nested) dense-order constraint databases. Some of the results do not hold in other contexts. In particular, Theorem 4.4 doesn't carry over to the case of discrete orders. It has been shown in [Rev93] that strati ed Datalog: can express any Turing computable function. Nevertheless, we conjecture that query languages capturing PTIME queries over other classes of constraints can be obtained with bounded xpoint. The results presented in this paper have consequences beyond the scope of dense-order constraint databases. Indeed, the non-de nability results carry over for non-restricted classes of models. Finally, we enumerate some open problems:
Separation of FO and FO . +
De ne a PTIME language for linear constraint
databases. Can connectivity be de ned with constraints over a reacher language?
Acknowledgments The authors wish to thank Paris Kanellakis for motivating discussions on this topic, and Christophe Tollu for technical comments.
References [AB87] S. Abiteboul and C. Beeri. On the power of languages for the manipulation of complex objects. In Proc. Int. Workshop on Theory and Applications of Nested Relations and Complex Objects (extended abstract), Darmstadt, 1987. INRIA research report 846. [ACGK94] F. Afrati, S. Cosmadakis, S. Grumbach, and G. Kuper. Expressiveness of linear vs. polynomial constraints in database query languages. In Proc. Workshop on the Principles and Practice of Constraint Programming, 1994. [AV89] S. Abiteboul and V. Vianu. Fixpoint extensions of rst-order logic and datalog like languages. In Proc. 4th Symp. on Logic in Computer Science, pages 71{79, 1989. [BJM93] A. Brodsky, J. Jaar, and M. J. Maher. Towards practical constraint databases. In Proc. Int. Conf. on Very Large Data Bases, pages 567{580, 1993. [BK95] A. Brodsky and Y. Kornatzky. The Lyri C language: Querying constraint objects. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995. [CH80] A. K. Chandra and D. Harel. Computable queries for relational data bases. Journal of Computer and System Sciences, 21(2):156{78, 1980. [CK73] C.C. Chang and H.J. Keisler. Model Theory, volume 73 of Studies in Logic. NorthHolland, 1973.
[Cod70] E.F. Codd. A relational model of data for large shared data banks. Communications of ACM, 13:6:377{387, 1970. [Col75] G. E. Collins. Quanti er elimination for real closed elds by cylindric decompositions. In Proc. 2nd GI Conf. Automata Theory and FOrmal Languages, volume 35 of Lecture Notes in Computer Science, pages 134{83. Springer-Verlag, 1975. [Dri82] L. Van den Dries. Remarks on Tarski's problem concerning (r; +; ; exp). In Logic Colloquium, North-Holland, 1982. Elsevier. [FSS84] M. Furst, J. B. Saxe, and M. Sipser. Parity, circuits, and the polynomial-time hierarchy. Math. Syst. Theory, 17:13{27, 1984. [FT83] P.C. Fischer and S.J. Thomas. Operators for non- rst-normal-form relations. In Proc. IEEE Computer Software Applications Conference, pages 464{475, 1983. [GJ79] M. Garey and D. Johnson. Computers and Intractability A Guide to the theory of NPCompleteness. Freeman, 1979. [GS86] Y. Gurevich and S. Shelah. Fixed-point extensions of rst-order logic. Annals of Pure and Applied Logic, 32:265{280, 1986. [GS94] S. Grumbach and J. Su. Finitely representable databases (extended abstract). In Proc. 13th ACM Symp. on Principles of Database Systems, 1994. [GST95] S. Grumbach, J. Su, and C. Tollu. Linear constraint databases. In Proc. LCC, 1995. To appear in LNCS Spring-Verlag volume. [GV91] S. Grumbach and V. Vianu. Tractable query languages for complex object databases. In Proc. ACM Symp. on Principles of Database Systems, pages 315{327, 1991. [HH93] T. Hirst and D. Harel. Completeness results of recursive data bases. In Proc. 12th ACM Symp. on Principles of Database Systems, pages 244{252, 1993. [HS91] R. Hull and J. Su. On the expressive power of database queries with intermediate types.
Journal of Computer and System Sciences, 43(1):219{267, August 1991.
[HS93] R. Hull and J. Su. Algebraic and calculus query languages for recursively typed complex objects. Journal of Computer and System Sciences, 47(1):121{56, August 1993. [Imm86] N. Immerman. Relational queries computable in polynomial time. Inf. and Control, 68:86{104, 1986. [JS82] B. Jaeschke and H.J. Schek. Remarks on the algebra of non rst normal form relations. In Proc. ACM Symp. on Principles of Database Systems, 1982. [KG94] P. C. Kanellakis and D. Q. Goldin. Constraint programming and database query languages. In Proc. 2nd Conference on Theoretical Aspects of Computer Software (TACS), April 1994. (To appear in LNCS Spring-Verlag volume). [KKR90] P. Kanellakis, G Kuper, and P. Revesz. Constraint query languages. In Proc. 9th ACM Symp. on Principles of Database Systems, pages 299{313, Nashville, 1990. [Kup93] G. M. Kuper. Aggregation in constraint databases. In Proc. Workshop on the Principles and Practice of Constraint Programming, pages 176{183, April 1993. [KY85] D. Kozen and C. Yap. Algebraic cell decomposition in np. In Proc IEEE Foundations of Computer Science, pages 515{521, 1985. [PVV94] J. Paredaens, J. Van den Bussche, and D. Van Gucht. Towards a theory of spatial database queries. In Proc. 13th ACM Symp. on Principles of Database Systems, pages 279{ 88, 1994. [Rev93] P. Z. Revesz. A closed form for datalog queries with integer (gap)-order constraints. Theoretical Computer Science, 116(1):117{ 149, 1993. [Rev95] P. Z. Revesz. Datalog queries of set constraint databases. In Proc. Int. Conf. on Database Theory, 1995.
[RKS88] M.A. Roth, H.F. Korth, and A. Silberschatz. Extended algebra and calculus for nested relational databases. ACM Transactions on Database Systems, 13(4):389{417, 1988. [SRR94] D. Srivastava, R. Ramakrishnan, and P. Z. Revesz. Constraint objects. In Proc. Workshop on the Principles and Practice of Constraint Programming, 1994. [SRS94] D. Srivastava, K. A. Ross, and P. J. Stuckey. Foundations of aggregation constraints. In Proc. Workshop on the Principles and Practice of Constraint Programming, 1994. [Tar51] A. Tarski. A Decision Method for Elementary Algebra and Geometry. University of California Press, Berkeley, California, 1951. [Var82] M. Vardi. Relational queries computable in polynomial time. In Proc. 14th ACM Symp. on Theory of Computing, pages 137{146, 1982.