Towards Practical Constraint Databases - CiteSeerX

1 downloads 0 Views 186KB Size Report
Stéphane Grumbach. I.N.R.I.A.. Rocquencourt BP 105. 78153 Le Chesnay, France stephane.grumbach@inria.fr. Jianwen Su. Computer Science Department.
Towards Practical Constraint Databases (Extended Abstract)

St´ephane Grumbach I.N.R.I.A. Rocquencourt BP 105 78153 Le Chesnay, France [email protected]

Abstract We develop a framework for (real) constraint databases based on finite precision arithmetic which fulfills the main requirements of practical constraint databases. First, it allows the manipulation of approximate values, standard in scientific applications. More importantly, it permits the extension of the relational calculus with aggregate functions, while preserving the fundamental property of closed form evaluation with PTIME data complexity. This is an important step since the initial model of [KKR90] cannot be extended to aggregate functions. Moreover, finite precision computation plays a central role in efficient query processing. We introduce the finite precision semantics of queries and prove expressive power results concerning it. We then present a new constraint query language, CALCF , which includes aggregate and analytical functions, and show that it admits a closed form evaluation in PTIME.

1 Introduction Since their introduction by Kanellakis, Kuper and Revesz [KKR90], constraint databases have generated a rapidly growing interest in the research community. Until now, the primary focus has been on constraint database models [KKR90, PVV94, GS95a], and the expressive power and complexity analysis of the corresponding query languages [ACGK94, GST94, GS95b, PVV95, BDLW95]. Some fundamental aspects have been insufficiently addressed, such as (i ) implementation issues, and (ii ) adequacy of the constraint data models and query languages to  Work supported in part by an NSERC Fellowship in Canada. Part of this work was done while visiting UCSB.  Work supported in part by NSF grant IRI94-11330 and NASA grant NAGW-3888.

Jianwen Su Computer Science Department University of California Santa Barbara, California 93106, USA [email protected]

the intended applications. In this paper, we initiate an investigation into both issues and propose a general constraint database paradigm based on a solid mathematical foundation and engineered towards realistic applications. Constraint databases integrate database technology with constraint solving to deal with new applications such as spatial or geographical applications and those requiring arithmetic computations. Implementation of constraint databases is a challenge, and very little has been done yet. Indexing techniques for constraint data [KRVV93], and more general techniques for real-world data (in the application range of constraint models) [FK94] have been studied. Constraint solving is a large and interdisciplinary field, with efficient methods developed for specific tasks in constraint logic programming, operation research, graphics and visualization, etc. Large sets of constraints in databases will require specific algorithms. Following the paradigm of [KKR90], a constraint relation is represented by a quantifier-free formula over some arithmetical domain. We consider in this paper constraints over real numbers. The seminal argument exploited in [KKR90] was to show that relational calculus queries over constraint databases could be evaluated in closed form (i.e., the output of a query is also represented by a quantifier-free formula) with PTIME data complexity. Query evaluation is performed by quantifier elimination; its tractability follows from the tractability of quantifier elimination (with a fixed number of variables) in the theory of real closed fields [Tar51, Col75, BKR86, Ren92a]. The constraint database paradigm has a promising potential to be a fundamental basis for new database models, but extensions are necessary to fulfill the needs of applications. There are essential shortcomings. First, quantifier elimination is not a sufficient means to evaluate queries. Indeed, the data is always represented in terms of

a boolean combination of polynomial functions. Often, one is interested in the actual numerical values (obtained for instance as the roots of polynomial equations). Our first result is to show that the numerical evaluation (extraction of values from formulas) can be done in PTIME data complexity. In practice, these roots need to be approximated. The underlying machinery for implementing constraint databases cannot avoid intensive numerical computations and approximated values. This is a significant difference from relational database implementations where symbolic computation is sufficient. Another limitation of the initial approach is that extensions of the relational calculus with new functions (such as aggregate functions) do not satisfy the quantifier elimination property [Dr82], and therefore there is no procedure to evaluate queries in closed form, as already observed in [Kup93]. On the other hand, without aggregate functions, the applicability of the constraint database approach is drastically reduced. In this paper, we introduce a new framework based on a semantics with finite precision arithmetic. We show that although in general the finite precision semantics offers less expressive power than the general semantics with full arithmetic over the reals, under some interesting restrictions, the two semantics have the same power. Moreover, we prove that tractable recursion can be easily added under the finite precision semantics, which contrasts sharply with the general semantics. We then show that under the finite precision semantics, general functions (e.g., exponentiation, trigonometric functions, integrals, etc.) can be safely added to the relational calculus, therefore permitting the extension to a large set of aggregate functions. These theoretical results lead to the definition of a new constraint data model, CALCF , integrating approximate values, and a large class of analytical functions. We prove our main result that query evaluation can be done in closed form and PTIME data complexity. The paper is organized as follows. In Section 2, we discuss informally the need for extending the existing constraint database approach. In Section 3, constraint databases are defined, the core of constraint query evaluation is briefly described, and we prove the tractability of numerical evaluation. We then analyze the constraint relational calculus under finite precision arithmetic in Section 4, and propose a practical constraint query language in Section 5. Brief conclusions

are drawn in Section 6.

2

Motivations

In this section, we explain the various facets of query evaluation in the context of constraint databases and illustrate the need to extend current models (formal definitions can be found in Section 3). We restrict the scope of the paper to constraints over real numbers only. Constraint relations allow the encoding of potentially infinite sets of points using constraints over the reals. For example, the expression

S (x; y)  (4x2 ? y ? 20x + 25  0) defines a binary relation S  R2 . Simple queries can be performed on S , such as checking if a specific point ( ; ) is contained in S . This is easily done by evaluating the numerical value of the polynomial expression on this point.

y

4x 2−20x+25=y

y=9

x

The relation

S

The relational calculus allows the definition of more complex queries. Consider for instance

Q(x)  9y(S (x; y) ^ y  0): The evaluation of such a query generally requires the following three steps (Figure 1). 1. INSTANTIATION: Replace the symbol S in Q by its definition. This leads to a formula over the reals with no database relation symbols: 9y(4x2 ? y ? 20x + 25  0 ^ y  0): 2. QUANTIFIER ELIMINATION: Express the set of reals satisfying the query by eliminating the quantified variables, and using only the variable x. This results in a new constraint relation, which happens to be an equation for this example: 4x2 ? 20x + 25 = 0. 3. NUMERICAL EVALUATION: Solve the resulting system(s) of equation(s) to obtain the values of the variable(s) which satisfy the previous expression. It results here in a unique root, x = 2:5. Step 1 is a purely syntactic manipulation. Step 2 is based on a partly symbolic technique, effective in the case of real numbers with polynomial constraints. Step 3 is purely numerical. Note

Constraint relation

4x2

? ? 20 y

x

+ 25

Query

S

0

9

y (S (x; y )

Eliminate quantifier

Q

^  0) y

-

4x2

? 20

x

+ 25 =

0

Numerical evaluation

-

x

=

2:5

Figure 1: Query Evaluation that it is not always possible to obtain exact values as in the simple example considered. Indeed, the roots might be irrational. In such a case, only approximate values can be computed. The set of answers may also be infinite; in this case, step 3 does not come into the picture. In practice, queries can be more complicated and might involve various forms of aggregate functions. Among the desirable aggregate functions, MIN and MAX are easily definable in the relational calculus. On the other hand, the fundamental function AVG, (e.g. average value of a bond over a period of time) is not definable. For relations of higher dimensions, functions such as SURFACE and VOLUME, very useful in most of the related applications, are not definable either. Consider the query SURFACEx;y (S (x; y ) ^ y  9). The surface of the area f(x; y ) j S (x; y ) ^ y  9g can be mathematically defined as: 27 ? 4 2 1 (?4x +20x?25)dx = 27 ? (F (4) ? F (1)) = 18, where F (x) = 43 x3 ?10x2 +25x is a primitive. In the presence of aggregate functions, the query evaluation process needs an extra step for aggregate evaluation.

R

4. AGGREGATE EVALUATION: Apply aggregate functions to relations and produce the answer. In the example, the query computes the surface and gives a numerical value as output. During the evaluation of a query, steps 1 to 4 will be intertwined, depending on the query. The result of an intermediate step is a constraint relation with exact or approximate values. It may sometimes be a finite relation. The existing constraint database approaches handle steps 1 and 2 based on the well known quantifier elimination techniques. However, they do not carry out the numerical and aggregation steps. A recent proposal was made to deal with aggregate functions [CK95], but it cannot express the function SURFACE of the above example for instance. The work reported here is, to the best of our knowledge, the first attempt towards these goals.

3

Constraint Databases

We briefly recall the main concepts of constraint databases as they are exposed in [KKR90, GS94]. We consider the first-order language L of the real closed field [CK73], with equality, =, order, , addition, +, and multiplication, . Let  = fR1; :::; Rn g be a signature (database schema), where R1 ; :::; Rn are relation symbols. Introduced in [KKR90], a k -ary generalized tuple is a conjunction of atomic formulas in L over k variables. For instance, “(xy ^ x0 ^ y10)” is a binary generalized tuple representing a filled triangle. A k -ary finitely representable relation is then a finite set of k -ary generalized tuples, denoting a possibly infinite set of tuples in Rk , with a finite representation in L. A constraint database is a finite collection of finitely representable relations. It can be seen as an expansion of the real field hR; ; +; ; 0; 1i to a database schema  . For this reason, we speak of a constraint database hR1 ; :::; Rn i in the context of the real field. The domain of the database is the set of real numbers. Other contexts will be considered in the next sections. The notion of satisfaction is defined as usual w.r.t. the context structure. A constraint database hR1 ; :::; Rn i satisfies a sentence ' iff hR; ; +; ; 0; 1; R1 ; :::; Rn i j= ', which is simply written as hR1 ; :::; Rn i j=R '. A query is a partial mapping from constraint databases to finitely representable relations. We do not assume any notion of genericity here. Moreover, throughout the paper, we will only consider queries computable in polynomial time. It was shown in [KKR90] that the relational calculus constitutes a query language in the context of polynomial constraints over the real numbers. A first-order formula '(x) in the language L [  with free variables x naturally defines a query Q:

b

b

b

b b b

b b

Q(hRb1 ; :::; Rbn i) = fa j hRb1 ; :::; Rbn i j=R '(a)g: The answer to Q over a database is a potentially

infinite relation that should be defined by a quantifier-free formula , where  is logically equivalent to the formula ' after each relation

symbol Ri has been replaced by its definition (i.e., a disjunction of generalized tuples). It is rather non obvious that first-order logic defines a query language for constraint databases. This follows from the fundamental quantifier elimination property of the first-order theory of real closed fields. A logical theory admits quantifier elimination if every formula is equivalent to a quantifier-free formula. The fact that the theory of real closed fields admits quantifier elimination was discovered by Tarski [Tar51]. Moreover, there are tractable algorithms to perform quantifier elimination for a fixed number of distinct variables. These properties were cleverly used in [KKR90] to design the constraint database framework, and shown necessary in [GS94]. This very fortunate situation (tractable quantifier elimination) has lead to the success of the constraint database model. In practice, things are not so simple. Indeed, query evaluation procedures must rely on very sophisticated algorithms developed in the area of real algebraic geometry [Arn88]. We distinguish three fundamental phases in the evaluation process: (i) solution of systems of polynomial inequalities, (ii) cylindrical algebraic decomposition, and finally (iii) quantifier elimination. (The main technical aspects of each phase are described in Appendix I.) Let ' be a first-order sentence. The quantifier elimination algorithm is complete, so any sentence is reduced to either the tautology 0 = 0 or R ', if ' is reits negation 0 6= 0. We write j=QE duced by the quantifier elimination algorithm to the tautology. Collins’ first result [Col75] can be rephrased as follows. For each first-order sentence ',

j=R ' iff j=R ': QE

We denote by FOR the set of queries defined by first-order formulas. These techniques lead to the fundamental result of [KKR90], pointing out the tractability of the constraint framework in the database context at a theoretical level. Theorem 3.1 [KKR90] Queries in FOR on constraint databases have PTIME data complexity. Remark The quantifier elimination property is of a very rare nature. It was shown by Van den Dries [Dr82] that any proper finite extension of the real field with real analytic but not semi-algebraic functions, doesn’t admit quantifier elimination. In other words, for sentences with, say, the exponentiation function, there is no quantifier elimination algorithm to decide validity. It is easy to see

that most of the aggregate functions mentioned in the previous section are not semi-algebraic. It follows that formulas containing aggregate functions cannot always be evaluated in closed form, as was already observed in [Kup93]. Any proper extension of the constraint database model of [KKR90] is therefore impossible. In Section 5, we propose a solution based on approximations, and provide in particular a tractable algorithm for the AGGREGATE EVALUATION. Kanellakis, Kuper and Revesz proved that the step 2 of our query evaluation framework, had PTIME data complexity. We next prove that the step 3, NUMERICAL EVALUATION has also PTIME data complexity, therefore proving the tractability of the first three steps of the evaluation. QUANTIFIER ELIMINATION,

Theorem 3.2 For each >0, the NUMERICAL EVAL(up to -approximation) step can be done in PTIME.

UATION

A sketch of the proof can be found in Appendix II. We can actually prove a better parallel complexity bound. Indeed, the NUMERICAL EVALUATION (up to -approximation) can be done in NC. The result follows from [Nef90]. The same parallel complexity bound was shown for the data complexity of the QUANTIFIER ELIMINATION step of the evaluation of queries in FOR [KKR90]. Despite the positive essence of the previous results, they are of little use in practice. In the following section, we consider constraint databases in a more realistic context.

4

Relational Calculus with Finite Precision Arithmetic

In this section, we introduce the finite precision semantics for the relational calculus with real arithmetic, and consider its complexity and expressive power. Finite precision values are best represented by floating (point) numbers. Assume a numeration in base b. A floating number is a pair [n; e], where n and e are integers, denoting the rational number n  be . We distinguish floating numbers of various size, and define a k -floating number with a mantissa n over k digits, and an exponent e over log(k ) digits. Let F k be the set of k-floating point numbers. Arithmetic operations over elements of F k are defined as usual [Knu69]. We consider the finite structure1 of k -floating 1 Note

that arithmetic operations are only partially defined in Fk . They have to be seen as relations in a way similar to the arithmetic over finite segments of the integers.

numbers Fk = hF k ; ; +; ; 0; 1i. Our goal is to use this structure as a context for constraint databases. We define constraint databases w.r.t. floating numbers. A k -floating constraint database over a schema  is a structure hR1 ; :::; Rn i, where R1 ; :::; Rn are relations over schema , finitely representable in the finite language of Fk . The universe (semantics) is the set of real numbers, but the active domain (numbers manipulated) is limited to floating numbers. We consider first-order logic over floating constraint databases. We start by noting difficulties that arise with the definition of the semantics. Our goal is to inherit the semantics of the real case. The standard notion of satisfaction is not satisfactory in this context. It is indeed easy to see that for instance Fk j= 9x8y (y x). The existence of the biggest element is of course false for the reals. And sadly, Fk does not even satisfy the distributive laws, which hold in the real field. For example, two expressions a  (b + c) and (a  b)+(a  c) may have different values and therefore are not equivalent. The arithmetic over floating numbers has poor algebraic properties. Moreover, unlike in the context of real numbers, the arithmetic computations with finite precision2 are sensitive to the order in which subexpressions are evaluated. Two different evaluation strategies of the same expression may lead to different results. The approach we propose to overcome the previous difficulties is to use a semantics defined w.r.t. a specific evaluation algorithm, instead of the classical Tarskian semantics. The goal is twofold, first, to avoid non desirable deductions (such as the existence of a biggest element), and second, to determine the meaning of terms by imposing some systematic choice. A natural candidate to achieve this goal is to use a slightly modified version (described in the full paper) of a quantifier elimination method for the real closed field (we choose the version of Renegar [Ren92a]), denoted the QE algorithm in the following. It follows the steps described in Appendix I. It differs from the original algorithm in that only numbers in F k are allowed and that the value of terms might be undefined which may be caused by overflow of exponent (number too large or too small) or mantissa (insufficient precision).3 Moreover, we assume that the set of variables is ordered, and that the cylindrical algebraic decomposition is always performed following this pre-established

b

b

2 This

b

b

holds for all finite precision arithmetics such as interval arithmetic [Moo66], etc. 3 In most practical cases, machine precision is sufficiently high.

order. However, arithmetic operations are still carried out in exact values. We call the semantics associated to the QE algorithm the finite precision semantics, and investigate the expressive power of first-order queries under this semantics. To this purpose, we introduce w.l.o.g. some restrictions which simplify the presentation of the proofs of the results in this section. We consider integers instead of floating numbers. Let Zk = hZk; ; +; ; 0; 1i be the structure of the integers of bit length at most k . The restriction is harmless since polynomials in F [X ] can always be transformed into equivalent polynomials in Z[X ]. F , is The satisfaction relation, denoted j=QE defined as follows. Let hR1 ; :::; Rn i be a constraint database defined with integers of bit length at most k , and ' be a sentence. We write that:

b

hRb1 ; :::; Rbn i j=F ' QE

iff

b

Zk t hRb1 ; :::; Rbn i j= '; QE

that is, ' can be reduced to the tautology, by the algorithm using only integers of bit length k . The bit length of the integers allowed in the QE algorithm depends upon the input database and the query. The active domain is therefore the Zk , such that k is a bound on the bit length of all integers occurring in the finite representation of the input or in the query. Queries are defined as usual (w.r.t. the satisfaction relation) as mappings from constraint databases over some schema fR1 ; :::; Rn g to relations defined by: QE

hRb1; :::; Rbn i 7?!

n b b F o a¯ hR ; :::; Rn i j= '(a¯ ) : 1

QE

Let FOFQE be the set of queries defined as above with the finite precision semantics. We compare its expressive power with FOR (or equivalently FOR QE ) (with the general semantics, as introduced in [KKR90]). FOFQE is a set of partial queries, while FOR contains only total queries. Consider a query Q in FOFQE expressible by some formula ', and let DQ be its domain (set of databases on which Q is defined). It is clear that the formula ' expresses the same query over the same domain in FOR QE . In this restricted sense, we write that FOFQE  FOR QE . The converse doesn’t hold. Theorem 4.1 FOR QE has more expressive power than FOFQE . Proof: (Sketch) The proof is based on the fact that integers of large size (polynomial in the degree of the polynomials in the input) are necessary in the

algorithm. Polynomially large numbers cannot be defined in FOFQE (proof in the full paper). 2 QE

On the other hand, the finite precision semantics is as expressive as the general semantics in various interesting cases, obtained either by modifying the query language, or the class of inputs on which queries are applied. We define FOFQE (L) to be the query language restricted to queries expressible in the language L over constraint databases also definable in L, and we consider the subset of total queries: total-FOFQE (L). It is easy to see that total-FOFQE () = FOR QE (). The proof follows from the fact that queries with the order relation only are insensitive to exact values [GS94, GS95a], but only to their respective order. The result extends to the case of queries with linear constraints (proof in Appendix II). Theorem 4.2 Total-FOFQE (; +; 0; 1) = FOR QE (; +; 0; 1). Although Theorem 4.1 shows that the previous equivalence doesn’t hold in general in presence of multiplication, we prove a weaker result by modifying the language and restricting the class of inputs. Consider the class Kd;m of constraint databases over some fixed schema whose finite representation contains at most m distinct polynomials, and each has degree at most d. We conl=u sider a slightly different finite arithmetic: Zk = l u l u l hZk; ; + ; + ;  ;  ; 0; 1i, where + defines a total function over Z2k to Zk defining the k lower bits of the addition, while +u defines the k higher bits. The same holds for l and u . Theorem 4.3 For every fixed d and m, totalFOFQE (; +l ; +u ; l ; u ; 0; 1)jKd;m = FOR QE jKd;m The proof is a consequence of the two following technical lemmas (proofs in Appendix II). Lemma 4.4 Over the class Kd;m , the bit length of the integers computed by the algorithm QE is linear in the bit length of the integers occurring as parameters in the input constraint database. Lemma 4.5 The relations of the structure l=u are first-order definable in Zk .

Z2l=u k

Also, the well known hierarchy [ACGK94] induced by the arithmetic operators carries over in the case of the finite precision semantics. Proposition 4.6 FOFQE ()  FOFQE (; +)

 FOF (; +; ). QE

Datalog with finite precision semantics The use of the finite precision semantics allows a natural tractable extension of first-order with recursion. We consider the language Datalog: FQE , which is Datalog with inflationary negation, and finite precision semantics (the QE algorithm is called at each iteration). We consider the complexity of queries w.r.t. the size of the input constraint database with its context arithmetic4 , that is the size of Zk t hR1 ; :::; Rn i, where k is the bit length of the integers occurring in the representation of the relations.

b

b

Theorem 4.7 Datalog: FQE



PTIME.

It contrasts with the fact that Datalog: R QE contains all Turing computable queries. More interestingly, Datalog: FQE captures all PTIME queries over various classes of inputs. We consider the class, DO, of dense-order inputs (defined without the symbols + and ). Theorem 4.8 PTIME j DO

 Datalog: F

QE

j DO .

The proof is in the spirit of a similar characterization of PTIME for dense order constraint databases [GS95a]. The inherent difficulty of this result relies in the encoding of a constraint database into finite relations. This is easily achieved for dense-order constraint databases [KG94, GS95a], but much more complex for more general constraint databases.

5

Towards Practical Constraint Queries

In this section, we develop a framework for the implementation of constraint databases for practical applications. We designed a generic constraint query language, CALCF , extending FOFQE with non polynomial functions and aggregate functions. More precisely, the new functionalities are: (i) analytical functions (e.g., polynomial, exponential, logarithmic, trigonometric functions, etc.), and (ii) aggregate functions including MIN, MAX, AVG, LENGTH, SURFACE, VOLUME, and EVAL. The functions LENGTH, SURFACE, and VOLUME have standard meanings for objects of the appropriate dimension, and the functions MIN, MAX, AVG are standard unary functions which return respectively the smallest, largest, and average values if they 4 This

might appear as a bad definition of the size of the input. Our claim is that in practice small integers are enough to define a variety of polynomials, and that the size of Zk t hR1 ; :::; Rn i, is close to the size of hR1 ; :::; Rn i.

b

b

b

b

exist, undefined otherwise. The function EVAL maps a given system of constraints S , either to its finite set of solutions if it exists, or to S itself otherwise. Both kinds of new functions are essential in most of the potential constraint database applications. The finite precision semantics, associated with an approximation mechanism, makes this extension possible, although the classical quantifier elimination doesn’t hold [Dr82]. Instead of excluding these fundamental functions and letting the user handle the interactions between queries and numerical computations, we propose a sound general query processing system (algorithm) that “consults” external “numerical computation modules” and performs bottom-up query evaluation in closed form with PTIME data complexity. The syntax of CALCF is standard and therefore omitted here (provided in the full version). Terms are built using arbitrary functions. If  is a formula in CALCF with free variables among x; y and gy is an aggregate function mapping relations over variables y to k -ary relations, then gy [] is an (jxj + k )-ary aggregate predicate in CALCF . Example 5.1 Consider the binary relation S and the query computing the surface of a portion of S in Section 2. The query can be expressed in CALCF as follows:

 z SURFACE [S (x; y) ^ y  9] (z) : x;y

2

The semantics of CALCF relies on two kinds of numerical computation modules: (i) to approximate non polynomial functions with polynomials, and (ii) to evaluate aggregate functions. Roughly speaking, queries are evaluated in several stages, depending on the maximal number of nesting levels of aggregate predicates used. If no aggregates occur (e.g., at the innermost level), we first replace all non polynomial functions with their polynomial approximations (using a procedure to be described below), then apply the QE algorithm to obtain a quantifier-free formula in CALCF with only polynomial constraints. If the query formula is inside an aggregate predicate, the corresponding aggregate computation module is called which results in a constraint relation. If there are nested aggregate predicates, the above steps are repeated. Non polynomial functions are approximated over approximation bases. An approximation base (a-base) is a list of floating numbers b1 ; :::; b`?1 where bi?1 < bi . For convenience, we denote ?1; +1 as b0; b` (respectively) and (sloppily) use intervals [b0 ; b1 ]; [b`?1 ; b` ].

Definition 5.2 Let k>0 be an integer. A k -order approximation module is a mapping which, on input an n-ary (n0) function f and n intervals [ i ; i ] (1in), produces an n variate polynomial g in F[X ] of degree k and defined over the hypercube ni=1 [ i ; i ] in F n which approximates f . The above definition does not involve error factors. Although any analytical function continuous on an interval (hypercube) [ ; ] can be approximated by a polynomial within any given error margin (Weierstrass Approximation Theorem [Wei85], see also [SA94]), the a-base has to be appropriately chosen to bound the error margin. (Theoretically, if numerical modules have bounded errors, the global error margin of the query answer can potentially be bounded too. However, as we discuss below, the issue concerning errors is not simple.) With approximation modules, CALCF allows virtually all analytical functions as long as their polynomial approximations can be effectively obtained. CALCF does approximation dynamically using an a-base. For different databases and queries the a-base may be chosen differently for complexity and accuracy reasons. Evaluating aggregate predicates of queries in CALCF requires a set of “aggregate (evaluation) modules”. Slightly different from approximation modules each of which may be used to approximate more than one functions, each aggregate module computes a unique aggregate function. (The difference is merely technical.) Definition 5.3 Let k  1; l  0 be integers. A (k; l )-aggregate (evaluation) module is a partial mapping from k -ary constraint relations to l-ary constraint relations. For technical reasons, we assume in the following that aggregate functions are applied on relations with no parameters. Let Q be an arbitrary query in CALCF with m ? 1 aggregate predicates. We construct a directed acyclic graph GQ with m nodes, each labeled by an aggregate predicate in Q or Q itself. If a predicate of node i appears inside a predicate of node j , we construct a directed edge from i to j . The evaluation of the query Q has the following steps (suppose the abase is b1 ; :::; b`?1 ) for a given k : 1. Select a node without incoming edges. If the node is labeled Q, perform steps 2 and 3, output the result at step 3 as the answer, and stop. Otherwise we assume it is labeled gy [].

2. For each term f (z ) involving non polynomial functions, each hypercube e = ni=1 [bji ?1 ; bji ] (1  j1 ; :::; jn  `), we call a k -order approximation module on f; e, and let he be the resulting polynomial. Each tuple t containing f (z ) is then replaced by a set of tuples te ^ “z 2 e” where e is a hypercube and te is t after f (z ) has been replaced by he (z ). (The formal description for t having more than one non polynomial function terms is provided in the full paper.) Let 0 be the resulting formula after all non polynomial functions are replaced.

k’s (and an a-base with small intervals for reducing errors). The complexity of each approximation call is however independent of the database size. Aggregate evaluation calls can also be very expensive for complex aggregate functions. However, the aggregate functions included in CALCF can be implemented by known numerical methods [BF85, PTVF92]. Therefore,

3. Apply the QE algorithm to obtain an ary constraint relation r from 0 .

Finally, we discuss the error margins. Seemingly errors can be caused by any numerical computation calls. Although the QE algorithm we described in this paper performs exact computation, it may propagate errors. (Practically, one might use this algorithm to perform inexact operations and in this case roundoff errors can also occur.) Since for each fixed database and each fixed query, the number of arithmetic operations to be performed by the QE algorithm is likely to be bounded (for k -order (k fixed) approximations and if aggregate modules do not introduce new polynomials), it appears that one can obtain -approximate query answers if numerical computation errors are bounded, if we define the error to be the difference between two relations (databases) in the standard way. Unfortunately, this cannot be formalized as well as we would hope. Indeed, even without aggregate functions, the “exact answers” may not be finitely representable. Moreover, any approximation of a function with singular points (e.g., log(x ? 3) at x = 3) in the intervals of the a-base admits no bounded error. Error analysis remains an interesting issue to be resolved.

jxj + jyj)-

(

4. If for each t 2 r, constraints in t can be divided into constraints only on x and constraints only on y , i.e., t  tx ^ ty , we perform the following (the query is undefined otherwise):

Construct a CAD C (see Appendix I) on the constraint relation ftx j t 2 rg. For each c 2 C , since c can be represented by a tuple tc , we obtain a relation rc = fy j 9x r(x; y ) ^ tc (x)g. We now assign the following constraint relation to the predicate gy [] after the calls gy (rc ):

 t ^ t c 2 C ; t 2 g (r ) : c y c

Remove the node and all incident edges and go back to step 1. Example 5.4 For the query Q in Example 5.1, its graph GQ has two nodes, labeled Q and SURFACEx;y [S (x; y )^y 9] (respectively), and an edge from the latter to the former. Since there are no other functions or quantifiers, the aggregate module for SURFACEx;y is called at step 4, which results in a unary relation f18g. In the second iteration (corresponding to Q), step 3 produces the 2 final answer f18g. Next we consider the data complexity of CALCF queries. We use the standard data complexity notion, and view approximation and aggregation modules as oracles. From the evaluation algorithm and known results, it is clear that: Theorem 5.5 Let k > 0 be an integer. Every query in CALCF can be evaluated in PTIME data complexity and with polynomially many k -order approximation and aggregate computation calls. Numerical models are computationally intensive. For example, function approximation with a single polynomial of degree k can be very complex when k is large. An alternative is to choose small

Corollary 5.6 Every CALCF query can be answered in PTIME w.r.t. the database size.

6

Conclusion

We introduced a constraint query language CALCF which fulfills the basic requirements of practical constraint databases with an increased modeling power: non semi-algebraic functions to define aggregate functions, and allow more complex data (such as periodic information defined with trigonometric functions for instance); approximated numbers originating from scientific data, or resulting from solutions of algebraic systems; associated with concerns on the efficiency of query evaluation: finite precision computation to speed-up the costly CAD algorithm. The finite precision arithmetic is the fundamental concept here. It is (i) desirable for data modeling, (ii) fundamental for the efficiency, and (iii) necessary

to have closed form evaluation in presence of aggregate functions. There are important implementation issues, that we started considering. Clearly, the central problems are optimization and error control. For the approximation, there are standard approaches such as Taylor polynomials, Lagrange interpolation polynomials, iterated interpolation, cubic spline interpolation, etc. Among them, cubic spline interpolation will give a set of polynomials rather than a simple one. Appropriate abases are important: small intervals reduce the errors but increase the complexity. A good compromise seems to select an a-base according to the database based for example on the CAD of the polynomials occurring in the constraint database.

References [ACGK94] F. Afrati, S. Cosmadakis, S. Grumbach, and G. Kuper. Expressiveness of linear vs. polynomial constraints in database query languages. In Second Workshop on the Principles and Practice of Constraint Programming, 1994. [ACM88] D. Arnon, G. Collins, and S. McCallum. Cylindrical algebraic decomposition. SIAM J. computing, 13(4):865–889, 1988. [Arn88] D. Arnon. A bibliography of quantifier elimination for real closed fields. Journal of Symbolic Computation, 5:267–274, 1988. [BDLW95] M. Benedikt, G. Dong, L. Libkin, and L. Wong. Relational expressive power of constraint query languages. In manuscript, 1995. [BF85] R.L. Burden and J.D. Faires Numerical Analysis (3rd edition). PWS-Kent, Boston, MA, 1985. [BKR86] M. Ben-Or, D. Kozen, and J. Reif. The complexity of elementary algebra and geometry. Journal of Computer and System Sciences, 32:251–264, 1986. [CK73] C.C. Chang and H.J. Keisler. Model Theory, volume 73 of Studies in Logic. NorthHolland, 1973. [CK95] J. Chomicki and G. Kuper. Measuring infinite relations. In Proceedings 14th ACM Symposium on Principles of Database Systems. ACM Press, 1995. [CL82] G.E. Collins and R. Loos. Real zeros of polynomials. Computing, 1982. [Col75] G.E. Collins. Quantifier elimination for real closed fields by cylindric decompositions.

In Proc. 2nd GI Conf. Automata Theory and Formal Languages, volume 35 of Lecture Notes in Computer Science, pages 134–83. SpringerVerlag, 1975. [Dr82] L. Van den Dries. Remarks on Tarski’s problem concerning (R; +; ; exp). In Logic Colloquium. Elsevier, North-Holland, 1982. [FK94] C. Faloutsos and I. Kamel. Beyond uniformity and independence: Analysis of rtrees using the concept of fractal dimension. In Proc. 13th ACM Symp. on Principles of Database Systems, pages 4–13, Minneapolis, 1994. [GS94] S. Grumbach and J. Su. Finitely representable databases. In 13th ACM Symp. on Principles of Database Systems, pages 289– 300, Minneapolis, May 1994. [GS95a] S. Grumbach and J. Su. Dense order constraint databases. In 14th ACM Symp. on Principles of Database Systems, San Jose, May 1995. [GS95b] S. Grumbach and J. Su. First-order definability over constraint databases. In Proc. First Int. Conf. on Principles and Practice of Constraint Programming, Cassis, Sept. 1995. [GST94] S. Grumbach, J. Su, and C. Tollu. Linear constraint query languages: Expressive power and complexity. In D. Leivant, editor, Logic and Computational Complexity Workshop, Indianapolis, 1994. Springer Verlag. to appear in LNCS. [GV88] D.Yu. Gridor’ev and N.N. Vorobjov. Solving systems of polynomial inequalities in subexponential time. Journal of Symbolic Computation, 1988. [KG94] P. Kanellakis and D. Goldin. Constraint programming and database query languages. In Proc. 2nd Conference on Theoretical Aspects of Computer Software (TACS), 1994. [KKR90] P. Kanellakis, G. Kuper, and P. Revesz. Constraint query languages. In Proc. 9th ACM Symp. on Principles of Database Systems, pages 299–313, Nashville, 1990. [Knu69] D.E. Knuth. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms. Addison Wesley, 1969. [KRVV93] P. Kanellakis, S. Ramaswamy, D. Vengroff, and J. Vitter. Indexing for data models with constraints and classes. In Proc. 12th ACM Symp. on Principles of Database Systems, pages 233–243, 1993.

[Kup93] G. Kuper. Aggregation in constraint databases. In Proc. First Workshop on Principles and Practice of Constraint Programming, 1993. [Moo66] R.E. Moore. Interval Analysis. PrenticeHall, 1966. [Nef90] C. Neff. Specified precision polynomial root isolation is in NC. In Proc IEEE Foundations of Computer Science, 1990. [Pan92] V. Pan. Complexity of computations with matrices and polynomials. SIAM Review, 34(2):225-62, 1992. [PVV94] J. Paredaens, J. Van den Bussche, and D. Van Gucht. Towards a theory of spatial database queries. In Proc. 13th ACM Symp. on Principles of Database Systems, pages 279– 288, 1994. [PVV95] J. Paredaens, J. Van den Bussche, and D. Van Gucht. First-order queries on finite structures over the reals. In Proceedings 10th IEEE Symposium on Logic in Computer Science. IEEE Computer Society Press, 1995. [PTVF92] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C (Second Edition). Cambridge University Press, 1992 [Ren92a] J. Renegar. On the computational complexity and geometry of the first-order theory of the reals. Journal of Symbolic Computation, 13:255–352, 1992. [Ren92b] J. Renegar. On the computational complexity of approximating solutions for real algebraic fromulae. SIAM Journal of Computing, 21:1008–1025, 1992. [SA94] B. Sendov and A. Andreev. Approximation and interpolation theory. In P. G. Ciarlet and J. L. Lions, editors, Handbook of Numerical Analysis, volume III, pages 223–464. NorthHolland, 1994. [Tar51] A. Tarski. A Decision method for elementary algebra and geometry. Univ. of California Press, Berkeley, California, 1951. ¨ [Wei85] K. Weierstrass. Uber die analytische Darstellung sogenannter willkurlicher Funktionen einer reelen Veranderlichen. Sitzungsber. der Akad. zu Berlin, pages 633–9, 1885. [Yap94] C.K. Yap. Fundamental Problems in Algorithmic Algebra. Princeton University Press, 1994.

Appendix I: Real Algebraic Geometry The Appendix contains a description of the main aspects of the algorithms of real algebraic geometry mentioned in the paper. Solution of systems of polynomial inequalities Consider a system (conjunction) of m polynomial inequalities over k variables of maximum degree d, and with integer coefficients of maximum bit length `. defines a semi-algebraic set S  Rk . Let S = [Si be the unique decomposition of S into disjoint connected components. There is an algorithm [GV88] which defines an algebraic witness for each of the connected components5 . Moreover, the algorithm finds for any rational , a rational -approximation of all witnesses. The complexity of the algorithm is polynomial in log( 1 ), `, m, and d (it is exponential in the number k of variables). It follows that: the NUMERICAL EVALUATION (up to some -approximation) can be done in PTIME (Theorem 3.2). Cylindrical algebraic decomposition A cylindrical algebraic decomposition, CAD, of Rk w.r.t. a set of k -variate integral polynomials P  Z[x1; :::; xk ] is a partition of Rk into finitely many semialgebraic connected subsets, the cells, in each of which each polynomial is invariant in sign, that is either it vanishes (equals zero) everywhere or nowhere in the cell. The CAD is said to be P invariant. Assume that C is a CAD of Rk . If k = 1, each cell is either an isolated point or an open interval. For k  2, there is a CAD, C 0 of Rk?1 , such that for each cell C 0 in C 0 , the cylinder C 0  R is a union of cells of C . Also, if C  C 0  R is a cell of C , the k -th projection of C is C 0 . C is said to induce C 0 . Thus a CAD is a tower of CAD’s, C1 ; :::; Cn , where Ci is a CAD of Ri , and Ci induces Ci?1 . The first phase of the CAD algorithm computes successive projections of the set of polynomials P . The operator PROJ, introduced by Collins, maps a set of polynomials Pi  Z[x1; :::; xi ] to a set of polynomials over i ? 1 variables PROJ(Pi )  Z[x1; :::; xi?1 ] such that for any PROJ(Pi )-invariant CAD C 0 of Ri?1 , there is a Pi -invariant CAD C of Ri which induces C 0 . Polynomials of PROJ(Pi ) are formed by addition, subtraction, and multiplication of the coefficient (elements of 5 An algebraic number that  vanishes on .

is defined by a polynomial  such

Z[x1; :::; xi?1 ]) of the polynomials of Pi with the technique of subresultants [Yap94]. The second phase of the algorithm results in a PROJk?1 (P )-invariant CAD of R. All the roots are identified [CL82], and the cells are constructed. An algebraic number is defined by its minimal polynomial p , and an isolating interval for the particular root of p . Open intervals are defined modulo some  by rational numbers. The third phase extends the CAD of R to a CAD of R2 , and so on. For each cell, sample points are exhibited to be able to check the value of the polynomials on the sample points. Also the cells are indexed in a simple way which permits to determine their dimension and their relative positions in the stacks. More details can be found in [ACM88, Ren92a]. The quantifier elimination procedure Consider a first-order formula in prenex normal form:

Qn+1 xn+1 :::Qn+m xn+m (x1 ; :::; xn+m ) where the Qi ’s are quantifiers and is the unquantified matrix. Let P be the set of polynomials occurring in , and CP be a CAD associated to P . From the sample points of the cells, we can decide in which cells the unquantified matrix holds. The decomposition of the space induces a decomposition of each lower dimension space. Since each cylinder is partitioned in a finite number of cells, the universal (respectively existential) quantifiers can be replaced by finite conjunctions (respectively disjunctions). One can decide in which cells of a lower dimensional space, the quantified formula holds. To complete the quantifier elimination, the formulas defining the later cells have to be constructed. Renegar [Ren92a] gave a very precise characterization of these formulas with a detailed complexity analysis (See Appendix II, Proof of Lemma 4.4).

Appendix II: Proofs of the Main Results The proofs of some of the results of the paper have been collected in this Appendix. Proof of Theorem 3.2: (Sketch) For each , the (up to -approximation) can be done in PTIME. The proof is based on powerful techniques to solve systems of polynomial inequalities (see Appendix I on the solution of

NUMERICAL EVALUATION

systems of polynomial inequalities). In particular, in the case defines a finite set, the algorithm finds an -approximation of all the solutions. The data complexity is defined w.r.t. the size of the input constraint database, assuming without loss of generality that the polynomials have integer coefficients. The size of the integers is defined by their bit length as in [GST94]. The complexity is exponential in the number of variables. In the case of the data complexity, the number of variables is fixed. The complexity is then polynomial in the size of the database (upper-bound on the size of the integers, the number of distinct polynomials, and the degree). 2 Proof of Theorem 4.2: (Sketch) The proof follows from the fact that all the integers obtained during the computation of a linear query (over linear constraint databases) have a bit length linearly bounded by the bit length of the coefficients of the input database. If the bit length of the integers in the input is bounded by k , then the bit length of all integers during the computation of a query Q is bounded by c  k , where c is a constant which depends only upon the query Q (and its quantifier depth). The linear arithmetic over integers of length c  k, can be defined in first-order from the arithmetic of Zk . Consider the structure of the integers of bit length at most k :

Zk = hZk; k ; +k ; 0k ; 1k i where the relations are indexed by the bit length, k and the constant 1 denotes 10000. 2k 2k We prove that the constants 0 ; 1 , and the relations 2k ; +2k of the structure Z2k , are firstorder definable in Zk . We define integers of length 2k , by pairs of integers of length k . The constants k k 2k are obviously defined (e.g., 1 = [1 ; 0 ]). We give now the definitions of the relations (length indices are omitted when clear from the context). We use the operational notation instead of the relational notation for simplicity. Subtraction is used as an abbreviation (x ?k y =k z iff x =k y +k z ).

x; x0 ] 2k [y; y0 ] ,  x k y _ (x =k y ^ x0 k y0 )

[

x; x0 ] +2k [y; y0 ] = [z; z0 ] ,

[

0 (z =k x +k y ^ z0 =k x0 +k y0 ) 1 0 1 B 9 ((x +k y) +k 1k =k ) C B CC B B ^ 8 0 ((x0 +k y0 ) 6=k 0 ) C B B C B CACCA @^ z =k (x +k y) +k 1k @_ B ^ z0 =k x +k (y ?k 1k )

By iterating this technique, we obtain integers (with order and addition) of sufficient length. 2 Proof of Lemma 4.4: Consider the class Kd;m of constraint databases over schema  whose finite representation contains at most m distinct polynomials of degree at most d. Let Zk t hR1; :::; Rn i be some arbitrary constraint database in Kd;m , with Ri defined by i . Let Q be an `-ary query in FOR defined by a formula ' in L [  with free variables x1 ; :::; x` . To evaluate the query, it suffices to perform the QE of the L-formula '[Ri = i ](x1 ; :::; x` ). As shown by Renegar [Ren92a], the QE algorithm produces a quantifier-free formula of the form:

b

b

b

Ji _I ^ i=1 j =1

pij (x1 ; :::; x` )ij 0

where ij 2 f=; ; g, the numbers I , Ji , as well as the degree of the polynomials pij ’s are bounded by a constant C depending only upon the class Kd;m and the query Q. The Oconstant C is given 2 (w) `k nk 0 0 by C = ((m + m )sup(d ; d)) where m0 is the number of polynomials in ', and d0 their maximum degree, w ?1 is the number of quantifier alternation, and ni the number of variables for each quantifier type in the alternation. All numbers computed by the QE algorithm, including the coefficients of the polynomials pij ’s, are integers of bit length bounded by C  k . It follows that the numbers sufficient for the QE algorithm to compute the answer to a query Q on a constraint database are of length linear in the numbers in the input. 2 Proof of Lemma 4.5: Consider the structure of the integers of bit length at most k :

Zkl=u = hZk; k ; +lk ; +uk ; lk ; uk ; 0k ; 1k i where the relations are indexed by the bit length and the position of the output bits l, and u. We prove that the relations +l2k ; +2uk ; l2k ; 2uk of the l=u l=u structure Z2k , are first-order definable in Zk .

x; x0 ] +l2k [y; y0 ] = [z; z0 ] ,

[

 z = (x0 +u y0) +l (x +l y) k k 0 0 lk 0 k ^z

[

=k

x +k y

x; x0 ] +2uk [y; y0 ] = [z; z0 ] ,   z= 0 k ^ z 0 =k (x +uk y) +lk ((x0 +uk y0 ) +uk (x +lk y)) x; x0 ] 2l k [y; y0 ] = [z; z0 ] ,  z = (x0 u y0) +l ((y l x0) +l (y0 l x)) k k k k k k ^ z 0 =k x0 lk y0

[

x; x0 ] 2uk [y; y0 ] = [z; z0 ] , 1 0 z = (((x0 u y0) +u ((y l x0) k k k k BB +uk(y0 lk x))) +uk (x lk y)) +lk (x uk y)CC A @^ z0 =k ((x0 uk y0) +uk ((y lk x0) 0 l x))) +l (x l y) +u ( y k k k k This concludes the proof. 2 [