SQL Can Maintain Polynomial-Hierarchy Queries Limsoon Wong
Leonid Libkin
BioInformatics Centre & Institute of Systems Science Singapore 119597 Email:
[email protected]
Bell Laboratories 600 Mountain Avenue Murray Hill, NJ 07974, USA Email:
[email protected]
22 September 1997
1 Summary Let us introduce the concept of an incremental evaluation system, or IES, discussed in [15]. Suppose we have a query Q : S ! T . An IES(L) for maintaining the query Q is a system consisting of input database I , an answer database A, an optional auxiliary database, and a nite set of \update" functions that correspond to the dierent kinds of permissible update to the input database. These update functions take as input, the corresponding update, the input database, the answer database, and the auxiliary database; and collectively produce as output the updated input database, the updated answer database, and the updated auxiliary database. There are only two requirements: the condition A = Q(I ) must be maintained and that the update functions must be expressible in the language L. (L is called the ambient language of the IES.) In this report, we only consider queries from at relations to at relations; and the criteria for permissible update is restricted to the insertion and deletion of a single tuple. A restriction is also imposed so that the constants that appear in the auxiliary database must also appear in the database or in the answer or in some xed set. In this report, this xed set is Q , the set of rational numbers. We use the rst-order incremental evaluation system, IES(FO)(called FOIES in [10]), to illustrate the concept. IES(FO) uses rst-order logic to express update functions [9, 11]. For each relation symbol R, we use Ro to refer to the instance of R before an update, and Rn the instance of R after the update (here `o' stands for old and `n' for new). Consider the view EVEN that is de ned to be f1g if the relation R has even cardinality and fg if R has odd cardinality. While EVEN is well known to be inexpressible in rst-order logic [2], it can be expressed in IES(FO). The update function when a tuple o is deleted from R is given by EVEN n (1) i (R(o) ^ :EVEN o (1)) _ (:R(o) ^ EVEN o (1)): Part
of this work was down when Wong was visiting Bell Labs.
1
The update function when a tuple o is inserted into R is given by EVEN n (1) i (R(o) ^ EVEN o (1)) _ (:R(o) ^ :EVEN o (1)):
The IES(FO ) that we used to maintain EVEN as above is also called a space-free IES(FO), because it does not make use of any auxiliary relations. It is sometimes necessary to use auxiliary relations. We write IES(FO)k to mean the subclass of IES(FO ) where auxiliary relations of arities up to k can be used. In general, we write IES(L)k to mean the subclass of IES(L) where at auxiliary relations of arities up to k can be used. Much is already known about IES(FO). For the transitive closure of undirected graphs, it can be maintained in IES(FO)3 [19] and even in IES(FO)2 [10]. Dong and Su [10] showed that the IES(FO )k hierarchy is strict for k 2. More recently, using a result of Cai [5], Dong and Su showed in the journal version of their paper [10] that the IES(FO )k hierarchy is strict for every k. However, their example query that proved the strict inclusion of IES(FO )k in IES(FO)k+1 had input arity 6k. Even more recently, using a result of Razborov [20], Dong and Zhang separated IES(FO)k from IES(FO )k+1 using an example query of arity 3k + 1. However, it is open if there is a IES(FO ) for transitive closure of general directed graphs. It is also open whether the IES(FO)k hierarchy remains strict if we restrict to queries having xed input arity. Besides these unresolved problems, IES(FO ) has the further problem of not properly re ecting the power of practical relational systems. This is because IES(FO) uses rst-order logic as its ambient language, while practical relational systems use SQL, which is more powerful than rst-order logic. So we study the incremental evaluation system whose ambient language is NRC aggr , our reconstruction of SQL based on a nested relational calculus. We use the notation IES(NRC aggr ) to denote the incremental evaluation system where both the input database and answer are at relations, but the auxiliary database can be nested relations. We use the notation IES(SQL) when the auxiliary database is restricted to at relations. The rationale for the IES(SQL) is that it more closely approximates what could be done in a relational database, which can store only at tables. With features such as nesting of intermediate data (as in GROUPBY) and aggregates, the ambient language has essentially the power of SQL, hence the notation. Much is also known about IES(SQL). We know that space-free IES(SQL) is unable to maintain transitive closure of arbitrary graphs [7]. We also know that transitive closure of arbitrary graphs remains unmaintainable in IES(SQL) even in the presence of auxiliary data whose degrees are bounded by a constant [8]. On the positive side, we know that if the bounded degree constraint on auxiliary data is removed, transitive closure of arbitrary graphs becomes maintainable in IES(SQL). In fact, this query and even the alternating path query can be maintained in IES(SQL)2 ! Finally, we also know that the IES(SQL)k hierarchy collapses to IES(SQL)2 [15]. One can therefore ask what exactly is the limit of the power of IES(SQL)? In this paper, after we introduce IES(SQL) more formally in Section 2, we prove results aimed at answering this question. On the positive side, we show in Section 3 in a uniform manner that all relational queries expressible in second-order logic, and hence in the polynomial hierarchy [14], are maintainable in IES(SQL). On the negative side, we show in Section 4 using a cardinality argument that second-order logic is essentially the upperbound on the power of IES(SQL). From these results, we conclude that practical relational 2
databases, as well as more advanced systems like Kleisli [6], possess remarkable power in a way that was little suspected before.
2 Nested Relational Calculus with Aggregates The language NRC aggr is obtained by extending the nested relational calculus NRC (=) of [4, 23] by arithmetics and aggregate functions. The motivation for considering NRC aggr is that it is a much more realistic query language than relational algebra. Indeed, as explained later, one can consider NRC aggr to be a theoretical reconstruction of SQL, the de facto relational query language of the commercial world. We present the language incrementally. We start from NRC (=), which is equivalent to the usual nested relational algebra [1, 4]. The data types that can be manipulated are:
s ::= b j s1 sn j fsg The symbol b ranges over base types like Booleans B , rational numbers Q , etc. The type s1 sn contains n-ary tuples whose components have types s1 , ..., sn respectively. The objects of type fsg are sets of nite cardinality whose elements are objects of type s. As can be seen from the data types, NRC (=) is a language for arbitrarily nested relations. The syntax and typing rules of NRC are given below.
e1 : s1 en : sn e : s1 s n c:b i e : s i (e1 ; : : : ; en ) : s1 sn e:s e1 : fsg e2 : fsg Sfe1e1: jftxgs 2ee22:gf:sfgtg fgs : fsg feg : fsg e1 [ e2 : fsg e1 : s e2 : s e1 : B e2 : s e3 : s e1 = e2 : B true : B false : B if e1 then e2 else e3 : s xs : s
We often omit the type superscripts as they can be inferred. An expression e having free variables ~x ~ x] as is interpreted as a function f (~x) = e, which given input O~ , of the same arity as ~x, produces e[O=~ ~ its output. Here [O=~x] is the substitution replacing the ith component of ~x by the ith component of O~ . An expression e with no free variable can be regarded as a constant function f e. Let us brie y recall the semantics; see also [4]. Variables xs are available for each type s. Every constant c of base type b is available. The operations for tuples are standard. Namely, (e1 ; : : : ; en ) forms an n-tuple whose i component is ei and i e returns the i component of the n-tuple e. fg forms the empty set. feg forms the singleton set containing e. e1 [ e2 unions the two sets e1 and e2 . Sfe1 j x 2 e2 g maps the function f (x) = e1 over all elements in e2 and then returns their union; if e2 is the set fo1 ; : : : ; on g, the result of this operation would be f (o1 ) [ [ f (on). For example, Sthus ff(x; x)g j x 2 f1; 2gg evaluates to f(1; 1); (2; 2)g. 3
The operations for Booleans are also quite typical, with true and false denoting the two Boolean values. e1 = e2 returns true if e1 and e2 have the same value and returns false otherwise. Finally, if e1 then e2 else e3 evaluates to e2 if e1 is true and evaluates to e3 if e1 is false . We provided equality test on every type s. However, this is equivalent to having equality test restricted to base types together with emptiness test for set of base types [22]. Also, set member, union, intersection, etc. can obviously be de ned in terms of equality test. NRC possesses the so-called conservative extension property [23]: if a function f : s1 ! s2 is expressible in NRC , then it can be expressed using an expression of height no more than that of s1 and s2 . The height of a type is de ned as its depth of nesting of set brackets. The height of an expression is de ned as the maximum height of all types that appear in its typing derivation. More speci cally, if f : s1 ! s2 takes at relations to at relations and is expressible in NRC , then it is also expressible in the standard
at relational algebra [18, 4]. It is a common misconception that the relational algebra is the same as SQL. The truth is that all versions of SQL come with three features that have no equivalence in relational algebra: SQL extends the relational calculus by having arithmetic operations, a group-by operation, and various aggregate functions such as AVG, COUNT, SUM, MIN, and MAX. It is known [4] that the group-by operator can already be simulated in NRC (=). The others need to be added. The arithmetic operators are the standard ones: +, ?, , and of type Q Q ! Q . We also add the order on the rationals: Q : Q Q ! B . As to aggregate functions, we add just the following construct
Pfje1e1: Qj xs e22 e: 2fjgsg: Q The semantics is this: map the function f (x) = e1 over all elements of e2 and then P add up the results. Thus, if e2 is the set fo1 ; : : : ; on g, it returns f (o1 ) + + f (on). For example, fj1 j x 2 X jg returns the cardinality of X . Note that this is dierent from adding up the values in ff (o1 ); : : : ; f (on )g; in the example above, doing so yields 1 as no duplicates are kept. To emphasize that duplicate values of f are being added up, we use bag (multiset) brackets fj jg in this construct. We denote this theoretical reconstruction of SQL by NRC aggr . That is, NRC aggrPhas all the constructs of NRC (=), the arithmetic operations +; ?; and , the summation construct and the linear order on the rationals. It was shown in [16, 17] that all SQL aggregate functions mentioned above can be implemented in NRC aggr . It is also known [16, 17] that NRC aggr has the conservative extension property and thus its expressive power depends only on the height of input and output and is independent of the height of intermediate data. So to conform to SQL, it suces to restrict our input and output to height at most one, that is, to the usual at relational databases. Let us brie y introduce a nice shorthand, based on the comprehension notation [21, 3], for writing NRC aggr queries. Recall from [3, 4, 23] that the comprehension fe j A1 ; : : : ; An g, where each Ai either has the form xi 2 ei or is an expression ei of type B , has a direct correspondent in NRC that is given by recursively applying the following equations:
fe j xi 2 ei ; : : :g = Sffe j : : :g j xi 2 eig
4
fe j ei; : : :g = if ei then fe j : : :g else fg The comprehension notation is more user-friendly than the syntax of NRC aggr . For example, it allows us f(x; y) j x 2 e1 ; y 2 e2 g for the cartesian product of e1 and e2 instead of the clumsier SfStoffwrite (x; y)g j y 2 e2 g j x 2 e1 g. In addition to comprehension, we also nd it convenient to use a little bit of pattern matching, which can be removed in a straightforward manner. For example, we write f(x; z ) j (x; y) 2 e1 ; (y0 ; z ) 2 e2 ; y = y0 g for relational composition S Sinstead of the more ocial f(1 X; 2 Y ) j X 2 e1; Y 2 e2; 2 X = 1 Y g or the much clumsier f fif 2 X = 1 Y then f(1 X; 2 Y )g else fg j Y 2 e2 g j X 2 e1 g. Here X and Y denote edges ((x; y) and (y; z) respectively), whose components, x, y and z, are obtained by applying projections 1 and 2 . We use the notation IES(NRC ) to refer to the incremental evaluation system that uses NRC aggr as its ambient language. We use the notation IES(SQL) to refer to the restriction of IES(NRC aggr ) to use only
at auxiliary relations. This restriction is made to achieve a direct correspondence to real relational databases which can only store at relations. It is known [15] that this restriction does not cost a loss in power; in fact, even a restriction to use at auxiliary relations of arity 2 does not result in a loss in power. Hence in the development of our results, we freely use the IES that is most convenient.
Fact 2.1
IES(NRC aggr ) = IES(SQL) = IES(SQL)2 .
2
3 Maintainability of Second Order Queries We prove in this section that we can maintain all queries expressible in second order logic (equivalently, all queries in PH|the polynomial hierarchy.) Let us use the notation }(B k ) to mean the powerset of the k-fold cartesian product of the set B : fbg of atomic objects. The proof involves two steps. In the rst step, we show that }(B k ) can be maintained in IES(SQL) for every k, when B is updated. In the second step, we show that if the domain R of each second order quanti er 9x 2 R: is made available to NRC aggr as a symbol R^ , then any second order logic formula can be translated to NRC aggr . Let us take the rst and key step now.
Proposition 3.1 IES(SQL) can maintain }(B k ) for every k when B : fbg is updated. Proof. Let PBko and PBkn be the symbols naming the nested relation }(B k ) immediately before and after the update. We proceed by induction on k. The base case of k = 1 is that of maintaining the powerset of a unary relation, which is obvious and is omitted. For the induction case of k > 1, we consider two cases.
Suppose the update is the insertion of a new element x into the set B . By the induction hypothesis, IES(SQL) can maintain }(B k?1 ). So we can create the following nested sets in IES(SQL): Y0 = ff(x; : : : ; x)gg and Yi = ff(z1 ; : : : ; zi ; x; zi+1 ; : : : ; zk?1) j (z1 ; : : : ; zk?1) 2 X g j X 2 PBkn?1g, for i = 1, ..., k ? 1. Let cartprod be the function that forms the cartesian product of two sets; this 5
function is easily de nable in NRC aggr . Let allunion be the function that takes a tuple (S1 , ..., Sk ) of sets and returns a set of sets containing all possible unions of S1 , ..., Sk ; this function is also de nable in NRC aggr because the number of combinations is nite once k is given. Then it is not dicult to see that PBkn = fX j Y 2 (PBko cartprod Y0 cartprod Y1 cartprod cartprod Yk?1 ); X 2 allunion (Y )g. Suppose the update is the deletion of an existing element x from the set B . Then all we need is to delete from each of PB1 , ..., PBk all the sets that have x as a component of one of their elements, which is clearly de nable in NRC aggr . 2 We can now do the second step and prove the main result of this section.
Theorem 3.2
IES(SQL) can maintain all queries expressible in second-order logic. Thus IES(SQL)
can maintain all queries that are in PH. Proof. Suppose we want to maintain A = f~y j R~ j= g, where R~ are the input relations, A is the output relation, and is a second order formula containing free variables ~y. We can assume that all quanti ers in are of the form 9x 2 Uk :, where Uk is interpreted as }(B k ) and B contains all the atomic objects found in R~ . (That is, B is the active domain.) We show that can be expressed in NRC aggr as an expression ^ , assuming that the symbols PBk are available and are taking on the values of }(B k ). We proceed by induction on the structure of . If is P 9x 2 Uk :, then we translate it as fj1 j z 2 fx j x 2 PBk ; ^gjg > 0. If is R(~z), then we translate it as ~z 2 R. If is ^ ', then we translate it as if ^ then '^ else false . If is :, then we translate it as if ^ then true else false . Finally, if is a comparison operation, we just use the corresponding one in NRC aggr . So to maintain A = f~y j R~ j= g, all we need is to maintain A = f~y j ~y 2 ^ g instead. Since ^ is already in NRC aggr , to maintain it, all we need to do is to maintain the PBk used in ^ when the input relations R~ are updated. Since each update to R~ can only add a nite number of new atoms to the active domain or delete a nite number of old atoms from the active domain, by our previous proposition, we know that we can maintain PBk in IES(SQL). The theorem is thus proved. 2
Many queries are complete for interesting complexity classes. For instance, the 3-colorability query 3-COLOR is NP-complete [12] and the alternating paths query APATH is P-complete [13]. It follows from our result above that these interesting queries can be maintained in IES(SQL).
Corollary 3.3 Both 3-COLOR and APATH can be maintained in IES(SQL).
2
4 Unmaintainability of Third Order Queries Having capture the whole of the polynomial hierarchy inside IES(SQL), can we do more? Perhaps we can go beyond mere }(B k ) and also maintain huge sets like }(}(B k )), }(}(B ) cartprod }(B )), etc.? In this section we show that IES(SQL) cannot maintain any of these huge sets. Indeed, while IES(SQL) 6
can fully exploit exponential space, it cannot escape much beyond that and is strictly contained inside 2-DEXPSPACE.
Theorem 4.1
IES(SQL) is inside 2-DEXPSPACE. In fact, for every IES(SQL) there is a number m such that if N is the total size of the initial input database, answer database, and auxiliary database, then the ntotal size of the nal input database, answer database, and auxiliary database after n updates is O(N m ). Proof. Given an IES(SQL). It is known from previous work that all NRC aggr queries have polynomial complexity [4, 22]. Since all the update functions of our IES(SQL) are expressible in IES(SQL), their complexity must be O(N m ) for some constant m, where N is the total size of the old input database, the old answer database, and the old auxiliary database of our IES(SQL). That is, after a single update, the total size of the updated input database, the updated answer database, and the updated auxiliary database must be O(N m ). It follows that, after n updates, the total size of the updated input database, the updated answer database, and the updated auxiliary database must be O(N mn ). That is, IES(SQL) is in 2-DEXPSPACE. 2
Let us }j (B k ) to mean the taking the powerset j times on the k-fold cartesian product of the set B of atomic objects. It follows from the theorem above that
Corollary 4.2 }j (B k ) can be maintained by a IES(SQL) when B is updated i either j = 1 or j = 2
and k = 1. Proof. We need to prove three things.
First, we need to show that }(B k ) can be maintained. This was already done in Proposition 3.1. Second, we need to show that }2 (B ) can be maintained. Here is a IES(SQL) that does the job. Let B : fbg denote the input database. Let PPB : fffbggg denote the answer database. B is initially empty. PPB is initially ffg; ffggg. We want to maintain PPB = }(}(B )) when B is
updated. We have two cases. Suppose the update is the insertion of a new atomic object x into B . Let = fU [ffxg[ v j v 2 V g j U 2 PPB o ; V 2 PPB og. Then PPB n = PPB o [ is the desired double powerset. Suppose the update is the deletion of an old object x from B . Then we simply delete from PPB all those sets that mention x. A task readily expressible in NRC aggr . This nishes the case. Lastly, we need to show that for j 2 and k > 1, }j (B k ) cannot be maintained. But this follows immediately from Theorem 4.1. 2
Therefore, the largest meaningful class of queries that can be maintained by IES(SQL) is essentially those in PH.
7
References [1] S. Abiteboul and P. Kanellakis. Query languages for complex object databases. SIGACT News, 21(3):9{18, 1990. [2] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. [3] Peter Buneman, Leonid Libkin, Dan Suciu, Val Tannen, and Limsoon Wong. Comprehension syntax. SIGMOD Record, 23(1):87{96, March 1994. [4] Peter Buneman, Shamim Naqvi, Val Tannen, and Limsoon Wong. Principles of programming with complex objects and collection types. Theoretical Computer Science, 149(1):3{48, September 1995. [5] Jin-Yi Cai. Lower bound for constant-depth circuits in the presence of help bits. Information Processing Letters, 36:79{83, 1990. [6] Susan Davidson, Christian Overton, Val Tannen, and Limsoon Wong. BioKleisli: A digital library for biomedical researchers. International Journal of Digital Libraries, 1(1):36{53, April 1997. [7] Guozhu Dong, Leonid Libkin, and Limsoon Wong. On impossibility of decremental recomputation of recursive queries in relational calculus and SQL. In Proceedings of 5th International Workshop on Database Programming Languages, Gubbio, Italy, September 1995, Springer Electronic Workshops in Computing, page 8, 1996. Available at http: //www.springer.co.uk /eWiC /Workshops /DBPL5.html. [8] Guozhu Dong, Leonid Libkin, and Limsoon Wong. Local properties of query languages. In Proceedings of 6th International Conference on Database Theory, pages 140{154, Delphi, Greece, January 1997. [9] Guozhu Dong and Jianwen Su. Incremental and decremental evaluation of transitive closure by rst-order queries. Information and Computation, 120(1):101{106, July 1995. [10] Guozhu Dong and Jianwen Su. Space-bounded FOIES. In Proceedings of 14th ACM Symposium on Principles of Database Systems, San Jose, California, pages 139{150, May 1995. [11] Guozhu Dong, Jianwen Su, and Rodney Topor. Nonrecursive incremental evaluation of Datalog queries. Annals of Mathematics and Arti cial Intelligence, 14:187{223, 1995. [12] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP completeness. W. H. Freeman, San Francisco, 1979. [13] Neil Immerman. Languages that capture complexity classes. SIAM Journal of Computing, 16:760{ 778, 1987. [14] D. Johnson. A Catalog of Complexity Classes, volume A of Handbook of Theoretical Computer Science, chapter 2, pages 67{161. North Holland, 1990. [15] Leonid Libkin and Limsoon Wong. Incremental recomputation of recursive queries with nested sets and aggregate functions. In LNCS ????: Proceedings of 6th International Workshop on Database Programming Languages, Estes Park, Colorado, August 1997. to appear. 8
[16] Leonid Libkin and Limsoon Wong. Aggregate functions, conservative extension, and linear orders. In Catriel Beeri, Atsushi Ohori, and Dennis E. Shasha, editors, Proceedings of 4th International Workshop on Database Programming Languages, New York, August 1993, pages 282{294. SpringerVerlag, January 1994. See also UPenn Technical Report MS-CIS-93-36. [17] Leonid Libkin and Limsoon Wong. New techniques for studying set languages, bag languages, and aggregate functions. In Proceedings of 13th ACM Symposium on Principles of Database Systems, pages 155{166, Minneapolis, Minnesota, May 1994. See also UPenn Technical Report MS-CIS-9395. [18] Jan Paredaens and Dirk Van Gucht. Converting nested relational algebra expressions into at algebra expressions. ACM Transaction on Database Systems, 17(1):65{93, March 1992. [19] Sushant Patnaik and Neil Immerman. Dyn-FO: A parallel dynamic complexity class. In Proceedings of 13th ACM Symposium on Principles of Database Systems, pages 210{221, Minneapolis, Minnesota, May 1994. [20] R. Smolensky. Algebraic methods in the theory of lower bounds for boolean circuit complexity. In Proceedings of 19th ACM Symposium on Theory of Computing, pages 77{82, 1987. Same thing as A. Razborov, Lower bounds for the size of circuits of bounded depth with basis AND, XOR, Notes(1987). [21] Philip Wadler. Comprehending monads. Mathematical Structures in Computer Science, 2:461{493, 1992. [22] Limsoon Wong. Querying Nested Collections. PhD thesis, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, August 1994. Available as University of Pennsylvania IRCS Report 94-09. [23] Limsoon Wong. Normal forms and conservative extension properties for query languages over collection types. Journal of Computer and System Sciences, 52(3):495{505, June 1996.
9