recursively defined syntax for" expressions that exploits to the full any closure properties the object class may posses
THE
A
C R I T I Q U E
0
S Q L
D A T A B A S E
L A N G U A G E
F
C.J.Date P O Box 2647~ S a r a t o g a California 9~.~7(.~ U S A
December
1983
The ANS Database Committee (X3H2) is c u r r e n t l y at work on a proposed standard relational database language (RDL)~ and has adopted as a basis for that activity a definition of the "structured q u e r y l a n g u a g e " S Q L f r o m IBM [10]. M o r e o v e r ~ numerous hardware and software vendors (in a d d i t i o n to IBM) h a v e already released or at least announced products that are based to a greater or l e s s e r e x t e n t on t h e S Q L l a n g u a g e as d e f i n e d b y IBM. There can thus be little doubt that the importance of that l a n g u a g e will i n c r e a s e s i g n i f i c a n t l y over the next few years. Yet the S Q L l a n g u a g e is v e r y f a r f r o m p e r f e c t . T h e p u r p o s e of this paper is to p r e s e n t a c r i t i c a l a n a l y s i s of t h e l a n g u a g e ' s major shortcomings~ in t h e h o p e t h a t it m a y b e p o s s i b l e to r e m e d y s o m e of the deficiencies before their influence becomes too allpervasive. The paper's standpoint is p r i m a r i l y t h a t of formal computer languages in g e n e r a l ~ rather than that of database languages specifically.
sql
critique 8
I.
INTRODUCTION
The relational language SQL ( t h e acronym i s u s u a l l y pronounced "sequel"), pioneered in the IBM p r o t o t y p e System R [i] and subsequently a d o p t e d by IBM a n d o t h e r s as t h e b a s i s f o r numerous commercial implementations, represents a major advance over older database l a n g u a g e s s u c h as t h e D L / I l a n g u a g e of IMS a n d t h e DML and DDL of the Data Base Task Group (DBTG) of CODASYL. Specifically, SQL is far easier to use than those older languages; as a r e s u l t , u s e r s in a S Q L s y s t e m (both end-users and application programmers) c a n b e far m o r e p r o d u c t i v e t h a n t h e y u s e d t o b e in t h o s e o l d e r s y s t e m s (improvements of up t o 2 0 t i m e s have been reported). Among the strongpoints of S Q L t h a t l e a d t o such improvements we may cite the following: simple
data
powerful short
structure
operators
initial
learning
period
improved d a t a independence integrated
data
double mode of integrated
These
definition
and
data
manipulation
use
catalog
compilation
and
optimization
advantages
are
elaborated
in
the
appendix
to
this
paper.
T h e l a n g u a g e d o e s h a v e i t s w e a k p o i n t s too, however. In f a c t , it c a n n o t b e d e n i e d t h a t S Q L in i t s p r e s e n t f o r m l e a v e s r a t h e r a lot t o b e d e s i r e d -- e v e n t h a t , in s o m e i m p o r t a n t r e s p e c t s , it f a i l s to realize the full potential of the relational model. The purpose of t h i s p a p e r is t o d e s c r i b e a n d e x a m i n e s o m e of those w e a k p o i n t s , in t h e h o p e t h a t s u c h a s p e c t s of t h e l a n g u a g e m a y b e improved before their influence becomes too all-pervasive. Before getting into details, I should like to make one point absolutely clear: The c r i t i c i s m s that follow should not be construed as criticisms of the original designers and implementers o f t h e SQL language. The paper i s i n t e n d e d s o l e l y as a c r i t i q u e of t h e SQL language as such, and n o t h i n g more. Note also that t h e paper a p p l i e s s p e c i f i c a l l y t o t h e d i a l e c t of SQL implemented by IBM i n i t s p r o d u c t s SQL/DS, D B 2 , and QMF. It is e n t i r e l y p o s s i b l e t h a t some s p e c i f i c p o i n t does n o t a p p l y t o some o t h e r implemented d i a l e c t . However, most p o i n t s o f t h e paper do a p p l y t o most of t h e d i a l e c t s c u r r e n t l y implemented, so f a r as I am aware. The
sql
remainder
of
the
paper
is
critique
9
divided
into
the
following
sections: lack
of
orthogonality:
expressions
lack
of
orthogonality:
builtin
lack
of
orthogonality:
miscellaneous
formal
functions items
definition
mismatch missing
with
host
languages
function
mi s t a k e s aspects
of
summary
and
the
relational
model
not
supported
conclusions
Reference [3] g i v e s s o m e b a c k g r o u n d m a t e r i a l -- s p e c i f i c a l l y ~ a set of principles that a p p l y to the design of programming languages in g e n e r a l a n d d a t a b a s e l a n g u a g e s in particular. Many of the criticisms that follow are expressed in t e r m s of those principles. Note: Some of t h e p o i n t s a p p l y to i n t e r a c t i v e SQL only and some to embedded SQL only~ b u t m o s t a p p l y to both. I have not bothered to spell out the distinctions; the context m a k e s it c l e a r in e v e r y c a s e . A l s o ~ t h e s t r u c t u r e of t h e p a p e r is a little arbitrary~ in t h e s e n s e t h a t it is n o t really always clear which heading a particular point belongs under. There is also some repetition (I h o p e n o t t o o m u c h ) ~ for e s s e n t i a l l y the same reason.
sql
critique I0
2.
LACK
It
is
OF
ORTHOGONALITY:
convenient
to
EXPRESSIONS
begin
by
* A t~b_l_e_-eE.p.ces.si_on - is a for example, the expression SELECT FROM WHERE
* EMP DEPT#
=
A
=
is
A row-exQressioo for example, the SELECT FROM WHERE
* EMP EMP#
=
row-expression
a
or
AVG EMP
that
a SQL expression expression
special
case
of
is a SQL expression expression
terms.
yields
a
table
--
that
yields
a
single
a
table-expression.
that
yields
a
single
row
~E2" is
a
special
* A scalar-expression is a scalar value -- for example, SELECT FROM
expression
nonSQL
~D3 ~
column-expression
* --
A
EMP# EMP DEPT#
SQL
some
~D3'
* A ~o_ik.!mn_2_eEQce_s_si_oQ i s column -- for example, the SELECT FROM WHERE
introducing
case
of
a
table-expression.
SQL expression the expression
that
yields
a
single
(SALARY)
the expression SELECT FROM WHERE
SALARY EMP EMP# =
~E2'
A scalar-expression is a special c:ase o f special c a s e of a c o l u m n - e x p r e s s i o n .
a
row-expression
and
a
Note t h a t t h e s e f o u r k i n d s of e x p r e s s i o n c o r r e s p o n d t o t h e four c l a s s e s of data o b j e c t ( t a b l e , c o l u m n ; r o w , s c a l a r ) s u p p o r t e d by SQL -- though incidentally SQL i s i n c o n s i s t e n t as t o w h e t h e r i t s e x p r e s s i o n s y i e l d v a l u e s or r e f e r e n c e s , i n g e n e r a l . Note t o o t h a t (as pointed out partially ordered
sql
critique
in [3]) the as follows:
four
II
classes
of
object
can
be
table
(highest)
V
V col umn
row
V (i o w e s t )
s c a l ar (columns are neither this ordering). As e x p l a i n e d in c l a s s of o b j e c t
higher
[3] ( a g a i n ) , it s u p p o r t s ,
nor
a at
lower
to
for- c o m p a r i n g
a means for another;
assigning
rows
with
respect
to
l a n g u a g e s h o u l d p r o v i d e , for" e a c h l e a s t all of t h e f o l l o w i n g :
a constructor function, i.e., object of t h e c l a s s from l i t e r a l v a r i a b l e s of l o w e r c l a s s e s ; a means
than
a means for (constant)
two
objects
the
value
a selector function, i.e., o b j e c t s of l o w e r c l a s s e s f r o m
of
of
the
one
constructing an values and/or
class;
object
in
a means for extracting an o b j e c t of t h e g i v e n
the
class
component class;
a general, recursively d e f i n e d s y n t a x for" e x p r e s s i o n s that exploits to the full any closure properties the object class may possess. The table below shows these requirements.
sql
critique
that
SQL
12
does
not
really
measure
up
to
\
opn
~
constructor
compare
ob.j\
: ~
assign
only table
:
no
no
÷
via
~ INSERT SELECT
-
: selector ~
: gen ~ expr
:
:
: :
yes
+
÷
no
: (see :below) ~
:
column :
o n l y a s a r g to: : IN ( h o s t v b l e s : :
& c:onsts
no
:
no
no
~ only ~ from
:
yes
+
~ only in INSERT: ~ & UPDATE ( h o s t :
row
~ vbles ~ only)
scalar
~
no
only):
& consts:
to/ set
~ :
:
~ of h o s t ~ scalars
÷
+
~
: : ~
: : ~
: only to/ : : from host: ~ scalar ~
N/A
yes
,
(yes)
~ ~
: :
~ ~
÷
~
(yes)
~ : ~
no
no
Let us consider table-expressions in m o r e detail. The SELECT statement, which., s i n c e it y i e l d s a t a b l e , m a y b e r e g a r d e d as a table-expression (possibly of a d e g e n e r a t e form, e.g., as a column-expression)., currently has the following structure: SELECT FROM WHERE
scalar-expression-commalist t a b I e - n a m e - c o m m a l i st predicate
(ignoring numerous irrelevant details). N o t i c e t h a t it is just ~l_able2name_s t h a t appear- in t h e F R O M c l a u s e . Completeness suggests that it should be ta_ble__-eEQEessiQns (as Gray puts it [8]., "anything in c o m p u t e r science t h a t is n o t r e c u r s i v e is n o g o o d " ) . T h i s is n o t j u s t an a c a d e m i c consideration, by the way; on the contrary, there are several practical reasons as to why such recursiveness is d e s i r a b l e . First, consider the relational algebra. Relational algebra possesses the important property of closure -- that is~ relations form a closed system under the operations of the algebra., in t h e s e n s e t h a t t h e r e s u l t of a p p l y i n g a n y of t h o s e operations to any relation(s) is i t s e l f a n o t h e r relation. As a consequence, the operands of any given operation are not constrained to be real ("base") relations only, but rather can be any algebraic expression. Thus, the relational algebra allows the user to write 0 ~ relational ~2R~i~0~ -- and this feature is u s e f u l f o r p r e c i s e l y the same reasons that nested expressions are useful in o r d i n a r y arithmetic.
or
sql
Now consider indirectly,
critique
SQL. all
SQL the
is a l a n g u a g e operations of
13
that supports, the relational
directly algebra
(i.e., SQL is r e l a t i o n a l l y complete). However, the tableexpressions of SQL (which are the SQL equivalent of the expressions of t h e r e l a t i o n a l algebra) ~aQoQt be arbitrarily nested. Let u s c o n s i d e r t h e q u e s t i o n of e x a c t l y w h i c h cases SQL does support. Simplifying matters slightly, the expression SELECT - FROM - WHERE is the SQL version of the nested algebraic expression projection
( restriction
( product
( table1,
table~,~
...
)
)
)
(the product corresponds to t h e F R O M c l a u s e , the restriction to t h e W H E R E c l a u s e , and the projection to the SELECT clause; tablel, table2, ... are the tables identified in t h e FROM c l a u s e -- a n d n o t e t h a t , as r e m a r k e d e a r l i e r , t h e s e a r e s i m p l e table-names, not more complex expressions). Likewise, the expression SELECT UNION SELECT
is t h e
SQL
union
...
FROM
...
WHERE
...
...
FROM
...
WHERE
...
version
of
( tabexpl,
the
nested
tabexp2,
...
algebraic
expression
)
where tabexpl, tabexp2~ ... a r e in t u r n t a b l e - e x p r e s s i o n s of the form shown earlier (i.e., projections of r e s t r i c t i o n s of p r o d u c t s of n a m e d t a b l e s ) . B u t it is n o t p o s s i b l e to f o r m u l a t e direct equivalents of a n y o t h e r n e s t e d a l g e b r a i c e x p r e s s i o n s . Thus, for example, it is n o t p o s s i b l e to write a direct equivalent in S Q L of t h e n e s t e d e x p r e s s i o n restriction
( projection
( table
)
)
Instead, the user has to recast the expression into a semantically equivalent (but s y n t a c t i c a l l y different) form in which the restriction is a p p l i e d b e f Q ~ e t h e p r o j e c t i o n . What this means in p r a c t i c a l t e r m s is t h a t t h e u s e r m a y have to expend time and effort transforming the "natural" formulation of a given query into some different, and arguably less "natural", representation (see E x a m p l e b e l o w ) . W h a t is m o r e , t h e u s e r is t h e r e f o r e a l s o r e q u i r e d to u n d e r s t a n d exactly when such transformations are valid. This may not always be intuitively obvious. For example, is a p r o j e c t i o n of a u n i o n always equivalent t o t h e u n i o n of t w o p r o j e c t i o n s ? Example: NYC SFO
Given ( EMP#, ( EMP#,
(representing respectively),
sql
critique
the
two
tables
DEPT#~ DEPT#~
SALARY SALARY
New list
York EMP# for
) ) and all
14
San Francisco employees.
emp ioyees,
"Natural"
formulation
(projection
of
a union):
SELECT EMP# FROM ( NYC UNION SFO ) SQL f o r m u l a t i o n SELECT UNION SELECT
(union of
EMP#
FROM
NYC
EMP#
FROM
SFO
two p r o j e c t i o n s ) :
We r e m a r k in p a s s i n g t h a t a l l o w i n g b o t h f o r m u l a t i o n s of the query would enable different users to perceive and express the same problem in d i f f e r e n t ways (ideally~ of course~ both formulations would translate to the same internal representation~ for otherwise the choice between the two would no longer be arbitrary). The foregoing e x a m p l e t a c i t l y m a k e s u s e of t h e f a c t t h a t simple table-reference (i.e.~ a t a b l e - n a m e ) QYgh~ to be just s p e c i a l c a s e of a g e n e r a l t a b l e - e x p r e s s i o n . Thus we wrote NYC instead
UNION
a a
SFO
of
SELECT
~ FROM
NYC
UNION
SELECT
i FROM
SFO
which current SQL would require. It w o u l d b e h i g h l y d e s i r a b l e for SQL to allow the expression "SELECT ~ FROM T" to be replaced b y s i m p l y "T" w h e r e v e r it a p p e a r s ~ in t h e s t y l e of more conventional languages. In o t h e r w o r d s ~ S E L E C T s h o u l d b e regarded as a s t a t e m e n t whose function is t o r e t r i e v e a table ( r e p r e s e n t e d by a t a b l e - e x p r e s s i o n ) . Table-expressions per se -- in particular~ nested table-expressions -- should not require the "SELECT ~ FROM". Among other things this change would improve the usability of t h e E X I S T S builtin function (see l a t e r ) . It w o u l d a l s o b e c l e a r t h a t I N T O a n d O R D E R BY a r e clauses of t h e S E L E C T ~ t ~ n ~ a n d n o t p a r t of a table(or column-) expression; t h e q u e s t i o n of w h e t h e r t h e y c a n a p p e a r in a nested expression would then simply not arise, thus avoiding the need for a rule that looks arbitrary b u t is in f a c t not. A nested table-expression is p e r m i t t e d -- in f a c t required -- in current S Q L as t h e a r g u m e n t t o E X I S T S (but strangely enough not as t h e a r g u m e n t to t h e o t h e r builtin functions; this p o i n t is d i s c u s s e d in t h e n e x t s e c t i o n ) . Nested column~E~C~iQQ~ ("subqueries") a r e (a) ~ g u ~ r e d with the "ANY" and "ALL" operators ( i n c l u d e s t h e IN o p e r a t o r ~ w h i c h is just a different s p e l l i n g for = A N Y ) ; a n d (b) Q ~ m i t t e d with scalar comparison operators (~ =~ etc.)~ if a n d o n l y if the column-expression yields a c o l u m n h a v i n g at m o s t one row. Moreover, the nested expression is a l l o w e d t o i n c l u d e G R O U P BY and HAVING in case (a) but not in case (b). More arbitrariness.
sql
critique IS
Elsewhere I have proposed some extensions to SQL to support the outer join operation [4]. The details of t h a t p r o p o s a l do not concern us here; what does concern u s is t h e f o l l o w i n g . If the user needs to compute an o u t e r j o i n of three or more relations, then (a) that outer _join is constructed by performing a sequence of ~!i_[!~E2 o u t e r joins (e.g., join relations A a n d B, then join the result and relation C); and (b) it is e s s e n t i a l that the user indicate the sequence in which tlnose binary joins are performed, because different sequences wi i i produce different results, in general. Indicating the required sequence is done, precisely, by writing a suitable nested expression. Thus, nested expressions are @=ss]eQt~i_al_ if S Q L is t o provide direct (i.e., singlestatement) support for general o u t e r j o i n s of m o r e t h a n two tel a t i o n s . Another example (involving outer join again): P a r t of the proposal for- s u p p o r t i n g o u t e r j o i n [4] i n v o l v e s t h e u s e of a new clause, the PRESERVE clause, whose function is t o p r e s e r v e rows from the indicated table that would not otherwise participate in t h e r e s u l t of t h e S E L E C T . Consider the tables COURSE OFFERING
( COURSE#, ( COURSE#,
SUBJECT ) OFF#, LOCATION
)
a n d consider- t h e q u e r y " L i s t all a l g e b r a courses, offerings if any" The two SELECT statements (neither of which is valid in current SQL, represent two attempts to formulate this query: ALGEBRA. COURSE#, OFF#, LOCATION ( SELECT COURSE# FROM COURSE WHERE SUBJECT = ~Algebra ~ ) ALGEBRA, WHERE ALGEBRA.COURSE# = OFFERING.COURSE# PRESERVE ALGEBRA
with their fol l o w i n g of course>
SELECT FROM
SELECT FROM WHERE AND PRESERVE
OFFERING
COURSE.COURSE#, OFF#, LOCATION COURSE, OFFERING COURSE.COURSE# = OFFERING. COURSE# SUBJECT = ~Algebra' COURSE
Each of these statements does list all algebra courses, together with their offerings, f o r all s u c h c o u r s e s that do have any offerings. The first also lists algebra courses that do not have any offerings, concatenated with null values in the OFFERING positions; i.e., it p r e s e r v e s information for those courses (note the introduced name ALGEBRA, w h i c h is u s e d to r e f e r t o t h e r e s u l t of e v a l u a t i n g the inner expression). The second, by contrast, preserves information not only for algebra courses with no offerings, b_L~ a.lso f o r al..l c Qb~rse_s f..or_ which, t_h_e ~L~i~c_~ i__s no_t al_gebj2 ~ ( r e g a r d l e s s of whether those courses have any offerings or n o t > . In o t h e r w o r d s , t h e
sql
critique 16
first preserves information for algebra courses only (as required)., the second produces a l o t of u n n e c e s s a r y output. And note that the first cannot even be formulated (as a s i n g l e statement) if n e s t e d e x p r e s s i o n s are not supported. * In f a c t , SQL kind of "under ex a m p i e : Base
does the
alreacly covers"
support sense.
nested expressions in a Consider the following
table: S
( S#., S N A M E ,
STATUS,
CITY
)
View d e f i n i t i o n : CREATE AS
Query
VIEW LONDON SUPPLIERS S E L E C T S#, SNAME., S T A T U S FROM S WHERE CITY = ~London ~
(Q) :
SELECT FROM WHERE Resulting
* LONDONSUPPLIERS STATUS > 50 SELECT
SELECT FROM WHERE AND
statement
(Q'):
S#., S N A M E ~ S T A T U S S STATUS > 50 CITY = ~London ~
The SELECT statement Q' i s o b t a i n e d from the original query Q by a process usually described as "merging .... statement Q is "merged" with the SELECT in t h e v i e w d e f i n i t i o n to produce statement Q'. To the naive user this looks a little bit like magic. But in fact what is going on is simply that the reference to LONDON_SUPPLIERS in t h e F R O M c l a u s e in Q i s b e i n g replaced by the expression that ~ n ~ LONDON_SUPPLIERS, as follows: SELECT * FROM ( SELECT FROM WHERE WHERE STATUS
S#., SNAME., S T A T U S S CITY = ~London ~ ) > 50
This explanation~ though both accurate and easy to understand., cannot conveniently b e u s e d in d e s c r i b i n g or teaching SQL., precisely because SQL does not support nesting at the external or
user's
* UNION things)
sql
critique
level.
is not permitted cannot be used
in a s u b q u e r y . , a n d h e n c e (among other in t h e d e f i n i t i o n of a v i e w (although
17
strangely enough it c a n b e u s e d t o d e f i n e t h e scope for a cursor in e m b e d d e d SQL). So a view cannot be "any derivable relation", and the relational closure property breaks down. Likewise, I N S E R T ... S E L E C T c a n n o t b e u s e d t o a s s i g n t h e u n i o n of two relations to another relation. Yet another consequence of the special treatment g i v e n t o U N I O N i s t h a t it is not possible to apply a builtin function such as AVG to a union. See the following section. We conclude this discussion of S Q L e x p r e s s i o n s by additional (and apparently arbitrary) restrictions.
noting
The predicate C BETWEEN A AND B is equivalent predicate A