p(A) C Dam(A) for each A in X. A relation R over relation scheme. R is a finite set ..... the best we can hope for is an algorithm polynomial in the size of the input and ..... in a projection (if there are mills) in time polynomial in the sixe of the input ...
ALGORITHMS FOR ACYCLiC DATARASE SCHEMES Mihalis
Yannakakis
Bell Laboratories Murray Hill, NJ 079?4
AfSSTRACT: Many real-world situations can be captured by a set of functional dependencies and a single join dependency of a particular form called acyclic [B..]. The join dependency corresponds to a natural decomposition into meaningful1 objects (an acyclic database scheme). 0ur purpose in this paper is to describe efficient a& rithms in this setting for various problems, such as computing projections, minimizing joins, inferring dependencies, and testing for dependency satisfaction.
called acyclic
[S..].
Acyclic join dependencies are those that are
equivalent to some set of multivalued dependencies. Such sets of multivalued dependencies are conflict-free.
a notion introduced by
[t] in the study of the relation between the network and the reiational model. [S] argues also that most real-world sets of mvds fall into this category or can be put in such a form. The class of acyclic
1. INTRODUCTION
database schemes contains the class of loopfree
An important pati in the design of relational database schemes
Bachman diagram
schemes of IL], the class of simply connected schemes of [Z], and
is the specificatiuo of constraints satisfied by the data, called depen-
bears a close resemblance to the class of tree queries of [BG J.
dencies. The first dependencies to be introduced were the functional
In this pape.t we shall give efficient algorithms for several prob
dependencies [C]. Their properties are well understood and efficient algorithms have been developed for infetring new dependencies and
lems on acyclic database schemes, such as computing projections.
designing database schemes [BE, Ber].
testing satisfaction of the dependencies by a database, inferring other
Multivalued
dependencies
[F, Z] were introduced to describe those cases where a relation can
dependencies. h Seaion 2 we review the basic terminology.
be decomposed into two of its projccdons, and join dependencies fR]
tions 3-6 deal with a database scheme and its associated join depen-
for a decomposition into several projections without loss of informa-
dency (no functional dependencies). in Section 3 we assume a gen-
tion; i.e. the original relation can be reconstructed by joining the
eral database scheme Q. We examine the complexity of computing
projections.
a projection of the join of the relations in a database ow
There are also efficient algorithms for the inference of
Set-
Q, of
muitivalued dependencies [Bee] and their use in the design process.
determining if the projection can be computed by joining only some
However,
in general they are harder to grasp and deal with than
of the relations, and of inferring dependencies from the join depen-
functional dependencie+e.g. the. best known algorithm to infer a join
dency associated with Q. In Se&m 4 we examine the same prob-
dependency from multivalued dependencies takes expootential time
lems when Q is an acyclic database scheme. In Se&ion 5 we define
and space [ABU].
the association D[X] between the attributes of a bet X represented by
Joii dependencies are studied in [BV, MSY, Y].
a database D and show how to compute it. In Section 6 we examine
Recently, [FMU] advanced the hypothesis that most real-world
loop-free Bachman diagram schemes and give a characterization of
situations have a particularly simple structure: They can be. captured
them in terms of lossless joins. Section 7 assumes a set of functional
by some functional dependencies and a single join-dependency that describes a “natural”
decomposition
into meaningfull
dependencies and an acyclic join dependency; it examines the infer-
“objezts”.
ence of other dependencies, and testing if a given database satisfies
Furthermore, the join-dependency is in most cases of a special form,
82
CH1701-2/81/0000/008~.75
0 1981 IEEE
trs(T)-tuple of distinguished symbols. A tableau T defines a map
the dependencies.
ping fr from relations over U (or universal relations) to relations over trs(T) as follows. A valuation p is a mapping that maps for
2. TERMINOLOGY
each attribute A, S(A) into Dam(A). A valuation is extended to
In this Section we will go briefly over the basic relational
tuples comlionentwiseand to relations elementwise. fr is defined as
theory terminology. For more details the reader is referred to [U].
follows. fr(R) =. {p(q) ( p is a valuation with p(T) C R). A
The universeis a finite set U of atrributes. A relation schemeR is a
tableau T, is contained in another tableau Tz ( denoted T, cr Tz) if
subset of CJ. A dumbuseschemeQ (over U) is a set or relation
they both have the same target relation schemeand fT,(R) E frJR)
schemeswith union c/. Every attribute A has an associatedset of
for every universal relation R; Tt and Tz are equivaknt
values, its domuinDam(A), If X is a set of attributes, an X-trrpfe(or X-vulue)
is a mapping u from X into lJ km(A), *Lx
(or
T, =r Tz) if T1 Gr T2 and T2 Gr T,. Note that, if T, c T2 (where
such that
c is set-inclusion)and trs(T,) = trs(Tz) then T2 Gr Ti. We now lit
p(A) C Dam(A) for each A in X. A relation R over relation scheme someof the basic resultsof the theory of tableaux [ASUI, ASU2]. R is a finite set of R-tuples. A database D over a databasescheme (1) For every project-join expression 4 there is a tableau T with
Q is a set of relations containing one relation over each relation
&J(R) = jr(R)
schemeof Q.
for every universal relation R. The tableau T is con-
structed recursively from I$ as follows. If + = nxo, and S is the The projection t[Y] of an X-tuple t onto a subsetY of X is the
tableau for o, T is obtained from 5 by changingeach distinguished
restriction of I to Y. The projection nr (R) of a relation R over X to YisthesetofprojectIonsofthetuplesinRtoY.
relations over schemesat, .. .. & respectively and R = Ug. join of RI, . . . . Rk, denoted R, W w(R*. ” i=l
symbol
LetR,,...,R,be
with A L X into a new nondistinguished symbol. If
0 = alWozP4
The
Wok, and Ti, Tz . ... T1 are the tableaux for
(11,.. . . ok. then T is the union of the Ti’s.
.WRk (or 7 Ri or
Rk}), is the set of R-tuplw t, with t&j
4
(2) Let T,, T2 be two tableaux with the sametarget relation scheme. A homomorphism /r from Tg to Tz is a mapping from S(A) to S(A)
6 Rj for
for each A, such that
, ..I. k.
ment mapping. Such a homomorphismexists if and only if Tr Cr T,.
tor we can build project-join relational expressions. An expression$
(3) Each tableau T has a minimal subset 7 equivalent to T; f is
defines a mapping from relations over U to relations over a certain
unique up to renaming of nondistinguishedsymbolsand is called the
An expre
minbwl equiv4lcnt
sion 4 is contabvd in an expremion # or 4 G $ if trs(&) = trs(g)
mHeuu of T. If T is the tableau of an expression
+, then T is the tableau of an expressionJ, equivalent to 4 that con-
and $(R) G S(R) for every relation R over LI; 41 and $ are
tains the minimum number of (binary) joins.
equivalent, denok4 ~$I=I$,if $ G QIand $I E 9. A useful tool for
A fnnctiorvll
compating expredsionsis the tableau [ASVI]. Each attribute A of U hasanasociatedsymbolsetS(A)={a,a,,nt,
= n, and h(T,) c h(T2); the mapping
from the tuples of T, to the tuples of Tz induced by h is a contain-
Using the projection operator xx (X c 6’) and the join opera-
set of attributes, the t&et relation scheme of I$. trs(+).
h(4)
dependency (er fd) is a statement of the form
X - Y where X, Y C U. It is satisfied by a universal relation R, if
~~~];aiscalleda
the ai’s are nondi~tingwkhed. A tabteuu T
for all tuples I,, ta of R with t~[Xl = IAX], also ti[Y] = tt[Y] holds.
is a reIatIon over U with the symbol setsas the domainsof the attri-
A join dependency $or jd) is a statementof the form l Q where Q is
bute+. The target elation schemerrr(T) of T is the set of ~atttibutee
a databaseschemeover LI; it is satisfied by a univetsal relation R if
in which T-has 4 distinguishedsymbol. The summarysr of T is the
W{q(R) I& t RI = R. A multivubted dependency (or mvd)
distinguished symbol &
83
X -
Y is the,join
dependency
*{XI’,
XZ} where 2 = U-Xl’.
embedded join dependency (ejd) is like a join dependency the union
of the relation
attributes;
i.e. if S is a collection
satisfied We
will
usually
say that
R if W{q(R)
W{ nx, ) Xi C S} is called a project-join ms.
It has the properties
potency).
From
mapping
The
and is denoted
(1) ms Q nx, and (2) ms(ms)=ms
the idempotency
of the project-join
lows that the join of the relations
elements
S(A)
by
applied
of a database D over Q satisfies
A remplate dependency dependency;
it is a statement
and sr its summary. fT(R)
(or rd) [sU]
is a very general
It is satisfied
by a universal
= nx(R)
where X is the target relation
A set I
of dependencies
P I= u) if o holds in every
implies
relation
testing if P I= D is the chase ([AM, template
dependency
dependency
o = T/Q is T.
new nondistinguished chase procedure with projection
symbols
modifies
join
T, trying
T, are as follows.
FD-rule.
other,
r,
keeping
replace
all occurrences
a distinguished
bol with a lower subscript. tableau resulting
T by mpT.
ahe
symbol
JD-rule.
I@]
number
etc.
The
carry
hypergraph
in the different
mvds.
the
in [B..)
We will
schemes of Q as its
if Q is disconnected of Q are independent way by
An acyclic join dependency properties
the definition
of the here but
It is a join dependency
by (or is equivalent
are going to use the following
Q has the
in a straightforward
not give
scheme Q is acyclic
in a
We are going to assume
characterizations.
and is implied
A database
with
in terms of the topological
hypergraph.
Graph-notions
to hypergraphs
components
to ) some set of
if *Q is an acyclic jd.
characterizations
of acyclic
We
database
schemes Q[B..].
an attri-
(1) Q can be reduced
by the
to the empty
set by repeatedly
attribute
if it occurs in exactly
one relation
If *Q is a jd in P, replace the
relation
scheme if it is contained
in another
the rules for the dependencies
consists of a set of
associated
separately.
a
from a graph is that
over
and the results generalize
each component
implies
If two
arise or else the
of nodes.
sym-
or the nondistinguished
chase of T, under P, chasc,(T,,)~,
tableau after applying
which
If f: X - Y is an fd in P and
of one of r,(B),
The only difference
rather list some equivalent
only sets P of
there are two tuples rt, rr that agree on X but disagreeon bute B of
associated
or to eliminate
rules for modifying
like a graph,
We will denote it also as Q.
*Q is defined
a tuple
of l’ if o is an fd; if
Here we will consider The
considering
of
stops and we say
will
U as its set of nodes, and the relation
of each other,
The
with an element
the order in which the rules are
for the rest of this paper that Q is connected;
of X and
to include
[Bg],
an arbitrary
way.
then the attributes
In this case, if
we keep the constant.
a contradiction
‘%onnectedness”
set of edges.
in the rest, and
dependency,
symbols from the attributes
dependencies.
universe
T, of a
in the rest of the attributes.
“path,
R with
scheme Q or a jd *Q we can associate
A hypergraph
straightforward
for
the one has distinguished
symbols in the attributes
the tableau
this happens then it succeeds. and
A procedure
The tableau
like
o (or
The tableau T,, of a functional
sr in it if o is a template
the nondistinguished
tableau
MMS]).
P.
of XI’ and nondistinguished
the other tuple has distinguished
functional
satisfying
dependency
are
is the same.
a database
an edge can contain
A.
is identified
Again,
either
nodes and a set of edges.
R if
scheme of T.
another
o = X - Y has two tuples;
symbols in the attributes
relation
arises.
chase,(R)
hypergraph.
of the form Tlsr where T is a tableau
(i.e.
the ales
also to relations
then the chase procedure
is immaterial:
With
kind of
property
in which
for each attribute
of an FDrule,
are identified,
final relation
the jd *Q.
can be applied
(or constam)
in the application
that a conrradicrion
it fol-
on the order
U S(A)
of Dam(A)
constants
(idem-
mapping
from Dam(A)
an element
expression
has the Church-Rosser
and P (= (I iff it succeeds (MMS].
The chase procedure
1 Xi C S} = %x(R).
S has a lossless join.
The procedure
does not depend
applied),
all the
schemes, the ejd l S is
of relation
relation
chase,(T,)
except that
schemes does not have to contain
by a universal
far as possible.
An
(2) There
is the
is a tree T with
the subgraph
in P as
84
of T induced
the relation
scheme,
deleting
an
and deleting
a
scheme.
schemes as nodes such that
by the nodes containing
an attribute
A is
connected(i.e. a subtree) for each attribute A of U. We say that T
the prgject-join mapping). There are however somecasesof particu-
represenrs p.
lar interest which can be efficiently decided even for general join
Acyclic schemesQ have also the following properties. (3) We say that a database D = {R,, . .., &} over p is @l/y) reduced
or the projection of a universal ia.s&nceif there is a universal
dependencies. Let I = *(X1, .. .. X,) be a join dependency. (1) Inferring losslessjoins [BcV]. MdWd
Let S =
be a collection of subsets of U with
{Vj]je,,
relation R such that R, = ~“4 (R) for each & E 0. If R is a data-
Y = U Y,. We can decide if J implies that S has a losslessjoin (i.e. i
base over an acyclic scheme Q, then D is reduced iff
if Wny,(R) = ny(R) for every universal relation satisfying J) as folj
Ri = n,t, (Ri W Rj) for each pair of relations Ri, Rj in D. There is
lows. Form a bipartite graph G with’node-sthe Xj’S and the attri-
an efficient algorithm using a particular kind of joins (called semi-
butes in U-Y and an edge between an attribute and a set if the
joins) which computes the reducrion of D : D’ = {RI’, . . . . Rk’}
attribute is in the set. Let K,, .., K, be the connectedcomponents
where Ri’ = IQ, (YR,) [BG].
of G. Let Zi be all the attributes of Y that belong to a set Xj in Xi,
(4) If Z is a set of attributes, let R(Z) be the family of nonempty
for i = 1, .. .. I. [BeV] showsthat Wn, is losslessiff each Zi is coni J
sets & fl Z where & C Q. Q(Z) is an acyclic scheme over the
tained in some Yj. 0
universe2, cakd the schemegenerated by Z. (2) Let S c {X1, . .. . X,} andX c UX,. x,cs 3. GENERAL DATABASE SCHEMES Let D = {RI.
Q = (&,...,&)
be
a
Deciding if nx$t$
database scheme and
Method.
nx,) = nx
We claim that J
(*).
implies (*)
if
and only if
t
. . , Rk} a databaseover it. It is trivial to determine if a
nx (,C$s nx,) - 71x(,5] nx,) is a tautology (i.e. holds for all univergiven tuple is in the join of the relations in D. If however the tuple is only partially specified (in someset of attributes X) then the problem is much harder : Theorem 3.1. It is NP-complete to teat if an X-tuple I is in Q(~R,)
where D = {Ri}i c 1,k is a database over an arbitrary
sal relations). At fint R, px[m,(R)]
C nx[ms(R)].
~rx m,(R) G nxms(R).
note that for all universal relations Supposethat there is a relation R with
Let R’ = m,(R). Then R’ satisfiesthe join
dependencyI but violates (‘) since ?r,(R’) = ax[m,(R’)] (from the idempotency of m,). Conversely, if nx[m,(R)] = nx[ms(R)]
scheme,even if D is the projection of a universal instance. Proof Membership in NP is obvious. For the NP-hardnesspart
for
every R, then for every instance I satisfying J we have nx[ms(l)]
= nx[m,(l)]
= a,&),
sincem,(l) = I.
we use a result of [SY], that it is NP-completeto teat if an expression I$, = nx(yny,) is contained in another expression+z = nx(y~~~). L.et T, be the tableau of 0, and s1 its summary. From the theory of tableaux, Q, G 4* if and only if s1 c 4z(TI). Thus, we can decide if
4, c 42 by testing whether s1 E ox (YRi) where Ri = ~‘4 (TI). 0
Thus, it suffices to test if mx(x$tsnx) - ox (i$l nxJ. Both expressionsare simple [ASLIl], and therefore their equivalence can be tested in polynomial time using tableaux techniques[AStll]. 0 The significanceof (2) is the following: Supposethat a universal instanceR (satisfying.J) is decomposedinto its projections on the
As a consequenceof Theorem 1, it is NP-completeto test if an
4’s. Then we can test efficiently if a projection of it (on some set
arbitrary join dependencyimplies a template dependencyof the form
X) can be recovered by joining only some of the relations in the
nxmS = nx, (Membership in NP follows from the indempotencyof
database. Moreover, we can find efficiently the minimum number
4. ACYCLIC
DATABASE SCHEMES
of such relations whose join gives the projection of R on X: we just Let D = {R,, . . . . Rh} be a database over the acyclic scheme
have to minimize the (simple) tableau of nx m, [ASUI, ASU2].
Q = {Et, . . . . &}.
From the theory of tableaux, all minimal subsets S of J with
Let X be a set of attributes.
We shag show how
to compute nx(WRi) in time polynomial in the size of the input and
nxmr = nx have the same minimum cardinafity, and the expressions nxms have identical tableaux T up to renaming of nondistinguished
the output.
symbols.
database schemes and tree queries. Let T be a tree representing the
Therefore,
for
every
minimal
S = {Xi,, . , X,},
scheme Q as explained in Section 2. At first we compute a fult
nxms(R) = nx [W fly,(R)], where Yj is the set of attributes in which I
reduction D’ of D as in [BG] using semijoins. Then we root the tree
the j-th row of the minimal tableau T has a distinguished or repeated nondistinguished
symbol (Yj C X,).
at an arbitrary node say &, and prune it leaf-toroot.
In other words for every
minimal set S, the computation of nx(R),
The algorithm is based on the relation between acyclic
Let T, be the ,
subtree of T rooted at node &. Zi the set of attributes labelling the
involves the join of
edge from- & to its father Bj (2, = &n&i),
exactly the same relations (my(R)) - the best choice of S depends on
and Xi the set of attri-
butes of X that are contained in some node of T,. When the turn of how fast n,,,(R) can be obtained from Rij = n+(R)
(organization of & comes to be deleted, Ri has been replaced by a relation &
over
the various relations, supporting data structures, etc.). the set of attributes Xi&. Note that we might have nx(,ys nx,) = nx even though the
deleting the relation Ri’ and replacing the current relation Rj’ of its
join is lossy - i.e. xDtsnx, + nr where Y = xlJ, Xi. However, any
father 4 by Rj’ W rrxX(Ri’). It is easy to show by an induction that when node & is
nonredundant join for computing the projection on X is lossless:
deleted, the current relation Ri’ is equal to nxA ( W Rl). 467
Theorem 3.2 Let *J be a jd and suppose that S is a miminal subset of J such that nx(xk4, xx,) = nx is implied by *J.
The deletion of node & is carried out by
Then
Thus,
when we are left with the root, the relation that is stored there is Rt’ = no, (&$R,), and nx@t’)
xP4sTX, is a lossless join. 0
= m(WRJ. I
We had to reduce. first the database so that intermediate results Regarding the actual computation of nx(yRi),
the size of the output will not get exponential in the input and the output:
relation can be exponential in the size of the input database. Thus,
Lemma 4.1 Throughout
the computation,
the size d
the best we can hope for is an algorithm polynomial in the size of current version Rj’ of Ri is bounded by piI
Inx(yRj)l.
the
0
the input and the output. However, we know that it is NP-complete to determine if YR, is empty or a (universal) R, = nq (R) [WY].
Theorem 4.1 If {R,, . . . . Rk} is a database over an acychc
relation R with
scheme, then mX(WRi) can be computed in time polynomial in the I
Also it is NP-complete to determine if a
universal relation R satisfies R = ?a
input and the output. o
(R) [MSY]. Thus, probably
Simple variations of the algorithm can be used to (unless P = NP) there is no efficient (polynomial in the sizes of the (1) Test if a universal relation R satisfies an acyclic join dependency
input and the output) algorithm to carry out either the reduction of
*1.
the database or the pin of a reduced database. Also. it is NP-
Apply the algorithm to D = {nx,(R) 1Xi
complete to test if a universal relation satisfies a join-dependency.
c
J} with X = (I while
making sure that the various relations along the way don’t become
WY1
different from the corresponding projections of R. Another way for
86
doing this is to take a set M of mvd’s equivalent to *J with !M] s PI
Ri ... L!fch -6 7i: T2
(there is always such a set [B. .]) and check if R satisfies M. (2) Include selections. Let Y be a set of attributes and SI(A) a subset of Dam(A)
for each
attribute A in Y. To compute the projections on X of the tuples in yRi
which have A-value in Sl(A) for each A CY, we first select those
Figure 1 It can be shown that (1) every subtrce that covers X has to
tuples from each relation Ri that can contribute to the result. That
contain TX, and (2) TX covers X and is connected.0
is. we remove from Ri those tuples I with r[A] f S!(A) for some A in Yf&.
Then we can apply the algorithm to the remaining relations. Even though Tx is a minimal subtree covering X, it might still
The following now is immediate. contain redundant relations. CoroIIary
4.1 (1) Given a database {RI, .
For example, suppose that we have a
. ,Rk} and an Xfilm database with rehnions FD (film-director),
FP (film-producer),
tuple t, we can decide in polynomial time if r C nx(WRi). FA (film-actor)
arranged as in the tree of Figure 2.
(2) We can decide in polynomial time if an acyclic join dependency implies a template dependency. 0
Returning to the computation of nx (YRJ , we note that we Figure 2 do not have to carry out the second phase using the whole tree. The minimal subtrce that relates directors to producers is the Since the join of the relations in the database satisfies the jd *R we whole tree.
However,
clearly DP = noP (FD WFP).
We could
can use dependencies that are implied by it. change the arrangement so that FD and FP become adjacent. This Lemma 4.2.
The join of a subset of the relations, whose
is true in general:
schemes form a subtrce (connected subgraph) of T, is lossless. 0 Theorem 4.2 Let S be a subset of schemes from D that join Thus, it suffices to join a set of relations whose schemes form a
losslessly. There is a tree T representing Q such that the schemes in
subtree T’ of T and have a union containing X. We will say that 7’
S form a subtree of T.
cover3 X. If X is contained in some relation scheme $, then we can Proof (sketch) obtain nx(WR,) 8
by projecting the corresponding relation R, onto X. Let S = {&t, . . . . &,} and Y = U R,,. From our discussion in
If not, then the intersection of all 7” covering X is also a subtree Section 3 (see Property (1) of general join dependencies) the
covering X:
schemes S join losslessly if and only if the schemes in Q-S Lemma 4.3.
can be
If X is not contained in two or more relation partitioned
into
sets KI, .._, K,,, so that (a)
if &, 6 K,
then
schemes, then there is a unique minimal subtree Tx of T covering X. gr rl Y c &,, and (b) if 8, and & belong to different K,‘s then Proof (sketch) The subtree Tx is defined as follows.
I& II & n (U-Y) Let & be a node of T
At
and T1. T2, . . . . T[ the subtrces hanging from it (see Figure 1). If
= 0.
first we show that S and each of
K, U {&,} are acyclic
schemes. We can construct thus trees To, T,, . . . . T, representing
one of the Ti’s covers X then & t Tx. Otherwise we include & into
respectively S, Kt U &I},
TX.
87
..,, K, U (&,,,}.
We attach the trees
desired tree.
Parts (1) and (2) 4.2 All
efficiently.
lomless joins (and their projections)
than the projection
database
is not the projection
we have also a relation
of a universal
by
instance.
this particular
FS (film-sound
base, the way to compute
n’
directors
completeness
the
in computing
the example
interpretation
where direct0r.d ducer p.
NP-
the join
attribute
4.2 with
Theorem
that the schemes of any nonredundant a connected
has worked
Actor
3.2 we can conclude
join computing
the projection
set in some tree representation
of R.
We
can be computed
X.
Tbat
is,
we
, Y,,,} of sets such that nxm,. subset
contains
S of n,
m distinct
for
find
collection
= xx is implied which
Sets &,,
the
. . . . &,
jd
with
by the jd *Q
implies
Y, C E,, (for
Q be a family
Eliminate
a
S’ using the following
of relation
elimination
schemes, initially
in
Q
if
it
S’ = {Y,, . . . . Y,,,} is the final family
is
algorithm.
set equal to n.
in U - X if it belongs to exactly
set
contained
The
is the set of pairs
(p,d)
by) pro-
will miss all associations
with
producers.
their
(or
unknown,
1) Elim-
another
Q of sets after applying
4.3 (1) The j.d. *Q implies
nrmr.
= TX.
dependency
as follows.
out to U every
That
unspecified
is, the etc.)
in
(marked
nulls in [WI,
We denote by D[X] of each universal
Q
set.
D is the projection
steps 1)
in each R, with Let R’(D)
[Ml).
new distinct
= mo R(D).
Sections,
(2) If S is
of a universal
now
Q is an acyclic
that
Theorem
instance
instance.
of X.
of
D[X] is
D[X] or determine are hard even if
(Note,
that if D is the
then D[X] = nx(WR,)).
Suppose
I
scheme.
4.1, with the projection
A tuple
it follows that for
is in it, since both problems
of a universal
nulls
tuples of R’(D).
in the previous
projection
for
instance R(D) by
We form a universal
onto X of all X-total
X-tuple
relation
R, G T&, (R)
with
a general database scheme it is hard to compute if a particular
the
R is called a conmining insronce. D[X]
tuple
From our discussion
one set of Q, 2) in
join
padding
Let
and 2) as far as possible. Theorem
and directors.
a film produced
., R,} be a database.
., k. Such a relation
the projection
inate an attribute
THE
2, and suppose we
R*(D) is X-fornl if it contains no nulls in the attributes
j = 1, . . . . m). We compute
NOT
npo (yRi)
the
satisfying
a
= nr.
for (directed
R
will
nxms
the correct
IS
producers
that are in the X-projection
minimal
every
of Figure
set of X-tuples
i = 1,
and
between
computing
Let D = {RI,
a (in fact all)
*R,
Part (3)
INSTANCE
is inapplicable
for deriving
S’ = {Y,,
compute
DATABASE
scheme
of documentaries
use this fact now to devise an easy method for
j.d.
these cases.
Theorem
join
THE
of such a query
However,
of directors
over Q.
Combining
on X form
“natural”
FD with
thus,
WHEN
want to know the associations
data-
to an acyclic one
schemes;
results in Section 3 are relevant
of the relations
will not necessarily
OF A UNIVERSAL
Consider
if
lossy joins can be hard to com-
relation
ASSOCIATIONS
PROJECTION
there
in the film
database scheme n can be augmented appropriate
And
is by joining
5.
if the
join: for example,
engineer)
of talkies
to the Corollary,
adding
is an arbitrary
result (a proper
of the join of all the relations,
might be good reason for wanting
pute: Every
0
Yj’s if tItc jd is not acyclic.
superset)
By contrast
lQ
hold even if
does not ; i.e. the algorithm
can be
0
Note that a lossless join might give a different
FS.
. . . . Bi, with Yj c hj (for j = 1, . . . . m).
q
Cornlhuy computed
&I,
to Ril, . . . . &,, to get the
Tt, . . . . T,,, to TO at the nodes corresponding
Consider
Remark
of R(D) on the relation
(2)
after
schemes of
any subset of Q that has for each j = 1, .__, m a scheme &, contain-
Q in place of the database D there, the set X in place of Y and with
ing Y, then the j.d. *Q implies xxms
S/(A)
n
with
nxms = xx
implied
by *Q,
= nx.
(3) If S is any subset of
then S contains
distinct
sets
the set of nonnull
(2) we have :
88
A-values
for each A in X.
From
Remark
Theorem 5.1 If Q is an acyclic databasescheme,,we can teat if
from Lemma 5.1 and Theorem 4.2. 0
an X-tuple is in D[X] and compute D[X] (or a projection of it) in In the next Section we shall see that there are some schemes
time polynomial in the size of the input and the output. o
which have a representing tree such that the union can be dropped
Returning to the representation of Q by a tree T, we noted
from Theorem 5.2, and thus D[X] can be computed by just joining
before that navigating through T (joining the relations on the way)
somerelations from D .
producesvalid associations(the joins are lossless). In other words, if
Even for a general Q however,we can compute D[X] without
the databaseis the projection of a universal relation R (satisfying’the
introducing null values - and clearly faster than the general method
join dependencyQ), then the computed relation is the projection of
described at the beginning of the Section (as shown also in [Sal).
R on the correspondingset Y of attritrJtes - and is therefore indepen-
Let S’ = {Yr, .. ., Y,,,}be the collection of sets in which the rows of
dent of the choice of the tree T that representsQ. If D is not the
the tableau of nxmn have a distinguished or repeated nondis-
projection of a universal instance, then again only valid associations
tinguished symbols. Let Ri = ,,& my,(Ri).
will be produced (i.e. every produced Y-tuple is in D[Y]); however,
I
some valid associationsmight be lost, and the result might be sensiTkwem
tive to the tree T chosen. We shall show that D[X] is the set of
5.3 [Sal Let Q be a general database scheme,
D = {RI, . . ..Rt}adatabaseoverit.
associationsderived by considering o/l trees representingQ.
LetXbeasetofattributesand
let the R;‘s be defined as above. Then D[X] = mx (W Rj’). j
Lemma 5.1 Let Q be a general database scheme, D = {RI, . . . . Rk} a database over it. Let I be a tuple of R*(D)
Pmd
defined on the set of attributes Y. Then there is a subset5 of Q of
. (1) TX (9 Rj) S D[xI. I
relation schemeswith a loasleasjoin with r[&] C Ri for each & C S
We know that nxms. = rxmp.
and Y = U I& C S} 0 T&em
if
X
5.2 Let Q be an acyclic database scheme,
ie
nx (WRY) G nxms.R(D) C mxmnR(D) = qR’(D).
j
contained in
a
relation
in the set of X-total tuples of the right-hand side (= D[X]).
(2) D[xl
If X is not contained in any relation scheme, then
D[X] = U {nx ($r
I
Therefore, the
set of X-total tuples of the left-hand side (= ox (WR,)) is contained i
scheme, then
D[X] = U {nxRi 1.X E; &). (2)
TllUS,
Ri’ E v,W’).
D = {RI, .. .. Rk} a databaseover it, and X a set of attributes. (1)
From the definition of D(R),
Ri) 1T a tree representingQ}, where Tx is the
E TX W;). 1
Let I be an X-total tuple of R’(D) defined on Z 1 X. From Lemma 5.1 there is a subfamily S of Q of relation schemeswith r[&] C Ri
minimal subtreeof T covering X as in Lemma 4.3.
for each &CS,
P&f
Z=U&gicS}
and ms=nrmp.
Thus,
qrnr = nxmp and therefore S contains distinct sets l&r, .. ., I&,,, (1) From the construction of R(D), c nxR(D) C nxR’(D).
U (~xRi 1X G Ri}
With Yj C&j for j = 1, ... . m. Since t&j
For the opposite inclusion, it is easy to see
each j.
that any losslessjoin of a family 5 of schemescovering X, must con-
from Lemma5.1. (2) lie
Let Y = U Yj; “We have r[Y] C WRT, and therefore i ’ i
r[X] C ax (YR;).
tain at Icast one schemecontaining X. The conclusionthen follows
one inclusion folknvs from’lemma 4.2, and the other
89
C Rij. r[Yj] C RJ, for
0
6. LOOP-FREE BACWAN DIAGRAMS
Let p be this path. We have , nqWqt, = W{q, ( & C p}. Let S’
A loop-free Bochman diagram (IfLSd) [Ba. Li] is a tree T with
be the set of nodes that lie in such a Path p connectingtwo setsin S
directions on its edges, whose nodes are distinct sets of attributes,
with a nonempty intersection. We have, mr = mr. Since S has a
satisfying the following conditions.
connected hypergraph, S’ induces a subtree of T. The conclusion
(1) Every attribute is a node,
now follows from Lemma 4.2. 0
(2) If X - Y, then X c Y, and Corotlary 6.1 ff D is a database over a IfBd schemeQ, then
(3) If X E Y are nodes, then X : Y. (’ is the transitive closure of
all joins (and their projections)can be computedefficiently.
-1.
Prouf
Let Q be the database scheme containing the nodes of the
Let S be a subsetof Q. The join of the relations in each con-
diagram; we will call Q a lfsd scheme. Clearly. the diagram T is a
netted component of (the hypergraph of) S is losslessand thus can
tree representationof Q ; thus Q is an acyclic scheme.
be computed effkiendy. The join of the relations in S is just the
From the definition of a loop-free Bachmandiagram, it follows
Cartesianproduct of thesejoins.0
easily that (a) for every set X, the set of nodes that contain X form a rooted
Note that an arbitrary acyclic schememay have a connected
tree,
subsetwith a lossy join. In fact, IfBd schemesare the only oneswith
(b) If X and Y are two nodes with a nonempty intersection, then
the (CJ) property:
X fl Y is also a node: their lowest commonancestor. Theorem 6.2 Let Q be a databaseschemewith the (CJ) proLoop-free Bachman diagram schemeshave the fottowing nice
perty. Then Q is contained in a IfBd schemeQ’ with the samemax-
pmw.
imal sets(and thus with an equivalent jd).
(CJ) : The jd *Q-imp&es that every subsetof Q with a connected The proof of Theorem 6.2 is w
hypergraphhaq a losslessjoin.
on the following two Lem-
mas whose proof we omit. Theorem 6.1 Every IfBd schemehas the (CJ) property. Lemma 6.1 if Q satisfies(Cl), then the closureunderintersection of Q has also the (CJ) property.0 Let Q be the databaseschemeof a loop-free Bachmandiagram
Let Q’ now be the union of the closureunder intersectionof Q
T, and let S be a subset of Q with a connected hypergraph. Let
and the set of all singletons. Let us define a directed graph G on
&, Bj be two relation schemesin S with a nonempty intersecfii.
theelements ofQ’byhavinganarcX-YifX.YfQ’,XGY,and
From property (b) above there is a path from & to & in T that goes
there is no Z in Q’ with X c Z c Y (i.e. - is the transitive reduc-
from & up to &II&~ meeting nodes that are subsetsof & and then
tion of E).
down to 4 through nodesthat are subsetsof 8, (see Figure 3). Lemma 6.2 If Q’ has the (CI) property then G contains no (undirected) cycles.0 The closure under intersection of an arbitrary family of sets can have size exponential in the size of the original family . (Just considerthe family of subsetsof cardinality n - 1 of a set of sire n.) figure 3
90
with the appropriate
This is not true for families with the (CJ) property:
existence constraints (but has no loopfree
Bachman diagram); Q U {FD} has no such tree. Lemma 6.3 If Q is a database scheme with the (CI) property and Q’ is as above, then In’] 5 In] + 2]U]. IJ 7. AN ACYCLIC Thus, we can represent a database scheme with the (U)
TIONAL petty by a loopfree
JOIN DEPENDENCY
WITH A SET OF FUNC-
pro DEPENDENCIES
Elachman diagram without having to introduce As in Section 5 , a database D = {Rt, . .., Rk} represents the
too many new relation schemes. These relations can facilitate the
information
computation of the associations D[X] by having essentially ready the
universal instances (i.e.
instances R with ‘4 (R) Q Ri) which satisfy the dependencies, if
relations RJ of Theorem 5.3.
there is such a containing universal instance [HI.
A database D over Q is legal if it satisfies the following
We can check the dependencies and find the information
exisrence construinrs. For all nodes X,Y with X - Y the correspond-
represented by D, by forming a universal relation R(D) as in Section
ing relations satisfy Rx > nx(Ry).
5 and then chasing the dependencies on R(D).
Theorem 6.3 Let D be a legal database over the database scheme Q of a loopfree
common to all containing
In this Section we
will see how to do this efficiently in the presence of an acyclic loin
Bachman diagram T, and X a set of attri-
dependency Q and a set of functional dependencies. butes. (1) If X is contained in at least two relation D[X] = x,$r,
schemes, then 7.1. ‘Festhrg a functional dependency.
where Y is the unique minimal set of Q containing X.
Let D = (RI, . . . . Rk} be a database over the acyclic scheme Q
(2) If X is not contained in two or more relation schemes, then D[X] = 71x ($r
x
and f: X - A a functional dependency. We shall show how to check
&), where Tx is as in Lemma 4.3. 0
if WR, satisfiesf,
and list all violations if it does not, i.e. list all pairs
of distinct A-values associated with the same X-value, in time poly-
We leave it as an open problem to determine those acyclic schemes Q that have a representing tree T which can produce all
nomial in the size of D:
D[X]
?rxA(yRi) and then check f, since its size might not be polynomially
(by joining connected subsets), even with the addition of
existence constraints.
(It is easy to see that some kind of such con-
Note that we cannot simply compute
bounded in that of D.
straints is necessary for this to hold.) Such a tree would keep, in
Let T be a tree representing Q. We root the tree at a node
some sense, attributes as closely connected as possible. For example,
containing attribute A, say 8,.
the scheme of Figure 2 has the Bachman diagram of Figure 4 (nodes
We prune the tree leaf to root while
associating a graph Gi with every relation Ri. The nodes of Gi are
A, P, D are unmzssary).
the tuples of the current version of Ri. initially,
there is an edge
between two tuples of R, if they agree on the attributes of X fl &. Let Ti be the subtree of T rooted at node &, 2, the set of attri-
f-P
FD
butes label@
the edge from & to its father - (2, C &) - and Xi the
set of attributes of X that arc covered by T,. The deletion of node & is carried out as follows.
Fire me
4
S&RUT Q = {FPM, FPA, FAR, FP, FA,F}
Let RI be the current version of Ri (a
relation on the set of attributes &) and Rj the current version for its has such a tree
father 4.
91
At first we project RI onto Zi, and merge nodes of Gj that
correspond to tuples with the same Z,-projection.
Two merged
now plays the role of the set X there (i.e. apply the algorithm to the
nodes II, v of Gi are adjacent if there were two nodes u’, v’ respec-
minimal subtree of T covering X4, or the tree of the set S’ there for
tively merged into them that were adjacent. Then we replace RI by
U).
RI’ W nZ,(Ri’), delete Ri’ and update Gj as follows.
TWO nodes a, Y
As for testing if D satisfies the functional dependency f: X - A
of the new Gj are adjacent if they were adjacent in the old Gj and
and the join dependency Q (or equivalently, whether D[XA] satisfies
their Z,-projectii
were adjacent in Gj. Let RI’ be the version of
fl, we can either (i) form the universal relation R(D), project back
the root relation when all other nodes have been deleted and G, the
down en the schemes and then check if f is satisfied in the join of
corresponding graph. Let R,, be the projection of RI’ onto A, and
the new relations, or still netter (ii) find the set S’ for XA as in
G., the graph (on R,,) obtained from Gt by merging all nodes with
Theorem 4.3, compute the corresponding R;‘s as for Theorem 5.3
the same A-projection.
We claim that the edges of G,, give all viola-
and then check if f holds in the join of the Ri’s.
However, if we
tions off
in yRi, i.e. all pairs of A-values associated with the same
have a set F of functional dependencies, it is not sufficient to check
X-value.
Thus, YRi satitiesf
each dependency individually;
iff GA has no edges
i.e. D may satisfy Q and f for each fd
f of F but still vidate the dependencies as a whole. At first note that, apart from the graphs, the algorithm comA similar algorithm can be used to list for two sets X, Y, all
putes x4( YRJ as in the beginning of Scction 4. (The reduction pro-
pairs of Y-tupks that are associated with a common X-tuple in YRi. cess is not needed here since the output is small.) Therefore, when node & is deleted the current relation RI is equal to ~4($r
/
In this case we must first reduce the database (so that intermediate
R,).
results will be bounded by the size of the input and the output) and
We can show also by an induction that in the corresponding version then we can root the tree at any node. Besides testing a functional of the graph Gi, two &-tuples a and v are adjacent iff there are two dependency the algorithm can be used also to compute some queries tuples u’, v’ of $r
I
R, which agree with each other on Xi, and with invdving
u and v respectively on &.
Therefore,
quijoins;
e.g. Have these two actors worked with the
at the end two A-values
same directnr?, Which films are made by the same director and pry
at, a2 are,adjacent in GA iff there are two tuples r,. rt in yRi which
ducer? Equijoins can be transformed to natural joins by renaming
agree with each other cm X, and have f,[A] = at, rr[A] = as.
attributes. joins.
Thearem 7.1 Let D be a database over an acyclic scheme Q
However,
For example,
the transformed
query may invdve
the second query
n,F(FDWFPWF’DWF’P)
cyclic
above qrmsponds
to
where F’ is a renaming of F.
and f: X - A a functional dependency. We can check in polynomial time if yRi satisfiesf.
and list all violations (pairs of A-values associ-
7.2. Caarqsuthg the chase
ated with the same X-value) if it does not. 0
Let p = {&, . . . . &} he an acyclic join dependency (database scheme) and F a set of functianal dependencies of the form X - A.
We could easily modify the algorithm to give for each pair of We shall show how to compute efficiently the chase of a relation R A-vilucs in the list of violations an X-value with which they are assounderQ andF. ciated, or all such X-values, if so desired. In the last case the al@ [MSY] shows that any tuplc in the chase of a tableau under a
rithm would run in time polynomial in the input and the output.
single join dependency and a set of functional dependencies can be
Also, if the database is already reduced (the projection of a universal
generated nundeterminstically
instance), then we can use the improvements of Sectibnn4 where XA
92
in polynomial time.
Our algorithm
P implies another join dependency , but functional and multivalued amounts essentiallyto turning all nondeterministicstepsin this algodependenciescan be efficiently inferred [MSY, V]. Also, it is easy rithm into deterministic. From R we shall construct (in polynomial to modify the proof of Theorem 3.1 to show that it is NP-complete time) a relation R’ by identifying all symbolsthat the chasewill, and to determine if a given databaseD satisfiesP. even if D is the pro
such that chure(R) = mpR’. Thus, by ‘computing the chase effi-
jection of a universal instance. [H] showshow to test if a database ciently” we mean that we can check if a certain (possibly partially satisfiesa set of functional dependencies(no jds). specified) tuple is in it, generate a projection of it or the total tuples in a projection (if there are mills) in time polynomial in the sixe of REFERENCES
the input and the output.
WUI
A. V. Aho, C. Reeri, J. D. Ulhnan, ‘The theory of joins in relational databases”,ACM Trans. on Database Systems4(3), 297-314,(1979).
[ASUl]
A. V. Aho, Y. Sagiv, J. D. Ulhnan, “Equivalence among relational expressions”, SIAM J. Computing, E(2), 218-246,(1979).
[ASU2]
, ‘Efficient optimixation of a classof relational expression,” ACM Trans. on Database Systems, 4(4)
Computingthe thaw of R While there is a changedo
Foreachf:X-A inFdo Begin LetDbetheprojectionofRontoP; Apply the algorithm of section 6.1 for f and D; Identify A-values in the sameconnectedcomponent of G4 and delete duplicate tuples from R; Ed;
C. W. Rachman, “Data structure diagrams”, Data Base, l(2), 410, (1969).
Clearly, the while-loop cannot be executed more than k] times. Since the algorithm of Section 6.1 takes polynomial amount
IW
C. Reeri, “Gn the membershipproblem for multivalued dependenciesin relational databases”,ACM Trans. on Databasesystems.
PBI
C. Reeri, D. A. Bernstein, Computational problems related to the design of normal form relational schemes”,ACM Trans. on DatabaseSystems,4(l), 3059. (1979).
P-1
C. Reeri, R. Fagin, D. Maie:, A. Mendelxon, J. D. Ulhnan, M. Yannakakis, “Roperties of acyclic database schemes”,ACM Symp. Theory of Computing, (1981).
WI
C. Reeri, hf. Y. Vardi, “on the prcpertia of total join dependencies”, Proc. Workshop on Formal Bases for Bttabases, Toulouse, (1979).
WI
P. A. Remstein. ‘Synthesizing third normal form relations from functional dependencies”,ACM Trans. on DatabaseSystems,l(4), 277-298,(1976).
WI
P. A. Ikmstein and N. Goodman, “Ihe theory of semi-joins”, TR CCA-79-27, Computer Corp. of America, (1979).
[%I
C. Rerge, Graphs and Hypergraphs, North-Holland, 1973.
[Cl
E. F. Codd, ‘A relational model for large shared databank$‘, Comm. ACM, 13(6), 377-387,(1970).
PI
R. Fagin, “Multivah~eddependenciesand a new normal form for relational databases”,ACM Trans. on Datahas Systems,2(3), 262-278.(1977).
of time in the sixe of D (which is bounded by kw]), the algorithm stopswith a final relation R’ in time polynomial in k, ,p] and pi. We claim that chase(R) = mpR’.
Clearly, any two values
identified by the algorithm are ax-reedy identified. SincempR’ then satisfies the j.d. 0 and all functional dependenciesin F, it is the chaseof R. Theorem 7.2 We can compute in polynomial time from a relation R another relation R’ such that the chaseof R under an acyclic join dependencyp and a set of functional dependencesis equal to mpR’. 0 CnroUary 7.1 Given a set P consistingof functional dependencies and a single acyclic join dependency,we can decide in polyno mial time (1) if a given databasesatisfiesP, (2) if a given template dependencyis implied by P. 0 For a set P of functional dependenciesand a single general pin dependency*Q, we know that it is NP-compkte to determine if
93
muI
R. Fagin, A. Mendelzon, J. D. Ullman. “A simplified universal relation assumption and its properties”, RJ2900,LBM,SanJo~e,CaL,(1980).
P. Honeyman, ‘Testing functional dependencysatisfaction”, to appear in Journal of ACM.
WY1
P. Honeyman, R. Ladner, M. Yannakakis, ‘Testing the universal instance assumption”, Inf. Pmt. Letters, 10(l), 14-19(1980).
D-1
Y. E. Lien, ‘On the equivalence of databasemodels”, Bell Labs memorandum,(1979).
PI
D. Maier, ‘Discarding the universal instance assumption preliminary results”, Proc. XPl Conference, Stonybrook, NY, (19gO).
PMSI
D. Maier, A. Mendelzon, Y. Sagiv, ‘Testing implications of data dependencies”,ACM Trans. on Database Systems,4(4), 455469, (1979).
WW
D. Maier, Y. Sagiv, M. Yannakakis, ‘Testing implications of functional and loin dependencies”,Journal of ACM, to appear.
[RI
J. R&men, ‘Theory of relations for databases - a tutorial survey” Proc. 7th Symp. on Mathematical Foundations of Computer Science, Lecture Notes in Cornputer Science64, Springer-Verlag,537-551,(1978).
WI
F. Sadri, J. D. Ullman, “A complete axiomatization for a large class of dependenciesin relational databases”, Proc. 12th Am. ACM Symp. on Theory of Computing, 117-122,(1980).
@I
Y. Sagiv, ‘Can we use the universal instanceassumption without using null values? “, 7th ACM-SIGMOD Int’l Gmf. on Managementof Data, 108-120,(1981).
WI
Y. Sagiv and M. Yannakakis, ‘Equivalences among relational expressions with the union and difference operators”, Journal of ACM, 27(4), 633-655,(1980).
PI
E. Sciore, ‘Real-world MVDs”, 7th ACM-SIGMOD Int’l Conf. on Managementof Data, 121-132,(1981).
WI
J. D. Ulhnan, Principles of Database Systems, Computer SciencePress,1979.
[VI
M. Y. Vardi “Inferring multivalued dependenciesfrom functional and loin dependencies”,Dept. of Applied Math, Weizmann Inst. of Science, Rehovot, Israel, (1980).
WI
A. Walker, ‘Time and space in a lattice of universal relations with blank entries”, Proc. XPl Conference, Stonybrook, NY, (1980).
PI
C. Zaniolo, “Analysis and design of relational schemata for databasesystems”,TR UCLA-ENG-7769, Dept. of Camp. Sci., UCLA, (1976).
94