Algorithms For Acyclic Database Schemes

22 downloads 0 Views 1MB Size Report
p(A) C Dam(A) for each A in X. A relation R over relation scheme. R is a finite set ..... the best we can hope for is an algorithm polynomial in the size of the input and ..... in a projection (if there are mills) in time polynomial in the sixe of the input ...
ALGORITHMS FOR ACYCLiC DATARASE SCHEMES Mihalis

Yannakakis

Bell Laboratories Murray Hill, NJ 079?4

AfSSTRACT: Many real-world situations can be captured by a set of functional dependencies and a single join dependency of a particular form called acyclic [B..]. The join dependency corresponds to a natural decomposition into meaningful1 objects (an acyclic database scheme). 0ur purpose in this paper is to describe efficient a& rithms in this setting for various problems, such as computing projections, minimizing joins, inferring dependencies, and testing for dependency satisfaction.

called acyclic

[S..].

Acyclic join dependencies are those that are

equivalent to some set of multivalued dependencies. Such sets of multivalued dependencies are conflict-free.

a notion introduced by

[t] in the study of the relation between the network and the reiational model. [S] argues also that most real-world sets of mvds fall into this category or can be put in such a form. The class of acyclic

1. INTRODUCTION

database schemes contains the class of loopfree

An important pati in the design of relational database schemes

Bachman diagram

schemes of IL], the class of simply connected schemes of [Z], and

is the specificatiuo of constraints satisfied by the data, called depen-

bears a close resemblance to the class of tree queries of [BG J.

dencies. The first dependencies to be introduced were the functional

In this pape.t we shall give efficient algorithms for several prob

dependencies [C]. Their properties are well understood and efficient algorithms have been developed for infetring new dependencies and

lems on acyclic database schemes, such as computing projections.

designing database schemes [BE, Ber].

testing satisfaction of the dependencies by a database, inferring other

Multivalued

dependencies

[F, Z] were introduced to describe those cases where a relation can

dependencies. h Seaion 2 we review the basic terminology.

be decomposed into two of its projccdons, and join dependencies fR]

tions 3-6 deal with a database scheme and its associated join depen-

for a decomposition into several projections without loss of informa-

dency (no functional dependencies). in Section 3 we assume a gen-

tion; i.e. the original relation can be reconstructed by joining the

eral database scheme Q. We examine the complexity of computing

projections.

a projection of the join of the relations in a database ow

There are also efficient algorithms for the inference of

Set-

Q, of

muitivalued dependencies [Bee] and their use in the design process.

determining if the projection can be computed by joining only some

However,

in general they are harder to grasp and deal with than

of the relations, and of inferring dependencies from the join depen-

functional dependencie+e.g. the. best known algorithm to infer a join

dency associated with Q. In Se&m 4 we examine the same prob-

dependency from multivalued dependencies takes expootential time

lems when Q is an acyclic database scheme. In Se&ion 5 we define

and space [ABU].

the association D[X] between the attributes of a bet X represented by

Joii dependencies are studied in [BV, MSY, Y].

a database D and show how to compute it. In Section 6 we examine

Recently, [FMU] advanced the hypothesis that most real-world

loop-free Bachman diagram schemes and give a characterization of

situations have a particularly simple structure: They can be. captured

them in terms of lossless joins. Section 7 assumes a set of functional

by some functional dependencies and a single join-dependency that describes a “natural”

decomposition

into meaningfull

dependencies and an acyclic join dependency; it examines the infer-

“objezts”.

ence of other dependencies, and testing if a given database satisfies

Furthermore, the join-dependency is in most cases of a special form,

82

CH1701-2/81/0000/008~.75

0 1981 IEEE

trs(T)-tuple of distinguished symbols. A tableau T defines a map

the dependencies.

ping fr from relations over U (or universal relations) to relations over trs(T) as follows. A valuation p is a mapping that maps for

2. TERMINOLOGY

each attribute A, S(A) into Dam(A). A valuation is extended to

In this Section we will go briefly over the basic relational

tuples comlionentwiseand to relations elementwise. fr is defined as

theory terminology. For more details the reader is referred to [U].

follows. fr(R) =. {p(q) ( p is a valuation with p(T) C R). A

The universeis a finite set U of atrributes. A relation schemeR is a

tableau T, is contained in another tableau Tz ( denoted T, cr Tz) if

subset of CJ. A dumbuseschemeQ (over U) is a set or relation

they both have the same target relation schemeand fT,(R) E frJR)

schemeswith union c/. Every attribute A has an associatedset of

for every universal relation R; Tt and Tz are equivaknt

values, its domuinDam(A), If X is a set of attributes, an X-trrpfe(or X-vulue)

is a mapping u from X into lJ km(A), *Lx

(or

T, =r Tz) if T1 Gr T2 and T2 Gr T,. Note that, if T, c T2 (where

such that

c is set-inclusion)and trs(T,) = trs(Tz) then T2 Gr Ti. We now lit

p(A) C Dam(A) for each A in X. A relation R over relation scheme someof the basic resultsof the theory of tableaux [ASUI, ASU2]. R is a finite set of R-tuples. A database D over a databasescheme (1) For every project-join expression 4 there is a tableau T with

Q is a set of relations containing one relation over each relation

&J(R) = jr(R)

schemeof Q.

for every universal relation R. The tableau T is con-

structed recursively from I$ as follows. If + = nxo, and S is the The projection t[Y] of an X-tuple t onto a subsetY of X is the

tableau for o, T is obtained from 5 by changingeach distinguished

restriction of I to Y. The projection nr (R) of a relation R over X to YisthesetofprojectIonsofthetuplesinRtoY.

relations over schemesat, .. .. & respectively and R = Ug. join of RI, . . . . Rk, denoted R, W w(R*. ” i=l

symbol

LetR,,...,R,be

with A L X into a new nondistinguished symbol. If

0 = alWozP4

The

Wok, and Ti, Tz . ... T1 are the tableaux for

(11,.. . . ok. then T is the union of the Ti’s.

.WRk (or 7 Ri or

Rk}), is the set of R-tuplw t, with t&j

4

(2) Let T,, T2 be two tableaux with the sametarget relation scheme. A homomorphism /r from Tg to Tz is a mapping from S(A) to S(A)

6 Rj for

for each A, such that

, ..I. k.

ment mapping. Such a homomorphismexists if and only if Tr Cr T,.

tor we can build project-join relational expressions. An expression$

(3) Each tableau T has a minimal subset 7 equivalent to T; f is

defines a mapping from relations over U to relations over a certain

unique up to renaming of nondistinguishedsymbolsand is called the

An expre

minbwl equiv4lcnt

sion 4 is contabvd in an expremion # or 4 G $ if trs(&) = trs(g)

mHeuu of T. If T is the tableau of an expression

+, then T is the tableau of an expressionJ, equivalent to 4 that con-

and $(R) G S(R) for every relation R over LI; 41 and $ are

tains the minimum number of (binary) joins.

equivalent, denok4 ~$I=I$,if $ G QIand $I E 9. A useful tool for

A fnnctiorvll

compating expredsionsis the tableau [ASVI]. Each attribute A of U hasanasociatedsymbolsetS(A)={a,a,,nt,

= n, and h(T,) c h(T2); the mapping

from the tuples of T, to the tuples of Tz induced by h is a contain-

Using the projection operator xx (X c 6’) and the join opera-

set of attributes, the t&et relation scheme of I$. trs(+).

h(4)

dependency (er fd) is a statement of the form

X - Y where X, Y C U. It is satisfied by a universal relation R, if

~~~];aiscalleda

the ai’s are nondi~tingwkhed. A tabteuu T

for all tuples I,, ta of R with t~[Xl = IAX], also ti[Y] = tt[Y] holds.

is a reIatIon over U with the symbol setsas the domainsof the attri-

A join dependency $or jd) is a statementof the form l Q where Q is

bute+. The target elation schemerrr(T) of T is the set of ~atttibutee

a databaseschemeover LI; it is satisfied by a univetsal relation R if

in which T-has 4 distinguishedsymbol. The summarysr of T is the

W{q(R) I& t RI = R. A multivubted dependency (or mvd)

distinguished symbol &

83

X -

Y is the,join

dependency

*{XI’,

XZ} where 2 = U-Xl’.

embedded join dependency (ejd) is like a join dependency the union

of the relation

attributes;

i.e. if S is a collection

satisfied We

will

usually

say that

R if W{q(R)

W{ nx, ) Xi C S} is called a project-join ms.

It has the properties

potency).

From

mapping

The

and is denoted

(1) ms Q nx, and (2) ms(ms)=ms

the idempotency

of the project-join

lows that the join of the relations

elements

S(A)

by

applied

of a database D over Q satisfies

A remplate dependency dependency;

it is a statement

and sr its summary. fT(R)

(or rd) [sU]

is a very general

It is satisfied

by a universal

= nx(R)

where X is the target relation

A set I

of dependencies

P I= u) if o holds in every

implies

relation

testing if P I= D is the chase ([AM, template

dependency

dependency

o = T/Q is T.

new nondistinguished chase procedure with projection

symbols

modifies

join

T, trying

T, are as follows.

FD-rule.

other,

r,

keeping

replace

all occurrences

a distinguished

bol with a lower subscript. tableau resulting

T by mpT.

ahe

symbol

JD-rule.

I@]

number

etc.

The

carry

hypergraph

in the different

mvds.

the

in [B..)

We will

schemes of Q as its

if Q is disconnected of Q are independent way by

An acyclic join dependency properties

the definition

of the here but

It is a join dependency

by (or is equivalent

are going to use the following

Q has the

in a straightforward

not give

scheme Q is acyclic

in a

We are going to assume

characterizations.

and is implied

A database

with

in terms of the topological

hypergraph.

Graph-notions

to hypergraphs

components

to ) some set of

if *Q is an acyclic jd.

characterizations

of acyclic

We

database

schemes Q[B..].

an attri-

(1) Q can be reduced

by the

to the empty

set by repeatedly

attribute

if it occurs in exactly

one relation

If *Q is a jd in P, replace the

relation

scheme if it is contained

in another

the rules for the dependencies

consists of a set of

associated

separately.

a

from a graph is that

over

and the results generalize

each component

implies

If two

arise or else the

of nodes.

sym-

or the nondistinguished

chase of T, under P, chasc,(T,,)~,

tableau after applying

which

If f: X - Y is an fd in P and

of one of r,(B),

The only difference

rather list some equivalent

only sets P of

there are two tuples rt, rr that agree on X but disagreeon bute B of

associated

or to eliminate

rules for modifying

like a graph,

We will denote it also as Q.

*Q is defined

a tuple

of l’ if o is an fd; if

Here we will consider The

considering

of

stops and we say

will

U as its set of nodes, and the relation

of each other,

The

with an element

the order in which the rules are

for the rest of this paper that Q is connected;

of X and

to include

[Bg],

an arbitrary

way.

then the attributes

In this case, if

we keep the constant.

a contradiction

‘%onnectedness”

set of edges.

in the rest, and

dependency,

symbols from the attributes

dependencies.

universe

T, of a

in the rest of the attributes.

“path,

R with

scheme Q or a jd *Q we can associate

A hypergraph

straightforward

for

the one has distinguished

symbols in the attributes

the tableau

this happens then it succeeds. and

A procedure

The tableau

like

o (or

The tableau T,, of a functional

sr in it if o is a template

the nondistinguished

tableau

MMS]).

P.

of XI’ and nondistinguished

the other tuple has distinguished

functional

satisfying

dependency

are

is the same.

a database

an edge can contain

A.

is identified

Again,

either

nodes and a set of edges.

R if

scheme of T.

another

o = X - Y has two tuples;

symbols in the attributes

relation

arises.

chase,(R)

hypergraph.

of the form Tlsr where T is a tableau

(i.e.

the ales

also to relations

then the chase procedure

is immaterial:

With

kind of

property

in which

for each attribute

of an FDrule,

are identified,

final relation

the jd *Q.

can be applied

(or constam)

in the application

that a conrradicrion

it fol-

on the order

U S(A)

of Dam(A)

constants

(idem-

mapping

from Dam(A)

an element

expression

has the Church-Rosser

and P (= (I iff it succeeds (MMS].

The chase procedure

1 Xi C S} = %x(R).

S has a lossless join.

The procedure

does not depend

applied),

all the

schemes, the ejd l S is

of relation

relation

chase,(T,)

except that

schemes does not have to contain

by a universal

far as possible.

An

(2) There

is the

is a tree T with

the subgraph

in P as

84

of T induced

the relation

scheme,

deleting

an

and deleting

a

scheme.

schemes as nodes such that

by the nodes containing

an attribute

A is

connected(i.e. a subtree) for each attribute A of U. We say that T

the prgject-join mapping). There are however somecasesof particu-

represenrs p.

lar interest which can be efficiently decided even for general join

Acyclic schemesQ have also the following properties. (3) We say that a database D = {R,, . .., &} over p is @l/y) reduced

or the projection of a universal ia.s&nceif there is a universal

dependencies. Let I = *(X1, .. .. X,) be a join dependency. (1) Inferring losslessjoins [BcV]. MdWd

Let S =

be a collection of subsets of U with

{Vj]je,,

relation R such that R, = ~“4 (R) for each & E 0. If R is a data-

Y = U Y,. We can decide if J implies that S has a losslessjoin (i.e. i

base over an acyclic scheme Q, then D is reduced iff

if Wny,(R) = ny(R) for every universal relation satisfying J) as folj

Ri = n,t, (Ri W Rj) for each pair of relations Ri, Rj in D. There is

lows. Form a bipartite graph G with’node-sthe Xj’S and the attri-

an efficient algorithm using a particular kind of joins (called semi-

butes in U-Y and an edge between an attribute and a set if the

joins) which computes the reducrion of D : D’ = {RI’, . . . . Rk’}

attribute is in the set. Let K,, .., K, be the connectedcomponents

where Ri’ = IQ, (YR,) [BG].

of G. Let Zi be all the attributes of Y that belong to a set Xj in Xi,

(4) If Z is a set of attributes, let R(Z) be the family of nonempty

for i = 1, .. .. I. [BeV] showsthat Wn, is losslessiff each Zi is coni J

sets & fl Z where & C Q. Q(Z) is an acyclic scheme over the

tained in some Yj. 0

universe2, cakd the schemegenerated by Z. (2) Let S c {X1, . .. . X,} andX c UX,. x,cs 3. GENERAL DATABASE SCHEMES Let D = {RI.

Q = (&,...,&)

be

a

Deciding if nx$t$

database scheme and

Method.

nx,) = nx

We claim that J

(*).

implies (*)

if

and only if

t

. . , Rk} a databaseover it. It is trivial to determine if a

nx (,C$s nx,) - 71x(,5] nx,) is a tautology (i.e. holds for all univergiven tuple is in the join of the relations in D. If however the tuple is only partially specified (in someset of attributes X) then the problem is much harder : Theorem 3.1. It is NP-complete to teat if an X-tuple I is in Q(~R,)

where D = {Ri}i c 1,k is a database over an arbitrary

sal relations). At fint R, px[m,(R)]

C nx[ms(R)].

~rx m,(R) G nxms(R).

note that for all universal relations Supposethat there is a relation R with

Let R’ = m,(R). Then R’ satisfiesthe join

dependencyI but violates (‘) since ?r,(R’) = ax[m,(R’)] (from the idempotency of m,). Conversely, if nx[m,(R)] = nx[ms(R)]

scheme,even if D is the projection of a universal instance. Proof Membership in NP is obvious. For the NP-hardnesspart

for

every R, then for every instance I satisfying J we have nx[ms(l)]

= nx[m,(l)]

= a,&),

sincem,(l) = I.

we use a result of [SY], that it is NP-completeto teat if an expression I$, = nx(yny,) is contained in another expression+z = nx(y~~~). L.et T, be the tableau of 0, and s1 its summary. From the theory of tableaux, Q, G 4* if and only if s1 c 4z(TI). Thus, we can decide if

4, c 42 by testing whether s1 E ox (YRi) where Ri = ~‘4 (TI). 0

Thus, it suffices to test if mx(x$tsnx) - ox (i$l nxJ. Both expressionsare simple [ASLIl], and therefore their equivalence can be tested in polynomial time using tableaux techniques[AStll]. 0 The significanceof (2) is the following: Supposethat a universal instanceR (satisfying.J) is decomposedinto its projections on the

As a consequenceof Theorem 1, it is NP-completeto test if an

4’s. Then we can test efficiently if a projection of it (on some set

arbitrary join dependencyimplies a template dependencyof the form

X) can be recovered by joining only some of the relations in the

nxmS = nx, (Membership in NP follows from the indempotencyof

database. Moreover, we can find efficiently the minimum number

4. ACYCLIC

DATABASE SCHEMES

of such relations whose join gives the projection of R on X: we just Let D = {R,, . . . . Rh} be a database over the acyclic scheme

have to minimize the (simple) tableau of nx m, [ASUI, ASU2].

Q = {Et, . . . . &}.

From the theory of tableaux, all minimal subsets S of J with

Let X be a set of attributes.

We shag show how

to compute nx(WRi) in time polynomial in the size of the input and

nxmr = nx have the same minimum cardinafity, and the expressions nxms have identical tableaux T up to renaming of nondistinguished

the output.

symbols.

database schemes and tree queries. Let T be a tree representing the

Therefore,

for

every

minimal

S = {Xi,, . , X,},

scheme Q as explained in Section 2. At first we compute a fult

nxms(R) = nx [W fly,(R)], where Yj is the set of attributes in which I

reduction D’ of D as in [BG] using semijoins. Then we root the tree

the j-th row of the minimal tableau T has a distinguished or repeated nondistinguished

symbol (Yj C X,).

at an arbitrary node say &, and prune it leaf-toroot.

In other words for every

minimal set S, the computation of nx(R),

The algorithm is based on the relation between acyclic

Let T, be the ,

subtree of T rooted at node &. Zi the set of attributes labelling the

involves the join of

edge from- & to its father Bj (2, = &n&i),

exactly the same relations (my(R)) - the best choice of S depends on

and Xi the set of attri-

butes of X that are contained in some node of T,. When the turn of how fast n,,,(R) can be obtained from Rij = n+(R)

(organization of & comes to be deleted, Ri has been replaced by a relation &

over

the various relations, supporting data structures, etc.). the set of attributes Xi&. Note that we might have nx(,ys nx,) = nx even though the

deleting the relation Ri’ and replacing the current relation Rj’ of its

join is lossy - i.e. xDtsnx, + nr where Y = xlJ, Xi. However, any

father 4 by Rj’ W rrxX(Ri’). It is easy to show by an induction that when node & is

nonredundant join for computing the projection on X is lossless:

deleted, the current relation Ri’ is equal to nxA ( W Rl). 467

Theorem 3.2 Let *J be a jd and suppose that S is a miminal subset of J such that nx(xk4, xx,) = nx is implied by *J.

The deletion of node & is carried out by

Then

Thus,

when we are left with the root, the relation that is stored there is Rt’ = no, (&$R,), and nx@t’)

xP4sTX, is a lossless join. 0

= m(WRJ. I

We had to reduce. first the database so that intermediate results Regarding the actual computation of nx(yRi),

the size of the output will not get exponential in the input and the output:

relation can be exponential in the size of the input database. Thus,

Lemma 4.1 Throughout

the computation,

the size d

the best we can hope for is an algorithm polynomial in the size of current version Rj’ of Ri is bounded by piI

Inx(yRj)l.

the

0

the input and the output. However, we know that it is NP-complete to determine if YR, is empty or a (universal) R, = nq (R) [WY].

Theorem 4.1 If {R,, . . . . Rk} is a database over an acychc

relation R with

scheme, then mX(WRi) can be computed in time polynomial in the I

Also it is NP-complete to determine if a

universal relation R satisfies R = ?a

input and the output. o

(R) [MSY]. Thus, probably

Simple variations of the algorithm can be used to (unless P = NP) there is no efficient (polynomial in the sizes of the (1) Test if a universal relation R satisfies an acyclic join dependency

input and the output) algorithm to carry out either the reduction of

*1.

the database or the pin of a reduced database. Also. it is NP-

Apply the algorithm to D = {nx,(R) 1Xi

complete to test if a universal relation satisfies a join-dependency.

c

J} with X = (I while

making sure that the various relations along the way don’t become

WY1

different from the corresponding projections of R. Another way for

86

doing this is to take a set M of mvd’s equivalent to *J with !M] s PI

Ri ... L!fch -6 7i: T2

(there is always such a set [B. .]) and check if R satisfies M. (2) Include selections. Let Y be a set of attributes and SI(A) a subset of Dam(A)

for each

attribute A in Y. To compute the projections on X of the tuples in yRi

which have A-value in Sl(A) for each A CY, we first select those

Figure 1 It can be shown that (1) every subtrce that covers X has to

tuples from each relation Ri that can contribute to the result. That

contain TX, and (2) TX covers X and is connected.0

is. we remove from Ri those tuples I with r[A] f S!(A) for some A in Yf&.

Then we can apply the algorithm to the remaining relations. Even though Tx is a minimal subtree covering X, it might still

The following now is immediate. contain redundant relations. CoroIIary

4.1 (1) Given a database {RI, .

For example, suppose that we have a

. ,Rk} and an Xfilm database with rehnions FD (film-director),

FP (film-producer),

tuple t, we can decide in polynomial time if r C nx(WRi). FA (film-actor)

arranged as in the tree of Figure 2.

(2) We can decide in polynomial time if an acyclic join dependency implies a template dependency. 0

Returning to the computation of nx (YRJ , we note that we Figure 2 do not have to carry out the second phase using the whole tree. The minimal subtrce that relates directors to producers is the Since the join of the relations in the database satisfies the jd *R we whole tree.

However,

clearly DP = noP (FD WFP).

We could

can use dependencies that are implied by it. change the arrangement so that FD and FP become adjacent. This Lemma 4.2.

The join of a subset of the relations, whose

is true in general:

schemes form a subtrce (connected subgraph) of T, is lossless. 0 Theorem 4.2 Let S be a subset of schemes from D that join Thus, it suffices to join a set of relations whose schemes form a

losslessly. There is a tree T representing Q such that the schemes in

subtree T’ of T and have a union containing X. We will say that 7’

S form a subtree of T.

cover3 X. If X is contained in some relation scheme $, then we can Proof (sketch) obtain nx(WR,) 8

by projecting the corresponding relation R, onto X. Let S = {&t, . . . . &,} and Y = U R,,. From our discussion in

If not, then the intersection of all 7” covering X is also a subtree Section 3 (see Property (1) of general join dependencies) the

covering X:

schemes S join losslessly if and only if the schemes in Q-S Lemma 4.3.

can be

If X is not contained in two or more relation partitioned

into

sets KI, .._, K,,, so that (a)

if &, 6 K,

then

schemes, then there is a unique minimal subtree Tx of T covering X. gr rl Y c &,, and (b) if 8, and & belong to different K,‘s then Proof (sketch) The subtree Tx is defined as follows.

I& II & n (U-Y) Let & be a node of T

At

and T1. T2, . . . . T[ the subtrces hanging from it (see Figure 1). If

= 0.

first we show that S and each of

K, U {&,} are acyclic

schemes. We can construct thus trees To, T,, . . . . T, representing

one of the Ti’s covers X then & t Tx. Otherwise we include & into

respectively S, Kt U &I},

TX.

87

..,, K, U (&,,,}.

We attach the trees

desired tree.

Parts (1) and (2) 4.2 All

efficiently.

lomless joins (and their projections)

than the projection

database

is not the projection

we have also a relation

of a universal

by

instance.

this particular

FS (film-sound

base, the way to compute

n’

directors

completeness

the

in computing

the example

interpretation

where direct0r.d ducer p.

NP-

the join

attribute

4.2 with

Theorem

that the schemes of any nonredundant a connected

has worked

Actor

3.2 we can conclude

join computing

the projection

set in some tree representation

of R.

We

can be computed

X.

Tbat

is,

we

, Y,,,} of sets such that nxm,. subset

contains

S of n,

m distinct

for

find

collection

= xx is implied which

Sets &,,

the

. . . . &,

jd

with

by the jd *Q

implies

Y, C E,, (for

Q be a family

Eliminate

a

S’ using the following

of relation

elimination

schemes, initially

in

Q

if

it

S’ = {Y,, . . . . Y,,,} is the final family

is

algorithm.

set equal to n.

in U - X if it belongs to exactly

set

contained

The

is the set of pairs

(p,d)

by) pro-

will miss all associations

with

producers.

their

(or

unknown,

1) Elim-

another

Q of sets after applying

4.3 (1) The j.d. *Q implies

nrmr.

= TX.

dependency

as follows.

out to U every

That

unspecified

is, the etc.)

in

(marked

nulls in [WI,

We denote by D[X] of each universal

Q

set.

D is the projection

steps 1)

in each R, with Let R’(D)

[Ml).

new distinct

= mo R(D).

Sections,

(2) If S is

of a universal

now

Q is an acyclic

that

Theorem

instance

instance.

of X.

of

D[X] is

D[X] or determine are hard even if

(Note,

that if D is the

then D[X] = nx(WR,)).

Suppose

I

scheme.

4.1, with the projection

A tuple

it follows that for

is in it, since both problems

of a universal

nulls

tuples of R’(D).

in the previous

projection

for

instance R(D) by

We form a universal

onto X of all X-total

X-tuple

relation

R, G T&, (R)

with

a general database scheme it is hard to compute if a particular

the

R is called a conmining insronce. D[X]

tuple

From our discussion

one set of Q, 2) in

join

padding

Let

and 2) as far as possible. Theorem

and directors.

a film produced

., R,} be a database.

., k. Such a relation

the projection

inate an attribute

THE

2, and suppose we

R*(D) is X-fornl if it contains no nulls in the attributes

j = 1, . . . . m). We compute

NOT

npo (yRi)

the

satisfying

a

= nr.

for (directed

R

will

nxms

the correct

IS

producers

that are in the X-projection

minimal

every

of Figure

set of X-tuples

i = 1,

and

between

computing

Let D = {RI,

a (in fact all)

*R,

Part (3)

INSTANCE

is inapplicable

for deriving

S’ = {Y,,

compute

DATABASE

scheme

of documentaries

use this fact now to devise an easy method for

j.d.

these cases.

Theorem

join

THE

of such a query

However,

of directors

over Q.

Combining

on X form

“natural”

FD with

thus,

WHEN

want to know the associations

data-

to an acyclic one

schemes;

results in Section 3 are relevant

of the relations

will not necessarily

OF A UNIVERSAL

Consider

if

lossy joins can be hard to com-

relation

ASSOCIATIONS

PROJECTION

there

in the film

database scheme n can be augmented appropriate

And

is by joining

5.

if the

join: for example,

engineer)

of talkies

to the Corollary,

adding

is an arbitrary

result (a proper

of the join of all the relations,

might be good reason for wanting

pute: Every

0

Yj’s if tItc jd is not acyclic.

superset)

By contrast

lQ

hold even if

does not ; i.e. the algorithm

can be

0

Note that a lossless join might give a different

FS.

. . . . Bi, with Yj c hj (for j = 1, . . . . m).

q

Cornlhuy computed

&I,

to Ril, . . . . &,, to get the

Tt, . . . . T,,, to TO at the nodes corresponding

Consider

Remark

of R(D) on the relation

(2)

after

schemes of

any subset of Q that has for each j = 1, .__, m a scheme &, contain-

Q in place of the database D there, the set X in place of Y and with

ing Y, then the j.d. *Q implies xxms

S/(A)

n

with

nxms = xx

implied

by *Q,

= nx.

(3) If S is any subset of

then S contains

distinct

sets

the set of nonnull

(2) we have :

88

A-values

for each A in X.

From

Remark

Theorem 5.1 If Q is an acyclic databasescheme,,we can teat if

from Lemma 5.1 and Theorem 4.2. 0

an X-tuple is in D[X] and compute D[X] (or a projection of it) in In the next Section we shall see that there are some schemes

time polynomial in the size of the input and the output. o

which have a representing tree such that the union can be dropped

Returning to the representation of Q by a tree T, we noted

from Theorem 5.2, and thus D[X] can be computed by just joining

before that navigating through T (joining the relations on the way)

somerelations from D .

producesvalid associations(the joins are lossless). In other words, if

Even for a general Q however,we can compute D[X] without

the databaseis the projection of a universal relation R (satisfying’the

introducing null values - and clearly faster than the general method

join dependencyQ), then the computed relation is the projection of

described at the beginning of the Section (as shown also in [Sal).

R on the correspondingset Y of attritrJtes - and is therefore indepen-

Let S’ = {Yr, .. ., Y,,,}be the collection of sets in which the rows of

dent of the choice of the tree T that representsQ. If D is not the

the tableau of nxmn have a distinguished or repeated nondis-

projection of a universal instance, then again only valid associations

tinguished symbols. Let Ri = ,,& my,(Ri).

will be produced (i.e. every produced Y-tuple is in D[Y]); however,

I

some valid associationsmight be lost, and the result might be sensiTkwem

tive to the tree T chosen. We shall show that D[X] is the set of

5.3 [Sal Let Q be a general database scheme,

D = {RI, . . ..Rt}adatabaseoverit.

associationsderived by considering o/l trees representingQ.

LetXbeasetofattributesand

let the R;‘s be defined as above. Then D[X] = mx (W Rj’). j

Lemma 5.1 Let Q be a general database scheme, D = {RI, . . . . Rk} a database over it. Let I be a tuple of R*(D)

Pmd

defined on the set of attributes Y. Then there is a subset5 of Q of

. (1) TX (9 Rj) S D[xI. I

relation schemeswith a loasleasjoin with r[&] C Ri for each & C S

We know that nxms. = rxmp.

and Y = U I& C S} 0 T&em

if

X

5.2 Let Q be an acyclic database scheme,

ie

nx (WRY) G nxms.R(D) C mxmnR(D) = qR’(D).

j

contained in

a

relation

in the set of X-total tuples of the right-hand side (= D[X]).

(2) D[xl

If X is not contained in any relation scheme, then

D[X] = U {nx ($r

I

Therefore, the

set of X-total tuples of the left-hand side (= ox (WR,)) is contained i

scheme, then

D[X] = U {nxRi 1.X E; &). (2)

TllUS,

Ri’ E v,W’).

D = {RI, .. .. Rk} a databaseover it, and X a set of attributes. (1)

From the definition of D(R),

Ri) 1T a tree representingQ}, where Tx is the

E TX W;). 1

Let I be an X-total tuple of R’(D) defined on Z 1 X. From Lemma 5.1 there is a subfamily S of Q of relation schemeswith r[&] C Ri

minimal subtreeof T covering X as in Lemma 4.3.

for each &CS,

P&f

Z=U&gicS}

and ms=nrmp.

Thus,

qrnr = nxmp and therefore S contains distinct sets l&r, .. ., I&,,, (1) From the construction of R(D), c nxR(D) C nxR’(D).

U (~xRi 1X G Ri}

With Yj C&j for j = 1, ... . m. Since t&j

For the opposite inclusion, it is easy to see

each j.

that any losslessjoin of a family 5 of schemescovering X, must con-

from Lemma5.1. (2) lie

Let Y = U Yj; “We have r[Y] C WRT, and therefore i ’ i

r[X] C ax (YR;).

tain at Icast one schemecontaining X. The conclusionthen follows

one inclusion folknvs from’lemma 4.2, and the other

89

C Rij. r[Yj] C RJ, for

0

6. LOOP-FREE BACWAN DIAGRAMS

Let p be this path. We have , nqWqt, = W{q, ( & C p}. Let S’

A loop-free Bochman diagram (IfLSd) [Ba. Li] is a tree T with

be the set of nodes that lie in such a Path p connectingtwo setsin S

directions on its edges, whose nodes are distinct sets of attributes,

with a nonempty intersection. We have, mr = mr. Since S has a

satisfying the following conditions.

connected hypergraph, S’ induces a subtree of T. The conclusion

(1) Every attribute is a node,

now follows from Lemma 4.2. 0

(2) If X - Y, then X c Y, and Corotlary 6.1 ff D is a database over a IfBd schemeQ, then

(3) If X E Y are nodes, then X : Y. (’ is the transitive closure of

all joins (and their projections)can be computedefficiently.

-1.

Prouf

Let Q be the database scheme containing the nodes of the

Let S be a subsetof Q. The join of the relations in each con-

diagram; we will call Q a lfsd scheme. Clearly. the diagram T is a

netted component of (the hypergraph of) S is losslessand thus can

tree representationof Q ; thus Q is an acyclic scheme.

be computed effkiendy. The join of the relations in S is just the

From the definition of a loop-free Bachmandiagram, it follows

Cartesianproduct of thesejoins.0

easily that (a) for every set X, the set of nodes that contain X form a rooted

Note that an arbitrary acyclic schememay have a connected

tree,

subsetwith a lossy join. In fact, IfBd schemesare the only oneswith

(b) If X and Y are two nodes with a nonempty intersection, then

the (CJ) property:

X fl Y is also a node: their lowest commonancestor. Theorem 6.2 Let Q be a databaseschemewith the (CJ) proLoop-free Bachman diagram schemeshave the fottowing nice

perty. Then Q is contained in a IfBd schemeQ’ with the samemax-

pmw.

imal sets(and thus with an equivalent jd).

(CJ) : The jd *Q-imp&es that every subsetof Q with a connected The proof of Theorem 6.2 is w

hypergraphhaq a losslessjoin.

on the following two Lem-

mas whose proof we omit. Theorem 6.1 Every IfBd schemehas the (CJ) property. Lemma 6.1 if Q satisfies(Cl), then the closureunderintersection of Q has also the (CJ) property.0 Let Q be the databaseschemeof a loop-free Bachmandiagram

Let Q’ now be the union of the closureunder intersectionof Q

T, and let S be a subset of Q with a connected hypergraph. Let

and the set of all singletons. Let us define a directed graph G on

&, Bj be two relation schemesin S with a nonempty intersecfii.

theelements ofQ’byhavinganarcX-YifX.YfQ’,XGY,and

From property (b) above there is a path from & to & in T that goes

there is no Z in Q’ with X c Z c Y (i.e. - is the transitive reduc-

from & up to &II&~ meeting nodes that are subsetsof & and then

tion of E).

down to 4 through nodesthat are subsetsof 8, (see Figure 3). Lemma 6.2 If Q’ has the (CI) property then G contains no (undirected) cycles.0 The closure under intersection of an arbitrary family of sets can have size exponential in the size of the original family . (Just considerthe family of subsetsof cardinality n - 1 of a set of sire n.) figure 3

90

with the appropriate

This is not true for families with the (CJ) property:

existence constraints (but has no loopfree

Bachman diagram); Q U {FD} has no such tree. Lemma 6.3 If Q is a database scheme with the (CI) property and Q’ is as above, then In’] 5 In] + 2]U]. IJ 7. AN ACYCLIC Thus, we can represent a database scheme with the (U)

TIONAL petty by a loopfree

JOIN DEPENDENCY

WITH A SET OF FUNC-

pro DEPENDENCIES

Elachman diagram without having to introduce As in Section 5 , a database D = {Rt, . .., Rk} represents the

too many new relation schemes. These relations can facilitate the

information

computation of the associations D[X] by having essentially ready the

universal instances (i.e.

instances R with ‘4 (R) Q Ri) which satisfy the dependencies, if

relations RJ of Theorem 5.3.

there is such a containing universal instance [HI.

A database D over Q is legal if it satisfies the following

We can check the dependencies and find the information

exisrence construinrs. For all nodes X,Y with X - Y the correspond-

represented by D, by forming a universal relation R(D) as in Section

ing relations satisfy Rx > nx(Ry).

5 and then chasing the dependencies on R(D).

Theorem 6.3 Let D be a legal database over the database scheme Q of a loopfree

common to all containing

In this Section we

will see how to do this efficiently in the presence of an acyclic loin

Bachman diagram T, and X a set of attri-

dependency Q and a set of functional dependencies. butes. (1) If X is contained in at least two relation D[X] = x,$r,

schemes, then 7.1. ‘Festhrg a functional dependency.

where Y is the unique minimal set of Q containing X.

Let D = (RI, . . . . Rk} be a database over the acyclic scheme Q

(2) If X is not contained in two or more relation schemes, then D[X] = 71x ($r

x

and f: X - A a functional dependency. We shall show how to check

&), where Tx is as in Lemma 4.3. 0

if WR, satisfiesf,

and list all violations if it does not, i.e. list all pairs

of distinct A-values associated with the same X-value, in time poly-

We leave it as an open problem to determine those acyclic schemes Q that have a representing tree T which can produce all

nomial in the size of D:

D[X]

?rxA(yRi) and then check f, since its size might not be polynomially

(by joining connected subsets), even with the addition of

existence constraints.

(It is easy to see that some kind of such con-

Note that we cannot simply compute

bounded in that of D.

straints is necessary for this to hold.) Such a tree would keep, in

Let T be a tree representing Q. We root the tree at a node

some sense, attributes as closely connected as possible. For example,

containing attribute A, say 8,.

the scheme of Figure 2 has the Bachman diagram of Figure 4 (nodes

We prune the tree leaf to root while

associating a graph Gi with every relation Ri. The nodes of Gi are

A, P, D are unmzssary).

the tuples of the current version of Ri. initially,

there is an edge

between two tuples of R, if they agree on the attributes of X fl &. Let Ti be the subtree of T rooted at node &, 2, the set of attri-

f-P

FD

butes label@

the edge from & to its father - (2, C &) - and Xi the

set of attributes of X that arc covered by T,. The deletion of node & is carried out as follows.

Fire me

4

S&RUT Q = {FPM, FPA, FAR, FP, FA,F}

Let RI be the current version of Ri (a

relation on the set of attributes &) and Rj the current version for its has such a tree

father 4.

91

At first we project RI onto Zi, and merge nodes of Gj that

correspond to tuples with the same Z,-projection.

Two merged

now plays the role of the set X there (i.e. apply the algorithm to the

nodes II, v of Gi are adjacent if there were two nodes u’, v’ respec-

minimal subtree of T covering X4, or the tree of the set S’ there for

tively merged into them that were adjacent. Then we replace RI by

U).

RI’ W nZ,(Ri’), delete Ri’ and update Gj as follows.

TWO nodes a, Y

As for testing if D satisfies the functional dependency f: X - A

of the new Gj are adjacent if they were adjacent in the old Gj and

and the join dependency Q (or equivalently, whether D[XA] satisfies

their Z,-projectii

were adjacent in Gj. Let RI’ be the version of

fl, we can either (i) form the universal relation R(D), project back

the root relation when all other nodes have been deleted and G, the

down en the schemes and then check if f is satisfied in the join of

corresponding graph. Let R,, be the projection of RI’ onto A, and

the new relations, or still netter (ii) find the set S’ for XA as in

G., the graph (on R,,) obtained from Gt by merging all nodes with

Theorem 4.3, compute the corresponding R;‘s as for Theorem 5.3

the same A-projection.

We claim that the edges of G,, give all viola-

and then check if f holds in the join of the Ri’s.

However, if we

tions off

in yRi, i.e. all pairs of A-values associated with the same

have a set F of functional dependencies, it is not sufficient to check

X-value.

Thus, YRi satitiesf

each dependency individually;

iff GA has no edges

i.e. D may satisfy Q and f for each fd

f of F but still vidate the dependencies as a whole. At first note that, apart from the graphs, the algorithm comA similar algorithm can be used to list for two sets X, Y, all

putes x4( YRJ as in the beginning of Scction 4. (The reduction pro-

pairs of Y-tupks that are associated with a common X-tuple in YRi. cess is not needed here since the output is small.) Therefore, when node & is deleted the current relation RI is equal to ~4($r

/

In this case we must first reduce the database (so that intermediate

R,).

results will be bounded by the size of the input and the output) and

We can show also by an induction that in the corresponding version then we can root the tree at any node. Besides testing a functional of the graph Gi, two &-tuples a and v are adjacent iff there are two dependency the algorithm can be used also to compute some queries tuples u’, v’ of $r

I

R, which agree with each other on Xi, and with invdving

u and v respectively on &.

Therefore,

quijoins;

e.g. Have these two actors worked with the

at the end two A-values

same directnr?, Which films are made by the same director and pry

at, a2 are,adjacent in GA iff there are two tuples r,. rt in yRi which

ducer? Equijoins can be transformed to natural joins by renaming

agree with each other cm X, and have f,[A] = at, rr[A] = as.

attributes. joins.

Thearem 7.1 Let D be a database over an acyclic scheme Q

However,

For example,

the transformed

query may invdve

the second query

n,F(FDWFPWF’DWF’P)

cyclic

above qrmsponds

to

where F’ is a renaming of F.

and f: X - A a functional dependency. We can check in polynomial time if yRi satisfiesf.

and list all violations (pairs of A-values associ-

7.2. Caarqsuthg the chase

ated with the same X-value) if it does not. 0

Let p = {&, . . . . &} he an acyclic join dependency (database scheme) and F a set of functianal dependencies of the form X - A.

We could easily modify the algorithm to give for each pair of We shall show how to compute efficiently the chase of a relation R A-vilucs in the list of violations an X-value with which they are assounderQ andF. ciated, or all such X-values, if so desired. In the last case the al@ [MSY] shows that any tuplc in the chase of a tableau under a

rithm would run in time polynomial in the input and the output.

single join dependency and a set of functional dependencies can be

Also, if the database is already reduced (the projection of a universal

generated nundeterminstically

instance), then we can use the improvements of Sectibnn4 where XA

92

in polynomial time.

Our algorithm

P implies another join dependency , but functional and multivalued amounts essentiallyto turning all nondeterministicstepsin this algodependenciescan be efficiently inferred [MSY, V]. Also, it is easy rithm into deterministic. From R we shall construct (in polynomial to modify the proof of Theorem 3.1 to show that it is NP-complete time) a relation R’ by identifying all symbolsthat the chasewill, and to determine if a given databaseD satisfiesP. even if D is the pro

such that chure(R) = mpR’. Thus, by ‘computing the chase effi-

jection of a universal instance. [H] showshow to test if a database ciently” we mean that we can check if a certain (possibly partially satisfiesa set of functional dependencies(no jds). specified) tuple is in it, generate a projection of it or the total tuples in a projection (if there are mills) in time polynomial in the sixe of REFERENCES

the input and the output.

WUI

A. V. Aho, C. Reeri, J. D. Ulhnan, ‘The theory of joins in relational databases”,ACM Trans. on Database Systems4(3), 297-314,(1979).

[ASUl]

A. V. Aho, Y. Sagiv, J. D. Ulhnan, “Equivalence among relational expressions”, SIAM J. Computing, E(2), 218-246,(1979).

[ASU2]

, ‘Efficient optimixation of a classof relational expression,” ACM Trans. on Database Systems, 4(4)

Computingthe thaw of R While there is a changedo

Foreachf:X-A inFdo Begin LetDbetheprojectionofRontoP; Apply the algorithm of section 6.1 for f and D; Identify A-values in the sameconnectedcomponent of G4 and delete duplicate tuples from R; Ed;

C. W. Rachman, “Data structure diagrams”, Data Base, l(2), 410, (1969).

Clearly, the while-loop cannot be executed more than k] times. Since the algorithm of Section 6.1 takes polynomial amount

IW

C. Reeri, “Gn the membershipproblem for multivalued dependenciesin relational databases”,ACM Trans. on Databasesystems.

PBI

C. Reeri, D. A. Bernstein, Computational problems related to the design of normal form relational schemes”,ACM Trans. on DatabaseSystems,4(l), 3059. (1979).

P-1

C. Reeri, R. Fagin, D. Maie:, A. Mendelxon, J. D. Ulhnan, M. Yannakakis, “Roperties of acyclic database schemes”,ACM Symp. Theory of Computing, (1981).

WI

C. Reeri, hf. Y. Vardi, “on the prcpertia of total join dependencies”, Proc. Workshop on Formal Bases for Bttabases, Toulouse, (1979).

WI

P. A. Remstein. ‘Synthesizing third normal form relations from functional dependencies”,ACM Trans. on DatabaseSystems,l(4), 277-298,(1976).

WI

P. A. Ikmstein and N. Goodman, “Ihe theory of semi-joins”, TR CCA-79-27, Computer Corp. of America, (1979).

[%I

C. Rerge, Graphs and Hypergraphs, North-Holland, 1973.

[Cl

E. F. Codd, ‘A relational model for large shared databank$‘, Comm. ACM, 13(6), 377-387,(1970).

PI

R. Fagin, “Multivah~eddependenciesand a new normal form for relational databases”,ACM Trans. on Datahas Systems,2(3), 262-278.(1977).

of time in the sixe of D (which is bounded by kw]), the algorithm stopswith a final relation R’ in time polynomial in k, ,p] and pi. We claim that chase(R) = mpR’.

Clearly, any two values

identified by the algorithm are ax-reedy identified. SincempR’ then satisfies the j.d. 0 and all functional dependenciesin F, it is the chaseof R. Theorem 7.2 We can compute in polynomial time from a relation R another relation R’ such that the chaseof R under an acyclic join dependencyp and a set of functional dependencesis equal to mpR’. 0 CnroUary 7.1 Given a set P consistingof functional dependencies and a single acyclic join dependency,we can decide in polyno mial time (1) if a given databasesatisfiesP, (2) if a given template dependencyis implied by P. 0 For a set P of functional dependenciesand a single general pin dependency*Q, we know that it is NP-compkte to determine if

93

muI

R. Fagin, A. Mendelzon, J. D. Ullman. “A simplified universal relation assumption and its properties”, RJ2900,LBM,SanJo~e,CaL,(1980).

P. Honeyman, ‘Testing functional dependencysatisfaction”, to appear in Journal of ACM.

WY1

P. Honeyman, R. Ladner, M. Yannakakis, ‘Testing the universal instance assumption”, Inf. Pmt. Letters, 10(l), 14-19(1980).

D-1

Y. E. Lien, ‘On the equivalence of databasemodels”, Bell Labs memorandum,(1979).

PI

D. Maier, ‘Discarding the universal instance assumption preliminary results”, Proc. XPl Conference, Stonybrook, NY, (19gO).

PMSI

D. Maier, A. Mendelzon, Y. Sagiv, ‘Testing implications of data dependencies”,ACM Trans. on Database Systems,4(4), 455469, (1979).

WW

D. Maier, Y. Sagiv, M. Yannakakis, ‘Testing implications of functional and loin dependencies”,Journal of ACM, to appear.

[RI

J. R&men, ‘Theory of relations for databases - a tutorial survey” Proc. 7th Symp. on Mathematical Foundations of Computer Science, Lecture Notes in Cornputer Science64, Springer-Verlag,537-551,(1978).

WI

F. Sadri, J. D. Ullman, “A complete axiomatization for a large class of dependenciesin relational databases”, Proc. 12th Am. ACM Symp. on Theory of Computing, 117-122,(1980).

@I

Y. Sagiv, ‘Can we use the universal instanceassumption without using null values? “, 7th ACM-SIGMOD Int’l Gmf. on Managementof Data, 108-120,(1981).

WI

Y. Sagiv and M. Yannakakis, ‘Equivalences among relational expressions with the union and difference operators”, Journal of ACM, 27(4), 633-655,(1980).

PI

E. Sciore, ‘Real-world MVDs”, 7th ACM-SIGMOD Int’l Conf. on Managementof Data, 121-132,(1981).

WI

J. D. Ulhnan, Principles of Database Systems, Computer SciencePress,1979.

[VI

M. Y. Vardi “Inferring multivalued dependenciesfrom functional and loin dependencies”,Dept. of Applied Math, Weizmann Inst. of Science, Rehovot, Israel, (1980).

WI

A. Walker, ‘Time and space in a lattice of universal relations with blank entries”, Proc. XPl Conference, Stonybrook, NY, (1980).

PI

C. Zaniolo, “Analysis and design of relational schemata for databasesystems”,TR UCLA-ENG-7769, Dept. of Camp. Sci., UCLA, (1976).

94

Suggest Documents