we show in Section 4. by allowing a new relational operator (the union operator), one might normalize further. The first 3 sections of this paper provide a clean ...
P r o c , 1979 ACM-SIGMOD (ed. P. A . Bernstein), 153-160.
Normal forms and relational database operators by Ronald Fagin IBM Research Laboratory San Jose, California 95193
ABSTRACT: We discuss the relationship between normal forms in a relational database and an allowed set of relational operators. We define *projection-join normal form" (PJ/NF), which is the ultimate normal rorm when only projection and join are allowed. Aho. Beeri and Ullman made the counterintuitive discovery that there is a relation schema with a valid decomposition into three of its projections without the decomposition being equivalent to a cascade of decompositions. each into two projections. Because of this possibility. there exist bizarre relation schemata that are in fourth normal form but not in PJ/NF. We also discuss issues associated with allowing the union operator.
We note that P J / N F could logicaliy be called "fifth normal form", since it is stronger than fourth normal form. However, we instead choosc to call i t projection-join normal form for several reasons. First, we wish to emphasize its finality with respect to the projection and join operators. Second, we feel that from now on. it will be desirable to explicitly point out the relationship between normal forms and the allowed relational operators. For example. as we show in Section 4. by allowing a new relational operator (the union operator), one might normalize further. The first 3 sections of this paper provide a clean completion to the normalization question. with respect to the usual operators (projection and join). On the other hand, the final section merely "opens the door a crack" with respect to new operators: a great deal of further research remains to be done in this latter area.
1. Introduction
If S[UV] and T[UW] are relations (where the attributes. or column names, in common are the attributes in U), then the join S*T of S and T is the relation consisting of the tuples (u,v,w), where (u.v) is in S and (u.w) is in T. The concept of multivalued dependency (see Fagin [ F a l l ) is intimately related to that of join. Specifically, if U and V are subsets of attributes of a relation R, and if W is the set of attributes of R not in U or V, then the multivalued dependency U--V holds in R if and only if R is the join of its projections R[UV] and R[UW]. (This is Theorem 1 of Fall.)
The process of normalization consists of converting a relation schema (or set of relation schemata) into another form that stores the same data but in a different format (see Beeri, Bernstein, and Goodman [BBG] for a detailed discussion). The question of which operators are allowed in the conversion process is, of course, critical. Surely, the two most important relational operators are projection and (natural) join (see Codd [CoZ]). In this paper. we introduce "projection-join normal form" (PJ/NF), i n which the only legal operators are projection and join. P J / N F is almost by definition the ultimate normal form when only projection and join are allowed. The reason why P J / N F is stronger than fourth normal form ( F a l l is because of a surprising phenomenon that was first discovered by Aho. Beeri, and Ullman [ABU], and that was also investigated by Nicolas (Nil. We also discuss in this paper the issue of allowing the union operator.
NOTE: When we write UV, where U and V are sets of attributes, we mean the set UUV. When we write AB. where A and B are attributes. we mean the set 1A.B); similarly for ABCD and so on. It is convenient for us to use a generalized definition of join, due to Aho, Beeri and Ullman [ABU]. that gives the join of an arbitrary collection of relations. If R,. ....R, are relations, then the join of the set ( R , ,_... R m ) is the set of all tuples (a, a"), where A, .....A, are the attributes appearing in at least one of R, R,. and where for each i, the projection of the tuple (al. ....a,) onto the attributes of Ri is a tuple in the relation Ri. It is straightforward to verify that the old and new definitions of join agree when there are' only two relations to be joined. Furthermore, this generalized join can be "built up" out of binary joins. For example, the join *(Rl,R2.R,.R,) is equal to (((RI*R2)*R,)*R4), that is. the result of joining R, and R,. joining the result with R,. and then joining this result with R,. By the associativity and commutativity of the binary join, we can take the binary joins in any order we wish, and parenthesize howcver we wish, and still get the same answer. For example, [ R,,R2,R,,R4J also equals (R2*R,)*(R4*RI).
.....
P e r m i s s i o n t o copy w i t h o u t fee a l l or p a r t of t h i s material is g r a n t e d p r o v i d e d t h a t t h e c o p i e s are n o t made or d i s t r i b u t e d for d i r e c t commercial adv a n t a g e , t h e ACH c o p y r i g h t n o t i c e and t h e t i t l e of the p u b l i c a t i o n and its d a t e a p p e a r , and n o t i c e i s g i v e n t h a t c o p y i n g is by p e r m i s s i o n of t h e Associat i o n for Computing Machinery. To copy o t h e r w i s e , Or to r e p u b l i s h , r e q u i r e s a fee a n d / o r s p e c i f i c permission.
.....
Let X I,...,X,,,each be subsets (not necessarily disjoint) of the attributes of relation R, where each attribute of R is contained in at least one of X I , Xm- Following Rissanen [RiZJ, we say that R obeys the join dependency * ( X,,...,Xm] if R is the join of its projections R[X,], ...,R[X,].
....
153
As noted in [ABU], R[X,] ,...,R(X,] of R equals (1: there are tuples w,. for each i ( I s i s m ) 1
2. Projection-join normal form
the join of the projections
.... w,
of R such that w,[X,]
- t[X,]
In this section. thc only relational operators that we consider are projection and join. Thus. in this section. the ground rules for normalization are:
Thus. the join dependency * { Xl....,Xm] holds for the relation R if and only if R contains each tuple t for which there are tuples w , , ..., w, of R (not necessarily distinct) such that w,[X,] t[XJ for each i ( I s i s m ) .
-
( I ) Each relation that we consider is already in Codd’s first normal form ( I N F ) ; that is, each entry is atomic.
If R and U,V.W are as in the definition above of multivalued dependency. then the multivalued dependency U-V holds in R if and only if the join dependency (UV.UW) holds in R. Thus. it is possible to represent each multivalued dependency by a join dependency. We note that Nicolas [Nil defined the muruaf dependency. which is the special case of the join dependency where there are exactly three relations to be joined. Also, Dayal and Bernstein [DBJ explore properties of join dependencies, which thcy call interdependencies.
(2) When a relation is replaced by a set of relations, each of the new relations is a projection of the original relation. (3) The original relation must be the join of the set of new relations.
Second normal form, third normal form. Boyce-Codd normal form (BCNF), and fourth normal form ( 4 N F ) all fulfill these ground rules, as will the new projection-join normal form (P J / N F).
Aho. Beeri, and Ullman [ABU] present a surprising example to show that a relation can be the join of three of its projections, without this join being the result of cascading 2-way projections. Thus, let R, with attributes ABCDEF. obey the functional dependencies A-B and F-E. They show that the join dependency { ABDE.ACDF.BCEF J necessarily holds. That is, if we abbreviate the projections R[ABDE], R[ACDF], and R[BCEF] by R,. R2. and R3 in any order, then the relation R equals the join of its three projections R,. R,. and R,. However, it is not hard to show that this decomposition is not necessarily the result of cascading 2-way decompositions. That is. there arc not necessarily projections S , and S, of R such that
From now on. we will abbreviate functional dependency, multivalued dependency, and join dependency by FD. MVD, and J D respectively. Let X be a set of attributes, let u be a dependency (FD. MVD, or JD), and let Z be either a dependency or a set of dependencies. When w e say that Z fogicaIly implies u (in the context of X). or that u is a logical consequence of Z (in the context 0f.Y). we mean that whenever Z holds for a relation with attributes X, then so does u . That is, there is no ‘counterexample relation“ R with attributes X such that every dependency in Z holds in R, but such that u does not hold in R. We write Z u, or, if the context X is understood, simply Z I- u. As a simple example.
( A-B, B-C
( I ) R i s t h e j o i n o f S , andS2.
1 I-
A-C.
We remark that the context X is not important for FDs alone. That is, it follows from the completeness of Armstrong’s axioms [Ar. Fa21 that if 2: is an FD or a set of FDs, and if u is a single FD, then Z u if and only if Z I b Y u. as long as X and Y each contain all of the attributes appearing in Z and/or u. However, as noted in [Fall, the context is important when MVDs (or JDs) are involved. For example, if X ABC, then A-B A--C holds, whereas it does not hold if X ABCD. When R is a set of dependencies, then by C I- R, we mean that 2 I- u for each u in R. Later on, we will make use of the easily verified fact that if p . u, and T are dependencies (or sets of dependencies), if p I- u and if u I- 7 , then p I- 7 (transitivity).
(2) S, is the join of two of its projections S,, and S22.
(3) R,, R,, and R, equal S,.S2t,and S,, respectively. Thus, this 3-way decomposition is not built out of 2-way decompositions. Let us look a little more carefully about what goes wrong in this example. Since R ( R,,R2,R, I , we know that R is the join of R , with R2*R,.- In this case, R,*R, has all of the attributes of R. and contains all of the tuples of R, and possibly more. The effect of joining R1 with R,*Rl is only to remove enough tuples from R,*R3 to make it exactly equal to R. As a concrete example of the join removing tuples, assume that the relation R contains, possibly among others, the tuples (a,b,c,d,e.f) and (a’,b’,c,d’,e.f). where sea'. b-+b’, and d*d’. and that the relation R obeys the functional dependencies A-B and F-E. In particular, the tuple (a,b’,c,d,e.f) is not in R. or else the functional dependency A-B would be violated. Let R,, R2. and R, be the projections R[ABDE], R[ACDF]. and R[BCEF) respectively. As noted earlier. the relation R equals the join of R, with R,*R3. It is easy to verify that R2*R, contains, among others. the tuple (a.b’,c,d.e.f), which, as we saw. is not in R. When w e join the relation R2*Rl with R,. this tuple disappears.
--
-
We say that a dependency is rriviaf (in the context of the set X of attributes) if it holds in every relation that has the set X as its attributes. That is, a dependency u is trivial (in the u, where 0 is the empty set. It is easy context of X) if 0 to verify that the only trivial FDs U-V are those where V is a subset of U. Further, the only trivial MVDs U-V are those where either V is a subset of U, or where the union of U and V equals X. Finally, it follows easily by using the techniques in [ABU] that the only trivial JDs * ( X,, ...,X m ] are those where one of the Xi’s equals X. Following Cadiou [Ca], we define a relation schema to be a set or attributes. along with a set of dependencies (in our case. FDs, MVDs, and JDs). Cadiou actually includes more, that does not concern us here, in his definition of relation schema. So :hat we can always speak of dependencies in the schema, rathcr than dependencies that are logical consequences of those in the schema, it is convenient to assume that the set of
Fourth normal form is defined in terms of functional and multivalued dependencies alone. Since multivalued dependencies correspond to 2-way decompositions. and since. as we saw, there are dtcomDositions that are not the result of cascadine 2-way decompositions alone, an ‘ultimate“ PJ/NF must also consider the more general join dependencies.
I
I
-1
1
1
..
!
1 i
iI
!
t i
! I
i
iI 154
I !
of U (or else the F D U-V would be trivial). So, relation Q is a counterexample relation that obeys A but disobeys U-V. This contradicts the fact that h I- U-V. We have shown by contradiction that K, is a subset of U for some i. Hence, K,-X IU-X. Since Ki-X holds in R', and since K,-X I- U-X, it follows that the FD U-X holds in the schema R*. which was 10 be shown. 0
dependencies in the schema is closed under logical consequence. That is, if C is t h e set of dependencies in the schema, and if u is a dependency such that Z I= u, then u is also a dependency in the schema. Let R' be a relation schema with attributes X. If K is a subset of X. then we say that K is a key (of the relation schema) if the FD K-X is in the schema, and if there is no proper subset L of K such that the FD L-X is also in the schema. We call such a functional dependency K-X a key dependency of R'. We note that each theorem in this paper that mentions a key dependency would also be true if we were to modify our definition of key dependency to allow a n arbitrary functional dependency L-X, where the left-hand side L is either a key or a superset of a key (a "superkey" [Be]).
We now present three definitions of J N F .
Dejinirion Z(a) [ F a l l . A I N F relation schema R' with attributes X is in 4 N F if, for each nontrivial M V D U--V i n R*, the F D U-X is in R*. Definition 2(bj. A I N F relation schema R * with attribu for each M V D u in R ' . where A is the set'of key dependencies of R '. Thus, every M V D is the result of keys. utes X is in 4 N F if A I-
T o emphasize the analogy between BCNF. 4 N F , a n d P J / N F , it will be helpful to define each of them in several distinct (but equivalent) ways.
De/inifion l f o l [ C O ~ ] .A I N F relation schema R' with attributes X is in B C N F if, for each nontrivial F D U-V in R*, the F D U-X is in R '.
Definiriorr 2(cJ. A I N F relation schema R* with attribX is in 4 N F if, for each M V D u in R', there is a key dependency K-X of R * such that K-X I- u. Thus, every M V D is the result of a key.
Definition l(b). A I N F relation schema R' with attributes X is in B C N F if A I- u for each F D u in R*. where A is the set of key dependencies of R'. Thus, every F D is t h e result of keys.
lent.
utes
Theorem 2. Definitions 2(a). 2(b), and 2(c) a r e cquiva-
ProoJ T h e proof that a relation schema in 4 N F by Definition 2(a) is also in 4 N F by Definition 2(c) is almost identical to t h e proof that a relation schema in B C N F by Definition I(a) is also in B C N F by Definition I(c). We leave the trivial modifications to the reader. As before, the implication from 2(c) to 2(b) is trivial. To conclude the proof, we show that a relation schema in 4 N F according to Definition 2(b) is also in 4 N F according to Definition 2(a). Let U--V be a nontrivial M V D in R'. We must show that the F D U-X is in R'. We c a n assume (by replacing V with t h e set difference V-U if necessary) that U and V a r e disjoint. Let W be t h e complement (in X) of the union of U and V. Since the M V D U-V is nontrivial. we know that V and W a r e both nonempty. From DefiniWrite A a s (Kt-X. ,.., tion 2(b). we know that A I- U-V. K,-XI. We now show that Ki is a subset of U for some i ( I c i s r ) . Assume not. As in t h e proof of Theorem I , define a two-tuple relation Q with attributes X, such that one tuple has only zeros as entries. and t h e other tuple has zeros in the U columns and ones everywhere else. As before, relation Q obeys every FD in A. We can write the two tuples of Q a s (u,v.w) a n d (u,v'.w'). where U.V. and w each contain only zeros, and v' and w' each contain only ones. Since the tuple (u.v'.w) docs not occur in Q,it follows by Proposition 3 of [ F a l l that Q does not obey the M V D U-V. So. relation Q is a counterexample This contradicts the relation that obeys A but disobeys U--V. We have shown by contradiction that Ki fact that A 1- U--V. is a subset of U for some i. Hence, Ki-X I- U-X. Since K,-X holds in R'. and since Ki-X I- U-X. it follows that the F D U-X holds in the schema R*. which was to be shown. 0
Definirion I(c). A I N F relation schema R* with attributes X is in B C N F if, for each F D u in R*,there is 3 key dependency K-X of R* such that K-X !- u. Thus, every FD is the result of a key. Tkeorem I .
Definitions I(a), I(b), and I(c) a r e equiva-
lent. Proofi Assume first that a relation schema R' is in B C N F according to Definition l(a). We will show that it is in B C N F according to Definition l(c). Let U-V be a n F D that holds in R'. By Definition I(a), w e know that either (i) U-V is a trivial FD. or else (ii) the FD U-X holds. In case (i), let K be an arbitrary key of the schema. Then K-X I- U-V, since the FD U-V is trivial. This completes case (i). In case (ii), we know that U contains a key K. Since K-X I- U-X and U-X lU-V, it follows by transitivity that K-X I- U-V, a s desired. Thus, in either case, R' is in B C N F according to Definition I (c).
T h e fact that a relation schema in B C N F according to Definition I(c) is also in B C N F according to Definition l ( b ) is trivial.
' is in B C N F according to Definition Now assume that R l(b). The proof is complete if we show that R' is in B C N F according to Definition I(a). Let U-V be a nontiivial F D in R'. We need to show that the F D U-X is in .'R From Definition I(b). we know that A I- U-V. Write A a s [ K,-X, .... K,-Xj. W e now show that Ki is a subset o f U for some i ( I S i s r ) , Assume not: w e will derive a contradiction. Define a two-tuple relation Q with attributes X. such that one tuple has only zeros a s entries. and the other tuple has zeros in the U columns and ones everywhere else. ( W e do nof make any assumption that Q is a valid instance of the rclation schema R*.)For each i ( I s i s r ) . the two tuples of Q disagree in a t least one column of Ki. since K, is not a subset of U. Hence, the relation Q obeys each F D in A. However. relation Q does not obey the F D U-V. since the two tuples agrce in the U columns but disagree in a t least one column of V. since V is not a subset
Let U-V be a n F D in a 4 N F relation schema R*. Then, of course, the M V D U--V holds in ' R (since U-V I- U-V). and so. by Definition 2(c), there is a key K of R* such that K-X I- U-V. Is the stronger fact true t h a t K-X I- U-V? As we now show. the answer is yes.
Theorem 3. Let U--V be a nontrivial MVD. Denote the then K is a subset of set of attributes by X. If K-X I- U-V. U. In particular. K-X 1- U-V.
Proof: This follows from the proof of Theorem 2 . 0
155
However. as we will show in Section 3. overstrong P J / N F is nor equivalent to P J / N F
Corollary 4. Assume that the F D u holds in the 4 N F relation schema R' with attributes X. Then there is a key dependency K-X of R * such that K-X I- u .
By comparing Definitions I(b). 2(b). and j ( b ) , we see the striking relationship between BCNF. 4NF. a n d P J / N F . I n particular, since every M V D can be represented by a JD. the following theorem is immediate.
Proof Assume that the FD u is U-V. Then the MVD U--V is in R' By Definition 2(c), there is a key dependency K-X of R' such that K-X is U--V. By Theorem 3. K-X IU-V. which was to be shown. 0
Theorem 5 . P J / N F implies 4 N F . Before we define P J / N F , we present the following membership algorithm, which tells whether or not a join dependency u is a member of the set of logical consequences of the set A of key dcpendencies. that is. whether or not A I- u. I t follows in a fairly straightforward manner from the more general membership algorithm of ( A B U ] that our membership algorithm succeeds (with inputs u and A ) if and only if A I- u.
Of course. we already know (from Theorem 2 of [ F a l l ) that 4 N F implies BCNF. A new proof of this result is trivially obtained from Definition I(c) and Corollary 4. From Definition 2(c). Corollary 4 and Theorem 5 , w e obtain the following result.
Membership algorirhm. The inputs a r e a join dependency = ( K i - Y , ..., K,-X] of key dependencies. Initialize set S as [ X I ,...,X,,,].Apply the following rule until i t can be no longer applied: if K i c Y n Z for some i ( l s i ~ r )a n d for some members Y and 2 of 3. then replace Y and Z in S by their union, that is, remove the sets Y and Z from S and add to S the single member YUZ. ( I n particular. the number of members of S t h e n decreases by one). If the algorithm terminates with X, the set of all u
- * ( X i....,X,,], and a set A
attributes, as a member of 3, then we say the algorithm succeeds, otherwise, we say it faits.
Theorrni 6 . Let R' be a relation schema in either 3 N F or P J / N F . Then. for every FD and MVD u , there is a key dependency K-X of R C such that K-X I- u. Thus. in a P J / N F relation schema, every dependency (FD. MVD, and J D ) is determined by keys (the FDs and MVDs each by a single key, and the J D s by perhaps several keys). Hence, in a PJ/NF relation schema, all that needs t o be specified is the set of attributes and the set of keys: all FDs. hlVDs. and JDs (and in particular, all decompositions) a r e determined by these. This is good, since keys a r e quite fundamental and easily understood. If all relation schemata that correspond to stored relations in a database a r e in P J / N F (and if there a r e no inter-relational dependencies). then the database management system need only have a mechanism for supporting keys, rather than a mechanism. for supporting more general dependencies. W e remark that in fact. System R ( A B C J supports keys (via "unique indices"). but does not support general func-. tional dependencies.
that that
Example: Let the set X of attributes be ABCD. Let u be the J D * ( A B . A D . B C ) , and let A be (A-ABCD. B-ABCDI. W e now show. by using our membership algorithm. that A I- (I. ( W e will make use of this example later). Initialize S as ( A B , A D , B C ] . Since A-ABCD i s in A, and since AB and A D a r e in 3,we replace A B and A D in S by ABD. At this stage, S is ( A B D . B C ] . Since B-ABCD is in 4, and since ABD and BC are in S. we replace ABD and BC by ABCD. We a r e left with S { X 1. Thus. the membership algorithm succeeds, and so A I- u.
-
W e now show that if a relation schema obeys only those dependencies (FDs, MVDs. and J D s ) that a r e logical consequences of a set of FDs. then BCNF, 4NF. and P J / N F coincide.
I n this example. we can understand intuitively what is going on in the decomposition associated with the J D u. There is one relation (with attributes AB), that gives the association between the keys A and B; a second relation (with attributes A D ) that relates D to the key A; and a third relation (with attributes BC) that relates C to the key B.
Theorem 7 . Assume that the set of dependencies (FDs, MVDs. and J D s ) in relation schema R * equals ( u : 2 I- o f . where S is a set of FDs. Then all of the following a r e equivalent: (i) R * is in RCNF.
Since, a s noted, our membership algorithm succeeds i f and u, the following pair of definitions a r e equivalent.
only if A I-
(ii) R' is in 4NF.
Definirion 3(uJ A I N F relation schema R' with attributes X is in P J / N F if, for each J D u in R*. the membership algorithm succeeds with inputs u and 4, where A is the set of key '. dependencies of R
(iii)
.R*is in P J / N F .
Proo/: W e already know that (iii) implies (ii). and that (ii) implies (i). So we need only show that (i) implies (iii). Thus, assume that K* is in B C N F we will show that R ' is in P J / N F . Let A be the set o f key dependencies of R'. Since R * is in BCNF. we know by Definition I(c) that for each F D r in 2. there is a key dependency I' in A such that T' I- T. Since for each r in Z there is r' in A such that r' I- T, it follows that A IZ. Let u be a n arbitrary J D of R': thus, 2 I- u. Since A I- Z and Z I- u, it follows by transitivity that A 1- u. So by Definition 3(b), schema R' is in P J / N F . 0
Definirion J(b). A I N F relation schema R * with attributes X is in P J / N F if A I- u for each J D u in R*, where A is the set of key dependencies of R'. Thus, every J D is the result of keys.
Since Definitions I(b) a n d I ( c ) a r e equivalent,.as a r e Definitions 2(b) and 2(c). we might naturally assume that P J / N F i s equivalent to what we call oversrrong P J / N F in the following definition.
Similarly, the following theorem holds.
Definition 4. A I N F relation schema R* with attributes X is i n overstrong P J / N F if. for each J D u in R*. there is a key dependency K-X of R* such that K-X I- u. Thus, every
Theorem 8. Assume that the set of dependencies (FDs. MVDs. and JDs) in relation schema R * equals [ u : 2' I- u ) . where Z is a set of FDs and MVDs. Then t h e following a r e equivalent:
J D is the result of a key.
156
.
(i)
R' is in 4NF.
We now show that Y I- u (note that we are using 7-" in a slightly extended way from before, since Y is not an FD. MVD. or JD.) Assume that Y holds in a relation R. We must show that u hoids in R. that is. that if (a.p,c'). (a.p',c). and (a'.p.c) are all tuples of R. then so is the !uple (a,p.c). We will. in fact. show the stronger result that it is impossible for JII of (a.p,c'). (a.p'.c). and (a',p.c) to appear as tuples in R. Assume that they do. Since (a,p.c') and (a'.p,c) both appear. companies c and c' both carry product p. Since agent a works for company c (because of the tuple (a,p'.c)), he cannot work for company c'. and so the tuple (a.p.c') cannot appear. Since i t docs, w e have reached a contradiction.
(ii) R + is in P J / N F . ProoJ We alrcady know that (ii) implies (i). We will show that (i) implies (ii). So. assume that R' is in J N F ; we will show that it is in P J / N F . Once again. let 1 be the set of key dependencies of R'. For each FD or MVD T in 2. w e know by Theorem 6 that there is a key dependency r' i n 2, such that r' Ir . So, I- Z. Let u be an arbitrary J D in R '. Since A I- Z and Z I- u, it follows by transitivity that A I- u. Hence, by Definition 3fb). R' is in P J / N F . 0
Nicolas shows, by giving an example, that Y docs not logically imply any nontrivial FD or MVD. Hence. if we let R' be a relation schema w i t h attributes APC. and w i t h only those drpendcncies that are logical consequences o f u. then no nontrivial FDs or MVDs hold (because if u I- T for a nontrivial FD or MVD r. then. since Y I- 6; i t follows by transitivity that Y I- r. which contradicts Nicolas' result.) So R' is a relation schema in 4 N F (since i t has no nontrivial FDs or MVDs). However. i t is not i n P J / N F . since the only key dependency of R * is the irivial FD APC-APC, and Definition 3(b) is violated (since the JD u, bcing nontrivial. is not a logical consequence of the trivial dependency APC-APC.) To obLain PJ/NF. it is necessary to decompose on the basis of u, into the projection of R' on each of AP. AC, and PC.
It is easy to see that by repeated decomposition. we can always convert a relation schema that is not in P J / N F into a lamily of relation schemata that are (although inter-relational dependencies might be introduced). Compare Theorem 3 of ( F a l l . which deals w i t h the 4 N F case. We remark that "cmbcdded" JDs must bc considered in the P J / N F dccomposition. just as embedded MVDs must be considered in the 4 N F case. Investigating the interrelationships between embedded JDs and ordinary JDs is an interesting research problem. It is important t o note that a relaiion schema in PJ/NF is not necessarily decomposed -as far as it can go". For example, let R' be a relation schema with attributes ABCD and with its only dependencies being the logical consequences of A-ABCD. Thus, the only dependencies in R' are those that are the result of A being 3 key. Then R' is in PJ/NF, although it is possible to decompose R' into its projections on AB, AC, and AD, and still to be able to reobtain the original relation from the projections by taking the join. The point is that this decomposition is not necessary, since it does not seem to "buy" anything. .
Note that $ the only restriction we know about the schema R' is the J D u given by * { AP.AC.PC] (and if, in particular, we do nor assume that Nicolas' statement Y holds), t h e n i t is certainly desirable to decompose R' into its projections on AP. AC, and PC. For, it is then possible to independently update each of these 3 relations; a tuple can be inserted into or deleted from any of these relations with no effect on any other tuple. The decomposition automatically enforces t h e J D u. By contrast. in the original relation with attributes APC, the addition of a tuple may force the presence of other tuples. Also, in this original relation, i t may be illegal to delete certain tuples, since the deletion may cause the violation of the J D u.
It follows from Definition 3(b) that P J / N F is the 'ultimate" normal form with respect to projection and join, since Tor a P J / N F relationschema, the only JDs. and hence the only decompositions, are those that are the result of keys. That is, a P J / N F relation schema cannot be decomposed any further. except by decompositions based on keys. It is an important question to consider criteria by which one decomposition into P J / N F is "better than" another. A good example of a partial answer to the question is given by Rissanen's concept of 'independent components" [ R i l l . When we say that P J / N F is 'ultimate" (with respect to projections and joins), we simply mean that there is no need to decompose further.
We remark that in this example. there is a simple-direct proof [ M e \ that if T i s an FD or MVD such that u I- T, then r is necessarily trivial. For, if T is nontrivial, then it is easy to see that there is a two-tuple relation R such that r fails in R. However, we leave 10 the reader the fairly straightforward verification that. since the union of any two o f AP, AC, and PC is the set X of all attributes, necessarily u *{AP.AC,PC) holds in every 2-tuple relation. Thus, it is impossible that u IT , since our 2-tuple relation R is a counterexample relation in which u holds and T fails.
-
W e close t h i s section by presenting the following bizarre
example (modified slightly from the example given by Nicolas (Nil), which shows that i t is possible for a relation schema to be in 4 N F but not in PJ/NF. Assume that there are only three attributes A.P.C. where t h e tuple (a.p.c7 means, intuitively, that agent a represents product p produced by company c. We want this schema to obey the J D [ AP.AC.PC). which we will abbreviate by u. In English, the J D u says precisely the following:
3. An overstrengthening o f PS/NF
'Assume that agent a represents product p produced by company c'. that agent a represents product p' produced by company c. and that agent a' represents product p produced by company c. Then agent a represents product p produced by company c."
In this section, we discuss 'overstrong PJ/NF". defined in Definition 4 of the previous section. Since Definitions I(b) and I(c) are equivalent, as a r e Definitions 2(b) and 2 ( c ) . it is natural to believe by anitlogy that P J / N F (Definition 3(b)) is equivalent to overstrong P J / N F (Definition 4). We now show that this is not the case. Let R * be a relation schema with attributes X ABCD, arid with only those dependencies that are the logical consequcnccs of the key dependencies A-X and B-X. Thus, the depcnden-
Nicolas actually assumes the following statement, which we will abbreviate by Y:
-
'An agent must not work for two companies whose products overlap."
157
cies in R' are precisely those that a r e the result of A and B being keys. So. R * is in P J / N F . We now show that R' is not in overstrong P J / N F . Let u be the J D * ( AB,AD.BC). As we showed in the example immediately following our description of the membership algorithm, u is a logical consequence of the key depcndencies A-X and B-X. Hence, u is a J D of . 'R However. i t is false that A-X I- 6, as the reader can see by considering our membership algorithm where we let A be ( A - X I . Similarly, it is false that B-X I- u. So. there is no key dependency K-X of R' such that K-X 1- u. Hence. R' is not in overstrong P J / N F .
To see why the word "minimal" cannot be eliminated from Deiinition 4(a). let R' be a relation schema w i t h attributes ABC, whose only dependencies a r e those caused by A being a key. That is. the only dependencies in R' a r e A-ABC and its logical consequences. Then R' is in overstrong P J / N F . according to Definition 4(b). We now show that i t does not obey a modified version of Definition 4(a) with the word "minimal" eliminated. By Heath's Theorem [He], A-ABC I- * [ A B A C ] By Lemma 9,
Note-also that R', being in P J / N F , is in B C N F . So, Theorem 7 fails i f w e replace * P J / N F " by 'overstrong PJ/NF". A similar comment holds for Theorem 8.
* ( A B . A C f I- * ( A B . A C , B C ) By transitivity.
We feel that, a s the name implies, overstrong P J / N F is too demanding, since. for example. the schema R' we just analyzed is not in overstrong P J / N F . However, we find it interesting from a theoretical point of view to explore overstrong P J / N F a little more carefully. Hence, we will give t w o deiinitions of overstrong P J / N F , and show that they a r e equivalent. First, we need to define two more concepts. W e say the J D * [ X ,,...,Xn,] covers the J D
A-ABC I-
Thus, * ( A B , A C , B C ] is in R'. not i n BC.
Proofi Assume first that a relation schema R' is in overstrong P J / N F according to Definition 4(a). W e will show that it is in overstrong P J / N F according to Definition 4(b). Let u be a J D in R'. W e must show that there is a key dependency K-X such that K-X I- u. It is easy to see that the schema R' contains a minimal J D T that is covered by u ( T may. of course, be u itself.) By Lemma 9, r I- u. Write T a s ( Y I ,...,Y, 1, By Definition 4(a), there is a key K of R * such that K is a subset of each Y,. I t follows easily from our membership algorithm that. since K is a subset of each Y , . necessarily I(-X I- 7 . Since K-X i- T, and since 7 I- u, it follows by transitivity that K-X I- u. This was t o be shown.
-
(i) they contain the s a m e attributes, t h a t is, U ( X , : I s i s m ] U[Y,: I s j j l p ] ; and (ii) each Y, equals a n Xi, that is, for each j ( I s j z p ) there exists i ( I s i s m ) such that Yj Xi.
-
f o r example, the J D { Zl,Zz.Z, 1 covers the J D * { Z l , Z 2 ] , provided they contain the same set of attributes, that is, provided every attribute in 2, is also in 2 , o r Zz. We say that a J D in a relation schema is minimof if no J D t h a t i t covers is also i n the schema. T.
Now assume that a relation schema R* is in overstrong P J / N F according 10 Definition 4(b). W e must show that i t is also in overstrong P J / N F according to Definition 4(a). Let u be a minimal J D of R'. Assume that u is 1 X I ,....X",]. By Definition 4(b), there is a key dependency K-X of R * such that K-X I- u. Since K is a key, we need only show that K is a subset of each Xi. Assume not. Without loss of generality, assume that K is not a subset of XI. Denote by A an attribute in K that is not in X i . W e will derive a contradiction.
then r I- u.
Proofi Assume that the J D r holds in relation R. We must show that the J D u also holds in R. Let us write u as + { X I X,]. and T as * { Y l,...,Y p ] . Let t be a n arbitrary tuple for which there a r e tuples w I ,...,w, of R such that wi[Xi] [[Xi] for each i. As we noted in t h e introduction. to show that u holds in R, we need only show that the tuple t is in R. For each j (Isjjrp). find i . such that Xi Y,. Then wi [Y,] W; [ X i ] t[Xi] t[Y;]. h o w , the tuplhs wi a r e in R ( f s j s p ) . . a i d d e have shown that w i [Y,] t[Y,j ( I A j q ) . So, since 7 holds for R, the tuple t is in'R. This was to be shown. 0
-
.....
-
-
-
-
However. the (only) key A is
Theorem IO. Definitions 4(a) and 4(b) are equivalent.
* ( Y ,..., l Yp)
if
Lemma 9: If the J D u covers the J D
* (AB.AC,BC 1.
-
Since u is a minimof J D of R'. we know that * [ X2,X3,....X,] is not a JD of R*. There are two possible reasons why.
We remark that Lemma 9 is a special case of a stronger result i n [ASU]. Further. it is essentially the same as Proposition 5 of [Nil.
CUM I : There is an attribute in X I that is not in any of the other Xi's. Denote this attribute by B. Note that AcB, since B is in X,, a n d A is not. Define a two-tuple relation Q as follows. O n e tuple (call it u) has only zeros as entries. The other tuple (call it v) has a one in the A a n d B columns and zeros everywhere else. Denote by t a tuple that has a one in the B column a n d zeros everywhere else. In particular. tuple t is not in relation Q. Now, v[X,] t[X,], since both v [ X I J and t ( X l ] have a one in t h e B column and zeros everywhere else (recall that X I contains B but not A.) Furthcr. u[Xi] t(X,]. for 2si-m, since they both contain only zeros (recall that Xi does not contain B, if 2 s i s m . ) It now follows from the definition of join dependency that since t is not in Q, the J D (I does not hold in Q. However, the FD K-X holds in Q. since the twotuplcs of Q differ in thc A column (and A is in K). So. relation Q is a counterexample relation that obeys K-X but disobeys u. This contradicts the fact that K-X I- u.
W e now present two definitions of overstrong P J / N F . T h e second definition (Definition 4(b) below) is the same as our earlier Definition 4.
-
Definition 4(a/. A I N F relation schema R ' with attributes X is in overstrong P J / N F if, for each minimal J D * I Xl,...,X,) in R*, there is a key K of R' such that K is a subset of each Xi.
-
Definirion 4(bJ. A I N F relation schema R' with attributes X is i n overstrong P J / N F if, for each J D u in R*. there is a key dependency K-X of R ' such that K-X I- u. Thus, every J D is determined by a key.
158
Case 2: X I is contained in t h e union of X2,X, ,...,X ,. Denote the J D * ( X ? , X , .....X,) by T. Since u is minimal, we know that r is not in the relation schema R*. I n particulai. since the FD K-X is in R . we know that it is false that K-X ' I- r. Thus. there is a counterexample relation S such that K-X holds for S. but r does not. Since r does not hold in relation S. there are tuples w2.w3....,w ,,,,in S and a tuple t not in S such that w i ( X i ] t(XiJ. for 2515m. Let z be a new tuple that agrees with t everywhere except in the A column (recall that A is a n attribute that is in K but not in X I ) . T h e value of z in the A column is defined to be a new value. that does not appear in t or in any tuple of S. Let T be a new relation that consists of all tuples in S, along with the tuple z. We will now show that T is a counterexample relation, in which K-X holds, but in which u fails. This contradicts our assumption that K-X Iu. I t i s easy to see that the F D K-X holds in T, since i t holds i n S, and T differs from S only by having a new tuple with a new value i n the A column (and A is in K). To complete the proof, we need only show that c does not hold in T. Now z ( X , ] t ( X , ] by construction of z. since attribute A is not in XI. W e already know that wi[Xi] t(Xi], for 2 s i s m . However, t is not in T. So the J D u does not hold in T, as was to be shown. 0
As a second of Smith's examples, assume that there is a V E H I C L E schema with a numbcr of attribuics. such as W I N G S P A N and SAIL-AREA. Certatn of these attributes do not apply to certain vehicles. For example. W I N G S P A N applies to air vehiclcs. but does not apply to water vehiclcs. T h e schema can be split into several schemata (such as AIR-VEHICLE a n d W A T E R - V E H I C L E ) , on the basis of VEHICLE-TYPE. I t is then desirablc to 'drop" inapplicable attributes (by taking projections) in the new schemata. T h e net result is to eliminate what Codd calls 'property inapplicable" nulls ( a s opposed to his 'value a t present unknown" nulls) [ C O ~ ] .
4. Allowing other relational operators
Note that if attribute A 'does not apply" lo tuples in the relation R , . then R , obeys the F D 0'-A (where 0 is the empty set). Thus. in Smith's second example, a n instance of the schema WATER-VEHICLE obeys the F D @ - W I N G S P A N , whereas the set of remaining tuples do not. Thus. a split is called for.
-
I
-
I
A possible definition (which. as we will show later, must be modified) of a 'projection-split-join-union normal form" ( P S J U j N F ) is: a relation schema R' is in P S J U / N F if (i) it is in P J / N F , and (ii) there is no way to split R' (on the basis of a n expression ~ ( t ) ) into relation schemata R , * and R?'. where the set of dependencies that hold i n R,' is not the same as the set of dependencies that hold in R2*.
-
Smith's first example then violates P S J U / N F (and is therefore split), since the schema SEX-ROLE obeys the F D PERSON-ROLE, while the schema PROF-ROLE does not.
Perhaps the next important operator to consider is the union (and a corresponding inverse operator. which we will call the split). Let $(t) be a formula of the relational calculus [ C O ~ ]wirh . one free tuple variable t. T h e splir (with respect to $(t)) of a relation R is the set of two relations R , and R,, each with the same attributes a s R. where the set of tuples in R is the union of the tuples in R , and R,. T h e relations R , and R 2 a r e disjoint, and a tuple t of R is in R, if and only if it obeys
A major defect of our definition of P S J U / N F as i t stands is that we have given no restrictions on what $(t) may be. For example, by letting $(t) completely specify a single tuple, the split operator would simply split off a single tuple. and this process of splitting off single tuples could take place repeatedly (note that a one-tuple relation obeys every FD. MVD. and J D . ) As a more subtle example. assume that we have a S U P P L Y relation schema w i t h only two attributes, S U P P L I E R a n d PART. Let $(I) be the relational calculus equivalent of "LSUPPLIER supplies every PART". On the basis of +(t), the S U P P L Y schema is split into two schemata UNIV-SUPPLY and NON-UNIV-SUPPLY, where universal S U P P L I E R S (those S U P P L I E R S who supply every P A R T ) appear in the UNIV-SUPPLY relation; thus, this schema obeys t h e M V D 0-SUPPLIER. SO. the UNIV-SUPPLY schema can then be decomposed into two schemata UNIV-SUPPLIER (with one attribute. S U P P L I E R ) , and P A R T (with one attribute, PART). W e a r e left with three schemata: one lists the universal suppliers. one lists all parts, and the third lists all (supplier,part) pairs for non-universal suppliers. W e note that in Codd's relational algebra (CoZ], the "quotient" of t h e S U P P L Y schema by the P A R T schema is the UNIV-SUPPLY schema, and the 'remainder" is t h e NON-UNIV-SUPPLY schema. I s it worthwhile to split t h e original S U P P L Y schema in the first place? It probably depends on whether or not we expect a large proportion of the suppliers to be universal suppliers.
*(O. In the previous section, our ground rules allowed us t o decompose a relation. that is, to replace it by a set of its projections, provided the original relation was the join of the new relations. In this section. we not only allow a relation to be decomposed (provided, a s before. that the operation can be reversed by taking joins); we furthermore allow a relation to be split (possibly repeatedly); the relation t h a t w a s split is then the union of the new relations. W e allow a concatenation of decompositions and splits, possibly interleaved. Smith [Sm] gives several interesting examples where splitting is desirable. As one of his examples. assume that there is a relation schema P-ROLE with attributes P E R S O N and ROLE. A P E R S O N is assume to have two types of ROLES: sex role (male o r female) and profession. 11 is assumed that a person can have several professions. but only one sex. The schema can be split into two schemata, SEX-ROLE and PROF-ROLE. Here, $(I) is ([.ROLE
-
'male') V (t.ROLE
-
'female').
T h e schema SEX-ROLE obeys the F D PERSON-ROLE, while the schema PROF-ROLE does not. A s Smith notes, a n advantage of splitting is that the system's key maintenance mechanism can then prevent the insertion of two sex roles for one person.
We belicve that it is a n important research problem to clarify which expressions $(I) we should allow in our definition of PSJ U / N F.
159
'
J
5. Acknowledgements
[ C O ~E. ] F. Codd. Relational completeness of data base sublanguages. Courant Computer Science Symposia 7, Data Base Systems. New York City, May 21-25. 1971. Prentice Hall.
The author is grateful to Y. Sagiv and E. F. Codd for valuable suggestions. He also thanks C.J. Date for helpful comments.
[ C O ~E. ] F. Codd. Recent investigations in relational dam base systems. Proc. IFlP Congress 74. North-Holland ( 1 9 7 3 ~10171021. [ C O ~E.] F. Codd. Understanding relations. Installment No. 7.
FDT (bulletin of ACM SIGMOD) 7. 3-4. Dec. 1975. BIBLIOGRAPHY
[ABC] M. M. Astrahan. M . W . Blasgen. D. D. Chambcrlin, K. P. Eswaran, J. N . Gray. P. P. Griffiths, W. F. King, R. A. Lorie, P. R. McJones, J. W. Mehl, G. R. Putzolu, I. L. Traiger, B. W . Wade. and V. Watson. System R: A relational approach to data base management. TODS 1.2 (Junc 1976). 97-137.
[DB] U. Dayal and P. A. Bernstein, The fragmentation problem: lossless decomposition of relations into files. Technical report CCA-78- 13, Computer Corporation of America. Cambridge Mass. (Nov. 15, 1978).
[Fail R . Fagin. Multivalued dependencies and a new normal form for relational databases. TODS 2.3 (Sept. 1977). 262278.
[ABU] A. V . Aho. C. Beeri, and J. D. Ullman. The theory of joins in relational data bases. Proc. 19th IEEE Symp. on Foundations o f Computer Science (Oct. 1977). 107-1 13.
[Fa21 R. Fagin. Functional dependencies i n a relational database and propositional logic. IBM J . Res. and Devel. 21.6 (Nov. 1977), 534-544.
[ASU] A. V. Aho. Y. Sagiv, and J. D. Ullman, Equivalences among relational expressions. To appear, S l A M J. Computing.
[He] I. J. Heath, Unacceptable file operations in a relational data base. Proc. 1971 ACM-SIGFIDET workshop on data description. access, and control, San Dicgo, Calif., Nov. 11-12, 1971.
[Ar] W. W. Armstrong. Dependency structures of database relationships. Proc. IFlP 74, North Holland (1974). 580-583. [BBG] C. Bceri. P. A. Bernstein, and N. Goodman, A sophisticate's introduction to database normalization theory. Proc. 1978 VLDB (Berlin). 113-124. [Be] P. A. Bernstein, Synthesizing third normal form relations from functional dependencies. T O D S 1.4 (Dec. 1976). 277298.
[Ca) J.-aM. Cadiou. On semantic issues in the relational model of data. Proc. Int. Symp. on Math. Foundations of Computer Science, Gdansk, Poland. Lecture Notes in Computer Science. Springer-Vcrlag. Heidelberg, Scpt. 1975. [ C o l ] E. F. Codd. A relational model of data for large shared data banks. Comm. ACM 13.6 (June 1970). 377-387.
[Me] A. Mendelzon. Private communication received transitively through Y. Sagiv.
[Nil J. M. Nicolas. Mutual dependencies and some results on undccomposable relations. Proc. 1978 VLDB (Berlin), 360367. [Ri I ] J. Rissanen. Independent components of relations. TODS 2.4 (Dec. 1977). 317-325. (Ri21 J. Rissanen, Theory of relations for databases - a tutorial survey. Proc. 7th Symp. on Math. Found. of Comp. Science, Lecture notes in Comp. Science, 64, Springer-Verlag. 537-55 I .
[Sm]J. M. Smith. A normal form for abstract syntax. Proc. 1978 VLDB (Berlin). 156-162.