Brigham Young University. We givea .... the total unnesting of the nested tuple (Young, {Chess, Soccer)). 2.2. ... denotes the union of attributes in all descendants.
A Normal Form for Precisely Characterizing Redundancy in Nested Relations WAI YIN
MOK,
Brigham
Young
YIU-KAI
NG,
and
DAVID
W. EMBLEY
University
We givea straightforward
definition for redundancy in individual nested relations and define a new normal form that precisely characterizes redundancy for nested relations. We base our definition of redundancy on an arbitrary set of functional and multivalued dependencies, and show that our definition of nested normal form generalizes standard relational normalization theory. In addition, we give a condition that can prevent an unwanted structural anomaly in nested relations, namely, embedded nested relations with at most one tuple, Like other normal forms, our nested normal form can serve as a guide for database design. Categories and Subject models% normal forms
H.2. 1 [Databaae
Descriptors:
Management]:
Logical Desigr-data
General Terms: Design, Theory Additional Key Words and Phrases: Database design, data redundancy, functional and multivalued dependencies, nested normal form, nested relations, normalization theory, scheme trees
1. INTRODUCTION Although
normalization
its extension
to nested
theory relations
for flat relations is much
has a long
more recent.
research
Partition
Normal
history, Form
(F’NF’) [Roth et al. 19881, which guarantees eqmctid properties for nesting and unnesting and for keys of nested relations, has been well accepted. indeed, nested relations are sometimes defined such that only PNF relations are allowed,l and for Abiteboul and Bidoit [ 1986], the definition predates PNF. A normal form for nested relation schemes that detects potential redundancy
and the possible
has not posed [Ozsoyoglu Although these relation schemes, however,
‘ See Abiteboul
update
anomalies
that accompany
redundancy,
been widely accepted, even though some have been proand Yuan 1987, 1989; Roth and Korth 1987]. earlier proposals provided guidance for the design of nested they did not succeed in precisely characterizing potential
and Bidoit [ 1986], Ozsoyoglu
and Yuan [ 1987, 1989], and Roth and Korth [ 1987].
Much of the work on this paper was done while W. Y, Mok was at Hong Kong Polytechnic. Authors’ address: Department of Computer Science, Brigham Young University, Provo, UT 84602. Permission tomake digital/hard copyofpartor allofthiswork for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. IQ 1996 ACM 0362-5915/96/0300-(3077 $03.50
ACMTransactions on Database Systems, Vol. 21, No. 1, March 1996, Pages 77-106
78
.
W. Y. Mok et al,
redundancy. In this article we propose a new normal form for individual nested relation schemes that completely characterizes redundancy with respect to any given set of functional dependencies (FDs) and multivalued dependencies (MVDs). The result we present is a generalization of standard relational normalization theory. We proceed as follows. In Section 2, we provide our basic definitions for nested relations. Like Abiteboul and Bidoit [ 1986], Ozsoyoglu and Yuan [ 1987, 1989], and Roth and Korth [1987], we define our nested relations to be in PNF. In Section 2 we also give carefully specified redundancy definitions. As illustrations for our redundancy definitions, we give examples, which we use later to show that none of the earlier definitions fully detects redundancy. In Section 3, we present our definition, which we call NNF (Nested Normal Form). As we illustrate our definition, we also compare it to earlier definitions and show that ours can provide greater flexibility in how attributes may be clustered in nested relation schemes. In Section 4, we present a theorem guaranteeing that NNF detects potential redundancy. In Section 5, we investigate the converse of this theorem. We show that a nested relation scheme that is not consistent with the given set of MVDs and FDs, as we define consistency, is automatically not in our normal form. In addition, we are able to show that if a nested relation scheme is consistent with the given set of MVDs and FDs and there is no potential redundancy, then the nested relation scheme satisfies our definition of NNF. In Section 6, we show that our definition of NNF is a generalization of standard relational normalization theory. In particular, we show that 4th Normal Form (4NF), as defined by Fagin [1977], is a special case of NNF, and that Boyce-Codd Normal Form (BCNF) is also a special case when we limit the dependencies to FDs. Thus, like other normal forms, our definition of NNF can provide a guide to database design. It also has the drawbacks of these other normal forms, and, in this sense, is not a panacea for database design. We therefore comment on what our definition does and does not provide for the designer. In Section 7, we present a condition that can prevent an unwanted structural characteristic of nested relations, which we call singleton buckets because a nested relation represented by a singleton bucket allows at most one tuple. We then prove that this condition does indeed prevent singleton buckets. Although this condition has nothing to do with redundancy, it is in harmony with earlier definitions [Ozsouyoglu and Yuan 1989; Roth and Korth 1987], that also disallow singleton buckets. In Section 8, we present our conclusions. 2. BASIC DEFINITIONS
AND
PROPERTIES
2.1. Nested Relations A nested relation allows each tuple component to be either atomic or another nested relation, which may itself be nested several levels deep. As in Abiteboul and Bidoit [1986], Ozsoyoglu and Yuan [1987, 1989], and Roth and Korth [1987], we are only interested in nested relations that are in PNF. ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996.
Redundancy in Nested Relations
.
79
Thus, in a nested relation, there can never be distinct tuples that agree on the atomic attributes of either the nested relation itself or of any nested relation embedded within it [Atzeni and DeAntcmellis 1993].
Definition 2.1.1. recursively
Let U be a set of attributes. defined as follows:
(1) If X is a nonempty subset of U, then over the set of attributes X.
A
nested relation scheme is
X is a
nested relation scheme
, Xn are pairwise disjoint, nonempty subsets of U, and R ~,. . . . R. are nested relation schemes over Xl,, . . . X. respectively, then X( h!,)* . . . (R ~)* is a nested relation scheme over XXl . . . Xm.
(2) If X, Xl,...
Definition 2.1.2. attributes
Let R be a nested relation scheme over a nonempty set of Z. Let the domain of an attribute A G Z be denoted by dom( A). A
nested relation ouer R is recursively
defined
as follows:
(1) If R has the form X where X is a set of attributes {Al,..., A.), n > 1, then r is a nested relation ouer R if r is a (possibly empty) set of functions {tl, . . . . tm) where each function t,, 1 < i < m, maps A, to an element in dom( A,), 1 s j < n. (2) If R has the form attributes
(A
X(RI)* . ..(R~)*. m >1, where X is a set of ~,. . ., A~), n > 1, then r is a nested relation over R if
(a)
r is a (possibly
(b)
t,c r and t, G r and t,(X)
empty) set of functions {tl,..., tp}where each function t,, 1< i s p, maps Aj to an element in dom( A, ), 1 < j < n, and maps l?h to a nested relation over Rk, 1 < k < m, and
Each function of a nested nested tuple of r
relation
= t,(X)
r over
implies
nested
t,= t,,1< i, j < p.
relation
scheme
R is a
Example 2.1.1. Figure 1 shows a nested relation. Its scheme is Dept Chair ( Prof( Hobby)* ( Matriculation(Student( Interest )* )* )* )*, and it contains two nested tuples. As in Abiteboul and Bidoit [ 1986], we draw a bucket for each embedded nested relation. Each bucket also contains nested tuples of ita own. For example, {Young, {Chess, Soccer)) and { Barker, {Skiing)) are nested tuples in the first bucket under the embedded nested relation scheme Student( Interest)*. Notice that, as required, PNF is satisfied. Thus the values for the atomic attributes Dept Chair differ, and in each bucket the atomic values differ. Definition 2.1.3. Let R be a nested relation scheme. Let r be a nested relation on R. The total unnesting of r is recursively defined as follows: R has the form X, where total unnesting of r.
(1) If
X is a set of attributes,
then r is the
(2) If R has the form X(RI)* . . . (Rn)*, where X, is the set of attributes in R,, 1 s i s n, then the total unnesting of r = {t] there exists a ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996.
80
.
W. Y. Mok et al.
Dept
Chair
( Prof
CS
Turing
Jane
( Hobby
)*
I Skiing
I
L
1
(Matriculation
Student
Ph.D.
Young
I
( interest )*)”)”)’
u
Chess Socoer
‘“s’r2xal , Pat
I Hitdng
I
I
Ph.D.
I
Lee
I Travel
I
““s’
I
caner
m]
I I I
~ Math
Polya
Steve Ill&-l
I
Fig. 1. Nested relation.
nested tuple u c r such that t(X) = u(X) total unnesting of u(l?i ), 1 S i < n.}
and t(Xi) is a tuple in the
Definition 2.1.4. Let R be a nested relation scheme. Let r be a nested relation on R. Let t be a nested tuple of r. The total unnesting of t is defined as the total unnesting of q, where q is a nested relation containing the single nested tuple t. Example 2.1.2. Figure 2 shows the total unnesting of the nested relation in Figure 1. Observe that the first two tuples in the total unnesting contain the total unnesting of the nested tuple (Young, {Chess, Soccer)). 2.2.
Scheme
Trees
We can graphically represent a nested relation scheme by a tree, called a scheme tree. A scheme tree captures the logical structure of a nested relation scheme and explicitly represents a set of MVDs. Scheme trees have been used for earlier normal form definitions for nested relations [Ozsoyogu and Yuan 1987, 1989; Roth and Korth 1987]. We use them here for the same purpose.
Definition 2.2.1. A scheme tree T corresponding scheme R is recursively defined as follows:
to a nested
relation
(1) If R has the form X, then T is a single node scheme tree whose root node is the set of attributes X. (2) If R has the form X(RI)* . . .(R.)*, then the root node of T is the set of attributes X, and a child of the root of T is the root of the scheme tree Ti, where T, is the corresponding scheme tree for the nested relation scheme Ri, 1 s i s n. ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996.
Redundancy in Nested Relations Dept
Chair
Prof
Hobby
Matriculation
Student
Interest
Cs Cs Cs Cs Cs
Polya
Skiing Skiing Skiing Skiing Hiking Dance Dance Hiking Hiking
Young Young Barker Adams Lee Carter Carter Carter Carter
Chess
Math
Jane Jane Jane Jane Pat Steve Steve Steve Steve
Ph.D. Ph.D. Ph.D.
Math Math Math
Turing Turing Turing Turing Turing Polya Polya Polya
Fig. 2.
fd.s. Ph.D. M.S. M.S. M.S. M.S.
.
81
Skiing Skiing Travel Travel Skiing Travel Skiing
Total unnesting of nested relation inFig.1.
The one-to-one correspondence between a scheme tion scheme along with the definition of a nested
tree
and a nested
rela-
relation scheme impose several properties on a scheme tree. Let T be a scheme tree. We denote the set of attributes in T by Aset(T). Observe that the atomic attributes of a nested relation scheme, at any level of nesting, constitute a node in a scheme tree. Observe further that because Definition 2.1.1 requires nonempty sets of attributes, every node in T consists of a nonempty set of attributes. Furthermore, because the sets of attributes corresponding to nodes in T are pairwise disjoint and include all the attributes of T, the nodes in T are pairwise disjoint,
and their
union
is
Aset(T).
Let N be a node in T. Notationly, Ancestor(N) denotes the union of attributes in all ancestors of N, including N. Similarly, Descendant N ) denotes the union of attributes in all descendants of N, including N. In a scheme tree T each edge (V, W ), where V is the parent of W, denotes an MVD Ancestor(V) + Descendant(W). Notationly, we use MVD(T ) to denote the union of all the MVDs represented by the edges in T. By construction, each MVD in MVD(T) is satisfied in the total unnesting of any nested relation for T. Because FDs are also of interest, we use FD( T ) to denote any set of FDs equivalent to all FDs X -+ Y implied by a given set of FDs and MVDs over a set of attributes U such that Aset(T ) c U and XY L Aset(T ). (Note that because a set of FDs F together with a set of MVDs M can imply FDs not implied by F alone, FD(T), in general, is not equivalent to the set of FDs in F whose left-hand side is a subset of Aset ( T ) and whose right-hand side is restricted to Aset (T ).) Figure 3 shows the scheme tree T for the scheme of the relation in Figure 1. Figure 3 also gives the set of attributes in Aset(T ) and the set of dependencies MVD(T). Observe that each of the MVDs in MVD(T ) is satisfied in the unnested relation in Figure 2.
Example 2.2.1.
nested
2.3. Data Redundancy Data redundancy is a concern in database design. Redundant data can lead to higher storage and access cost. It can lead to update anomalies, forcing multiple copies of the same data value to be updated when one copy changes, and it can lead to data inconsistency if all copies do not agree. ACM Transactions on Database Systems, Vol 21, No. 1, March 1996.
82
.
W. Y. Mok et al. T = Dept Chair
I Prof
/\ Hobby
Matriculation
I
student
I Interest Aset(T) = Dept Chair Prof Hobby Matriculation Student Interest MVD(T) = {Dept Dept Dept Dept Dept
Chair Chair Chair Chair Chair
+ Prof Prof Prof Prof
Prof Hobby Matriculation Student Interest, + Hobby, + Matriculation Student Interest, Matriculation + Student Interest, Matriculation Student+ Interest)
Fig. 3. Scheme tree Z’, AaeKZ’), and MVD(Z’) for nested relation scheme in Fig. 1.
Except in rare cases, such as Vincent and Srinivasan [1992], papers and textbooks on normalization fail to provide rigorous definitions for redundancy and thus also fail to prove that normalization removes redundancy as expected. Offered instead are motivating examples to illustrate redundancy removal. Thus in the vast body of research literature on normalization, we have mostly only rigorous syntactic justifications for normalization; what we are missing are rigorous semantic justifications. Besides only providing for syntactic characterizations, a danger of not treating redundancy formally is that the examples may be misleading. Indeed, as we show in the following, the definition for 4NF found in most textbooks does not detect potential redundancy for all cases even though some readers of these books are led to believe that it does. In creating definitions for redundancy, we should try to find simple and intuitive characterizations, but creating a simple and intuitive definition for redundancy is more difficult than one might at first think. Any definition will involve a sophisticated statement, and there are many possible approaches one might use. Our notion of redundancy is based on the idea that an atomic value u in a nested or flat relation is redundant if we can erase u, and then from what remains and from a single FD or MVD that holds, determine what u must have been. ACMTransactions on DatabaseSystems,Vol.21,No. 1, March 1996.
Redundancy in Nested Relations
.
83
U = {Dept, Chair, Prof, Hobby, Hobby-Equipment, Matriculation, Student, Interest)
F = { Student + Dept + Chair)
Matriculation, Student +
M = ( Student + Interest, Prof + Hobby + Hobby-Equipment) Fig. 4.
Some given constraints
Prof. Prof +
Dept.
Hobby Hobby-Equipment,
over a set of attribuks.
The way we define “holds” is important. Here, we adapt Fagin’s [ 1977] definition, and we explain it thoroughly before proceeding with our definition of redundancy.
Definition 2.3.1. Let U be a set of attributes. Let M be a set of MVDs U and F be a set of FDs over U. Let T be a scheme tree such Aset(T) c U. An MVD X + Y holds for T with respect to M and X c Aset(T ) and there exists a set of attributes Z & U such that Y = Aset(T) and M u F implies X + Z on U. An FD X - Y holds for T X + Y on U. respect to M and F if XY G Aset(T ) and M u F implies This definition Fagin [ 1977].
is motivated
by the following
over that
F if 2 n with
Lemma, which is Theorem 5 in
LEMMA 2.3.1. Let U be a set of attributes and R G U. Let M be a set of MVDS over U and F be a set of FDs over U. Let X G R, Z c U, and Y= Z~R. If MU Fimplies X+ Zon U,then MU Firnplies X+ Yon R.
ROOF.
Fagin [ 1977].
•l
Example 2.3.1. Figure 4 shows a given set of attributes U and a given set of FDs F over U and a given set of MVDs M over U. All the FDs in F in Figure 4 hold for the scheme tree T in Figure 3, as do all the FDs implied by M u F. Not all the MVDs in M hold for T. In particular, neither Hobby + Hobby-Equipment nor Prof + Hobby Hobby-Equipment hold for T. Because Hobby Hobby-Equipment ~ Aset (T) = Hobby, however, Prof + Hobby does hold for T. Although Prof + Hobby holds for T, obsewe that it is not implied by Mu Fon U. As illustrated in Example 2.3.1, certain MVDs hold for a relation scheme even when they are not implied by a given set of FDs and MVDs. It is those that hold that are of interest to us. We now return to our task of defining redundancy. Because our definition depends on the validity of a nested relation, however, we must first define what it means for a relation to be valid for a given set of FDs and MVDs.
Definition 2.3.2. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs over U. Let T be a scheme tree such that ACM Transactions on Database Systems, Vol ’21, No. 1, March 1996.
84.
W. Y. Mok et al.
Aset(Z’) c U. Let r be a nested relation on T. Nested relation r is valid with respect to lkf U 1’ if in the total unnesting of r, every FD and every MVD that holds for T with respect tQ M and F is satisfied. We now define redundancy. and MVD redundancy.
The definition
has two parts: FD redundancy
Definition 2.3.3. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs over U. Let T be a scheme tree such that Aset(T) G U. Let XY G Aset(T), and let X-* Y be an FD or an MVD that holds for T with respect to M and F and has an attribute A ● Y and A 6X. Let r be a nonempty nested relation on T that is valid with respect to M U F. Let S be a subtree of T that contains A as an atomic attribute, and Ietsl,..., Sn be the nested relations over S in r. Let u ~ and u ~ be distinct nested tuples of Si and Sj, respectively, 1 s i, j s n, such that Ul( A) = U, U2(A) = u’, and u = u’. (Nots that i = ~ is possible so that si and sj may either be the same nested relation under S or may be in different nested relations under S.) Let tl and t~ be distinct tuples in the total unnesting of r such that tl( Aset(S)) and tz( Aset(S )) are tuples in the total unnesting of U1 and u ~, respectively. —FD redundancy, when X-* Y is X + Y: If tl(X) = tz(X), value v is a redundant atomic value in r caused by X + Y.
then atomic
when X-* Y is X + Y: If tl(X) = tz(X), tl(Y) = tz(Y), where Z = Aset(T) – (xY), then atomic value v is a redundant atomic value in r caused by X * Y.
—MVD redundancy, and tl(Z) # tz(Z)
Example 2.3.2. Let Student + Dept and consider the nested relation and its total unnesting in Figure 5. Both appearances of Math are redundant in both the nested and unnested relation. We can see this formally as follows. Let t~ be the third tuple and tz be the last tuple in the unnested relation. Now we have tJStudent) = tJStudent), and thus Math in the third tuple of the unnested relation is redundant. Because Math in the third tuple of the unnested relation comes from the first nested tuple in the nested relation, Math in the first nested tuple of the nested relation is redundant. By reversing t~ and t~, we can see formally that Math in the second nested tuple of the nested relation and in the last tuple of the unnested relation are also redundant. It is possible for a value not tQ be redundant in a nested relation and yet be redundant in the total unnesting of the relation. h-ideed, this is often the reason we create nested relations—h remove redundant values.
Example 2.3.3. Suppose Student + have multiple majors. Now consider the ing in Figure 6. Observe that Skiing is However, in the nested relation, Skiing only once.
Interest and we allow students
to nested relation and its total unnestredundant in the unnested relation. is not redundant because it appears
ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996.
Redundancy in Nested Relations Student +
Interest
(Dept
.
85
Dept
(Student)*)*
Skiing
Travel
I
‘s
w
‘at’
15!!ll
Cs
Interest Skiing Skiing skiing Travel Travel
I Lee
Student
CS
Barker Adams Carter Lee Carter
Cs
Math
‘s
Math
( Dept
Fig.5.
I
Dept
Student +
Interest
I
I
Redundancy caused byan FD.
Interest
(Student
)*)*
Skiing
w Fig.6.
Travel
Elimination
ofredundancy
by nesting.
~
Interest
Dept
Student
Skiing Skiing Skiing Travel
CS CS Math Math
Barker Adams Barker Catter ACM Transactions on Database Systems, Vol. 21, No. l, March 1996
86
.
W. Y. Mok et al.
Example 2.3.4. Let Prof + Dept Student and Prof + Hobby HobbyEquipment and consider the flat relation in Figure 7. Because the scheme of the relation in Figure 7 is Prof Student Hobby and Prof + Dept Student and Prof + Hobby Hobby-Equipment, by Lemma 2.3.1 Prof + Student and Prof + Hobby hold for the scheme Prof Student Hobby. But now all the data values under Student and Hobby are redundant, as can be seen formally by appropriately picking two distinct tuples and choosing which attribute and value to consider. For example, let tl be the first tuple and t~ be the second tuple, then Young in t~ is redundant because tJ Prof ) = tJProf ), tl(Student) = tz(Student), and tl(Hobby) # tJHobby). As an aside, we observe here that by the common definition of 4NF found in most textbooks (e.g., [Korth and Silberschatz 1991; Maier 1983]) the relation scheme in Figure 7 is in 4NF. This is because no nontrivial MVD, given or implied, applies to the scheme, where “applies” means that the set of attributes that constitutes the MVD is a subset of the scheme. In particular for Example 2.3.4, neither Prof + Student nor Prof + Hobby is implied by or is in the given set of MVDs {Prof + Dept Student, Prof + Hobby Hobby-Equipment). According to the original definition given by Fagin [1977], however, the relation scheme in Figure 7 is not in 4NF. Fagin’s definition not only considers all nontrivial MVDs that are given or implied (without regard to the scheme under consideration), but also the MVDs that hold when the scheme is considered.
Example 2.3.5. To show an example of a nested relation (with embedded relations) that has redundancy caused by an MVD, we present one more example of redundancy. Let U = {Prof, Article-Title, Publication-Location) and let Prof - Article-Title and Article-Title + Prof be the MVDs. (Note that Example 2 in Beeri and Kifer [1986] has exactly the same characteristics). Consider the nested relation and its total unnesting in Figure 8. Based on the MVD Article-Title + Publication-Location, which holds for the nested relation scheme in Figure 8, all the values under Publication-Location in the nested relation are redundant. We can see formally, for example, that the last Hong Kong value under (Publication-Location)* is redundant by letting t~ be the last tuple and t~ be the 4th tuple in the unnested relation. Thus t1(Article-Title) = t 2( Article-Title), t ~(Publication-Location) = t 2(Publica tion-Location), and tl(Prof) + tJProf). Definition 2.3.3 tells us what it means for an individual atomic value to be redundant in a nested relation for an FD or MVD that holds. Our next definition ties together the notion of a redundant data value in a nested relation and the notion of a nested relation scheme that permits valid nested relations that contain redundancy. It is this definition that allows us to later show that our normal form definition detects redundancy. over Definition 2.3.4. Let U be a set of attributes. Let M be a set of ~s U and F be a set of FDs over U. Let T be a scheme tree such that Aset(T) c U. T is said to have potential redundancy with respect to M U F if there exists a redundant atomic value in any valid nested relation for T caused by either an FD or an MVD that holds for T with respect to M and F. ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996
Redundancy in Nested Relations Prof + Prof +
.
87
5, 7, and
8 all
Dept Student Hobby Hobby-Equipment
Prof
Student
Hobby
Jane Jane Jane Jane
Young Young Barker Barker
Reading Skiing Reading Skiing
Fig. 7.
A flat relation with redundancy
Prof ~ Article-Title Article-Title + Prof
Prof
Article-Tfile
(
)* (Publication-Location)*
Steve lx&xN&lw Pat
Programming in Ada W&&!9_l
Article-Tfile
Prof
Programming Programming Programming Programming Programming Programming
Steve Steve Steve Steve Pat Pat
Publication-Location
in C++ in Ada in C++ in Ada in Ada in Ada
USA USA Hong Kong Hong Kong USA Hong Kong
Fig.8. Redundancycausedby a MVD
Example 2.3.6, have
redundancy,
said to have
Because the nested
potential
the nested relation
redundancy
with
relations schemes respect
in Figures in Figure
5, 7, and 8 are all
to the FDs
and MVDs
given
for the examples.
3. NESTED
NORMAL
FORM
we motivate the need for a new normal-form definition for nested relations by making certain observations about the examples we have presented. First, if we are given the FDs and MVDs in Figure 4, none of the earlier normal-form definitions
[Ozsoyoglu
and Yuan
1987,
1989; Roth
and Korth
1987]
allow the
ACM Transactionson DatabaseSystems,Vol 21, No. 1, March 1996.
88
.
W. Y. Mok et al.
scheme of the nested relation in Figure 1, which is also equivalent to the scheme tree T in Figure 3. They therefore do not allow the nested relation in Figure 1 even though it is a good clustering for this application and has no redundancy. For a scheme tree T to satisfy the earlier normal-form definitions, T must satis& four conditions. It turns out that T in Figure 3 does not satisfy the fourth condition for any of these previous definitions. One requirement of the fourth condition insists that the root of a scheme tree be the left-hand side of a reduced nontrivial MVD, but all (implied) MVDs with Dept Chair as the left-hand side are trivial. In fact, this is not the only violation. In particular, Matriculation cannot be an inner node of T in Figure 3. For the definitions in Ozsoyoglu and Yuan [1987, 1989], there are even partial MVDs in T because of the edges (Prof, Hobby) and (Student, Interest). Because there are unnecessary conditions in these previous normal form definitions, they all restrict attribute clustering and design flexibility, as these examples show. In fact, these conditions can lead to unnecessary decompositions of schemes. Second, all the earlier definitions [Ozsoyoglu and Yuan 1987, 1989; Roth and Korth 1987] allow the scheme of the nested relation in Figure 8, but as pointed out in Example 2.3.5, the nested relation has redundancy. We can see that the earlier definitions allow the scheme of the nested relation in Figure 8 as follows. Let T be the scheme tree for the scheme of the nested relation in Figure 8, and assume we are given the set of MVDs, M = {Prof + ArticleTitle, Article-Title + Prof }, and the empty set of FDs. When there are no FDs, all three earlier definitions are equivalent. Now observe that MVD(T) = {Prof + Article-Title, Prof + Publication-Location}. Because ill implies MVD(T), the first condition of their definitions is satisfied. Because M does not imply an MVD with a left-hand side that is a proper subset of Prof, T has no partial MVDs, and thus their second condition is satisfied. Article-Title + Prof, T has no transitive MVDS, and thus their third condition is satisfied. Each node in the scheme tree for Figure 8 is a single attribute, therefore there can be no decomposition of nodes, and thus their fourth condition is satisfied. We now give our definition for Nested Normal Form.
Definition 3.1. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs over U. Let T be a scheme tree such that Aset(T) G U. T is in Nested Normal Form (NNF) with respect to M u F if the following
conditions
are satisfied.
(1) If D is the set of MVDs and FDs that hold for T with respect M u F, then D is equivalent to MVD(T) u FD(T) on Aset(T).
to
(2) For each nontrivial FD X - A that holds for T with respect to M u F, X + Ancestor(N~) also holds with respect to M u F, where NA is the node in T that contains A. Example 3.1. Suppose we are given scheme tree T in Figure 3 is in NNF. follows. We have observed in Example does not hold for Aset(T). The set of
U, F, and M as in Figure 4. Then the We can see this from our definition as 2.3.1 that Hobby + Hobby-Equipment MVDs and FDs that do hold for T is
ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996
Redundancy in Nested Relations
.
89
equivalent to F u {Student + Interest, Prof + Hobby} considered on Aset(T ). This set is thus equivalent to D in the NNF definition. MVD(T ) is the set of MVDS in Figure 3, and we can let F in Figure 4 be FD( T ). We can convince ourselves that Condition 1 is satisfied by applying a few standard MVD and FD derivation rules. For example, we can derive Prof + Hobby from MVD(T)
and FD(T ) by using the FDs in FD(T) to obtain Prof + Dept Chair Prof, converting this derived FD into an MVD, and then applying transitivity with Dept Chair Prof + Hobby in MVD(T) to obtain Prof + Hobby. To convince ourselves that Condition 2 is satisfied, we consider Student + Matriculation, which holds for T, and observe that Ancestor( Matriculation) = Matriculation Prof Dept Chair and that Student + Matriculation Prof Dept Chair is implied. Hence Student + Ancestor( Matriculation). We similarly check each nontrivial satisfied. Example
FD in
FD( T), which
3.1 not only illustrates
is suficient NNF
to ensure
in a nontrivial
that case,
Condition
2 is
but also shows
that our definition
accepts the nested relation scheme in Figure 1, which we consider to be good, but which is rejected by all the earlier definitions as previously explained. We now continue by giving two more examples, one that violates Condition 1 of NNF and one that violates Condition 2. Our example that violates Condition 1 also shows that NNF detects the problem of the nested relation scheme in Figure 8. It therefore recognizes the scheme that allows redundancy, but is not detected by the earlier definitions.
Example 3.2. Let U = {Prof, Article-Title, Publication-Location), M = {Prof + Article-Titie, Article-Title + Prof}, and F = 0. As in Figure 8, let T be Prof( Article-Title)* (Publication-Location)*. T does not satise Condition 1 because MVD(T) U FD(T), which is {Prof * Article-Title, Prof + Publica tion-Location), is not equivalent to the set of FDs and MVDs that hold for T. For example, we cannot derive Article-Title + Prof from { Prof + ArticleTitle, Prof + Publication-Location}. Thus Condition 1 is not satisfied. Example 3.3. Let U = {Interest, Dept, Student), M = 0, and F = {Student + Dept). As in Figure 5, let T = Interest(Dept(Student)* )*. T does not satisfy Condition 2 because Student + Dept is a nontrivial FD that holds for T and Ancestor( Dept ) = Dept Interest, but Student + Dept Interest. 4. NNF
AND
POTENTIAL
REDUNDANCY
In this section we prove one of our main results, In particular, we prove that a nested relation whose scheme is in NNF for a given set of FDs and MVDs cannot have redundancy with respect to the given FDs and MVDs. Many of the lemmas here depend on a set of FD and MVD derivation rules. We use the following rules, where X, Y, Z, V, W, and Z’ are all subsets of a set of attributes R:
FD derivation rules: F1: (reflexivity) Y c X implies X + Y. F2: (augmentation) X + Y and V c W imply XW + YV. F3: (transitivity) X + Y and Y + Z imply X + Z. ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996
W. Y. Mok et al.
90.
MVD derivation rules: Ml: M2: M3: M4: M5:
(reflexivity) Y c X implies X ~ Y. (augmentation) X a Y and Z’ c Z imply X2 -B YZ’. (transitivity) X ~ Y and Y a Z imply X -P Z – Y. (complementation) X ~ Y implies X ~ R – (xY). (trivial complementation) X -+ R – X.
Combined FD and MVD derivation rules: Cl: (replication) C2: (coalescence)
X + Y implies X a Y. X ~ Y, Z a W, W c Y, and Y n Z = 0 imply X + W.
These FD and MVD derivation rules are sound and complete [Beeri et al. 1977], but not minimal. Indeed, part of what we show is that Ml (reflexivity) is not needed so that without it the derivation rules are sound and complete. The more common choice, of course, is to retain Ml and omit M5. For our proofs about scheme trees, however, it is often required that our MVDs stretch from root to leaf. We therefore use the alternative choice for trivial MVDs. Because this choice is not common, we prove in Lemma 4.1 that this is possible. In addition, we add a corollary that tailors the lemma only for the case of MVDs. LEMMA 4.1. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs over U. Let Z -B W be an MVD implied by M U F on U. There exists an (M u F)-based derivation sequence for Z -B Won U that uses only FI–F3, M2–M5, and C1–C!2. ~OOF. Because Z ~ W is implied by M U F on U and rules F1 –F3, M 1–M5, C 1, and C2 are sound and complete, derivation sequence S for Z - W on U using these rules. include Ml, we are done. Otherwise, we replace each usage of
X ~ Y, by Ml (reflexivity) by the following
sequence
the derivation there exists a If S does not Ml as follows:
where Y ~ X,
of derivation
X ~ Y, by F1 (reflexivity) because X ~ Y, by C 1 (replication). ❑
rules: Y c X.
COROLLARY. If F = 0, there exists an M-based derivation sequence for Z + W that uses only the MVD rules M2 – M5. ~OOF. Because M1–M4 are sound and complete when no FDs are given, there exists a derivation sequence S for Z -B W that uses only M1–M4. If S does not include Ml, we are done. Otherwise, we replace each usage of Ml by the following sequence of derivation rules: X ~ R – X, by M5 (trivial complementation). X ~ R – (X(R – X)), by M4 (complementation). X ~ 0, because R – (X(R – X)) = 0. XY + Y, by M2 (augmentation). X ~ Y, because Y c X. •l ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996
Redundancy in Nested Relations
Lemma 4.2 guarantees is in the right-hand side, if the MVD is implied by in N are included in the LEMMA
Aset(T) Aset(T)
.
91
us that if an attribute of a node N in a scheme tree but not the left-hand side, of a nontrivial MVD and the MVDs of the scheme tree, then all the attributes MVD.
4.2. Let U be a set of attributes. Let T be a scheme tree such that & U. Let XY G Aset(T). Let X + Y be an MVD in MVD(T)’ on such that A G Y and A E X. Let A be in node N of T, then N c XY.
WOOF. Because X ~ Y is an MVD in MVD(Z’)’ on Aset(Z’) and MVD(T ) consists only of MVDS, there exists an ikfVD(T)-based derivation sequence S for X * Y on Aset(Z’), that by the Corollary to Lemma 2, uses only the MVD rules M2–M5. We show by induction on the number of MVDs n in S that for every MVD X’ + Y’ in S, if A is an attribute in node N of T such that A ~ Y’ and A @X’, then N GX’Y’, Thus because X + Y is in S, N cXY.
Basis: Suppose n = 1, because only rules M2-M5 are used and M2-M4 require antecedents, X + Y is either given or introduced by M5 (trivial complementation). If X ~ Y is given, then X + Y= iWYD(T), and thus Y zN. If X + Y is introduced by M5 (trivial complementation), XY = Aset (7’ ), and thus N c XY. Induction: X + Y can be introduced by any of the MVD derivation rules M2-M5 or as a given MVD in MVD(T), and therefore we have five cases to consider. Because the cases for given MVDs and M5 (trivial complementation) have no antecedents, they are the same as in the basis. Therefore, we can reduce the cases to three. (1) X’ ~ Y’ is introduced by M2 (augmentation). Let V + W be the antecedent MVD and let Z’ G Z such that X’ = VZ and Y’ = WZ’. If A ● Y’ – X’, then because Z’ c Z, A 6 W and A @ V. By the induction hypothesis, N c VW and thus N G X’ Y’. (2) X’ ~ Y’ is introduced by M3 (transitivity). Let V - W and W + Z be the antecedent MVDs. Thus X’ = V and Y’ = Z – W. Now assume there exists an attribute A in node N of T such that A e Y’ and A E X’, but N ~ X Y’. Then there exists an attribute B such that B 6 N and B @ X’Y’. Because B @ X’Y’ and X’ = V and Y’ = Z – W, B @ V and either B G W or B ~Aset(T) – (VWZ). Suppose B E W, then because B G W and B @ V, by the induction hypothesis, N c VW. A ● N, therefore A G VW. But because A @ X’ and X’=V, AEV; and because A~Y’and Y’=Z– W, AEW. Thus A @ VW. We therefore suppose that B G Aset(T ) – (VWZ ). But now we have A G Y’, Y’ = Z – W, and therefore A c Z, A E W, and A = N. Therefore, by the induction hypothesis, we have N L WZ. Because B G N, B = WZ. However, B G Aset(T) – (VWZ) and thus, B # WZ—a contradiction. (3)
X’ ~ Y’ is introduced by M4 (complementation). Let V ~ W be the antecedent MVD. Thus X’ = V and Y’ = Aset(T) – (VW). Now assume there exists an attribute A in node N of T such that A G Y’ ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996
92
.
W. Y. Mok et al.
and A @ X’, that B ● N B E W and N G VW. A Y’ = Aset(T)
but N ~ X’ Y’. Then there exists an attribute B such and B Q!X’Y’. Hence B E V(Aset(T) – (VW)) and thus B E V. Because B ● N, by the induction hypothesis, G N, therefore, A e VW. However, because A ~ Y’ and – (VW), A @ VW-a contradiction. ❑
Lemma 4.3 extends Lemma 4.2 to not only guarantee us that a node is included, but also that all ancestors and descendants of the node are included. That is, Lemma 4.3 guarantees us that if an attribute of a node in a scheme tree is in the right-hand side of a nontrivial MVD (but not in the left-hand side), and if the MVD is implied by the MVDs of the scheme tree, then both the ancestors and the descendants of the node are included in the MvD. LEMMA 4.3. LQt U be a set of attributes. Let T be a scheme tree such that Aset(T) c U. Let XY c Aset(T). Let X + Y be a nontrivial MVD in MVD(T)+ on Aset(T). Let A be an attribute such that A = Y and A z X, and let A be in node N of T. Then both Ancestor(N) G XY and Descendant(N) c XY. ROOF. As in Lemma 4.2, we show by induction on the number of MVDs n in the &fVD(T)-based derivation sequence S for X a Y on Aset (T ) that for every MVD X’ ~ Y’ in S if A is an attribute such that A G Y’ and A @ X’, and if A is in node N of T, then both Ancestor(N) c X’ Y’ and
Descendants(N) G X’ Y’. Basis: Suppose n = 1. Because S has no MVD introduced
by Ml (reflexivity) and there is only one MVD X ~ Y in S, X + Y is given or is introduced by M5 (trivial complementation). If X ~ Y is given, then X ~ Y G MVD(T). As argued in Lemma 4.2, Y 2 N, and since X ~ Y = MVD(T), therefore both Ancestor(N) c XY and
Descendant(N)
c XY.
If X ~ Y is introduced by M5 (trivial complementation), XY = Aset(T). Hence every node is a subset of XY, and thus, both Ancestor(N) G XY and
Descendant(N) c XY. Induction: As in Lemma 4.2, we have only three cases to consider. (1) X’ ~ Y’ is introduced by M2 (augmentation). lar to the proof of Case 1 in Lemma 4.2.
The argument
is simi-
(2) X’ ~ Y’ is introduced by M3 (transitivity). Let V + W and W + Z be the antecedent MVDs. Thus X’ = V and Y’ = Z – W. Let A be an attribute in node N of T such that A G Y’ and A G X’. Hence, by Lemma 4.2, N G X’Y’. We claim that Ancestor(N) g X’ Y’. If not, then there exists an attribute B G Ancestor(N) such that B @ X’ Y’. Because B @ X’ Y’ and X’ = V and Y’ = Z – W, B @ V(Z – W). Thus B @ V and either B G W or B G Aset(T) – (VWZ). We first assume that B G W. Because B g V and B G W, by Lemma 4.2, B is in a node N’ such that N’ G VW. B G Ancestor(N) and B G N’, thus N G Descendant(N’) and A G ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996.
Redundancy in Nested Relations
.
93
Descendant(i’V’ ). By the induction hypothesis, Descendant(N’ ) c VW. Thus A G Descendant(N’), and therefore A G VW. But because A~Y’and Y’=Z– W, A@ W,and A@ Xand X’= V, contradiction. A E V. Thus A E VW-a Thus we assume that B = Aset (T) – (VWZ ). Because A G Y’ and Y’ = Z – W, A = Z and A g W. Thus by the induction hypothesis, Ancestor(N) c WZ, 23 = Ancestor(N) and Ancestor(N) c WZ, and therefore B E WZ. Thus 1? @ Aset(Z’) – ( VWZ )—a contradiction. By an identical argument with Descendant replacing vice versa, Descendant N) G X’ Y’.
(3) X’ + Y’ is introduced
by M4 (complementation). similar to the proof of Case 3 in Lemma 4.2. ❑
Lemma 4.4 tells us that the set D of dependencies tree T is the closure of itself on Aset(T).
Ancestor and
The argument
is
that holds for a scheme
LEMMA 4.4. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs ol)er U. Let T be a scheme tree such that Aset(T ) c U. L-et D be the set of dependencies in (M u F)+ that hold for T. Then D+= D on Aset(T). PROOF. inference
The rules
strategy
for
can
a new
add
cases except for C2 are cannot add a new FD. Let in Dsuchthat WgYand Because X + Y holds for in (M u F)+ such that Y Z + W is in (M u F)+
proving
this
result
is to
show
that
none
of
the
D. All the straightforward, therefore we just prove that C2 X + Y be an MVD in D, and let Z - W be an FD Y nZ=O. Thus X+ Yand Z +W hold for T. T, XY G Aset(Z’) and there exits an MVD X + Y’ = Y’ n Aset(T). Z + W holds for T, and therefore and Zw ~ Aset(T). Because W c Y and Y G Y’, dependency
that
is not
already
in
W G Y’. Y = Y’ n Aset(T), Z CAset(T), and Y n Z = 0, therefore Y’ n Z = 0. Hence X + W is in (M U F)+. Because XW G Aset(T), X + W is already in D. u Lemma garded
4.5
provides
if we are only
an interesting interested
result:
in certain
the implied
given
FDs
MVDs.
can
be disre-
In particular,
if
MVD(T) and FD(T ) imply an MVD X + Y, then if we close the left-hand side of the MVD under MVD(T) and FD(T) on Aset (T ) to obtain X+, MVD(T) alone is sufficient to imply X++ Y, The converse also holds, and although we do not need the converse for Theorem 4.1, we provide because we need it later for a lemma required for Theorem 5.2.
it here
LEMMA 4.5. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs over U. Let T be a scheme tree such that Aset(T ) c U. Let XY Q Aset(T). If MVD(7’) u FD(T) implies X + Y on Aset(T), then MVD(T)
implies XL
Y on Aset(T) and conversely.
The result can be proved by using Theorem
PROOF.
[1986].
VD{T )b FD( T ) +
1 in Beeri and Kifer
❑ ACM Transactions on Database Systems, Vol. 21, No. 1, March 1996.
94
.
W. Y. Mok et al.
Lemma 4.6 begins to directly address the redundancy issue in nested relations. We use it twice in Theorem 4.1, and thus we write it separately as a Lemma. Before stating and proving Lemma 4.6, we need a definition for a path in a scheme tree.
Definition 4.1. where
A path of a scheme tree T is a sequence of nodes NI, ..., N. NI is the root of T and N. is a leaf node of T and Ni is the parent of
Ni+l,l