IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 14,
NO. 2,
MARCH/APRIL 2002
369
A Comparative Study of Various Nested Normal Forms Wai Yin Mok, Member, IEEE AbstractÐAs object-relational databases (ORDBs) become popular in the industry, it is important for database designers to produce database schemes with good properties in these new kinds of databases. One distinguishing feature of an ORDB is that its tables may not be in first normal form. Hence, ORDBs may contain nested relations along with other collection types. To help the design process of an ORDB, several normal forms for nested relations have recently been defined, and some of them are called nested normal forms. In this paper, we investigate four nested normal forms, which are NNF [20], NNF [21], NNF [23], and NNF [25], with respect to generalizing 4NF and BCNF, reducing redundant data values, and design flexibility. Another major contribution of this paper is that we provide an improved algorithm that generates nested relation schemes in NNF [20] from an -acyclic database scheme, which is the most general type of acyclic database schemes. After presenting the algorithm for NNF [20], the algorithms of all of the four nested normal forms and the nested database schemes that they generate are compared. We discovered that when the given set of MVDs is not conflict-free, NNF [20] is inferior to the other three nested normal forms in reducing redundant data values. However, in all of the other cases considered in this paper, NNF [20] is at least as good as all of the other three nested normal forms. Index TermsÐObject-relational database management systems, object-relational databases, SQL:1999, nested relation schemes, nested relations, nested database schemes, nested databases, nested normal forms, conflict-free sets of MVDs, acyclic database schemes, nested database design, data redundancy, design flexibility, algorithms.
æ 1
S
INTRODUCTION
data are becoming more and more complex, operations on complex data will become commonplace in the future. In order to support operations on complex data, it is generally believed that the next generation of database management systems (DBMSs) not only needs a better representation of the real world, but it also needs many of the tested and proven services provided by conventional DBMSs [6], [24], [29], [27]. Object-orientation, with its rich collection of modeling concepts, such as abstraction, encapsulation, and inheritance, has been proven to be an effective way for representing the real world [3]. Also, a considerable amount of knowledge of DBMSs has been discovered in the last three decades. Hence, an obvious way to develop the next generation of DBMSs is to combine these two technologies. SQL:1999, formerly known as SQL3, is a new standard of SQL. SQL:1999 incorporates many object-oriented concepts [5], [14]. For example, SQL:1999 allows the definition of user-defined types that can participate in supertype/ subtype relationships and provides type constructors for collection types such as arrays and rows. In addition, SQL:1999 also allows triggers to be defined. If a database system conforms to SQL:1999, it will have the capabilities to support object-oriented data management. This type of database system is usually called an object-relational database management system (ORDBMS) [4], [24], [29]. INCE
. The author is with the Department of Accounting and Information Systems, University of Alabama in Huntsville, Huntsville, AL, 35899. E-mail:
[email protected]. Manuscript received 19 Jan. 1999; revised 17 Jan. 2000; accepted 20 Oct. 2000; posted to Digital Library 10 Sept. 2001. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number 109011.
Examples of ORDBMSs include Informix Universal Server, IBM DB/2, UniSQL, and Oracle8. As we mentioned earlier, a distinguishing feature of an ORDB is that it supports a rich type system which would enable us to declare collection types. A particular collection type of interest in this paper is the nested relation; that is, a relation which may be embedded in another relation. In fact, many commercial ORDBMSs are already supporting nested relations [15], [31]. Since ORDBMSs support nested relations, it is imperative for database designers to be able to design nested databases with good properties. In the past, numerous normal forms have been defined for flat relations so that if a flat relation scheme satisfies a certain normal form, then the relations on that scheme will enjoy the properties of the normal form. For a long time, database designers have been using these normal forms as guides for flat relational database design. In the same spirit, numerous normal forms have recently been defined for nested relations as well [20], [21], [23], [25], [26]. Among all of these cited normal forms, Partition Normal Form (PNF), which is defined in [26], is the most fundamental. In essence, PNF basically states that in a nested relation, there can never be distinct tuples that agree on the atomic attributes of either the nested relation itself or of any nested relation embedded within it [26]. Since this is a basic property of nested relations, the nested normal forms defined in [20], [21], [23], [25] all imply PNF. As guides for database design, normal forms should be used with caution. Database designers should understand the strengths and the weaknesses of a normal form in order to use it intelligently. As an example, it is well known that BCNF is able to remove redundancy caused by FDs;
1041-4347/02/$17.00 ß 2002 IEEE
370
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 14,
NO. 2,
MARCH/APRIL 2002
Fig. 1. A nested relation.
however, it is not dependency preserving. On the other hand, 3NF is dependency preserving but it is not able to remove redundancy caused by FDs in all cases. Knowing information like this is in fact vital for a successful database design. Hence, the main purpose of this paper is to compare the nested normal forms defined in [20], [21], [23], [25], and to discover their strengths and weaknesses. In particular, we investigate them with respect to generalizing 4NF and BCNF, reducing redundant data values, and design flexibility. In addition, we also examine their algorithms and the nested database schemes that they generate. Notice that the comparison among NNF [21], NNF [23], and NNF [25] is already done, as presented in [23], [25], and the results are not reproduced here. Another major contribution of this paper is that we provide an algorithm that generates nested relation schemes in the nested normal form defined in [20] from an -acyclic database scheme, which is the most general type of acyclic database schemes. After presenting this algorithm, the algorithms of all of the four nested normal forms and the nested database schemes that they generate are examined. Here, we would like to recognize some other normal forms defined for nested relations, such as the ones defined in [16], [30]. However, they are not included in this paper because the normal form defined in [30] mainly deals with semantic issues as opposed to removing redundancy; the one defined in [16] is based on extended FDs which involve single-valued and nested-valued attributes. Further, the nested relation schemes defined in [16] are not represented as trees. On the other hand, the FDs and MVDs considered in this paper are ordinary. In the following, for brevity, we abbreviate the nested normal form defined in [20] as NNF [20]. Similarly, the
Fig. 2. The total unnesting of the nested relation in Fig. 1.
nested normal form defined in [21] is abbreviated as NNF [21], the one defined in [23] as NNF [23], and the one defined in [25] as NNF [25]. This paper is organized as follows: We present some of the motivations for nested relations in Section 2. In Section 3, we present the basic definitions and concepts used in this paper. In Section 4, we present the algorithm for NNF [20]. The nested normal forms are then rigorously compared in Section 5 and we summarize our results in Section 6.
2
MOTIVATIONS
In this section, we present some of the motivations for nested relations and their normal forms. In Fig. 1, we can find a nested relation. The corresponding flat relation that stores the same data is shown in Fig. 2. We first note that the number of data values in the nested relation is fewer than the number of data values in the corresponding flat relation. Advanced data storage techniques, such as the ones used in Oracle and Jasmine [13], [31], can be used to store the tuples in the nested relation directly, thus reducing data storage. Second, object compositions occur frequently in ORDBs in which objects may have subparts and, in turn, a subpart may have its own subparts. Nested relations can be used to represent such object compositions. As an example, the hierarchy of data represented by the scheme of the nested relation in Fig. 1 is shown in Fig. 3. From this tree structure, it is easy to see the composition of the objects. Third, using nested relations can also avoid joins, which are the most expensive operations. For example, the data in the nested relation in Fig. 1 are organized with respect to departments. As we retrieve the data for a particular department, the query system may just traverse down the hierarchy without taking any join. Fourth, as we mentioned in Section 1,
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
371
Fig. 3. Scheme tree T , Aset
T , and MV D
T of the nested relation scheme in Fig. 1.
SQL:1999 allows relations not in first normal form. In addition to other collection types such as arrays and rows, nested relations will be supported by DBMSs that conform to this new standard of SQL. A good nested relation scheme denotes a good clustering of attributes. In fact, the tree in Fig. 3, which corresponds precisely to the scheme of the nested relation in Fig. 1, represents a hierarchy of data constructed from the point of view of department. As discussed in [20], this hierarchy of data represents a good clustering of attributes. In addition, the nested relations on this scheme do not have data redundancy caused by the MVDs and FDs in Fig. 4. As an example, as we nest the flat relation in Fig. 2 to obtain the nested relation in Fig. 1, we are able to remove the redundant data values caused by the MVDs and FDs in Fig. 4. In fact, the nested relation scheme in Fig. 1 is in NNF [20]. Observe further that none of the sets of attributes in the two paths of the tree is in 3NF with respect to the MVDs and FDs in Fig. 4, and thus they are not in BCNF or 4NF. Hence, from this example, we can see that normal forms for flat relations are not appropriate in designing nested relation schemes. Certainly, we need normal forms designed specifically for nested relations.
Fig. 4. Some given MVDs and FDs over a set of attributes.
3
BASIC CONCEPTS
AND
TERMINOLOGY
We first present some basic definitions, after which, the definition of each of the nested normal forms is presented.
3.1 Nested Relation Schemes and Nested Relations The following definitions of nested relation schemes, nested relations, and scheme trees are adapted from [20]. However, any equivalent definitions of these concepts, such as those in [21], [23], [25], can be used as well. A nested relation allows each tuple component to be either atomic or another nested relation, which may itself be nested several levels deep. Definition 1. Let U be a set of attributes. A nested relation scheme is recursively defined as follows: 1. 2.
If X is a nonempty subset of U, then X is a nested relation scheme over the set of attributes X. If X; X1 ; . . . ; Xn are pairwise disjoint, nonempty subsets of U, and R1 ; . . . ; Rn are nested relation schemes over X1 ; . . . ; Xn , respectively, then X
R1 . . .
Rn is a nested relation scheme over XX1 . . . Xn .
Definition 2. Let R be a nested relation scheme over a nonempty set of attributes Z. Let the domain of an attribute A 2 Z be
372
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
denoted by dom
A. A nested relation over R is recursively defined as follows: 1.
2.
If R has the form X where X is a set of attributes fA1 ; . . . ; An g, n 1, then r is a nested relation over R if r is a (possibly empty) set of functions ft1 ; . . . ; tm g, where each function ti , 1 i m, maps Aj to an element in dom
Aj , 1 j n. If R has the form X
R1 . . .
Rm , m 1, where X is a set of attributes fA1 ; . . . ; An g, n 1, then r is a nested relation over R if a.
r is a (possibly empty) set of functions ft1 ; . . . ; tp g, where each function ti , 1 i p, maps Aj to an element in dom
Aj , 1 j n, and maps Rk to a nested relation over Rk , 1 k m, and b. ti 2 r and tj 2 r and ti
X = tj
X implies ti tj , 1 i; j p. Each function of a nested relation r over nested relation scheme R is a nested tuple of r. Several observations can be made about Definitions 1 and 2. First, flat relation schemes are also nested relation schemes. Second, any two distinct embedded nested relation schemes do not have any attribute in common. For example, a nested relation scheme such as A (C)* (B (C)* )* is not allowed. Third, because of Condition 2b, every nested relation in this paper is in PNF [26]. Example 1. Fig. 1 shows a nested relation. Its scheme is Dept Chair (Prof (Hobby)* (Matriculation (Student (Interest)*)*)*)* and it contains two nested tuples. Each embedded nested relation also contains nested tuples of its own. For example, and are nested tuples under the embedded nested relation scheme Student (Interest)*. Notice that, as required, PNF is satisfied. Thus, the values for the atomic attributes, Dept Chair, differ, and in each embedded nested relation, the atomic values differ. Definition 3. Let R be a nested relation scheme. Let r be a nested relation on R. The total unnesting of r is recursively defined as follows: 1. 2.
If R has the form X, where X is a set of attributes, then r is the total unnesting of r. If R has the form X
R1 . . .
Rn , where Xi is the set of attributes in Ri , 1 i n, then the total unnesting of r is equal to ft j there exists a nested tuple u 2 r such that t
X u
X and t
Xi is a tuple in the total unnesting of u
Ri , 1 i ng.
Example 2. Fig. 2 shows the total unnesting of the nested relation in Fig. 1. We can graphically represent a nested relation scheme by a tree, called a scheme tree. A scheme tree captures the logical structure of a nested relation scheme and explicitly represents a set of MVDs. Definition 4. A scheme tree T corresponding to a nested relation scheme R is recursively defined as follows:
1. 2.
VOL. 14,
NO. 2,
MARCH/APRIL 2002
If R has the form X, then T is a single node scheme tree whose root node is the set of attributes X. If R has the form X
R1 . . .
Rn , then the root node of T is the set of attributes X, and a child of the root of T is the root of the scheme tree Ti , where Ti is the corresponding scheme tree for the nested relation scheme Ri , 1 i n.
The one-to-one correspondence between a scheme tree and a nested relation scheme, along with the definition of a nested relation scheme, imposes several properties on a scheme tree. Let T be a scheme tree. We denote the set of attributes in T by Aset
T . Observe that the atomic attributes of a nested relation scheme, at any level of nesting, constitute a node in a scheme tree. Observe further that since Definition 1 requires nonempty sets of attributes, every node in T consists of a nonempty set of attributes. Furthermore, since the sets of attributes corresponding to nodes in T are pairwise disjoint and include all the attributes of T , the nodes in T are pairwise disjoint, and their union is Aset
T . Let N be a node in T . Notationally, Ancestor
N denotes the union of attributes in all ancestors of N, including N. Similarly, Descendent
N denotes the union of attributes in all descendants of N, including N. In a scheme tree T , each edge
V ; W , where V is the parent of W , denotes an MVD Ancestor
V ! ! Descendent
W . Notationally, we use MV D
T to denote the set of all the MVDs represented by the edges in T . By construction, each MVD in MV D
T is satisfied in the total unnesting of any nested relation for T . Since FDs are also of interest, we use F D
T to denote any set of FDs equivalent to all FDs X ! Y implied by a given set of FDs and MVDs over a set of attributes U such that Aset
T U and XY Aset
T . Example 3. Fig. 3 shows the scheme tree T for the scheme of the nested relation in Fig. 1. Fig. 3 also gives the set of attributes in Aset
T and the set of MVDs in MV D
T . Observe that each of the MVDs in MV D
T is satisfied in the unnested relation in Fig. 2. Given a set D of MVDs and FDs over a set U of attributes, and a scheme tree T such that Aset
T U, Aset
T may be a proper subset of U. However, D may imply MVDs and FDs that hold for T . By Theorem 5 in [7], an MVD X ! !Y holds for T with respect to D if X Aset
T and there exists a set of attributes Z U such that Y Z \ Aset
T and D implies X ! ! Z on U. An FD X ! Y holds for T with respect to D if XY Aset
T and D implies X ! Y on U. Example 4. Fig. 4 shows a given set U of attributes and a given set of FDs F over U and a given set of MVDs M over U. All the FDs in F hold for the scheme tree T in Fig. 3. Not all the MVDs in M hold for T , however. In particular, neither Hobby ! ! Hobby-Equipment nor P rof ! ! Hobby Hobby-Equipment holds for T . Since
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
Hobby Hobby-Equipment \ Aset
T Hobby; P rof ! ! Hobby does hold for T . Although P rof ! ! Hobby does hold for T , observe that it is not implied by M [ F on U. Given a scheme tree T , a path of T is a sequence of nodes N1 ; . . . ; Nn , where N1 is the root of T and Nn is a leaf node of T and Ni is the parent of Ni1 , 1 i n 1. Nonetheless, for the sake of convenience, we may sometimes refer to Ancestor
Nn as a path of T when the context is clear. That is, the set of attributes that appear in a path is sometimes referred to as a path. If T has leaf nodes Nl1 ; . . . ; Nlm , where m 1, P ath
T denotes the set fAncestor
Nl1 ; . . . ; Ancestor
Nlm g. We now define what consistent scheme trees are. Given a set D of MVDs and FDs over a set U of attributes, and a scheme tree T such that Aset
T U, T is consistent with D if for each MVD X ! ! Y in MV D
T , D implies an MVD X! ! Z on U such that Y = Z \ Aset
T . A scheme tree should be consistent with the given MVDs and FDs; otherwise, its scheme implies an MVD that does not follow from the given MVDs and FDs. Further, consistency of nested relation schemes has been studied in [10]. Hence, only consistent scheme trees are considered in this paper.
3.2
Conflict-Free Sets of MVDs and Acyclic Database Schemes Some researchers have claimed that most real-world sets of MVDs are conflict-free and acyclic database schemes are sufficiently general enough to encompass most real-world situations [1], [28]. In fact, conflict-free sets of MVDs and acyclic database schemes have numerous desirable properties [1]. Therefore, we would like to examine the normal forms with respect to conflict-free sets of MVDs and acyclic database schemes. Their definitions are now presented. An MVD X ! ! Y (with X and Y disjoint) splits two attributes A and B if one of them is in Y and the other is in U XY , where U is the set of all the attributes. A set M of MVDs splits A and B if some MVD in M splits them. An MVD (or a set of MVDs) splits a set X, where X U, if it splits two distinct attributes in X. Let D be a set of MVDs and FDs over U. LHS
D denotes the set of lefthand sides of the members of D. As usual, DEPD
X denotes the dependency basis of X with respect to D, which is a partition of U X. Definition 5. A set M of MVDs is conflict-free if 1. 2.
M does not split any element in LHS
M. For every X 2 LHS
M and for every Y 2 LHS
M; DEPM
X \ DEPM
Y DEPM
X \ Y .
A conflict-free set of MVDs allows a unique 4NF decomposition [1]. We shall use this fact in the proof of Lemma 11. A database scheme R fR1 ; . . . ; Rn g over a set of attributes U is a set of relation schemes where each relation scheme Ri is a subset of U and [ni1 Ri U. Notice that every database scheme R corresponds to a unique join dependency, namely, R [17]. A database scheme R is acyclic if and only if the join dependency R is equivalent to a conflict-free set of MVDs [1]. Also, R is acyclic if and only if R has a join tree [1].
373
Definition 6. Let R fR1 ; . . . ; Rn g, n 1, be a database scheme. A join tree for R is a tree where each Ri is a node, and 1. 2.
Each edge (Ri , Rj ) is labeled by the set of attributes Ri \ Rj , and For every pair Ri and Rj (Ri 6 Rj ) and for every A in Ri \ Rj , each edge along the unique path between Ri and Rj includes label A (possibly among others).
Let M be a set of MVDs over a set of attributes U. Notationally, we use M to denote the closure of M. M has the intersection property if whenever the MVDs X ! ! Z and Y ! ! Z are implied by M (with Z disjoint from both X and Y ), then X \ Y ! ! Z is also implied by M. Furthermore, M has the intersection property if and only if M is implied by a join dependency R [1]. We shall use this property in the proof of Theorem 8.
3.3 NNF [20] We now present NNF [20]. Definition 7. Let U be a set of attributes. Let M be a set of MVDs over U and F be a set of FDs over U. Let T be a scheme tree such that Aset
T U. T is in NNF [20] with respect to M [ F if the following conditions are satisfied. 1. 2.
If D is the set of MVDs and FDs that hold for T with respect to M [ F , then D is equivalent to MV D
T [ F D
T on Aset
T . For each nontrivial FD X ! A that holds for T with respect to M [ F , X ! Ancestor
NA also holds with respect to M [ F , where NA is the node in T that contains A.
3.4 NNF [21] The definition of reduced MVDs is fundamental to NNF [21], NNF [23], and NNF [25] and is now adapted from [21], [22]. Definition 8. Let U be a set of attributes. Let M be a set of MVDs over U. X ! ! W in M is 1. 2.
trivial if XW U or W X. ! left-reducible if there is an X0 X such that X0 ! W is in M . 3. right-reducible if there is a W 0 W such that X! ! W 0 is a nontrivial MVD in M . ! 4. transferable if there is an X0 X such that X0 !
X X0 W is in M . An MVD X ! ! W is reduced if it is nontrivial, left-reduced (nonleft-reducible), right-reduced (nonright-reducible), and nontransferable. Let M1 and M2 be sets of MVDs over a set of attributes U. M1 is a cover of M2 if and only if M1 M2 . Definition 9. Let U be a set of attributes. Let M be a set of !W j X! ! W is a reduced MVDs over U. Let M fX ! MVD in M g. Elements in LHS
M are called keys of M. A minimal cover Mmin of M is a subset of M and no proper subset of Mmin is a cover of M. NNF [21] disallows several configurations of scheme trees. To achieve this goal, transitive dependencies and fundamental keys in a scheme tree are defined. Let M
374
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
be a set of MVDs over a set U of attributes. Let T be a scheme tree such that Aset
T U. M implies MV D
T on Aset
T if for each MVD X ! ! Y in MV D
T , M implies an MVD X ! ! Z on U such that Y Z \ Aset
T . Assume M implies MV D
T on Aset
T . Let
V ; W be an edge in T . Suppose that there is a key X of M such that there exists a dependent Z 2 DEPM
X and Descendent
W Z \ Aset
T . If there exist some sibling nodes W1 ; . . . ; Wn of W in T such that Y [ni1 Descendent
Wi , X Ancestor
V Y , and XY ! ! Ancestor
V does not hold for T with respect to M, then W is transitive redundant with respect to X in T . In this case, X ! ! Descendent
W on Aset
T is a transitive dependency in Aset
T . Let V be a subset of U. The set of fundamental keys on V , denoted by F K
V , is defined as F K
V fV \ X j X 2 LHS
Mmin and V \ X 6 ;, and no Y 2 LHS
Mmin such that X \ V Y \ V 6 ;g. When FDs are given, NNF [21] uses the MVDs counterparts of FDs. That is, for each given FD X ! Y , X ! Y is replaced by the set of MVDs fX ! ! A j A 2 Y g. We are now ready to present NNF [21]. Definition 10. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let M be the set fX ! !Y jX! !Y 2 Dg [ fX ! ! A j X ! Y 2 D and A 2 Y g. Let T be a scheme tree such that Aset
T U. T is in NNF [21] with respect to D if 1. 2. 3. 4.
1. 2. 3. 4.
Definition 11. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let E
D be the envelope set of D. Let T be a scheme tree such that Aset
T U. T is in NNF [23] with respect to D if
NO. 2,
MARCH/APRIL 2002
E
D implies MV D
T on Aset
T . F o r e a c h e d g e
V ; W o f T ; Ancestor
V ! ! Descendent
W on Aset
T is left- and right-reduced with respect to E
D. For each node N in T , there is no key X of E
D such that N is transitive redundant with respect to X. The root of T is a key of E
D and, for each edge
V ; W in T , if D does not imply Ancestor
V ! Descendent
W a n d F K
Descendent
W 6 ;, then W 2 F K
Descendent
W , otherwise W is a leaf node of T .
3.6 NNF [25] When FDs are given, this normal form uses the integrated approach described in [2] to handle MVDs and FDs together. Hence, it is not necessary to consider the MVDs counterparts of FDs. From a given set D of MVDs and FDs, the authors first derive another set M 0 of MVDs and the normal form is defined mainly in terms of M 0 . Notice in the following that X denotes the closure of a set of attributes X and that NNF [25] uses the original definition of fundamental keys in Section 3.4. Definition 12. Let U be a set of attributes. Let D be a set of !Y jX! ! MVDs and FDs over U. Let M 0 be the set fX ! Y 2 D and X X g. Let T be a scheme tree such that Aset
T U. T is in NNF [25] with respect to D if M 0 implies MV D
T on Aset
T . For each edge
V ; W of T , Ancestor
V ! ! Descendent
W on Aset
T is left- and right-reduced with respect to M 0 . 3. For each node N in T , there is no key X of M 0 such that N is transitive redundant with respect to X. 4. The root of T is a key of M 0 and is in LHS
M 0 , and for each other node N in T , if F K
Descendent
N 6 ;, then N 2 F K
Descendent
N. Further normalization is specified to remove redundancy caused by FDs. If T contains a node N such that X ! N, where X N, then replace N by N 0 where N 0 N, N 0 ! N, and for no N 00 N 0 does N 00 ! N hold. In essence, each node N in T is replaced by one of its candidate keys defined in the usual sense [24], [29]. 1. 2.
M implies MV D
T on Aset
T . For each edge
V ; W of T , Ancestor
V ! ! Descendent
W on Aset
T is left- and right-reduced with respect to M. For each node N in T , there is no key X of M such that N is transitive redundant with respect to X. The root of T is a key of M, and for each other node N in T , if F K
Descendent
N 6 ;, then N 2 F K
Descendent
N.
3.5 NNF [23] When FDs are given, this normal form uses the envelope sets defined in [33] to handle MVDs and FDs together. Hence, it is not necessary to consider the MVDs counterparts of FDs. From a given set D of MVDs and FDs, the authors first derive the envelope set E
D of D and the normal form is defined in terms of E
D and D. The envelope set E
D of D is defined as fX ! !W jX2 LHS
D and W 2 DEPD
X and X 6! W g. Notice that the authors also redefine transitive dependencies and fundamental keys for this normal form, which affect Conditions 3 and 4. The new definition of fundamental keys on a set of attributes V is denoted by F K
V and is defined as fV \ X j X 2 LHS
E
Dmin and V \ X 6 ;, and there is no Y 2 LHS
E
Dmin such that X \ V Y \ V 6 ;g. The new definition of transitive dependencies is lengthy and involved, however. Furthermore, since we will not use this new definition of transitive dependencies in this paper, we do not reproduce it here.
VOL. 14,
4
ALGORITHM
FOR
NNF [20]
In this section, we present an algorithm that generates scheme trees in NNF [20] from an acyclic database scheme and a set of embedded FDs (defined in Algorithm 1). This algorithm generalizes the algorithms in [18], [19], in which only -acyclic database schemes are allowed [8]. The acyclic database schemes that we consider in this paper, however, are -acyclic database schemes, which are the most general type of acyclic database schemes [8]. In Section 3.2, some of the numerous equivalent definitions of -acyclic database schemes are presented. To be brief, in this paper, when we say ªan acyclic database scheme,º we actually mean ªan -acyclic database schemeº unless otherwise stated. In Section 5, we shall compare the algorithms of all of the nested normal forms and the nested database schemes that they generate.
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
In addition to the assumptions that the given database scheme is acyclic and each given FD is embedded in a relation scheme of the given database scheme, we also assume that each relation scheme is in BCNF with respect to the given FDs. These assumptions are justified as follows: First, in Section 3.2, we have already mentioned the importance of acyclic database schemes. Second, as stated in [9], most FDs that are relevant for data structuring are embedded in some relation schemes of the given database scheme. Furthermore, most FDs that are derived from the semantic data models that we use in [18], [19] are indeed embedded in some relationship sets which roughly correspond to relation schemes in this paper. Third, as stated in [19], in most cases, relation schemes are of small arity and a majority of them are binary, thus, we believe that most relation schemes in practice are in BCNF. We now present Algorithm 1, which is the algorithm that generates nested relation schemes in NNF [20]. Algorithm 1. Input: An acyclic database scheme R fR1 ; . . . ; Rn g, n 1, and a set F of nontrivial FDs where each FD X ! Y in F is embedded in an Ri (XY Ri ). Furthermore, each Ri is in BCNF with respect to F . Without loss of generality, we assume no Ri is a subset of Rj if i 6 j. Output: Nested relation schemes that are in NNF [20] with respect to R and F . Internal Data Structure: A join tree J of R and a first-in firstout queue L of relation schemes that is initially empty. 1. Use the GYO Reduction1 to derive J from R. We begin with the graph with nodes R1 ; . . . ; Rn and with no edge. Let R0i be Ri after applying zero or more nodes removals. Each time an R0i is removed because R0i R0j , where i 6 j, add an edge between the corresponding Ri and Rj in the graph. Eventually, n 1 or less edges will be added to the graph and the resulting graph is J. Notice that since there may be more than one possible reduction sequences, an acyclic database scheme may have more than one join trees. 2. While there is an unmarked node in J, do: 2.1. Select an unmarked node Rseed in J. Create a single path scheme tree TRseed from Rseed . Mark Rseed and enter Rseed into L. 2.2. While L is not empty, do: 2.2.1. Let RM be the first marked relation scheme in L. Remove RM from L. 2.2.2. For each unmarked neighbor RU of RM in J, do: 2.2.2.1. If there is a node N in TRseed such that RU ! Ancestor
N and (RU \ RM ) Ancestor
N (If there are several nodes that satisfy these conditions, choose the lowest one.), modify TRseed as follows: 1. GYO Reduction is called Graham Reduction in [17].
375
Fig. 5. A join tree of the join dependency in Example 5.
3.
2.2.2.1.1. If Ancestor
N ! RU , put (RU RM ) into N. Otherwise, create a single path scheme tree TRU from (RU RM ) and attach the root of TRU as a child of N. 2.2.2.1.2. Mark RU and enter RU into L. A scheme tree T produced in Step 2 can be modified by moving an attribute A in Aset
T up or down the nodes in the path that A appears as long as T satisfies Condition 2 of NNF [20] and P ath
T remains the same.
Since a given acyclic database scheme may have several join trees, Algorithm 1 may generate different scheme trees with different join trees. The purpose of using join trees, however, is to ensure that scheme trees generated by Algorithm 1 are in NNF [20]. Nevertheless, the decision on which scheme trees to use may depend on some other issues such as the semantics of the application at hand and data usage patterns, which are beyond the scope of this paper. The purpose of Step 3 of Algorithm 1 is to provide more flexibility to the user so that he or she may decide the final shape of the scheme tree based on some other considerations that cannot be handled by syntactic means. Example 5. Consider the set U of attributes, the set M of MVDs, and the set F of FDs in Fig. 4. To save space, we abbreviate Dept as D, Chair as C, Prof as P , Hobby as H, Hobby-Equipment as E, Matriculation as M, Student as S, and Interest as I. Notice that M and F are equivalent to R and F where R fDC; DP ; P H; HE; SP ; SM; SIg. At Step 1, a join tree J of R is derived and is shown in Fig. 5. Initially, every node in J is unmarked. At Step 2.1, suppose DC is selected as the seed relation scheme (Rseed ) of a new scheme tree and the single path scheme tree TDC is simply created as DC. The node DC in J is then marked and is entered into L. At Step 2.2.1, RM becomes DC and it has one unmarked neighbor DP in J. The node N in Step 2.2.2.1 is DC and at Step 2.2.2.1.1, P is attached as a child of DC. DP is then marked and is entered into L. Back at Step 2.2.1, RM becomes DP and it has two unmarked neighbors PH and SP in J. Successively, H and S become children of P in TDC and both PH and SP are marked and are entered into L. Back at Step 2.2.1, RM becomes PH and it has one unmarked neighbor HE. Since there is no node N in TDC such that HE ! Ancestor
N, we cannot extend the path that contains H. Back at Step 2.2.1, RM becomes SP and it has two unmarked neighbors SI and SM in J. For SI, the node N in Step 2.2.2.1 is S and I becomes a child of S in TDC . For SM, the node N in Step 2.2.2.1 is also S. But since Ancestor
N ! SM, we put the attribute M into the node N and, thus, the node N becomes SM. Notice that in this example, the order of choosing unmarked neighbors at Step 2.2.2 does not make a difference in TDC .
376
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Fig. 6. The join tree of the join dependency in Example 6.
The only unmarked node left in J is HE and, thus, in the second iteration of Step 2, a single path scheme tree created from HE is generated and then HE is marked. In Step 3, we may choose to move M to be the parent of S in TDC and the scheme tree in Fig. 3 is generated. The database scheme R in Example 5 is -acyclic. The following example shows that Algorithm 1 is able to generate scheme trees in NNF [20] from an -acyclic, but not -acyclic, database scheme. Example 6. Assume U ABCDEF and R fABC; CDE; AEF ; ACEg. The join tree generated at Step 1 is shown in Fig. 6. Suppose we select ABC as the seed relation scheme and the single path scheme tree TABC created at Step 2.1 is AC B. ABC has only one unmarked neighbor ACE. Later E becomes a child of AC in TABC . As the reader may verify, CDE and AEF cannot be attached to TABC . In the second and the third iterations of Step 2, two single path scheme trees are created from CDE and AEF, respectively. Notice that CDE and AEF cannot appear in the same scheme tree because they are not neighbors in the join tree in Fig. 6. The concept of a closed set of relation schemes is crucial to the proof of the correctness of Algorithm 1. Let R fR1 ; . . . ; Rn g be a database scheme over a set U of attributes. Let S R and let S be [Ri 2 S Ri . S is closed if, for every Ri 2 R, there is an Rk 2 S such that Ri \ S Rk . Notice that, if R is acyclic, then any closed set of relation schemes of R is also acyclic [1]. The proof of the correctness of Algorithm 1 also depends on Lemmas 1 and 3 as well, which characterize the set of MVDs and the set of FDs that hold for a scheme tree generated by Algorithm 1. Let R be a join dependency where R fR1 ; . . . ; Rn g is a database scheme over a set U of attributes. R implies numerous MVDs. However, every MVD implied by R is also implied by an MVD in MV D
R. MV D
R is a set of MVDs of the form f[i 2 S1 Ri , [i 2 S2 Ri g, where S1 [ S2 f1; . . . ; ng and S1 \ S2 ; [17].2 As shown in Chapter 13 in [17], R is acyclic if and only if MV D
R and R are equivalent. Lemma 1. Let U be a set of attributes. Let R fR1 ; . . . ; Rn g be a database scheme over U. Let S R and let S be [Ri 2 S Ri . If S is closed and D is the set of MVDs that hold for S with respect to R, then D is equivalent to MV D
S. Proof. Let X ! ! Y be an MVD implied by R where Thus, X ! is in D. X ! X S. !
Y \ S ! Y is equivalent to the join dependency fXY ; X
U XY g. R 2. For every MVD X ! ! Y on U, X ! ! Y is equivalent to the join dependency fXY ; X
U XY g. R implies fXY ; X
U XY g if and only if Ri XY or Ri X
U XY , 1 i n. Hence, fXY ; X
U XY g is implied by an MVD in MV D
R.
VOL. 14,
NO. 2,
MARCH/APRIL 2002
implies fXY ; X
U XY g if and only if Ri XY or Ri X
U XY g, 1 i n. Hence, for each Ri in S, is Ri XY or Ri X
U XY . Thus, X ! !
Y \ S implied by MV D
S. Now, consider an MVD V ! ! W in MV D
S, which is equivalent to the join dependency fV W ; V
S V W g. Let us construct an MVD implied by R which is denoted by fL; Rg. Initially, L V W and R V
S V W . As we add the relation schemes in R S to either the set L or the set R, since S is closed, L \ R is always equal to V . Thus, R t implies an MVD V ! ! Z on U such that W Z \ S. u In the following, if G is a set of FDs and W is a set of attributes, then GW fX ! Y 2 G j XY W g. Lemma 2. Let R fR1 ; . . . ; Rn g be a database scheme. Let F be a set of FDs such that each FD in F is embedded in an Ri . R and F imply an FD X ! Y if and only if F implies X ! Y . Proof. Lemma 1 in [12].
u t
Lemma 3. Let U be a set of attributes. Let R fR1 ; . . . ; Rn g be an acyclic database scheme over U and let J be a join tree of R. Let F be a set of FDs over U such that each FD in F is embedded in an Ri . Let J 0 be a connected subtree of J. Let S be the set of nodes (relation schemes) in J 0 and let S be [Ri 2 S Ri . If D is the set of FDs that hold for S with respect to R and F , then D is equivalent to [Ri 2 S F Ri . Proof. By Lemma 2, R has nothing to do with the closures of sets of attributes. The if-part is obvious because for each Ri 2 S, F Ri D. We proceed to the only-if part. Without loss of generality, we assume the righthand side of every FD in F is a single attribute. We first show that if then W ! B is embedded W ! B 2 F , where W B S, in a relation scheme in S. Assume not; we shall derive a contradiction. Let RB be a relation scheme in R S such that W B RB . Since J 0 is connected, there is a unique node (relation scheme) Ru in J 0 such that among all of the nodes in J 0 , Ru is the closest node to RB . By Condition 2 of Definition 6, for each Rj 2 S, Rj \ therefore, W B Ru , RB Ru . Now, since W B S, which is a contradiction. Let X ! A be an FD in D. Since X ! A 2 D, XA S and there is a derivation sequence Z of X ! A by using the FDs in F . Let Z be the following derivation sequence: V1 X0 ; V2 X1 ; .. .
Vp Xp 1 ; .. . Vq Xq 1 ; .. .
V1 ! A1 2 F ; V2 ! A2 2 F ; .. . Vp ! Ap 2 F ; .. . Vq ! Aq 2 F ; .. .
X0 X; X1 X0 A1 ; X2 X1 A2 ; .. .
Xp Xp 1 Ap ; .. .
Xq Xq 1 Aq ; .. .
Vm Xm 1 ; Vm ! Am 2 F ; Xm Xm 1 Am ; where Am A:
then every Vi in Z is a If each Ai that appears in Z is in S, This implies that every Vi Ai is a subset of S. subset of S. Thus, by what we have just proved, each Vi ! Ai is embedded in a relation scheme in S and the proof is
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
377
Suppose Ap is done. Now, assume some Ai 0 s are not in S. the first such attribute in Z. Since Am A and A 2 S, there is an Aq that appears in Z such that A1 2 . . . ; Ap 1 2 S, and Ap 62 S; . . . ; Aq 1 62 S, and Aq 2 S. S;
Let RAi be a relation scheme in R that embeds Vi ! Ai . Since J 0 is connected and RAp is not a node in J 0 , Z can be arranged in such a way that Vp1 ; . . . ; Vq all contain some of these attributes Ap ; . . . ; Aq 1 . This implies that we can arrange Z such that RAp ; . . . ; RAq 1 all belong to the same remaining connected subtree J 00 after we remove every node in J 0 from J and every edge in J that is incident on a node in J 0 . Since J 0 is a connected subtree, there is a unique node Rv in J 0 such that among all of the nodes in J 0 , Rv is the closest node to any node in J 00 . We now show by induction that Xp 1 \ Rv ! Ak , p k q. . . . ; and Ap 1 2 S, thus Vp S. Since Basis. Since A1 2 S;
Vp Ap RAp and RAp is in J 00 , thus ; RAp 62 S. Therefore, by Condition 2 of Definition 6, Vp Rv . Since Vp Xp 1 , therefore, Vp Xp 1 \ Rv . Thus, Xp 1 \ Rv ! Ap . Induction. Assume Xp 1 \ Rv ! Al for every l, where p l < q. Now, consider l 1. By the construction of Z, Vl1 Ap . . . Al Xp 1 . Since Vl1 Al1 RAl1 , which is not a node in J 0 , Vl1
Ap . . . Al Xp 1 \ RAl1 . This implies that Vl1 Ap . . . Al
Xp 1 \ RAl1 . But, since RAl1 is not by Condition 2 of Definition 6, a node in J 0 and Xp 1 S, Xp 1 \ RAl1 Xp 1 \ Rv , which implies Vl1 Ap . . . Al
Xp 1 \ Rv . By the induction hypothesis, Xp 1 \ Rv ! Ap . . . Al . Thus, Xp 1 \ Rv ! Al1 and we conclude that Xp 1 \ Rv ! Ak , p k q. Now, since Vq ! Aq is embedded in Rq , Rq 2 R S, therefore, Aq 2 Rv . Thus, Xp 1 \ Rv ! Aq and Aq 2 S, F Rv . By using the same reasoning, it is possible to show that for each X ! A 2 D, X ! A is implied by u t [Ri 2 S F Ri . Lemma 4. Let U be a set of attributes. Let T be a scheme tree such that Aset
T U. MV D
T is equivalent to the join dependency P ath
T on Aset
T . Proof. Proposition 4.1 in [21].
u t
Lemma 5. All of the nodes in TRseed that satisfy the conditions in Step 2.2.2.1 are on the same path. Proof. Assume there are two nodes N and N 0 that satisfy the conditions. That is, RU ! Ancestor
N and
RU \ RM Ancestor
N, and RU ! Ancestor
N 0 and
RU \ RM Ancestor
N 0 . This implies that
RU \ RM
Ancestor
N \ Ancestor
N 0 . Since RU and RM are neighbors in J, RU \ RM RU \ Aset
TRseed and since every given FD is embedded in a relation scheme of the given database scheme, RU ! Ancestor
N if and only if
RU \ RM ! Ancestor
N. Similarly, RU ! Ancestor
N 0 if and only if
RU \ RM ! Ancestor
N 0 . Thus,
Ancestor
N \ Ancestor
N 0 ! Ancestor
NAncestor
N 0 : This FD will force N and N 0 to be on the same path at Step 2.2.2.1.1. t u
In the following proof, we assume the reader is familiar with the chase, which is described in Chapter 8 in [17]. Theorem 1. Algorithm 1 is correct as stated. Proof. It is obvious that Step 1 generates a join tree for the given acyclic database scheme. Next, we show that every tree generated by Algorithm 1 satisfies Definitions 1 and 4. In particular, we show that the nodes in a generated tree are all nonempty and are pairwise disjoint. By assumption, no relation scheme is a subset of another relation scheme in the given database scheme. Also, since RU and RM are neighbors in the derived join tree, RU \ RM RU \ Aset
TRseed and, thus, RU RM , which is not empty, does not intersect with Aset
TRseed . Furthermore, the single path scheme trees generated at Steps 2.1 and 2.2.2.1.1 satisfy Definitions 1 and 4 and, thus, they do not have empty nodes. Step 3 will not violate these two definitions. Thus, Algorithm 1 generates trees that satisfy Definitions 1 and 4. Observe that Algorithm 1 generates a scheme tree from the nodes in a connected subtree of the derived join tree. Since the nodes in a connected subtree constitute a closed set of relation schemes, Algorithm 1 generates scheme trees from closed sets of relation schemes. In fact, this is exactly the purpose of the join tree created at Step 1. As an example, {CDE, AEF} in Example 6 is not a closed set of relation schemes and, thus, CDE and AEF cannot appear in the same scheme tree. The purpose of the first-in first-out queue in Algorithm 1, however, is to ensure that a scheme tree is built level by level. Next, we need to characterize the set D of MVDs and FDs that hold for a scheme tree T generated by Algorithm 1. Let S be the closed set of relation schemes from which T is constructed and let S be [Ri 2 S Ri . Thus, It turns out that D is equivalent to Aset
T is equal to S. [Ri 2 S F Ri and S. The proof is as follows: D implies [Ri 2 S F Ri because [Ri 2 S F Ri D. Since the given database scheme R is acyclic and S is closed, S is acyclic. Thus, MV D
S and S are equivalent. However, since S is closed, by Lemma 1, MV D
S D. Therefore, D implies S. We now consider the reverse implication. By Lemma 3, the set of FDs in D is ! Y be an MVD in equivalent to [Ri 2 S F Ri . Let X ! D. By the definition of D, R and F imply an and MVD X ! ! Z on U
[Ri 2 R Ri such that X S, Y Z \ S. Let us run the chase on the tableau TXZ for X! ! Z. TXZ has two rows, r1 and r2 . Row r1 has a0 s under XZ-columns and b0 s elsewhere and row r2 has a0 s under X
U XZ-columns and b0 s elsewhere. Since R and F imply X ! ! Z on U, by Lemma 2, we can assume that we apply the FDs in F until no more b can be changed into a, and then for every Ri 2 R, Ri fC j r1
C ag or Ri fC j r2
C ag. This statement also holds for every Ri 2 S because S R. Now, consider using the FDs in [Ri 2 S F Ri instead of the FDs in F . By Lemma 2 again, the FDs in [Ri 2 S F Ri are strong enough to ensure that for every Ri 2 S, Ri fC j r1
C ag or Ri fC j r2
C ag. Thus, [Ri 2 S F Ri and S imply X ! ! Y on S. We are now ready to prove by induction that a scheme tree T generated by Algorithm 1 is in NNF [20]. The induction is on the size of S, which is denoted by jS j.
378
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Basis. When jS j = 1, T is created at Step 2.1. MV D
T is a set of trivial MVDs and, thus is equivalent to S, which is a trivial join dependency. By definition, F D
T is equivalent to [Ri 2 S F Ri . Therefore, T satisfies Condition 1 of NNF [20]. By assumption, every given relation scheme is in BCNF. Hence, T also satisfies Condition 2 of NNF [20]. At Step 3, P ath
T remains the same and thus T still satisfies NNF [20]. Induction. Assume the statement is true for every closed set of relation schemes S of R where 1 jS j k. Now, consider when jS j k 1. In the connected subtree J 0 from which S is defined, let Nk1 be a node in J 0 that has exactly one neighbor in J 0 . We denote the scheme tree before Nk1 is added by TS Nk1 and the scheme tree after Nk1 is added by TS . The goal is to show that MV D
TS [ F D
TS is equivalent to [Ri 2 S F Ri and S. Observe that F D
TS is equivalent to [Ri 2 S F Ri . By Lemma 4, MV D
T is equivalent to the join dependency P ath
T on Aset
T for any scheme tree T . Hence, S implies MV D
TS because for each Ri in S, there is a path P in P ath
TS such that Ri P . Therefore, we are left to show that P ath
TS [ F D
TS implies S. Let us run the chase on the tableau TS for S. Notice that TS has every row in T S Nk1 , which is the tableau for
S Nk1 . Observe that F D
TS Nk1 F D
TS . By the induction hypothesis, P ath
TS Nk1 [ F D
TS Nk1 implies
S Nk1 . Now, consider adding Nk1 into TS Nk1 . By Lemma 5, there is only one node N in TS Nk1 that satisfies the conditions at Step 2.2.2.1. One more path will be added to TS Nk1 if Ancestor
N 6! Nk1 or the paths that contain N will be enlarged if Ancestor
N ! Nk1 . These changes can be done in TS by using the FDs in [Ri 2 S F Ri , which is equivalent to F D
TS . Thus, for each path P in P ath
TS , there is a row r in TS such that P fC j r
C ag. Therefore, P ath
TS [ F D
TS implies S and, thus, MV D
TS [ F D
TS imply S. Hence, Condition 1 of NNF [20] is satisfied. The satisfaction of Condition 2 of NNF [20] by T is implied by the fact that every given relation scheme is in BCNF and by the conditions in Step 2.2.2.1. At Step 3, P ath
T remains the same and thus T still satisfies NNF [20]. u t
5
COMPARISON
OF THE
NESTED NORMAL FORMS
In this section, we compare the nested normal forms with respect to generalizing 4NF and BCNF, reducing redundant data values in nested relations, and providing flexibility in nested relation scheme design. Furthermore, we also examine their algorithms and the nested database schemes that they produce.
5.1 Generalizing 4NF and BCNF As stated in Section 3.1, flat relation schemes are also nested relation schemes. Here, we show that NNF [20], NNF [21], NNF [23], and NNF [25] all imply 4NF with respect to the given set of MVDs and FDs. Furthermore, each of the nested normal forms also implies BCNF when there are only FDs. However, the converses of these results are not true for NNF [21], NNF [23], and NNF [25].
VOL. 14,
NO. 2,
MARCH/APRIL 2002
Theorem 2. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let T be a single node scheme tree such that Aset
T U. T is in NNF [20] with respect to D if and only if T is in 4NF with respect to D. Proof. Theorem 6.1 in [20].
u t
Lemma 6. Let U be a set of attributes. Let M be a set of MVDs over U. Let Z be a key of M and X Z. There exists a dependent V 2 DEPM
X such that Z XV . Proof. Lemma 3.5 in [22].
u t
Lemma 7. Let U be a set of attributes. Let M be a set of MVDs over U. Let Z be a key of M. Z is in 4NF with respect to M. Proof. Assume X Z. By Lemma 6, there exists a dependent V 2 DEPM
X such that Z XV . Hence, X! ! Vi does not split Z for every Vi 2 DEPM
X. Therefore, Z is in 4NF with respect to M. u t Theorem 3. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let T be a single node scheme tree such that Aset
T U. If T is in NNF [21] with respect to D, then T is also in 4NF with respect to D. Proof. Since T is a single node scheme tree, it only consists of the root. By Condition 4 of NNF [21], the root is a key of M, which is the set of MVDs defined in Definition 10. By Lemma 7, the root is in 4NF with respect to M. By the definition of M, it is clear that the root is also in 4NF with respect to D. u t Lemma 8. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let E
D be the envelope set of D. Let R U. If R is in 4NF with respect to E
D, then R is also in 4NF with respect to D. Proof. Proposition 4.2 in [33].
u t
Theorem 4. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let T be a single node scheme tree such that Aset
T U. If T is in NNF [23] with respect to D, then T is also in 4NF with respect to D. Proof. Since T is a single node scheme tree, it only consists of the root. By Condition 4 of NNF [23], the root is a key of E
D. By Lemma 7, the root is in 4NF with respect to E
D. By Lemma 8, the root is also in 4NF with respect to D. u t Theorem 5. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let T be a single node scheme tree such that Aset
T U. If T is in NNF [25] with respect to D, then T is also in 4NF with respect to D. Proof. By Condition 4 of NNF [25], the root K is a key of M 0 and is in LHS
M 0 , where M 0 is the set of MVDs defined in Definition 12. K is then replaced by one of its candidate keys, say T . Since K is a key of M 0 , by Lemma 7, K is in 4NF with respect to M 0 . We now show that T is in 4NF with respect to D. Assume not; then, there is a nontrivial MVD X ! ! Y that holds for T with respect to D such that X 6! T . Thus, X T and D imply an MVD X ! ! Z on U such that Y Z \ T . We now ! W , where W Z \ K, is a nontrivial claim that X ! MVD that holds for K with respect to M 0 such that X 6! K. Therefore, K is not in 4NF with respect to M 0 , which is a contradiction. We first show that X ! !W ! Z, holds for K with respect to M 0 . Since D implies X ! ! Z. By the definition of M 0 , X ! ! Z is D implies X !
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
in M 0 . Since X T and T K, X K. Therefore, X K . Since K is in LHS
M 0 , K K . Hence, ! Z is in M 0 , X ! !W X K. Since X K and X ! holds for K with respect to M 0 . We next show that X ! ! W is nontrivial. Since X ! ! Y is nontrivial and holds for T , neither Y nor T
XY is empty. Furthermore, X 6! Y ; otherwise, T is not a candidate key. Therefore, there is an attribute A 2 Y such that X 6! A. Since A 2 Y and Y Z \ T , A 2 Z, and A 2 T . Since T K, A 2 K. Therefore, A 2 W . Since X 6! A, A 62 X . Similarly, there is an attribute B such that B 2 T
XY , X 6! B, and B 2 K
X W . Thus, X ! ! W is nontrivial on K. t u The converses of Theorems 3, 4, and 5 are all false, as the following example shows. Example 7. Suppose U ABC, R fAB; ACg, and M fA ! ! B, A ! ! Cg. Clearly, M is conflict-free and R is the unique 4NF decomposition of U with respect to M. However, neither AB nor AC is in NNF [21], NNF [23], or NNF [25] because neither AB nor AC is a key (A is the only key in this case). On the other hand, both AB and AC are in 4NF and NNF [20]. Notice that the nested relation scheme A(B)*(C)*, which has two paths AB and AC, is in NNF [21], NNF [23], and NNF [25]. However, this nested relation scheme is clearly not in 4NF. Further, due to Theorem 6 below, A(B)*(C)* is in NNF [20] as well. In fact, it is easy to generalize the above example. Suppose M is a conflict-free set of MVDs over a set U of attributes. Let R fR1 ; . . . ; Rn g, n 1, be the unique 4NF decomposition of U with respect to M. Each key of M is of the form Rp \ Rq , where Rp 2 R, Rq 2 R, and
Rp ; Rq is an edge in a join tree of R, which implies M has at most n 1 distinct keys.3 Therefore, each Ri in Ris not in NNF [21], NNF [23], or NNF [25] because Ri is not a key. On the other hand, each Ri in R is in 4NF and NNF [20]. The case for BCNF follows immediately from the case for 4NF. We conclude this section by stating that NNF [20] is superior to NNF [21], NNF [23], and NNF [25] in generalizing 4NF and BCNF.
379
Lemma 9. Let U be a set of attributes. Let M be a conflict-free set of MVDs over U. Let T be a scheme tree such that Aset
T U. If T is in NNF [21] with respect to M, then P ath
T is in 4NF with respect to M. Proof. Stated in the conclusions of [21].
u t
Lemma 10. Let U be a set of attributes. Let T be a scheme tree such that Aset
T U. If MV D
T does not imply an MVD X! ! Y on Aset
T , then X ! ! Y splits a path of T . Proof. Lemma 5.1 in [20].
u t
Lemma 11. Let U be a set of attributes. Let M be a conflict-free set of MVDs over U. Let T be a scheme tree such that Aset
T U and is consistent with M. Let D be the set of MVDs implied by M that hold for T . If P ath
T is in 4NF with respect to M, then MV D
T implies D on Aset
T . Proof. As mentioned in Section 3.2, a conflict-free set of MVDs allows for a unique 4NF decomposition [1]. Since P ath
T is in 4NF with respect to M, which is conflictfree and T is consistent with M, P ath
T is the unique 4NF decomposition of Aset
T with respect to M. Now, suppose MV D
T does not imply D. Then, there is an MVD X ! ! Y in D such that MV D
T does not imply X! ! Y . By Lemma 10, X ! ! Y splits a path of T and, thus, P ath
T is not a unique 4NF decomposition of Aset
T with respect to M, which is a contradiction. t u Theorem 6. Let U be a set of attributes. Let M be a conflict-free set of MVDs over U. Let T be a scheme tree such that Aset
T U. If T is in NNF [21] with respect to M, then T is also in NNF [20] with respect to M. Proof. Since there is no given FD, we only need to show that T satisfies Condition 1 of NNF [20]. Since T satisfies Condition 1 of NNF [21], T is consistent with M. Thus, the set D defined in Condition 1 of NNF [20] implies MV D
T on Aset
T . By Lemma 9, P ath
T is in 4NF with respect to M and, then by Lemma 11, MV D
T implies the set D defined in Condition 1 of NNF [20] on Aset
T . u t
5.2 Reducing Redundant Data Values In this section, each of the nested normal forms is investigated with respect to reducing redundant data values in nested relations. Two cases are considered: when the given set of MVDs is conflict-free and when the given set of MVDs is not conflict-free.
The converse of Theorem 6 is not true, as demonstrated by Example 7 and Theorem 2.
5.2.1 Conflict-Free Sets of MVDs Here, we show that if a scheme tree T is in NNF [21] with respect to a conflict-free set of MVDs, then T is also in NNF [20]. The converse of this result, however, is not true. Furthermore, if a scheme tree T is consistent with a conflict-free set M of MVDs, and each path of T is in 4NF with respect to M, then the nesting structure of T is able to squeeze out redundant data values. When there is no FD, NNF [21], NNF [23], and NNF [25] are all equivalent. Therefore, these results also hold for NNF [23] and NNF [25]. Notice that we do not consider FDs here; instead, FDs are considered in Section 5.4.
Proof. Assume there is a path P in P ath
T such that P is not in 4NF with respect to M. Then, there is a nontrivial MVD X ! ! Y that holds for P . Since there is no given FD, MV D
T [ F D
T does not imply X ! ! Y , which violates Condition 1 of NNF [20]. u t
3. If M is a set of trivial MVDs, then M is conflict-free and n 1. This means that M has zero (or no) key.
Lemma 12. Let U be a set of attributes. Let M be a conflict-free set of MVDs over U. Let T be a scheme tree such that Aset
T U. If T is in NNF [20] with respect to M, then P ath
T is in 4NF with respect to M.
When a conflict-free set of MVDs is given, by Lemmas 9 and 11, if a scheme tree T is in NNF [21], then all the MVDs that hold for T follow from MV D
T . Thus, the nesting structure of T is able to squeeze out the redundant data values. By Lemmas 11 and 12, NNF [20] also has this property. Since when there is no FD, NNF [21], NNF [23], and NNF [25] are all equivalent, NNF [23] and NNF [25] both have this property.
380
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 14,
NO. 2,
MARCH/APRIL 2002
Neither NNF [21], nor NNF [23], nor NNF [25] has this property, as the following example shows, which is similar to Example 2 in [2].
Fig. 7. A nested relation with redundancy.
Example 8. Let U fP rof; Course; Hobby; Hobby-Equipmentg and let M fP rof ! ! Course; Hobby ! ! Hobby-Equipmentg: Notice that M is conflict-free. Consider the nested relation in Fig. 7. Its scheme violates NNF [20], NNF [21], NNF [23], and NNF [25] because one of its pathsÐProf, Hobby, Hobby-EquipmentÐis not in 4NF with respect to M. Due to this violation, there are redundant data values in the nested relation. As we can see, the equipment values of hiking are stored twice in the nested relation. After we decompose this nested relation into two smaller nested relations in Fig. 8, the redundant data values are removed. Notice that every path of the two nested relation schemes in Fig. 8 is in 4NF and both nested relation schemes satisfy NNF [20], NNF [21], NNF [23], and NNF [25]. We conclude this section by stating that when no FD is given, NNF [20], NNF [21], NNF [23], and NNF [25] are all able to reduce redundant data values with respect to conflict-free sets of MVDs. Furthermore, if there is no given FD, NNF [21] implies NNF [20] with respect to conflict-free sets of MVDs. The same results hold for NNF [23] and NNF [25].
5.2.2 Non-Conflict-Free Sets of MVDs We first show a property of NNF [20]. Theorem 7. Let U be a set of attributes. Let D be a set of MVDs and FDs over U. Let T be a scheme tree such that Aset
T U. If T is consistent with D, then T is in NNF [20] with respect to D if and only if for every nested relation R on T , R does not have redundancy caused by any MVD or FD implied by D that holds for T . Proof. Follows immediately from Theorems 5.1 and 5.2 in [20]. u t
Fig. 8. A decomposition of the nested relation in Fig. 7.
Example 9. Consider a genealogy application in which persons are interested in finding information about their ancestors with the help of professional genealogists. We then have the MVDs P erson ! ! Ancestor and P erson ! ! Genealogist because a person may be interested in finding information about several specific ancestors and may use the services of several genealogists to do so (i.e., Ancestor and Genealogist are independent with respect to Person). We also have t h e M V D s Ancestor ! ! P erson a n d Ancestor ! ! Genealogist because it does not matter which genealogist comes up with information about a person's ancestor (i.e., Person and Genealogist are independent with respect to Ancestor). Hence, the given set of MVDs is fP erson ! ! Ancestor, P erson ! ! Genealogist, Ancestor ! ! P erson, Ancestor ! ! Genealogistg. Consider the nested relation and its total unnesting in Fig. 9. There are redundant data values in the nested relation. For example, the last Mary value under (Genealogist)* is redundant because if it is covered up, based on the MVD Ancestor ! ! Genealogist and the other data values in the nested relation, we can deduce that it must be Mary. If a scheme tree T violates Condition 1 or Condition 2 of NNF [20], then there exists a nested relation on T that has redundancy [20]. The scheme of the nested relation in Fig. 9 violates Condition 1 of NNF [20] and Example 9 is designed to show the redundancy. However, this nested relation scheme satisfies NNF [21], NNF [23], and NNF [25]. From this example, we can see that NNF [21], NNF [23], and NNF [25] all allow nested relations with redundancy. To satisfy NNF [20], the nested relation in Fig. 9 needs to be decomposed into two smaller nested relations in Fig. 10. The number of data values, however, increases from 9 to 11. Therefore, decomposing the nested relation in Fig. 9 cannot remove the redundant data values. On the other hand, NNF [21], NNF [23], and NNF [25] all accept the nested relation scheme in Fig. 9 without requiring the nested relation to be decomposed. Notice that the set of MVDs in Example 9 is not conflict-free because it violates Condition 2 of Definition 5. As stated in [2], there is hardly a satisfactory solution to normalization with respect to nonconflict-free sets of MVDs. Notice that it is easy to generalize Example 9. Suppose M is a conflict-free set of nontrivial MVDs. Let T be a scheme tree in NNF [21], NNF [23], and NNF [25] with respect to M. By Theorem 6, T is also in NNF [20] with respect to M. Since
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
381
Fig. 10. A decomposition of the nested relation in Fig. 9.
Fig. 9. A nested relation with redundancy caused by an MVD.
M is a set of nontrivial MVDs, there are at least two paths in T . Choose two edges
V ; W1 and
V ; W2 in T , where V is the parent of W1 and W2 . Since
V ; W2 is an edge in T and T is in NNF [20] with respect to M, M implies an MVD Ancestor
V ! ! Z such that Descendent
W2 Aset
T \ Z. ! Ancestor
V Now, we add two MVDs Descendent
W1 ! ! Z to M. After adding these two and Descendent
W1 ! MVDs, M is no longer conflict-free since Ancestor
V ! !Z ! Z violate the intersection property. and Descendent
W1 ! Now, T is no longer in NNF [20] because of these two new MVDs. However, T is still in NNF [21], NNF [23], and NNF [25]. Therefore, NNF [21], NNF [23], and NNF [25] all allow T not to be decomposed while NNF [20] demands T to be decomposed without being able to remove redundancy caused by these two MVDs. We now argue that decomposition does not help in reducing redundant data values when non-conflict-free sets of MVDs are given. Our discussion follows closely with that in [2]. If a set of MVDs is not conflict-free, there may be more than one possible 4NF decomposition. Hence, even if every path of a scheme tree T is in 4NF, there may still be MVDs that hold for T which do not follow from MV D
T . Since the nesting structure of a nested relation scheme is only able to squeeze out redundant data values caused by MV D
T , the MVDs that are not implied by MV D
T are the ones that cause the redundancy. Therefore, even after the paths of the scheme tree are separated to satisfy NNF [20], the redundant data values remain. This problem of decomposition has been pointed out in [2] (see Example 2 in that paper). Moreover, decomposition of a nested relation scheme incurs overhead. In Example 9, the data values Steve and Pat are replicated in the two smaller nested relations so that they can be joined back together to obtain the original nested relation. Interestingly, if the given set of MVDs has the intersection property, then Conditions 1 and 2 of NNF [21] imply Condition 3 of NNF [21]. Theorem 8. Let U be a set of attributes. Let M be a set of MVDs over U that has the intersection property. Let T be a scheme tree such that Aset
T U. If T satisfies Conditions 1 and 2 of NNF [21] with respect to M, then T also satisfies Condition 3 of NNF [21] with respect to M.
Proof. Suppose T does not satisfy Condition 3 of NNF [21]. We shall derive a contradiction. Assume there is an edge
V ; W in T and there is a key X of M such that there exists a dependent Z 2 DEPM
X and Descendent
W Z \ Aset
T . Assume also that there are some sibling nodes W1 ; . . . ; Wn of W in T such that Y [ni1 Descendent
Wi , X Ancestor
V Y , and XY ! ! Ancestor
V does not hold for T with respect to M. Ancestor
V 6 X; otherwise, XY ! ! Ancestor
V on Aset
T . T h u s , X 6 Ancestor
V . X 6 Ancestor
V ; otherwise, Ancestor
V ! ! Descendent
W on Aset
T is not left-reduced with respect to M and, thus, T violates Condition 2 of NNF [21]. Hence, X 6 Ancestor
V . Since T satisfies Condition 1 of NNF [21], M implies an MVD Ancestor
V ! ! Z 0 on U such that Descendent
W Z 0 \ Aset
T . Since the nodes in T are pairwise !Z disjoint, X \ Z ; and Ancestor
V \ Z 0 ;. X ! ! Z \ Z 0 and Ancestor
V ! ! Z 0 imimplies X
Z Z 0 ! ! Z \ Z 0 . Since X \ Z ; plies Ancestor
V
Z 0 Z ! 0 and Ancestor
V \ Z ;, Z \ Z 0 is disjoint from both X
Z Z 0 and Ancestor
V
Z 0 Z. By the intersection ! property of M, X
Z Z 0 \ Ancestor
V
Z 0 Z ! Z \ Z 0 . However, X
Z Z 0 \ Ancestor
V
Z 0 Z X \ Ancestor
V and
Z \ Z 0 \ Aset
T Descendent
W . Since Ancestor
V 6 X and X 6 Ancestor
V , X \ Ancestor
V is a proper subset of Ancestor
V . Thus, Ancestor
V ! ! Descendent
W on Aset
T is not leftreduced with respect to M, which is a contradiction. t u We conclude this section by stating that, if the given set of MVDs is not conflict-free, NNF [21], NNF [23], and NNF [25] all allow redundancy while NNF [20] does not. On the other hand, decomposition, as demanded by NNF [20], does not help in reducing redundant data values in this situation. In addition, Condition 3 of NNF [21] is redundant with respect to sets of MVDs with the intersection property. This implies that Condition 3 of NNF [21] is also redundant with respect to conflict-free sets of MVDs since conflict-free sets of MVDs have the intersection property [1].
5.3 Design Flexibility In this section, we show that NNF [20] allows greater flexibility in nested relation scheme design than NNF [21], NNF [23], and NNF [25] under the input requirements of Algorithm 1, which are very reasonable and practical. The way we prove our claim is to show that the set of nested relation schemes allowed by each of NNF [21], NNF [23], and NNF [25] is a proper subset of the set of nested relation schemes allowed by NNF [20] under the input requirements of Algorithm 1. In addition, by Theorem 7, NNF [20] precisely characterizes data redundancy for consistent
382
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
nested relation schemes. Thus, by considering these two points together, the results in this section show that there are conditions in NNF [21], NNF [23], and NNF [25] that hinder design flexibility but have nothing to do with redundancy removal. To proceed, we first provide several examples to illustrate the problems of Conditions 4 of NNF [21], NNF [23], and NNF [25]. Then, we formally prove the main theorems of this section. Example 10. Suppose U fStudent; Semester; Course; P art-time Jobg and M fStudentSemester ! ! Course; StudentSemester ! ! P art-time Jobg:
VOL. 14,
NO. 2,
MARCH/APRIL 2002
scheme tree to be decomposed more than necessary. Partial dependencies are further discussed in Section 4. Examples 10, 11, and 12 are very reasonable. By these examples, we can see that Conditions 4 of both NNF [21] and NNF [23] are problematic and NNF [25] inherits the same problem. We now proceed to prove our claim by considering two separate cases: when there is no FD and when there are FDs. Theorem 9. Let U be a set of attributes. Let R be an acyclic database scheme over U. The set of nested relation schemes allowed by each of NNF [21], NNF [23], and NNF [25] with respect to R is a proper subset of the set of nested relation schemes allowed by NNF [20] with respect to R.
Some of the nested relation schemes allowed by NNF [20] are Student Semester (Course)* (Part-time Job)*, Student (Semester (Course)* (Part-time Job)*)*, and Semester (Student (Course)* (Part-time Job)*)*. On the other hand, because of their Conditions 4, Student Semester (Course)* (Part-time Job)* is the only nested relation scheme allowed by NNF [21], NNF [23], or NNF [25].
Proof. Since R is acyclic, R is equivalent to a conflict-free set of MVDs. Thus, this theorem follows immediately from the results in Section 5.2.1. u t
Example 11. Assume U fP rof; Hobby; Hobby-Equipmentg and let M fHobby ! ! Hobby-Equipmentg be the given set of MVDs. The only nested relation scheme allowed by NNF [21], NNF [23], or NNF [25] is Hobby (Prof)*(HobbyEquipment)* since Hobby is the only key. In this nested relation scheme, data are stored in the point of view of Hobby. Notice that Hobby (Prof)* (Hobby-Equipment)* also satisfies NNF [20]. Now, suppose we need to store the data in the point of view of Prof. Thus, the nested relation schemes that are needed are Prof (Hobby)* and Hobby (Hobby-Equipment)*. Both of these nested relation schemes satisfy NNF [20]. However, since Prof is not a key, Prof (Hobby)* violates Conditions 4 of NNF [21], NNF [23], and NNF [25].
Theorem 10. Let U be a set of attributes. Let R be an acyclic database scheme over U and let F be a set of FDs such that each FD in F is embedded in a relation scheme in R and each relation scheme in R is in BCNF with respect to F . The set of nested relation schemes allowed by each of NNF [23] and NNF [25] with respect to R and F is a proper subset of the set of nested relation schemes allowed by NNF [20] with respect to R and F .
We now show another example to illustrate another problem of Conditions 4 in NNF [21], NNF [23], and NNF [25]. Example 12. As discussed in [20], the scheme tree in Fig. 3 is in NNF [20], but it violates NNF [21], NNF [23], and NNF [25]. There are several violations. Since the reasons behind the violations for NNF [23] and NNF [25] are quite similar to those of NNF [21], we focus our discussions on NNF [21]. Notice that the set of MVDs and FDs considered in this example (presented in Fig. 4) satisfies the input requirements of Algorithm 1. We now argue that Matriculation cannot be an inner node in the scheme tree. Consider the set of attributes Descendent
Matriculation, which is equal to {Matriculation, Student, Interest}. Student is in F K
Descendent
Matriculation because Student is a key and is contained in Descendent
Matriculation. However, since Matriculation is not a key, therefore, it is not in F K
Descendent
Matriculation and, thus, the scheme tree violates a subcondition of Condition 4 of NNF [21]. Moreover, Dept Chair cannot be the root since Dept Chair is not a key. The reader can verify for himself that these two violations also apply for NNF [23] and NNF [25]. Notice that there are also partial dependencies in the scheme tree and these partial dependencies force the
Since NNF [23] is designed to handle FDs when FDs are given, therefore, in Theorem 10, we only consider NNF [23] and NNF [25].
Proof. We first show that the theorem statement is true for NNF [23]. The decomposition algorithm in [23] will produce a set of scheme trees T1 ; . . . ; Tn , n 1, where each Ti is in NNF [23] and every path in Ti is in 4NF with respect to R and F . Hence, each path in Ti is also in BCNF with respect to F . In addition, the algorithm in [23] will guarantee that P ath
Ti is a closed set of relation schemes of R. By Lemma 4 and by using a similar proof to Theorem 1, we can show that Ti satisfies Condition 1 of NNF [20]. Since each path in Ti is in BCNF with respect to F , Ti also satisfies Condition 2 of NNF [20]. Thus, each Ti is in NNF [20]. By Examples 10, 11, and 12, this superset and subset relationship is proper. We now show that the theorem statement is also true for NNF [25]. Suppose R is equal to fR1 ; . . . ; Rn g, where n 1, and suppose T is a scheme tree constructed by the algorithm in [25]. By using the chase, it is easy to prove that R and F are equivalent to S and F , where S fS1 ; . . . ; Sm g in which each Si equals R j , 1 j n, and each R j equals Si , 1 i m n. We first show that S is acyclic. Since R is acyclic, R has a join tree J. It is easy to modify J to obtain a join tree for S, thus, we can conclude that S is acyclic. Consider a nontrivial FD X ! A in F . As we compute the closure of each relation scheme in R, if Ri is a superset of X, we add A to Ri . Let X ! A be embedded in RXA . By Condition 2 of Definition 6, every Rj on the unique path that leads from Ri to RXA is a superset of X. Thus, we have to add A to every Rj on that path. After we do that, it is clear that Condition 2 of Definition 6 is still satisfied. Thus, S is acyclic and S is equivalent to a conflict-free set of
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
Fig. 11. A nested relation I on A
B
C
D and its total unnesting.
MVDs, say M 0 . Since every Si is a closure, the lefthand side of every MVD in M 0 is also a closure and, thus, M 0 is the set of MVDs we need in Definition 12. Similarly, the algorithm in [25] will guarantee that P ath
T is a closed set of relation schemes of S. Now, since each Si is in 4NF with respect to M 0 , and each path in T is constructed precisely from an Si , by using a similar proof to Theorem 1, we can show that T satisfies Condition 1 of NNF [20]. The algorithm in [25] will also make sure that each node in a path of T will functionally determine its parent. After this step, each node is replaced by one of its candidate keys. Hence, T also satisfies Condition 2 of NNF [20]. Thus, T is in NNF [20]. By Examples 10, 11, and 12 again, this superset and subset relationship is proper. u t By Theorems 9 and 10, we conclude this section by stating that NNF [20] provides greater design flexibility than the other three nested normal forms under the input requirements of Algorithm 1. However, the results in this section do not apply if the input does not satisfy the input requirements of Algorithm 1.
5.4 Comparison of the Algorithms In this section, we first discuss how to utilize FDs in creating large scheme trees. Second, we present some results on dependency preservation. 5.4.1 Utilizing FDs In general, a nested relation scheme needs to be decomposed if redundancy occurs in the nested relations on that scheme. Nevertheless, with FDs, redundancy may have been prevented and, thus, decomposition becomes unnecessary. In this section, we first show an example in demonstrating how FDs can be used to prevent redundancy. After which, we show another example which demonstrates removing partial dependencies, as defined in Conditions 2 of NNF [21] and NNF [23], will force a nested relation scheme to be decomposed more than necessary. Example 13. Suppose T A
B
C
D is the nested relation scheme in Fig. 11. Further, assume that the given set M of MVDs is fB ! ! A; C ! ! Dg. It is clear that both MVDs in M hold for T . The nested relation I in Fig. 11 clearly has redundancy caused by both MVDs. First of
383
all, notice that T violates NNF [20], NNF [21], NNF [23], and NNF [25]. Now, in addition to M, assume we also have a set F of FDs fB ! A; C ! Bg. With F , the redundancy cannot happen. Notice that I violates both FDs in F . In fact, no nested relation on T that satisfies M and F can have redundancy. T is in NNF [20] with respect to M and F . However, T still violates NNF [21], NNF [23], and NNF [25] with respect to M and F . To satisfy either NNF [21] or NNF [23], T needs to be decomposed into two nested relation schemes B
A
C and C
D .4 For NNF [25], since the set M 0 of MVDs defined in Definition 12 is trivial, thus, M 0 has no key. Hence, NNF [25] is ill-defined for M and F . Removing partial dependencies, as demanded by Conditions 2 of NNF [21] and NNF [23], regardless of the given FDs, may lead to small scheme trees and an unnecessarily large number of scheme trees. Example 14. The scheme tree in Fig. 3 has partial dependencies according to the definitions of NNF [21] and NNF [23]. Consider the edge (Student, Interest). Ancestor
Student ! ! Descendent
Interest is not leftreduced and, thus, the scheme tree violates Conditions 2 of NNF [21] and NNF [23]. Furthermore, Ancestor
P rof ! ! Descendent
Hobby is also not left-reduced according to NNF [21] and NNF [23]. To remove these partial dependencies, which do not cause any data redundancy in the presence of the FDs in Fig. 4, the algorithms of NNF [21] and NNF [23] decompose more than necessary. For example, the decomposition algorithm in [23] produces the following four nested relation schemes: 1. 2. 3. 4.
Dept
Chair
P rof , P rof
Student
Hobby , Student
Matriculation
Interest , and Hobby
Hobby-Equipment .5
Since NNF [23] is designed to handle FDs when FDs are given, therefore, we focus our discussion on NNF [23] and NNF [25]. By Theorem 10, the set of nested relation schemes accepted by either NNF [23] or NNF [25] is a proper subset of the set of nested relation schemes accepted by NNF [20] under the input requirements of Algorithm 1. Therefore, NNF [20] must allow greater attribute clustering than NNF [23] and NNF [25]. Thus, we have established our claim. In short, when the input satisfies the input requirements of Algorithm 1, NNF [20] utilizes FDs more in constructing large scheme trees than NNF [21], NNF [23], and NNF [25]. This claim does not apply if the input does not satisfy the input requirements of Algorithm 1.
5.4.2 Dependency Preservation Let D be a set of MVDs and FDs over a set U of attributes. A decomposition R fR1 ; . . . ; Rn g, where n 4. Notice that the algorithm in [23] discovers that there is an FD B ! A pointing down in B
A
C and, thus, it will bring A up to the level of B. Thus, the actual nested relation schemes generated are BA
C and C
D . 5. Because of the FDs, the actual nested relation schemes generated are 1. Dept Chair
P rof , 2. Prof
Student
Hobby , 3. Student Matriculation
Interest , and 4. Hobby
Hobby-Equipment .
384
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
1 and [Ri 2 R Ri U, of U with respect to D is dependency preserving if there is a set F of FDs such that each FD in F is embedded in an Ri , and R [ F is equivalent to D on U [33]. Given a set U of attributes and a set D of MVDs and FDs over U, using the decomposition algorithm in [23], a set of scheme trees fT1 ; . . . ; Tm g, m 1, is generated. According to Proposition 6.3 in [23], if D is extended conflict-free, then [m i1 P ath
Ti is an acyclic, 4NF, and dependency preserving decomposition of U with respect to D. Notice that the definition of an extended conflict-free set of MVDs and FDs has undergone substantial development. The definition used by the authors in [23] is old and they published a new definition in [32]. According to Proposition 6.1 in [32], an extended conflict-free set of MVDs and FDs is necessary and sufficient to have an acyclic, 4NF, and dependency preserving database scheme. In Lemma 13 below, we show that the definition of an extended conflict-free set of MVDs and FDs is equivalent to the input requirements of Algorithm 1. In Theorem 11, we show that the nested database schemes generated by Algorithm 1 are dependency preserving. Lemma 13. Let U be a set of attributes. Let R be an acyclic database scheme over U, and let F be a set of FDs such that each FD in F is embedded in a relation scheme in R. Each relation scheme Ri in R is in BCNF with respect to F if and only if Ri is in 4NF with respect to R and F . Proof. The if-part is obvious. We now prove the only-if part. Assume there is an Rk in R such that Rk is not in 4NF with respect to R and F . Thus, there is an MVD X ! !Y implied by R and F on U such that X ! ! Y splits Rk . Consider the tableau TXY for X ! ! Y . TXY has two rows r1 and r2 . Row r1 has as under XY -columns and bs elsewhere and row r2 has as under X
U XY -columns and bs elsewhere. Since X ! ! Y splits Rk , Rk is a subset of neither XY nor X
U XY . Nevertheless, since R and F imply X ! ! Y on U, by Lemma 2, we can assume that we apply the FDs in F until no more bs can be changed into a, and then for every Ri 2 R, Ri fC j r1
C ag or Ri fC j r2
C ag. Without loss of generality, assume Rk X Y . Since Rk is in BCNF with respect to F , it must be that X ! Rk . Thus, Rk is in 4NF with respect to R and F , which is a contradiction. u t Theorem 11. Algorithm 1 generates nested database schemes that are dependency preserving with respect to the input of Algorithm 1. Proof. Let U be a set of attributes. Let R fR1 ; . . . ; Rn g, n 1, be an acyclic database scheme over U and let F be a set of FDs such that each FD in F is embedded in a relation scheme Ri in R and each relation scheme Ri is in BCNF with respect to F . Assume R and F are input to Algorithm 1 and Algorithm 1 generates m scheme trees T1 ; . . . ; Tm from m closed sets of relation schemes S1 ; . . . ; Sm of R. By the construction of Algorithm 1, [m i1 Si R. The set F of FDs is clearly preserved since each FD in F is embedded in an Ri . By Theorem 1, MV D
TSj [ F D
TSj is equivalent to [Ri 2 Sj F Ri and Sj , 1 j m. Therefore,
VOL. 14,
NO. 2,
MARCH/APRIL 2002
MV D
TS1 [ F D
TS1 [ . . . [ MV D
TSm [ F D
TSm is equivalent to [m i1 Si and F , which is equivalent to R [ F. t u Thus, the decomposition algorithm in [23] produces dependency preserving nested database schemes from extended conflict-free sets of MVDs and FDs. Algorithm 1 also produces dependency preserving nested database schemes from the input of Algorithm 1. As we have proven in Lemma 13, the definition of an extended conflict-free set of MVDs and FDs is equivalent to the input requirements of Algorithm 1.
6
CONCLUSIONS
The results of this paper are summarized in the following: 1. 2.
3. 4.
5.
6.
7.
8.
9.
We have presented a more general algorithm for NNF [20], which is called Algorithm 1 in this paper. NNF [20] is equivalent to 4NF when MVDs and FDs are given and NNF [20] is equivalent to BCNF when only FDs are given. NNF [21], NNF [23], and NNF [25] all imply 4NF and BCNF but neither 4NF nor BCNF implies any of them. When no FD is given, NNF [20], NNF [21], NNF [23], and NNF [25]are all able to reduce redundant data values with respect to conflict-free sets of MVDs. When no FD is given, NNF [21], NNF [23], and NNF [25] all imply NNF [20] with respect to conflict-free sets of MVDs but NNF [20] does not imply any of them. When the given set of MVDs is not conflict-free, NNF [21], NNF [23], and NNF [25] all allow redundancy while NNF [20] does not. On the other hand, decomposition, as demanded by NNF [20], does not help in reducing redundant data values in this situation. Condition 3 of NNF [21] is redundant with respect to sets of MVDs with the intersection property. This implies that Condition 3 of NNF [21] is also redundant with respect to conflict-free sets of MVDs since conflict-free sets of MVDs have the intersection property. NNF [20] provides greater flexibility in nested relation scheme design than the other three nested normal forms under the input requirements of Algorithm 1. However, this claim does not apply if the input does not satisfy the input requirements of Algorithm 1. When the input satisfies the input requirements of Algorithm 1, NNF [20] utilizes FDs more in constructing large scheme trees than NNF [21], NNF [23], and NNF [25]. Again, this claim does not apply if the input does not satisfy the input requirements of Algorithm 1. The decomposition algorithm in [23] produces dependency preserving nested database schemes from extended conflict-free sets of MVDs and FDs. Algorithm 1 also produces dependency preserving nested database schemes from the input of Algorithm 1. Further, the definition of an extended conflict-free set of MVDs and FDs is equivalent to the input requirements of Algorithm 1.
MOK: A COMPARATIVE STUDY OF VARIOUS NESTED NORMAL FORMS
ACKNOWLEDGMENTS Example 9 is authored by Professor David W. Embley of Brigham Young University. The author would also like to thank Professor Z. Meral Ozsoyoglu and the reviewers for their many helpful suggestions.
REFERENCES [1] [2] [3] [4]
[5] [6] [7] [8] [9] [10]
[11]
[12] [13]
[14] [15] [16] [17] [18]
[19] [20] [21] [22] [23]
C. Beeri, R. Fagin, D. Maier, and M. Yannakakis, ªOn the Desirability of Acyclic Database Schemes,º J. ACM, vol. 30, no. 3, pp. 479±513, July 1983. C. Beeri and M. Kifer, ªAn Integrated Approach to Logical Design of Relational Database Schemes,º ACM Trans. Database Systems, vol. 11, no. 2, pp. 134±158, June 1986. G. Booch, J. Rumbaugh, and I. Jacobson, The Unified Modeling Language: User Guide. Reading, Mass.: Addison-Wesley, 1999. M.J. Carey, N.M. Mattos, and A.K. Nori, ªObject-Relational Database Systems: Principles, Products, and Challenges (Tutorial),º Proc. 1997 ACM SIGMOD Int'l Conf. Management of Data, p. 502, May 1997. A. Eisenberg and J. Melton, ªSql: 1999, Formerly Known as Sql3,º SIGMOD Record, vol. 28, no. 1, pp. 131±138, Mar. 1999. R. Elmasri and S. Navathe, Fundamentals of Database Systems, third ed. Reading, Mass.: Addison-Wesley, 2000. R. Fagin, ªMultivalued Dependencies and a New Normal Form for Relational Databases,º ACM Trans. Database Systems, vol. 2, no. 3, pp. 262±278, Sept. 1977. R. Fagin, ªDegrees of Acyclicity for Hypergraphs and Relational Database Schemes,º J. ACM, vol. 30, no. 3, pp. 514±550, July 1983. R. Fagin, A.O. Mendelzon, and J.D. Ullman, ªA Simplified Universal Relation Assumption and Its Properties,º ACM Trans. Database Systems, vol. 7, no. 3, pp. 343±360, Sept. 1982. P.C. Fischer, L.V. Saxton, S.J. Thomas, and D. Van Gucht, ªInteractions Between Dependencies and Nested Relational Structures,º J. Computer and System Sciences, vol. 31, no. 3, pp. 343±354, Dec. 1985. N. Goodman, O. Shmueli, and Y.C. Tay, ªGYO Reductions, Canonical Connections, Tree and Cyclic Schemas and Tree Projections,º Proc. Second ACM SIGACT-SIGMOD Symp. Principles of Database Systems, pp. 267±278, Mar. 1983. M.H. Graham and M. Yannakakis, ªIndependent Database Schemas,º J. Computer and System Sciences, vol. 28, no. 1, pp. 121± 141, 1984. H. Ishikawa, F. Suzuki, F. Kozakura, A. Makinouchi, M. Miyagishima, Y. Izumida, M. Aoshima, and Y. Yamane, ªThe Model, Language, and Implementation of an Object-Oriented Multimedia Knowledge Base Management System,º ACM Trans. Database Systems, vol. 18, no. 1, pp. 1±50, Mar. 1993. ISO Database Language SQLÐPart 2: Foundation (ISO/IEC 9075-2). Int'l Organization for Standardization, 1998. W. Kim, ªBringing Object/Relational Down to Earth,º Database Programming and Design, vol. 10, no. 7, pp. 26±35, July 1997. T.W. Ling and L.L. Yan, ªNF-NR: A Practical Normal form for Nested Relations,º J. Systems Integration, vol. 4, pp. 309±340, 1994. D. Maier, The Theory of Relational Databases. Rockville, Md.: Computer Science Press, 1983. W.Y. Mok and D.W. Embley, ªTransforming Conceptual Models to Object-Oriented Database Designs: Practicalities, Properties, and Peculiarities,º Conceptual ModelingÐER '96, pp. 309±324, Oct. 1996. W.Y. Mok and D.W. Embley, ªUsing NNF to Transform Conceptual Data Models to Object-Oriented Database Designs,º Data and Knowledge Eng., vol. 24, no. 3, pp. 313±336, Jan. 1998. W.Y. Mok, Y.K. Ng, and D.W. Embley, ªA Normal Form for Precisely Characterizing Redundancy in Nested Relations,º ACM Trans. Database Systems, vol. 21, no. 1, pp. 77±106, Mar. 1996. Z.M. Ozsoyoglu and L.Y. Yuan, ªA New Normal Form for Nested Relations,º ACM Trans. Database Systems, vol. 12, no. 1, pp. 111± 136, Mar. 1987. Z.M. Ozsoyoglu and L.Y. Yuan, ªReduced MVDs and Minimal Covers,º ACM Trans. Database Systems, vol. 12, no. 3, pp. 377±394, Sept. 1987. Z.M. Ozsoyoglu and L.Y. Yuan, ªOn the Normalization in Nested Relational Databases,º Lecture Notes in Computer Science, pp. 243± 271, 1989.
385
[24] R. Ramakrishnan, Database Management Systems. Boston, Mass.: WCB/McGraw-Hill, 1998. [25] M.A. Roth and H.F. Korth, ªThe Design of :1nf Relational Databases into Nested Normal Form,º Proc. 1987 ACM SIGMOD Int'l Conf. Management of Data, pp. 143±159, May 1987. [26] M.A. Roth, H.F. Korth, and A. Silberschatz, ªExtended Algebra and Calculus for Nested Relational Databases,º ACM Trans. Database Systems, vol. 13, no. 4, pp. 389±417, Dec. 1988. [27] M. Stonebraker, P. Brown, and D. Moore, Object-Relational DBMSs: Tracking the Next Great Wave, second ed. San Francisco: Morgan Kaufmann Publishers, 1999. [28] E. Sciore, ªReal-World MVD's,º Proc. 1981 ACM SIGMOD Int'l Conf. Management of Data, pp. 121±132, Apr. 1981. [29] A. Silberschatz, H.F. Korth, and S. Sudarshan, Database System Concepts, third ed. Boston: WCB/McGraw-Hill, 1999. [30] Z. Tari, J. Stokes, and S. Spaccapietra, ªObject Normal Forms and Dependency Constraints for Object-Oriented Schemata,º ACM Trans. Database Systems, vol. 22, no. 4, pp. 513±569, Dec. 1997. [31] S. Urman, Oracle8 PL/SQL Programming. Berkeley, Calif.: Osborne/McGraw-Hill, 1997. [32] L.Y. Yuan and Z.M. Ozsoyoglu, ªDesign of Desirable Relational Database Schemes,º J. Computer and System Sciences, vol. 45, no. 3, pp. 435±470, Dec. 1992. [33] L.Y. Yuan and Z.M. Ozsoyoglu, ªUnifying Functional and Multivalued Dependencies for Relational Database Design,º Information Sciences, vol. 59, pp. 189±211, 1992.
Wai Yin Mok received the BS, MS, and PhD degrees in computer science from Brigham Young University in 1990, 1992, and 1996, respectively. He is an assistant professor of information systems at the University of Alabama in Huntsville. From 1999 to 2001, he was an assistant professor of information systems at Utah State University, Logan, Utah. From 1996 to 1999, he was an assistant professor of computer science at the University of Akron in Ohio. His papers appear in ACM Transactions on Database Systems, Data and Knowledge Engineering, and Information Processing Letters. Currently, he is on the editorial review board of the Journal of Database Management. He is a member of the IEEE and the IEEE Computer Society.
. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.