Strong Functional Dependencies and a ... - Semantic Scholar

2 downloads 0 Views 471KB Size Report
system. The nal contribution of the paper is to propose a normal form for XML ..... We now assume the existence of a set of legal paths P for an XML application.
Strong Functional Dependencies and a Redundancy Free Normal Form for XML Millist W. Vincent and Jixue Liu and Chengfei Liu Advanced Computing Research Centre, School of Computer and Information Science The University of South Australia The Levels, SA5095, Adelaide, Australia Email: millist.vincent @unisa.edu.au Abstract

In this paper we address the problem of how to extend the de nition of functional dependencies (FDs) in incomplete relations to XML documents. There are two complementary approaches to de ning functional dependencies (FDs) in incomplete relational databases. The rst approach, called the weak satisfaction approach, de nes a FD to be weakly satis ed in an incomplete relation if there is at least one complete relation, obtained by replacing all null values by data values, which satis es the FD in the ordinary sense. The second approach, called the strong satisfaction approach, de nes a FD to be strongly satis ed in an incomplete relation if every complete relation, obtained by replacing all null values by data values, satis es the FD in the ordinary sense. In this paper we consider the extension of the strong satisfaction approach to de ning FDs in XML (referred to as XFDs). We propose a syntactic de nition of strong XFD satisfaction in an XML document and then justify it on two grounds. The rst justi cation is to show that for a very general class of mappings of a relation into an XML document, a relation strongly satis es a unary FD if and only if the XML document also strongly satis es the equivalent XFD. The second justi cation is to show that an XML tree strongly satis es an XFD if and only if every completion of the tree also strongly satis es the XFD. This last result is the XML equivalent of a well known result in relational database theory which establishes the equivalence of the semantic de nition of strong satisfaction (the one just given) with a syntactic de nition. The next contribution of our paper is the investigation of the issue of axiomatization of strong XFD implication and we provide a set of axioms for reasoning about XFD implication and show that the set is sound. The axioms have the well known Armstrong's axioms for FDs in relational databases as a subset but contain some additional axioms which have no counterparts in Armstrong's system. The nal contribution of the paper is to propose a normal form for XML documents based on our de nition of XFDs and provide a formal justi cation for it by proving that it is a necessary and sucient condition for the elimination of redundancy in an XML document.

1

1 Introduction The eXtensible Markup Language (XML) [11] has recently emerged as a standard for data representation and interchange on the Internet [46, 1]. While providing syntactic exibility, XML provides little semantic content and as a result several papers have addressed the topic of how to improve the semantic expressiveness of XML. Among the most important of these approaches has been that of de ning integrity constraints in XML [13, 28]. Several di erent classes of integrity constraints for XML have been de ned including key constraints [12, 13, 14], path constraints [3, 17, 13, 16], and inclusion constraints [27, 26] and properties such as axiomatization and satis ability have been investigated for these constraints. One observation to make on this research is that the exible structure of XML makes the investigation of integrity constraints in XML more complex and subtle than in relational databases. However, one topic that has been identi ed as an open problem in XML research [46] and which has been little investigated is how to extended the oldest and most well studied integrity constraint in relational databases, namely functional dependencies (FDs), to XML and then how to develop a normalization theory for XML. This problem is not of just theoretical interest. The theory of FDs and normalization forms the cornerstone of practical relational database design and the development of a similar theory for XML will similarly lay the foundation for understanding how to design XML documents. The only paper that has speci cally addressed this problem is the recent paper [5]. The approach in [5] was based, roughly speaking, 1 on an extension of the concept of weak satisfaction of FDs in incomplete relational databases to XML documents. In our paper we investigate the problem of how to de ne FDs and normal forms in XML using an extension of the other approach to de ning the satisfaction of FDs in incomplete relations, namely the approach based on the concept of strong satisfaction. Before outlining the contributions of our paper, we shall brie y review the weak and strong satisfaction approaches in incomplete relations and then discuss their extensions to XML. For a more complete discussion of integrity constraints and incomplete information in relational databases we refer the reader to [35, 7]. We assume that the reader is familiar with the basic concepts of relational databases and functional dependencies when relations are complete as given in any standard book on database theory [35, 7, 2]. Consider then the problem of de ning satisfaction of FDs in incomplete relations when the incomplete information has the semantics of being unknown (for an analysis of FD satisfaction when incomplete information has the meaning of being no information refer to [7]). We represent missing values by the special symbol ? and the relation shown in Figure 1 shows a relation with incomplete information. Consider the FDs Course ! Lecturer and Course ! Department. In the presence of null values, an incomplete relation is assumed to represent the set of all possible complete relations which are obtained by replacing all occurrences of ? by values drawn from the domain of the underlying attributes. Denote by Poss(r) the set of all complete relations which are obtained by replacing all occurences of ? by data We say roughly speaking because, as will be outlined in more detail later, the de nition of a functional dependency in XML in [5] does not always adhere to the weak satisfaction property. 1

2

Course

Lecturer

Department

c1

l1

d1

c1

l1

?

Figure 1: A relation containing incomplete information. values in a relation r. Then r is de ned to weakly satisfy an FD X ! Y if there exists at least one r1 2 Poss(r) that satis es X ! Y in the ordinary sense. Thus the relation r in Figure 1 weakly satis es both the FDs Course ! Lecturer and Course ! Department since the replacement of ? by d1 results in a complete relation which satis es both the FDs. In contrast, a relation r is said to strongly satisfy an FD X ! Y if every r1 2 Poss(r) satis es X ! Y in the ordinary sense. Thus in Figure 1 the relation strongly satis es Course ! Lecturer but not Course ! Department since the replacement of ? by d2 results in a complete relation which violates Course ! Department. We note that weak satisfaction and strong satisfaction coincide with the conventional notion of satisfaction when a relation is complete. Also, axiom systems have been developed for both strong and weak satisfaction of FDs. In the case of strong satisfaction the axioms coincide with the normal Armstrong's axioms [6]. However, in the case of weak satisfaction, the transitivity rule in Armstrong's system is no longer valid and a modi ed system of axioms applies [8, 9, 37]. We also note that the de nitions of weak and strong satisfaction just given are not particularly useful for testing satisfaction and more ecient syntactic equivalents gave been de ned and shown to be equivalent to strong and weak satisfaction (see Lemmas 6.1 and 6.2 in [7]). We now compare these two approaches to FD satisfaction in incomplete relations. The rst disadvantage of the weak satisfaction approach is in the area of constraint maintenance. In general, when a complete relation is updated the FDs which apply to the relation may not be satis ed after the update and this gives rise to problems called update anomalies [25, 43, 18, 10]. However, if the set of FDs satis es the Boyce-Codd normal form (BCNF) condition [20] then it can be proven [25, 43] that BCNF is a necessary and sucient condition for there to be no update anomalies 2 . Thus if a complete relation is in BCNF then it ensures that FDs will still be satis ed after an update and thus avoids the potentially expensive operation of checking whether a relation satis es a set of FDs. However in incomplete relations using the weak satisfaction approach, BCNF does not guarantee that the FDs will be satis esd after an update. To see this consider the relation in Figure 1. It weakly satis es both Course ! Lecturer and Course ! Department and so Course is a key for the relation and so is in BCNF. However if ? is replaced by d2 then Course ! Department is violated. Thus in an incomplete relation using the weak satisfaction approach one cannot avoid having to check for FD satisfaction after an update if the consistency of the relation is to be maintained. The second problem that arises with the weak satisfaction approach is that of additivity [7, 31, 32], We note, however, that for one class of update anomalies a normal form that lies between 3NF and BCNF suces for there to be no update anomalies [43]. 2

3

a problem that does not occur for satisfaction of FDs in complete relations. Consider the relation r in Figure 2. Course

Lecturer

Phone

c1

l1

p1

c1

?

p2

Figure 2: A relation demonstrating the additivity problem. The relation weakly satis es Course ! Lecturer since replacing ? by l1 results in a complete relation which satis es Course ! Lecturer. The relation also weakly satis es Lecturer ! Phone since replacing ? by l2 results in a complete relation which satis es Lecturer ! Phone. However, there is no complete relation in Poss(r) that simultaneously satis es both Course ! Lecturer and Lecturer ! Phone. The reason is that any relation r1 2 Poss(r) which satis es Course ! Lecturer and Lecturer ! Phone must, by the transitivity rule in Armstrong's axioms, also satisfy Course ! Phone which is a contradiction. In this case we say that weak satisfaction is not additive with respect to the relation r and the set F of FDs fCourse ! Lecturer, Lecturer ! Phoneg. This poses a problem because in some sense the relation r and F are contradictory since there is no way we can complete r and satisfy F. There are two possible solutions to this problem. The rst is to view the situation as arising because the set of FDs is not correctly speci ed and then to look at syntactic conditions on the set of FDs which ensure that additivity is satis ed. This is the approach taken in [31, 32] where it was shown that the set F satisfying a property called monodependence [34] is a necessary and sucient condition for additivity. The second solution is to view additivity as a problem with the incomplete relation rather than with the set of FDs. For example, if we remove the rst tuple from the relation in Figure 2 then satisfaction is additive. A procedure has been developed [29, 7] for determining if satisfaction with respect to a given relation and a set of FDs is additive and so this procedure can be used to check for additivity after an update to a relation. However, neither of these solutions to the problem of additivity is entirely satisfactory. In the rst solution we must restrict the set of allowable FDs. In the second solution we place no restrictions on the set of FDs but instead must execute a checking procedure after each update to ensure that additivity is maintained. Neither of the above problems occurs if the approach of strong satisfaction of FDs is adopted. This is a simple consequence of the de nition of strong satisfaction. However the price to be paid for the strong satisfaction approach is that it places much greater restriction on the amount of uncertainty that an incomplete relation can store. In summary, the weak satisfaction approach makes maintenance of the constraints a more dicult task but allows for a greater degree of uncertainty to be represented in the database, whereas strong satisfaction simpli es the task of constraint maintenance at the expense of restricting the amount of uncertainty that can be represented in the database. However, as also argued 4

in [35], we believe that these approaches are complementary rather than competing and both have their place in the real world depending on the needs of the application. Also, as investigated in [33], it is possible to combine the two approaches and have some FDs in a relation weakly satis ed and others strongly satis ed. We now discuss the extension of these approaches to the satisfaction of FDs in XML. In considering the satisfaction of FDs in XML, the rst question that one must address is how to extend the notion of the completion of an incomplete relation to XML. There are some subtlities in this, which will be explored in more detail later, but for the moment it suces to adopt a somewhat imprecise approach and regard a completion of an XML document as one that is obtained by ' lling in' all missing information in the document. For example, given the XML document in Figure 3, if we add a name element to it then we obtain a completion as shown in Figure 4.
e1 n1
e2

Figure 3: An XML document with incomplete information. We now consider the problems with the notion of FD satisfaction in XML as de ned in [5]. We shall refer to FDs in XML documents as XFDs. Consider the XML document shown in Figure 5. If one adopts the de nition of XFDs as given in [5], the XML document in Figure 5 satis es the XFD root.Enrolment.Course.S ! root.Enrolment.Lecturer.S 3 . Without formally de ning the notion of XFD satisfaction as given in [5], one can see intuitively that a completion, shown in Figure 6, satis es the XFD root.Enrolment.Course.S ! root.Enrolment.Lecturer.S since both enrolments have di erent lecturers. It also satis es the XFD root.Enrolment.Course.S ! root.Enrolment. How3

The meaning of the .S in the notation will be de ned later

5

e1 n1
e2 n2


Figure 4: A completed XML document. ever, the same problem with constraint maintenance that occurs with weak satisfaction in relations also occurs in the document in Figure 5. Firstly we note that the set of XFDs froot.Enrolment.Course.S ! root.Enrolment.Lecturer.S; root.Enrolment.Course.S ! root.Enrolmentg satis es the normal forms proposed in [5] and in this paper. However, this does not ensure that the XFD constraints are satis ed after an update since if c1 is inserted into the missing Course element in Figure 5, rather than c2, then in the resulting XML document, shown in Figure 7, root.Enrolment.Course.S ! root.Enrolment.Lecturer.S and root.Enrolment.Course.S ! root.Enrolment are both violated. Hence, as in the relational case, XFDs are not automatically satis ed after an update, even if the document is normalised, and so a checking procedure has to be executed after an update if the integrity of the document is to be maintained. If one instead adopts the strong satisfaction approach used in our paper, then the document shown in Figure 5 violates the XFD root.Enrolment.Course.S ! root.Enrolment.Lecturer.S. Indeed one of the main results of our paper is to show that the syntactic de nition of strong satisfaction that we propose results in every completion of an XML document also satisfying the XFD and so the integrity maintenance problem that we have just shown cannot arise. Next, we investigate the additivity problem in XML. Consider the XML document shown in Figure 8. In this example, the document satis es the XFDs root.Enrolment.Course.S ! 6



c1

l1 l2

Figure 5: An XML document.

c1

l1

c2

l2

Figure 6: A completed XML document. and root.Enrolment.Lecturer.S ! root.Enrolment.Phone.S according to the de nitions of XFDs as proposed in [5]. However, as in the relational case, there is no completion that satis es both root.Enrolment.Course.S ! root.Enrolment.Lecturer.S and root.Enrolment.Lecturer.S ! root.Enrolment.Phone.S simultaneously and so the additivity problem still arises. Once again we are faced with either restricting the class of XFDs or checking the XML document after an update to see if it is additive w.r.t. the set of FDs. At present, deriving syntactic conditions on the set of XFDs which ensure additivity or developing a procedure for testing the additivity of an XML document after an update are issues which have not been investigated. If instead one uses the strong satisfaction approach used in our paper, then it follows immediately from one of the main results of the paper that the additivity problem cannot arise. To summarize, the comparison between strong and weak satisfaction of XFDs in XML documents is completely analogous to the situation in relational databases. The weak satisfaction approach makes root.Enrolment.Lecturer.S

7



c1

l1

c1

l2

Figure 7: An XML document violating the XFD constraints.

c1

l1 p1

c1

p2

Figure 8: An XML document demonstrating the additivity problem. maintenance of the constraints in XML a more dicult task but allows for a greater degree of uncertainty to be represented in the document, whereas strong satisfaction simpli es the task of constraint maintenance at the expense of restricting the amount of uncertainty that can be represented in the document. The main contributions of our paper are as follows.

 We give a precise syntactic de nition of the strong satisfaction of an XFD in an XML document.  We provide formal justi cation for our syntactic de nition of an XFD on two grounds. The rst

justi cation is based on the investigation of the correspondence between strong satisfaction of FDs in incomplete relations and strong satisfaction of XFDs in XML documents. We show that for a very general class of mappings from incomplete relations to XML documents, a unary FD (one 8

with only one path on the l.h.s. of the XFD) is strongly satis ed in the relation if and only if the corresponding XFD is strongly satis ed in the XML document. The second justi cation is to show that an XML document strongly satis es an XFD if and only if every completion of the XML document also satis es the XFD. This is the XML equivalent of Lemma 6.1 in [7] which proves the equivalence between a syntactic de nition of strong satisfaction of FDs in incomplete relations and the semantic de nition (the one previously given). This result thus justi es our use of the term 'strong' in XFD satisfaction and ensures that strong satisfaction of an XFD eliminates the integrity maintenance problems discussed earlier.

 We investigate the relationship between our notion of strong XFD satisfaction and the notion

of key satisfaction as proposed recently in [13, 15]. We show that while in general there is no correspondence between the two notions, for an important class of key constraints an XML tree satis es a key constraint if and only if the tree satis es a corresponding XFD. So for this class of keys, key satisfaction can be viewed as a special case of XFD satisfaction in a similar fashion to the way that key satisfaction in relational databases is a special case of FD satisfaction.

 We investigate the issue of axiomatization of strong XFD satisfaction and we provide a set of

axioms for reasoning about implication of XFDs and show that they are sound for arbitrary XFDs. Interestingly, the set of axioms that we de ne have Armstrong's axioms [6] as a subset but contain some additional axioms which have no counterparts in Armstrong's system. We also note that the axioms have been shown to be complete for unary XFDs and the implication problem for unary XFDs has been shown to be decidable in time linear in the number of XFDs [45].

 We propose a normal form for XML documents based on our de nition of XFDs and provide a formal justi cation for it by proving that it is a necessary and sucient condition for the elimination of redundancy in an XML document. We note that a normal form for XML documents was also proposed in [5] but we show that it needs a minor modi cation to guarantee the elimination of redundancy.

The rest of the paper is organized as follows.. Section 2 contains some preliminary de nitions. Section 3 contains the de nitions of strong XFDs in XML. In Section 4 we compare our notion of XFD satisfaction with that of key satisfaction. Section 5 rstly investigates the correspondence between strong satisfaction of FDs in incomplete relations and strong satisfaction of XFDs in XML documents. Then the relationship between strong satisfaction of an XFD in an incomplete document and the strong satisfaction of the XFD in a completed document is investigated. In Section 6 axioms for XFDs are de ned and shown to be sound. Section 7 de nes a normal form for XML documents and investigates the relationship between the normal form and the elimination of redundancy. Section 8 discusses related work and nally Section 9 contains some concluding comments. 9

2 Preliminary De nitions In this section we present some preliminary de nitions that we need before de ning XFDs.

De nition 1 A tree is a nite, acyclic, directed graph in which there is a unique node, called the root,

with indegree 0 and every other node has indegree 1. A node v0 is a child of a node v (or equivalently, v is the parent of v0 ) if there is a directed edge from v to v0 . A node is a leaf if it has no children. A node v0 is recursively de ned to be a descendant of a node v (or equivalently, v is an ancestor of v0 ) if either v0 is a child of v, or v has a child v00 such that v0 is a descendant of v00. The height of a tree is the number of nodes on the longest path from the root to a leaf node. A tree T 0 is a subtree of a tree T if the nodes of T 0 are a subset of those of T and for every pair of nodes v0 and v, v0 is a child of v in T if and only if v0 is a child of v in T 0 . A subtree T 0 is a principal subtree of T if the root of T 0 is a child of the root of T. We now present the de nition of an XML tree adapted from the de nition given in [13].

De nition 2 Assume a countably in nite set E of element labels (tags), a countable in nite set A of

attribute names and a symbol S indicating text. An XML tree is de ned to be T = (V; lab; ele; att; val; vr ) where V is a nite set of nodes in T; lab is a function from V to E [ A [ fSg; ele is a partial function from V to a sequence of V nodes such that for any v 2 V , if ele(v) is de ned then lab(v) 2 E; att is a partial function from V  A to V such that for any v 2 V and l 2 A, if att(v; l) = v1 then lab(v) 2 E and lab(v1 ) = l; val is a function such that for any node in v 2 V; val(v) = v if lab(v) 2 E and val(v) is a string if either lab(v) = S or lab(v) 2 A; vr is a distinguished node in V called the root of T and we de ne lab(vr ) = root. Since node identi ers are unique, a consequence of the de nition of val is that if lab(v1 ) 2 E and lab(v2 ) 2 E and v1 6= v2 then val(v1 ) 6= val(v2 ). We also extend the de nition of val to sets of nodes and if V1  V , then val(V1 ) is the set de ned by val(V1 ) = fval(v)jv 2 V1 g. For any v 2 V , if ele(v) is de ned then the nodes in ele(v) are called subelements of v. For any l 2 A, if att(v; l) = v1 then v1 is called an attribute of v. Note that an XML tree T must be a tree. Since T is a tree the ancestors of a node v, denote by Ancestor(v) are de ned as in De nition 1. The children of a node v are also de ned as in De nition 1 and we denote the parent of a node v by Parent(v). We now explain the intuition behind the above de nition. V is the set of nodes of the tree T. These nodes can be classi ed into the types: element nodes, attribute nodes and text nodes. As shown in Figure 9, text nodes (S) have no name but carry text, attribute nodes (A) both have a name and carry text, and element nodes (E) have a name. Also, for a node v the functions ele and att de ne the children of v, which are partitioned into subelements and attributes. Subelements of a node v are ordered, whereas the attributes of a node v are unordered and are identi ed by their labels. The function val is the identity for element nodes and assigns a text value for attribute and text nodes. We note that our de nition of val de nition di ers slightly from that in [13] since we have extended the de nition of the val function so that it is also de ned on element nodes. The reason for this is that we want to include in our de nition 10

vr E

v3 A D#

root

v1 E Division

v2 E

Division

v4 E

v15 E

Section

Section

v6 A

v14 A D# “d2”

“d1” S#

v7 E

Employee

v5

E Employee

v16 A S#

“s1”

“s2”

v9 E Emp#

v12

S “e1”

S

v10 E

Name

v13

S

S

v8

Emp#

E

v11 S

S

“e2”

“n1”

Figure 9: An XML tree. paths that do not end at leaf nodes, and when we do this we want to compare element nodes by node identity, i.e. node equality, but when we compare attribute or text nodes we want to compare them by their contents, i.e. value equality. This point will become clearer in the examples and de nitions that follow. We note that there is a one-to-one mapping between XML documents and trees and we now illustrate this mapping by an example. Example 1 Consider the XML document given in Figure 3. Then the corresponding XML tree is shown

in Figure 9. In this example the element nodes are fvr ; v1; v2; v4; v5 ; v7; v8; v9; v10; v15g, the attribute nodes are fv3; v6 ; v14; v16g and the text nodes are fv11; v12; v13g. We now give some preliminary de nitions related to paths.

De nition 3 A path is an expression of the form l1 :    :ln, n  1, where li 2 E [ A [ fSg for all i; 1  i  n and l1 = root. If p is the path l1 :    :ln then Last(p) = ln . For instance, if E = froot, Division, Employeeg and A = fD#, Emp#g then root, root.Division, root.Division.D#, root.Division.Employee.Emp#.S

are all paths.

De nition 4 Let p denote the path l1:    :ln. The function Parnt(p) is the path l1 :    :ln?1. Let p denote the path l1 :    :ln and let q denote the path q1 :    :qm . The path p is said to be a pre x of the path q, denoted by p  q, if n  m and l1 = q1; : : :; ln = qn. Two paths p and q are equal, denoted by p = q, if p is a pre x of q and q is a pre x of p. The path p is said to be a strict pre x of q, denoted by p  q, if p is a pre x of q and p 6= q. We also de ne the intersection of two paths p1 and p2, denoted but p1 \ p2 , to be the maximal common pre x of both paths. It is clear that the intersection of two paths is also a path.

For example, if E = froot, Division, Employeeg and A = fD#, Emp#g then root.Division is a strict pre x of root.Division.Employee and root.Division.D# \ root.Division.Employee.Emp#.S 11

=

.

root.Division

De nition 5 A path instance in an XML tree T is a sequence v1 :    :vn such that v1 = vr and for all vi ; 1 < i  n,vi 2 V and vi is a child of vi?1. A path instance v1 :    :vn is said to be de ned over the path l1 :    :ln if for all vi ; 1  i  n, lab(vi ) = li . Two path instances v1 :    :vn and v10 :    :vn0 are said to be distinct if vi = 6 vi0 for some i, 1  i  n. The path instance v1 :    :vn is said to be a pre x of v10 :    :vm0 if n  m and vi = vi0 for all i; 1  i  n. The path instance v1:    :vn is said to be a strict pre x of v10 :    :vm0 if n < m and vi = vi0 for all i; 1  i  n. The set of path instances over a path p in a tree T is denoted by Paths(p)

For example, in Figure 9, vr :v1 :v4 is a path instance de ned over the path root.Division.Section and vr :v1:v4 is a strict pre x of vr :v1:v4 :v7 We now assume the existence of a set of legal paths P for an XML application. Essentially, P de nes the semantics of an XML application in the same way that a set of relational schema de ne the semantics of a relational application. P may be derived from the DTD, if one exists, or P be derived from some other source which understands the semantics of the application if no DTD exists. The advantage of assuming the existence of a set of paths, rather than a DTD, is that it allows for a greater degree of generality since having an XML tree conforming to a set of paths is much less restrictive than having it conform to a DTD. Firstly we place the following restriction on the set of paths.

De nition 6 A set P of paths is consistent if for any path p 2 P, if p1  p then p1 2 P. This is natural restriction on the set of paths and any set of paths that is generated from a DTD will be consistent. We now de ne the notion of an XML tree conforming to a set of paths P.

De nition 7 Let P be a consistent set of paths and let T be an XML tree. Then T is said to conform to P if every path instance in T is a path instance over a path in P.

We now need to address the problem of missing information in an XML document. XML is quite

exible in the structure of data and it is possible, for instance, to have the set of paths P to be froot, root.Dept, root.Dept.Section, root.Dept.Section.Emp, root.Dept.Section.Projectg and so the XML tree in Figure 10 (a) conforms to P. In the relational context, there are two basic interpretations for missing or null values [7]. The rst is the 'dke' approach which assumes that the null value represents something that does not exist. The second approach is the 'unk' approach and assumes that the null value stands for a value that exists but is not known. In this paper we adopt the 'unk' approach. In particular, this means that we consider the tree in Figure 10 (a) to have missing information in that there are path instances from node v2 that are missing. In this sense we regard the tree in Figure 10 (a) as representing a family of complete trees. For instance, the tree in Figure 10 (b) represents one possible completion. 12

vr E root

vr E root

v2 E Dept

v1 E Dept

v3 E Section

v3 E Section

v4 A Emp v5 A Emp “e1”

v2 E Dept

v1 E Dept

“e2”

v6 A Project “j1”

v4 A Emp v5 A Emp “e1”

“e2”

v7 E Section

v6 A Project v8 A Emp v9 A Emp “j1” “e3” “e4” v10

(a)

(b)

A Project “j2”

vr E root v2 E Dept

v1 E Dept

v3 E Section

E Section 1

v4 A Emp v5 A Emp “e1”

“e2”

v6 A

Project

2

A Emp

4

A Project

“j1”

(c)

Figure 10: XML trees illustrating missing information. However, before de ning the completion of a tree, which will be done later, we rstly de ne the notion of a minimal extension of a tree as an intermediate step. The minimal extension of a tree extends the tree by including marked nulls for any incomplete path instances. This leads to the following de nitions.

De nition 8 Let T be an XML tree and let P be a set of paths such that P is consistent and T conforms to P. An extended XML tree is a tree (V [ N; lab; ele; att; val; vr ) where N is a set of marked nulls f?1; : : :; ?ng that is disjoint from V . Also, unlike the case for nodes in V , we de ne the val of any node v 2 N to be the marked null itself, i.e. if v =?i then val(v) =?i, even if lab(v) 2= E. We note that in the previous de nition, the val of di erent marked nulls are distinct, i.e. val(?i ) 6= val(?j ) if i 6= j.

De nition 9 Let T be an XML tree (V; lab; ele; att; val; vr ) and let T 0 be an extended tree (V [ N; lab; ele; att; val; vr ). Then T is said to be embedded in T 0 if: (i) v1 has a child v2 in T i v1 has a child v2 in T 0 , where v1 and v2 are in V ; (i) for any node v 2 V , lab(v) in T is equal to lab(v) in T 0 ; (i) for any node v 2 V , val(v) in T is equal to val(v) in T 0 .

For example, the tree in Figure 10 (a) is embedded in the tree of Figure 10 (c). 13

De nition 10 Let P be a consistent set of paths, let T be an XML tree (V; lab; ele; att; val; vr ) that conforms to P and let T 0 be an extended tree tree (V [ N; lab; ele; att; val; vr ). Then T 0 is said to be a minimal extension of T if:

(i) T is embedded in T 0 ; (ii) T 0 conforms to P; (ii) if there exist paths p1 and p2 in P such that p1  p2 and there exists a path instance v1 :    :vn de ned over p1 , in T 0 , then there exists a path instance v10 :    :vm0 de ned over p2 in T 0 such that v1 :    :vn is a pre x of the instance v10 :    :vm0 ; (iv) there is no other extended tree (V [ N00; lab; ele; att; val; vr ), denoted by T 00, such that T 00 satis es (i), (ii) and (iii) and N00  N.

Condition (iii) of the de nition ensures that there is no missing information in the minimal extension of a tree. Also, the tree in Figure 10 (c) is a minimal extension of Figure 10 (a). The next crucial result shows that all minimal extensions of an XML tree T are essentially the same.

Theorem 1 Let T1 and T2 be two minimal extensions of a tree T . Then T1 and T2 are isomorphic. Proof. See Appendix.

2

As a consequence of this result, from now on we refer to the minmal extension of a tree, rather than a minimal extension of a tree, and denote it by M(T).

De nition 11 A path instance v1:    :vn in M(T) is de ned to be complete if vi 2= N for all i, 1  i  n. An XML tree is complete if T = M(T). Also we often do not need to distinguish between di erent nulls and so the statement v =? is shorthand for 9j(v =?j ) and v 6=? is shorthand for 6 9j(v =?j ). The next function returns all the nal nodes of the path instances of a path p in M(T).

De nition 12 Let P be a consistent set of paths, let T be an XML tree that conforms to P and let M(T) be the minimal extension of T. The function N(p), where p 2 P, is the set of nodes in M(T) de ned by N(p) = fvjv1 :    :vn 2 Paths(p) ^ v = vn g. For example, in Figure 10 (c), N(root.Dept) = fv1; v2g and N(root.Dept.Section.Emp) = fv4; v5; ?2 g. We now need to de ne a function that returns a node and its ancestors.

De nition 13 Let P be a consistent set of paths, let T be an XML tree that conforms to P and let M(T) be the minimal extension of T. The function AAncestor(v), where v 2 V [ N, is the set of nodes in M(T) de ned by AAncestor(v) = v [ Ancestor(v). For example in Figure 10 (c), AAncestor(?2 ) = fvr ; v2; ?1; ?2g. The next function returns all nodes that are the nal nodes of path instances of p and are descendants of v. 14

De nition 14 Let P be a consistent set of paths, let T be an XML tree that conforms to P and let M(T) be the minimal extension of T. The function Nodes(v; p), where v 2 V [ N and p 2 P, is the set of nodes in M(T) de ned by Nodes(v; p) = fxjx 2 N(p) ^ v 2 AAncestor(x)g. Note that Nodes(v; p) may contain nulls or be empty.

For example, in Figure 10 (c), Nodes(vr ; root.Dept) = fv1; v2g, Nodes(?1 ; root.Dept.Section.Emp) = f?2g, Nodes(v3 ; root.Dept) = ;. We also de ne a partial ordering on the set of nodes as follows.

De nition 15 The partial ordering > on the set of nodes V in a XML tree T is de ned by v1 > v2 i v2 2 Ancestor(v1 ).

3 Strong Functional Dependencies in XML This leads us to the main de nition of our paper. We rstly consider the case where there is a single path on the l.h.s of the XFD.

De nition 16 Let P be a set of consistent paths and let T be an XML tree that conforms to P. An XML functional dependency (XFD) is a statement of the form: p ! q where p 2 P and q 2 P. T strongly satis es the XFD if p = q or for any two distinct path intances v1 :    :vn and v10 :    :vn0 in Paths(q) in M(T), val(vn ) = 6 val(vn0 )) ) x1 = 6 y1 if Last(p) 2 E else ?2= Nodes(x1 ; p) and ?2= Nodes(y1 ; p) and val(Nodes(x1 ; p)) \ val(Nodes(y1 ; p)) = ;, where x1 = maxfvjv 2 fv1;    ; vng ^ v 2 N(p \ q)g and y1 = maxfvjv 2 fv10 ;    ; vn0 g ^ v 2 N(p \ q)g. We note that since the path p \ q is a pre x of q, there exists only one node in v1 :    :vn that is also in N(p \ q) and so x1 is always de ned and unique. Similarly for y1 . Also, since x1 2 N(p \ q) and y1 2 N(p \ q) then it is always the case that val(Nodes(x1 ; p)) 6= ; and that val(Nodes(y1 ; p)) 6= ;. As a result, if x1 = y1 then p ! q is automatically violated. We now outline the thinking behind the above de nition. In the relational model, if we are given a relation r and a FD A ! B, then to see if A ! B is satis ed we have to check the B values and their corresponding A values. In the relational model the correspondence between B values and A values is obvious - the A value corresponding to a B value is the A value in the same tuple as the B value. However, in XML there is no concept of a tuple so it is not immediately clear how to generalize the de nition of a FD to XML. Our solution is based on the following observation. In a relation r with tuple t, the value t[A] can be seen as the 'closest' A value to the B value t[B]. In De nition 16 we generalize this observation and given a path instance v1 :    :vn in Paths(q), we rst compute the 'closest' ancestor of vn that is also an ancestor of a node in N(p) (x1 in the above de nition) and then compute the 'closest p-nodes' to be the set of nodes which terminate a path instance of p and are descendants of x1 . We then proceed in a similar fashion for the other path v10 :    :vn0 and compute the 'p-nodes' which are closest to 15

vn0 . We note that in this de nition, as opposed to the relational case, there will be in general more than one 'closest p - node' and so Nodes(x1 ; p) and Nodes(y1 ; p) will in general contain more than one node. Having computed the 'closest p-nodes' to vn and vn0 , if val(vn ) 6= val(vn0 ) we then require, generalizing on the relational case, that the val0 s of the sets of corresponding 'closest p-nodes' be disjoint. We note that if p ends with an element node, then since T is a tree and node identi ers are unique the condition x1 6= y1 automatically ensures that val(Nodes(x1 ; p)) \ val(Nodes(y1 ; p)) = ; . We now illustrate the de nition by some examples. Example 2 Let P be the set of paths froot,

root.Department, root.Department.Dname,

root.Department.Head, root.Department.Lecturer, root.Department.Lecturer.Lname. root.Department.Lecturer.Subject,root.Department.Lecturer.Subject.Subjec t#,

g.

root.Department.Lecturer.Subject.Subjname, root.Department.Lecturer.Subject.Subjname.S

The XML tree shown in Figure 11 conforms to P and is complete. Consider the XFD root.Department.Lecturer.Subject.Subject#

! root.Department.Lecturer.Subject.SubjName.S.

Then vr :v1:v5:v13:v17:v22 and vr :v2:v9 :v15:v21:v24 are two distinct path instances in Paths(root.Department.Lecturer.Subject.SubjName.S) and val(v22 ) = "n1" and val(v24 ) = "n2". So N(root.Department.Lecturer.Subject.Subject#\ root.Department.Lecturer.Subject.SubjName.S)) = fv13; v14; v15g and so x1 = v13 and y1 = v15. Thus val(Nodes(x1 ; root.Department.Lecturer.Subject.Subject#)) = f"s1"g and val(Nodes(y1 ; root.Department.Lecturer.Subject.Subject#)) = f"s2"g. Thus val(Nodes(x1 ; root.Department.Lecturer.Subject.Subject#)) and val(Nodes(y1 ; root.Department.Lecturer.Subject.Subject#)) are disjoint. Similarlyfor the paths vr :v1:v6:v14:v19:v23 and vr :v2:v9:v15:v21:v24 and so the XFD is satis ed

Consider next the XFD root.Department.Head ! root.Department. Then vr :v1 and vr :v2 are two distinct paths instances in Paths(root.Department) and val(v1 ) = v1 and val(v2 ) = v2 . Also N(root.Department.Head \ root.Department) = fv1 ; v2g and so x1 = v1 and y1 = v2. Thus val(Nodes(x1 ; root.Department.Head)) = f"h1"g and val(Nodes(y1 ; root.Department.Head)) = f"h2"g and so the XFD is satis ed. Consider next the XFD root.Department.Lecturer ! root.Department.Lecturer.Subject.Subject#. Then vr :v1:v5:v13:v16 and vr :v2:v9 :v15:v20 are two distinct path instances in Paths(root.Department.Lecturer.Subject.Subject#) and val(v16 ) = "s1" and val(v20 ) = "s2". So N(root.Department.Lecturer\ root.Department.Lecturer.Subject.Subject#) = fv5 ; v6; v9g and so x1 = v5 and y1 = v9. Thus val(Nodes(x1 ; root.Department.Lecturer)) = fv5 g and 16

val(Nodes(y1 ; root.Department.Lecturer.Subject.Subject#)) = fv9g. Similarly for the paths vr :v1:v6:v14:v18 and vr :v2:v9:v15:v20 and so the XFD is satis ed. vr

E

root

v2 E

v1 E Department

v3

E “d1”

v4 A Head v E 5 Dname “h1”

v10 A Lname “l1”

v16

A Subject# “s1”

Lecturer

v13 E Subject

v6

v7 E Dname

E Lecturer

v8 A Head v E Lecturer 9

“d2”

v11 A Lname

v14

E Subject

“l2” v17 E SubjName v18 A Subject# v19 “s1” v22 S S

E

“h2”

v12 A Lname v15 “l3”

SubjName v20

E Subject

A Subject# v21 E

SubjName

“s2”

v23 S

“n1”

Department

“n1”

S

v24

S

S

“n2”

Figure 11: An XML tree with illustrating the de nition of an XFD. Example 3 Let P be the same set of paths as in the previous example. The XML tree shown in Figure

12 conforms to P and is complete. Consider the XFD root.Department.Lecturer.Subject.Subject# ! root.Department.Lecturer.Subject.SubjName.S. Then vr :v1:v5:v13:v17:v22 and vr :v1:v6:v14:v19:v23 are two distinct path instances in Paths(root.Department.Lecturer.Subject.SubjName.S) and val(v22 ) = "n1" and val(v23 ) = "n3". So N(root.Department.Lecturer.Subject.Subject#\ root.Department.Lecturer.Subject.SubjName.S) = fv13; v14; v15g and so x1 = v13 and y1 = v14. Thus val(Nodes(x1 ; root.Department.Lecturer.Subject.Subject#)) = f"s1"g and val(Nodes(y1 ; root.Department.Lecturer.Subject.Subject#)) = f"s1"g and so the XFD is violated. Consider next the XFD root.Department.Head ! root.Department. Then vr :v1 and vr :v2 are two distinct paths instances in Paths(root.Department) and val(v1 ) = v1 and val(v2 ) = v2 . So N(root.Department.Head \ root.Department) = fv1 ; v2g and so x1 = v1 and y1 = v2 . Thus val(Nodes(x1 ; root.Department.Head)) = f"h1"g and val(Nodes(y1 ; root.Department)) = f"h1"g and so the XFD is violated. The next example illustrates the e ect of incomplete paths on the de nition of an XFD. 17

vr

E

root

v2 E

v1 E Department

v3

E “d1”

v4 A Head v E 5 Dname “h1”

v10 A Lname “l1”

v16

A Subject# “s1”

Lecturer

v13 E Subject

v6

v7 E Dname

E Lecturer

v8 A Head v E Lecturer 9

“d2”

v11 A Lname

v14

E Subject

“l2” v17 E SubjName v18 A Subject# v19 “s1” v22 S S

E

“h1”

v12 A Lname v15 “l3”

SubjName v20

E Subject

A Subject# v21 E

SubjName

“s2”

v23 S

“n1”

Department

S

“n3”

v24

S

S

“n2”

Figure 12: An XML tree with XFD. Example 4 Let P be the same set of paths as in the previous example. The XML tree shown in Figure

13 conforms to P but is not complete. Its minimal completion is shown in Figure 14. Consider the XFD root.Department.Lecturer.Subject.Subject#

! root.Department.Lecturer.Subject.SubjName.S.

Then vr :v1:v5:v13:v17:v22 and vr :v2:v9 : ?1 : ?2 : ?3 are two distinct path instances in Paths(root.Department.Lecturer.Subject.SubjName.S) and the nal node in vr :v2:v9: ?1 : ?2 : ?3 is null. So N(root.Department.Lecturer.Subject.Subject#\ root.Department.Lecturer.Subject.SubjName.S) = fv13; v14; ?1g and so x1 = v13 and y1 =?1. Thus val(Nodes(x1 ; root.Department.Lecturer.Subject.Subject#)) = f"s1"g and val(Nodes(y1 ; root.Department.Lecturer.Subject.SubjName.S)) =?3 and so the XFD is violated.

Consider next the XFD root.Department.Head ! root.Department. Then vr :v1 and vr :v2 are two distinct paths instances in Paths(root.Department) and val(v1 ) = v1 and val(v2 ) = v2 . So N(root.Department.Head \ root.Department) = fv1; v2; g and so x1 = v1 and y1 = v2 . Then we note that Paths(root.Department.Head) = fvr :v1 :v4; vr :v2: ?4g. Thus val(Nodes(x1 ; root.Department.Head)) = f"h1"g and val(Nodes(y1 ; root.Department.Head)) =?4 and so the XFD is violated. Consider next the XFD root.Department.Lecturer ! root.Department.Lecturer.Subject.Subject#. Then vr :v1:v5:v13:v16 and vr :v2:v9 : ?1 : ?5 are two distinct path instances in Paths(root.Department.Lecturer.Subject.Subject#) and val(v16 ) = "s1" and val(?5 ) =?5. Then N(root.Department.Lecturer\ root.Department.Lecturer.Subject.Subject#) = fv5 ; v6; v9g and so x1 = v5 and y1 = v9. 18

Thus x1 6= y1 . Similarly for the path instances vr :v1 :v6:v14:v18 and vr :v2:v9: ?1 : ?5 and so the XFD is satis ed. vr

E

root

v2 E

v1 E Department

v3

E

Dname

v4 A

Head v5 E

Lecturer

v6

v7 E Dname

E Lecturer

v10 A Lname “l1”

v16

A Subject# “s1”

v9 E Lecturer

“d2”

“h1”

“d1”

Department

v13 E Subject

v11 A Lname

v14

E Subject

“l2”

“l3”

v17 E SubjName v18 A Subject# v19 “s1” v22 S S

E

v23 S

“n1”

v12 A Lname

SubjName

S

“n3”

Figure 13: An XML tree with XFDs. We now generalize the de nition of an XFD to the case where there is more than one path on the l.h.s. of the XFD. Our approach is similar to that in De nition 16. Given an XFD p1 ;    ; pk ! q and two paths v1 :    :vn and v10 :    :vn0 in Paths(q) which end in nodes having di erent val0 s, we rstly compute, for each pi, the set of 'closest pi nodes' to vn in the same fashion as in De nition 16. Then extending the relational approoach to FD satisfaction, we require that in order for p1 ;    ; pk ! q to be satis ed there is at least one pi for which the val0 s of the set of 'closest pi nodes' to vn is disjoint from the val0 s of the set of 'closest pi nodes' to vn0 .

De nition 17 Let P be a set of consistent paths and let T be an XML tree that conforms to P. An XML functional dependency (XFD) is a statement of the form: p1 ;    ; pk ! q, k  1, where p1 ;    ; pk and q are paths in P. T strongly satis es the XFD if pi = q for some i; 1  i  k or for any two distinct path instances v1:    :vn and v10 :    :vn0 in Paths(q) in M(T), val(vn ) = 6 val(vn0 ) ) 9i; 1  i  k, such that xi = 6 yi if Last(pi ) 2 E else ?2= Nodes(xi ; pi) and ?2= Nodes(yi ; pi) and val(Nodes(xi ; pi)) \ val(Nodes(yi ; pi)) = ;, where xi = maxfvjv 2 fv1 ;    ; vn g ^ v 2 N(pi \ q)g and yi = maxfvjv 2 fv10 ;    ; vn0 g ^ v 2 N(pi \ q)g. For the same reason as in De nition 16, xi and yi are de ned and unique for all i; 1  i  n. We now illustrate the de nition by an example. Example 5 Let P be the set of paths froot,

root.Department, root.Department.Project

19

vr

E

root

v2 E

v1 E Department

v3

E

Dname

v4 A

Head v5 E

Lecturer

v6

v7 E Dname

E Lecturer

v10 A Lname “l1”

v13 E Subject

v16

v17 E SubjName v18 A Subject# v19 “s1”

A Subject# “s1”

A

Head v9 E Lecturer

4

“d2”

“h1”

“d1”

Department

v11 A Lname

v14 E Subject

v10 A Lname “l3”

“l2”

v22 S S

E

SubjName 5

v23 S

“n1”

E Subject

1

A Subject#

S

2

E SubjName

3

S S

“n3”

Figure 14: The minimal extension of tree in Figure 13. vr

v1 E

v3

v5 A “e1”

Employee

E

root

E

v2 E

Department

v4 E Project

Project

v6 A Task “t1”

Department

v7 A Hours “h1”

v8 A Employee “e1”

v9 A Task

v10

“t2”

A Hours “h2”

Figure 15: An XML tree with XFDs. root.Department.Project.Employee, root.Department.Project.Task

. The XML tree shown in Figure 15 conforms to P and is complete.

root.Department.Project.Hours

Consider the XFD

root.Department.Project.Employee, root.Department.Project.Task

! root.Department.Project.Hours.

Then vr :v1 :v3:v7 and vr :v2:v4:v10 are two distinct paths in Paths(root.Department.Project.Hours) such that val(v7 ) = "h1" 6= val(v10 ) = "h2". We then compute N(root.Department.Project.Employee\ root.Department.Project.Hours) = fv3 ; v4g and so x1 = v3 and y1 = v4 . Next we compute N(root.Department.Project.Task\ root.Department.Project.Hours) = fv3; v4g and so x2 = v3 and y2 = v4 . So val(Nodes(x1 ; root.Department.Project.Employee)) = "e1" and val(Nodes(y1 ; root.Department.Project.Employee)) = "e1" and val(Nodes(x2 ; root.Department.Project.Task)) = "t1" and 20

val(Nodes(y2 ; root.Department.Project.Task)) =

"t2"

so the XFD is satis ed.

Next consider the XFD root.Department, root.Department.Project ! root.Department.Project.Task in Figure 15. Then vr :v1:v3:v6 and vr :v2 :v4:v9 are two distinct paths in Paths(root.Department.Project.Task) such that val(v6 ) 6= val(v9 ). We then compute N(root.Department\ root.Department.Project.Task) = fv1; v2 g and so x1 = v1 and y1 = v2 . Also, N(root.Department.Project\ root.Department.Project.Task) = fv3 ; v4g and so x2 = v3 and y2 = v4 . So val(Nodes(x1 ; root.Department)) = v1 and val(Nodes(y1 ; root.Department)) = v2 and val(Nodes(x2 ; root.Department.Project)) = v3 and val(Nodes(y2 ; root.Department.Project)) = v4 so the XFD is satis ed.

4 Relationship between XFDs and Keys in XML Keys are of fundamental importance in relational databases where they establish the connection between a real world object and its representation in the database, thus enabling information about an object to be located in the database. With the increasing use of XML in electronic data interchange, there is a similar need for the speci cation of keys to help in location data in an XML document and in enforcing semantic integrity constraints. Recently, key speci cations have been proposed for the XML standard [11], XML Data [23] and XML Schema [39]. As discussed in more detail in [15], these proposals are both not suciently powerful in some areas and too powerful in other areas. They are not suciently powerful because they can neither express keys having a complex structure nor express scoped keys (called relative keys in [15, 13]). They are too powerful because the proposal in [39] is based on the powerful language Xpath [19] and this allows key constraints to be speci ed which are not satis able. Also, it is desirable to be able to solve the implication problem for keys, i.e. deciding whether a set of key constraints logically implies another key constraint, since being able to solve this problem is essential for query optimization and integrity enforcement. At the moment, the decidability of the implication for keys in XML Schema, as well as their axiomatization, is unknown. Recently, a new way of specifying keys has been proposed [13, 15] which overcomes the problems just outlined with other proposals. In particular, the proposal allows one to specify keys having a complex structure as well as scoped keys. Also, it has been shown that the implication problem for this type of key speci cation is nitely axiomatizable and there is an algorithm for the implication problem which is polynomial in the size of the keys. We now investigate the relationship between the key speci cation in [13, 15] and XFDs as de ned in this paper. We rstly recall the de nition of keys in XML as proposed in [15]. We shall modify the notation of [15] to conform to the notation of our paper. Also, we only consider the case of absolute keys. We rstly recall the de nition of value equality. 21

De nition 18 Two nodes v1 and v2 are value equal, denoted by v1 =v v2, i the following conditions are satis ed:

 lab(v1 ) = lab(v2 );  if lab(v1 ) 2 A [ S then val(v1 ) = val(v2 );  if v1 2 E then: (1) for any a1 2 att(v1 ), there exists a2 2 att(v2 ) such that a1 =v a2, and vice versa; (2) if ele(v1 ) = [v1; : : :; vk ], then ele(v2 ) = [v10 ; : : :; vk0 ] and for all i, 1  i  k; vi =v vn0 . That is, v1 =v v2 i their subtrees are isomorphic by an isomporphism that is the identity on text values. For example, in Figure 16, v6 =v v8 and v7 =v v10 Next, we recall the de nition of keys.

De nition 19 A key constraint  is a speci cation of the form  = (q; fp1; : : :; png) where q and p1 ; : : :; pn are in a set of paths P and q is a pre x of every path in fp1; : : :; png. Then an XML tree T satis es , if for any two nodes v1 and v2 in N(q) in T, if for all i; 1  i  n, Nodes(v1 ; pi ) \v Nodes(v2 ; pi) = 6 ; then v1 = v2 , where Nodes(v1 ; pi) \v Nodes(v2 ; pi ) = f(z1 ; z2)jz1 2 Nodes(v1 ; pi); z2 2 Nodes(v2 ; pi); z1 =v z2 g. For example, in Figure 16, T satis es the key constraint (root.Dept, froot.Dept.Dnameg). We now consider the relationship between the key constraint (q; fp1; : : :; png) and the XFD p1; : : :; pn ! q. Note also that the set of key paths, fp1; : : :; png, may be empty. In this case the key constraint (q; fg) reduces to the requirement that the val of nodes in N(q) are distinct, i.e. if there are two nodes v1 and v2 in N(q) such that val(v1 ) = val(v2 ) then v1 = v2 . For example, in Figure 16, T satis es the key constraint (root.Dept.Dname; fg). The rst thing to note is that because of the di erent way that value equality is de ned for nodes whose labels are in E in this paper to that in [15], the statement that T satis es the key constraint (q; fp1; : : :; png) is not equivalent in general to the statement that T satis es the XFD p1; : : :; pn ! q. To see this, consider the XFD root.Dept.Major.Student ! root.Dept.Major and the key constraint (root.Dept.Major, froot.Dept.Major.Studentg). Then the paths instances that end in v3 , v4 and v5 are all distinct path instances in Paths(root.Dept.Major). However, because of the way that we de ne val, we have that val(v7 ) 6= val(v9 ) 6= val(v18 ) and so the XFD root.Dept.Major.Student ! root.Dept.Major is satis ed. However, the constraint (root.Dept.Major, froot.Dept.Major.Studentg) is not satis ed because v3 and v4 are distinct nodes yet v7 =v v10 . Let us now restrict ourselves to the class of simple key paths (those not ending in a label in E) and assume that Last(pi ) 2= E, for all i; 1  i  n, i.e. all the keys paths end in leaves. In this case we still don't have equivalence between keys and XFDs because of the di ering ways that our paper and [15] deal with missing information. To see this consider the tree shown in Figure 16 and the key constraint (root.Dept, froot.Dept.Dnameg). Then the key constraint is satis ed since 22

vr E root

v30 E Dept

v1 E Dept

v2 A Dname

v3 E Major

v5 E Major

v4 E Major

“Maths”

v6 A Mname v7 E Student v8 A Mname v 9 E Student “Pure Maths” “Pure Maths

v12 A Fname v13 A Lname v Fname 14 A “Bill” “Smith” “Alan”

v10 E Student

v15 A Lname “Jones” v16 A Fname “Bill”

v17 A Lname “Smith”

v19 A Fname “John”

v18 E Student

v20 A Lname “Laws”

Figure 16: An XML tree. N(v1 ; root.Dept.Dname) \v N(v30 ; root.Dept.Dname) = ;. However, the XFD root.Dept.Dname ! root.Dept is not satis ed since y1 = v30 and ?2 Nodes(y1 ; root.Dept.Dname). The di erence in this case is because we have adopted in this a strong satisfaction approach, whereas the approach in [15] can in some sense be viewed as an extension of the weak satisfaction approach. There is one further case where the key speci cation of [15] is not equivalent to an XFD, and that is the case of an empty set of key paths. As noted earlier, the key constraint (q; fg) is equivalent to the requirement that the val of all nodes in N(q) are unique. This constraint cannot be expressed as an XFD because we require at least one path on the l.h.s. of a XFD. However, this di erence is not very signi cant. It is possible to modify the de nitions in our paper to allow an empty set of paths on the l.h.s. of an XFD but we have chosen not to do so as it slightly complicates the de nition. However, had we chosen to do so, then the key constraint (q; fg) would be equivalent to the XFD ; ! q. Finally, let us restrict ourselves to the class of keys previously de ned and let the XML tree be complete. Then we have the following result.

Lemma 1 Let P be a set of consistent paths and let T be a complete XML tree that conforms to P . Let  be the key constraint  = (q; fp1; : : :; png), where fp1; : : :; png 6= ; and Last(pi ) 2= E for all i; 1  i  n. Then T satis es  i T satis es the XFD p1; : : :; pn ! q. Proof.

If: We show the contrapositive that if T violates  = (q; fp1; : : :; png) then it violates p1 ; : : :; pn ! q.

Since T violates , there exist distinct nodes v0 and v00 in N(q) and v1 ; : : :; vn and v10 ; : : :; vn0 such that 23

for all i; 1  i  n; val(vi ) = val(vi0 ) and vi 2 N(pi ) and vi0 2 N(pi). We rst note that val(v0 ) 6= val(v00 ) since Last(q) 2 E. Also, since q is a pre x of pi then xi = v0 and yi = v00, where xi and yi are as de ned in De nition 17. Hence for all i; 1  i  n, Nodes(xi ; pi) \ Nodes(yi ; pi) 6= ; and so p1 ; : : :; pn ! q is violated in T. Only If: We show the contrapositive that if T violates p1 ; : : :; pn ! q then it violates  = (q; fp1; : : :; png). If T violates p1; : : :; pn ! q then there exist two distinct nodes v0 and v00 in N(q) such that Nodes(xi ; pi) \ Nodes(yi ; pi) 6= ;. However, as we have already noted, xi = v0 and yi = v00 so Nodes(v0 ; pi) \v Nodes(v00 ; pi ) 6= ; and thus (q; fp1; : : :; png) is violated. 2 We also note that XFDs de ned in this paper are more expressive than the key constraints de ned in [15] since in an XFD p1; : : :; pn ! q there are no restrictions placed on the relationship between the path q and the paths p1 ; : : :; pn, whereas in the key constraint  = (q; fp1; : : :; png) q must be a pre x of p1 ; : : :; pn. So, for certain important cases, keys as de ned in [15] are a special case of XFDs as de ned in this paper, a situation that is analogous to the situation for FDs and keys in relations.

5 Properties of Strong XFDs In this section we provide formal justi cation for the de nition of strong XFDs given in the previous section. Firstly we look at the correspondence between FDs in relations and XFDs in XML trees. We show that for a very general class of mappings from relations to XML trees, a unary FD A ! B is strongly satis ed in a relation if and only if the corresponding XFD is strongly satis ed in the XML tree. The second justi cation is to investigate the relationship between the satisfaction of XFDs in XML trees with missing information and XML trees with no missing information. We show that an XFD is strongly satis ed in an XML tree T if and only if every completion of T also strongly satis es the same XFD.

5.1 XFDs in XML trees and FDs in relations 5.1.1 De nitions We now give some basic de nitions of nested relations that we need to de ne the mapping between relations and XML trees. This section is adapted from the de nitions given in [44]. Let U be a xed countable set of atomic attribute names. Associated with each attribute name A 2 U is a countably in nite set of values denoted by DOM(A) and the set DOM is de ned by DOM = [DOM(Ai) for all Ai 2 U. We assume that DOM(Ai ) \ DOM(Aj ) = ; if i 6= j. A scheme tree is a tree containing at least one node and whose nodes are labelled with nonempty sets of attributes that form a partition of a nite subset of U. If n denotes a node in a scheme tree S then: - ATT(n) is the set of attributes associated with n; 24

- A(n) is the union of ATT(n1 ) for all n1 2 Ancestor(n). Name

Sid

Major

Class

Exam

Project

Figure 17: A scheme tree. Figure 17 illustrates an example scheme tree de ned over the set of attributes fName, Class, Exam, Projectg.

Sid, Major,

De nition 20 A nested relation scheme (NRS) for a scheme tree S, denoted by N(S), is the set de ned

recursively by: (i) If S consists of a single node n then N(S) = ATT(n); (ii) If A = ATT(ROOT(S)) and S1 ;    ; Sk ; k  1, are the principal subtrees of S then N(S) = A [ fN(S1 )g   fN(Sk )g: For example, for the scheme tree S shown in Figure 17, N(S) = fName, Sid; fMajorg; fClass; fExamg; fProjectggg. We now recursively de ne the domain of a scheme tree S, denoted by DOM(N(S)).

De nition 21 (i) If S consists of a single node n with ATT(n) = fA1;    ; Ang then DOM(N(S)) = DOM(A1 )      DOM(An ); (ii) If A = ATT(ROOT(S)) and S1 ;    ; Sk are the principal subtrees of S, then DOM(N(S) = DOM(A)  P(DOM(N(S1 )))      P(DOM(N(Sk ))) where P(Y ) denotes the set of all nonempty, nite subsets of a set Y .

The set of atomic attributes in N(S), denoted by Z(N(S)), is de ned by Z(N(S)) = N(S) \ U. The set of higher order attributes in N(S), denoted by H(N(S)), is de ned by H(N(S)) = N(S) ? Z(N(S)). For instance, for the example shown in Figure 17, Z(N(S)) = fName, Sidg and H(N(S)) = ffMajorg; fClass; fExamg; fProjectggg. Finally we de ne a nested relation over a nested relation scheme N(S), denoted by r (N(S)), or often simply by r when N(S) is understood, to be a nite nonempty set of elements from DOM(N(S)). If t is a tuple in r and Y is a nonempty subset of N(S), then t[Y ] denotes the restriction of t to Y and the restriction of r to Y is then the nested relation de ned by r [Y ] = ft[Y ]jt 2 rg. An example of a nested relation over the scheme tree of Figure 17 is shown in Figure 18. 25

A tuple t1 is said to be a subtuple of a tuple t in r if there exists Y 2 H(N(S)) such that t1 2 t[Y ] or there exists a tuple t2 , de ned over some NRS N1 , such that t2 is a subtuple of t and there exists Y1 2 H(N1 ) such that t1 2 t2 [Y1]. For example in the relation shown in Figure 18 the tuples < CS100; fmid-year, finalg; fProject A, Project B, Project Cg > and < Project A > are both subtuples of < Anna; Sid1; fMaths, Computingg; fCS100; fmid-year, finalg; fProject A, Project B,Project Cgg >.

Name Sid Anna

Sid1

Bill

Sid2

fMajorg

fClass

fExamg

Physics

P100

Final

Chemistry

CH200

Test A Test B

Maths CS100 Mid-year Computing Final

fProjectgg

Project A Project B Project C Prac 1 Prac 2 Experiment 1 Experiment 2 1

Figure 18: A nested relation. We now introduce the nest and unnest operators for nested relations as de ned in [38].

De nition 22 Let Y be a nonempty proper subset of N(S). Then the operation of nesting a relation r on Y , denoted by Y (r ), is de ned to be a nested relation over the scheme (N(S) ? Y ) [ fY g and a tuple t 2 Y (r ) i : (i) there exists t1 2 r such that t[N(S) ? Y ] = t1 [N(S) ? Y ] and (ii) t[fY g] = ft2[Y ]jt2 2 r and t2[N(S) ? Y ] = t[N(S) ? Y ]g. De nition 23 Let r (N(S)) be a relation and fY g an element of H(N(S)). Then the unnesting of r on fY g, denoted by fY g (r ), is a relation over the nested scheme (N(S) ? fY g) [ Y and a tuple t 2 fY g (r ) i there exists t1 2 r such that t1[N(S) ? fY g] = t[N(S) ? fY g] and t[Y ] 2 t1[fY g]. More generally, one can de ne the total unnest of a nested relation r , denoted by  (r ), as the at relation de ned as follows.

De nition 24 (i) if r is a at relation then  (r ) = r; (ii) otherwise  (r ) =  ((fY g (r ))) where fY g is a higher order attribute in the NRS for r . It can be shown [38] that the order of unnesting is immaterial and so  (r) is uniquely de ned. Also, we need the following result from [38]. Let us denote the NRS of nested relation r by N(r ).

Lemma 2 For any nested relation r and any Y  N(r), fY g (Y (r)) = r . We note the well known result [38] that the converse of the above lemma does not hold, i.e. there are nested relations such that Y (fY g (r )) 6= r . 26

5.1.2 Mapping from relations to XML The translation of a relation into an XML tree consists of two phases. In the rst we map the relation to a nested relation whose nesting structure is arbitrary and then we map the nested relation to a XML tree. In the rst step we let the nested relation r be de ned by ri = Y ?1 (ri?1); r0 = r; r = rn ; 1  i  n where r represents the initial ( at) relation and r represents the nal nested relation. The Yi are allowed to be arbitrary apart from the obvious restriction that Yi is an element of the NRS for ri. The nest operator as de ned above is de ned only for complete relations so we have to indicate how we handle nulls. Our approach is in the rst step to consider the nulls to be marked, and hence distinguishable, and to treat these unmarked nulls as though they are data values. Thus the de nition of the nest operator and r remain unchanged. In the second step of the mapping procedure we take the nested relation and convert it to an XML tree as follows. We start with an initially empty tree. For each tuple t in r we rst create an element node of type Id and then for each A 2 Z(N(r )) we insert a single attribute node with a value t[A]. We then repeat recursively the procedure for each subtuple of t. The nal step in the procedure is to compress the tree by removing all the nodes containing nulls from the tree. We now illustrate these steps by an example. i

Example 6 Consider the at relation shown in Figure 19.

Name Sid Anna Anna Anna Anna Anna Anna Anna Anna

?4 ?4 ?4 ?4

Sid1 Sid1 Sid1 Sid1 Sid1 Sid1 Sid1 Sid1 Sid2 Sid2 Sid2 Sid2

Major Maths Maths Maths Maths

?2 ?2 ?2 ?2

Chemistry Chemistry Chemistry Chemistry

Class

CS100 CS100 CS100 CS100 CS100 CS100 CS100 CS100 CH200 CH200 CH200 CH200

Exam

Mid-year Mid-year Final Final Mid-year Mid-year Final Final Test A

?3

Test A

?3

Project

Project A

?1

Project A

?1

Project A

?1

Project A

?1

Prac 1 Prac 1 Prac 2 Prac 2

Figure 19: A at relation. If we then transform the relation r in Figure 19 by the sequence of nestings r1 = PROJECT (r), r2 = EXAM (r1 ), r3 = CLASS;fEXAM g;fPROJECT g (r2), r = MAJOR (r3 ) then the relation r is shown 27

in Figure 20. We then transform the nested relation in Figure 20 to the XML tree shown in Figure 21.

Name Sid Anna

fMajorg

Sid1

fClass

CS100 Mid-year ?2 Final Sid2 Chemistry CH200 Test A

?4

Maths

fExamg fProjectgg Project A

?1

Prac 1 Prac 2

?3

Figure 20: A nested relation derived from a at relation. E

root

E Id

A

A Name “Anna”

A Major “Maths”

Sid

E

Id

E

Id

A Class “CS100

E

Id

E

Id

A Sid “Id2”

“Id1”

A Exam “mid-year”

A Exam “final”

A Class

A Major “Chemistry”

A Project “Project A”

A

E

Id

E

Id

E

Id

E

Id

E

Id

“CH200” Exam

“Teat A”

A Project “Prac 1”

A Project “Prac 2”

Figure 21: An XML tree derived from a nested relation. This now leads to the main result of this section which establishes the correspondence between satisfaction of FDs in relations and satisfaction of XFDs in XML. We denote by Tr the XML tree derived from r .

Theorem 2 Let r be a at relation and let A ! B be a FD de ned over r. Then r strongly satis es A ! B i Tr strongly satis es pA ! pB where pA denotes the path in Tr to reach A and pB denotes the path to reach B .

Proof. See Appendix. To illustrate the above result, consider the relation given in Figure 19. This relation satis es the FDs Major ! Sid and violates the FDs Class ! Exam and Exam ! Project. One can see that the XML tree in Figure 21 satis es the XFDs root.Id.Id.Major ! root.Id.Sid and violates the XFDs root.Id.Id.Subject ! root.Id.Id.Exam and root.Id.id.Id.Exam ! root.Id.Id.Id.Project.

5.2 XFDs in Completions of XML trees In this section we investigate the relationship between satisfaction of XFDs in XML trees with missing information and satisfaction in complete trees with no missing information. In particular, we show that 28

an XML tree satis es an XFD if and only if every completion of the tree also satis es the XFD. This result is the XML analogue of Lemma 6.1 in [7] which proved a similar result for the syntactic de nition of strong satisfaction in relations. Our result is important since it provides justi cation for regarding our de nition of XFDs as conforming to the strong satisfaction approach. Firstly, we have to make precise what is meant by an XML tree having no missing information. This is quite straightforward.

De nition 25 Let P be a consistent set of paths and let T be an XML tree that conforms to P. Then a

completion of T is a tree constructed as follows. Let V1 be another set of node identi ers that is disjoint

from V and let v be any node in M(T). Then for all v, if v 2 N then replace v by a distinct node from V1 and if lab(v) 2= E then assign an arbitrary text string to val(v). The set of all possible completions of T is denoted by Poss(T). We then have one of the main results of our paper.

Theorem 3 An XML tree T strongly satis es an XFD p1;    ; pk ! q i every T 0 in Poss(T) strongly satis es p1;    ; pk ! q.

6 Axiomatization for XFDs In this section we address the issues of satisfaction and axiomatization for XFDs. We rstly show that every set of XFDs is satis able. We then present an axiom system for XFDs and show that it is sound. We also note that the axiom system has also been shown to be complete for unary XFDs [45] and an algorithm for the implication problem has been given which is linear in the number of XFDs.

De nition 26 A nite set  of XFDs is satis able if there exists an XML tree which satis es . It can be easily veri ed from our de nition of an XFD that the tree consisting of just the root node vr satis es any set of XFDs. Hence we have the following result.

Lemma 3 Every set of XFDs is satis able. We now present a set of axioms for determining implication of XFDs. We note that the rst three axioms correspond to the well known Armstrong axioms [6] for FDs in relations, but the other axioms have no counterpart in Armstrong's system.

Axiom A1. p1 ;    ; pk ! pi for any pi , 1  i  k. Axiom A2. If p1;    ; pk ! q, then p; p1;    ; pk ! q for any path p. Axiom A3. If p1;    ; pk ! q, and q ! s then p1;    ; pk ! s. Axiom A4. If p1;    ; pk ! q and 8i; 1  i  k; pi \ q = root, then p ! q for any path p. Axiom A5. If p ! q then p0 ! q for all paths p0 such that p \ q is pre x of p0 and either p0 is a

pre x of p or p0 is a pre x of q.

29

Axiom A6. If Last(p) 2 E and q is a pre x of p then p ! q. Axiom A7. If Last(q) 2 A then Parnt(q) ! q. Axiom A8. p ! root for any path p. Theorem 4 Axioms A1 - A8 are sound for implication of arbitrary XFDs. Proof. See Appendix. We now illustrate these axioms by an example. E root E

E A

E

B

E

D

E C

E

E

A C#

E

G

F

“c1” A F# “f1”

Figure 22: XML tree illustrating axioms for XFDs. Example 7 Consider the XML tree show in Figure 22 and the set  of XFDs froot.A.B.C.C# !

, ! root.A.D.E.F.F#, root.A ! root.Gg. It can be easily veri ed that the XML tree in Figure 22 satis es . Then from  and the axioms we can deduce that the following XFDs are implied by 4 : from A1 we can derive root.A ! root.A, from A2 and root.A ! root.G we can derive that root.A, root.A.B.C ! root.G, from A3 and root.A.B.C.C# ! root.A.D.E and root.A.D.E ! root.A.D.E.F.F# we can derive that root.A.B.C.C# !root.A.D.E.F.F#, from A4 and root.A ! root.G we can derive that root.A.D.E ! root.G, from A5 and root.A.B.C.C# ! root.A.D.E we can derive that root.A.B ! root.A.D.E and that root.A.D ! root.A.D.E, from A6 we can derive that root.A.D.E ! root.A, from A7 we can derive that root.A.D.E.F ! root.A.D.E.F.F# and from A8 we derive that root.A.D ! root. root.A.D.E root.A.D.E

7 A redundancy Free Normal form for XML documents In this section we propose a normal form for XML documents that is adapted from the one given in [5]. We also provide a formal justi cation for the normal form by showing that it is a necessary and 4

We do not show all the XFDs that can be derived from the axioms

30

sucient condition for the elimination of redundancy. This approach to justifying the de nition of a normal form is an extension of the approach adopted by one of the authors in some other research which investigated the issue of providing justi cation for the normal forms de ned in standard relational databases [40, 41, 42, 43, 36]. The approach that we use to justifying our normal form is to formalise the notion of redundancy, the most intuitive approach to justifying normal forms, and then to try to ensure that our normal form is a necessary and sucient condition for there to be no redundancy 5 . However, de ning redundancy is not quite so straightforward as might rst appear. The most obvious approach is, given a relation r and a FD A ! B and two tuples t1 and t2 , to de ne a value t1 [B] to be redundant if t1 [B] = t2[B] and t1[A] = t2 [A]. While this de nition is ne for FDs in relations, it doesn't generalise in an obvious way to other classes of relational integrity constraints, such as multi-valued dependencies (MVDs) or join dependencies (JDs) or inclusion dependencies (INDs), nor to other data models. The key to nding the appropriate generalization is based on the observation that if a value t1 [B] is redundant in the sense just de ned then every change of t1[B] to a new value results in the violation of A ! B. One can then de ne a data value to be redundant if every change of it to a new value results in the violation of the set of constraints (whatever they may be). This is essentially the de nition proposed in [43] where it was shown that BCNF is a necessary and sucient condition for there to be no redundancy in the case of FD constraints and fourth normal form (4NF) is a necessary and sucient condition for there to be no redundancy in the case of FD and MVD constraints. Interestingly though, it was shown in [42] that when the set of constraints contains JDs, the necessary and sucient condition for there to be no redundancy is a new normal form that is di erent from the standard de nition of 5NF [24]. We now give our de nition of redundancy which is an extension of the de nition given in [43]. Firstly, let us denote by P the set of paths that appear on the l.h.s. or r.h.s. of any XFD in 

De nition 27 Let T be an XML tree and let v be a node in T. Then the change from v to v0 , resulting in a new tree T 0 , is said to be a valid change if v = 6 v0 and val(v) = 6 val(v0 ). We note that the second condition in the de nition, val(v) 6= val(v0 ), is automatically satis ed if the rst condition is satis ed when lab(v) 2 E.

De nition 28 Let P be a consistent set of paths and let  be a set of XFDs such that P  P. Then

 is said to cause redundancy if there exists an XML tree T such that T conforms to P and satis es  and a node v in T such that every valid change from v to v0 , resulting in a new XML tree T 0 , causes  to be violated. We now illustrate this de nition by an example. There are other ways of justifying normal forms, such as the elimination of update anomalies [25, 40, 42], but we leave the investigation of their relationship to normal forms for future work. 5

31

A “p1”

Id

E

Project

A

Name

“n1”

E

root

E

Project

A

Id

“p1”

A

Name

E

Employee

E

Phone

“n1”

A

Emp#

“e1” S S “p1”

Figure 23: XML tree illustrating redundancy. Example 8 Let P be the set of paths froot,

root.Project, root.project.Id, root.Project.Name,

g. Con-

root.Employee, root.Employee.Phone, root.Employee.Emp#, root.Employee.Phone.S

sider the set of  of XFDs froot.Project.Id ! root.Project.Nameg and the XML tree T shown in Figure 23. Then  causes redundancy because T is consistent with P and satis es  yet every valid change to either of the Name nodes results in root.Project.Id ! root.Project.Name being violated. We now de ne the notion of a trivial XFD and then de ne the normal form for XML.

De nition 29 An XFD p1 ;    ; pk ! q is trivial if any of the conditions hold: (i) q = pi for some i; 1  i  k; (ii) Last(pi ) 2 E and q is a pre x of pi for some i; 1  i  k; (iii) Last(q) 2 A and pi = Parnt(q) for some i; 1  i  k; (iv) q = root.

De nition 30 Let P be a consistent set of paths and let  be a set of XFDs such that P  P.  of XFDs is in XML normal form (XNF) if for every nontrivial XFD p1;    ; pk ! q 2 + , Last(q) 2= S and if Last(q) 2 A then p1;    ; pk ! Parnt(q) 2 + , where + denotes the set of XFDs logically implied

by .

The main di erence between this de nition and the one in [5] is that we have added the extra condition that Last(q) 2= S. Without this extra condition, the normal form does not guarantee the elimination of redundancy. Also, as a result of the fact that we have an axiom system for XFDs, our de nition of a trivial XFD covers some extra cases that were not considered in [5]. We now illustrate the de nition by an example. Example 9 Consider the XML tree T shown in Figure 23 and the sets of XFDs 1 = froot.Employee !

g, 2 = froot.Project.Id ! root.Project.Nameg and 3 = froot.Project ! root; root.Employee.Phone.S ! root.Employee.Phone; root.Employee.Phone.S ! root.Employee.Emp#; root.Employee.Emp# ! root.Employee; root.Employee.Phone.S ! root.Employee.Phone.S

32

g. Then 1 is not in XNF because last label in the path root.Employee.Phone.S is a text node and 2 is not in XNF since root.Project.Id ! root.Project.Name does not imply root.Project.Id ! root.Project. 3 is in XNF because the rst XFD is trivial, the last label root.Employee.Emp#

in the path on the r.h.s. of the second XFD is an element, as is the last label in the path on the r.h.s. of the third XFD and for the last XFD, it follows from applying A3 to root.Employee.Emp# ! root.Employee, root.Employee.Phone.S ! root.Employee.Emp# that root.Employee.Phone.S ! root.Employee 2 + 3 and so 3 is in XNF. This now leads to one of the main results of this paper.

Theorem 5 Let P be a consistent set of paths and let  be a set of XFDs such that P  P . Then  does not cause redundancy i  is in XNF.

Proof. See Appendix.

8 Related Work Functional dependencies for XML documents were also proposed in [30] and some guidelines were given for how such FDs should be hierarchically structured. However the approach in [30] is informal and no precise de nition of an FD is given. Instead the paper gives an informal approach based on our intuitive understanding of keys and entities. However, as shown in the work on keys [12, 13] the extension of concepts from conventional relational databases to those in XML documents is not as straightforward as rst appears and many unexpected subtleties arise when one tries to formally extend intuitive relational concepts to XML. Indeed, it is the purpose of our paper is to put the understanding of XFDs on a rmer foundation by providing precise and rigorous de nitions. Normal forms for XML documents were also investigated in [22]. In this paper it was assumed that the starting point for the design of an XML document is a conceptual model such as the one proposed in [21]. The conceptual model is then converted into a set of XML documents and these documents are de ned to be in normal form if they have no potential redundancy. Algorithms are then de ned which are guaranteed to produce XML documents which are free of redundancy. While the aim of [22] and the normal form section of our paper are similar, namely that of producing XML documents without redundancy, there are major di erences. The most important di erence is that the approach in [22] assumes that the XML document is derived from a conceptual model. This is a serious limitation since in practice XML documents can be produced in a variety of situations where no conceptual model exists. Also, constraints and redundancy are de ned with respect to the conceptual model rather than within XML. In contrast, in our paper XFDs and normal forms are de ned entirely within the context of an XML document and so can be applied no matter how the XML document is generated. Also, we give a syntactic characterization of the normal form whereas no such characterization is given in [22]. 33

The work that is closest to our paper is that of [5]. Both our paper and [5] address the problem of de ning FDs and normal forms in XML however there are several major di erences which we now outline. In [5], FDs are de ned based on the concept of a 'tree tuple' whereas we use a more direct approach based on paths and path instances that has similarities with the approach used in [27] to de ning path functional constraints and that of [13, 15] in de ning key constraints. Also, our de nition does not requiring the existence of a DTD so our de nition of an XFD is orthogonal to the de nition of a DTD. This approach is similar to the one adopted in [13, 14, 15] in de ning keys for XML. Next, there is the di erence in how we de ne satisfaction. We de ne satisfaction based on the strong satisfaction approach and our de nition is supported by one of the main results in our paper (Theorem 3). The approach in [5] has some elements of the weak satisfaction approach but is not a strict application of the approach since there are XFDs which are violated in a tree with missing information according to the XFD de nition in [5], yet there exist completions of the tree which satis es the XFD. To see this, consider the example taken from [4]. E

A A# “a3”

E

A

A

B#

“b3”

root

E

A C#

A A# “a3”

“c3”

A

A

B#

“b3”

Figure 24: An XML tree illustrating XFD satisfaction in [5]. Example 10 Consider the XML tree shown in Figure 24. This tree violates the XFD

root.A.A#,

! root.C.C# according to the de nition of an XFD in [5]. However, if we add a C# attribute node with val c3 to the tree then the new tree satis es the XFD root.A.A#, root.B.B# ! root.C.C# root.B.B#

according to the de nition in [5].

A question that then arises is whether the de nition of an XFD in [5] and our de nition of an XFD coincide in the case where the XML document does not contain missing information. We believe that this is the case but have left the formal veri cation for future work. In addition to the di erences just discussed, the other major di erence between [5] and our paper is that we address several issues that were not considered in [5]. Firstly, we address the issue of providing formal justi cation for our de nition of a strong XFD by proving that, for a very general class of mappings from relations to XML documents, a relation strongly satis es a unary FD if and only if the corresponding XML document also strongly satis es the corresponding XFD. The other result justi es our use of the term 'strong satisfaction' and shows that an XML document strongly satis es an XFD if and only if every completion of the document also satis es the XFD. The second important issue that we address is 34

the axiomatization of XFDs and we de ne a sound set of axioms for the implication of XFDs. Lastly, we address the issue of providing justi cation for the normal form we propose, a minor modi cation of the one proposed in [5], and show that the normal form is a necessary and sucient condition for the elimination of redundancy.

9 Conclusions In this paper we have investigated the issues of XFDs and normalization for XML documents. We have de ned a strong XFD in XML and justi ed it on two grounds. The rst is we have shown that for a very general class of mappings from an incomplete relation to an XML document, a FD is strongly satis ed in the relation if and only if the corresponding XFD is strongly satis ed in the XML document. The second justi cation is to show that an XML document strongly satis es an XFD if and only if every completion of the XML document also satis es the XFD. This result thus justi es our use of the term 'strong' in XFD satisfaction and ensures that strong satisfaction of an XFD eliminates the integrity maintenance problem and the additivity problem discussed in Section 1. Next we have investigated axiomatization of XFDs and provided a sound set of axioms for the implication of XFDs. The nal contribution of our paper has been to propose a normal form for XML documents based on our de nition of XFDs and provide a formal justi cation for it by proving that it is a necessary and sucient condition for the elimination of redundancy. There are several other topics related to the ones investigated in this paper that warrant further investigation. Firstly, some of the main results in our paper (Theorems 2) have been only proven for unary XFDs and there is a need to generalize the results for arbitrary XFDs. Secondly, ecient algorithms need to be developed for converting unnormalized XML documents to normalized ones. Some preliminary work on this problem was reported in [5] but more work needs to be done on the problem. Thirdly, in cases where a DTD exists the issue of interaction between XFDs and the DTD needs investigation. Fourthly, our approach to de ning XFDs can also be extended to weak satisfaction and similar results to the ones derived in this paper need to be established for weak satisfaction. Finally, MVDs also occur naturally in XML documents and there is a need to formalise their de nition and establish a 4NF for XML.

References [1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kau man, 2000. [2] S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. Addison Wesley, 1996. [3] S. Abiteboul and V. Vianu. Regular path queries with constraints. In Proc. ACM PODS Conference, pages 122 { 133, 1997. [4] M. Arenas, 2002. personal communication. 35

[5] M. Arenas and L. Libkin. A normal form for xml documents. In Proc. ACM PODS Conference, pages 85{96, 2002. [6] W.W. Armstrong. Dependency structure of database relationships. In IFIP congress, pages 580 {583, 1974. [7] P. Atzeni and V. DeAntonellis. Foundations of databases. Benjamin Cummings, 1993. [8] P. Atzeni and N. M. Morfuni. Functional dependencies in relations with null values. Information Processing Letters, 18(4):233 { 238, 1984. [9] P. Atzeni and N. M. Morfuni. Functional dependencies and constraints on null values in database relations. Information and Control, 70(1):1 { 31, 1986. [10] P.a. Bernstein and N. Goodman. What does boyce-codd normal form do? In 6th International Conference on Very Large Databases, pages 245 {249, 1980. [11] T. Bray, J. Paoli, and C.M. Sperberg-McQueen. Extensible markup language (xml) 1.0. Technical report, http://www.w3.org/Tr/1998/REC-xml-19980819, 1998. [12] P. Buneman, S. Davidson, W. Fan, and C. Hara. Keys for xml. In Proceedings of the 10th International World Wide Web Conference, pages 201 { 210, 2001. [13] P. Buneman, S. Davidson, W. Fan, and C. Hara. Reasoning about keys for xml. In International Workshop on Database Programming Languages, 2001. [14] P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Keys for xml. Computer Networks, 39(5):473{487, 2002. [15] P. Buneman, S. Davidson W. Fan, C. Hara, and W. Tan. Reasoning about keys for xml. Information Systems, 2003. to appear. [16] P. Buneman, W. Fan, and S. Weinstein. Path constraints on structured and semistructured data. In Proc. ACM PODS Conference, pages 129 { 138, 1998. [17] P. Buneman, W. Fan, and S. Weinstein. Interaction between type and path constraints. In Proc. ACM PODS Conference, pages 129 { 138, 1999. [18] E. P. F. Chan. A design theory for solving the anomalies problem. SIAm Journal on Computing, 18(3):429 { 448, 1989. [19] J. Clark and S. DeRose. Xml path language (xpath). http://www.w3.org/Tr/1998/xpath, 1999.

36

W3C Working Draft,

[20] E.F. Codd. Recent investigations in relational database systems. In IFIP Conference, pages 1017 {1021, 1974. [21] D. W. Embley, B. D. Kurtz, and S.N. Wodd eld. Object-Oriented Systems Analysis: A Model driven Approach. Prentice-Hall, 1992. [22] D. W. Embley and W. Y. Mok. Developing xml documents with guaranteed "good" properties. In ER 2001, 20th International Conference on Conceptual Modeling, pages 426 {441, 2001. [23] T. Bray et. Al. Xml-data. W3C Note, http://www.w3.org/Tr/1998/Note-xml-data, 1998. [24] R. Fagin. Normal forms and relational database operators. In ACM SIGMOD International Conference on the Management of Data, pages 123{134, 1979. [25] R. Fagin. A normal form for relational databases that is based on domains and keys. ACM Transactions on Database Systems, 6(3):387{415, 1981. [26] W. Fan and L. Libkin. On xml integrity constraints in the presence of dtds. In Proc. ACM PODS Conference, pages 114{125, 2001. [27] W. Fan and J. Simeon. Integrity constraints for xml. In Proc. ACM PODS Conference, pages 23{34, 2000. [28] Wenfei Fan, Gabriel Kuper, and Jrme Simon. A uni ed constraint model for xml. In The 10th International World Wide Web Conference, pages 179{190, 2001. [29] P. Honeyman. Testing satisfaction of functional dependencies. Journal of the ACM, 29:668{667, 1982. [30] M.L. Lee, T.W. Ling, and W. L. Low. Desigining functional dependencies for xml. In International Conference of Extending Database Technology, 2002. [31] M. Levene and G. Loizu. The additivity problem for functional dependencies in incomplete relations. Acta Informatica, 34:135 { 149, 1997. [32] M. Levene and G. Loizu. The additivity problem for data dependencies in incomplete relations. In B. Thalheim and L. Libkin, editors, Semantics in databases, pages 136{169. Springer Verlag, 1998. [33] M. Levene and G. Loizu. Axiomatization of functional dependencies in incomplete relations. Theoretical Computer Science, 206:283{300, 1998. [34] M. Levene and G. Loizu. Database design of incomplete relations. ACM Transactions on Database Systems, 24:35{68, 1999. [35] M. Levene and G. Loizu. A guided tour of relational databases and beyond. Springer, 1999. 37

[36] M. Levene and M. W. Vincent. Justi cation for inclusion dependency normal form. IEEE Transactions on Knowledge and Data Engineering, 12:281 {291, 2000. [37] Y. E. Lien. On the equivalence of database models. Journal of the ACM, 29(2):333 { 362, 1982. [38] S.J. Thomas and P.C. Fischer. Nested relational structures. In P. Kanellakis, editor, The theory of databases, pages 269 {307. JAI Press, 1986. [39] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. Xml schema part 1:structures. W3C Working Draft, http://www.w3.org/Tr/1998/xmlschema-1, 2001. [40] M. W. Vincent. Semantic foundations of normal forms in relational database design. PhD thesis, Department of Computer Science, Monash University, 1994. [41] M. W. Vincent. A corrected 5nf de nition for relational database design. Theoretical Computer Science, 185:379 {391, 1997. [42] M. W. Vincent. A new redundancy free normal form for relational database design. In B. Thalheim and L. Libkin, editors, Database Semantics, pages 247 {264. Springer Verlag, 1998. [43] M. W. Vincent. Semantic foundations of 4nf in relational database design. Acta Informatica, 36:1 {41, 1999. [44] M. W. Vincent and M. Levene. Restructuring partitioned normal relations without information loss. SIAM Journal on Computing, 39(5):1550 { 1567, 2000. [45] M.W. Vincent and J. Liu. The implication problem for unary functional dependencies in xml. Submitted for publication, 2002. [46] J. Widom. Data management for xml - research directions. IEEE data Engineering Bulletin, 22(3):44{52, 1999.

10 Appendix Before proving Theorem 1, we rstly de ne an isomorphism between two extended XML trees. Essentially it means that two trees are isomorphic if they are unique up to a labelling of the node identi ers.

De nition 31 Let T1 be the extended XML tree (V1 [ N1; lab; ele; att; val; vr ) and let T2 be the extended XML tree (V2 [ N2 ; lab; ele; att; val; vr ). Then T1 and T2 are isomorphic if there exists a 1-1 function  from V1 [ N1 onto V2 [ N2 such that: (i) v 2 V1 i (v) 2 V2 ; (ii) v1 is a child of v2 in T1 i (v1 ) is a child of (v2 ) in T2 ; (iii) for any node v in T1 , lab(v) = lab((v)); 38

vr E root v2 E Dept

v1 E Dept

v3 E Section

E Section 1

v4 A Emp v5 A Emp “e1”

v6 A

Project

2

A Emp

3

A Emp

4

A Project

“j1”

“e2”

(a)

vr E root v2 E Dept

v1 E Dept

v3 E Section

E Section 1

v4 A Emp v5 A Emp “e1”

v6 A

Project

2

A Emp

3

A Emp

5

A Project

“j1”

“e2”

(b) vr E root v2 E Dept

v1 E Dept

v3 E Section

E Section 1

v4 A Emp v5 A Emp “e1”

“e2”

v6 A

Project

2

A Emp

3

A Emp

“j1”

(c)

Figure 25: XML trees illustrating the de nition of an isomorphism. (iv) for any node v in T1 such that v 2 V1, val(v) = val((v)). For example, the trees in Figure 25 (a) and Figure 25 (b) are isomorphic but not the trees in in Figure 25 (a) and Figure 25 (c). Before proving the isomorphism theorem, we rstly establish a preliminary lemma.

Lemma 4 Let P be a consistent set of paths, let T be an XML tree (V [ N; lab; ele; att; val; vr ) that

conforms to P and let M(T) be the minimal extension of T . Then for any node v in M(T) and for any label l 2 E [ A [ S, v has only one child node v1 such that lab(v1 ) = l and v1 2 N.

Proof. Suppose to the contrary that v has two nodes v0 and v00 in M(T) with the same label and such that both v0 and v00 are in N. We rstly claim that for any path instance v1 :    :vn which is de ned 39

over some path p, and includes v0 , there exists another path instance v10 :    :vn0 de ned also over path p and including v00. Suppose that there doesn't exist such a path instance. Then this implies that there exists a path p1 such that p  p1 and there does not exist a path v10 :    :vm0 , de ned over p, such that v10 :    :vn0 is a pre x of v10 :    :vm0 . However this contradicts condition (iii) of a minimal extension and so we can conclude that for any path instance v1:    :vn which is de ned over some path p and includes v0 there exists a path instance v10 :    :vn0 de ned also over path p and including v00 . One can then also use the same argument to show that for any path instance v10 :    :vn0 which is de ned over some path p and includes v00 there exists a path instance v1 :    :vn de ned also over path p and including v0 . Consider then removing the subtree rooted at v00 from M(T). We claim that doing this still retains property (iii) of a minimal extension. Suppose there exist paths p1 and p2 in P such that p1  p2 and there exists a path instance v1 :    :vn , containing v0 , de ned over p1 , in M(T). Then since M(T) is a minimal extension of T, there exists a path instance v10 :    :vm0 de ned over p2 in M(T) such that v1 :    :vn is a pre x of the instance v10 :    :vm0 . If v10 :    :vm0 is not contained in the subtree rooted at v00 then property (iii) will still hold in the new tree. If v10 :    :vm0 is contained in the subtree rooted at v00 then by the result just established there will be a path instance v100 :    :vm00 which includes v0 with the same property. Hence when the subtree rooted at v00 is removed, v100:    :vm00 will still be in the new tree and so property (iii) is again maintained. However, this shows that we can remove the subtree rooted at v00 from M(T) and obtain a tree which still satis es (i), (ii) and (iii). This contradicts the de nition of a minimal extension and so we conclude that v has only one child node that is of type l and is in N and so the lemma is established. 2

Proof of Theorem 1

Let T be an XML tree, let T1 be a minimal extension (V [ N1 ; lab; ele; att; val; vr ) and let T2 be a minimal extension (V [ N2 ; lab; ele; att; val; vr ). Let us de ne a sequence of sets of nodes in T1 by S0 = V , Si = Si?1 [ fvjv 2 N ^ 9v1 (v1 2 Si?1 ^ v1 = Parent(v))g. We note that eventually there is a n such that Sn = V [ N and at this stage the iteration stops. We shall prove by induction that there exists a 1-1 function from V [ N1 into V [ N2 . Initially the result is true for S0 since T is embedded in both T1 and T2 and so we let  be the identity function on S0 . Suppose then the result is true for the set Si?1 . Consider a node v2 in Si ? Si?1 . We rstly note that v2 2 N. By de nition of Si , there exists a node v1 in Si?1 such that v1 = Parent(v2). Consider v1 and (v1 ) and the two path instances which end in v1 and (v1 ). It follows from the induction hypothesis that the two path instances are de ned over the same path, call it p. Then because v2 is a child of v1 it follows that the path instance that ends with v2 is a path instance of another path p1 such that p  p1 and p1 2 P (since by (ii) of the de nition of a minimal extension T1 conforms to P). This implies, by (iii) of the de nition of a minimal extension, that since the path instance that ends in (v1 ) is an instance over p and p  p1, then (v1 ) must have a child v3 with the same label as v2. Then, because of Lemma 4, v2 is the only child of v1 of that type and similarly v3 is the only child of (v1 ). We then de ne then (v2 ) = v3 . 40

Next we show that if v2 6= v20 then (v2 ) 6= (v20 ) and so  is 1-1. To see this, we rstly note that because of Lemma 4, Parent(v2 ) 6= Parent(v20 ). So since by the induction hypotheis  is 1- 1 on Si?1 , (Parent(v2 )) 6= (Parent(v20 )) and so since T1 is a tree and node identi ers are unique, this implies that (v2 ) 6= (v20 ) which was to be shown. We have just seen that  satis es (i), (ii) and (iii) and it satis es (iv) because v2 2 N. Hence by the induction hypothesis, there exists a 1-1 function from V [ N1 into V [ N2 and so jV [ N1 j  jV [ N2 j, where jj denotes cardinality. Using exactly the same argument in reverse, one can show that there is a 1-1 function from V [ N2 into V [ N1 and so jV [ N2 j  jV [ N1 j. Hence jV [ N2 j = jV [ N1 j and so  is a 1-1 function from V [ N1 onto V [ N2 satisfying (i) - (iv) and so T1 and T2 are isomorphic. 2 Before proving Theorem 2 we need some preliminary de nitions and lemmas.

De nition 32 Let A and B be attributes in U and let S be a scheme tree de ned over U. Then A and

B are de ned to be siblings if A and B are members of the label for a node, A is an ancestor of B if A is a member of the label of a node which is the ancestor of a node for which B is a member of the label, and A and B are unrelated if A and B are not siblings and A is not an ancestor of B and B is not an ancestor of A. For example, in the scheme tree shown in Figure 17, Name and Sid are siblings, Name is an ancestor of Exam and Major and Project are unrelated. Next, let us denote by S(r ) the scheme tree for a nested relation r .

Lemma 5 Let r and r be as de ned in the procedure given in Section 5.1.2, i.e. r is any at relation and r = Y ?1    Y0 (r) and let A 2 U and B 2 U . Also, let tA be a subtuple in r in which A is n

an atomic attribute and let tB be a subtuple in r in which B is an atomic attribute. If tA [A] = a and tB [B] = b then there exists a tuple t in r such that t[A; B] =< a; b > if any of the following conditions are true: (i) A and B are siblings in S(r ) and tA = tB ; (ii) A is an ancestor of B in S(r ) and tB is a subtuple of tA ; (iii) B is an ancestor A in S(r ) and tA is a subtuple of tB ; (iv) A and B are unrelated in S(r ).

Proof. Suppose that (i) is satis ed. We shall show by induction that there exists a tuple t 2  (r ) such that t [A; B] =< a; b > from which it follows that t 2 r by Lemma 2. Since the ordering of unnesting is immaterial we unnnest r by fY0g    fY ?1 g (r ). Let Yi be the NRS in the unnesting in n

which A and B are atomic attributes. Initially, we have a subtuple tA in r for which tA [A; B] =< a; b >. Assume inductively then that fY g    fY ?1 g (r ); i + 1 < j, contains the subtuple tA . It follows from the de nition of unnest that tA is will still be a subtuple fY ?1 g    fY ?1 g (r ) and so by induction tA is j

n

j

41

n

a subtuple in fY +1 g    fY ?1 g (r ). From the de nition of unnest, it follows that fY g    fY ?1 g (r ) will contain a tuple t such that t [A; B] =< a; b > and the property will then still hold, by a similar inductive argument and the de nition of unnest, for fY g    fY ?1 g (r ); j < i and so the property is proven. Consider (ii). Let Yi denote the NRS in the construction of r in which A appears as an atomic attribute and let Yj denote the NRS in the construction of r in which B appears as an atomic attribute. Since A is an ancestor of B the unnesting on Yi will be performed before the unnesting on Yj in the total unnest. We rstly note that since tA is a subtuple in r then it follows by a simple inductive argument similar to the one just given in (i) and the de nition of unnest that tA will be a subtuple in fY +1 g    fY ?1 g (r ) and tA has the subtuple tB . It then follows by the de nition of unnest that there will be a tuple t in fY g    fY ?1 g (r ) such that t [A] =< a > and t has the subtuple tB . Then again by induction and the de nition of unnest there will be a tuple t1 in fY g    fY ?1 g (r ) such that t1 [A; B] =< a; b > and again by induction and the de nition the same property will hold for fY g    fY ?1 g (r ); k < j and so the result is proven. The result (iii) follows, by symmetry, using the same argument as in (ii). Consider (iv). Let Yi denote the NRS in the construction of r in which A appears as an atomic attribute and let Yj denote the NRS in the construction of r in which B appears as an atomic attribute. Suppose that the unnesting on Yi is performed rst. Then using the same arguments as for the previous cases it follows that there exists a tuple t in fY g    fY ?1 g (r ) such that t [A] =< a >. The using the same arguments as before it follows that there will exist a tuple t1 in fY g    fY g    fY ?1 g (r ) such that t1 [A; B] =< a; b > and the same property will then also hold for fY g    fY ?1 g (r ); k < j and so the result is proven. 2 i

n

j

i

n

j

n

n

n

i

k

i

n

n

i

n

j

i

k

n

n

The next results are essentially converses of the above result.

Lemma 6 If t is a tuple in r such that t[A; B] =< a; b > and A and B are siblings in S(r ) then there

exists a subtuple t2 in r , de ned over a NRS N1 , such that A and B are atomic attributes in N1 and t2 [A; B] =< a; b >.

Proof. We shall prove the result by induction on the nesting operations. Let Yi be the NRS in which

A and B appear as atomic attributes. Initially the result is true for r and suppose inductively that it is true for rj , where j < i ? 1. Then by property (i) of the nest operator the result will be true after we nest rj on Yj and so the property is true for rj +1. Consider then ri = Y ?1 (ri?1). By property (ii) of the nest operator, if there exists a tuple t with t [A; B] =< a; b > before nesting on Yi then after the nesting there will be a subtuple t1 de ned over Yi such that t1 [A; B] =< a; b >. It then follows by a similar inductive argument and property (ii) of the nest operator that each relation rj , j > i will contain the subtuple t1 and so the result is proven. 2 i

42

Lemma 7 If t is a tuple in r such that t[A; B] =< a; b > and A is an ancestor of B in S(r ), then there

exist subtuples t2 (de ned over a NRS N 00) and t3 (de ned over a NRS N 000), such that A is an atomic attribute in N 00 and t2 [A] =< a > and t3 is a subtuple of t2 and B is an atomic attribute in N 000 and t3 [B] =< b >.

Proof. We shall prove the result by induction on the nesting operations. Let Yi be the NRS in which

A appears as the atomic attribute and let Yj be the NRS in which B appears as the atomic attribute. Since A is an ancestor of B we have that i > j. Initially r contains a tuple t with t[A; B] =< a; b >. The same argument as used in the previous lemma then shows that rk , where k < j, will contain a tuple tk such that tk [A; B] =< a; b >. Consider then the e ect of nesting on Yj . By de nition of the nest operator, after the nesting it will contain a tuple tj and tj1 , a subtuple of tj , such that tj [A] =< a > and tj1 [B] =< b >. It then follows by a similar inductive argument and properties of the nest operator that each relation rk , i > k > j will have the same property. Consider then the e ect of nesting on Yi . By property (ii) of the nest operator, after the nesting on Yi there will exist a subtuple ti in ri+1 and ti1 , a subtuple of ti , such that ti [A] =< a > and ti1 [B] =< b >. Using a similar inductive argument one can show that the same property remains true for all rj , j > i and so the result is proven. 2

Lemma 8 If t is a tuple in r such that t[A; B] =< a; b > and A and B are unrelated in S(r ) then there

exists a tuple t1 in r and there exist subtuples t2 (de ned over NRS N 00) and t3 (de ned over NRS N 000) of t1 , such that A is an atomic attribute in N 00 and t2 [A] =< a > and B is an atomic attribute in N 000 and t3 [B] =< b > and neither t2 nor t3 are subtuples of each other.

Proof. We shall prove the result by induction on the nesting operations. Let Yi be the NRS in which

A appears as the atomic attribute and let Yj be the NRS in which B appears as the atomic attribute. Since A and B are unrelated it does not matter which nest is performed rst. We shall choose arbitrarily Yj to be nested rst. Initially r contains a tuple t with t[A; B] =< a; b >. The same argument as used in the previous lemma then shows that rk , where k < j, will contain a tuple tk such that tk [A; B] =< a; b >. Consider then the e ect of nesting on Yj . By de nition of the nest operator, after the nesting rj +1 will contain a tuple tj and tj1 , a subtuple of tj , such that tj [A] =< a > and tj1 [B] =< b >. It then follows by a similar inductive argument and the properties of the nest operator that each relation rk , i > k > j will have the same property. Consider then the e ect of nesting on Yi?1. By property (ii) of the nest operator, after the nesting on Yi there will exist a subtuples ti and ti1 in ri such that ti [A] =< a > and ti1 [B] =< b > but ti and ti1 are not subtuples of each other. Using a similar inductive argument one can show that the same property remains true for rj , j > i and so the result is proven. 2 We also need the following result (Lemma 6.1 in [7]) which gives a syntactic characterization of strong satisfaction in at relations. 43

Lemma 9 A FD A ! B is strongly satis ed in a relation r if for every t1 ; t2 2 r, either both t1[A] and

t2 [A] are non null and not equal or t1 [B] and t2[B] are non null and equal. We now prove the main result of Section 5.1.2

Proof of Theorem 2

If: We shall show the contrapositive that if r violates A ! B then Tr violates pA ! pB . We shall

consider several cases. Case A: 9 t1 , t2 in r such that t1 [A] 6=? and t2[A] 6=? and t1 [A] = t2 [A] =< a > and t1[B] 6=? and t2[B] 6=? and t1[B] = b1 6= b2 = t2[B]. Suppose rstly that A and B are siblings in S(r ). By Lemma 6 there exist subtuples t1 and t2 in r , de ned over a NRS N1, such that A and B are atomic attributes in N1 and t1 [A; B] =< a; b1 > and t2 [A; B] =< a; b2 >. It then follows by the construction procedure that the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t1 [B], call it v10 , are both children of the same node (call it vp1 ). It thus follows from the de nition of x1 in De nition 16 that x1 = vp1 and so v1 2 Nodes(x1 ; pA). It then follows by the construction procedure that the node in M(Tr ) corresponding to t2[A], call it v2 , and the node corresponding to t2[B], call it v20 , are both children of the same node (call it vp2 ). It thus follows from the de nition of y1 in De nition 16 that y1 = vp2 and so v2 2 Nodes(y1 ; pA). Thus since val(v1 ) = val(v2 ) = a and val(v10 ) = b1 6= b2 = val(v20 ) it follows from the de nition of an XFD that Tr violates pA ! pB . Suppose next that A is an ancestor of B in S(r ). It follows from Lemma 7 that there exists subtuples t2 (de ned over NRS N 00) and t3 (de ned over NRS N 000), such that A is an atomic attribute in N 00 and t2 [A] =< a > and t3 is a subtuple of t2 and B is an atomic attribute in N 000 and t3 [B] =< b1 >. Using Lemma 7 again there exists subtuples t4 (de ned over NRS N 00) and t5 (de ned over NRS N 000), such that A is an atomic attribute in N 00 and t4 [A] =< a > and t5 is a subtuple of t4 and B is an atomic attribute in N 000 and t5 [B] =< b2 >. Then by the construction procedure the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t1[B], call it v10 , are such that the parent of v1, (call it vp1 ), is an ancestor of v10 . It thus follows from the de nition of y1 in De nition 16 that y1 = vp1 and so v1 2 Nodes(x1 ; pA ). It then follows by the construction procedure that the node in M(Tr ) corresponding to t2 [A], call it v2 , and the node corresponding to t2 [B], call it v20 , are such that the parent of v2, (call it vp2 ), is an ancestor of v20 . It thus follows from the de nition of x1 in De nition 16 that x1 = vp2 and so v2 2 Nodes(x1 ; pA). Thus since val(v1 ) = val(v2 ) = a and val(v10 ) = b1 6= b2 = val(v20 ) it follows from the de nition of an XFD that Tr violates pA ! pB . The case where B is an ancestor of A is handled similarly. Lastly consider the case where A and B are not related in S(r ). By Lemma 8 there exist subtuples t2 (de ned over NRS N 00) and t3 (de ned over NRS N 000), such that A is an atomic attribute in N 00 and t2 [A] =< a > and t3 is a subtuple of t2 and B is an atomic attribute in N 000 and t3 [B] =< b1 > and neither 44

t2 nor t3 are subtuples of each other. Again, by Lemma 8 there exist subtuples t4 (de ned over NRS N 00) and t5 (de ned over NRS N 000), such that A is an atomic attribute in N 00 and t4 [A] =< a > and t5 is a subtuple of t4 and B is an atomic attribute in N 000 and t5 [B] =< b2 > and neither t5 nor t4 are subtuples of each other. It then follows by the construction procedure that the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t1 [B], call it v10 , have a maximal common ancestor (call it vp1 ). Moreover, because of the construction of M(Tr ) and the de nition of of x1 it follows that x1 = vp1 and so v1 2 Nodes(x1 ; pA). Similarly, it follows by the construction procedure that the node in M(Tr ) corresponding to t2[A], call it v2 , and the node corresponding to t2 [B], call it v20 , have a maximal common ancestor (call it vp2 ). Moreover, because of the construction of M(Tr ) and the de nition of of y1 it follows that y1 = vp2 and so v2 2 Nodes(x1 ; pA). Thus since val(v1 ) = val(v2 ) and val(v10 ) 6= val(v20 ) it follows from the de nition of an XFD that Tr violates pA ! pB . Case B: 9 t1 , t2 in r such that t1 [A] 6=? and t2 [A] 6=? and t1 [A] = t2[A] =< a > and t2 [B] =?. Suppose rstly that A and B are siblings in S(r ). Using the same argument as in Case A, the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t1 [B], call it v10 , are both children of the same node (call it vp1 ) and so x1 = vp1 and v1 2 Nodes(x1 ; pA ). Using the same argument in Case A, the node corresponding to t2 [A], call it v2, and the node corresponding to t2 [B], the null node v20 , have the same parent vp2 . It thus follows that y1 = vp2 and so v2 2 Nodes(y1 ; pA). So Tr violates pA ! pB since val(v10 ) 6= val(v20 ) (since v20 is a null node) and val(v1 ) = val(v2 ). Suppose next that A is an ancestor of B in S(r ). Using the same argument as in Case A, it follows that the node in M(Tr ) corresponding to t1[A], call it v1 , and the node corresponding to t1 [B], call it v10 , are such that the parent of v1 , (call it vp1 ), is an ancestor of v10 and so x1 = vp1 and thus v1 2 Nodes(x1 ; pA). Following Case A, the node in M(Tr ) corresponding to t2 [A], call it v2 , and the null node v20 (corresponding to t2 [B]), are such that the parent of v2 , (call it vp2 ), is an ancestor of v20 . Thus y1 = vp2 and so v2 2 Nodes(y1 ; pA ) and thus Tr violates pA ! pB for the same reasons as before. The case where B is an ancestor of A is handled similarly. Lastly consider the case where A and B are not related in S(r ). From Lemma 8 and using the same argument as in Case A, the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t1 [B], call it v10 , have a maximal common ancestor (call it vp1 ) and x1 = vp1 and so v1 2 Nodes(x1 ; pA). Following Case A again, the node in M(Tr ) corresponding to t2 [A], call it v2 , and the null node v20 (corresponding to t2[B]), have a maximal common ancestor (call it vp2 ) and u1 = vp2 and so v2 2 Nodes(y1 ; pA) and so Tr violates pA ! pB for the same reasons as before. Case C: 9 t1 , t2 in r such that t1 [A] 6=? and t2[A] 6=? and t1 [A] = t2 [A] =< a > and t1 [B] =?. As for Case B. Case D: 9 t1, t2 in r such that t1[B] 6=? and t2[B] 6=? and t1[b] 6= t2 [B] and t1 [A] 6=? and t2[A] =?. Suppose rstly that A and B are siblings in S(r ). As before, the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t2 [B], call it v10 , are both children of the same node (call 45

it vp1 ) and x1 = vp1 and so v1 2 Nodes(x1 ; pA ). It then follows by the construction procedure that the node in M(Tr ) corresponding to t2 [A] will be a null node and the node corresponding to t2[B], call it v20 , will have the same parent (call it vp2 ). It thus follows from the de nition of y1 in De nition 16 that y1 = vp2 and so ?2 Nodes(y1 ; pA ) and thus Tr violates pA ! pB because val(v10 ) 6= val(v20 ). Suppose next that A is an ancestor of B in S(r ). As in Case A the node in M(Tr ) corresponding to t1 [A], call it v1 , and the node corresponding to t1 [B], call it v10 , are such that the parent of v1, (call it vp1 ), is an ancestor of v10 . It thus follows from the de nition of x1 in De nition 16 that x1 = vp1 and so v1 2 Nodes(x1 ; pA ). It then follows by the construction procedure that the node in M(Tr ) corresponding to t2 [A] (v20 ) will be null and the node corresponding to t2 [B], call it v20 , are such that the parent of v2 , (call it vp2 ), is an ancestor of v20 . It thus follows from the de nition of y1 in De nition 16 that y1 = vp2 and so so ?2 Nodes(y1 ; pA) and thus Tr violates pA ! pB for the same reasons as before. Lastly consider the case where A and B are not related in S(r ). As for Case A, the node in M(Tr ) corresponding to t1[A], call it v1 , and the node corresponding to t1 [B], call it v10 , have a maximal common ancestor (call it va1 ) and x1 = va1 and so v1 2 Nodes(x1 ; pA ). It then follows by the construction procedure that the node in M(Tr ) corresponding to t2 [A] (v20 ) will be null, and the node corresponding to t2 [B], call it v20 , are such that they have a maximal common ancestor (call it va2 ). It thus follows from the de nition of y1 in De nition 16 that y1 = vp2 and so so ?2 Nodes(y1 ; pA) and thus Tr violates pA ! pB . Case E: There exist t1 and t2 in r such that t1 [B] =? and t2[A] =?. As for previous case. The other cases are handled similarly to the ones just given. Only If: We shall show the contrapositive that if Tr violates pA ! pB then r violates A ! B. Let

v1 :    :vn and v10 :    :vn0 be two distinct path instances in Paths(pB ) in M(Tr ). We rstly note that because of the construction for Tr and axiom A7 we can assume that pA 6= pB and Last(pA ) 2 A and Last(pB ) 2 A. There are several cases to consider. Case A: Parnt(pA) is a pre x of pB . Since pA ! pB is violated we must have that val(vn0 ) 6= val(vn ) and (?2 Nodes(x1 ; p) or ?2 Nodes(y1 ; p) or val(Nodes(x1 ; p)) \ val(Nodes(y1 ; p)) 6= ;) where x1 and y1 are de ned as in De nition 16. We consider each of these cases. Case A.1: val(Nodes(x1 ; p)) \ val(Nodes(y1 ; p)) 6= ; Let v1 be a node in Nodes(x1 ; p) and let v2 be a node in Nodes(y1 ; p) such that val(v1 ) = val(v2 ) (v1 and v2 may be the same node). Then because Parnt(pA) is a pre x of pB and the de nition of x1 and y1 in De nition 16 it follows that x1 = Parent(v1) and y1 = Parent(v2 ). Hence by the construction procedure for Tr and the fact that A is an ancestor of B in S(r ) (because Parnt(pA) is a pre x of pB ), there exists a subtuple t1 in r such that t1 [A] = val(v1 ) and there exists a subtuple t2 of t1 such that t2 [B] = val(vn ). So 46

by Lemma 5 there exists a tuple t 2 r such that t[A; B] =< val(v1 ); val(vn ) >. Using the same argument for v2 and vn0 shows that there exists another tuple t0 2 r such that t0[A; B] =< val(v2 ); val(vn0 ) > and so by Lemma 9 A ! B is violated in r since val(vn0 ) 6= val(vn ) and val(v1 ) = val(v2 ). Case A.2: ?2 Nodes(x1; p) Let v1 the node in Nodes(x1 ; p) that is null and let v2 be any node in Nodes(y1 ; p). Following the argument used in the previous case, if pA ! pB is violated then there exists a subtuple t1 in r such that t1 [A] =? and there exists a subtuple t2 of t1 such that t2 [B] = val(vn ). So by Lemma 5 there exists a tuple t 2 r such that t[A; B] =

Suggest Documents