dencies in relational databases (INDs) and XINDs in XML documents and we .... We now assume the existence of a set of legal paths P for an XML application.
Generalized Inclusion Dependencies in XML Millist W. Vincent, Michael Schrefl, Jixue Liu, Chengfei Liu, and Solen Dogen School of Computer and Information Science University of South Australia {millist.vincent, michael.schrefl,jixue.liu, chengfei.liu}@unisa.edu.au
Abstract. Integrity constraints play a fundamental role in defining semantics in both conventional databases and in XML documents. In this paper we generalize previous approaches to defining inclusion dependencies in XML. Previous approaches have considered only the case where the paths in the l.h.s. are child attributes of the same node and the paths on the r.h.s. of the dependency are child attributes of the same node, whereas we do not apply this restriction. We then give an axiom system for XINDs and prove that the system is sound and complete. As a corollary, we also show that the implication problem for XINDs is decidable. Finally we consider the relationship between inclusion dependencies in relational databases (INDs) and XINDs in XML documents and we show that for a very general class of mappings from a relational database to a set of XML documents, and IND is satisfied in a relational database if and only if the corresponding XIND is satisfied in the XML documents.
1
Introduction
Integrity constraints play a fundamental, and much studied [1], role in relational databases where they are used to define the semantics of data. Similarly, with the adoption of the eXtensible Markup Language (XML) [5] as the standard for data interchange over the Web, the study of integrity constraints in XML has attracted considerable interest in recent years [7, 15]. Several different classes of integrity constraints for XML have been defined including key constraints [6–8], path constraints [2, 11, 7, 10], and inclusion constraints [13, 12, 14] and properties such as axiomatization and satisfiability have been investigated for these constraints. One observation to make on this research is that the flexible structure of XML makes the investigation of integrity constraints in XML more complex and subtle than in relational databases. We also point out that the study of integrity constraints in XML has assumed additional significance with the recent trend in Web technology of developing a semantic Web [4]. In such a scenario, XML documents form the bottom layer of the semantic Web architecture [4]. To be able to implement this vision of a machine processible semantic Web, it is essential to store not only XML data, but also the semantics of the XML data. Hence the importance of integrity constraints in XML, given that they are one of the most important and fundamental ways of describing the semantics of data.
In this paper we extend this study of XML constraints. In particular we focus on those classes of constraints which correspond to the classical constraints in relational databases. This is a relatively little studied area and most of the XML constraints described earlier are unique to XML. We claim that this area of study is of considerable significance since, with today’s technology, a large part of XML data is derived from relational databases. Hence, given the importance of semantics and constraints in XML, it is also important to be able to determine how constraints in relations map to constraints in XML. We have recently defined functional and multivalued dependencies in XML (called XFDs and XMVDs) [20, 23, 22, 21] and shown that, for a very general class of mappings, classical functional and multivalued dependencies in relations map to XFDs and XMVDs in XML. We focus our attention in this paper on inclusion dependencies in XML (called XINDs). While such dependencies have already been defined [13, 12, 14], these existing definitions can only handle the situation where the paths on the l.h.s. are all attributes of element node and all the paths on the r.h.s. are child elements of one element. Under this restriction, one cannot in general map a relational inclusion dependency to an XIND using the existing definitions when the relation is restructured, as is often the case, during the mapping to XML. To see this, consider the following example. Example 1. Consider the relational database shown in Figure 1 which contains the relations R1 and R2. The database satisfies the inclusion dependency (called an IND) R2[StudentId, Course] ⊆ R1[StudentId,Course]. Suppose we then map R1 to an XML document by first nesting on Course Grade and map R2 to an XML document by first nesting on Course Tutorial. The two XML documents are shown below. R1 StudentId Course Grade s1 c1 P s1 c2 F R2 StudentId Course Tutorial s1 c1 t1 Fig. 1. A relational database
Because the relational database satisfies the IND R2[StudentId, Course] ⊆ R1[StudentId,Course], in the two XML documents one would like to express the constraint that for the allocation element in T2, there is a enrol element in T1 such that the StudentId value and Course value in the allocation el-
%XML Document T1 < alloc Course = "c1" Tutorial = "t1"> Fig. 2. Two XML documents
ement of T2 match the corresponding values in the enrol element of T1. This constraint cannot be defined in the approach used in [13, 12, 14] because Course and StudentId are not attributes of one element. However, as we shall see later, such a constraint can be defined in our definition of an XIND. We also point out that, as in the relational case, the constraint just mentioned is not in general equivalent to the two separate unary constraints that a StudentId value in a allocation element of T2 has a matching value in a enrol element of T1, and Course value in a allocation element of T2 has a matching value in a enrol element of T1. The main contributions of this paper are as follows. – We propose a definition of XINDs in XML which generalizes previous definitions ([13, 12, 14]) and allows arbitrary multiple paths on the left and right hand side of the dependency. – We present an axiom system for XINDs and prove that the system is sound and complete for XINDs without duplicate paths. Interestingly, the axioms include the well known axioms for INDs in relations [18] but also have an additional axiom that has no counterpart in INDs in relations. As a corollary of these results, we show that the implication problem for XINDs is decidable. – We provide formal justification for our syntactic definition of an XIND. The justification is based on the the correspondence between INDs in relations and XINDs in XML documents. We show that for a very general class of mappings from relations to XML documents, an IND is satisfied in a relation if and only if the corresponding XIND is satisfied in the XML document. Thus there is a natural correspondence between XINDs in XML documents and INDs in relations. The class of mappings we consider is very general and consists of those mapppings defined by firstly mapping the flat relation
to a nested relation by an arbitrary sequences of nest operations, followed by mapping the nested relation to an XML document in a straightforward way by making each subtuple an element with the atomic attributes in the subtuple being attributes. This is a very general class of mappings and we believe that it includes all the mappings that are likely to occur in practice. The rest of the paper is organized as follows. Section 2 contains some preliminary definitions. In Section 3 we give the definition of an XIND. Section 4 contains an axiom system for XINDs and a proof of sounness and completeness of the axioms. In Section 5 we consider the relationship between INDs and XINDs and show that essentially INDs in relations map to XINDs in XML. Finally, Section 6 contains some concluding comments.
2
Preliminary Definitions
In this section we present some preliminary definitions that we need before defining XFDs. We now present the definition of an XML tree adapted from the definition given in [9]. Definition 1. Assume a countably infinite set E of element labels (tags), a countable infinite set A of attribute names and a symbol S indicating text. An XML tree is defined to be T = (V, lab, ele, att, val, vr ) where V is a finite set of nodes in T ; lab is a function from V to E∪A∪{S}; ele is a partial function from V to a sequence of V nodes such that for any v ∈ V , if ele(v) is defined then lab(v) ∈ E; att is a partial function from V × A to V such that for any v ∈ V and l ∈ A, if att(v, l) = v1 then lab(v) ∈ E and lab(v1 ) = l; val is a function such that for any node in v ∈ V, val(v) = v if lab(v) ∈ E and val(v) is a string if either lab(v) = S or lab(v) ∈ A; vr is a distinguished node in V called the root of T and we define lab(vr ) = root. Since node identifiers are unique, a consequence of the definition of val is that if lab(v1 ) ∈ E and lab(v2 ) ∈ E and v1 6= v2 then val(v1 ) 6= val(v2 ). We also extend the definition of val to sets of nodes and if V1 ⊆ V , then val(V1 ) is the set defined by val(V1 ) = {val(v)|v ∈ V1 }. For any v ∈ V , if ele(v) is defined then the nodes in ele(v) are called subelements of v. For any l ∈ A, if att(v, l) = v1 then v1 is called an attribute of v. Note that an XML tree T must be a tree. Since T is a tree the ancestors of a node v, are denoted by Ancestor(v) and the the parent of a node v is denoted by P arent(v). We note that our definition of val definition differs slightly from that in [9] since we have extended the definition of the val function so that it is also defined on element nodes. The reason for this is that we want to include in our definition paths that do not end at leaf nodes, and when we do this we want to compare element nodes by node identity, i.e. node equality, but when we compare attribute or text nodes we want to compare them by their contents, i.e. value equality. This point will become clearer in the examples and definitions
that follow. We also note that there is a 1-1 mapping between XML documents and XML trees. Example 2. An XML tree is shown in Figure 3. In this example the element nodes are {vr , v1 , v2 , v4 , v5 , v7 , v8 , v9 , v10 , v15 }, the attribute nodes are {v3 , v6 , v14 , v16 } and the text nodes are {v11 , v12 , v13 }.
Fig. 3. An XML tree.
We now give some preliminary definitions related to paths. Definition 2. A path is an expression of the form l1 . · · · .ln , n ≥ 1, where li ∈ E ∪ A ∪ {S} for all i, 1 ≤ i ≤ n and l1 = root. If p is the path l1 . · · · .ln then Last(p) = ln . For instance, if E = {root, Division, Employee, Section, Emp#, Name} and A = {D#, S#} then root, root.Division, root.Division.D#, root.Division.Section.Employee.Emp#.S are all paths. We now assume the existence of a set of legal paths P for an XML application. Essentially, P defines the semantics of an XML application in the same way that a set of relational schema define the semantics of a relational application. P may be derived from the DTD, if one exists, or P be derived from some other source which understands the semantics of the application if no DTD exists. The advantage of assuming the existence of a set of paths, rather than a DTD, is that it allows for a greater degree of generality since having an XML tree conforming to a set of paths is much less restrictive than having it conform to a DTD. Firstly we place the following restriction on the set of paths.
Definition 3. A set P of paths is consistent if for any path p ∈ P , if p1 ⊂ p then p1 ∈ P . This is natural restriction on the set of paths and any set of paths that is generated from a DTD will be consistent. We now define the notion of an XML tree conforming to a set of paths P . Definition 4. Let P be a consistent set of paths and let T be an XML tree. Then T is said to conform to P if every path instance in T is a path instance over a path in P .
3
Defining XINDs
In this section we present our definition of general XINDs and illustrate the definition by some examples. Definition 5. Let P and Q be consistent set of paths. Then an XML inclusion dependency XIND is a statement of the form P [p1 , . . . , pn ] ⊆ Q[q1 , . . . , qn ] where {p1 , . . . , pn } ∈ P and Last(pi ) ∈ / A for (i = 1..n), and {q1 , . . . , qn } ∈ Q, Last(qi ) ∈ / A for (i = 1..n) and pi 6= pj if i 6= j and qi 6= qj if i 6= j. Definition 6. Let P and Q be consistent sets of paths and let TP and TQ be XML trees (not necessarily complete) that conform to P and Q, respectively. Then TP and TQ satisfy the XIND P [p1 , . . . , pn ] ⊆ Q[q1 , . . . , qn ], with p = p1 ∩ . . . ∩ pn and q = q1 ∩ . . . ∩ qn , if for all nodes v1 , . . . , vn in TP such that v1 ∈ NTP (p1 ), . . . , vn ∈ NTP (pn ) and v1 , . . . , vn are descendants of some node vp ∈ NTP (p), then there exists a node vq ∈ NTQ (q) and there exist nodes w1 , . . . , wn in TQ such that w1 , . . . , wn are all descendants of vq and val(v1 ) = val(w1 ), . . . , val(vn ) = val(wn ). Example 3. Consider the trees shown in Figure 4. The trees satisfy the XIND P[root.Enrol.Student.StudentId, root.Enrol.Student.Result.Course] ⊆ Q[root.Allocation.Student.StudentId, root.Allocation.Student.Alloc.Course]. We note however that although the treees satisfy the XINDs P[root.Enrol.Student.result.Course] ⊆ Q[root.Allocation.Student.Alloc.Course] and P[root.Enrol.Student.result.Term] ⊆ Q[root.Allocation.Student.Alloc.Term] it does not satisfy the XIND P[root.Enrol.Student.result.Course, root.Enrol.Student.result.Term] ⊆ Q[root.Allocation.Student.Alloc.Course, root.Allocation.Student.Alloc.Term].
Fig. 4. An XML tree illustrating the definition of an XIND.
4
Axiomatization of XINDs
In this section we provide a set of axioms for reasoning about the implication of XINDs and prove that the set is sound and complete. As a corollary, we also show that the implication problem is decidable. Axioms for XINDs: I1 (Reflexivity) ` P [p1 , . . . , pn ] ⊆ P [p1 , . . . pn ] I2 (Permutated Projection P [p1 , . . . , pn ] ⊆ Q[q1 , . . . qn ] ` P [pσ1 , . . . , pσm ] ⊆ Q[qσ1 , . . . qσm ], where σ1 . . . σm (m ≤ n) are members of {1, . . . , n} I3 (Transitivity) P [p1 , . . . , pn ] ⊆ Q[q1 , . . . qn ] and Q[q1 , . . . , qn ] ⊆ R[r1 , . . . rn ] ` P [p1 , . . . , pn ] ⊆ R[r1 , . . . rn ] I4 (Expansion) P [p1 , . . . , pm ] ⊆ Q[q1 , . . . qm ] and P [pm+1 . . . , pn ] ⊆ Q[qm+1 , . . . qn ], where q1 ∩ . . . ∩ qn = root ` P [p1 , . . . , pn ] ⊆ Q[q1 , . . . qn ]
I5 (Substitution) P [p1 , . . . , pn ] ⊆ Q[q10 , . . . qn0 ] and P [p1 , . . . , pn ] ⊆ Q[q100 , . . . qn00 ] ` P [p1 , . . . , pn ] ⊆ Q[q1 , . . . qn ], where for i = 1..n qi = qi0 or qi = qi00 , and q1 ∩ . . . ∩ qn = root. Note: I5 redundant (achieved by projection plus expansion). Theorem 1. Axioms I1 − I5 are sound and complete for XINDs without repeated paths. Corollary 1. The implication problem for XINDs without repeated paths is decidable.
5
XINDs in XML and INDs in relations
In this section we provide justification for our definition of XINDs. We show that for a very general class of mappings from relations to XML documents, a relational database satisfies an IND if and only if the corresponding XFD is strongly satisfied in the XML documents. Thus there is a close relationship between INDs in relations and XINDs in XML documents. Since our mapping from an incomplete document to an XML document uses nested relations as an intermediate step, we assume that the reader is familiar with the definition of the nested relational model, and in particular the nest operator νY (r∗ ), as given in standard texts such as [3, 16] or [19]. The translation of a relation into an XML tree consists of two phases. In the first phase we map the relation to a nested relation whose nesting structure is arbitrary and then we map the nested relation to an XML tree. In the first step we let the nested relation r∗ be defined by ri = νYi−1 (ri−1 ), r0 = ∗ r, r = rn , 1 ≤ i ≤ n where r represents the initial (flat) relation and r∗ represents the final nested relation. The Yi are allowed to be arbitrary. The nest operator as defined above is defined only for complete relations so we have to indicate how we handle nulls. Our approach is in the first step to consider the nulls to be marked, and hence distinguishable, and to treat these unmarked nulls as though they are data values. Thus the definition of the nest operator and r∗ remain unchanged. In the second step of the mapping procedure we take the nested relation and convert it to an XML tree as follows. We start with an initially empty tree. For each tuple t in r∗ we first create an element node of type Id and then for each A ∈ Z(N (r∗ )) we insert a single attribute node with a value t[A]. We then repeat recursively the procedure for each subtuple of t. The final step in the procedure is to compress the tree by removing all the nodes containing nulls from the tree. This now leads to the main result of this section which establishes the correspondence between satisfaction of FDs in relations and satisfaction of XFDs in XML. We denote by Tr∗ the XML tree derived from r∗ .
Theorem 2. A relational database satisfies the IND r1 [A1 , . . . , An ] ⊆ r2 [B1 , . . . , Bn ] iff the corresponding XML database satisfies T1 [pA1 , . . . , pAn ] ⊆ T2 [pB1 , . . . , pBn ], where pBi and qBi , i ∈ [1 . . . n] represent the paths to reach attribute Bi in T1 and T2 respectively.
6
Conclusions
In this paper we have addressed the issue of defining inclusion dependencies in XML (XINDs). We have generalized previous approaches [13, 12, 14] and shown how to define XINDs for the case of arbitrary paths on the l.h.s. and r.h.s. of the dependency, rather than for paths that all end in attributes of one element as is the case with previous approaches. We also show in some sense that our definition is ’correct’ by proving that for a wide class of mappings from a relational database to a set of XML documents, a relational database satisfies an inclusion dependency (IND) if and only if the corresponding set of XML documents satisfies the corresponding XIND. Apart from justifying our definition of an XIND, this last result also is important because it shows how constraints in relations map to constraints in XML. This is relevant to the recent recognition of the importance of knowing semantics of data, in addition to the data, in areas such as the development of the semantic Web [4].
References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. Addison WAesley, 1995. 2. S. Abiteboul and V. Vianu. Regular path queries with constraints. In Proceedings of the SixteenthACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 122 – 133, 1997. 3. P. Atzeni and V. DeAntonellis. Foundations of databases. Benjamin Cummings, 1993. 4. T. Berners-Lee. Weaving the Web. London:Orion Business, 1999. 5. T. Bray, J. Paoli, and C.M. Sperberg-McQueen. Extensible markup language (xml) 1.0. Technical report, http://www.w3.org/Tr/1998/REC-xml-19980819, 1998. 6. P. Buneman, S. Davidson, W. Fan, and C. Hara. Keys for xml. In Proceedings of the 10th International World Wide Web Conference, pages 201 – 210, 2001. 7. P. Buneman, S. Davidson, W. Fan, and C. Hara. Reasoning about keys for xml. In International Workshop on Database Programming Languages, 2001. 8. P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Keys for xml. Computer Networks, 39(5):473–487, 2002. 9. P. Buneman, S. Davidson, W. Fan, C. Hara, and W. Tan. Reasoning about keys for xml. Information Systems, 2003. 10. P. Buneman, W. Fan, and S. Weinstein. Path constraints on structured and semistructured data. In Proc. ACM PODS Conference, pages 129 – 138, 1998. 11. P. Buneman, W. Fan, and S. Weinstein. Interaction between type and path constraints. In Proc. ACM PODS Conference, pages 129 – 138, 1999. 12. W. Fan, G. M. Kuper, and J. Simeon. A unified constraint model for xml. Computer Networks, 39(5):489–505, 2002.
13. W. Fan and L. Libkin. On xml integrity constraints in the presence of dtds. Journal of the ACM, 49(3):368 – 406, 2002. 14. W. Fan and J. Simeon. Integrity constraints for xml. Journal of Computer and System Sciences, 66(1):254–291, 2003. 15. Wenfei Fan, Gabriel Kuper, and Jrme Simon. A unified constraint model for xml. In The 10th International World Wide Web Conference, pages 179–190, 2001. 16. M. Levene and G. Loizu. A guided tour of relational databases and beyond. Springer, 1999. 17. M. Levene and M. W. Vincent. Justification for inclusion dependency normal form. IEEE Transactions on Knowledge and Data Engineering, 12:281 –291, 2000. 18. H. Mannila and K. Raiha. The design of relational databases. Addison WAesley, 1992. 19. M. W. Vincent and M. Levene. Restructuring partitioned normal relations without information loss. SIAM Journal on Computing, 39(5):1550 – 1567, 2000. 20. M.W. Vincent and J. Liu. Strong functional dependencies and a redundancy free normal form for xml. Submitted to ACM Transactions on Database Systems, 2002. 21. M.W. Vincent and J. Liu. Multivalued dependencies and a 4nf for xml. In 15th International Conference on Advanced Information Systems Engineering (CAISE), pages 14–29, 2003. 22. M.W. Vincent and J. Liu. Multivalued dependencies in xml. In 20th British National Conference on Databases (BNCOD), pages 4–18, 2003. 23. M.W. Vincent, J. Liu, and C. Liu. Strong functional dependencies and a redundancy free normal form for xml. In 7th World Multi-Conference on Systemics, Cybernetics and Informatics, 2003.