Integrating XML and Relational Database Systems Gerti Kappel Institute of Software Technology and Interactive Systems, Business Informatics Group (BIG) Vienna University of Technology, Favoritenstr. 9 - 11 / 188, A-1040 Wien, AUSTRIA Tel. ++43/1/58801-18870 Fax ++43/1/58801-18896 {gerti}@big.tuwien.ac.at
Elisabeth Kapsammer, Werner Retschitzegger Institute of Applied Computer Science, Department of Information Systems (IFS) University of Linz, Altenbergerstraße 69, A-4040 Linz, Austria Tel. ++43/732/2468-8883 Fax ++43/732/2468-9511 {ek, wr}@ifs.uni-linz.ac.at
Abstract. Relational databases get more and more employed in order to store the content of a web site. At the same time, XML is fast emerging as the dominant standard at the hypertext level of web site management describing pages and links between them. Thus, the integration of XML with relational database systems to enable the storage, retrieval, and update of XML documents is of major importance. Data model heterogeneity and schema heterogeneity, however, make this a challenging task. In this respect, the contribution of this paper is threefold. First, a comparison of concepts available in XML schema specification languages and relational database systems is provided. Second, basic kinds of mappings between XML concepts and relational concepts are presented and reasonable mappings in terms of mapping patterns are determined. Third, design alternatives for integrating XML and relational database systems are examined and X-Ray, a generic approach for integrating XML with relational database systems is proposed. Finally, an in-depth evaluation of related approaches illustrates the current state of the art with respect to the design goals of X-Ray.
Keywords. Relational Database System, XML, Heterogeneity, Meta Schema, Generic Integration
Running Head: Integrating XML and Relational Database Systems
Integrating XML and Relational Database Systems Abstract. Relational databases get more and more employed in order to store the content of a web site. At the same time, XML is fast emerging as the dominant standard at the hypertext level of web site management describing pages and links between them. Thus, the integration of XML with relational database systems to enable the storage, retrieval, and update of XML documents is of major importance. Data model heterogeneity and schema heterogeneity, however, make this a challenging task. In this respect, the contribution of this paper is threefold. First, a comparison of concepts available in XML schema specification languages and relational database systems is provided. Second, basic kinds of mappings between XML concepts and relational concepts are presented and reasonable mappings in terms of mapping patterns are determined. Third, design alternatives for integrating XML and relational database systems are examined and X-Ray, a generic approach for integrating XML with relational database systems is proposed. Finally, an in-depth evaluation of related approaches illustrates the current state of the art with respect to the design goals of X-Ray.
1
Introduction
Web-based information systems no longer aim at purely providing read-only access to their content, which is simply represented in terms of web pages stored in the web server’s directory. Nowadays, not least due to new requirements emerging from several application areas such as electronic commerce, the employment of databases (DB) to store the content of a web site turns out to be worthwhile [27], [50]. This allows to easily handle both retrieval and update of large amounts of data in a consistent way on a large distributed scale [23]. Besides using databases at the content level, the Extensible Markup Language (XML) [69] is fast emerging as the dominant standard for representing the hypertext level of a web site, i.e., the logical composition of web pages and the navigation among them [1], [15], [58], [67]. Since there is no blending with layout aspects, multi delivery is supported, meaning that one and the same hypertext can be easily rendered according to, e.g., different devices [43]. Furthermore, XML has become the first choice for data exchange between different organizations. Because of the increasing importance of XML and database systems (DBS), the integration of them with respect to storage, retrieval, and update is a major need [16], [67]. Regarding the kind of data model used by the DBS as basis for integration, one can distinguish three alternatives [10], [28]. First, special-purpose or native DBS are particularly tailored to store, retrieve, and update XML documents by using XML itself as underlying data model. Examples thereof are research prototypes such as Rufus [62], Lore [30], Strudel [27], and Natix [37] as well as commercial systems such as Tamino [57] and Infonyte [35]. Second, because of their rich object-oriented data modeling capabilities, object-oriented DBS such as Poet [49] as well as object-relational DBS such as DB2 [6], Oracle [32], and SQLServer [54] are wellsuited for storing XML documents. Third, concerning the latter systems, there is also the possibility to use their relational data model for integration purposes. This alternative is especially motivated by the fact that currently, a significant amount of data is stored in pre-existing relational databases1 and will continue to be used by existing applications in the future [29]. There is an increasing demand to publish (parts of) these existing relational data as XML documents according to existing (standardized) XML schemata in terms of a document type definition (DTD) [69] or by using the more powerful XML Schema language [70]2 or, vice versa, for storing XML documents into existing databases. Concerning the integration with RDBS, there exist three basic alternatives. The most straightforward approach would be to store XML documents as a whole within a single database attribute, using a simple CLOB attribute (cf., e.g., [22]) or a dedicated XML data type (cf., e.g., DB2 [6], Oracle [32], or the forthcoming SQL/XML-standard [24]). Another possibility would be to decompose XML documents in some way (“shredding”), e.g., into a graph structure and store them into appropriate database tables (cf., e.g., [28]). In both cases, the XML schema specification in form of tags and attributes is stored together with the content of the XML document as database values. The advantage is that the DB schema is independent of the structure of the XML documents, thus allowing to store XML documents having arbitrary structures. One disadvantage occurring with the first alternative is that, concerning transaction management, it is no more possible to lock only parts of this document for update purposes. A drawback of the second alternative is that a query, for example, has to reconstruct the schema (possibly involving several joins) before being able to access the “real” data needed, thus making query formulation cumbersome and decreasing performance. Finally, both approaches do not allow the 1
Note, that in the following, we use the term relational database system (RDBS) in case that the relational data model is used as basis for integration with XML.
2
Note, that in the following, we use a capital letter to denote the XML Schema standard and a small letter to depict any XML schema.
-1-
integration with already existing relational schemata. To prevent these deficiencies, the third integration alternative is that the structure of XML documents is mapped to a corresponding relational schema wherein XML documents are stored according to the mapping (cf., e.g., [11], [22], [59]). Only this approach allows to reuse existing relational schemata and thus is further investigated in the paper3. One major challenge of this schema-to-schema mapping approach is the existence of data model heterogeneity and schema heterogeneity4. Data model heterogeneity refers to the fact that there are fundamental differences between concepts provided by XML and those provided by RDBS, which have to be considered when defining a certain mapping. These differences concern, e.g., structuring, typing, and identification issues, relationships, default declarations, and the order of stored instances (cf. Section 2). Schema heterogeneity means that, even if the XML schema specification and the relational schema represent the same part of the universe of discourse, the design of both may be different, due to different design goals or simply since they have been developed independently from each other without having integration in mind. Consider for example business to business electronic commerce, where a supplier wants to store the product catalogue of another supplier represented in XML within an already existing relational database. In this scenario, the autonomy of both the XML schema specification and the relational schema should be preserved in that neither of them has to be changed. This paper deals with these different forms of heterogeneity, fitting together previous work done in this area [39], [40], [41]. First, an in-depth analysis of data model heterogeneity is provided by comparing concepts available in RDBS and XML schema specification languages, comprising XML DTD and XML Schema (cf. Section 2). Second, basic kinds of mappings between XML concepts and relational concepts are presented and reasonable mappings in terms of mapping patterns are determined to mediate between the different structuring mechanisms supported by XML and RDBS (cf. Section 3). Third, design alternatives for integrating XML and relational database systems are examined (cf. Section 4) and X-Ray, a generic approach for integrating XML with relational database systems is proposed (cf. Section 5). An extensive evaluation of existing approaches closely related to X-Ray is presented in Section 6. Finally, Section 7 concludes the paper with a summary and gives an outlook to future work.
2
Comparison of Concepts – XML versus RDBS
This section is dedicated to an in-depth investigation of the similarities and differences between XML concepts and RDBS concepts at different levels of abstraction. It focuses on six different aspects of the data models, comprising structuring and typing mechanisms, uniqueness of names, null values and default values, identification, relationships, and order. Concerning XML, DTD concepts are used as a starting point, XML Schema concepts are considered as far as they are different or go beyond those of DTDs. It has to be noted that we do not consider every XML Schema concept in full detail but rather try to give an overview of those concepts closely related to DTDs and RDBS. Concerning RDBS, it is distinguished, as far as necessary, between concepts as defined by the original relational model and their realization by the SQL standard [5]. This section provides the basis for analyzing different mapping possibilities as done in Section 3. Further, it represents the prerequisite for discussing design goals and design decisions as done in Section 4 and for developing a meta schema in order to bridge the heterogeneous concepts as described in Section 5. 2.1 Levels of Abstraction For discussing and comparing the basic concepts of XML from a database point of view, it is important to keep in mind that they belong to different levels of abstraction as illustrated in Fig. 1. These levels comprise the data model level (i.e., the concepts provided for defining the structure of data), the schema level (i.e., the utilization of data model concepts for structuring certain domain data), and the instance level (i.e., a concrete XML document or relational database being the instance of a certain schema). Regarding the data model concepts provided by RDBS and XML, there are fundamental differences leading to data model heterogeneity, which aggravate the integration of both paradigms. Heterogeneity with respect to data models is mainly due to the different purposes RDBS and XML have been developed for. The aim of RDBS is to store large amounts of data enabling efficient access and ensuring their consistency [5]. In contrast, XML is intended to serve as a format for structuring and exchanging hypertext documents [1], [67]. Data model heterogeneity is discussed in the following sub sections focusing on typing mechanisms, null values and default 3
It has to be noted that there are also hybrid approaches using, e.g., a mapping approach for the structured parts of a document and the CLOB approach for the unstructured parts (cf. e.g., [20]).
4
Other forms of heterogeneity which would be also relevant in this context such as “semantic heterogeneity” [63] are not further dealt with in this paper.
-2-
values, identification, relationships, and order of instances. Since from a database perspective, many concepts supported by DTDs are insufficient for schema definition, e.g., the typing mechanisms, XML is often referred to as a data format, only, having no appropriate data model. Consequently, there have been strong efforts to supplement DTDs by means of the richer XML schema specification language XML Schema [70]. In contrast to DTDs XML Schema is expressed by means of XML itself. Extensions include a richer set of primitive data types as well as a mechanism to enable inheritance. Although the use of XML itself as language to specify various schemata allows to reuse existing XML tools for schema validation, the use of XML Schema is by far more complex than simply using DTD syntax. Relational Concepts Data Model Level
Relation
Attribute
Relational Schema Schema Level
Relation A Relation B ...
Attribute X Attribute Y ...
Relational Database Instance Level Legend:
Fig. 1.
Tuple
Value
XML Concepts Element Type
Attribute
DTD / XML Schema (optional) Element Type a Element Type b ...
Attribute x Attribute y ...
XML Document Element Element Value
Attribute Attribute Value
... consists of ... may consist of
Concepts at Different Levels of Abstraction
Analogous to heterogeneity at the data model level, there may also be heterogeneity at the schema level. This is very likely since the schema of an XML document is allowed to be irregular, implicit, partial, incomplete, not always known ahead of time, and may change frequently and without notice, which demonstrates the close resemblance to semi-structured data [69]. In case that XML documents are based on an explicit schema specification, applications are able to validate the documents’ structure with respect to this specification by means of an XML parser [19]. A DTD as well as an XML Schema specification can be stored within a separate file referenced from within the XML documents by means of an URI (Unified Resource Identifier) [69], [7]. In addition, a DTD can be included directly within an XML document. Since explicit schema specifications are optional, it is not clear at this time if more XML documents will be governed by such schemata or if more documents will exist without them [58]. These are fundamental differences to RDBS where the existence of an a priori schema, which is stored directly within the database, is mandatory, and the validity of tuples with respect to this schema is checked by the system before inserting them into the database. Finally, at the instance level, we consider a certain XML document and a certain relational database, respectively. XML documents are self-describing, meaning that parts of the schema definition in terms of tags are replicated within each XML document, no matter if the schema is defined explicitly or not. This is in contrast to RDBS where the schema exists only once for the whole database. Storing the schema along with the data, as XML is doing, provides flexibility with respect to both integrating heterogeneous sources and changes to the structure. However, the replicated schema information implies space cost for storage, time cost for retrieval, and the danger of inconsistencies in case of schema updates [19]. 2.2 Structuring and Typing Mechanisms The basic mechanisms used to specify the structure of XML documents and relational schemata are element types and attributes for XML as well as relations and attributes for RDBS, respectively (cf. Fig. 1). Concerning element types, it is useful for further discussions to categorize them along two dimensions (cf. Table 1). The first dimension depicts whether the element type contains an atomic domain or not whereas the second dimension denotes whether the element type contains a composite domain or not. This distinction results in four different kinds of element types.5 It has to be emphasized that this classification is applicable to both DTDs and XML Schema. RDBS, however, do not allow to specify domains for relations, but for attributes, only. Let's consider each of these kinds of element types as well as XML and database attributes in more detail.
5
Note, that the XML standard specification [69] does not provide any terminology for such a distinction.
-3-
Table 1. Kinds of Element Types Atomic Domain
Composite Domain
9
8
Composite ET with Element Content
8
9
Composite ET with Mixed Content
9
9
Empty ET
8
8
Kind of Element Type (ET) Atomic ET
Legend:
9... contains 8 ... does not contain
Element types that contain an atomic domain, only, are called atomic element types. Concerning DTD element types, there is only one possible predefined atomic domain, namely #PCDATA. The predefined atomic domains for DTD attributes comprise a string type called CDATA (which is different from #PCDATA in that in case of #PCDATA if values contain tags, these tags are interpreted by an XML parser), an enumeration type, and some special types including, e.g., ID and IDREF(S) (cf. Section 2.5 and 2.6). In contrast to DTDs, XML Schema provides a large range of predefined atomic domains for both, element types and attributes. These predefined atomic domains are comparable to those present in RDBS and include some special ones like anyURI to represent URIs [7] and QName (qualified name) to specify a name that may have a namespace prefix. XML Schema allows atomic domains to be used as a basis to derive user-defined atomic domains, which is similar to the object-oriented concept of sub-classing by specifying appropriate extensions or constraints, e.g., length restrictions or enumeration restrictions. Besides atomic domains, element types are allowed to be associated with a composite domain, furtheron called composite element types. Such an element type contains other element types called component element types used to build arbitrarily deep part-of hierarchies by means of nesting. For each XML document, it is required that all component element types are rooted in a single element type. This is in contrast to RDBS, where part-of hierarchies cannot be realized by means of nesting since relations consist of atomic-valued attributes, only. However, part-of hierarchies can be expressed in RDBS by means of foreign key constraints (cf. Section 2.6). Since composite element types may have an atomic domain in addition to a composite domain, they are further distinguished into composite element types with mixed content and composite element types with element content (cf. Table 1). Concerning the latter, it has to be specified whether component element types occur in a sequence, or as choice meaning that they are mutual exclusive. DTD
XML Schema
XML Document Innsbruck Tyrol Hotel Post Hotel Admiral Hotel Anker
Fig. 2.
Composite Element Type with Element Content
Considering the definition of composite element types, there is a significant difference between DTDs and XML Schema. In contrast to DTDs, XML Schema separates the definition of a composite element type from the declaration of its composite domain specifying the component element types. This separation allows to reuse a domain for different composite element types, i.e., they can share the same domain. Fig. 2 illustrates the definition of the composite element type village and the specification of the composite domain villageInfo. The keyword complexType denotes the declaration of this composite domain. Note, that defining a schema specification using XML Schema namespaces have to be used to distinguish between elements and data types provided by XML Schema and elements and data types defined by the particular schema specification. In this example this is done by applying the prefix “acc” when utilizing the complex domain villageInfo as type for the element village.
-4-
Similar to atomic domains, composite domains can be derived from each other, which is not supported by DTDs and RDBS. Element types that neither have an atomic domain nor a composite domain are called empty element types. An element type can also be declared to have ANY content, meaning that there is no restriction concerning the kind of element type. In XML Schema, any is more powerful in that it can be restricted to element types of a certain namespace and can also be used for attributes (anyAttribute). Finally, each element type no matter if it is an atomic, composite, or empty element type may contain XML attributes. Table 1 summarizes the different kinds of element types by denoting their characteristics (concerning examples it is referred to Sections 3.1 and 3.2). Concerning the instance level, XML documents contain elements each of them marked by a start tag and an end tag in terms of the name of a specific element type. The element may contain component elements expressed by nested tags as well as attributes. Both elements and attributes are allowed to contain values, therefore we distinguish between element values and attribute values. Attribute names and their values are placed within the start tag, whereas element values occur between start tag and end tag. Consequently, schema information of an explicit schema specification is replicated within XML documents in that each element and each attribute value is annotated with the corresponding element type name and attribute name, respectively. The instance level of an RDBS is quite simpler, since values exclusively belong to attributes, which are in turn composed to tuples. 2.3 Uniqueness of Names The name of a relation is required to be unique within the whole relational schema, similar to the name of an XML element type being unique throughout the DTD. By means of so called namespaces [68], XML allows element types having the same name by using different namespace prefixes. Namespaces, however, are not further considered in this paper. XML Schema is more flexible in this respect since the name of an XML element type has to be unique within a so-called symbol space, only. A symbol space is among others associated with each composite domain defined by a user. Thus, the same name may appear in composite element types being defined on the basis of different composite domains without conflict [70]. For example, two composite domains may both contain an element type with name address but different domains without conflict. The name of an XML attribute defined within a DTD or an XML Schema has to be unique within its element type, again similar to an RDBS attribute’s name which has to be unique within its relation. 2.4 Null Values and Default Values Similar to RDBS, XML allows to express null values6 as well as default values. In RDBS, the concept of null values is defined for attributes, only. XML, however, supports null values for both attributes and elements. In DTDs, default values may be applied to XML attributes, only, whereas XML Schema supports default values for XML element types, too. Concerning XML attributes, the so-called default declaration within a DTD requires to specify for each attribute one of the following constraints: • #REQUIRED, meaning that a value is required in the sense of NOT NULL of RDBS. • #IMPLIED,
denoting the optional nature of an attribute value, expressed by the omission of NOT NULL in RDBS. Note, that in case there is no value provided for such an XML attribute at the instance level, the attribute name is omitted within the XML document, too.
• #FIXED , • ,
defining a constant value which is not possible in RDBS.
specifying a default value analogous to the DEFAULT clause in RDBS.
In XML Schema, there is an additional constraint for attributes named prohibited, which allows to mask an inherited attribute for the actual element type. These constraints can be expressed by the attribute use (possible values: required, optional which corresponds to #IMPLIED in DTDs, and prohibited) and the attributes default and fixed storing a default value and a fixed value, respectively. Concerning an element, whether it may be omitted or not is specified within both DTDs and XML Schema by means of cardinality constraints. The cardinality specifies how often the element of a certain element type occurs as component element of its composite element. Since element types may be components of more than one composite element type, each of its occurrences as component element type can exhibit another cardinality. The cardinality symbols for DTDs are
6
Note, that although in literature (cf., e.g., [18]) different meanings of null values are discussed (e.g., the value is inapplicable, exists but is missing, or even its existence is unknown), these differentiations can neither be expressed at the XML-side using DTDs or XML Schema nor at the DB-side using RDBS.
-5-
‘?’ (zero or 1), ‘*’ (zero or more), ‘+’ (1 or more) and no symbol (exactly 1). In XML Schema, the cardinality can be specified in more detail by using the attributes minOccurs and maxOccurs (cf. Table 2). Table 2. Comparison of Concepts: Cardinality Cardinality
UML
zero or one exactly one zero or more one or more arbitrary cardinality, e.g., three to five
0..1 1 0..*, * 1..* 3..5
DTD ? default, no symbol * + not supported
XML Schema minOccurs maxOccurs 0 1 (default) 1 (default) 1 (default) 0 unbounded 1 (default) 3
unbounded 5
It is emphasized that there is a semantic difference between a start tag being directly followed by an end tag and start tag and end tag being omitted at all from the XML document. The former matches to one of three different specifications within DTDs and XML Schema: (1) An element is specified as an empty element type. (2) An element is specified as an atomic element type, whose value is an empty string. (3) An element is specified as a composite element type, but within the particular XML document, no component elements exist. In contrast to that, the omission of tags indicates a null value in the sense of RDBS. Using XML Schema, an alternative would be to set the special attribute xsi:nil to the value true. Note, that there is no corresponding mechanism for XML attributes. XML Schema provides also a boolean attribute nillable, indicating whether an element is allowed to have neither text content nor element content, despite having a specification requiring content. In such a case the element must have an attribute xsi:nil with the value true. 2.5 Identification In RDBS, the unique identification of tuples is done by means of a primary key, which may be composed of one or more attributes of the corresponding relation (cf. Table 3). In DTDs, only a single attribute of an element type can be designated as identifying attribute by means of the special attribute type ID which may in turn contain a string value (cf. Fig. 3). Table 3. Comparison of Concepts: Identification RDBS
DTD ID attribute type
XML Schema
Concept
Primary key
key (in addition to DTD concept)
Composite Key
Yes - one or more No - single attribute of Yes - one or more XML attributes or atomic element types attributes of a relation an element type
Scope of Identification
Unique identification of tuple within relation
Unique identification of Unique identification of element within a element within document scope identified by an XPath expression
Optional Key
Yes
Yes
Yes
Equality & Identity
No distinction
No distinction
No distinction
In addition to the DTD concept ID, XML Schema allows not just attributes, but also element types of an arbitrary atomic domain and combinations thereof to serve as keys. The scope of identification in RDBS is a single relation, i.e., the value of the primary key uniquely identifies each tuple within a relation. In DTDs, the scope of identification is broader in the sense that the value of an ID attribute is unique within the whole XML document. This allows the unique identification of an element not only with respect to other elements of the same element type but rather across all elements of any element type. XML Schema allows to specify the scope for each key by means of an XPath [71] expression (cf. Fig. 3, attribute xpath of element selector). Another XPath expression denotes the element types and/or attributes serving as key (attribute xpath of element field).
-6-
DTD
XML Schema
...
XML Document ....
Fig. 3.
Exemplary Identification in XML
In DTDs and XML Schema, element types are not required to contain an ID attribute or a key, respectively. This is similar to RDBS products, where relations need not contain a primary key. Note, this is in contrast to the theory of the relational model, where primary keys are mandatory for each relation. Concerning DTDs, even in case that an element type has an attribute of type ID, its usage may be optional by defining it as #IMPLIED. In contrast, keys in XML Schema must be always non-nillable. Since the identification of both tuples in RDBS and elements in XML is value-based, it is not possible to distinguish between equality and identity as it is possible in the object-oriented data model [14] and in the XQuery 1.0/XPath 2.0 data model [73]. As can be seen, whereas keys in RDBS and in XML Schema are very similar, keys and attributes of type ID are rather heterogeneous concepts. 2.6 Relationships In RDBS, relationships can be expressed between relations by means of foreign keys, i.e., arbitrary attributes that refer to the primary key of the same relation or of another relation. The number of tuples which may participate in a relationship can be constrained by defining the foreign key as NOT NULL and/or UNIQUE. With this, different cardinalities can be supported as illustrated in Table 4. DTDs allow two alternative ways for specifying relationships between element types comprising IDREF(S) attributes and component element types (cf. Section 2.2). Attributes of type IDREF(S) represent some kind of foreign key referencing attributes of type ID. The distinction between IDREF attributes and IDREFS attributes concerns their cardinality, in that the former are singlevalued and the latter are multi-valued. In contrast to RDBS, where the participating tuples are constrained to the participating relation, using IDREF(S) the participating elements cannot be constrained to be of a certain element type. Table 4. Comparison of Concepts: Relationships RDBS Concept
Foreign key
Participants
Relations (tuples)
Typing of Participants
Cardinality
DTD
XML Schema
IDREF(S) attribute
Component ET
Tuples of a certain relation
It is not possible to constrain the type of the participating elements
The participating elements are restricted by the component ET
Elements of a certain element type
The participating elements are restricted by the component ET and derivations thereof
(0..1):*, 1:*, (0..1):(0..1), 1:(0..1)
*:(0..1), *:1, *:*, *:(1..*)
1:1, 1:*, 1:(1..*), 1:(0..1)
Same as in RDBS
1:arbitrary value
Legend: 0..1 1 1..* *
keyref (in addition to DTD concepts)
Component ET
Element types (elements)
... zero or one ... exactly one ... one or more ... zero or more
XML Schema supports the concept of so-called keyref, which is similar to the RDBS concept of foreign keys, meaning that a certain element/attribute combination refers to the corresponding element/attribute combination building the key. Different to DTDs, the participants of a relationship are typed by the element type containing the key. Regarding relationships which are realized by specifying component element types the cardinality of component element types can be an arbitrary value as already mentioned in Section 2.4. Further, the participating elements may not only be component elements as required when using DTDs, but may also be elements derived from these component elements.
-7-
2.7 Order In contrast to relations and tuples in RDBS, the element types and elements of an XML document adhere to both an explicit and implicit order. The order of element types can be explicitly defined within a DTD by using the sequence operator ‘,’ whereas XML Schema uses the element type sequence. The example shown in Fig. 4 specifies that an element of type village comprises the following three component element types in the specified order, i.e., name has to occur first, then country, and then accommodation. DTD
XML document Innsbruck Tyrol Hotel Post Hotel Admiral Hotel Anker
Fig. 4.
Explicit and Implicit Order
At the instance level, the order of concrete elements is defined implicitly by the position of elements within the XML document (cf. also Fig. 4). Note, that this implicit order may not contradict the explicit order defined by the corresponding DTD. In our example the order of the particular accommodation elements is given at the instance level by occurring at a certain position within an element of type village. It is important to be aware, that elements occurring as component elements of different composite elements do not always have to exhibit the same order. DTD
XML document Einkehr Hotel Post Hotel Admiral Verdi Diele Hotel Anker
Fig. 5.
Implicit Order Between Elements of Different Element Types
For example, elements of type accommodation being component elements of type village may show a different order than as component elements of type owner. An implicit order not only concerns elements of the same element type but also elements of different element types, as is depicted in Fig. 5. XML Schema
Fig. 6.
Unordered Component Element Types
In addition to these concepts, XML Schema allows the explicit definition of an unordered occurrence of component element types by means of the element type all (cf. Fig. 6). Note, that using all the cardinality constraints are restricted in that minOccurs may have the values 0 and 1 and maxOccurs may have the value 1, only.
3
Patterns for Mapping XML and RDBS
After having analyzed differences between XML concepts and RDBS concepts, let’s consider the possibilities for mapping a DTD to a relational schema. This section proposes some basic mapping possibilities and determines on their basis which kind of mapping is reasonable in a certain situation thus representing so-called mapping patterns.7 These mapping patterns are universally applicable and have been used as a basis for designing a meta schema for representing mapping knowledge in our X-Ray approach (cf. Section 5).
7
Mapping patterns for supporting also the XML Schema standard are currently under development (cf. Section 7).
-8-
3.1
Basic Kinds of Mappings
A straightforward way would be to map each element type to a relation and each XML attribute to an attribute of the respective relation (cf. Fig. 7). Due to data model heterogeneity and schema heterogeneity, however, such a one to one mapping is neither always possible nor desirable. For example, in the presence of deep element nesting directly mapping elements to tuples of different relations would lead to excessive fragmentation of the document over various relations, thus decreasing performance. Accommodation
...
Name
AccID
name = “Hilton”
Hilton
a1
Vienna
id = “a1”
Biedermeier Theater Hotel Hotel Bristol Gasthof Post Hotel Mozart ......
a2 a3 a4 a5 a6
Vienna Salzburg Salzburg Innsbruck Salzburg
... ...
Fig. 7.
Straightforward Mapping of XML Concepts to Relational Concepts
When considering the structuring mechanisms of XML and RDBS as discussed in Section 2.2, three basic kinds of mappings at the data model level may be distinguished (cf. Fig. 8): (1) ET_R. An element type (ET) is mapped to a relation (R), furtheron called base relation. Note, that several element types can be mapped to one base relation. An example for an ET_R mapping is the mapping of element type accommodation to relation Accommodation in Fig. 8. (2) ET_A. An element type is mapped to a relational attribute (A), whereby the relation of the attribute represents the base relation of the element type. Note, that several element types can be mapped to the attributes of one base relation. An example for an ET_A mapping is the mapping of element type name to attribute Name of relation Accommodation in Fig. 8. (3) A_A. An XML attribute is mapped to a relational attribute whose relation represents the base relation of the XML attribute. Again, several XML attributes can be mapped to the attributes of one base relation. The mapping of XML attribute id to attribute AccID of relation Accommodation in Fig. 8 gives an example for an A_A mapping. It has to be emphasized that both element types and attributes can be mapped to a single base relation and a single attribute, only. Another point is that ET_A and A_A mappings determine also the instance level, in that database values are mapped to XML values. Thus, it makes sense that ET_R mappings occur together with ET_A and A_A mappings. Furthermore, it is not mandatory that all element types and attributes of a DTD as well as all relations and attributes of a relational schema have a mapping. An example at the relational side could be a foreign key that serves for establishing a relationship but might not be relevant within the XML document and therefore requires no mapping. An example at the XML side would be an empty element type that occurs exactly once at a certain position within the XML document and does not require any mapping, neither.
Mapping Possibilities
RDBS Concepts
ElementType
ET_R
ET_A
accommodation : ElementType
Attribute
A_A
Example:
XML Concepts
Relation
ET_R
id : Attribute
ET_A
A_A
Accommodation
Attribute Attribute Attribute
Fig. 8.
name : ElementType
Name
AccID
Basic Kinds of Mappings
The examples demonstrate that the omission of mappings is imaginable not only in case that both DTD and relational schema have been developed independently from each other, but also if one has been derived from the other one. However, in case that the cardinality of a relationship or an element type, respectively, or the default declaration of an attribute requires the existence of a corresponding instance, a proper mapping is mandatory. The three basic kinds of mappings introduced above can be further refined with respect to the determination of an element type’s base relation. First, if an element type should be mapped, one has to consider the first of its
-9-
ancestor element types that is mapped to a relation or an attribute, thus having a base relation. This base relation constitutes the parent base relation of the XML element type which should be mapped and is a candidate for being its base relation, too. If none of its ancestor element types is mapped, an arbitrary relation can be chosen as base relation. Concerning the example in Fig. 9 (cf. also the more comprehensive example given in Fig. 10), the element types address, street, and country all have the same parent base relation, namely Accommodation, which represents the base relation of the ancestor element type accommodation. Note, that aiming at an intuitive presentation, Fig. 9 depicts mappings between XML element types and relations in terms of a UML class diagram [52]. To be able to distinguish between element types and relations, they are depicted as instances of the corresponding ‘meta class’ ElementType and Relation, respectively. DTD root ET has no mapping, and consequently no base relation
Relational Schema
accommodations : ElementType
component ET
accommodation : ElementType
base relation Accommodation
: Relation
n relatio on ion ati t rel ela e s er a s a tb b n re pa
t base paren
ba se
component ET
address : ElementType
Direct Mapping
pa re nt
has no mapping, and consequently no base relation
re lat ion
first ET being mapped, and consequently has no parent base relation
component ET
Village : Relation
street : ElementType
Indirect Mapping
component ET
country : ElementType
Fig. 9.
base relation
Country : Relation
Exemplary Mappings
Second, if an XML attribute should be mapped, its element type has to be considered first. If the attribute’s element type is not mapped, its ancestor element types have to be considered as done for element types discussed above. Again, the relation which the first of these ancestor element types is mapped to represents the parent base relation of the XML attribute, thus being a candidate for being its base relation, too. The parent base relation constitutes also the base relation, if the XML element type or the attribute, respectively, can be mapped to the relation or one of its attributes, which is furtheron called direct mapping. For an example, confer to the element type street in Fig. 9, which is directly mapped to an attribute of its parent base relation Accommodation. Otherwise, a proper base relation may be one of those relations, reachable by the parent base relation via foreign key relationships, which is furtheron called indirect mapping. For an example, consider the element type country, which is indirectly mapped to an attribute of relation Country reachable by its parent base relation Accommodation. Indirect mapping is reasonable in case that the relational attribute, which should be the mapping target, is factored out from the parent base relation, e.g., due to normalization reasons or because of vertical partitioning. Note, that element type address is used to group address data and thus has no relational counterpart and no base relation at all. Both direct and indirect mapping is applicable to the three basic mapping possibilities introduced above thus resulting in ET_Rdirect/indirect, ET_Adirect/indirect, and A_Adirect/indirect mappings. Furthermore, the possibility of a direct mapping always implies the possibility of an indirect mapping due to vertical partitioning. This differentiation is also made in [58], where it is proposed to inline as many sub elements as possible to reduce fragmentation (direct mapping) and to keep multi-valued elements and elements involved in recursive associations in separate tables (indirect mapping). 3.2
Reasonable Mappings
After introducing the basic kinds of mappings, this section discusses reasonable mappings. Reasonable mappings may serve as mapping patterns, when mapping XML concepts to RDBS concepts. If one tries to map two existing schemata to each other, mapping patterns can be used to facilitate this mapping process at a syntactical level by analyzing the structure of both schemata and proposing potential mappings as well as preventing others because of syntactical conflicts. This focus is different to approaches supporting default mapping rules to derive one schema from the other (cf., e.g., [9], [32], [54], [58]). The determining factors can be categorized into characteristics of the XML element type (cf. Section 3.2.1) and characteristics of the XML attribute (cf. Section 3.2.2). In order to illustrate the subsequent investigations, in Fig. 10 we provide a comprehensive running example building on the ones given in the previous section. The
- 10 -
example is intended to show as many mapping possibilities as possible.8 Fig. 10 shows the running example in terms of a DTD and in terms of a relational schema. The latter is depicted with a table structure and as UML class diagram better visualizing relationships. Concerning the relational schema, primary keys are formatted in bold face and underlined, foreign keys are depicted using italic type. DTD
UML Class Diagram
Phone
EmailAddress
AccID Number
1..*
Accommodation
1
AccID RatingID RatingOrder
*
1 1
PossibleRating RatingID Rating
1
AccID Name
ActualRating
*
1
*
AccID Email
Pool
*
1
1 AccID Name Kind Street VillageName AcceptsCreditCard Sauna
Village
Name PostalCode CountryID*
CDATA CDATA (hotel | motel) CDATA CDATA CDATA CDATA
*
* RatingDescription
1
AccID RatingOrder Description
* 1
#REQUIRED #FIXED “Austria“ “hotel“> #REQUIRED> #IMPLIED> #REQUIRED> #REQUIRED>
Country CountryID Name
0..1
History VillageName YearFound
Relational Schema
Accommodation Name AccID
Kind
Village Name
PostalCode
CountryID
Phone AccID
Number
ActualRating AccID RatingID
Street
RatingOrder
VillageName
AcceptsCreditCard History VillageName
Sauna
YearFound
Country CountryID
Name
EmailAddress AccID Email
Pool AccID
PossibleRating RatingID
RatingDescription AccID RatingOrder
Rating
Name
Description
Fig. 10. Exemplary DTD, Relational Schema, and UML Class Diagram
Even this small example shows that data model heterogeneity and schema heterogeneity prevent a simple one to one mapping. Regarding the DTD illustrated in Fig. 10, there is a single root element type accommodations having no relational counterpart. Its component element type accommodation contains various element types, which have either relational attributes or relations as counterparts. The different cardinalities specified for each of these element types correspond to those defined at the relational side. Regarding the composite element type address and its atomic component element types street, village, and country it can be seen that the relational schema does not contain a relation Address with attributes Street, Village, and Country. Even more, there does not exist any counterpart for address in the relational schema and its component element types correspond to attributes located in three different relations, connected by ‘*:1’ relationships, namely attribute Street of relation Accommodation, attribute Name of relation Village, and attribute Name of relation Country. Having three relations instead of one is the consequence of the normalization process. The element type accommodation as well as some of its component element types contain attributes. One of these attributes, namely state, has the fixed value ‘Austria’ and therefore lacks a relational pendant. The attribute kind is restricted to an enumeration of two values. The composite element type description has mixed content, comprising the atomic element type rating meaning that elements of this type may occur several times mixed with atomic data in any order within an XML document. Note, that the attributes RatingOrder of the two classes ActualRating and RatingDescription are not mapped to any XML concept. They express an absolute order over both rating descriptions and actual ratings with respect to a certain accommodation. This is not necessary at the XML side, since the order is implicitly defined by the position of the elements within the XML document. 8
Note, that to reach this goal, we had to carefully design the schemata with respect to each other, although the focus of our approach is on schemata designed relatively independent to each other.
- 11 -
3.2.1 Element Type Characteristics As already mentioned, choosing a certain mapping is based on characteristics of the element type to be mapped. As illustrated in Fig. 11, these decisive characteristics can be categorized into three orthogonal dimensions comprising the kind of element type, if it contains attributes, and its cardinality. Note, that if the element type has been declared to have ANY content (cf. Section 2.2), a reasonable mapping cannot be determined in advance. Depending on the combination of these characteristics, certain reasonable mappings can be determined as shown in Table 5. In the following, these mappings are discussed by means of the running example. First, we consider composite element types with element content. Mapping this kind of element type is neither influenced by cardinality nor whether it contains any attributes. Since there are no values associated with elements of this type, the only reasonable mapping possibility is ET_R. Depending on whether the element type can be mapped to its parent base relation or not, ET_Rdirect or ET_Rindirect mapping can be used. In fact, the lack of any mapping would not result in a loss of information, since elements of this type contain no values which could be stored in the database. Kind of Element Type Composite ET with Element Content Composite ET with Mixed Content Atomic ET Empty ET ? 1
No
Yes
Contains Attributes
* +
Cardinality
Fig. 11. Orthogonal Dimensions Characterizing XML Element Types
Concerning our running example, whereas the root element type accommodations does not require any mapping, the element type accommodation is mapped to the relation Accommodation (ET_R mapping). Since accommodation does not have a parent base relation, we do not distinguish between a direct and an indirect mapping in this case. Next, let us consider the mapping of an atomic element type. The reasonable mappings of such element types depend on the cardinality, only, and are not influenced by the existence of XML attributes. Since atomic element types contain values they always require a mapping to relational attributes, i.e., an ET_A mapping. In case of cardinality ‘?’ and ‘1’, an ET_Adirect mapping is possible, since no more than one element may occur. However, also an ET_Aindirect mapping may be necessary, when the relational attribute which the atomic element type should be mapped to is not part of the parent base relation. In case of cardinality ‘*’ and ‘+’, ET_Aindirect mapping is required due to normalization. Concerning our running example, the most simple case is represented by element type name which has cardinality ‘1’ and is mapped to attribute Name of base relation Accommodation representing an ET_Adirect mapping. Accommodation is mapped to element type accommodation, the direct ancestor element type of element type name, i.e., the base relation and the parent base relation are the same. This kind of mapping also applies to element type street. In this case the parent element type address has no mapping and the ancestor element type accommodation is mapped to the relation that contains the relational counterpart Street. The element types village and country require ET_Aindirect mappings, since their relational counterparts are stored in base relations different to the parent base relation Accommodation due to normalization reasons. The relational counterparts are attribute Name of base relation Village and attribute Name of base relation Country, respectively. This kind of mapping is possible, since Accommodation and Village, as well as Village and Country are directly connected via foreign key relationships. Element type email has cardinality ‘*’ requiring an ET_Aindirect mapping and therefore is mapped to attribute Email of relation EmailAddress. The same holds true for element type rating with the difference that the parent base relation Accommodation and the base relation PossibleRating containing an attribute Rating are indirectly connected via the relation ActualRating thereby explicitly demonstrating schema heterogeneity. Another example for schema heterogeneity is given by the empty element type pool which is mapped to the relational attribute Name of relation Pool storing the names of pools.
- 12 -
Table 5. Reasonable Mappings of XML Element Types Kind of Element Type
Contains Attributes
Cardinality
Reasonable Mapping
Composite ET with element content
No influence
No influence
ET_Rdirect/indirect; No mapping
Atomic ET
No influence
?, 1
ET_Adirect/indirect
Atomic ET
No influence
+, *
ET_Aindirect
Empty ET
No
1
No mapping
Empty ET
Yes
1
ET_Rdirect/indirect; No mapping
Empty ET
No influence
?
ET_Adirect/indirect
Empty ET
No influence
*, +
ET_Aindirect
Composite ET with mixed content
No influence
No influence
ET_Aindirect
Regarding empty element types with a cardinality ‘1’, no matter if there are attributes or not, no mapping is required since a corresponding element occurs exactly once without carrying any value. However, if there were attributes, it would make sense to employ a direct or indirect ET_R mapping since the base relation could serve as the base relation for the attributes. In case of any other cardinality, the existence of attributes does not influence the reasonable mappings. An ET_A mapping is required in any case. It depends on the particular cardinality whether a direct or indirect mapping is reasonable. Referring to our example, the empty element types facilities without attributes and sauna including a single attribute represent the most simple case both having a cardinality of one thus requiring no mapping. The attribute available of element type sauna is mapped to the relational attribute Sauna of the parent base relation of the element type sauna, namely Accommodation. The optional empty element type acceptsCreditCard contains no attributes and is mapped directly to the relational attribute AcceptsCreditCard of its parent base relation Accommodation. Finally, the empty element types phone and pool having a cardinality of ‘+’ and ‘*’, respectively, are mapped via ET_ Aindirect to the relational attribute Number of the relation Phone and the relational attribute Name of the relation Pool, respectively. Considering composite element types with mixed content, neither the existence of attributes nor the cardinality have any influence on the reasonable mappings. Since at the instance level, several values may occur within a single element, an ET_Aindirect mapping is required. Our example contains one composite element type with mixed content, namely description, which is mapped to the attribute Description of the relation RatingDescription. The attributes RatingOrder of the two relations ActualRating and RatingDescription are, as already mentioned, not mapped to any XML concept, since they express an absolute order over both rating descriptions and actual ratings with respect to a certain accommodation. It has to be noted that in case that one or several ancestor element types are not mapped and any of these ancestor element types depicts a cardinality of ‘*’, the next component element type being mapped can be mapped indirectly, only. 3.2.2 XML Attribute Characteristics The mapping of XML attributes depends on two orthogonal dimensions comprising the multiplicity of the XML attribute, i.e., whether it is single-valued or multi-valued, and the default declaration (cf. Fig. 12). Considering the different combinations of these two dimensions three reasonable mapping alternatives may be identified as shown in Table 6. Default Declaration Fixed Default Implied Required
Multiplicity of Attribute Single-valued Multi-valued
Fig. 12. Orthogonal Dimensions Characterizing XML Attributes
For XML attributes with default declaration being #FIXED, no mapping is necessary independent of the multiplicity of the XML attribute. In our example, the XML attribute state of the element type accommodation has the constant value Austria. Regarding XML attributes which are not specified to be #FIXED, it has to be distin-
- 13 -
guished whether they are single-valued like attributes of type CDATA or multi-valued like attributes of type Single-valued attributes can be directly mapped to relational attributes (A_Adirect) or may require indirect mapping due to normalization reasons (A_Aindirect), whereas multi-valued attributes may be mapped indirectly (A_Aindirect), only. Considering attributes of type ID and IDREF(S), it seems conceivable to map them to primary key attributes and foreign key attributes, respectively, of the relational schema. Due to data model heterogeneity, however, this is not always feasible, since there are differences concerning scope and composite keys (cf. Section 2.5 and Section 2.6). IDREFS.
Table 6. Reasonable Mappings of XML Attributes Multiplicity of XML Attribute
Default Declaration
Reasonable Mapping
No influence
#FIXED
No mapping
Single-valued
#REQUIRED, #IMPLIED, Default Value
A_Adirect/indirect
Multi-valued
#REQUIRED, #IMPLIED, Default Value
A_Aindirect
In our example, directly mapped single-valued attributes comprise id and kind of element type accommodaof element type phone, and available of element type sauna. Single-valued attributes which have to be mapped indirectly are postalCode of element type address, and yearOfFoundation of element type village. Multi-valued attributes are not part of our example. It has to be emphasized that with one exception the reasonable mappings of an attribute are independent of the kind of mapping of its element type. In case that the element type of the attribute is not mapped and any of its ancestor element types that is not mapped depicts a cardinality of ‘*’, the attribute can be mapped via A_Aindirect, only.
tion, number
4
Design Alternatives for Integrating XML and RDBS
This section discusses design alternatives for integrating XML and RDBS, together with design goals and corresponding design decisions taken for our system X-Ray (cf. Section 5). As already mentioned, our focus is on integrating already existing schemata representing the universe of discourse. Therefore, those design alternatives storing XML documents within a single attribute or decomposing them according to their graph structure are not further considered. The design alternatives considered also constitute the basis for the discussion of related work as done in Section 6 and can be categorized into three dimensions, comprising the schemata which should be integrated, the mapping between these schemata, and the access to the stored data (cf. Fig. 13). Kind of Schemata. Regarding the first dimension, one has to consider the kind of schemata at both, the XMLside and the DB-side, offering a derived approach and a user-defined approach (cf. Fig. 14). The derived approach requires that either the DB schema is derived from the XML schema according to certain pre-defined rules or vice versa. Concerning the derivation process, one can distinguish different degrees of automatism allowing to configure derivation rules manually or not. The user-defined approach allows to develop the DB schema independent of the XML schema and vice versa. This is typical for an electronic commerce scenario where part of the content stored within the database should be transferred to business partners according to a standardized schema or if XML documents received should be stored within an existing DB. The mapping between the schemata is not derived on the basis of pre-defined rules but rather defined by a user, eventually with appropriate support by the system. Since a major design goal of X-Ray is to support existing schemata rather than to (semi)-automatically derive schemata from each other, the only feasible design decision is to adhere to the user-defined approach. To cope with the various heterogeneity problems arising when mapping two existing schemata, we provide mapping patterns for resolving data model and schema heterogeneities.
- 14 -
Mapping Representation of Mapping Knowledge T hard-coded T reified in file T reified in DB 3
Mapping Transparency / Ease of Maintenance Schema Autonomy Multiple Schemata Existing Schemata
Coupling with Schemata T tight
Unified Approach
T loose 3
Access Homogeneity
Mapping Cardinality
DB Schema Transparency
T multiple at DB-side 3 T multiple at XML-side 3
Access Kind of Schemata T derived approach
T user-defined approach 3
Access Capability
Access Language
Access Target
T 3 Storage T 3 Publishing
T DB-centric T 3 XML-centric T Others
T 3 T 3
T
DB schema XML schema Mapping Knowledge
Schemata
Fig. 13. Design Alternatives
Representation of Mapping Knowledge. To perform the necessary transformations when inserting or retrieving XML documents or parts thereof, appropriate mapping knowledge has to be managed by the system in one way or the other. Regarding its representation, one can distinguish between hard-coding mapping knowledge within applications or within a query, respectively, and reifying the mapping knowledge within a file or within a DB. Hard-coding mapping knowledge can be done either at runtime, meaning that the user issuing the request must have knowledge about the two schemata and the mapping in between or already at initialisation time, thereby ensuring mapping transparency for subsequent access. Hard-coding is a feasible solution in case of the above mentioned derived approach, since the mapping knowledge required remains the same independent of changes to the initial schema. If changes to the mapping knowledge are likely, hard-coding is very inflexible, since such changes would require implementation efforts and possibly a recompilation of the system. According to [25], hard-coding mapping knowledge within a query may in addition result in very large and complex queries, decreasing flexibility and maintainability. In contrast to that, reification of mapping knowledge facilitates retrieval and maintainability, especially if, instead of plain files, a DB is used for storage purposes. Since an important design goal of X-Ray is to realize mapping transparency while enhancing maintainability, X-Ray stores mapping knowledge within a meta schema, managed by an RDBS. Schema at DB-side
Schema at XML-side
Derived User-defined
Derived
User-defined
not applicable
derived approach
derived approach
user-defined approach
Fig. 14. Kind of Schemata
Coupling with Schemata. Coupling of mapping knowledge with the schemata involved in the integration can be either tight or loose. Tight coupling means that mapping knowledge is intermingled with either the XML schema or the DB schema, whereas loose coupling allows to store mapping knowledge separately. The main drawback of tight coupling is that it requires to change existing schemata, thus violating schema autonomy. Since schema autonomy is a crucial design goal of X-Ray, we adhere to the loose coupling approach. For this, the above mentioned meta schema reifies the schemata at both, the XML-side and the DB-side. Mapping Cardinality. The criterion of mapping cardinality describes, on the one hand, the possibility of mapping a certain XML schema to multiple different schemata at the DB-side and on the other hand the opportunity of mapping a certain DB schema to multiple different schemata at the XML-side. Mapping of a certain XML schema to multiple schemata at the DB-side would allow to provide a global XML view to relational data stored in several different databases, e.g., product databases from different subsidiaries. Vice versa, a certain DB schema could be used to supply several XML documents according to different schemata with relational data. For example relational data could be published to varying schemata given by different business partners. Since a
- 15 -
design goal of X-Ray is to support multiple schemata, both options are supported by allowing multiple relationships between the reified schemata. Access Capability. The criterion of access capability describes whether the focus of a system is on storing XML documents or on publishing XML documents or other relational data out of a DB. X-Ray aims at providing a unified approach, therefore supporting both directions. Access Language. The languages used for storing and retrieving XML documents or parts thereof can be mainly divided into DB-centric (e.g., SQL or extensions thereof) and XML-centric (e.g., XQuery or XPath). There are also other possibilities, e.g., by providing predefined functions. The goal of X-Ray in this respect is to support homogeneous access, meaning that both, the query language and the result adhere to the same data model. Thus, X-Ray is based on the XML-centric approach, using an XML query language. Access Target. The access target reflects the fact whether a request can be issued directly against the DB schema, against the XML schema, and/or against the mapping knowledge itself. Directly accessing the DB schema demands the user to have knowledge about two schemata belonging to different data models and the required mapping in between. If the access target is an XML schema, this schema constitutes an XML view over the relational schema thus achieving DB schema transparency meaning that the user is not concerned with the structure of the underlying DB schema. The necessary transformations and the finally required access to the relational schema is performed by the system, automatically. If this view is virtual, retrieval and materialization is done for the actually accessed data only. Performance reasons, however, could also endorse materializing the view or parts of it, respectively. As X-Ray aims at achieving DB schema transparency, an XML schema realizing a virtual XML view is used as access target. In addition, convenient access to mapping knowledge is facilitated, since the schemata to be mapped are reified together with the mappings in between within a meta schema.
5
Overview of X-Ray
This section is dedicated to an overview of the X-Ray approach. The main focus is on the meta schema component which represents one of the most distinguishing characteristics with respect to other closely related approaches (cf. Section 6). It is not the aim of this section to provide an in-depth description of all X-Ray components. 5.1
Architecture
The overall architecture of X-Ray consists of three main components, namely the generic meta schema, the mapping knowledge editor, and the composer/decomposer component (cf. Fig. 15). XML Documents and DTDs XML Documents and DTDs
...... ...... ...... ...... ...... ......
XML Doc.
...
Quilt
...... ...... ......
...... ...... ...... ...... ...... ......
XML Doc.
...
...... ...... ......
Composer/Decomposer Mapping Knowledge Editor
Meta Schema SQL Query
Result
Domain DB
Meta Schema Repository
Fig. 15. Architecture of X-Ray
Before X-Ray can be used for storing and retrieving XML documents, the mapping knowledge required for mapping a certain XML schema to a certain relational schema has to be specified in an initialization phase. To support this task, the X-Ray architecture provides a mapping knowledge editor. On the basis of the database schema and the DTD, the user may interactively specify the required mappings, guided by the proposed mapping patterns. As soon as the system is initialized with the mapping knowledge which is stored within the meta schema repository, the user is able to transparently issue queries using Quilt [17] against a virtual XML view specified by a DTD. It is also possible to access the XML view and one or several XML documents no matter where they are stored by a single query. Quilt is a second generation XML query language having major impact on XQuery [72], the XML query language proposed by the W3C, since it was developed by synthesizing concepts from several other XML query languages. Utilizing the mapping knowledge the query is decomposed into
- 16 -
corresponding SQL queries on the relational database. The result is used to compose XML documents out of flat relational data. The composer/decomposer component serves for storing and retrieving XML documents and therefore performs all necessary transformations based on the mapping knowledge stored in the meta schema. To realize the composition/decomposition task, Kweelt [55], a Java framework for querying XML based on Quilt, has been used and slightly extended. A prototype of X-Ray is operational and supports retrieval and storage of XML documents [31]. 5.2 Meta Schema The insights gained in the previous sections concerning data model heterogeneity, schema heterogeneity, and the mapping possibilities between XML and relational schemata provide the basis for the design of the meta schema of X-Ray. The meta schema is the key mechanism for the genericity of X-Ray allowing to map DTDs9 and relational schemata. The meta schema consists of three components describing the relevant meta knowledge (cf. Fig. 16). The DBSchema component is responsible for storing information about relational schemata that shall be mapped to DTDs to make their data available to XML documents or that shall be used to store XML documents. Analogously, the XMLDTD component stores schema information about XML documents as specified by means of DTDs. Finally, the XMLDBSchemaMapping component stores the mapping knowledge between DBSchema and XMLDTD. The goal of XMLDBSchemaMapping is to bridge both data model heterogeneity and schema heterogeneity in order to support a proper mapping. XMLDTD
*
*
DBSchema
XMLDBSchemaMapping
Fig. 16. Components of the X-Ray Meta Schema
In X-Ray, a database schema is not limited to be mapped to a single DTD but may be mapped to several DTDs and vice versa. This is reasonable since, due to presentation requirements, it may be necessary to represent a particular piece of information by several XML documents being based on different DTDs. Likewise, if we assume several relational schemata storing data of the same domain it may be required to represent these data by XML documents based on the same DTD. Concerning the storage of the meta knowledge itself, X-Ray comprises both a relational representation of the meta schema stored within a relational database and an objectoriented representation for main memory mapping. The latter is being initialized with the content of the relational meta schema at the beginning of an X-Ray session, herewith allowing an efficient composition and decomposition of XML documents at runtime. The object-oriented representation in terms of UML class diagrams is also used throughout this section to concisely and precisely depict the various meta schema components. 5.2.1 Database Schema Component Concerning the database schema component, it has to be emphasized that it is not necessary to store meta knowledge about the complete relational schema, but only about those relations and attributes being relevant for the mapping to the DTD. However, not only base relations and their attributes are relevant, but also non-base relations which are the connecting relations between two base relations. DBSchema 1 1..* DBRelation 1
DBJoinPath
*
DBConcept
1..*
1
1
* * 1..* 1 DBRelationship
DBAttribute 1
1..*
1
* * DBJoinSegment
Fig. 17. Meta Schema of the Relational Schema
As illustrated in Fig. 17 DBSchema contains at least one DBRelation, which consists of at least one DBAttribute. DBAttribute stores among others its atomic domain and whether it represents a primary key attribute. DBRelation and DBAttribute are generalized to DBConcept. Relationships (DBRelationship) connect two relations and specify one or more join segments (DBJoinSegment) comprising the join attributes, i.e., primary key 9
An extension of the meta schema in order to support also the XML Schema standard is currently under development (cf. Section 7).
- 17 -
and foreign key attributes of two relations that realize the relationship. The relationship comprises more than one join segment in case that the primary key is composed of two or more attributes. In case that parts of an XML document are stored within different relations, information about the proper join paths (DBJoinPath) is necessary. A DBJoinPath consists of one or more relationships. It comprises more than one relationship if more than two relations have to be joined for composing or decomposing a particular part of an XML document. Note, that there is no difference between relationships connecting different relations and those refering to one and the same relation, neither with respect to storing information within the meta schema about a recursive relationship or a recursive element type, respectively, nor concerning the mapping between them. 5.2.2 XML DTD Component Similar to the database schema component, it is not necessary to store meta knowledge about the complete DTD, but only about those parts being relevant for the mapping to the relational schema10. The meta knowledge specifies that a DTD (XMLDTD, cf. Fig. 18) has a certain element type (XMLElemType) that serves as root. For element types with attributes, XMLAttribute stores information about their atomic domains and their default declaration. Similar to the database schema component, XMLElemType and XMLAttribute are generalized to XMLConcept. For enumeration attributes the possible values are stored within XMLAttValEnum. According to the distinction made in Section 2.2, XMLElemType is specialized into XMLAtomicET, XMLEmptyET, and XMLCompositeET. The latter is further specialized into XMLCompositeETMixedContent and XMLCompositeETElemContent. XMLMain XMLDTD
XMLConcept
has root elem type 0..1 1 1..*
XMLElemType
1
*
XMLAttribute
1
* XMLAttValEnum XMLCompositeET
XMLCompositeET MixedContent *
XMLAtomicET
XMLCompositeET ElemContent
XMLEmptyET
*
1
CompositionStructure:: XMLContentParticle
Composition Structure
Fig. 18. Meta Schema of the DTD
The nesting structure of an XMLCompositeETElemContent is described by the package CompositionStructure (cf. Fig. 19). For an XMLCompositeETMixedContent the nesting structure needs not to be represented in the meta schema, since, as already mentioned, component element types are allowed to occur in a choice with cardinality ‘*’, only. CompositionStructure 1 1 * XMLMain:: XMLContentParticle has outer most XMLCompositeETElemContent 1..* content particle
1..* XMLMain::XMLElemType
XMLSequence *
Position
XMLChoice *
Fig. 19. Meta Schema of the XML Composition Structure
For component element types occurring in an XMLSequence or in an XMLChoice, the cardinality of the element type and in case of a sequence its position have to be stored. Furthermore, arbitrary combinations of sequences and choices can be described. 5.2.3 Mapping Knowledge The mapping knowledge is expressed by various associations between the object classes of the XML DTD component and the database schema component. Fig. 20 illustrates these mapping relationships denoting them with 10
Note, that the aim of our approach is neither to store schema-less XML documents nor to provide for a “round-trip-engineering” of XML documents by preserving instance-level information such as comments, processing instruction, or document order. Rather, our aim is to deal with XML documents adhering to a certain schema and to allow them to be either generated out or stored within an RDBS.
- 18 -
bold lines. For representation convenience, only those object classes are shown which are part of a mapping relationship. In order to meet the requirement that the meta schema is able to store mappings between different DTDs (XMLDTD) and different database schemata (DBSchema), the mapping between the class XMLConcept and the class DBConcept takes part in a ternary relationship with the association class XMLDBSchemaMapping. As discussed in Section 3.2, deciding on the exact kind of element type is a prerequisite for deciding a reasonable mapping to a database concept. Consequently, the leaf classes of the XMLElemType hierarchy are mapped to DBAttribute with two exceptions. The class XMLCompositeETElemContent is mapped to DBRelation, and the mapping of class XMLEmptyET is not further refined, since it inherits the (ternary) association to DBConcept. XMLDTD
*
*
0..1 has root elem type
DBSchema 1
XMLDBSchemaMapping 1 0..1
0..1
XMLConcept
DBConcept 0..1
*
XMLAttribute
1
1
1..*
XMLElemType
DBRelation
0..1
1
0..1
XMLCompositeET
XMLAtomicET
XMLEmptyET
1..*
DBAttribute 0..1 0..1 0..1
0..1 {OR}
0..1
XMLCompositeET XMLCompositeET 0..1 MixedContent ElemContent 0..1
Fig. 20. Meta Schema Describing the Mapping Knowledge
Besides the mapping relationships depicted in Fig. 20, there are also relationships to class DBJoinPath (cf. Fig. 17) which are not illustrated for representation convenience. Due to space restrictions, the attributes of the various object classes are also not shown. An example mapping in terms of the filled-in meta schema can be found in [42].
6
Related Work
This section provides an in-depth survey of related work. This survey comprise six different research prototypes as well as three of the most prominent commercial RDBS, namely DB2, Oracle, and SQLServer. The rationale behind choosing these nine was to assort a representative mix of current approaches supporting different concepts which are closely related to X-Ray. Another intent was to evaluate not only research approaches but also commercially available systems. Especially commercial systems often support different integration alternatives, each of them pursuing another goal. We have considered those alternatives closely related to X-Ray. Each of the selected approaches is described in the following within a separate subsection using the design alternatives discussed in Section 4 as evaluation criteria. Thereby, we intended to provide an overall understanding of each approach before discussing general findings of the evaluation with respect to the X-Ray approach. The results of this survey are summarized in Table 7. It has to be noted that there are already papers available, focusing on a comparison of different approaches for integrating XML and RDBS (cf., e.g., [36], [45], [66]). The following survey is different to these, since it discusses the different approaches specifically from the viewpoint of the design alternatives considered by X-Ray. 6.1 Agora Schemata. Agora [46], [47] is a data integration system which provides a global virtual XML schema. It integrates existing relational and DOM-compliant data sources and translates XML queries on the global schema into (SQL) queries of the underlying data sources regardless of the kind of mapping employed. Also the global XML schema can already exist in terms of, e.g., a standardized schema available as DTD or XML Schema, thus supporting a user-defined approach. Mapping. The data sources are defined as SQL-views over a global XML schema (so-called “local-as-view” approach), thereby hard-coding the mapping knowledge at initialisation time. For this, a normalised relational schema consisting of 11 tables serves as intermediate layer to losslessly represent the content of XML documents in a relational way. This intermediate schema, which can be virtual, is also the basis for query translation. It has to be emphasised that this intermediate schema is completely different to the meta schema employed in XRay. First, it is independent of the mapping between the global XML schema and the real data sources, second, it is only used to get queries across the language gap, and third, it stores the real content of the XML document
- 19 -
instead of information about this content. Coupling of the mapping knowledge with the schemata is loose, mapping cardinality seems to be multiple at DB-side, only. Access. The main focus of Agora is on publishing existing (relational) data. The processing of XML-centric queries on the global XML schema, which are expressed by means of Xquery, is done in three steps. First, the query is normalized, applying equivalent transformations to enable a direct translation to SQL, second the normalized query is translated into an SQL query on the intermediate schema (working still independent of the local data sources), and third, the SQL query is rewritten into a SQL query on the real data sources using their definitions as views over the intermediate schema. 6.2 LegoDB Schemata. LegoDB [9], [51] is a cost-based XML storage mapping engine that explores the space of possible XML-to-relational mappings. In particular, LegoDB generates, with respect to query performance, the “best” mapping and corresponding relational schema for a given XML schema, an XML query workload and statistics over the XML data, thus supporting a derived approach. The XML schema expressed with the XML Schema standard or DTDs, is converted (“normalized”) into a schema tree (so-called “physical schema”) consisting of type constructors using an XML query algebra. For this normalization process, only a subset of XML Schema concepts is used, disregarding, e.g., the distinction between groups and complex types as well as local and global declarations. Semantic preserving schema transformation operations (e.g., inlining/outlining, repetition merge/split) are then repeatedly applied to these physical schemata and cost estimation is done for each transformed schema until a “good” DB schema and corresponding mapping is found using heuristics. Mapping. LegoDB uses the physical schemata as a basis for a fixed set of derivation rules which are hardcoded within the system and separated from the schemata. In particular, LegoDB creates a table for each type, maps the contents of the elements into columns of that table and generates a key column that contains the ID of the corresponding element and a foreign key that keeps track of the parent/child relationship. Based on these derivation rules, an extended XML Schema parser is used to automatically generate the appropriate mapping knowledge in a batch process and to populate the database with the content of the XML documents by generating appropriate SQL insert-statements. Different mapping cardinalities are not further considered. Access. Concerning access, the main purpose of LegoDB is on storage of XML documents, a subset of XQuery is used as XML-centric access language at the XML-side, employing a very simple translation algorithm from XQuery to SQL. 6.3 MARS Schemata. The system MARS (“Mixed And Redundant Storage”) [20], [21] focuses, similar to [19], on a mixed XML and relational storage scenario, where redundancy can be explored to enhance the performance of translating an XML query to the underlying data source. MARS realises a user-defined approach, and handles also unstructured parts of data in terms of CLOBs. Mapping. Mapping knowledge is hard-coded within views, defined towards both directions, i.e., DB-side and XML-side, thereby realising not only a LAV approach (cf. Agora) but also a “global-as-view” (GAV) approach to data integration11. Similar to Agora, XML documents are represented in a relational way, using an intermediate relational meta schema called “Generic Relational representation of XML” (GReX). This intermediate schema consists of 8 tables and – different to Agora – of a set of relational constraints. Mapping from relational data to the published schema is expressed using XQuery. Mapping from the published schema to the storage schema is done using relational constraints. Redundant data used for supporting queries is expressed by materialised SQL views over the intermediate schema. Access. The major focus of the system is on publishing. Access on the global XML schema is provided by means of XQuery. Due to the combined LAV and GAV approach, one and the same algorithm can be used for performing “rewriting-with-views”, “composition-with-views” or both. 6.4 MXM Schemata. MXM (“My XML Mapper”) [3], [4] is a declarative XML-to-relational mapping language that allows to specify several mappings that have been proposed in literature. MXM takes into account both, XML
11
For a discussion of the benefits and drawbacks of each approach, it is referred to [47].
- 20 -
documents without any schema and XML documents conforming either to a DTD or an XML Schema and generates a target relational schema, thus realizing a derived approach. Mapping. Mapping knowledge is represented separately from the schemata and reified within XML documents conforming to an XML Schema. In addition, it is stored within a relational database consisting of 11 tables for describing the schema at the XML side, the schema at the DB side and the mapping in between. The main difference between MXM and X-Ray in this respect is the expressiveness of the meta schema. Since MXM adheres to a derived approach, heterogeneities between the schemata to be mapped can be reduced to a minimum using appropriate derivation rules. Therefore, the meta schema can be kept very simple, which is in contrast to X-Ray, where several possible heterogeneities have to be taken into account because of its user-defined approach. In MXM, there are some very simple default mapping rules (e.g., table names and CLOB names are system-generated, if not explicitly given) which can be configured and extended by users within an XML configuration file. Concerning mapping cardinality, although it seems to be possible to realise multiple mappings at both sides, it is only mentioned that multiple mappings into relational back-ends are possible. Access. Finally, MXM provides an interface in terms of a set of C-functions to query the mapping schema, i.e., all choices made in a mapping, to generate the target relational schema and to store XML documents in this schema. 6.5 SilkRoute Schemata. SilkRoute [25], [26] is an XML publishing framework that supports a user-defined approach, defining the XML schema using XQuery. Mapping. The mapping process proceeds as follows: First, the relational schema is transformed automatically into a canonical XML view that represents the DB tables and their attributes in XML format. Then, on the basis of this view an administrator defines a public XML view, which is virtual, using a subset of XQuery. The public query represents the mechanism to specify the schema at the XML-side together with the mapping knowledge describing how this schema is related to the canonical XML view. Thus, one part of the mapping knowledge is hard-coded within the system to generate the canonical view, the other part is hard-coded within a query to define the public view, thereby allowing to define multiple mappings at the XML-side, but a single mapping at the DB-side, only. Access. Finally, users may access the public XML view by formulating application queries using XQuery to publish XML documents out of the DB. Internally, XQuery expressions are transformed into view forests, a representation that separates the structure of the output XML document from the computations that produce the document’s content. The computations are expressed in SQL, whereby it is attempted to translate XQuery expressions into an optimal set of corresponding SQL queries. 6.6 XTABLES Schemata / Mapping. The XTABLES approach [29], [65] formerly known as “XPERANTO” [13] provides a middleware to publish XML views of relational data, to query XML views, and to store and query XML documents within an RDB. Regarding the publishing component, XTABLES automatically generates a derived default XML view out of the underlying relational data together with a corresponding XML Schema [60]. Based on this simple view, more complex application-specific views can be derived in a hard-coded way using XQuery and materialized on demand [60]. Concerning the storage of XML documents, one of possibly many relational schema generators automatically derive appropriate relational tables for storage purposes. XML documents are shredded according to the schema generation technique used (e.g., [28] or [59]) and stored within the tables. The schema generator as well as the document shredder have to be implemented manually, thus hardcoding mapping knowledge. In addition, according to the schema generation technique used, XTABLES generates a reconstruction XML view over these tables, representing in fact a query over the default XML view of the created tables. Thus, XTABLES eliminates the need to build a new query processor for different relational schema generation techniques. Access. XML views can be queried using again XQuery, the result is computed by means of a view composition mechanism, ensuring that only those relational data items needed are materialized and not any intermediate results. Furthermore, XQuery can be used to issue queries that span XML documents and XML views of relational data. In any case, most computation is pushed down to the RDB engine [61]. XTABLES can be used on top of any ODBC-compliant RDBS and has been already integrated into IBM’s DB2 named “XML for Tables” for the publishing component and “XML Data Mediator” [33] for the storage component [34].
- 21 -
6.7 DB2 XMLExtender Schemata. IBM’s DB2 XML Extender [6] provides procedures to perform storage and retrieval of XML documents realizing a user-defined approach. Mapping. Mapping knowledge is stored within data access definition (DAD) files, which are in fact XML documents. DAD files support the so-called “XML Collection” approach to map XML documents to several database attributes and the so-called “XML Column” approach to store XML documents within a single database attribute. XML Collection provides two ways to define a mapping: “SQL Node mapping” allows to specify an SQL query over the relational schema, thus hard-coding a part of the mapping knowledge, and the mapping is defined to the relational schema of the query result, whereas “RDB Node mapping” requires a mapping to the relational schema as it is. Since the XML schema is specified within the DAD files using pre-defined elements and attributes, too, the mapping knowledge is intermingled with the XML schema specification. It is possible to define multiple mappings at the XML-side, but a single mapping at the DB-side, only. Access. SQL Node mapping supports retrieval of XML documents, whereas RDB Node mapping supports both, retrieval and storage of XML documents. Instead of explicitly accessing the DB-side or the XML-side, access is performed by applying procedures that take a DAD as parameter. Conditions restricting the data may be specified within the DAD file. 6.8 Oracle First of all, it has to be noted that, in contrast to the other commercial systems, Oracle9i R2 [32] builds on the object-relational data model for integrating XML. In this way nested structures of the XML-side are mapped to nested structures at the DB-side, making the process of mapping schemata more intuitive than approaches relying on the flat relational data model, hardening, however, to reuse existing relational data. In particular, Oracle supports several mechanisms for integrating XML, comprising the data type XMLType12, the XML SQL Utility (XSU), and a canonical XML view over the relational DB. XMLType Schemata. Using XMLType to decompose XML documents an object-relational schema at the DB-side is derived from a user-defined XML schema specified by using the XML Schema standard or vice versa. Mapping. Mapping knowledge is mainly hard-coded within the application, but may be enhanced by specifying changes to the default mapping rules intermingled within the XML schema specification. Such changes are supported by pre-defined attributes, allowing to rename elements and attributes, for instance. A single mapping is possible at both sides, only. Access. Storage as well as retrieval of XML documents is supported by several functions provided by the data type XMLType allowing to reference elements and attributes by XPath expressions. Although these functions are applied directly to XMLType columns of a relational schema using SQL, conditions are formulated against the XML schema constituting therefore the access target. XML SQL Utility (XSU) Schemata. The XML SQL Utility (XSU) is actually a programming interface for Java and PL/SQL for storage and publishing of XML into or out of, respectively, Oracle. XSU supports a derived approach by applying default mapping rules to the result of an SQL query. There exists no explicit schema specification at the XMLside, although it is possible to generate a DTD or an XML Schema specification corresponding to the schema derived from the query result. Mapping. The mapping knowledge is hard-coded within the application and within the query, defined at access time, thus being coupled tightly with the structure at the XML-side. Therefore, it is possible to define multiple mappings at the XML-side, but a single mapping at the DB-side, only. Access. XSU supports storage as well as retrieval of XML documents by several methods, having the DBside as access target. In case of storage it is not necessary to distinguish whether the DB-side or the XML-side is accessed, since they are required to have the same structure. Canonical XML View Schemata. Further, Oracle supports a virtual canonical XML view over all schemata of the DB. Therefore, the schema at the XML-side is derived from the DB schema.
12
Note, that XMLType is used for both, storing XML documents as a whole within a single attribute and shredding the XML documents into different attributes.
- 22 -
Mapping. The mapping knowledge is hard-coded within the application. Thus, there is a loose coupling to the mapped schemata. A single mapping is possible at both sides, only, resulting from applying a canonical mapping. Access. The virtual XML view can be used to publish XML using a subset of XPath within URLs. 6.7 SQL Server Microsoft SQL Server 2000 [53], [54] supports several different options for integrating XML with the SQL Server, comprising a FOR XML clause for SQL statements, an annotation mechanism for XML Schema, the OpenXML function to define relational views over XML, and a canonical XML view over the RDB13. FOR XML Schemata. The FOR XML clause is an extension to SQL and provides four modes to transform query results into XML. Depending on the mode a derived approach (RAW, AUTO, NESTED modes) as well as a userdefined approach (EXPLICIT mode) are supported. Concerning the latter, the XML schema is defined by specifying a universal table, encoding information about the structure of the XML document using a special syntax for column names of the query result. In this way arbitrary nesting levels are supported and it is allowed to determine for each attribute of the query result whether it is mapped to an element or to an XML attribute. Further, additional information, like the attribute type ID for an XML attribute, may be specified. Mapping. In the derived case mapping knowledge is mainly hard-coded within the application, whereby changes to the default mapping rules may be hard-coded within the query. In the user-defined case mapping knowledge is hard-coded within the query, only. The mapping knowledge is defined at access time and not stored in any way. There is a tight coupling between mapping knowledge and the specification of the structure for the XML document, allowing multiple mappings at the XML-side and a single mapping at the DB-side, only. Access. The FOR XML clause is an extension to the SQL SELECT statement issued against the RDB-side and thus, supports publishing of XML documents. Annotated Schemata Schemata. Another way to map relational data to XML is given by annotated schemata supporting a userdefined approach. XML Schema as well as a variant invented by Microsoft, called XML Data Reduced (XDR), can be used to define a virtual XML view. Mapping. The mapping is defined by enhancing the XML schema definition with annotations, representing a tight coupling, that specify those relations and attributes, which should be mapped to together with foreign-key relationships that have to be considered when retrieving or storing, respectively, data from more than one relation. In this way, multiple mappings may be specified at the XML-side, but a single mapping may be specified at the DB-side, only. Access. Annotated Schemata support both, retrieval as well as storage of XML documents, by accessing the virtual XML view. Retrieval is supported by applying a subset of XPath, whereas storage (insert, update, and delete) is supported by a special mechanism, called “Updategram”. Updategrams are XML documents that describe a “before state”, used for finding the corresponding element or attribute, and an “after state”, used for defining the new values. Since Updategrams demand a special syntax it is not possible to insert arbitrary XML documents. Updategrams may also be used to update the canonical XML view over the relational schema. OpenXML Schemata. Whereas the annotated schemata approach defines XML views over relational data, OpenXML works the other way round, allowing to define relational views over XML. For this, an XML document is transformed via DOM into a flat table structure. The OpenXML function may be used within an SQL statement to insert the generated tuples into a relational table, for instance. Therefore, OpenXML provides the WITH clause that allows to define the mapping knowledge. OpenXML supports a user-defined approach between the implicit schema of the XML document and the relational schema, actually a single relation. Mapping. The transformation is either done on the basis of default mapping rules or by explicitly defining the mapping knowledge within the WITH clause of the OpenXML function performing the transformation. For this, XPath expressions are used to reference elements or XML attributes within the XML document and assign them to DB attributes. The mapping knowledge is coupled loosely with the schema specification and it is possible to define multiple mappings at the XML-side as well as at the DB-side. Access. The OpenXML function may be used to store XML documents by applying it within an SQL statement issued against the DB. 13
This approach has already been discussed within the context of Oracle.
- 23 -
6.8 Summary of Results In the following, the major findings of our evaluation are briefly summarized according to the design alternatives presented in Section 4 (cf. Table 7). Kind of Schemata. Concerning the kind of schemata, about half of the examined approaches support a derived approach, either at the DB-side or at the XML-side, the others allow – similar to X-Ray – a user-defined approach. Regarding the commercial systems, whereas Oracle provides a derived approach only, DB2 and SQLServer also allow a user-defined approach. Representation of Mapping Knowledge. Regarding the representation of mapping knowledge, mapping transparency is only violated by a few approaches realized by commercial systems, requiring to define the mapping knowledge at runtime. Many approaches hard-code mapping knowledge at initialization time within queries, thereby hardening its maintenance. Only five approaches allow to reify the mapping knowledge within a meta schema, four of them within XML documents, only two of them, namely MXM and X-Ray, allow the storage within a DB. Coupling with Schemata. Most of the investigated approaches support loose coupling of the mapping knowledge with the schemata to be mapped, only SilkRoute and some approaches of the commercial systems adhere to tight coupling. Mapping Cardinality. Considering the mapping cardinality, except SQLServer’s OpenXML and X-Ray, none of the evaluated approaches support multiple schema at the DB-side and multiple schema at the XML-side. Seven approaches allow multiple mappings at the XML-side, four approaches multiple mappings at the DB-side and five of them allow a single mapping at both directions, only. Access Capability. About one third of the approaches investigated aim at both, storage and publishing of XML documents within a unified approach. Access Language. Most research prototypes support XML-centric access, whereas commercial products realize also database-centric access. Access Target. Most of the approaches support the XML schema as target of the access, only few of them use the DB schema. Only two approaches - MXM and X-Ray - allow to query the mapping knowledge which is a result of the fact that both approaches store the mapping knowledge within the DB. Table 7. Overview of Evaluation Results
Kind of
Schemata Schemata Representation of Mapping Knowledge
Mapping
Access
Coupling with Schemata Mapping Cardinality Access Capability Access Language Access Target
derived approach user-defined approach hard-coded at runtime hard-coded at initialisation time reified in file reified in DB tight loose multiple at DB-side multiple at XML-side Publishing Storage DB-centric XML-centric Others DB schema XML schema Mapping Knowledge
9 9
9
9
9
9 9
9
9
9
9
9
9 9 9
9
9
9 9
9 9
9
9 9
9 9
9 9
9
9
9 9
9
9
- 24 -
9 9
9
9 9 9
9
9 9 9
9 9 9
9
9
9 9
9
9 9
9 9 9
9 9
X-Ray
SQL Server XML View FOR XML Annotated Schemata Open XML
XSU
DB2
9 9
Oracle
Commercial Systems
XML Extender XML Type
SilkRoute
9 9
XTABLES
MXM
MARS
LegoDB
Agora
Research Approaches
9
9 9
9 9 9 9 9 9
9
9
9
9 9 9 9 9
9
9 9 9
9 9 9 9
9 9 9 9
9 9
9
9 9 9 9 9 9 9 9
9 9 9 9 9 9
9 9
Summing up, the most distinguishing characteristics of X-Ray with respect to most of the compared approaches are the following: (1) X-Ray ensures mapping transparency and eases the maintenance of mapping knowledge by reifying the mapping knowledge as an instance of a meta schema, stored within a DB. (2) X-Ray realizes multiple schemata at the DB-side and at the XML-side by allowing multiple relationships between the reified concepts of the schemata to be defined within the meta schema. (3) X-Ray supports a unified approach, equally allowing to store and to publish XML documents on the bases of an RDBS. (4) X-Ray provides different access targets, especially allowing to reason about the mapping knowledge simply by using queries since the mapping knowledge is stored within the DB.
7
Summary and Future Work
This paper discusses several issues relevant when integrating XML and RDBS and proposes X-Ray, an approach that realizes such an integration in a generic way. First, the paper provides an analysis of data model heterogeneity between XML, in terms of DTDs and XML Schema, and the relational data model. In particular, an in-depth investigation of similarities and differences between XML concepts and RDBS concepts is provided, focusing on structuring and typing mechanisms, uniqueness of names, null values and default values, identification, relationships, and order. Second, on the basis of different mapping possibilities between XML and RDBS, a set of mapping patterns is introduced, determining reasonable mappings of XML concepts to RDBS concepts by considering several characteristics of XML element types and XML attributes. Third, design alternatives relevant when integrating XML and RDBS are discussed, comprising the schemata to be mapped, the mapping itself and the access to the system. Fourth, X-Ray, a generic approach for integrating XML with RDBS is proposed. X-Ray allows to realize user-defined mappings between independently developed schemata supported by mapping patterns, thus preserving the autonomy of the participating DTDs and relational schemata as well as a generic integration thereof. Concerning future work, besides working on a data manipulation language for X-Ray (cf. [31], [12], [48]), we are currently extending X-Ray towards the XML Schema standard. As discussed in Section 6, existing approaches support XML Schema with respect to the design goals of X-Ray in rather restricted ways. They either support a derived approach or enhance the schema language to incorporate the mapping knowledge intermingled with the XML Schema specification. Consequently, they do not allow to map a schema to multiple schemata at the other side. In addition, it has been shown that they simplify XML Schema concepts or do not support all concepts provided. Also, mapping patterns that describe which XML Schema concepts might be mapped to which relational concepts are not provided. Finally, neither reification of both schemata is supported nor the storage of the mapping knowledge within a DB, thus preventing a more sophisticated model management as suggested, e.g., by Bernstein et al. [8]. X-Ray will incorporate XML Schema according to the design goals discussed in Section 4. For this, a meta schema has to be designed, incorporating the concepts of the XML Schema standard in an object-oriented way, just as already done concerning the XML DTD component of our existing meta schema. To support the mapping process, patterns for mapping XML Schema concepts to RDB concepts have to be proposed. These mapping patterns have to be realized within the meta schema of X-Ray, to allow reasonable mappings, only. Concerning the definition of mapping patterns, we heavily build on existing work of mapping object-oriented models to relational models (cf., e.g., [2], [44], [38]), since several concepts available in XML Schema are closely related to concepts available in object-oriented models. At the same time, the rich set of concepts provided by XML Schema further increases the heterogeneity with respect to RDBS. One example of these concepts is the inheritance mechanism provided by the XML Schema standard for user-defined simple / complex types which are used for defining simple / complex elements. We have already defined four different mapping patterns dealing with inheritance of user-defined complex types, representing one of the more complex scenarios when mapping XML Schema to RDBS. These patterns capture, on the one hand, three basic realization alternatives for inheritance in RDBS as proposed in literature (cf., e.g., [5]) and cover, on the other hand, the notion of dynamic binding, meaning that a composite element is able to contain not only nested elements where the dynamic type equals the static one, but also elements where the dynamic type is derived from the static one.
- 25 -
References [1]
S. Abiteboul, P. Buneman, and D. Suciu, “Data on the Web: From Relations to Semistructured Data and XML”, Morgan Kaufmann Publishers, 2000.
[2]
S.W. Ambler, “Mapping Objects to Relational Data”, Ambysoft http://www.ambysoft.com/mappingObjects.html, [last access 2003-08-07-].
[3]
S. Amer-Yahia, and D. Srivastava, “A Mapping Schema and Interface for XML Stores”, in Fourth ACM CIKM International Workshop on Web Information and Data Management (WIDM'02), Virginia, USA, Nov. 2002.
[4]
S. Amer-Yahia, M. Fernandez, R. Greer, and D. Srivastava, “Logical and Physical Support for Heterogeneous Data”, in Eleventh Int. ACM Conference on Information and Knowledge Management (CIKM'02), Virginia, USA, Nov. 2002.
[5]
P. Atzeni, S. Ceri, S. Paraboschi, and R. Torlone, “Database Systems – Concepts, Languages and Architectures”, Mc Graw Hill, 1999.
[6]
S.E. Benham, “IBM XML-Enabled Data Management Product Architecture and Technology”. XML Data Management, Native XML and XML-Enable Database Systems, A.B. Chaudhri, A. Rashid, and R. Zicari R. (eds.), Addison Wesley, 2003.
[7]
T. Berners-Lee, R. Fielding, U.C. Irvine, and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax”, Network Working Group, Aug. 1998, http://www.ietf.org/rfc/rfc2396.txt, [last access 2003-0807-].
[8]
P.A. Bernstein, A.Y. Halevy, and R.A. Pottinger, “A Vision for Management of Complex Models”, ACM SIGMOD Record 29(4), Dec. 2000.
[9]
P. Bohannon, J. Freire, J. Haritsa, M. Ramanath, R. Prasan, and J. Simeon, “Bridging the XMLRelational Divide with LegoDB: A Demonstration”, in Proceedings of ICDE, 2003.
[10]
R. Bourret, “XML and Databases”, http://www.rpbourret.com/xml/XMLAndDatabases.htm, 2003, [last access 2003-08-07-].
[11]
R. Bourret, C. Bornhövd, and A.P. Buchmann, “A Generic Load/Extract Utility for Data Transfer Between XML Documents and Relational Databases”, in 2nd Int. Workshop on Advanced Issues of EC and Web-based Information Systems (WECWIS), San Jose, California, June 2000.
[12]
V. Braganholo, S. Davidson, C. Heuser, “On the Updatability of XML Views over Relational Databases”, Proc. of the 6th Int. Workshop on the Web and Databases (WebDB), San Diego, California, June, 2003.
[13]
M. Carey, D. Florescu, Z. Ives, Y. Lu, J. Shanmugasundaram, E. Shekita, and S. Subramanian, “XPERANTO: Publishing Object-Relational Data as XML”, in Proc. of the Third International Workshop on the Web and Databases (WebDB), in conjunction with ACM SIGMOD, Dallas, Texas, May 2000.
[14]
R. G. G. Cattell, and D.K. Barry (eds.), “The Object Data Standard: ODMG 3.0”, Morgan Kaufmann Publishers, Jan. 2000.
[15]
S. Ceri, P. Fraternali, and S. Paraboschi, “Design Principles for Data-Intensive Web Sites”, ACM SIGMOD Record 24(1), March 1999.
[16]
S. Ceri, P. Fraternali, and S. Paraboschi, “XML: Current Developments and Future Challenges for the Database Community”, in Proc. of the 7th Int. Conf. on Extending Database Technology (EDBT), Springer, LNCS 1777, Konstanz, March 2000.
[17]
D. Chamberlin, J. Robie, and D. Florescu, “Quilt: An XML Query Language for Heterogeneous Data Sources”, Lecture Notes in Computer Science, Springer-Verlag, Dec. 2000.
[18]
E.F. Codd, “Missing Information (Applicable and Inapplicable) in Relational Databases”, SIGMOD RECORD 15(4), Dec. 1986.
[19]
A. Deutsch, M. F. Fernandez, and D. Suciu, “Storing Semistructured Data in Relations”, Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, Jerusalem, Jan. 1999.
- 26 -
White
Paper,
2003,
[20]
A. Deutsch, M. F. Fernandez, and D. Suciu, “Storing Semistructured Data with STORED”, in Proc. of the Int. ACM SIGMOD Conference on Management of Data, Philadelphia, USA, June 1999.
[21]
A. Deutsch, and V. Tannen, “Reformulation of XML Queries and Constraints,” in Proc. of the 9th International Conference on Database Theory (ICDT), Siena, Italy, Jan. 2003.
[22]
A. Deutsch, and V. Tannen, “MARS: A System for Publishing XML from Mixed and Redundant Storage”, in Proc. of the 29th Int. Conference On Very Large Databases (VLDB), Berlin, Germany, 2003.
[23]
G. Ehmayer, G. Kappel, and S. Reich, “Connecting Databases to the Web - A Taxonomy of Gateways”, in Proc. of the 8th Int. Conf. on Database and Expert Systems Applications (DEXA), Springer LNCS 1308, Toulouse, Sept. 1997.
[24]
A. Eisenberg, and J. Melton, “SQL/XML is Making Good Progress”, SIGMOD Record 31(2), 2002.
[25]
M.F. Fernandez, W.-C. Tan, and D. Suciu, “SilkRoute: Trading between Relations and XML”, in Proc. of the 9th Int. World Wide Web Conf. (WWW), Amsterdam, May 2000.
[26]
M.F. Fernandez, Y. Kadiyska, A. Morishima, D. Suciu, and W.-C. Tan, “{SilkRoute} : a framework for publishing relational data in {XML}”, Published in ACM Transactions on Database Technology , 27(4) , Dec. 2002.
[27]
D. Florescu, A. Levy, and A. Mendelzon, “Database Techniques for the World Wide Web: A Survey”, ACM SIGMOD Record 27(3), Sept. 1998.
[28]
D. Florescu, and D. Kossmann, “Storing and Querying XML Data Using an RDBMS”, IEEE Data Engineering Bulletin, Special Issue on XML, 22(3), Sept. 1999.
[29]
J. Funderburk, G. Kiernan, J. Shanmugasundaram, E. Shekita, and C. Wei, “XTABLES: Bridging Relational Technology and XML”, IBM Systems Journal 41(4), 2002.
[30]
R. Goldman, J. McHugh, J. Widom, “From Semistructured Data to XML: Migrating the Lore Data Model and Query Language”, in Proc. of the 2nd Int. Workshop on the Web and Databases (WebDB), Philadelphia, June 1999.
[31]
Ch. Hiebl, „Implementation of a Declarative Query and Data Manipulation Language for X-Ray”, master theses, Department of Information Systems, Johannes Kepler University of Linz, Austria, 2002.
[32]
U. Hohenstein, „Supporting XML in Oracle9i. XML Data Management“, Native XML and XML-Enable Database Systems, A.B. Chaudhri, A. Rashid, R. Zicari (eds.), Addison Wesley, 2003.
[33]
IBM, alphaWorks, “XML Data Mediator”, www.alphaworks.ibm.com/tech/XI, [last access 2003-08-07].
[34]
IBM, alphaWorks, “XML for Tables”, www.alphaworks.ibm.com/tech/xtable, [last access 2003-08-07].
[35]
Infonyte XML database, http://www.infonyte.com, [last access 2003-08-07].
[36]
L. Khan, Q. Chen, and Y. Rao, “A Comparative Study of Storing XML Data in Relational Database Management Systems”, in Proc. of International Conference on Internet Computing, Las Vegas, Nevada, June, 2002, 277-282.
[37]
C.-C. Kanne, and G. Moerkotte, “Efficient Storage of XML Data”, in Proc. of the 16th Int. Conf. On Data Engineering (ICDE), San Diego, March 2000.
[38]
G. Kappel, S. Preishuber, E. Pröll, S. Rausch-Schott, W. Retschitzegger, R.R. Wagner, and Ch. Gierlinger, “COMan - Coexistence of Object-Oriented and Relational Technology”, in Proc. of the 13th Int. Conf. on the Entity-Relationship Approach (ER), Manchester, Dec. 1994.
[39]
G. Kappel, E. Kapsammer, S. Rausch-Schott, and W. Retschitzegger, “X-Ray - Towards Integrating XML and Relational Database Systems”, in Proc. of the 19th Int. Conf. on Conceptual Modeling (ER), LNCS 1920, Springer, Salt Lake City, USA, Oct. 2000.
[40]
G. Kappel, E. Kapsammer, and W. Retschitzegger, “Architectural Issues for Integrating XML and Relational Database Systems – The X-Ray Approach”, in Proc. of the Workshop on XML Technologies and Software Engineering (XSE), 23rd Int. Conf. On Software Engineering (ICSE), Toronto, Canada, May 2001.
[41]
G. Kappel, E. Kapsammer, and W. Retschitzegger, “XML and Relational Database Systems – A Comparison of Concepts”, in Proc. of the 2nd Int. Conf. on Internet Computing (IC), CSREA Press, Las Vegas, USA, June 2001.
[42]
G. Kappel, E. Kapsammer, and W. Retschitzegger, “X-Ray – Towards Integrating XML and Relational Database Systems”, Technical Report, Department of Information Systems (IFS), Johannes Kepler Uni-
- 27 -
versity of Linz, Austria, July 2000, http://www.ifs.uni-linz.ac.at/ifs/research/publications/papers00.html, [last access 2003-08-07]. [43]
G. Kappel, B. Pröll, W. Retschitzegger, and W. Schwinger, “Customisation for Ubiquitous Web Applications - A Comparison of Approaches”, Int. Journal of Web Engineering and Technology (IJWET), Inaugural Volume, Inderscience Publishers 2003.
[44]
W. Keller, “Mapping Objects To Tables - A Pattern Language”, Second European Conference on Pattern Languages of Programming (EuroPlop), Irsee, Germany, July 1997.
[45]
R. Krishnamurthy, R. Kaushik, and J. Naughton, “XML-SQL Query Translation Literature: The State of the Art and Open Problems”, The first XML Database Symposium (XSym03), held in conjunction with VLDB2003, Berlin, Sept. 2003.
[46]
I. Manolescu, D. Florescu, D. Kossmann, F. Xhumari, and D. Olteanu, “Agora: Living with XML and Relational”, in Proc. of the 26th Int. Conf. On Very Large Data Bases (VLDB), Cairo, Egypt, 2000.
[47]
I. Manolescu, D. Florescu, and D. Kossmann, “Answering XML Queries over Heterogeneous Data Sources”, in Proc. of the Int'l. Conf. on Very Large Databases (VLDB) , Roma, Italy, 2001.
[48]
D. Obasanjo, and S.B. Navathe, “A Proposal for an XML Data Definition and Manipulation Language”, EEXTT, 2002, 1-2.
[49]
Poet Software Corporation, www.poet.com, [last access 2003-08-07].
[50]
B. Pröll, H. Sighart, W. Retschitzegger, and H. Starck, "Ready for Prime Time - Pre-Generation of Web Pages in TIScover”, in Proc. of the 8th Int. ACM Conference on Information and Knowledge Management (CIKM), Kansas City, Missouri, Nov. 1999.
[51]
M. Ramanath, J. Freire, J. Haritsa, and P. Roy, “Searching for Efficient XML-to-Relational Mappings”, XML Database Symposium (XSym), in conjunction with VLDB 2003, Berlin, Germany, 2003.
[52]
J. Raumbaugh, I. Jacobson, and G. Booch, “The Unified Modeling Language Reference Manual”, Addison-Wesley, 1999.
[53]
M. Rys, “State-of-the-art Support in RDBMS: Microsoft SQL Server's XML Features”, IEEE Data Engineering Bulletin, 24(2), June 2001.
[54]
M. Rys, “XML Support in Microsoft SQL Server 2000”, XML Data Management, Native XML and XML-Enable Database Systems, A.B. Chaudhri, A. Rashid, R. Zicari (eds.), Addison Wesley, 2003.
[55]
A. Sahuguet, “Kweelt, the Making-of: Mistakes Made and Lessons Learned”, Technical Report, Department of Computer and Information Science, University of Pennsylvania, http://db.cis.upenn.edu/DL/kweelt-TR.pdf, Nov. 2000, [last access 2003-08-07].
[56]
A.R. Schmidt, M.L. Kersten, M.A. Windhouwer, and F. Waas, “Efficient Relational Storage and Retrieval of XML Documents”, Workshop on the Web and Databases (WebDB), Dallas, May 2000.
[57]
H. Schöning, and J. Wäsch, „Tamino – An Internet Database System“, in Proc. of the 7th Int. Conf. on Extending Database Technology (EDBT), Springer, LNCS 1777, Konstanz, March 2000.
[58]
M. Schrefl, M. Bernauer, E. Kapsammer, B. Pröll, W. Retschitzegger, and T. Thalhammer, “SelfMaintaining Web Pages”, Information Systems (IS), International Journal, 28(8), Elsevier Science Ltd., 2003, 1005-1036.
[59]
J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton, “Relational Databases for Querying XML Documents: Limitations and Opportunities”, VLDB Conference, Sept. 1999.
[60]
J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, and B. Reinwald, “Efficiently Publishing Relational Data as XML Documents”, VLDB Journal 10(2-3), 2001.
[61]
J. Shanmugasundaram, J. Kiernan, E. Shekita, C. Fan, and J. Funderburk, “Querying XML Views of Relational Data”, VLDB Conference, Sept. 2001.
[62]
K. Shoens, A. Luniewski, P.Schwarz, J. Stamos, and J. Thomas, “The Rufus system: Information organization for semi-structured data”, in Proc. of the Int. Conf. On Very Large Data Bases (VLDB), Dublin, Ireland, 1993.
[63]
S. Spaccapietra, C. Parent, and Y. Dupont, “Model Independent Assertions for Integration of Heterogeneous Schemas”, VLDB Journal, 1(1), 1992, 81-126.
[64]
B. Surjanto, N. Ritter, and H. Loeser, “XML Content Management based on Object-Relational Database Technology”, in Proc. Of the 1st Int. Conf. On Web Information Systems Engineering (WISE), Hongkong, June 2000.
- 28 -
[65]
I. Tatarinov, S.D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, “Storing and Querying Ordered XML Using a Relational Database System”, SIGMOD Conference, June 2002.
[66]
F. Tian, D.J. DeWitt, J. Chen, and C. Zhang, “The Design and Performance Evaluation of Alternative XML Storage Strategies”, Sigmod Record, 31(1), March 2002.
[67]
J. Widom, “Data Management for XML - Research Directions”, IEEE Data Engineering Bulletin, Special Issue on XML, 22(3), Sept. 1999.
[68]
World Wide Web Consortium (W3C), “Namespaces in XML”, W3C Recommendation, Jan. 1999, http://www.w3.org/TR/1999/REC-xml-names-19990114/, [last access 2003-08-07].
[69]
World Wide Web Consortium (W3C), “Extensible Markup Language (XML) 1.0 (Second Edition)”, W3C Recommendation, Oct. 2000, http://www.w3.org/TR/2000/REC-xml-20001006, [last access 200308-07.
[70]
World Wide Web Consortium (W3C), “XML Schema”, W3C Recommendation, May 2001, http://www.w3.org/XML/Schema, [last access 2003-08-07].
[71]
World Wide Web Consortium (W3C), “XML Path Language (XPath) 1.0”, W3C Recommendation, Nov. 1999, http://www.w3.org/TR/xpath, [last access 2003-08-07].
[72]
World Wide Web Consortium (W3C), “XQuery 1.0: An XML Query Language”, W3C Working Draft, May 2003, http://www.w3.org/TR/xquery, [last access 2003-08-07].
[73]
World Wide Web Consortium (W3C), “XQuery 1.0 and http://www.w3.org/TR/xpath-datamodel, [last access 2003-08-07].
- 29 -
XPath
2.0
Data
Model”,