Storage Techniques and Mapping Schemas for XML - Semantic Scholar

15 downloads 183 Views 137KB Size Report
the annotated text or that any tag is (case of open content model [41]). ... null values. .... tender [25] and Microsoft SQL Server [33] offer XML storage and.
Storage Techniques and Mapping Schemas for XML Sihem Amer-Yahia AT&T Labs Research [email protected]

ABSTRACT As an increasing amount of XML data is being exchanged between Web applications, reliable XML storage systems that make this data persistent and process it efficiently are becoming necessary. The flexibility of representation offered by XML raises challenging issues when storing XML data in relational or object systems. Major database vendors: IBM, Microsoft and Oracle provide a declarative mapping schema in which database administrators (DBAs) can express how to map XML data into their system. One of the main benefits of using a mapping schema is to make mappings transparent to DBAs and thus, help them modify and combine mappings when tuning an XML storage system. Unfortunately, existing mapping schemas for XML storage are proprietary and can be used for only one storage backend. We developed a new mapping schema that identifies orthogonal aspects of mapping XML into relations and provides a powerful tool in which most existing research and commercial solutions for storing XML can be expressed. If mapping information is made accessible, optimizing and reusing applications on top of XML stores will be easier. In particular, an application such as XML data exchange would greatly benefit from the knowledge of how XML data is stored in the underlying systems. We designed a mapping interface that provides a simple way of querying mappings expressed in our schema. This paper presents an overview of storage techniques and mapping schemas for XML and discusses related open issues.

1.

INTRODUCTION

XML is the standard format for data exchange between Web applications. Many such applications produce and consume large volumes of XML data and thus require efficient and reliable storage systems. The flexibility of representation in XML raises challenging issues for storing XML data in relational or object database management systems. To illustrate this flexibility, we classify XML documents into three broad categories: regular documents, which have regular structure and contain mostly scalar values, mixed documents, which have irregular structure and contain mostly large blocks of annotated text and “semi structured” documents, whose

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2003 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

structure is unknown. Most existing documents fit into the first two categories. In fact, this is true in all data exchange applications where a schema is provided. Thus, we will consider regular and mixed documents. A discussion on how to extract structure from semi structured documents such as in [29] and how to store these documents [19] is beyond the scope of this paper. The structure of regular documents is typically known and is specified in a DTD or XML Schema [42]. For example, the DTDs developed by the Health Level Seven standardization committee (HL7) [23], define documents in which document structure is regular and terminal data is mostly scalar values. In mixed documents, the structure is flexible and is typically specified using “mixed content models” [42], which permit arbitrary inter-leaving of text with XML markup. For instance, the Library of Congress publishes congressional bills [43] as XML documents that contain data with regular structure as well as large portions of annotated text with irregular structure. We expect that as XML becomes more widely used, an increasing number of documents will fit into both categories. Developing a complete XML storage that can efficiently support these categories is challenging. The first challenge is to capture variability due to the presence of optional attributes, optional and repeated elements, and alternatives of multiple elements. Another challenge is to preserve document structure, i.e., the relative order of values in a document, which may be significant to an application. Finally, providing support for mixed content is one of the most challenging issues. Many solutions for storing XML have been proposed, including the use of relational and object systems, LDAP directories, and native XML databases. Major database vendors (IBM [25], Microsoft [33] and Oracle [7]) provide a set of tools to manipulate XML data. The main tool is a declarative mapping schema in which database administrators (DBAs) can express mappings between XML and relational values. Mapping schemas are end-user languages that make mappings transparent and help DBAs combine and change mappings when tuning an XML storage system. However, existing mapping schemas are tailored to one particular system and hard-code some default mappings on behalf of users. Therefore, they cannot be used for any relational backend. Our first contributions are: We summarize XML storage techniques developed in research and XML storage tools provided by IBM, Microsoft and Oracle. In particular, we describe their mapping schemas for XML storage. We discuss orthogonal aspects of XML-to-relational mappings and present our mapping schema MXM that maps each aspect separately thereby allowing users to express most existing storage techniques and create new mappings. MXM is

independent from whether a DTD or a XML Schema is used to describe input documents. MXM allows users to specify defaults explicitly. Finally, MXM is extensible and can incorporate new mappings. Applications built on top of stored XML data might need only limited information on document content and structure, others might rely on detailed information to guarantee efficiency. For example, e-commerce applications that publish catalogs or portions of catalogs use only coarse information about how XML documents are stored. On the other hand, applications that need to correlate data in an XML document such as medical records data [23] rely on a finer knowledge of the document structure and data types. Thus, our next contribution is: An interface, IMXM, to query XML-to-relations mappings written in MXM. IMXM is designed as a library of functions that are easy to use inside an application. In the next section, we present a classification of XML documents and the properties of XML data that impact storage. In Section 3, we review XML storage techniques developed in research. In Section 4, we describe XML storage in IBM, Microsoft and Oracle. In Section 5, we present our mapping schema and interface. Section 6 is a discussion of related issues in the context of XML storage. We do not discuss Native storage solutions, i.e., systems that are built specifically for storing XML. We refer readers to [3, 10] for an overview of these systems and to [15] for a comprehensive coverage of XML storage and publishing on relational and object systems.

2.

DOCUMENTS AND QUERIES

We describe two classes of XML documents: regular and mixed. Regular documents resemble “de-normalized” relational data. They have regular structure and contain mostly scalar values. Examples include inventory data, logs, and stock quotes. Mixed documents have flexible structure which permits arbitrary inter-leaving of text with XML markup. Examples include manuals, transcripts, and tax forms. In practice, documents might exhibit characteristics from each category. Product catalogs contain regular content derived from inventory systems and mixed content describing individual products.

2.1

Regular Document Example

In order to facilitate exchanging and processing electronic healthcare documents, Health Level Seven (HL7) develops document standards for the health-care industry since September 1996 [23]. The example below describes medical records. A patient has information such as family and given names, date of birth and observations. Jones William 1961-06-13 150 Na Above high 220

K Normal ...

HL7 documents illustrate the variability that occurs in regular documents. A patient’s driver’s license number is optional. A patient may have any number of observations. The containment between parent and child elements is significant, but the order in which observations appear in a patient’s record is insignificant. Queries that manipulate regular documents are mainly selectproject-join and sorting by value. A typical query might return all admission records of patients born after 8/30/01, sorted by family and given names. Queries may also contain grouping operations (e.g., return all patients grouped by date of birth).

2.2

Mixed Document Example

The Library of Congress publishes congressional bills as XML documents [43]. The example below contains information about a bill such as its stage, congress and action, which is described by a date and a possibly large portion of annotated text. 110th CONGRESS 1st session H.R. 133 IN THE HOUSE OF REPRESENTATIVES Mr. English (for himself and Mr. Coyne) introduced the following bill; which was referred to the Committee on Financial Services...

In this example, the order in which text in a bill’s action interleaves with markup is significant. More subtle differences such as the presence of white spaces and new lines in the annotated text might also be meaningful [41]. We classify queries that operate on mixed documents into three categories. A detailed description of these queries is given in [2]. Queries on text only. These queries are similar to Information Retrieval (IR) full-text search queries [9, 34]. Examples include keyword search, stemming and proximity search. The query that finds all bills where “striking” and “amended” are within six intervening words in the text is an example of a proximity query. In [22], the authors develop a technique that extends IR keyword search to XML. Queries on text and structure. These queries both search for keywords within text and query the relative order of elements and text. A query that returns text elements containing “exemption” and “social security” and their preceding and following text elements is an example of combining full-text search with querying document order. Several proposals [9, 12, 21, 39] have studied techniques for combining IR search with document structure. Queries that span structure. These queries apply full-text search operators while ignoring specific markup in the searched text. An example is the query that returns bills containing the phrase “referred to the committee on financial services”. This query must ignore to be successful. In [5], the authors present a system to support this kind of queries.

2.3

Properties of XML Data and Queries

The semantics of XPath [17, 18] and XQuery [14] strongly depend on XML data properties. These properties significantly impact storage design. Attributes vs. Elements. Attributes and elements might contain scalar values or lists. The question of whether they should be stored similarly is open. One should keep in mind that elements might contain a complex structure and that attributes are not ordered whereas elements are. XPath and XQuery are designed to query attributes and elements values and the relationship between attributes and elements. Heterogeneity. Optionality, repetition and alternation create variance in regular content. Thus, documents conforming to the same XML schema may have substantially different structures. A thorough study of how to extend relational systems to support heterogeneity is presented in [4]. XPath 2.0 [18] permits querying of heterogeneous structure. For example, the XPath expression PATIENT/(surgery|checkup) returns all surgeries and checkups associated with patients. Identity and structure. The identity of an element in a XML document depends upon the location of the element in the document. Node identity is central to querying and, in XPath, is used to define the semantics of operators such as element equality and union, intersection, and difference of element sequences. Document structure, i.e., parent/child relationships and sibling order may be significant to an application and are always significant in mixed documents. XPath axes are defined in terms of global document order. The XPath expression

We will mention support for mixed content in the section on storage tools from industry (Section 4). In this section, we focus on storage techniques for capturing element identity and document structure and on techniques that make use of available schema information.

3.1

3.1.1

The Edge relation that stores all edges in a single table [source,ordinal,name,flag,target]. source and target are the identifiers of the edge end-points; name is the tag on the edge; ordinal is an integer that captures sibling order and, flag indicates whether the target node of an edge is a node or a value.

Structure dependence. Parent/child relationships might indicate dependency between an element and its parent (e.g., an observation depends on a patient). Dependent elements are typically accessed “via” their parent or ancestors. This dependency may be used to determine whether “direct” access (e.g. indexing) to elements is necessary or not and decide to either inline dependent elements with their parent in the schema or cluster them on disk.

Presence of a schema. Although many existing XML documents do not conform to a predefined XML schema, most XML documents in data exchange applications will have an associated schema that specifies the types of terminal data and constraints on document structure. The schema should be used for storage if it exists.

3.

STORAGE TECHNIQUES

The main focus of existing techniques for storing XML data in relational and object systems is to capture element identity and document structure and order. Variability in content is captured using null values. Very little has been done for handling mixed content.

Foreign Keys

The simplest way of capturing parent/child relationships is to use a foreign key in the child element that refers to the key of its parent element. In addition, the sibling order can be capture using an ordinal value (that can be the key of the element itself). We refer to this technique as KFO for Key, Foreign key and Ordinal. KFO is used in a number of storage methods on relational and object databases [20, 40]. The main idea of these methods is to explore several alternatives for storing a tree in a generic relational schema in the absence of a schema for input documents. Four alternative mappings have been explored for edges, two for values. The approaches that store edges are studied in [20]. Each node in the XML document tree is assigned a unique integer. Terminal values are either stored in a separate table V[vid,value] or inlined in the relation that stores edges. The edge alternatives are:

bill/co-sponsor[./text()=‘Mr. English’ and follow-sibling::co-sponsor/text()=‘Mr.Coyne’] returns all co-sponsors that contain “Mr. English” and “Mr. Coyne” such that “Mr. Coyne” follows “Mr. English” in the text.

Flexibility in mixed content. The schema associated with mixed documents may specify that only specific markup is permitted in the annotated text or that any tag is (case of open content model [41]). The design of appropriate indices to query markup inside text has to account for the fact that in some applications, specific markup could occur inside the text and in others, any markup might occur.

Capturing Identity, Structure and Order

All existing techniques rely on computing a unique identifier for each node in the XML document tree. These identifiers are used in multiple ways to capture document structure. In most cases, recovering document fragments requires multiple joins. Some cases benefit from the ordered nature of XML data to implement efficient join algorithms such as sort-merge [1, 44]. We give a brief description of the major techniques that have been proposed and refer readers to [36] that compares several of these techniques.

The Attribute relations alternative is the result of horizontally partitioning the Edge table on its name field. This alternative creates as many relations as the number of distinct tags in a document with the same schema as Edge (except for the field name). Finally, the Universal and the Normalized Universal alternatives result from applying a full outer-join to the Attribute relations. Possible variations of these tables exist. The Edge table might have a field whose value is the type of a node in the XML tree. Mixed documents could be captured using a special value “text” for text nodes and “element” for elemen nodes in the text.

3.1.2

Interval Encoding

INTERVAL creates an interval for each node in the XML document tree that records the subtree rooted at that element node. The interval encoding of a node has to be included in the interval of its parent node. A common way of generating intervals is to generate a unique identifier for each node in a preorder traversal of the document tree and a unique identifier in a postorder traversal of the tree. In addition, in order to distinguish children from descendants, a level number is recorded with each node. Finally, a unique document number distinguishes nodes from multiple documents. The INTERVAL value of a node is composed of four numbers at most.

In [1, 44], the authors develop algorithms with nice linearity properties to compute children, ancestors and descendants of nodes.

3.1.3

Dewey

DEWEY stands for Dewey Decimal Classification developed for general knowledge classification [30]. This encoding is based on assigning a unique integer to each node in the document tree and using it to record at each node the path from the node to the root of the document by concatenating the identifiers of all nodes present on that path starting from the node being identified. Document structure is recovered using substring comparisons between the DEWEY values of nodes. DEWEY is the technique that is used in LDAP directories to encode entity identity and hierarchical relationships in a database [24]. In [26], the authors develop stack-based algorithms with linear properties to compute children and descendants of entities in LDAP. These algorithms make use of a stack and are similar to the twigstack algorithms developed in [1]. Since the DEWEY value of a node contains its unique id (which can also serve as an ordinal value) and the id of its parent, DEWEY is the most complete encoding of node identity, document structure and document order, when compared to KFO and INTERVAL. However, unlike INTERVAL where only four values are used for each node, in DEWEY, the deeper a node is in the document tree, the longer the path from the root to the node and the longer the DEWEY value of the node.

3.1.4

Paths

In [38], the authors store paths in XML documents using the following relational schema, denoted XRel. Path[pathID,pathexp] Element[docID,pathID,start,end,index,reindex] Attribute[docID,pathID,start,end,value] Text[docID,pathID,start,end,value] A relation is created for each node type (element, attribute, text). Paths are stored in the relation Path to avoid redundancies. Each path is stored as a string and sub-string matching is used at query time. Since the same path might be shared by several nodes, paths are not sufficient to recover document structure. Thus, the attributes start and end are used to record the region of each node. For an element, start records the start position of the element in a document, end records its end position. The attribute index records document order and the attribute reindex records reverse document order. Both attributes are used for efficient query processing. By storing paths, XRel reduces the number of join operations that need to be performed to recover document structure. It also uses B+-trees and R-trees.

3.2 3.2.1

Using Schemas Fixed Mapping

In this category, we find DTD-driven mappings: Basic, Shared and Hybrid [37]. The decision of whether to create a table for an element or to inline it with its parent is central to these approaches and is made on the basis of whether or not an element is “shared” by other elements in the DTD. These solutions vary in the amount of redundancy they may generate (an element could be inlined in several of its referencing elements). In all of them, KFO is used to capture document structure. Basic creates relations for every element in the input DTD. For example, in the HL7 example (See Section 2.1), a separate relation will be generated for each of FaNa and GiNa. Shared

reduces the number of relations created by Basic by not creating relations for elements in the DTD graph whose nodes have an indegree greater than one, such as FaNa. Elements with an in-degree of one are inlined. Elements with an in-degree of zero are stored in separate relations because they are not reachable from any other element. Finally, of all mutually recursive elements having in-degree one, one of them is stored in a separate relation. Hybrid is similar to Shared except that it performs additional inlining. In particular, it inlines elements with an in-degree greater than one that are not recursive or reached through a “*” edge.

3.2.2

Flexible Mapping

LegoDB [11] considers an XML Schema as input and adopts an optimization approach to derive the best corresponding relational schema, i.e. the one that optimizes a given query workload. LegoDB applies semantic preserving XML transformations to the XML Schema such as: Inlining, Union Factorization, Repetition Merge, Wildcard rewritings and Unions to options. An example of rewriting a repetition is given below: type Patient = PATIENT [ @IDNum[ String ], PaNa [ PANA ], Obx 1,* ] type Patient = PATIENT [ @IDNum[ String ], PaNa [ PANA ], Obx, Obx 0,* ] To estimate the cost of an XML Schema, it is transformed to a relational schema using straightforward rules such as creating one relation for each type name and using KFO for document structure. The benefit of LegoDB lies in applying schema transformations on the XML Schema, thereby exploring mappings that might be harder to find if the rewritings were applied to a relational schema.

4.

STORAGE TOOLS

Major database vendors, Oracle 9iR2 [7], IBM DB2 XML Extender [25] and Microsoft SQL Server [33] offer XML storage and publishing tools on top of their storage system. Due to the mismatch between the XML data model and the data model of the storage system, a mapping between the two data models is necessary. Consequently, each vendor provides mapping interfaces to help users specify their mappings from XML to relations (and objects) using a declarative mapping schema and special purpose queries. In these systems, document structure is captured using KFO and mixed content is supported in a limited way.

4.1

IBM DB2 XML Extender

Users can annotate a simplified XML Schema with mapping information. The resulting schema is referred to as the XML Extender Document Access Definition (DAD). DADs are used both for publishing relational data in XML and for storing XML. A DAD mapping defines RDB Nodes. A primary key is needed for each table and column types. Two functions are provided: dxxShredXML() to decompose an incoming XML document and dxxGenXML() to compose a shredded XML. A number of stored procedures are provided for handling XML columns. XMLVarCharFromFile() is used for type conversion. Cast functions Varchar(XMLVarChar) for retrieval. Update functions such as Update(xmlobj, path,value) and selection functions using XPath such as Extractvarchar(). The example below shows a portion of the DAD used to map the HL7 example into DB2.

IDNum > "635"


This mapping creates a table Patient tab to store PATIENT elements and a column name Patient key using the attribute IDNum of each PATIENT element. More complex mappings, e.g., using join conditions and vertical partitioning of an element into multiple tables could be provided. In addition, XML columns can be registered with the types: XMLCLOB for large XML documents; XMLVARCHAR for small XML documents and XMLFile for XML documents stored outside DB2. XML Extender provides an XML DTD repository. Each XML database contains a DTD reference table called DTD REF which is used to store meta information on users mappings. Users can access this table to insert their own DTDs. These DTDs can be used to validate XML documents. Given a mapping, the system reads an arbitrary XML document and loads it into a DB2 database. Users do not have to write loading programs by hand. Mixed content is handled using CLOBs (Character Large OBjects) and side tables for indexing structured data contained in text. Side tables are automatically updated when new documents are inserted. This method of handling mixed content is the most advanced among the solutions provided by database vendors.

4.2

Microsoft SQL Server

Microsoft provides extensions to SQL to publish relational data as XML documents using the FOR XML clause. There are three publishing modes: RAW, AUTO and EXPLICIT. RAW creates flat XML documents by converting each row in the SQL result into an XML element and each non-NULL column value to an attribute (column name becomes the attribute name). In the AUTO mode, query results are used to build nested documents where each table in the FROM clause is represented as an XML element. The columns listed in the SELECT clause are mapped to attributes or sub-elements. EXPLICIT mode provides more flexible publishing of relational data. It define a SQL view to assemble relevant rows. Special column names such as Tag and Parent are used. Nesting is explicitly specified as part of the query. Microsoft adopts three solutions for storing XML documents. It implements the generic Edge technique described in Section 3.1. It allows users to annotate an XML schema in order to determine the XML-to-relations mapping. Finally, it provides OpenXML. Annotated schemas are created using the XML Schema Definition (XSD). The XSD language is the successor to the XML-Data Reduced (XDR) schema definition language. This solution is implemented in SQLXML that enables XML support for SQL Server 2000 Databases. SQLXML includes an XDR to XSD converter tool that is designed to help convert annotated XDR schemas to equivalent XSD schemas. An XSD schema is enclosed in a element. Additional attributes that define the namespace in which the schema resides and the namespaces that are used in the schema can be de-

fined for that element. Below is an annotated XSD schema that describes a mapping to a relational database [28]. The view specification contains embedded SQL references. Similarly to IBM, this schema is used both for publishing and for storage.

The mapping describes the names of the tables and columns used to store XML documents. Mappings can also describe the names of the KFO fields that capture document structure. Mapping schemas are parsed to generate the corresponding relational schema. The third solution, OpenXML, compiles XML documents into an internal DOM representation using sp xml preparedocument. The generic syntax is OPENXML(¡XML doc handler¿, ¡path expression¿, ¡flags¿) WITH (schema — Table). The T-SQL function is provided to build rowsets from a XML stream. OpenXML can be attribute-centric or element-centric. An example of an attributecentric mapping is given below: Select * from OpenXML(@pat, ‘/HL7/PATIENT’, 1) WITH (IDNum int GiNa varchar(20))

In order to load the XML data in the underlying database, users provide a decomposition of XML documents into multiple tables in a programmatic way which can make this task tedious. 1. The XML document is first parsed into a DOM tree. 2. Users must write XPath expressions to specify XML values to map into tuples and attribute values. As an example, the user could define a table containing patient records. The user can then specify how tuples in this table are computed using the query: /PATIENT row in Table Patient specifies that each distinct patient node corresponds to a distinct row in the Patient table. The query /PATIENT//FaNa LastName in Table Patient specifies how to compute the value of the LastName column in the Patient table. Microsoft supports storing XML documents in CLOBs. However, unlike IBM, no side tables are provided to index mixed content data. Templates are used to query the relational database that stores XML data. Templates are XML documents that provide a parameterized query and update mechanism to the database. In a template, elements in the urn:schemas-microsoft-com:xml-sql namespace are processed by the template processor and used to return database data as part of the resulting XML document. 



4.3

Oracle 9iR2

The first Oracle product that offered support for XML was Oracle8i that was released in 1999 [31]. Its main functionality was the ability to publish relational data in XML. In Oracle9i, released in June 2001, Oracle added XML support directly to the database. The Oracle9i Database Release 1 XDK contains a number of tools for processing XML into and out of the database. They include: XML Parsers, an XSLT Processor, an XML Schema Processor and XML SQL Utility to generate XML documents, DTDs and schemas from SQL queries. In addition, new datatypes-one for XML (XMLType) and one for logical pointers (URI-Ref)-were added to the kernel for direct XML storage. To facilitate the import of XML-encoded records, new Table Functions were introduced that could be used to decompose XML documents across multiple tables. The URI-Ref datatype introduced a vendor-neutral way to specify pointers to information both inside and outside the database. Finally, several operators are associated to XMLType: Extract() extracts nodes from the document identified by the XPath expression; getStringVal() or getNumberVal() get scalar content; existsNode() checks if the given XPath evaluates in at least a single XML element or text node. Oracle9i Database Release 2 (Oracle 9iR2) introduced Oracle XML DB. The features offered by Oracle make it the most appealing of all commercial solutions. Mappings can either be automatically generated by the system, or such default mappings overridden by users annotating an XML Schema. Once a mapping has been defined, XML DB loads the schema file and stores mapping information internally. XML DB may also create SQL types and tables, indexes, and Java classes associated to a mapping. XML DB subclasses the XML Schema definitions defined by the standard by adding extra attributes and elements to specify the mapping information. Among all commercial solutions for storing XML, Oracle’s is the only one that is based on XML Schema. Below is a mapping example. Similarly to Microsoft, additional attributes that define the namespace in which the schema resides and the namespaces that are used in the schema can be defined for the schema element. Declare doc varchar(5000) := ‘ ’ Begin dbms_xmlschema.registerSchema( ‘http://www.oracle.com/HL7.xsd’, doc) end;

Once the mapping schema compiled, XML documents can be loaded using a variety of tools such as standard ftp, or drag-anddrop between Windows folders, or any of the Oracle database load techniques (insert, sqlloader, etc.). Users can also define fine-grained access control on XMLType data. XML Schema is understood in the database. When XML instances are inserted, Oracle XML DB can check the validity of each instance according to schema contraints. This can also be done in the IBM XML Extender solution. Data can be accessed with XPath and SQL in the style of the ANSI and ISO standard [32]. XML documents can be generated from SQL queries using a set of predefined functions such as XMLELEMENT and XMLATTRIBUTES. An example of such a query is: SELECT XMLELEMENT("PATIENT", XMLATTRIBUTES(PatientID), XMLFOREST(LastName, BirthDate)) FROM Patient;

SYS XMLGEN() is a function that operates at the row level returning a XML document for each row. It can be used as follows: SELECT SYS_XMLGEN(PatientID) AS xml_doc FROM Patient;

When XML documents are read into memory, Oracle XML DB provides lazy materialization that reads in DOM nodes on demand. XPath queries are optimized using Btree indices. Similarly to Microsoft, Oracle supports storing XML documents in CLOBs (Character Large OBjects) which permits limited support of mixed content and full-text search.

5.

MXM (MY XML MAPPER)

Our goal is to provide a declarative mechanism to express existing XML-to-relations mappings and offer an interface to query these mappings. We also want to design our mapping language to be independent from whether a DTD or an XML Schema is used to describe XML documents. Finally, we want MXM to be extensible. In order to achieve these goals, we identified some orthogonal aspects of a mapping: Elements and attributes. We believe that mapping attributes and elements should be similar. Therefore, we allow attribute outlining. Outlining of attributes captures complex types (e.g., list-valued) more naturally. In this case, the relationship between an attribute and its containing element is captured in the same manner as document structure except that attributes are not ordered. Groups. The ability to map groups offers additional flexibility. For example, in DTDs, entities are assimilated to groups and nonterminal nodes are used to specify the XML-to-relations mapping. When an XML Schema is given, element, attribute and group names are used in the mapping. Document structure. Users should be allowed to choose how document structure is captured. This choice will then be used uniformly across all XML documents. We identify each possibility for mapping structure by a unique name which is used in the mapping. A consequence of this design is extensibility. For example, we can use an external relation to map document structure by adding a reserved word in our system that could be specified in a mapping. Table names. In our design, we allow flexible naming of tables. If the user specifies a table name, it will be used, otherwise, a default name will be generated. Defaults. In order to avoid hard-coding the semantics of default mapping rules into each application, default rules can be specified in a “configuration file” that is expressed in MXM. Thus, defaults could be shared by multiple MXM mappings and applications built on top of the XML store can query them. An example of a default rule could be to always capture document structure in an external relation unless otherwise specified in a mapping.

5.1

MXM Grammar and Examples

We express the grammar of MXM using XML Schema [41]. We define a XtoRMapping as being composed of a mapping of document structure, StructMap, a mapping of elements, attributes and groups into tables, TableMap, and a mapping of elements and attributes into CLOBs, CLOBMap.



StructMap indicates which technique is used to capture document structure between elements and outlined elements and attributes. Currently, whichMap can have one of the following values: empty, KFO, INTERVAL, DEWEY, PATH, EXTREL, EDGE, ATTRIBUTE, UNIVERSAL, BASIC, SHARED, HYBRID. All of them refer to the techniques presented in Section 3. EXTREL outlines KFO in an external table. TableMap is used to create tables from source elements, attributes and groups (choice, sequence and allgroup). A table might be assigned a name (otherwise it is automatically generated by concatenating all source names together and a tag field is used to distinguish between tuples corresponding to the same source name). It is also possible to specify whether a tag field should be created or not. In the case input documents are described with a DTD, any non-terminal node name in the DTD can be used as a source name. In the case input documents are described with an XML Schema, element, attribute and group names could be used as source names. Finally, CLOBMap indicates the creation of a CLOB from a source name. The name of a CLOB is either given or generated automatically. A CLOB containing all the substructure rooted at the specified source name is created. MXM captures the mappings described in [11] (see Section 3.2.2) since their XML-to-relations rules map document structure using KFO and element names are described using a tag field. We illustrate two mapping examples with MXM. More examples are given in [6]. The first example captures document structure using KFO. It also specifies user-given table names.
FaNa
GiNa
DTofBi
OBX


The name of the table containing dates of birth is system-generated

since it is not explicitly specified by the user. The same is true for the table containing observations. In this case, a tag field called ElemName is defined and records the name of the element to which a particular tuple corresponds. The second example shows the creation of a CLOB to contain the whole document. A CLOB could be created for only a fragment of input documents. Note that the attribute whichMap is not specified in this case.