Logical and Physical Support for Heterogeneous Data Sihem Amer-Yahia
Mary Fernandez ´ Rick Greer AT&T Labs - Research
Divesh Srivastava
fsihem,m,rxga,
[email protected]
ABSTRACT
Heterogeneity arises naturally in virtually all real-world data. This paper presents evolutionary extensions to a relational database system for supporting three classes of data heterogeneity: variational, structural and annotational heterogeneities. We de ne these classes and show the impact of these new features on data storage, data-access mechanisms, and the data-description language. Since XML is an important source of heterogeneity, we describe how the system automatically utilizes these new features when storing XML documents. 1. INTRODUCTION
Heterogeneity arises naturally in virtually all real-world data. Variational heterogeneity occurs in semantically related data items that have some shared and some unique properties. For example, variants of tcp messages, such as http and smtp, share common elds, but also have protocol-speci c elds. Structural heterogeneity arises in data that has nested structured content, which also may be optional or repeated. For example, billing information may contain multiple, structured addresses or a tax form may contain multiple, nested tax schedules. Annotational heterogeneity occurs in data that mixes structured content with unstructured text. A product catalog, for example, typically contains pricing or technical information on a product embedded in marketing text. These classes are not mutually exclusive: the data of one application may exhibit all three kinds of heterogeneity. Relational and object-relational (O-R) database systems adeptly handle homogeneous data and some simple kinds of heterogeneity. For example, missing or optional atomic values are typically modeled by null-able elds, and repeated atomic values are modeled by list- or set-valued elds. However, the combination of variants, nested content, and structured content embedded in text is handled only clumsily in relational and O-R systems. Without constructs that directly support heterogeneity, heterogeneous data must be
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’02, November 4–9, 2002, McLean, Virginia, USA. Copyright 2002 ACM 1-58113-492-4/02/0011 ... 5.00.
$
normalized to t the available constructs [6]. This mismatch means that the heterogeneity is encoded in the queries that recover the heterogeneous data, not in the schema or stored data itself. In this paper, we present XDX, an evolutionary extension of the Daytona [7] relational database system, which provides logical and physical support for heterogeneity. We show the changes to Daytona that XDX's new features require on the data-storage strategy, the data-access mechanisms, and the data-description language. We do not focus on a single storage feature, but instead present three new features and illustrate how their interaction can support data with complex heterogeneity. One important source of heterogeneous data is XML documents. XML can represent all three kinds of heterogeneity described above and many XML documents contain all three kinds [15]. Support for such complexity is one reason XML is succeeding as a universal data-exchange format. Existing techniques for storing XML in relational and O-R systems \shred" XML documents into multiple relations and/or object classes, which makes recovering the original document cumbersome and costly. XDX's storage features more naturally handle XML constructs and can be combined to model most features in XML schemas. We note, however, that XDX is not an \XML database", but a storage system that can support the heterogeneity we have observed in a variety of data sources. Given the demand for database support of XML, one might ask why not build a custom database system for XML? Our response is that, given the eort necessary to build a high-performance database system, we believe that evolving an existing system to support heterogeneous data, including XML, is a more ecient strategy than implementing such a system from scratch. Our base architecture scales, is thoroughly tested, and must continue to handle homogeneous data as eciently as before. Commercial database vendors agree, but to date, their strategy has been to shred heterogeneous data to t the available constructs or to encapsulate heterogeneous data in non-native data types [9, 14, 13]. In contrast, our strategy is to conservatively extend a relational model with constructs for heterogeneous data and to modify the native storage and indexing capabilities to support these new constructs. In this way, heterogeneous data (in non-XML or XML sources) is handled as uniformly and eciently as homogeneous data. We begin with an overview of XDX and Daytona, then we describe each class of heterogeneous data and XDX's support for that class. Next, we describe XDX's support for
XML and conclude with a discussion of other commercial and research solutions for handling heterogeneous data. 2. XDX AND DAYTONA
XDX (eXtending Daytona for XML) is an extension of the Daytonatm data management system [7]. Daytona has been developed in AT&T Labs - Research and is used by AT&T to solve a variety of data management problems. For example, as of February, 2002, Daytona is managing a 40 terabyte, 7x24 production data warehouse whose largest table contains over 191 billion (yes, billion) records. The total number of records being stored over all tables is over 345 billion. Daytona's architecture is based on translating its high-level query language, Cymbaltm (which includes SQL as a subset), completely into C and then compiling that C into object code. The system resulting from this architecture is fast, powerful, easy to use and administer, reliable, and open to UNIX tools. Two forms of data compression plus horizontal partitioning enable Daytona to handle terabytes with ease. Daytona oers all the essentials of data management including a high-level query language, B-tree indexing, locking, transactions, logging, and recovery. One application of Daytona handles more than 70,000 queries per month. XDX extends Daytona by providing support for storage and querying of heterogeneous data. XDX adds variant records, nested records, and embedded records to Daytona's logical data model and supports the storing, indexing, and updating of all record types. In this paper, we focus on how these extensions eect Daytona's data-storage strategy, its data-access mechanisms, and its data-description language. We do not address how they impact the Cymbal query language. Because XDX is an extension of Daytona, it includes all Daytona's features for querying and storing homogeneous data. This design makes XDX an evolutionary, not a revolutionary, system. XML documents include and combine all three classes of heterogeneity. XDX supports storage and querying of XML documents in an XML frontend to Daytona. XDX's XML frontend is designed to store XML documents for which XML Schema are de ned, not for schema-less semi-structured data [1]. Practical applications of XML as a data-exchange format (e.g., in electronic commerce and bioinformatics) require a priori DTDs or XML Schema, so this requirement is reasonable. The frontend supports automatic mapping of XML schemas into XDX schemas and utilizes XDX's support for heterogeneity. The frontend also supports querying of XML documents by translating XQuery [21] queries into the Cymbal query language. We discuss our techniques for mapping XML documents into the XDX data model. We do not discuss translation of XQuery into Cymbal here. 2.1
Daytona’s Data Model
Daytona supports the relational model, both at the logical and the physical levels. A Daytona record class is analogous to a relation. All records in a record class are homogeneous, i.e., they are all instances of the same relational type. Records in the same record class are stored together in one or more UNIX les. The logical data model and physical storage features of Daytona are expressed in its record-class description (RCD) language. The logical part of an RCD is analogous to a SQL data dictionary, consisting of, in part, speci cations for eld name
and type. The types include standard atomic types such as integer, oat, string, date, time, as well as lists and sets of atomic types. RCDs also specify keys 1 as lists of one or more elds that are (usually) associated with indexes to facilitate rapid search. For example, the RCD in Fig. 1 describes a subset of the general and entity headers of http messages [11].2 An http header has multiple elds, two of type string (i.e., Kind and Content Type), one of type date clock (i.e., Date) one of type optional date clock (i.e., Expires), and one of type integer (i.e., Content Length). The keys section speci es the indexed values. In this case, a pair of Content Type and Date values comprise one key value. Daytona RCDs also specify physical storage features. A record class can be partitioned horizontally and/or vertically into one or more bins, each of which is implemented as a separate UNIX le. The bins associated with a given record class are speci ed in its RCD, by listing the eld names and values on which to partition records. In our example, there are two horizontal partitions, and records are partitioned on the value of their Kind eld. Appendix A illustrates how horizontal and vertical partitions can be combined. Such partitioning is critical to feasibility: 191 billion records cannot be stored in one le, so the corresponding table is horizontally partitioned into 18,144 les. The utility of vertical partitioning is demonstrated in [2]. Fields in individual records are separated by eld separators (often the `|' character), and records are terminated by a newline. Records can be separated by comments (beginning with # and terminating with a newline). A null-valued eld is represented by two consecutive eld delimiters. Storing records in UNIX les permits Daytona users to view the records directly and to apply standard UNIX tools (e.g., grep, awk) to the same data les that Daytona uses. External indexes on records are also speci ed as part of the RCD, and indexes are built for each bin associated with a record class. A standard B-tree index can be speci ed by an ordered list of elds. The index maps key values to byte osets of individual records in a bin. Daytona also associates a .siz index with each bin of a record class, which stores in order the byte oset of records in the bin (\.siz" is the sux of the le that contains this ordinal index). This is important, since dierent records in a record class may have dierent lengths, especially in the presence of string-valued elds. The .siz index makes it possible to skip over records without scanning for the end-of-record delimiter. Once a record is located, individual elds are accessed by scanning the record for eld separators. Fig. 2 (left) depicts two bins of records in the HTTP record class. Note that values for the partitioning eld Kind are not stored with the data; they only appear in the schema. The corresponding .siz indexes for the bins are in Fig. 2 (right). The record numbers are included only for illustration. The second column comprises the actual index, minus a header that contains statistics and an optional compression dictionary. 1 The term \key" does not denote a uniqueness constraint. Uniqueness is speci ed as an optional attribute of a key. 2 To avoid proliferation of notations, we use XML as the RCD syntax. Daytona's RCD syntax is syntactically isomorphic to the XML here. To be consistent with Daytona syntax, all class names are in upper case, and eld names are in up-low case.
Figure 1: Record class description of http messages Record Byte Number Oset 1 0 2 33 1 0 2 53
# file_req 03/22/2002@08:09:01|image/gif|0| 03/22/2002@08:11:07|text/html|2000| # file_ans 03/22/2002@08:10:07|text/html|23|03/23/2002@08:12:00 03/22/2002@08:12:01|text/html|3219|
Figure 2: Sample http records and .siz (byte oset) index Daytona supports update of records. Locating the record to update is equivalent to locating a record during querying. If the size of the updated record does not exceed the size of the existing record, the record is updated in place, possibly padding the new entry with comment characters. Otherwise, Daytona tags the space holding the existing record as free, and the updated record is written into the rst free space of adequate size in the appropriate bin le. Given this foundation, we now describe three extensions to Daytona for directly supporting heterogeneity. 3. VARIATIONAL HETEROGENEITY
The success of the relational model is due to its simplicity, which it achieves by imposing homogeneity on real-world data. However, real-world data is often heterogeneous! For example, http, smtp, and ftp messages are all instances of tcp messages, with some variations in the set of applicable elds. Even messages within a particular protocol, such as the request and response messages of http, have some shared elds and some message-speci c elds. This variational heterogeneity can be modeled in three ways within the relational model. One approach requires fragmenting the heterogeneous data into multiple homogeneous relations, with one table for the shared elds, and a separate table for each variant, linked through foreign keys. A second approach is to have one table for each variant type whose records contain all the elds, shared and speci c, for that type. The third approach uses a universal relation, which contains all elds present in all the variants; elds not shared by all variants must be null-able. None of these approaches is ideal. In the rst approach, the association between shared and variant elds is encoded in the queries that recover the heterogeneous data, not in the data itself, and recovering the data often requires multi-way joins. In the second approach, any query over a shared eld in all variants requires accessing each variant table and unioning the results. In the universal-relation approach, the associ-
ation between shared and variant elds is preserved in the data, but this does not permit ease of schema evolution: if a new variant is added, the universal schema must change to include the new variant's elds, and every record in the universal table must be modi ed to represent these new elds. Finally, the second approach cannot preserve any order that may be inherent in the data. Consider a log of records of dierent types ordered by a timestamp eld { the temporal ordering is lost when the records are partitioned into a separate table for each variant. Providing direct logical and physical support for variational heterogeneity avoids the limitations of these approaches. This is achieved in XDX with variant-record classes. An XDX variant-record class (unlike a Daytona record class) contains a heterogeneous collection of records that are intermingled in the same le. 3.1
Variant record layout, RCD, and indexes
Although they may have dierent numbers and types of elds, variant records do not pose any problem in data layout, since the underlying system already handles variablesized records and elds. For example, Fig. 3(left) contains four REQUEST and RESPONSE records, which are both variants of HTTP records, and Fig. 3(right), their corresponding .siz index. These records share some elds (those in Fig. 1), and dier in other elds (depending on the message variant). Note that this le is not horizontally partitioned. The RCD of a base record class contains the elds that are common to all variants and contains the RCD of each variant. Fig. 4 shows the RCD for the HTTP base record class and its REQUEST and RESPONSE variants. The elds that distinguish between variants are identi ed by the Is A Classifier role, e.g., the Kind eld in HTTP. The relationship between the base record class and its variants is speci ed by the Is A Subclass role. Note that each variant-record class speci es the value of its Kind eld, which identi es the variant. In general, XDX permits a variant to serve as the base record class of other variants. For exam-
req|03/22/2002@08:09:01|image/gif|0||GET|/motd|
[email protected]|Mozilla/4.03 ans|03/22/2002@08:10:07|text/html|23|03/23/2002@08:12:00|404 Not Found|Netscape/3.5.1 req|03/22/2002@08:11:07|text/html|2000||PUT|/info/index.html|
[email protected]|IE/5.0 ans|03/22/2002@08:12:01|text/html|3219|||200 OK|Netscape/3.5.1
Record Byte Number Oset 1 0 2 78 3 164 4 266
Figure 3: Sample http variant records and .siz index
Figure 4:
http
variant-record class description
ple, we could further re ne the REQUEST record class with variants for get, post, and head requests. An important issue to address is the impact of variant records on query processing, in particular, on indexing , in which a record is accessed based on eld values, and on projection , in which individual elds of the record are accessed. XDX's variant records have minimal impact on indexing. The .siz index, which maps the record number in the bin le to its byte oset, is clearly unchanged. One can build and use value-based indexes, both on elds that are present in all variants and on elds that are present only in some variants. In particular, a value-based index on the Kind eld allows direct access to speci c variants in the le. On the records in Fig. 3, the Kind index value req is associated with the byte osets of records one and three (i.e., 0 and 164) and the index value ans is associated with the osets for records two and four (78 and 266). Recovering all semantically related elds is easy, since the semantics of the variants is in the RCD (schema) itself, and each variant record explicitly maintains a classi er eld. Checking the value of the classi er eld is a small added cost not incurred when dealing with Daytona homogeneous records. Processing all variant records, (e.g., selecting all http messages initiated in a particular time period) requires only a simple selection expression on the common Date eld. Recovering the complete record requires no additional work (i.e., no joins are necessary), because the complete record is available from contiguous bytes. Also, if the ordering of
records in a bin le is meaningful, for example, when storing time-series data, then XDX preserves that ordering. Finally, adding a new variant does not require changing any existing records. A new variant simply has a dierent value for its Kind eld and new variant records can be appended to existing bin les. 4. STRUCTURAL HETEROGENEITY
In the previous section, we dealt with variational heterogeneity arising in at relational data. However, real-world data is often structured! For example, the Health Level Seven (HL7) Standard [8] speci es myriad formats for electronic exchange of medical data. An HL7 laboratory report contains patient information as a at structure together with multiple observation records, each of which is at. Some structured data is more deeply nested, e.g., a tax form contains schedules, which in turn, contain sub-forms. Structured data may also be nested to arbitrary depth; for example, a parts assembly may contain a part record, which recursively contains other part records. While relational and object-relational database systems are adept at handling optional atomic values (as null elds), and repetition of atomic values (as set- and list-valued elds), the combination of nested, structured content and optionality/repetition/choice is modeled only clumsily in the at relational model. The nested relational model [18] supports repeated structured data, but does not directly support recursion or choice. Representing this structural heterogeneity at the storage level is important for eciency and ease of use. This is achieved in XDX with nested-record classes, which may be combined freely with variant record classes. An XDX nested-record class (like a Daytona record class) contains a collection of homogeneous independent records, but (unlike a Daytona record class) each such record can contain (zero or more) collections of dependent records, which in turn, may contain dependent records. An independent record is uniquely identi able by its key elds, but, in general, a dependent record is only identi able by its own elds together with the key elds of all its ancestors. An independent record and its nested dependent records are a logical and physical unit, which for example, are moved as an ensemble to a new physical location in the le if any part of the nested record is updated and update-in-place is not possible. A bin le is associated with an independent record class; that association determines the bin les of the dependent records in a particular independent record class. In our HL7 example, the patient information comprises the independent record and the multiple observation records comprise the dependent records. An observation record depends on the patient that it describes, i.e., it is only uniquely identi able within its containing records. Nested records make it possible to preserve this dependence explicitly in the storage system. We do not address here how the independence-dependence relationship is determined. The
Figure 6:
LAB
nested-record class description
schema designer could state it explicitly or it might be inferred from the input schema or a query workload. 4.1
Nested record layout, RCD, and indexes
To physically represent nested records, we only need new delimiters to distinguish dependent records from individual elds. For example, Fig. 5 contains two laboratory reports represented as XDX nested records. The rst record contains one independent record containing patient information and four dependent nested records containing laboratory \observations". (The indentation is for presentation only. More importantly, the data is stored and processed as it appears here { Daytona does not transform it into a proprietary binary form.) The RCD of a nested record class contains the elds that are in the independent record and also the RCD of each dependent class. Fig. 6 shows the RCD for the LAB independent record class and its single, dependent record class (OBX). Multiple dependent classes are permitted and dependent classes may be nested recursively in other dependent classes. An independent record may contain any number of dependent records, i.e., there are no default multiplicity constraints on the number of dependent records. Nested records introduce some additional computation when indexing and accessing dependent records. We generalize the .siz index by adding a separate .siztree index, which includes the byte osets of both independent and dependent records in a bin le. For nested records, the N -th .siz entry contains the oset in the .siztree le for the .siztree entry for the N -th independent record; this entry contains at least the byte oset in the data le of the N -th record. For example, Fig. 5(center) contains the .siz index for the records in Fig. 5(left). The .siz entries contain the osets (A and C) for the two independent records in the .siztree le. Because a nested record may contain a variable number of records, a .siztree entry for an independent record and all its nested dependents is also variable length. Fig. 5(right) contains the .siztree index for the records in Fig. 5(left). The .siztree entry for a leaf nested record L (with no dependents) is the relative oset of L from its independent record in the data le (e.g., the dependent records 1.1{1.4 and 2.1). The .siztree entry for a non-leaf, dependent record D has three parts: a relative data- le oset like that for a leaf nested
record; followed by a sequence of .siztree osets, one for each class of dependent record that may be nested in D; followed (recursively) by the .siztree entries for each of D's dependent records. The .siztree entry for an independent record I (e.g., records 1 and 2) is the same as for a non-leaf dependent record, except that its rst part is the absolute oset of I in the data le. Because each .siztree entry for a LAB record has one possible nested dependent class (OBX), its absolute oset is followed by one .siztree oset. A .siztree oset is a relative oset from the start of the .siztree entry for the independent record to the last byte in the .siztree le that represents the dependent records of the associated child kind. The .siztree provides fast access to all children or children of certain kinds. Since .siztree entries represent records in dierent classes, navigation depends on the RCD nesting information to correctly interpret the .siztree entries. One can build value-based indexes on elds present in dependent records by mapping eld values to the byte osets of the corresponding independent record, but this technique requires scanning the entire nested record to check any additional conditions on its elds, in particular the elds of the same dependent record. This is not ecient when an independent record contains a large number of dependent records. Instead, XDX supports indexing on multiple elds of a dependent nested record. An index on a eld of a dependent record D, nested at level N , maps a eld value to an N -tuple of the osets of the N records in which D is contained as well as the oset for D. For example, given an index on the AbnormalFlag eld of an OBX record, the index value ``Above high normal'' maps to the pair (0, 35), i.e., the byte oset for the rst independent record and its rst dependent record. Given direct access to the dependent record, one can then check whether its ValueName eld is equal to \Na", without having to scan the entire nested record. Recovering all elds of an independent record and its dependent records is easy, since the semantics of the dependence is in the RCD itself. Recovering the complete record requires no additional work (i.e., no joins are necessary). However, checking for delimiters of dependent records is a small added cost not incurred when dealing with Daytona (independent) records. 4.2
Combining Record Classes
A bene t of XDX is the orthogonality of its features. Variant and nested record classes, for example, can be combined freely to model complex recursive data. A \parts assembly" is an example that requires both variants and nesting. A part contains some shared elds and is either a \basic" part, which is unnested, or a \compound" part, which recursively contains one or more parts. XDX can easily model such data by combining nested and variant records. Fig. 7 contains six compound-part and seven basic-part records that describe the hardware con guration of a computer. The rst eld in each part record is a classi er eld indicating whether it is compound (\C") or basic (\B"), followed by the part's category, id, and cost elds. The elds of a compound-part record are then followed by nested part records. Representing recursive records extends naturally from Daytona's support from variable-width elds and records. The .siz and .siztree indexes are the same for non-recursive and recursive nested records; Figure 8 shows the .siztree index for the data in Fig. 7. It contains recursive PART records and the .siztree
Id1234|Jones|William|1961-06-13|M {84295|Na|150|136-148|Above high normal 84132|K+|4.5|3.5-5|Normal 82435|Cl|102|94-105|Normal 82374|CO2|27|24-31|Normal} Id1235|Buchsbaum|Elena|1964-01-02|F {71020|Chest X-ray||It is a normal PA Chest X-ray}
Record .siz Number Oset 1 A 2 C
Record .siztree Number Oset Contents 1 A 0 B 1.1 34 1.2 73 1.3 99 1.4 126 B 2 C 152 D 2.1 36 D
Figure 5: Sample LAB nested record, .siz index, and .siztree index index for those records. AbsO(r) denotes the absolute address of the independent record r in data le; RelO(r) denotes the relative byte oset of dependent record r from its independent record in data le; SzTrOset denotes the last byte oset of the .siztree entries for a given class of children elements. The RCD for the PART record class in Fig. 7 illustrates the combination of variant and nested records. A PART is a nested record class that consists of shared elds and two variants: BASIC PART and COMPOUND PART. The COMPOUND PART record class recursively includes the PART record class. Record classes are globally scoped, so an empty nested RCD denotes a (possibly recursive) reference to a fully speci ed RCD for the named class. Nested relations in traditional relational databases are less expressive than nested and variant record classes, and implementing nested relations is typically more complex [18, pp 330{336], because a storage system based on xed-width elds and records is more rigid. Nested records naturally extend from Daytona records, because Daytona already handles variable-sized records and elds. They require slightly more complex indexing structures than at records, but the extra complexity is out weighed by the increased expressiveness. 5. ANNOTATIONAL HETEROGENEITY Real-world data often mixes or annotates large bodies of
text with structured data. For example, book databases often associate bibliographic information about a book with excerpts from and reviews of the book. The Library of Congress [12] publishes bills and resolutions of the U.S. congress in which the body of a document contains structured data identifying sponsoring members and committees. Another real-world example is bioinformatics data that interleaves descriptions of experimental methods with references to related experiments and results [15]. Note that these examples are not self-describing in that the structure of the embedded data is known a priori. We discuss support for self-describing data in Sec. 6. Relational databases have considered the case where records have (completely unstructured) text-valued elds. Such values are modeled as CLOBs (Character Large OBjects) and special-purpose functions (such as UDFs) are used to match and manipulate CLOB elds. While this suces for the book database example, using CLOBs in this way does not adequately address the cases where structured data is embedded within the text. In such cases, the database system
is unable to take advantage of its native indexing and queryevaluation capabilities in processing such embedded structure. Dealing with this annotational heterogeneity at the storage level is clearly important for eciency and ease of use. This is achieved in XDX with embedded-record classes. An XDX embedded-record class contains a collection of records, embedded in string ller data. The embedded records represent the structured data within the ller data. Embedded records generalize Daytona records, which may be separated by intervening comments, by permitting records to be interleaved with arbitrary ller data. Note that the relative order of embedded records is signi cant. 5.1
Embedded records, RCD, and indexes
Embedded records simply require additional delimiters to distinguish them from surrounding ller data. For example, Fig. 9 illustrates a Library of Congress document modeled in XDX as ve embedded records, interleaved with text ller data. (For presentation purposes, we include newlines in the textual ller and use fg to delimit embedded records.) The RCD of an embedded record class must de ne the base type of the ller data. The embedded record class may have variants. For example, Fig. 9 shows the RCD for the BILL embedded record class, which consists of a SPONSOR variant and a COMMITTEE variant. Indexing and accessing elds of embedded records is analogous to the corresponding operations for ordinary Daytona records. The .siz index is unchanged, as are value-based indexes. Fig. 9(right) contains the .siz index for the embedded records. Embedded record classes, however, require an additional kind of index to obtain direct access to ller data; this is important, for example, for full text indexing of textual ller data. A ller index maps ller data values to byte osets within a bin le; text ller values might be individual words, stems, phrases, etc. Additional operators are needed to operate on ller data, but a discussion on this subject is outside the scope of this work. Note that by storing embedded records contiguously with ller data in UNIX les, standard text-processing tools (e.g., editors, grep, etc.) may be applied directly to the data. Also, contiguous storage makes it possible to apply full-text operators over both the unstructured ller data and structured content. In contrast, storing such data in traditional databases excludes use of standard tools and furthermore, normalization techniques that partition unstructured text and their structured content can complicate full-text search. In a traditional storage system in which xed-width, ho-
C|chassis||555000 {C|sys_board|0|35000 {B|cpu|0|6000|900MHz} {B|cpu|1|6000|900MHz} {C|mem_board||4000 {B|mem_bank|0|3000|256M} {B|mem_bank|1|3000|256M}} {C|io_board||7000 {B|hba_card|0|1200|DWIS} {B|hba_card|2|1100|QFE}}} {C|sys_board|1|35000 {B|cpu|0|6000|400MHz} {C|io_board||7000}}
Figure 7: Sample recursive PART records and PART nested-record class description mogeneous records are common, directly representing embedded structured data as XDX does would be dicult, if not impossible. Given Daytona's existing support for variable-sized records and elds, and XDX's variant-record extension, embedded records can be modeled easily and can be processed using Daytona's native indexing and query evaluation capabilities. Moreover, embedded, variant, and nested records are entirely orthogonal and can be combined freely. 6. SUPPORT FOR XML
All three classes of heterogeneity arise in XML documents. XML Schema [20], a schema language for specifying the permissible structure and content of XML documents, supports optional, repeated, and recursive structured content, of choice among alternatives, and of structured content mixed with text ller. XDX's XML frontend supports automatic mapping of an XML Schema into an XDX schema. Fig. 11 depicts the architecture of XDX's XML frontend. The frontend accepts one or more XML Schema(ta) of the XML documents to be stored by XDX. XML Schema has constructs (e.g., attribute and element groups, global complex types) that help a schema writer factor common structure and that can be modeled by other constructs (e.g., substitution groups). The frontend applies normalization rules [19] to remove redundant and purely syntactic structures. After normalization, the frontend maps the normalized XML schema into an XDX schema, utilizing XDX's support for heterogeneity. The relationship between XML schema constructs and their corresponding XDX schema constructs is speci ed declaratively in the mapping schema. This speci cation is the interface between the \XML universe" of XML documents, schemas, and queries and the \XDX universe" of bin les, RCDs, and Cymbal queries. Three frontend modules take mapping schemata as input. The schemaextractor module extracts the target XDX schema from a mapping schema. The document-loader module takes XML documents that conform to a mapping schema and produces records in one or more bin les. Finally, the query-translator module takes an XQuery query and mapping schema(ta) for the query's input documents and produces a Cymbal query
that, when evaluated, yields the XML result required by the input query. The mapping module itself takes as input the names of elements in the input schema that should be mapped to independent record classes in the XDX schema. An independent record and its nested dependent records are a logical and physical unit which are moved together (if necessary) during updates; thus classifying an element as independent can have an impact on update cost. Choosing independent elements is a design problem that we do not discuss here. We illustrate schema mapping on the ebXML business process speci cation schema [5]. Limited space prevents describing the complete algorithm. Fig. 12 contains the mapping schema for one ebXML element; the mapping schema interleaves normalized XML schema constructs (element, attribute, and content de nitions) with their corresponding XDX constructs (record classes and elds). Removing the XDX constructs (pre xed by ) yields the normalized XML schema. An element de nition consists of the element's name and its simple or complex content. Simple content consists of attribute de nitions and a simple type (an atomic type or list of atomic types). Complex content consists of attribute de nitions followed by element references combined with sequence, choice, and repetition operators. For example, the ProcessSpec element contains a name attribute, followed by a repetition of a choice of Documentation or other (recursive) ProcessSpec elements. An element with complex content (C ) is always mapped to a record class. An attribute of C consists of a (name, simple type) pair and is mapped to an XDX eld with a corresponding XDX data type. In Fig. 12, the attribute name is mapped to a eld with the same name. An element's complex content is mapped as follows. A choice operator is mapped to a variant-record class, and all element references within the choice are mapped to distinct variants. In Fig. 12, the choice operator is modeled by the variant record class VARIANT 1; note that each element reference is modeled by a variant, which contains a nested record (record-class-stub) corresponding to the referenced element. The repetition operator has no explicit representation in XDX : any element references within a repetition are modeled as a nested class, which, by de nition, contains multiple records. We discuss
{C|chassis||555000 {C|sys_board|0|35000 {B|cpu|0|6000|900MHz} {B|cpu|1|6000|900MHz} {C|mem_board||4000 {B|mem_bank|0|3000|256M} {B|mem_bank|1|3000|256M}} {C|io_board||7000 {B|hba_card|0|1200|DWIS} {B|hba_card|2|1100|QFE}}} {C|sys_board|1|35000 {B|cpu|0|6000|400MHz} {C|io_board||7000}}
Record Byte Osets Number in .siztree le 1 0-3 4-5 6-7 1.1 8-9 10-11 12-13 1.1.1 14-15 1.1.2 16-17 1.1.3 18-19 20-21 22-23 1.1.3.1 24-25 1.1.3.2 26-27 1.1.4 28-29 30-31 32-33 1.1.4.1 34-35 1.1.4.2 36-37 1.2 38-39 40-41 42-43 1.2.1 44-45 1.2.2 46-47 48-49 50-51
Contents AbsO(1) 7 51 RelO(1.1) 17 37 RelO(1.1.1) RelO(1.1.2) RelO(1.1.3) 27 27 RelO(1.1.3.1) RelO(1.1.3.2) RelO(1.1.4) 37 37 RelO(1.1.4.1) RelO(1.1.4.2) RelO(1.2) 45 51 RelO(1.2.1) RelO(1.2.2) 51 51
Comment chassis SzTrOset for B kids SzTrOset for C kids sys board SzTrOset for B kids SzTrOset for C kids cpu cpu mem board SzTrOset for B kids SzTrOset for C kids membank membank io board SzTrOset for B kids SzTrOset for C kids hba card hba card sys board SzTrOset for B kids SzTrOset for C kids cpu io board SzTrOset for B kids SzTrOset for C kids
Figure 8: Sample (recursive) PART data and .siztree index On June 5, 2003, in the House of Representatives, {sponsor|Mr. English|Libertarian} (for himself and {sponsor|Mr. Coyne|Green}) introduced the following bill; which was referred to the {comm|Committee on Financial Services}, {comm|Committee on Health}, and {comm|Committee on Ways and Means}. The bill permits revocation by members of the clergy of their exemption from Social Security coverage and to reform the election process and to make reparations to citizens who suffered servitude.
Record Byte Number Oset 1 51 2 102 3 186 4 226 5 258
Figure 9: Sample Library of Congress (BILL) embedded records and .siz index how the sequence operator is modeled shortly. The context in which an element with simple content (S ) is referenced eects its mapping into an XDX construct. If S is referenced within a repetition, then it is always mapped into a nested record class { the nested-record set naturally models repetition. If S does not occur within a repetition, e.g., it occurs as a single or optional element within another element C , then the content and attributes of S may be \inlined" as elds in C [17]. Independent elements are mapped to record classes that have bin les. In Fig. 12, the ProcessSpec element is independent. By de nition, independent records are not nested in other records, therefore if an independent element E contains an element reference to another independent element F , the element reference is modeled by a eld that contains a foreign-key of the independent-record class for F . Nested records capture structure and nesting, but are not guaranteed under update to preserve the relative order of dependent records. This means that a nested-record class cannot directly model the sequence operator, which speci es the relative order between sibling elements. To model document order, XDX associates a quadruple key with each element in a document. During document loading, XDX associates a depth- rst pre x and post x ordinal value with each element; the quadruple key contains a unique document identi er, the preorder and postorder numbers, and the element's parent's preorder number. Note in Fig. 12 the PROCESS SPEC
record class contains a Quad eld, which encodes the document order of ProcessSpec elements. Quadruples support ecient evaluation of XPath expressions. We have described how embedded records are used to model structured records embedded in text, which corresponds directly to XML mixed content in which the structured content is known a priori. One of main bene ts of embedded records is that they avoid having to shred XML documents that contain mixed content. Embedded records can also be used to model open content models, i.e., text interleaved with arbitrary structured content, and XML documents that are laxly validated [20]. Fig. 13 contains the mapping schema for a Documentation element with open content. Because the embedded structure is unknown, it is modeled as an arbitrary graph of nodes and edges [6]. The tag of each element is stored along with either its terminal leaf value or the content of its complex node. 7. RELATED SYSTEMS AND DISCUSSION
Current activity in research and commercial support for heterogeneous data is focussed on XML. Native systems are designed speci cally for XML, whereas non-native solutions focus on mapping XML documents into the relational or object-relational models. Native solutions handle XML-speci c heterogeneity at the storage level. The Natix [10] storage system decomposes an
Figure 10:
BILL
embedded-record class description XDX Schema Extractor
XDX Schema
Independent Elements XML Document(s) XML
Schema
Mapping
Schema(ta)
Normalization
Schema(ta)
XML Document Loader
Data Files (Bins)
XDX/Daytona Query Engine
& Mapping XQuery Query
Query Translator
Cymbal Query
XML Frontend
Figure 11: Architecture of XDX's XML Frontend XML document into sub-forests; each sub-forest is stored on one disk page and split when it over ows. Sub-forests are linked via pointers to recover the tree structure. Natix may use application-speci c hints to cluster related elements in order to minimize page hits. The Tamino [16] database system is a commercial native XML solution, which focusses on both structural and full-text search of XML documents. By de nition, native XML solutions do not support non-XML data and therefore are not as adept as traditional databases at storing and querying homogeneous data. Non-native solutions are either generic, schema-driven, or user-de ned. Generic mappings de ne a generic relational schema for storing any XML document. Schemadriven mappings de ne a set of xed or application-dependent rules for mapping an XML schema construct into a relational construct. User-de ned mappings require the user to specify explicitly the mappings of XML schema constructs into corresponding relational constructs. The generic Edge and Attribute techniques [6] model an XML document as an edge-labeled tree. Edges can be stored in a single table using a generic schema ((source, ordinal, tag, target)); or edges can be horizontally partitioned into multiple tables based on edge labels (tags). These methods handle variational and structural heterogeneity in that edges that are not present in the data are not stored, but they sacri ce most of the bene ts of the relational model by ignoring schema information. Even simple navigation operations can require many joins and document reconstruction may require computing the transitive closure of the edge table.
The schema-driven Shared and Hybrid techniques [17] use the structure of a DTD to determine the mapping of DTD constructs into relational constructs. For each element, the key choice is whether to create a relational table for the element (e.g., if the element is \shared" by more than one parent element) or to \inline" the element's content in those elements that refer to it. The \hybrid" approach combines these two. The techniques introduce varying degrees of redundancy with the goal of reducing query costs. Schemadriven LegoDB [4] goes one step further and derives an optimal relational mapping for an XML schema based on a query workload. The optimizer applies semantics-preserving transformations (e.g. inlining/outlining, union factorization) to an XML Schema and the resulting schema is translated to a relational schema by a xed mapping. The optimizer chooses the XML schema whose corresponding relational schema minimizes query costs. Although these techniques automate the mapping problem, they all normalize XML constructs into \pure" relational constructs, therefore they suer from all the limitations of normalization described in Secs. 3 and 4. Commercial systems typically provide automatic generic mappings or user-de ned mappings. Microsoft SQL Server [14] supports an automated mapping using the generic edge approach or a (non-automated) user-de ned mapping. IBM DB2 [9] provides a user-de ned mapping in a DAD speci cation, which is similar to XDX's mapping schema, but which the user must write entirely by hand. Oracle 9i [13] is the only solution to use object-relational features in its mapping of XML schema constructs: an element is either
Figure 12: Mapping schema for subset of ebXML business processes mapped to a record in a table or, if necessary, to an object, but there is a rigid separation between records and objects, which prevents their intermingling to support heterogeneity. All three systems use CLOBs to support large bodies of XML text. IBM DB2 users can specify \side tables" for storing and indexing structured data that is embedded in CLOBs. None of the solutions provide direct support for annotational heterogeneity. Object-oriented (OO) database systems were designed, in part, to handle variational and structural heterogeneity { together, collection types and an inheritance hierarchy can model optional and repeated structured content as well as variants. Variant- and nested-record classes are in spirit \object-oriented" in that they permit de nition of shared elds in base classes and unique elds within variants. Unlike object-oriented features, they are a natural extension of the relational model and do not carry the semantic weight of object classes. In this paper, we described the design choices that have been made in XDX. XDX is in the early stages of implementation. To our knowledge, XDX is the rst system that seamlessly extends the relational data model to support variational, structural, and annotational heterogeneity at both the logical and physical levels. These extensions are important for handling heterogeneity that arises in non-XML and XML data sources. 8. REFERENCES
[1] S. Abiteboul. Querying semi-structured data. In ICDT, 1997. [2] A. Ailamaki, D J. DeWitt, M. D. Hill, M. Skounakis. Weaving Relations for Cache Performance. In VLDB, 2001. [3] S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, Y. Wu. Structural joins: A Primitive
for Ecient XML Query Pattern Matching. In ICDE 2002. [4] P. Bohannon, et al. From XML schema to relations: A cost-based approach to XML storage. In ICDE, 2001. [5] ebXML Business Process Speci cation Schema v1.01 http://www.ebxml.org/specs/index.htm
[6] D. Florescu, D. Kossman. A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. IEEE Data Engineering Bulletin 1999. [7] R. Greer. Daytona and the fourth-generation language Cymbal. In ACM SIGMOD, 1999. [8] Health Level 7, v2.2, http://www.hl7.org/ [9] IBM DB/2 XML Extender Administration and Programming.
http://www-4.ibm.com/software/data/db2 /extenders/xmlext/docs/v71wrk/english/index.htm
[10] C. Kanne and G. Moerkotte. Ecient storage of XML data. In ICDE, 2000. [11] B. Krishnamurthy and J.Rexford. Web Protocols and Practice, Addison Wesley, 2001 [12] Library of Congress, sample DTDs. http://lcweb.loc.gov/crsinfo/xml/. [13] Oracle9i Application Developer's Guide - XML, Release 1 (9.0.1). http://download-east.oracle.com/otndoc/ oracle9i/901 doc/appdev.901/a88894/toc.htm
[14] G. Malcolm. Programming Microsoft SQL Server 2000 with XML, 2001. [15] PROXIML (PROtein eXtensIble Markup Language) http://www.cse.ucsc.edu/ douglas/proximl/
[16] H.Schonig. Tamino { A DBMS designed for XML. In ICDE, 2001.
Figure 13: Mapping schema for element with \open" content model [17] J. Shanmugasundaram, et al. Relational databases for querying XML documents: Limitations and opportunities. In VLDB, 1999. [18] J. Ullman. Principles of Database and Knowledge-base Systems, Volume II, 1999. [19] World-Wide Web Consortium, \XML Schema : Formal Description", W3C Working Draft, Sept, 2001 http://www.w3.org/TR/xmlschema-formal
[20] World-Wide Web Consortium, \XML Schema Part 1: Structures", W3C Recommendation, May, 2001 http://www.w3.org/TR/xmlschema-1
[21] World-Wide Web Consortium, \XQuery 1.0: An XML Query Language", W3C Working Draft, Dec, 2001 http://www.w3.org/TR/xquery/
APPENDIX A. VERTICAL PARTITIONING
Fig. 14 illustrates the interaction of horizontal and vertical partitioning. Records in the SUPPLIER record class are vertically partitioned into three partitioning record classes: SUPPLIER V 1, which contains the elds Number, Name, City, and Zip, , SUPPLIER V 2, which contains the Orders eld, and SUPPLIER V 3, which contains the Company History eld. Each vertical partition is horizontally partitioned on the Category eld. A partitioned record can be accesseed like any other record; the correspondence between elds in the paritioned fragments is handled automatically by Daytona.