From XML Schema to Relations: A Cost-Based Approach to XML ...

From XML Schema to Relations: A Cost-Based Approach to XML Storage Phil Bohannon Juliana Freire Prasan Roy Jérôme Siméon Bell Labs bohannon,juliana,prasan,simeon @research.bell-labs.com

Abstract XML has become an important medium for data representation, particularly when that data is exchanged over or browsed on the Internet. As the volume of XML data increases, there is a growing interest in storing XML in relational databases so that the well-developed features of these systems (e.g., concurrency control, crash recovery, query processors) can be re-used. However, given the wide variety of XML applications and the mismatch between XML’s nested-tree structure and the flat tuples of the relational model, storing XML documents in relational databases presents interesting challenges. LegoDB is a cost-based XML-to-relational mapping engine that addresses this problem. It explores a space of possible mappings and selects the best mapping for a given application (defined by an XML Schema, XML data statistics, and an XML query workload). LegoDB leverages existing XML and relational technologies: it represents the target application using XML standards and constructs the space of configurations using XML-specific operations, and it uses a traditional relational optimizer to obtain accurate cost estimates of the derived configurations. In this paper, we describe the LegoDB mapping engine and provide experimental results that demonstrate the effectiveness of this approach.

1 Introduction As XML is now an important medium for representing, exchanging and accessing data over the Internet, applications are processing an increasing amount of XML data. Not surprisingly, there is a growing interest in storing XML in relational databases so that these applications can use a complete set of data management services (including concurrency control, crash recovery, scalability, etc) and benefit from the highly optimized relational query processors. A number of strategies have been proposed [7, 11, 14, 18, 19] to address the XML-to-relational mapping problem. An important limitation of most of these proposals is that they rely on a fixed XML-to-relational mapping. One single mapping is unlikely to work well for more than a few of the wide variety of access patterns an application may present. For example, a web site may perform a large volume of simple lookup queries, whereas a catalog printing application may require large and complex queries with deeply nested results. Modern commercial databases (see e.g., [24]), on the other hand, provide a more flexible approach to storing XML data, by allowing the developer to specify the storage mapping. 1

However, this approach has drawbacks, most notably: it requires the developer to master two quite distinct technologies (XML and RDBMS); and it might be hard, even for an expert, to determine a good mapping for a complex application. In this paper, we introduce a novel cost-based framework for XML-to-relational storage mapping, and describe the design of LegoDB, a tool based on this framework that automatically finds an efficient XML-torelational mapping for a target application. The three main design principles behind LegoDB are cost-based search, logical/physical independence, and re-use of existing technology. Since the effectiveness of onesize-fits-all mapping is improbable given the wide variety of XML applications (with data ranging from flat to nested, schemas ranging from structured to semistructured, access patterns ranging from traditional SPJ queries to full-text or recursive queries), our first principle is to take application parameters into account. More precisely, given a schema describing the XML data to be processed, a query workload, and data statistics, the LegoDB engine explores various relational configurations in order to find the most efficient for the application. Our second principle is to support logical/physical independence. Developers of XML applications should deal with XML structures and operations and they should not be concerned with the underlying physical storage in a relational database. Hence, the LegoDB interface is purely XML-based—it takes as input XML queries, schemas and statistics. Our third principle is to leverage existing XML and relational technology whenever possible. LegoDB relies on: 1) existing XML standards to represent the target application, 2) XML-specific operations over a schema to generate a space of possible mappings, and 3) a traditional relational optimizer to obtain accurate cost estimates for the derived mappings. On the first point, queries which make up the application workload are given in XQuery [5], and the user data is described with XML Schema [21]. Our main contributions are summarized below. We introduce the notion of physical XML Schemas (p-schemas), which are XML Schemas extended with statistics about the underlying XML data. We define a fixed mapping from a particular p-schema to a relational schema and a corresponding mapping from XML documents to databases. We define XML Schema transformations that when applied to a p-schema and followed by the fixed mapping, lead to a space of alternative storage configurations. The idea is that one may define many alternate XML Schemas which are equivalent in terms of the documents which are valid under each schema, but yield different configurations. For instance, types can be introduced or elided, and regular expressions can be rewritten without affecting the semantics of the schema. Because the proposed rewritings are specific to XML Schema, this search space contains many configurations not exploited by relational storage design tools (see, for example, [1, 23]). Through the fixed mapping, XML-specific statistics are translated into the corresponding relational statistics, and XQuery workloads are converted into the corresponding SQL workloads. As a result, we can exploit a traditional relational optimizer to obtain cost estimates for the various configurations, and select the best among them. One potential problem with this approach is search space explosion. Due to the nature 2

of XML Schema, the schema transformations may lead to a large (possibly infinite) search space. In this paper we use a greedy evaluation strategy to explore an interesting subset of this space. We give experimental results which show that LegoDB is able to find efficient storage designs for a variety of workloads in a reasonable time. Our results indicate that our cost-based exploration selects storage designs which would not be arrived at by previously-proposed heuristics, and that in most cases, these designs have significantly lower costs. Organization of the Paper The rest of the paper is organized as follows. In Section 2, we present a motivating example along with some background information. In Section 3, we present the LegoDB framework for mapping XML Schemas, queries, and documents into relational configurations, queries, and databases. In Section 4, we present the rewriting rules defining the search space and our search algorithm. In Section 5, we present preliminary experimental results. We review related work in Section 6, and discuss directions for future work in Section 7.

2 Background and Motivating Example In this section, we motivate our approach, notably the use of XML Schema and the cost-based evaluation of storage mappings, with an example XML storage mapping scenario inspired from the Internet Movie Database [13]. XML documents and DTDs Figure 1 gives an example XML fragment in which the show element is used to represent movies and TV shows. This element contains information that is shared between movies and TV shows, such as title and year as well as information specific to movies (e.g., box office and video sales) and to TV shows (e.g., seasons). Figure 2(a) shows a Document Type Definition (DTD) [2] for the example document of Figure 1. The DTD contains declarations for all elements and attributes in the document. The contents of each element may be text (e.g., #PCDATA, CDATA), or a regular expression over other elements (e.g., (show*,director*,actor*)). Using XML Schema for storage Figure 2(b) shows an alternative schema described using the notation for types from the XML Query Algebra [9]. This notation captures the core semantics of XML Schema, abstracting away some of the complex features of XML Schema which are not relevant for our purposes (e.g., the distinction between groups and complexTypes, local vs. global declarations, etc). The XML Schema and the XML Query Algebra notation for our sample schema can be found in Appendix B. Like DTDs, XML Schema describes elements (e.g., show) and attributes (e.g., @type) and uses regular expressions to describe allowed subelements (e.g., imdb contains Show*, Director*, Actor*). But Figure 2(b) also illustrates a number of distinguishing features that are useful for storage. First, one can specify precise data types (e.g., String, Integer) instead of text, an essential feature for generating an efficient storage configuration. Also, regular expressions are extended with more precise cardinality annotations for collections (e.g., 1,10 indicates that there can be between 1 to 10 aka elements for show),

3

1994 Akte X - Die unheimlichen F¨ alle des FBI Aux frontieres du Reel 10 1993 1994 1995 1996 1997 1998 1999 2000 2001 A paranoiac FBI agent teams up with a frustrated female scientist to chase DNA modified aliens financed by the NSA. Ghost in the Machine Jerrold Freedman Fallen Angel Larry Shaw ....

Fugitive, The 1993 Auf der Flucht Fuggitivo, Il Roger Ebert Two thumbs up! This is a fun action movie, Harrison Ford at his best. The standard Hollywood summer movie strikes back. 183,752,965 72,450,220 X Files, The

Figure 1: XML data sample for a subset of the IMDB

title (#PCDATA)> year (#PCDATA)> aka (#PCDATA)> review (#PCDATA)>

seasons (#PCDATA))> description (#PCDATA))> episode (name,guest_director)> name (#PCDATA)> guest_director (#PCDATA)> (a)

type IMDB = imdb [ Show*, Director*, Actor* ] type Show = show [ @type[ String ], title[ String ], Year, Aka 1,10 , Review*, ( Movie | TV) ]

type Year = year[ Integer ] type Aka = aka[ String ] type Review = review[ ˜ [ String ] ] type Movie = box_office[ Integer ], video_sales[ Integer ] type TV = seasons[ Integer ], description[ String ], episode[ name[ String ], guest_director[ String ] ]* type Director = ... (b)

Figure 2: Schema samples for the IMDB documents

4

type Show = show [ @type[ String ], title[ String ], year[ Integer ], Aka 1,10 , Review*, ( Movie | TV) ]

TABLE Show ( Show_id INT, type STRING, title STRING, year INT )

TABLE Aka ( Aka_id INT, aka STRING, parent_Show INT )

type Aka = aka [ String ]

.....

..... Original XML Schema

Mapped relational schema

Figure 3: Mapping XML Schema to relations which enables the specification of more constrained collections. Finally, XML Schema can describe socalled wildcards: for instance, the [AnyType] notation specifies that the review element can contain an element with an arbitrary name and content. This allows XML Schema to describe parts of the schema for which no precise structural information is available. Storage mappings In addition to the features described above, a very important difference between XML Schema and DTDs is that the former distinguishes between elements (e.g., a show element) and their type (e.g., the Show type). The type name never appears in the document, and one element may have different allowed content when it appears in different types. A key feature of the LegoDB approach is that it uses the classification of elements to type names as the basis for creating storage mappings. As an example, Figure 3 shows a sample mapping for a fragment of the schema in Figure 2(b). Each type (e.g.,Show) can be used to group a set of elements together. The LegoDB mapping engine creates a table for each such type (e.g.,Show) and maps the contents of the elements (e.g., type, title, etc.) to columns of that table. Finally, the mapping also generates a key column that contains the id of the corresponding element (e.g., Aka_id column), and a foreign key that keeps track of the parent-child relationship (e.g., parent_Show column). Clearly, it is not always possible to map types into relations. For instance, since there can be many episode elements in the type TV, these elements cannot be mapped into columns of that table. In Section 3, we introduce a restricted form of XML Schemas, which we refer to as physical schemas, which have the property that they are easily mapped to relations by creating one relation for each type name. Schema transformations An important observation is that there are many different XML schemas that validate the exact same set of documents. For instance, different but equivalent regular expressions (e.g., (a(b|c*)) ((a,b)|(a,c*))) can describe the contents of a given element. In addition, the allowed subelements of an element can be referred to directly (e.g., the element title in Show), or can be referred to by a type name (e.g., see the type Year). Although the presence of a type name does not change the semantics of the XML Schema, it affects the derived relational schema, as our mapping generates one relation for each type. Hence, by performing a sequence of transformations (also called rewritings) which preserve the semantics of the schema and then generating the implied the storage mapping, a space of storage 5

TABLE Show ( Show_id INT, type STRING, title STRING, year INT, box_office INT, video_sales INT, seasons INT, description STRING ) TABLE Review ( Reviews_id INT, tilde STRING, reviews STRING, parent_Show INT )

TABLE Show ( Show_id INT, type STRING, title STRING, year INT, box_office INT, video_sales INT, seasons INT, description STRING ) TABLE NYT˙Reviews ( Reviews_id INT, review STRING, parent_Show INT ) TABLE Reviews ( Reviews_id INT, tilde STRING, review STRING, parent_Show INT )

TABLE Episode ( Episode_id INT, name STRING, parent_Show INT )

TABLE Episode ( Episode_id INT, name STRING, parent_Show INT )

....

.....

TABLE Show˙Part1 ( Show_Part1_id INT, type STRING, title STRING, year INT, box_office INT, video_sales INT ) TABLE Show˙Part2 ( Show_Part2_id INT, type STRING, title STRING, year INT, seasons INT, description STRING ) TABLE Reviews ( Reviews_id INT, tilde STRING, review STRING, parent_Show INT ) TABLE Episode ( Episode_id INT, name STRING, parent_Show INT ) ....

(a)

(b)

(c)

Figure 4: Three storage mappings for the Show element mappings can be explored. Cost-based evaluation of XML storage Figure 4 shows three possible relational storage mappings that are generated by some of our transformations. For instance, configuration (a) results from inlining as many elements as possible in a given table, roughly corresponding to the strategy advocated by [19]. Configuration (b) is obtained from configuration (a) by partitioning the reviews table into two tables (one that contains New York Times reviews, and another for reviews from other sources). Finally, configuration (c) is obtained from configuration (a) by splitting the Show table into Movies or TV shows. Even though each of these configurations can be the best for a given application, there may be cases where they perform poorly. An important question is then how to select a particular configuration. In LegoDB, this decision is based query workloads and data statistics. Consider the queries of Figure 5 described in XQuery [4]. The first query returns the title, year and the New York Times reviews for all shows from 1999. Query 2 publishes all the information available for all shows in the database. Query 3 retrieves the description of a show based on the title, and Query 4 retrieves episodes of shows directed by a particular guest director. Whereas queries 1 and 2 are typical of a publishing scenario (i.e., to send a movie catalog to an interested partner), queries 3 and 4 contain specific selection criteria and are typical of interactive lookup queries. We then define two workloads, and , where and , where each workload contains a set of queries and an associated weight that could reflect the relative importance of each query for the application. From an application perspective, workload might be representative of the workload generated by a cable company which routinely publishes large parts of the database for download to intelligent set-top boxes, while while

6

Q1: FOR $v in imdb/show WHERE $v/year = 1999 RETURN $v/title, $v/year, $v/nyt_reviews Q2: FOR $v in imdb/show RETURN $v Q3: FOR $v in imdb/show WHERE $v/title = c2 RETURN $v/description

Query 4: FOR $v in imdb/show RETURN $v/title $v/year FOR $v/episode $e WHERE $e/guest_director = c4 RETURN $e

Figure 5: Queries for the Show element

Storage Map 1 (Fig 4(a))

Storage Map 2 (Fig 4(b))

Storage Map 3 (Fig 4(c))

Q1

1.00

0.83

1.27

Q2

1.00

0.50

0.48

Q3

1.00

1.00

0.17

Q4

1.00

1.19

0.40

W1

1.00

0.75

0.75

W2

1.00

1.01

0.40

Figure 6: Estimated Costs for Queries and Workloads might represent the lookup queries issued to a movie-information web site, like the IMDB itself.

Figure 6 shows the estimated costs for the queries and workloads returned by the LegoDB storage mapping tool for each configuration in Figure 4. These costs are normalized by the costs of Storage Map 1. It is important to remark that only the first one of the three storage mappings shown in Figure 4 can be generated by previous heuristic approaches of which we are aware. However, this mapping has significant disadvantages for either workload we consider. First, due to its treatment of union, it inlines several fields which are not present in all the data, making the Show relation larger than necessary. Second, when the entire Show relation is exported as a single document, the records corresponding to movies need not be joined with the Episode tables, but this join is required by mapping 4(a) and (b). Finally, the large Description element need not be inlined unless it is frequently queried.

3 From XML Schema to Relations The architecture of the LegoDB mapping engine is depicted in Figure 7. Given an XML Schema and statistics extracted from an example XML dataset, we first generate an initial physical schema (PS0). The physical schema and the XQuery workload are then input into the Query/Schema Translation module, which in turn generates the corresponding relational catalog (schema and statistics) and SQL queries that are input into a relational optimizer for cost estimation. Schema transformation operations are then repeatedly applied 7

cost(SQi) XML Schema

XML data statistics

Generate Physical Schema PS0

RSi: Relational Schema/Queries/Stats PSi: Physical Schema

Physical Schema Transformation

Query/Schema Translation PSi

Query Optimizer RSi

XQuery workload

Optimal configuration

Figure 7: Architecture of the Mapping Engine to PS0, and the process of Schema/Query translation and cost estimation is repeated for each transformed PS until a good configuration is found. In this section we focus on physical schemas and on the Query/Schema Translation module.

3.1 Physical XML Schemas As pointed out in [19], mapping DTDs to relational configurations is a hard problem. There are several reasons for that: (1) the presence of regular expressions, nested elements and recursive types results in a mismatch with flat relations; (2) DTDs do not differentiate between elements that correspond to entities (e.g., a person) and elements that correspond to some attribute of that entity (e.g., the name of a person) — hence it is not clear whether one should map an element to a relation or to an attribute of a relation; (3) DTDs define no explicit data types for elements (e.g., integer, date), and as a result all values must be stored as strings which can lead to inefficiencies. As we have seen in Section 2, XML Schema differs from DTDs in a number of ways. Notably, because XML Schema distinguishes between type names and element description, a straightforward mapping strategy is to create a relation for each type in XML Schema. In addition, XML Schema provides explicit data types which lead to more natural (and efficient) storage mappings. However, a number of difficulties remain: the mismatch between the structure of XML Schema types and relations, due to the presence of nested tree regular expressions, and the lack of information about the data to be stored, e.g., cardinality of collections and number of distinct values for an attribute, which is necessary for designing an efficient storage mapping. In order to address these problems, we introduce the notion of physical XML schemas (p-schemas). P-schemas have the following properties: (i) they are as expressive as XML Schemas, (ii) they contain useful statistics about the data to be stored, and (iii) there exists a fixed, simple mapping from p-schemas into relational schemas. Before we give a precise definition of p-schemas, we illustrate the construction of a p-schema from an XML Schema through an example. Transforming an XML Schema into a P-schema By inserting appropriate type names for certain elements, one can satisfy (iii) above while preserving the semantics of the original schema. For instance, in order to 8

type Show = show [ @type[ String ], title [ String ], year[ Integer ], reviews[ String ]*, ... ] (a) Initial XML Schema

type Show = show [ @type[ String ], title [ String ], year[ Integer ], Reviews*, ... ] type Reviews = reviews[ String ] (b) P-schema transformation

TABLE Show ( Show_id INT, type STRING, title STRING, year INT ) TABLE Review ( Review_id, review String, parent_Show INT ) (c) Relational configuration

Figure 8: P-schema creation guarantee that there exists a simple and unique mapping into a relational configuration, the XML Schema is rewritten so that all multi-valued elements have an associated type name. For example, the Show type of Figure 8(a) cannot be stored directly into a relational schema because there might be multiple reviews elements in the data. However, the equivalent schema in Figure 8(b), in which this element is described by a separate type name, can be easily mapped into the relational schema shown in 8(c). The foreign key from the Review table, parent Show is present since the type name Reviews appears within the definition of the Show type. No indication of the relationship appears in the Show table. Data Statistics The p-schema also needs to store data statistics. These statistics are extracted from the data and inserted in the original physical schema PS0 during its creation. A sample p-schema with statistics for the type Show is given below: type Show = show [ @type[ String ], year[ Integer ], title[ String ], Review* ] type Review = review[ String ]

where Scalar indicates for each scalar datatype the corresponding size (e.g., 4 bytes for an integer), minimum and maximum values, and the number of distinct values; and String which specifies the length of a string as well as the number of distinct values. The notation * indicates the relative number of Reviews elements within each element of type Show (e.g., in this example, there are 10 reviews per show). Stratified physical types We are now ready to define p-schemas. As we have discussed, it is essential that each type name contains a structure that can be directly mapped to a relation. Accordingly, we adapt the original syntax for types of [9] to enforce the appropriate structure1 . The resulting grammar is shown in Figure 9. Because this new grammar is stratified (i.e., instead of the types defined in the original XML Schema, there are three different layers of types), it ensures that type names are always used within collections or unions in the schema. The first layer, physical types, contains only singleton elements, nested singleton elements, and optional types. The second layer, optional types, is used to represent element structures that 1

Note that for space reasons, we do not enclose here the original grammar, but encourage the reader to consult the original XML Query Algebra document.

9

scalar type physical scalar named type

$

& '

/102&3"

, ,# +-,.

6

[/4 ] , /4 ()

/4

7

&

/4)(18

* 7

, 91,

6

schema item schema

| &

&)(*

/45

physical type

!#"

%

optional type

Integer String Boolean

;=
.

;

From XML Schema to Relations: A Cost-Based Approach to XML ...

From XML Schema to Relations: A Cost-Based Approach to XML ...

Suggest Documents