Ozone: Integrating Structured and Semistructured Data 1 ... - CiteSeerX

A068

Ozone: Integrating Structured and Semistructured Data? Tirthankar Lahiri1 ?? , Serge Abiteboul2 , Jennifer Widom1 1 Stanford University, ftlahiri,[email protected] 2 INRIA Rocquencourt, [email protected]

Abstract

Applications have an increasing need to manage semistructured data (such as XML) along with conventional structured data. We extend the structured object database model ODMG and its query language OQL with the ability to handle semistructured data based on the OEM model and Lorel language, and we implement our extensions in a system called Ozone. In our approach, structured data, such as typed objects or relations, may contain entry points to semistructured data, and vice-versa. The uni ed representation and querying of such \hybrid" data is the main contribution of our work. We retain strong typing and access to all properties of structured portions of the data while still allowing untyped access to semistructured portions of the data. Ozone also enhances both ODMG/OQL and OEM/Lorel by virtue of their combination. For example, ordering in ODMG allows Ozone to provide ordering in OEM (enabling correct modeling of XML). Furthermore, untyped OEM semantics can optionally be applied to suspend type-checking on ODMG data, allowing Ozone to support semistructured-style navigation of structured data. Ozone is implemented as a wrapper on top of the ODMG-compliant O2 database system, and it fully supports our extensions to the ODMG model and OQL.

1 Introduction Database management systems traditionally have used data models based on regular structures, such as the relational model [Cod70], or the object model [Cat94]. Meanwhile, the proliferation of diverse information repositories across the Internet, the integration of heterogeneous data sources, and the recent emergence of XML [MFDG98], have motivated research in the area of semistructured data models, e.g., [BDS95, FFLS97a, PGMW95]. Like XML, semistructured data models allow the representation of information that is irregular, incomplete, or whose structure changes dynamically. In this paper, we extend the standard model for object databases, the ODMG model [Cat94], and its query language, OQL, to integrate semistructured data with structured data. We present an implementation of the extended ODMG model and query language in a system called Ozone. We will see that Ozone is very well suited to handling hybrid data|data that is partially structured and partially semistructured. We expect this combination of structured and semistructured data to become more and more prevalent. Increasingly many applications, in addition to processing structured data, import data from the Web or integrate data from heterogeneous, autonomous sources. The integration of semistructured data within ODMG greatly simpli es the design of such applications. The exclusive use of a structured data model (such as pure ODMG) for such hybrid data would lead to forcing semistructured portions of the data into structured encodings, and would miss the numerous advantages of a semistructured data model and query language [Abi97, Bun97]. On the other hand, exclusive use of a semistructured data model precludes strong typing and ef cient implementation mechanisms for structured portions of the data. Our approach based on a hybrid data model provides the advantages of both worlds. ? ??

This work was supported by the Air Force Rome Laboratories under DARPA Contract F30602-95-C-0119. Phone: (650) 506-6279, FAX: (650) 506-7203.

The integration of structured and semistructured data poses a number of challenges arising from the fundamental dierences between their data models and query languages. Structured data models are characterized by strong schemas and type systems. All values in a structured model are associated with precise types speci ed by the schema, and type errors are raised for operations that violate the type system. In contrast, semistructured data models are characterized by the absence of schemas and type systems. Values within a semistructured model are self-describing, and operations on semistructured data do not produce type errors [AQM+ 97]. Combining the data models eectively requires preserving strong typing for structured parts of the data while also preserving untyped semantics for semistructured parts of the data. In addition, all properties of structured data should be accessible, even when the structured data is embedded within semistructured data. The query language for navigating hybrid data should accommodate structured and semistructured operands within the same expressions, and the semantics of such expressions must be de ned to re ect the semantics of both structured and semistructured data. Our extension to the ODMG data model uses the Object Exchange Model (OEM) [PGMW95] to represent semistructured portions of the data, and it allows structured and semistructured data to be mixed freely together in the same physical database. The query language for Ozone, called OQLS , is syntactically identical to the Object Query Language (OQL) [Cat94]. The semantics of OQLS extends the OQL semantics for querying ODMG data with the semantics of the OQL-based Lorel query language [AQM+ 97] for OEM data, as well as adding new semantics for the integrated querying of structured and semistructured data. An interesting feature of our approach is that it also enables structured data to be treated as semistructured data if so desired, eectively turning o type-checking, but not losing access to structural properties. Conversely, our approach automatically adds some useful features to semistructured data, such as support for ordered lists. We have implemented the full functionality of Ozone as a wrapper application on top of the O2 [BDK92] ODMG-compliant database management system. We could also have considered extending the relational model or the SQL-3 object-relational model [MD96] with semistructured OEM data. While we plan to explore these avenues as future work, the use of objects as a basic primitive in the ODMG model facilitated the task of extending it with the semistructured model of OEM. Our choice of the ODMG model was further motivated by the fact that ODMG provides a query language (OQL) with a well-de ned semantics and an existing complete implementation in O2 (whereas SQL-3 is still evolving), and that the Lorel language for querying OEM data was derived from OQL [AQM+ 97]. Section 2.4 compares the Ozone approach to managing hybrid data with a purely semistructured approach.

Related Work

Data models, query languages, and systems for semistructured data are areas of active research. Of particular interest and relevance, eXtensible Markup Language (XML) [LB97, MFDG98] is an emerging standard for Web data, and is remarkably similar to semistructured data models introduced in research, e.g., [BDS95, FFLS97b, PGMW95]. An example of a complete database management system for semistructured data is Lore [MAG+97], a repository for OEM data featuring the Lorel query language. Another system devoted to semistructured data is the Strudel Web site management system [FFLS97a] , which features the StruQL query language [FFLS97b] and a data model similar to OEM. UnQL [BDHS96, BDS95] is a query language that allows queries on both the content and structure of a semistructured database and also uses a data model similar to Strudel and OEM. All of these data models, languages, and systems are dedicated to pure semistructured data. 2

We know of no previous research that has explored the integration of structured and semistructured data as exhibited by Ozone. Note also that the query language OQLS is supported by a complete implementation in the Ozone system. There have been numerous proposals for query languages for Web navigation; examples are W3QL (implemented in the W3QS system [KS95]), WebSQL [MMM96], and WebLog [LSS96]. The goal of such languages is to enable exible and expressive querying and navigation of HTML hyperlinks and text. All such languages we know of rely on a model to represent Web data that is either semistructured or structured, and therefore these languages do not provide a hybrid approach such as that supported by Ozone. Frameworks for integrating heterogeneous databases [Com91, Wie92] provide the capability to combine structured and semistructured data, since they are designed for disparate data sources. However, in general such work concentrates either on making the integrated information appear structured through a global schema, e.g., [LMR90, SL90], or treats all information as semistructured for the purposes of integration, e.g., [CGMH+ 94]. There has been some work in extracting structural information from and building structural summaries of semistructured databases. [NAM97] shows how approximate structural information can be inferred from semistructured databases, while [NAM98] shows how schema information can be extracted from OEM databases by typing semistructured data using Datalog. Structural properties of semistructured data can be described and enforced using graph schemas as shown in [BDFS97]. Structural summaries called DataGuides are used in the Lore system as described in [GW97]. These lines of research are dedicated to nding the structural properties of pure semistructured data and do not address the integration of structured and semistructured data as performed by Ozone. The OQL-doc query language [ACV+ 97] is an example of an OQL extension with a semistructured avor: it extends OQL to navigate document data without precise knowledge of its structure. OQL-doc diers from OQLS in that it navigates only \typed document data," i.e., documents with some form of structural speci cation (such as an XML or SGML DTD [LB97]), so it does not support the querying of pure semistructured data. Finally, the object-relational data model [MD96] provides a good example of the extension of a data model (the relational model) with the ability to represent data from a very dierent data model (the object model). SQL-3 relations can have attributes with object types, just as ODMG objects in our proposed extension to the ODMG data model can have OEM attributes.

Outline

The rest of the paper is organized as follows. Section 2 provides background and motivation for our work by brie y describing the ODMG and OEM data models and their query languages, and by presenting an example application scenario that requires a combination of the two models and query languages. Section 3 speci es our extension to the ODMG data model with the ability to represent OEM data. Section 4 describes important features of the semantics of OQLS . Section 5 discusses the implementation of Ozone as a wrapper on top of the O2 database system. Section 6 concludes and mentions future work.

3

2 Background and Motivation In this section, we rst brie y review the ODMG data model and the OQL query language. We also brie y describe the OEM model and Lorel query language for semistructured data. We then use an example scenario to show how structured and semistructured data can be handled together within an extension of ODMG and OQL that allows the management of hybrid (ODMG and OEM) data.

2.1 The structured ODMG model and OQL

The ODMG data model is the accepted standard for object databases. ODMG has all the necessary features of object-orientation: classes encapsulating attributes and methods, subtyping, and multiple inheritance allowing complex class hierarchies. The basic primitives of the model are objects (values with unique identi ers) and literals (values without identi ers). All values in an ODMG database must have a valid type de ned in the schema, and all values of an object type or class are members of a collection known as the extent for that class. The possible types in ODMG are:

Atomic types: integer, real, boolean, string, char, binary, and nil.

Structured types: The tuple constructor is used to generate structured types. If t1 ; : : :; tn are valid types, and l1; : : :; ln are labels, then tuple(l1 : t1 ; l2 : t2 ; : : :; ln : tn ) is a structured type. (The keywords struct and tuple are interchangeable in ODMG.)

Collection types: ODMG supports sets, bags, lists, and arrays. If T denotes a valid type, then set(T ), bag(T ), list(T ), and array(T ) are collection types.

Classes: Each object in an ODMG database is an instance of a class that encapsulates some type T. In addition, a class may de ne methods that specify the legal set of operations on objects belonging to the class. A class may also de ne relationships with other classes. Classes most commonly encapsulate structured types, and the dierent elds in the encapsulated structure (along with methods without arguments) denote the attributes of the class. ODMG de nes the class Object to be the root of the class hierarchy.

An attribute of a class in the ODMG model may have a literal type and therefore not have identity. Named objects and literals form entry points into an ODMG database. The Object Query Language, or OQL, is a declarative query language for navigating ODMG data. It is an expression-oriented query language: an OQL query is composed of one or more expressions or subqueries whose types can be inferred statically. Complex queries can be formed by composing expressions as long as the compositions respect the type system of ODMG. Examples will be given below. Details of the ODMG model and the OQL query language can be found in [Cat94], but in-depth knowledge is not essential for understanding this paper.

2.2 The semistructured OEM model and Lorel

The Object Exchange Model, or OEM, is a self-describing semistructured data model, useful for representing irregular or dynamically evolving data. OEM objects may either be atomic, containing atomic literal values (of type integer, real, string, binary, etc.), or complex, containing a set of labeled 4

OEM subobjects. A complex OEM object may have any number of children (subobjects), including multiple children with the same label. Note that all OEM subobjects have identity, unlike ODMG class attributes. An OEM database may be viewed as a labeled directed graph, with complex OEM objects as internal nodes and atomic OEM objects as leaf nodes. Named OEM objects form entry points into an OEM database. Examples will be given below. Lorel is a declarative query language for navigating OEM data and is based on OQL. Some important features of Lorel are:

Path expressions: Lorel queries navigate OEM databases using path expressions that are sequences of labels. For instance, the query \Select D From A.b B, B.c C, C.d D" returns all objects reachable by following a path with labels b, c, d starting from the named object A. Lorel supports path expressions with wildcards and regular expression operators. For instance, the query \Select C From A(.b|.c%)* C" selects all objects reachable from A by following zero or more edges having either label b or a label beginning with the character c. Automatic coercion: Lorel attempts to coerce operands to compatible types whenever it performs a comparison or other operation on them. For instance, if X is an atomic OEM object encapsulating the string value \4", then for the evaluation of X < 10, Lorel coerces X to the integer value 4. If no such coercion is possible (for instance, if X were a complex object) the predicate returns false. Lorel also coerces between sets and singleton values whenever appropriate. No type errors: Lorel never raises type errors. For instance, an attempted navigation from an OEM object using a nonexistent label simply produces an empty result, and a comparison between non-comparable values evaluates to false (as mentioned previously). Thus, arbitrary queries can be executed on an OEM database with unknown or partially known structure without the risk of run-time errors.

Details of OEM and Lorel can be found in [AQM+ 97, PGMW95], but again in-depth knowledge is not crucial to understanding this paper.

2.3 Example of hybrid data and queries

In this section, we present an example scenario used throughout the paper. Our relatively simple example centers around a database for an on-line retail agency that sells products on the Web on behalf of dierent companies. There are three ODMG classes of objects in this database: Catalog, Company, and Product. A single object in the class Catalog represents the on-line catalog maintained by the agency. This class has two attributes: a vendors attribute of type set(Company), denoting the companies whose products are sold in the catalog, and a products attribute of type set(Product), denoting the full suite of products sold in the catalog. The Company class de nes a one-to-many produces relationship with the class Product of type list(Product). This relationship speci es the list of products manufactured by the company, ordered by product number. Likewise, the Product class de nes the inverse many-to-one madeby relationship with the class Company denoting the company manufacturing the product. The Company class contains other attributes such as name, address (of type tuple(street:string, city:string, zip:integer)), and totalstock (the total value of all of its stocked products), and an inventory() method that takes a product name argument and returns the number of stocked units of the product of that name. The Product class contains other attributes such 5

WebCat

Catalog products

vendors

produces (1 to many)

Company

Product

madeby (many to 1)

Figure 1: Structured ODMG classes in the retail-agency database o1: Product attribute (OEM) prodinfo specs wattage

weight

100

"16oz"

compatible competing o4: Product

competing competing name

o2: Product

madeby

o3: Company "System12"

"Xylem Inc."

Figure 2: Example OEM graph for the prodinfo attribute of a Product object as name, prodnum (product number), unitprice, stocklevel (total value of the stocked units of the product). The named object WebCat of type Catalog provides an entry point to this database. The database described so far is a standard structured ODMG database, and its schema is easily speci ed using the Object De nition Language (ODL) [Cat94] of ODMG, although a full description is omitted here. Figure 1 is a graphical representation of this schema with atomic attributes omitted. In addition to this structured data, let us suppose that we have product-speci c XML information available for some products, drawn from say Web sites of companies and analyst rms. This data might include manufacturer speci cations (power ratings, weight, etc.), compatibility information if it applies (for instance, the strobes compatible with a particular camera), a listing of competing companies and products, etc. To integrate this XML data within our database, we enhance the Product class with a prodinfo attribute that serves as an entry point to the productspeci c data for the corresponding product. Since this data is much less structured and less stable than the retailer database, we cannot easily use a xed ODMG type for its representation. It is much more convenient to use the semistructured OEM data model. Therefore, we let the prodinfo attribute be a \crossover point" (described below) to OEM data. There is also a need for referencing structured data from semistructured data. If a competing product (or company) or a compatible product is one that appears in the retail catalog, then it should be represented by a direct reference to the ODMG object representing that product or company. If the competing product or company is not part of the catalog, only then is a complex OEM object created to encode the XML data for that product or company. 6

Reviews ABC_Consulting

ConsumersInc ConsumersInc

ABC_Consulting

category

subject

subject

rating

rating

category

"handtools"

o1: Product torque 4.3

efficiency 6.5

power name "Amplifier"

distortion

"electronics"

prodnum 200

3.9

6.5

Figure 3: Semistructured Reviews data for the Web catalog An example OEM database graph for the prodinfo attribute of a product is shown in Figure 2. Note that in Figure 2, the competing product named \System12" is not part of the catalog database and therefore is represented by a (complex) OEM object; the other competing product and company are part of the catalog and thus are represented by references to ODMG objects of type Product and Company. To continue with the example, let us also suppose that we have some review data available in XML for products and companies. The information is available from Web pages of dierent review agencies and varies in structure across dierent agencies. We enhance our example database with a second entry point: the named object Reviews integrates all the XML review data. Since this review data is neither stable nor well-structured, it is better represented by the OEM data model than by any xed ODMG type. Thus, Reviews is a complex OEM object integrating available reviews of companies and products. There is once again a need for referencing structured data from this semistructured data: reviewed companies and products that are part of the catalog should be denoted by references to the ODMG objects representing them. Figure 3 is a simpli ed example of this semistructured Reviews data. We assume that the reviews by a given agency reside under distinct subobjects of Reviews, and the names of the review agencies (ConsumersInc, ABC Consulting, etc.) form the labels for these subobjects. Dierent agencies use dierent formats for their reviews. For subsequent examples, we restrict ourselves to reviews by ConsumersInc. Reviews by this agency are shown to have a category subobject, specifying the category of the product or company being reviewed, and a subject subobject denoting the subject of the review (either a product or a company). The measurable aspects of each subject are rated by ConsumersInc under rating subobjects. If the subject of a review is part of the catalog, the subject subobject is a reference to the ODMG object representing the company or product. Non-catalog subjects are represented by complex OEM objects. Both cases are depicted in Figure 3. Our example scenario consists of hybrid data. Some of the data is structured, such as the Product class without the prodinfo attribute, while some of the data is semistructured, such as the data reachable via a prodinfo attribute or via the Reviews entry point. Queries should allow the navigation of this hybrid database from either entry point, WebCat or Reviews. We conclude this section with three such queries expressed in an extension to OQL. The rst query selects products compatible with the product \Diskman12": Select C From WebCat.products P, P.prodinfo.compatible C Where P.name = "Diskman12"

7

This query starts in the ODMG world. Note that variable P will be assigned to an ODMG object, while the prodinfo attribute moves us into the OEM world. The next query nds the names of all products and companies that have reviews by ConsumersInc with a category containing the term \electronics": Select N From Reviews.ConsumersInc R, R.subject S, S.name N Where Exists C In R.category: C Like "%electronics%"

This query starts in the OEM world. Note that variable S may be assigned to both OEM objects and ODMG objects, since some of the subobjects labeled subject in Figure 3 are of type Product or Company, while others are complex OEM objects. In either case, the evaluation of S.name works as expected, accessing either an ODMG attribute or an OEM subobject. The third query selects the inventory of any product called \Diskman123" made by a competitor of the product called \DiskmanXY": Select C.inventory("Diskman123") From WebCat.products P, P.prodinfo.competing C Where P.name = "DiskmanXY"

Here, inventory(string product name) is a method. It is applicable only when variable C is bound to an ODMG object with an inventory() method of the appropriate signature. In our example, only the class Company has such a method. When C is bound to an OEM object, no result is returned for that binding, but an error is not raised.

2.4 Bene ts over a pure semistructured approach

The main bene t of our approach over a purely semistructured approach such as Lore [MAG+97] is that we are capable of exploiting structure when it is available. In particular, we can more easily take advantage of known query optimization techniques for structured data, and we can take advantage of strong typing when portions of the data are typed. There are additional bene ts to mixing structured and semistructured data in the environment of a full- edged object database: 1. Bene ts from ODMG classes: Semistructured data models such as OEM usually are equipped with a xed set of atomic types. In an ODMG environment, it is easy to introduce new atomic types as ODMG classes. For instance, if we want to introduce an audio type, it suces to create a class Audio using the full power of ODMG (e.g., its Java binding) and we can reference Audio objects from the OEM part of the data. 2. Bene ts from ODMG ordered collections: Most semistructured data models do not support ordering. Ordering is often relevant, e.g., in XML. In our approach, since ODMG already supports lists, XML or any other semistructured but ordered data can easily be handled without changing the model or the system. Similarly, we could introduce \semistructured spreadsheets" as arrays of OEM objects. 3. Circumventing ODMG type checking: As a side bene t to allowing OEM data to exist in the ODMG environment, it is possible to treat ODMG data as if it were OEM and thus disable static typing. Queries can therefore be written with incomplete knowledge of the schema of an ODMG database without the risk of type error. At the same time, all properties of the ODMG data (such as methods, ordering, indexes, etc.) remain accessible to such queries. 8

3 The Extended ODMG Data Model Our basic extension to the ODMG data model to accommodate semistructured data is relatively straightforward: we extend the ODMG model with a new built-in class type OEM. Using this OEM type, we can construct ODMG types that include semistructured data. For instance, we can de ne a partially semistructured class Product as in our running example: interface Product (extent Products) { attribute String name; attribute Integer prodnum; attribute OEM prodinfo; relationship Company madeby; ...}

There is no restriction on the use of the type OEM. For instance, tuple(a:integer, b:OEM) and list(OEM) are valid types. This use of OEM in type constructors enables the rst kind of hybrid data: semistructured data may be referenced from the structured portions of an ODMG database. Objects in the class OEM are of two categories: OEMcomplex and OEMatomic, representing complex and atomic OEM objects respectively (recall Section 2.2). Note that these categories do not represent subclasses in the strict sense of the term, since OEM objects are untyped. An OEMcomplex object encapsulates a collection of (label,value) pairs, where label is any string of characters and value is an OEM object. The original OEM data model speci cation included only unordered collections of subobjects [AQM+ 97, PGMW95]. We allow complex OEM objects with either unordered or ordered subobjects in our data model, and respectively refer to them as OEMcomplexset and OEMcomplexlist. To enable the second kind of hybrid data (semistructured references to structured data), the value of an OEMatomic object may have any valid ODMG type (including OEM). Thus, apart from the ODMG atomic types integer, real, string, etc., the value of an OEMatomic object may for example be of type Product, list(tuple(a:integer, b:Product, c:OEM)), array(string), etc. When the content of an OEMatomic object is of type T , we will say that its type is OEM(T ). Since OEM objects are actually untyped, OEM(T ) denotes a \dynamic type" that does not serve as a constraint and is not checked at run-time when queries access objects of this type. For example, an object of type OEM(integer) may be compared with an object of type OEM(string) or OEM(set(Product)) without raising a type error; further discussion of such operations is provided in Section 4.3. Intuitively, atomic OEM objects are untyped containers for typed values. Note that an OEMatomic object of type OEM(OEM) can be used to store a reference to another OEM object (possibly external to the database). Also note that OEM(nil) is a valid type for an OEMatomic object, and we denote a particular object of this type as the value OEMNil.

4 The OQLS Query Language The query language for the Ozone system is OQLS . OQLS is not a new query language|except for some syntactic conveniences derived from Lorel, it is syntactically identical to OQL. The semantics of OQLS on structured data is identical to OQL on standard ODMG data. OQLS extends OQL with additional semantics that allow it to access semistructured data. The semistructured capabilities of OQLS are mostly derived from the Lorel query language for navigating pure semistructured data. Like Lorel, OQLS allows the navigation of semistructured data without the possibility of run-time 9

OQLS OQL Strong typing Methods List and array indexing Type constructors Type conversions (casts)

...

Lorel Path expressions Select-From-Where queries Arithmetic expressions Set expressions

...

No type errors Automatic coercion Regular path expression operators OEM object constructors

...

Structured to semistructured crossover and vice-versa Mixed expresssions Coercion from OEM to structured types and vice-versa

Figure 4: Some of the similarities and dierences between OQL, Lorel, and OQLS errors. OQLS also provides new features necessary for the navigation of hybrid data. Since OQLS expressions can contain both structured and semistructured operands, OQLS de nes new semantics that allow such queries to be interpreted appropriately. Figure 4 highlights some of the features of OQL, Lorel, and OQLS . Since Lorel was derived from OQL, there are a number of features common to both languages, such as path expressions, select-from-where queries, and arithmetic expressions. OQLS also has special features necessary for hybrid data (illustrated outside the two ovals). Note the extreme cases of OQLS :

OQLS on pure ODMG data: If a database is fully structured, the set of admissible OQLS queries on the database (with the exception of casting to OEM, described later on) is precisely the set of admissible OQL queries for the same data, and the behavior of such queries is identical to that of OQL.

OQLS on pure OEM data: If a database is completely semistructured, the set of meaningful queries on such data is precisely the set of queries allowed by Lorel, and the behavior of such queries is identical to that of Lorel.

In the remainder of this section, we explain the semantics of OQLS by focusing mostly on hybrid data. We describe queries that cross boundaries from structured to semistructured data in Section 4.1. In Section 4.2, we describe the reverse form of \crossover", i.e., from semistructured to structured data. Section 4.3 discusses the semantics of OQLS constructs such as arithmetic and logical expressions involving hybrid operands.

4.1 Structured to semistructured crossover

Structured to semistructured crossover in OQLS is straightforward since we can always identify the crossover points statically. For instance, the following query selects the names of all competing products and companies for all products in the Web catalog from Section 2.3: Select N From WebCat.products P, P.prodinfo.competing C, C.name N

10

In this query, P is statically known to be of type Product, but prodinfo is an OEM attribute, and C is therefore of type OEM. Thus, prodinfo represents a crossover from structured to semistructured data.

4.2 Semistructured to structured crossover

Semistructured to structured crossover is more complicated than crossover in the other direction, since it is not possible to identify such crossover points statically without detailed knowledge of the structure of a hybrid database. To de ne the semantics of queries with transparent crossover from semistructured to structured data, we introduce the logical concept of OEM proxy for encapsulating structured data in OEM objects. We also discuss an explicit form of crossover for users who do have detailed knowledge of structure. In the purely semistructured OEM data model, atomic OEM objects are leaf nodes in the database graph. The result of attempting to navigate an edge from an atomic OEM object is de ned in Lorel to be the empty set, i.e., the result of evaluating X:label is empty if X is not a complex OEM object. However, in our extended ODMG model, OEMatomic objects serve a richer purpose. In Ozone, OEMatomic objects are containers for values with any ODMG type, so in addition to containing atomic values, they may provide semistructured crossover to structured data. Thus, OQLS extends Lorel path expressions to be able to navigate structured data encapsulated by OEMatomic objects, in addition to navigating semistructured data represented by OEMcomplex objects. In our running example, some of the objects in the Reviews graph labeled subject are OEMatomic objects of type OEM(Product) and OEM(Company). Thus, the following query has a semistructured to structured crossover since some of the bindings for C are OEM(Company) and OEM(Product) objects: Select A From Reviews.ConsumersInc R, R.subject C, C.address A

For C bindings of type OEM(Company), the evaluation of C.address generates the OEM proxy (to be de ned later in this section) of the address attribute in the Company class. For C bindings of type OEM(Product), the evaluation of C.address yields the empty set. Semistructured to structured crossover is complicated by two factors arising from the absence of type information in semistructured data. First, a semistructured to structured crossover point cannot be identi ed statically. In the above query, we cannot statically infer the precise types of the bindings for R, C, and A. A user may not know of the existence of OEM(Company) or OEM(Product) objects in the bindings for C, and cannot identify \R.subject C" to be a crossover point without detailed knowledge of the Reviews database. Second, we cannot always statically infer the types of values encapsulated by OEMatomic objects. An OEMatomic object is an untyped container and there is no constraint on the type of the value it encapsulates. The bindings for C in our example may include OEMcomplex objects, and OEMatomic objects of types OEM(Product) and OEM(Company). The semantics of semistructured to structured crossover should therefore satisfy the following two design goals:

Information should be preserved: An OQLS query should be able to access any structural property of data encapsulated by OEMatomic objects, without requiring the user to cast such

11

data explicitly to its true ODMG type. For instance, all attributes and methods of a class should be accessible from an OEMatomic object encapsulating an object of that class. For instance, the address attribute of class Company must be accessible from an object of type OEM(Company) as illustrated by the above example.

Type errors should not be generated: Since the types of values embedded in OEMatomic objects cannot be inferred statically, in keeping with the spirit of untyped semistructured data navigation, references to nonexistent properties of data contained in OEMatomic objects should not lead to run-time errors. For instance, in the preceding example, some of the C bindings are of type OEM(Product) and the class Product does not have an address attribute. For such bindings, the evaluation of C.address produces an empty result and does not generate a type error.

These goals are achieved for semistructured to structured crossover in OQLS through the logical notion of OEM proxy objects, de ned as follows:

De nition 4.1 (OEM Proxy) An OEM proxy object is a temporary OEM object created (perhaps only logically) to encapsulate a value of any ODMG type. It is an OEMatomic object that serves as a proxy or a surrogate for the value it encapsulates. Semistructured to structured crossover is accomplished by (logically) creating OEM proxy objects, perhaps recursively, to contain the result of navigating past an OEMatomic object. It is important to note that this concept of OEM proxy is a logical rather than a physical concept. It speci es how a query over hybrid data is to be interpreted, but does not specify anything about the actual implementation of the query processor. All we require is that the result of a query should be the same as the result that would be produced if proxy objects were actually created for every navigation past every OEMatomic object. For C bindings of type OEM(Company) in our example query, the corresponding A bindings are OEM proxy objects of type OEM(tuple(street:string, city:string, zip:integer)) encapsulating the address attribute of the corresponding Company objects. The general algorithm for evaluating X:l when X is an OEMatomic object of type OEM(T ) encapsulating the value Y follows. Consider the dierent cases for T : 1. T is an atomic ODMG type (i.e., one of integer, real, char, string, boolean, binary, or nil): For any label l, the result of evaluating X:l is the empty set. 2. T is a tuple type: For all elds in the tuple whose labels match l (note that a label with wildcards can match more than one eld) a proxy object is created encapsulating the value of the eld. The result of evaluating X:l is the set of these proxies. 3. T is a collection: If T is a set or a bag type, X:l returns a set. This set is empty unless the label l is the speci c label item. For this built-in system label, the value of X:l is a set of OEM proxies encapsulating the elements in the collection Y . If T is a list or an array type, X:l is evaluated similarly, except that the ordering of the elements of Y is preserved by returning a list of proxies instead of a set. Note that navigation past such objects requires some knowledge of the type T , since the user needs to use the item label (or a wildcard) to navigate below the encapsulated collection. 12

4. T is OEM: In this case, X is a reference to an OEM object Y , and the result of X:l is the same as the result of evaluating Y:l, i.e., automatic dereferencing3 is performed. 5. T is a class: Here, Y is an object encapsulating some value: let Z be an OEM proxy for that value. The result of evaluating X:l is a set including the OEM proxies obtained by evaluating Z:l (by recursively applying these rules), the OEM proxies encapsulating the values of any relationships with names matching l, and the OEM proxies encapsulating the results of invoking any methods with names matching l. The rules above are recursive, as illustrated by the following query, which selects the names of all products manufactured by all companies reviewed by ConsumersInc: Select N From Reviews.ConsumersInc R, R.subject C, C.produces L, L.item P, P.name N

Here, for those C bindings that are of type OEM(Company), the corresponding L bindings are proxies of type OEM(list(Product)), encapsulating the produces relationship of the Company class. The P bindings are therefore proxies of type OEM(Product) encapsulating the dierent Product objects in the lists encapsulated by the L bindings. Finally, the N bindings encapsulate the name attributes of the Product objects encapsulated by the P bindings, and the type of these N bindings is OEM(string). Thus, proxies allow \transparent navigation" to structured data from semistructured data with the caveat that users are required to be aware of the presence of collection types (such as the produces relationship of class Company) to know that they need to use the item label to navigate below them. The rules for navigating past OEMatomic objects allow OQLS queries to access methods without arguments in objects encapsulated by OEMatomic objects, since such methods are indistinguishable from other class attributes. In keeping with our design goal of allowing access to all properties of structured data referenced by semistructured data, OQLS also allows queries to invoke methods with arguments on structured objects encapsulated by OEMatomic objects. The expression X:m(arg1; arg2; : : :; argn) applies the method m() with the speci ed list of arguments to the object encapsulated by X . If X is of type OEM(T ), the result of this expression is a set containing the OEMatomic object encapsulating the return value of the method, provided T is a class that has a method of this name and with formal parameters whose types match the types of the actual parameters arg1; arg2; : : :argn . If T is not a class, or if it is a class without a matching method, or if X is of type OEMcomplex, this expression returns the empty set. As an example, the following query selects the inventories of \camera1" for all companies reviewed by ConsumersInc: Select C.inventory("camera1") From Reviews.ConsumersInc R, R.subject C,

For those C bindings that are not of type OEM(Company), the evaluation of C.inventory(\camera1") yields the empty set. For C bindings of type OEM(Company), C.inventory(\camera1") returns a There is an important restriction to this automatic dereferencing|it is performed only if l does not contain the Kleene closure operator. This feature is needed for restricting the search space of queries with Kleene closures but is not detailed further here. 3

13

singleton set containing an OEM(integer) object encapsulating the value produced by applying the inventory() method with argument \camera1" to the Company object encapsulated by C. OQL allows casts on objects, and as a rami cation of our approach, OQLS allows any object to be cast to the type OEM by creating the appropriate proxy for the object. For instance, an object C of type Company can be cast to an OEM proxy object of type OEM(Company) by the expression (OEM) C. Once this conversion is performed, the proxy can be navigated in a semistructured style, i.e., type-checking is disabled. This casting approach is useful when:

Users have approximate knowledge of the structure of some (possibly very complex) structured ODMG data but prefer not to study its schema in detail. The structure of the database changes frequently and users want the same query to run on the changing database without type errors; or, users want the same query to run without type errors on dierent databases with slightly dierent structures.

Semistructured to structured crossover by explicit coercion

OQLS provides another mechanism for accessing structured data from semistructured data: a modi ed form of casting that extracts the structured value encapsulated by an OEMatomic object. Although OEM proxies allow all properties of structured data contained in OEMatomic objects to be accessed without casts, casts enable static type checking. Furthermore, casting a semistructured operand to its true structured type may also provide performance advantages, since it may allow the query processor to exploit the known structure of the data. The standard OQL cast expression \(T )X ", denoting a conversion of the identi er X to the type T , is available in OQLS for performing type-conversions on standard structured operands. As in OQL, such casts may lead to run-time errors: if the actual type of an object o bound to X is not a subtype of T , the evaluation of \(T ) o" produces a run-time type error. While this approach is acceptable for operands whose types can be inferred statically, it is not appropriate for casts on identi ers bound to OEM objects since it would contradict the untyped philosophy of semistructured data. Therefore, OQLS provides a separate mechanism for performing type conversions on OEM objects, through the Coerce() function de ned as follows:

De nition 4.2 (The Coerce() function) Let X be an OEM object. The value of Coerce(C , X ) is the singleton set set((C ) Y ) if X is an OEMatomic object encapsulating an object Y in class C (or a subclass of C ). Otherwise the value of Coerce(C , X ) is the empty set.

As an example, the following query selects the products of all Company subjects of reviews by ConsumersInc. Since the type of the C bindings is known to be Company, the type of the result returned by the query can be determined statically to be set(list(Product)). Select P From Reviews.ConsumersInc R, R.subject S, Coerce(S, Company) C, C.produces P

14

4.3 Semantics of mixed expressions

In OQL, expressions (queries) can be composed to form more complex expressions as long as the expressions have types that can be composed legally. OQL provides numerous operators for compositions, such as arithmetic operators (e.g., +), comparison operators (e.g.,

Ozone: Integrating Structured and Semistructured Data 1 ... - CiteSeerX

Ozone: Integrating Structured and Semistructured Data 1 ... - CiteSeerX

Suggest Documents

Semistructured and Structured Data in the Web - CiteSeerX

Semantic Integration of Semistructured and Structured ... - CiteSeerX

data mining techniques for structured and semistructured data

MIROWeb: Integrating Multiple Data Sources Through Semistructured ...

Semistructured data, XML, DTDs Structured vs. unstructured data ...

Semistructured and Structured Data in the Web - Semantic Scholar

Semistructured Data and XML 1 Introduction 2

Storing Semistructured Data with STORED 1 Introduction - CiteSeerX

1 using wg-log schemata to represent semistructured data - CiteSeerX

Query Rewriting for Semistructured Data 1 Introduction

Storing Semistructured Data in Relations 1

Distributed Query Evaluation on Semistructured Data - CiteSeerX

ASTERIX: towards a scalable, semistructured data ... - CiteSeerX

Extracting Schema from Semistructured Data - CiteSeerX

Structured versus Semistructured versus Unstructured Interviews

Semistructured Data and XML 1 Introduction 2 ... - Semantic Scholar

Semistructured Data Search - Semantic Scholar

QURSED: Querying and Reporting Semistructured Data

Foundations of Semistructured Data - Core

Multidimensional Semistructured Data - Semantic Scholar

A Semantic Approach to Integrating XML and Structured Data Sources

Extracting and Converting Data from Semistructured ...

Storing Semistructured Data with STORED 1 ... - Semantic Scholar

The Lorel Query Language for Semistructured Data? - CiteSeerX