INTEROPERABLE QUERY PROCESSING FROM OBJECT TO ...

2 downloads 0 Views 315KB Size Report
Oct 1, 1994 - YA-HUI CHANG .... has been presented in Chang et al(1994). ...... of QPSobj(i), or Y0, PO will be the optional sup-list say Psup, from Y to Y0, ...
INTEROPERABLE QUERY PROCESSING FROM OBJECT TO RELATIONAL SCHEMAS BASED ON A PARAMETERIZED CANONICAL REPRESENTATION LOUIQA RASCHID

College of Business and Management and Institute for Advanced Computer Studies University of Maryland, 3129 A. V. Williams Building College Park Maryland 20742 USA

and YA-HUI CHANG

Department of Computer Science University of Maryland, 3228 A.V. Williams Building College Park Maryland 20742 USA

Received Revised

October 1, 1994 May 1, 1995

ABSTRACT In this paper, we develop techniques for interoperable query processing between object and relational schemas. The objective is to pose a query against a local object schema and be able to share information transparently from target relational databases, which have equivalent schema. Our approach is a mapping approach (as opposed to a global schema approach) and is based on using canonical representations (CR). We use one CR for resolving heterogeneity based on the object and relational query languages. We use a second parameterized CR to resolve representational heterogeneity between object and relational schema, and to build a mapping knowledge dictionary. There is also a set of mapping rules, based on the CR, which de nes the appropriate mapping between schemas. A query posed against the local object schema is rst represented in the CR for queries, and then transformed by the mapping rules, to an appropriate query for the target relational schema, using relevant information from the mapping knowledge dictionary. The use of the parameterized CR allows us to build the mapping knowledge dictionary easily, and allows reusability of the mapping rules. Keywords: interoperable query processing, heterogeneous databases, representational heterogeneity, transparent query transformation, canonical representation

1. Introduction The proliferation of database systems based on di erent data models has created the need for techniques that support interoperability, which allows users to access data from multiple independent heterogeneous databases. To support interoperability,  Interoperable query

processing between object and relational schema

1

we need to resolve con icts between di erent databases and generate appropriate queries corresponding to each database. Sharing of data requires resolving con icts among equivalent concepts which may have been de ned in di erent ways, leading to representational heterogeneity among schema. Heterogeneity may be caused by using di erent data models, for example, relational models and object-oriented models have dissimilar representational power. At the same time, there are also many ways to represent similar knowledge based on the same model. For example, in the relational model, data about IBM could correspond to tuples in a relation with \IBM" as an attribute value in one schema, and could also be represented by all the tuples of a relation \IBM" in another schema. In addition to representational heterogeneity among schema, using di erent query languages with each database also introduces heterogeneity. For example, the standard query language for the relational model is SQL, whereas query languages for object-oriented models include a path-oriented language, XSQL (Kifer et al (1992), or a functional language, OSQL (Fishman et al (1990)). In this paper, we study the issue of interoperable query processing between object and relational schema. The objective is to support query transformation in a transparent manner so that users may pose a query against a local schema and be able to (transparently) access data from remote databases which may have a di erent data model, schema or query language. We make the assumption that although there is representational heterogeneity among the schemas, the data obtained from multiple remote databases (and the local data) satisfy the integrity constraints of the local schema. Obviously this is a shortcoming in some environments, and in related research (see Vidal and Raschid (1995)), we are investigating knowledge and data sharing among di erent schema (F-logic schema), where the semantics may not be equivalent. Many approaches advocate support of a global schema and several systems have been proposed and built as a result of this research, e.g., Multibase (see Dayal and Hwang (1984)), Pegasus (see Ahmed et al (1991, 1993), Albert et al (1993)), UniSQL/M (see Kim et al (1993), Kim and Seo (1991)), and SIMS (see (Arens and Knoblock (1992), Arens et al (1993), Hsu and Knoblock (1993)). One shortcoming of building a uni ed global schema is that all queries must be posed against this schema. Such an approach cannot provide interoperability for existing (legacy) applications, where queries are posed against existing local schema. Also, it is complicated to incorporate new databases, because the global schema has to be modi ed. There has also been research on the use of higher-order query languages and meta-data models to support interoperability (see Barsalou and Gangopadhay (1992), Chomicki and Litwin (1993), Kent (1991), Krishnamurthy et al (1991), Lakshmanan et al (1993)). Krishnamurthy et al (1991)). The drawback here is that the query must be expressed in a higher-order language, and there is no support for interoperability wrt existing queries against local schema and legacy applications. However, the advantage is that higher-order queries can resolve con icts that cannot be resolved when, for example, only relational queries can be used to import data from a target relational database. Our approach is a mapping approach based on canonical representations (CR). It is similar to approaches described by Lefebvre et al (1992), Qian (1993), Qian and Lunt (1994), and Qian and Raschid (1995). We transform a query posed 2

against the local schema into an appropriate query with respect to the remote databases. A user has access to remote databases, without needing to know their exact schemas and query languages. This will also provide interoperability for legacy applications, where the queries are posed against the local schema. To meet this need, we propose an architecture for interoperability which utilizes two canonical representations (CR). The heterogeneities are classi ed and parameterized within these two CR. The rst CR resolves heterogeneity based on the query language. We de ne a representation for the query language of each data model. For example, we have CRQrel, (CR for a relational query), and CRQobj, (CR for an object query). The second canonical representation provides the mapping information to resolve representational heterogeneity among di erent schemas, and is used to build a mapping knowledge dictionary. There is a CR corresponding to each of the data models with which we interoperate, e.g., relational and object models. There is also a set of mapping rules, based on the CR, which de nes the appropriate mapping between schemas. A local query posed against the local schema is rst represented in the corresponding CR for queries. Next, the canonical query will be transformed by the mapping rules, using relevant information from the mapping knowledge dictionary. We produce a transformed query (TQrel or TQobj), in the appropriate CR, and this is used to generate the appropriate query for the target database. We note that we are limited by the restrictions of the target database system and query language. This does not allow us to query schema knowledge in the target relational schema, and would limit the con icts that are resolved, compared to higher-order query languages that can query both schema and data. The CR to solve the representational heterogeneity is parameterized and thus, provides a standard way to specify the mapping among schemas. This feature allows us to incorporate di erent schemas into the mapping knowledge dictionary easily, since users simply need to ll in a \template" in the parameterized CR, to interoperate with another database. In contrast, in the global schema approach, a new global schema has to be rebuilt to incorporate with new local schemas. Further, a high-level language, e.g., OSQL in the Pegasus system, is used to specify all the mappings from the local schema to the global schema, and these mappings are written explicitly, and sometimes repeated for similar mappings among the schema. Our use of the parameterized CR and mapping rules results in reusability and ease of speci cation, since the users only need to ll in a template, and the mapping rules which encode the mapping knowledge are reused. Currently, we use a second order logical language (F-logic), which has second order syntactical structures but rst order semantics, to express both canonical representations. F-logic is designed to capture the properties of object-orientedness, and has good modeling constructs for CR and mapping rules. Each CR is a class in F-logic, and parameters are de ned to be the attributes or methods of the corresponding classes. Our approach is also more exible than a query translation approach in which the source query is directly translated into a target query form. Using two CRs to represent the semantics of queries, and to resolve representation heterogeneity, allows us to separate the mapping knowledge needed to resolve heterogeneity of schema, from the knowledge needed to translate between the constructs in the different query languages. It also simpli es the mapping rules needed for query transformation. If we used a direct translation approach, then the rules for translation 3

will be very complicated, since they have to resolve representational heterogeneity between schema, and also consider the syntax of each query language. The interoperation between object and relational queries was initially described in Raschid et al (1993, 1994), and the interoperation from relational to object queries has been presented in Chang et al (1994). Details of the canonical representation to resolve representational heterogeneity among schemas has been presented in Chang and Raschid (1995). In this paper, we provide a detailed description of the interoperability between object and relational queries. The paper is structured as follows: In section 2, we discuss related research, and in section 3, we provide an example of query interoperation. In section 4, we present an architecture for query interoperation. Section 5 discusses the canonical representation for queries and the Extractor Module to express a local object query in CRQobj. Section 6 discusses the parameterized CR for the mapping knowledge to resolve representational heterogeneity and the Heterogeneous Mapping Module which actually performs the mappings. We provide a detailed algorithm for resolving heterogeneity among object and relational schema, and for query interoperation, based on a set of mapping rules. We then discuss the F-logic implementation of these mapping rules. Finally, we discuss evaluation of the mapping knowledge base and the expressive power of the CR and mapping rules. Section 6 is a summary and suggests future research.

2. Related Research We review research on approaches to resolve representational heterogeneity with heterogeneous schema and data models. The proposed approaches can be broadly classi ed into three classes as follows: those advocating a uni ed global schema; those requiring the use of a higher-order language or a meta-model; and those advocating a mapping approach. Many approaches advocate support of a uni ed global schema and global database. Early work discusses schema integration based on the relational model or the entityrelationship model (see a summary in Batini et al (1986)). Mannino et al (1988) provide a rule-based approach for merging generalization hierarchies, which can be applied to integrate object-oriented schemas. This research does not consider schematic discrepancies, where the data in one schema corresponds to the metadata in another, e.g., IBM could be a value in the domain of an attribute CompanyName in one relational schema, and it could be an attribute name or a relation name in another schema. In later research, such heterogeneity was also considered and resolved. Several systems have been proposed and/or built, e.g., Multibase (see Dayal and Hwang (1984)), Pegasus (see Ahmed et al (1991, 1993), Albert et al (1993)), UniSQL/M (see Kim et al (1993), Kim and Seo (1991)), and SIMS (see (Arens and Knoblock (1992), Arens et al (1993), Hsu and Knoblock (1993)). In the global schema approach, a single uni ed global schema is created such that each external or remote database is a model of that schema, i.e., the database satis es the integrity constraints of the global schema. If a single uni ed model can be obtained, then there are many advantages to access several underlying databases through a single interface. However, the requirements to create a uni ed schema such that all underlying schema satisfy the integration criteria may introduce much 4

redundancy. One shortcoming of building a uni ed global schema is that all queries must be posed against this schema. Such an approach cannot provide interoperability for existing (legacy) applications, where queries are posed against existing local schema. Also, it is complicated to incorporate new databases because the global schema has to be modi ed. However, if several databases wish to interoperate, then there is an advantage to building some \common" schema, in some \common" data model. We would advocate our mapping approach to interoperate between the legacy applications and the common schema. We now describe a few of these systems based on a global schema. UniSQL/M (see Kim et al (1993), Kim and Seo (1991)) is a commercial multidatabase product whose query language UniSQL/M is similar to XSQL. It builds a uni ed schema where virtual classes are created to resolve and \homogenize" heterogeneous entities in the di erent schemas, which may include relational and object-oriented schema. Instances of the local schema are imported to populate the virtual classes of the integrated schema, and this involves creating new instances. The rst step in integration is de ning the attributes (methods) of a virtual class, and the second step is a set of queries to populate this class. They provide a vertical join operator, similar to a tuple constructor, and a horizontal join, which is equivalent to performing a union of tuples. The major focus of their research is con icts due to generalization, for e.g., an entity in one schema can be included i.e., become a subclass of an entity in the global schema, or a class and its subclasses may be included by an entity in the global schema. Attribute inclusion con icts between two entities can be solved by creating a subclass relationship among the entities. Other problems that are studied are aggregation and/or composition con icts. Pegasus (see Ahmed et al (1991, 1993), Albert et al (1993)) is a heterogeneous DBMS that provides access to native and external heterogeneous schema. Queries access the local schema via the imported Pegasus global schema. They use the HOSQL high-level language to de ne \imported types" (corresponding to class definitions) and functions (relationships among instances). New objects are generated for instances of each imported type. For supporting schema integration, they de ne equivalences among objects, reconciliation of discrepancies, and \covering" supertypes which are collections of instances of di erent imported types. In the SIMS system (see (Arens and Knoblock (1992), Arens et al (1993), Hsu and Knoblock (1993)), information sharing from multiple relational schema is facilitated through using the LOOM knowledge representation schema to construct a global schema for each application domain. Here, the global query language is a LOOM query. Each external relation has to be mapped into a single LOOM concept, based on some notion of a primary key, and they cannot express a view over the external relations. This can be a drawback if the corresponding concepts or entities in the schemas are mismatched. Although they research many issues in query processing, this work cannot be applied in the context of heterogeneous DBMS, supporting SQL-like query languages. The second approach advocates the use of a higher-order query language or a meta-model (see Barsalou and Gangopadhay (1992), Chomicki and Litwin (1993), Kent (1991), Krishnamurthy et al (1991), Lakshmanan et al (1993)). Krishnamurthy et al (1991) propose the higher-order language features needed for interoperability based on relational schema. They de ne a powerful language which can query 5

schema; its variables can range over databases, relations, attributes and values. Queries against a uni ed schema (which is a nested relational object) are expressed using an Interoperable De nition Language (IDL). A disadvantage is that the DBMS must support the higher-order language, and queries are also expressed in this language. Thus, it does not allow the interoperation of legacy applications. SchemaLog (see Lakshmanan et al (1993)) is a higher-order logic with formal semantics. It is a very expressive declarative language that can query multiple schema. A query has higher-order syntactic features but the logic is a rst-order logic and has model and proof semantics. One disadvantage is that resolution (uni cation) for literals in the formula can be complex, compared to Prolog uni cation. Although the research is interesting, it is not applicable in a DBMS environment with legacy applications. Chomicki and Litwin (1993) also propose a language for declarative speci cation of mapping between di erent object-oriented multidatabases. As a nal example of this approach, the M(DM) meta-model uses meta-level descriptions of schemas to facilitate interoperation (see Barsalou and Gangopadhay (1992)). They build an inheritance lattice to organize meta-types of the schema. Thus, there is no uni ed schema based on a single data model. Queries are expressed against the meta-model and translated against local relational, object, or other schema. Second-order logic is used to reason about the meta-types. Again, this research does not support the interoperation of legacy applications. The nal approach is a mapping approach and advocates the use of building a mapping knowledge base to facilitate interoperation between existing (legacy) applications. Such approaches are discussed in Lefebvre et al (1992), Qian (1993), Qian and Lunt (1994) and Qian and Raschid (1995). Lefebvre et al (1992) consider the problem of interoperable query processing among multiple relational schemas. F-logic, a second order logic, is used to express the mapping information among relational schemas and to express the algorithm for query transformation. Qian (1993)s and Qian and Lunt (1994) suggest that a language which has minimal representation bias may express mismatch in representation among heterogeneous schema. They choose a rst order deductive database to represent mapping knowledge among different relational schema. Each SQL query is converted to some restricted clausal form. The relational schema and the corresponding integrity constraints are also expressed in the form of an implication of some restricted clausal form, where all variables in the body of the clause are universally quanti ed and all variables in the head are existentially quanti ed. A mediation knowledge base (of such restricted clauses) is built. An advantage of this approach is that the query, the schema, the constraints and the mediation knowledge, are all expressed in the same language. An important aspect of this research is that it uses a theorem proving approach rather than a translation approach to transform the queries. Qian and Raschid (1995) then extend this approach to resolve mapping among object and relational schema. They consider queries with higher-order features in the XSQL language. They use a canonical deductive database to represent the object schema and the mapping knowledge. The higher-order features in the query are resolved in the rst step and in the next step the query is simpli ed and optimized. Finally, it is transformed using a set of mapping rules to obtain a query wrt some target relational schema. Approaches on translating semantic data models (e.g., ER models) to relational 6

models (see Markowitz and Shoshani (1992)), have focused on generating optimal relational schemas from ER schemas. The issue of query translation is not dealt with. Moreover, they do not handle translation to existing relational databases. On the other hand, approaches on coupling object-oriented and relational databases (see Keller (1993), Keim et al (1993, 1994), Pirahesh et al (1994)), and on federated databases (see Ahmed et al (1991), Dayal and Hwang (1984), Kim et al (1993)) have focused on constructing object-oriented interfaces or federated schemas on top of relational schemas by relational views, through which queries are translated. They do not handle query translation from existing object-oriented databases. We believe that query interoperation wrt existing databases is a crucial issue in a heterogeneous environment.

3. Examples of Transforming Queries from Object to Relational Schema We use a sample object schema (Figure 1) and two equivalent relational schema (Figures 2 and 3). In the object schema,1 each node is an object, and the broken arcs represent an inheritance hierarchy. A solid arc represents a pointer to another object, for example, LeasedVehicle is an attribute of Company and refers to some object Vehicle. The local query for the object schema is an XSQL query2 . We use simple XSQL queries that only di er from SQL queries in the query path expressions (qpe) of the where clause. An XSQL query takes the following form: Query 1

SELECT Z FROM Company X, GasolineAuto Y, Company Z    WHERE X.LeasedVehicle[Y].Manufacturer[Z]    In the object schema, we interpret X.LeasedVehicle [Y] as selecting each Company X having LeasedVehicle Y, where Y is restricted to be a GasolineAuto (in the from clause). The Company Z is the Manufacturer of the GasolineAuto. GasolineAuto does not have an attribute Manufacturer and has to inherit this from the class Vehicle. There are two relational schemas in Figures 2 and 3. In RDBMS #1, relations RelCompany-Vehicle and RelYachtClub-Vehicle correspond to vehicle information for Company and YachtClub, RelVehicle has the Model and Manufacturer information on each Vehicle and RelAuto and RelBoat correspond to Auto and Boat, respectively. To obtain a query equivalent to Query 1 in RDBMS #1, we use the knowledge that the VehicleId values in RelCompany-Vehicle may refer to either boats or automobiles. We use these VehicleId values to refer to information in RelAuto. We must select only those tuples in RelAuto corresponding to GasolineAuto, 1 The objects described in the schema are class objects which may be distinguished from individual objects. 2 The language is de ned in section 5. and was initially presented in Kifer et al (1992)

7

Company

YachtClub

LeasedVehicle

OwnedVehicle Vehicle

CompanyName

ClubName

Manufacturer VehicleId Model

DieselVehicle

GasolineVehicle

DieselBoat

DieselAuto

GasolineAuto

GasolineBoat

Registration

License

License

Registration

Engine

Body

Body

Engine

Weight

Engine

Engine

Weight

Figure 1: Sample object schema #1 RelCompany-Vehicle LeasedVehicle

CompanyName

RelYachtClub-Vehicle ClubName OwnedVehicle RelVehicle Model RelAuto Engine VehicleId RelBoat Engine VehicleId

Manufacturer License

VehicleId Body

Type Registration

Weight

Type

Figure 2: Sample relationsl schema #1 8

RelCompany-Vehicle CompanyName Leasedvehicle RelHatchBack AutoId Vehtype License

Body

RelSedan AutoId Engine

License

Vehtype Body

Engine

RelDslVeh DslAutoId DslVehId RelAllVeh VehicleId Manufacturer Model

Figure 3: Sample relational schema #2 as indicated by the value of the Type attribute. To obtain the manufacturer information we need to use the VehicleId values to refer to the relation RelVehicle. An equivalent SQL query would be as follows: Query 2

SELECT Manufacturer FROM RelCompany-Vehicle RelVehicle RelAuto WHERE RelCompany-Vehicle.LeasedVehicle = RelAuto.VehicleId & RelAuto.Type = GasolineAuto & RelAuto.VehicleId = RelVehicle.VehicleId Next, we consider the relational schema of RDBMS #2, where information on automobiles (GasolineAuto and DieselAuto) are actually stored in two relations, RelHatchBack and RelSedan. There is also a relation storing information on diesel vehicles, RelDslVeh and a relation RelAllVeh that stores the Manufacturer and Model information for all vehicles. The corresponding SQL query is as follows: Query 3

SELECT Manufacturer FROM RelCompany-Vehicle RelHatchBack RelSedan RelDslVeh RelAllVeh WHERE RelCompany-Vehicle.LeasedVehicle = RelAllVeh.VehicleId & RelDslVeh.DslVehId = RelAllVeh.VehicleId & (RelSedan.AutoId = RelDslVeh.DslAutoId j RelHatchBack.AutoId = RelDslVeh.DslAutoId) & (RelSedan.VehType = DieselAuto j RelHatchback.VehType = DieselAuto)

9

mapping knowledge dictionary HTobj-relMapping

HeTerogeneous Mapping Module (HTM) (mapping rules)

HTrel-objMapping

CRQobj

CRQrel local knowledge dictionary 1

local knowledge dictionary n

schema information

Extractor Module (EM)

Extractor Module (EM)

query language information

Generator Module (GM)

Generator Module (GM)

SQL query

TQrel

TQobj

XSQL query

SQL query

schema information query language information XSQL query

data flow information accessed

Figure 4: Architecture of the Interoperability Module (IM)

4. An Architecture for Query Interoperation This section describes the architecture for interoperable query processing through transparent query transformation. Figure 4 illustrates the overall design of the Interoperability Module (IM). The architecture is general and should be able to provide interoperation among schemas based on the same or di erent data model. In this paper, we focus on the mapping from an object schema to a relational schema. Suppose a user posed an XSQL object query against a local object schema. The Extractor module (EM) will transform the query into a canonical representation, say CRQobj. The HeTerogeneous Mapping Module (HTM) will resolve the con icts due to representational heterogeneity and produce a transformed query, say TQrel, wrt a remote relational schema. The Generator Module (GM) will produce a proper query, say an SQL query, for a remote relational database. We now discuss the functionality of each module. Corresponding to each DBMS in the heterogeneous environment, there is a local knowledge dictionary that encodes relevant knowledge on (1) the query language constructs, and (2) the data model and the local schema. This local dictionary is accessed by the EM. This module accesses both the local query language and schema dictionary for the task of extracting relevant information from the source query. For a sample XSQL source query, the EM will identify object and attribute variables, path expressions in the query corresponding to class hierarchies for inheritance, path expressions corresponding to identifying speci c sub-classes of a class, etc., and represent this using CRQobj. For a sample SQL source query, the EM will identify join paths and selection criteria, depending on whether the joins are over the primary key, a foreign key of another relation, an unstructured join with no key restrictions, etc. The EM will represent this information in the corresponding 10

canonical representation, CRQrel. In this paper, we discuss the EM function for obtaining CRQobj from an XSQL query. The module which performs the transformation among the heterogeneous schema is the HeTerogeneous mapping Module (HTM). This module accesses the mapping knowledge dictionary, which contains the canonical representation of the mapping knowledge, HTx-yMapping, where x and y denote the schema. HTx-yMapping resolves con icts among entities in the di erent schemas. So far we have considered con icts among relational and object schemas, so there are four di erent mapping possibilities for HTM. The intent is to obtain a single parameterized CR (HTMapping) for resolving the con icts among the schema. The function of HTM is de ned by a set of mapping rules and these rules will encode the appropriate transformation needed for each of the four possible mappings among relational and object schema. The mapping rules in HTM process CRQobj (or CRQrel) as input, and utilize the relevant HTx-yMapping in the knowledge dictionary. After applying the rules, HTM produces a TransformedQuery (TQrel or TQobj, respectively), corresponding to the target query in the canonical representation. The mapping rules are classi ed into di erent groups, each of which performs a class of transformations relevant to the target model/query language. For example, one group of rules resolve the class hierarchy con ict. When mapping from relational schema to object schema, these rules will use the key inclusion dependencies among relations to determine the inclusion relationship between them and identify the relevant class hierarchies in the object schema (see Chang et al (1994)). When mapping from object schema to relational schema, these rules will produce a sequence of joins for the relational query, corresponding to inheritance through a class hierarchy in the object schema (see Raschid et al (1993,1994)). Selected rules in each group will be applied, and each produces a part of the transformed query (TQobj or TQrel). Currently, we only consider the interoperation of simple SQL and XSQL queries, e.g., without nesting or higher-order features. To interoperate with more complex query language constructs, we need to add more mapping rules and extend the canonical representation for the query languages. In this paper, we focus on the mapping algorithm to resolve con icts when mapping target relational schema into a local object schema. The corresponding mapping rules are classi ed into three groups, namely, Ident-Rel-Attr, Sel-Constr and Join-Path; they are discussed in detail in a later section. The nal module of the IM is the Generator Module (GM). This module accesses the local knowledge dictionary, corresponding to the target schema. It will access the syntax information of the particular query language to produce a correct query; it may also use knowledge on the cost of processing a query, indexing information, and knowledge of equivalences in the target query language and schema, etc, to produce an optimal query in the target query language. We note that in general, the optimization of queries in a heterogeneous environment is an important research area, and refer the reader to some of our work in this area (see Florescu et al (1994)).

11

5. Canonical Representation for Queries and the Extractor Module We consider an object query language such as XSQL, an extension of SQL for object schema. The language is described in more detail in Kifer et al (1992). We discuss the canonical representation for a simple subset of XSQL queries and describe the function of the Extractor Module (EM), parsing an XSQL query, to extract information relevant to the query transformation process, and producing a CRQobj. An XSQL query is as follows: SELECT e1 ;    ; em FROM p1;    ; pn WHERE p where m  1 and n  0. Each ei is a path expression for 1  i  m, each pi is a binding clause for 1  i  n, and p is a formula involving arithmetic formulas, path expressions, path formulas, and boolean operators. The building blocks of XSQL queries are query path expressions , which are compositions of references. A query path expression (qpe) is of the form: s0 :a1f[s1 ]g:    :anf[sn]g where n  0, and expressions in curly brackets are optional. Expression ai is either an attribute variable or the name of a reference attribute for 0  i  n ? 1, and expression an is either an attribute variable or the name of an attribute. Expression si is either an object variable or an object identi er for 0  i  n ? 1, and sn is either a variable or a constant. A qpe is ground if it contains no variables. A qpe could be either a term or an atomic formula. As a term, the value of a ground qpe is the union of the values of its ground instances. As an atomic formula, a ground qpe evaluates to true if there is a path satisfying it, and a qpe evaluates to true if at least one of its ground instances does so. A path operator is of the form [Q1][Q2 ], where Q1 ; Q2 are either 9 or 8, and  is in f; g. A path formula is of the form e1 op e2 where e1 ; e2 are qpe and op is a path operator. A binding clause is of the form o: c, a AttributeOf c, or c SubclassOf c0 , where o is either an object variable or an object identi er, a is either an attribute variable or the name of an attribute, and c; c0 are either class variables or names of classes. It is called respectively an object, attribute , or class binding clause. A variable could appear in more than one binding clause. In this paper, we only consider a simple un-nested XSQL query, without attribute or class variables, and where the qpe do not include quanti ers. We describe the EM algorithm to produce a canonical representation for such queries. Simplifying higher order XSQL queries with attribute and class variables are discussed in related research in Qian and Raschid (1995). We describe the procedure to obtain the canonical representation for a single qpe in detail. The procedure to transform a complete XSQL query is straightforward, and has no impact on the heterogeneous mapping to transform queries among multiple schema, since each qpe is independently mapped. We also do not discuss the case where a XSQL query must rst 12

be expressed as a \global" query and several `sub-queries," and where each of the sub-queries are evaluated in a di erent target database. Consider a qpe of an XSQL query, of the following form: X.attr-X [Y].attr-Y [Z].attr-Z [A].attr-A [B]    where X, Y, Z, etc., are object variables,3 attr-X, attr-Y, etc., are attribute labels, and X.attr-X [Y] is read as the attribute attr-X for the object X, is an object Y . The Extractor Module (EM) will produce a corresponding canonical structure CRQobj for the corresponding XSQL query, and each qpe will be represented by a list or sequence, QPSobj in CRQobj. Each pair Y.attr-Y [Z] will be represented by an element of the sequence QPSobj(i). Each QPSobj is a class in F-logic. We present the structure in this section but an explanation of F-logic will follow in a later section. Each (simple) QPSobj(i) structure is as follows: QPSobj(i)[begin-object ! Y , sub-list ! Psub, sup-list ! Psup , attr ! attr-Y end-object Y0 ] Consider the QPSobj(i) for [Y].attr-Y in the qpe X.attr-X [Y].attr-Y [Z]. The

begin-object is Y , corresponding to X.attr-X, i.e., the object pointed to by the pair (X, attr-X). If Y is identical to the object Y speci ed in the object binding of the FROM clause, then the sub-list will be empty. If Y di ers from Y, then sub-list will contain a path from Y to Y. The end-object will be some Y0 which has attribute attr-Y. If Y is the same as Y0 , then, the sup-list will be empty; otherwise, sup-list will contain the path from Y to Y0 . We use Table 1 to summarize a procedure Extract which is performed by the EM to produce each QPSobj(i), and describe it using examples:

Extract: Extracting information from X.attr-X[Y].attr-Y[Z] Object type of Y speci ed in object binding in FROM Extract1;1 X.attr-X returns object Y and Y  Y Extract1;2 X.attr-X returns object Y and Y 6= Y Extract2 Object type of Y not speci ed in object binding in FROM Extract2;1 X.attr-X returns some object Y and Y has some subclass(es) Y which have attribute attr-Y Extract2;2 X.attr-X returns some object Y but none of the subclass(es) Y of Y has attribute attr-Y. However, Y and Y may inherit attribute attr-Y Table 1: The Function of the Extractor Module Extract1

3 We do not interoperate with queries which select object identi ers, since object identi ers are not shared across databases.

13

Begin Procedure Extract Extract1 { object type of Y is speci ed in object binding in FROM Extract1;1 { Y  Y Consider the following example Query 4: SELECT * FROM GasolineBoat Y WHERE YachtClub.OwnedVehicle[Y] The object pointed to by (YachtClub,OwnedVehicle), or Y , is exactly the object GasolineBoat, speci ed for Y in the query. Thus, in the corresponding QPSobj(i), GasolineBoat is the begin-object and the sub-list will be empty.

Extract1;2 { Y 6= Y but is speci ed Consider the following example Query 5: SELECT * FROM DieselBoat Y Company Z WHERE Company.LeasedVehicle [Y].Manufacturer [Z]    The object pointed to by (Company, LeasedVehicle), or Y is Vehicle, but Y is speci ed to be DieselBoat. Now, in the QPSobj(i), the begin-object is Vehicle and sub-list will contain the path from Vehicle to GasolineBoat. The end-object (that has attribute Manufacturer) is Vehicle and GasolineBoat inherits this attribute from Vehicle. Thus, the sup-list will contain the path from GasolineBoat to Vehicle. In this case, we see that there is a total overlap between the sup-list and the sub-list. We will consider such overlap after we discuss the basic Extract procedure.

Extract2 { object type of Y not speci ed in object binding in FROM Extract2;1 { Y has sub-class Y with attribute attr-Y Consider the following example Query 6: SELECT * FROM WHERE Company.LeasedVehicle [Y].Registration    Here, Y is unspeci ed in the object binding in FROM. The object Y which is pointed to by (Company,LeasedVehicle) is Vehicle. From the object schema, Extract will infer that there are two objects (Y), namely DieselBoat and GasolineBoat, with attribute Registration. Thus, in QPSobj(i), begin-object is Vehicle, there are two paths in sub-list (from Vehicle to DieselBoat, and from Vehicle to GasolineBoat).4 There are two end-objects, DieselBoat and GasolineBoat, and the sup-list is empty.

Extract2;2 { Y has no sub-class with attribute attr-Y 4 In this paper, we do not discuss details of this case; however, it is straightforward to handle such a situation.

14

Consider the following example Query 7: SELECT * FROM WHERE YachtClub.OwnedVehicle [Y].Manufacturer [Z].CompanyName    Here, Y is unspeci ed in the object binding in FROM. The object Y which is pointed to by (YachtClub,OwnedVehicle) is GasolineBoat. However, GasolineBoat does not have attribute Manufacturer nor does it have any sub-class with this attribute. However, GasolineBoat can inherit the attribute Manufacturer from object Vehicle, or Y0 . Thus, for the QPSobj(i), begin-object is GasolineBoat, the sublist is empty, the end-object is Vehicle, and the sup-list is a path from GasolineBoat to Vehicle.

End Procedure Extract

There is a possibility of overlap among the sub-list and the sup-list, when they are initially determined by the EM. Figure 5 lists several possibilities.5 In this gure, (a), (b) and (c) examines several possible paths that may occur between the objects Y and Y, corresponding to the (sub-list) and Y and Y0 , corresponding to the (sup-list). In (a), there is no overlap between the sub-list and the sup-list. In (b), the sup-list from Y to Y0 totally overlaps the sublist from Y to Y. This could produce redundancy in the transformed query. To eliminate redundancy, the sub-list could be eliminated (set to NULL). Alternately, the sup-list could be adjusted to be the non-overlapping portion from Y to Y0 . Finally, in (c), the sup-list from Y to Y0 and the sub-list from Y to Y show a partial overlap. In this case, either the sup-list could be changed to be the nonoverlapping portion from Y+ to Y0 , or alternately, the sub-list could be changed to the non-overlapping portion from Y to Y+ . In each case, care must be taken when combining the sequence of QPSobj(i), corresponding to a single qpe such as X.attr-x[Y].attr-Y[Z], since the set of objects referred to by the pair (Y0 , attr-Y) must correspond to the set of objects Z .

6. Canonical Representation for Mapping Knowledge and Heterogeneous Transformation Module In this section, we rst discuss the parameterized canonical representation for the mapping knowledge, HTx-yMapping, for resolving heterogeneity between object and relational schema. The representation is general, and can handle the four possibilities of mapping between schema, where each x and y could be \rel" or \obj". In this paper, our focus is on the parameters for mapping target relational schema into a local object schema, i.e., HTobj-relMapping; this is presented in section 6.1.. The main parameters are mappings, joins, and constraints. The function of HTM is de ned by a set of mapping rules which speci es the algorithm for mapping among the schema. The mapping rules for HTobj-relMapping are classi ed into 5 There could also be more complicated cases where there may be multiple paths among objects, an object may inherit the same attribute from multiple parent objects in the class hierarchy, etc., each of which is an extension we have not considered in this paper.

15

X

'

Y

*

' Y

attr-X

sup

attr-Y

sub

X

sup

Y (a) No overlap X.attr-X [Y].attr-Y [Z].attr-Z ... ' Y

X

'

Y

attr-Y

*

sup-nov

attr-X

sup

sub-ov

X

sup-ov

Y (b) Total overlap X attr-X '

Y

X.attr-X [Y].attr-Y [Z].attr-Z ...

*

' Y

sub-nov

Y

+

attr-Y

sup-nov

sup sup-ov

X

sub-ov

(c) Partial overlap

Y

X.attr-X [Y].attr-Y [Z].attr-Z ...

Figure 5: Examples of overlap of sup-list and sub-list in QPSobj(i)

16

three groups, namely, Ident-Rel-Attr, Join-Path and Sel-Constr, and they operate on each of the parameters mentioned above. In section 6.2., we use examples to present several cases of HTM for resolving con icts when mapping target relational schema into a local object schema. In section 6.3., we provide a detailed algorithm, using the corresponding mapping rules. In section 6.4., we brie y present the Flogic implementation of these mapping rules. The output of HTM is a transformed query, in this case TQrel, whose parameters rel-attr, join-path, and constraints, are produced by each group of mapping rules, respectively. Finally, in section 6.5., we discuss the veri cation of the mapping knowledge base, HTobj-relMapping, and the expressive power of the CR and mapping rules. The resolution of other con icts that are not particular to the object to relational mapping are also described in section 6.1., and we refer the reader to a detailed presentation in Chang and Raschid (1995). 6.1. Parameterized Canonical Representation HTx-yMapping

A comprehensive framework for classifying representational heterogeneity among relational database and object-oriented databases is provided in Kim and Seo (1991) and Kim et al (1993). We use this framework as the basis for building a corresponding parameterized canonical representation, namely, HTx-yMapping. We consider possible con icts in the remote schema, wrt an entity or an attribute in the local schema. The parameters of CR are used to resolve each class of con ict. We discuss the parameters in general, for each of the four possible mappings, but the focus of this paper is HTobj-relMapping, for mapping target relational schema into a local object schema. Suppose X represents an entity in the local schema D1 , A is an attribute de ned for X, and Y represents an entity in the remote schema, D2 . If an entity is a class in the local schema, then we have information on the corresponding class hierarchy. If it is a relation in the target schema, then we know its primary key attribute(s). The parameters of HTx-yMapping are as follows:

equiv-entity $ f ENTITY-NAMES g sup-entity $ f ENTITY-NAMES g sub-entity $ f ENTITY-NAMES g joins $ join(X; X 0) [rel-attr $ (Y , Z1 , Y 0 , Z2 )] mappings $ map(X, A, i) { entity ! ENTITY-NAME { attribute ! ORDERED-LIST-of-ATTRIBUTE-NAME { type ! ORDERED-LIST-of-TYPE { homo-expression@LOCAL-VALUE ! REMOTE-VALUE  constraints $ constraint(X, Y ) { attribute ! ATTRIBUTE-NAME

    

17

{ range $ f ATTRIBUTE-VALUES g  constraints $ constraint(X; A) { value-entity $ f (VALUE, ENTITY-NAME) g  reachable-attr  includes The rst three paramenters resolve entity-versus-entity con icts, between entity X in D1 and Y in D2 . These include con icts in entity names (equiv-entity) and con icts in the class inclusion hierarchy (sub-entity and sup-entity). Since a target relational schema does not have a class hierarchy, these parameters are not relevant to this paper. The most direct con ict resolves attribute-value versus attribute value con icts, in the two schemas, and also homogenizes dissimilar expressions, for some attribute A of local entity X. The parameter mappings resolves these con icts. It has three arguments, X, A, and i, where i is used to identify between multiple solutions that resolve the con ict. These con icts are not particular to the mapping of target relational schema to local object schema. The sub-parameter type includes speci cations such as if the local attribute is multi- valued, the domain over which it is de ned, i.e., if it is a primitive type or a user-de ned class, etc. The sub-parameter homo-expression is to resolve con icts of unit, precision, etc. We refer the reader to Chang and Raschid (1995) for further details. The class of mapping rules that resolves these con icts is Ident-Rel-Attr, and they produce the output for parameter rel-attr in TQrel. The parameter joins resolves class hierarchy and attribute inheritance con icts when mapping from a target relational schema. Consider an attribute A of object X, in the local object schema, which is an implicit attribute, i.e., it is inherited from some super class X 0 , and it is an explicit attribute of X 0 . Suppose a relational schema has relations Y and Y 0 , whose tuples correspond to instances of X and X 0 . There are several possibilities for associating attribute A with relations Y and Y 0 . In the case of a horizontal partitioning, all instances of X could be represented in relation Y , and relation Y could have both the explicit and implicit attributes for the instances of X. In the case of a vertical partitioning, instances of X could be replicated in both Y and Y 0, and the implicit attribute A may be represented in relation Y 0, for all instances of X. These are the straightforward cases. However, due to possible normalization of relations in the target relational schema, or because there is no one-to-one correspondence for each class in the local object schema, there could be other possibilities involving additional relations. Thus, there could be some mixed partitioning, where some general relational SPJ-type expression must be computed over several relations, in addition to Y and Y 0 , in order to compute the attribute values of A for instances of X. The mapping rules, namely JoinPath rules, use the parameter joins to resolve these con icts, and they produce the output for parameter join-path in TQrel. The parameter constraints resolves con icts that are de ned as schematic con icts, or entity-name-versus-attribute-value con icts. These are con icts of data 18

(attribute value) and meta-data (names of attributes, classes, etc.) in heterogeneous schema. For example, the class X in the local object schema may correspond to those tuples in relation Y , in the target schema, that satisfy some selection criterion based on the values of some other attribute of relation Y . These con icts, too, are typical of mapping from target relational schema to local object schema, and the mapping rules that resolve these con icts, namely Sel-Constr, produce the output for parameter constraints in TQrel. There are two more parameters reachable-attr and includes, which are used by the mapping rules; however, they do not resolve con icts among schema but are used as control parameters by the groups of mapping rules. They will be referred to in the detailed mapping algorithm. 6.2. Heterogeneous Transformation Module

We now discuss the function of the HTM in applying the appropriate mapping rule. We focus on the transformation corresponding to HTobj-relMapping here. The transformation corresponding to HTrel-objMapping is discussed in a companion paper (see Chang et al (1994)). In this section, we use examples to present several cases for resolving con icts when mapping target relational schema into a local object schema. In section 6.3., we provide a detailed algorithm. The HTM expresses a mapping from the XSQL query to a relational SPJ-type query in a target relational schema. Each XSQL query is constructed of a sequence of structures QPSobj, which can be conceptually represented as some combination (SO, EO, PO, attribute), with a start object (SO), and corresponding attribute, in the local object schema. If the attribute is not actually structurally associated with the start object, then it must be associated with an end object (EO), through a class hierarchy in the object schema. Finally, PO refers to the path objects that may occur in the path between SO and EO. Each QPSobj(i) will actually yield one or two such combinations as input to HTM. Consider the QPSobj(i) for some qpe X.attr-X[Y].attr-Y[Z]. In one combination, SO will be the object variable Y named in the qpe. EO will be the end-object of QPSobj(i), or Y0 , PO will be the optional sup-list say Psup , from Y to Y0 , and attribute will be attr-Y. Thus, the combination is (Y, Y0 , Psup , attr-Y). There may be a second combination if Y , the object referred to by (X, attr-X) is not Y; then SO will be Y , EO will be Y, PO will be the optional sub-list from Y ,to Y, say Psub, and the value for attribute will be irrelevant (since we are specifying a speci c sub-class and not inheritance). Here, the combination is (Y , Y, Psub, ). As described in the previous section, the output of HTM is a set of parameters for TQrel(), rel-attr, join-path, and constraints, which is used by GM to generate the appropriate query. In this section, we informally describe the function of the HTM, in English, for a number of cases, as well as the output TQrel(). The exact details of the parameters of TQrel() are in sections 6.3. and 6.4.. We note that in the case of the object to relational mapping, the canonical representation and the mapping rules is limited to only resolving those con icts where a relational query, expressed on the target relational schema, can be used to express the data that is being imported from the target system. This does not allow us to query schema knowledge in the target relational schema, and would 19

limit the con icts that are resolved, compared to higher-order query languages that can query both schema and data. Table 2 summarizes the cases that are to be discussed in this section. Case 1

Transformation depends on start object S0 Case 1.1 (SO, , ,attribute) maps to list of rel-attr or f(relation RN, attribute AN)g Case 1.2 (SO, , ,attribute) maps to list of rel-attr, constraints, or f(relation RN, attribute AN, criterion to select subset of RN tuples based on SO)g Case 1.3 (SO, , ,attribute) maps to SPJtype of relational expression Algorithm provides details of obtaining this expression using all 3 parameters Case 2 Transformation depends on end object EO Case 2.1 ( ,EO, ,attribute) maps to list of rel-attr or Does not depend f (relation RN, attribute AN)g on SO Case 2.2 (SO,EO, ,attribute) maps to list of rel-attr Also depends on constraints, or f(relation RN,attribute AN, criSO terion to select subset of RN tuples based on SO)g ( ,EO, ,attribute) maps to list f (RN, AN, criterion to select subset of RN tuples based on EO)g (SO,EO, ,attribute) maps to f(RN, AN, criterion to select RN tuples based on EO, criterion to select RN tuples based on SO)g (SO,EO, ,attribute) maps to SPJ-type of relational expression Also depends on Algorithm provides details of obtaining this exSO pression using all 3 parameters Case 3 (SO,EO,PO,attribute) maps to SPJ-type of relational expression Algorithm provides details of obtaining this expression using all 3 parameters Case 4 General case where (SO,EO,PO, ) refers to to an object Algorithm provides details of dealing with object reference Table 2: The Function of the HeTerogeneous Mapping Module Mapping from Target Relational Schema to Object Schema Case 2.3 Does not depend on SO Case 2.4 Also depends on SO Case 2.5

[Case 1]

This corresponds to a horizontal partitioning of objects, in the target relational schema. The corresponding relational query is only dependent on the start object SO and attribute, and is independent of the end object EO or optional path objects. Consider an example combination of (DieselAuto, , ,Model). There are two possibilities, as follows: 20

[Case 1.1] (SO, , , attribute) maps to (a set of) rel-attr parameters, corresponding to f(relation name RN, attribute name AN)g, in TQrel().

Suppose there is a relation RelDieselAuto containing information on all diesel automobiles, with an attribute AttrModel. Then, (DieselAuto, , ,Model) will map to f(RN  RelDieselAuto, AN  AttrModel)g. Instead of a single pair, (SO, , ,attribute) could produce (a set of) f(RN, AN)g. For example, if the information on diesel autos were stored in separate relations, for hatchbacks and sedans, this would produce the following: f(RN  RelDieselSedan, AN  AttrModel), (RN  RelDieselHatchback, AN  AttrModel)g. A simple SQL query can be generated from this TQrel().

[Case 1.2]

This corresponds to a horizontal partitioning, where there is an additional schematic con ict on SO. (SO, , ,attribute) maps to (a set of) rel-attr and constraints, corresponding to f(RN, AN, criterion (constraint) to select subset of tuples of RN based on SO)g, in TQrel(). In this case, all tuples in RN do not correspond to instances of object SO. SO provides an additional selection criterion, to select those tuples in relation RN which actually correspond to instances of SO. For example, attribute AN-SO may be an additional attribute and its values may be used to identify instances of SO in RN. This is an entity-name-versus-attribute-value schematic con ict since the values of an attribute (AN-SO) are used to identify the instances of SO. Suppose there is a relation RelAuto which stores information on several types of automobiles. Now the pair (DieselAuto, , ,Model) will map to f(RN  RelAuto, AN  AttrModel, AttrType = f\DieselHatchBack", \DieselSedan"g)g. Here, certain tuples of relation RelAuto are selected, based on the value of AttrType matching the set f\DieselHatchBack", \DieselSedan"g, which are an enumeration of the types of DieselAuto.

[Case 1.3]

This is also a horizontal partitioning with schematic con ict based on SO. (SO, , ,attribute) maps to some SPJ-type of relational expression in TQrel(), where all three parameters are produced as output. Consider the relational schema in Figure 6, with relations RelHatchBack, RelSedan and RelDieselAuto. Consider the combination (DieselAuto, , ,License). The relations RelHatchBack and RelSedan have an attribute License. These relations may contain information for both diesel autos and gasoline autos. The relation RelDieselAuto has an attribute DslAutoId which can be used to identify those tuples in RelHatchBack and RelSedan corresponding to diesel autos. This would require a join of RelHatchBack (and RelSedan) with RelDieselAuto, over the corresponding attributes AutoId and DslAutoId, respectively. Then, the values of License may be obtained, for the diesel autos. The three parameters of TQrel() which are obtained are described in the 21

RelHatchBack AutoId Vehtype License

Body

RelSedan AutoId Engine

License

Vehtype Body

Engine

RelDieselAuto DslAutoId DslVehId

Figure 6: Relational Schema for Case 1.3 next section, in the detailed algorithm.

[Case 2]

Suppose we consider the combination, (DieselVehicle,Vehicle, ,Manufacturer). The attribute Manufacturer is actually associated with the end object (EO) Vehicle. Here the relational query to obtain the manufacturer attribute is dependent on the end object, EO, or Vehicle. If the manufacturer attribute does not depend on the SO, DieselVehicle, then this corresponds to the scenario in Cases 2.1 and 2.3. This corresponds to a vertical partitioning in the target relationl schema, with possible schematic con ict based on EO. However, it is possible that the SO, DieselVehicle, is a further constraint that is used in the relational query (indicating further selection of some appropriate tuples which are actually diesel vehicles). This is the situation in Cases 2.2 and 2.4. This is where there is schematic con ict based on SO, and possibly EO as well. Finally, Case 2.5 is similar to Case 1.3 in that we obtain some SPJ-type of relational expression.

[Case 2.1]

This is a vertical partitioning in the target relational schema. ( ,EO, ,attribute) maps to f(RN, AN)g, using parameter rel-attr in TQrel(). Suppose SO is DieselVehicle, EO is Vehicle and the attribute is Manufacturer. Suppose there is a relation RelVehicle, with attribute AttrManufacturer, which has the manufacturer information for all vehicles. HTM uses the combination of ( ,Vehicle, ,Manufacturer) to produce f(RelVehicle, AttrManufacturer)g. This implies that all tuples in relation RelVehicle correspond to DieselVehicle, and there is no necessity to select a subset of those tuples from RelVehicle, corresponding to the SO, DieselVehicle.

[Case 2.2]

This is a vertical partitioning in the target relational schema with schematic con ict based on SO. (SO,EO, ,attribute) maps to f(RN, AN, criteria to select subset of RN tuples based on SO)g in TQrel(), and uses the parameters rel-attr and constraints. Consider (DieselVehicle,Vehicle, ,Manufacturer) as before, producing

f(RelVehicle, AttrManufacturer, AttrType = f\DieselAuto", \DieselBoat"g)g. 22

Here, all Vehicles are in RelVehicle, and corresponding to the SO, DieselVehicle, we select a subset of RelVehicle tuples, where the value of AttrType in f\DieselAuto", \DieselBoat"g, identi es a subset of tuples corresponding to DieselVehicle.

[Case 2.3]

This is a vertical partitioning in the target relational schema with schematic con ict based on EO. ( ,EO, ,attribute) maps to f(RN, AN, criterion to select subset of RN tuples based on EO)g, using parameters rel-attr and constraints in TQrel(). This is similar to Case 2.1, and the EO is used to identify a relation and attribute. However, all tuples in this relation may not correspond to EO, and there is an additional selection constraint on that relation, to identify the tuples corresponding to EO. Suppose we consider a situation where all the Vehicle engine data are stored together with other types of engines in a relation AllEngines. There is an attribute AttrEngineType which can be used to identify those engines corresponding to Vehicle. Further, all Vehicle engines in this relation RelAllEngines correspond to DieselVehicle. Then, (Vehicle, Manufacturer) will map to a relational query corresponding to the following, in TQrel: f(RelAllEngines, AttrManufacturer, AttrEngineType = f\enumeration of Vehicle engine types"g)g.

[Case 2.4]

This is a vertical partitioning in the target relational schema with schematic con ict based on SO and EO. (SO,EO, ,attribute) maps to f(RN, AN, subset of RN tuples based on SO, subset of RN tuples based on EO)g, and uses parameters rel-attr and constraints in TQrel(). As in 2.3, all information on Vehicle engines are in a relation RelAllEngines. However, all the Vehicle engines in this relation do not correspond to DieselVehicle engines. We have to identify a subset of tuples for DieselVehicle. Thus, (DieselVehicle, Manufacturer, Vehicle) maps to the following expression, corresponding to a relational query: f(RelAllEngines, AttrManufacturer, AttrEngineType = f\enumeration of Vehicle engine types"g, AttrFuelType = f\enumeration of fuel types for DieselVehicle"g)g. Here, AttrEngineType selects tuples corresponding to Vehicle and AttrFuelType identi es DieselVehicle. In this scenario, we need two selection criteria to identify the tuples corresponding to objects that are in both Vehicle and DieselVehicle. To explain the need for two selection criteria instead of a single criteria, note that there may be multiple class hierarchies associated with Vehicle. DieselVehicle may only participate in one such hierarchy. Similarly, DieselVehicle may participate in other hierarchies that do not include Vehicle. 23

RelHatchBack AutoId Vehtype License

Body

RelSedan AutoId Engine

License

Vehtype Body

Engine

RelDieselAuto DslAutoId DslVehId

Figure 7: Relational Schema for Case 3

[Case 2.5]

In this scenario, we obtain a general SPJ-type relational expression from (SO,EO, ,attribute). This is a special case of Case 3, where (SO,EO,PO,attribute) maps to a relational expression, and we will not describe this separately. All three paramters of TQrel are produced.

[Case 3]

This is the most general case, and (SO,EO,PO,attribute) maps to a SPJ-type of relational query expression, using all three parameters of TQrel. This may be a mixed partitioning with possible schematic con ict based on SO, EO and all objects PO. In some sense, we may consider the previous cases to be degenerate cases of this case. Case 1 could be a degenerate case where het-map ignores the path objects PO and end object EO, and terminates after using SO to produce a mapping. Case 2 could also be a degenerate case where het-map ignores the path objects PO and uses EO (and optionally SO) to produce a mapping. We will describe this scenario using an example. The algorithm in section 6.3. has all details of TQrel(). Suppose we consider the combinations (Auto, Vehicle, , Model) or (Auto, Vehicle, , Model). Information on automobiles are in RelHatchBack and RelSedan, there is a relation RelDieselAuto corresponding to DieselAuto, and a relation AllEngines has information on all Vehicle objects. The relations are in Figure 7. The output in TQrel() would consist of two pairs. The rst pair speci es a join between RelSedan and RelDieselAuto and RelHatchBack and RelDieselAuto, as follows: ( f(RelHatchback, AN-Join  AttrAutoId), (RelSedan, AN-Join  AttrAutoId)g, f(RelDieselAuto, AN-Join  AttrDslAutoId)g ) The second pair is a join between RelDieselAuto and RelAllVehicles, as follows: ( f(RelDieselAuto, AN-Join  AttrDslVehId)g, f(RelAllVehicles, AN-Join  AttrVehicleId)g ) We refer to the next two sections for details. In general, each Pi in PO could map to a pair (or a set of pairs) as follows: (RN, AN-Join), or a quadruple (or a set) 24

as follows: (RN, AN-Sel, range of values for AN-Sel, AN-Join). The expressions for each adjacent element in PO would then be combined. In the case of a pair, the relation RNi for Pi and RNi+1 for Pi+1 are joined over the corresponding attributes AN-Joini and AN-Joini+1 . If it is a quadruple, then the range of values for AN-Seli is used to select some tuples from each relation RNi after which the relations are joined. There is also the possibility that each object Pi itself maps to a SPJ-type query expression. We also consider the possibility that this joining path information may be incomplete wrt each adjacent pair in the path PO, in which case we skip over subsequent objects until we obtain a pair of objects in the path which does produce a query expression. These details are discussed in the next section.

[Case 4]

Suppose the attribute of an object is itself an object, for example, the Manufacturer of a Vehicle is a Company. We may have a qpe in a query as follows: Company.LeasedVehicle [Vehicle].Manufacturer [Company]. This will produce a sequence of QPSobj(i), each of which will produce some query expression, and they will have to be combined appropriately. See details in the next section. 6.3. Details of the Mapping Algorithm and Mapping Rules

We describe the mapping algorithm in detail, for resolving con icts and mapping between the object and relational schema. The discussion follows the cases of Table 2, outlined in the previous section. We process each QPSobj(i), independently, to produce a portion of TQrel()(QPSobj(i), D2 ), the transformed query wrt the target relational schema D2. Each QPSobj(i) is as follows: QPSobj(i)[begin-object ! Y, sub-list ! Psub, sup-list ! Psup, attr ! attr-Y end-object Y0 ] The CR for the mapping knowledge is encoded in HTobj-relMapping as follows:

HTobj-relMapping(Y, D1, D2) [mappings $ map(Y, attr-Y, i)[rel-attr $ (R, Z)], constraints $ constraint(R, Y) [attr ! attr-C, range ! S ], joins $ join(Y, Y00 )[rel-attr $ (  , (R1, R2, Z1 , Z2 ),    )] reachable-attr $ attr-Y, includes $ Y0 ]

The output of the algorithm is TQrel(QPSobj(i), D2 ), which is as follows: 25

TQrel(QPSobj(i), D2) [rel-attr $ (R, AN), constraints@R,X $ (R, attr-C, S), join-path@P ! L] The parameter rel-attr in TQrel corresponds to attr-Y, and paramters constraints and join-path produce selection predicates, or joining clauses, respectively, in the SQL query.

HTM Mapping Algorithm Step 1 This step covers Cases 1.1, 1.2 and 1.3 (Table 2) of the HTM function. For begin-object Y and attr attr-Y of QPSobj(i),

i.e., for the combination (Y, , ,attr-Y), test if there are matching entries in database D2 for

HTobj-relMapping(Y,D1,D2)[reachable-attr $ attr-Y, mappings $ map(Y,attr-Y,i)]. This is the test to determine if we are in Case 1. If we are in Case 1 (there are matches), then apply the appropriate rule in IdentRel-Attr, and produce entries in TQrel() for rel-attr. Next, for the Join-Path rules, we test for matching entries in database D2 for

HTobj-relMapping(Y,D1,D2) [joins $ join(Y,Y)[  ] ]. If there are entries produced by the test, we produce appropriate entries in TQrel() for join-path@Y. Rules in Sel-Constr must be checked for all relations which are present in TQrel(). To do so we look for matching entries in database D2 for HTobj-relMapping(Y,D1,D2)[constraints $ constraint(R,Y)[  ] ], where R matches the relations Ri present in either TQrel()(i,QPSobj(i),D)[rel-attr $ (Ri, )] or TQrel()(i,QPSobj(i),D)[join-path@Y $ (  , (Ri , Rj , , ),   )]. When there are matching entries, applying these rules in this step produces the appropriate entries in TQrel() for rel-attr, (and constraints@R,Y and join-path@Y where applicable). We relate these tests with the functionality of HTM as summarized in Table 2. If there are entries in TQrel()[join-path@Y], then this is Case 1.3, i.e., (Y, , ,attrY) maps to an arbitrary SPJ relational expression. If there are entries in TQrel()[constraints@R,Y ], but none in TQrel()[joinpath@Y], then this is Case 1.2. If there are only entries in TQrel()[rel-attr], then this is Case 1.1. 26

If there are matching entries in Step 1, then after applying these rules, proceed to Step 4. However, if the test for Case 1 fails, i.e., for begin-object Y and attr attr-Y of QPSobj(i), or the combination (Y, , , attr-Y), there are no matching entries in

HTobj-relMapping(Y,D1,D2)[reachable-attr $ attr-Y, mappings $ map(Y,attr-Y,i)], then proceed to Step 2. Note that to enter Step 2, the sup-list cannot be NULL. If it is NULL, and the tests in Step 1 failed to produce matching entries to proceed to Step 4, then the entries for HTobj-relMapping(Y,D1,D2) are incomplete.

Step 2

This step covers all Cases 2.x of the HTM function in Table 2.

Step 2a For QPSobj(i), consider the sup-list Psup and the end object of that list, end-

object@Psup , say Y0 , from which Y inherits the attribute attr-Y. For Y0 and attr-Y, i.e., the combination (Y,Y0 , ,attr-Y), test if there are matching entries in database D2 for

HTobj-relMapping(Y0,D)[reachable-attr $ attr-Y, mappings $ map(Y0,attr-Y,i)]. If there are entries, then apply the rules Ident-Rel-Attr, and produce entries in TQrel() for rel-attr. Next, for the Join-Path rules, we test for matching entries in database D2 for

HTobj-relMapping(Y0,D)[joins $ join(Y0,Y0)[  ] ]. If there are entries these rules are applied, and we produce appropriate entries in TQrel() for join-path@Y0. As in Step 1, we check rules in Sel-Constr; we look for matching entries in database D2 for HTobj-relMapping(Y0,D)[constraints $ constraint(R,Y0 )[  ] ], where R matches the relations Ri, Rj , occurring in either TQrel()(i,QPSobj(i),D)[rel-attr $ (Ri, )] or TQrel()(i,QPSobj(i),D)[join-path@Y0 $ (  , (Ri, Rj , , ),   )]. When there are entries, applying these rules produces appropriate entries in TQrel() for constraints@R,Y0.

Step 2b

Next, for begin-object Y, i.e., the combination (Y, Y0 , , attr-Y), test if there is an entry 27

HTobj-relMapping(Y,D1,D2)[includes $ Y0].

If there is, complete procesing Step 2 and proceed to Step 4. If there is no such entry, then continue to Step 2c.

Step 2c For begin-object Y, i.e., the combination (Y, Y0 , , attr-Y), test if there are entries for Y as follows: HTobj-relMapping(Y,D1,D2)[constraints $ constraint(R,Y)[  ] ], where the relation R matches the relations Ri present in either, TQrel()(i,QPSobj(i),D)[rel-attr $ (Ri, )] or TQrel()(i,QPSobj(i),D)[join-path@Y0 $ (  , (Ri, Rj , , ),   )]. If there are such entries, apply rules in Sel-Constr and produce appropriate entries in TQrel() for constraints@R,Y. After applying these rules, complete processing Step 2 and proceed to Step 4.

Step 2d

If the tests in Steps 2c do not produce appropriate entries, then proceed to Step 2d. For the objects Y and Y0 , i.e., the combination (Y, Y0, , attr-Y), determine if there are entries as follows:

HTobj-relMapping(Y,D1,D2)[joins $ join(Y, Y0) [rel-attr $ (   , (Ri, Rj , , )   )] ]. If there are such entries, then apply rules in Join-Path and produce appropriate entries in TQrel() for join-path@Y,Y0. Optionally, test if there are the following entries: HTobj-relMapping(Y,D1,D2)[constraints $ constraint(Rk , Y)[   ] ], and/or HTobj-relMapping(Y0,D1,D2)[constraints $ constraint(Rl , Y0 )[  ] ], where Rk or Rl match entries in Ri or Rj as identi ed in this step. If there are such entries, then apply rules in Sel-Constr, as appropriate, to produce appropriate entries in TQrel() for constraints@Rj,Y and constraints@Rk,Y0, where applicable. If the test in 2d also fails to produce appropriate entries, then the elements in suplist have to be processed recursively. Proceed to Step 3. To relate these tests with the functionality of HTM as summarized in Table 2, if there are entries in TQrel() for join-path@Y0 or join-path@Y,Y0, then this corresponds to Case 2.5, i.e., the combination (Y, Y0 , , attr-Y) maps to an SPJtype relational expression. If there are any entries in TQrel() for constraints, then this corresponds to Cases 2.2, 2.3 or 2.4. If there are entries for both constraints@Y and constraints@Y0, then this is Case 2.4. If there are only entries for constraints@Y0, then this is Case 2.3 If there are entries for only constraints@Y, then this is Case 2.2. If there are no entries for constraints or join-path, then this is Case 2.1. 28

Step 3

In this step we process each element in sup-list in turn.

Step 3a

We consider the pair Y and Y1 , the rst element in the sup-list, Psup . For this pair, i.e., combination (Y, Y1 , Psup , attr-Y), we determine if there are entries in HTobjrelMapping(Y,D1,D2) and HTobj-relMapping(Y1,D1,D2), similar to those described in Steps 2b, 2c and 2d, where we replace Y0 (in Steps 2b, 2c, 2d) with Y1 . If there are matching entries, then we apply the appropriate rules in Sel-Constr and Join-Path and produce entries in TQrel(). If none of the tests in Step 3a, similar to 2b, 2c or 2d, produce matching entries for this pair, we skip this element Y1 , which is the rst element in sup-list, Psup , and consider the subsequent pair of Y and Y2, where Y2 is the next element in Psup , i.e., we consider the combination (Y, Y2 , P0sup , attr-Y). We continue this process, skipping over each subsequent element in Psup , until we nd an element Yi for which one of the tests in Step 3a, (similar to steps 2b, 2c or 2d), for the pair of Y and Yi , produces appropriate matching entries, i.e., some combination (Y, Yi , P0sup, attr-Y). If there is no such element Yi in sup-list, which produces matching entries in Step 3a, then the HTobj-relMapping is incomplete. As soon as an element Yi is identi ed, we proceed to Step 3b, with the combination (Y, Yi , P0sup , attr-Y), where P0sup is a new sup-list corresponding to truncating all the elements in Psup , preceding element Yi .

Step 3b

For the element Yi obtained in Step 3a, i.e., the combination (Y, Yi ,P0sup,attr-Y), we determine if there are entries in HTobj-relMapping(Yi,D1,D2) and HTobj-relMapping(Y0,D1,D2), to produce matching entries for tests in Step 3b, similar to the tests in Steps 2b, 2c and 2d. In these tests, we replace Y (in Steps 2b, 2c and 2d) with Yi . If there are matching entries, we apply the appropriate rules in Sel-Constr and Join-Path and produce entries in TQrel(). If any of the tests in Step 3b (similar to those in 2b, 2c or 2d), produce appropriate entries, then we complete processing Step 3 and proceed to Step 4. If none of the tests in Step 3b produce entries for the combination (Yi , Y0 , P0sup,attr-Y), then we return to Step 3a, where the element we were currently processing Yi replaces Y, and with the truncated sup-list, P0sup from Step 3a. If none of the tests in Step 3b produce matching entries and the truncated sup-list, P0sup is NULL, then we conclude that HTobj-relMapping is incomplete.

Step 4

Step 4 is executed only if the sub-list is not NULL. Suppose we are processing the (i+1)-th QPSobj(i+1), Y.attr-Y [Z]   , which occurs in the qpe X.attr-X [Y].attr-Y [Z]   , and Y , the object which is pointed to by X.attr-X is di erent from Y, the begin-object, in QPSobj(i+1). Now, sub-list 29

of QPSobj(i+1) will be a path from Y to Y. Step 4 is similar to Step 3, and has two sub-steps 4a and 4b. One di erence is that instead of considering the begin-object Y, we consider Y and the sub-list, Psub from Y to Y. Another di erence is that while processing Step 4b, (similar to Step 3b), when the tests produce appropriate matching entries, then instead of proceeding to Step 4, the algorithm will instead terminate processing of the current QPSobj(i+1). If we consider the case where there is either partial or total overlap in Psup and Psub, and we process these lists without eliminating the overlap, then we will produce redundant expressions in the corresponding TQrel(). However, if we eliminate the overlap in the sup-list and the sub-list, we must take care that the objects referred to by some combination (EO, attribute) of QPS(i) is identical to the SO of the next combination (SO,EO,PO,attribute) for the next QPSobj(i+1). 6.4. Specifying Mapping Knowledge using F-logic Rules

We use the F-logic language as formulated by Kifer and Lausen (1989) and Kifer et al . (1990) for the canonical representations and to represent the mapping rules de ning the HTM which perform the query transformation from CRQobj (actually each QPSobj(i)), to TQrel. We choose F-logic since it has good modeling constructs and we are able to represent the mapping rules in the language. Each canonical representation is a class in F-logic, and parameters are de ned as the attributes or methods of the corresponding classes. We brie y present the syntax of F-logic and then examine a few of the mapping rules. 6.4.1. Syntax of F-logic We borrow this brief description from Lefebvre et al . (1992). In F-logic, the instance term o : c means that the object o is an instance of class c. A data term o[m@a1;    ; an ! v; m0 @a1 ;    ; ap $ fv0 ; v00 g] means that the value of the functional method m0 with arguments a1 to ap for the object o is a set containing the values v0 and v00 . If a method m has no arguments, \@" will be omitted. The symbol $ indicates a set-valued method.6 Note that other values could also be members of this set, and that the expression above does not restrict the value of m0 for o to be fv0 ; v00g; it only indicates some of the values of the corresponding set. An object can be denoted by a constant, or a term. For example, dept(cs) is a valid object identi er. An atomic data term is a data term referring to only one method. Notational conventions allow us to write o[m0 $ v] instead of o[m0 $ fvg] for a single element of a set-valued method; the expression o[m ! v; n ! v0 ] is equivalent to o[m ! v] ^ o[n ! v0 ]; the expression o : c[m ! v] is equivalent to o : c ^ o[m ! v]. An F-logic program consists of a set of data declarations(data or instance terms) and a set of deduction rules. A deduction rule has a head, which is a data term, and a body, which is a conjunction of data and instance terms. Disjunction and negation are allowed in the body of rules. Deduction rules can be used in a way similar to 6

This symbol is di erent from the one used in the original paper.

30

Datalog rules(or Prolog clauses), i.e., the head of a rule de nes a derived method, the value of which can be found by evaluating the body. An interpreter for the language is described in Lawley (1993). 6.4.2. Specifying Mapping Rules in F-logic The rules that we present here is intended to provide a avor of the types of transformations that can be expressed and is not intended to represent the complete transformation algorithm from object to relational schema. We present rules corresponding to Cases 1.2, 2.4 and 3, as representative examples of the types of transformations accomplished by the mapping algorithm. The mapping rules de ning the HTM function occur in several groups. IdentRel-Attr is a group of rules which identify the corresponding relation and attribute name in the TQrel data structure. Sel-Constr is a group of rules which speci es selection criteria that must rst be satis ed by the tuples of the relations speci ed in Ident-Rel-Attr rules (or the Join-Path rules). Join-Path is a group of rules which specify relations which must be joined, corresponding to the more complicated Cases 1.3, 2.5 and 3 of the algorithm.

Mapping Rules for Case 1.1

Case 1.1 is selected when the CRQobj and HTobj-relMapping indicate that given some begin class X, A is a reachable attribute. In other words, the corresponding relation and attribute names are directly obtained. The following mapping rule in the group Ident-Rel-Attr may be applied in this case, and will provide one or more (relation, attribute) pair(s) in the corresponding TQrel() structure: TQrel(i, QPSobj(i), D2) [rel-attr $ (R, AN) ] QPSobj(i)[begin-object ! X, attr ! A, sub-list ! Psub] ^ HTobj-relMapping(X, D1, D2 )[mappings $ map(X, A, i) [rel-attr $ (R, AN)], reachable-attr $ A ] ^ TQrel(i, QPSobj(i),D2)[constraints@R,X $ C] ^ TQrel(i, QPSobj(i),D2)[joining-path@Psub ! L]

From our example, an instantiation of this rule for the pair (DieselAuto, , ,Model) may be as follows: TQrel(1, QPSobj(1), D2) [rel-attr $ (RelDieselAuto, AttrModel) ] QPSobj(1)[begin-object ! DieselAuto, attr ! Model, sub-list ! NULL] ^ HTobj-relMapping(DieselAuto, D1, D2 ) [mappings $ map(DieselAuto, Model, 1) [rel-attr $ (RelDieselAuto, AttrModel )], reachable-attr $ Model ] ^ TQrel(1,QPSobj(1),D2)[constraints@RelDieselAuto, DieselAuto $ NULL] ^ TQrel(1, QPSobj(1),D2)[joining-path@NULL ! NULL] 31

For this Case 1.1, the corresponding relation and attribute pair are directly determined, and there is no sub-list nor are there constraints or join paths. A rule in the group Sel-Constr will be instantiated, indicating that there is no constraint on X, as follows: TQrel(1, QPSobj(1),D2 )[constraints@RelDieselAuto,DieselAuto $ NULL] not(HTobj-relMapping(DieselAuto,D1, D2 ) [constraints $ constraint(RelDieselAuto,DieselAuto)])

Mapping Rules for Case 1.2 As an example for Case 1.2, (DieselAuto, ,Model) must provide the following information in TQrel(): f(RelAuto, AttrModel, AttrType = f\DieselHatchBack", \DieselSedan"g)g. The model information on DieselAuto is stored in an attribute AttrModel of a relation RelAuto. The DieselAuto are distinguished by a value of \DieselSedan" or \DieselHatchBack" in the attribute AttrType, and in order to select only those tuples we use one of the Sel-Constr rules. A particular instantiation for a rule from Ident-Rel-Attr (which produces the pair (RelAuto, AttrModel) ) and a rule from Sel-Constr (which selects tuples from RelAuto based on AttrType values) are as follows: TQrel(i, QPSobj(i), D2)[rel-attr $ (RelAuto, AttrModel) ] QPSobj(i)[begin-object ! DieselAuto , attr ! Model, sub-list ! NULL] ^ HTobj-relMapping(DieselAuto, D1 , D2) [mappings $ map(DieselAuto, Model, i) [rel-attr $ (RelAuto, AttrModel)], reachable-attr $ Model ] ^ TQrel(i,QPSobj(i),D)[constraints@RelAuto, DieselAuto $ C] ^ TQrel(i,QPSobj(i),D)[joining-path@NULL ! NULL] TQrel(i, QPSobj(i), D2)[constraints@RelAuto, DieselAuto $ (RelAuto, AttrType, f\DieselHatchBack", \DieselSedan"g) ] HTobj-relMapping(DieselAuto, D1 , D2) [constraints $ constraint(RelAuto, DieselAuto) [attr ! AttrType, range ! f \DieselHatchBack", \DieselSedan"g]] After these two rules are applied, we produce the following TQrel(): TQrel(1, QPSobj(1), D2) [rel-attr $ (RelAuto, AttrModel), constraints@RelAuto,DieselAuto $ RelAuto, AttrType, f\DieselHatchBack", \DieselSedan"g)] It should be straightforward to generate an SQL query which selects AttrModel values for those tuples of RelAuto, such that the range of values of AttrType is f\DieselHatchBack", \DieselSedan"g. 32

Mapping Rules for Case 2.4 As an example for Case 2.4, (DieselVehicle, Vehicle, , Manufacturer) must provide the following in TQrel(): f(RelAllEngines, AttrManufacturer,

AttrEngineType = f\enumeration of vehicle engine types"g, AttrFuelType = f\enumeration of fuel types for DieselVehicle"g)g. The relation RelAllEngines stores information on the object DieselVehicle as well as on other vehicles. Based on the attribute AttrEngineType, we select Vehicle tuples, and based on the attribute AttrFuelType, we select DieselVehicle tuples, from RelAllItems. The necessity for having two selection constraints may be due to the fact that DieselVehicle may participate in multiple hierarchies, which may not include Vehicle, and vice versa. The instantiated rules from Ident-Rel-Attr and Sel-Constr are as follows: TQrel(i,QPSobj(i),D2)[rel-attr $ (RelAllEngines,AttrManufacturer)] QPSobj(i)[begin-object ! DieselVehicle, sup-list $ Psup , end-object@Psup ! Vehicle , attr ! Manufacturer, sub-list ! NULL ] ^ not(HTobj-relMapping(DieselVehicle,D1,D2)[reachable-attr $ Manufacturer])^ HTobj-relMapping(Vehicle,D1, D2) [mappings $ map(Vehicle,Model,i) [rel-attr ! (RelAllEngines, AttrManufacturer)], reachable-attr $ Manufacturer ] ^ HTobj-relMapping(DieselVehicle,D1, D2)[constraints $ constraint(RelAllEngines,DieselVehicle) [attr ! AttrEngineType, range ! S]] ^ TQrel(i,QPSobj(i),D2)[constraints@RelAllEngines,DieselVehicle $ C] ^ TQrel(i,QPSobj(i),D2)[constraints@RelAllEngines,Vehicle $ C0 ] ^ TQrel(i,QPSobj(i),D2)[joining-path@NULL ! NULL] The following two rules are evaluated in the group Sel-Constr to provide the constraints on RelAllEngines for DieselVehicle and Vehicle: TQrel(i,QPSobj(i),D2)[constraints@RelAllEngines,DieselVehicle $ (RelAllEngines, AttrEngineType, f\enumeration of vehicle engine types"g)] HTobj-relMapping(DieselVehicle,D1, D2)[constraints $ constraint(RelAllEngines,DieselVehicle) [attr ! AttrEngineType, range ! f\enumeration of vehicle engine types"g]] TQrel(i,QPSobj(i),D2)[constraints@RelAllEngines,Vehicle $ (RelAllEngines, AttrFuelType, 33

f\enumeration of diesel vehicle fuel types"g)] HTobj-relMapping(Vehicle,D1, D2)[constraints $ constraint(RelAllEngines,Vehicle) [attr ! AttrFuelType, range ! f\enumeration of diesel vehicle fuel types"g]] The following TQrel() is produced from which an SQL query is easily obtained: TQrel(i,QPSobj(i),D2)[rel-attr $ (RelAllEngines,AttrManufacturer), constraints@RelAllEngines,DieselVehicle $ (RelAllEngines, AttrEngineType, f\enumeration of vehicle engine types "g), constraints@RelAllEngines,Vehicle $ (RelAllEngines, AttrFuelType, f\enumeration of diesel vehicle fuel types"g)]

Mapping Rules for Case 3

As an example of Case 3, (Auto, Vehicle, , Model) must provide the following: The rst pair speci es a join between RelSedan and RelDieselAuto and RelHatchBack and RelDieselAuto, as follows: ( f(RelHatchback, AN-Join  AttrAutoId), (RelSedan, AN-Join  AttrAutoId)g, f(RelDieselAuto, AN-Join  AttrDslAutoId)g ) The second pair is a join between RelDieselAuto and RelAllVehicles, as follows: ( f(RelDieselAuto, AN-Join  AttrDslVehId)g, f(RelAllVehicles, AN-Join  AttrVehicleId)g ) The following rule will be applied from Ident-Rel-Attr: TQrel(i,QPSobj(i),D2)[rel-attr $ (RelAllEngines,AttrModel)] QPSobj(i)[begin-object ! Auto, sup-list $ Psup , end-object@Psup ! Vehicle , attr ! Model, sublist ! NULL ] ^ not(HTobj-relMapping(Auto,D1, D2 )[reachable-attr $ Model]) ^ not(HTobj-relMapping(Auto,D1, D2 )[includes $ Vehicle]) ^ not(HTobj-relMapping(Auto,D1, D2 ) [constraints $ constraint(RelAllEngines, Auto)] ) ^ HTobj-relMapping(Vehicle,D1, D2)[mappings $ map(Vehicle,Model,i) [rel-attr ! (RelAllEngines, AttrModel), reachable-attr $ Model ] ^ TQrel(i,QPSobj(i),D2)[join-path@Psup ! Lsup ] ^ TQrel(i,QPSobj(i),D2)[join-path@Psub ! Lsub ] We do not evaluate any rules in group Sel-Constr to provide constraints on RelAllEngines for Auto or Vehicle. 34

The list Psup is the path (and , which we do not discuss here). This path will be processed by the rules in Join-Path to produce the appropriate path in TQrel(). One instantiated Join-Path rule will be as follows: TQrel(i,QPSobj(i),D2)[join-path@Psup ! cons((RelHatchBack, RelDieselAuto, AttrAutoId, AttrDslAutoId), L0sup)] QPSobj(i)[sup-list $ Psup[head ! Auto, tail ! P0sup [head ! DieselAuto]]] ^ TQrel(i,QPSobj(i),D2)[join-path@P0sup $ L0sup ] ^ HTobj-relMapping(Auto, D1 , D2)[joins $ join(Auto,DieselAuto) [rel-attr $ (RelHatchBack, RelDieselAuto, AttrAutoId, AttrDslAutoId)] ] ^ TQrel(i,QPSobj(i),D2)[constraints@RelHatchBack,Auto $ CAuto ] ^ TQrel(i,QPSobj(i),D2)[constraints@RelDieselAuto,DieselAuto $ CDieselAuto ] The following Join-Path rule applied since there is no joining information for the pair (DieselAuto, DieselVehicle) and the algorithm skips ahead to the pair (DieselAuto, Vehicle): TQrel(i,QPSobj(i),D2)[join-path@Psup ! L0sup ] QPSobj(i)[sup-list $ Psup[head ! DieselAuto, tail ! Psup1[head ! DieselVehicle, tail ! Psup2]]] ^ not(HTobj-relMapping(DieselAuto, D1 , D2) [joins $ join(DieselAuto,DieselVehicle) ] ) ^ TQrel(i,QPSobj(i),D2)[join-path@cons(DieselAuto, Psup2) $ L0sup ] The following Join-Path rule is the terminating rule, corresponding to the pair (DieselAuto, Vehicle): TQrel(i,QPSobj(i),D2)[join-path@cons(DieselAuto,cons(Vehicle,nil)) ] ! cons((RelDieselAuto, RelAllEngines, AttrDslVehId, AttrVehicleId), nil)) QPSobj(i)[sup-list $ cons(DieselAuto,cons(Vehicle,nil))] ^ HTobj-relMapping(DieselAuto, D1 , D2)[joins $ join(DieselAuto,Vehicle) [rel-attr $ (RelDieselAuto, RelAllEngines, AttrDslVehId, AttrVehicleId)]]^ TQrel(i,QPSobj(i),D2)[constraints@RelHatchBack,Auto $ CAuto ] ^ TQrel(i,QPSobj(i),D2)[constraints@RelDieselAuto,DieselAuto $ CDieselAuto] Finally, the following TQrel() is produced, from which an SQL query is easily obtained: TQrel(1,QPSobj(1),D2)[rel-attr $ (RelAllEngines,AttrModel) join-path@cons(Auto,cons(DieselAuto,cons(DieselVehicle,cons(Vehicle,nil)))) $ cons( (RelHatchBack, RelDieselAuto, AttrAutoId, AttrDslAutoId), cons( (RelDieselAuto, RelAllEngines, AttrDslVehId, AttrVehicleId), nil))) 6.5. An Evaluation of the Query Interoperation Approach Using a CR

In this section we discuss how we can evaluate our approach to query interoperation, and discuss the procedure to verify the correctness of the approach. 35

One aspect of evaluation is wrt the expressive power of our canonical representation, as compared to the higher-order query languages proposed in Kent (1991), Krishnamurthy et al (1991) and Lakshmanan et al (1993). However, we must consider that these proposed languages can only resolve representational heterogeneity in relational schema, and do not address the heterogeneity between object and relational schema. Given this limitation, we must compare our parameterized representation, HTx-yMapping, and the transformations that are speci ed in the mapping rules, with the higher-order constructs in these languages. The objective is to determine if we are as expressive as a higher-order query language. We can assume that the higher-order query is expressed wrt some global schema and the queries are translated against the local relational schema. We must determine the extent to which our parameters and mapping rules can cover similar language constructs as in the higher-order queries. We note that we are limited by the target query languages, for example, we cannot query the schema of a target relational database. This comparison is a topic of current research. The second aspect of evaluation is to determine if a particular knowledge dictionary is correctly constructed and is internally consistent. This would guarantee that every object query would be translated to a relational query. We perform this veri cation in two steps. The rst step is to make sure that for every attribute associated with an object Y, in a local object schema, and which occurs in an XSQL query, we have de ned it to be a reachable attribute in the target database system. We construct an F-logic class Y(D1 , D2 ), for each object Y, and use the following rule to determine if this is the case, i.e., there is an entry for each attribute of Y: Y(D1, D2 )[attr$ fattr-Yg] HTobj-relMapping(Y,D1, D2)[reachable-attr ! attr-Y] _ Y : Y0 ^ HTobj-relMapping(Y0,D1, D2) [reachable-attr ! attr-Y] For each such reachable attribute de ned in the target database, in the second step, we determine if the mapping knowledge base has sucient parameters to construct a query in the target database. To verify this, for each reachable attribute in the target database, we identify the corresponding relation (R) and attribute (Z) pair, for this reachable attribute (attr-Y), and the associated parameters. We construct an F-logic class Y(D1 , D2)[attr-rel-attr $ (attr-Y, R, Z)]. We use the following rule to determine if there is an entry for each pair: Y(D1, D2 )[attr-rel-attr$ (attr-Y, R, Z)] HTobj-relMapping(Y,D1, D2) [reachable-attr ! attr-Y, mappings ! map(Y,attr-Y,D)[rel-attr ! (R,Z)] ]

_

Y :Y0 ^ Y0 (D1, D2)[attr-rel-attr $ (attr-Y, R, Z)] ^ (

HTobj-relMapping(Y,D1,D2) 36

[includes ! Y0 ]

_ HTobj-relMapping(Y,D1, D2 ) [constraints $ constraint(R, Y)[] ] _ HTobj-relMapping(Y,D1, D2 ) [joins $ join(Y, Y0 ) [rel-attr $ (   ( , R, , )    )]

)

This second step makes sure there will be at least one path (graph) in the target (relational) schema D2 for each (Y,attr-Y), so that a relational query can be obtained. We can also compare our approach based on a parameterized representation and mapping rules with the approach of Qian (1993), Qian and Lunt (1994), and Qian and Raschid (1995). The advantage in the latter approach is that the mapping knowledge base is a rst order deductive data base. They use a theorem proving approach to provide a query which is interoperated wrt a target schema (object or relational). Since the queries and the mapping knowledge base are all expressed using the same language it is more straightforward to verify the correctness of the interoperation, with this approach. However, it su ers from the same disadvantage that the mapping rules can be repetitive, and must be written explicitly for each entity, whereas our parameterized representation makes it very simple to build the knowledge dictionary HTx-yMapping.

7. Summary and Future Research We developed techniques for interoperable query processing between object and relational schema. The objective is to pose a query against a local object schema and be able to share information transparently from target relational databases, which have equivalent schema. Our approach is a mapping approach (as opposed to a global schema approach) and is based on using canonical representations (CR). A mapping approach allows existing legacy applications to interoperate, whereas in the global schema approach, only queries posed against the global schema are interoperable. Our architecture for interoperability is composed of an Interoperability Module (IM) which has three sub-modules, an Extractor Module (EM), a HeTerogeneous Mapping Module (HTM), and a Generator Module (GM). We utilize two canonical representations (CR). We use one CR for resolving heterogeneity based on the object and relational query languages. We use a second parameterized CR to resolve representational heterogeneity between object and relational schema and to build a mapping knowledge dictionary, HTobj-relMapping. There is also a set of mapping rules, which speci es the functionality of the HTM and de nes the appropriate mapping between schemas. A query posed against the local object schema is rst represented in the CR for queries, CRQobj, by the EM. It is then transformed by the mapping rules of HTM, to obtain an appropriate query for the target relational 37

schema, TQrel, using relevant information from the mapping knowledge dictionary, HTobj-relMapping. The use of the parameterized CR allows us to build the mapping knowledge dictionary easily, and allows reusability of the mapping rules. Currently, we use a second order logical language (F-logic), which has second order syntactical structures but rst order semantics, to express both CR and the mapping rules. Each CR is a class in F-logic, and parameters are de ned to be the attributes or methods of the corresponding classes. In related research, we have studied the interoperation from relational to object queries, and this is presented in Chang et al (1994). We now discuss future research. An important task is to compare our approach of using parameterized CRs with other mapping approaches, based on canonical deductive bases and theorem-proving techniques, to interoperate between heterogeneous schema. We must also compare the expressive power of our parameterized CR and mapping rules with other proposals for higher-order query languages or logics. We also intend to study the interoperability of query languages with higher-order features. This is discussed in Qian and Raschid (1995). In this paper, we make the assumption that the databases are equivalent, i.e., the answers obtained for a query from a remote database satisfy the integrity constraints of a local schema. This is a shortcoming in many situations. In related research (see Vidal and Raschid (1995)), we are investigating knowledge and data sharing among di erent schema (F-logic schema), where the semantics may not be equivalent. We use the KIF interchange logic to build a mediation knowledge base for sharing knowledge in this environment. A nal topic of interest is heterogeneous query optimization. We are pursuing research on specifying algebraic representations for the mapping knowledge among instances in the heterogeneous schema, so that we may apply algebraic transformations to obtain equivalent and more ecient queries. This research is discussed in Florescu et al (1994).

8. Acknowledgements This research has been supported by the Advanced Research Projects Agency under grant ARPA/ONR 92-J1929. The authors express their gratitude to Bonnie J. Dorr, for her numerous contributions during the course of this research, and in particular, her contribution in the speci cation of parameterized canonical representations.

9. References Ahmed, R., de Smedt, P., Du, W., Kent, W., Ketabchi, M.A., Litwin, W.A., Ra i, A. and Shan, M.-C. (1991) \The Pegasus heterogeneous multidatabase system," IEEE Computer, 24(12), 19{27. Ahmed, R., Albert, J., Du, W. and Kent, W. (1993) \An overview of Pegasus," Proceedings of the Workshop on Research Issues in Data Engineering, International Conference on Data Engineering. Albert, J., Ahmed, R., Ketabchi, M., Kent, W. and Shan, M.-C. (1993) \Automatic importation of relational schemas in Pegasus," Proceedings of the Workshop on Research Issues in Data Engineering, International Conference on Data Engineering.

38

Arens, Y. and Knoblock, C.A. (1992) \Planning and reformulating queries for semanticallymodeledmultidatabasesystems," Proceedings of the First International Conference on Knowledge Management. Arens, Y., Chee, C.Y., Hsu, C.-N., Knoblock, C.A. (1993) \Retrieving and integrating data from multiple information sources," International Journal of Intelligent and Cooperative Information Systems. Vol. 2, No. 2., 127-158. Barsalou, T. and Gangopadhay, D. (1992) \M(DM): An open framework for interoperation of multimodel multidatabase systems," Proceedings of the International Conference on Data Engineering. Batini, C., Lenzerini, M. and Navathe, S.B. (December 1986) \A comparative analysis of methodologies for database schema integration," ACM Computing Surveys, Vol. 18, No. 4, 323-364. Chang, Y. and Raschid, L. (1995) \Toward Interoperability among heterogeneous databases: a mapping approach based on canonical representations," Submitted to the International Conference on Database Systems for Advanced Applications (DASFAA'95). Chang, Y., Raschid, L. and Dorr, B.J. (1994) \Transforming queries from a relational schema to an equivalent object schema: a prototype based on F-logic," Proceedings of the International Symposium on Methodologies in Information Systems, (ISMIS-94). Chomicki, J. and Litwin, W. (1993) \Declarative de nition of object-oriented multidatabase mappings," in Distributed Object Management, Oszu, M.T., Dayal, U. and Valduriez, P. (eds.) Morgan Kau man Publishers. Dayal, U. and Hwang, H. (1984) \View de nition and generalization for database integration in a multidatabase system," IEEE Transactions on Software Engineering, 10(6), 628-645. Fishman, D. H., Beech, D., Cate, H.P., Chow, E.C., Connors, T., Davis, J.W., Derrett, N., Hoch, C.G., Kent, W., Lyngbaek, P., Mahbod, B., Neimat, M.A., Ryan, T.A. and Shan, M.C. (1990) \Iris: an object orienteddatabase managementsystem," in Readings in Object-Oriented Database Systems, Zdonik, S.B. and Maier, D. (eds.), Morgan Kaufmann, Inc. Hsu, C.-N. and Knoblock, C.A. (1993) \Reformulating query plans for multidatabase systems," Proceedings of the Second International Conference on Knowledge Management. \From relational to object-oriented integrity simpli cation," Proceedings of the Second International Conference on Deductive and Object-Oriented Databases. Jeusfeld, M. and Jarke, M. (1992) \Generating objects from relations," , Technical report. Keller, A. (1992) \Generating objects from relations," , Technical report. Kent, W. (1991) \Solving domain mismatch and schema mismatch problems with an objectoriented database programming language," Proceedings of the International Conference on Very Large Data Bases. Keim, D.A., Kriegel, H.-P. and Miethsam, A. (1993) \Object-oriented querying of existing relational databases," Proceedings of the Fourth International Conference on Database and Expert Systems Applications, (DEXA 93). Keim, D.A., Kriegel, H.-P. and Miethsam, A. (1994) \Query translation supporting the migration of legacy databases into cooperative information systems," Proceedings of the Second International Conference on Cooperative Information Systems, pages 203-214. Kifer, M. and Lausen, G. (1989) \F-logic: A higher-order language for reasoning about objects, inheritance and scheme," Proceedings of the ACM Sigmod Conference. Kifer, M., Lausen, G. and Wu, J. (1990) \Logical foundations of object-oriented and frame-based languages," , Technical report 90/14, SUNY at Stonybrook. Kifer, M., Kim, W. and Sagiv, Y. (1992) \Querying object-oriented databases," Proc. of the ACM Sigmod Conference. Kim, W., Choi, I., Gala, S. annd Scheevel, M. (1993) \On resolving schematic heterogeneity in multidatabase systems," Distributed and Parallel Databases, 1(3), 251{279. Kim, W. and Seo, J. (December, 1991) \Classifying schematic and data heterogeneity in multidatabase systems," IEEE Computer, pages 12{18. Krishnamurthy, R., Litwin, W. and Kent, W. (1991) \Language features for interoperability of databases with schematic discrepancies," Proceedings of the ACM Sigmod Conference.

39

Lakshmanan, L.V.S., Sadri, F. and Subramanian, I.N. (1993) \On the logical foundations of schema integrationand evolutionin heterogeneousdatabasesystems," Proceedings of the Third International Conference on Deductive and Object-Oriented Databases, pages 81-100,. Lawley, M. (1993) \A Prolog interpreter for F-logic," , Technical report, Grith University. Lefebvre, A., Bernus, P. and Topor, R. (1992) \Query transformation for accessing heterogeneous databases," Joint International Conference and Symposium on Logic Programming, Workshop on Deductive Databases. journACM Transactions on Database Systems, 17(3) Markowitz, V.M. and Shoshani, A. (1992) \Representing extended entity-relationship structures in relational databases: a modular approach," , 423{464. Pirahesh, H., Mitschang, B., Sudkamp, N. and Lindsay, B. (1994) \Composite-object views in relational DBMS: an implementation perspective," Information Systems, 19(1), 69-88. Qian, X. (1993) \Semantic interoperation via intelligent mediation," Workshop on Research Issues in Data Engineering, International Conference on Data Engineering. \Semantic interoperation: a query mediation approach," SRI-CSL-94-02. Qian, Q. and Lunt, T. (1995) \Query interoperationamongobject-orientedand relationaldatabases," Proceedings of the International Conference on Data Engineering, Computer Science Laboratory, SRI International. Qian, X. and Raschid, L. (1995) \Query interoperation among object-oriented and relational databases," Proceedings of the International Conference on Data Engineering, Computer Science Laboratory, SRI International. Raschid, L. (1994) \Issues in supporting interoperable query processing with multiple heterogeneous information servers," Submitted to the Workshop on Information Technologies and Systems, International Conference on Information Systems. Raschid, L. and Chang, Y. (1995) \A comparison of a parameterized canonical representation versus a higher order language in a heterogeneous environment," In preparation. Raschid, L., Chang, Y. and Dorr, B. (1993) \Query mapping and transformation techniques for problem solving with multiple knowledge servers," Proceedings of the Second International Conference on Information and Knowledge Management (CIKM-93). Raschid, L., Chang, Y. and Dorr, B.J. (1994) \Query Transformation Techniques for Interoperable Query Processing in Cooperative Information Systems," Proceedings of the Second International Conference on Cooperative Information Systems (CoopIS94), May 17-20,. Raschid, L., Florescu, D. and Valduriez, P. (1994) \De ning the search space for heterogeneous query optimization," In preparation. Sheth, A. and Larson, J. (1990) \Federated database systems for managing distributed, heterogeneous, and autonomous databases," ACM Computing Surveys, 22(3), 183{236. Vidal, M. E. and Raschid, L. (1995) \Interoperation between multiple F-logic schema using a KIF mediator," In preparation.

40