Enhancing the semantics of federated schemata by translating SQL

Enhancing the semantics of federated schemata by translating SQL-queries into object methods Mark W.W. Vermeer Peter M.G. Apers fvermeer, [email protected] University of Twente, Computer Science Department P.O. Box 217, NL-7500 AE Enschede, The Netherlands April 18, 1995 Abstract

Within the context of federated databases, considerable attention has been paid to the translation of relational export schemata of participating databases to a semantically richer common data model, such as the object-oriented model. Such a translation results in the de nition of an object-oriented component schema for the local database. Each class de nition in the component schema is a mapping of a set of attributes of one or more tables of the underlying relational schema. Component schemata are subsequently integrated into a federated schema. We argue that it is meaningful to extend the scope of the translation/integration process to include methods and constraints derived from applications on the underlying schemata. The component and federated schemata can thus be equipped with methods representing common queries on the underlying schemata and application-enforced constraints, enhancing the semantics of the schema, yielding a higher-level interface to federated applications, and supporting the possibility of lling the federated object schema with new or migrated data. We describe an algorithm translating SQL-statements on a relational schema which has been translated to an object structure, into corresponding methods on this structure, taking into account the additional semantics that have been acquired in the data structure translation phase. We use the TM objectoriented database speci cation language as the target DML of the translation. It is shown that a considerable simpli cation and increase of semantic expressiveness can be achieved in this translation. We are currently investigating the semantic expressiveness of the translation in the context of real-life applications within a large oil company.

1 Introduction

1.1 Federated databases

With the increasing functionality and availability of database technology, today most organisations have replaced application-based data management by database systems, thus integrating data management for common application domains. Typically, the information management within larger organisations is organised around 1

a number of corporate databases, each containing data owned by a speci c department or discipline within the organisation, and often managed by dierent database management systems. The current trends towards company-wide computing have stimulated demand for interoperability among these legacy systems, to support cooperation between various disciplines within the organisation. As legacy systems represent considerable value to an organisation, however, the current operation of these systems cannot be tampered with. Federated database systems [18] have been proposed to achieve interoperability for pre-existing, heterogeneous, autonomous database systems. They oer the possibility of de ning an integrated view of data available anywhere in the organisation, hiding dierences in location and representation to the user. This view is called the federated schema and is usually expressed in a semantically powerful common data model, such as the functional, EER, or objectoriented one. The federation itself usually does not contain data, although in the long run data might be migrated from the participating local databases to a federation database, thus actually populating the federated schema.

1.2 Schema translation and integration

De ning a federated schema requires translation of the (usually relational) schema of every participating database into a so-called component schema [18]. A component schema oers a view of the local database expressed in terms of the common data model. Component schemata are subsequently integrated into a federated schema. This integration process bene ts considerably from the semantic power of the common data model. The (semi-)automation of such schema translation has been the subject of a large amount of research, which we shall discuss in a subsequent section. At this point, we wish to stress that until today only the integration of data structures has been investigated in this context. The integration of operations and constraints has not yet been considered. It is our view, however, that equipping component and federated schemata with these features is meaningful for the following reasons: Specifying operations and constraints on the data structures enhances the semantics of the schema. Experience from industry shows that especially in non-administrative application domains, a considerable part of the semantics of the schema is buried in the applications de ned on it. Constraints on the underlying schemata are usually hard-coded into the applications due to the inadequacy of the data model to express them. Providing semantically expressive views of the local databases is of utmost importance in federated database systems, as it eases the de nition of a federated schema and subsequently aids the federated user in his search for relevant information. Existing applications on the underlying schema usually have a certain number of common operations on the schema. For example, consider a bank database with several applications for payment orders, cheques, interest, investments etc. All of these applications contain code for the transfer operation on accounts, checking the adequacy of the credit balance etc. Such operations are likely to be relevant to new, federated, applications as well. Oering a highlevel interface to the integrated data-structure consisting of operations de ned in existing applications thus supports the reuse of code when developing federated applications. Whereas incorporation of operations and constraints is an attractive option in general, it is absolutely indispensible in case the federated object schema is gradually lled with either new data or data migrated from the component 2

Federative Schema

Application

Component Schema

Application

Component Schema

Application

Local schema Application

Component Schema

Application



Local Databases

Figure 1: Incorporating methods and constraints in the integration process databases. In these cases the interface oered by the operations de ned on the object schema can provide the necessary location transparency, allowing data to be manipulated without concerning the phase of migration of individual objects. Figure 1 illustrates the proposed approach, embedded in the reference architecture for federated database system by Sheth and Larson [18]. Our approach implies the use of the object-oriented data model as the global data model, enabling the speci cation of methods and constraints on the integrated data structure.

TM We use the TM [2,9] object-oriented database speci cation language, because of its high level of abstraction and its expressiveness. TM allows the speci cation of

data structures in terms of Classes and Sorts (abstract data types without object identity), the de nition of constraints on this structure using rst-order logic, and provides a computationally complete, functional data manipulation language capable of expressing set-oriented queries. Methods on the schema are expressed in this DML.

1.3 Scope of the paper

In the construction of an object-oriented federated schema equipped with methods and constraints, the following steps can be identi ed: 1. Translate each local schema into the structural part of an object-oriented component schema. 2. Isolate from the application code operating on the local schema the part that performs the interaction with the database, i.e. the embedded SQLstatements, and translate them into methods on the component schema. 3

3. Extract additional methods and hard-coded constraints from the data processing part of the application code. 4. Integrate the structural part of the component schemata into the object structure of a federated schema, taking advantage of the additional semantics acquired in step 2 and 3. 5. Integrate the methods of the component schemata into methods on the federated schema. The focus of this paper is the second step of this list. We describe the translation of SQL-statements on relational tables to TM-methods on TM-classes and sorts. Translating part of the application code, their integration, and deriving constraints from applications will be the subject of subsequent research. Note that existing work on translating and integrating data structures (steps 1 and 4) is still relevant in our approach. In fact, the translation of SQL-queries substantially bene ts from the extra semantic information acquired in the data structure translation phase. Since the DML of TM has more expressive power than SQL, we could provide a direct translation for any SQL-query, simulating relational concepts on the object structure by using joins to establish relationships etc. We therefore do not aim at extensively discussing the translation of general SQL-queries, as this would be highly target-language dependent and the translation to a semantically stronger language is not interesting in general. In contrast, we focus on those SQL-constructs of which the translation can bene t from object-oriented concepts such as subclassing and object references available in the translated schema, resulting in simpler and semantically more expressive expressions.

1.4 Related work

Query translation has been an important topic in federated database systems. However, existing work [3,7,20] is devoted to the run-time translation of queries on the federated schema, expressed in the DML of the common data model, into queries on the underlying databases, expressed in the local DML. Whereas these approaches focus on implementing the data access operations speci ed at the federation level as ecient as possible at the local level, it is our goal to provide a translation that re ects and clari es the meaning of the queries speci ed at the local level to the federation. Processing eciency is of no concern to us, as the actual data access is not taking place at the federation level. To our knowledge, the only publication speci cally addressing the translation of SQL-queries to an object-oriented DML is the work of Meng et al. [13]. The paper focusses on constructing a relational front end to object-oriented databases. SQL-queries de ned on a relational view of the object schema are made operational in terms of the DML of the underlying object-oriented database. It thus belongs to the research area described above. Although there are some strong similarities between their translation algorithm and ours, which we developed independently, two important dierences can be pointed out. In [13], the relational schemata on which the SQL-queries are formulated, are the result of translating the underlying object schemata to the relational model. As a result of this, these schemata are highly standardized and the mapping from tables to classes is quite straightforward. As the schema translation was performed from the object-oriented model to the semantically weaker relational model, no additional semantics have been acquired in the translation process. In contrast, we assume that the object schema is a result of the translation from an existing relational schema to an object-oriented model, 4

which requires additional semantic information to be input during the translation process, resulting in a non-trivial mapping between tables and classes. We show that the translation of SQL-queries bene ts considerably from these extra semantics. Since Meng et al. focus on the operational aspects of query translation, they do not discuss the translation of nested queries, as there exist unnesting algorithms [12,14] that transform nested queries into an equivalent collection of simple queries. We show that this approach would be inadequate for our purposes, as it would obscure the meaning of the original query. Existing work on schema translation is of course still relevant in our approach, as we assume the underlying relational schema has been translated into an object schema. Although we abstract from the speci c translation method used, we do assume a method suciently powerful for our purposes is used. In the context of federated databases, research on schema translation towards a semantically powerful data model has a.o. been conducted by Saltor and Castellanos. [4] contains a quite exhaustive treatment which uses the object-oriented data model BLOOM as its target model. The Pegasus project at Hewlett-Packard Laboratories uses a functional object model as its common data model. [1] describes the translation method used for relational schemata. The latter approach requires less semantic input, resulting in an output schema that bears more resemblance to the original relational schema; for example, so-called missing entities are not identi ed. Translation of a relational schema to a semantically richer data model has been the subject of more generally oriented research as well. Examples include the reverse engineering methods of Navathe [15] and Davis[6], which extract the underlying ERmodels of relational schemata, covering isa-relations but not missing entities, and, more recently, the powerful methods described by Johannesson [10], an elaborated formal approach, and Chiang et al [5]. Although the methods themselves are somewhat dierent, the output schemata of the more semantically powerful methods ([4], [10] and [5]) for a given relational schema are roughly equivalent. Occasionally, research conducted into other database topics is relevant to the present subject as well. The translation of SQL to a query on an object-oriented schema bears resemblance to the translation of an SQL-query on a 1NF conceptual schema to a query on the nested relations of an internal NF2 -schema [17]. However, an object-oriented schema also incorporates notions such as subclassing and weak entities, which require additional techniques, demonstrated in this paper. There are also some similarities to the translation from SQL to CODASYL-based retrieval techniques [16]. Also relevant is past work on data migration, in particular, the framework for database and application conversion described by Su et al. [19]. They advocate a structural approach towards application conversion in the face of semantic database changes, which isolates queries from the application code, analyzes the access patterns they de ne, and transform these according to the mapping information extracted from the schema transformation process. Our approach neatly ts into this framework, as shall become apparent from the subsequent discussion.

1.5 Overview

The remainder of this paper is organized as follows. We rst summarize results concerning data structure translation in Section 2 and describe the schema mapping information needed in the query translation process. In Section 3 we describe two translations of an SQL-query, illustrating the idea of semantic expressiveness of TM-methods. Next, in Section 4 we describe how SQL-queries and object schemata can be matched, resulting in the de nition of the object structure described by the 5

ssn, name, bdate, address, sex, salary, superssn, dno) dnumber, dname, mgrssn, mgrstartdate) essn, dependent-name, sex, bdate, relationship) pnumber, pname, plocation, dnum) wssn, pno) EMPLOYEE.superssn EMPLOYEE.ssn EMPLOYEE.dno DEPARTMENT.dnumber DEPARTMENT.mgrssn EMPLOYEE.ssn DEPENDENT.essn EMPLOYEE.ssn PROJECT.dnum DEPARTMENT.dnumber WORKS-ON.wssn EMPLOYEE.ssn WORKS-ON.pno PROJECT.pnumber

EMPLOYEE( DEPARTMENT( DEPENDENT( PROJECT( WORKS-ON(

Figure 2: Example relational schema with inclusion dependencies query, which can subsequently be used to determine the class or classes to which the translated methods should be assigned. Finally, Section 5 describes the translation of relevant SQL-queries to TM-methods. Section 6 presents a brief summary of the translation algorithm.

2 Schema translation Our query translation method assumes that the relational schema on which the input query is de ned has been translated into an object schema. Several methods for the (semi-)automatic translation of a relational schema to a semantically richer model have been described. These methods dier considerably in the semantics they add to the original schema. Obviously, a trivial translation is possible, simply modelling each table as a class. However, all methods try to exploit the modelling power of the semantic model used to obtain a more meaningful schema. We discuss characteristics of such methods and requirements we place on their output schema.

2.1 Existing methodologies

We assume a translation procedure is used which results in schemata with a semantic expressiveness comparable to that of [4] or [10]. Such methods are capable of inferring a class hierarchy, and detect so-called missing entities which were implicit in the original schema. Required input information usually consists of table de nitions in 3NF, information on keys and inclusion dependencies between attribute values of dierent tables. Figure 2 shows an example input to a schema translation algorithm. Application of any of these methods results in an object schema, where each of the classes is a mapping of a number of attributes from one or many tables. Figure 3 shows a possible translated schema, expressed in TM (the P-symbol represents a setvalued attribute). We assume a Sort exists for each relational datatype which is not a basic type in TM, such as Date. In general, tables can be translated to classes in various ways. To translate queries on these tables, we need information on the mapping of tables to classes induced by the translation. The following mappings between tables and classes are possible: 1:1 A table gives rise to exactly one class. There is an exact correspondence between the attributes, except for table attributes holding foreign key values 6

Class Employee attributes ssn

bdate sex employee dependents

: string : Date : string : Employee : PDependent

name address salary department projects

object constraints oc1: forall x in projects j self in x.employees end Employee Class Department attributes dnumber : integer dname manager : Manager object constraints oc1: manager.department = self end Department Class Manager isa Employee attributes mgrstartdate : Date department object constraints oc1: department.manager = self end Manager Class Project attributes pnumber : integer pname plocation employees

: string : PEmployee

mdepartment

object constraints oc1: forall x in employees j self in x.projects end Department Sort Dependent type h dependent-name: string sex relationship : string bdate end Dependent

: : : : :

string string real

Department

PProject

: string

:

Department

: string : Department

: string : Date i

Figure 3: Translation of the example relational schema (attribute sets appearing as the rhs of some inclusion dependency). These are replaced by attributes referring to other objects. Such tables represent a primary entity or an m:n relationship with attributes. 0:1 A table representing an m:n relationship between other tables is usually mapped to a pair of set-valued object references between the related classes if it consists of key attributes only. This is the case for the WORKS-ON relation in the example. Thus such a table does not give rise to a class being de ned. Actually, this is more properly an m:n mapping, as we may regard both related classes (Employee and Project) to include the connecting table. 1:n A class corresponds to a merge of the attributes of several tables. This is the case when an entity is represented by a number of tables, for example to eliminate repeating groups. We also consider weak entities as a repeating group and represent them as a set of instances of a complex datatype, a Sort in TM, associated with the class. 7

This is illustrated in the example, where the Sort Dependent appears as a setvalued attribute in Employee. This translation stresses the dependence of a weak entity on its owner. Note that we could also have chosen to implement a weak entity as a class. n:1 A table has been vertically partitioned into multiple classes. That is, so-called missing entities are detected in a table. Examples of circumstances indicating missing entities are an \extraneous" identi er (a secondary key which is not just an alternative identi er) or an inclusion dependency on an attribute set which is not a key. The example shows a table DEPARTMENT containing a missing entity Manager, as mgrssn is an extraneous secondary key. Note that although in this case there is a 1:1 relationship between classes stemming from the same table, this need not be the case in general. [4] To our knowledge, there are no methods that use horizontal partitioning indicating missing entities, i.e. tuples belong to one of several classes according to their values. m:n The cases above can be combined into more complicated relationships between objects and tables. For example, the missing entities detected in dierent tables may be identical. Thus each table represents multiple classes, while the missing entitiy is represented by multiple tables. isa-relationships between classes also indicate a non-trivial relationship between objects and tables. When a class C1 which has been inferred from a table T1 is found to be a subclass of C2 from T2 , C1 is in fact the representation of the natural join of T1 and T2 on their common key, where all tuples from T1 participate in the join. The common key attributes of T1 and T2 need not be repeated in C1 , whereas other common attributes are assumed to be overriding de nitions. In the example, the missing entity Manager is found to be a subclass of Employee due to the inclusion dependency of their keys. The schema translation thus results in a de nition of an object schema where each class Ck can be seen as a representation of a natural join on the common key attributes of a number of table fragments ai1 :::ain (Ti ). We say that Ck is an image of each of the tables Ti . The attributes of Ck are formed by the union of all aij of the projections not contained in the rhs of any inclusion dependencies, leaving out the key attributes if Ck is involved in an ISA relation with another class Cl . An object referencing attribute is added instead for each rhs which is contained in the attributes of Ck .

2.2 Requirements on the object schema

Although we abstract from the translation method used, we pose some requirements on the object schema to be able to perform query translation. Let Ck be a class as described above. Let T be a table such that both Ck and some other class Cl are images of T . We require that Ck and Cl contain attributes referencing each other, as SQL-queries will treat object pairs from Ck and Cl as one occurrence of a tuple of T . If one of the classes, say Cl , does not contain a key for T , the object reference from Cl to Ck must be set-valued, as tuples representing multiple instances of Ck may represent the same Cl -object (i.e. there exists a 1:n relationship between Cl and Ck ). Some methods will add 8

these references automatically, as they re ect trivial inclusion dependencies on the relational schema.

Annotations We further require the annotation of the class de nition of Ck with: the set of attributes aij from tables Tk1 ; :::; Tkm the class stems from. the subset of these attributes that form a key for the class. for each object reference within the class, the (set of) foreign key attribute(s)

or connecting table it is based on. We thus obtain information on the mapping from tables to classes, and the inclusion dependencies object references were inferred from. Such information can very well be represented using a meta-schema, as in [11] and [20]. For sake of the presentation here, we use the predicates Image(className, tableName, attributeSet), Key(className, attributeSet) and DenotedBy(objectReferencingAttribute, foreignKeySet), respectively. In the following we will also write Ck =Image(Tki ) instead of Image(Ck ; Tki ; attributeSet). Figure 4 shows the annotations necessary for the example. Note that the foreign key essn in DEPENDENT is not translated to an object reference in the Sort Dependent. This foreign key determines which Employee object a Dependent value is part of. This is indicated using the keyword OWNER. Image(Employee,EMPLOYEE,f*g) Key(Employee,fssng) DenotedBy(employee, fsuperssng) DenotedBy(department, fdnog) DenotedBy(projects, fWORKS-ON.pnog) Image(Department,DEPARTMENT,fdnumber,dname,mgrssng) Key(Department,fdnumberg) DenotedBy(manager, fmgrssng) Image(Manager,DEPARTMENT,fdnumber,mgrssn,mgrstartdateg) Key(Manager,fmgrssng) DenotedBy(department, fdnumberg) Image(Dependent,DEPENDENT,f*g) Key(Dependent,fessn, dependent-nameg) DenotedBy(OWNER, fessng) Image(Project,PROJECT,f*g) Key(Project,fpnumberg) DenotedBy(employees, fWORKS-ON.wssng)

Figure 4: The annotations for the example TM schema The input to the query translation algorithm thus consists of an object schema, obtained by applying a schema translation method to a relational schema, and some annotations on the schema mapping induced by the translation. The next section discusses the desired output of such an algorithm.

9

SELECT name FROM EMPLOYEE WHERE NOT EXISTS (SELECT * FROM DEPENDENT WHERE ssn=essn AND sex="F") AND EXISTS (SELECT * FROM DEPARTMENT WHERE ssn=mgrssn ) Figure 5: Example query

3 Semantic expressiveness of query translations

Assuming a translated schema, annotated as above, we now turn to the translation of SQL-queries on the original relational schema to methods on the object schema. We assume the existence of a set-oriented method speci cation language that has equivalents for all SQL primitives. The DML of TM is such a language. Its most important statement in this context, which we will use for translating the SQL SELECT-statement, is the following: collect e1 for var in e2 [i Pred] where e2 is a set or list expression, var is a variable ranging over e2 , e1 is an expression of arbitrary type (typically involving var), and Pred represents an optional condition. Such a DML allows the translation of any SQL statement in an ad-hoc way; i.e. without exploiting any of the additional semantics captured by the object schema. In contrast, our translation method uses this additional semantics to achieve a more meaningful query translation. In particular, we exploit the class hierarchy and object references of the object model to eliminate joins, sometimes leading to the elimination of complete subqueries.

Example 3.1 Consider the query of Figure 5. Note that the intended meaning of

the query, \List the names of managers not having a female dependent", is far from obvious in the relational schema. Note also how the de nition of an employee which is also a manager, the latter being an entity that remains implicit in the relational schema, is expressed by the second subquery. Such a query could in principle be translated in an ad-hoc manner, as is shown in the rst translation of Figure 6. Note that the extra semantics of the object schema ensure the increased semantic expressiveness even when the query translation is performed in a very direct manner, as analogous to the structure of the original SQL-query as possible. For example, due to the translation of the DEPENDENT-table as a repeating group (set-valued Sort attribute) in Employee, we have to access the dependents by the set-valued attribute in Employee. However, the join in the second subquery is still explicitly performed. The second translation is much more in accordance with the semantics of the query and the object model. Note how we use object references instead of joins, and the subquery identifying employees that are also managers is made irrelevant by simply de ning the method on members of class Manager. It is this kind of translation that we strive for in this paper.

10

database retrieval method Trans1(out Ph Name : string i)= collect h Name = x.name i for x in employees i not exists y in x.dependents j y.sex="F" and exists z in departments j z.manager.mgrssn = x.essn class retrieval method for Manager collect h Name = x.name i for x in self i not exists y in x.dependents j y.sex="F"

Trans2(out Ph Name : string i)=

Figure 6: Two translations of the example query

Translating nested queries In [13], Meng et al. do not consider the translation of nested queries explicitly. Existing unnesting algorithms [8,12,14] are relied upon to rst atten unnested queries; subsequently a general translation procedure for unnested queries is applied. In our context, such a procedure would be unsatisfactory. These algorithms in general result in collections of simple queries, with temporary relations holding intermediate results. In other words, they were not devised with the purpose of returning queries with intuitively clear semantics. We therefore deal with nested queries seperately, using equivalent TM constructs for SQL-subqueries, aggregates and quanti ers, as in Figure 6.

Sketch of the translation algorithm As shown in the example, a good query

translation should be based on a comparison of the semantics of the original query to the object model. In the translation of the SQL-query we can then use object references and the class hierarchy, which both can be seen as the materialization of certain joins in the relational schema. We therefore formalize the notion of joins materialized in an object schema by de ning a join materialization graph (JMG) for any object schema. The join structure of an SQL-query Q is represented by a query join graph (QJG). This QJG can then be matched with the JMG, to obtain a subgraph of the JMG, called the matching object graph (MOG), containing those parts of the object structure addressed by Q. Each connected subgraph in a MOG can be seen as a complex object de nition (COD) by Q. Selections on these complex objects are implemented by methods on their root class, from which attributes involved in selection conditions can be accessed by following the appropriate object references. The result of Q is then de ned as a selection on the cartesian product of these complex object de nitions. The translation is complicated by the fact that object references are directed, whereas joins are essentially bi-directional. Another complication is the non-trivial mapping that may exist between tables and objects, as demonstrated in the example schema translation. The next section discusses how the object structure described by an SQL-query can be determined, resulting in the identi cation of the set of classes on which the translated query should be de ned.

11

4 Identifying complex object de nitions in a query In this section, we describe how the semantic structure of object schemata and SQL-queries can be formalized and subsequently be matched, resulting in the set of complex object de nitions described by the query.

4.1 Representing object schemata and SQL-queries

For an object schema resulting from applying a translation algorithm to an underlying relational schema, we describe a representation formalism called the join materialization graph (JMG). A JMG describes the structure of the object schema, consisting of the class hierarchy and the delegation structure (the set of object referencing attributes de ned in a schema). It further relates each structural relationship in the object schema to a pair of attributes from the underlying relational schema, on which a join must be performed to calculate the corresponding relationship for the relational schema. The idea is that an object reference r from a class C1 imaging a table T1 to a class C2 imaging a table T2 with key K, where r is based on the foreign key F occurring in T1 , materializes the join T1 o nF =K T2. A pair of set-valued object references between two classes, implementing an m:n relationship between the underlying tables which was implemented by a connecting table R in the relational schema, is then the materialization of any join involving R. Furthermore, the inheritance semantics of TM ensure that a class C1 based on a table T1 with key K1 which has an isarelationship to a class C2 based on a table T2 with key K2 is a materialization of the join T1 o nK 1=K 2 T2. Lastly, a pair of object references between two classes inferred from the same underlying table can be seen as the representation of a join that was already materialized in the relational schema. Note that although the join-operator is symmetric, a materialized join has a direction. A join need not be materialized in both directions. When we say that the join R o n S is materialized, we will assume that there is an object reference from the class imaging R to the class imaging S , but necessarily one in the opposite direction. However, some object references, such as those implementing an m:n relationship, necessarily occur in pairs. A JMG thus consists of a node for each schema entity, and a labeled, directed edge for each equijoin materialized in the object schema, with labels describing the attributes occurring in the join condition, and edges pointing from a class containing an object referencing attribute to the referenced class. Pairs of object references are represented by bidirectional edges.

De nition 4.1 : Join Materialization Graph (JMG) Given an object schema

O, derived by applying a transformation procedure to an underlying relational schema R, and a set of annotation clauses A on the mapping between R and O, a join materialization graph JMG = hV; E i is de ned, consisting of a set of vertices V , where each vertex v represents a class or sort from O. In the following, we will write \class v1 " short for \the class or sort represented by the node v1 ". a set of labeled, directed edges E = fxjx = (v1 ; v2 ; L); v1 ; v2 2 V g = Efk [ Etb [ Eme [ Eisa , where: { An Efk -edge with label hfkey; keyi is added for each object reference r from v1 to v2 where DenotedBy(r; fkey) and Key(v2 ; key) are in A. { A bidirectional Etb-edge with label T , where T is a tablename occurring in R, is added for each pair of set valued object references r; s between two classes v1 and v2 where DenotedBy(r; T:ai ) and DenotedBy(s; T:aj ) 12

Project

WORKS-ON dnum,dnumber superssn,ssn

dno,dnumber

essn,ssn

Employee

Department

mgrssn,ssn Efk Eisa

Dependent

Eme

Manager

Etb

Figure 7: The join materialization graph for the example are in A. T thus implements the m:n relationship between v1 and v2 expressed by r and s. { A bidirectional Eme -edge is added for a pair of object references r; s between two vertices v1 and v2 where Image(v1 ; T ) and Image(v2 ; T ) are in A (i.e. at least one of v1 ; v2 represents a missing entity detected in the table T ). The pair r; s was added according to the requirements listed in Section 2. { An Eisa -edge with label hksub; ksuperi is added between each pair of classes v1 and v2 , where O contains the declaration v1 isa v2 and Key(v1 ; ksub) and Key(v2 ; ksuper) are in A.

Figure 7 shows the JMG for the example object schema. Notice the isa-edge from Manager to Employee, and the me-edge between Manager and Department due to their common origin: the DEPARTMENT-table. Notice also that we represent Etb and Eme -edges as bidirectional, as they inherently occur in pairs, and we thus may travel them in any direction. Following Dayal in [8], we de ne a comparable representation formalism for SQLqueries, including queries involving the nesting predicates IN, NOT IN, EXISTS and NOT EXISTS. Dayal shows that IN and EXISTS conditions can be represented by semijoins, whereas NOT IN and NOT EXISTS conditions correspond to a special kind of antijoins. Such antijoins should always be computed after the computation of the regular joins. In the remainder of this paper, we shall restrict our discussion of antijoins to these cases. We assume conjunctive queries; the extension to disjunctive queries is analogous to [8]. We thus assume that the conditions of Q can be described as jc ^ sjc ^ sc ^ ajc, where jc is a conjunction of join conditions, sjc is a conjunction of semijoin conditions, sc is a conjunction of selection conditions of the form R:A Op Const, and ajc is a conjunction of the special antijoin conditions as described.

De nition 4.2 : Query Join Graph (QJG) With an SQL-query Q, a query join graph QJG = hV; E i is associated, consisting of a set of vertices V , where each vertex v represents a tuple variable occurring in Q. 13

DEPENDENT ssn,essn

ssn,mgrssn

Ejn Eaj

EMPLOYEE

DEPARTMENT

Esj

Figure 8: The query join graph for the example query

a set of edges E = Ejn [ Esj [ Eaj , where { An undirected Ejn -edge with annotation h A,B i is added between nodes

v1 and v2 for each remaining condition R.A=S.B of Q, where v1 and v2 represent R and S, respectively. { A directed Esj -edge with annotation h A,B i is added from node v1 to v2 for each condition of the form R.A IN SELECT S.B FROM... WHERE... or R.A=S.B, where S was declared in the FROM clause of a query block within the scope of an EXISTS quanti er, and R was declared outside of this scope where v1 and v2 represent R and S, respectively. { A directed Eaj -edge from v1 to v2 with annotation h A,B i is added for each condition of the form R.A NOT IN SELECT S.B FROM... WHERE... or R.A=S.B, where S was declared in the FROM clause of a query block within the scope of a NOT EXISTS quanti er, and R was declared outside of this scope where v1 and v2 represent R and S, respectively.

A QJG G covers the entire query Q. If Q contains i subqueries, however, we may distinguish i + 1 subgraphs in G: Gout , and Ginj ; 1 j i, corresponding to the outer query and inner queries blocks, respectively. Note that we consider nesting of depth 1 only here; the extension to nesting of general depth is quite straightforward. Figure 8 shows the QJG associated with the example query from Figure 5. The example QJG is connected, and the subqueries are correlated.

4.2 Matching SQL-queries to object schemata

This subsection discusses how the representations of the relational query Q and the object schema O can be matched to obtain a set of so-called complex object de nitions by Q, which will form the basis for our query translation procedure.

4.2.1 Tuple variables and the JMG

To perform such a matching, the focus of both representations must be brought into accordance. Note that where SQL-queries describe relevant tuples, i.e. instances of 14

concepts described by a table, an object schema describes classes, i.e. the concepts themselves. As a consequence, Q may contain multiple tuple variables referring to the same table. To match the QJG to the JMG, we therefore consider JMG-nodes to represent class instance variables rather than classes themselves, where multiple class instance variables may be associated with a particular class. This motivates the following de ntion:

De nition 4.3 : Expanding a JMG Given an SQL-query Q and a join materi-

alization graph G of an object schema O representing the translation of a relational schema R, we de ne the expansion JMGQ of a JMG for Q as the following transformation of G: For each table T 2 R that is the scope of i tuple instance variables of Q; i > 1, replicate the set SV = fvjv 2 VJMG ; v represents C ; Image(C; T; :::)g and the set SE = fe = (v1 ; v2 )je 2 EJMG ; v1 2 SV _ v2 2 SV g i times in JMGQ . Subsequently replace all replicates of edges e = (v; v) 2 SE (i.e. the self-loops in SE ) by a set of edges completely connecting all replicates of v. Project

WORKS-ON

WORKS-ON dnum,dnumber

Employee 2

superssn,ssn

dno,dnumber dno,dnumber

essn,ssn

Employee 1 essn,ssn

Department

mgrssn,ssn mgrssn,ssn

Efk Eisa

Dependent

Eme

Manager

Etb

Figure 9: An example JMG-expansion

Example 4.1 Consider the following query on the relational schema of Figure 2.

SELECT X.name, Y.name FROM EMPLOYEE X, EMPLOYEE Y, DEPARTMENT WHERE X.ssn=Y.superssn AND X.dno=dnumber AND dname="R&D" Figure 9 shows the JMGQ obtained from the JMG of Figure 7 for this query. Note that if Q does not contain multiple tuple variables referencing the same table, then JMGQ = JMG. To keep formulations simple, in the remainder of this section we will assume that each tuple variable of Q addresses a unique table, and thus JMGQ = JMG. We thus speak of tables instead of tuple variables and classes instead of class instance variables. The ideas discussed extend trivially to the general case (see Example 4.7) . Note that the matching with JMGQ is performed seperately for each connected subgraph in the QJG. In particular, if a QJG is composed of n connected subgraphs that have no connecting edges between one another, the MOG resulting from the matching should have at least n such subgraphs as well. If object structures matching distinct QJG-subgraphs should overlap, classes appearing in multiple matchings 15

should be represented by multiple class instance variables in the translated query, each corresponding to the role the class plays in the translation of a particular QJG-subgraph.

4.2.2 Using Complex Object De nitions

A complex object de nition is de ned as follows.

De nition 4.4 : Complex Object De nition (COD) A complex object de nition associated with a query Q on a relational schema R that has been translated to an object schema O, is a connected subgraph hV 0 ; E 0 i of the JMGQ hV; E i induced

by Q and O such that: 1. each node v 2 V 0 represents a class C that images a table T appearing in the FROM-clause of Q. 2. each edge e 2 E 0 is either an Eme -edge or represents the materialization of a join speci ed by Q. 3. there exists a node vR 2 V 0 representing a class CR called the root class, from which all other v0 2 V 0 are reachable. 4. for each object o in a class C represented by a node v 2 V 0 that is the image of a tuple t in the table T imaged by C , the following holds: If t is selected by Q, then there exists an object oR in CR from which o is reachable by following object references.

This de nition ensures that for each complex object de nition, we can translate the part of Q addressing the COD into a method on its root class, selecting at least all relevant objects in the COD. The idea of the translation process is then to de ne a set S of CODs of minimal cardinality, such that Q can be translated to a database method M which combines the results of a set of methods SM on the root classes of S . The CODs in S must therefore contain all classes relevant to Q. Although at rst sight the de nition of a COD may seem rather strict, the following lemma identi es a trivial set S of CODs satisfying De nition 4.4, and containing all classes addressed by Q.

Lemma 4.5 : Primitive CODs Given an SQL query Q, the set S of most primitive CODs associated with Q is the set S = fC j Image(C; T; fa ; a ; :::; an g); T:ai occurs in Qg. Q can be translated to a database method on the cartesian product 1

2

of the elements of S .

Proof Each COD in S consists of one class only that images a table appearing

in Q. Thus the conditions for a COD are trivially satis ed. The absence of any edges in the CODs implies that we do not exploit any of the navigational properties of the object structure. Q expresses some SQL-selection criterion on the cartesian product of the tables mentioned in its FROM-clauses. By assumption, the object DML used is powerful enough to express these criteria on the cartesian product of the corresponding classes. Obviously, using an S as de ned in this lemma as the basis for our translation is undesirable, as the navigational possibilities of the object structure are not exploited. A better approach is to merge mulitple classes into one COD by connecting them using object references that materialize joins speci ed by Q. All relevant objects of 16

these classes can then be selected using a single retrieval method. Ideally, we would associate a single COD with Q, with a root class from which all objects satisfying Q can be retrieved. Q is then translated to a retrieval method on this root class. Constructing CODs from multiple classes and object references between them is based on the matching of joins described by Q and joins materialized in the object schema by object references. Note that apart from the regular joins of Q, we also try to match the implied semijoins and antijoins, as represented in the QJG. The matching process is described in the next subsections.

4.3 Matching joins on relationship tables T T-ID

TS T-ID

S-ID

S S-ID

SR S-ID

R-ID

R R-ID

1. 2. 3.

Figure 10: Three cases of joins involving relationship tables We rst discuss the matching of joins involving simple connecting tables that were translated into a pair of Etb -edges. Figure 10 illustrates three possibilities of joins on tables implementing an m:n relationship. Case (1) is simplest; the relationship is traversed in the normal way, using a pair of joins on the table implementing the relationship. This case translates to a "join" of the entities represented by tables T and S . Case (2) depicts the situation where a relationship table is involved in only one join. The semantics of such a join is selecting the tuples of R involved in the relationship. This case translates to a semi-join from entity T to entity S . Case (3) is in fact an extension of case 2. The join TS.S-ID=SR.S-ID is equivalent to the pair of joins TS.S-ID=S.S-ID ^ S.S-ID=SR.S-ID, each corresponding to case 2. Thus, such a join translates to a pair of semijoins: one from S to T and the other from S to R.1 . Note that joins on tables implementing relationships between entities may imply semijoins on those entities! As simple connecting tables are not translated into classes but into pairs of setvalued object references, we preprocess the QJG, replacing joins on these tables by joins and semijoins on the tables representing entities, as described above. The joins and semijoins between tables representing entities are then matched like regular joins and semijoins.

4.3.1 Matching a join

We now describe matching a regular join appearing in Q. The simple idea is to match a QJG-edge e representing the join R o nA=B S by a JMGQ -edge e0 that 0 represents the materialization of this join. e has annotation hA; B i and points either from a class imaging R to a class imaging S or vice versa. However, the match need not be so straightforward. 1 Note that these semantics remain unchanged if the relationship tables themselves are involved in a semijoin instead of a join

17

A join R o nA=B S can be matched by a chain of object references r1 ; r2 ; :::; rn nA1 =A2 R2 ,..., and where r1 is a materialization of R o nA=A1 R1 , r2 materializes R1 o rn materializes Rn?1 o nAn?1 =B S . Without loss of generality, we assume that in such cases there exists a (derived) object reference r that materializes R o nA=B S directly. On the other hand, the joins T1 o nA=B T2 o nB=C T3 may be matched by an object reference r materializing the join T1 o nA=C T3 (which is implied by transitivity) and s materializing either T1 o nA=B T2 or T2 o nB=C T3 . We therefore compute the transitive closure of the QJG, matching each of the joins implied by a query with a materialized joins in the object schema.

Lemma 4.6 : Matching a join Let a join R onA B S be matched by an edge e representing an object reference r from a class C to a class C as described. Let C be included in a COD G = hV; E i with a root class CR . Now G0 = hV [fC g; E [fegi =

1

2

1

2

is again a COD.

Proof Since on is symmetric, assume without loss of generality that C1 images R and C2 images S. By nature of the matching procedure, C2 satis es condition 1 and e satisi es condition 2 of de nition 4.4. By assumption, C1 is reachable from CR ; therefore, C2 is reachable from CR via C1 and e, satisfying the third condition. Finally, since e matched Ro nA=B S, r is based on the inclusion dependency R.AS.B. Thus the objects o in C2 reachable by r from C1 are exactly those objects that correspond to the tuples of S satisfying the join condition, including all o's satisfying Q. By assumption, all objects o0 in C1 satisfying Q are reachable from CR . Therefore all objects o in C2 satisfying Q are also reachable from CR . 4.3.2 Matching a semijoin

The materialized joins of the object schema can also be used in the translation of the semijoins and antijoins implied by Q. Note that the semijoin RnS is de ned as R (Ro nS). A semijoin can therefore be matched like a normal join, regardless of the direction of the materialization; the projection does not play a role here. We illustrate this with a small example. Example 4.2 Consider a query Q =

SELECT * FROM R WHERE R.A IN SELECT S.B FROM S.

Q has a QJG consisting of nodes for R and S and one edge representing RnS. Let C1 and C2 be images of R and S, respectively. Let r be a C1 -valued object reference in C2 materializing the join So nA=B R. Although the direction of r is opposite to that of the semijoin, the latter can simply be translated by collecting all C1 -objects referenced by C2 objects. Q therefore describes a complex object de nition with root C2 , on which the following retrieval method is de ned: collect x.r for x in self

Lemma 4.7 : Matching a semijoin Let a semijoin RnA B S be matched by an edge e representing an object reference r from a class C to a class C as described. Let C be included in a COD G = hV; E i with a root class CR . Now G0 = hV [ fC g; E [ fegi is again a COD. =

1

2

1

2

Proof The proof is completely analogous to the proof of Lemma 4.6, as RnS= R (Ro nS).

18

4.3.3 Matching an antijoin The antijoin R.S is de ned as RnR (Ro nS). Here the direction of the materialization does play a role. Again we illustrate this with an example. Example 4.3 Consider a query Q0 =

SELECT * FROM R WHERE R.A NOT IN SELECT S.B FROM S WHERE S.C=0

Q0 has a QJG consisting of nodes for R and S and one edge from R to S representing R.S. As in Example 4.2, assume that r is a C1 -valued object reference of C2 , materializing the join So nA=B R. Here we cannot regard Q0 as describing a complex object de nition with root C2 . The semantics of this query construct is such that we cannot access every C1 -object satisfying Q from C2 , violating the fourth condition of De nition 4.4. The problem is that any C1 -object not referenced by a C2 -object satis es Q. Q0 thus describes two CODs (each consisting of a single class) and is translated to the database method collect x for x in C1 i x not in collect y.r for y in C2 i y.c=0 On the other hand, if r is a C2 -valued object reference of C1 , Q0 can be translated as a retrieval method on C1 (assuming r is a single-valued attribute): collect x for x in self i not x.r.c=0 Now consider the case that C1 is a Sort, representing the fact that R is a weak entity belonging to S. In this situation, the problem with a C1 -valued object reference in C2 does not occur, as each C1 -object is necessarily referenced by a C2 -object. Here we may translate Q0 to a retrieval method on C2 : collect y.r for y in self i not y.c=0 even though the direction of r is opposite to that of the antijoin. Lemma 4.8 : Matching an antijoin Let an antijoin R.A=B S be matched by an edge e representing an object reference r from a class C1 to a class or sort C2 as described. Let C1 be included in a COD G = hV; E i with a root class CR . Now G0 = hV [ fC2 g; E [ fegi is again a COD.

Proof We focus on condition 4 of De nition 4.4. By nature of the matching procedure described, we need to distinguish between two cases. 1. Both C1 and C2 are classes. Then C1 is necessarily an image of R and C2 is an image of S. By assumption, all objects o in C1 satisying Q can be accessed from CR . Since the antijoins considered in this paper are derived from the constructions

SELECT ... FROM R,... WHERE R.A NOT IN SELECT S.B FROM S.... or a similar NOT EXISTS construction, Q by de nition does not select any tuples in S. The condition is therefore trivially satis ed. (The reason why e is included in the COD here is not to retrieve objects from C2 but to express conditions on C1 -objects. Conditions expressed by NOT EXISTS and NOT IN-constructs restrict the set of allowed C2 -instances related to a C1 -instance. These related instances are accessible via r.) 2. C2 is a sort representing the fact that R is a weak entity dependent on S. Thus C1 is an image of S and C2 is an image of R. The antijoin matched the object reference r relating owner C1 -objects to the corresponding C2 -values. By de nition, every C2 -instance is accessible by following r from C1 , thus including the C2 -instances satisfying Q. 19

4.3.4 Matching Object Graphs

Example 4.3 shows that if not all QJG-edges can be matched, we may obtain multiple CODs even from a connected QJG. However, due to the splitting of tables into multiple classes in the data structure translation phase, we may obtain multiple CODs from a connected QJG even if all edges can be matched. CODs can therefore be \glued together" by connecting classes stemming from the same table with Eme -edges, provided that the resulting structure is again a COD. To identify the CODs of a query Q, we rst identify the subgraph of JMGQ containing classes and materialized joins matching the QJG of Q (i.e. containing all CODs of Q). This subgraph is called the matching object graph. De nition 4.9 : Matching Object Graph (MOG) Given a QJG of a query Q and a JMGQ of an object structure, we constructively de ne the subgraph of JMGQ that matches QJG, called the matching object graph, by the following procedure: input: A query join graph QJG for an SQL-query Q A join materialization graph JMGQ for the object schema output: A matching object graph MOG=hVMOG ; EMOGi for Q G=TransitiveClosure(QJG); VMOG = ;; EMOG = ;; foreach connected subgraph Gi of G do foreach edge e = (v1 ; v2 ; ha1 ; a2 i) 2 G: nd an edge d = (v1 ; v2 ; L) 2 EJMG matching e if such a d exists then if vj (j = 1; 2) already matched another Gi0 then duplicate vj and its edges in JMGQ to obtain a new class instance variable node vj else vj = vj ; VMOG = VMOG [ fv1 ; v2 g; EMOG = EMOG [fdg; VMOG = VMOG [ fv jv 2 JMGQ ; v 2QJG; v =Image(v); v has an attribute a stemming from an attribute a from v referenced by Qg; foreach dme = (v1 ; v2 ) 2 Eme such that v1 ; v2 2 VMOG: EMOG = EMOG [ dme ; 0

0

0

0

00

00

0

00

0

00

0

0

0

0

The matching procedure attempts to match each edge from the transitive closure of the query join graph with an edge of the join materialization graph by comparing the respective starting and ending nodes and the labels. The MOG then consists of the matching edges and nodes connected by these edges or containing attributes addressed in Q. Finally, me-edges are added if this results in a higher connectivity of the MOG. Example 4.4 Figure 11 shows the MOG for the example query of Figure 5. Note how the semijoin EMPLOYEEnssn=mgrssn DEPARTMENT has matched the isa-edge from Manager to Employee. By matching the QJG with the JMG, we have isolated those join conditions of Q which are directly satis able in the object schema by following object references. On closer inspection of Figure 11, we note that the distinction between Manager and Employee is in fact super uous in the context of Q, since Q considers only those employees that are also a manager, and the semantics of the isa-relation de ned in the object schema ensures that these are exactly the instances of class Manager. This observation can be generalized to the following proposition. 20

Employee

essn,ssn

mgrssn,ssn

Dependent Manager

Figure 11: The MOG for the example query

Proposition 4.10 : SQL and Inheritance Let an SQL-query Q on a relational schema R contain a join speci cation Ro nA=B S that matches an Eisa -edge e from a class C1 to a class C2 in the JMG of the corresponding object schema O. This join is the SQL-representation of an inheritance mechanism that is implicit in the semantics of the object schema. Proof Since e matched RonA B S, the annotations Image(C ,R,...), Image(C , S,...), Key(R,A) and Key(S,B) (or the symmetrical case) exist. Since C isa C was deduced in the schema translation phase, the inclusion dependency R.AS.B must =

1

2

1

2

hold. Therefore, Q selects all tuples in R and joins them to the tuples in S with the same key. The resulting table thus consists of the union of the attributes of R and S and contains exactly one Ro nkey=key S-tuple per R-tuple. This is precisely the semantics of R isa S. As the inheritance mechanism speci ed by the SQL-query is implicit in the objectoriented data model, the explicit speci cation of the superclass is super uous in the context of a COD. This motivates the following transformation.

Algorithm 4.11 : Superclass elimination Let G be a MOG containing an edge e 2 Eisa . The following transformation deletes this edge from G and yields a new MOG in which v1 and v2 have been merged: input: A matching object graph MOG output: A transformed MOG0 with super uous superclasses eliminated.

EMOG0 = EMOG ; VMOG0 = VMOG ; foreach e = (v1 ; v2; ha1 ; a2i) 2 Eisa : EMOG = EMOG =feg; replace all edges from/to v2 by edges from/to v1 ; VMOG = VMOG =fv2 g;

Example 4.5 Figure 12 shows the example MOG after eliminating the super uous superclass Employee, with dotted circles indicating subgraphs matching the subqueries. Note that Gin2 and Gout coincide here. Superclass elimination and TM In TM, we have slight diculties implement-

ing superclass elimination. Consider the MOG of Figure 13 Note that the query resulting in this MOG appearantly retrieves Employees whose employee-object reference points to an Employee which is also a Manager. 21

Gout,Gin(2)

Manager

essn,ssn

Dependent

Gin(1)

Figure 12: The MOG of Figure 8 after eliminating a super uous superclass, with subgraphs Employee 1

Employee 2

Manager

Figure 13: Example MOG with superclass Now we reduce this MOG to obtain the MOG of Figure 14. Note that this MOG correctly depicts the semantics of the query, but we have diculty implementing this into a TM-method. This is due to the following. Employee 1

Manager

Figure 14: The reduced MOG The edge from Employee to Manager does not have a corresponding object reference in the object schema. The object reference points to the superclass Employee of Manager. The query however addresses only those employees that are managers. This cannot be expressed as a class method in TM; we need a database method to express this. It would be desirable to have language-construct to more directly formulate a condition like this. Since our approach should be valid for a general OODML, we apply the superclass elimination as it is semantically valid, and abstract from speci c implementation issues.

22

4.3.5 Identifying CODs Algorithm 4.12 : Identifying CODs After elimination of the isa-edges, the following procedure identi es the CODs of the MOG. In particular, the set S of root classes of the CODs described by the query Q is determined, as follows:

input: output: S := ;; V repeat

A matching object graph MOG A set S of root classes

:= VMOG ; E := EMOG ; if there exists a node v 2 V with indegree(v) = 0 then root := v else root := some v of the cycle in G; S := S [ fvg; DeleteSet := f(v1 ; v2 )jv1 is reachable from rootg; V := V=fvj8e = (v0 ; v) 2 E : e 2 DeleteSetg; E := E=DeleteSet; until V = ;

Proposition 4.13 : Identifying CODs The procedure above correctly identi es a set of CODs of minimal cardinality associated with a query Q. Proof The procedure starts with a single class v 2 VMOG . Note that VMOG = fC jImage(C; T; fa ; a ; :::; an g); Tai occurs in Qg. As shown in Lemma 4.5, an element of this set is always a COD. Furthermore, as each edge in the MOG matched an edge of the QJG of Q, by Lemmata 4.6, 4.7, and 4.8 adding a class v0 2 VMOG 1

2

and edge e connecting v to v0 to the COD again results in a COD. Therefore, each node reachable in the MOG from a COD may be included in that COD. Each iteration of the procedure thus results in a COD satisfying De nition 4.4. Lastly, each COD has maximal size, i.e. we cannot add any new node to it without violating the conditions of De nition 4.4. The cardinality of S is therefore minimal.

Corollary 4.14 : Q can be translated to a set of retrieval methods on the elements of S . Example 4.6 The query of Figure 5 describes a single complex object de nition with root class Manager, as can be seen from Figure 12. We therefore translate the example query of Figure 5 to a method on Manager. Example 4.7 To illustrate our claim that the procedure described trivially extends to the case where multiple tuple variables address the same table, below we show the single COD described by the query from Example 4.1. Employee 1

ssn, superssn

Employee 2

dno,dnumber

Department

So far, we have determined the object structure described by a query Q, and the class or classes on which the translated methods should be de ned. We now turn to the actual translation of SQL statements to TM-methods.

23

5 Query translation

This section discusses the translation of some relevant SQL-queries to TM-methods on the set of root classes on the complex object de nitions speci ed by an SQLquery. We do not oer a complete coverage of all possible query forms allowed by SQL. This would be of little interest, as TM has more expressive power than SQL; thus, any SQL-statement has a direct translation to TM. We therefore only discuss the translation of those SQL-constructs that bene t from the additional semantics oered by the object schema, resulting in a non-trivial translation. Topics of interest are: Determining the proper location for the translation method(s) in the object schema. Applying query conditions by travelling complex object structures. Translating nested queries, including the translation of constructs such as EXISTS and IN As indicated in Section 4, the essence of our approach is formed by nding a set of root classes from which all objects addresses by a query Q can be accessed by following object references that materialize joins speci ed in Q. In the following, we interpret the SQL-SELECT as SELECT DISTINCT due to the set-oriented nature of our target language.

5.1 De ning a database method

As a SELECT-query Q is a retrieval action on a relational database, it is translated to a database retrieval method in a TM-speci cation. Assume Q gives rise to a set S of CODs. The database method translating Q calculates the cartesian product of the results of the retrieval methods de ned on the root class of each COD, with the addition of conditions spanning multiple CODs (i.e. the unmatched join conditions of Q). If the cardinality of S is 1, Q describes a single complex object de nition, and the corresponding database method is a trivial call of the retrieval method on the root class of the COD. In Section 4, we made the assumption that the conditions of Q can be expressed as a conjunction c1 ^ c2 ^ ::: ^ cn of expressions, where each ci is of the form hR:Ai Op S:Aj i, i.e. ci is a join condition and is part of the QJG of Q, or hR:Ai Op Termi, i.e. ci is a selection condition. We showed that SQL-constructs like EXISTS and IN can be mapped to such expressions. For each condition ci of Q, we de ne its scope as follows.

De nition 5.1 : Scope Given a condition c appearing in a SQL-query Q, we de ne its scope as the set fC j Image(C; R; fa ; a ; :::; an g); R:ai appears in E ; 1 i ng. The scope of a single attribute R.Aj is de ned analogously as fC j Image(C; R; fa ; a ; :::; an g); ai = Aj ; 1 i ng. A condition c is called an explicit join condition i there does not exist a COD G such that scope(c) VG . 1

1

2

2

For each COD of Q, a retrieval method is de ned, satisfying the following de nition.

De nition 5.2 : Retrieval method for a COD Given a COD G associated with a query Q, a retrieval method for G is a retrieval method MG on the root class CR of G of type Pha : ; a : ; :::; an : n i, where each ai is the image of an attribute R:Ai such that scope(R.Aj ) VG and R:Ai either appears in the outer 1

1

2

2

SELECT-clause of Q or in an explicit join condition of Q. Each tuple returned by 24

V

MG satis es the condition scope(ci )VG ci . That is, the retrieval method expresses all conditions \local" to the COD. We now de ne the (partial) translation of an SQL-query Q: De nition 5.3 : Translating an SQL-query Let Q be an SQL-query of the form SELECT A1 ; A2 ; :::; An FROM R1 ; R2 ; :::; Rm WHERE... inducing a set of CODs S with cardinality 1. Let a retrieval method MG satisfying De nition 5.2 be de ned for each G 2 S . Q is translated to a database method M with result type Pha1 : 1 ; a2 : 2 ; :::; an : n i, where each ai is an image of Ai and each i is the TM-type corresponding to the type of Ai . M is formed by a TM collect-expression collecting Dot(Ai ) for each Ai in the SELECT-clause of Q from cartesian product ResG1 ResG2 ::: ResGn , where ResGi is the result set V of the retrieval method of the ith COD of Q, with the additional condition 69G2S:scope(ci)G ci (i.e. the conjunction of the explicit join conditions of Q). If S has cardinality 1, M consists of a simple call of MG. Thus, a typical TM-database method translating an SQL-query de ning 2 CODs, say, has the following form:

database retrieval method Trans(in i: tau-i; out h o1: tau-o1; o2: tau-o2; ... on: tau-on i) unnest collect collect i o1=x.accesspath.o1, o2=x.o2,...,on=y.accesspath.on i for y in MG2 i condition(x,y) for x in MG1

The de nition is only partial; we need yet to de ne retrieval methods for CODs. This is addressed in the next subsection.

5.2 De ning a retrieval method on a COD

This section discusses the construction of a retrieval method MG on a COD G satisfying De nition 5.2. In the following, we will use dot-notation to address an attribute. This is de ned as follows De nition 5.4 : Dot Given an attribute R:Ai appearing in an SQL-query Q and a set S of CODs induced by Q, we de ne Dot(Ai ) as CR :C1 :C2 :::::Cn :ai , with CR = root(G) where G 2 S ; scope(R:Ai ) G, Image(Cn ; R; fa1; a2 ; :::; ak g), and (CR ; C1 ; C2 ; :::; Cn ) are nodes on a path from CR to Cn in G. In this expression, CR :C1 :::::Cm (m n) may be replaced by a class instance variable representing the class Cm . For a condition c of the form Ai Op Aj , Dot(c)=Dot(Ai ) Op Dot(Aj ).

Constructing a collect-expression The body of MG is formed by a TM-expression of the form collect selector for x in

self i conds, where the keyword self refers to the extension of the class on which the method is formulated, i.e. CR . The expression is constructed gradually while travelling the graph-structure of the COD in a depth- rst manner, starting at CR , and guided by the following rules.

25

Visiting a node When visiting a node v representing a class C for the rst time, replace selector by selector [ fDot(ai )jai is an attribute of C ; ai is an image of Ai appearingVin the selector or an explicit join condition of Qg. Replace condition by condition fDot(ci )jscope(ci ) fvjv 2 VG ; v has been visitedgg When visiting a node v that has been visited before, do the following: let P be the path from CR to v currently explored; let P 0 be the path from CR to v that was explored when visiting v for the rst time; add P = P 0 to conditions.

Selecting an edge to travel Let Ev be the set of edges leaving the current node v. The selection of the next e 2 Ev to be travelled, is governed primarily by the

ordering: 1. Any edge that matched a regular join 2. Any edge that matched a semijoin 3. Any edge that matched an antijoin; and secondarily by the ordering: single-valued object references before multi-valued object references. This assures that outer query blocks of Q are always processed before inner query blocks of Q.

Travelling a single-valued object reference Travelling an edge e = (v; v0 )

representing a single-valued object reference has consequences only when e matched an antijoin-edge e0 of the QJG. Note that regular join conditions and the SQL-constructs EXISTS and IN underlying the semijoin conditions are inherently satis ed when travelling a single valued reference that matched this semijoin. In the case of e matching an antijoin, the directions of e and e0 must be considered. Consider the case where e and e0 have the same direction. Then Q is of the form SELECT (Expression involving the table underlying v) .... WHERE NOT EXISTS SELECT (Query Q expressing a selection on the table underlying v ) or a similar NOT IN construction. As for every instance of v there exists a single associated v0 -instance, we simply express the negation of the conditions of Q0 on v0 . Thus, any condition generated after travelling such an edge must be negated in the translation of Q. However, if the directions of e and e0 are opposite, the roles of v and v0 in Q are reversed. Now the current cond must be negated, as the expression generated so far concerns the COD described by the NOT EXISTS-block. This will be illustrated in Example 5.2. 0

0

Travelling a set-valued object reference Travelling an edge e = (v; v0 ) rep-

resenting a multi-valued object reference r gives rise to an extra level of nesting in the collect expression. The exact translation of travelling e depends on the edge e0 2 QJG which matched e when creating the MOG. There are three possibilities: 1. e0 represents a regular join. 2. e0 represents a semijoin. 3. e0 represents an antijoin.

Case 1: regular join Let

collect selector for var in dom i cond

be the current expression. Let e have matched a regular join of Q. Travelling e transforms the expression to

26

unnest collect collect selector for newvar in Dot(r) for var in dom i cond where the unnest-operator attens a nested set. The inner collect-expression then

becomes the current expression, to which selectors and conditions are added when further travelling the COD from v0 . Note how any object related by r and satisfying the conditions contributes a tuple to the result. See Example 5.1.

Case 2: semijoin If e = (v; v0 ) matched a semijoin, the direction of e and e0

must be considered. If e and e0 have the same direction, Q is of the form SELECT (Expression involving the table underlying v) .... WHERE EXISTS SELECT (Query Q expressing a selection on the table underlying v ) or a similar IN construction, and the expression becomes collect selector for var in dom i cond^ exists newvar in Dot(r)jnewconds 0

0

The exists expression, which is treated as a collect without a selector, becomes the current expression. See Example 3.1. If e and e0 have opposite directions, the role of v and v0 in Q are reversed. e can then be travelled as a regular join edge. The fact that the method accesses objects from the class v0 by reference from objects from the class v implies satisfaction of the condition that for each v0 -object there must exist a related v-object. See Example 5.2.

Case 3: antijoin If e matched an antijoin, the respective directions of e and e0

again play a role. If the directions are equal, a not exists-expression is formed analogous to the exists-form of case 2. If the directions of e and e0 are opposite, rst the current condition is negated, since it refers to the COD described by a NOT EXISTS-block as discussed under the single valued case, and then the edge is processed as a regular join edge of case 1. Note that by the nature of the matching procedure as described in Lemma 4.8, this can occur only when v0 is a Sort representing a weak entity. Again see Example 5.2.

Example 5.1 Figure 15 shows an example query illustrating case 1. The query

selects R&D employees and the projects they work on, provided that these projects are under responsibility of their own department. Notice how travelling the multivalued object reference from Employee to Project gives rise to a nested collect. Note also the condition y.department=x.department, ensuring the equality of both access paths to Department.

Example 5.2 Figure refXY illustrates the cases where the direction of the edge

matching a semi- or antijoin (i.c. the P-valued reference from Employee to Dependent) is not the direction of the joins themselves. The rst query illustrates the semijoincase; the second query contains an antijoin. Note how the edge matched the antijoin only because Dependent is a Sort. 27

SELECT name, pname FROM EMPLOYEE, WORKS-ON, PROJECT WHERE ssn=wssn AND pno=pnumber AND EXISTS SELECT * FROM DEPARTMENT WHERE dnum=dnumber AND dno=dnumber AND dname="R&D" Project

WORKS-ON dnum,dnumber

dno,dnumber

Department

Employee

class retrieval method for Employee unnest collect collect h Name= x.name, Pname= y.pname i for y in x.projects i y.department = x.department for x in self i x.department.dname="R&D"

Trans(out Ph Name: string Pname : string i)

Figure 15: Query translation illustrating case 1

Translation of SQL-nesting constructs Note that where a semi- or antijoin

edge in the QJG of Q indicates a level of nesting in Q, an edge in a COD corresponding to a multi-valued object reference indicates a level of nesting in the TM- translation of Q. Note also that SQL-constructs like IN and EXISTS often (viz. when corresponding to a single valued object reference or in cases like Example 5.2) do not need a corresponding TM-construct for their translation, due to the semantics of the object model. Reviewing the translation process described in this paper, notice that the crucial point for translating relational queries to queries on an object schema is the exploitation of the navigational structure of the object model. Each query is rst associated with such a navigational structure, as described in Section 4. This section dwelled on how joins speci ed in the SQL-context are subsequently translated into path traversals in the object structure. Although in our examples we used the TM object-oriented DML, the strategy we enunciated is suciently general to be applicable for any set-based OODML.

6 Conclusion In this paper, we argued the feasibility of equipping object-oriented federated schemata with methods re ecting operations on the component schemata extracted from the applications. An algorithm to obtain object methods in an object-oriented DML from SQL-queries has been described, based on the matching of query join graphs with object structures. It was shown that the navigational structure provided by the 28

SELECT dname FROM DEPENDENT WHERE sex="F" AND essn IN SELECT ssn FROM EMPLOYEE WHERE salary>50000

SELECT dname FROM DEPENDENT WHERE sex="F" AND NOT EXISTS SELECT ssn FROM EMPLOYEE WHERE salary>50000 AND ssn=essn

Dependent

Employee

Class retrieval method for Employee Class retrieval method for Employee Trans1(out Ph DepName : string i)= Trans2(out Ph DepName : string i)= unnest unnest collect collect collect h DepName = y.dependent-name i collect h DepName = y.dependent-name i for y in x.dependents i y.sex="F" for y in x.dependents i y.sex="F" for x in self for x in self i x.salary>50000 i not x.salary>50000 Figure 16: Examples of opposite directions object model can be exploited to represent joins and nesting constructs contained in an SQL-query. We have shown that with such a translation, considerable gain in the intuitive meaning of the query can be achieved. When applied to queries from applications on the underlying databases in a federated database system, both the subsequent schema integration phase and the development of new, federated, applications bene t from this additional insight in the semantics of the underlying schema.

Further research We are currently planning a PROLOG-implementation of the translation procedure, which will be confronted with a test-set of existing queries from large database applications operational within a large international oil company, to investigate the semantic expressiveness of the results. A later version will use the ODMG-DML as its target language.

References [1] J. Albert, R. Ahmed, M. Ketabchi, W. Kent & M-C. Shan, \Automatic importation of relational schemas in Pegasus," in RIDE Interoperability in Multidatabase Systems, IEEE Computer Society Press, Los Alamitos, CA, 1993, 105{113. [2] H. Balsters, R. A. de By & R. Zicari, \Typed sets as a basis for object-oriented database schemas," in Proceedings Seventh European Conference on ObjectOriented Programming, July 26{30, 1993, Kaiserslautern, Germany, LNCS #707, O. M. Nierstrasz, ed., Springer{Verlag, New York{Heidelberg{Berlin, 1993, 161{ 184.

29

[3] Y. Breitbart, P. L. Olson & G. R. Thompson, \Database integration in a distributed heterogeneous database system," in Proceedings Second International Conference on Data Engineering, Los Angeles, CA, February 5{7, 1986, IEEE Computer Society Press, Washington, DC, 1986, 301{310. [4] M. Castellanos, \A methodology for semantically enriching interoperable databases," in Advances in Database - BNCOD 11, Springer-Verlag, New York{ Heidelberg{Berlin, 1993, 58{75. [5] R. H. L. Chiang, T. M. Barron & V. C. Storey, \Reverse engineering of relational databases: Extraction of an EER model from a relational database," Data & Knowledge Engineering 12 (March 1994), 107{142. [6] K. H. Davis & A. K. Arora, \Converting a relational database model into an entity-relationship model," in Proceedings Sixth International Conference On Entity-Relationship Approach, North-Holland, Amsterdam, 1988, 271{285. [7] U. Dayal, \Query processing in a multidatabase system," in Query Processing in Database Systems, W. Kim, D. S. Reiner & D. S. Batory, eds., Topics in Information Systems, Springer-Verlag, New York{Heidelberg{Berlin, 1985, 81{ 108. [8] U. Dayal, \Of nests and trees: A uni ed approach to processing queries that contain nested subqueries, aggregates, and quanti ers," in Proceedings of Thirteenth International Conference on Very Large Data Bases, Brighton, England, September 1{4, 1987, P. M. Stocker, W. Kent & P. Hammersley, eds., Morgan Kaufmann Publishers, Los Altos, CA, 1987, 197{207. [9] J. Flokstra, M. van Keulen & J. Skowronek, \The IMPRESS DDT: A database design toolbox based on a formal speci cation language," in Proceedings ACMSIGMOD 1994 International Conference on Management of Data, ACM Press, New York, NY, 1994, 506. [10] P. Johannesson, \A method for transforming relational schemas into conceptual schemas," in Proceedings Tenth International Conference on Data Engineering, IEEE Computer Society Press, Los Alamitos, CA, 1994, 190{201. [11] D. A. Keim, H-P. Kriegel & A. Miethsam, \Integration of relational databases in a multidatabase system based on schema enrichment," in RIDE Interoperability in Multidatabase Systems, IEEE Computer Society Press, Los Alamitos, CA, 1993, 96{104. [12] W. Kim, \On optimizing an SQL-like nested query," ACM Transactions on Database Systems 7 (September 1982), 443{469. [13] W. Meng, C. Yu, W. Kim, G. Wang, T. Pham & S. Dao, \Construction of a relational front-end for object oriented database systems," in Proceedings Ninth International Conference on Data Engineering, Vienna, Austria, April 19{23, 1993, IEEE Computer Society Press, Washington, DC, 1993, 476{483. [14] M. Muralikrishna, \Improved unnesting algorithms for join aggregate SQL queries," in Proceedings of Eighteenth International Conference on Very Large Data Bases, Vancouver, Canada, August 23{27, 1992, L-Y. Yuan, ed., Morgan Kaufmann Publishers, San Mateo, CA, 1992, 91{102. [15] S. B. Navathe & A. M. Awong, \Abstracting relational and hierarchical data with a semantic data model," in Proceedings Sixth International Conference on Entity-Relationship Approach, North-Holland, Amsterdam, 1988, 305{333. [16] A. Rosenthal & D. S. Reiner, \Querying relational views of networks," in Query Processing in Database Systems, W. Kim, D. S. Reiner & D. S. Batory, eds., Topics in Information Systems, Springer-Verlag, New York{Heidelberg{Berlin, 1985, 109{124. 30

[17] M. H. Scholl, \Theoretical foundation of algebraic optimization utilizing unnormalized relations," in Proceedings International Conference Database Theory, LNCS # 243, Springer-Verlag, New York{Heidelberg{Berlin, 1986, 380{396. [18] A. P. Sheth & J. A. Larson, \Federated database systems for managing distributed, heterogeneous and autonomous databases," ACM Computing Surveys 22 (September 1990), 183{236. [19] S. Y. W. Su, H. Lam & D. H. Lo, \Transformation of data traversals and operations in application programs to account for semantic changes of databases," ACM Transactions on Database Systems 6 (June 1981), 255{294. [20] S. D. Urban & T. ben Abdellatif, \An object-oriented query language interface to relational databases in a multidatabase environment," in Fourteenth International Conference on Distributed Computing Systems, 1994, 287{295.

31

Enhancing the semantics of federated schemata by translating SQL

Enhancing the semantics of federated schemata by translating SQL

Suggest Documents

1 Towards Federated, Semantics-based Supply ...

Privacy-enhancing Federated Middleware for the Internet of Things

PseudoID: Enhancing Privacy for Federated Login - CiteSeerX

Causal schemata and role schemata

translating the web semantics of georeferences - Semantic Scholar

Translating Express to SQL: A User's Guide

Testing Database Applications with SQL Semantics

Translating SQL Applications to the Semantic Web - CiteSeerX

Statistical semantics for enhancing document clustering - CiteSeerX

Federated SQL on Hadoop and Beyond - Linux Foundation Events

1 Towards Federated, Semantics-based Supply Chain Analytics

Enhancing SQL with Set-Comparison Operators

SCHEMATA AND READING COMPREHENSION

Voting for Schemata - CiteSeerX

A semantics-based English-Bengali EBMT system for translating news ...

Enhancing IHE XDS for Federated Clinical Affinity Domain Support

Enhancing Federated Cloud Management with an Integrated Service ...

A Study of Library Databases by Translating Those SQL Queries Into ...

Enhancing the Business Analysis function with Semantics - CiteSeerX

Schema integration is the process by which schemata from ...

TRANSLATING IMAGES BY UNSUPERVISED ... - CiteSeerX

the Federated States of Micronesia

Communicating Semantics: Reference by

Schemata Schuldrecht AT - Jurawelt