Semantic Query Processing in a Heterogeneous Database Environment

4 downloads 12260 Views 175KB Size Report
unnecessarily complicated. In this paper, we present the design of a semantic query optimizer for a. heterogeneous database management system (HDBMS).
Semantic Query Processing in a Heterogeneous Database Environment John Cardiff

Tiziana Catarci & Giuseppe Santucci

Department of Computing RTC Tallaght Dublin 24 Ireland [email protected]

Dipartimento di Informatica e Sistemistica Università degli Studi di Roma "La Sapienza" Via Salaria 113 I-00198 Roma, ITALY [catarci/santucci]@infokit.ing.uniroma1.it

Abstract Semantic query optimization is the process of transforming a query issued by a user into a different query which, because of the semantics of the application, is guaranteed to yield the correct answer for all states of the database. While this process has been successfully applied in centralised databases, its potential for distributed and heterogeneous systems is enormous, as there is the potential to eliminate inter-site joins which are the single biggest cost factor in query processing. Further justification for its use is provided by the fact that users of heterogeneous databases typically issue queries through high-level languages which may result in very inefficient queries if mapped directly, without consideration of the semantics of the system. Even if this is not the case, users cannot be expected to be familiar with the semantics of the component databases, and may consequently issue queries which are unnecessarily complicated. In this paper, we present the design of a semantic query optimizer for a heterogeneous database management system (HDBMS). It is based on a powerful data model, the Graph Model, which we argue is suitable as the canonical model of a HDBMS. We also introduce a logic language to formally express interdependencies between classes belonging to different schemas. Using these assertions, we may benefit in several ways from the possibility of reasoning about them, and specifically by applying them in heterogeneous query processing.

Research supported by the EEC under the Esprit Project 6398 "VENUS".

1. INTRODUCTION ...................................................................................................................... 3 2. THE VENUS ARCHITECTURE AND MODEL ...........................................................................4 2.1 THE VENUS ARCHITECTURE .................................................................................................... 4 2.2 THE GRAPH MODEL ................................................................................................................ 7 2.3 GRAPHICAL PRIMITIVES ......................................................................................................... 10 3. BACKGROUND TO GLOBAL SEMANTIC QUERY PROCESSING.......................................... 10 3.1 REVIEW OF RELATED WORK ................................................................................................... 12 3.2 OUR APPROACH TO SEMANTIC QUERY PROCESSING ..................................................................... 14 4. SCHEMA ASSERTIONS AND TRANSFORMATIONS .............................................................. 15 4.1 SCHEMA KNOWLEDGE ........................................................................................................... 15 4.2 SPECIFYING INTERSCHEMA KNOWLEDGE ................................................................................... 16 4.3 COOPERATIVE INFORMATION SYSTEM ...................................................................................... 17 4.4 QUERY TRANSFORMATIONS .................................................................................................... 19 4.4.1 Transformations within a Class Hierarchy ........................................................................ 19 4.4.2 Transformation of Queries between Class Hierarchies......................................................... 20 4.4.3 Transformations using Domain Assertions......................................................................... 22 5. GLOBAL SEMANTIC QUERY OPTIMISATION ..................................................................... 23 5.1 REPRESENTATION OF GLOBAL EXECUTION STRATEGIES ............................................................... 23 5.1.1 Conceptual Global Query Trees ....................................................................................... 24 5.1.2 Implementation Global Query Trees.................................................................................. 25 5.2 HDB QUERY OPTIMISATION METHODOLOGY ............................................................................ 26 5.2.1 Global Semantic Transformation...................................................................................... 27 5.2.2 Initial CGQT Optimisation ............................................................................................. 28 5.2.3 Construction of Candidate IGQTs.................................................................................... 29 5.2.4 Selective IGQT Transformation........................................................................................ 31 5.3 EXAMPLE ............................................................................................................................ 33 5.4 APPRAISAL .......................................................................................................................... 38 6. SUMMARY AND CONCLUSIONS............................................................................................ 39

1. Introduction The motivation for this works comes from two different needs. On one hand the number of non-expert users accessing databases is growing. On the other hand, information systems no longer tend to be based on a single centralized architecture. They tend to be constituted of several heterogeneous component systems which cooperate to achieve global tasks. In order to be deployed and exploited successfully, a friendly man-machine interaction is paramount. For this purpose, we have developed a query system based on the use of visual formalisms (i.e., a visual query system, VQS). VQSs are particularly appropriate for querying heterogeneous databases: in these cases, users typically are not aware of the structure or location of the underlying data. VQSs can thus represent iconically the concepts and their properties that are stored in the database, and the translation of the query to a set of component databases plus the corresponding result assimilation, can be achieved transparently by the user. However, existing VQSs do not exploit this potentiality, remaining limited to querying databases expressed in a single model. Our approach aims to overcome this weakness by defining a complete system for interfacing heterogeneous databases through an adaptive interface providing the user with several visual representations. This system has been designed in the framework of the Esprit Project “VENUS” and relies on both a general data model, the Graph Model (GM), whose constructs are visual, and a minimal set of Graphical Primitives (GPs), in terms of which general query operations may be visually expressed. However, it is worth noting that the main issue of our approach is not in defining a new data model. On the contrary, the basic idea is in giving importance to the visual representation, which is one of the most important components in building effective interfaces, and in defining a set of basic GPs, in terms of which more complex interactions may be expressed. By using the VENUS system a user can ask queries and see the resulting data by interacting with a conceptually single database, namely a Graph Model Database (GMDB). The user’s query is then translated into a set of queries which are executed on the component databases, and the results of these are combined into a single result. The user is thus oblivious to the existence of the underlying databases, and need not be concerned with their specific storage formats or query languages. The Graph Model is powerful enough to express the semantics of most of the common data models, and therefore is suitable as a unifying canonical model. Once the user issues a query through the visual interface, the query is decomposed to a number of subqueries. These are translated into the languages of the underlying component databases, and the query is executed. This may result in a very inefficient query processing if the query is translated directly to the underlying databases, without consideration of their semantics. Even if this is not the case, users are not expected to be familiar with the semantics of the component databases, and may consequently issue queries which are unnecessarily complicated.

We address this issue by augmenting the “traditional” approach to heterogeneous query processing with additional methodologies to exploit the availability of semantic information that could simplify the expression of a query. This semantic information may apply to the global or external view of the database, between pairs of schemas, or be purely local to component databases. The information is exploited in a variety of ways, at different stages in the transformation process. In all cases, however, our goal is to find a "near optimal" query that can be derived with minimal overhead. While the utilisation of semantics in query processing has been successfully applied in centralised databases, its potential for distributed and heterogeneous systems is enormous, as there is the potential to eliminate inter-site joins which are the single biggest cost factor in query processing. Our approach to heterogeneous semantic query processing forms the central theme of this paper. In order to exploit the underlying semantics in heterogeneous query processing, need a formal means of representing them. Therefore, we introduce a logic language to express interdependencies between classes belonging to different schemas. Such interdependencies allow a designer of a cooperative information system to establish several relationships between both the intensional definition and the set of instances of classes represented in different schemas. Once we have a set of assertions of the above mentioned kinds, we may benefit in several ways from the possibility of reasoning about them, in particular in providing integrated access to the cooperative information system. It is woth noting that the notion of interschema knowledge is crucial for the development of cooperating heterogeneous information systems. Recent work on interoperability points out that two individual information systems can interoperate on the basis of a mutual understanding of the information resources they provide [Geller Perl Neuhold 1991; Sheth Kashyap 1992; Yu et al. 1991, Barsalou Gangopadhyay 1992; Brodie Ceri 1992; Fang Hammer McLeod 1991; Li McLeod 1991; Krishnamurthy Litwin Kent 1991]. Obviously, in order to achieve this mutual understanding, several forms of interschema knowledge must be expressed and reasoned upon. The paper is organized as follows. Section 2 presents the overall framework for the current research, ie. the Venus system and the Graph Model. Section 3 gives the background to the field of semantic query optimization. Section 4 introduces the notion of inter- and intraschema knowledge, and presents a set of transformations which can be exploited in the task of semantic query optimisation. Section 5 presents the methodology for heterogeneous semantic query optimisation, and the conclusions are presented in Section 6.

2. The VENUS Architecture and Model 2.1 The VENUS Architecture In Figure 1 we show the architecture of the VENUS system. Such a system consists of a Visual Interface Manager, a User Model Manager, a GMDB & Query Manager, and one or

more Database Management Systems. The kernel of the system consists of the three managers that are cooperating processes.

Visual Interface Manager

User Model Manager

GMDB & Query Manager

DBMS1

DB1

DBMS2

DB2

User Modeling Knowledge

GM Schemata

...

...

DBMSn

DBn

Figure 1: The Venus System Architecture The Visual Interface Manager is capable of supporting multiple visual representations (formbased, iconic, diagrammatic and hybrid) of the databases and the corresponding interaction modalities. Based upon the user model provided by the User Model Manager, the Visual Interface Manager selects the visual representation most appropriate for the user. At any moment, the user has the freedom of shifting to any one of the available interaction modalities, finding updated the state of the query in the new modality. At the bottom of the figure, different databases structured according to several data models are shown. Each database is translated into a Graph Model Data Base (GMDB) by the GMDB & Query Manager, using the mappings defined in [Catarci Santucci Angelaccio 1993], and in [Cardiff Catarci Santucci 1994]. It is up to the GMDB & Query Manager to manage such mappings and to translate the visual queries into queries that can be executed by the appropriate Database Management System. Figure 2 describes the initial activity of the GMDB & Query Manager, i.e., the conversion of the DBMS schemata into GMDB schemata and the following integration.

GM Integrated Schema Knowledge Base

GM schema integrator

GM Schema 1

GM Schema 2

Model 1 database translator

Model 2 database translator

Schema S1 DBMS1 Model 1

Schema S2 DBMS2 Model 2

GM Schema n

....

....

Model 2 database translator

Schema Sn DBMSn Model n

Figure 2: Conversion and integration of local schems Each one of the merging schemata is expressed in terms of a data model supported by a suitable DBMS. The GMDB & Query Manager activates the suitable translators to convert the local schemata in terms of GMDB schemata. During this phase, each translator interacts with an internal knowledge base at two main aims: 1) 2)

to document the choices adopted during the translation activity; to find out additional information (if available) about the schema to be translated.

Once the set of GMDB schemas is available, the GM schema integrator module provides for integrating them. The first step in the integration of the two schemata is to recognize their similarities, which provide the starting point for the integration. However, the main difficulty during schema integration is to discover and solve possible conflicts in the schemata to be merged, e.g., different representations for the same concepts. During the integration process the schema integrator exploit the interschema knowledge contained in the knowledge base (see [Cardiff Catarci Santucci 1994]. The interschema knowledge is made of interdependencies between classes belonging to different schemas, which are expressed in terms of a suitable language. These interdependencies allow one to establish several relationships between both the intensional definition and the set of instances of classes represented in different schemas. If we want to integrate two or more schemas belonging to the heterogeneous information system, we can benefit from this knowledge in all the phases of the integration process. In particular, the activity of conflict detection can be fully automatized. Also, it will be possible to single out several forms of redundancies in the integrated schemas, and makes it explicit which are the links to be included in the integrated schema in order to reflect meaningful relationships between classes coming from different schemas.

Admissible view of the GM Integrated Schema Result

Knowledge Base

Query Handler

Admissible view on Schema 1

Admissible view on Schema 1

Model 1 query transator

Model 2 query translator

GM Schema 3

....

.... DBMS2

Model n query translator

DBMSn

DBMS1

Figure 3: Query Management In Figure 3 we show the modules of the GMDB & Query Manager devoted to the query management. Through the visual interface manager (see [Catarci et al. 1994; Catarci Chang Santucci 1994]]) the user interacts with a view of the integrated GM schema and her/his actions are translated in terms of GPs. Each time the above process results in an admissible Typed Graph (see Section 4) the user can ask for the query computation. The Query Handler module, through the analysis of the distribution information produces a set of admissible views on the local GM schemata. Each view is processed by the appropriate query translator, resulting in a query expressed in terms of the underlying DBMS. Each DBMS computes its query and sends the result to the Query Handler that, in turn, merge them, producing a unique result. The global and local strategies for the query optimization carried out by the Query Handler are described in Section 6. 2.2 The Graph Model In this section we introduce the basic notions on the data model, namely the Graph Model, and the query language, namely the Graphical Primitives, we adopt in the VENUS system. In particular, we concentrate on the aspects of the GM which deeply influence the strategies for the semantic query optimization described in Section 6, i.e. the constraint language which is used in order to express the so-called intra-schema knowledge. The Graph Model allows one to define a GMDB D in terms of a triple , where g is a Typed Graph, c is a set (possibly empty) of suitable Constraints, and m is the

corresponding Interpretation. The schema of a database, i.e., its intensional part, is represented in the Graph Model by the Typed Graph and the set of Constraints. The instances of a database, i.e., its extensional part, are represented by the notion of Interpretation. A Typed Graph g is a tuple: g= < N, E, l1, l2, f1 , f1, f3 >, where N is the set of nodes, divided into NC, the set of class-nodes, and NR, the set of the role-nodes. Moreover, NC is partitioned into NCp, the set of printable class-nodes, and NCu, the set of unprintable nodes. E is the set of edges; l1 and l2 are the sets of node and edge labels (l2 includes a special label T, corresponding to the true value); f1 and f2 are functions associating nodes and edges with labels in l1 and l2 respectively; finally, f3 is a total function mapping each node to a value in {unselected, selected, displayed}. The set c may be specified by using a logic-based Constraint language, which is intended to be exploited by the system designer in order to specify constraints on and meaningful properties of the nodes represented in the Typed Graph (which corresponds to classes of objects and relationships among them). The idea is that all the knowledge about such nodes can be specified in terms of a set of assertions. on the nodes of the Typed Graph. It is worth noting that the user who wants simply to query the database does not need to be acquainted with the existence of such a Constraint Language. On the contrary, it is intended to be used by both the designers of each single information system and by the designer who builds the heterogeneous system, who needs to know much more about the component systems s/he has to integrate. The Constraint Language, can be used at different levels. For instance, in [Catarci Santucci Angelaccio 1993] a simple subset of constructs has been used for expressing ISA and cardinality constraints, while its potentiality has been fully exploited in [Catarci D’Angiolini Lenzerini 1994] and [Catarci Lenzerini 1991], allowing also for representing indefinite and negative information. The main idea of the language is that all the knowledge about the basic elements in the Typed Graph can be specified in terms of a set of assertions. Syntactically, an assertion is a statement of the form L1 isa L2, where L1 and L2 are expressions of the language. Informally, an assertion of the above form states that every instance of the class1 (denoted by the expression) L1 is also an instance of the class L2 (for a formal treatment of this subject see [Catarci Lenzerini 1991]). Suppose we are given an alphabet B of symbols for a GMDB D, including: • the node labels l1, the set of elementary values d; • •

two special class symbols ⊥ and ; the special symbols ∩,∪,∃,∀,(,),{,}.

We use the term class expression over B to denote any expression that is formed using the symbols in B according to some syntactic rules. Intuitively, the expression R:C denotes the restriction of the relationship represented by the role-node R to those pairs whose second 1

The simplest class expression is the one formed by the label of a node.

component is an instance of the concept C. Moreover, symbol ° denotes composition of roles, symbol -1 is used to denote the inverse of a role, and symbol * is used to denote the transitive closure of a role. The expression {e1,...,en}, where e1,...,en are elementary values, denotes the concept whose interpretation is the set of values corresponding to e1,...,en. The meaning of the constructor ¬, ∩, ∪, is simply that of set complement, set intersection and set union, respectively. For instance, C∩ F (where C and F are class-nodes) represents the set of instances of both C and F. The constructors ∃ and ∀ are used to describe concepts on the basis of their linking to roles: intuitively, ∃R.C denotes the set of objects that are linked by the role-node R to at least one instance of the class-node C, whereas ∀R.C denotes the set of objects which are linked by R to all the objects that are instances of C. The expression ∃R denotes the set of objects that are linked by the role-node R to at least one object. Several kinds of assertions are expressible using the logic language, and this great richness can be proficuosly used once we want to represent and use interschema knowledge. We can not only express the usual isa relationships between classes (e.g., Author isa Person), but also isa relationships on roles. Moreover, we can use explicit negation, so stating that the set of instances of two class-nodes are disjoint. For example, we can represent that the classes Author and Referee are disjoint by means of the assertion: Author isa ¬ Referee. Another example of negative information is the one satting that a certain role is meaningless for a class, i.e. that the instances of the class cannot partecipate in the role. Several kinds of indefinite information can also be expressed by means of assertions. One of the most important kind of indefinite assertions is related to the use of disjunction. For example, one can assert that persons are either males or females by writing: Person isa (male ∪ female). Finally, it is possible to combine necessary and sufficient conditions, so providing a definition of a concept in terms of other concepts. For instance, we can associate a definition to the class-node ItalianPaper as follows: ItalianPaper defint Paper ( ∀WrittenBy.(Author( ∀Nationality.Italian))). which means: an Italian paper is a paper whose authors are all Italians. Finally, let us turn our attention to the notion of Interpretation, which is used for characterizing the instances of the database. We start by considering c to be an empty set. In this case, the interpretation for a Typed Graph g is a function mapping the printable classnodes of g to a subset of the set of elementary printable values, the unprintable class-nodes to a subset of the set of elementary unprintable values, and the role-nodes to a subset of a set of structured objects, defined as the smallest set containing the set of elementary values and all the possible labeled tuples (of any arity). In particular, given a role-node n, its Interpretation is constituted by a set of tuples whose arity is equal to the number of class-nodes adjacents to n, and each component is labeled with the label of one adjacent class-node and takes its values in the corresponding Interpretation. When the c set is not empty, the definition of

interpretation is extended in order to satisfy the set of assertions in c. A formal semantics is defined for the class expressions, as well as a set of semantic equations which have to be satisfied in order a certain m to be an interpretation for the couple . An interpretation m is called a model of a set of assertions if every assertion is satisfied by m. Recalling the correspondence between interpretations and database states, it is easy to see that the models correspond to those database states which are legal with respect to a set of integrity constraints (in our case, the set of assertions). 2.3 Graphical Primitives The Graphical Primitives (GPs) are defined in [Catarci Santucci Angelaccio 1993] as the GM query language. They are not intended to be used by the final user, rather they are effectively used for formally characterizing the semantics of the different visual languages provided to the user by our multiparadigmatic interface (see [Catarci et al. 1994)]. The main idea is to express any query-oriented user interaction with a database in terms of two simple graphical operations: the selection of a node and the drawing of a labeled edge. The former is the simplest graphical operation available to the user, and corresponds to switching the state of a node. The latter is the linkage of two nodes by a labeled edge, and corresponds to either restricting the node Interpretation according to the rules stated in the label, or performing a set operation on them. The Selection of a node n in D, denoted by s(D,n), is used to restrict the original graph g to a subgraph g' by changing the state of the specified nodes from unselected to selected and displayed. The drawing of a labeled edge in D, denoted with e(D,f,n,q), can only be applied when no edge between n and q is in D. Its effect on the database D depends on the label f, which may be a boolean expression, "≡", or a set operator. Applying a set of GPs on a GMDB D results again in a GMDB D'. A further step is devoted to associating with D' a new GMDB, called D@, denoting the information content of the query represented by D'. The process of building D@ is explained in [Catarci Santucci Angelaccio 1993]. Generally speaking, D@ is a GMDB composed by a unique unprintable class-node linked, by means of binary role-nodes, to a set of printable nodes, corresponding to the ones set to displayed in D'. The Interpretation of the above binary role-nodes is computed in two logical steps: in the first step all the selected role-nodes of D' are joined together giving the meaning of a fictitious n-ary role-node; in the second step such a meaning is suitably projected on the binary role-nodes of D@.

3. Background to Global Semantic Query Processing The term Semantic Query Optimization refers to the process of utilising the integrity constraints in the optimization process. The underlying concept of semantic query optimization is that by harnessing the schema knowledge, a user's query can be transformed into a query which is syntactically different to the original but which will produce the same result for all states of the database, and be more efficient to execute than the original query. In

order to contrast this type of optimization with conventional query optimization, the latter is often referred to as syntactic query optimization, although this is not a strictly accurate term. Semantic query optimization results in the transformation of an input query into a “semantically equivalent query”. Two queries are said to be semantically equivalent if, for every state of the database, they produce the same result. The schema knowledge must be used to determine semantic equivalence. We state this more formally as follows: Definition: Semantic Equivalence. In a database D which is a model of a set of schema assertions SA, SA = SA1 ... SAn, a query predicate Q' is semantically equivalent to a given query predicate Q if it can be expressed in the form (Q ∧ SA'), where m

SA' =



i=1 SAi,

SAi ∈ SA, m £ n.

This means that Q ∧ SA' |= Q' and Q' ∧ SA' |= Q.

There are a number of different stages of transformation in heterogeneous semantic query processing. Firstly, the sites containing the data of interest are identified and the alternative sets of queries to the component databases are formulated. Note that data may be retrieved from more than one source, as there will be replication across the heterogeneous database. We refer to this stage as Global Semantic Query Optimization. Each of these component queries can then be transformed into a set of semantically equivalent queries (Local Semantic Query Optimization). Finally, for each of these queries, the internal DBMS optimiser will select an execution strategy. We refer to this stage as “Syntactic” Query Optimization. We are not concerned with the determination of the query execution strategies, but rather with the choice of a set of queries to the component databases, which can answer the user’s query, and with the transformation of these component queries to efficiently yield semantically equivalent queries which can be executed in a shorter period. These tasks are independent of one another (although cost estimates and statistics need to be passed between different levels), and therefore this work can be seen as a semantic query optimization system for a centralised DBMS, with generalisations and extensions to acommodate Heterogeneous DBMSs. The problems of query processing are considerably more complicated in a Heterogeneous Database Management System (HDBMSs), since the systems be integrated may be defined using different models, are likely to have large schematic discrepancies (both syntactic and semantic), and have prerequisites in terms of the degree of autonomy they are prepared to relinquish. However, very little effort has been devoted to this problem, possibly because it was assumed that the processing techniques developed for Distributed DBMSs (ie. homogeneous systems) could be adapted for the heterogeneous case [Lu et al., 1992].

In [Dayal Landers Yedwab 1982], the most significant differences between homogeneous and heterogeneous query processing are identified as • the potential for data inconsistencies in the presence of replicated data, and the fact that global attributes may be defined as an aggregation of local attributes and consequently many of the optimisation transformations that can be applied to homogeneous systems no longer hold, • the difference in local processing costs, which will have implications for the choice of the execution strategy, and • differences in local processing capabilities, since the functionalities of the local DBMSs will not be identical, it may be impossible or difficult to preform certain operations on certain sites (eg. join on a hierarchical DBMS). A further problem with defining a query optimization strategy for a HDBMS is that the participating systems may not be willing or able to give details of their cost models, thus making the definition of a global cost model extremely difficult. An approach to this problem is presented in [Du Krishnamurthy Shan 1992], where a calibration process is used to estimate the various cost models. However, we believe that regardless of the individual cost models, substantial savings may be made by performing a "semantic pre-processing" step in query decomposition, which involves harnessing the integrity constraints of the HDBMS in order to simplify the subqueries to be executed by the individual systems. This process is known as Semantic Query Optimization and has been shown in several prototypes to be successful in reducing the overall query processing time in many cases (e.g., [King 1981], [Chakravarthy 1986]). These systems almost universally concentrated on the centralised case, i.e., query optimization in a single system.

3.1 Review of Related Work The QUIST (QUery Improvement by Semantic Transformation) system [King 1981] represented the first major work in the area of semantic query optmisation. It is a front-end to a conventional optimizer and transforms queries by attaching to them merge-compatible integrity constraints. It considers for transformation a “significant subset” of Select-ProjectJoin queries. A query is applied to a single relation, which may be physical or virtual. Virtual relations are defined in terms of joins of real relations, where only one join is permitted between any two physical relations. The search for a more efficient query is based on the plan-generate-test paradigm of AI. In the Plan stage, the constraint targets are established. Heuristics are described which guide the process of target selection. In the Generate state, the space of semantically equivalent queries is searched. The constraint target list permits only transformations that may be cheaper. Finally, in the Test stage, estimates of the processing costs of each query are obtained. Chakravarthy et al. [Chakravarthy Fishman Minker 1985; Chakravarthy Minker Grant 1987] consider semantic query optimization in the context of deductive databases. A two-phased approach to optimization is presented. The first phase is semantic compilation, which associates the integrity constraint fragments, or residues, with compiled axioms. This is done

before any queries are issued. A technique known as partial subsumption is used to associate the residues with the relevant axioms. The second phase is semantic transformation which is applied when a query is issued. It uses the axioms to generate semantically equivalent queries, which may be executed faster than the original. Shenoy and Ozsoyoglu [Shenoy Ozsoyoglu 1987; Shenoy Ozsoyoglu 1989] use a graph theoretic approach to identify redundant join and restriction clauses. The algorithm also attempts to introduce as many restrictions on indexed attributes as possible. Only “profitable” constraints are used, which are decided by heuristic rules, parameters and assumptions. Two types of integrity constraint are used – subset (i.e., referential) and implication integrity constraints. The heuristics used are restriction elimination, index introduction, scan reduction and join elimination. The transformation process is divided into two main phases: (1) Semantic expansion which iteratively adds any new restriction or join implied by the query graph and semantic implication constraints: (2) Semantic reduction, which eliminates all redundant relations and restrictions from the query graph. Query processing in the context of heterogeneous systems was first explored in the Multibase project [Landers Rosenberg 1982]. In [Dayal Landers Yedwab 1982], four transformations which are shown to be “beneficial” are described, along with the conditions under which they can be applied. A query processing strategy is given which exploits these transformations in a multidatabase query processing algorithm. A methodology for reformulating multidatabase queries into component queries using semantic knowledge is presented in [Florescu et al. 1995]. A global query, expressed in OQL, is converted into a set of queries against the union of the local databases, each of which is a different method of computing the result of the global query. Each of these is then decomposed into a series of queries to be executed by the local systems, and a composite query whose purpose is to assimilate the results passed back by the local databases. One of these alternatives is chosen for execution. A wrapper converts the local queries into the local language. The semantic rules used to transform queries into semantically equivalent queries are expressed as a set of assertions, and describe constraints that hold on the global view, on local databases, and between local databases (which is information about replication of data). Although the issue of heterogeneous query optimisation is not explicitly addressed, the techniques of semantic knowledge representation, exploitation of data replication, and reuse of executed queries provide valuable input to this process. Lu et al. [Lu et al. 1992] identify the key phases of global query optimisation as being decomposition, plan generation, cost evaluation and subquery execution monitoring. Decomposition breaks a query into a query graph in which the nodes are query units (queries which can be executed at a single site). Plan generation takes a global query graph and constructs a number of possible execution plans, based on the available statistics. It is guided in this by the cost evaluation and heuristics to reduce the search space. In this approach, the query plan can be generated dynamically, as the query is being processed. When a subquery finishes executing, the cost is compared to the estimated cost. If there is a significant difference, this may impact on the cost estimation for the overall plan. The Plan Generator is

therefore reconsulted and any parts of the plan not yet executed may be redefined (eg. by changing execution sites or reformulating subqueries). 3.2 Our Approach To Semantic Query Processing Much of the work in the area (e.g., [King 1981]) regarded the goal of Semantic Query Optimization as being the production of a query semantically equivalent to the original query, which can be executed most efficiently (i.e., the minimal element in the equivalence class of a given query - where equivalence is interpreted as “semantic equivalence” - under the ordering Query Execution Cost). However it is obvious that the determination of the most efficient query rises exponentially with the number of schema assertions to be considered – thus increasing the search space of query execution plans in a centralised database by an order of magnitude, and by two orders of magnitude in the case of a heterogeneous database. Hence, the optimization costs cannot be regarded as negligible with respect to the query execution costs, as is the case in conventional optimizers. Instead, it is preferable to choose a suboptimal query, while strictly controlling the optimization costs. Definition: Goal of Semantic Query Optimization. Let S be the set of schema knowledge assertions applicable to a given database D. The goal of semantic query optimization can be seen as the application of an arbitrary number of schema assertions to a given query, in such a way as to minimise the total costs of the query, viz. the sum of the query execution and optimisation costs. In order to achieve the goal of minimising Total Execution Cost, there are several issues which must be considered. These include the following issues: •

Control of Search Space. This is the process of choosing rules that are relevant to the query. In an application where the schema knowledge consists of n schema assertions, there are 2n-1 ways of combining them with the query. As the schema assertions are true for all states of the database, the answer will be correct in each case, but only very few rules will be of benefit.



Control of Costs. Semantic query optimization increases the optimization cost by an order of magnitude in a centralised database, and by two in a distributed or heterogeneous system. For each set of query execution plans, an execution strategy must be determined in order to estimate the costs. With a conventional query optimizer, optimization costs were effectively negligible. They can be a major factor here.



Termination. The application of a schema assertion to a query can produce a query which may in turn be transformed by the application of another assertion. This procedure can be repeated indefinitely, and so we need precise criteria for termination.

4. Schema Assertions and Query Transformations 4.1 Schema Knowledge In this section we recall the notion of schema knowledge as defined in [Catarci Lenzerini 1993], and show how it can be used inside the global system architecture presented in Section 2. It is important to note that interschema knowledge is needed in a heterogeneous cooperative information system, independently of the architecture of the system. We can distinguish between two basic approaches for accessing heterogeneous information systems. The first approach, which is the one we follow in VENUS, is based on a global schema describing the information in the individual databases, in such a way that users and applications are presented with the illusion of a single, centralized information system. The second approach avoids constructing the global schema, and relies on various tools to be used for sharing information by users and applications. It is evident that interschema knowledge is an essential element in both approaches: in the former, it provides the necessary information for building the global schema, while in the latter, it is used for understanding the content of different databases, so as to share relevant information. In our approach, we use logic for both expressing schema knowledge, and reasoning about it. We also show that it can assert relationships on the global schema, between component schemas (interschema knowledge), and on the local schemas. We assume that the individual information systems which are the components of the heterogeneous information system are defined in terms of the same data model, i.e. the Graph Model presented in Section 2, which is a class-oriented model. We have shown in [Catarci Santucci Cardiff 1997] the formal translations between the most representative data models and the GM, so this assumption is not a limit to the generality of our approach. The basic idea of our approach is to propose a logic-based language to express interdependencies between classes belonging to different schemas. Such interdependencies allow a designer of a heterogeneous information system to establish several relationships between both the intensional definition and the set of instances of classes represented in different schemas. For example, one can assert in our language that the concept represented by the class-node GraduateStudent in the schema S1 is the same as the concept represented by SeniorStudent in S2. This assertion implies a sort of intensional equivalence between the two classes, but does not imply that the set of instances of the former is always the same as the set of instances of the latter. On the other hand, one can assert that the set of instances of the class-node Teacher in the schema S2 is always disjoint from the set of instances of the class-node Tutor in S3, even if there exists a form of intensional inclusion between the latter and the former. Once we have a set of assertions of the above mentioned kinds, we may benefit in several ways from the possibility of reasoning about them. For example, one can check whether one class represented in the cooperative information system is incoherent, i.e. it has an empty extension in every state, or can deduce that the extension of a class A in the schema Si is always a subset of the extension of the class B in Sj, so that accessing Si is useless if we want

to retrieve all the instances (stored in any of the information systems) of the concept represented by B. 4.2 Specifying Interschema Knowledge In this subsection we describe our approach for specifying interschema knowledge in terms of interdependiences between classes belonging to different schemas. Suppose that the heterogeneous information system is constituted by n individual information systems, called component information systems, whose schemas are S1,...,Sn with alphabets B1,...,Bn. We assume that all the schemas are expressed using the Graph Model and that some extra assertions are available for the designer. Let S0 be a further schema (having its own alphabet B0 and set of assertions c0), called the common knowledge schema of the cooperative information system. Intuitively, S0 represents the general properties of the classes that should be considered common knowledge in the heterogeneous information system. Obviously, in those applications where such a knowledge is not available, the set of assertions in c0 will be empty. Interschema knowledge is then specified in terms of interschema assertions. There are five kinds of such assertions, whose forms are: L1 defint L2 L1 isaint L2 L1 defext L2 L1 isaext L2 L1 exclext L2 where in every assertion, L1 represents an Si expression, and L2 represents an Sj expression, in such a way that L1 and L2 are expressions of the same types, and i ≠ j. Moreover, if L1 and L2 are role expressions, then there is the constraint that the set of components appearing in L1 has the same cardinality as the set of components appearing in L2. We now discuss the intuitive meaning of the four kinds of interschema assertions that we have introduced by using some examples, and refer to [Catarci Lenzerini 1993] for what concerns their formal semantics. The first assertion states that the expression L1 is intensionally equivalent to L2. Intuitively, this means that, if Si and Sj referred to a unique set of objects in the real world, then the extension of L1 would be the same as the extension of L2. Therefore, the above assertion is intended to state that, although the extension of L1 may be different from the extension of L2, the concept represented by L1 is in fact the same as the concept represented by L2. As a simple example, the S1-class UndergraduateStudent can be declared intensionally equivalent to the S2-class Student ∩ ¬ GraduateStudent, to reflect that, even if the instances of the two expressions may be different in the various states of the cooperative information system, the

concept of UndergraduateStudent in the schema S1 is fully captured by the above class expression in the schema S2. The second assertion states that the class expression L1 of schema S1 is intensionally less general than the class expression L2 (of schema S2). This means that there is a sort of containment between L1 and L2, and this containment is conceptual, not necessarily being reflected at the instance level. In other words, the above intensional relationship is intended to state that, if S1 and S2 referred to a unique set of objects in the real world, then the extension of L1 would be a subset of L2. For example, Tutor2 may be declared intensionally less general than Teacher3, if the concept of tutor as represented in the schema S2 is subsumed by the concept of teacher in the schema S3. The third assertion states that L1 and L2 are always extensionally equivalent. In this case we are asserting that in every state of the cooperative information system, the set of objects that are instances of L1 in S1 is the same as the set of objects that are instances of L2 in S2. The fourth assertion states that the extension of L1 is always a subset of the extension of L2. For example, if the class Student1 refers to the students of a University Department, and the schema S2 refers to the whole University, then we may assert that Student1 is extensionally less general than Student2. Finally, the fifth assertion states that the set of instances of classes of L1 in S1, and the set of objects that are instances of L2 in S2 are disjoint. 4.3 Cooperative Information System A cooperative information system CIS can be characterized on the basis of the knowledge it contains. In particular, at the class level we can define it as: CIS= , where • S1,...,Sn is the collection of the n Typed Graphs of the various information systems; • S0 is a common knowledge schema, which may be empty; • Σ is a set of schema assertions of the above form. Σ is partitioned into a number of sets: − Σ G which contains assertions relating exclusively to the global schema, − Σ IS which contains assertions describing relationships between the component schemas, and − A set of schema assertions {Σ L1 ...Σ Ln}, where Σ Li contain assertions relating exclusively to the typed graph Si, 1 £ i £ n.

ΣG

and any of the local assertion sets Σ Li may be empty.

In [Catarci Lenzerini 1993] the conditions for the global interpretation m of CIS to satisfy the interschema assertions are formally stated. Moreover, we say that an interpretation m for CIS is called a model of CIS if every assertion in Σ is satisfied by m. If σ is an intraschema or interschema assertion, we say that σ is a logical consequence of CIS if σ is satisfied by every model of CIS.

Most of the reasoning we want to perform on a cooperative information system CIS can be reduced to the problem of checking whether an assertion σ is a logical consequence of CIS. In [Catarci Lenzerini 1993] a technique is described that can be used to devise a solution to this problem. We will not enter in the detail of this technique in the present paper, but we want to show how to use it for building a so-called interschema knowledge network, which is a compact structure reflecting all the relevant interdependencies holding between classes of different schemas. The interschema knowledge network is defined as a directed graph whose nodes represent classes, and arcs represent extensional containment between classes. The construction of the interschema knowledge network associated with a cooperative information system is again based on the above mentioned reasoning technique, and proceeds as follows: 1) select from the schemas S0, S1,...,Sn the nodes that are to be associated with the nodes of the network; 2) for every pair of selected nodes draw an arc between the corresponding nodes in the network; 3) compute the transitive reduction of the network, i.e. iteratively remove from the network all the arcs from n1 to n2 such that there is a node n3 with arcs and in the network. By comparing the interschema knowledge network with the cooperative system itself, we can point out three main characteristics of the network. First, the network refers to a selected subset of (not necessarily all) the nodes in the various schemas constituting the system. There are several possibility for such a selection. For example, we may select only atomic classes from the various schemas, or we may include all atomic classes plus some important complex classes, or we may want to represent in the network all the classes denoted by some expression in the interschema assertions. Note that the choice of which classes are represented in the network determines both the complexity and the completeness of the information carried by the network itself. Second, while the system is a flat collection of interschema assertions, the network reflects the structure of interdependencies between classes, by making explicit all the extensional inclusions between selected classes that are implied by the system. Note that such a structure results in a partial order between the selected classes. Finally, the arcs of the network reflect some important inferences that can be drawn from the cooperative system. In particular, the existence of a path in the network from two nodes n1 and n2 implies that all instances of the class represented by n1 are also instances of the class represented by n2. Moreover, the mutual extensional equivalence of a set of selected classes is made explicit in the network by merging the corresponding nodes. It is worth noting that we can use the interschema network in several ways. First, besides providing information on the correspondence between classes in different schemas,

interschema assertions actually constitute a declarative specification of several consistency requirements over different databases. For example, the interschema assertion GraduateStudent2 isaext Student1 specifies that any update operation inserting new instances into the class GraduateStudent of schema S2 should ensure that the same instance is also in the class Student of schema S1. Obviously, if the interschema knowledge is itself incoherent, then no state of the cooperative information system may exists satisfying all the interschema assertions. Therefore checking coherence is one of the basic activities for verifying the correctness of the cooperative information system. In our approach, coherence verification corresponds to checking CIS for satisfiability, and this task can be directly done by exploiting the reasoning technique described in [Catarci Lenzerini 1993]. An important aspect related to the consistency of interschema knowledge is concerned with schema integration [Cardiff Catarci Santucci 1994]. If we want to integrate two or more schemas belonging to the cooperative information system, we can benefit from the knowledge expressed in our language, in all the phases of the integration process. Finally, we will see in the next section that interschema knowledge is crucial in realizing integrated access to the heterogeneous information system (we mean with the term integrated access any querying operation that may require accessing different component information systems). 4.4 Query Transformations Let us say that a user issues a query Q on a class CI, a class defined in the integrated schema. The immediate mapping of this query is to a series of queries on the component databases to classes C1 ..... Cn, where each Ci is either intensionally equivalent to or less general than CI. The query can be represented as C 1 INT C2 INT ... INT Cn, where INT refers to a class integration operator, such as outerunion or outerjoin. We define the schema of a class C as AD{n} ∪i AD{ni}, ni ∈ NCu, and n ISA* ni – {x | x ∈ NR ∧ |AD(x) ∩ NCu| > 1}, where f1(n) = C. Essentially, this corresponds to the set of printable attributes of C, including those of its subclasses, which do not take part in a relationship with any other classes. 4. 4. 1 Transformations within a Class Hierarchy In this section, we describe the cases in which it is possible to eliminate the need to query some of the component databases, when the query is defined only on a single class hierarchy. By this we mean that there is only a single unprintable node (corresponding to a class identifier) in a "selected" state. Case 1:

We first consider the case where all of the printable nodes of CI are set to "displayed"2, and the set of edge labels are set to T. •

If there is an interschema assertion Ci defext Cj, then if



− schema(Cj) ⊆ schema(Ci), Cj can be removed from Q. − schema(Ci) ⊆ schema(Cj), Ci can be removed from Q. − schema(Ci) = schema(Cj), Ci or Cj can be removed from Q, If there is an interschema assertion Ci isaext Cj, and if schema(Ci) ⊆ schema(Cj), then Ci can be removed from Q. Similarly, if there is an interschema assertion Ci isaext Cj, and if schema(Cj) ⊆ schema(Ci), then Cj can be removed from Q.



Case 2: Only some of the printable nodes of CI are set to "displayed". The set of edge labels are set to T. Here, we consider only the portion of the schema of class CI having nodes set to displayed, and proceed as in Case 1. Case 3: Some of the printable nodes of class CI are not set to T. If an edge label between a role node and and a printable node is other than T, it effects a restriction on the result of the query. Let Eres = (, , ...., ) be the set of edges in class CI whose labels are not T. For each class schema Si, let Ei be the set of edges in this hierarchy. • If Eres ⊆ Ei does not hold, Si can be removed from Q. • If there is an inter-schema assertion which is defined to hold under a certain condition, and the restrictions defined in Eres enforce this condition, then the transformations described for Case 1 can be applied. • For each label L associated with the edges in Eres , - If there is an assertion σ defined on schema Ci, such that any extension satisfying Σ must also satisfy L, then L can be replaced with T. - If there is an intra-schema assertion σ defined on schema Ci, such that no extension can simultaneously satisfy σ and L, then Ci can be removed from Q.

4. 4. 2 Transformation of Queries between Class Hierarchies

2

The nodes set to “displayed” correspond to the concepts which have to be included in the GMDB D@ representing the query result.

Here, we describe the cases in which it is possible to eliminate the need to query some of the component databases, when the query spans two or more class hierarchies, i.e., there is more than one unprintable class-node (corresponding to a class identifier) in a "selected" state. In the following, we assume that the query contains a selected role node R having label , where x and y are unprintable nodes corresponding to classes X and Y. Case 1: If the nodes set to “displayed” in the GMDB are all under a single class hierarchy, say X, and there is an assertion X isa ∃ R.Y then the query can be transformed to consider solely X. Essentially such a query is asking for information on instances of X who participate in a role R with instances of Y. The assertion states that every instance of X participates in this role, and therefore it is not necessary to confirm this during execution. An example of this type of query is: List the students who study a course. Given the assertion Student isa ∃Study.Course the query can be transformed to: List all students. This type of transformation can be applied at the level of the integrated schema, or on individual component schemas, depending on the level at which the assertions are stated. Case 2: Here we consider more specific assertions. Again, say the nodes set to “displayed” in the GMDB are all under a single class hierarchy, X. Let be the label of a role node R played by X and D, where D is a printable node. The label on the edge linking R and D is a restriction on values of D: ∀D.{r}. If there is an assertion (X and ∀ R.{f}) isa ∃ S.Y then if the restriction ∀D.{r} is stronger than or equivalent to ∀D.{f}, then the subset of instances of X will also satisfy (X and ∀D.{f}). In this case, as above, it is not necessary to confirm that instances of X participate in the role S with Y during execution. Similarly, we can say that if ∀D.(r) conflicts with ∀D.{f}, i.e., that an instance of X cannot satisfy both conditions simultaneously, then the result of the query is empty. As in Case 1, this type of transformation can be applied at the level of the global schema, or on individual component schemas, depending on the level at which the assertions are stated. Case 3: If the nodes set to “displayed” in the GMDB are under both the class hierarchies of X and Y, the opportunities for transformation are limited to a subset of Case 2. If we assume that the

same assertion holds, and a restriction of the same format is applied to the query, then, as before, if ∀D.(r) conflicts with ∀D.{f}, the result of the query is empty. The transformations in Case 2 can be applied to each component database, since there may exist a schema where schema(Yi) ⊆ schema(YI), and the set of displayed nodes are in YI – Yi. As an example, consider the query: "List the names of students taking a course, and list the director of this course". If there is a component schema in which the role node "Has Director" does not exist, then the query can be simplified to the names of all students on this database. Another query, "List students enrolled in an arts course" may be simplified in the same way if a specific university only offers arts courses. Case 4: We can detect queries that are guaranteed to have no answer using assertions of the type C isa not C', or C isa not ∃R. Assertions of this type are used if the user issues a query which has no meaning in the database, for instance "List all courses having an age". 4. 4. 3 Transformations using Domain Assertions We can consider two categories of assertions on domains, which can be useful for effecting query transformations. Case 1: Direct Domain Assertions These place a direct constraint on the domain of an attribute, independent of all others. They take the form C isa ∀A.() where can be a numeric range, or a set of valid strings. Examples of this type of constraint are Person isa ∀Age.({1 ... 100}), or Student isa ∀Degree.({Arts, Science, Engineering,...}). In cases of subclasses, stronger assertions may be stated, e.g.: Student isa ∀ Age.({18...31}). Assertions of this type can benefit the optimisation process in two ways: • If σ is an assertion and f is a restriction in a query (ie. a labelled edge), then if σ logically implies f (written σ → f), then the label can be set to T. This will reduce local processing. • More usefully, if σ → ¬ f, then the query result is empty. As before, these assertions can be made at either the global or local levels.

Case 2: Value Dependent Assertions They take the form (C and ∀A'.{} [and ∀A''.{< expression2>} and ....]) isa ∀A.{< expression3>} The meaning is that, for any instance of class C, if the conditions specified on the left hand side of the ISA hold, then the condition on the RHS must hold. For example the assertion Person and ∀Age.(> 40) isa ∀Salary.(>20000) states that any person over 40 earns more than $20,000. Such constraints may also apply across related classes. For example, (Student and ∀LivesIn.{X}) isa (Course and ∀TakenIn.{X}) means that students must live in the same country as their course is taken in. The optimisation possibilities for these assertions are an extension of those for direct domain assertions. If σ is a value dependent assertion of the form C ∧ f isa ∀g, and if the query contains two edges whose labels are r1 and r2, then • If f → r1 and g → r2, then the edge whose label is r2 can be set to T. • If f → r1 and g → ¬ r2, then the query result is empty

5. Global Semantic Query Optimisation In this section, we describe strategies for exploiting the intra- and inter-schema knowledge in order to transform a query issued by a VENUS user into a form that can be executed as efficiently as possible.

5.1 Representation of Global Execution Strategies In order to produce an efficient optimisation strategy for the global query, we need some means of representing its constituent queries, and the cost factors involved in its computation. Distributed query optimization algorithms traditionally consider two cost parameters: the local processing costs and the communication costs. Several of these (e.g., SDD-1 [Bernstein et al. 1981]) disregard the former, assuming it to be negligible with respect to the latter. We believe that this is too general an assumption for the following reasons: ï ï

the heterogeneous database may be spread over several networks, which may have widely varying data transfer costs, and the speed at which a query is processed depends on several local factors, such as the size of the database, the current load, and on the power of the machine on which it resides.

Therefore we regard the cost of executing a distributed query as being a function of the processing and communication costs. Although there are other minor parameters that should strictly be considered (eg. load, network traffic, etc), their currency will be difficult to guarantee, and in any case are usually insignificant in comparison to the processing and communication costs.

We introduce two representations of a decomposed global query. The first, which we call a Conceptual Global Query Tree (CGQT) is a generic representation of the decomposed global query. There is a single CGQT associated with each global query. In effect, this information describes the operations to be effected to compute the result, without explicitly stating how they will be implemented. Clearly, there can be many alternative strategies for implementing a specific CGQT − an integration operation can be split over several queries and each of these can be executed at any site. We use a structure called an Implementation Global Query Tree (IGQT) to represent specific execution strategies. These are described in detail below. 5. 1. 1 Conceptual Global Query Trees We can regard any global query as being executed by three types of operation: • local queries, which identify the subset of a class’s extension which is relevant to the query, • logical reconstruction of the class, which can be achieved using an outerjoin or outerunion operation, and • result class computation, in which the inter-class restrictions and projections are performed. We need to identify for any given execution strategy of the Global Query, what is the minimum time in which the query can be executed. The component queries cannot all be executed independently of each other, since the result of one query may be required as input to another. Therefore the execution time is bound by the time taken for the most expensive sequence of consecutive component queries to be executed. To describe this process more formally, we use a structure called a Conceptual Global Query Tree (CGQT). A CGQT is a directed acyclic graph whose nodes are the component queries of the global query. The edges indicate that the result of one component query is an operand of another, with the direction of data movement being determined by the arrow. The root node query computes the final result of the Global Query. The concept behind CGQTs is similar to that of relational algebra query trees [Elmasri Navathe 1989], to the extent that a node represents an operation and the edges represent flow of data from the leaves to the root. In addition, a CGQT defines a specific execution plan for a global query, and we can apply heuristics to transform it into a form which can be executed more efficiently. The idea of using a tree or graph structure to represent execution of a qlobal query is not new (see for example [Dayal Hwang 1984], [Florescu et al 1995]), however we define its structure differently, apply different types of transformations, and extend the graph structure in certain cases.

One of the main problems we experience in determining a global query processing strategy is that not only is there a large search space of equivalent CGQTs, but there are also several CGQTs which will yield the required result. Figure 4 shows a “canonical” form of a CGQT. In this diagram: •



• •

The leaf nodes correspond to local queries. The data being processed is controlled by a single DBMS. Where two or more local queries can contribute the same data, this is represented by a branch in the data flow. The next level corresponds to the reconstruction of the logical classes: the outerjoin operation (or outer union in the case of disjoint entity sets) is applied to the results of the local queries. The intermediate nodes represent inter-class restrictions, which operate on the data produced by lower nodes. The root node represents the result of the global query.

…..

Result Computation Logical Class Reconstruction Local Queries Alternative Data Sources

Figure 4: Conceptual Global Query Tree Structure 5. 1. 2 Implementation Global Query Trees An Implementation Global Query Tree is a global query execution strategy. It identifies a specific method of executing a CGQT, which in this context can be regarded as a “generic” specification of a global query. Accordingly, there are several IGQTs that can implement any given CGQT. The overall structure of an IGQT is similar to that of a CGQT, however there are some required implementation details added. There are three types of label in an IGQT, an internal and external label associated with a node, and a label associated with an edge. − the internal node label identifies a site at which the query is to be executed, − the external node label gives the estimated cost of processing this query at this site, and

− the edge label gives the estimated cost of transferring the result of a query to the receiving site. In addition, there is no concept of “alternative data sources” as in a CGQT, since an IGQT indicates a specific data source. Figure 5 shows the structure of an IGQT. The external node and edge label values are denoted pc and cc respectively.

pc cc cc

pc cc

@site cc

pc @site cc cc

@site

…..

pc @site pc@site pc@site pc@site

cc

cc

pc @site

pc @site

cc pc@site

Figure 5: Implementation Global Query Tree Structure 5.2 HDB Query Optimisation Methodology The transformations described in Section 4 vary considerably in their effect on the overall cost of the GQT, and so they cannot be applied indiscriminately to each GQT under consideration, since this may impact the overall execution time. Instead we adopt the following tactics for application of transformations: The global query is first considered for semantic transformations using only the global assertions, ΣG. The assertions in ΣG describe properties relating exclusively to GMDBI, and are useful in identifying redundancy or inconsistencies in a query specification, due to user ignorance of the domain, for instance. The output of this phase is a semantically “relevant” query, ie. one without redundancy and inconsistency. In effect, this analysis is equivalent to semantic query optimisation on a centralised database. The query is then decomposed from its specification on GMDBI to component queries on the underlying Export Schemas GMDB1...n. Query decomposition has been well explored in the literature (see for example [Meng Yu, 1995]) and is not considered in this paper. The output of this phase is a Conceptual Global Query Tree (CGQT), a representation of an execution strategy for the global query on the component Graph Model databases. There are several categories of transformation that can be applied immediately to the canonical GQT, however we restrict application initially to the identification of redundant and alternative execution sites. This phase performs a series of initial semantic transformations on the CGQT, identifying:

• • •

those databases which contain no data relevant to the result, those which need not be queried, due to the query semantics, and those which contain data replicated over two or more sites, any of which can be used in computing the result. The output is a semantically transformed CGQT. The next phase uses the CGQT to derive a number of potential execution strategies, represented as IGQTs. Since there is a large number of IGQTs that can be derived fromHeuristics are used to narrow the search space of candidate IGQTs chosen. Futher semantic transformations are then applied to the candidate IGQTs. The set of IGQTs still represents a very large search space of potential execution strategies, and so transformations are applied selectively using heuristics which are non-exhaustive, but are designed to identify a reasonably efficient execution plan. These phases are summarised in Figure 6, and are described in detail in the following subsections.

Σ global

Global Query Transformation

Semantically “Relevant” Global Query Global Query Decomposition Conceptual GQT Initial GQT Optimisation Optimised Conceptual GQT Construct Candidate Implementation GQTs Candidate IGQTs Selective IGQT Optimisation Chosen GQT

ΣIS Σ Local

Figure 6: Phases of Global Query Processing 5. 2. 1 Global Semantic Transformation The purpose of Global Semantic Transformation is to identify redundancies and inconsistencies in the specification of the global query, eg. queries in which every object satisfies a restriction in an edge label, or no object can satisfy such a restriction. Global assertions can be expected to be small in number, and so are easy and cost effective to apply.

The application of global transformations is straightforward: for each class Ci ∈ GMDBI having state of “selected” or “displayed”, use pattern matching techniques to identify the assertions in ΣG in which Ci is referenced. If the conditions specified are met, the transformation is performed. Algorithm Global-Transformation For each edge ei ∈ E, 1 £ i £ n, having f2(ei) ≠ T, // the set of restricted edges Let f2(ei) = f(f1(ni)) // f1(ni) is the name of the restricted role node //apply direct domain assertions Let ΣG,i,dda be the the set of direct domain assertions in ΣG relating to node ni If ∃ σ ∈ ΣG,i | f2(ei) → ¬σ Then Q ← φ //query result is empty If ∃ σ ∈ ΣG,i | f2(ei) → σ Then f2(ei) ← T //apply value dependent assertions Let ΣG,i,vda be the the set of value dependent assertions in ΣG of the form C ∧ f(ei) isa ∀g(ej), i ≠ j If ∃ σ ∈ ΣG,i, vda | f2(ei) → f(ei) and if ∃ ej ∈ E, i ≠ j | g → ¬ f2(ej) Then Q ← φ //query result is empty If ∃ σ ∈ ΣG,i, vda | f2(ei) → f(ei) and if ∃ ej ∈ E, i ≠ j | g → f2(ej) Then f2(ej) ← T

5. 2. 2 Initial CGQT Optimisation This phase is one of the most important phases of the transformation algorithm, in the sense that the transformations having the greatest impact on query cost can be effected at this stage. During this phase, the inclusion and exclusion extensional assertions are used to identify sites which need not be queried. A site can be excluded from the global query if the assertions allow us to infer any of the following: • no instances used in computing the result are stored at that site, • the set of instances used in computing the result are a subset of those stored at another site, or • the set of instances used in computing the result is equivalent to a set stored at another site. In the first two cases, the subquery has no net effect on on global query result, and can simply be eliminated from CGQT, which from a computation viewpoint is clearly the most desirable outcome. In the third case, a set of component queries is identified having the property that execution of any one of the queries in the set will contribute the same data to

the result. Therefore, only one of thse queries needs to be executed, rather than all. This is represented on the CGQT by a branch (see Figure 4). Algorithm Initial-GQT-Opt Input: Conceptual GQT Output: Transformed CGQT • • •

Let C be any unprintable class node in GMDBI. C is a generalisation of component class nodes C1 ... Cn, on local schemas GMDB1...GMDBn. Let selected(C) be the set of role nodes associated with unprintable class node C which are set to selected in Q. Let displayed(C) be the set of role nodes associated with unprintable class node C which are set to displayed in Q.

For every class node C ∈ CGQT having component class nodes C1 ... Cn For each i, 1 £ i £ n, If displayed(Ci) ∩ displayed(C) = φ, and selected(Ci) ∩ selected(C) = φ Then remove Ci For each j, 2 £ j £ n, If ∃ σ ∈ ΣIS | σ has operands Ci and Cj If σ = Ci isaext Cj, and displayed(Ci) ⊆ displayed(Cj), and selected(Ci) ⊆ selected(Cj) Then remove Ci If σ = Ci defext Cj, and displayed(Ci) = displayed(Cj), and selected(Ci) ⊆ selected(Cj) Then only one of Ci and Cj needs to be executed Although this algorithm has complexity increasing exponentially with the number of component classes, the number of these can be expected to be very small in general, and therefore this should not be a degrading factor in performance of the optimisation process. 5. 2. 3 Construction of Candidate IGQTs A CGQT can be regarded as a canonical form of a multidatabase query. It needs to be transformed into a form in which it can be efficiently executed. This transformation process consists of two activities • decomposing the logical integration operations into a sequence of subqueries that collectively effect that operation, and • determining at which site each subquery is to be executed. An execution strategy for a CGQT is called an Implementation GQT (IGQT). Clearly the search space of IGQTs which implement a given CGQT is very large, and so the prime purpose of this phase is to identify a small subset of the IGQTs which can be considered as potential execution strategies.

The Candidate IGQT set is constructed by employing a number of techniques: • application of heuristic transformation rules which are known to reduce the query execution cost, • application of heuristic rules for selection of a site at which to execute a subquery, and • application of schema assertions which can identify query simplification opportunities. Algebraic transformations and site selection are research issues which have received attention in other projects, both in the distributed (homogeneous) and heterogeneous cases. We therefore do not focus on these techniques, but advocate the adoption of previously developed strategies. We summarise the main points of these below, and then discuss the application of schema assertions in identifying optimisation possibilities. 5.2.3.1 Heuristics for Transforming Queries and Site Selection By performing algebraic transformations, the operation can be divided into a number of smaller operations that can be executed over several sites, in such a way as to minimise the overall response time. These activities have been well researched as part of the query decomposition phase in several research projects, eg. SDD-1 [Bernstein et al. 1981], R* [Selinger Adiba 1980], and Multibase [Dayal Hwang 1984]. The results of the Multibase project are particularly important in our case, since they generalise the tactics and transformations developed for homogeneous databases to autonomous heterogeneous system. In [Dayal Landers Yedwab 1982], several heuristics to reduce the search space by which integration operations can be optimised. These include the distribution of selection and join over generalisation, and use of the semijoin and semiouterjoin operations. Since the focus of our work is on the application of semantic transformations to effect global query transformation, we will assume the application of these transformations without describing them further. The allocation of subqueries to specific sites increases further the potential search space. We apply the following heuristics to minimise the search space: • at least one of the operands must be located at the site, and • in cases where there are two or more reasonable alternative sites for execution of a subquery, the choice should between them should be made based on local processing capabilities, bandwidth of network links, available memory, and query language expressiveness. 5.2.3.2 Heuristics for Transforming Queries using Schema Assertions As described above, there has already been considerable work done on the multidatabase query decomposition problem and we consider that this work can be used within our system.

However, we can add improvements to existing knowledge on query decomposition by exploiting semantic knowledge in cost estimation of inter-site queries. The traditional approach to estimation of costs in order to produce a query execution plan is to take the “average” expected execution time of a particular inter-site integration operation, having approximate knowledge of the cardinalities, network speeds etc. In this section, we describe the conditions under which we can revise these estimates downwards, based on interschema relationships which we know to hold. Say that class C is integrated from component classes {C1,....Cn}, and Ci and Cj, 1 £ i, j £ n, i ≠ j, are two such classes, and σ is an inter-schema assertion in ΣIS, between Ci and Cj, then if •

• •

σ = Ci defexcl Cj, then the integration operation is simplified to an outerunion operation , since the set of instances of Ci and Cj are mutually exclusive. This means that the cost of processing the integration operation is significantly lower. σ = Ci isaext Cj, then execution of the integration operation at Cj’s site is likely to be more efficient, since no redundant data will be transferred, σ = Ci defext Cj, then execution of this integration operation at an early stage is likely to be beneficial, since there will be no “dangling” values, and therefore the overall size of the result is lower than would otherwise be estimated.

5. 2. 4 Selective IGQT Transformation The application of the above sets of heuristics will reduce the search space of candidate IGQTs to a small subset, one of which will be chosen as the execution strategy. Clearly, we wish to choose as an execution strategy the IGQT having the fastest response time. There is still the opportunity to perform semantic transformations using the local schema assertions. However, applying transformations to the IGQTs indiscriminately would be prohibitively expensive. Our approach to this problem is to apply the transformations selectively to specific sequences of queries in an IGQT. The path of execution in an IGQT is from the leaves (representing queries on local data) to the root (representing the result), an intermediate node cannot execute until its operands (results from preceeding queries) have been computed and transmitted to the site of execution. Therefore the overall execution time is bound by the sequence of queries taking longest to execute. We refer to such a sequence − where the output of one query is input to the next − as a chain, and the chain taking longest to execute as the longest chain. Since the estimated execution time of an IGQT is bound by its longest chain, it follows that applying transformations to chains other than the longest chain will have no effect on the overall response time. We therefore restrict application of transformations to the longest chain in each candidate IGQT. The transformation algorithm described for the Global Schema is used in this case. Successful transformations will shorten this chain. The IGQT is then examined to determine whether this chain is still the longest. If this is the case, then no further

reduction in response time can be achieved by applying further transformations. On the other hand, if a different chain is now the longest, transformations are applied it it. This process continues until the transformation process does not change the longest chain. The IGQT having the minimum longest chain is then chosen for execution.

Algorithm Selective-IGQT-Opt INPUT: A set of Global Query Trees GQT1 ... GQTn OUTPUT: A Global Query Tree GQT’, derived from GQT i, 1 £ i £ n 1.

Let ci1 ... cim be the set of chains in Global Query Tree GQTi, 1 £ i £ n. Let cij = {qij1 ... qijp}, 1 £ i £ n, 1 £ j £ m, Let pcijk be the cost of executing component query qijk, 1 £ i £ n, 1 £ j £ m, 1 £ k £ p Let ccijk be the communication costs incurred from the transmission of the result of component query qijk to its destination site. 1 £ i £ n, 1 £ j £ m, 1 £ k £ p

2.

For i in 1...n, m

p

Find ci, max, thechain whose estimated execution time is max ( ∑ pcijk + j =1

− −

k =1

p −1

∑ cc

ijk

)

k =1

Apply assertions in ΣLi to ci,max, following the Global-Optimisation algorithm If ∃cij, cij ≠ ci,max such that the estimated execution time of cij is less than that of ci,max Then {ci,max ← cij; Apply relevant assertions in ΣLi to ci,max} Repeat until there is no change in ci,max n

3.

Find c min, the chain whoseestimated execution time is min ( ci, max )

4.

Return the GQT containing this chain.

i= 1

5.3 Example Consider the following example, in which a query to retrieve the names of all Students over 17 living in Ireland is posed to GMDBI (Figure 7). Lives in

Student

GPA

Age

Name

City

Name

Age > 17 Integer

String

Country Country =“Ireland”

Figure 7: Global Graph Model Schema Now say that GMDBI is formed by the following Export Schemas. City

Student

GPA

Age

Integer

City

CName

String

String Schemas at Sites 3 and 4

Schemas at Sites 1 and 2

Person

Lives in

Student

Student

GPA

Age

Integer

City

String

Schemas at Sites 5 and 6

Country

GPA

Age

Integer

City

City

CName

Country

String

Schemas at Sites 7, 8 and 9

Figure 8: Component Graph Model Schemas The interschema and intraschema knowledge is as follows: Σ G = { 18)>} ΣIS = { , , } ΣL7 = {} ΣL9 = {}

Phase 1: Global Query Transformation Application of the global assertion 18)> results in the removal of the label on the edge between Age and Integer.

Phase 2: Initial CGQT Optimisation Query decomposition produces an initial CGQT shown in Figure 9. The local queries indicate the GMDB on which they are executed.

Construction of “Result” class Reconstruction of “City” class

Reconstruction of “Student” class

Student Student1

Student2

Student5

Student6

Inter site Queries

City Student7

Student9

Student8

City3

City4

City7

City8

City9

Local Queries

Figure 9: Initial CGQT Application of the Init-CGQT-Opt algorithm results in the following transformations to this CGQT. • Student5 is removed, using the assertion , • Student8 and Student9 are tagged as alternative sources of the same data. The Optimised CGQT is shown in Figure 10. Construction of “Result” class Reconstruction of “City” class

Reconstruction of “Student” class

Student Student1

Student2

Student6

Inter site Queries

City City3

Student7 Student8

City4

Student9

City7

City8

City9

Local Queries

Figure 10: Optimised CGQT Phase 3: Construct Candidate IGQTs It is infeasible to present all candidate IGQTs produced under this phase. We therefore show two sample IGQTs in Figure 11 and demonstrate for these how the algorithm proceeds. The maximum path through the IGQT is highlighted in each case.

0

@GMDB6

45 300

@GMDB6 1 1 5

350 @GMDB2 0

250

180

0

110

0

200

1 0 City3

City4 1 6

0

0

Student1 2 6 Student2 3 1

Student6

@GMDB9 0

Student7

5

City7

38

25 0

@GMDB7

@GMDB2 4 0

@GMDB3

@GMDB9

0

City9

Student9 4

7

0 13

9

City8 4

IGQT 1 @GMDB6

0

20

190

60 @GMDB3 170

@GMDB8 180

0

@GMDB2 4 0 0

40

Student1

26

Student6

@GMDB3

149

City3

@GMDB8 6 1 40

0

1 5 Student7 3 0

29

20 10

City4 1 6

Student2

0

60 150 @GMDB9 10

23

7

City8 Student8 2 0 City7 4 4 31

0 City9 2 3

IGQT 2 Figure 11: Example IGQTs

The schema assertion heuristics are now applied to each IGQT. The following transformations take place: IGQT 1: IGQT 2:

Application of assertion causes revision of cost estimate of integration of Student1 and Student2 to 40. None.

Phase 4: Selective IGQT Optimisation We now attempt optimisation of the queries by iterating through each IGQT, considering only the longest chain on each iteration. In the initial state, the longest chains have costs of 588 and 466 for IGQT1 and IGQT2 respectively. These chains are highlighted in Figure 11. Iteration 1 IGQT 1:

Application of local value dependent assertion is in conflict with query label “Country = Ireland”. Since this query result will be empty, the query on Student7 can be removed.

Application of local value dependent assertion makes the query label “Country = Ireland” redundant.

IGQT 2:

The transformed IGQTs are shown in Figure 12.

@GMDB6 4 5

0 350

300

@GMDB6 1 1 5

@GMDB2

@GMDB9 0

250

0

0

@GMDB2 4 0

@GMDB3

110

0

200

1 0 City3

City4 1 6

25

@GMDB9

40 0

0

Student1 2 6 Student2 3 1

Student6

City7

5

City9

Student9 4

7

0 9

13

City8 4

IGQT 1 @GMDB6

0

20

190

60 @GMDB3

@GMDB8

170

180

0 29 @GMDB3

@GMDB2 4 0 0

40

140

Student1 2 6 Student6 1 5 Student7 3 0

40

0 City3

@GMDB8 6 1

20 10

City4 1 6

Student2

0

60 150 @GMDB9 10

23

7

City8 Student8 2 0 City7 4 4 31

0 City9 1 3

IGQT 2 Figure 12: IGQTs After First Transformation

Iteration 2 As a result of these transformations the chains have been reduced to 414 for IGQT1 and and 456 for IGQT2. The longest chain in IGQT 2 is now 460, as shown in Figure 12, and since this is a different chain the algorithm is repeated for this chain. IGQT 2:

Application of local value dependent assertion is in conflict with query label “Country = Ireland”. Since this query result will be empty, the query on Student7 can be removed.

The transformed IGQT is shown in Figure 13.

@GMDB6

0

20

190

60 @GMDB3 170

@GMDB8 180

0

@GMDB2 2 6 0 Student1

40 26

Student6

@GMDB3

City3

@GMDB8 6 1 40

0 15

29

20 10

City4 1 6

Student2

0

60 95 @GMDB9 10

23

7

City8 Student8 2 0 City7 4 4 31

0 City9 1 3

IGQT 2 Figure 13: IGQT 2 After Iteration 2 The transformation reduces the cost of the chain to 330. There is no change in the longest chain, and so the IGQT having the minimal longest chain, IGQT2, is chosen for execution. 5.4 Appraisal To the best of our knowledge, this work is the only comprehensive treatment of the semantic query optimisation in a heterogeneous environment. The transformation process we have described exploits at different points both exclusively global and local knowledge. This could lead to criticism on the grounds that a Cooperative Information System should not maintain global assertions, and that local assertions may not be available due to autonomy requirements. However, the process we have described can work effectively without either sets of this information (ie. if ΣG and global and ΣLi are empty): they are exploited if available and can produce beneficial transformations. However the main contribution of this algorithm is in the use of interschema assertions (ΣIS) to effect transformations and its operation is not affected by the absence of the other classes of assertions. The initial approaches to semantic query optimization (eg. [King 1981]) concentrated on acquiring a more optimal query, and the costs associated with finding this query were ignored. The QUIST system, for example, has strong expressive power of schema assertions (as statements in clausal logic), but the issue of detecting the profitable constraints is not addressed. Shenoy and Ozsoyoglu's semantic query optimizer is more efficient in terms of optimization costs, but there is a trade-off against weaker expressive power of schema assertions. The most general work on semantic query optimization to date is that of Chakravarthy, who considers semantic query optimization in the context of a deductive database. The schema assertions relevant to a given relation are readily available, as they are associated directly with the relation by the semantic compilation process. However, the fact that they are compiled makes them very difficult to modify. The approach of dynamic plan generation by subquery monitoring taken by Lu et al. is very specific in its focus, and does not consider the multiplicity of plans that may exist due to data replication and semantic information. Florescu et al. exploit semantic information during query reformulation, but purely for the purpose of global to component query transformation, rather than in the context of heterogeneous semantic query optimisation.

6. Summary and Conclusions This paper has presented a methodology for semantic query optimisation in a heterogeneous database environment. During its development, we have focussed our efforts on keeping optimisation costs to a minimum and by applying schema assertions in such a way as to eliminate where possible the need to perform integration operations, reducing their complexity where possible, and exploiting data replication, and targeting strategic sections of a query execution plan for application of optimisation strategies. It is particularly important to address the issue of Global semantic query optimisation for two reasons. Firstly, in a heterogeneous database, a user typically interacts with the system through a VQS, which in our case is built upon the Graph Model. This hides the details of the underlying models, storage characteristics, and location of the data, and consequently the user, when formulating a query, has no control over how the query will be executed. Secondly, the user is likely to interact with a view, or external schema [Sheth Larson 1990], rather than with a global conceptual schema. Although we assume a global schema in VENUS, in a large system with many cooperating databases this view is unrealistic. As a consequence of these two points, a query decomposition algorithm that fails to exploit schema assertions may result in an extremely inefficient query execution plan, containing for instance redundant restrictions or inter-class assertions. One of the most important aspects of the VENUS Project is its ability to integrate data from multiple heterogeneous databases, allowing the end user to interact with a conceptually single database. At this aim, we propose an adaptive visual interface as an appropriate interaction tool for the final user. However, such a user may issue queries which are either unnecessarily complicated or very inefficient if mapped directly on the DBMSs. In this paper we present our proposal for avoiding such drawbacks. In particular, we describe a semantic query optimization tool for the VENUS system, exploiting the notions of intra- and inter-schema knowledge to optimise a VENUS user’s query on two levels. On the global level, the interschema knowledge can be used to select the set of queries to component databases which can be executed in minimal time, and on the local level, the inter- and intra-schema knowledge is used to convert the individual queries to more efficient representations of the same queries. We intend to progress this work by investigating methods of performing data mining from local assertions. Typically, it is relatively simple to derive local assertions since application domain experts will have detailed knowledge of individual systems. There is no inherent expertise in the relationships that exist between systems, and we intend to automate the extraction of interschema knowledge from these assertions. Our approach to semantic query optimization provides a very flexible framework. The schema assertions which are used to perform semantic query transformations are defined declaratively. This allows them to be used for other purposes such as integrity maintenance, and also provides the flexibility to augment the classes of assertion without major reorganisation. The optimization costs are kept as low as possible by applying a fixed set of

algorithms to the query, in a pre-defined order. The algorithms recognise cases where a more efficient query might be achieved, and only such promising cases are examined further. Only transformations which can decrease the overall response time of the query are applied. The termination problem (i.e., of determining when to cease applying additional rules to a query) is avoided by applying each strategy only once to a given query. Although some optimization opportunities may be lost by not re-applying the strategies, the majority of the transformations, and the most restrictive, can be performed by a single pass. The system can be tailored to suit a particular user's needs – novice users may require a high degree of optimization, whereas less optimization would be needed by more familiar users who, being familiar with the application, are more likely to issue correct, efficient queries. The system serves a dual purpose as an educator of the system – semantically incorrect queries are notified to the user indicating the integrity constraint that has been violated, thereby increasing a user's awareness of the system, and lessening their reliance on the semantic query optimizer. Furthermore it is entirely modular both in the sense that specific parts of it may be installed and removed at will, and in the sense that it is applicable both for centralised and distributed databases, with very little modification required to extend it.

References [Barsalou Gangopadhyay 1992] T. Barsalou and D. Gangopadhyay - M(DM): An open framework for interoperation of multimodel multidatabase systems - in Proc. of the 8th IEEE Data Engineering Conference. [Bernstein et al. 1981] P. Bernstein et. al - Query Processing in a System for Distributed Databases (SDD-1) - ACM Transactions on Database Systems, Vol. 6, No. 4 [Brodie Ceri 1992] M. L. Brodie and S. Ceri - On intelligent and cooperative information systems: a workshop summary - International Journal of Intelligent and Cooperative Information Systems, Vol. 1, N.1. [Cardiff 1990] J. Cardiff - The Design of an Extensible Semantic Query Optimization System - PhD Thesis, University of Queensland, Australia [Cardiff Catarci Santucci 1994] J. Cardiff, T. Catarci and G. Santucci - Interfacing Heterogeneous Databases - Technical Report DI-17/08-009, Esprit Project 6398 “VENUS”. [Catarci Chang Santucci 1994] T.Catarci, S.K.Chang and G.Santucci - Query Representation and Management in a Multiparadigmatic Visual Query Environment - Journal of Intelligent Information Systems, Special Issue on "Advances in Visual Information Management Systems", Vol. 3, N. 3/4, pp. 1-32, [Catarci D’Angiolini Lenzerini 1994] T.Catarci, G.D'Angiolini, M.Lenzerini - Statistical Data Modeling based on Concept. Logic - Data and Knowledge Engineering, to appear. [Catarci et al. 1994] T.Catarci, S.K.Chang, M.F.Costabile, S.Levialdi and G.Santucci - A Visual Interface for Multiparadigmatic Access to Databases - IEEE Transactions on Knowledge and Data Engineering, to appear.

[Catarci Lenzerini 1991] T.Catarci, M.Lenzerini - Conceptual Database Modeling Through Concept Modeling - in Information Modelling and Knowledge Bases III, S. Ohsuga et al. (Eds), IOS Press, Tokyo. [Catarci Lenzerini 1993] T.Catarci and M.Lenzerini - Representing and Using Interschema Knowledge in Cooperative Information Systems - Journal of Intelligent and Cooperative Information Systems, Vol. 2, N. 4, pp.375-398, World Scientific. [Catarci Santucci Angelaccio 1993] T.Catarci, G.Santucci and M.Angelaccio - Fundamental Graphical Primitives for Visual Query Languages - Information Systems, Vol. 18, N.2. [Catarci Santucci Cardiff 1997] ] T.Catarci, G.Santucci and J. Cardiff - Graphical Interaction with Heterogeneous Databases - VLDB Journal, to be published in 1997 [Chakravarthy Fishman Minker 1985] U.S. Chakravarthy, D.H. Fishman and J. Minker Semantic Query Optimization in Expert Systems and Database Systems - in Expert Database Systems, Kerschberg Ed., Benjamin/Cummings Publishing Co. [Chakravarthy 1986] U.S. Chakravarthy - Semantic Query Optimization in Deductive Databases - Ph.D. Thesis, Dept. of Computer Science, University of Maryland. [Chakravarthy Minker Grant 1987] U.S. Chakravarthy, J. Minker and J. Grant - Semantic Query Optimization: additional Constraints and Control Strategies, - in Expert Database Systems, Kerschberg Ed., Benjamin/Cummings Publishing Co. [Dayal Hwang 1984] U. Dayal, H.Hwang - View Definition and Generalisation for Database Integration in a Multidatabase system - IEEE Transactions on Software Engineering, pp. 628-644 [Dayal Landers Yedwab 1982] U. Dayal, T. Landers, L. Yedwab - Global Query Optimization in Multibase, a system for Heterogeneous Distributed Databases - Technical Report TR82-05, Computer Corporation of America [Du Krishnamurthy Shan 1992] W. Du, R. Krishnamurthy and M.C. Shan - Query Optimization in a Heterogeneous DBMS - in Proc. of the 18th International Conference on Very Large Data Bases, Vancouver. [Elmasri Navathe 1989] R. Elmasri and S. Navathe - Fundamentals of Database Systems Benjamin-Cummings. [Fang Hammer McLeod 1991] D. Fang, J. Hammer and D. McLeod - The Identification And Resolution of Semantic Heterogeneity In Multidatabase Systems - Proc. of the 1st IEEE International Workshop on Interoperability in Database Systems. [Florescu et al. 1995] D. Florescu, L.Raschid, P.Valduriez - A Methodology for Query Reformulation in CIS using Semantic Knowledge - manuscript [Geller Perl Neuhold 1991] J. Geller, Y. Perl and E. Neuhold - Structural Schema Integration In Heterogeneous Multi-Database Systems Using The Dual Model - Proc. of the 1st IEEE International Workshop on Interoperability in Database Systems. [Krishnamurthy Litwin Kent 1991] R. Krishnamurthy, W. Litwin and W. Kent - Language features for interoperability of databases with schematic discrepancies - Proc. of the 1991 ACM SIGMOD International Conference on Management of Data. [Landers Rosenberg 1982] T.A. Landers and R. Rosenberg - An overview of Multibase - in Distributed Databases, Schneider ed., North Holland Publishing Company [Li McLeod 1991] Q. Li and D. McLeod - An object-oriented approach to federated databases Proc. of the 1st IEEE International Workshop on Interoperability in Database Systems.

[Lu et al. 1992] H. Lu, B-C. Ooi, C-H. Goh - On Global Multidatabase Query Optimization SIGMOD Record 21(4) [Meng Yu, 1995] W. Meng, C. Yu - Query Processing in Multidatabase Systems - in Modern Database Systems, Kim ed., Addison-Wesley [Ozsu Valduriez 1991] T. Ozsu and P. Valduriez - Principles of Distributed Database Systems Prentice Hall. [Selinger Adiba 1980] P. Selinger, M. Adiba - Access Path Selection in Distributed Database Management Systems - Proc. International Conference on Databases, Aberdeen, Scotland [Shenoy Ozsoyoglu 1987] S. Shenoy and Z. Ozsoyoglu - A System for Semantic Query Optimization - Proc. of the ACM-SIGMOD Conference on Management of Data. [Shenoy Ozsoyoglu 1989] S. Shenoy and Z. Ozsoyoglu - Design and Implementation of a Semantic Query Optimizer - IEEE Transactions on Knowledge and Data Engineering, Vol. 1, No. 3. [Sheth Larson 1990] A. Sheth and J. Larson - Federated Database Systems for Managing Distributed. Heterogeneous, and Autonomous Databases - ACM Computing Surveys, Vol. 22, No. 3. [Sheth Kashyap 1992] A. Sheth and V. Kashyap - So Far (Schematically), So Near (Semantically) - Proc. of the IFIP DS-5 Conference on Semantics of Interoperable Database Systems, Elsevier Publisher. [Zdonik 1980] S. Zdonik - On the use of Domain Specific Knowledge in the Processing of Database Queries - M.Sc. Thesis, MIT.

Suggest Documents