InterData

2 downloads 0 Views 310KB Size Report
P.zza L. da Vinci, 32 - 20133 Milano ... datasources) can be queried by means of the WG-Log language. ... As we shall see, as a consequence of the high expressive .... As it is based on Java, it ensures that the downloaded code is safe,.
InterData

Programma di ricerca (co nanziato dal MURST, esercizio 1997)

Metodologie e tecnologie per la gestione di dati e processi su reti Internet e Intranet

The Integrated WG-Log System for Querying Semi-structured Information Sara Comai Ernesto Damiani Barbara Oliboni Letizia Tanca

T2-R17

15 febbraio 1999 Sommario

Nowadays several information systems comprise a number of heterogeneous datasources, such as relational, object-oriented or object-relational databases. Moreover, an increasing amount of information is being made available through unstructured and semi-structured datasources such as collections of textual documents and Web sites. Our proposal outlines a distributed architecture for heterogeneous datasources, by means of schemata expressed via WG-log, a common data description and query language independent of the native datasource models. As a rst result, we show how O2 (an object-oriented database) and Tsimmis (a system for integrating semi-structured datasources) can be queried by means of WG-Log.

Tema Codice Data Tipo di prodotto Unita responsabile Unita coinvolte Autori Autore da contattare

Tema 2: Estrazione di informazioni distribuite sul WWW T2-R17 15 febbraio 1999 Rapporto tecnico VR VR Sara Comai Ernesto Damiani Barbara Oliboni Letizia Tanca Sara Comai Dipartimento di Elettronica e Informazione Politecnico di Milano P.zza L. da Vinci, 32 - 20133 Milano [email protected]

The Integrated WG-Log System for Querying Semi-structured Information Sara Comai

Ernesto Damiani

Barbara Oliboni

Letizia Tanca

Abstract

Nowadays several information systems comprise a number of heterogeneous datasources, such as relational, object-oriented or object-relational databases. Moreover, an increasing amount of information is being made available through unstructured and semi-structured datasources such as collections of textual documents and Web sites. Our proposal outlines a distributed architecture for heterogeneous datasources, by means of schemata expressed via WG-log, a common data description and query language independent of the native datasource models. As a rst result, we show how O2 (an object-oriented database) and Tsimmis (a system for integrating semi-structured datasources) can be queried by means of WG-Log.

1 Introduction Nowadays it is usually the case that information systems comprise a considerable number of heterogeneous datasources, some of which are conventional relational databases while others adopt di erent data models, such as object-oriented or object-relational ones. Moreover, an increasing amount of information is being made available through unstructured and semi-structured datasources such as collections of textual documents (e.g. e-mail messages) and Web sites. While all these datasources are currently managed and accessed by applications independently from one another, it is widely recognized that, in the fullness of time, they will have to be accessible in an integrated and uniform way to both end users and software application layers. This fully integrated scenario, however, presents several problems, some of which are listed in the following.  Integration quality When integration support is missing or insucient, applications involving multiple heterogenous datasources must locate, retrieve and pre-process all relevant data before they can be used. Indeed, when data are integrated in a nave fashion, semantic discrepancies among the various data representations may well occur; for instance, some information may be missing in some sources, an attribute may be single-valued in one source and multi-valued in another, or the same entity may be represented by di erent types in di erent sources.  Integration e ort Considerable development e ort is therefore required to ensure via the application layer that the integrated data are well structured and conform to a single, uniform abstract schema. Additional programming e ort is also required if one or more information sources change, or when new sources are added.  Semi-structured information management Focusing on semi-structured datasources, let us consider the data stored on enterprise-wide internal Web sites. At each site, HTML pages are frequently updated, and even the overall site structure may change without notice. Currently, not all these Web sites store their information in a database system; it is clear, however, that client applications could take advantage of database support, e.g., by having 1

the ability to pose queries involving logical data relationships, which usually are known by site's creators but not made explicit.  Control of application-datasource coupling A fully- edged integration architecture, while allowing sophisticated interaction styles between applications and datasources, should not hinder the possibility of close coupling application and speci c datasources in order to achieve the desired performance level. In order to deal with these problems, our present proposal outlines a distributed architecture allowing description of heterogeneous datasources, some of which are provided by their own local data description and query language, by means of schemata expressed via WG-log, a common data description and query language totally independent of the native datasource models. As a rst result of this general approach, in this paper we show how the two systems O2 (a typical object-oriented database) and Tsimmis [CGH94] (a system for the integration of semi-structured datasources) can be queried by means of the WG-Log language. Details on WG-Log language and system can be found in [CDT98a, BCDT98, CDPT98]. The paper is organized as follows: in Section 2, our general architecture is brie y outlined. Section 3 brie y describes the use of CORBA-compliant middleware to support system integration; Section 4 deals with interacting with foreign datasources by means of WG-Log and delves into integration of O2 , an O-O datasource currently used in the context of the ENEL information system, and of the semi-structured object model OEM. In Section 5 design and implementation of the proposed architecture, including communication protocols, and the use of CORBA-compliant design patterns, are brie y discussed. Finally, in Section 6 we draw some conclusion.

2 The General Integration Architecture In order to deal with the problems presented in Section 1, our approach relies on a distributed architecture (Fig. 1) whose main focus is the uniform description of heterogeneous datasources by means of schemata expressed via WG-log, a common data description and query language independent of the data models of the heterogeneous datasources. Datasources' schemata are made available to client applications through specialized modules called Traders, managing repositories of known schemata and helping clients to identify the datasources more apt to their needs, based on the information contained in the schemata. Besides allowing datasource search and identi cation based on a description of their semantics, the WG-Log common language can be used for querying both structured datasources (regardless of their native data model and local language) and semi-structured ones. In the rst case, a set of Mediators will be available to translate queries expressed in the common language into the speci c query languages available at target sites. The query results produced by remote sites are sent back to the Mediator and translated into a WG-Log result instance, that is returned to the client. In the case of semi-structured datasources, query execution will be performed by Remote Query Managers running at the target sites. Remote Query Managers can compute query results on the basis of the query itself and a synthetic internal representation of site data. In the case of Web sites, such representation can be either extracted from the HTML pages or exist from the beginning of the site life. As we shall see, as a consequence of the high expressive power of WG-Log, it is theoretically well possible that some kinds of WG-Log queries simply cannot be translated into the query languages available at some target datasource. However, this problem turns out not to be relevant in practice, since WG-log queries are usually posed taking the WG-log schemata of the datasources into account. These schemata having been obtained by translating the foreign data into the common language, only expressible WG-log queries will 2

WG-Log Client

WG-Log

Schema Robot

WG-Log Schemata (native or O2 derived)

(Trader) WG-Log WG-Log to O2 Mediator

WG-Log

O2 client O2

Remote Query Manager

WG-Log O2 WG-Log to Foreign datasource Mediator

WG-Log Source

Client Foreign language

Foreign datasource

Figure 1: The general architecture be formulated. Finally, it is worthwhile pointing out that our proposed architecture does not require any modi cation of datasources' operation, freely allowing special purpose and legacy applications to continue to access their speci c datasources using local query languages in the conventional way. However, the architecture also comprises a Naming Service aimed at interactive clients, such as conventional browsers. This service allows for fast datasource identi cation on the basis of an extended name following the well-known Uniform Resource Locator standard. In this case, the target site's schema is supposed to be known to the client which can submit WG-log queries directly to the remote datasource via the appropriate Mediator.

3 Middleware-based Techniques for Interfacing Foreign Datasources The WG-Log schema based approach is particularly suitable for representing data which have a well-de ned structure, because with this method we obtain a compact data representation simply by translating data representations expressed using foreign data models. However, this approach can also be applied to represent loosely structured data or, in the worst case, totally unstructured data, even if in the latter case the schema and the instance tend to become identical. Our aim is to exploit the structure of data whenever it is possible, without losing the capability of describing unstructured data sources. Moreover we intend to exploit distributed object standards in order to de ne a uniform set of interfaces for our integration architecture, regardless of the data model and the level of structure of the involved datasources.

3

3.1 The CORBA Standard

The past few years have witnessed a tremendous increase of the interest in object-oriented client/server applications and, in general, in distributed computing, which has made signi cant advances due to the di usion of the World Wide Web technology. To encourage a tight coupling of distributed objects and the Web, the Object Management Group (OMG) decided to promote the CORBA (Common Object Request Broker Architecture) object-oriented distributed software architecture [YD96]. Using CORBA, a client can transparently (i.e., without knowing servers implementation details) invoke a method on a server object by two di erent ways: static invocation or dynamic invocation. With dynamic invocation, the server is chosen at run time, according to the features of its interface (e.g., using methods signatures provided by servers' interface). CORBA also provides a Trading Service that identi es an object o ering services requested by the client on the basis of functional information: it contains descriptions of services and o er properties provided by the servers, such as what servers exactly do, how and where they do it. The OMG has developed a conceptual model, known as the core object model, and a reference architecture, called the Object Management Architecture (OMA) upon which applications can be constructed. OMA attempts to de ne at a high level of abstraction, the various facilities necessary for distributed object-oriented computing. It consists of four components: an Object Request Broker (ORB) , Object Services (OS) , Common Facilities (CF) , and Application Objects (AO) . Objects Services speci cations de ne a set of objects that perform fundamental functions such as naming services, life cycle services, transaction services or trader services. Generally speaking, they augment and complement the functionality of the ORB, whereas CORBA Common Facilities provide services of direct use to application objects. The core of the OMA is the ORB component, which is a transparent communication bus for objects that let them transparently make requests and receive responses from other objects located locally or remotely. In other words, the ORB allows client and server objects, that can be written in di erent languages, to interoperate. It intercepts calls and is responsible for nding an object that can execute them, pass it the parameters, invoke its methods and return the results. Moreover, operation invocations can be done either statically at compile time or dynamically at run time with a late binding of servers. The client side is composed of IDL (Interface De nition Language) stubs, a Dynamic Invocation Interface (DII) , an Interface Repository and an ORB Interface. The client-side IDL stubs provide the static interfaces to object services and de ne how clients invoke corresponding services on the servers. On the other hand, the DII allows clients to construct and issue a request, whose signature is possibly unknown until run time, using information from the Interface Repository. As for the ORB interface (the only component of the architecture shared by both sides), it allows functions of the ORB to be accessed directly by the client code. The implementation side interface consists of server IDL skeletons that provide static interfaces to each service exported by the server, a Dynamic Skeleton Interface (DSI), an Objet Adapter, an Implementation Repository and the ORB interface. The DSI (the server equivalent to the DII) looks at parameters values in an incoming message to determine a target object and method. The Object Adapter is on top of the ORB's core communication services and accepts requests on behalf of server's objects. It provides the run time environment for creating instances of server objects, passing requests to them and registering their classes in the Implementation Repository. As said previously, in addition to static method invocation, CORBA also provides a dynamic distributed invocation mechanism. Using CORBA's Dynamic Invocation Interface, Naming Services, Trader Services and the Interface Repository, a client application can discover new objects at run time and dynamically invoke its methods, with a late binding of servers. Clearly, the DII provides a very dynamic environment that allows systems to remain exible 4

Object Implementation

Client

IDL skeleton

Dynamic Invocation

IDL Stubs ORB services

Object Adaptor

Object Request Broker ORB

Figure 2: The CORBA architecture and extensible. In CORBA, the dynamic identi cation system of an object is made in 4 steps [S96]: 1. The Trader identi es an object o ering the needed service requested by the user on the basis of its functional properties. 2. Using the Interface Repository, the Trader Service retrieves the object interface, as well as a reference to it (an IOR or Interoperable Object Reference). 3. According to the description of the signature of the method (number and types of arguments), it constructs the invocation. 4. Finally, it invokes the object's method with adequate parameters and receives the results. The Remote Method Invocation (RMI), which is part of Java Development Kit, was designed to support remote method invocations on objects across Java virtual machines. Its main characteristics are:  It lets programmers move code in addition to data,  As it is based on Java, it ensures that the downloaded code is safe,  It uses Java both as an interface de nition language and as an implementation language. RMI integrates a distributed object model into the Java language. Like CORBA, RMI lets the user invoke methods on a remote object as if it were on a local object, and let also pass a reference to a remote object in an argument or return it simply as a result. Moreover, RMI provides interfaces and classes to nd remote objects, load them, and then run them securely. Currently, it also provides a primitive naming service to help locate remote objects, and a class loader to let clients download stubs from the server. Furthermore, even if RMI does not propose a dynamic invocation interface as speci ed by CORBA standard de nitions, it provides a dynamic stub loading, which allows clients to dynamically download stubs that reference remote objects from the server. Although RMI provides ORB-like facilities from the Java object 5

model, it is impossible to use it to invoke objects written in other languages. CORBA standard presumes a heterogeneous and multi-language environment and accordingly a language-neutral object model. On the contrary, RMI's system assumes the homogeneous environment of the Java Virtual Machine. For the purposes of this paper, we will consider RMI as a CORBA-light solution for remote services invocation. Howevwr, after native CORBA support was added to Java with the release of JDK 1.2, CORBA provides the missing link between the Java portable application environment and the world of services [OH97]. CORBA allows applets to invoke methods on remote objects. Moreover, CORBA provides a rich set of distributed services that enlarge Java possibilities, such as transaction services, security services, naming services or trading services. On the other hand, Java simpli es the code distribution in CORBA execution environments because its code can be easily managed from one server, and most of all, it provides a portable object-oriented infrastructure that nearly runs on every operating system.

3.2 Integration architecture: The Components

Before describing in detail our CORBA-compliant architecture for integration of heterogeneous datasources, we shall brie y outline the role of its basic modules: client, mediator, wrapper and interface repository.  Client The client module is a platform-independent program which besides allowing graphical query formulation and editing, communicates syntactically valid WG-log queries to site wrappers. The client module is also able to show site schemata in graphical form as a support for query formulation.  Mediator The Mediator module includes two separate though complementary components: a Trader and a Naming service. The Trader is basically a metadata repository holding site schemata. Clients can query this module specifying a content descriptor. In its simplest form, a descriptor X is a simple keyword list which is compared with the data dictionary Y of each stored schema using straightforward lexical nearness, namely jjXX [\YY jj (the ratio between the number of common words and the total number of words)1 The Trader answers with the URLs corresponding to schemata best matching the descriptor provided by the client. The Naming service, on the other hand, allows for locating sites by name. The two models are complementary in that the result of a query to the Trader is a list of names later to be resolved using the Naming service.  Wrapper A Wrapper module is basically a translator from the common language WGlog to the local query language of a given datasource. In order to perform the translation, Wrappers may query Mediators and local datasources to obtain WG-log and local schemata. We are now ready to outline the software design of the modules mentioned above. The dynamic model for our is rather simple: client modules query Mediators in order to obtain candidate site schemata; then they formulate WG-log queries which are sent to wrappers. Queries are translated by wrappers into local query languages, if necessary consulting the corresponding WG-log schema. Finally, query results are back-translated into WG-log and sent to the client. If the datasource is not available, an exception is generated and communicated to the client. 1 This matching can also be executed in a semantics-aware fashion, provided that the descriptor is equipped with a simple verb-noun syntax.

6

3.3 WQS Interface Repository

The design of our Web Query System was aimed at attaining high exibility through incremental reuse of components. In fact, reuse of available modules is ensured by identifying and incapsulating their variation points. This feature is particularly important for "heavyweight components"such as Wrappers, that must be easily modi able in order to accomodate for additional sites whose data model and query language may di er w.r.t existing datasources. Distributed objects standards such as CORBA and COM/DCOM allow for dealing with this problem through hierarchical IDL-based interface design. CORBA Interface Repositories can be organized either in a hierarchical or at fashion. While in a at repository interfaces are stored independently from one another, with no factorization of common features, hierachical repositories allow for identifying and incapsulating variation points while inheriting common features from the higher levels of the hierarchy. In the following a sample hierarchical interface repository for our WQS is given. The repository is written in JavaIDL, Sun Microsystem interface de nition language associated to its skeleton CORBA implementation, Door ORB. The Repository IDL interface is described in the following: module wqs { //=================== TAG AND STRING DESCRIPTORS ================== const string SSEPARATOR = ""; const string ESEPARATOR = ""; //=================== DTD definition ================================= const string STRING1_0 = "string 1.0"; //Keywords const string WG_LOG_VER1_0 = "wg-log 1.0"; //WG-Log const string RDF_VER1_0 = "rdf 1.0"; //rdf const string OEM_VER1_0 = "oem 1.0"; //OEM const string OQL_VER1_0 = "oql 1.0"; //OQL const string INTERFACE = "interface"; //interface type //=================== STRUTTURE DATI ======================================= struct metadata { string dtd; string data; }; // scope supports management of a hierarchical repository struct interface{ string dtd; string scope; string data; }; struct descriptor { string dtd; string data;

7

}; struct url { string datasource; string address; }; //================== INTERFACES===================================== //wrapper general interface interface wrapper { metadato sendQuery(in metadata query, in descriptor searchdesc); }; //OQL wrapper interface interface wrapperOQL:wrapper { string getPersistentRoot(in metadata node); }; //Repository interface interface repository { void addInterface(in interface interfaceToIns); interface getInterface(in string reference); interface getScopedInterface(in string theScope, in string reference); void delInterface(in descriptor interDesc); }; };

4 Heterogeneous Datasources Integration We are now ready to describe in detail the integration techniques that can be used for interfacing heterogeneous datasources with the rest of our general WQS architecture.

4.1 Integration of Totally Unstructured Data

Consider rst the case of a standard hard-disk, containing a at le system, whose les contain multimedia documents. The relevant information is available as text les, image les and audio les. In this example we deal with a well-de ned data domain (enterprise-wide technical information); this knowledge has to be exploited to create a WG-Log schema representing the large amount of information provided in the standard hard-disk. The presence of a schema allows to refer to the data in a precise and compact way; in fact, an instance graph representation can be added, where all multimedia les are considered just by means of a reference (their le name). Software applications can then take advantage of this representation to pose queries based on the abstract, schema-based representation, while it remains perfectly possible for developers to choose to access individual les on the standard hard-disk through the low-level le system 8

interface.

4.2 Interacting with O2 datasources

We will now describe the integration of a well structured datasource: an O2 database. The O2 object-oriented database system [O2] provides a complete object-oriented database solution for object developers. As it is in use in the framework of a system aimed at the support of the protection information system of the electrical transmission network, we chose it as a case study for WG-Log based datasource integration. As outlined before, integration in the WG-Log environment does not require any interference with the datasource internal operation. However, the interaction protocol between our Wrapper and O2 datasources must be speci ed. In the following, we rely on the CORBA standard in order to meet the following requirements:  OMG-compliant interface. The joint O2 -CORBA solution o ers a standard OMG compliant interface, with persistence, transaction and query services.  Heterogeneous platforms. CORBA and O2 jointly provide a fully heterogeneous platform. For instance, a client from an INTEL PC will be able to access an O2 database managed on SPARC or RS-6000 platforms.  Multiple servers. An O2 system manages a distributed database, i.e., a set of O2 schemas and O2 bases, distributed on a set of disks inside a network. Since O2 provides a C++/ODMG compliant interface [C96], it is not only feasible but also very simple to adapt O2 objects to any CORBA environment.

4.3 WG-log to O2/OQL Translation Algorithm

As seen in Section ?? WG-log lexicon includes entities, slots, collections and entry points, while labeled edges can be single or double. Moreover "dummy" nodes can be used to increase the language's expressive power. WG-log entities are mapped into class membership expressions inside the FROM clause of an OQL query. Such a class is given the conventional name "E ". On the other hand, slot are simple objects and are mapped as conditions in the WHERE clause. Thus, the slot name of the classe person whose value is "Robert" is translated into a condition such as x.name="Robert". Collections can be translated into OQL lists. O2 object model allows arcs representing part-of and is-a relationships only. The former will be denoted in WG-log by arcs labelled "L ". O2 path expressions, data manipulation, polimorphism and operator composition make translation of WG-log queries a rather straightforward procedure: the remote client sends to the wrapper both the query itself and its associated schema(o or an handle to it). The wrapper then computes the O2 query. However, a problem must be solved on how the resulting O2 query gets access to data in its persistent storage. There are three ways for an O2 object to become persistent: 1. Insertion in a persistent class 2. Reachability from a persistent object 3. Denomination. Any named O2 variable is persistent. Our wrapper will access the O2 database starting from a standard entrypoint (PersistentRoot), corresponding to an ad-hoc persistent class. Procedure givePersistentRoot is used to access any object starting form the persistent entrypoint. We are now ready to describe the translation process in some detail. The algorithm rst deals with translating the red part of the WG-log 9

query. Then queriesthe O2 database, taking the green part of the WG-log query into account to process the O2 query result. Red dashed arcs are dealt with using O2 negation. Goals are managed as views on the query result. In fact, given a WG-log rule R an instance I on G satis es mathcalR with goal G if exists an instance I0 satisfying R and I0jG = I . This concept is readily mapped into an O2 view on the OQL query result. Before giving our algorithm, we observe that it relies on a OQLQuery class in prder to create the OQL query's clauses. A JAva version of class OQLQuery is given below: /** * QueryOQL: this class stores the OQL query with some methods */

import java.util.*; import java.lang.*; class QueryOQL{ //three rings for the elves kings... Vector select, from, where, varType; int i=0; /** *standard costructor method */ public QueryOQL(){ select =new Vector(); from =new Vector(); where =new Vector(); varType =new Vector(); } /** *insert a variable type ad return the name */ public String insertType(String type){ varType.addElement(type); i++; return "x"+i; } /** *return the var given the type */ public String getVar(String type){ return "x"+varType.indexOf(type); } /**

10

*return a variable type from the name */ public void getVarType(String name){ StringTokenizer parser=new StringTokenizer(name,"x"); int num=Integer.parseInt((String)parser.nextToken()); } /** *insert a piece of the select */ public void insertSelect(String cond){ select.addElement(cond); } /** *insert a piece of the from */ public void insertFrom(String cond){ from.addElement(cond); } /** *insert a piece of the where */ public void insertWhere(String cond){ where.addElement(cond); } /** *the standard toString method */ public String toString(){ return "SELECT\n******\n"+select+"\nFROM\n****\n"+from +"WHERE\n*****\n"+where; } }

Here is a pseudocode version of the algorithm: OQLQuery translate(WGGraph wgQuery, SchemaID id) Enumeration edges=wgQuery.getEdges(); OQLQuery query=new OQLQuery(); while(edges.hasMoreElements()){ WGEdge currentEdge=edges.nextToken(); WGNode nodeToTraslate = currentEdge.getTo(); query.insertFrom("X"+i+" IN "+nodeToTraslate.getLabel()); if ((currentEdge.type()).isNavigational()) { query.insertSelect("X"+i); } Enumeration slots = nodeToTranslate.getSlots();

11

while(slots.hasMoreElements()){ query.insertWhere("X"+i+"."+slots.getName()+"="+slots.getLabel()); } } return query;

The algorithm operates translating an arc at a time; termination occurs when all arcs in the WG-log query have been translated and is ensured by coloring the already visited arcs. In the following, some of the main features are shortly described  Applicability Let us underline once again that given the di erence in expressive power between the source and the target language, not all conceivable WG-log queries can be e ectively translated into OQL. However, here we are interested only in queries that can be meaningfully posed to an O2 database schema. Our algorithm is therefore applicable to OQL-compatible queries only, i.e. queries composed by a connected RS part (any topology), an RD part formed by paths of length 1 and a GS part of a single node. This class coincides withe the simple query notion given in previous sections. As we shall see in the next section, our algorithm translates such queries into OQL as follows: select GS from x in Entity where conditions in slots



Dummy nodes Dummy nodes can be of any type and act as wildcards in WG-log queries.Our algorithm deals with dummy nodes by means of the O2 root Object class



Multiple inheritance Our approach currently deals with single inheritance only; however it could be extended to include multiple inheritance via coloring of nodes, provided that some notion of order among siblings (e.g. lexicographic ordering) is introduced in WG-log semantics.

4.4 Backtranslation of O2 Query Results into WG-log

Once the OQL query has been executed, the result must be backtranslated into WG-log. There is one main problem worth mentioning, i.e. method translation. While in our design we identify this as an issue to be dealt with at the wrapper level, our system does not yet deal with the representation of active content; therefore we shall not alaborate on method representation in this paper. Besides this, backtranslation procedure is basically the inverse of the translation procedure described above. Namely, strating from the result entrypoint, OQL types are iteratively mapped into WG-log entities, while values are backtranslated into slots The wrapper uses the WG-log schema as a guide while visiting OQL query results and creating a WG-log istance. The backtranslation turns out to be simpler than direct translation, as the algorithm uses O2 inheritance relationship to insert WG-log arcs among the entities it has obtained by translating OQL types. A class OQLObject is used to provide the algorithm with service information, such as OQL types, needed to perform the backtranslation.

4.5 Worked-out Examples

An example in the OMT graphic notation of an O2 schema reproducing a part of the ENEL electrical network information system is shown in Fig. 3. Fig. 4shows its translation in WG-Log. 12

Sub-section

Section Code Voltage Type

Area Name sections

Code

sub-sections

bays Substation substations

bus-bars

Name Code Type

Bay Bus-bar Code line-bays

Protection device

...

Code Type Setting

Line bay devices

line

circuit breaker Circuit breaker

TV ...

Distance relay

Line Code Voltage

Voltage transformer Data

TA

Reclosure device Current transformer Data

Figure 3: A sample O2 schema

13

Type Data

TYPE NAME

CODE

VOLTAGE

CODE

CODE

E_AREA

L_SUBSECTIONS L_SUBSTATIONS

E_SECTION

L_BUS_BARS E_SUBSECTION

E_BUS_BAR

NAME L_SECTIONS

L_BAYS

CODE E_SUBSTATION TYPE

E_BAY L_LINE_BAYS CODE

E_VOLTAGE TRANSFORMER

ISA

TYPE

DATA

L_TV

SETTING L_TA

E_LINE_BAY E_PROTECTION DEVICE

E_CURRENT TRANSFORMER

L_DEVICES

DATA

L_LINE ISA

ISA

CODE L_CIRCUIT BREAKER

E_DISTANCE RELAY

E_LINE VOLTAGE

E_RECLOSURE DEVICE E_CIRCUIT BREAKER

TYPE DATA

Figure 4: The WG-Log translation of an O2 schema

14

NAMES NAME

TYPE Production

E_SUBSTATION

SUBSTATIONS E_SUBSTATION

L_SECTIONS

E_SECTION

VOLTAGE 220 KV

LINES E_LINE

E_LINE_BAY

VOLTAGE

pp

E_SUBSTATION

TYPE

380 KV

Production

Figure 5: Sample WG-Log queries Like a WG-Log schema, an O2 schema contains no atomic values and it is essential for both programs and users to explore the database and formulate queries. In passing, we observe that WG-Log lexicon is redundant to this purpose, because O2 query syntax does not allow the expression of navigational and presentation-related concepts. However, even though O2 lacks explicit navigational information, any object in the O2 database is either complex or atomic; both types exist in WG-Log, so we can exploit this information to logically connect the objects. WG-Log schemata contain two types of nodes: \non-printable" representing complex object classes and \printable" representing atomic object classes; it is therefore possible to represent complex O2 objects via WG-Log non-printable nodes with label \E " and atomic O2 objects by means of printable WG-Log nodes (slots) with label \string". In an object-oriented representation, the binding between complex objects and their sub-objects is the relationship \part of". This scenario can be represented in WG-Log by means of logical links from the parent object to subobjects with textual label \L ". If the subobject is itself complex it will have a set of labeled links to its component objects. The translation process outlined in this Section can be applied to any O2 object in order to produce a WG-Log graph. Therefore, the translation of a O2 schema will become a WG-Log schema, while the translation of a generic O2 representation will become a WG-Log instance. We now turn to giving some examples of the translation of WG-Log queries into OQL queries. As we saw in the previous Section, a general-purpose query translation algorithm has to consider several particular conditions that increase its complexity and impair its readability. In this Section, for the sake of clarity, we show how the translation works works on some sample queries (Fig. 5) referring to the schema of Fig. 3. The O2 version of these queries is below: 1SELECT x.name FROM x IN Areas.substations WHERE x.type="production" 15

2SELECT x FROM x IN Areas.substations WHERE EXISTS y IN x.sections: y.voltage="220" 3SELECT x.line FROM y IN Areas.substations, x IN y.line-bays WHERE x.line.voltage="380" AND y.type="production" Notice the use of the persistent root Areas, collecting the set of all objects of class Area, which is used as entry point for the queries. This was not necessary in WG-Log, where the only references needed are those to the entities directly involved in the query. Such entry point is known to the Wrapper, which uses it appropriately in the translation phase. The integration of O2 databases is possible because these datasources provide both the logical structure description (their object-oriented schema), and the knowledge of the instance information. Since, as shown previously, it is possible to translate the schema from O2 to WGLog, we can store the WG-Log description of O2 sites in our Trader. Whenever the client chooses, during the interaction with the Trader, an O2 -derived schema, the WG-Log query formulated by the user is converted into a conventional OQL query. After the translation phase, the OQL query is sent to the site. The query result is an O2 object that is sent back to the Wrapper and translated into a WG-Log result instance.

4.6 Expressing OEM in WG-Log

Now we proceed to describe integration of semi-structured datasources. The idea of representing data as a graph has inspired many projects that deal with (re)structuring and querying semistructured data. As a reference for the present work, the OEM, graph based data model was chosen. OEM is currently used in several related projects, as LORE and C3 [CGM97]. An example of OEM graph representation is shown in Fig. 6 The proposed standard has three main characteristics:  object orientation: each OEM object has an Object Identi er (OID) and its representation is composed of three elements:

{ a label that describes what the object represents; { the data type of the object value. Any object may be atomic or complex. In the rst

case the type may be integer, real, string or any other data considered indivisible. The type is \set" if the object is complex. { a value that is either the variable length value for atomic objects or, if the object is complex, a collection of one or more OEM subobjects each linked to the parent via a \descriptive textual label" .  

representation of semantics: the label component in OEM structure captures the semantics of the object.

exibility: the OEM structure is exible enough to encompass all types of information.

16

Frodos

Restarea

Restaurant

Name

Name

Type

Price

Entree

Label_N

N_Fax

Opinion

N_Phone

Name

Rating

Rating

Figure 6: A sample OEM graph An OEM object may be translated into a standard textual format [GCC97]. This format keeps the information about semantics and contents and can be easily parsed by a program. An example of this format is the text in Fig. 7 (left). This is the textual representation of an OEM database which contains complex(Restaurant, Entree, etc) and atomic (Name, Type, etc) objects. As in all semistructured data, in this OEM database there is no xed schema in advance. In lieu of a standard schema, the LORE project uses an alternative tool to represent the time-invariant structure of an OEM database: an OEM object, called DataGuide, that is straightforwardly generated from the OEM database and dynamically maintained in order to provide several of the functions normally provided by means of a schema. It is interesting to remark that, from the same OEM database, many Data- Guides may be extracted, each of them representing a summary of the structure of the database. We are interested, in particular, to the so-called strong DataGuide [GWi97], that induces a straightforward one-to-one correspondence between source target sets and objects in the DataGuide. Like a WG-Log schema, a DataGuide contains no atomic values and it is essential for both programs and users to explore the OEM database and formulate queries. Since a DataGuide is an OEM object it is possible to translate it into the standard textual format: Fig. 7 (right) represents the DataGuide of the previous OEM database (its graphical representation is depicted in Fig. 6). Representing semistructured data in a synthetic way is also the basic aim of WG-Log schemata; exactly in the same straightforward way as a DataGuide is extracted, we can derive a WG-Log graph based on the OEMsource data representation. Unfortunately the OEM syntax does not allow the expression of navigational andpresentation-related concepts, which are very useful in the representation ofWeb sites: think for instance of queries involving some kind of \reachability"relationship between Web pages: this cannot be expressed over OEM or DataGuide representations. However, even though OEM-based data lack the explicit navigational information contained in a WG-Log schema, any object in the database is either complex or atomic, both types that 17

exist in WG-Log, so we can exploit this information to logically connect the objects. WGLog schemata contain two types of nodes: \non-printable" representing complex object classes and \printable" representing atomic object classes; it is therefore possible to represent complex OEM objects via WG-Log non-printable nodes with label \E " and atomic OEM objects by means of printable WG-Log nodes (slots) with label \string". In an OEM representation, the binding between complex objects and their subobjects is the relationship \part of".

18

19

< Entree { < Label_Number 2 > } > } > < Restaurant { < Entree { < Name "Route_9Red_curry"> < Opinion { < Rating "Great"> } > } > < Entree { }> }> < Restarea { } > } >

< Type > < Number_Phone > < Number_Fax > < Number_Home > < Entree { < Name > < Price > < Label_Number > < Opinion { < Rating > } > < Rating > } > } > < Restarea { < Name > } > } >

Figure 7: OEM instance (left) and DataGuide (right) textual representation

This scenario is representable in WG-Log by means of logical links from the parent object to subobjects with textual label \L ". If the subobject is itself complex it will have a set of labeled links to its component objects. The OEM-to-WG-log translation algorithm is speci ed in Fig. 8 Thus, OEM objects correspond to WG-Log objects; as an example, consider the OEM DataGuide object and its WG-Log representation in Fig. 9. The textual format also provides the notion of SymOid (Symbolic Object Identi er) [GCC97] to identify objects included as children of multiple complex objects. In particular, it is possible to specify that a SymOid is persistent: in an OEM database at least one persistent SymOid is required to serve as entry point. Considering this persistent SymOid as the entry point of the corresponding WG-Log schema, we can recursively pply the previous method starting from this identi er, in order to obtain a WG-Log representation of an entire OEM data source. Note that Figure 10 depicts the WG-Log schema derived from the DataGuide textual representation of Fig. 7, where Frodos is the (unique) entry point. The translation process outlined in this section can be applied to any OEM object and produces a WG-Log graph. DataGuides are themselves OEM objects, therefore the translation of a DataGuide will become a WG-Log schema, while the translation of a generic OEM representation will become a WG-Log instance. Consider now the translation of WG-Log queries into Lorel queries. The query translation algorithm has to consider several particular conditions that increase its complexity and its readability. In this paper we just describe a simpli ed version of the algorithm showing how it works on two sample queries. Let us consider the query depicted in Fig. 11, asking for a list of all the Entrees of the \Blues by the bay" restaurant. The target of the operation is represented in WG-Log with a green node; therefore, in the considered query, the result is a list of Entrees. The target information has to be included in the SELECT clause of the Lorel query. The FROM clause is set by default to \root", and it is di erent only when more complex queries are translated. The constraints that the result set must satisfy are expressed in the WHERE clause: in the sample WG-Log query, the only condition is that the restaurant name must be the one written in the \name" slot (i.e. \Blues by the bay"). It is necessary to visit the query graph nodes to obtain the proper path expression. The translation algorithm is speci ed in Fig.13 The Lorel translation of such a query is then: SELECT Restaurant.Entree FROM root WHERE *.Restaurant.Name = "Blues_by_the_bay"

The \*" character is used here as a wildcard; in this way Lorel can indicate a variable length path starting from the root (speci ed in the FROM clause). Note that this is not necessary in WG-Log, where no language constraint imposes to navigate starting from the entry point. Similarly the query depicted in Fig. 12 is translated into the following Lorel expression: SELECT Restaurant FROM root WHERE *.Restaurant.Entree.Rating = "great" OR *.Restaurant.Entree.Opinion.Rating = "great"

It is worth noticing that in this query we added the boolean operator OR; in this way we can express costraints on disjoint \path-expressions". This translates the WG-Log double-rule query. 20

WGGraph wgraph; // symoid is the OEM database entry point's identifier OemObject obj = *symoid; // translate an OEM object into a WG-Log graph using persistent SymOid as // entry point. procedure OemToWGLog(obj) { while(obj.hasOtherChildren()) { OemObject child = obj.NextChild(); if(obj.isComplex()) { // in an OEM object it's possible to create some cycles using SymOid if(!child.alreadyVisited()) { // add Oem complex objects as WG-Log node wgraph.addNode(obj, child, child.label); OemToWGLog(child); // recursive procedure for complex object } // add a WG-Log link to a previously visited object else wgraph.addLink(obj, child); } // an atomic OEM object is translated into a WG-Log slot else wgraph.addSlot(obj, child, child.value); } }

Figure 8: Algorithm to translate an OEM object into a WG-Log graph

< Entree { < Name > < Price > < Rating > }>

E_ENTREE L_Name String

L_Price String

L_Rating String

Figure 9: The OEM DataGuide textual format (left) and the translation in WG-Log graph

21

FRODOS L_restarea

L_restaurant

L_numphone

E_RESTAURANT

E_RESTAREA

String L_name String

L_name

L_type

L_numhome

L_numfax

String

String String

L_price

String

String

E_ENTREE

L_name

L_labelnumber

String L_rating

String

String L_rating

String

E_OPINION

Figure 10: WG-Log schema derived from Dataguide

E_RESTAURANT

E_ENTREE

L_name String

RESULTS

Blues_by_the_bay

Figure 11: Example of WG-Log query

22

E_ENTREE

E_RESTAURANT

L_Rating

RESULTS

String Great

E_ENTREE

E_OPINION

E_RESTAURANT

L_Rating

String Great

Figure 12: Example of WG-Log query

string WGtoLorel(WGGraph query, WGGraph schema){ string Lorelquery; WGNode result = query.entry(); // returns node to insert in SELECT WGNode root = schema.getroot(); string select = result.getlabel(); //returns label Lorelquery.concat(select); string from = schema.getEntry(result); //returns path of the entry-node //of the query Lorelquery.concat(from); WGNode node; string where; while (node = query.NextNode()){ //returns NULL if all nodes are visited for i = 0 to node.numberSlot(){ string constraint = node.getConstraint(); //Node.link = value where.concat(constraint); } } LorelQuery.concat(where); return Lorelquery; }

Figure 13: Algorithm to translate a WG-Log query into a LOREL query

23

5 Design patterns for the WG-Log architecture The design of our architecture relies on some CORBA-compliant design patterns [MM97] both for the Wrapper and Trader components, as well as for the Naming service. Such patterns describe a minimum interface all mentioned components should provide and the communication protocols within our architecture. The complete speci cation of these patterns into design-level software components will be carried out in a future version of this paper. Here, we limit ourselves to a brief comment on the choice of CORBA patterns w.r.t. standrd O2 schemas and bases de nition using the standard O2 tools as explained for instance in Chapter 7 of the O2 Administration Guide. If, as in our case, the designer chooses to access a database "from outside", CORBA's IDL (Interface De nition Language) gives an external view of it. In IDL, the designer de nes the interface of any objects and selects the methods needed for the particular purpose of the view. This view can then be used by a client outside O2 (a client which runs in a process but not as an O2 client). The designer of the view has then to implement this view on top of O2 , i.e., inside another process which is an actual O2 client.

5.1 From IDL to C++

The CORBA compiler parses the IDL interface and generates a corresponding C++ class. Let us call the IDL interface "K". CORBA generates a C++ class, called K . This class can be used directly by the CORBA client (the application "outside" O2 ). The programmer then has to de ne a C++ class, called K i, which implements the view inside O2 . This class has to inherit from an CORBA generated class, called KBOAImpl (Basic Object Adapter Implementation for K) , which implements the CORBA client/server protocol. The class KBOAImpl is a subclass of K . K i can use any features of the C++ binding provided by O2 . In the proposed approach, K i is a pure transient C++ class. It does not need to be known by O2 . It is linked with an O2 process which runs a standard O2 C++ client. There can be two possible architectures: 1. One O2 client for each CORBA client 2. A single O2 client for several CORBA clients Solution (1) must be chosen when the CORBA clients run concurrently on the same data in potentially con icting modes. In this case they need to rely on the O2 transactional system (one O2 client per transaction). Solution (2) can be adopted when either the CORBA clients are not con icting (for instance they all run in read-only mode) or when each request coming from an CORBA client can be implemented as a complete transaction (which is worth only when the implementation of the request involves a signi cant amount of data manipulation) In our proposed architecture, we will rely on solution (2) as it is suited for allowing CORBA access to an existing O2 data while not interfering with its normal operation.

5.2 Building a CORBA server with O2: a basic example

Building an CORBA/O2 server consists of compiling and linking the les produced by the O2 export tool and the IDL CORBA compiler, the C++ application les (if needed), the CORBA library and the O2 runtime libraries An O2 exported class is mapped into an IDL interface, the name of which is the name of the O2 class pre xed by "d ". For each exported method, a 24

corresponding IDL operation is generated with the same name and signature. All the parameters are declared as in parameters. The following list shows the O2 types the client can use as parameter types and result type for an exported O2 method. The corresponding IDL types are also given. Q stands for any O2 type, and P its corresponding IDL type.

O2

Type IDL type no type oid integer long real double boolean boolean char char string string K ( 2 class) d K (IDL interface) 2 list Q ( 2 collection class) d List of P (IDL collection interface) 2 set Q ( 2 collection class) d Set of P (IDL collection interface) 2 bag Q ( 2 collection class) d Bag of P (IDL collection interface)

O O O

O

O O O

Table 1: Corresponding O2 and IDL types for export We are now ready to provide an example. Let K be the class to be exported class K private inherit K1 type tuple(s: string, o: K2) /* K2 is an 2 class */ method public get s:string, public set s(v: string), public foo(obj: K2) end;

O

and let the command O2 export:

O2 export

...

-class "K -idl -method get s -method set s -method foo"

The corresponding interface d K will be generated as follows: // forward declarations interface d K2; interface d K: d K1 f string get s(); void set s(in string v); void foo(in d K2 obj); g;

The O2 classes K1 and K2 must be exported, either before or after the importation of the class K, the order is not important.

25

6 Conclusions and Future Work In this paper we have presented a technique for the representation and querying of heterogeneous datasources, which can be e ectively coupled with other approaches in order to reach the full integration of all the datasources in a distributed information system. In particular, we have outlined the feasibility of this integration on the examples of an O2 object-oriented database and of the Tsimmis system for semi-structured information querying. We believe that this experience can be easily transferred to other data-model based datasources thus allowing the WG-Log environment to seamlessly manage information about site contents derived from di erent approaches.

Acknowledgments

The authors wish to thank D. Lucarella, A. Zanzi for the useful discussions on the subject. Thanks are also due to M.Sc. candidates Nicola Drago, Anna Mazzi, Stefano Vantini, Davide Veronese for their e ort towards the implementation of a prototype of our architecture.

References [Abi97] [AKn97] [AMM97] [AQM96] [BBB97] [BCDT98] [BDI97] [C96] [CGH94] [CGM97] [CDPT98]

Serge Abiteboul, Querying Semi-Structured Data, ICDT'97, 6th International Conference on Database Theory, Vol. 1186, pp. 1-18, Springer, 8-10 Jan 1997. N. Ashish, C. Knoblock. Wrapper generator Semi-structured Internet Sources, Proceedings of the ACM SIGMOD International Conference on Management of Data. P. Atzeni, A. Masci, G. Mecca, P. Merialdo, and E. Tabet. ULIXES: Building relational views over the Web. In Proc. of the 13th Int'l Conf. on Data Engineering (ICDE'97), pages 576{576, Washington - Brussels - Tokyo, April 1997. IEEE. S. Abiteboul, D. Quass, J. McHugh, J. Widom and J. Wiener. The Lorel query language for semistructured data. Journal of digital Libraries, November 1996. R. J. Bayardo et al. InfoSleuth: Agent-Based Semantic Integration of Information in Open and Dynamic Environments, Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Vol. 26,2, pp. 195-206, ACM Press, May 13-15 1997. M. Baldi, S. Comai, E. Damiani, F. Insaccanebbia, L. Tanca. The Architecture of the WG-Log Web Query System. Technical Report Interdata Proj. n. T2-R07 M. Baldi, E. Damiani, and F. Insaccanebbia. Structuring and querying the Web through graph-oriented languages. In Proc. of SEBD 97, SEBD Conferences, Verona, Italy, June 1997. R. Cattell et al. The Object Database Standard: ODMG-93, release 1.2. Morgan Kaufmann, 1996 S. Chawathe, et al. The TSIMMIS project: Integration of heterogenous information sources. In proceedings of IPSJ, Tokyo, Japan, October 1994. S. Chawathe and H. Garcia-Molina. Meaningful Change Detection in Structured Data. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. Tuscon, Arizona, May 1997. S. Comai, E. Damiani, R. Posenato, L. Tanca. A Schema-based Approach to Modeling and Querying WWW Data. LNAI 1495. 26

[CDT98a] S. Comai, E. Damiani, L. Tanca. The WG-Log System: Data Model and Semantics. Technical Report Interdata Proj. n. T2-R06, 1998. [Cor98] Cortesi A., Dovier A., Quintarelli E., Tanca L., "Operational and Abstract Semantics of a Query Language for Semi-Structured Information" Proc. of the Intl. Workshop on Deductive Logic Programming. [DTa97] E. Damiani, L. Tanca. Semantic Approach to Structuring and Querying the Web Sites. In Proceedings of 7th IFIP Work. Conf. on Database Semantics (DS-97), 1997. [FFK97] M. Fernandez, D. Florescu, J. Kang, A. Levy, D. Suciu, STRUDEL: A Web Site Management System, Proceedings of the ACM SIGMOD International Conference on Management of Data, Vol. 26,2, pp. 549-552, ACM Press, May 13-15 1997. [GCC97] R. Goldman, S. Chawathe, A. Crespo, J. McHugh. A Standard Textual Interchange Format for the Object Exchange Model (OEM). Manuscript available from http://www-db.stanford.edu [GWi97] R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases, VLDB'97, pp. 436-445, 1997. [HGC97] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo. Extracting Semistructured Information from the Web. Paper available at http://www-db.stanford.edu [MSL97] E. Miller, B. Schloss, O. Lassila, and R. Swick. Resource description framework model and syntax. Technical report, W3 Consortium, oct 1997. Revision 1,02, http://www.w3.org/TR/WD-rdf-syntax/. [MM97] T.J.Mowbray and R.C. Malveau Corba Design Patterns J. Wiley and sons, New York, 1997. [O2] O. Deux et al., The O2 System, Communications of the ACM, Vol. 34, N. 10, Oct. 1991. [PGW95] Y. Papakonstantinou and H. Garcia-Molina and J. Widom, Object Exchange Across Heterogeneous information Sources, Proc. of the 11th Int'l Conf. on Data Engineering, pp.251-260, IEEE Computer Society Press, Mar 1995. [PPT95] J. Paredaens, P. Peelman, L. Tanca. G-Log: A Declarative Graphical Query Language. IEEE Trans. on Knowledge and Data Eng., vol.7, 1995 pp. 436-453 [QWG96] D. Quass, J. Widom, R. Goldman, K. Haas, Q. Luo, J. McHugh, S. Nestorov, A. Rajaraman, H. Rivero, S. Abiteboul, J. D. Ullman, J. L. Wiener, LORE: A Lightweight Object REpository for Semistructured Data, Proc. of the 1996 ACM SIGMOD Int'l Conf. on Management of Data, p. 549, 4-6 Jun 1996. [OH97] R. Orfali, D. Harkey, Client/Server Programming with Java and CORBA, John Wiley Computer Publishing, 1997. [S96] J. Siegel, CORBA Fundamentals and Programming, John Wiley Computer Publishing, 1996. [YD96] Z. Yang, K. Duddy, CORBA: A Platform for Distributed Object Computing, ACM Operating Systems Review, vol. 30, 1996. 27