Using Object-Grammars for Internet Data Warehousing

7 downloads 0 Views 183KB Size Report
where database queries and updates on text files are studied and general undecidability results ... via FTP, SMTP and WWW servers mirroring the IMDB data.
Using Object-Grammars for Internet Data Warehousing Lukas Faulstichx

Myra Spiliopoulou{

Volker Linnemannk

Abstract The increasing amount of information available in the web demands sophisticated querying methods and knowledge discovery techniques. In this study, we introduce our model WIND for a data warehouse over a domain-specific portion of the Internet. The aim of WIND is to provide a partially materialized structured view onto a thematic section of the web, on which database querying can be applied and mining techniques can be developed. WIND organizes web documents into local repositories with functionalities ranging from OODBMSs to file systems. This allows for a combination of attribute and content-oriented query processing. Special interest is paid to the format specifications of documents, where the notion of format is extended to cover characteristics and constraints that hold on the subject domain. To support conversion between (semi-)structured documents and database objects, we consider a format converter generation technique based on the notion of object-grammars. Keywords: data warehouse, web, mining, information retrieval, format conversion, grammars

1 Introduction The Internet forms a large source of data in which users are mining for useful information. Access is mostly done by browsing combined with some keyword search on index servers like Lycos, Galaxy or Altavista. Since the coverage of index servers is differing and their actuality far from perfect, the user should not have to deal with them directly. Moreover, the result of a query on a certain topic should be a structured compilation of information about this topic instead of an often unreliable list of HTML references. Integration of Information Systems. Among the numerous projects dealing with the integration of information systems, we mention TSIMMIS [CGMH+ 94], Information Manifold [LRO96]. TSIMMIS wraps each information source in a “translator” that translates both queries and results. However, the OEM hierarchical data model used is not well suited for complex objects. Complex objects are needed to model e.g. cyclic relationships, as appearing in web descriptions of products with interrelated components, related statistics and catalog prices. The Information Manifold models each information source as a set of relational tables with a limited querying facility. General relational queries are transformed into distributed queries based on this facility. The special subject of integrating text files into databases is discussed in [ACM93, ACM95], where database queries and updates on text files are studied and general undecidability results are given. In [ABH93, BA96], generic storage methods are proposed for SGML and HyTime documents in the VODAK OODBMS. General grammar-defined formats are not supported. x Institut für Informatik, Freie Universität Berlin { Institut für Wirtschaftsinformatik, Humboldt-Universität zu Berlin k Institut für Informationssysteme, Medizinische Universität zu Lübeck

1

[email protected] [email protected] [email protected]

Document transformations and unparsing. The tree transformations discussed in [KP93, FW93, ACM97] can be used to transform the syntax trees of structured documents. These approaches do not extend to the unparsing problem, i.e. the transformation of a possibly complex semantic value into a syntax tree. In [ACM95], an algorithm for the unparsing of database contents is given. The somewhat simpler problem of pretty-printing of parsed program code is discussed in [vdBV96]. Approaches for object-oriented extensions of attribute grammars mostly aim at compiler construction [Paa95]. An object-oriented data model based on attributed syntax trees is proposed in [SL95]. To our knowledge, there is no general methodology for text representations of objects in the literature yet. Mining and Data Warehouses. The ultimate motive for browsing the web is to find useful information. As pointed out by Etzioni [Etz96], web mining methodologies based on testing and domainspecific knowledge can discover precious information in WWW. Inmon stresses that the ideal environment for data mining is a data warehouse [Inm96]. Hence, organizing the information in the web in a data warehouse instead e.g. in an index server for text retrieval, offers a much higher potential for knowledge discovery. Data Warehouses [Inm92, IK93] are used to integrate data from the “Legacy Systems” in a large database for mining. Research projects on data warehousing like WHIPS [HGML+ 95] and H2O [ZHKF95] consider problems of view updating and semantic heterogeneity of data originally wellorganized in information systems. Since the web is potentially unlimited and contains mostly unstructured documents, it is important (a) to identify a domain of information, so that domain-specific knowledge can be exploited [Etz96], and (b) to prepare the data in a way appropriate for mining strategies [FPSS96]. Our Model. In this study, we propose a model for the organization of domain-specific web information into a data warehouse and for the retrieval of this information with querying techniques, on top of which data mining can be implemented. The WIND architecture for a Warehouse on INternet Data, is designed to integrate structured and unstructured documents from the web, media objects and other types of data, which are fetched from the Internet on request, i.e. as the result of a user query. WIND offers querying functionality and supports format conversions of documents according to the user’s requirements. In order to support the storage of different forms of data, WIND uses a variety of repositories, including databases, text archiving and retrieval systems, media servers and file systems. Their different query facilities are integrated in a uniform query language, WINDSurf, which is transparent with respect to storage and access methods as well as data formats. Format conversion is essential for WIND. Since the vast number of existing formats forbids ad hoc methods, a general methodology is needed to specify declaratively bidirectional format transformations. For this, we propose the concept of object-grammars. This article is organized as follows: in the next section we present a running example of a web domain of information. In sections 3 and 4 we describe WIND and apply it on our running example. Since we emphasize the problem of format conversion, we introduce our concept of object-grammars in section 5. In section 6 we show the usage of object-grammars in WIND in the framework of our running example. The last section concludes the study.

2

2 The Internet Movie Database In this paper, we use the Internet Movie Database (IMDB) as a running example. We show how the IMDB can be remodelled as an instance of our WIND architecture. The IMDB is a public domain data set describing movies and the people involved in them. It is based on ASCII files consisting of simple records as shown in Fig. 1. Access to the IMDB is provided via FTP, SMTP and WWW servers mirroring the IMDB data. The web interface supports a number of query templates and generates HTML pages on the fly. These are linked by cross-references to films, persons, locations etc. Figures 9 and 10 (in the appendix) show two typical pages generated by this web interface. Allen, Weldon

Dolores Claiborne (1994)

[Bartender]



Allen, William Lawrence Dangerous Touch (1994) [Slim] Sioux City (1994) [Dan Larkin] Allen, Woody

Annie Hall (1977) (AAN) (C:GGN) [Alvy Singer] Bananas (1971) [Fielding Mellish] ... Zelig (1983) (C:GGN) [Leonard Zelig]

Figure 1: A part of the list of actors in the IMDB. However, in order to support ad hoc queries like “which actors played both in films of Jim Jarmusch and of Aki Kaurismäki?”, the IMDB source files must be loaded into a database and query results must be formatted as HTML pages. The web interface of the IMDB uses HTML forms to enter new data or updates. A more intuitive method is to edit existing HTML pages directly using a WYSIWYG editor like amaya and send the modified pages back to the WWW server, which updates the data set. Besides IMDB, there are many other resources about cinema in the internet: home pages of artists, organizations and companies, images, sound and video clips, magazines, etc. The IMDB offers a few links without integrating this information. Integration is needed to answer queries like “who plays the main character in ‘The 3rd Man’ and on which channel can he be seen the next time?”. The WIND architecture discussed hereafter is intended as a framework to meet these demands.

3 The WIND Architecture Our internet warehouse model gathers information on a given domain (of topics) from different information sources across the Internet and organizes them in Document Repositories (DRs). Each DR is encapsulated into a wrapper module, the IDW-Wrapper, which is responsible for the communication between the DR and the WIND-Server. The WIND-Server administers the Data Warehouse as a whole and offers interfaces to information sources and clients/users. It consists of the following modules: (i) The Internet Loader is responsible for the import of data into WIND. (ii) The Repository Manager performs transaction processing and query processing at the server level, before forwarding the subtransactions and subqueries to the underlying DRs, and postprocesses the results returned by them. (iii) The View Exporter is in charge of the communication with the clients. The WIND architecture is depicted in Fig. 2.

3

Anfragen/Anforderungen Client (WWW)

...

...

Internet

View Exporter

Client (DB)

Meta-Daten Daten

Internet

Document Repository ClientInterface

ClientInterface

...

Query Transformer DBMS

WIND-Server

Query Processor

Document Repository Query Transformer

Service Catalog

Fusion Table

Text Archive

Format Conversion Service ...

Transaction Manager

WIND-Wrapper

Query Optimizer

Format Conversion Service

Document Repository

Repository Manager

Query Transformer

Internet Loader

File

SourceInterface

...

Internet

Source (WWW)

SourceInterface

Format Conversion Service

Server

Internet ...

Source (FTP)

Figure 2: The WIND Architecture

3.1 Internet Loader The Internet Loader gathers information from sources across the Internet. To this purpose, it needs a variety of interfaces to communicate with different types of information sources, such as WWWservers, news-servers, database servers and, eventually, file managers. Support for the interaction with meta-information providers, such as WWW search engines, is also desirable, because search engines have already registered large portions of the Internet. Since the information retained in a WIND instance is domain-specific, domain knowledge is used to specify the general characteristics to be satisfied by the documents imported by the Internet Loader. Depending on the domain and the applications on it, such characteristics can be keywords, pre-determined URL parts etc. Sophisticated characteristics, related e.g. to the structure of HTMLpages are handled by the Repository Manager and its modules. The Internet Loader is used to load an initial set of documents into the data warehouse and then to extend this set on demand, i.e. in response to requests from the Repository Manager. Moreover, it must regularly update the contents of WIND by periodically polling its information sources and in response to update requests from the Repository Manager. The usually high cost of polling should be weighted against the frequency of data changes and the importance of keeping the data updated.

3.2 View Exporter The View Exporter consists of interfaces to clients, such as web browsers, file based legacy applications, database applications etc. Those interfaces translate the client requests into the internal query language of WIND, WINDSurf, and forward them to the Repository Manager. The results returned are forwarded by the interfaces to the clients. Different ways of implementing web interfaces in the View Exporter are presented in [FLS97]. The client requests can be queries, format conversion of documents and updates of documents, 4

which a client considers to be obsolete. Such updates are for instance supported in the IMDB database. A policy of recognizing authorized clients and propagating the updates from the DRs to the Internet sources is yet to be designed.

3.3 Document Repositories The data retained in a WIND instance belong to a particular domain, as the movie data of our running example. Attempting to store the whole Internet in a single warehouse is infeasible and meaningless, since no mining strategy would ever search the whole internet unselectively to discover some knowledge on a given subject. The domain knowledge is used to specify the characteristics of the documents to be retained in WIND, including meta-information, structural specifications and format descriptors, wherever possible. Depending on their structure, data entities are retained in one or more “Document Repositories” (DRs). The DRs of WIND can be databases, text retrieval systems, multimedia managers, file repositories etc. Each DR is observed as an information system encapsulated in a Wrapper. As shown in Fig. 2, the WIND-Wrapper and its submodules, the Query Transformer and the Format Conversion Server (FCS) are responsible for the interaction of the DR with the WIND-Server. The schema describing the data in a DR depends on the expressiveness of the underlying information system: a DBMS usually has a powerful data model, while data in a text archive must be modelled as a set of documents with some simple predefined attributes. The functionality of the data retrieval service over a DR also varies: a database offers a querying interface, while an information retrieval system offers pattern matching algorithms. User queries towards a DR can only be submitted via the WIND-Server, since DRs, as usual in a warehouse, are not autonomous. The Query Optimizer transforms a query towards the server into a set of query subplans towards the DRs. The Query Transformer in the WIND-Wrapper of each DR translates the subplan in a sequence of commands that can be processed by the DR. A query subplan towards a DR is accompanied by a list of objects used as arguments to the query and by a specification of the output format. As shown in Fig. 3, the query is translated by the Query Transformer, while its arguments are converted by the Format Converters (FC-1,. . . ,FC-n) of FCS in a format supported by the DR. The query results are then converted into the desired output format and returned to the Repository Manager. Document Repository Subquery Plan

Query Transformer

Argument-1 ...

al I

SL

ang

uag

e

...

Conv-1

Argument-k Result ... Conv-n

Information System

Figure 3: Querying a Document Repository

5

WIND-Wrapper

Argument-i

Format Conversion Server

Loc

Format conversion in FCS is based on the notion of Object-Grammars (s. Section 5), which can support converter generation. However, FCS is open to the incorporation of special purpose converters as well, e.g. for the reformatting of pictures and videos. Object-Grammars are appropriate both for text analysis (parsing) and for text generation (unparsing). Hence, they are appropriate for specifying the characteristics of the documents in the domain in a concise way. They can be used not only for the translation of a semi-structured text into a (composite) database object, but also for the reformatting of objects: format converters generated by object-grammars can be coupled together to translate a text piece into an intermediate object-oriented presentation and then again to text of another format. This is particularly useful, when objects are requested to appear in a large variety of alternative formats.

3.4 Repository Manager The Repository Manager is responsible for the administration of the Document Repositories. Its “schema” is the union of the DR schemas. The data retained in the DRs are not necessarily distinct: the same “entity” may appear in several DRs, e.g. as a HTML-page, a database object and a postscript file, whereby we distinguish between the original entity imported from the Internet into a DR and its replicas, jointly denoted as “doppelgänger(s)”. Queries towards an entity may be applied to one or more doppelgängers and should be forwarded to the appropriate DRs. Updates should be applied to all doppelgängers. Hence, the Repository Manager contains modules for query and update processing and maintains information on the data in the DRs and their formats. The Fusion Table keeps track of doppelgängers for an entity, retaining their common global identifier, a pointer to the original, information on the conversions that produced the replicas and the location of the replicas. The Service Catalog keeps track of the formats and converters available in the FCS of the WIND-Wrappers.

3.5 The Query Language of WIND For WIND we consider an object-oriented query language based on OQL, as proposed in [R.G94]. Our language, WINDSurf, must support: (i) object-oriented database queries, (ii) predicates for information retrieval from multimedia archives, mainly based on pattern matching, (iii) format conversion requests and (iv) document updates. Those operations can be incorporated into an object-oriented query language as methods on the objects. Therefore, WINDSurf does not differ from OQL designs in the syntax but in the evaluation: a WINDSurf query is executed towards multiple DRs; parts of the query can be executed by a single DR, while for other parts there are more DR candidates. Updates referring to an entity must be propagated to all DRs containing doppelgängers of that entity. 3.5.1

Query Optimizer and Query Processor

The Query Optimizer decomposes each WINDSurf query into subqueries assigned to the individual DRs. To do the decomposition and the assignment, the optimizer must verify whether each DR involved can process the arguments passed to it and produce the desired output format. If a DR does not support the processing of a given entity, the optimizer consults the Fusion Table, looking for a doppelgänger in the appropriate format. If no such doppelgänger is found, the optimizer searchs the Service Catalog for format converters that can transform the entity into the desired format.

6

The output of the Query Optimizer is an execution plan consisting of subplans and conversion requests towards the DRs. The Query Processor assigns the subplans and requests to the DRs. It is responsible for supervising the transfer of intermediate results from one DR (actually WIND-Wrapper) to another and for merging the results. If new replicas of an entity are produced during query execution, the Query Processor inserts the appropriate entries in the Fusion Table. The merging of results into a list of documents goes beyond standard query processing, because some of the returned documents are selected from archives and ranked by the pattern matching facilities used there. Advances on the processing of ranking predicates [CG96, Fag96] will be considered to assign ranks to the results of the whole query 1 . The data retained in WIND can occasionally be inadequate to answer a query. This is for instance the case, when the number of result entities requested by the user cannot be reached from the DRs contents. Then, the Query Processor asks the Internet Loader to bring additional data from the web. Since WINDSurf is more expressive than the query languages of search engines like AltaVista and Lycos, the query issued by the Internet Loader should be limited to the predicates supported by those servers. The results must be then imported to the DRs and processed according to the original WINDSurf query. 3.5.2

Transaction Manager

The Transaction Manager of WIND supervises the execution of updates over the DRs. Updates occur when an obsolete object should be replaced with a newer version fetched by the Internet Loader, and whenever an (authorized) client requests an update. An update should be performed on all doppelgängers of the same entity, as registered in the Fusion Table. Hence, for each update request towards an entity an equivalent update request per doppelgänger must be generated. We are considering ways of generating such updates in an efficient way. Our initial solution is to perform the original update, use the updated entity as original and generate its doppelgängers anew as replicas. All update operations initiated from the original update request must be either performed immediately or deferred until the next access to a doppelgänger. Since the DRs are heterogeneous but not autonomous, those updates can be performed with a rather simple protocol for nested transactions.

4 Modelling the IMDB as an WIND Instance The IMDB movie database has several mirror sites in the Internet. We discuss the modelling of IMDB in WIND as an enhanced IMDB mirror with extended functionality.

4.1 Structure of the IMDB-WIND On reasons of brevity we consider a minimally equipped WIND instance. It consists of an OODBMS, a text archive and a HTML-page repository; the Internet Loader has http and ftp interfaces; the View Exporter supports a web interface. Additional repositories, such as a video archive, can obviously be added. The data in the OODBMS repository are organized according to the schema in Fig. 4. This schema is a simplified version of the IMDB schema, allowing us to concentrate on the aspects of the

7

persons:Set[Person]

films:Set[Film]

Person

Film

name:String biography:HyperText jobs:Set[Job]

title:Integer year:Integer jobs:Set[Job]

HyperText url:String

Job category:Category role:String remarks:List[String] rank:Integer performer:Person film:Film

Figure 4: The schema of the movie database movie database important for our study. The Format Conversion Server of the OODBMS is equipped with object grammars that can transform the IMDB files into objects, and with object grammars for the conversion of database objects into HTML pages.

4.2 An Example Query on the IMDB In the WIND instance of IMDB, let us retrieve all persons (in alphabetical order), whose biography contains the words “neurotic” and “New York” or “New Yorker”, and who have appeared in movies of Woody Allen. This query can be expressed in WINDSurf as follows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

sort p in ( select person from person in persons where person.biography.match("neurotic NEAR New York*") and exists film in ( select job.film from job in person.jobs where job.category = cast ): (exists director in ( select directing.person from directing in film.jobs where directing.category = direction ): director.name = "Allen, Woody") ) by p.name ) @ HTMLDoc[UnorderedList[Link]]("Query result")

1 A query returning documents with ranks does not need to support fuzzy predicates. Fuzzy queries would be an enhancement of WINDSurf, which we intend to consider.

8

This query selects the persons, whose biography contains the pattern ‘‘neurotic NEAR New York*’’ (line 4), and for which there is a movie registering them as cast members (lines 6-8: category = cast) and being directed category = direction by a person named ‘‘Allen, Woody’’ (lines 9-13). The results are sorted alphabetically by name (line 14). In line 15 we specify that the set of person-objects output by this query must be formated as a HTML document (@ HTMLDoc...). This document should be titled ‘‘Query result’’. Its content is an unordered list (HTML-Tag