Using Object-Grammars for Internet Data Warehousing Lukas Faulstichx
Myra Spiliopoulou{
Volker Linnemannk
Abstract The increasing amount of information available in the web demands sophisticated querying methods and knowledge discovery techniques. In this study, we introduce our model WIND for a data warehouse over a domain-specific portion of the Internet. The aim of WIND is to provide a partially materialized structured view onto a thematic section of the web, on which database querying can be applied and mining techniques can be developed. WIND organizes web documents into local repositories with functionalities ranging from OODBMSs to file systems. This allows for a combination of attribute and content-oriented query processing. Special interest is paid to the format specifications of documents, where the notion of format is extended to cover characteristics and constraints that hold on the subject domain. To support conversion between (semi-)structured documents and database objects, we consider a format converter generation technique based on the notion of object-grammars. Keywords: data warehouse, web, mining, information retrieval, format conversion, grammars
1 Introduction The Internet forms a large source of data in which users are mining for useful information. Access is mostly done by browsing combined with some keyword search on index servers like Lycos, Galaxy or Altavista. Since the coverage of index servers is differing and their actuality far from perfect, the user should not have to deal with them directly. Moreover, the result of a query on a certain topic should be a structured compilation of information about this topic instead of an often unreliable list of HTML references. Integration of Information Systems. Among the numerous projects dealing with the integration of information systems, we mention TSIMMIS [CGMH+ 94], Information Manifold [LRO96]. TSIMMIS wraps each information source in a “translator” that translates both queries and results. However, the OEM hierarchical data model used is not well suited for complex objects. Complex objects are needed to model e.g. cyclic relationships, as appearing in web descriptions of products with interrelated components, related statistics and catalog prices. The Information Manifold models each information source as a set of relational tables with a limited querying facility. General relational queries are transformed into distributed queries based on this facility. The special subject of integrating text files into databases is discussed in [ACM93, ACM95], where database queries and updates on text files are studied and general undecidability results are given. In [ABH93, BA96], generic storage methods are proposed for SGML and HyTime documents in the VODAK OODBMS. General grammar-defined formats are not supported. x Institut für Informatik, Freie Universität Berlin { Institut für Wirtschaftsinformatik, Humboldt-Universität zu Berlin k Institut für Informationssysteme, Medizinische Universität zu Lübeck
1
[email protected] [email protected] [email protected]
Document transformations and unparsing. The tree transformations discussed in [KP93, FW93, ACM97] can be used to transform the syntax trees of structured documents. These approaches do not extend to the unparsing problem, i.e. the transformation of a possibly complex semantic value into a syntax tree. In [ACM95], an algorithm for the unparsing of database contents is given. The somewhat simpler problem of pretty-printing of parsed program code is discussed in [vdBV96]. Approaches for object-oriented extensions of attribute grammars mostly aim at compiler construction [Paa95]. An object-oriented data model based on attributed syntax trees is proposed in [SL95]. To our knowledge, there is no general methodology for text representations of objects in the literature yet. Mining and Data Warehouses. The ultimate motive for browsing the web is to find useful information. As pointed out by Etzioni [Etz96], web mining methodologies based on testing and domainspecific knowledge can discover precious information in WWW. Inmon stresses that the ideal environment for data mining is a data warehouse [Inm96]. Hence, organizing the information in the web in a data warehouse instead e.g. in an index server for text retrieval, offers a much higher potential for knowledge discovery. Data Warehouses [Inm92, IK93] are used to integrate data from the “Legacy Systems” in a large database for mining. Research projects on data warehousing like WHIPS [HGML+ 95] and H2O [ZHKF95] consider problems of view updating and semantic heterogeneity of data originally wellorganized in information systems. Since the web is potentially unlimited and contains mostly unstructured documents, it is important (a) to identify a domain of information, so that domain-specific knowledge can be exploited [Etz96], and (b) to prepare the data in a way appropriate for mining strategies [FPSS96]. Our Model. In this study, we propose a model for the organization of domain-specific web information into a data warehouse and for the retrieval of this information with querying techniques, on top of which data mining can be implemented. The WIND architecture for a Warehouse on INternet Data, is designed to integrate structured and unstructured documents from the web, media objects and other types of data, which are fetched from the Internet on request, i.e. as the result of a user query. WIND offers querying functionality and supports format conversions of documents according to the user’s requirements. In order to support the storage of different forms of data, WIND uses a variety of repositories, including databases, text archiving and retrieval systems, media servers and file systems. Their different query facilities are integrated in a uniform query language, WINDSurf, which is transparent with respect to storage and access methods as well as data formats. Format conversion is essential for WIND. Since the vast number of existing formats forbids ad hoc methods, a general methodology is needed to specify declaratively bidirectional format transformations. For this, we propose the concept of object-grammars. This article is organized as follows: in the next section we present a running example of a web domain of information. In sections 3 and 4 we describe WIND and apply it on our running example. Since we emphasize the problem of format conversion, we introduce our concept of object-grammars in section 5. In section 6 we show the usage of object-grammars in WIND in the framework of our running example. The last section concludes the study.
2
2 The Internet Movie Database In this paper, we use the Internet Movie Database (IMDB) as a running example. We show how the IMDB can be remodelled as an instance of our WIND architecture. The IMDB is a public domain data set describing movies and the people involved in them. It is based on ASCII files consisting of simple records as shown in Fig. 1. Access to the IMDB is provided via FTP, SMTP and WWW servers mirroring the IMDB data. The web interface supports a number of query templates and generates HTML pages on the fly. These are linked by cross-references to films, persons, locations etc. Figures 9 and 10 (in the appendix) show two typical pages generated by this web interface. Allen, Weldon
Dolores Claiborne (1994)
[Bartender]
Allen, William Lawrence Dangerous Touch (1994) [Slim] Sioux City (1994) [Dan Larkin] Allen, Woody
Annie Hall (1977) (AAN) (C:GGN) [Alvy Singer] Bananas (1971) [Fielding Mellish] ... Zelig (1983) (C:GGN) [Leonard Zelig]
Figure 1: A part of the list of actors in the IMDB. However, in order to support ad hoc queries like “which actors played both in films of Jim Jarmusch and of Aki Kaurismäki?”, the IMDB source files must be loaded into a database and query results must be formatted as HTML pages. The web interface of the IMDB uses HTML forms to enter new data or updates. A more intuitive method is to edit existing HTML pages directly using a WYSIWYG editor like amaya and send the modified pages back to the WWW server, which updates the data set. Besides IMDB, there are many other resources about cinema in the internet: home pages of artists, organizations and companies, images, sound and video clips, magazines, etc. The IMDB offers a few links without integrating this information. Integration is needed to answer queries like “who plays the main character in ‘The 3rd Man’ and on which channel can he be seen the next time?”. The WIND architecture discussed hereafter is intended as a framework to meet these demands.
3 The WIND Architecture Our internet warehouse model gathers information on a given domain (of topics) from different information sources across the Internet and organizes them in Document Repositories (DRs). Each DR is encapsulated into a wrapper module, the IDW-Wrapper, which is responsible for the communication between the DR and the WIND-Server. The WIND-Server administers the Data Warehouse as a whole and offers interfaces to information sources and clients/users. It consists of the following modules: (i) The Internet Loader is responsible for the import of data into WIND. (ii) The Repository Manager performs transaction processing and query processing at the server level, before forwarding the subtransactions and subqueries to the underlying DRs, and postprocesses the results returned by them. (iii) The View Exporter is in charge of the communication with the clients. The WIND architecture is depicted in Fig. 2.
3
Anfragen/Anforderungen Client (WWW)
...
...
Internet
View Exporter
Client (DB)
Meta-Daten Daten
Internet
Document Repository ClientInterface
ClientInterface
...
Query Transformer DBMS
WIND-Server
Query Processor
Document Repository Query Transformer
Service Catalog
Fusion Table
Text Archive
Format Conversion Service ...
Transaction Manager
WIND-Wrapper
Query Optimizer
Format Conversion Service
Document Repository
Repository Manager
Query Transformer
Internet Loader
File
SourceInterface
...
Internet
Source (WWW)
SourceInterface
Format Conversion Service
Server
Internet ...
Source (FTP)
Figure 2: The WIND Architecture
3.1 Internet Loader The Internet Loader gathers information from sources across the Internet. To this purpose, it needs a variety of interfaces to communicate with different types of information sources, such as WWWservers, news-servers, database servers and, eventually, file managers. Support for the interaction with meta-information providers, such as WWW search engines, is also desirable, because search engines have already registered large portions of the Internet. Since the information retained in a WIND instance is domain-specific, domain knowledge is used to specify the general characteristics to be satisfied by the documents imported by the Internet Loader. Depending on the domain and the applications on it, such characteristics can be keywords, pre-determined URL parts etc. Sophisticated characteristics, related e.g. to the structure of HTMLpages are handled by the Repository Manager and its modules. The Internet Loader is used to load an initial set of documents into the data warehouse and then to extend this set on demand, i.e. in response to requests from the Repository Manager. Moreover, it must regularly update the contents of WIND by periodically polling its information sources and in response to update requests from the Repository Manager. The usually high cost of polling should be weighted against the frequency of data changes and the importance of keeping the data updated.
3.2 View Exporter The View Exporter consists of interfaces to clients, such as web browsers, file based legacy applications, database applications etc. Those interfaces translate the client requests into the internal query language of WIND, WINDSurf, and forward them to the Repository Manager. The results returned are forwarded by the interfaces to the clients. Different ways of implementing web interfaces in the View Exporter are presented in [FLS97]. The client requests can be queries, format conversion of documents and updates of documents, 4
which a client considers to be obsolete. Such updates are for instance supported in the IMDB database. A policy of recognizing authorized clients and propagating the updates from the DRs to the Internet sources is yet to be designed.
3.3 Document Repositories The data retained in a WIND instance belong to a particular domain, as the movie data of our running example. Attempting to store the whole Internet in a single warehouse is infeasible and meaningless, since no mining strategy would ever search the whole internet unselectively to discover some knowledge on a given subject. The domain knowledge is used to specify the characteristics of the documents to be retained in WIND, including meta-information, structural specifications and format descriptors, wherever possible. Depending on their structure, data entities are retained in one or more “Document Repositories” (DRs). The DRs of WIND can be databases, text retrieval systems, multimedia managers, file repositories etc. Each DR is observed as an information system encapsulated in a Wrapper. As shown in Fig. 2, the WIND-Wrapper and its submodules, the Query Transformer and the Format Conversion Server (FCS) are responsible for the interaction of the DR with the WIND-Server. The schema describing the data in a DR depends on the expressiveness of the underlying information system: a DBMS usually has a powerful data model, while data in a text archive must be modelled as a set of documents with some simple predefined attributes. The functionality of the data retrieval service over a DR also varies: a database offers a querying interface, while an information retrieval system offers pattern matching algorithms. User queries towards a DR can only be submitted via the WIND-Server, since DRs, as usual in a warehouse, are not autonomous. The Query Optimizer transforms a query towards the server into a set of query subplans towards the DRs. The Query Transformer in the WIND-Wrapper of each DR translates the subplan in a sequence of commands that can be processed by the DR. A query subplan towards a DR is accompanied by a list of objects used as arguments to the query and by a specification of the output format. As shown in Fig. 3, the query is translated by the Query Transformer, while its arguments are converted by the Format Converters (FC-1,. . . ,FC-n) of FCS in a format supported by the DR. The query results are then converted into the desired output format and returned to the Repository Manager. Document Repository Subquery Plan
Query Transformer
Argument-1 ...
al I
SL
ang
uag
e
...
Conv-1
Argument-k Result ... Conv-n
Information System
Figure 3: Querying a Document Repository
5
WIND-Wrapper
Argument-i
Format Conversion Server
Loc
Format conversion in FCS is based on the notion of Object-Grammars (s. Section 5), which can support converter generation. However, FCS is open to the incorporation of special purpose converters as well, e.g. for the reformatting of pictures and videos. Object-Grammars are appropriate both for text analysis (parsing) and for text generation (unparsing). Hence, they are appropriate for specifying the characteristics of the documents in the domain in a concise way. They can be used not only for the translation of a semi-structured text into a (composite) database object, but also for the reformatting of objects: format converters generated by object-grammars can be coupled together to translate a text piece into an intermediate object-oriented presentation and then again to text of another format. This is particularly useful, when objects are requested to appear in a large variety of alternative formats.
3.4 Repository Manager The Repository Manager is responsible for the administration of the Document Repositories. Its “schema” is the union of the DR schemas. The data retained in the DRs are not necessarily distinct: the same “entity” may appear in several DRs, e.g. as a HTML-page, a database object and a postscript file, whereby we distinguish between the original entity imported from the Internet into a DR and its replicas, jointly denoted as “doppelgänger(s)”. Queries towards an entity may be applied to one or more doppelgängers and should be forwarded to the appropriate DRs. Updates should be applied to all doppelgängers. Hence, the Repository Manager contains modules for query and update processing and maintains information on the data in the DRs and their formats. The Fusion Table keeps track of doppelgängers for an entity, retaining their common global identifier, a pointer to the original, information on the conversions that produced the replicas and the location of the replicas. The Service Catalog keeps track of the formats and converters available in the FCS of the WIND-Wrappers.
3.5 The Query Language of WIND For WIND we consider an object-oriented query language based on OQL, as proposed in [R.G94]. Our language, WINDSurf, must support: (i) object-oriented database queries, (ii) predicates for information retrieval from multimedia archives, mainly based on pattern matching, (iii) format conversion requests and (iv) document updates. Those operations can be incorporated into an object-oriented query language as methods on the objects. Therefore, WINDSurf does not differ from OQL designs in the syntax but in the evaluation: a WINDSurf query is executed towards multiple DRs; parts of the query can be executed by a single DR, while for other parts there are more DR candidates. Updates referring to an entity must be propagated to all DRs containing doppelgängers of that entity. 3.5.1
Query Optimizer and Query Processor
The Query Optimizer decomposes each WINDSurf query into subqueries assigned to the individual DRs. To do the decomposition and the assignment, the optimizer must verify whether each DR involved can process the arguments passed to it and produce the desired output format. If a DR does not support the processing of a given entity, the optimizer consults the Fusion Table, looking for a doppelgänger in the appropriate format. If no such doppelgänger is found, the optimizer searchs the Service Catalog for format converters that can transform the entity into the desired format.
6
The output of the Query Optimizer is an execution plan consisting of subplans and conversion requests towards the DRs. The Query Processor assigns the subplans and requests to the DRs. It is responsible for supervising the transfer of intermediate results from one DR (actually WIND-Wrapper) to another and for merging the results. If new replicas of an entity are produced during query execution, the Query Processor inserts the appropriate entries in the Fusion Table. The merging of results into a list of documents goes beyond standard query processing, because some of the returned documents are selected from archives and ranked by the pattern matching facilities used there. Advances on the processing of ranking predicates [CG96, Fag96] will be considered to assign ranks to the results of the whole query 1 . The data retained in WIND can occasionally be inadequate to answer a query. This is for instance the case, when the number of result entities requested by the user cannot be reached from the DRs contents. Then, the Query Processor asks the Internet Loader to bring additional data from the web. Since WINDSurf is more expressive than the query languages of search engines like AltaVista and Lycos, the query issued by the Internet Loader should be limited to the predicates supported by those servers. The results must be then imported to the DRs and processed according to the original WINDSurf query. 3.5.2
Transaction Manager
The Transaction Manager of WIND supervises the execution of updates over the DRs. Updates occur when an obsolete object should be replaced with a newer version fetched by the Internet Loader, and whenever an (authorized) client requests an update. An update should be performed on all doppelgängers of the same entity, as registered in the Fusion Table. Hence, for each update request towards an entity an equivalent update request per doppelgänger must be generated. We are considering ways of generating such updates in an efficient way. Our initial solution is to perform the original update, use the updated entity as original and generate its doppelgängers anew as replicas. All update operations initiated from the original update request must be either performed immediately or deferred until the next access to a doppelgänger. Since the DRs are heterogeneous but not autonomous, those updates can be performed with a rather simple protocol for nested transactions.
4 Modelling the IMDB as an WIND Instance The IMDB movie database has several mirror sites in the Internet. We discuss the modelling of IMDB in WIND as an enhanced IMDB mirror with extended functionality.
4.1 Structure of the IMDB-WIND On reasons of brevity we consider a minimally equipped WIND instance. It consists of an OODBMS, a text archive and a HTML-page repository; the Internet Loader has http and ftp interfaces; the View Exporter supports a web interface. Additional repositories, such as a video archive, can obviously be added. The data in the OODBMS repository are organized according to the schema in Fig. 4. This schema is a simplified version of the IMDB schema, allowing us to concentrate on the aspects of the
7
persons:Set[Person]
films:Set[Film]
Person
Film
name:String biography:HyperText jobs:Set[Job]
title:Integer year:Integer jobs:Set[Job]
HyperText url:String
Job category:Category role:String remarks:List[String] rank:Integer performer:Person film:Film
Figure 4: The schema of the movie database movie database important for our study. The Format Conversion Server of the OODBMS is equipped with object grammars that can transform the IMDB files into objects, and with object grammars for the conversion of database objects into HTML pages.
4.2 An Example Query on the IMDB In the WIND instance of IMDB, let us retrieve all persons (in alphabetical order), whose biography contains the words “neurotic” and “New York” or “New Yorker”, and who have appeared in movies of Woody Allen. This query can be expressed in WINDSurf as follows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sort p in ( select person from person in persons where person.biography.match("neurotic NEAR New York*") and exists film in ( select job.film from job in person.jobs where job.category = cast ): (exists director in ( select directing.person from directing in film.jobs where directing.category = direction ): director.name = "Allen, Woody") ) by p.name ) @ HTMLDoc[UnorderedList[Link]]("Query result")
1 A query returning documents with ranks does not need to support fuzzy predicates. Fuzzy queries would be an enhancement of WINDSurf, which we intend to consider.
8
This query selects the persons, whose biography contains the pattern ‘‘neurotic NEAR New York*’’ (line 4), and for which there is a movie registering them as cast members (lines 6-8: category = cast) and being directed category = direction by a person named ‘‘Allen, Woody’’ (lines 9-13). The results are sorted alphabetically by name (line 14). In line 15 we specify that the set of person-objects output by this query must be formated as a HTML document (@ HTMLDoc...). This document should be titled ‘‘Query result’’. Its content is an unordered list (HTML-Tag
) of hyperlinks, which point to the web pages of the persons found. Those pages will be generated upon traversal of the links to them. The predicate on line 4 is a pattern matching operation supported in the text archive only. The other predicates must be processed by the OODBMS. Thus, the Query Optimizer produces the following execution plan: 1. The predicates of the original OQL query are executed in the OODBMS, with exception of the pattern matching predicate. The result is the list of cast members of all Woody Allen movies. 2. For each such person: (a) The HTML-page with the biography of the person is transformed in a pure-text file (replica of the HTML-page) and stored in the text archive. For this transformation, we need a converter translating a HTML-page in a string object. Such a converter can belong to the FCS of the HTML repository or to the FCS of text archive. (b) The pattern-matcing operation is applied on the replica of the biography in the text archive. If the operation succeeds, the person is attached to the list of results. 3. The persons found are sorted by name. 4. The FCS of the OODBMS is used to transform the list of results in the format HTMLDoc[UnorderedList[Link]](‘‘Query result’’). 5. The HTML-page produced is sent to the web interface of the View Exporter Format conversions are necessary in steps 2(a) and 4. The corresponding converters are generated by object-grammars, which are introduced in the following.
5 Object-Grammars An Object-Grammar describes the textual representation of an object according to a given format and depending on the object’s actual type. In an object-oriented environment, objects can have different textual representations. For instance, a number can be written in decimal as well as in hexadecimal notation. We may indicate this by using format names dec and hex, respectively. Since there are Reals as well as Integers to be represented, different rules are necessary to describe for each format name its floating point and its integer notation. Existing Attribute Grammar formalisms[Paa95] do not have this feature. Furthermore, most of them are used for parsing only. Even though Definite Clause Grammars[PW80] can be used both for parsing and text generation (so called unparsing), they are not wellsuited for our purpose since the underlying language Prolog does not support object-oriented data models nor strong typing.
9
5.1 Specifiying HTML Pages by an Object-Grammar We show the use of object-grammars, by specifying web pages for persons in the movie database. In order to stress the general principles of object-grammars, while keeping the example simple, we restrict the specification to simplified HTML pages. The web page of a person contains a set of links to all movies the person has been involved in. It has the following structure (variable parts are printed in italics): the person’s name the person’s name
- title of movie 1 ...
- title of movie n
A Person’s HTML Page. The following object-grammar PersonSimpleHTML provides a format SimpleHTML to represent persons as HTML pages. 1 2 3 4 5 6 7 8 9 10 11 12
grammar PersonSimpleHTML Person @ SimpleHTML --> /* Rule 1 */ "" "" "" self.name "" "" "" "" self.name "
" self @ Films "" ""
Rule 1 describes how a Person can be represented in format SimpleHTML as a HTML page whose skeleton is given in lines 4–12. Strings (like ‘‘’’) and string valued expressions (like self.name) occurring in rule 1 are represented verbatim using the format Default. The specification @Default can be omitted. The person’s name is used for the title (line 6) and the headline (line 9) of the page. The nonterminal expression self@Films in line 10 stands for the set of movies the person has contributed to. The representation of a Person in format Films is a numbered list of all films (s)he participated in. This list is defined by the query starting at line 14 as a view containing the films, in which the person was involved, from the list of jobs of this person. The result is represented as OrderedList, ordered by the movie’s year of appearance and title. 13 14 15 16
Person @ Films --> (sort film in ( select distinct job.film from job in self.jobs 10
/* Rule 2 */
17 18 19 20 21 22
) by -film.year, film.title ) @ OrderedList List[Film] @ OrderedList --> /* Rule 3 */ "" self @ Sequence "
"
An OrderedList in HTML is a Sequence of elements enclosed in a pair of ...
tags, as described in rule 3. Format Sequence is defined in rule 4 as a wrapper for the recursive format From(.) which is in turn defined by rule 5 providing two alternatives (lines 25–27 and line 28). It represents the elements self[i] of a List of Films as a concatenation of HTML ListItems. The constraints given in lines 27 and 28 use the value of i to select the appropriate alternatives, in order to limit the number of recursions. 23 24 25 26 27 28
List[Film] @ Sequence --> self @ From(0) List[Film] @ From(i: Integer) --> self[i]@ListItem self@From(i+1) { i < self.count } | { i >= self.count }
/* Rule 4 */ /* Rule 5 */
Each element of the list of films is represented according to rule 6 by prepending a - tag and then using format Link as defined in rule 7 to provide a link to this film. 29 30 31 32 33
Film @ ListItem --> " - " self@Link /* Rule 6 */ Film @ Link --> /* Rule 7 */ "" self.title "(" self.year ")" ""
HTML uses elements to specify links. The destination URL must be given as an attribute HREF. Assume that there is a method url in the class Film which returns the unique URL of its web page. The title and year of the film form the link’s anchor (line 31), and are therefore given between the ... tags. Generalisation. The rules described above are self-contained and do not make use of modern software engineering concepts. Since HTML representations of several object types will be needed and lists for numerous element types are to be formatted in various ways, generic formats are essential for software reuse. We now show how we can generalize the formats OrderedList, ListItem, Sequence and From(.) in a generic way. In rule 3 defining format OrderedList, we generalize the type from List[Film] to the generic type List[G] and introduce a formal generic parameter EltFormat. When format OrderedList is used to represent an actual list, e.g. of type List[Film], EltFormat must be replaced by an actual format which can be used to represent objects of the actual type replacing G, i.e. Film. This is denoted as EltFormat->G:
11
34 35 36 37
List[G] @ OrderedList[EltFormat->G] --> /* Rule 3’ */ "" self @ Sequence[ListItem[EltFormat]] "
"
Each element of a list formatted as OrderedList is represented as a ListItem. The list item itself is represented in the format EltFormat. This requires the format ListItem to become generic as well, in order to propagate the format parameter EltFormat to it. The list items are concatenated as specified by the format Sequence used for the list itself. The rules for Sequence and From are generalized in the same way as rule 3 by allowing a generic type List[G] and adding a format parameter EltFormat: 38 39 40 41 42 43 44
List[G] @ Sequence[EltFormat->G] --> /* Rule 4’ */ self @ From[EltFormat](0) List[G] @ From[EltFormat->G](i: Integer) --> /* Rule 5’ */ self[i] @ EltFormat self @ From[EltFormat](i+1) { i< self.count } | { i>=self.count }
The generic version of format ListItem generalizes the type from Film to the generic type G and requires a format parameter ItemFormat which can be used to represent objects of type G, as specified by ItemFormat->G: 45 46
G @ ListItem[ItemFormat->G] --> " - " self @ ItemFormat
/* Rule 6’ */
The generic formats described above can now be included in a standard library and used by other grammars. To do so for PersonSimpleHTML, the format Links in line 18 must be replaced by the generic format OrderedList[Link]. Rules 3–6 can then be omitted.
5.2 Picasso’s HTML Page We now apply the grammar PersonSimpleHTML to a person, who does not owe his fame to his movie career but was nevertheless involved in films: Pablo Picasso played himself in the films “Le Testament d’Orphee” of 1959 and “Le Mystere Picasso” of 1956; for the latter he also wrote the script. The objects relevant to his film activities are shown in Figure 5. The query producing the films Picasso was involved with is: sort film in ( select distinct job.film from job in self.jobs ) by -film.year, film.title This query returns the list [Film1,Film2], which is represented as an OrderedList[Link], each film finally being formatted as Link. The syntax tree representing the result as a HTML page is shown in Fig. 6. After each node, we show the rule used to expand this node.
12
persons
Film 1
Person 1
Picasso, Pablo
jobs
films
Le Testament d’Orphee 1959
person
Job 1 cast himself
jobs
film
Job 2 cast himself
Film 2
Le Mystere Picasso 1956
film
jobs
Job 3 writer
Figure 5: Film projects of Picasso
5.3 Formats, Format Names and Potential Nonterminals. A format is described by a format name. We extend this concept by allowing formal parameters to be attached to a format name. For instance, we could specify the base of a numerical notation as an argument to a general format name number. We define the format as a format name supplied with actual arguments. A nonterminal in an object-grammar is a pair @, where is a type and a format name. A grammar rule may be given for each nonterminal. Otherwise, @ inherits the rule of another nonterminal @ where is a supertype of . Since the actual type of an object may be a subtype of its declared type, different rules may be selected for a fixed term denoting an object and for a given format name. This is similar to the concept of late binding in object-oriented programming languages. Instead of nonterminals, we have nonterminal expressions on the right hand of an object-grammar rule. A nonterminal expression is a pair t@f representing an object expression t in a format f . To each nonterminal expression there belongs a set of potential nonterminals @ corresponding to the format name of f and all potential types of t. A production of an object-grammar rule is obtained by replacing each nonterminal expression on its right hand side by one of its potential nonterminals. This instance is a context free production. Therefore, the equivalent of an object-grammar rule in a standard context free grammar would be the set of the rule’s productions. Depending on the class hierarchy, the context free equivalent of an object-grammar can have arbitrarily many productions. Hence, object-grammars provide a very concise means to describe text representations of objects. Generic Types. An important feature of modern object-oriented languages are generic types. Text representations of generic types need to be generic as well. For instance, the format used for a list of objects as a whole should be independent from the format used to represent each object. In the most simple case, a list will be represented as a concatenation of its elements’ text representations with the element format supplied as a generic parameter. This is yet another requirement not addressed by typical grammar formalisms. We define a generic format to be a generic format name supplied with actual generic parameters.
13
Person1@SimpleHTML
Picasso,Pablo
#1
Person1.name
Person1@Films
#2
Picasso,Pablo
Person1.name
[Film1,Film2]@OrderedList[Link]
#3’
[Film1,Film2]@Sequence[ListItem[Link]] [Film1,Film2]@From[ListItem[Link]](0)
#4’
#5’
[Film1,Film2]@From[ListItem[Link]](1)
#5’
[Film1,Film2]@From[ListItem[Link]](2) Film1@ListItem[Link]
Film1@Link
#6’
Film2@ListItem[Link]
#7
Film1.url Film1.title
Film2@Link
Film2.url
Film1.year
Le Testament d’Orphee (1959)
#5’
#6’ #7
Film2.title
Film2.year
Le Mystere Picasso (1956)
Figure 6: Derivation of Picasso’s HTML page Queries and Constraints. The concept of object-grammars is open for different object-oriented data models and query languages (e.g. OQL). Using queries in nonterminal expressions is a very powerful feature allowing for database views based text representations. Constraints can be given on the right hand side of a grammar rule, in order to both determine the applicability of the rule and to create new variable bindings as in constraint logic programming languages. Object expressions in nonterminal expressions can be seen a special case of equality constraints. Parsing means solving these constraints in order to determine a new consistent database state including the parsed information. After this informal introduction, in the next sections we’ll give formal definitions for objectgammars, their type correctness and concept of derivations.
5.4 Formal Definition of Object-Grammars Object-Grammars are formulated with respect to an (object-oriented) database schema S which is described by the following sets:
classes(S ) is the set of classes in S . It contains at least the base classes Boolean and String. We observe the “subclass” relationship on classes(S ) as a partial ordering denoted as . For instance, in the IMDB schema (cf. Figure 4), classes(S ) is the set fBoolean; Integer; String; List[]; Set[]; Person; Film; Jobg.
typevar(S ) defines a (countable) set of type variables, denoted by capital letters G, H . . . 14
gentypes(S ) is the set of all generic type expressions over classes(S ) with type variables from typevar(S ). Examples of type expressions are Integer, Set[Person] Table[K,List[G]]. The subclass relationship extends to a subtype relationship on gentypes(S ), i.e., if C 0 [] C [], then C 0 [ ] C [ ] for any Type 2 classes(S ).
For instance, if we assume SparseMatrix[] Matrix[], then SparseMatrix[Real] Matrix[Real]
The set types(S ) describes all ground types, i.e. type expressions not containing any type variables. Ground types are String, Set[Job], List[Film] etc. It holds that types(S ) gentypes(S ).
instances (S ) is the set of database instances with respect to S . In our example, an instance is the transitive closure of persons [ films for a finite set persons of person objects and films of film objects.
The set val(S ) is the universe of possible data, containing both values (e.g. all Integers) and objects (e.g. all Person-objects). Therefore, val(S ) is the union of all instance sets of classes in classes(S ).
var(S ) is the set of variable names. They are denoted as lower case strings. The special variable name self is reserved for the current object.
terms(S ) is the set of object expressions over var(S ) which can be evaluated to elements in val(S ). For instance, self.film, persons and (select p from p in persons where p.name = ‘‘Allen,Woody’’) are valid terms.
Using the above sets, an Object-Grammar G is defined by:
a set var(G) of format variables, e.g. ItemFormat, EltFormat. a set id(G) of format identifiers, e.g. ListItem, OrderedList. a set names(G) of format names F [p1 ! 1 ; : : : ; pk ! k ](a1 : 1 ; : : : ; an : n ), where F 2 id(G) is a format identifier, each pj 2 var(G); j = 1::k is a formal generic parameter, each ai 2 var(S ); i = 1::n is a formal argument, and 1 ; : : : ; k ; 1 ; : : : ; n 2 gentypes(S ) are types. We require that there exists exactly one format name for each format identifier in G. names(G) contains a special element “Default” denoting the verbatim representation for strings and a decimal representation for integers. For non-generic format names (k = 0), the brackets may be omitted. For instance, valid format names in PersonSimpleHTML are Link, ListItem[ItemFormat->G] or Foreach[EltFormat->G](i:Integer).
the set of nongeneric formats, denoted as formats0 (G). Each format F (t1 ; : : : ; tn ) 2 formats0 (G) consists of a format identifier F corresponding to a nongeneric format name F (a1 : 1 ; : : : ; an : n ) and a list of object expressions ti 2 terms(S ) replacing the formal arguments ai . If there are no arguments (n = 0), parentheses may be omitted.
the set formats(G) of all formats is a superset of formats0 (G) [ var(G). It is defined recursively by the following rule: 15
If F 2 names(G) is a format name with k formal generic parameters and n formal arguments, and f1 ; : : : ; fk are formats from formats(G), t1 ; : : : ; tn object expressions from terms(S ), then F [f1 ; : : : ; fk ](t1 ; : : : ; tn ) is a format in formats(G).
For instance, formats(PersonSimpleHTML) contains the formats Simple, OrderedList[Link] and Foreach[EltFormat](i+1).
the set of format instances, denoted as forminst(G). A format instance is a format, in which all object expressions have been evaluated and all format variables have been replaced by format instances. forminst(G) is defined recursively as a superset of fF (o1 ; : : : ; on )jF 2 id(G); o1 ; : : : ; on 2 val(S )g by applying the following rule: If F 2 id(G) is a format identifier and q1 ; : : : ; qk are format instances from forminst(G) and o1 ; : : : ; on values or objects from val(S ), then F [q1 ; : : : ; qk ](o1 ; : : : ; on ) is an element of forminst(G).
a set prod(G) of productions in the form @ ?! 1 j : : : j m , where @ is a nonterminal. For each pair @ there exists at most one production. Each alternative i of a production consists of a sequence of so-called nonterminal expressions t@f , with t being an object expression and f a format. For f = Default, t@f can be abbreviated as t. For each alternative, a constraint fcg may be given which specifies a selection criterion. Hence, a production alternative has the general form t1 @f1 : : : tr @fr fcg, where t1 ; : : : ; tr ; c 2 terms(S ) and f1 ; : : : ; fr 2 formats(G).
5.5 Type-Correctness Informally, an object-grammar is type correct, if all occurring object expressions are typed correctly and their values can be represented in the given formats. In this section, we are going to formalize this notion of type correctness. Definition 1: We call an object of type 2 gentypes(S ) representable in format f 2 formats(G) (“f ! ”), if there exists a supertype of ( ) which occurs in the head of a production for the format name of f . We require that there is at most one such type that is also minimal with respect to . (This can always be achieved by duplication of productions.) Definition 2: The type of an object expression t occurring in the body of a production for type (with arguments ai of type i ; i = 1; : : : ; n), conforms to a type (“t : ”), if the type system of S allows the deduction fself : ; a1 : 1 ; : : : ; an : n g ` t : . Definition 3: Any formal generic parameter p declared by p ! in the head of a rule is called type correct with respect to . Then p ! can be assumed in the body of . Definition 4: We call a format F [f1 ; : : : ; fk ](t1 ; : : : ; tn ) (with format name F [p1 k ](a1 : 1 ; : : : ; an : n )) type correct (with respect to a rule ), if 1. all actual generic parameters fi are type correct with respect to , 2. fj
! j holds for all j = 1 : : : k (according to definitions 1 and 3),
3. ti : i holds for all i = 1 : : : n (according to definition 2).
16
! ; : : : ; pk ! 1
Corollary:
Any non-generic format without arguments is type correct.
Definition 5: A nonterminal expression t@f (with t being an object expression and f a type correct format) occurring in a production is type correct, if there exists a type , such that both t : and f ! can be deduced with respect to . This guarantees that the value of t can be represented in format f . Definition 6: A constraint fcg in a production is type correct, if c : Boolean can be deduced with respect to . Definition 7: An alternative of a production is type correct, if all nonterminal expressions t@f as well as the constraint fcg occurring in it are type correct. A production of the form of @ ?! 1 j : : : j m is type correct, if all its alternatives are type correct and there are no type variables which do not occur in the type expression . An object-grammar is type correct, if all of its productions are type correct.
5.6 Derivations Preliminaries: In order to define derivations with object-grammars, we have to discuss the evaluation of object expressions to values and of formats to format instances. Object expressions are evaluated with respect to a database instance db 2 instances (S ) and to an assignment which maps free variables to values. We can assume that assignments are extended to total functions on the set of all variables using default values. Therefore, we assume a partial function eval : terms(S ) val(S )var(S ) instances (S ) ?! val(S ) which assigns a value to a type correct object expression provided that the computation terminates. Informally, the format instance of a format is obtained by evaluating all object terms occurring in the format and replacing all format variables with their format instances according to a given assignment. Definition 8: Let : var(G) ?! forminst(G) be an assignment of format variables and : var(S ) ?! val(S ) an assignment of object variables. A format f 2 formats(G) is either a format variable in var(G) or has the form f = F [f1 ; : : : ; fk ](t1 ; : : : ; tn ). We define the function inst(f; ; ; db), which instantiates f with respect to and :
8 (f ) < inst(f; ; ; db) = : F [q ; : : : ; qk ](o ; : : : ; on ) 1
where qj
1
f
2 var(G)
f
2 formats(G) ? var(G)
= inst(fj ; ; ; db) for j = 1 : : : k and oi = eval(ti ; ; db) for i = 1 : : : n.
Definition 9: Let o be an object of type , q be an instance of a format f with f format name of f . Then we call o@q an instance of the nonterminal @.
! , and the
Let = F [p1 ! 1 ; : : : ; pk ! k ](a1 : 1 ; : : : ; an : n ) the format name of f . Then q has the q1 ; : : : ; q~k ](~o1 ; : : : ; o~n ) for format instances q~j and objects o~i : form F [~ 17
Definition 10: We call a sequence o1 @q1 : : : or @qr of nonterminal instances directly derivable from the instance o@q of a nonterminal @ with respect to the variable assignments and (o@q =) o1 @q1 : : : or @qr ), if the following conditions hold:
such that there exists a rule with head @. there exists an alternative of having the form t @f : : : tr @fr fcg.
1. there is a unique minimal type with 2.
1
1
3. eval(c; ; db) = true 4. eval(ti ; ; db) = oj for j
= 1; : : : ; r 5. inst(fj ; ; ; db) = qj for j = 1; : : : ; r Definition 11: A word w over the alphabet is called derivable in Default format from the string s (s@Default =) w), iff w is the value of s. This derivation relationship can now be extended point-wise on sequences o1 @q1 : : : or @qr of nonterminal instances. Definition 12: Let o@q be a nonterminal instance and w derivable from o@q iff.
9w ; : : : ; wn 2 : o@q =) w 1
1
2 a word over an alphabet . We call w
: : : w n ^ w = w1 : : : w n
As with simple context free grammars, there do exist nonterminating derivations. However, the uniqueness of derivations can be achieved if the constraints occurring in the alternatives of a production are mutually exclusive.
5.7 Using Object-Grammars in WIND In the WIND architecture, object-grammars are used for query translation and for data translation. Incoming queries (coded for instance as URLs) are translated by the client interfaces into WINDSurf queries. Also, the subquery execution plans sent to the document repositories are represented as objects to be translated into the local repository query language by the query transformer. In both cases, object-grammars can be used to specify the translations. The arguments and results of subqueries directed to a document repository are translated into/from the repository internal data representation using format converters provided by the format conversion services of the WIND wrapper. This is shown in Figure 3. Each format converter in the format conversion service can be specified by an object-grammar. In order to create a format converter, we use the schema of the wrapped information system to develop an object-grammar describing the text format in terms of this schema. The object-grammar together with the schema are used as input for a generator which creates the format converter. Figure 7 shows this process as well as the use of the created converters in WIND wrappers.
6 Object-Grammars for the Internet Movie Database We present the use of object-grammars in an extended example. It is used for remodelling the Internet Movie Database as a WIND instance. The object-grammar shown in figure 8 specifies the translation of the actors.list source file into database contents. The file consists of a record for each person who has acted in a film, as shown in Fig. 1 in section 2. 18
Document Repository
QEP Schema
Query Language OG
Query Translator QEP Converter Generator
Query Information System
Format Conversion Service
IS Schema
Format OG
Object Graph
Converter
Converter Generator
Syntax Tree
Text Representation
Figure 7: Generation and use of format converters Rule 1 The text representation of a person set (in format Actors) consists of all those persons who had a Job of category cast in some film. The query in lines 4-9 selects these persons and sorts them alphabetically by name. The result is expected in format Sequence (see section 5.1) with each person in format Actor (see rule 2). In order to import actors.list into the database, it is parsed with start symbol persons @ Actors, resulting in a set of database constraints determining the subset of all actors in persons. Rule 2 Each person’s record begins with its full name which serves as its primary key in the set allpersons according to the rule’s constraint. Assuming the database to be consistent, there exists at most one Person object for each name. In case that the parsed name matches the name of a person in allpersons, self will be bound to this Person object, otherwise a new Person object is created. In the person’s record, there follows a list of its appearances in films consisting of the person’s jobs where category equals cast, sorted by film title and year. Its expected format is a Sequence of Jobs, each formatted as Appearance. Each record is terminated by a blank line specified by "\n". Rule 3 Each appearance of a person in a film is written as the MovieTitle of the film, followed by optional remarks (each in format Remark), the role name (in brackets) and optionally the person’s position in the credits (in format Rank). When parsing a Job in format Appearance, its respective attributes will be assigned the parsed data. The attribute person will be set with the argument actor and category will be set to the constant cast. Rules 4–6 In format MovieTitle, a film is represented by its title, followed by its production year (in parentheses). The constraint specifies this combination as a primary key in set films. Similar to rule 2, self will be assigned either an existing film of that title/year, or a new Film object, otherwise. Rules 5 and 6 describe the representation of a string as a Remark and of an integer as optional rank, as used in rule 4.
19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
grammar Actors inherit Standard Set[Person] @ Actors --> /* Rule 1 */ (sort actor in ( select distinct person from person in self where exists job in person.jobs: job.category = cast ) by actor.name ) @ Sequence[Actor(self)] Person @ Actor(allpersons: Set[Person]) --> /* Rule 2 */ self.name ( sort appearance in ( select job from job in self.jobs where job.category = cast ) by -appearance.film.year, appearance.film.title ) @ Sequence[Appearance(self)] "\n" { self= element(select p from p in allpersons where p.name = self.name) } Job @ Appearance(actor: Person) --> "\t" self.film @ MovieTitle self.remarks @ Sequence[Remark] "[" self.role "]" self.rank @ Rank "\n" { self.person = actor, self.category = cast }
/* Rule 3 */
Film @ MovieTitle --> self.title "(" self.year ")" /* Rule 4 */ { self= element(select film from film in films where film.title = self.title and film.year = self.year) } String @ Remark --> "(" self ")" Integer @ Rank --> | ""
/* Rule 5 */ { self = 0 } { self != 0 }
/* Rule 6 */
end Actors;
Figure 8: Object-grammar for the source file actors.list
20
7 Conclusions The information available in the Internet is mainly retrieved by browsing and searching on index servers. Database-like queries are supported only to a limited extend because of the sparse structure of most web documents and the practically infinite size of the Internet. Another problem of Internet information retrieval is to support user-specific views with respect to both the content and the format of the presented data. In this work, we presented the WIND architecture, which proposes solutions for both aspects of Internet data retrieval. We focuse on information integration for some application domain. A core of essential information is loaded from the Internet in advance, whereas additional information is retrieved on demand. It can be maintained in multiple presentations in different Document Repositories. The WIND-Server queries and updates the DRs in a uniform manner by using WIND-Wrappers to translate queries and results. A key to our methodology for a flexible view of the Internet is the generation of format converters from object-grammars. As opposed to ad hoc solutions for transformations between documents and objects, we propose the object-grammar formalism and show its appropriateness by examples of semistructured Internet data. The implementation of the WIND components reveals several challenges as the automatic generation of format converters, the optimization of processing requests comprised of conventional subqueries and format conversion requests, update propagation from the Internet sources to local representations. The aims of our current research are the development of efficient parsing and unparsing algorithms for object-grammars, the implementation of a format converter generator and the establishment of a working WIND prototype. Converters are a key component of the prototype, since they are needed for storage and reformatting objects during query processing. The Query Optimizer should also take into account the converters available to each DR it considers. We need to develop a cost model estimating the cost of involving DRs with different converters in the execution of a given query, and a transformation mechanism to explore the increased search space.
References [ABH93]
Karl Aberer, Klemens Böhm, and Christoph Hüser. The prospects of publishing using advanced database concepts. Electronic Publishing, 6(4):469–480, dec 1993.
[ACM93]
Serge Abiteboul, Sophie Cluet, and Tova Milo. Querying and updating the file. In 19th Intl. Conference on VLDB, volume 19, pages 73–85, 8 1993.
[ACM95]
Serge Abiteboul, Sophie Cluet, and Tova Milo. A database interface for file update. In SIGMOD ’95, pages 386–397, 1995.
[ACM97]
Serge Abiteboul, Sophie Cluet, and Tova Milo. Correspondence and translation for heterogeneous data. In ICDT’97, 1997. to appear.
[BA96]
Klemens Böhm and Karl Aberer. HyperStorM—administering structured documents using object-oriented database technology. In ACM SIGMOD Intl. Conference on Management of Data, page 547, Montreal, Quebec, Canada, 4–6 June 1996.
[CG96]
Surajit Chaudhuri and Luis Gravano. Optimizing queries over multimedia repositories. In SIGMOD’96, pages 91–102, Montreal, Canada, June 1996. ACM. 21
[CGMH+ 94] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The tsimmis project: Integration of heterogeneous information sources. In Proceedings of the 100th Anniversary Meeting, pages 7–18. Information Processing Society of Japan, 1994. [Etz96]
Oren Etzioni. The World-Wide Web: Quagmire or gold mine? CACM, 39(11):65–68, Nov. 1996.
[Fag96]
Ronald Fagin. Combining fuzzy informationm from multiple systems. In PODS’96, pages 216–226, Montreal, Canada, June 1996. ACM.
[FLS97]
Lukas C. Faulstich, Volker Linnemann, and Myra Spiliopoulou. Using objectgrammars for internet data warehousing. Technical report, Institut für Informationssysteme, Med. Universität Lübeck, 1997. http://www.inf.fu-berlin.de/faulstic/wind.ps (english version).
[FPSS96]
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The KDD process for extracting useful knowledge from volumes of data. CACM, 39(11):27–34, Nov. 1996.
[FW93]
An Feng and Toshiro Wakayama. SIMON: A grammar-based transformation system for structured documents. Electronic Publishing: Origination, Dissemination, and Design, 6(4):361–372, December 1993.
[HGML+ 95] J. Hammer, H. Garcia-Molina, W. Labio, J. Widom, and Y. Zhuge. The stanford data warehousing project. In Widom [Wid95], pages 42–48. [IK93]
W.H. Inmon and C. Kelley. Rdb/VMS: Developing the Data Warehouse. QED Publishing Group, Boston, Massachusetts, 1993.
[Inm92]
W.H. Inmon. EIS and the data warehouse: a simple approach to building an effective foundation for EIS. Database Programming & Design, 5(11):70–73, nov 1992.
[Inm96]
W.H Inmon. The data warehouse and data mining. CACM, 39(11):49–50, Nov. 1996.
[KP93]
Eila Kuikka and Martti Penttonen. Transformation of structured documents with the use of grammar. Electronic Publishing: Origination, Dissemination, and Design, 6(4):373–383, December 1993.
[LRO96]
A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In 22nd Intl. Conference on VLDB, pages 251– 262, 1996.
[Paa95]
Jukka Paakki. Attribute grammar paradigms: A high-level methodology in language implementation. ACM Computing Surveys, 27(2):196–255, June 1995.
[PW80]
F.C.N. Pereira and D.H.D. Warren. Definite Clause Grammars for language analysis. Artificial Intelligence, 13:231–278, 1980.
[R.G94]
R.G.G.Cattell. The Object Database Standard, ODMG-93. Morgan Kaufmann, 1994.
[SL95]
Ulrike Stutschka and Volker Linnemann. Attributierte grammatiken als werkzeug zur datenmodellierung. In Georg Lausen, editor, Datenbanksysteme in Büro, Technik und Wissenschaft (BTW’95), pages 160–178. GI, 1995. 22
[vdBV96]
Mark van den Brand and Eelco Visser. Generation of formatters for context-free languages. ACM Trans. on Software Engineering and Methodology, 5(1):1–41, jan 1996.
[Wid95]
J. Widom, editor. Special Issue on materialized views and data warehousing, IEEE Data Engineering Bulletin 18(2), 1995.
[ZHKF95]
Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti. Warehousing using h2o. In Widom [Wid95], pages 29–40.
23
Appendix Love and Death (1975) USA 1975 Color Produced by: Jack Rollins & Charles H. Joffe Productions Genre(s)/keyword(s): Comedy / historical / suicide Certification: USA:PG Runtime: USA:85
Directed by Woody Allen Cast (in credits order) verified as complete Woody Allen .... Boris Dimitrovich Grushenko Diane Keaton .... Sonja ... Written by Woody Allen Mildred Cram Donald Ogden Stewart Cinematography by Ghislain Cloquet Music by Sergei Prokofiev ... ... ... Copyright 1990-1996 The Internet Movie Database Team
Figure 9: IMDB HTML-Page for the movie “Love and Death” (shortened)
Woody Allen Woody Allen’s biographical information. Also Known As: Allen Stewart Konigsberg There’s also a combined view of this filmography. The filmography lists the titles for which Woody Allen was:
Writer Actor Director Producer Composer Writer 1. 2. 3. 4. 5. 6.
Everybody Says I Love You (1996) Mighty Aphrodite (1995) (DS:AAN) Bullets Over Broadway (1994) (DS:AAN) Don’t Drink the Water (1994) (TV) (also play) Manhattan Murder Mystery (1993) ...
1. 2. 3. 4.
Everybody Says I Love You (1996) Mighty Aphrodite (1995) .... Lenny Don’t Drink the Water (1994) (TV) .... Walter Hollander ...
Actor
Director 1. 2. 3. 4. ...
Everybody Says I Love You (1996) Mighty Aphrodite (1995) Bullets Over Broadway (1994) (AAN) ... Copyright 1990-1996 The Internet Movie Database Team
Figure 10: IMDB HTML-page for Woody Allen (shortened) 24