A novel approach to querying the Web: Integrating ... - CiteSeerX

3 downloads 154947 Views 417KB Size Report
Aug 27, 1997 - If the fragment corresponds to an HTML anchor, then the value ... tag , names that can referenced from other pages using ...
A novel approach to querying the Web: Integrating Retrieval and Browsing Zoe Lacroix, Arnaud Sahuguety, R. Chandrasekarz and B. Srinivasx IRCS, University of Pennsylvania Suite 400A, 3401 Walnut St, Philadelphia PA 19104, USA Phone: 1 (215) 898 0326 { Fax: 1 (215) 573 9427 Web: http://www.cis.upenn.edu/AKIRA flacroix,[email protected] fmickeyc,[email protected]

August 27, 1997 Abstract

We propose a new approach to querying the Web based on information retrieval (IR), browsing, as well as database techniques so as to provide maximum exibility to the user. We present a model based on object representation where an identity does not correspond to a source HTML page but to a fragment of it. A fragment is identi ed by using the explicit structure provided by the HTML tags as well as the implicit structure extracted using IR techniques. We de ne a query language, PIQL, that is a simple algebra extended with restructuring primitives. Our language expresses browsing and restructuring based on IR techniques in a uni ed framework. All these components take part in the AKIRA system, currently under development. Keywords: views, data model, query language, information retrieval, agents

1 Introduction The Web invades our lives. While reading this short introduction it will have probably increased by several hundreds of pages making extra information available to everyone. But of what use can this information be if we cannot manage it properly? Some people view the Web as a set of text pages without paying much attention to the explicit structure given by links; these people typically use full-text search engines to retrieve some content. Other people view the Web as a gigantic network, and focus on the hyperlinks between pages. In both cases, a crucial aspect of information has been overlooked which prevents any e ective processing of information. What we need is an e ective method of using all the information available. In this paper, we present AKIRA (Another Knowledge-based Information Retrieval Approach), an attempt to integrate, using database technologies, the navigational capabilities o ered by HTML documents, and IR techniques able to exploit and infer ner structure implicit in the content of a set of Web pages. Many people model the Web as a database, but is it really so? On the Web, data consists of HTML pages with tags and hyperlinks. It has an explicit structure de ned by its tags and anchors. A page also has content (text, image, sound, video, etc.) which is usually in free and unrestricted form, and predominantly textual. The content carries implicit structure not known in advance. Its blurred structure makes us view the Web as a semistructured database [Abi97, Bun97].  Work supported by NSF STC grant SBR-8920230 and ARO grant DAAH04-95-I-0169. y Work supported by ARO grant DAAH04-93-G0129 and ARPA grant N00014-94-1-1086. z On leave from NCST, Bombay x This work is partially supported by NSF grant NSF-STC SBR 8920230, ARPA grant N00014-94

94-G0426.

1

and ARO grant DAAH04-

In contrast to a database, no super-user, central authority, or federation monitors, regulates, schedules, controls, or supervises the whole Web. Each user can modify her pages at any time. Moreover, a user may also give the right to any visitor to update her pages (for instance a counter, or a guest-book may be remotely updated by a visitor who does not own the page). This anarchy results in a permanent evolution of the Web. Every second, several pages are deleted and created. The speed of this constant change is greater than any exploration strategy. It is therefore impossible to take an exhaustive snapshot of the Web. We need better methods to explore the Web. A database system provides a query language. Is there an analogous high-level language to query the Web? Today, querying the Web consists essentially of browsing: a human being sitting in front of an idle computer following the tiresome iterative process of click-wait-read. AKIRA will obviate much of this tedious task using the PIQL language. If the Web is not a database, why do we consider using database technology? First of all, we note that currently available tools to access the Web use caches to store retrieved pages. These caches can be viewed as a database (though very primitive). Moreover, such caches can also be seen as a view (a partial snapshot) of the Web. In AKIRA, we propose to consider the cache as an object-oriented database where HTML pages are retrieved on demand and stored as fragments; we consider our views as \smart caches". Our smart cache is a repository of the (meta-)knowledge we have accumulated so far from our various explorations/navigations. In addition AKIRA tries to make the most of IR techniques to avoid \brute-force browsing" and focuses instead on \directed exploration". Finally the cache can deliver enriched views of the Web document by populating the retrieved content with extra information according to the user's needs expressed through a target structure. Another major aspect of the Web is that we almost never know what you will get. Sometimes we do not know what we are looking for, and sometimes source providers have decided to modify the way their content is delivered. AKIRA assumes zero-knowledge1 as far as the source is concerned. It does not require prede ned schemas but builds its own knowledge (almost) from scratch to guarantee maximum exibility. The user's annotated bookmark le (including access to search engines) can be a very good start for AKIRA's exploration. After each \expedition", the content is fed into agents that extract some new knowledge and store it into the database. To allow AKIRA to ll its mission, its design implies a very high exibility in order to deal with new types of documents, new requests from the user, new tools developed to better analyze the content, etc. To satisfy user's expectations, it has also to be highly tunable and should learn from the user's feedback or external knowledge. AKIRA, as mentioned earlier, can be viewed as a database where pages are stored as fragments. A fragment is a piece of information per se. Fragmentation is triggered by a pool of agents that have the knowledge to identify speci c types of information (names of persons, names of locations, relations, etc.). Agents can be seen as intelligent lters that parse the content of page fragments to generate other sub-fragments. These fragments can then be queried (in the standard database sense), enriched, and reshaped to provide a nal document to be returned to the user. Agents have various core competences based on IR (Information Retrieval) techniques. They are able to help while browsing, analyzing content or restructuring content. Among other things, they can allow fuzzy browsing (the path provided by the user is maybe not the exact path, with the labels de ned by the creator of the site, but it should have the same meaning). Agents should also be aware of user's preferences, and should take advantage of whatever information is available. The paper is organized as follows. Section 2 gives an overview of the architecture of the AKIRA system. Section 3 de nes the central concept of Web views. Section 4 presents the notion of Web query language and introduces the PIQL language. Section 5 explains the IR (Information Retrieval) techniques used. Section 6 compares our approach to some previously proposed. The last section contains our conclusion and some directions for future work.

2 AKIRA architecture The AKIRA system can be viewed as a personal proxy. AKIRA o ers extended browsing capabilities and therefore acts as an interface between the user and the source of information.

1 \Zero-knowledge" should not be interpreted as it is in cryptography (for extra details about zero-knowledge protocols, [Sch94] is a good start).

2

:

352;< ,63LQWUDQHWSUR[\ ‡FHQWUDOL]HGFDFKLQJ ‡ILOWHULQJ ‡DFFHVVFRQWURO

ERRNPDUNV ORFDO FDFKH

: ERRNPDUNV ORFDO FDFKH

,QWHUQHW

$.,5$ 3,4/4XHU\"

UHPRWHVRXUFHV

$.,5$ 352;
Services" \Our services..."

PRED

null

o1 o2 o3 o4

Fragments here correspond to boundaries of tags.

6

NEXT

o2 o3 o4 o5

null

HREF

null

o4

null null null

HREF CONTENT

\"

\#services"

\" \" \"

REF NAME

\" \" \"

\services"

\"

Fragments provide a clear advantage for locality issues. Our fragmentation allows a direct access to the relevant portion of a document (say an internal reference like #ID-153) and also allows us to navigate up and down (along PRED and NEXT) to grab the local context: we do not need to retrieve the entire page. The

at representation also extends the locality to fragments reachable from this fragment (along PRED, NEXT and HREF). A browsing query such as Query 1 de ned in Example 3.2 is easily expressed by a standard select-from-where query.

Example 3.2

Consider the following query. Query 1: \Find today's news alluding to Microsoft". We assume that the user has a bookmark le with an implicit or explicit reference to a source of news that will be used to answer this query. In our case this will be http://www.yahoo.com/headlines/tech/. Query 1 is expressed by: select from where

y.URL x in Fragment, y in Fragment x.URL = ``http://www.yahoo.com/headlines/tech/'' x.HREF = y y.CONTENT = ``*Microsoft*'';;

The semantics of the expression consists in loading the page located at http://www.yahoo.com/headlines/tech/ (x in Fragment), then fragmenting it with respect to the
tags, and then loading each referred HTML page as a fragment and checking each of them for the pattern \Microsoft".

3.2 Concepts

Another component of the database is concept classes. Concepts can be viewed as abstractions related to a speci c knowledge area, or a given domain of expertise. A concept is de ned as an abstract class with one attribute REFERS TO of type set of Fragment that refers objects of class Fragment. In Example 3.3 a concept class is class Person with attributes like name of type string and age of type integer. Methods can be de ned, such as getGender() (that infers the gender of the person based on the name and the context), getEmailAddr(), etc.

Example 3.3 Concepts classes Person, Position and Appointment are de ned as follows. class Person

class Position

class Appointment

name : string; REFERS TO : Fragment; ...

title : string; REFERS TO : Fragment; ...

person : Person; position : Position; REFERS TO : Fragment; ...

f

g

f

g

f

g

AKIRA also has some speci c concepts (we can call them meta-concepts) that can express relationships between concepts. In Example 3.3, concept class Appointment associates an instance of class Person with an instance of class Position. Example 3.4 illustrates the expressive power of concept classes and their relationships.

Example 3.4 Concept classes Person, Position and Appointment are de ned as in Example 3.3. Consider the query de ned by: Query 2: \Name and position of newly appointed person at Microsoft in today's news" This query is an extension of the previous one. We assume that we start from the result of the previous query (we have fragments about news) and that the classes have been de ned as in Example 3.3. 7

select from where

z.person.name, z.position.title x', x'' in Fragment, z in Appointment z.URL = x'.URL = x''.URL x' in z.person.REFERS_TO x'' in z.position.REFERS_TO

3.3 View mechanism

Our view mechanism goes through a pipeline (see Figure 2) of successive and interleaved materialized views (obtained by successive materialized extensions [dS95, LDB97]). Initially, class Fragment is supposed to be populated with data retrieved from the Web. Afterwards, each step derives a Web view from a Web view through an extension step. An extension consists in adding structure (concepts classes, organized in a hierarchy, and attributes de ned at these classes), and data. New data may be added through three di erent processes:  new fragments derived from previous ones ( ner fragmentation);  new fragments added from newly retrieved Web pages;  new instances of concept classes. The interleaved Web views de ne the semantics of a Web query, expressed in PIQL as explained in Section 4 and based on a xed-point mechanism [Llo87]. Our view mechanism provides an elegant and exible way of handling both \fond et forme"6 of the Web.

4 Web Query Language The only language available today for a user to query the Web is the \type-and-fetch" language (clicking is no more than a shortcut). The user actually types a query (which is nothing but a URL) and gets back some pages or nothing. It is rather interesting to note that \type-and-fetch" languages do not even provide low-level le commands such as dir or Unix ls. It is not always possible to get the content of the directory under which an HTML page is physically located. The owner of the page only grants us a limited access to the content of this directory. The structure of the Web does not allow any real search: everything is browsing! File systems on a computer are all connected in some regular fashion (usually in as a tree). Web pages, on the other hand, can be viewed as a set of graphs not (fully) connected in any regular manner. Thus there is no a priori way to search for something. There is no expectation that pages will be connected to any other page, nor is it possible to expect that any page will have links outside the page. The only way to explore is to start from a given point (a valid URL, a bookmark, etc.) and navigate from page to page, accumulating addresses of new pages and knowledge about the Web itself. A \query" to a search engine in nothing but a \type-and-fetch" command to a parameterized URL. For instance, looking in Yahoo's base for \Microsoft" corresponds to the following URL: http://search.yahoo.com/bin/search?p=microsoft. We regard search engine querying as parameterized browsing through access to databases (such as AltaVista, Yahoo) that return views of the web. In any case, only pages that are visible to search engines or their spiders are returned by such queries; this visibility comes out by being connected to other visible parts of the Web, or by explicitly registering such pages with spiders with a request to be indexed. Actually, the situation is even worse. The user really has to go through \type-fetch'n read" and repeat this process until he reaches satisfactory answers. Since the content is not taken care of by available tools, the user must cope with reading everything to extract relevant information and discard trash, according to his request. We believe that a high-level Web query language should provide the following features: 1. assume zero-knowledge with regard to the source; 6

content and form.

8

2. express the target structure according to the user's requirements; 3. allow fuzzy reasoning; 4. be both navigation and content oriented; 5. manage transparency and automation of queries; 6. specify the structure of the output; 7. support exibility. It is unrealistic to assume from the (naive) user any knowledge about the structure (implicit or explicit) of the Web. One does not know anything (zero-knowledge) about where to nd information (e.g. news, weather-forecast, stock market quotes, etc.), anything about the architecture of a given site, anything about the structure used to represent the data (plain-text, HTML tags, extended tags). The user does not know the structure of the source either , but he knows the structure of his needs. In Example 3.4, the user expresses a target structure as composed of a relationship between a person and a position through an appointment. However, he has no idea of how this information is represented in the source. In the same order of ideas, the user might express some incomplete or slightly inaccurate requests based on partial or erroneous knowledge. Notions such as similarity should be available to answer a fuzzy query. In order to capture the meaning of the information accessible from the Web, the language must address both content and structure (related to navigation). A Web query language should o er the user to specify exploration plans to generate automatic browsing (no more clicks!) and analysis of the content. The successive steps of the exploration are performed transparently by the system on behalf of the user. The user is not always interested in an HTML page as the result of his request. For instance, one might ask for a Microsoft Support phone-number. The output should not be a list of thousands of hyperlinks pointing to pages one will have to read to get to the correct information. The expected output is a string: the requested phone-number. On the other hand, the output might be a customized HTML page, generated on the y by a compilation of data extracted from all over the Web. A Web query language should provide a large range of output restructuring tools to meet the user's needs. The semantics of a Web query language should be extensible in order to embrace more advanced abstractions: new IR technologies (handling text, image [MSS97], audio, video, etc.), new browsing strategies, and adapt to the evolution of the Web (HTML syntax extensions7, protocols, etc.). We propose PIQL, a high-level OQL-like query language described in Section 4.1. The navigational aspect of the Web (\Everything is browsing!") favors object-oriented models where a query is a path traversal, rather than a sequence of joins as in the relational model. Some operators restructuring the output are proposed in Section 4.2.

4.1 PIQL: a High-level query language

With no real surprise, since we have decided to adopt an object-oriented approach, our Web query language has the avor of OQL. As presented in Section 3, we adopt a at representation where navigation through hyperlinks or within a page is expressed using attributes of an object-oriented schema. This representation permits the use of path-expressions. Our view mechanism requires creation of new object identities. We adopt PIQL (Path Identity Query Language) an enrichment of OQL with identity invention a la IQL [AK89] and with generalized path-expressions a la POQL [CCM96]. The syntax and the semantics of PIQL will be formally de ned in the full version of the paper. In Example 4.1 the expression of the query uses the pattern (NEXT | HREF)* which represents any path de ned by a sequence of NEXT (moving to the next fragment in the same page) or HREF (moving to the pointed page, if any). When the path cannot be followed, the exploration is moved to another path. 7

Various proposals such as XML [BSM97], CDF [Mic97], MCF [App97] have already been proposed.

9

Example 4.1 Query 3: \Find all documents from the Microsoft Web-site that allude to 'Bill Gates'." select from where

x.URL x, y, z, t in Fragments y.URL = ``http://www.microsoft.com'' y.PRED = NULL y.(NEXT | HREF)* = z z.HREF_CONTENT = ``http://www.microsoft.com*'' x = z.HREF x.(PRED | NEXT)*.CONTENT = ``*Bill Gates*''

/ / / / / /

start on top of page z is pointed from y z points inside the site x is pointed from z check the entire page and not just the fragment

PIQL also understands some advanced string matching expressions as well as some fancy operators, such as perm (see Example 4.2) to facilitate expressive queries. Example 4.2 also illustrates fuzzy-reasoning capabilities.

Example 4.2 Query 4: \Find the list of Microsoft press releases related to its stock-holders, knowing that it should be located at or around http://www.microsoft.com/Corporate/Stock-holders/Press/News/" We almost know the path to get the data, but we are not sure of it. We use IR techniques to back-up our lack of certainty. select from where

x.URL x in Fragment, y in Fragment y.URL=http://www.microsoft.com x = y.fuzzy(Corporate).fuzzy(Stock-Holders).fuzzy(Press).fuzzy(News);;

In some cases, we don't even know for sure the right order of path components. In Yahoo's directory of concepts, some data can be found according to various path like Computer/Business or Business/Computer. We introduce a perm operator to look for paths following any permutation of the concepts. perm(A,B,C) would lead to A.B.C, A.C.B, B.A.C, B.C.A, C.A.B and C.B.A. In contrast to the usual database systems, a user may ask a query with a condition set on the time allowed to answer the query. This condition may also be expressed by a condition on the depth (in terms of HTML links) of the search.

4.2 Output Restructuring

Restructuring operators are available to the user as macros. Some of them are brie y described in the following. TableOfContent: takes a page or several pages and returns a new HTML page, listing the headers found in the content of the input. The result consists of a list of new tags pointing to their corresponding entries. Chronology: takes a page or several pages and returns an HTML page listing all references to dates found in the content of the input, sorted in chronological order. Summary: takes a page or several pages and returns the summary of the content (for more details about summarization see [Pai90, ENHJ95]). Emphasize: takes a page, a concept and a tag (color, bold, italic, underline. etc.) and returns a copy of the same pages where all the strings corresponding to the concept is emphasized. New operators can be easily de ned by the user.

10

5 Information Retrieval The IR agents in AKIRA aim to glean information from given segments of text. Some of these agentive services are based on techniques developed for the Glean project [RC93, CS97a, CS97b]. AKIRA will use similar and extended agents. There is a lot of information latent in any coherent text, which can be used to enhance retrieval eciency. This information includes not just the syntactic structure of natural language text, but also the structure inherent in many domains such as tables of stock prices, time-line information etc. We can also use information extraction ideas to identify entities in the text such as names of people and places, designations, time phrases, currency expressions, etc. In addition, collocations and grammatical relations between words/phrases in a sentence may provide valuable information. Some of this information is predicated on grammatical structures of a speci c language (English, say) or a particular domain. Much is it is based on heuristics about language use, and hence does not require users to provide any structure information a priori. Such information is extracted in by AKIRA's IR agents using a number of tools and techniques of various complexities, including a supertagger, a lightweight dependency analyzer and various tools such as noun-phrase and verb-group chunkers. This information is used to populate the database, to answer user queries and for further reuse. Some of AKIRA's IR advanced services rely on the use of the `supertagging' [JS94]. Supertags are rich syntactic labels, which provide information about each word in a text, including the parts of speech of each word, as well as information about the syntactic role it plays in a sentence. Supertagging is a method to assign the most appropriate supertag to each word in a sentence. The labeling provided by supertagging can be used to obtain patterns of word use, and to deduce relations between words in a sentence. Agents can use these patterns and relations to identify portions of text relevant to user queries. In addition, these tools are used to restructure the information displayed to the user. We assume that the user provides the target structure which speci es the concepts involved in the query and the relationship(s) between them.8 The concepts speci ed in the target structure are used to select the appropriate agents that would process the query and the documents. The idea may be explained with an example. We can de ne a relation Appointed between a Person and a Position. The user may indicate this relation to AKIRA using a set of sentences which exhibit this relation. Glean technology uses supertagging to label the words in these sentences, and extract patterns which are typical of this relation (eg. Name/Person appointed Name/Person Position or Name/Person was appointed to Position etc.). Using such information increases the precision of information retrieval. Once we have some idea of the structure of the text being processed, we can use this information in restructuring the output as well. Relations provided to AKIRA by users may be stored in libraries, and used in other, similar contexts. We could envisage situations where users could subscribe to services which provide stored patterns and relation-descriptors on demand, for analysis or for restructuring.

6 Related Work In this section, we compare our approach with related approaches in the literature, at several levels.

6.1 Modeling the Web

The Web can be seen as an oracle that returns some content (for a valid URL) or nothing. This approach has already been presented in [MM97]. We believe that actually the only mode of computation currently available on the Web is browsing. The search capabilities available are only access to databases, sort of views of the Web. Querying against one of these databases is nothing but opening a URL. In this sense, our understanding of the search di ers from the one expressed in [AV97a]. 8

We will address this issue further in the nal version of the paper.

11

6.2 Granularity of Description

Most formal approaches represent the Web as a labeled graph [AV97a, BDFS97, FFLS97, PGMW95, AM97a, AMM97a, KS97] where a vertex corresponds to an HTML page. These approaches are primarily designed to provide browsing capabilities and hence they do not take the content of the page into account in a satisfying way: they are browsing-oriented. In [AMM97b], the content is considered but only at a second level, using Editor [AM97b] for presentation purposes. In contrast to these approaches that exploit the explicit structure provided by the HTML tags for browsing and presentation purposes, the approach we present in this paper exploits both explicit and the implicit structures present in Web documents thus allowing for a uniform interface to both retrieval and browsing capabilities. Our internal representation is based on the notion of fragments that raise content-based and hyperlinkbased navigation to the same level. Our model is not page oriented. This fragment terminology exists for RAW [FWM97] and the same idea appears in [LSS96] with rel-infons that are tokens of related information. Our representation is both navigational and content oriented.

6.3 View Representation

For a user, the only vision of the Web is the content of both cache9 and bookmarks. AKIRA enhances the standard browser cache into smart-cache as a database view of the Web. Our view representation is based on an object-oriented model unlike WebSQL [MMM96] or RAW [FWM97] that use a relational approach. We do not use tree-based representation either, like WebOQL [AM97a](with Abstract-Syntax-Trees) or Strudel [FFLS97], Lorel [AQM+ 97], or W3QS [KS97] but prefer a

at representation. Another approach consists in using the structure of the content (which is supposed to be known at some point) as a schema of a database [CACS94, GW97, ACC+ 97]. In contrast to these approaches that only exploit the explicit structure provided by the HTML tags or make some assumptions on the structure of the source (see ARANEUS), our approach does not assume any prede ned schema for the source but only focuses on the structure requested by the user (target structure). Unlike ARANEUS [AMM97b] that requires two consecutive distinct steps (retrieve then store), AKIRA adopts an interleaving view mechanism based on a sequence of extensions (retrieval then internal-restructuring). The entire process is a series of views of views [LDB97].

6.4 Query Language

When answering a query, AKIRA assumes that the user knows what he is looking for, but that he has not a clear idea of how to express his needs. AKIRA also assumes that he has no knowledge of the format of the source. Our query language therefore assumes zero-knowledge as far as the source is concerned. This approach is di erent from [GW97] where the semi-structured source is reachable and can provide a \structural summary" of itself (the dataguide). AKIRA has to populate the cache on-the- y and has no a priori knowledge of the structure of the source. Our choice of a at representation of the Web makes an extension of POQL [Chr96] a good candidate for our query language. POQL allows the use of OQL-style queries and the use of general-path-expressions as well as underlying optimization techniques. It is also very similar to Lorel [AQM+ 97] even-though it does not really need coercion. It extends POQL with a set of IR tools and some output restructuring operators. Moreover we extend the notion of general-path-expression to fuzzy-general-path-expression, where the fuzziness (see [MSS97]) comes from the help of IR techniques to guess some relevant paths. We also de ne an underlying algebra for fragments with a set of operators. Like RAW[FWM97], to embrace new concepts coming from the Web (link, page, etc.), we include classes of objects that are equivalent to RAW's domains HTTP-address, HTTPaddress-path and HTML-document-path. We also provide the user some restructuring tools (available as macros) that are similar to some proposed in SgmlQL [HMMV97]. 9

Tell me what you have in your cache, I will tell you who you are.

12

6.5 The use of Information Retrieval techniques

Most information retrieval (IR) systems, as well as IR algorithms embedded in web search engines, have considered documents to be just sets of words. The most popular amongst them, the Vector Space model [Sal89, SM83], treats the set of words present in a document as a vector in an multi-dimensional space. A user query is similarly treated as another vector in that space and documents that are in the `neighborhood' of the query vector are deemed relevant. Others IR systems simply use retrieval based on inverted indexes of content words to approximate the content of documents: each document is parsed and an entry (pointing back to the document) is created in the index for every "relevant" word. But these methods ignore the relations among words. Our approach based on techniques developed for the Glean project focuses not only on the words in a text but also on the relations between words. The use of information retrieval and information extraction techniques help us to make explicit the tokens and relationships among tokens that may be present in the query as well as the document so as to improve the relevance of retrieved information. We also exploit the linguistic structure implicit in the document to postprocess the results of Web search engine so as to improve the precision of the retrieved results [CS97a, CS97b].

7 Conclusion AKIRA is a system that behaves as a proxy server for a user. It establishes an interface between the user and the sources of information. The system can be viewed as a \smart cache" since it will retrieve data for the user and restructure it. The use of information retrieval and information extraction techniques, along with advanced tools from computational linguistics permits us to increase the precision of retrieval, using knowledge about the content of queries and text processed by AKIRA. We have proposed a high-level Web query language, PIQL, that assumes zero-knowledge with regard to the source, expresses the target structure according to the user's requirements, allows fuzzy reasoning, is both navigation and content oriented (thanks to our chosen at representation of documents as fragments), manages transparency and automation of queries, speci es the structure of the output and supports exibility. The query process is also interactive in the sense that it can be iterated with the user re ning his query. In future work, we plan to investigate issues arising from information distributed across fragments, as also information available in fragments connected in some tree-like, use of fragments for collaborative work and integration with a high-level user-interface.

References [Abi97]

S. Abiteboul. Querying semi-structured data. In Proc. of Intl. Conf. on Database Theory, pages 1{18, Delphi, Greece, January 1997. LNCS 1186, Springer. [ACC+ 97] S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Simeon. Querying documents in object databases. Journal on Digital Libraries, 1997. [AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison{Wesley, 1995. [AK89] S. Abiteboul and P. Kanellakis. Object identity as a query language primitive. In ACM SIGMOD Symposium on the management of Data, pages 159{173, Portland Oregon USA, June 1989. [AM97a] G.O. Arocena and A. Mendelzon. WebOQL: Restructuring documents, databases and webs. submitted to publication, 1997. [AM97b] P. Atzeni and G. Mecca. Cut and paste. In Proc. ACM Symp. on Principles of Database Systems, Tucson, Arizona, May 1997. [AMM97a] P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web: Going back and forth. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. 13

[AMM97b] P. Atzeni, G. Mecca, and P. Merialdo. To weave the web. submitted to publication, 1997. [App97] Apple. The Meta Content Format (MCF). Apple, July 1997. avalaible at http://mcf.research.apple.com/hs/mcf.html. [AQM+ 97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J.L. Wiener. The lorel query language for semistructured data. Journal on Digital Libraries, 1997. ftp://db.stanford.edu//pub/papers/lorel96.ps. [AV97a] S. Abiteboul and V. Vianu. Queries and computation on the web. In Proc. of Intl. Conf. on Database Theory, 1997. [AV97b] S. Abiteboul and V. Vianu. Regular Path Queries with Constraints. In Proc. ACM Symp. on Principles of Database Systems, 1997. [BDFS97] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proc. of Intl. Conf. on Database Theory, Delphi, Greece, January 1997. + [BLCL 94] T. Berners-Lee, R. Caillau, A. Lautonen, H.F. Nielsen, and A. Secret. The world wide web. Communications of the ACM, 37(8):76{82, August 1994. [BSM97] T. Bray and C. M. Sperberg-McQueen. The XML Speci cation. W3C, 1997. avalaible at http://www.w3.org/pub/WWW/TR/WD-xml.html. [Bun97] P. Buneman. Semistructured data. In Proc. ACM Symp. on Principles of Database Systems, Tucson, 1997. Invited tutorial. [CACS94] V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl. From structured documents to novel query facilities. In Proc. ACM SIGMOD Symp. on the Management of Data, 1994. [CCM96] V. Christophides, S. Cluet, and G. Moerkotte. Evaluating queries with generalized path expressions. In Proc. ACM SIGMOD Symp. on the Management of Data, 1996. [Chr96] V. Christophides. Documents Structures et Bases de Donnees Objet. PhD thesis, Conservatoire National des Arts et Metiers, 1996. available at http://www-rocq.inria.fr/christop/. [CS97a] R. Chandrasekar and B. Srinivas. Gleaning information from the web: Using syntax to lter out irrelevant information. In In AAAI Spring Symposium on Natural Language Processing for the World Wide Web, Stanford University, March 1997. [CS97b] R. Chandrasekar and B. Srinivas. Using syntactic information in document ltering: A comparative study of part-of-speech tagging and supertagging. In In Proceedings of RIAO'97, Montreal, June 1997. [dS95] C. Souza dos Santos. Un Mecanisme de Vues pour les systemes de Gestion de Bases de Donnees Objet. PhD thesis, Universite de Paris Sud - Centre d'Orsay, Paris, France, November 1995. [ENHJ95] B. Endres-Niggemeyer, J. Hobbs, and K. Sparck Jones. Summarizing text for intelligent communication. Technical Report 79, Dagstuhl Seminar, 1995. http://www.bid.fhhannover.de/SimSum/Abstract/. [FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language and processor for a website management system. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. [FWM97] T. Fiebig, J. Weiss, and G. Moerkotte. Raw: A relational algebra for the web. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. [GW97] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. Technical report, Department of Computer Science, Stanford Univ., 1997. 14

[HMMV97] S. Harie, J. Le Ma^tre, E. Murisasco, and J. Veronis. SgmlQL: Language reference. Universites de Toulon et de Provence, France, may 1997. avalaible at http://www.lpl.univaix.fr/projects/SgmlQL/. [JS94] A.K. Joshi and B. Srinivas. Disambiguation of super parts of speech (or supertags): Almost parsing. In Proceedings of the 17th International Conference on Computational Linguistics (COLING '94), Kyoto, Japan, August 1994. [KS97] D. Konopnicki and O. Shmueli. W3QS - The WWW Query System. Department of Computer Science, Technion, Israel, may 1997. avalaible at http://www.cs.technion.ac.il/ W3QS/. [LDB97] Z. Lacroix, C. Delobel, and Ph. Breche. Object Views and Database Restructuring. In Proc. of Intl. Workshop on Database Programming Languages, august 1997. [Llo87] J.M. Lloyd. Foundations of Logic Programming, 2nd Ed. Springer Verlag, 1987. [LSS96] L. V. S. Lakshmanan, F. Sadri, and I.N. Subramanian. A declarative language for querying and restructuring the web. In Sixth International Workshop on Research Issues in Data Engineering - Interoperability of Nontraditional Database Systems, 1996. available at http://wwwdb.stanford.edu/pub/papers/icde95.ps. [Mic97] Microsoft. The Channel De nition Format (CDF). Micorsoft, July 1997. avalaible at http://www.microsoft.com/standards/cdf.htm. [MM97] A. Mendelzon and T. Milo. Formal Models of Web Queries. In Proc. ACM Symp. on Principles of Database Systems, 1997. [MMM96] A. Mendelzon, G. Mihaila, and T. Milo. Querying the world wide web. In Proc. PDIS'96, Miami, USA, December 1996. [MSS97] C. Meghini, F. Sebastiani, and U. Straccia. Modelling the retrieval of structured documents containing texts and images. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, Pisa, September 1997. [OO88] K. Otomo and M. Oshii. Akira: Neo Tokyo is about to E.X.P.L.O.D.E. Akira Commity, 1988. 124 mn. [Pai90] C. D. Paice. Constructing literature abstracts by computer. Information Processing and Management, 26(1):171{186, 1990. [PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of the International Conference on Data Engineering, 1995. available at http://www-db.stanford.edu/pub/papers/icde95.ps. [Rag97] D. Raggett. Html3.2 reference speci cation. Technical report, W3C, 1997. [RC93] S. Ramani and R. Chandrasekar. Glean: a tool for automated information acquisition and maintenance. Technical report, NCST Bombay, 1993. [RL96] F. Rouaix and B. Lang. The v6 engine. Technical report, INRIA, 1996. [Sal89] Gerard Salton. Automatic Text Processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, 1989. [Sch94] B. Schneier. Applied Cryptography, Protocols, Algorithms, and Source Code in C. John Wiley & Sons, Inc., 1994. [SM83] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGrawHill, New York, 1983. 15