A novel approach to querying the Web: Integrating Retrieval and Browsing Zoe Lacroix, Arnaud Sahuguety, R. Chandrasekarz and B. Srinivasx IRCS, University of Pennsylvania Suite 400A, 3401 Walnut St, Philadelphia PA 19104, USA Phone: 1 (215) 898 0326 { Fax: 1 (215) 573 9427 Web: http://www.cis.upenn.edu/AKIRA flacroix,[email protected] fmickeyc,[email protected]
August 27, 1997 Abstract
We propose a new approach to querying the Web based on information retrieval (IR), browsing, as well as database techniques so as to provide maximum exibility to the user. We present a model based on object representation where an identity does not correspond to a source HTML page but to a fragment of it. A fragment is identi ed by using the explicit structure provided by the HTML tags as well as the implicit structure extracted using IR techniques. We de ne a query language, PIQL, that is a simple algebra extended with restructuring primitives. Our language expresses browsing and restructuring based on IR techniques in a uni ed framework. All these components take part in the AKIRA system, currently under development. Keywords: views, data model, query language, information retrieval, agents
1 Introduction The Web invades our lives. While reading this short introduction it will have probably increased by several hundreds of pages making extra information available to everyone. But of what use can this information be if we cannot manage it properly? Some people view the Web as a set of text pages without paying much attention to the explicit structure given by links; these people typically use full-text search engines to retrieve some content. Other people view the Web as a gigantic network, and focus on the hyperlinks between pages. In both cases, a crucial aspect of information has been overlooked which prevents any eective processing of information. What we need is an eective method of using all the information available. In this paper, we present AKIRA (Another Knowledge-based Information Retrieval Approach), an attempt to integrate, using database technologies, the navigational capabilities oered by HTML documents, and IR techniques able to exploit and infer ner structure implicit in the content of a set of Web pages. Many people model the Web as a database, but is it really so? On the Web, data consists of HTML pages with tags and hyperlinks. It has an explicit structure de ned by its tags and anchors. A page also has content (text, image, sound, video, etc.) which is usually in free and unrestricted form, and predominantly textual. The content carries implicit structure not known in advance. Its blurred structure makes us view the Web as a semistructured database [Abi97, Bun97]. Work supported by NSF STC grant SBR-8920230 and ARO grant DAAH04-95-I-0169. y Work supported by ARO grant DAAH04-93-G0129 and ARPA grant N00014-94-1-1086. z On leave from NCST, Bombay x This work is partially supported by NSF grant NSF-STC SBR 8920230, ARPA grant N00014-94
94-G0426.
1
and ARO grant DAAH04-
In contrast to a database, no super-user, central authority, or federation monitors, regulates, schedules, controls, or supervises the whole Web. Each user can modify her pages at any time. Moreover, a user may also give the right to any visitor to update her pages (for instance a counter, or a guest-book may be remotely updated by a visitor who does not own the page). This anarchy results in a permanent evolution of the Web. Every second, several pages are deleted and created. The speed of this constant change is greater than any exploration strategy. It is therefore impossible to take an exhaustive snapshot of the Web. We need better methods to explore the Web. A database system provides a query language. Is there an analogous high-level language to query the Web? Today, querying the Web consists essentially of browsing: a human being sitting in front of an idle computer following the tiresome iterative process of click-wait-read. AKIRA will obviate much of this tedious task using the PIQL language. If the Web is not a database, why do we consider using database technology? First of all, we note that currently available tools to access the Web use caches to store retrieved pages. These caches can be viewed as a database (though very primitive). Moreover, such caches can also be seen as a view (a partial snapshot) of the Web. In AKIRA, we propose to consider the cache as an object-oriented database where HTML pages are retrieved on demand and stored as fragments; we consider our views as \smart caches". Our smart cache is a repository of the (meta-)knowledge we have accumulated so far from our various explorations/navigations. In addition AKIRA tries to make the most of IR techniques to avoid \brute-force browsing" and focuses instead on \directed exploration". Finally the cache can deliver enriched views of the Web document by populating the retrieved content with extra information according to the user's needs expressed through a target structure. Another major aspect of the Web is that we almost never know what you will get. Sometimes we do not know what we are looking for, and sometimes source providers have decided to modify the way their content is delivered. AKIRA assumes zero-knowledge1 as far as the source is concerned. It does not require prede ned schemas but builds its own knowledge (almost) from scratch to guarantee maximum exibility. The user's annotated bookmark le (including access to search engines) can be a very good start for AKIRA's exploration. After each \expedition", the content is fed into agents that extract some new knowledge and store it into the database. To allow AKIRA to ll its mission, its design implies a very high exibility in order to deal with new types of documents, new requests from the user, new tools developed to better analyze the content, etc. To satisfy user's expectations, it has also to be highly tunable and should learn from the user's feedback or external knowledge. AKIRA, as mentioned earlier, can be viewed as a database where pages are stored as fragments. A fragment is a piece of information per se. Fragmentation is triggered by a pool of agents that have the knowledge to identify speci c types of information (names of persons, names of locations, relations, etc.). Agents can be seen as intelligent lters that parse the content of page fragments to generate other sub-fragments. These fragments can then be queried (in the standard database sense), enriched, and reshaped to provide a nal document to be returned to the user. Agents have various core competences based on IR (Information Retrieval) techniques. They are able to help while browsing, analyzing content or restructuring content. Among other things, they can allow fuzzy browsing (the path provided by the user is maybe not the exact path, with the labels de ned by the creator of the site, but it should have the same meaning). Agents should also be aware of user's preferences, and should take advantage of whatever information is available. The paper is organized as follows. Section 2 gives an overview of the architecture of the AKIRA system. Section 3 de nes the central concept of Web views. Section 4 presents the notion of Web query language and introduces the PIQL language. Section 5 explains the IR (Information Retrieval) techniques used. Section 6 compares our approach to some previously proposed. The last section contains our conclusion and some directions for future work.
2 AKIRA architecture The AKIRA system can be viewed as a personal proxy. AKIRA oers extended browsing capabilities and therefore acts as an interface between the user and the source of information.
1 \Zero-knowledge" should not be interpreted as it is in cryptography (for extra details about zero-knowledge protocols, [Sch94] is a good start).