Catering to the needs of Web users: Integrating Retrieval ... - CiteSeerX

2 downloads 0 Views 512KB Size Report
Some people view the Web as a set of text pages without paying much ... In both cases, a crucial aspect of hypertext information has been overlooked, which prevents any e ective ... data sources but builds its own knowledge (almost) from scratch to ..... Some friends of mine (click). >0DLO@. Hello, sound picture mail icon.
Catering to the needs of Web users: Integrating Retrieval and Browsing

Zoe Lacroix, Arnaud Sahuguety, R. Chandrasekarzand B. Srinivasx IRCS, University of Pennsylvania Suite 400A, 3401 Walnut St, Philadelphia PA 19104, USA Phone: 1 (215) 898 0326 { Fax: 1 (215) 573 9427 Web: http://www.cis.upenn.edu/AKIRA flacroix,[email protected] fmickeyc,[email protected]

Abstract

We propose a new approach to querying hypermedia documents on the Web based on information retrieval (IR), browsing, and database techniques so as to provide maximum exibility to the user. We present a model based on object representation where an identity does not correspond to a source HTML page but to a fragment of it. A fragment is identi ed using the explicit structure provided by the HTML tags as well as the implicit structure extracted using IR techniques. Our fragmentation provides access to di erent heterogeneous components (text, image, audio, video, etc.) of a given document, and to their relationships (implicit or explicit through hyperlinks). Our language expresses browsing and restructuring based on IR techniques in a uni ed framework. All these are integral components of the AKIRA system, currently under development. Keywords: multimedia, hypermedia, Web, views, data model, query language, information retrieval, agents

1 Introduction The Web invades our lives. While reading this short introduction it will have probably increased by several hundreds of pages making extra information available to everyone. But of what use can this information be if we cannot manage it properly? Some people view the Web as a set of text pages without paying much attention to the explicit structure given by links; these people typically use full-text search engines to retrieve some content. Other people view the Web as a gigantic network, and focus on the hyperlinks between pages. In both cases, a crucial aspect of hypertext information has been overlooked, which prevents any e ective processing of information. A Web document is actually an hypermedia document in which components (text, images, animations, sounds, etc.) together contribute to the information content of the page. What we need is an e ective method of using all the information available taking advantage of all (or at least most of) the tools that already exist to handle this information. In this paper, we present AKIRA (Another Knowledge-based Information Retrieval Approach), an attempt to integrate, using database technologies, the navigational capabilities o ered by HTML documents, and an agent pool of extended IR techniques able to exploit and infer ner structure implicit in the hypermedia content of a set of Web pages. One problem with the Web is that it is not always clear what information is available, where it is available, and how it is structured. Sometimes we do not know what we are looking for, and sometimes source providers  Work supported by NSF STC grant SBR-8920230 and ARO grant DAAH04-95-I-0169. y Work supported by ARO grant DAAH04-93-G0129 and ARPA grant N00014-94-1-1086. z On leave from NCST, Bombay x This work is partially supported by NSF grant NSF-STC SBR 8920230, ARPA grant N00014-94

94-G0426.

and ARO grant DAAH04-

:

352;< ,63LQWUDQHWSUR[\ ‡FHQWUDOL]HGFDFKLQJ ‡ILOWHULQJ ‡DFFHVVFRQWURO

ERRNPDUNV ORFDO FDFKH

: ERRNPDUNV ORFDO FDFKH

,QWHUQHW

$.,5$ 3,4/4XHU\"

UHPRWHVRXUFHV

$.,5$

ZKHQQRWXVLQJ $.,5$¶VVHUYLFHV

352;
0DLO@ caption of type string that expresses PDLOLFRQ the meaning in a textual form. An agent specialized in speech recognition (for example [JHC97]) assigns to attribute caption the string Zoe for the -6ULQL-0LFNH\-$UQDXG object in class Sound, when an agent expert in pen-stroke recognition (for example [ABL96]) assigns to Lacroix its Figure 3: Hypermedia D E with redundancy documents, without and translation as a string, and a last agent capable of matching a given icon to an element in an icon library by similarity (in the spirit of [Jag96]). ]RHKWPO

]RHKWPO

IULHQGVKWPO

3.3 View mechanism

Our view mechanism goes through a pipeline (see Figure 4) of successive and interleaved materialized views (obtained by successive materialized extensions [dSDA94, LDB97]). Initially, class Fragment is supposed to be populated with data retrieved from the Web. Afterwards, each step derives a Web view from a Web view through an extension step. An extension consists in adding structure (concepts classes, organized in a hierarchy, and attributes de ned at these classes), and data. New data may be added through three di erent processes: (i) new fragments derived from previous ones ( ner fragmentation), (ii) new fragments added from newly retrieved Web pages, or (iii) new instances of concept classes. The interleaved Web views de ne the semantics of a Web query, expressed in PIQL and based on a xed-point mechanism [Llo87]. Our view mechanism provides an elegant and exible way of handling both \fond et forme"6 of the Web.

4 Querying the Web The only language available today for a user to query the Web is the \type-and-fetch" language. The user actually types a query (which is nothing but a URL) and gets back some pages or nothing. It is rather interesting to note that \type-and-fetch" languages do not even provide low-level le commands such as dir or Unix ls. It is not always possible to get the content of the directory under which an HTML page is 6

content and form.

LFRQXVHGIRU VHSDUDWL 6HFWLRQ$ 6HFWLRQ%



QDYLJDWLRQ LFRQ

F

QHZLQVWDQFHV RIFODVV3HUVRQ

QHZLQVWDQFHV RIFODVV'DWH

2ULJLQDOSDJH

SDJHGLYLGHG LQWRIUDJPHQWV

³'DWH´

³3HUVRQ´

$JHQW

$JHQW

7KHGRFXPHQWLVIUDJPHQWHG ,QVWDQFHVRIFODVV³'DWH´FUHDWHG OLQNHGWRWKHSURSHUIUDJPHQWDQG VWRUHGLQWKHGDWDEDVH

([WHQVLRQ 3KDVH

7KHGRFXPHQWLVIUDJPHQWHG ,QVWDQFHVRIFODVV³3HUVRQ´FUHDWHG OLQNHGWRWKHSURSHUIUDJPHQWDQG VWRUHGLQWKHGDWDEDVH

Figure 4: AKIRA's fragmentation pipeline. physically located. The owner of the page only grants us a limited access to the content of this directory. There is a sort of encapsulation that de nes its own visibility of information. The structure of the Web does not allow any real search: everything is browsing! File systems on a computer are all connected in some regular fashion (usually as a tree). Web pages, on the other hand, can be viewed as a set of graphs not (fully) connected in any regular manner. Thus there is no a priori way to search for something. There is no expectation that pages will be connected to any other page, nor is it possible to expect that any page will have links outside the page. The only way to explore is to start from a given point (a valid URL, a bookmark, etc.) and navigate from page to page, accumulating addresses of new pages and knowledge about the Web itself. A \query" to a search engine in nothing but a \type-and-fetch" command to a parameterized URL. For instance, looking in Yahoo's base for \Microsoft" corresponds to the following URL: http://search.yahoo.com/bin/search?p=microsoft. We regard search engine querying as parameterized browsing through access to databases (such as AltaVista, Yahoo) that return views of the Web. In any case, only pages that are visible to search engines or their spiders are returned by such queries; this visibility comes out by being connected to other visible parts of the Web, or by explicitly registering such pages with spiders with a request to be indexed. Actually, the situation is even worse. The user really has to go through \type-fetch'n read" and repeat this process until she reaches satisfactory answers. Since the content is not taken care of by available tools, the user must cope with reading, listening, viewing and understanding every component of a page to extract relevant information and discard trash, according to her request. Querying the Web is therefore a two-fold problem: getting information and processing it. And on the Web, information is stored in Web documents that are real hypermedia documents. Each document consists of a connected set of les (HTML pages, texts, images, sounds, animations, videos, etc.). One could think of this connected set as a book composed of a cover and table of contents, some gures, etc. But, in contrast to a book, it has no physical boundaries. Processing Web documents is tricky even for human beings! On the one hand, the creator of the document can choose to use various media formats to express his message. Let us see the message as divided into information tokens (see Figure 3). The document creator can either partition information tokens among the available media formats (the document is like a rebus, as in http://zoe1.html) or use an overlapping representation where tokens are redundantly expressed through more than one format (see http://zoe2.html). He is also responsible for the density by splitting (or not) the message among several pages. On the other hand, the comprehension of the document by the user depends on her preferences and/or her displaying capabilities. Such restrictions as screen size, color depth, layout, frames and/or scrollbars, transfer delays, synchronization, etc., strongly in uence the perception of the message. A Web document may encapsulate other Web documents. Its scope is therefore variable, depending on the accuracy of the expected understanding. For instance, the understanding of this paper on AKIRA can be

limited to the reading of the text or extended to pictures, bibliographic references, and so on. By extension, the World Wide Web can be also seen as a Web document and evaluating a query means determining the required level of encapsulation. Each level of encapsulation corresponds to a unique Web document which de nes its own locality, i.e. its logical boundaries. It is worth noting that the usual notion of locality (linear) in standard IR techniques cannot embrace this hypermedia locality (tree-like). Some heuristics to address the question of locality in hypertext are proposed in [Spe96]. It is unrealistic to assume from the (naive) user any knowledge about the structure (implicit or explicit) of the Web. One does not know anything (zero-knowledge ) about where to nd information (e.g. news, weather-forecast, stock market quotes, etc.), anything about the architecture of a given site, anything about the structure used to represent the data (plain-text, HTML tags, extended tags). The user does not know the structure of the source, but she knows the structure of her needs. In Example 3.2, the user expresses a target structure as composed of a relationship between a person and a position through an appointment. However, she has no idea of how this information is represented in the source. In the same order of ideas, the user might express some incomplete or slightly inaccurate requests based on partial or erroneous knowledge. Notions such as similarity should be available to answer a fuzzy query. In order to capture the meaning of the information accessible from the Web, the language must address both content and structure (related to navigation ). A Web query language should o er the user to specify exploration plans to generate automatic browsing (no more clicks!) and analysis of the content. The successive steps of the exploration are performed transparently by the system on behalf of the user. With no real surprise, since we have decided to adopt an object-oriented approach, our Web query language has the avor of OQL [Ba97]. As presented in Section 3, we adopt a at representation where navigation through hyperlinks or within a page is expressed using attributes of an object-oriented schema. This representation permits the use of (possibly fuzzy) pathexpressions. Our view mechanism requires creation of new object identities. We adopt PIQL (Path Identity Query Language) an enrichment of OQL with identity invention a la IQL [AK89] and with generalized path-expressions a la POQL [CCM96]. Query 1 of Example 3.2 is expressed in PIQL in Example 4.1.

Example 4.1 Query 1 de ned in Example 3.2 is expressed in PIQL by the following expression, assuming

that the pages (related to press releases anoncing new appointments) are already retrieved and fragmented. select from where

z.person.name, z.position.title x, y in Fragment, z in Appointment z.URL = x.URL = y.URL x in z.person.REFERS_TO y in z.position.REFERS_TO

The user is not always interested in an HTML page as the result of her request. For instance, one might ask for a Microsoft Support phone-number. The output should not be a list of thousands of hyperlinks pointing to pages one will have to read to get to the correct information. The expected output is a string: the requested phone-number. On the other hand, the output might be a customized HTML page, generated on the y by a compilation of data extracted from all over the Web. A Web query language should provide a large range of output restructuring tools to meet the user's needs. Some examples are given below. TableOfContent takes a page or several pages and returns a new page, listing the headers (textual or graphical) found in the content of the input. The result consists of a list of new tags pointing to their corresponding entries. Chronology takes a page or several pages and returns an HTML page listing all references to dates found in the content of the input, sorted in chronological order. Summary takes a page or several pages and returns the summary of the content (for more details about summarization see [Pai90, ENHJ95]). TranslateIconToText transforms (by similarity as in [Jag96]) icons into a piece of text that carries an equivalent meaning (this can be used for users with text-only displays). The semantics of a Web query language should be extensible in order to embrace more advanced abstractions: new IR technologies (handling text, image [MSS97], audio, video, etc.), new browsing strategies, and adapt to the evolution of the Web (HTML syntax extensions7, protocols, etc.). 7

Various proposals such as XML [BSM97], CDF [Mic97], MCF [App97] have already been proposed.

5 Hypermedia Information Retrieval AKIRA's goal is to process hypermedia information on behalf of the user according to her preferences. AKIRA's source is the World Wide Web and not a database where the information has been already preprocessed. To our knowledge no available system is currently able to deal with raw hypermedia information retrieved on-the- y (without preprocessing). However, various tools have been successfully developed to index speci c media formats. These tools either provide access to preprocessed indices or can be used for a live indexing. Search engines like Alta-Vista [Alt], Excite [Exc], Infoseek [Infb] or Hot-Bot [HB] for text, WebSeer [Web] or Lycos [Lyc] for images, give access to pre-computed indices of Web sources. On the other hand, syntactic indexing tools such as Glean [RC93], Topic [Top], Wais [Wai], Live-Topics [LT] or Glimpse [Gli] for text or Webseer [SFA97], Qbic [Qbi], Photobook [Pho], Virage [Vir] or Informix's datablades [Infa] for images and Muscle Fish [Fis] or speech recognition systems developed at SRI [STA] for audio can index documents on-the- y. AKIRA tries to provide an open-architecture to take advantage of already existing core competences mentioned above. Its at representation of Web documents and its agentive IR services permit these techniques to work together. AKIRA's IR agents used for information extraction are highly autonomous and use available Web search engines to locate information or other ltering tools such as Glean [RC93]. Some of AKIRA text-based agentive services are based on techniques developed for the Glean project [RC93, CS97a, CS97b]. In e ect, there is a lot of information latent in any coherent text, which can be used to enhance retrieval eciency. This information includes not just the syntactic structure of natural language text, but also the structure inherent in many domains such as tables of stock prices, time-line information etc. We can also use information extraction ideas to identify entities in the text such as names of people and places, designations, time phrases, currency expressions, etc. In addition, collocations and grammatical relations between words/phrases in a sentence may provide valuable information. Glean relies on the use of the `supertagging' [JS94]. Supertags are rich syntactic labels, which provide information about each word in a text, including the parts of speech of each word, as well as information about the syntactic role it plays in a sentence. Supertagging is a method to assign the most appropriate supertag to each word in a sentence. The labeling provided by supertagging can be used to obtain patterns of word use, and to deduce relations between words in a sentence. IR agents can use these patterns and relations to identify portions of text relevant to user queries. These tools currently only handle text-based components. In the future, AKIRA's tools will be extended to a larger range of formats.

6 Related Work The Web can be seen as an oracle that returns some content (for a valid URL) or nothing. This approach has already been presented in [MM97]. We believe that actually the only mode of computation currently available on the Web is browsing. The search capabilities available are only access to databases, which are like views of the Web. Querying against one of these databases is nothing but opening a URL. In this sense, our understanding of the search di ers from the one expressed in [AV97]. Many people model the Web as a database, but is it really so? On the Web, data consists of HTML pages with tags and hyperlinks. It has an explicit structure de ned by its tags and anchors. A page also has content (text, image, sound, video, etc.) which is usually in free and unrestricted form, and predominantly textual. The content carries implicit structure not known in advance. In contrast to a database, no superuser, central authority, or federation monitors, regulates, schedules, controls, or supervises the whole Web. This anarchy results in a permanent evolution of the Web. Most formal approaches represent the Web as a labeled graph [AV97, BDFS97, FFLS97, PGMW95, AM97a, AMM97a, KS97] where a vertex corresponds to an HTML page. These approaches are primarily designed to provide browsing capabilities and hence they do not take the content of the page into account in a satisfying way: they are browsing-oriented. In [AMM97b], the access to the content is considered but only at a second level, using Editor [AM97b] for presentation purposes. In contrast to these approaches that exploit the explicit structure provided by the HTML tags for browsing and presentation purposes, the approach we present in this paper exploits both explicit and the implicit structures present in Web documents

thus allowing for a uniform interface to both retrieval and browsing capabilities. AKIRA enhances the standard browser cache into smart-cache as a database view of the Web. Our view representation is based on an object-oriented model unlike WebSQL [MMM97] or RAW [FWM97] that use a relational approach. We do not use tree-based representation either, like WebOQL [AM97a](with Abstract-Syntax-Trees), Lorel [AQM+ 97], or W3QS [KS97] but prefer a at representation. Our internal representation is based on the notion of fragments that raise content-based and hyperlink-based navigation to the same level. Our model is not page oriented. This fragment terminology exists for RAW [FWM97] and the same idea appears in [LSS96] with rel-infons that are tokens of related information. Our representation is both navigational and content oriented. The entire process is a series of views of views [LDB97]. Another approach consists in using the structure of the content (which is supposed to be known at some point) as a schema of a database [CACS94, GW97, ACC+ 97]. In contrast to these approaches that only exploit the explicit structure provided by the HTML tags or make some assumptions on the structure of the source (see ARANEUS [AMM97b]), our approach does not assume any prede ned schema for the source but only focuses on the structure requested by the user (target structure). When answering a query, AKIRA assumes that the user knows what she is looking for, but that she has not a clear idea of how to express her needs. AKIRA also assumes that she has no knowledge of the format of the source. Our query language therefore assumes zero-knowledge as far as the source is concerned. This approach is di erent from [GW97] where the semi-structured source is reachable and can provide a \structural summary" of itself (the dataguide). AKIRA has to populate the cache on-the- y and has no a priori knowledge of the structure of the source. Our choice of a at representation of the Web makes an extension of POQL [ACC+ 97] a good candidate for our query language. POQL allows the use of OQL-style queries and the use of general-path-expressions as well as underlying optimization techniques. It is also very similar to Lorel [AQM+ 97] even-though it does not really need coercion. We also provide the user some restructuring tools (available as macros) that are similar to some proposed in SgmlQL [HMMV97]. Our agentive services try to perform as much work as possible on behalf of the user. In this sense, we do not follow the "suggest rather than act" by Lieberman [Lie97]. Our agents have to behave autonomously when retrieving information from the Web. Our exploration of the Web is based on the content of the page (using an extended notion of locality) rather than the knowledge of previous tours as in WebWatcher [JFM97]. Most information retrieval (IR) systems [Top], as well as IR algorithms embedded in web search engines [Alt, HB], have considered documents to be just sets of words. The most popular amongst them, the Vector Space model [Sal89, SM83], treats the set of words present in a document as a vector in an multi-dimensional space. A user query is similarly treated as another vector in that space and documents that are in the `neighborhood' of the query vector are deemed relevant. Others IR systems (see for instance [Alt, Wai]) simply use retrieval based on inverted indexes of content words to approximate the content of documents: each document is parsed and an entry (pointing back to the document) is created in the index for every "relevant" word. But these methods ignore the relations among words. Our approach based on techniques developed for the Glean project [CS97a, CS97b] focuses not only on the words in a text but also on the relations between words. AKIRA's architecture is aimed at integrating/plugging-in new services. Our at representation allows us to take advantage of various available techniques developed for multimedia processing. Most existing services (as [Alt, Lyc, Web]) provide access to preprocessed data stored in a database and indexed according to a given schema, while AKIRA retrieves information on-the- y and processes it according to the user's request. AKIRA can use the existing services mentioned above if they appear to be useful in answering queries).

7 Conclusion AKIRA is a system that behaves as a proxy server for a user. It establishes an interface between the user and the sources of information. The system can be viewed as a \smart cache" since it will retrieve data for the user and restructure it. The use of information retrieval and information extraction techniques, along with advanced tools from computational linguistics, permits us to increase the precision of retrieval, using knowledge about the content of queries and hypermedia documents processed by AKIRA. We have proposed a high-level Web query language, PIQL, that assumes zero-knowledge with regard to the source, expresses the target structure according to the user's requirements, allows fuzzy reasoning, is both navigation and content

oriented (thanks to our chosen at representation of documents as fragments), manages transparency and automation of queries, speci es the structure of the output and supports exibility. The query process is also interactive in the sense that it can be iterated with the user re ning his query. In future work, we plan to investigate issues arising from information distributed across fragments, as also information available in fragments connected in some tree-like fashion; use of fragments for collaborative work and integration with a high-level user-interface; and incorporation of new services in the agent pool.

Acknowledgment: Alberto Mendelzon and Anne-Marie Vercoustre are thanked for valuable comments on an earlier version of the paper.

References [ABL96] [ACC+ 97] [AHV95] [AK89] [Alt] [AM97a] [AM97b] [AMM97a] [AMM97b] [App97] [AQM+ 97] [AV97] [Ba97] [BDFS97] [BLCL+ 94] [BSM97]

W.G. Aref, D. Barbara, and D. Lopresti. Ink as a First-Class Datatype in Multimedia databases. In Multimedia Database Systems, 1996. S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and J. Simeon. Querying documents in object databases. Journal on Digital Libraries, 1997. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison{Wesley, 1995. S. Abiteboul and P. Kanellakis. Object identity as a query language primitive. In ACM SIGMOD Symposium on the management of Data, pages 159{173, Portland Oregon USA, June 1989. Altavista. available at http://altavista.digital.com/. G.O. Arocena and A. Mendelzon. WebOQL: Restructuring documents, databases and webs. unpublished manuscript, 1997. P. Atzeni and G. Mecca. Cut and paste. In Proc. ACM Symp. on Principles of Database Systems, Tucson, Arizona, May 1997. P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and structured data in the web: Going back and forth. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. P. Atzeni, G. Mecca, and P. Merialdo. To weave the web. In Proc. of Intl. Conf. on Very Large Data Bases, 1997. Apple. The Meta Content Format (MCF). Apple, July 1997. avalaible at http://mcf.research.apple.com/hs/mcf.html. S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J.L. Wiener. The lorel query language for semistructured data. Journal on Digital Libraries, 1997. ftp://db.stanford.edu/pub/papers/lorel96.ps. S. Abiteboul and V. Vianu. Queries and computation on the web. In Proc. of Intl. Conf. on Database Theory, 1997. D. Bartels and al. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, San Francisco, 1997. P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proc. of Intl. Conf. on Database Theory, Delphi, Greece, January 1997. T. Berners-Lee, R. Caillau, A. Lautonen, H.F. Nielsen, and A. Secret. The world wide web. Communications of the ACM, 37(8):76{82, August 1994. T. Bray and C. M. Sperberg-McQueen. The XML Speci cation. W3C, 1997. avalaible at http://www.w3.org/pub/WWW/TR/WD-xml.html.

[CACS94] V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl. From structured documents to novel query facilities. In Proc. ACM SIGMOD Symp. on the Management of Data, 1994. [CCM96] V. Christophides, S. Cluet, and G. Moerkotte. Evaluating queries with generalized path expressions. In Proc. ACM SIGMOD Symp. on the Management of Data, 1996. [CS97a] R. Chandrasekar and B. Srinivas. Gleaning information from the web: Using syntax to lter out irrelevant information. In In AAAI Spring Symposium on Natural Language Processing for the World Wide Web, Stanford University, March 1997. [CS97b] R. Chandrasekar and B. Srinivas. Using syntactic information in document ltering: A comparative study of part-of-speech tagging and supertagging. In In Proceedings of RIAO'97, Montreal, June 1997. [dSDA94] C. Souza dos Santos, C. Delobel, and S. Abiteboul. Virtual Schemas and Bases. In Proceedings of the International Conference on Extending Database Technology, March 1994. [ENHJ95] B. Endres-Niggemeyer, J. Hobbs, and K. Sparck Jones. Summarizing text for intelligent communication. Technical Report 79, Dagstuhl Seminar, 1995. http://www.bid.fhhannover.de/SimSum/Abstract/. [Exc] Excite. available at http://www.excite.com/. [FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language and processor for a website management system. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. [Fis] Muscle Fish. available at http://www.muscle sh.com/. [FWM97] T. Fiebig, J. Weiss, and G. Moerkotte. Raw: A relational algebra for the web. In ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. [Gli] Glimpse. available at http://donkey.cs.arizona.edu:1994/. [GW97] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. Technical report, Department of Computer Science, Stanford Univ., 1997. [HB] Hot-Bot. available at http://www.hotbot.com/index.html. [HMMV97] S. Harie, J. Le Ma^tre, E. Murisasco, and J. Veronis. SgmlQL: Language reference. Universites de Toulon et de Provence, France, may 1997. avalaible at http://www.lpl.univaix.fr/projects/SgmlQL/. [Infa] Informix. available at http://www.informix.com/. [Infb] Infoseek. available at http://www.infoseek.com/. [Jag96] H.V. Jagadish. Indexing for retrieval by similarity. In Multimedia Database Systems, 1996. [JFM97] T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In Proceedings of IJCAI, August 1997. [JHC97] L. Julia, L. Heck, and A. Cheyer. A Speaker Identi cation Agent. In Proceedings of AVBPA, 1997. [JS94] A.K. Joshi and B. Srinivas. Disambiguation of super parts of speech (or supertags): Almost parsing. In Proceedings of the 17th International Conference on Computational Linguistics (COLING '94), Kyoto, Japan, August 1994.

[KS97]

D. Konopnicki and O. Shmueli. W3QS - The WWW Query System. Department of Computer Science, Technion, Israel, may 1997. avalaible at http://www.cs.technion.ac.il/W3QS/. [LDB97] Z. Lacroix, C. Delobel, and Ph. Breche. Object Views and Database Restructuring. In Proc. of Intl. Workshop on Database Programming Languages, August 1997. [Lie97] H. Lieberman. Autonomous interface agents. In Proceedings of the ACM Conference on Computers and Human Interface, CHI-97, Atlanta, Georgia, march 1997. [Llo87] J.M. Lloyd. Foundations of Logic Programming, 2nd Ed. Springer Verlag, 1987. [LSS96] L. V. S. Lakshmanan, F. Sadri, and I.N. Subramanian. A declarative language for querying and restructuring the web. In Sixth International Workshop on Research Issues in Data Engineering - Interoperability of Nontraditional Database Systems, 1996. available at http://wwwdb.stanford.edu/pub/papers/icde95.ps. [LT] Live-Topics. available at http://altavista.digital.com/av/lt/help.html. [Lyc] Lycos. available at http://www.lycos.com/. [Mic97] Microsoft. The Channel De nition Format (CDF). Micorsoft, July 1997. avalaible at http://www.microsoft.com/standards/cdf.htm. [MM97] A. Mendelzon and T. Milo. Formal Models of Web Queries. In Proc. ACM Symp. on Principles of Database Systems, 1997. [MMM97] A. Mendelzon, G. Mihaila, and T. Milo. Querying the world wide web. Journal on Digital Libraries, 1(1):54{67, 1997. [MSS97] C. Meghini, F. Sebastiani, and U. Straccia. Modelling the retrieval of structured documents containing texts and images. In Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, Pisa, Italy, September 1997. [OO88] K. Otomo and M. Oshii. Akira: Neo Tokyo is about to E.X.P.L.O.D.E. Akira Commity, 1988. 124 mn. [Pai90] C. D. Paice. Constructing literature abstracts by computer. Information Processing and Management, 26(1):171{186, 1990. [PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of the International Conference on Data Engineering, 1995. available at http://www-db.stanford.edu/pub/papers/icde95.ps. [Pho] Photobook. available at http://iris.elis.rug.ac.be/pds/imvb/thesisdemo/photo book index.html. [Qbi] Qbic. available at http://wwwqbic.almaden.ibm.com/qbic/qbic.html. [Rag97] D. Raggett. Html3.2 reference speci cation. Technical report, W3C, 1997. [RC93] S. Ramani and R. Chandrasekar. Glean: a tool for automated information acquisition and maintenance. Technical report, NCST Bombay, 1993. [RL96] F. Rouaix and B. Lang. The v6 engine. Technical report, INRIA, 1996. [Sal89] Gerard Salton. Automatic Text Processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, 1989. [SFA97] M. Swain, C. Frankel, and V. Athitsos. Distinguishing photographs and graphics on the world wide web. Submitted to the IEEE Workshop on Content-Based Access of Image and Video Libraries { Available at http://infolab.cs.uchicago.edu/webseer/, March 1997.

[SM83] [Spe96] [STA] [Top] [VDH97] [Vir] [Wai] [Web]

Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGrawHill, New York, 1983. E. Spertus. Parasite: Mining structural information on the web. In Hyper Proceedings of the Sixth International World Wide Web Conference, Santa Clara, California USA, April 1996. STAR. available at http://www-speech.sri.com/. Topic. available at http://www.informix.com/. A-M. Vercoustre, J. Dell'Oro, and B. Hills. Reuse of Information through virtual documents. In Proceedings of the 2nd Australian Document Computing Symposium, Melbourne, Australia, April 1997. Virage. available at http://www.virage.com. Wais. available at http://www.informix.com/. WebSeer. available at http://webseer.cs.uchicago.edu/.

Suggest Documents