Adding support for dynamic and focused search with ... - CiteSeerX

12 downloads 7769 Views 281KB Size Report
discrepancy between the actual Web and its static image as stored in search repositories. The second ... within the dynamically identified domain. We present ...
ELSEVIER

Adding support for dynamic and focused search with Fetuccino Israel Ben-Shaul a , Michael Herscovici b , Michal Jacovi b , Yoelle S. Maarek b,Ł , Dan Pelleg c , Menachem Shtalhaim b , Vladimir Soroka b , Sigalit Ur b a

Department of Electrical Engineering, Technion, Haifa, Israel b IBM Haifa Research Laboratory, MATAM, Haifa, Israel c Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, USA

Abstract This paper proposes two enhancements to existing search services over the Web. One enhancement is the addition of limited dynamic search around results provided by regular Web search services, in order to correct part of the discrepancy between the actual Web and its static image as stored in search repositories. The second enhancement is an experimental two-phase paradigm that allows the user to distinguish between a domain query and a focused query within the dynamically identified domain. We present Fetuccino, an extension of the Mapuccino system that implements these two enhancements. Fetuccino provides an enhanced user-interface for visualization of search results, including advanced graph layout, display of structural information and support for standards (such as XML). While Fetuccino has been implemented on top of existing search services, its features could easily be integrated into any search engine for better performance. A light version of Fetuccino is available on the Internet at http://www.ibm.com/java/fetucc ino.  1999 Published by Elsevier Science B.V. All rights reserved. Keywords: Search enhancements; Search filtering; Search results visualization; Site mapping

1. Introduction and motivation The Web keeps growing at a phenomenal rate. From the perspective of search technology, this growth has two important characteristics. First, the ‘update’ policy is totally uncontrolled, with millions of users creating, modifying and deleting content at will, and linking it to other content in an unstructured manner. Second, Web growth is increasingly fueled by the addition of dynamically and automatically generated content [18]. These characteristics impact the major criteria of evaluation for search services, namely effectiveness Ł Corresponding

author. E-mail: [email protected]

 1999 Published by Elsevier Science B.V. All rights reserved.

(i.e., the quality of results) and efficiency (i.e., response time). Effectiveness is typically evaluated in information retrieval (IR) by recall 1 and precision 2 , [17]. Recall, in the context of the Web, requires a good coverage of the Web which is itself achieved via periodic and wide off-line ‘crawling’ of the Web. Contents thus accessed is indexed and stored in in1

Recall is defined in information retrieval as the ratio between relevant the number of relevant retrieved documents and the total number of existing relevant documents, or using a simple analogy, refer to “all the truth”. 2 Precision is defined in information retrieval as the ratio between the number of relevant retrieved documents and the total number of retrieved documents, or using the same analogy, refers to “nothing but the truth”.

576

verted index repositories that are used at runtime to return results to queries. Because of the dynamic nature of the Web, there might be a substantial difference between the image of the Web that is stored in search repositories generated batch, and the actual Web. In addition to missing newly added documents, this batch — or simply put — static approach cannot address deletion or movement of documents that occurred since the last crawling (returning users the annoying broken link — AKA ‘404 error’ — message 3 . In spite of the increasing computing power and network bandwidth, this discrepancy is likely to grow. At the same time, a fully dynamic search is clearly not scalable time wise. Thus, a combined approach is highly desirable. The precision ratio is also a major challenge in Web-IR research, despite much work in this area in recent years. Improvement in Web search technologies include the taking into account of usage paths and users’ recommendations such as in Alexa 4 . Another family of advances has focused on analyzing the link topology 5 of a sub-Web, or more generally the hypertext structure, to discover ‘important’ documents that should be assigned a higher score. Example approaches and systems along this direction include HyPursuit [21], WebQuery [3,16], the ‘Hubs and Authorities’ approach [8,10] and Google [2]. Nevertheless, most of these solutions still lack search context. This results in many returned entries that may all match the query but belong to different and unrelated domains. The lack of context stems from the inherent lack of structure in the Web. Some search engines, most notably Yahoo, address this problem by allowing to specify (sub) domains in which to evaluate the query. Such domain restriction significantly reduces the number of irrelevant results 3

In some cases, rather than eliminating broken links from search results, users may want to have access to these ‘out of print’ entries. This can be solved by archiving the global Web, as proposed by Alexa’s ‘Archive of the Web Service’. 4 Alexa – Technology http://www.alexa.com/support/technology. html, Alexa – Internet Stats http://www.alexa.com/support/intern et_stats.html 5 A clear benefit of topology-based approaches is that they are less sensitive to ‘spam’ attacks — documents that are made to be relevant to a given topic by artificially inserting relevant terms for instance.

and narrows the search space, hence increasing precision. However, the domain hierarchy is manually (or at best semi manually) crafted as a taxonomy of predefined categories. The ability to support an ‘uncontrolled vocabulary’ for domain specification is desirable from a usability perspective, but it requires addressing a set of non-trivial technical issues. First, the ‘domain views’ (by analogy with the schema views in databases) would need to be dynamically identified. Second, users should be allowed to perform the search focused on a given topic within the scope of the dynamic domain. The Fetuccino system, presented in this paper, addresses these issues by: (1) Augmenting conventional static search with dynamic search that explores the ‘vicinity’ of the returned values to reveal relevant (possibly new, or dynamically generated) documents and check the validity of the retrieved documents at the time of the search. (2) Introducing an experimental dynamic two phase search that allows the user to specify a domain query and a focused query, and dynamically apply the focused search within the specified domain. (3) Visualizing the results of the dynamic exploration, showing the structure and relative position of the returned entries within their sites. Fetuccino has been fully implemented as an extension of the Mapuccino 6 site mapping system [9], previously known as WebCutter [13]. A light version of the service (without the two phase search) is already available on the IBM Corporate Java Page 7 . A stand-alone ‘Personal Mapuccino/Fetuccino’ system for Win 95/NT is also available for download on AlphaWorks 8 . The full-fledged applet (implemented as a Java bean and accepting XML-defined maps according to the SGF DTD [11]) and=or server have been integrated in IBM products and services including IBM WebSphere Site Analysis. The rest of this paper is organized as follows: Section 2 describes dynamic search and its embodiment in the Fetuccino light service 9 . Section 3 presents 6

http://www.ibm.com/java/mapuccino http://www.ibm.com/java/fetuccino 8 http://www.alphaworks.ibm.com/formula/mapuccino 9 http://www.ibm.com/java/fetuccino 7

577

the two phase search paradigm, along with some examples obtained with the full Fetuccino service. Section 4 summarizes the contributions of this paper and points to future work.

2. Adding dynamic search 2.1. The approach Dynamic search is primarily distinguished from conventional static search in that the former involves fetching the actual documents at the time the query is issued and analyzing their relevance to the search query on the fly, while the latter is based on evaluating the query against pre-computed repositories. Obviously, dynamic search cannot be employed ‘from scratch’ because of the high cost of text analysis and the huge search space, and therefore it requires a starting point for recursive exploration as well as some limitation on the search space. For example, the WebGlimpse 10 [15] system allows users to pose a query from any ‘WebGlimpse’-enabled page using the current page as a starting point for exploration, up to a predefined ‘hop’ parameter (i.e. depth in the Web viewed as a graph). The rationale for searching for new relevant pages in the ‘vicinity’ of other relevant pages follows from the hyper-text structure of the Web, where the hyper-link is likely to imply relevance of the target document to the source document. Our basic idea is to invoke a conventional static search engine in order to obtain an initial set of ‘good’ links to pages that are presumably relevant to the query, and continue the exploration from there. This dynamic exploration has several benefits. First, a more precise relevance analysis can be made since corpus statistics can be derived from a narrower domain than the full Web. Second, this process validates the existence and reachability of documents, i.e., it allows to detect ‘broken’ links caused by documents or sites that do not exist anymore or documents/sites that are unreachable. Third, the dynamic exploration of outgoing links 10

WebGlimpse Homepage http://glimpse.cs.arizona.edu/webglim pse/index.html, WebGlimpse Paper ftp://ftp.cs.arizona.edu/peopl e/udi/webglimpse.ps.Z

allows to discover pages that are relevant to the query but did not appear in the results of the static search One way to look at this dynamic process is as an ‘automatic browsing’ session, on behalf of the user, that significantly improves the starting point that the user would normally be at, had s/he used only static search. With static results only, such a user would typically conduct a manual browsing session, stumble on several ‘not found’ or ‘not reachable’ documents, and read irrelevant documents that only appear relevant. Then after finding relevant documents, s/he will probably further explore their links to find more relevant information. Using our method, much of this tedious and time consuming process can be avoided. The approach of using page relevance as a heuristic to direct an exploration procedure has been first introduced by the ‘fish search’ paradigm [7], which also capitalizes on the intuition that relevant documents often have relevant neighbors. Thus, it searches deeper under documents that have been found to be relevant to the search query, and stops searching in ‘dry’ areas. In a previous paper [9], we introduced our own relevance based exploration that enhances the fish-search algorithm, for use in the context of site mapping. The algorithm, termed ‘shark search’, improves the amount of relevant pages that are explored within a given amount of time. The original purpose of the shark-search algorithm was to dynamically build a Web site map, tailored to a given topic. We propose here to use the same kind of heuristic-based approach in order to dynamically explore the neighborhood of static search results. One key difference between this context of application and regular site mapping is that the root page is not manually specified by the user (via a URL), but dynamically generated as the result page from a given search service. 2.2. Contextualizing the search results In addition to the benefits of dynamic search that were mentioned above, another key benefit of the exploration process is that near-by documents (in the Web topology) can be identified not as isolated pages but rather in the context of a structured

578

Fig. 1. The Advanced Form in the Fetuccino service (light).

Web neighborhood. Indeed, as the graph of the local sub-Web around each search candidate is fetched, a high-level view of that graph that shows the mutual relationships can be provided. This structural information is orthogonal to the content-based information upon which most traditional search engines base their scoring. Yet, knowing that many search results point to the same candidate (that is not necessarily a top candidate) gives an important clue on how ‘authoritative’ that page is on a given subject. We discuss later how this concept of ‘authorities’ has become more and more important in more recent search services. As a first approximation though, simply visualizing the structure of the results gives a visual clue of the importance of the document, as shown below.

2.3. An example of dynamic search: Fetuccino light The Fetuccino service (light version) embodies this dynamic crawling approach on top of existing search services. In the advanced options form (see Fig. 1), the user can select his/her favorite search service from a list of supported services, and control a set of parameters such as the time limit, depth (in terms of the hops to follow) etc. Once the user clicks on the “Create the Results Map!” button, Fetuccino forwards the query to a selected search service. It then pushes those results back to the user’s browser ‘as is’ and at the same time feeds them to the dedicated crawler component, which starts the exploration within the time limit specified in the form. Note that during the exploration process, the user can already scan through the

579

Fig. 2. Augmenting static candidates with dynamically fetched results (regular tree layout).

regular search results, which are provided in a separate browser window. Once the time limit is reached, a map of the results is sent back to the user, together with the Mapuccino visualization applet, and pops up on the client station. The map view shows the search results page as root, and the top ranked candidates as first-level descendants. The shade of blue for each node label indicates the degree of relevance of each candidate, (the darker the more relevant) according to Fetuccino’s built-in IR engine. Subsequent levels in the tree show the pages fetched by Fetuccino and how they are related to each other. The screendump in Fig. 2 shows the added value of the interactive Fetuccino map, as discussed before: (1) Added relevance: The most useful feature is the identification of very relevant results that were not among the top candidates of the regular search. In Fig. 2, one of the two most relevant (darker) pages at the second level is actually the authority for the given query (‘xml’), namely

the actual XML entry at the http://www.w3. org site. (2) Structure: By default, the user gets only the underlying tree structure as induced by the crawling process. However, when clicking on one node, or by selecting the “Backlink” option in the View menu, s/he gets respectively the cross links for that node, or for all nodes in the map (in blue in Fig. 2). (3) Abstraction: One very convenient feature of the high-level view of results is the ability to perform global operations. This feature devised for the Mapuccino system for site management (see [13], is particularly useful for fetching the favorite search results and concatenating them in a single page (that includes images) for easy printing. (4) Validity: Finally, the user may also ask to see the broken links. By default s/he sees only pages that are accessible at the time the query was issued. By clicking on the “Broken Links” option in the View menu, the user will get the 404-results (if any).

580

3. Adding domain/focused search 3.1. A two phase approach A major problem when conducting searches over a large heterogeneous and uncontrolled document set such as the Web is with the quality of the results for ambiguous queries. Indeed, it happens quite often that the user discovers only after issuing a search that the query expression s/he picked has a different interpretation in a totally irrelevant domain. This might happen even if the user is an expert at expressing clear and precise queries, simply because of the large scope of the Web. This poor quality of the results is evidenced by low precision as well as low recall rates in the following cases: ž low precision: when the query expression is relevant to a set of unrelated domains. Thus a user explicitly asking for “Merced”, but implicitly thinking of the Intel Architecture, will get many results pertaining to “Merced County” as well. ž low recall: when the query expression, implicitly targeted to a particular domain, returns results overshadowed by the same expression in a much heavier domain. Thus, for example, for “Michael Jordan” (the researcher) will return a huge amount of data relating to the other famous Michael Jordan of basketball fame, so that the intended results have very low probability to be among the top results. One solution is to disambiguate the query by embedding the domain inside the query term. In our previous example, adding “architecture” to “Merced” would probably be enough to disambiguate the query. However, in our second example, adding the specific field of interest to “Michael Jordan” (e.g., “Cognitive Science”) might narrow the search space too much (we give further experiments with this approach in Section 3.3). Indeed, the disambiguating terms are not used by the search engine as a simple qualifier of the original query but as equally important terms. They might even overshadow the query. In our example, the term “Cognitive Science” might be rare enough to get more weight than “Michael Jordan”. Another approach is to automatically cluster search results by contents, this approach first applied in regular IR systems [6,14], is regaining attention in the context of the Web

[12,22]. However, while clustering techniques are getting faster and more precise, clusters labels are still rarely as good as those provided by a manually defined ontology. Thus for quality control, some search services provide a precise ontology or taxonomy that allows he user to specify the domain in which the query should be evaluated, separately from the query term itself, as done in Yahoo. The user can browse through the available categories, and when the proper specificity is reached, the query is evaluated within the selected domain. The domain selection disambiguates the query and therefore improves the precision rate. Fetuccino applies the same idea to dynamic search. The user specifies two terms: a domain term, and a query term. The search proceeds in two phases. In the first phase, the domain that matches the domain term is identified (notice that the domain is user-defined, as opposed to the common fixed domain hierarchy). In the second phase, the query term is dynamically evaluated within the domain. This approach, while promising, introduces two hard technical problems: how to dynamically identify a domain, and how to confine the dynamic search to this domain. Finding an optimal solution to these problems remains a far-reaching goal, however we propose here an experimental approach that performs well in many cases. The general idea is to represent a domain by finding the central documents in it, and to approximate intra-domain search by exploring these pages dynamically. The problem of finding central documents in a domain has been researched extensively recently, but mostly in the context of general-purpose search. The common approach is to analyze the link topology and give documents a score that is independent of their relevance to specific queries and which represents how authoritative they are. We argue that this general approach is particularly effective for identifying central documents of a domain (as opposed to general-purpose topic search) since a domain is more likely to have such central nodes (which is not necessarily true for arbitrary terms), and domain names are typically less ambiguous than specific terms. There are several approaches to finding important documents (that we refer to as ‘authorities’ in the following, to reuse the term coined by Klein-

581

berg [10]), and some are embodied on freely available Web search services, e.g., Google 11 . We have chosen to take advantage of one of the implementations of Kleinberg’s “Hubs and Authorities” approach, namely Clever [4], which ‘parasites’ AltaVista to build the graph on which to compute authorities. Clever is one of the most advanced implementations of the Hubs and Authorities model and is conveniently located on the same IBM intranet as our Fetuccino prototype. In principle, however, any method for finding hubs and authorities as domain representatives could be used. Let us remind the reader that the ‘Hubs and Authorities’ approach consists of identifying authoritative documents that are pointed to by many sites, and ‘hub’ nodes that point to authoritative sites. This approach has had several variants and extensions [1,5]. It is also particularly effective for our purposes, since we have found that the best way to represent a domain for the intra-domain search phase is to select not only the top authorities, as one might think, but also top hubs. The former are likely to contain relevant domain information, and the latter are likely to point to pages with domain information. Since intra-domain search is effectively a dynamic exploration along the link topology, Hubs are equally important as domain seeds. 3.2. A session example We have extended the Fetuccino light service presented in Section 2.3 to support the two phase approach introduced above. From a user-interface perspective, the main difference between the two versions of Fetuccino is that in this heavier version (code named “Fetuccino Alfredo”), the user is required to enter two query fields, a ‘domain query’ field and a ‘focused query’ field, instead of a single free-text entry. An example of the advanced form is shown in Fig. 3. Once the user clicks on “Create the Results Map” button, the server forwards the domain query to Clever, that computes a set of hubs and authorities (note that depending on the communication bandwidth between the Authority engine and the actual search server from which the graph is retrieved, this process might take up to a few minutes). The results 11

http://www.google.com

are pushed back to the user’s browser. See Fig. 4 for an example of Clever output on the domain query “Chicago Bulls”. At the same time, the top authorities and the top hubs are fed as seeds to the shark-directed crawler. The key point here is that the shark-directed search does not use the same domain query, but a more focused query that does not in general include the domain terms (so as not to bias the results, these should be implicit only in the dynamic domain represented by the hubs and authorities). After the time limit is expired, the map of augmented and filtered results is pushed back to the user’s browser. An example of such a map is given in Fig. 5, where the top 3 results are highlighted in darker blue (and framed in red). The top one refers to “Bulls roster”, the second to “Bulls transactions”, and the third one (in the bottom right hand side) as well as the other nodes in a lighter blue shade, refer to “trades and transactions” but of other NBA teams — in the case of the 3d candidate: the LA clippers. This is due to the fact that, as explained before, the domain is not a manually crafted domain, and as such includes related information (other NBA teams) that does not perfectly match the “bulls” domain. An option would be to filter the domain by forcing each page to explicitly refer to the bulls, or re-specify the domain within the focused query. We believe though that for the sake of a better recall, it is better to show those results to the user, who gets additional clues from the relevance score (shade of blue) and the proximity to the domain seeds. 3.3. Some experiments We show below some comparative results on a set of queries which show the main advantages of using a two phase domain/focused query approach as compared to simply combining the two in a single query and submitting the query to a regular search service as well as authority-based search service. Our first example is with the domain/focused queries: “Chicago Bulls/Trades and Transaction”. We tried, a concatenated query “Chicago Bulls, trades and transaction” (in free-text rather than Boolean form) against Excite, Alta-Vista and Clever. All of Excite and Clever results were most dominated by results relevant to the “Chicago Bulls”.

582

Fig. 3. The Advanced Form in Fetuccino Alfredo.

For Clever this was to be expected as the hubs and authority model specifically states that it works best for ‘broad’ queries. In any case, Clever did return indeed good Bulls authorities, while Excite returned the official site of the Bulls only as a 5th candidate, while a fan’s page was ranked 3rd. Alta-Vista had stranger results. On the day we issued the query, we got on the top 10 results, the 5th and 6th candidates entitled “Chicago Bulls Transactions”, but typically enough the 6th one was a broken link. The top 2 candidates referred to the Bulls, the 3rd one was a non-official “trade and free agent form” (from geocities) not for the Bulls specifically, and the remaining four referred to “J.A.C.K” (Journal of Associated Construction Knowledge, Publications for the trades) most of which were broken links but in any case had nothing to do with basketball. By comparison, the results obtained by Fetuccino, as shown in Fig. 5, were much more precise. Thanks

to the quality of the seeds or domain pages returned by Clever, Fetuccino could find in their vicinity precise information about trades in the right domain. We discovered that the two phase approach is particularly useful when focused queries have another meaning (as with the ‘trades’ focused query in the previous example) in a given domain, or get a different perspective, as in the example hereafter. In Fig. 6, we see the results for the domain/focused query: “Sun Corporation/McLaren deal”. This map evidences very good results in a very short time span. The top candidate in the close neighborhood entitled “Accelerated Success”, has as subtitle (when the page is visited) “McLaren racing team picks up speed with Sun”. The good quality of the results here is due to the fact that: (1) the top candidate returned by Clever was an excellent authority namely the Sun home page, and (2) the McLaren deal information

583

Fig. 4. Results from Clever for the domain query “Chicago Bulls”.

Fig. 5. Fetuccino results for two phase query “Chicago Bulls” and “trades and transactions”.

was in the close neighborhood of the authority. In such a simple case, a user skilled in Web searches would have found the same information manually

by going to the authority (that he should know) and doing two hops. Fetuccino automated these natural steps for the user and added a convenient interface

584

Fig. 6. Fetuccino results for query “Sun Corporation/McLaren deal”.

where, for instance, all relevant pages can be printed in one click. However, the two phase approach is not that successful when the authorities are not ‘hubby’ enough or the ‘hubs’ are not authoritative enough. Indeed, when trying the two phase query domain: “Intel Architecture” and focus: “Merced”, we did not get any good results on Merced from the second stage focused search. When analyzing the reason of that failure we found out (by searching the Intel site) that documents pertinent to the Merced architecture were on an Intel developer site that was not among the retrieved authorities. The main Intel authority (www.intel.com) did point to that developer site, but it would have required too many ‘hops’ for the dynamic crawling to find the relevant documents. In other words, when the authority seeds are too far from the relevant information, or the domain view is too large, the dynamic search looses its efficiency and relevant information is not found in a decent amount of time.

It is difficult to evaluate precisely the effectiveness of our two phase search without adequate benchmarking, which is in itself a complicated issue for the Web (while in the IR community, the TREC approach gives a good measure for evaluating recall and precision). However, as a simple indication, we show here (see Table 1) some precision results for the 10 best candidates returned by a few Web search engines (namely Excite, and Infoseek, and more recent ones like Clever and Google) as well as by Fetuccino, over the same set of queries. These test cases having been designed experimentally (rather than formally over a statistically significant sample, and using testing guidelines procedure), they should be considered as a positive indicator of the effectiveness of the algorithm rather than as formal results. As mentioned before, our results are dependent on the quality of Clever results. The first conclusion we drew from those experiments that it might be worthwhile to examine and compare various authority engines so as to get the best possible seeds.

585 Table 1 Query (domain/focus)

Leonardo da Vinci/inventions General Electric/career and employment opportunities Sun Corporation/McLaren deal Philips/medical systems

Precision score InfoSeek

Excite

Clever a

Clever b

Google

Alfredo

0 0.1 0 0.1

0 0 0.2 0.1

0 0 0 0.1

0.1 0 0 0.1

0 0.2 0.1 0.3

0.3 0.2 0.2 0.1

a Clever b

top 5 authorities and top 5 hubs. Clever top 10 authorities.

4. Conclusion and future work

Acknowledgements

The accelerated growth of the Web might cause traditional purely static search engines to become less accurate and less effective over the time, even if their indexing, retrieval, and storage techniques are likely to improve. Fetuccino addresses this shortcoming by augmenting static search services with text-based dynamic exploration around the vicinity of the search results, thereby discovering new relevant information and validating old information. The key to making an effective use of the limited exploration time, which is typically bounded by the (im)patience of the interactive user, is the use of textbased relevance check for controlling the exploration process itself, in addition to scoring purposes. On top of the basic dynamic search mechanism, we added a novel capability: a user-defined domain search, followed by a dynamic focused search within the domain. Although our current solution is only a crude approximation of truly domain-specific search, it nevertheless produces promising results that could improve advanced static methods. We believe that adding a dynamic search capability opens the door for many other future enhancements, of which domain-based search is only a starting point. For example, assuming that a standard interface for site search, like WAIS used to provide, would reappear, the identification of the domain could be much improved by dispatching the queries to various intra-site search engines, in a similar way that MetaCrawler [19,20], dispatches queries to search services.

We thank Jon Kleinberg 12 and Ron Fagin for useful discussions on Hubs and Authorities (and Ron Pinter for putting us in contact). We are also grateful to Prabhakar Raghavan for letting us use the Clever system. Finally, we are in debt to Dirk Nicol for hosting Mapuccino and Fetuccino on the IBM Corporate Java Site. This research was done while Dan Pelleg was an extern student at the IBM Haifa Research Laboratory.

References [1] K. Bharat, A. Broder, M. Henzinger, P. Kumar and S. Venkatasubramanian, The connectivity server: fast access to linkage information on the Web, in: Proc. 7th Int. World Wide Web Conference, Brisbane, April 1998. [2] S. Brin and L. Page, The anatomy of a large-scale hypertextual Web search engine, in: Proc. 7th Int. World Wide Web Conference, Brisbane, April 1998. [3] J. Carriere and R. Katzman, WebQuery: searching and visualizing the Web through connectivity, in: Proc. 6th Int. World Wide Web Conference, Santa Clara, April 1997. [4] S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, Experiments in topic distillation, in: ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, 1998, Melbourne, Australia. [5] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan and S. Rajagopalan, Automatic resource list compilation by analyzing hyperlink structure and associated text, in: Proc. 7th Int. World Wide Web Conference, Brisbane, April 1998. [6] D.R. Cutting, D.R. Karger, J.O. Pedersen and J.W. Tukey, Scatter/gather: a cluster-based approach to browsing large document collections, in: Proc. 15th Annu. Int. SIGIR Conference, Copenhagen, Denmark, June 1992, pp. 318–329. 12 Jon Kleinberg’s papers at http://simon.cs.cornell.edu/home/ kleinber/

586 [7] P. De Bra, G.-J. Houben, Y. Kornatzky and R. Post, Information retrieval in distributed hypertexts, in: Proc. RIAO ’94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994. [8] D. Gibson, J. Kleinberg and P. Raghavan, Inferring Web communities from link topology, in: Proc. 9th ACM Conf. on Hypertext and Hypermedia, 1998. [9] M. Herscovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalhaim and S. Ur, The shark-search algorithm: an application: tailored Web site mapping, in: Proc. 7th Int. World Wide Web Conference, Brisbane, April 1998. [10] J. Kleinberg, Authoritative sources in a hyperlinked environment, in: Proc. 9th ACM-SIAM Symp. on Discrete Algorithms, 1998. [11] O. Liechti, M.J. Sifer and T. Ichikawa, Structured graph format: XML metadata for describing Web site structure, in: Proc. 7th Int. World Wide Web Conference, Brisbane, April 1998. [12] Y.S. Maarek, R. Fagin and I.Z. Ben-Shaul, Efficient document clustering and its application on the Internet, in preparation. [13] Y.S. Maarek, M. Jacovi, M. Shtalhaim, S. Ur, D. Zernik and I.Z. Ben Shaul, WebCutter: a system for dynamic and tailorable site mapping, in: Proc. WWW6 Conference, Santa Clara, April 1997. [14] Y.S. Maarek and A.J. Wecker, The librarian’s assistant: automatically assembling books into dynamic bookshelves, in: Proc. RIAO ’94, New York, NY, 1994. [15] U. Manber, M. Smith and B. Gopal, WebGlimpse: combining browsing and searching, in: Proc. Usenix Technical Conference, January 6–10, 1997, ftp://ftp.cs.arizona.edu/pe ople/udi/webglimse.ps.Z, http://glimpse.cs.arizona.edu/web glimse/index.html [16] M. Marchiori, The quest for correct information on the web: Hyper search engines, in: Proc. 6th Int. World Wide Web Conference, Santa Clara, April 1997. [17] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, Computer Series, McGraw-Hill, New York, 1983. [18] Z. Smith, The truth about the Web crawling towards eternity, Web Techniques Magazine 2 (5) (1997). [19] E. Selberg and O. Etzioni. Multi-service search and comparison using the MetaCrawler, in: Proc. 4th Int. World Wide Web, 1995. [20] E. Selberg and O. Etzioni, The MetaCrawler architecture for resource aggregation on the Web, IEEE Expert 12 (1) (1997) 8–14. [21] R. Weiss, B. Velez, M.A. Sheldon, C. Manprempre, P. Szilagyi, A. Duda and D.K. Gifford, HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering, in: Proc. 7th ACM Conf. on Hypertext, New York, 1996. [22] O. Zamir and O. Etzioni, Web document clustering: a feasibility demonstration, in: Proc. 19th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 46–54.

Issy Ben-Shaul is a faculty member in the department of Electrical Engineering at the Technion — Israel Institute of Technology, and a consultant in the Information Retrieval and Organization Group at the IBM Haifa Research Lab. He received his BSc in Mathematics and Computer Science from Tel Aviv University in 1988, and his MS and PhD in Computer Science from Columbia University in 1991 and 1995, respectively. During 1995, before joining the Technion, Ben-Shaul was a research staff member at the IBM research laboratory in Haifa, and worked on applications and extensions of clustering technology to the Internet. He is leading the Distributed Systems Group and the associated software systems laboratory at the Technion. His research interests include distributed and mobile systems, software engineering, information retrieval, Web, advanced transactions, workflow management systems and electronic commerce. He has published over 30 papers in referred journals and conference Michael Herscovici is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and belongs to the ‘Information Retrieval and Organization’ Group. His research interests include Internet applications and parsing techniques. Mr. Herscovici received his B.Sc. in Computer Science from the Technion, Israel Institute of Technology in Haifa, in 1998. He joined IBM in 1997 and has since worked on the dedicated robot component of Mapuccino, a Web site mapping tool. Michal Jacovi is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel, and belongs to the ‘Information Retrieval and Organization’ Group. Her research interests include Internet applications, user interfaces, and visualization. She received her M.Sc. in Computer Science from the Technion, Haifa, Israel, in 1993. Ms. Jacovi has joined IBM in 1993, and worked on several projects involving user interfaces and Object Oriented, some of which have been published in journals and conferences. Since the emergence of Java, she has been involved in the conception and implementation of Mapuccino, a Web site mapping tool, written in Java, that is being integrated into several IBM products.

587 Yoelle S. Maarek is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and manages the ‘Information Retrieval and Organization’ Group that counts about 15 members. Her research interests include information retrieval, Internet applications, and software reuse. She graduated from the ‘Ecole Nationale des Ponts et Chaussees’, Paris, France, as well as received her D.E.A (graduate degree) in Computer Science from Paris VI University in 1985. She received a Doctor of Science degree from the Technion, Haifa, Israel, in January 1989. Before joining IBM Israel, Dr Maarek was a research staff member at the IBM T.J. Watson Research Center for about 5 years. She serves on the program committees of several international conference and is a member of the Review Board of the WebNet Journal. She has published over 25 papers in referred journals and conferences. Dan Pelleg received his B.A. in 1995 and his MSc in 1998 from the Department of Computer Science, Technion, Haifa, Israel. His Master’s thesis topic was ‘Phylogeny Approximation via Semidefinite Programming’. He is currently a PhD candidate in the CS Dept. at CarnegieMellon University. His research interests include computational biology, combinatorial optimization and Web-based software agents. During the summers of 1997 and 1998, Dan worked as an extern student in IBM Haifa Research Laboratory. Menachem Shtalhaim is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and belongs to the ‘Information Retrieval and Organization’ Group. His research interests include Internet applications, communication protocols and heuristic algorithms. Mr. Shtalhaim joined IBM in 1993, and worked on several projects involving morphological analysis tools, Network Design and analysis tool (IBM product NetDA/2) and the AS400 logical file system layer. In the past, Mr. Shtalhaim has worked on medical diagnostic systems. He is the author of the dedicated robot component of Mapuccino, a Web site mapping tool

Vladimir Soroka is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel and belongs to the ‘Information Retrieval and Organization’ Group. His research interests include Internet applications and Information organization. Mr. Soroka received his B.Sc. in Computer Science from the Technion, Israel Institute of Technology in Haifa, Israel in 1996. Before joining IBM, Mr. Soroka worked on Internet-based fax servers. He joined IBM in 1998 and has since worked on various applications such as Mapuccino, a Web site mapping tool. Sigalit Ur is a Research Staff Member at the IBM Haifa Research Lab in Haifa, Israel, working on Mapuccino, a Web site mapping tool, written in Java, that is being integrated into several IBM products. She received a Master of Science degree in Intelligent Systems from the University of Pittsburgh in 1993. Before joining IBM, Ms. Ur was involved in projects in a wide variety of fields, including data processing, databases, cognitive science, multi-agent planning and image processing, some of which have been published in journals and conferences.

Suggest Documents