The WebCluster Project. Using Clustering for Mediating ... - CiteSeerX

5 downloads 16384 Views 102KB Size Report
domain of interest, and avoiding the bulk of non- ... by using one the many available search engines (such as ... collection speci c to their domain of interest. 2.
The WebCluster Project. Using Clustering for Mediating Access to the World Wide Web. Mourad Mechkour The Robert Gordon University-SCMS Aberdeen, AB25 1HG, UK www.scms.rgu.ac.uk/staff/mrm/

David J. Harper The Robert Gordon University-SCMS Aberdeen, AB25 1HG, UK www.scms.rgu.ac.uk/staff/djh/

Gheorghe Muresan The Robert Gordon University-SCMS Aberdeen, AB25 1HG, UK www.scms.rgu.ac.uk/staff/gm/

1 Introduction

We believe in the need for some sort of intelligent assistance to lter the information available on the WWW When it comes to information seeking on the World Wide and reduce the scope of view to specialized subsets which Web (WWW), we can identify two main methods of per- were selected according to the area of interest of each parforming this task depending on many factors. ticular user or group of users. We believe that document clustering is a good method 1. Seeking very specialized information related to a for extracting the inherent structure of a large document domain of interest, and avoiding the bulk of non- collection, and organizing the documents into classes of relevant information available on the WWW. Here closely related documents. one wants to see the WWW from a very narrow perIn the WebCluster project, we propose using docuspective limiting the scope of view to the relevant ment clustering to structure a collection of documents information. (the source collection), which is representative of a doof interest of a particular user population. We 2. Seeking information with serendipity, learning about main then propose mediating access to the WWW (the target things one did not know he/she is interested, or collection) through clustering constructed over the nding information by \stumbling" into it by chance. source collection, by the using a combination of cluster-based retrieval strategies together with WWW search engines The rst way is achieved on the World Wide Web (see Figure 1). by using one the many available search engines (such as Altavista, Lycos, or InfoSeek), and the second way is mainly achieved by browsing the WWW, following pre- 2 Using clustering for mediating access to de ned links between documents. Most of the people the WWW tend to move between these two modes depending on the task to perform. Most of the clustering methods cannot be practically apNowadays, most users are facing diculties using the plied to clustering huge document collections (the whole existing tools to achieve satisfaction in their information WWW, for example). The most popular clustering algoseeking task. We think that the two main problems in- rithms (Hierarchical Agglomerative Clustering Method, herent to the WWW, which are behind these problems, see [3]) have time and space complexity which are O(n2 are : or worse. On top of the classical advantages o ered by the mediated access to the WWW and document clus1. The World Wide Web is a heterogeneous document tering, their combination o ers solutions to some of their collection; however users, most of the time, are inherent problems : interested in searching a homogeneous small subcollection speci c to their domain of interest.  by limiting the clustering to user domain speci c document collections (of a manageable size) we can 2. Browsing is done by following links introduced by have an ecient clustering process. Thus, we can document writers, to structure a hyper-document a ord to provide clusterings of di erent source color for reference purposes, and seldom by following lections, each of which is tailored to a particular possible semantic links between documents. user domain, and where we reveal the semantic structure of that domain, and thence the WWW. Permission to make digital/hard copy of all or part of this work  clustering the source collection will produce a strucfor personal or classroom use is granted without fee provided that ture for the domain covered by the collection, and copies are not made or distributed for pro t or commercial advantage, the copyright notice, the title of the publication and its this can be used to lter the target collection, and date appear, and notice is given that copying is by permission of allow a narrow view of it. ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or Clearly, a source collection should be representative c 1998 ACM 1-58113-015-5 fee. SIGIR'98, Melbourne, Australia 8/98 $5.00. of the domain of interest of the intended user population, and it could be derived (possibly by sampling) from

authoritative collections of document such as those produced by abstracting services. Alternatively, it could be derived from documents on an Intranet, or from existing collections of WWW documents. Target WWW = Collection

Web Cluster

Source Collection

WWW Search Engine

Figure 1: Mediating access to the World Wide Web. In WebCluster, the mediated search in a document collection is a two stage process : 1. Browsing/querying of the clustered source collection to gain knowledge about its structure, about how documents are described in it, and to consult some relevant documents, to re ne the user's need de nition to produce the best possible query. 2. Querying the target collection by issuing the query built in stage (1) to the di erent search engines that o er access to di erent WWW sub-collections.

3.2 The end user interface

This interface is used by end users to access the CF services. These services are : 1. Clustering document collections. The user can choose the clustering method to use and its parameters. This part of the interface allows the users to conduct multiple clustering for the same document collection and display the di erent results for comparison purposes. 2. Querying and Browsing document collections. Here we implement di erent scenarios for searching a multi-collection information space, by using di erent search strategies : cluster based search, best match search, mediated access, and a combined search.

4 Conclusion

The rst phase of the project is completed. We have developed the clustering framework, and we are designing the end-user interface to be used in the evaluation stage of the project. Evaluation of the project is to be done by conducting a set of experiments to show the validity of a set of working hypothesis : 1. The structure produced by clustering a source col2.1 Advantages of this approach lection can be transposed to the target collection. 1. It allows a semantic structuring and a cluster based 2. The cluster based mediated search on the WWW search of the WWW at low cost. is more e ective than the classical retrieval search 2. Allows di erent domain speci c views on the WWW. alone. 3. It allows a precision-oriented search compared to Acknowledgements We thank T. Bratvold, J. Mythe general WWW search engines available. llymaki, M. L. Barja, G. Sonnenberger, and H-P. Frei for 4. The user can be highly guided during the process their valuable comments. of query formulation, a process which is reportedly This project is supported by Ubilab of the Union very dicult for many users of WWW search en- Bank of Switzerland. gines. 5. The user has complete control over the evolution References of the source document collection (adding/deleting of documents), and hence its re-indexing and re- [1] Hearst M. A. and Pederson J. O. Reexamining the clustering. cluster hypothesis : Scatter/gather on retrieval. In 19th, ACM SIGIR Conference, pages 76{84, August 18-22 1996. 3 The tool : WebCluster R. C., Riehle D., and Buschmann F. Pattern In this project, we aim to conduct a series of experiments [2] Martin Languages of Program Design 3. Addisson-Wesley, to obtain evidence about the improvement in the e ec1997. tiveness in the retrieval process introduced by combining cluster-based and retrospective searches in mediating ac- [3] Rasmussen E. Clustering algorithms. In William B. cess to the WWW. In these experiments, we will need Frakes and Ricardo Baeza-Yaltes, editors, Informato choose the most appropriate parameters for clustertion Retrieval. Data Structures and Algorithms. Prening a source document collection (such as the clustering tice Hall, Englewood Cli s, New Jersey, 1992. method or the similarity measure), the best search strate- [4] Parsons J. and Wand Y. Choosing classes in congies, and the best user interface for combining di erent ceptual modeling. Communications of the ACM, searches and visualizing search results. To assist in these 40(6):63{69, 1997. experiments we are developing two main tools : the clustering framework and the end user interface. [5] Van Rijsbergen C. J. Information Retrieval, 2nd edition. Buttersworth, London, 1979. 3.1 The clustering framework [6] Jain A. K. and Dubes R. C. Algorithms for Clustering Data. Prentice Hall, 1988. The Clustering Framework (CF) is an integrated tool that can be used to cluster di erent document collec- [7] A. Leouski and W.B. Croft. An evaluation of techtions and to search the produced clusters using di erent niques for clustering search results. Technical Report search strategies. This tool can be used to experiment IR-76, University of Massachusetts at Amherst, 1996. with a wide range of clustering methods.

Suggest Documents