A Distributed Weighted
Proceedings JENC8
M.Rio
A Distributed Weighted Centroid-based Indexing System Miguel Rio Joaquim Macedo Vasco Freitas Abstract
Web changes using WWW hyperlink recursive processing algorithms. This is the solution for the most popular centralized IR systems on the Internet such as AltaVista [1], Lycos [2], and so forth. With the exponential growth of the Internet in terms of geography, connected hosts, number of users and therefore information available on-line, it has become evident that a good centralized solution to information search and retrieval on the Internet is a dicult goal to achieve [3]. In fact: available statistical data show that current Internet IR systems do not index the global Web; the best Internet search services are frequently overloaded; Web resource diversity (dierent document types) cannot be cached by a unique service, which must use very generic metadata; it is not possible to generate good index information automatically, for instance, phrase indexing based on local names and localities; most organizations, such as libraries, must employ manpower for local creation of index information; some organizations, with access restrictions, only allow a controlled exploration of their index information; at the current scale of the Web, a centralized solution becomes prohibitively expensive.
This paper describes the WHERE system, an approach to a distributed indexing service for document search on the Internet based upon an architecture of centroids. Numerical data, produced with the aid of Information Retrieval techniques, in the form of a weighing measure, are added to the Whois++ centroids enabling ranked results to be delivered to clients, which they use not only for presenting them to the user as an ordered response to his query but also for a more ecient interaction with the directory mesh. The underlying retrieval engine is based upon the vector space model. The system provides for a reduced search space in the distributed index mesh, as compared to that of Whois++, as allowed by the mesh traversal algorithm employed.
I. Introduction
As the Internet increases in size and popularity, networked information services (World Wide Web, in particular) become the main information source for many of its users. However, the large amount of available information requires a set of ecient and eective information discovery tools. Information Retrieval (IR) is a discipline concerned with the selection of the relevant set of documents within a given collection for a user's particular information need. IR techniques have been developed and used for several decades in applications with limited scope. The IR research community is now faced with the interesting challenge of designing and implementing IR solutions for a very large, dynamic, distributed and heterogeneous information space | the Internet. The rst approach that came to mind for the use of IR systems over the Internet was the adoption of the client/server model of communication. Due the popularity of the WWW, HTTP has been the main document transfer protocol used. Information collection is done by WWW robots, which are clients that periodically detect
An approach that obviates most of these problems is to distribute the IR service over the Internet, eg, several servers would index and answer queries about partitions of the total information space. The partitionnig criteria may be either geographical, by subject, by document type, or other. However, several problems must be solved before this can happen. This paper proposes a distributed indexing service for document searching in the Internet.
II. Existing approaches WAIS [4] is a very robust indexing service that uses the Z.39.50 protocol [5]. It can index many
author sponsored by PRAXIS XXI - BM/6644
322-1
A Distributed Weighted
Proceedings JENC8
types of documents (text, html, PostScript, etc) including binary les. It also provides for an excellent ranking system. All WAIS servers are supposed to be registered in a directory-of-servers to provide a global indexing service. This global system, however, did not prove to be satisfactory since the user needs to select the servers to which to place his queries. Harvest [6] is a package that retrieves document resources, automatically obtains its metainformation with ESSENCE [7] and indexes that information with GLIMPSE [8]. Harvest big disadvantage is the absence of a ranking system of results due to the underlying boolean model it uses. Whois++ [9] started as an extension to the Whois [10] service, but as requirements began to be more ambitious, Whois++ got a life of its own. Whois++ tries to provide directory and indexing services based on the centroid technology. Centroids are 'summaries' of the servers information that are passed to indexing services which in turn pass on their centroids to other indexing services forming the directory-mesh [11]. The system seems to be ecient and scalable but no ranking system is provided. CIP [12] is an extension to Whois++ and proposes that centroids should include a weight to each term but it does not specify how that weight is calculated. CIP is a protocol intended to allow all kinds of information servers (X.500 [13], LDAP [14], Whois++) and produces centroids in a standard way, allowing dierent servers to participate in the mesh. The ideas contained in CIP are the starting point to the concept presented in this paper.
M.Rio centroid
queries
MINI-CIP
centroid
query
daemon
manager
server
answers
URCs
Resource Transponder
Resources
Template Manager
...
Fig 1: WHERE service architecture
, to recognize an HTML page, PSAdobe-2.0, to recognize Postscript documents
and so on. If no special terms appear, the parser identi es the document as plain text. If the recognized format contains words that identify the title, like the and tags in an HTML page, the title is stored in the URC. To uniquely identify the document its summary is calculated using the MD5 algorithm which is also stored in the URC as the value of the summary attribute. The identi cation of the document's author is, in most cases, very dicult. The parser looks for lines with the word author and heuristically tries to obtain his name. Should this fail it looks for an e-mail address. If some token resembling xpto@domain is found it assumes that it is the author's e-mail and assigns it to the e-mail attribute. Should a directory service be available in the organization (a X.500 DSA, a LDAP server, a Whois++ server) which contain the organizations' persons entries, a query is issued asking for the name of the person with that e-mail address. If the query succeeds it lls the author attribute in the URC with the result. If a directory service is not available then a le containing , is used to map e-mails to authors.
III. The WHERE architecture The WHERE system uses a Resource Transponder [15] to gather resources to be indexed and produces their meta-information which are maintained valid by periodically checking whether those resources have changed. URCs [16] are then kept in a query server which answers queries and exports a centroid with the organization's information index. Figure 1 pictorically ilustrates this architecture.
III.A. Production of URCs The Resource Transponder uses a list of URLs to retrieve the documents to be indexed. Once they are gathered, a lexical analyzer parses the text to obtain the URC attributes. In order to identify the format the document is writen in, it tries to capture tokens like 322-2
A Distributed Weighted
Proceedings JENC8
III.B. Calculation of keywords
K: is a constant tf: term frequency in the document n: number of documents the term appears in N: number of documents in the collection K is a context constant and is used to increase the weights of terms appearing in the title, in the rst lines of the document, as uppercase words, and so on, as it is suggested in [22]. tf is used to increase the weights of terms that appear many times in the text. The last factor, log(N=n), is a measure of self-information in the sense of Information Theory, ie, the amount of information associated with the occurrence, within a collection, of a document containing the term. This measure is used to decrease the weights of terms that appear in many documents and therefore have a smaller discriminating power.
The keywords attribute is harder to calculate. IR techniques were adapted to obtain the relevant keywords and their respective weights. The keywords attribute consists of a set of , pairs, separated by comas as shown in gure 2 which exempli es a URC. Weight is calculated using a tf.idf algorithm like in INQUERY [17], SMART [18] and OKAPI [19]. Template-type: URC Handle: UMINHOURC3 Title: The Common Indexing Protocol CIP URL: ftp://ftp.uminho.pt/Docs/internetdrafts/draft-ietf-find-new-cip-00.txt Last Modified: Wed Jul 15:38:13 WET 1996 Size: 31401 Summary: a2b34d5f36e7789bbcc56dffe4acb4dc Summary-algorithm: md5 Content-type: text/html E-mail:
[email protected] Keywords: cip(120.60), index(94.27), server(86.13), poll(51.29), referr(48.52), object(43.67), queri(40.20), domain(30.28), search(28.42), dsi(27.73), mesh(27.48), dataset(24.95), whoi(21.99), type(21.99), databas(20.10), inform(17.33), sent(16.64), internet(14.56), attribut(12.48), ar(24.26), centroid(20.15), base(13.86), uri(11.91), set(11.91), draft(11.91), rout(11.09), respect(11.09), onli(11.09), name(11.09), mai(11.09), ldap(11.09), chang(11.09), wa(10.39), ne(10.39), specif(10.07), format(10.07), tree(9.70), send(9.70), manag(9.70), distribut(9.70), requir(9.01), gener(9.01), form(9.01), becaus(9.01), pass(7.33), authent(7.33), thei(6.93), technologi(6.93), receiv(6.93), html(6.93), hostnam(6.93), futur(6.93), effici(6.93), avail(6.93), valu(6.41), directori(6.41), provid(6.23), origin(6.23), necessari(6.23), issu(6.23), implement(6.23), docum(6.23), allow(6.23), algorithm(9.70), answer(7.33), weboid(5.54)
III.B.1. Grouping of physical resources A document may and it often does consist of more than one physical resource. The calculation of an attribute should consider not only parts of the document but as many parts as possible. The administrator may indicate in the Resource Transponder le more than one URL for each document. If he wishes he can use wildcards to indicate which links to follow.
III.C. The Resource Transponder As pointed out earlier on, documents may change frequently. The Resource Transponder plays an important role in keeping track of such changes. To accomplish this it periodically issues GET if-modi ed commands of the HTTP protocol to the resources in its database. Should a response be positive the new URC is calculated and stored in the database. Then, a data-changed signal is issued to the query server.
III.D. Creation of meta-information In the indexing systems analyzed in section II., two approaches can be distinguished. The one found in Whois++ and CIP whereby all the information has to be created and updated manually, and that found in Harvest where all metainformation is automatically produced. The rst approach will very likely lead to a rapidly outdated database and the work involved in producing and maintaining all the meta-information will discourage organizations to adopt such systems. On the other hand services like AltaVista produce meta-information on all HTML pages its robot nds, without grouping them in logical documents as a Whois++ template administrator is supposed to do.
Fig 2: URC example A lexical analyzer parses each document obtaining its terms and ignoring the most wellknown words which are listed in a stoplist [20]. Then stemming [21] is used to remove morphologically similar terms. The weight, w, of a term is then appended to it. Its value is computed from the equation: w = K tf log Nn where:
M.Rio
322-3
A Distributed Weighted
Proceedings JENC8
M.Rio Weights Term max min cip 120.60 90.5 circul 4.15 2.3 citat 1.38 1.2 cite 1.38 1.1 clariti 1.38 1.1 class 5.54 2.3 classif 2.77 2.2 clearer 1.38 1.0 clearli 1.83 1.7 click 2.77 1.2 client 6.41 4.3 clip 1.38 1.2 close 2.07 1.2 cmu 2.77 1.0 code 4.15 3.2 coher 1.38 1.2 collat 2.77 1.3 collect 11.09 8.7 column 1.38 1.27 com 9.70 4.7 combin 1.38 2.4 command 22.18 10.1 comment 1.83 1.8
In WHERE a combined policy is used. The administrator rstly produces a list of URLs to be indexed (possibly grouped in clusters) and then he calls the Resource Transponder. When new documents are created their URLs must be appended to that list. To ease the administrator's task, WHERE accepts a Netscape bookmarks le as an input and therefore the use of the browser may facilitate producing the list le. All metainformation may be changed manually. This is useful when the author provides external knowledge about the resource (eg, extra Keywords).
III.E. Export of weighted centroids WHERE answers queries via a MINI-CIP daemon which implements a subset of CIP. MINICIP implements a simple query-answer service without providing regular expressions which is not relevant for what the system tries to implement. It calculates its own centroid and may poll (and be polled by) other servers to obtain their centroids. Besides the list of all terms for each attribute as de ned in [9], the centroid provides the following numerical data for each term: the maximum and the minimum weights in all servers' URCs and the number of documents in which the term occurs, as exempli ed in gure 3. These data, as it will be shown, turn out to be very useful in query routing and in the dissemination of information. When the server polls other servers, the maximum and the minimum weights of each keyword become, respectively, the largest and the smallest of all weights for that keyword in all centroids and the number of documents it now occurs in, is the sum of the number of documents it appears in each centroid.
n 11 12 8 7 6 1 2 3 4 34 12 3 4 7 6 5 4 3 5 7 8 9 8
Fig 3: Weighted centroid extract indexing servers act like routers in a WAN and information servers act as hosts. A Query issued by a WHERE client is propagated to a small subset of servers in the net depending whether the keyword(s) is(are) present or not in the centroid. As it is usually done by centralized servers, only a small number of resources at a time are presented to the user. Since the WHERE centroids carry numerical information for that server, the client may search for the n best resources without having to query all servers in which the keyword matches. This technique leads to a reduction in wasted time and bandwidth per query. Three important steps need to be followed by the client: To nd which servers have the best documents; To retrieve the best documents from each server; To merge the results and order them by rank.
IV. The WHERE Client The client's main task is to implement the Query routing algorithm. When the user places a query, the client removes the more frequent words, applies stemming and passes the query on to the default server. It then executes the traversal algorithm and in the end it presents the results to the user. WHERE uses a WWW interface due to its friendlyness. Another advantage of such an interface is its capability of keeping information from previous queries in the same session.
Since a server may be both an information server and an indexing server, the rst two steps must be performed simultaneously. Each server replies with its templates and with which servers to ask. It must keep templates although the client may decide not to show them to the user since
V. Query Routing In the Query routing system of WHERE [23] 322-4
A Distributed Weighted
Proceedings JENC8
better responses are available. The pseudo-code for the client's algorithm is: n Number-of-resources-wanted list-servers list-resources servers-visited insert(initial-server,list-servers) while(list-servers = ) do server elem(list-servers) if (server = servers-visited) do query(server,n) foreach(resource returned) do insert(resource,list-resources) done foreach(server-to-ask returned) do if(elems(list-of-servers-to-ask) < n) then insert(server,list-of-servers-to-ask) else if (server.weight > min(list-servers.weight) AND n resources(server.max) < n) do remove(min(list-servers), list-servers) insert(server,list-servers) done done remove(server,list-servers) insert(server,servers-visited) done server elem(list-servers) done list-resources order(list-resources) present-x-results(n,list-resources)
M.Rio client Query Router
(
(
Query Router
(
Information Host
Information Host
(
Information Host
Information Host
Query Router
6
(
Information Host
2
Information Host
Query Router
Query Router
Information Host
Information Host
Information Host
Information Host
Information Host
Fig 4: Query routing
number, but the mesh is very small too). The client begins with a query to the default server, A, which returns pointers to servers B, C, D, E and F. Since only four documents are required, server B will not be queried since the best four documents reside in servers C, D, E and F. It then queries servers D, E and F. When server E is queried it returns server L with a weight (80.2) which is larger than that of server C. Server C is then removed from the servers-toask list. But as server I contains three documents with weights between (98.6) and (81.0), servers L and M are eliminated from the list, since it can be concluded that the four best documents lie in the sub-trees below I and J. The search engine is based upon the vector space model [24] with the document space ltered using other meta-information atributtes. Since a homogeneous system is used the merging process is straightforward. It can be assumed that although the number of resources (and information servers) may grow exponentially, the number of resources required by a user does not grow signi cantly. So, the sub-graph of nodes visited by the client increases much less than the number of servers containing a document with that keyword. Therefore this represents a better mesh traversal algorithm than that in Whois++.
(
(
The n resources(i) function returns the number of resources with a weight greater than i that are already being retrieved. Lists list-servers, list-resources and servers-visited contain, respectively, the servers to be queried, the resources retrieved and the servers that have already been queried. Functions elem, elems, insert and remove are list manipulating functions. Figure 4 represents a visualization of the technique. In gure 5 a small indexing mesh is presented as an example to illustrate the behaviour of a client. Servers are represented by boxes. Those that contain the keyword the user is asking for, also contain the data for that term in the centroid (maximum and minimum weights and number of documents). A user wishes to see the four best documents related to some keyword (this is a very small 322-5
A Distributed Weighted
Proceedings JENC8
M.Rio
default server A
client
50.4 30.0 5
78.3 40.0 4
B
50.4
78.3 40.0 4
H
G 30.0 2
50.2 31.0 3
N
.. .
.. .
I 98.6
polling relation
.. .
J
81.0 3
80.1 39.2 5
E
Q 78.3 40.0 4
.. .
.. .
.. .
97.5 33.4 4
.. .
80.1 39.2 5
L
S
R 98.6 81.0 3
.. .
WHERE transversal
M
T 80.2 40.0 2
80.1 39.2 5
80.2 40.0 2
.. .
U
.. .
.. .
Extra
.. .
.. .
Whois++ transversal
Fig 5: WHERE and Whois++ traversal: WHERE optimizes the search space.
322-6
F
97.5 33.4 4 80.2 40.0 2
P 50.2 31.0 3
.. .
97.5 33.4 6
D
K
O 50.4 30.0 2
98.6 81.0 3
C
A Distributed Weighted
Proceedings JENC8
VI. Dissemination of information
against Whois++ using a subset of TREC [27] Wall Street Journal (WSJ) 1992 document collection and a number of related queries. Because WHERE claims a higher interaction eciency with the information mesh, the number of servers visited by both system's query routing algorithms where compared. The same set of queries where used in both cases. In this experiment, the following parameters may take on dierent values: the number of servers in the search space, the number of terms in query and the number of items in the result set. 5000 WSJ documents were randomly split among 10; 20; 30; 40 and 50 servers. From q100 to q120 of the TREC WSJ collection, 10 queries with 1; 2; 3; 4; 5 and 6 terms were manually derived and the algorithm was run with number of response items set to 5; 10; 20; 30;40 and 50. Results for all queries in each scenario were collected and its mean value derived. In order to understand which test scenarios are the more meaningfull, we examined a one year's log les of our local proxy server and concluded that 90% of queries have 3 terms or less and that a typical user ends up reading the rst response page with 10 items only. Therefore, we only considered 1, 2 and 3 term queries and 5 and 10 result items. For each combination, the fraction of servers visited against the total number of servers was recorded. Figures 6 to 11 plot these results for both WHERE and Whois++ systems. Even using our pessimistic approach in the algorithm, a signi cative redution of the number of servers visited is achieved. The tendencies of number of servers visited against the number of items returned are shown in gures 12 and 13 and gure 14 shows the number of servers visited against the number of terms in query. The more likely explanation for the signi cative increase in the number of servers visited as a function of the number of document items re-
An interesting feature of the WHERE system is that it takes advantage of the information in the centroid to calculate the weights of keywords in the server. As pointed out in section III.B., a tf.idf algorithm is used. But if the documents in a server are not very dierent, merging their weights with others servers' replies may not be a simple approach. Imagine a Computer Science Department producing templates for all documents in its server. The term computer is likely to appear in most of the documents thus reducing the weight for that term. Should the English Department host a document with the term computer, this document will certainly have a larger weight than those of the CS Department, which is unreasonable. This is the reason why WHERE sees a collection as being the total number of documents above the servers' sub-graph. A model for the dissemination of collection wide information in a distributed information retrieval system has been already discussed in [25]. In the WHERE system this dissemination is carried by the centroids. The formula actually used to compute the weight of a term is: Pm N + N w = K tf log Pim=1 ni + n s s i=1 i
M.Rio
where: tf: term frequency in the document ni : number of documents the term appears in polled server i Ni : Number of documents in polled server i ns : number of a server's documents the term appears in Ns : number of documents in the server m: number of polled servers
VII. Implementation Notes Almost all modules of the WHERE system, that is, the Resource Transponder, the MINI-CIP daemon and the WHERE client, are written in C. Lex was used to build the lexical analyzer that parses the resources. The Template Manager was written in JAVA. The inverted les [26] used in the system DB were implemented using hashing and balanced binary trees.
VIII. Evaluation Fig 6: Visited vs total servers: 1 term, 5 items
WHERE performance has been evaluated 322-7
A Distributed Weighted
Proceedings JENC8
M.Rio
Fig 7: Visited vs total servers: 1 term, 10 items
Fig 11: Visited vs total servrs: 3 terms, 10 items
Fig 8: Visited vs total servers: 2 terms, 5 items
Fig 12: Visited vs items: 2 terms, 50 servers
Fig 9: Visited vs total servers: 2 terms, 10 items
Fig 13: Visited vs items: 3 terms, 50 servers
Fig 10: Visited vs total servers: 3 terms, 5 items
Fig 14: Visited vs terms: 10 items, 50 servers 322-8
A Distributed Weighted
Proceedings JENC8
M.Rio
turned is the small number of documents used in the experiment as compared to the number of document returned. Also, because the correlation of terms in the same document is not known (it is not modelled), a fast increase of servers visited as a function of the number of terms, was an expected result. A second criterion of evaluation concerned the eectiveness of the items returned by the WHERE system in terms of their relevance to query. The goal was to verify whether the document sets returned by WHERE were the best ranked documents Whois++ would return. Although this conclusion could be derived from the pessimistic approach to the client mesh traversal algorithm, an analysis of the experiments performed showed that indeed this was the case.
iting the Multimedia and Information Retrieval Group in the School of Computer Applications, Dublin City University, Ireland.
IX. Conclusions
[4] B. J. Kunze, H. Morris, M. J. Fullton, K. J. Goldman, and F. Schiettecatte. RFC 1625 WAIS over Z39.50-1988. June 1994. URL=ftp://ftp.funet. /rfc/rfc1625.Z.
X. References [1] Frequently asked questions about altavista. 1996. URL=http://altavista.digital.com. [2] Inside lycos. 1996. URL=http://www.lycos.com/info/index.html. [3] Chris Weider. The Future of Search on the Internet. In Proceedings of INET'96 - Annual Meeting of the Internet Society, Montreal, Canada, June 1996.
Taking as starting point the Whois++ centroid technology and Information Retrieval techniques, the WHERE system implements a distributed indexing service providing ranked results to user queries. Ranking of resources is performed by a weighing system of terms which, in the context of a collection, represent the term's information content (in the sense of Information Theory) and a measure of a document's relevance. The search space in the mesh of the distributed index is reduced as the search takes the nodes weights into account leading to a reduced number of nodes being visited. A better mesh traversal algorithm than that in Whois++ has been achieved. Some tuning needs yet to be done regarding weight calculation, namely the values for the context constant, K, and weight normalization. An evaluation of WHERE against Whois++ con rms the expected increase in eciency of WHERE as regards the number of servers visited as a function of the total number of servers for the most common number of terms in query and number of items in the returned set. Should an information mesh with a larger number of more homogeneous servers be used, eg, with a more realistic document allocation distribution other than random, it is felt that better results would be achieved.
[5] Joanh A. Kunze. Basic Z39.50 Server Concepts and Creation. August 1995. [6] Darren R. Hardy, Michael F. Schwartz, and Duane Wessels. Harvest User's Manual. Technical Report CU-CS-743-94, University of Colorado at Boulder, January 1996. [7] Darren R. Hardy and Michael F. Schwartz. Essence: A Resource Discovery System Based on Semantic File Indexing. In Winter USENIX - San Diego, CA, January 1993. [8] Udi Manber and Sun Wu. GLIMPSE: A Tool to Search Through Entire File Systems, TR 93-94. Technical report, University of Arizona, October 1993. [9] P. Deutsch, R. Schoultz, P. Faltstrom, and C. Weider. RFC 1835 - Architecture of the WHOIS++ Service. August 1995. URL=ftp://funet. /rfc/rfc1835.Z. [10] K. Harrenstien, E. Feinler, and M. Stahl. RFC 954 - NICNAME/WHOIS. October 1985. [11] P. Faltstrom, R. Schoultz, and C. Weider. RFC 1914 - How to Interact with the Whois++ Mesh. February 1996. URL=ftp://funet. /rfc/rfc1914.Z.
Acknowledgment
[12] Je Allen and Patrik falstrom. The Common Indexing Protocol, Internet draft(work in progress) . February 1996. URL=ftp://funet. /internet-darfts/.
TREC data was used in the evaluation of the WHERE system when one of the authors was vis322-9
A Distributed Weighted
Proceedings JENC8
[13] ITU-T.
Information Technology - Open Systems Interconnection - The Directory - Recomendations X.500-X.521, ISO/IEC 9594/1-9. ITU - International Telecommu-
nication Union.
[14] W. Yeong, T. Howes, and S. Kille. RFC 1777 - Lightweight Directory Access Protocol. March 1995. [15] Chris Weider. RFC 1728 - Resource Transponders. December 1994. [16] Ron Daniel Jr. and Michael Mealling. URC Scenarios and Requirements Internet Draft (Work in progress ). March 1995. [17] James P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In Proceedings of the 3rd In-
[24] Gerald Salton and M. McGill. Introduction to Modern Information Retrieval. McGrawHill Book Company, New York, 1983. [25] Charles L. Viles and James C. French. Dissemination of Collection Wide Information in a Distributed Infromation Retrieval System. In Proc. of 18th ACM/SIGIR Conference on Research and Devolopment in information Retrieval, Seattle, WA, July 1995.
[26] Donna Harman, Edward Fox, R. BaezaYates, and W. Lee. Information Retrieval
- Data Structures and Algorithms Inverted Files, chapter 3, pages 28{43. Prentice-Hall,
1992. [27] Donna Harman. Overview of the First Text REtrieval Conference. In Proc. of
16th ACM/SIGIR Conference on Research and Devolopment in information Retrieval, Pittsburgh, PA USA, July 1993.
ternational Conference on Database and Expert Systems Applications, pages 78{83.
[18] Chris Buckley. Implementation of the SMART Information Retrieval System. Technical Report 85-686, Cornell University, Ithaca, New York, May 1985. [19] S.E.Robertson and S.Walker and S.Jones and M.M. Hancock-Beaulieu and M. Gatford. Okapi at TREC-3. In Proceedings of TREC-3, 3rd Text Retrieval Conference,24 November, National Institute for Science and Technology, Gaithersburg, Washington D.C.
[20] Christopher Fox.
Information Retrieval - Data Structures and Algorithms, Lexical Analysis and Stoplists, chapter 7, pages 102{
130. Prentice-Hall, 1992.
[21] W. B. Frakes. Information Retrieval - Data
Structures and Algorithms, Stemming Algorithms, chapter 8, pages 131{159. Prentice-
Hall, 1992.
[22] Donna Harman.
Information Retrieval - Data Structures and Algorithms Ranking Algorithms, chapter 14, pages 363{391.
Prentice-Hall, 1992.
[23] Mark A. Sheldon, Andrzej Duda, Ron Weiss, and David K. Giord. Discover: A Resource Discovery System based on Content Routing. In Proceedings of the Third Inter-
national World Wide Web Conference Elsevier, North Holland To appear in a special issue of Computer Networks and ISDN Systems, April 1995. URL=http://www-
psrg.lcs.mit.edu/www95/discover.html.
M.Rio
Author Information Miguel Rio graduated in Systems and Informatics Engineering at the University of Minho, Portugal, in 1995 and is currently working for his Masters degree in Informatics. He joined the Computer Communications group of the Department of Informatics in 1995 and has since collaborated in several projects in protocol development for networked information and directory services. Joaquim Macedo is a Lecturer of Computer Communications at the University of Minho, Braga, Portugal. He graduated in Electronics and Telecommunications Engineering in 1983 at the University of Agostinho Neto, Angola, and received his Masters degree from the University of Minho in 1993. From 1990 until 1994 he participated in several technical working groups for the establishment of the Portuguese National R&D Network, the RCCN, and was particularly active in the development of X.500 Directory Services. He is currently doing research for a Doctoral degree where his interests concern networked information and directory services and protocols. Vasco Freitas is Associate Professor of Computer Communications at the University of Minho, Portugal. He graduated in electronic and telecommunications engineering in 1972 at the University of Lourenco Marques and received his M.Sc. and Ph.D. degrees from the University of Manchester (UK) in 1977 and 1980 respectively. From 1989 until 1994, he was in charge of the establishment and management of the Portuguese R&D Network (RCCN). 322-10