Open Architecture for Distributed Search Systems⋆ Mikhail Bessonov1, Udo Heuser2 , Igor Nekrestyanov3, and Ahmed Patel1 1
2
University College Dublin, Ireland;
[email protected] Tuebingen University, Germany;
[email protected] 3 St. Petersburg State University, Russia;
[email protected]
Abstract. This paper describes an open architecture for distributed Internet search engines and the experience derived from implementation of a conforming prototype. The architecture enables competing collaboration of independent information retrieval service providers. It allows integration of multiple cheap private servers into a powerful distributed system, which still guards independence and commercial interests of every player. Special emphasis was made on demonstrating the ability of the architecture to make effective use of latest advances in information retrieval technology. Prototype implementation has proved the feasibility of the approach. It has also exposed a wide area of optimisations desirable at the component level. The source code of the prototype is publicly available.
1
Introduction
Given the quantity of information and users on the Internet, centralised global search engines, such as AltaVista, Excite, Lycos, and HotBot, require huge singlesite network and hardware resources to handle index construction and query processing loads. The costs associated with establishing a new competitive service are prohibitive for all but the largest organisations [10]. The technical constraints imposed by limited single-site resources also result in a low quality of service provided to the user. Because of high query processing load only computationally inexpensive processing, such as basic keyword searching, can be performed. Due to the quantity and dynamic nature of information on the Internet, comprehensive index coverage can only be achieved at the expense of many outdated index records, if at all. These problems can only grow as the quantity of information, which includes large multimedia data objects, and the number of users on the Internet increases. Consequently, massively scalable search engine design is a key challenge for Internet based Information Retrieval research [2]. Form this prospective distributed searching looks like a very promising approach. Search engine design must be both technically and economically scalable if it is to meet this challenge. ⋆
This work was undertaken as a part of OASIS project (INCO Copernicus PL96 1116) funded by the Commission of European Communities
Technically scalable design requires partitioning of index construction and query processing loads across widely distributed servers. By partitioning the index construction load, each server can construct and maintain a comprehensive and up to date index of a subset of the information available on the Internet. By partitioning the query processing load, each server can perform advanced and computationally expensive query processing on that index. Economically scalable design requires individual proprietorship of servers in the system with co-operation between servers for mutual benefit. This cooperation can occur in the form of user query and URL propagation. Query propagation allows search engines to process queries submitted to the system as a whole, thus increasing the potential user population for each server. Similarly, URL propagation provides each index robot with access to URLs discovered by the system as a whole, thus increasing the potential quality of each index. Given the individual proprietorship of servers in the system, co-operation through URL propagation is explicitly required to be self-interested. That is, each robot propagates only unwanted URLs, otherwise a disproportionate distribution of costs and benefits across servers may develop. In terms of algorithm design there are essentially three requirements: efficient index construction and maintenance, precise query propagation and selfinterested URL propagation. A considerable amount of work was done by other researchers in the area. A number of architectures were developed such as Harvest [3], WAIS network [5], distributed inference networks [4], HyPursuit [18], DESIRE [1] to address these issues. The size of the paper does not allow us to present related works in necessary detail. The rest of the paper is organised as follows: Section 2 describes the architecture of distributed OASIS system, Section 3 elaborates on techniques developed to make effective use of the architecture. Section 4 gives a brief overview of the prototype implementation and demonstration scenarios. The main conclusions of the paper and directions for future research are summarised in Section 5.
2
OASIS Architecture
The OASIS service presents a distributed system of Internet search engines. The system provides search services for plain text and HTML documents stored on publicly accessible HTTP and FTP servers on the Internet. A Collaboration diagram illustrating interaction of several OASIS servers to serve a user query is presented in Fig. 1. A user can contact the OASIS server with a search request. The request is then processed locally by the OASIS server and/or automatically propagated to other servers in the system. If the request is propagated, the OASIS server that received the user request acts as a client of other OASIS servers. This happens transparently to the initiator of the search. Before returning the result data set to the search initiator, the OASIS server that has initiated the request propagation eliminates duplicate records and sorts the results by their relevance index.
Distributed Directory Service
3:The list of relevant OASIS servers
2:Important terms from user query
6:Local collection search
7:Local collection search LDAP
CORBA IIOP
CORBA IIOP 9:Local search results
4:Propagated query OASIS server
OASIS server 8:Local search results
HTTP/HTML
10:Collated result set (URLs)
OASIS server 5:Propagated query
1:User query
User HTTP/HTML HTTP/HTML
FTP
11:HTML document
13:Text file
12:HTML document WWW server
WWW server
FTP server
Fig. 1. Ordinary query processing collaboration diagram
The OASIS system is optimised for processing poorly specified search criteria. It pays special attention to user ranking of results and the use of this relevance feedback for the improvement of search result accuracy. After receiving the first references in a result data set, the user may respond by stating which of these references are relevant and which are not. From the user’s point of view this is an easy and intuitive way to refine a query and hence improve subsequent results. This relevance feedback is propagated to all OASIS servers which have taken part in the distributed search, and can be processed by those servers to improve subsequent user query processing. After receiving the results (in the form of URLs) the user may access the referenced HTTP or FTP content-server directly and obtain the full document. Every OASIS server keeps a local index of a relatively small portion of the documents available on the Internet. As most user queries are subject-oriented, subject-specific indexes are a requirement for scalable distributed query processing. Consequently, each OASIS server specialises in a particular topic, or set of topics, chosen by the server’s administrator. OASIS servers are not required to be mutually exclusive in terms of the topic areas they cover. It is possible to have more than one server that covers or overlaps the same topic area. This facilitates individual, and possibly competing, proprietorship of OASIS servers. The activity diagram for processing of conventional user query is presented in Fig. 2. A properly registered user has an account associated with her user name. The balance of the account can be incremented by the OASIS server administrator, e.g., in the event of payment by the user. Payments take place outside the OASIS system, but the OASIS system keeps track of the user account balance and decrements it every time the user is charged for a service. OASIS
servers must have accounts with each other in order to use each others non-free collections for request propagation.
Authenticate user
[authentication error] error notification [authentication ok]
Get user query
Select propagation set
Propagate query to Collection
Propagate query to Collection
....
Propagate query to Collection
Merge results
Update user account
Output result page
[new query]
[relevance feedback]
[log out]
Fig. 2. Ordinary query processing activity diagram
The protocols used for communication between OASIS servers are based on CORBA and Light Weight Directory Access Protocol (LDAP). LDAP is used for announcement of OASIS servers and their document collections. Every OASIS server has write access to the records in the Directory Service describing its document collections. It is in the interest of the server to update these in
synch with the updates of the collections themselves. These descriptions represent the forward knowledge about the document collections on which basis request propagation decisions are made by every OASIS server. Apart from the collection descriptions, each OASIS server exports via the Directory Service its CORBA object reference. The query propagation is performed via CORBA Internet Inter-ORB Protocol (IIOP). Every OASIS server makes available the “standard” CORBA distributed object interface, which provides methods for communication of both the query and the search results. The query is accompanied by the quality of service requirements including the maximum number of records returned in the answer, the maximum service cost, and the maximum response time. All the queries are signed by the issuer with its private key and represent a kind of “contract” between the initiator and the responder which can not be repudiated by the issuer. This forms the basis for commercial relations between the OASIS service providers who can charge each other on the basis of the signed entries in the service log. Each OASIS server index can be built, and regularly updated, by the server’s Internet scanning process which runs in the background (so called Crawler). This process uses a subject-specific harvesting strategy, supporting the HTTP and FTP protocols, to determine which documents to retrieve and revisit for building and updating the index respectively. In this manner, each server in the system can construct a comprehensive and up-to-date index of a subset of the information on the Internet. The crawlers run as separate processes isolated from the server ones. This allows greater flexibility in terms of collection management: the administrator can run crawlers in the night time when the load is low, create the collection by hand, etc. The crawlers are capable of cooperation between themselves. An OASIS server can set up a standing query with another server as described below. Though the development of Intelligent Agents is outside the scope of the project, each OASIS search server, and the OASIS distributed service as a whole, has intelligent agent properties. An OASIS server performs as an intelligent agent when an asynchronous mode of search result delivery is chosen by the user. In this case, the user’s request is stored by the OASIS server and it is treated as a persistent object, or standing query. Every new document retrieved by the server’s Internet scanning process is immediately evaluated against the persistent user request. If it matches well, the URL reference will be delivered to the user via e-mail. A similar mechanism is employed directly by OASIS servers to enable interserver cooperation for the improvement of local index quality. An OASIS server can set a standing query, representing the server’s chosen topic, with other servers in the system. If a server’s Internet scanning process discovers a document not relevant to its own server, but relevant to a server for whom a standing query has been set, then the URL is propagated to the server. This provides each server with access to URLs discovered by other servers in the system, thus increasing the potential quality of each local topic index. In this sense the whole OASIS service can be treated as a distributed intelligent agent.
Administration of the OASIS servers and their document collections is considered a local matter of the server administrator and is not covered by the OASIS “standard”. The current implementation enables the administration of the server via a Web browser. The administrator is capable of user account management, collection content and description management, and crawler management. 2.1
Prototype OASIS Server Implementation
A brief overview of the components of a prototype OASIS server implementation is provided below. The User Interface mediates between the user and the rest of the OASIS system. Interaction with the rest of OASIS is performed via the standard OASIS protocol. This is the only requirement for this component. The basic tasks of this component are: to get a query from the user, to propagate it to the Query Server, to present the returned search results to the user and to provide users with access to their account information. Users are required to have their own accounts in OASIS. This allows us to have personalised user profiles. It also provides framework for commercial services. The Query Server plays central role in the OASIS architecture. It is responsible for mediating between the user and the distributed set of available Collections in order to process the user’s query. Its functions include the selection of a set of Collections for query propagation, actual query propagation, results merging, and the set-up of, and delivery of results for, standing user queries. The OASIS Directory is a global directory of available Collections. It assists the Query Server in the selection of a set of Collections for query propagation. An OASIS Collection is a topic-specific index of documents or document representations (profiles). A collection’s topic is chosen by its server’s administrator. Examples of topics include mathematics, medicine, computer science, physics etc. A topic could also refer to a type of commercial service or product available on the Internet such as travel services, software retail, booksellers, etc. In terms of the OASIS architecture, a Collection processes user queries received from a Query Server, stores the profiles of documents recommended by the Crawler or presented by the Collection administrator, and generates a Collection description. The architecture of Collection itself is not specified1 in the OASIS framework. This allows other types of collections to be easily added to OASIS. OASIS Collections are topic-oriented, and constructing and maintaining a topic-oriented index is much more complex than a usual “anything goes” one. OASIS Crawler is a tool that assists the OASIS server administrator in doing this. It searches for documents on the Internet that are relevant to the specified topic filter (i.e., a description of the desired topic) and recommends them to the filter owner. Typically the filter owner is the Crawler’s associated Collection. In 1
Several collection architectures are implemented in the OASIS prototype and the list is open for extension.
certain cases, however, the filter may be in the form of a standing user query or a standing query from another Crawler. The Crawler uses a topic-oriented harvesting strategy [13] that determines which documents to retrieve and recommend for inclusion in its associated Collection. Its goal is to maximise the relevance of the documents in the Collection with respect to the Collection’s chosen topic, where out-of-date documents are considered to be irrelevant. 2.2
Queries in OASIS
A major obstacle to producing accurate search results are informationally poor queries. The analysis of logs of the squid proxy of a medium size internet provider, undertaken during the OASIS project, has shown that the average size of a query submitted to AltaVista is only 1.21 words. Clearly, this does not provide enough information for precise search. OASIS architecture aims at producing and propagating as rich queries as possible. The user can start the search either from keywords, or from an example of relevant document2 , or both. In the process of search the user can mark some documents from the result set as relevant and restart the search. The rating information becomes a part of the new query. As a result, we have dozens of weighted keywords in a query, obtained in a user-friendly manner. No irreversible transforms are performed on the queries as they are propagated between OASIS servers3. It makes the choice of query processing techniques a local matter at every OASIS service, allowing for competition of implementations, extensibility and innovation. There are two basic types of user query in OASIS: standing queries and conventional queries. In the case of a conventional query the user waits for the system response. Such queries may be refined with help of relevance feedback based on search results. In the case of a standing query the user receives the system response in asynchronous mode via e-mail. 2.3
Communication Issues
The ambition of the OASIS project is to offer a set of protocols for distributed search service that could be taken up by third party developers. To minimise their development costs, we tried to reuse existing protocols and technologies whenever possible. LDAP is used for fetching collection descriptions. Every server is interested to keep the descriptions of its collections up to date, to maximise the number of service requests. Collection descriptions are replicated to slave LDAP servers to speed up look-up. The process of selection of collections for query propagation is discussed in some detail in Sect. 3. Simple structured object interchange format (SOIF) is used for encoding of queries and results. We have used extended version of Stanford Proposal for 2 3
the user submits the URL of the example document queries are compressed to speed up transfer over slow links
Internet Retrieval and Search [7]. OASIS protocol allows simple extensions by addition of new kinds of query and document SOIF templates in addition to the obligatory one4 . The kind of templates to use may be negotiated between OASIS servers. OASIS relies heavily on the usage of CORBA technology that provides a generic way of interaction in heterogeneous environments. Main OASIS components are objects in the CORBA world. CORBA IIOP is used as transport. The Naming Service is used for the discovery of OASIS resources. It enables a reliable way of making contacts between OASIS entities. For example, in order to contact a specific Collection, a Query Server consults the Naming Service to get a real object reference by known symbolic name. The CORBA Query Service has much in common with basic OASIS architecture. It is designed for solving the problem of distributed search in a generic way [9]. The proposed solution is too generic for the case of OASIS system and not all of the provided functionality is applicable in this case. But a significant part of the available services may be reused and the required additional functionality may be provided by extending some of proposed features. The CORBA Query Service provides query operations on collections of objects. Mainly the Query service focuses on collections of CORBA objects but they are not well-suited for OASIS. The main reason is the necessity to provide traversing possibilities for Collections by implementing Iterator interface. However, the Query Framework defined in the Query Service provides another type of service that allows to use collections that are available only through native protocols. This flexible approach may be easily reused for OASIS purposes. The OASIS User Interface plays the role of CORBA Client. The Query Server and Collections both act as Query Evaluators in the CORBA World. However, the underlying semantics are different; the Query Server only manages the distributed search and Collections only search in local database through native protocol. Unfortunately the Query Service does not provide facilities for asynchronous query processing. This feature is very important in the OASIS framework for support of standing queries. In order to provide the required functionality additional interfaces were designed. The set of used CORBA services is limited by the lack of available service implementations. Among the potentially used services are LyfeCycle Service and Externalization and Internalization Services.
3
Discussion
In this sections we discuss several techniques that are used in the OASIS prototype implementation, such as query propagation, result merging and use of relevance feedback. 4
TF-profiles are the obligatory ones. Currently we are experimenting with use of Latent Semantic Indexing (LSI) techniques and additional types of SOIF templates were added to support LSI profiles on the protocol level.
3.1
Query Propagation
In order to process a user’s query OASIS Server must select several collections for query propagation that are likely to return the best results, meeting the user imposed limits on query cost, number of returned documents and query processing time. For every Collection in the selected set the system must also specify the maximum search cost and time, thus allocating resources of the distributed system for the job. The selection criteria are based on forward knowledge derived from Collection descriptions that are exported by each Collection to the OASIS Directory via LDAP protocol. The quality of selection of Collections for query routing is critical for scalability and performance of the system. Extensive experiments were conducted to find a satisfactory solution.
Collection Descriptions. The collection description is divided into two parts, human and automatically generated. Human description contains manually collected information. It includes Collection contact information for query propagation (e.g., CORBA-object name in CORBA Naming Service), human-oriented Collection identification and description, price list, etc. Automatically generated part is more complex. First of all, for each term used in the collection we calculate the document frequency dft c. It equals the number of documents containing the term. The total number of documents Nc in the collection is also known. We call this information a full description and do not store it in the Directory, only shorter topical description described below is exported via LDAP. The full description can be obtained from the Collection directly via a CORBA call. A topical description of collection c is derived from its full description by removing information on the terms that appear in this collection less frequently than on average in the whole collection set. In other words, information on term t is present in the topical description of collection c if P dftc dftc ≥ Pc∈C , Nc c∈C Nc
(1)
where C denotes the set of all Collections in the distributed system. The idea is in emphasising the terms that are specific to the Collection, the ones that reflect the subject of the documents contained. Every collection exports to the LDAP Directory its human and automatically generated topical descriptions. Aggregate statistics about all the Collections in the OASIS system is needed to calculate a topical description. It can be obtained by polling all these Collections via CORBA calls. It is in the interest of a Collection to publish its updated description periodically, even if its document set has not changed.
Selection of Propagation Set. We start with a user query, that also includes constraints on the search cost and time, and number of documents returned. The goal is to generate several sub-searches, each limited to a single Collection, specifying the corresponding constraints for every sub-search. The algorithm for Collection selection and constraints estimation should maximise the number of relevant documents in the result set, meeting the user constraints for the whole search. For the sake of simplicity, in this paper we consider a simpler model, where the only limitation imposed is N , the maximum number of documents retrieved. A full model can be found in [14]. For every collection c ∈ C we calculate an estimate r(q, c) (up to a constant multiplier) of the number of documents in the collection c relevant to the query q. The resources (i.e., the maximum number of documents to retrieve) are then distributed between the highest ranked collections proportionally with the values of r(q, c). Predicting the number r(q, c) of documents in a collection that a relevant to a query is the most complex part in both simplified and full models. After a large amount of experimentation the following method was adopted for the implementation. First of all, correlation between presence of different terms is expected. To formulate it more precisely, if dfti c ≤ dftj c , ti ∈ q, tj ∈ q , then every document containing the term ti is supposed also to contain the term tj . In this case the estimate for the number of documents in collection c containing all the terms from a query q ′ is calculated as follows: df (q ′ , c) = min′ dftc .
(2)
t∈q
Then, all the terms of a query q are sorted in descending order of the value dftc · w(t, q): dft1 c · w(t1 , q) ≥ . . . ≥ dftm c · w(tm , q) , where w(t, q) is the term weight of t in query q. Sub-queries of the kind qi = {t1 , . . . , ti }, i ∈ {1, . . . , m} are considered and the proximity function s(qi , c) of a sub-query qi to a given collection c is calculated as s(qi , c) = df (qi , c) ·
X
t∈qi
!δ
w(t, q)
,
(3)
where δ is a constant parameter of the model (the value of 10 produced best results). Finally, r(q, c) is calculated as r(q, c) =
max
i∈{1,...,m}
s(qi , c) .
(4)
Let N be the (user imposed) upper limit on the total number of documents to be retrieved from all the collections as a result of a distributed search. The upper limit on the number of documents retrieved from collection c for query q is set to r(q, c) . (5) N·P ′ c′ ∈C r(q, c )
The values for dftc used in the process are the ones from the Collections’ topical descriptions (see 3.1). They are derived from the Directory via LDAP protocol. Query propagation may be tuned manually by defining query propagation politics. By default, the set of Collections enabled for query propagation consists of all the available Collections. The administrator of the Query Server may restrict this set to improve the quality of Query Server services. Advanced users may forbid or force propagation to particular Collections. 3.2
Result Merging
As a result of query propagation to the set of selected Collections, the Query Server obtains several sets of query results. Since each Collection returns results relevant to the query from its point of view their document scores are not correlated. The Query Server has to perform additional analysis of the obtained results to select the best based on a common metric. This process is called result merging. It can be divided into two parts: de-duplication and document selection. Deduplication in its simplest form means removal of multiple entries of the same URL returned by different collections. Ideally the system should also detect and remove identical and almost identical documents residing at different locations. We combine this process with result document selection by clustering. Query results are split into several groups according to a similarity measure based on term usage. The documents in each group are considered by the system to be more similar to each other than to documents from other groups. If the number of clusters exceeds a threshold (ten, for example) then some neighbour ones are merged. Only one document from each cluster is presented in the result set. Since the distance between quasi identical documents is vanishingly small, they end up in the same cluster, so only one instance appears in the result set. The most important thing is that the result set is presented to the user in a structured fashion. We use a hierarchical radius-based competitive learning (HRCL) neural network that has exclusively been developed for OASIS project to accomplish the clustering. The rest of this subsection contains rather technical description of the method. It is primarily based on the neural gas approach by Martinetz and Schulten [12] and uses output neurons without fixed grid dimensionality. Neurons are rearranged according to their distances from the current input every time a new input sample is generated. Second, HRCL uses fixed radii around each
neuron and repels those second winners from the current input sample whose radii overlap with the winner’s radius. This feature is based on and extends the rival penalised competitive learning algorithm by Xu et. al. [19]. Third, HRCL builds a hierarchy of clusters and sub-clusters in either top-down or bottomup manner: the first generated top-down hierarchical level consists of detected clusters or cluster prototypes. Every neuron is trained to represent one prototype. The second level then tries to detect sub-clusters in every first level cluster, and so forth. Initial neuron settings at each hierarchical level are generated in accordance with probability densities at fixed cells in vector space, using a celllike clustering similar to the BANG-clustering system by Schikuta and Erhart [16]. Conventional statistical clustering methods like single- or one-pass clustering used in SMART retrieval system by Salton and McGill [15], p. 127 ff. or similar heuristic methods are highly dependent on the order of input vectors fed into the system. Conventional neural approaches, above all error minimising competitive learning algorithms, generally are able to detect major clusters. But competitive learning methods with a-priori given and fixed output neuron dimensionality like Kohonen’s Self-Organizing Map (SOM) [11] also place neurons to locations with lower probability densities and show problems, if the fractal dimension of the input vector space is higher than the usually two-dimensional output space [17]. Competitive learning without a-priory given network dimensionality like Fritzke’s Growing Neural Gas (GNG) [6] uses adapted network dimensionality at vector hyper-spaces with different “local” fractal dimension and thus tries to circumvent abovementioned drawbacks, but the granularity of detected clusters is highly dependent on the number of training steps. If the GNG training is interrupted in an early processing stage, GNG will find overall clusters with appropriate cluster centroids, if training is extended, GNG will only find subclusters, but not both.
1.5
1.5 mode data HRCL centroids
mode data HRCL centroids
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
-1.5 -1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1.5
-1
-0.5
0
0.5
1
1.5
Fig. 3. Left: first level HRCL; right: second level HRCL
Figure 3 shows the training results of HRCL hierarchical clustering on 2dimensional artificially arranged multi-modal input data using top-down hier-
archical refinement. The left picture depicts 3 neurons of the first hierarchical level placed at cluster centroids after 565 HRCL learning steps using a user supplied neuron radius of 0.3. The right picture depicts 5 HRCL neurons of the second hierarchy placed at sub-cluster centers of the cluster that is defined by one of the first level neurons plus its radius using additional 95 learning steps. HRCL is able to automatically build a hierarchy of clusters, sub-clusters and so on dependent on neuron settings at each level and user supplied radius. For the abovementioned input data that consist of 1,800 vectors, HRCL automatically detects 3 hierarchy levels reflecting the globularity of every top-level cluster.
4
Experimental Results
The prototype of OASIS Service was developed and currently it is in beta testing period. Currently OASIS Service includes ten topic collections that are distributed among four sites. The main entry point5 to the OASIS Service is www.oasis-europe.org . Source code of the software is also publicly available. 4.1
Experiments with Query Propagation Strategy
Since distributed search is the distinguishing feature of OASIS compared with traditional search engines, we pay extra attention to the performance of query propagation algorithms. We tested the performance of the algorithms described above against the “standard” document collections which have relevance ratings generated by human experts. Six topical collections were formed from TREC-5 documents. 30 queries were picked from the set of queries supplied for the routing task. The total number of documents in all collections equals 3514, the total size is approximately 40 MB. Expert ratings were used only for search inside the collections, while the propagation of the queries to the collections was performed as described in Sect. 3.1. Since the search inside the collections was perfect, the rate of relevant documents among the ones returned by such distributed search can be used as a measure for the quality of propagation strategy. In our experiments it equaled 69%. For comparison, the best method that does not use aggregate statistics for all the collections (derived from bGloss, [8]) scored only 48% in the same settings. With the method described, for 71% of queries the most relevant collection was ranked first, and for 93% it was ranked first or second. When documents are distributed between the collections uniformly, i.e., the collections are not topic-oriented, performance of the described methods drops dramatically (to 28% in our experiments). It is hardly surprising, since whole OASIS system is optimised for the case of subject oriented individually owned collections. The Crawler component, described in the next paper of this book [13], is instrumental in creation and maintenance of topic-oriented Web indexes. 5
Any OASIS Server may be used as an entry point to the OASIS Service. A list of known servers is available at http://www.oasis-europe.org .
5
Conclusions and Future Work
OASIS architecture is capable of supporting a wide variety of document and query representations as well as collection representation. It delivers its main promise of a distributed search environment carrying queries, result sets and relevance feedback data between all the actors in the transaction. The strategies chosen by the private actors in a distributed search service can be quite sophisticated. They require study both at component level and at the system level. With the start of public service a larger amount of information about the prototype system interactions with real users will be available.
References [1] Anders Ardo and Sigfrid Lunberg. A regional distributed WWW search and indexing service — the DESIRE way. In Proc. of the Seventh International World Wide Web Conference. Elsevier Science, 1998. [2] C.M. Bowman, P.B. Danzig, D.R. Hardy, U. Manber, and M.F Schwartz. Scalable internet resource discovery: Research problems and approaches. Communications of the ACM, 37(8):98–107, 1994. [3] C.M Bowman, P.B. Danzig, D.R. Hardy, U. Manber, and M.F. Schwartz. The harvest information discovery and access system. Computer Networks and ISDN Systems, 28(1-2):119–125, 1995. [4] James Callan, Zhihong Lu, and Bruce Croft. Searching distributed collections with inference networks. In Proc. of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995. [5] Andrzej Duda and Mark Sheldon. Content routing in a network of WAIS servers. In Proc. of the 14th International Conference on Distributed Computing Systems, pages 124–132. IEEE Computer Society Press, 1994. [6] B. Fritzke. A growing neural gas network learns topologies. Advances in Neural Information Processing Systems, 7:625–632, 1995. [7] L. Gravano, C. K. Chang, H. Garcia-Molina, and A. Paepcke. Starts: Stanford proposal for internet meta-searching. In Proc. of the International Conference on Management of Data, 1997. [8] L. Gravano, H. Garcia-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. In Proc. of the 3rd International Conference on Parallel and Distributed Information Systems (PDIS’94), 1994. [9] OMG Group. The Common Object Request Broker: Architecture and Specification. July 1995. [10] T. Koch, A. Ard, A. Bremmer, and S. Lundberg. The building and maintenance of robot based internet search services: A review of current indexing and data collection methods, 1998. [11] T. Kohonen. Self-Organizing Maps. Springer-Verlag, 1995. [12] T.M. Martinetz and K.J. Schulten. A “neural-gas” network learns topologies. Artificial Neural Networks, pages 397–402, 1991. [13] Igor Nekrestyanov, Tadhg O’Meara, and Ekaterina Romanova. Building topicspecific collections with intelligent agents. In Proc. of the Sixth International Conference on Intelligence in Services and Networks, April 1999. [14] OASIS project consortium. Distributed search algorithms specification. INCO Copernicus PL961116 deliverable D3.4.
[15] G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. [16] E. Schikuta and M. Erhart. The bang-clustering system: Grid-based data analysis. Advances in Intelligent Data Analysis (IDA-97), pages 513–524, 1997. [17] H. Speckmann. Analyse mit fraktale Dimensionen und Parallelesierung von Kohonens selbstorganisierender Karte. PhD thesis, University of Tuebingen, 1995. [18] R. Weiss, B. Velez, M. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and D. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Hypertext’96, The Seventh ACM Conference on Hypertext, pages 180–193. ACM Press, 1996. [19] L. Xu, A. Krzyzak, and E. Oja. Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Transactions on Neural Networks, 4(4), July 1993.