Simulation-Based Approach to Evaluation of Management ... - CiteSeerX

0 downloads 0 Views 96KB Size Report
bers of pages and queries, the scaling techniques they use require centralised .... what matters is what requests the search engine receives and not how they ...
Simulation-Based Approach to Evaluation of Management Strategies in a Distributed Web Search System Rinat Khoussainov and Ahmed Patel Computer Networks and Distributed Systems Research Group Department of Computer Science, University College Dublin Belfield, Dublin 4, IRELAND

Abstract: In our previous works, we advocated the use of distributed search architectures where multiple independently owned and managed topic-specific search engines act as one search system. This approach has significant advantages over the currently predominant centralised Web search model, including low market entry cost for individual search providers and the ability to provide better information coverage. The potential for competition between engines, however, requires new approaches to effective engine management. Many complex issues arise such as deciding what topic an engine should specialise in as well as service pricing strategies. Performance evaluation is an important stage in the development of management algorithms. It is also a challenging problem due to the complex and interactive nature of distributed search systems. We propose a simulation-based approach to this problem that provides for fast low-cost and verifiable performance measurements. In this paper, we describe our simulation environment and present experimental results for a heuristic topic management algorithm obtained using the developed simulator. Key-Words: Distributed Web search architectures, service management, simulation

1 Introduction Effective search for information resources on the Internet has become a popular and highly motivated research topic in the recent years. The huge size, continuing growth, and diversity of the Web has made it difficult for existing Web search systems to provide complete, relevant, and up-to-date information in response to user search queries. The source of these problems can be traced to the use of centralised search architectures. Centralised Web search architectures, as employed by AltaVista, FAST, Google, and Inktomi, use many computers under centralised management to act as a single search engine. These engines attempt to index all, or a representative sample of all, the pages on the Web and target all user queries in the search services market. While they scale to large numbers of computers, and can process large numbers of pages and queries, the scaling techniques they use require centralised access to, and control of, the underlying document index. A number of problems is associated with this approach including the cost, biased coverage, and the “invisible” Web (non-Web document collections having a Web frontend and proprietary content) [4, 8]. In our previous works, we argued that the solution to these problems lies in the use of Distributed Search Architectures (DSA) [4, 3]. In distributed search architectures, multiple search engines owned by different organisations or individuals act as a single search system. Each engine chooses to index only a subset of all doc-

uments available on the Web and processes only a subset of all user queries. We advocated the use of a topicbased distribution principle for building of such systems. According to this principle, each search engine indexes documents and processes user queries only on a selected topic, whereas selection of relevant topic-specific search engines, propagation of user queries, and aggregation (merging) of the results is done automatically and transparently to the end user. In both centralised and distributed search scenarios, there exists a competition between different services providers. Search engines adjust various service parameters to attract users by offering, for example, a higher result quality, more complete coverage, or a lower service price (in the case of paid search services). The decision making required for successful competition, however, is much more complicated in distributed search systems. Not only the number of parameters that need to be adjusted is bigger, but also the number of competitors influencing search engine’s performance is much greater. How should the engine choose its topic selection? How should the engine set the query processing price. Adjustment of these and other service parameters in order to increase the engine’s performance is what we call service management. In [3], we discussed the complexity issues associated with the service management problem and outlined a possible solution approach. In this paper, we concentrate on performance evaluation of service management strategies.

Performance evaluation is an indispensable stage in the development of management algorithms. It is also a challenging problem in itself due to the complex and interactive nature of distributed search systems. We propose a simulation-based approach to this problem that provides for fast low-cost and verifiable performance measurements. We describe our simulation environment and present experimental results for a heuristic topic management algorithm obtained using the developed simulator.

2 Service Management There are three basic activities performed by a centralised search system: storage and maintenance of resource (document) descriptions, resource discovery, and processing of search queries. A Web search engine needs to collect information about Web documents and store it in a form suitable for efficient processing. Discovery of Web documents can be manual or automatic. Automatic discovery is performed by first downloading a specified set of Web documents and then recursively downloading pages linked from the originals. The software component responsible for this is frequently called Web crawler or robot. Processing of user search queries is the main activity in a search engine. It includes receiving a query describing the Web resources that a user is interested in, searching the document index for relevant entries, and presenting the results to the user. Results may be ranked by their relevance to the query or other properties. In distributed search systems, two more activities are added: search engine selection and result aggregation or merging. The goal of search engine (collection) selection is to find the most suitable search engine (or a set of search engines) for each user query, where the suitability is determined by the quality of service that the engine(s) can provide for the query. In the case when query is sent to multiple search engines, the results returned need to be aggregated into a single list for presentation to the user. Performance of a search engine as a provider of search services ultimately depends on the user queries that it receives. By associating a reward with each query, performance of the engine can be measured in quantitative terms. In the simplest case, it may be the number of requests received (reward of 1 is associated with each query). In the case of paid search services, it may be the total revenue generated from the service provisioning. The user queries that the search engine receives are determined by the selection process which, in turn, is based on parameters of the service provided by the engine. Therefore, the goal of the service management is to adjust the service parameters of a search engine to increase its performance by affecting the search engine selection process.

2.1 Search engine selection A general model of the engine selection process can be represented by a function that for given user request and set of engine’s service parameters returns a value characterising the engine’s suitability for processing the request. Then the whole selection process can be split into two stages: ranking search engines by their suitability, and selection of search engines depending on their rankings. As we already mentioned in the Introduction, there may be many service parameters affecting the search engine selection process in a distributed search system. This also raises the problem of designing appropriate multi-parameter ranking functions. In this study, however, we use only one parameter for selection of search engines – the search engine’s content or topic. This will be the parameter adjusted in the service management process. The problem of selection between multiple topicspecific search engines has been widely addressed in the literature already. Well known examples include GlOSS and CORI [2, 1]. Most of these methods use some compact descriptions of engines’ document indexes (content descriptions) to estimate the quality (relevance) of results that would be returned by a search engine for a given query. Usually, these descriptions are based on the vector space model of information retrieval [7], and contain various statistics related to the terms appearing in the indexed documents. In our study, we use the CORI selection method for which content descriptions are represented by term frequency lists: each term appearing in the indexed documents is associated with a value equal to the number of documents in the index having this term.

3 Simulation-Based Evaluation Performance evaluation is an integral part in the development of service management algorithms as it allows us to answer the key question of whether an algorithm fulfils its requirements. An analytical approach to performance evaluation of Web search systems is extremely difficult due to the system complexity. This is especially true for distributed search systems. Complex interactions between system components make a tractable analytical analysis hardly feasible. Even if a tractable analysis were possible, it would be unrealistic due to our poor understanding of the system inputs (workload). These inputs include user queries and document sources for which we have insufficiently developed analytical models. A measurement approach to performance evaluation of search systems is also difficult due to the need to interact with users and the Web, combined with the large number of design features and parameter choices. This

makes real-world performance measurements slow, expensive, and extremely difficult to verify through repetition. For these reasons, simulation-based approach to performance evaluation of management algorithms seems the most appropriate choice. Simulation is less timeconsuming than real world measurements because it can be executed in simulated time rather than in real time. It also does not require the real system to be built in full prior to evaluation and does not interfere with the operation of any existing system.

Users provide input of search queries into the system. The ranking process receives content descriptions for the search engines in the system and user search queries, and produces engine rankings for each request. The engine selection process forwards each user query to a search engine based on its ranking. The index population process simulates population of the search engine index with documents discovered on the Web (Web robot functionality). Finally, the service management process is responsible for controlling the index population (Web robot) to increase the search engine performance.

3.1 Simulation model

3.2 Related work

As we already explained in Section 2, there are five basic activities performed in a distributed search system: storage and maintenance of document descriptions, document discovery, processing of user search queries by search engines, search engine selection, and result aggregation. Selection of search engines also involves search engine ranking. In our simulation model we omit processing of user search queries. From the management point of view, all what matters is what requests the search engine receives and not how they are processed. In general, processing of search queries can affect the search engine selection (and hence its performance), if the actual result quality does not match well the engine’s ranking estimates during engine selection (for example, due to malicious manipulations with content descriptions). We assume, however, that search engines in our system provide truthful content descriptions. Figure 1 shows the simulation model of the system. It includes search engine ranking, engine selection, index population, and service management. The last two processes are simulated for each search engine in the system.

Very little research has been done to date on performance modelling and simulation of Web search engines, let alone distributed search systems. Most previous efforts have concentrated on statistical analysis of user query streams (logs) without targeting any particular performance evaluation goals [10, 9]. Recently, a Web server simulator was used to evaluate the performance effects of caching query result pages [6]. This work, however, was an extension of Web server research rather than a look at Web search systems simulation issues.

Content Descriptions

content description updates Web

documents

Engine Ranking

Document Descriptions

Index Population

engine rankings Engine Selection

4 Simulator Design The processes in the system simulation model can be implemented in different ways with different levels of detail depending on the degree of the simulation realism needed. Therefore, one of the main requirements to the simulation environment is the flexibility to accommodate changes to implementations of separate system processes without modifications to other processes or the simulation environment. To achieve this, the simulator is built up of several independent modules implementing different parts of the overall simulation model. The following list enumerates and explains each module: Engine Ranker: The engine ranker is responsible for ranking of search engines for each given search query. It accepts the search query as an input and outputs a ranked list of search engine names with their corresponding rankings. It uses the store of engines’ content descriptions.

queries Service Management

content adjustments

queries For each search engine in the system Users

Fig. 1: Simulation model

Engine Selector: The engine selector chooses the search engine (or, possibly, a set of search engines) for each search query based on their rankings as provided by the engine ranker. Request Generator: The request generator simulates the stream of user search queries to the system.

Search Engine Simulator: The search engine simulator accepts search requests from the engine selector and submits the content description for the simulated search engine to the content descriptions store. It consists of a service manager module, store for the current engine’s content descriptions, and a web robot simulator. Web Robot Simulator: The web robot simulator models population of the search engine’s document index with documents from the Web. The input to the web robot simulator is a description of the desired content (topic) calculated by the service manager. The output of the web robot simulator are updates to the current content description of the search engine. Essentially, the web robot simulator models how the engine’s content description would change when an actual web robot populated the index with Web documents for the specified topic. Service Manager: The purpose of the service manager is to increase the performance of the search engine by controlling the index population process. The manager accepts the search queries forwarded to the engine (to calculate engine’s performance) and outputs a description of the desired topic for the search engine. This description has the same term frequency list format as the engine’s content description. The service manager is also responsible for updating the engine’s content description in the content descriptions store.

new content description

Content Descriptions

Service Manager new topic Web Robot Simulator

Engine Ranker engine rankings

Document Searcher

TREC

queries content description updates

Engine Selector queries Request Generator

Content Description Search Engine Simulator

Fig. 2: Simulator design

The simulator design is illustrated by Figure 2. As long as the formats of the search queries and content descriptions do not change, modifications in one module do not affect the other modules of the simulator. For example, the implementation of the rest of the search engine simulator will not change, whether the web robot simulator produces content updates by downloading actual Web documents or estimates them somehow else. Consider now implementation of separate modules. The service manager’s implementation obviously depends on the particular management algorithm being evaluated. In Section 5, we describe a heuristic management algorithm and present evaluation results obtained using the simulator. The most realistic way to implement the engine ranker is to actually have a real ranking algorithm there. As we mentioned earlier, we have implemented the CORI-based ranking scheme. The engine selector in our system always selects only one search engine for each query – the highest ranked one. In the case top-ranked search engines have the same ranking, one  of them is selected with probability .

4.1 Request generator As we already pointed out, little research has been done to date on modelling Web search engines. Previous studies merely presented various statistics for query logs collected from an existing centralised search engine over a fairly limited time period [10, 9]. Insufficient understanding of search engine inputs makes the generation of synthetic search engine workloads with realistic properties hard to justify. This is why the use of pre-recorded real search queries seems to us a preferable choice. Unlike the cited log analysis studies, which were collecting query logs from a single search engine, we collected our queries from Web proxy logs of a large ISP. Thus, we have been able to capture search queries to potentially all Web search engines used by the ISP’s users. While this approach provided for a more complete set of queries, it also brought up the problem of query extraction from the logs. Each Web search engine uses a distinct URL syntax for submission of search requests. URL patterns and extraction rules had to be developed individually for each search engine, thus, limiting the number of search engines we could analyse. How does this limitation affect the completeness of query sets obtained? The top 8 search engines (including Google, AskJeeves, Yahoo, Excite, and MSN) out of 47 well-known search engines and portals analysed in our study accounted together for almost 86% of all queries, with the remaining entries getting not more than 3% each. Therefore, the engines missing from our listings are likely to have a negligible share of all queries.

4.2 Web robot Realistic simulation of the web robot is particularly difficult, because we have to model its interactions with the Web. In our simulator, we model the behaviour of an “ideal” topic-specific web robot instead. For the given document index size and the topic , specified by the service manager, we search a collection of Web documents for the most relevant documents to the query , where term frequencies in serve as query term weights when estimating the document relevance. These documents are then retrieved by the robot simulator in the order of decreasing relevance. This process is, in fact, equivalent to the case when a real topic-specific web robot uses the above mentioned collection of documents as its “Web” and for any given topic it knows a priori the best order of retrieving documents for that topic. We call this behaviour “ideal”, because real topic-specific web robots obviously cannot know locations of the best documents for any given topic. Also, documents are retrieved by following hyperlinks, thus the downloading results are affected by the Web connectivity as well as by the robot’s ability to predict which links lead to relevant documents to follow them first. In our simulator, we used the TREC WT10g document collection and the SMART weighting scheme for calculating document relevance [7]. 



index is fully discarded (i.e. the old content description is cleared) and the document index is populated (i.e. the content description is updated) with the content for the new topic . A number of assumptions has been made in the simulation. We assumed that it takes the same time to populate document indexes of equal size for any topic. The list below describes the experiment sequence:



1) Initially, content descriptions of all engines were set to the same initial content description, which was the content of a document cluster on “rock music” built from the TREC collection. 2) A stream of 10,000 search queries (100,000 queries for the case with 30 search engines) was sent into the system, the number of search requests received by each search engine was recorded. 3) The managers calculated new topics for their engines. 4) The first 100,000 documents in the TREC collection were used in the Web robot simulators to build new content descriptions. 5) The sequence was repeated from step 2. 10000 Managed Not managed

4.3 Communications and processing delays

5 Experimental Results In this Section, we present evaluation results for a heuristic content management algorithm obtained using the described simulator. The principle employed in the management algorithm can be expressed as “give the users what they look for”. The manager calculates the number of times each term appeared in search requests received by the managed search engine during a selected time interval. The resulting term frequency list is then used as the new topic for the search engine. The content of the old document

8000

6000 Requests

Simulation of communication and processing delays is important for reflecting dynamics of a distributed system with independently functioning components. To allow for modelling of communication and processing delays in our simulator, we implemented it using a discrete event process-based simulation package, called C++SIM [5]. The simulation processes implementing separate modules can use C++SIM functions to specify delays in the simulated time that different actions take, while the C++SIM scheduler takes care of activating processes at the required moments in simulated time and ordering the process execution.

4000

2000

0 0

1

2

3

4

5

6

7

8

9

Iterations

Fig. 3: 2 search engines: 1 managed, 1 static The simulation results for different system configurations are presented in Figures 3–5. As we can see from the results, while the heuristic provides a sound performance increase against a static competitor, its performance against other search engines using the same algorithm can be very unstable. Once an engine is unsuccessful in selecting its topic, it gives up the most popular queries to opponents and, thus, receives only limited information about actual query popularity in future (remember, that each engine only sees the queries that it receives). This can in turn significantly degrade the engine’s performance. This study once again emphasises

In the current version of the simulator, we used the trace-driven simulation where possible, while web robot was simulated with the “ideal” behaviour, yielding the best possible rather than realistic performance. Future work may include development of more realistic models for the system interactions with the Web.

10000 Search Engine 1 Search Engine 2

8000

Requests

6000

4000

7 Acknowledgements 2000

0 0

1

2

3

4

5

6

7

8

9

Iterations

Fig. 4: 2 search engines: both managed SE 1 SE 2 SE 3 SE 4 SE 5 SE 6 SE 7 SE 8 SE 9 SE 10 SE 11 SE 12 SE 13 SE 14 SE15 SE 16 SE 17 SE 18 SE 19 SE 20 SE 21 SE 22 SE 23 SE 24 SE 25 SE 26 SE 27 SE 28 SE 29 SE30

12000

10000

Requests

8000

6000

4000

2000

References:

0 0

5

10

15

20

The authors would like to acknowledge Mikhail Sogrine for implementation of the engine ranker (CORI algorithm) and Alexander Ufimtsev for help with the C++SIM package and for processing of the query logs. The support of the Informatics Research Initiative of Enterprise Ireland is gratefully acknowledged.

25

Iterations

Fig. 5: 30 search engines: 27 managed, 3 static the importance of strategic reasoning and taking into account possible actions of others (see [3] for more details).

6 Conclusions In this paper, we studied performance evaluation of a distributed Web search system using the simulationbased approach. In particular, we presented evaluation results for a heuristic engine content management algorithm with the goal to automatically increase the engine’s performance. Simulation of Web search systems in general and distributed systems in particular is a relatively new research area. While simulation approach provides for fast, lowcost, and verifiable performance measurements, there are still serious challenges on the way to a realistic simulation. The elements of the system involving interactions with the external environment, such as search users and the Web, are the most difficult to simulate. More research and understanding is required here to provide for sound models of the system inputs.

[1] J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28. ACM Press, July 1995. [2] L. Gravano and H. Garcia-Molina. GlOSS: Text-source discovery over the internet. ACM Transactions on Database Systems, 24(2):229–264, June 1999. [3] R. Khoussainov, T. O’Meara, and A. Patel. Independent proprietorship and competition in distributed Web search architectures. In Proceedings of the Seventh IEEE International Conference on Engineering of Complex Computer Systems (ICECCS 2001), pages 191–199, University of Sk¨ovde, Sk¨ovde, Sweden, June 11–13, 2001. IEEE Computer Society Press, Los Alamitos, California, USA. [4] R. Khoussainov, T. O’Meara, and A. Patel. Advanced distributed search and advertising for the Web. In E. Leiss, N. Callaos, and J. Aguilar, editors, Web Computing Introduction. IIS Society, 2002. To appear. [5] M. Little and D. McCue. Construction and use of a simulation package in c++. Technical Report 437, University of Newcastle upon Tyne, July 1993. [6] E. Markatos. On caching search engine query results. Computer Communications, 24:137–143, Jan. 2001. [7] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, USA, 1989. [8] C. Sherman and G. Price. The Invisible Web: Uncovering Information Sources Search Engines Can’t See. Independent Publishers Group, 2001. [9] C. Silverstein, M. Henzinger, H. Marais, and M. Moricxz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1), Fall 1999. [10] D. Wolfram. A query-level examination of end user searching behaviour on the excite search engine. In CAIS 2000: DIMENSIONS OF A GLOBAL INFORMATION SCIENCE, Proceedings of the 28th Annual Conference, University of Wisconsin-Milwaukee, 2000. Canadian Association for Information Science.

Suggest Documents