a new performance evaluation technique for web

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A Coruña Facultad de Informática, Campus de Elviña s/n, 15.071 A Coruña, Spain* {fidel, fpuentes, vicar}@udc.es

ABSTRACT The performance evaluation of an information retrieval system is a decisive aspect for the measure of the improvements in search technology. Our work intends to provide a framework to compare and contrast the performance of search engines in a research environment. In this way, we have designed and developed USim, a tool for the performance evaluation of Web IR systems based on the simulation of users’ behavior. This simulation tool contributes in the performance evaluation process in two ways: estimating the saturation threshold of the system and in the comparison of different search algorithms or engines. The latter point is the most interesting because, as we demonstrated, the comparison using different workload environments will achieve more accurate results (avoiding erroneous conclusions derived from ideal environments). From a general point of view, USim intends to be an approximation to some new performance evaluation techniques specifically developed for the Internet search engines. KEYWORDS Web Information Retrieval, performance evaluation, simulation.

1. INTRODUCTION With the exponential increase of the Web there has also been a growing interest in the study of a variety of topics related to the use of the Web. There is a special interest in finding patterns in and computing statistics of Internet search engines users and some articles that analyze the query logs of several commercial search engines have been published (Cacheda and Viña, 2001), (Jansen et al., 1998), (Kirsch, 1998), (Silverstein et al., 1999) and (Spink et al., 2002). These studies examine the search process developed by Web users: terms per query, top terms, operators and modifiers, sessions, etc. Moreover, the performance evaluation of an Information Retrieval (IR) system is a decisive aspect for the measure of the improvements in search technology, as it is derived from the Text Retrieval Conference (TREC) and the WEB-TREC conferences. Zobel, Moffat and Ramamohanarao (1996) describe the guidelines for evaluating the performance and comparing several indexing and retrieval techniques. The main criteria for the comparison are the following ones: the scalability of the system, the response time for the search process (which is perhaps the single crucial test for an indexing scheme), disk space used by the index data structures, CPU time, disk traffic and memory requirements. Nevertheless, the response time is not an easily estimable parameter because it is the aggregate of many other parameters. The same problem arises in the Web IR systems, and gets complicated because these systems must operate under different workload situations, especially with a high number of concurrent users. In fact, in the WEB-TREC one of the measures to obtain is the response time of each request sent to the system (Hawking et al., 1999). However, the response times are computed in an ideal situation: without any workload on the system. This work is focused on the development and testing of search engines in a research environment, in order to provide a framework to compare and contrast the performance obtained after a change in the IR system. With the changes referring to any part of the search engine: the index organization, the search algorithm, the system architecture or the system configuration. For this purpose, we have designed and developed USim, a tool for the performance evaluation of Web IR systems based on the simulation of users’ behavior, in order to compare more accurately the performance

of different search systems. Moreover, we tested this performance evaluation tool in a real environment, in two ways: estimating the saturation threshold of the system and comparing the performance of different search algorithms or engines. This paper is structured as follows. It starts with an overview of the related research. Next we describe USim, the simulation tool proposed and the next section details the results obtained in the performance evaluation of Web IR systems. Finally the main conclusions are exposed.

2. RELATED STUDIES Recently there have been some studies that examine the behavior of Web search users while they are using an Internet search engine or a Web directory. The first study was performed by Kirsh (1998) who presented some search statistics of Infoseek usage. A bit later, Jansen et al. (1998) presented a study of queries, sessions and searched terms obtained in the query logs of Excite. Spink et al. (2002) analyze the changes in the search topics for Excite through several years. Silverstein et al. (1999) examined a very large number of queries taken from AltaVista logs, studying not only the queries but also the co-ocurrences among them. Cacheda and Viña (2001) investigated the queries, categories browsed and the documents retrieved by the users of a Web directory. These studies show that most Web users enter few queries consisting of few search terms, have difficulty using effective keyword or Boolean queries and conduct little query reformulation. Also, Jansen and Pooch (2001) provide a framework for the development and comparison of future query log analysis, comparing the searching characteristics of Web users with users of other systems. Latterly Spink and Ozmultu (2002) analyzed the characteristics of question format web queries and Ozmultu, Spink and Ozmultu (2002) explored a sampling technique for the correct statistical analysis of large data sets of web queries. On the other side, the importance of the performance evaluation is well known and it is fundamental to obtain the effectiveness and response time measures of the IR system. For the traditional IR systems, this evaluation is performed using the methodology created in the TREC evaluation program undertaken by the US National Institute of Standards and Technology (TREC, 2004). In the case of the Web IR systems, the WEB-TREC was created allowing the measurement of speed (using the response time) and effectiveness (using the precision and recall parameters). But the Web IR systems must operate under different workload situations, especially with a high number of concurrent users. Consequently, it would be quite important to measure the response time of a Web IR system under different workload situations. Clearly, a search engine cannot be put into production before being evaluated and so, no real workload situations (using real users) could be used. Therefore, it is fundamental to obtain a user’s profile that could be simulated, by means of the query logs analysis of a Web IR system, to later examine the simulation of different workloads (as a different number of requests) and its effect on the response time of a search engine, in order to improve the performance evaluation methodology. The user’s profile is based on the work by Cacheda and Viña (2001), where two basic conclusions are obtained for the simulation of the users’ behavior: The searches, categories browsed and documents visited fit an Exponential distribution, with the mean variable through the time. There is a linear relationship between the number of searches, categories browsed and documents visited in a period of time.

3. USIM: A PERFORMANCE EVALUATION TOOL USim (Users Simulator) is an application that simulates the users’ behavior while using a search engine, using the results obtained in some previous works. It has been designed to operate with any type of search engine using the HTTP protocol. USim will send multiple requests to an IR system in the same way that a group of users would have done. Three types of requests are supported: searches, browsed categories and visited documents. The requests will be sent following an Exponential distribution, as derived from the previous section and the number of requests per minute is defined as a parameter. In this way, USim can generate different supervised workload environments over an IR system.

USim was designed and implemented for research purposes in order to evaluate the performance of a Web IR system in a LAN, before being available on the Web. This guarantees that the network latency is negligible and so, the response times obtained only measure the search engine response times.

3.1 Design and implementation USim was designed using the object oriented methodology and completely developed in the object oriented language Java, in order to build a multiplatform application and facilitate its operation in any environment. The simulation tool is composed of three main modules that operate concurrently, associated with the three types of requests: searches, categories and documents. The three types of requests are generated in a similar way. Each type of request is managed by its own process, and at the same time, each request is processed independently by its own thread. As described in the previous section, the time between two consecutive requests fits an Exponential distribution and the simulation is done using the inversion method over the distribution function of the Exponential distribution. Then, a request is generated, submitted to the search system and then the response is processed when it is received. The generation of the request is handled in different ways depending on the type of the request. In the case of the searches a whole query must be created. Starting with the search string, the analysis carried out by Cacheda and Viña (2001) showed that the search strings examined did not fit the Zipf’s law (Zipf, 1949), confirming the results of Jansen et al. (1998). Consequently, a mathematical model cannot be defined and so an empirical distribution is used. USim was designed to operate with any empirical search string distribution and in our experiments the distribution obtained by Cacheda and Viña (2001) was used, consisting of 26,654 search strings with their respective frequencies. For the rest of the search parameters (e.g. number of results displayed per page) the default values are used. The previous descriptive works by Jansen et al. (1998) and Cacheda and Viña (2001) conclude that the high majority of the users do not change the default values in any search parameter. In case of browsing a category or visiting a document only an identifier is needed (of a category or a document, respectively). The best way to obtain a list of categories or documents identifiers is through the own simulation process, because there is no mathematical distribution that fits the categories or documents visited by the users (Cacheda and Viña, 2001). Thus, the simulation tool obtains the categories and documents identifiers by either the searching or browsing processes, and then they are stored in their respective caches. Each identifier has a finite life in the cache, which is a configuration parameter (typically the average length of a user’s session). When a request of a category or document is performed, the corresponding identifier is randomly selected from the cache. Once the request is generated it will be sent to the search engine using the HTTP protocol. The HTTPClient API (Tschalar, 2003) for Java was used for this purpose. The URL for each type of request and the name of the parameters required must be defined in the configuration of USim. When the HTTP response has been received a simple and configurable parsing process is used to extract some relevant information from the HTML document received, but only for the searches or categories requests. As it was previously described, the categories and document identifiers are extracted from this response to be later stored in their respective simulation caches. And also two values are obtained from the result page: the number of categories and the total number of documents retrieved in the answer. Finally, for each request sent to the IR system the following information is stored in an output file for a subsequent analysis: Timestamp: date and time when the answer was received from the retrieval system. Request identifier: depending on the type of request, the search string or the category or document identifier. Response time: the time since the request was sent until the response was received (the HTML document was completely received). Images: the number of images included in the answer. Response time images: the additional time needed to download the images. And for the searches and categories requests, the number of categories and documents retrieved, obtained from the parsing process.

Figure 1: General configuration for USim

Figure 2: Searches configuration for USim

3.2 Operation In this section we briefly describe the main functionality characteristics of this simulation tool, with the objective of speed up the understanding and comprehension of the performance evaluation situations described in the next section. Firstly it is important to mention that USim can operate with a user interface or in batch mode, using proprietary configuration files. The Figure 1 shows the graphic interface used to configure the general parameters that are related to the whole simulation process. In this part, the user can determine the length of the simulation and the life in cache of the category and document identificatiers gathered by USim during the simulation process. The configuration can be stored in order to use the simulation tool in the batch mode. When the simulation starts the application checks which types of requests must be sent to the IR system, and each module is started independently. In this way, USim can be used with the main types of IR system: Web directories, search engines and metasearch engines. If the system analyzed is a Web directory, USim must send searches and accesses to categories, meanwhile if the system analyzed is a search engine or a metasearch engine only searches must be sent. The module of visits to documents is included because some IR systems include a middle page between the search results and the final document, which also increases the load of the system. The remaining of the user interface is used to configure the parameters of each type of request. These parameters are very similar for the three types of requests and so we only describe the search configuration (see Figure 2). The main parameters are the number of searches per minute and the URL of the search system. Also, the names of the parameters needed to perform the search, commonly the search string, and the number of results to obtain and the position of the first result. The simulation tool will generate the values associated to these parameters and invoke the search engine passing these parameters and the generated values, using either a GET or POST method. The searches file contains the empirical distribution of the search strings that will be used by USim and, the output file will store all the information for the search requests (timestamp, request identifier, response time and so on). It is important to mention that the number of searches per minute is not static but it can dynamically change during the simulation using the parameters “Increase in” and “every minute”. This will help in the evaluation of the Web IR systems under different workloads using only one simulation process. The configuration of categories and documents is quite similar. In this case, the number of categories browsed and documents visited per minute are automatically estimated using the linear relationship obtained in (Cacheda and Viña, 2001), although these and the rest of the parameters can be directly modified by the user.

4. PERFORMANCE EVALUATION RESULTS USim is a simulation tool designed and implemented for the performance evaluation of any type of Web IR system, especially for research purposes in a local environment with negligible network latency.

USim can be used and configured in two different ways to measure two different parameters of the performance of an IR system. The first one is the classical denial of service method that will measure the maximum number of requests supported by the system (named saturation threshold). The second one is more interesting for research purposes because it will measure the response time of a search engine under different workload situations. This leads directly to the comparison of the performance for variations of the same search engine or even completely different IR systems.

4.1 Saturation threshold One of the critical measures for any Web IR system is the maximum number of requests supported in a minute. It is evident that starting from a certain threshold the performance of the system decreases suddenly, increasing the response times up to the denial of service. This point is named saturation threshold. To establish the saturation threshold is fundamental to take some preventive actions (such as, application management techniques or the incorporation of new hardware) before the denial of service is produced. In this case, USim can be easily configured to simulate the effect of multiple simultaneous users sending requests in a controlled environment to the IR system. For this purpose, an experiment was designed where the saturation threshold must be measured for a prototype of a basic Web directory installed in a Sun Ultra Enterprise 250, with one 300 MHz CPU and 768 MB of main memory. This basic Web directory consists of approximately 1000 categories and more than 50,000 classified documents, and its architecture is described in (Cacheda and Viña, 2003). USim is configured to send requests to the IR system and to periodically increase the workload. For all the requests the response time is measured. The simulation tool is configured to start with 5 searches per minute (and 4.1 browsed categories per minute and 7.5 viewed documents per minute). The initial value of searches per minute is increased every 10 minutes in 1 search per minute (and also the equivalent increase is performed in the number of categories and documents). The results are showed in Figure 3 and Figure 4. The first graph (Figure 3) presents the response times of the searches performed to the system through the whole simulation. The image is quite clear: approximately above 21 searches per minute the response times start augmenting rapidly. At this point, every new query that is requested to the system will increase the load of the system and worsen the situation. This condition will stop when the number of requests per minute decreases. Figure 4 illustrates the number of errors pages returned by the system and obviously, the IR system operates perfectly until the number of searches is superior to 21 requests per minute. The simulation process also obtains the response times for the browsed categories and the viewed documents, but these graphs are quite similar and don’t contribute with any new relevant information. Therefore, using USim the saturation threshold is established in 21 searches per minute (and the respective values for the categories browsed and viewed documents).

70

400000

60

300000 50

40

30

100000

20

10

0 5.00

10.00 12.00 14.00 16.00 18.00 20.00 21.00 23.00 24.00

8.00

11.00 13.00 15.00 17.00 19.00 21.00 22.00 24.00 25.00

Searches per minute

Figure 3: Response time vs. searches per minute

Errors

Response time (ms)

200000

0 5.00

7.00

9.00

11.00 13.00 15.00 17.00 19.00 21.00 23.00 25.00

Searches per minute

Figure 4: System errors vs. searches per minute

4.2 Performance comparison The main goal of the performance evaluation of an IR system is the measure of the response times to compare and analyze the effect of some changes in the search system, in order to distinguish the real improvements obtained. The Web IR systems are under different workload levels through the time, starting with periods of low workload until situations of high workload or even saturation. Obviously, the performance of the system depends on the workload of each moment. Therefore, the performance evaluation must be done considering different workload situations in order to elaborate a more complete and representative study. In the work described in (Cacheda and Viña, 2003) the performance of three different indexing techniques must be compared. In this case, the performance of a type of searches characteristic of the Web directories, named restricted searches was analyzed. A restricted search is a common search, but the results must belong to one category or any of its descendants. The first indexing technique uses a basic architecture based on inverted files and constitutes our baseline. The other two indexing techniques (named, hybrid model with total information and hybrid model with partial information) are based on a hybrid model of signature files embedded into inverted files. The hybrid model with total information will use the hybrid data structure for all the categories of the Web directory, increasing 100% the size of the index. Whereas the hybrid model with partial information only applies the hybrid data structure to the categories of the first levels of the Web directory, reducing approximately in 50% the size of the index. For more details about the hybrid data model and the two variants defined refer to (Cacheda and Viña, 2003) and (Cacheda and Baeza-Yates, 2004). Each of these search algorithms has been developed and tested on a Sun Ultra Enterprise 250, with one 300 MHz CPU and 768 MB of main memory. The methodology for the performance comparison is based on two units of USim, which are executed simultaneously. The first one will generate the workload on the IR system and the second one will send the restricted queries to test the performance of the system. The performance evaluation is performed over five different workload situations: null, low, medium, high and saturation. The first unit of USim was configured to generate these static workloads with the following average values: Null: 0 searches/minute, 0 browsed categories/minute and 0 viewed documents/minute. Low: 5 searches/minute, 4.1 browsed categories/minute and 7.5 viewed documents/minute. Medium: 12 searches/minute, 8.5 browsed categories/minute and 16.2 viewed documents/minute. High: 19 searches/minute, 12.9 browsed categories/minute and 24.9 viewed documents/minute. Saturation: 23 searches/minute, 15.4 browsed categories/minute and 29.9 viewed documents/minute. For each workload, the first unit of USim will send requests to the IR system, and after a stabilization period the second unit of Usim will be executed to measure the response time of the restricted queries. This second unit will use a reduced set of queries specifically designed to analyze the effects of some relevant parameters on the response time. In this case, two parameters are considered: the number of results obtained by the query and the number of documents associated with the restricted category. Therefore, a set of eight queries retrieving from 0 to 2000 documents was selected and three different categories were selected to 16000

20000

14000 16000 12000

12000

8000

6000 4000

Basic

2000

Hybrid total info

0

Hybrid parcial info 0

50

150

300

500

750

1000

2000

Number of results

Figure 5: Response time (null workload)

Response time (msegs)


10000

8000

Basic

4000

Hybrid total info 0

Hybrid parcial info 0

50

150

300

500

750

1000

2000

Number of results

Figure 6: Response time (low workload)

30000

25000

25000

20000

20000

15000

15000

10000 Basic 5000 Hybrid total info Hybrid parcial info

0 0

50

150

300

500

750

1000

2000

Number of results

Figure 7: Response (medium workload)



30000

10000 Basic 5000 Hybrid total info Hybrid parcial info

0 50

150

300

500

750

1000

2000

Number of results

Figure 8: Response time (high workload)

restrict the queries (with 20000, 10000 and 5000 documents associated, respectively). In Figure 5, Figure 6, Figure 7 and Figure 8 we show the results obtained for the comparison of the three algorithms. All the experiments were analyzed using the ANOVA test. Three factors are defined in the ANOVA: the type of model (basic, hybrid model with total information and hybrid model with partial information), the number of results retrieved by the query and the number of documents associated with the restricted category. Obviously, the number of results and the number of documents associated with the restricted category are relevant factors of the response time, and the objective is to determine if the type of model is also relevant, comparing the performance of the hybrid models versus the basic model. Figure 5 represents an ideal situation, and it is clear that the hybrid models perform much better than the baseline. In fact, if the query gets more than 500 results, the response times are reduced in 50% in the hybrid models versus the basic model. The ANOVA test considered relevant the three factors analyzed (R square = 0.988). The same situation is represented in Figure 6 and Figure 7, but with low and medium workload, respectively. The behavior of the three algorithms is equivalent to the previous one, except that the response times are slightly higher (the three factor are also relevant in the ANOVA test, with R square = 0.923 and R square = 0.920, respectively). But the situation changes in Figure 8 where a high workload is generated in the IR system. The ANOVA test still considers relevant the three factors, with a high R square (R square = 0.857). The most relevant aspect is that the performance of the hybrid model with total information is deteriorated, performing similarly to the baseline. While, the hybrid model with partial information keeps the performance improvement on 50% versus the baseline, and now also versus the hybrid model with total information. The hybrid data structure defined on both hybrid models seems to clearly improve the performance for the restricted queries. Although, in a high workload environment the disk operations are the bottleneck and so, the hybrid model with total information, with its higher index size, is penalized. The saturation workload is not described because its ANOVA test shows that only a small part of the variation in the response times is explained (R square = 0.536) and therefore its results are not significant. This experiment demonstrates the importance of considering the workload in any IR system, and specifically in the Web IR systems. Initially, both hybrid models behaved in a similar way, performing 50% better than the baseline. But, in the end, we have found out that only the hybrid model with partial information is able to keep the improvement in the performance in all the circumstances, whereas the hybrid model with total information decreases its performance in high workload situations, due to its higher disk requirements.

5. CONCLUSION With the emergence of the Web IR systems some new measures were defined for the evaluation of the retrieval (relative precision, relevant/useful pages, etc.). But this must also involve the performance evaluation of these retrieval systems. Therefore, in this paper we have presented USim, a simulation tool for the performance evaluation of Web IR systems based on the simulation of users’ behavior. This simulation

tool helps in the performance evaluation of Web IR systems in two different ways: estimating the saturation threshold of the system and comparing the performance of different search algorithms or engines. To establish the saturation threshold is fundamental before any Internet search engine is put into production because it will estimate the maximum load that the system will be able to bear before degrading its performance. So, before this point is reached, some preventive actions can be taken to increase the processing capacity of the system or to avoid this threshold using application management techniques. The second point is the most interesting because, traditionally, to measure the response time of any IR system has been fundamental in the performance evaluation. However, the Internet search engines must operate under different workload situations, with a high number of concurrent users. In this way, the response times must be measured considering different workload situations; otherwise erroneous conclusions can be achieved. The results obtained prove that a comparison only in null workload environments is not enough because we have discovered how the performance of a search algorithm seemed to be appropriate in a low workload environment, whereas its performance decreases suddenly in a high workload situation. From a general point of view, with USim we intend to create an approximation to some new performance evaluation techniques specifically developed for the Internet search engines. The use of the simulation for the performance evaluation of Internet search engines seems promising; mainly because the response times can be estimated more accurately considering different workload environments. For further research, an interesting point is the extension of USim to operate on a WAN. At the moment, the main limitation of our simulation tool is to operate on a LAN, where the network latency is negligible. However, on a WAN the network latency must be estimated to obtain the actual search engine response time. Also, in our future work we will concentrate in the improvement of USim in order to make it publicly available, trying to develop a more generic application suitable to any type of retrieval system. This implies that an advance must be done in the information extraction of the results Web pages (where the use of XML seems promising) and in the different parameters used by the search engines URLs.

REFERENCES Cacheda, F. and Viña, A., 2001. “Experiencies retrieving information in the World Wide Web”. Proceedings of the 6th IEEE Symposium on Computers and Communications. Hammamet, Tunisia, pp. 72-79. Cacheda, F. and Viña, A., 2003, “Optimization of Restricted Searches in Web Directories Using Hybrid Data Structures”. In Lecture Notes on Computer Science Vol. 2633, pp. 436-451. Cacheda, F. and Baeza-Yates, R., 2004, “An Optimistic Model for Searching Web Directories”. In Lecture Notes on Computer Science Vol. 2997, pp. 394-409. Hawking, D. et al., 1999. “Results and challenges in Web search evaluation”. In Proceedings of the 8th World Wide Web Conference, Toronto, Canada, pp. 243-252. Jansen, B. and Pooch, U., 2001. “Web User Studies: A Review and Framekwork for Future Work”. In Journal of the American Society of Information Science and Technology, Vol. 52, No 3, pp. 235-246. Jansen, B. et al., 1998. “Real Life Information Retrieval: A Study Of User Queries On The Web”. In SIGIR FORUM, Vol. 32, No 1, pp. 5-17. Kirsch, S., 1998. “Infoseek’s experiences searching the Internet”. In SIGIR FORUM, Vol. 32, No. 2, pp. 3-7. TREC, 2004. Text REtrieval Conference, NIST, TREC home page. http://trec.nist.gov/. Ozmultu, H.C., Spink, A. and Ozmultu, S., 2002. “Analysis of large data logs: an application of Poisson sampling on Excite web queries”. In Information Processing and Management Vol. 38, pp. 473-490. Silverstein, C. et al., 1999. “Analysis of a Very Large Web Search Engine Query Log”. In SIGIR FORUM, Vol. 33, No. 1, pp. 6-12. Spink, A. and Ozmultu, H. C., 2002. “Characteristics of question format web queries: an exploratory study”. In Information Processing and Management Vol. 38, pp. 453-471. A. Spink, B. Jansen, D. Wolfram, T. Saracevic, “From E-sex to E-commerce: Web Search Changes”. IEEE Computer, 35(3): 107-111, 2002. R. Tschalar, HTTPClient. http://www.innovation.ch/java/HTTPClient/, 2003. Zipf, G., 1949. “Human behaviour and the principle of least effort”. Ed. Addison-Wesley. Zobel, J., Moffat, A. and Ramamohanarao, K., 1996. “Guidelines for Presentation and Comparison of Indexing Techniques”. In ACM SIGMOD Record, Vol. 25, No. 3, pp. 10-15.