search engines, and as a result to be in a high rank while querying for information ... purpose software to visit web servers and classify their contents. ⢠they are ...
DEVELOPING AN EFFICIENT MODEL FOR EVALUATING WWW SEARCH ENGINES I. DESPOTOPOULOS +, G. KORINTHIOS *, I. NASIOS *, D. REISIS *. *
University of Athens, Electronics Laboratory, Physics Dept. Panepistimiopolis, Building 4, 15784 Athens, Greece. +
National Technical University of Athens, Telecommunications Laboratory, Dept. of Electrical & Computer Eng. Polytechniopolis, Heroon Polytechniou 9, 15773 Athens, Greece.
Keywords: Web, search engines, Metadata, robots, evaluation.
ABSTRACT The purpose of the work is to design an efficient methodology for evaluating the performance of public available WWW search engines. The first goal of this procedure is to develop a model representing WWW engines. This objective can only be accomplished by carefully studying the features of each search engine that is examined. This paper presents the methodology developed for evaluating the WWW search engines. As a consequence of developing the methodology of evaluating the performance of the engines, we evaluated, according to this model, a sample set of popular search engines. The second goal of this work was to study how each search engine index web pages in its respective database and thus to identify the means of preparing a web page, in order to be appropriately indexed by the search engines, and as a result to be in a high rank while querying for information relevant to that included in its content.
INTRODUCTION The explosive growth of the Internet has rendered the World Wide Web as the primary tool for information retrieval today. However, the amount of information published is increasing constantly and this makes it impossible for anyone to monitor changes. A number of commercially available search engines have been developed, dealing with the problem of indexing and retrieval of published information. As a result, an evaluation of the most popular search engines acquires increasing importance, especially if it helps answer questions concerning both the way they work and the accuracy of the results they provide.
The evaluation, with respect to accuracy, was accomplished by developing a set of twenty (20) queries, that were posed practically simultaneously at the search engines. Then, the results were studied and evaluated with a carefully designed process. The methodology used, was identified accordingly in order to be as less subjective as possible. The identification of the indexing mechanisms was fulfilled by careful studying of the hits returned by each search engine, thus, enabling the enumeration of the means of indexing information for each of the studied search engines. The results derived from this activity, are presented in order to aid information publishers to prepare their documents appropriately, so as to be highly visible on the Web. This paper includes some novel features with respect to relevant research activities in the recent bibliography. It is the first study of the features of public available WWW search engines that covers both the issue of which engine is more efficient, and which fields of an HTML document they favour in order to index the information. There was an attempt to unify the approach of both studies, and have a common attitude towards the results processing. Another novelty of this research, is the classification of the queries into three (3) different categories. The engines were evaluated with quoted text, natural language queries as well as queries with Boolean operators. Results have been derived from the performance in each category, while general results from the overall performance were acquired, as well. Another value added feature of this research was the categorisation of the point assignment, and the exploitation of the respective results into two potential search engines’ users; the easily and the hardly pleased user. Finally, this study is the first attempt in the relevant bibliography to contact an experiment and
measure the means of search engines’ indexing of information, that is to prove favouritism on HTML fields, for each search engine, by figures. This paper performs an evaluation of search engines that meet the following criteria: • they are robot-driven, that is they utilise special purpose software to visit web servers and classify their contents • they are well known and they are frequently used by the majority of Internet users • they maintain large databases, where the robot findings are indexed • they enable submission of queries using advanced syntax rules Issues such as response time, user-friendliness of the interface, ease in query syntax and submission, are not evaluated.
It is also important to note that each set of queries was given to the search engines in the shortest possible time period (experiments were carried out in a period of ten days). Moreover, each specific query was asked to all search engines at the same time, so as to avoid favouring any of the search engines.
SEARCH ENGINE INDEXING MECHANISM Indexing of web pages by the engines considered in this study, takes place according to information that robots find either in the Title section of the page or the contents of the Meta-tags or even the content of the body of the specific page. Combinations of the above information are also possible. To help draw conclusions about the indexing mechanism used by each search engine, the following steps were taken: • The same set of 20 queries, as in the previous phase, was submitted to each of the six search engines.
For the purposes of the two experiments presented in this study the following search engines were considered (listed alphabetically): AltaVista, Excite, HotBot, InfoSeek, Lycos, WebCrawler.
• The first 10 hits returned from each engine were retained for evaluation, and scoring was determined with respect to the absolute position of the query string within the source text.
SEARCH ENGINE ACCURACY
• Inclusion of the query string in the aforementioned parts of the document was awarded by marking them respectively.
Evaluation of the accuracy of the results that search engines return to a query, has been performed according to the following steps: • A set of 20 queries, divided in three categories, was carefully selected. The same set of queries was submitted to each of the six search engines. • The first ten hits returned from each engine were evaluated with respect to the following criteria: relevance or irrelevance of content, broken links and duplication of hits. A for highly relevant pages, B for fairly relevant pages or pages containing hyperlinks to A pages, C for irrelevant pages (albeit containing some of the keywords queried), X for ‘broken links’ and D for duplicate pages. • The results were processed using a hypothetical user model. Users were classified in two categories: easilypleased users and hardly-pleased users. Different points were given to each mark for the two types of users. (A, B, C, X, D)=(5, 3, 1, 0, mark/number of duplicates) for the first type and (5, 2, 0.5, 0, mark/number of duplicates) for the second type. • The results produced can be presented in four ways: before the score assigning procedure only the total marks (fig1). The processed scores can be presented separately for each query category, cumulative for the entire query set (fig2, fig3), and percentage of differences for easily and hardly pleased user.
The criteria used for grading each search engine was to determine whether the submitted query text was included in the Body, Title or Meta part of the HTML source (fig4).
CONCLUSIONS This paper presented an uniform approach on both evaluating accurateness and identifying indexing methods of Web search engines. The main advantage of the model presented is the objective manner of evaluating the results, introducing a set of novelties with respect to previous attempts described in relevant bibliography. Processing of the retrieved results could be utilised by both information seekers in terms of aiding them in deciding which search engine better serves their needs, as well as information publishers in selecting the appropriate way of exposing ‘highly visible’ documents on the Web. Future work could include software tools that would evaluate and identify indexing methods of any search engine in a more automated way.
BIBLIOGRAPHY [1] Search Engine Watch http://www.searchenginewatch.com [2] Precision among World Wide Web Search Services (Search Engines): Alta Vista, Excite, HotBot, Infoseek, Lycos By H. Vernon Leighton ,
http://www.winona.msus.edu/library/webind2/webind2.ht m [3] Chu, Heting and Marilyn Rosenthal. (1996). "Search engines for the World Wide Web: A comparative study and evaluation methodology," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 127-135. http://www.asis.org/annual96/ElectronicProceedings/chu.html [28 January 1997] [4] Gauch, Susan and Guijun Wang (1996). "Information Fusion with ProFusion," Webnet 96 Conference,San Francisco, CA October 15-19, 1996. http://www.csbs.utsa.edu:80/info/webnet96/html/155.htm [22 February 1997].
[5] Tomaiuolo, Nicholas G. and Joan G. Packer. (1996). "An analysis of Internet search engines: assessment of over 200 search queries." Computers in Libraries. v16 n6 (June 1996), p58 (5). The list of queries used is in: Quantitative Analysis of Five WWW "Search Engines". http://neal.ctstateu.edu:2001/htdocs/websearch.html [7 February 1997]. [6] Westera, Gillian. (25 November 1996). Search Engine Comparison: Testing Retrieval and Accuracy. http://www.curtin.edu.au/curtin/library/staffpages/gwpers onal/senginestudy/results.htm [7 February 1997] [7] The Spider’s Apprentice http://www.monash.com/spidap.html
Figure 2: points assignment, easily-pleased user
Figure 1: Total amount of marks
Ly co s eb C ra w le r W
In
fo
se
ek
t
Title Meta Content
Ex cit e H ot Bo
Al ta Vi st a
70 60 50 40 30 20 10 0
Figure 4: points assignment, hardly-pleased user
Figure 3: Field favouritism