Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
PROGRESS IN DOCUMENTATION THE EVALUATION OF WWW SEARCH ENGINES C. OPPENHEIM*, A. MORRIS, C. McKNIGHT {c.oppenheim, a.morris, c.mcknight}@lboro.ac.uk S. LOWLEY
Department of Information Science, Loughborough University Loughborough, Leics LE11 3TU
The literature of the evaluation of Internet search engines is reviewed. Although there have been many studies, there has been little consistency in the way such studies have been carried out. This problem is exacerbated by the fact that recall is virtually impossible to calculate in the fast changing Internet environment, and therefore the traditional Cranfield type of evaluation is not usually possible. A variety of alternative evaluation methods has been suggested to overcome this difficulty. The authors recommend that a standardised set of tools is developed for the evaluation of web search engines so that, in future, comparisons can be made between search engines more effectively, and that variations in performance of any given search engine over time can be tracked. The paper itself does not provide such a standard set of tools, but it investigates the issues and makes preliminary recommendations of the types of tools needed.
INTRODUCTION
The growth of the World Wide Web (also known as WWW, or simply the web) has been one of the most remarkable developments in the history of documentation. In less than ten years, it has grown from an esoteric system for use by a small community of researchers to the de facto method of obtaining information for millions of individuals, many of whom have never encountered, and have no interest in, the issues of retrieving information from databases. Within a very short period, search engines, generally funded by advertising revenue, have proliferated, often claiming to provide better retrieval of material from the web than other engines. Many are used intensively. Indeed, arguably, services such as AltaVista are much better known than services used by librarians and information scientists, such as Dialog. Although these search engines search an enormous volume of information at apparently impressive speed, they have been the subject of widespread criticism. The reasons include slow response time, the retrieval of duplicate records, the *To whom all correspondence should be addressed. Journal of Documentation, vol. 56, no. 2, March 2000, pp. 190–211
190
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
failure to retrieve relevant items and the vast amount of irrelevant items retrieved. Articles have appeared [1] giving advice on how to fine tune searches to reduce the chances of excessive numbers of hits. In addition, the engines do not provide clear search instructions [2], and in some cases where the engines rank the output according to likely relevance the ranking software sometimes creates bizarre lists. Other complaints are that search engines fail to return information known to be present on the web on the topic. This is combined with, ironically, the complaint that a search provides too much irrelevant or duplicated information, or out-ofdate links. Often there is little or no information on what the links are, and no clear explanation of the search results is given. ‘Good stuff is hard to find’ sums up how many people feel about searching for information on the web. The fundamental problems of the web are its size, heterogeneity and inconstancy [3]. Resources keep moving and increasing. Despite the impressive speed of search engines, effective information retrieval is uncommon [4]. There is no consistency or easy transferability between services, as noted already some years ago by Winship [5], though of course the meta-search engines have succeeded in overcoming the transferability issue. The content, organisation, comprehensiveness and search features of search engines vary. Very many search engines exist; some are large and incorporate multiple features, while others are smaller and offer limited functionality. Each engine, depending on the size of its database and its abilities, provides a service to its users different from all others. In this paper, a survey of the literature of evaluation of web search engines is reported, with the aim of identifying some of the issues and problems in this rapidly changing field. The survey shows that research on evaluating search engines has not been carried out in a consistent fashion, and, as a result, it is impossible to compare the performance of search engines as reported by different researchers. This paper does not consider the performance of search engines except in the retrieval of text. There is a need for a new generation of search engines that search the multimedia content of the web [6], but this is an issue that has not yet been addressed by the search engine owners. THE TYPES OF SEARCH ENGINE
Search engines can be divided into four categories: robots, directories, metasearch engines and software tools. Inevitably, some search engines combine characteristics of more than one of these categories. Other authors [7] have classified the engines in a very similar way: as robots, directories, meta-search engines, geographic specific resources and subject specific resources. Robot-driven search engines
‘Crawler’ or ‘worm’ programs generate databases by means of web robots. These robots are programs that reside on a host computer and retrieve information from sites on the web using standard protocols. In effect, they automatically travel the Internet following links from documents and collecting information according to the HTML structure of the document (such as URL, document title, keywords in the text) about the resources they come across. No doubt the different search 191
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
engines follow different algorithms to index the information on the web and to output results to a user’s query. Most of the search engines do not divulge the algorithms they use for searching and ranking output, for obvious commercial reasons. Some certainly give a heavy weighting to HTML meta tags, but others deliberately ignore them as the developers believe webmasters can and do deliberately mislead users about the content by using inaccurate tags. All results are indexed and stored in a database. However, indexing is not always comprehensive and the resources listed are not completely duplicated by different search engines because they start from different base documents. According to Search Engine Watch [8], a web site devoted to web page developers and end users: ‘a good search engine should gather more than a few pages from each web site it visits. The more pages it gathers, the more likely it will have a comprehensive index of the web. It should also make regular visits to the web sites in its index. This keeps the index fresh’. However, it also points out that users are often unaware of the comprehensiveness or currency of the material that search engines have indexed. Examples of such search engines include some of the best-known engines, such as AltaVista, Excite, Lycos, HotBot and Infoseek. These robot search engines appear to have been the most studied by evaluators. Directory-based search engines
These engines are also known as subject collections or as ‘subject gateways’ and are collections of links to relevant URLs created and maintained by subject experts, or by means of some automated indexing process. Some of these services also include a robot driven search engine facility, but this is not their primary purpose. They rely mainly on people to identify and group resources [5]. These directories offer access to information that has been classified into certain categories. Hypertext links lead the user to the most appropriate URL. There is an implication that only high quality URLs are provided. Such engines typically search manually created structured metadata (such as Dublin Core) rather than the full text of the target web sites. Yahoo!, SOSIG, EEVL, Biz/Ed and UKDirectory are examples of such directories. Meta-search engines
These search engines utilise databases maintained by other engines. A metasearch engine accepts a single query from the user and sends it to multiple search engines in parallel. The array of web databases is accessed (sometimes in parallel and sometimes in sequence) to return the hits from each. After the underlying search engines return retrieved items, they are further processed and presented to the user. The functionality of these engines depends largely on the performance of those engines, mentioned above, that create and manage their own databases. They are attractive because users do not have to visit multiple search engines and re-enter their search terms. The most popular meta-search engines include MetaCrawler, Dogpile and ProFusion. 192
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
Software tools
These tools are sometimes classified as ‘browsing companions’ or ‘browser searchbots’. They are similar to meta-search engines but need to be installed on the user’s workstation. One of their features is that results can be saved to hard disk for subsequent retrieval. They may also be able to check for broken links and prevent duplication of results. Most of these products are shareware and require payment. Examples of such software tools include Websleuth, Copernic 98, f.Search 1.3.1, Query-N Metasearch and Mata Hari. The vast majority of evaluations of web search engines have been carried out on the robot engines. This is hardly surprising, as they are the most popular tools for searching the web. However, in principle any of the search engines can be evaluated. METHODOLOGIES FOR THE EVALUATION OF SEARCH ENGINES
Introduction
Articles reporting the evaluation of search engines have become popular with readers as the web has grown. Some of the more significant papers that have appeared are discussed below. However, this article is not intended to be a comprehensive review of all such papers. Interested readers should also consult web based sources of such reviews [9]. Schwartz [10] reviewed the way web search engines operate and noted the growing body of evaluation literature, much of it not systematic in nature. She concluded that performance effectiveness is difficult to assess in the context of the web. She also concluded that significant improvements in general-content search engine retrieval and ranking performance may not be possible, and she argued that they are probably not worth the effort. However, search engine providers have introduced some rudimentary attempts at personalisation, summarisation and query expansion. She suggested there might be a trend towards smaller resource collections with rich metadata and navigation tools. Leighton and Srivastava [11] noted that much comparative research into the effectiveness of search engines arrives at conflicting conclusions as to which services are better at delivering superior precision. In addition many studies had either been based upon small test suites, or did not report their methodology. They considered that bias in evaluating web pages for relevance is a major problem for studies of search engines. They stated that an unbiased suite of queries must be developed to test search engines objectively. Fair methods for searching the services and evaluating the results of the searches are necessary. They noted that at many points in the design methodology, it is possible subtly to favour one service over another. They argued that conscious or unconscious bias must be guarded against. This criticism, of course, applies to information retrieval evaluation tests in more traditional environments as well. One of the problems associated with the evaluation of search engines is the fact that they are constantly changing, and developing their search mechanisms and user interface. In addition, every month sees the launch of new search engines. Combining these facts with the constantly changing content of the web, it is clear that no specific evaluation of a search engine is likely to remain valid for any 193
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
length of time. Consequently, most research projects rightly emphasise that their results are only indicative of the performance of those search engines at that time. However, this does not in any way invalidate the idea of evaluation to assess engines currently in use, and it does invalidate the idea of developing standardised evaluation techniques. However, so far there has been a reluctance to adopt standardised evaluation approaches. Each paper has its own idiosyncratic approach. The paradigm for evaluating the effectiveness and performance of information retrieval systems is arguably the Cranfield model [12]. This method of evaluation uses the well-established measures of recall and precision. Although the use of these measures is often controversial (see, for example, the many debates about what is ‘relevance’ [13]), the assessment of most information retrieval systems would normally not be undertaken without at least an acknowledgement of the importance of recall and precision. The measurement of precision is in theory straightforward; it simply requires an individual, or group of individuals, to inspect the output from a search and to sort the output into two groups of document – relevant and not relevant. On the other hand, measurement of recall requires that the individual or group of individuals also have access to the complete set of the documents that was searched, or at the very least a representative sample of these documents. This poses problems for most evaluations, but especially when evaluating the effectiveness of web search engines. The nature of the web, as noted by Chu and Rosenthal [14], means it is impossible to calculate how many potentially relevant items there are for any particular query in the huge and ever changing web system. The number of potentially relevant documents not retrieved by a search engine therefore cannot be estimated with any accuracy. A typology of evaluation methods
There are a number of possible ways to address this issue: 1. undertake a Cranfield type study on a very tightly defined topic where it is known that there are a very small number of relevant web pages, and that the researcher already knows each of them; 2. undertake a Cranfield type study but employ relative recall rather than absolute recall. Relative recall is calculated by adding all the relevant hits found by a series of searches and assuming this represents the universe of relevant items, and then assessing a given search engine’s performance in retrieving this set of pages. It is worth noting that the idea of using relative recall in evaluating web search engines has been the subject of a critical review [15]; 3. find some way of estimating the number of likely relevant pages by sampling the entire web in a statistically valid way. Then undertake a Cranfield type study on this sample of web pages, and estimate recall; 4. avoid recall as a measure altogether. However, recall is an obviously important measure, and adequate substitute measures need to be considered.
These different approaches are described in more detail below. It should be noted that the designers of web pages could attempt to influence the results of any web engine search [16]. Web page producers include particular 194
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
words, and repeat words within the text of the pages such that when retrieved using web search engines the page will be ranked higher. Sometimes they include ‘white on white’ words that do not appear on any screen as text, but that are embedded and provide words for the search engines to retrieve. Sometimes, as noted earlier, they put in misleading meta tags. This type of activity is bound to distort any evaluation of search engines. Evaluations using a very tightly defined topic
Both Dezelar-Tiedman [17] and the ‘Erdos’ project [18] undertook searches for a relatively small number of known items on the web. In the latter case, the retrieval capabilities of six search engines were investigated by a case study of a query on Paul Erdos, a Hungarian mathematician. In this work, the author examined all 6,681 documents that the search engines identified. He assumed that the 6,681 hits represented the entire universe of web sites of potential relevance. This permitted calculation of precision, the overlap between the engines and an estimated recall for each of the search engines. The precision of the engines was high, the overlap was minimal and recall was very low. It is clear that this approach can only be used on a few, restricted topics, and that the results cannot be necessarily extrapolated to all searches run by a particular search engine. Cranfield type evaluations on larger topics
In practice, the majority of the evaluations of search engines have involved recall and precision, or surrogates of these measures such as relative recall [19]. However, in addition, many other criteria have been used at various times. Leighton and Srivastava [11] evaluated five robot search engines. Between 31 January and 12 March 1997, these were compared for precision on the first twenty results returned for fifteen queries. The authors rated the services based on the percentage of results within the first twenty returned that were deemed relevant or useful. The queries used in the research were either ones that users had actually asked at a university library reference desk or were taken from another study. This method reduced the bias that choice of query could produce. The authors suggest that the choice of search expression was perhaps the weakest point in the design of their study. They were particularly concerned with avoiding bias and ‘blinded’ the evaluators; during the evaluation process, the evaluator did not know the search service that had returned the result. The authors only used unstructured or natural language queries. Friedmann’s randomised block design was used to perform multiple comparisons for significance. Their analysis showed that overall, AltaVista, Excite and Infoseek performed best. Correspondence analysis showed that Lycos performed better on short, unstructured queries, while Hotbot performed better on structured queries. Chu and Rosenthal [14] like many other researchers, used sample search queries that were based upon real reference questions. These queries were structured to evaluate the search engines’ abilities to deal with a variety of query syntax, for example, different Boolean logic searches. The authors tested the first ten results from three search engines for precision and concluded that AltaVista outperformed Excite and Lycos in both search facilities and recall, although Lycos 195
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
had the largest claimed coverage of web resources. Another factor that was taken into consideration was the response time of the search engines, which, perhaps surprisingly, did not vary between search engines. Other researchers have conducted similar experiments using different queries and analysing varying numbers of the results for relevance. Ding and Marchionini [20] considered the first twenty hits for precision using five queries. The search engines were all searched within twenty minutes of each other. Gauch and Wang [21] used twelve queries on a number of major search engines and reported precision figures for the first twenty hits. Tomaiuolo and Packer [22] studied the precision of the first ten hits from 200 queries, but did not list the query topics or the exact expression or operators used. Links may not have been visited to check for broken links. Also, in common with many researchers, the criteria for relevance were not given. Westera [23] checked the first five and last five results (or 100–104 if more than 100 results) listed by the search engines she studied. She only evaluated the relevance of the results from Boolean logic searches. Leighton and Srivastava [11] criticised Westera’s research for only using five queries, which were all on the subject of wine. On the other hand, as already noted, one legitimate approach to the evaluation of search engines is to assess their performance in retrieving information in a small, clearly defined subject area. In July 1998, The National Technical University of Athens, on behalf of the European Commission, completed an extensive research project [24] on search engine evaluation. This research aimed to evaluate the accuracy, coverage and indexing mechanism of robot-driven search engines by means of objective measurement. The purpose of the research was: to provide an aid to users wanting to search the Internet; to indicate the most effective means of constructing web pages for developers to facilitate retrieval; and to study the operation/weighting methods as well as the cataloguing and indexing methods of search engines. Each engine’s performance was analysed according to the type of given query. The methodology used involved two phases. Phase one involved ranking the results from each engine according to the accuracy of response, variety and relevance. This was achieved by a scoring system, which involved awarding points for positive aspects and penalising for negative aspects of the results. Phase two attempted to identify the criteria used by the search engines to match a specific query. This was achieved by analysing the title, meta tags and keyword content of the results retrieved by each engine. Schlichting and Nilsen [25] demonstrated a methodology that could be used to compare search engines and future intelligent resource discovery objectively. The focus of their research was on the quality of the results. Queries were formulated according to the real needs of five users, who described the information that they were looking for as well as offering keywords. Only the first ten hits from four engines were included in the experiment. Such a small sample clearly limits the effectiveness of their proposed methodology, which would have to be tested on much larger numbers to be proved robust and effective. Nonetheless, this research is important because it compared popular WWW search engines (AltaVista, Excite, InfoSeek and Lycos) and because it employed signal detection theory to yield two scores for each search engine. One score measured the sensitivity of the search engine in finding useful information. The other score measured how conservative or liberal the search engine is in determining which sites to include in its 196
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
output. The use of signal detection theory [26] has been recommended from time to time for use in evaluating information retrieval systems, but this seems to have been the first time it has been used on evaluating Internet search engines. The researchers claimed their results demonstrated an overall poor performance of the current technology. They concluded: Specifically, only Lycos displayed a respectable score. The negative values for Infoseek and AltaVista are especially troubling. They suggest that the performance of the search engines in this experiment is too poor to fit the standard assumptions of signal detection analysis. One central assumption is that the distribution of sites containing relevant hits is more likely to be indexed by the search than those containing irrelevant links.
Gonçalves et al. [27] report a controlled experiment that measured the effect on retrieval effectiveness and efficiency of varying the maximum number of keywords used to represent documents in web search engines. They developed the idea of the ‘centroid’, the number of words extracted from a document that are subsequently used for indexing. As they note, centroid size varies from a few (title only), to all the document’s words, plus meta tags and the like. The larger the centroid, the higher the recall, but the more expensive it is for the search engine to index and the slower and more expensive the retrieval. The authors undertook their experiment to measure recall, precision and computational cost for a variety of centroid sizes. They created a set of frequently visited web pages. They measured relative recall (the proportion of relevant documents retrieved with one search engine amongst all relevant documents found using all the search engines and strategies) rather than absolute recall. In their experiment, as the centroid size increased from twenty to 100, recall grew from 7.5% to 18%, and precision grew from 26% to 35%. However the storage space required increased significantly. The authors concluded, albeit on somewhat sketchy evidence, that the ideal centroid size is 100. Meghaghab and his co-workers [28, 29]] examined the effectiveness of five World Wide Web search engines: Yahoo!, WebCrawler, Infoseek, Excite and Lycos. The study involved five queries that were run against each of the engines in original and refined formats, a total of fifty searches. The queries varied in terms of broadness, specificity and level of difficulty in finding Internet resources. The results of the study revealed that Yahoo! had the highest precision ratio for both original and refined queries. InfoSeek maintained second place with respect to refined queries, and Lycos with respect to original queries. Query refinement resulted in a higher precision ratio. The addition of promising pages, selected by the researchers as likely to be relevant, increased precision in all queries and across all engines. In an important paper, Clarke and Willett [19] compared the effectiveness of AltaVista, Excite and Lycos using thirty different searches. The paper is important both because of the critical evaluation of earlier research that it provides, and because it offers a realistic and achievable methodology for evaluating search engines. The authors developed a method for comparing the recall of the three sets of searches, despite the fact that they are carried out upon non-identical sets of web pages, because each search engine has indexed a different set of documents. They developed an algorithm for calculating relative recall by checking 197
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
how many of the so-called relevants found by one search engine were present at all in the universe of documents covered by the other search engines. Following Chu and Rosenthal, Clarke and Willett scored relevance as 1 for relevant, 0.5 for partially relevant and 0 for irrelevant. However, they developed their own schema for scoring these points. The 0.5 score was given if a page consisted of a series of links that led to one or more relevant pages. Duplicate sites (same URL, same content) were scored as 0, whilst mirror sites (different URL, same content) were surprisingly not considered duplicates and were scored as if they were unique. Error messages, such as ‘file not found’ were scored 0, as were sites that could not be opened because of excessively slow response times. Foreign language pages were ignored for the purposes of the study. Clarke and Willett established that the mean precision for the three engines ranged from 0.25 (Lycos) to 0.46 (AltaVista). These differences were statistically significant. Mean relative recall ranged from 0.56 (AltaVista) to 0.66 (Excite) but the differences were found not to be statistically significant. Large differences were found for coverage, with Lycos performing far worse than Excite, which in turn was far worse than AltaVista. The difference between Lycos and Excite was not statistically significant, but the others were. Based upon this test, it is clear that AltaVista is significantly better at retrieval performance than either Lycos or Excite, a result in accord with other studies. Cranfield type studies involving statistical estimates of the size of the web
Researchers at the NEC Research Institute undertook what appears to be the largest search evaluation to date [30]. The same 575 queries were run on the robot search engines HotBot, AltaVista, Northern Light, Excite, Infoseek and Lycos. Hits were counted; duplicate pages were not counted, the maximum query limits for each engine were not exceeded, and other controls were used to normalise the results across services. Another control was to discard any page from the count if the exact search term did not appear on that page. So if the search were for ‘crystal’, then a page would not be counted unless the exact word ‘crystal’ was found. HotBot covered the most, and its coverage was used as a baseline for estimating search engine size. For example, AltaVista found only 81% of the pages that HotBot found. The researchers then arbitrarily presumed that its total page coverage was 81% the size of the HotBot page coverage. Finally, with a size estimate for the entire web, they returned to calculate percentages of the web covered by each search engine. This was straightforward division. HotBot’s had the best score, 110 million out of 320 million web pages, or 34% of the web. Lycos, estimated to have ‘only’ 8 million web pages, came in last with 3% coverage. The researchers acknowledge that the results may have been slightly distorted by the choice of queries, which were mainly academic and scientific. This study has received wide publicity. It clearly demonstrates that none of the existing search engines covers the whole web. The corollary is that the results from all web searches using a single engine should be treated with caution, and ideally, several should be used to achieve a reasonable coverage. Bharat and Broder [31] also attempted to estimate the total size of the web. They describe a standardised, statistical way of measuring search engine coverage and overlap through random queries. Their work involved experiments showing size and overlap estimates for HotBot, AltaVista, Excite and Infoseek as 198
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
percentages of their total joint coverage in mid 1997 and in November 1997. They estimated that in November 1997, the number of pages indexed by HotBot, AltaVista, Excite and Infoseek were respectively roughly 77 million, 100 million, 32 million, and 17 million and the joint total coverage was 160 million pages. They further conjectured that the size of the web was over 200 million pages at that time. The most startling finding was the small overlap; only 2,200,000 pages were indexed by all four engines. Similar NEC work can be found in Lawrence and Giles [32, 33]. Evaluations that avoid recall
Feldman [34] discussed the trends in web search engines based on results of ten test queries using robot engines such as AltaVista, Excite, Lycos, HotBot, Infoseek, Northern Light, Savvy Search, Inference Find and Ask Jeeves. He reported an improvement over performance from one year earlier, in terms of search results, user interfaces, query formulation tools, search results presentation, search syntax standardisation and intelligent agents. His paper provided a table comparing the features of all the services tested. He also noted the market challenges for these services, with an emphasis on revenue generation, and reported five strategies commonly used to boost revenues. He concluded with seven barriers to effective use from the end-user perspective: over-engineered interface designs; broken links; duplicate entries; ignorance of search engine default functionality; lack of integration across applications; fleeting archives; and slow web response times. Kimmel [35] also discussed the robot search engines. He provided details on the coverage, record content, search features and results of a given query for nine search engines. There are many other similar articles to be found with varied test suites. Tunender and Ervin [36] examined the indexing and retrieval rates of five search tools on the web. Character strings were ‘planted’ in the authors’ web site. Daily searches for the strings were performed for six weeks using Lycos, Infoseek, AltaVista, Yahoo! and Excite. The HTML title tag was indexed in four of the five search tools. They noted insufficient and inconsistent indexing amongst the engines. An intriguing alternative methodology for measuring the effectiveness of search engines is one based on estimated (or expected) search length (ESL), an evaluation tool first suggested in 1968 [37], but rarely used. This was suggested by Harter and Hert [12] as an attractive alternative to recall and precision for Internet search engine evaluation. Only one paper of which we are aware reports work using this measure [38] for the evaluation of search engines. ESL can take the place of recall and precision, and instead calculates the cost paid by a user in the sense of the number of sites the user must look through before he or she gets sufficient relevant items to satisfy the query. The researchers compared ESL to recall and precision by running searches on eight search engines. The authors believe that ESL is a viable tool for evaluating search engines. In June 1995 Winship [5] compared four search tools (World Wide Web Worm, WebCrawler, Lycos and Harvest) and two directory-based engines (Galaxy and Yahoo!). He evaluated these not according to recall or precision. Instead, he considered the following criteria: content (size = quoted items; types of source 199
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
indexed: URL telnet, gopher, ftp, WWW; title headings, summary excerpts, full text, size, users able to submit URLs); search interface (use of colour, drop-down menus); search features (Boolean operators AND OR NOT implied, choice, truncation, adjacency proximity operators, approximate match, specify source types, searchable fields, URL, headings); and output of results (record includes lines of text). He also examined ‘performance’, but by this term he simply meant ‘number of items retrieved’. Even this simplistic approach to performance demonstrated some surprising differences, including some search engines that retrieved zero items. He emphasised that these tools are in fairly continuous development and therefore the conclusions could only reflect the situation at the time of testing. Stobart and Kerridge [2] evaluated search engines’ facilities, user interface, underlying technology and future growth potential. They also compared search engines’ functionality and performance from the perspective of the web user by a survey using a web questionnaire. EVALUATION CRITERIA
By examining the research reported, a number of criteria used by researchers for the evaluation of web search engines can be identified. Number of web pages covered and coverage
Two studies estimated the total size of the web, and hence the coverage of a number of search engines. Both studies revealed that no single search engine indexes everything on the web. These results cast some doubt on the validity of using relative recall as a measure of engine retrieval effectiveness. Stobart and Kerridge [2] found that only 23% of respondents were aware of UK based search engines, and that 78% of these respondents would have liked to have limited their searches to country domains, for example to the UK. The reasons that users stated for using overseas search engines in preference to those offered in the UK included the latters’ poor searching mechanisms, interface design, facilities and lack of information and global perspective. Freshness/broken links
The NEC research [30] aimed to rate the ‘freshness’ as well as the size of each search engine’s index. They showed the percentage of ‘bad links’ amongst the pages returned. Some commentators claim that such percentages are hard to quantify without the painstaking research of physically verifying the existence of each page listed. However, random sampling of pages combined with suitable statistical tests should suffice for assessing the proportion of invalid URLs. Nasios et al. [24] considered that broken links are due to the malfunction of the search engine and therefore evaluative researchers should examine the ability of each engine to produce results that are free of broken links. Relevance
This topic is, of course, the subject of an entire corpus of literature. A study of end users [39] who were evaluating search results obtained by intermediaries showed 200
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
eleven types of reasons that users give when rating search results as valuable, including perceived relevance of results, completeness, output format, time, etc. Nahl [40] concluded: ‘a search engine is perceived in the context of the information content it gives access to. If the results are plentiful of pages that are of interest to the searcher, then the search engine is considered useful and easy’. Most researchers use a simple ‘relevant’ versus ‘not relevant’ criterion in these studies. There are, however, exceptions: Nasios et al. [24] assigned one of five marks to each hit. ‘A’ corresponded to the best possible result. ‘B’ corresponded to a fairly relevant result that partially or superficially covered the query theme or contained a link to an A-type page. ‘C’ corresponded to an irrelevant hit. ‘X’ indicated the search engine’s failure to retrieve a web page due to server error or broken links, and ‘D’ indicated duplicated hits. MacCall [41] has argued that simple relevance criteria used hitherto are not appropriate in cyberspace. He argued that we need a revised model for relevance based on a combination of recent relevance research and the dynamic nature of cyberspace. Clarke and Willett’s classification [19] was described earlier. It is an example of the use of relative recall commended by Chowdhury [42]. Leighton and Srivastava [11] established a draft set of criteria for categorising the links. However, these draft criteria required adjustments as the links were evaluated, to take into account the nature of the subjects involved. The research of Schlichting and Nilsen [25] differs in its approach to relevance from other studies. They allowed each of the users, who offered real queries, to judge the usefulness of hits. These were scored on a scale of 1 to 7, with 7 being the most useful. The assessment of the performance of the search engines also takes into account the ‘unreported links’, that is, irrelevant and relevant sites not found by the engines but found by others. Recent work in this Department [43] on the evaluation of the web involves asking searchers to categorise a percentage relevance score to each hit. Search engine effectiveness is then calculated using an algorithm that combines the number of hits rated 100%, the percentage score for each hit and the total number of items retrieved. Humphries and Kelly [39] evaluated only the first five links returned by search engines in a set of medical searches. The hits were scored on a scale of 0 to 4, in the following manner: (0) error: some sort of network error prevented retrieval of the article; (1) unrelated (that is, irrelevant); (2) uninformative; (3) informative; and (4) authoritative. Search syntax
Nasios et al. [24] divided the queries into three categories of different styles of syntax. These categories included ‘phrases’ (within quotation marks), natural language using multiple words (with no quotes or operators), and Boolean operators and special signs. The research found that the results from different types of queries differ for each search engine. Therefore the user ought to utilise a search engine according to the type of query that is to be posed. Humphries and Kelly [39] chose queries that represented five categories of query: single term; two word phrase; three word phrase; Boolean AND; and numbers/acronyms. However, they did not compare the results of different types of 201
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
query on one topic; they simply used the most appropriate search strategy for the particular topic they were searching on. Their results, therefore, do not make it possible to compare the efficiency of these different types of search statements. Subject areas/choice of query
Surprisingly, this is a field that has not been much studied. It seems intuitively reasonable to assume that some search engines are stronger in some subject areas than others. Indeed, subject gateways are by definition strongest in their chosen subject field. Some researchers have attempted to generalise from their research and draw conclusions about a search engine’s overall performance in all subject areas. This is clearly unsatisfactory. Nasios et al. [24] attempted to include a variety of subject areas so that the research would not be too general, but also not strictly confined to EC projects. The choice of some of the queries, which covered specifically recent research areas, was aimed at testing the currency of the databases of each search engine. Bharat and Broder [44] avoided using queries from query logs to reduce possible bias in favour of the interests of a particular group of users. They stated that random queries give a more uniform sampling. They employed 20,000 random queries using a lexicon of words drawn from a wide range of web topics using an automatic technique to frame these random queries based on a lexicon of 400,000 words built from the vocabulary of 300,000 pages present in the Yahoo! hierarchy. The queries were used to sample random pages from the index of each search engine. Each page was then checked for containment within other engines. From the overlap estimates, the authors derived estimates for the coverage of each engine, and based on the known size of AltaVista, a lower bound for the size of the web. Their estimate was that as of March 1998 there were at least 275 million distinct, static pages on the web. According to their results, the web comprised 125 million pages in mid-97. It grew to about 275 million distinct pages as of March 98. They tried to count only pages that are distinct in content. So, for instance, if the same page appears in two locations, it was counted only once. Lebedev [45] focused on the best search engines for finding scientific information on the web. He had undertaken two searches using various search engines at different points in time with a mixture of scientific terms as queries. He then compared the results to see which search engine offered the most information on the query and also which database had added new material in the period between searching. The changing nature of the web
A page may radically change after a search engine indexes it. Listings obtained from a search may therefore reflect a situation days, weeks or even months ago; the actual page in the meantime may have changed content or may no longer exist. Likewise, some sites deliver pages tailored to search engines’ spiders. A human being visiting the site would see a completely different page. Neither the search engines nor the web itself provide controlled environments for conducting research [46]. Westera [23] compared the results of the same searches done in January 1996 and again in October 1996. In contrast, Nasios et al. [24] processed the same 202
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
query concurrently and ensured that each set of queries was given to the search engines in the briefest possible time period (during ten days). Leighton and Srivastava [11] also emphasised that queries should be executed on all the search engines at the same time. If not, then a relevant page could be made available between the time one engine was searched and the time a second was searched. The second search engine would then have the unfair advantage of being able to index and retrieve the new page. However, when a few queries were executed a week later, Leighton and Srivastava [11] noted that the first twenty results from a search engine from a search expression were the same (with some minor differences in order). They checked the links to sites that gave a ‘603’ (server not responding) or a ‘Forbidden’ error several more times over a several week period before recording broken links. All the search engines were searched for a given query on the same day and mostly within half an hour of each other. Bar-Ilan [18] carried out the searches six times during a period of two months to ensure validity. He aimed to learn about the changes in the number of retrieved documents over the period of time that the searches were carried out. Response time
Monitoring the response time of search engines is a difficult task since network traffic and performance vary according to times of the day. Nasios et al. [24] stated that their research project was unable to define objective criteria to consider this aspect of search engine performance. Chu and Rosenthal [14] found little difference between response times in the search engines that they evaluated. Different system features
The varied features of search engines influence the choice of users. The University of Albany [47] offers a table that indicates the most appropriate search engines to use if a user wants a particular feature. It offers forty-two different options, according to users’ choice of search and other needs. Search options
Stobart and Kerridge [2] asked users which query syntax they used most frequently. The majority used simple searches, with complex searches rated as the second most used facility. Other searches conducted included hierarchical, random and the hot quick search lists. Human factors and interface issues
This broad heading covers a number of different aspects that have been explored by researchers. It has been strongly argued that, in future, web search engines should be evaluated from a user’s perspective [48]. The way that users judge the interface of a search engine is subjective, as noted by Nasios et al. [24]. A userfriendly environment is likely to help users to get the desired results, for example a menu of options or a template in addition to the query box that offers assistance to users who may be unfamiliar with creating effective search syntax. Evaluating search engines in terms of their interface is not only a subjective task, but also one where the results are likely to be soon out of date owing to the 203
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
evolving nature of the web and its search facilities. However, the well-established principles of interface design will surely apply to all aspects of web search engine interface design. Another aspect is that of the ability of search engines to cater for those with visual disability. Visually impaired users access the web via a screen reader, which is often combined with a voice synthesiser (or Braille output). They are dependent upon text-based interfaces, which are being phased out. However, some screen readers can work with graphical browsers [49]. The lack of uniform standards for page design on the web does not facilitate the writing of programs to aid users who are visually impaired. There are also difficulties for users in operating some of the screen access programs [50]. Preliminary research evaluating the responses of visually impaired subjects to Internet search engines is under way in this Department [51]. Another aspect is the display of results. Golovchinsky [52] measured the effectiveness of the simultaneous display of several documents retrieved by a given query. The experimental results suggest that viewed recall increases with increasing numbers of articles displayed on the screen simultaneously. He also concluded that users’ decision-making strategies appear to be independent of user interface factors. Chu and Rosenthal [14] stated that two search engines offered different output options for web records whose contents were exclusively ‘extracted’ from original web documents. It is not unusual to encounter cases where a so-called abstract or summary suddenly ends in the middle of a sentence. Some search engines may use automatic abstracting techniques to provide more information for users. The features to be assessed in a web record include the URL, outline, keys, abstract, description and other related data. Content displayed may be either redundant or of little practical value. For instance, an outline always repeats the title of a web document, and an abstract consists of the first twenty lines or 20% of the document, whichever is smaller. Everyone has different expectations and searching strategies. Many researchers have found that evaluation of search engines must take these factors into account. Traditional approaches to information retrieval evaluation do not explicitly consider the user or do so in an extremely artificial way [12]. Harter and Hert considered that the question of how to deal with system/user interaction is a problem with the evaluation of any online IR system [12]. The issue has been addressed to a greater or lesser extent in some evaluations, and some IR systems are explicitly based on a model of user needs, such as those systems based upon Belkin’s ASK (Anomalous State of Knowledge) theory. It is clear from the papers noted below that some effort has been made to incorporate users in evaluation processes and/or to consider evaluation from the users’ perspectives. Stobart and Kerridge [2] asked users about their choice of search engine. The results revealed that the choice of engine was dictated by speed of access for the majority (26%) of users. Other factors that influenced choice were the size of repository, habit, accuracy of data, user-friendliness and the interface. Nasios et al. [24] attempted to take into account the potential differences in the background of end users as well as to conduct experiments that reflect the diversity of ways in which a user may try to locate information. The former was achieved by developing a hypothetical user model, which was used to process the 204
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
results. Results were judged according to whether they would satisfy one of two types of user: an ‘easily-pleased user’ or a ‘hard to please user’. Different syntax was created for each query in order to test the capabilities of search engines with a number of different styles of syntax, according to the diversity of ways in which users may search. Nasios et al. [24] found that AltaVista and HotBot did best overall. InfoSeek and Excite followed these. WebCrawler and Lycos performed relatively poorly. HotBot was found to be good for searching for phrases, but not very good for natural language queries, as well as Boolean queries. Excite scored evenly in all categories. Finally, InfoSeek showed a lack of ability in handling Boolean queries. Nasios et al. found that the nature of the specific query had considerable influence on search results. However, the performance of the different search engines was sufficiently close that it was difficult to conclude that one engine was clearly better than another. Nahl [40] conducted research to understand the novice’s search experience. This research involved content analysis of structured self-reports and interviews with users. However, he did not attempt to discover the differences between users and search services. The qualitative research involved users rating (on a seven-point scale) their self-confidence, stress level, satisfaction, usefulness and understanding of the topic. It was found that ease of use and fast response time are important elements in self-confidence, stress-level and satisfaction. Nahl suggested that further research is required to discover what user-characteristics and system factors limit operations. Su et al. carried out a series of searches using staff and students at the University of Pittsburgh to assess five criteria [53]. Whilst one of them, relevance, is based on the Cranfield model, the others, efficiency, utility, user satisfaction and connectivity were more user-centred. Whilst Lycos performed best on the traditional measure, AltaVista scored better for user satisfaction and other user-centred criteria. Many research papers concentrate on comparing the features and capabilities of search engines in a closed environment. Keily [54] stated that many people do not search effectively. She suggested that the relatively intuitive interfaces of some of the search engines, which take into account the way that people will search on average, prevent greater disappointment with results retrieved from the web via search engines. Quality of abstracts
In what appears to be a pioneering study, Wheatley and Armstrong [55] evaluated the quality of abstracts retrieved by twelve Internet search engines, including five subject gateways, and compared these to the quality of abstracts in three traditional online databases. The criteria for evaluation included length and readability. They concluded that subject gateways gave better abstracts than either robot engines or online databases. FUTURE DEVELOPMENTS
Schlichting and Nilsen [25] suggested that future research should measure the effectiveness of different search strategies, that is, contrasting constructions of search queries, within single search engines. They also considered that a follow-up 205
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
study to test whether studying only the first ten sites retrieved provides valid measurements of effectiveness was necessary. This could be achieved by using a computer to track eye movements to observe users’ actions and responses to results. Eye tracking studies are well established in many areas of psychology [56]. CONCLUSIONS
Many comparative research projects have been undertaken in the field of evaluating WWW search engines. Indeed, Winship publishes a regularly updated chart of these on the Web [57]. Various factors are taken into consideration when making a comparison of the effectiveness of search engines. Some of these are subjective, such as interface design and presentation of results, and, of course, the relevance of the hits. Chu and Rosenthal [14] consider that the evaluation methodology for web search engines ought to include assessment of the following: ●
●
●
● ●
composition of web indexes, including coverage, update frequency and the portions of web pages indexed (e.g. titles plus the first several lines, or the entire web page); search capability, including Boolean logic, phrase searching, truncation and limiting facilities (e.g. limit by field); retrieval performance, based on precision, recall (not tested) and response time with extra caution concerning relevance judgements; output option, based on quantitative and qualitative analysis; user effort, based on analysis of documentation and interface.
Many research projects have focused on the performance of search engines in terms of the relevance of results, despite the difficulties with recall as a factor. Other methodologies used to evaluate search engines emphasise other aspects, such as: ● ● ● ●
human factors: the range of ability, needs of different users; choice of queries: objective methodology; query syntax: categorised to cover the range of options; results: duplicated and broken links.
The methodologies reviewed in this paper vary in many ways. Some researchers chose to study between the first five to twenty results retrieved by search engines, whereas others analysed every result retrieved. There appears to be scope for research in the area of comparing search engines for use by novice and advanced users. Nahl’s [40] research focused on novice users’ experience of using search engines for the first time and without guidance. Another area in which there appears to have been little academic research is the suitability of search engines to cover various subject areas. Each engine, depending on the size of its database and its abilities, provides a different service to end users. An alternative way of approaching research into the effectiveness of search engines could be to assess the needs of the user, not only in terms of user expertise but also their subject requirements, and then match those factors to the most appropriate search engine. 206
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
WWW SEARCH ENGINES
The ability to handle different types of search syntax deserves more work, in particular to follow up the work by Nasios et al. that showed that each engine deals with different types of query syntax with differing degrees of efficiency. Schlichting and Nilson [25] emphasised that it is important that users choose search engines most suited to their subject domain. They attempted to establish an objective method of evaluating the effectiveness of resource discovery in isolation. They took the first ten hits into account, and they assessed search engines’ hit rate, false alarm rate, sensitivity and response bias. OUR RECOMMENDATIONS
It is clear that much of the research is inconsistent in method and approach. This demonstrates the immaturity of this field of study. There is a clear need for a set of benchmarking tests for web search engines. The idea of such benchmarking is well established in other areas of electronic information [58, 59]. It is suggested that, at the very minimum, tests on web search engines should include the following criteria: 1. precision; 2. relative recall, using the Clarke and Willett method of calculation [19]; 3. speed of response, tested several times each day; overall time taken for test; 4. consistency of results over an extended period; 5. proportion of dead or out of date links; 6. proportion of duplicate hits; 7. overall quality of results as rated by users; 8. evaluation of GUI for user friendliness; 9. helpfulness of help and variations of software for naïve and experienced users; 10. options for display of results; 11. presence and instrusiveness of adverts; 12. coverage, using the Clarke and Willett method of calculation [19]; 13. estimated/expected search length; 14. length and readability of abstracts; 15. search engine effectiveness, using the Back and Summers method [43].
Tests should be run using the following types of search terms: 1. 2. 3.
single words; phrases; Boolean AND and OR logic.
Such research should be based upon the first n records examined by naïve and experienced users. The size of n is still a matter of debate. The date of the test and the subject areas searched should always be recorded. Then, when researchers wish to evaluate one or more search engines, they would select one or more of these benchmark tests to carry out the evaluation. In 207
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
vol. 56, no. 2
this way, comparisons between different search engines and longitudinal studies on a particular search engine can be carried out with a degree of consistency that has not appeared hitherto. ACKNOWLEDGEMENTS
We wish to acknowledge the help of Karen Bradley in identifying some of the articles referred to in this paper. REFERENCES
1. 2.
3. 4. 5.
6. 7.
8. 9.
10. 11.
12.
13. 14.
Teutenberg, F. Effective searching in the World Wide Web: search engines and search techniques. Wirtschaftsinformatik, 39(4), 1997, 373–385. Stobart, S. and Kerridge, S. An investigation into World Wide Web search engine use from within the UK JISC-funded project undertaken by UKERNA. 1996. http://osiris.sunderland.ac.uk/sst/se/results.html (visited 30 July 1999). Furner, J. IR on the Web: an overview. VINE, 104, 1997, 4. Gudivada, V.N. et al. Information retrieval on the World Wide Web. IEEE Internet Computing, September–October 1997, 65. Winship, I.R. World Wide Web searching tools: an evaluation. VINE, 99, 1995, 49–54. Also available at: http://bubl.ac.uk/journals/lis/oz/vine/ n09995/winship.htm (visited 30 July 1999). Amato, G., Rabitti, F. and Savino, P. Multimedia document search on the Web. Computer Networks and ISDN Systems, 30(1–7), 1997, 604–606. Jenkins, C., Jackson, M., Burden, P. and Wallis, J. Searching the World Wide Web: an evaluation of available tools and methodologies. Information and Software Technology, 39, 1998, 985–994. Search Engine Watch. http://www.searchenginewatch.com/ (visited 30 July 1999). Reviews of search engines: an in depth comparison. http://www.dis.strath. ac.uk/business/searchindepth.html; Wilder, R. Evaluating search engines. http://www.foley.gonzaga.edu/search.html (visited 30 July 1999). Schwartz, C. Web search engines. Journal of the American Society for Information Science, 49, 1998, 973–982. Leighton, H.V. and Srivastava, J. Precision among World Wide Web search services (search engines): AltaVista, Excite, HotBot, Infoseek, Lycos. 1997. http:// www.winona.msus.edu/library/webind2/webind2.htm (visited 30 July 1999). Harter, S.P. and Hert, C.A. Evaluation of information retrieval systems: approaches, issues, and methods. Annual Review of Information Science and Technology, 32, 1997, 3–79. Ellis, D. Progress and problems in information retrieval. London: Library Association Publishing, 1996. Chu, H. and Rosenthal, M. Search engines for the World Wide Web: a comparative study and evaluation methodology. In: ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19–24, 1996, 127–135. Also available at: http://www.asis.org/annual-96/ElectronicProceedings/ chu.html (visited 30 July 1999). 208
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
15. 16. 17. 18. 19. 20. 21. 22.
23. 24.
25. 26. 27.
28.
29. 30. 31.
WWW SEARCH ENGINES
Fricke, M. Measuring recall. Journal of Information Science, 24(6), 1998, 409–417. Lauren J.V. Somebody wants to get in touch with you: search engine persuasion. Database, 21(1), 1998, 42–46. Dezelar-Tiedman, C. Known-item searching on the World Wide Web. Internet Reference Services Quarterly, 2(1), 1997, 5–14. Bar-Ilan, J. On the overlap, the precision and estimated recall of search engines: a case study of the query “Erdos”. Scientometrics, 42(2), 1998, 207–228. Clarke, S.J. and Willett, P. Estimating the recall performance of Web search engines. Aslib Proceedings, 49(7), 1997, 184–189. Ding, W. and Marchionini, G. A comparative study of web search service performance. In: ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19–24, 1996, 136–142. Gauch, S. and Wang, G. Information fusion with ProFusion. Webnet 96 Conference, San Francisco, CA, October 15–19, 1996. http://www.csbs. utsa.edu:80/info/webnet96/html/155.htm (visited 30 July 1999). Tomaiuolo, N.G. and Packer, J.G. An analysis of Internet search engines: assessment of over 200 search queries. Computers in Libraries, 16(6), 1996, 58. Also available at: http://neal.ctstateu.edu:2001/htdocs/websearch.html (visited 30 July 1999). Westera, G. Search engine comparison: testing retrieval and accuracy. November 1996. http://www.curtin.edu.au/curtin/library/staffpages/gwpersonal/senginestudy/results.htm (visited 30 July 1999). Nasios, Y. et al. Evaluation of search engines: report undertaken by the National Technical University of Athens on behalf of the European Commission and Project PIPER, July 1998. http://piper.ntua.gr/reports/ searching/doc.0000.htm (visited 30 July 1999). Schlichting, C. and Nilsen, E. Signal detection analysis of WWW search engines. http://www.microsoft.com/usability/webconf/schlichting/schlichting. htm (visited 30 July 1999). Swets, J.A. Signal detection and recognition by human observers. London: John Wiley, 1964. Gonçalves P., Robin, P., Santos, T., Miranda, O. and Meiro, S. Measuring the effect of centroid size on web search precision and recall. http://paulista. di.ufpe.br/~bright/publications/inet98_centroids/article.html (visited 30 July 1999). Meghaghab D.B. and Meghaghab, G.V. Information retrieval in cyberspace. In: The digital revolution: proceedings of the ASIS Mid-Year Meeting, San Diego, CA, May 18–22 1996. Medford, NJ: Information Today Inc., 1998, 224–237. Bilal, D. and Meghaghab, G.V. Information retrieval in cyberspace. http:// www.sis.utk.edu/Dania/asis96.html (visited 30 July 1999). Bharat, K. and Broder, A. A technique for measuring the relative size and overlap of public Web search engines 1997. http://decweb.ethz.ch/WWW7/ 1937/com1937.htm (visited 30 July 1999). Bharat, K. and Broder, A. A technique for measuring the relative size and overlap of public web search engines. Computer Networks and ISDN Systems, 30, 1998, 379–388. 209
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
JOURNAL OF DOCUMENTATION
32. 33. 34. 35. 36.
37. 38.
39.
40.
41.
42. 43. 44.
45.
46.
47. 48.
49. 50.
vol. 56, no. 2
Lawrence, S. and Giles, C.L. Searching the Web: general and scientific information access. IEEE Communications Magazine, 37(1), 1999, 116–122. Lawrence, S. and Giles, C.L. Searching the World Wide Web. Science, 280(5360), 1998, 98–100. Feldman, S. Web search service in 1998: trends and challenges. Searcher, 6(6), 1998, 29–39. Kimmel, S. Robot-generated databases on the World Wide Web. Database, 19(1), 1996, 40–49. Tunender, H. and Ervin, J. How to succeed in promoting your web site: the impact of search engine registration on retrieval of a World Wide Web site. Information Technology and Libraries, 17, 1998, 173–179. Cooper, W.S. Expected search length. American Documentation, 19, 1968, 30–41. Agata, T., Nozue, M., Hattori, N. and Ueda, S. A measure for evaluating search engines on the World Wide Web: retrieval test with ESL. Library and Information Science, 37, 1997, 1–11. Humphries, K.A. and Kelly, J.D. Evaluation of search engines for finding medical information on the Web. 1997. http://www.icaen.uiowa.edu/ ~humphrie/ (visited 30 July 1999). Nahl, D. Ethnography of novices’ first use of Web search engines: affective control in cognitive processing. Internet Reference Services Quarterly, 3(2), 1998, 69. MacCall, S.L. Relevance reliability in cyberspace. In: Proceedings of the 61st Annual Meeting of the American Society for Information Science, 35, 1998, 13–21. Chowdhury, G.G. The Internet and information retrieval research: a brief review. Journal of Documentation, 55(2), 1999, 209–225. Back, J. and Summers, R. Unpublished results. Bharat, K. and Broder, A. Measuring the Web. March 1998 (an update to [31]). http://www.research.digital.com/src/whatsnew/sem.html (visited 30 July 1999). Lebedev, A. Best search engines for finding scientific information on the Web. May 1997. http://www.chem.msu.su/eng/comparison.html (visited 30 July 1999). Search engine sizes scrutinised from the Search Engine Report, 30 April 1998. http://www.searchenginewatch.com/sereport/9805-size.html (visited 30 July 1999). How to choose a search engine or research database. http://www.albany. edu/library/internet/choose.html (visited 30 July 1999). Dong, X.Y. and Su, L.T. Search engines on the World Wide Web and information retrieval from the Internet: a review and evaluation. Online and CD-ROM Review, 21(2), 1997, 67–82. Kerr, M. Universal access by design. VINE, 106, 1997, 42. Bowler, D., Okamoto, T. and Pothier, B. Internet access for the visually impaired. An interactive qualifying project submitted to the Faculty of Worcester Polytechnic Institute in partial fulfilment of the requirements for the Degree of Bachelor of Science, 1996. http://abbey.rnib.org.uk/bdp/ toc.htm (visited 30 July 1999). 210
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib
Journal of Documentation, Vol. 56, No. 2, March 2000 © Aslib, The Association for Information Management. All rights reserved. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988, no part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying or otherwise without the prior written permission of the publisher.
March 2000
51.
52.
53.
54.
55.
56. 57. 58.
59.
WWW SEARCH ENGINES
Oppenheim, C. and Selby, K. Access to information on the World Wide Web for blind and visually impaired people. Aslib Proceedings, 51(10), 1999, 335–345. Golovchinsky, G. From information retrieval to hypertext and back again: the role of interaction in the information exploration interface. PhD Thesis, University of Toronto, 1996. http://anarch.ie.utoronto.ca/people/golovch/ thesis/final/ (visited 30 July 1999). Su, L.T., Chen, H.L. and Dong, X.Y. Evaluation of Web-based search engines from an end-user’s perspective: a pilot study. Proceedings of the ASIS Annual Meeting, 35, 1998, 348–361. Keily, L. Improving resource discovery on the Internet: the user perspective. In: Raitt, D.I. et al., eds. Online Information ’97 Proceedings: 21st International Online Information Meeting. Oxford: Learned Information Europe Ltd, 1997, 205–212. Wheatley, A. and Armstrong, C.J. Metadata, recall and abstracts: can abstracts ever be reliable indicators of document value. Aslib Proceedings, 49(8), 1997, 206–213. Tikhomirov, O.K. The psychology of thinking. Moscow: Progress Publishers, 1988. Web search tool features. http://www.unn.ac.uk/features.htm (visited 30 July 1999). Harry, V. and Oppenheim, C. Evaluation of electronic databases. Part I: criteria for testing CD-ROM products. Online and CD-ROM Review, 17, 1993, 211–222. Harry, V. and Oppenheim, C. CD-ROM evaluation. Part II: testing CDROM products. Online and CD-ROM Review, 17, 1993, 339–368.
(Revised version received 4 August 1999)
211
Aslib, The Association for Information Management Staple Hall, Stone House Court, London EC3A 7PB Tel: +44 (0) 171 903 0000, Fax: +44 (0) 171 903 0011 Email:
[email protected], WWW: http://www.aslib.co.uk/aslib