In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999
A Method for Intranet Search Engine Evaluations Dick Stenmark
[email protected] Volvo Information Technology Dept. 9530, HD1N SE-40508 Göteborg, Sweden
IT & Organisation Viktoria Institute / GöteborgUniversity Box 620, SE-40530 Göteborg, Sweden
Abstract Most comparative evaluations of Internet search engines provide soon outdated answers to a handful of pre-selected questions. Such an approach is too limited to be useful for practitioners wanting to evaluate information retrieval tools. Informed by an action research-oriented study of a large organisation conducting such an evaluation, this paper first discusses what questions to ask when analysing and evaluating intranet search tools. Applying Henderson and Cooprider’s approach and Saracevic’s theoretical foundation to the field data the author outlines a generic methodology for information system evaluation. The proposed methodology is refined through repeated interviews and discussions with people from a variety of organisational roles, thus taking a broader and more general approach. By suggesting that the selected features should be given different weighs according to their relative importance and summarised in two dimensions, the methodology helps organisations detecting the most useful tool for their specific requirements.
Keywords: evaluation methodology, search engine, intranet, information systems
1. Introduction As the World Wide Web (hereafter, the web) started to explode in terms of numbers of users, servers, and pages, it became painfully obvious that search capabilities had to be added to this chaos of information in order for people to be able to find anything. Before long, a flourishing supply of publicly available search services emerged, and today these tools have become part of our everyday life as net citizens. Not surprisingly, a substantial body of research regarding search engines has therefore developed over the past years. A few years later, the number of intranets, i.e. intra-organisational webs hidden from the global public behind firewalls and proxies, had grown to become an important part of the web. Since the public search engines’ spiders are not allowed inside firewalls, organisations relying on such solutions must host their own engines in order to provide search capabilities. Though many commercial search engines and spiders are available on the market, it remains difficult for a particular organisation to know which solution best suits their needs and requirements. The computer-related press regularly publishes “stand-alone” evaluations of various tools, gadgets, and services - including search engines - but such evaluations are often highly subjective and the features are examined out of context. This fact calls for more targeted research on how such evaluations should be conducted. 1
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 Thought literature has many examples of comparative studies of search tools, both from academia (e.g. Chu and Rosenthal 1996; Leighton and Srivastava 1997) and industry (e.g. Goldschmidt et al., 1997; Slot, 1995), it fails to deliver a generic method useful for evaluations in general. The aim with this research is to suggest a functional method that will empower any organisation to do their own evaluation. With their functional model of CASE technology, Henderson and Cooprider (1990) provide a useful model for how to organise such an undertaking, and I will use their article as a blueprint for my own work. Though Lester and Willcocks (1993) have argued that information systems evaluations should consider all stages of the systems’ life cycle, the focus of this paper is limited to the pre-installation (selection) phase only. The use of precision and recall for Information Retrieval (IR) evaluation purposes goes back more that forty years to when Kent et al. (1955) proposed them as the primary measures of IR performance. It is therefore not surprising to find that many search engine evaluations are based solely on these measures (c.f. Leighton and Srivastava 1997). However, both precision and recall require the tool(s) to be installed before these quantities can be measured. This is problematic when the very objective is to determine what tool(s) to install. Further, and more importantly, several scholars have questioned the usefulness of precision (e.g. Raghavan et al. 1989) and recall (e.g. Chu and Rosenthal 1996) as measures of IR systems performance. Nielsen (1999) also points out that the use of precision and recall assumes that the users want a complete set of relevant documents. This might have been true in “traditional” (i.e. pre-web) IR, Nielsen argues, but on the web, most users are not interested in retrieving all relevant documents or only receiving relevant documents. As long as the answer to their query is found among the top ten returned items, most users are satisfied. This suggests that criteria other that just precision and recall should be considered. The question what other criteria remains, however, unanswered. Cho and Rosenthal (1996) claim to provide a generic answer to the above question, but their schema is unfortunately too limited. First, it only considers the usage of public search engines and not the running of intranet tools, and second, it only measures a certain set of preselected attributes that may be totally unimportant to some sites. Others, who have studied intranet tools (e.g. Goldschmidt et al., 1997), focus on specific features only and instead fail to provide a generic method. However, Saracevic’s excellent 1995 paper on evaluation of IR evaluations offers a good starting point. He argues in favour of a multi-level approach to evaluation, suggesting six possible levels ranging from a system-oriented to a user-oriented view. According to Saracevic, evaluating on one level seldom tells anything about the other levels and “this isolation of levels of evaluation could be considered a basic shortcoming of all IR evaluations” (ibid., p.141, italics in original). Based on such observations from literature and supported by empirical case-study data this paper argues that different organisations have different needs, and by recognising that these needs should not be defined beforehand, but as part of the evaluation process, the approach described here is applicable to any organisation.
2. A Multi-level Approach During a study of a Swedish company, I participated in a search engine evaluation project aimed at finding a suitable intranet search tool for the organisation. The project stretched over six months, during which I gained an understanding of what people in different positions in the company thought about web searching and what needs they had. I have used qualitative techniques such as long interviews and field observations for data collection and a total of 42 hours were spent on individual interviews with 14 people. Many of these people, which 2
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 included managers at different levels, IR personnel and librarians, people from operations, content providers, computer scientists and technicians, and end-users, were interviewed several times. I worked closely with the project members and participated in the project all the way from the very beginning to the final decision and implementation. In addition to the individual interviews, I also arranged three larger meetings and two workshops to which interested parties were invited to debate different needs and try various design suggestions. In parallel to the practical project work, I did a thorough literature study to see how others had carried out their evaluations. Ideas and methods found in the literature were discussed during the interviews and meetings, and strong and weak points noted in an iterative way. At the end of the project, I analysed the collected material and generalised my experiences to a tentative methodology in a grounded theory-inspired fashion. The methodology has been further refined through insights received through telephone calls and email contacts with practitioners at several other sites. Based on interviews and literature studies, I identified over 80 different criteria. This large number of features, and the spread of levels to which they belong, made it difficult to get a good overview. By applying the strategy of divide and conquer, I divided the features into different levels of objectives, as suggested by Saracevic (1995). Saracevic has six partly overlapping levels, namely the engineering, input, processing, output, user, and social levels. Though I agree that the social impacts of IT are important and in need of further research, such issues do not fit into this article and will thus be dealt with in subsequent papers. Informed by the empirical material and the suggestions received from reviewers of earlier drafts of this paper, and influenced by Saracevic’s multi-level approach, I suggest the following four levels: 1. Data gathering, which deals with the spider and its particular features. 2. Index, which is about how the index is organised. 3. Search features, which include search capabilities, interface issues, and user documentation. 4. Operation and maintenance, which cover both operational issues (such as the type of hardware and software the product runs on and what technical support is available) and more administrative aspects. These levels will now be discussed in more detail.
2.1 Data gathering Most organisations have legacy data in formats other than HTML, e.g. Adobe’s PDF, MSOffice, FrameMaker, Lotus Notes, Postscript, and plain ASCII text. The spider should at least be able to correctly interpret and index the most frequently used or the most important of these formats. If meta-information and XML tags are likely to show up within the documents, the spider must be able to interpret such tags, and it would also be useful if RDF-formatted information could be gathered intelligently. If USENET newsgroups need to be indexed, the spider must be able to crawl through them. That also goes for client side image maps, CGIscripts, ASP generated pages, pages using frames, and Lotus Domino servers. Although frames are frequently used within many companies, spiders, which generally work their way round the net by picking up and following hypertext links, may not be able correctly interpret the different syntax used for framed pages. These links could end up ignored. Spidering Domino servers using the above HTTP requests requires the search engine to be able to intelligently filter out the many collapsed/ expanded versions of the same page, or the index will quickly be
3
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 filled with duplicates. Another, and arguably better, way would be to access Domino servers via the provided APIs. Another situation that is likely to require access via APIs rather than having to crawl through HTTP is when Content Management (CM) systems are used. In CM tools, the actual content of a page is stored separated from the page layout information. Since pages are rendered dynamically only when requested by a user (via her browser), the spider may not be able to pick up the link information that is embedded in the page code. Without those links, the spider will not be able to find the information. Even if the information is found and indexed correctly it might be difficult for the search engine to understand how to display a search result since the information that has been indexed may belong to several dynamic pages. This is an area not yet fully explored by search engine vendors and proposed solutions should be investigated carefully. Intelligent robots are able to detect copies or replicas of already indexed data while crawling and advanced search engines can index “active” sites, e.g. sites that update frequently, more often than sites that are more “passive”. If this is not supported, some manual means of determining time-to-live should be provided. There should be some means of restricting the robot from entering certain areas of the net, including any desired domain, sub-net, server, directory, or file level. Also, check if search depth can be set to avoid loops when indexing dynamically generated pages. Support for proxy servers and password handling can be useful, as can the ability to not only follow links but also detect directories and thus find files not linked to from other pages. The spider should be easy to set up and start. Check how the URLs from which to start are specified as well as if the users may add URLs. Finally, the Robot Exclusion Protocol (Koster 1994) provides a way for the webmaster to tell the robot not to index a certain part of a server. This should be used to avoid indexing temporary files, caches, test or backup copies, as well as classified information such as password files. The above discussion is synthesised in table 1 below. Table 1: Questions about the data gathering process. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
What formats other than html can be indexed by default? Are any APIs (or equivalents) available to add more formats? Can meta-data be indexed? Can ad-hoc XML tags be handled? Can USENET news be indexed? Can image maps be handled? Are CGI scripts handled? Can frames be handled correctly? Can Notes Domino servers be handled correctly? Is crawling of Content Management systems supported? Are duplicate links automatically detected? Can the spider determine how often to revisit a site/page? Can time-to-live be specified? How can the crawling be restricted? Can the search depth be specified? Can the product handle proxy servers? Does the product handle password-protected servers? Can the spider find directories on its own? Is the spider easy to set up and control? Can users add URLs? Does the spider honour the robot exclusion protocol?
4
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999
2.2 Index Although a good index alone does not make a good search engine, the index is an essential part of a search tool. Chu and Rosenthal (1996) conclude that one of the most important issues is keeping the index up-to-date, and the best way to do that is to allow real-time updates. There is a big difference between indexing the full text or just a portion. Though partial indexing saves disk space it may prevent people from finding what they are looking for. The portion of text being indexed also affects the data that is presented as the search result. Some tools only show the first few lines while others may generate an automatic abstract or use meta-information. If the organisation consists of several sub-domains, users might only want to search their specific sub-domain. Allowing the index to be divided into multiple collections might then speed up the search. It may also prove useful to be able to split the index into several collections even though they are kept at one physical location. For example, one may want separate collections for separate topics or business areas. Some tools support linguistic features such as automatic truncation or stemming of the search terms, where the latter is a more sophisticated form that usually performs better. If the organisation is located in non-English speaking countries the ability to correctly handle national characters becomes important. Also, note that some products cannot handle numbers. If number searching is required, e.g. serial numbers, this limitation should be taken into consideration. Should words that occur too frequently be removed from the index? Some engines have automatically generated stoplists, while others require the administrator to remove such words manually. Search engines are of little use if an overview of the indexed data is wanted, unless they are able to categorise the data and present that data as a table of content. Automatic categorisation may also be used to focus in on the right sub-topic after having received too many documents. If information about when a particular URL is due for indexing is available, it is useful to make it accessible to the user. This discussion results in table 2. Table 2: Questions about the index and how the index is structured internally. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Is the index updated in real-time? Is the full text or just a subset indexed? Can abstracts be created automatically? Can the product handle several collections? Does the product exploit stemming? Can the product correctly index national characters? Are numbers indexed correctly? Are stoplists for common words supported? Is there any built-in support for automatic categorisation? Does the index have an URL status indicator?
2.3 Search features The user query and the search result interfaces are often sadly confusing and unpredictable. Shneiderman et al. (1998) argue that the text-search community would greatly benefit from a more consistent terminology. Since we do not yet have this concordance, evaluation of the search features must be done with great care. Different vendors use different names for the same feature, or the same name for different features. Though Boolean-type search language is often offered, most users do not feel comfortable with Boolean expressions (Hearst 1997). Instead, studies have shown that the average user only enters 1.5 keywords (Pinkerton 1994). Due to the vocabulary problem
5
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 (Furnas et al. 1987), the user is likely to receive many irrelevant documents as a result of a onekeyword search. Natural language queries have been shown to yield more search terms and better search results, even when performed by skilled IR personnel (Turtle 1994). Apart from Boolean operators, a number of more or less sophisticated options (e.g. full text search, fuzzy search, require/exclude, case sensitivity, field search, stemming, phrase recognition, thesaurus, or query-by-example) are usually offered. One feature to look for in particular is proximity search, which lets the user search for words that appear relatively close together in a document. Proximity search capability has been noted to have a positive influence on precision (Chu and Rosenthal 1996). Many organisations prefer to have a common “company look” on all their intranet pages. This requires customisation that may include anything from changing a logo to replacing entire pages or chunks of code. Again, this is an aspect irrelevant to public search services but something an intranet search engine might benefit from. Sometimes a built-in option allows the user to choose a simple or an advanced interface. It should also be possible to customise the result page. The user could be given the opportunity to select the level of output, e.g., by specifying compact or summary. Further, search terms may be highlighted in the retrieved text, the individual word count can be shown, or the last modification date of the documents may be displayed. It can also be possible to restrict the search to a specific domain or server, or to search previously retrieved documents only. For the latter, relevance feedback is a very important way to improve results and increase user satisfaction (Shneiderman et al. 1998). Ranking is usually done according to relevancy of some form. However, the true meaning of the ranking is normally hidden to the user, and only presented as a number or percentage on the result page. More sophisticated ways to communicate this important information to the user have been developed (e.g. Rao et al. 1995), but not many of the commercially available products have yet incorporated such features. However, the possibility to switch between relevancy and date is often supported. Dividing the results into specific categories might help the user to interpret the returned result. Finally, ensure the product comes with good and extensive online user documentation. Table 3 on the next page summarises the above discussion.
6
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999
Table 3: Questions about the search interface and the features available to the users. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
Can Boolean type queries be asked? Are true natural language queries supported? Does the product support full text search? Is fuzzy matching supported? Does the product support + (require, must) and - (exclude, must not)? Can case sensitivity be turned on/off? Can attributes or fields be searched? Is stemming done automatically? Does the product recognise phrases? Does the index utilise a thesaurus? Is “query by example” supported? Does the product have proximity search (near)? Can the search interface be customised? Can the layout of the result page be customised? Are there both simple and advanced interfaces? Can the user select the level of output (compact, normal, summary)? Are search words highlighted in the results? Does the product show individual word count? Does the index display modification date per document? How can the search be restricted? (By date? By domain? By server? Not at all?) Can previous results be searched (refined search)? Is relevance feedback implemented? Is the ranking algorithm modifiable? Can the result be ranked on both relevancy and date? Does the product have good on-line help and user documentation?
2.4 Operation and maintenance Hosting a search service requires considerations not necessary when using a public search engine. For example, operations and maintenance issues are of no importance to public search engine evaluations, but for an internal search service, they are of course highly interesting. Start by checking if the product is available on many platforms or if it requires the organisation to invest in new and unfamiliar hardware. If the intranet consists of one server only, a spider is not needed, but as the web grows, crawling capabilities become essential. A spider allows the net to grow without forcing the webmasters to install indexing software locally. An intranet search engine often runs on a single machine and is operated and maintained by people with knowledge about servers, but not necessarily experts in spider technology. This suggests that a good intranet spider should be designed specifically for an intranet and not just be a ported version of an Internet spider. Still, the spider and the index must be able to handle large amounts of data without letting response times degrade or the users will be upset. For example, a product that can take advantage of multi-processor hardware scales better as the intranet grows. The product should therefore have been tested to handle an intranet of the intended size. Running the spider should not interfere with how the index is operated. Both these components need to be active simultaneously. I found great differences in how straightforward the products were to install, set-up, and operate. Some required an external HTTP server while others had a built-in web server. The latter were consistently less complicated to install. However, installation is probably something done once while indexing and searching is done daily. This ratio suggests that indexing and searching features should be weighed higher than installation routines. It is difficult to estimate data collection time since it depends on the network, but during the test installation, this activity should be clocked. Also, try to determine how query response times grow with the size of the index. If an index in every city, state, or country where the organisation is represented is wanted, ensure the product supports this kind of distributed 7
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 operation, and check whether any bandwidth-saving technique is used. Having technical support locally is an advantage if the local support also has local competence. If questions have to be sent to a lab elsewhere, the advantage is lost. An important feature is the ability to automatically detect links to pages that have been moved or removed. If dead links cannot be detected automatically, the links should at least be easy to remove, preferably by the end-user. Allowing end-users to add links is a feature that will off-load the administrator. Functions like email notification to an operator, should any of the main processes die, and good logging and monitoring capabilities, are features to look for. I found that products with a graphical administrator interface were more easily and intuitively handled, though the possibility of being able to operate the engine via line commands may sometimes be desired. It should also be able to administer the product remotely via any standard browser. Documentation should be comprehensive and adequate. Finally, consider the price - is it a fixed fee or is it correlated to the size of the intranet? In addition, what kind of support is offered and to what cost? Sometimes installation and training are included in the price. How long the products have been available and how often they are updated are important factors that indicate the stability of the product, and it is also important to ask about future plans and directions. These questions are synthesised in table 4 below. Table 4: Questions about operational issues and maintenance of the product. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
What platforms does the product run on (HW and SW)? Can remote web servers be indexed? Is the product designed for intranet? Can the product exploit multi-processor HW? How large data volumes can be handled (in MB)? Can the spider and the index be run independently? How easy is the product to install and maintain? Does the product come with an embedded http server? How long will data collection take (per Mb)? What is the upper limit for query response times? Can the load be distributed over several servers/indices? Is some sort of bandwidth-saving technique utilised? Where is the nearest technical support? Can old/bad links be removed easily or even automatically? Does the product have email notification? What logging/monitoring features are included? Is there a graphical admin interface? Can the product also be operated in line mode (e.g., from a script or cron-tab entry)? Can the product be administered remotely? Is there enough documentation of good quality? Is the price scaleable? What is included in the price? Installation? Training? Upgrades? Support? Does technical support cost extra? How long has the product been available? How often is the product updated? What is coming within the next 6-12 months?
8
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999
3. The evaluation process Most authors (tacitly) seem to assume that evaluations are the result of the heroic work of a lone ranger. As information systems in general become increasingly complex in that they are no longer stand-alone black boxes but intertwined with the social realities in which they operate, this position no longer holds. Evaluations often cover so many areas in so many diversified fields that it is highly unlikely that any one person should be able to have all the competencies required. This fact argues in favour of having the evaluation performed as a collaborative cross-discipline process. Involving more people in the process also ensures a broader engagement and increases the chance of reaching consensus. The large number of features discussed above may suggest a rather time consuming evaluation process. The main point advocated here, however, and what seems to be a novel observation, is that what features are important and what features are not, cannot be decided in general, but must be determined for each site or installation individually. This should be done by considering each suggested aspect and deciding whether or not it is applicable or relevant. When the number of feature have been reduced to a more manageable set they should be ranked, measured, and summarised as described in the following sections.
3.1 Reducing and ranking The organisation should, from the attributes presented above, select only those features applicable to their situation. Again, since no one person is likely to have the full picture, this process should involve a reference group consisting of people with different backgrounds. There are several ways in which the relevant features may be identified but one rather straightforward is to let every group member individually list the features they think ought to be considered. Those featured selected by more than N persons, where N depends on the size and composition of the group, are in, while those not selected by any person are out. The remaining features, i.e. those receiving between 1 and (N-1) votes, are discussed until consensus is reached. The selected features should next be ranked. There are formal techniques available to support such a process. Eriksson and Roy (1999) and Karlsson et al. (1998) find that Analytic Hierarchy Process (AHP) is particularly good. AHP is a decision-making method based on pair-wise comparison of all unique attributes or requirements. However, since this means that n attributes generate n(n-1)/2 comparisons this approach scales rather poorly and is ill suited for projects with a large number of attributes. On the other hand, the high level of redundancy in the method makes it very insensitive to judgmental errors and another strong feature is that the resulting priorities are relative and ratio scaled-based. As the name implies AHP assumes that the aspects can be ordered hierarchically, something that is not always true. This may however be circumvented as described by Karlsson et al. (1998). My reference group, who felt that AHP was too laborious and academic, chose a more pragmatic but yet useful method known as priority groups. This approach is also discussed by Karlsson et al. (1998). Using priority groups means that the features are not evaluated individually but instead grouped according to importance. By determining how many groups should be used, one is able to set the granularity of the ranking (and thus the amount of work required). In my case the group members were able to reach a reasonable strong consensus fairly quickly. A third option would have been to combine the two methods by first 9
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 constructing priority groups and then applying AHP to each group.
3.2 Weighing and measuring Each subgroup or category should thereafter be assigned a weigh indicating the importance of the features in the group. For weighing to be meaningful, there should be some significant weigh difference between attributes that are significantly important and attributes that are not. For example, if there are three categories, the “Very important” category should get the weigh 8, the “Important” category should get 4, and the “Useful” category should get 2. If a feature in the Very important category is not twice as important as a feature in the second category, the weighs are inaccurate or the feature is wrongly categorised. It may also indicate that more categories are needed. When the importance of each relevant attribute has been established, it is time to see how well each product scores. Many other evaluations answer questions such as "Can formats other then HTML be indexed?" with a yes or no. I recommend against that type of binary scoring, and argue that the answers should be graded to better portray the complex reality. There is a significant difference between “Yes, it can also index ASCII text” and “Yes, it can also index two hundred additional formats”. Furthermore, I argue that a metric measure is to prefer to words such as excellent, good, fair, or poor, which are otherwise often used (e.g. Chu and Rosenthal, 1996). When many aspects are considered, it is difficult to calculate the average of five "good", eight "fair", and three "poor". Therefore, I recommend the use of a 6-point metric scale, where 5 means full support and 0 means no support. Although the consensus approach works well when trying to establish the relative importance for each attribute, it is not equally suitable for measuring the quality of the attributes. Instead, it seems logical that the domain expert’s opinion should be given a higher importance, even though a layman may contribute to the discussion. I further suggest that not only should the grand total for each product be calculated, but also that summarising should be done both per importance category and objectivity level. Summarising both per importance category and objectivity level adds a dimension to the evaluation by allowing identification not only of the overall winner, but also of particular characteristics for each product. These characteristics are just as important as the grand total, a fact foreseen in other evaluation methods.
10
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999
3.3 Two-dimensional summarising To summarise per importance category, first create a table and enter the items from the most important group or category. Thereafter, enter the previously assigned score for each attribute and calculate the sub-total as indicated in the example in table 5. Each score should be multiplied by the weigh factor. Do likewise for the other groups or categories. Table 5: Example of a "Very important" category table in a 3-candidate evaluation sheet. The weigh for this category is set to 8 and each aspect can receive 0-5 points. The table shows both points and total (e.g. 5/40) Question Attribute #1 Attribute #2 Attribute #3 … Attribute #n Sum:
Candidate #1 5/40 3/24 5/40 … 5/40 290
Candidate #2 4/32 5/40 5/40 … 5/40 310
Candidate #3 3/24 5/40 5/40 … 5/40 280
When the sub-totals for all categories have been calculated they should be summarised as in table 6 below. In the example in table 6, there is a draw between Candidate #1 and Candidate #2. However, by looking at the sub-scoring from the importance groups it becomes clear that Candidate #2 scores best on the features considered Very important while Candidate #1 gets its points from the Useful group. It can thus be deducted that Candidate #2 has its strength where it matters the most while Candidate #1 has more “extras”, i.e. nice to have features that are used if present but not missed if left out. The organisation doing the evaluation does now have a richer understanding of the pros and cons of each candidate and thus a better basis for decision. Table 6. The overall scoring table of the example evaluation sheet. Here, three importance categories are added to a grand total. Total Score Very important Important Useful Sum:
Candidate #1 290 105 90 485
Candidate #2 310 110 65 485
Candidate #3 280 102 60 442
The process may however be take yet a step further by examining also the objective levels, i.e. Data gathering, Index, Search features, and Operation. To do this, create a table for each of the four objective levels, enter the pre-calculated scores and the associated weighs, and compute the sub-total for each objective level. Table 7 below shows an example of the Data gathering level. Table 7. Summarising per objective level. The items from the three different importance categories, weighed 8, 4, and 2, respectively, are included. Both points and total (e.g. 5/40) are displayed. Data gathering Attribute #1 Attribute #2 Attribute #3 … Attribute #m Sum:
Candidate #1 3/24 5/20 4/32 … 0/0 136
Candidate #2 5/40 5/20 5/40 … 4/16 150
11
Candidate #3 5/40 5/20 3/24 … 3/12 140
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 Repeat this for the remaining levels (Index, Search features, and Operation) and enter the results in a table as shown in table 8. The totals should add up to the same as in table 6. Summarising per objectivity level and comparing these categories, as in table 8, help identify the strengths and weaknesses of each product. From the analysis of table 6, we may have decided that Candidate #2 is the best choice according to our needs. However, table 8 now tells us that both Candidate #1 and Candidate #3 are stronger on indexing and that Candidate #1 is better on operational aspects. Armed with these insights, we can approach the vendor of Candidate #2 and ask for specific improvements in these areas. The possibility of being able to analyse the score according to both importance and function is therefore powerful. Table 8. Adding all objective levels to a grand total. The column sums should, obviously, be the same as in table 6. Objectivity Level Data gathering Indexing Searching Operation Sum:
Candidate #1 136 85 164 100 485
Candidate #2 150 72 172 90 485
Candidate #3 140 100 152 50 442
4. Concluding discussion Running an intranet search engine is different from merely using a public search service, since otherwise ignored aspects such as choice of platform, operability, maintenance, and support become important. Intranet search tools should thus be evaluated differently – a fact that seems to have gone undetected. I have collected and listed a set of features that should not be seen as a final set, but rather as a starting point, to which new findings should be added. I believe that by identifying as many aspects as possible, it is less likely that something important will be missed. Saracevic (1995) argues that both system- and user-centred evaluations are needed, and with the approach suggested in this paper the whole spectra is covered. However, not all sites must examine all of these features. My main point is that one does not know which of these features are important and which are not until one is about to implement at a particular site. I therefore argue that the focus should be on the questions rather than on soon outdated answers. To reduce the large number of aspects to a more manageable set I have advocated a methodology consisting of three distinctive phases; reduce and rank, weigh and measure, and two-dimensional summarising. Once the scores and sub-totals have been collected and added up, the overall winner is easily identified, at the same time displaying both its strong and weak sides. It is also possible to account for any decisions to discard a certain product since it is very easy to show that the product in question had a low overall score, or that it scored poorly in a particular objective level. My methodology brings out and makes explicit the critical aspects on which the final decision is based. As business evolves over time and products keep improving, re-evaluation of the decision might become necessary. To have documented that a certain product was discarded due to lack of a specific feature may be useful if it turns out that the feature is now available, or that the feature is no longer required. It may be argued that manually assigning scores to features will bias the evaluation and open it to subjective opinions. As suggested by Saracevic serious questions such as “Who should be the judges? How reliable are their assessments? What affects their judgement? How do they affect the results?” (1995, p.144) must always be asked when people are involved in the evaluation process. From the field of Decision Theory it has been suggested that formal methods such as Analytical Hierarchy Process (AHP) should be used instead of a purely
12
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 manual process (Eriksson and Roy 1999). However, AHP is based on the assumption that the features can be ordered hierarchically – an assumption that conflicts with my empirical findings. I realise that having to rely on human opinion is an issue for criticism, but still believe that this methodology helps reduce those risks. Before having summarised, it is difficult to know how the individual products are doing, and since the concentration is targeted on a single feature at the time, there is less risk of unintentionally favouring any particular product. If one truly wants to favour a product, it can be done using any methodology.
5. Acknowledgement This work has been possible thanks to the support and co-operation from several individuals at Volvo IT. Thanks are also due to reviewers and participants of the 22nd IRIS conference where a previous version of this paper was presented. In particular I would like to thank Carsten Sørensen for valuable suggestions and insightful comments.
6. References Chu, H., and Rosenthal, M. (1996), “Search Engine for the World Wide Web: A Comparative Study and Evaluation Methodology”, In Proceedings of the 1996 ASIS, October 19-24. Eriksson, U. and Roy, R. (1999), “The selection of a Web Search engine using Multiattribute Utility Theory”, In Proceedings of CE99, Bath, UK, September 1-3. Furnas, G., Landauer, T., Gomez, L., and Dumais, S. (1987), “The Vocabulary Problem in Human-Systems Communication”, Communications of the ACM, Vol. 30, No. 11, pp 964971. Goldschmidt, R., Mokotoff, C., and Silverman, M. (1997), “Evaluation of WWW Search Engines for NIH Implementation”, National Institutes of Health, Bethesda, MD, USA. Hearst, M. A. (1997), “Interfaces for Searching the Web”, Special Report Article, Scientific American, #3. Henderson, J. C. and Cooprider, J. G. (1990), “Dimensions of I/S Planning and Design Aids: A Functional Model of CASE Technology”, Information Systems Research, 1:3, pp.227-254. Karlsson, J., Wohlin, C., and Regnell, B. (1998), “An evaluation of methods for prioritizing software requirements”, Information and Software Technology, Vol. 39, No.14-15, pp.939947. Kent, A., Berry, M., Leuhrs, F.U., and Perry, J.W. (1955), “Machine literature searching VIII. Operational criteria for designing information retrieval systems”, American Documentation, Vol.6, No.2, pp.93-101. Koster, M. (1994), “A Standard for Robot Exclusion”, consensus on the robots mailing list (
[email protected]) as of 30 June 1994. URL: http://www.robotstxt.org/wc/robots.html [2002, May 3]. Leighton, H. V. and Srivastava, J. (1997), “Precision among World Wide Web Search Services (Search Engines): AltaVista, Excite, Hotbot, Infoseek, Lycos”, URL: http://www.winona.msus.edu/library/webind2/webind2.htm [1999, August 3]. Lester, S. and Willcocks, L. (1993), “How do organizations evaluate and control information systems investments? Recent UK survey evidence, paper for the IFIP WG8.2 working conference. Nielsen, J. (1999), “User Interface Directions for the Web”, Communications of the ACM, Vol. 42, No. 1, pp 65-72.
13
In Käkölä, T. (ed.), Proceedings of IRIS22, Jyväskylä, Finland, 7-10 August 1999 Pinkerton, B. (1994), “Finding What People Want: Experiences with the WebCrawler”, In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA. Raghavan, V., Bollman, P., and Jung, G. S. (1989), “A critical investigation of recall and precision as measures of retrieval system performance”, Communication of the ACM, Vol. 7, No. 3, pp 205-229. Rao, R., Pedersen, J., Hearst, M., Mackinlay, J., Card, S., Masinter, L., Halvorsen, P.-K., and Robertson, G. (1995), “Rich interaction in the digital library”, Communications of the ACM, Vol. 38, No. 4, pp 29-39. Saracevic, T. (1995), “Evaluation of evaluation in information retrieval”, In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Special issue of SIGIR Forum, pp 138-146. Shneiderman, B., Byrd, D., and Croft, W. B. (1998), “Sorting Out Searching. A User-Interface Framework for Text Searches”, Communications of the ACM, Vol. 41, No. 4, pp 95-98. Slot, M. (1995), “Web Matrix: What’s the Difference? Some answers about Search Engines and Subject Catalogs”, 1995-96. URL: http://www.ambrosiasw.com/~fprefect /matrix/answers.html [2002, January 19]. Turtle, H. R. (1994), “Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance”, In Proceedings of SIGIR ‘94, pp 212-220, Dublin, Ireland, Springer Verlag.
14